# Module 5 Model Optimization

---
## Introduction Script
Hello and welcome. 


As I mentioned before, please watch the video to learn the concepts behind the algorithms, and more importantly, go through the lesson notebooks and practice as much as you can.

---
## Lesson 1: Introduction to Feature Selection

---
### Slide 1
#### The key benefits feature selection:

- Reduces Overfitting
- Improves Accuracy
- Reduces Training Time
- Improves Interpretability


### Slide 1 Script
This lesson explores feature selection, which is a technique for improving the performance of machine learning algorithms by focusing on those features that contain the most predictive power.

The key benefits of performing feature selection on the data are:

- Reduces Overfitting: Less redundant data means less chance to make decisions based on noise.
- Improves Accuracy: Less misleading data means improvements in modeling accuracy.
- Reduces Training Time: Less data means algorithms train faster.
- Improves Interpretability: Less complexity of a model makes it easier to interpret.


---
### Slide 2

#### Feature selection algorithms:
- Filter methods
- Wrapper methods
- Embedded methods


---
### Slide 2 Script


---
### Slide 3

#### Filter Methods
- **Variance Threshold**
 - Rank features by variance
 - Feature Scaling is necessary
 - Doesn't use target feature
- **Univariate Techniques**
 - Rate features individually
 - Relationship between individual feature and target feature

---
### Slide 3 Script
Filter methods typically involve the application of a statistical measure to score the different features. This score allows the features to be ranked. We introduce VarianceThreshold in this lesson. This method rank features by their variance. The idea is straightforward, feature with higher variance normally contains more predictive information. 


---
### Slide 4
#### Wrapper methods

- Recursive Feature Extraction

---
### Slide 4 Script


---


### Slide 5
#### Embedded Methods
- Rank features during training
- Model dependent
 - Decision Tree
 - Random Forest

### Slide 5 Script

---
## Lesson 2: Introduction to Cross Validation

Make image 1,2,3 consistant

---
### Slide 1
#### Train Test Split vs. Cross Validation

<img src='https://miro.medium.com/max/2984/1*pJ5jQHPfHDyuJa4-7LR11Q.png' width=500>

---
### Slide 1 Script

This lesson introduces cross-validation, which is a technique used to evaluate machine learning models by training the models on subsets of the available data and evaluating them on the complementary subset of the data. Cross validation helps to evaluate models more accurately and can be used to select the best model hyperparameters.



---
### Slide 2--Not needed
#### Cross Validation

- `KFold`
- `StratifiedKFold`
- `GroupKFold` similar to `KFold`, but limits the testing data to only one group within each fold.
- `LeaveOneOut` iteratively leaves one observation out to validate the model trained on the remaining data.
- `LeavePOut` iteratively leaves $P$ observations out to validate the model trained on the remaining data.
- `ShuffleSplit` generates a user defined number of train/validate data sets, by first randomly shuffling the data.


---
### Slide 2 Script


---
### Slide 3
#### KFold

<img src="https://miro.medium.com/max/2736/1*rgba1BIOUys7wQcXcL4U5A.png" width="600">



---
### Slide 3 Script



---
### Slide 4
#### Stratified KFold
<img src="https://image.noelshack.com/fichiers/2018/20/6/1526716452-general-tips-for-participating-kaggle-competitions-13-638.jpg" width='500'>


---
### Slide 4 Script



---
### Slide 5
#### Evaluate Model with Cross-Validation
```
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score

skf = StratifiedKFold(n_splits=10, random_state=23)
score = cross_val_score(adult_model, features, label, cv=skf)

```


---
### Slide 5 Script

When there are two outcomes, 0 and 1, It is a table with 4 different combinations of predicted and actual values.


### Slide 6
#### Custom Scoring
```
precision_score = cross_val_score(adult_model, features, label, cv=skf, scoring='precision')
recall_score = cross_val_score(adult_model, features, label, cv=skf, scoring='recall')
auc_score = cross_val_score(adult_model, features, label, cv=skf, scoring='roc_auc')
```

### Slide 6 Scrip

---
## Lesson 3: Introduction to Model Selection

---
### Slide 1
#### Model Selection
- Use cross validation
- Grid Search
- Random Grid Search


---
### Slide 1 Script


---
### Slide 2
#### Grid Search  (Shade lines while speaking)
```
from sklearn.model_selection import GridSearchCV

skf = StratifiedKFold(n_splits=10, random_state=23)
knc = KNeighborsClassifier()

# Create a dictionary of hyperparameters and values
neighbors = [1, 3, 5, 11, 17, 23, 31, 53, 107]
params = {'n_neighbors':neighbors}

# Create grid search cross validator
gse = GridSearchCV(estimator=knc, param_grid=params, cv=skf)
gse.fit(features, label)

best_n_neighbors=gse.best_estimator_.get_params()["n_neighbors"]
```


### Slide 2 Script

### Slide 3
#### Multi-dimensional Grid Search
```
neighbors = [1, 3, 5, 11, 17, 23, 31, 53, 107]
weights = ['uniform', 'distance']

knc = KNeighborsClassifier()
skf = StratifiedKFold(n_splits=10, random_state=23)

params = {'n_neighbors':neighbors, 'weights':weights}
```

### Slide 3 Script

### Slide 4
#### Randomized Grid Search
```
from sklearn.model_selection import RandomizedSearchCV

knc = KNeighborsClassifier()
skf = StratifiedKFold(n_splits=10, random_state=23)

neighbors = range(1, 51)
weights = ['uniform', 'distance']
params = {'n_neighbors':neighbors, 'weights':weights}
 
# Run randomized search
rscv = RandomizedSearchCV(knc, param_distributions=params, n_iter=20, random_state=23)
```

### Slide 4 Script

---
## Review Script


The first module's assignment is fairly straightforward. Just remember to work on the problems in order.

Good luck.