# Module 5 Model Optimization

# Module 5 Introduction

### Slide 1
#### Module 5 Model Optimization
- Lesson 1: Feature Selection
- Lesson 2: Cross Validation
- Lesson 3: Model Selection

---
### Slide 1 Script
Hello and welcome. 

We've learned 6 supervised learning algorithms so far, we also learned the different ways to evaluate a model. It's a natual movement to the next step, model optimization.

In this module, we will discuss how to improve the performance of a model. 

In lesson one, we will discuss feature selection. A dataset can have many features, not all of them are useful or even relavent. Selecting features properly has big impact on model performance.

In lesson two, we will introduce cross validation, which helps to evaluate models more accurately. By using cross validation, we can select best model hyperparameters, which will be introduced in lesson three.

As I mentioned before, for each lesson, please watch the video to learn the concepts, and more importantly, go through the lesson notebooks and practice as much as you can.

---
## Lesson 1: Introduction to Feature Selection

---
### Slide 1
#### Key Benefits

- Reduces Overfitting
- Improves Accuracy
- Reduces Training Time
- Improves Interpretability


### Slide 1 Script
This lesson explores feature selection, which is a technique for improving the performance of machine learning algorithms by focusing on those features that contain the most predictive power.

A dataset may have many features and it's natural to think that more data leads to better model. But not all features are equal. Some features may contain too much noise, some features maybe redundant, some are simply irrelavent to the objective of the analysis.

This kind of features may cause overfitting, which makes the model perform well on training data, but poorly on predicting unseen data. Too many features also increases training time and makes it difficult to interpret the model.

So it's critical to pick proper subset of features in a dataset to mitigate these problems.


---
### Slide 2

#### Feature selection algorithms:
- Filter methods
- Wrapper methods
- Embedded methods


---
### Slide 2 Script

In this lesson, we will introduce three feature selection methods, filter methods, wrapper methods and embedded methods.


---
### Slide 3

#### Filter Methods
- **Variance Threshold**
 - Rank features by variance
 - Feature Scaling is necessary
 - Doesn't use target feature
- **Univariate Techniques**
 - Relationship between individual feature and target feature

---
### Slide 3 Script
Filter methods typically involve the application of a statistical measure to score the different features. This score allows the features to be ranked. We introduce two filter methods in this lesson.

The first method is Variance Threshold. This method rank features by their variances. The reason behind this method is that features with higher variances normally contain more predictive information. An extreme case is if a feature has a constant value, its variace is 0, this feature has no predictive power at all.

A feature's variance is very sensitive to the scale of the values. For example, two features that measure height of same group of people, one in centimeter and one in meter, the variance of the first feature can be thousands times larger than that of the second feature, even though the two features measure same information just in different unit.
So it's very important to scale the features before ranking them by variance.

Another filter method is univariate feature selection which involvs target feature. We all know that a feature has high predictive power if it is highly correlated to the target feature. Just like total bill in tips dataset. Univariate feature selection rank features by the strength of the relationship of the feature with the target feature.

There are many statistical measures that can be used to measure the relationship. They are listed in the lesson notebook and we are not going to discuss them in more details in this video.


---
### Slide 4
#### Wrapper methods

- A model is needed
- Evaluate different feature combinations
- More time and resource demanding
- Recursive Feature Extraction(RFE)

---
### Slide 4 Script
The next feature selection method we introduce is wrapper method, which ranks features by training a model with different combination of features. This method is more expensive since a model has to be trained on each combination of features.

A popular wrapper method is Recursive Feature Extraction, or RFE, which is defined in the scikit learn feature selection module. To use RFE, you need to construct a machine learning model first. You may use any model in RFE.

---


### Slide 5
#### Embedded Methods
- Rank features during training
- Model dependent
 - Decision Tree
 - Random Forest

### Slide 5 Script

Some machine learning models, like decision tree, can evaluate feature importance during the learning process. For this kind of models, we can retrieve feature imporance from the trained model directly. This kind of method is called embedded methods. This method is very convenient to use, but not all models have the ability to rank features during training process. Among all machine learning models we've learned so far, only decision tree and random forest have this ability.

The python code for all three feature selection methods are very easy to understand. Please refer to the lesson notebook for more details.

---
## Lesson 2: Introduction to Cross Validation

Make image 1,2,3 consistant

---
### Slide 1
#### Train Test Split vs. Cross Validation

<img src='https://miro.medium.com/max/2984/1*pJ5jQHPfHDyuJa4-7LR11Q.png' width=500>

---
### Slide 1 Script

This lesson introduces cross-validation, which is a technique used to evaluate supervised learning models.

In the previous lessons, we split dataset into train and test set, then train the model with the train set and evaluate the model with the test set. 

The problem with this approach is that the data splitting process is random, with different train test split, the model's evaluation matrics can have big difference. On the other word, the evaluation metrics are not accurate.

To solve this problem, we can split the training data into multiple sub groups, train the model multiple times, each time use one sub group as validation group and other groups as train group. Use the train group to train the model and use the validation group to get evaluation metrics of the model. Then we average all the metrics to get a more accurate metric.

Depending on how to split the dataset, there are many different cross validation methods. We will focus on the two most used cross validation methods, kfold and stratified kfold in this video. You can find introductions to other cross validation methods in the lesson notebook.


---
### Slide 2--Not needed
#### Cross Validation

- `KFold`
- `StratifiedKFold`
- `GroupKFold` similar to `KFold`, but limits the testing data to only one group within each fold.
- `LeaveOneOut` iteratively leaves one observation out to validate the model trained on the remaining data.
- `LeavePOut` iteratively leaves $P$ observations out to validate the model trained on the remaining data.
- `ShuffleSplit` generates a user defined number of train/validate data sets, by first randomly shuffling the data.


---
### Slide 2 Script


---
### Slide 3
#### KFold

<img src="https://miro.medium.com/max/2736/1*rgba1BIOUys7wQcXcL4U5A.png" width="600">



---
### Slide 3 Script
K-fold is one of the most popular cross validation techniques. It randomly splits the dataset into k equal folds. The image in this slide demonstrates k fold with k equals to 5. The dataset is splitted to 5 equal folds. A machine learning model will be trained 5 times. In each iteration, a different fold will be the validation set or test set, and other 4 folds are train set. 


---
### Slide 4
#### Stratified KFold
<img src="https://image.noelshack.com/fichiers/2018/20/6/1526716452-general-tips-for-participating-kaggle-competitions-13-638.jpg" width='500'>


---
### Slide 4 Script

Stratified kfold is similar to kfold, but instead of splitting the dataset randomly to k-fold, stratified kfold splits dataset in a way that each fold has same class distribution as the whole dataset.

For example, as shown in this iamge, the outcome of a dataset is gender. In the original dataset there are about 3 times more male than female in the outcome class. When we split the dataset with stratified kfold, each fold will have similar outcome class distribution, or 3 times more male than female.

You may ask, isn't stratified kfold always prefered to kfold? This question is a little tricky to answer. When we use stratified kfold, we assume that the class distribution of the original dataset is true. But this assumption may not hold. In most cases, we won't be able to get population data, what we have is just a sample data. The class distribution of the sample data may not refect the class distribution of the population. So if we make this assumption when splitting data, we esentially add information which may not be true to the training process. So we will need to pick cross validation method based on dataset and the problem we are trying to solve.

---
### Slide 5
#### Evaluate Model with Cross-Validation
```
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score

skf = StratifiedKFold(n_splits=10, random_state=23)
score = cross_val_score(adult_model, features, label, cv=skf)

```


---
### Slide 5 Script

Scikit learn model_selection module has functions for cross validation methods. We can get model evaluation metrics with cross_val_score function. The cross_val_score function by default returnsaccuracy score for a classification model and r-squared score for a regression model. You may set function argument to get other metric values. Please refer to lesson notebook for coding details.

### Slide 6(Not needed)
#### Custom Scoring
```
precision_score = cross_val_score(adult_model, features, label, cv=skf, scoring='precision')
recall_score = cross_val_score(adult_model, features, label, cv=skf, scoring='recall')
auc_score = cross_val_score(adult_model, features, label, cv=skf, scoring='roc_auc')
```

### Slide 6 Scrip

---
## Lesson 3: Introduction to Model Selection

---
### Slide 1
#### Model Selection
- Use cross validation
- Grid Search
- Random Grid Search


---
### Slide 1 Script

In this lesson, we will discuss model selection techniques. Here model selection doesn't mean select the best model from different models. It actually means select proper hyperparameter values to optimize a model.

We talked about the difference between model hyperparameter and model parameter before and I'd like to explain it again here. The model hyperparameters means the parameters used to construct a model, they are determined before the model is trained. For example, value of k in k nearest neighbors, number of trees in randome forest. Model parameters, on the other hand, are parameters that are determined by the training process. For example, the intercept and coefficicents of a linear regression model are model parameters.

The goal of model selection is to find best hyperparamter values to construct a model. Model selection is enabled by cross validation which is introduce in the previous lesson.

In this lesson, we will introduct two kind of model selection technique, Grid search and random grid search.

---
### Slide 2
#### Grid Search  (Shade lines while speaking)
```
from sklearn.model_selection import GridSearchCV

skf = StratifiedKFold(n_splits=10, random_state=23)
knc = KNeighborsClassifier()

# Create a dictionary of hyperparameters and values
neighbors = [1, 3, 5, 11, 17, 23, 31, 53, 107]
params = {'n_neighbors':neighbors}

# Create grid search cross validator
gse = GridSearchCV(estimator=knc, param_grid=params, cv=skf)
gse.fit(features, label)

best_n_neighbors=gse.best_estimator_.get_params()["n_neighbors"]
```


### Slide 2 Script

The concept of grid search is very simple, we just define a range of values for a hyperparameter of a model, then we train the model with each of the hyperparameter value and get evaluation score with cross validation. The hyperparameter value that leads to the best evaluation score is the best hyperparameter value.

In this piece of code, we first construct a stratified kfold cross validation object and a k nearest neighbor classifier, then define values of k nearest neighbor's hyperparamter n_neighbors, then apply grid search with  GridSearchCV class defined in the scikit learn model_selection module.

We can get best parameter value from the best_estimator_ attribute of GridSearchCV object.

### Slide 3
#### Multi-dimensional Grid Search
```
neighbors = [1, 3, 5, 11, 17, 23, 31, 53, 107]
weights = ['uniform', 'distance']

knc = KNeighborsClassifier()
skf = StratifiedKFold(n_splits=10, random_state=23)

params = {'n_neighbors':neighbors, 'weights':weights}
```

### Slide 3 Script

We can also apply grid search on more than one hyperparameters. For example, in this code sample, we define value ranges for two hyperparameters. Grid search will train the model with every hyperparameter value combinations. Here there are 9 values for n_neighbors adn 2 values for weights, so there are total 18 different combinations.

A model may have many hyperparamters and a hyperparameter may have many different values. The number of total combinations can be a very large number which makes gridsearch very time consuming, sometimes not feasible.

In that case, we can use a variation of gridsearch, which is called randomized grid search.

### Slide 4
#### Randomized Grid Search
```
from sklearn.model_selection import RandomizedSearchCV

knc = KNeighborsClassifier()
skf = StratifiedKFold(n_splits=10, random_state=23)

neighbors = range(1, 51)
weights = ['uniform', 'distance']
params = {'n_neighbors':neighbors, 'weights':weights}
 
# Run randomized search
rscv = RandomizedSearchCV(knc, param_distributions=params, n_iter=20, random_state=23)
```

### Slide 4 Script
The idea of Randomized grid search is pretty simple, instead train a model on every hyperparameter combination, it randomly pick certain number of combinations to train the model. This way, the training time depends on a predefined iteration number. For example, in this code example, the n_iter is set to 20, which means, no matter how many different combinations there are, the randomzied grid search will only train the model on 20 randomly picked combinations. The best combination among these 20 will be the winner.

Of course we can't find the best possible combination with randomized grid search, but this approach can achieve relatively good result with much less training time.

# Module 5 Review

### Slide 1
#### Module 5 Review

- Feature Selection
 - Filter Methods
 - Wrapper Methods
 - Embedded Methods
- Cross Validation
- Model Selection

---
## Review Script

In this module we introduce model optimization techniques. Feature selection is used to select best subset of features in the dataset. model selection is used to find the best model hyperparameter values. model selection is enabled by cross validation which provides a better way to evaluate a model.

For feature selection, you need to understand the different methods and how to apply those methods with python code.

For cross validation, you need to understand different cross validation methods and how to get cross validation scores for a model.

For model selection, you need to understand the differences between model hyperparameter and model parameter. You also need to be able to perform grid search and randomized grid search to select the best model.

If you understand all the python code in this module, you will have no trouble in finishing the assignment. But please remember to work on the problems in order.

Good luck.