## Scikit-Learn (Sklearn) Course

<span>
0. sklearn workflow overview<br>
1. preparing data (exploring, cleaning, transforming, reducing, splitting)<br>
2. selecting machine learning model / algorithm<br>
3. training algorithm and making predictions<br>
4. evaluating algorithm<br>
<span style="color:orange">5. improving the model</span><br>
6. saving and loading algorithm<br>
7. putting it all together
</span>

## 5. Improving the Model

#### General concepts

--- resources  
https://colab.research.google.com/drive/1ISey96a5Ag6z2CvVZKVqTKNWRwZbZl0m  
colab notebook correcting a model comparison error in the lesson videos

--- baseline  
first model = baseline model  
first prediction = baseline prediction

--- improving the model / data perspective  
collecting more data (the more data the better)  
improving current data by adding more features

--- correlation analysis  
when two features are highly correlated, one of them may be removed from the model  
backward feature selection: checking whether removing features reduces model perfomance  
forward feature selection: checking whether adding features improves model performance

--- improving the model / algorithm perspective  
using a better, more complex algorithm  
improving current algorithm with hyperparameter tuning

--- hyperparameters  
settings of the algorithm that the user can adjust  
hyperparameters are basically function parameters of the algorithm instance  
hyperparameters are detailed in the documentation of each algorithm

--- hyperparameter adjustment methods  
by hand (user guessing)  
randomized search cross validation (machine guessing)  
grid search cross validation (brute force)

#### Preparing data and evaluation function

In [None]:
### imports
import numpy, pandas
from sklearn.model_selection import cross_val_score

In [None]:
### preparing data

### loading heart disease data into dataframe
heart_disease = pandas.read_csv("data-heart-disease.csv")

### splitting data features <> target
features = heart_disease.drop(columns="target")
target = heart_disease.loc[:, "target"]

In [None]:
### classification algorithm evaluation function
def evaluateAlgo(algorithm, features, target):
    numpy.random.seed(42)
    metrics_dict = {
        "Accuracy": cross_val_score(estimator=algorithm, X=features, y=target, cv=5, scoring="accuracy").mean(),
        "Precision": cross_val_score(estimator=algorithm, X=features, y=target, cv=5, scoring="precision").mean(),
        "Recall": cross_val_score(estimator=algorithm, X=features, y=target, cv=5, scoring="recall").mean(),
        "F1 Score": cross_val_score(estimator=algorithm, X=features, y=target, cv=5, scoring="f1").mean()}
    print(f"""Accuracy: {100.0 * metrics_dict["Accuracy"]:.3f}%""")
    print(f"""Precision: {100.0 * metrics_dict["Precision"]:.3f}%""")
    print(f"""Recall: {100.0 * metrics_dict["Recall"]:.3f}%""")
    print(f"""F1 Score: {100.0 * metrics_dict["F1 Score"]:.3f}%""")
    return metrics_dict

#### Tuning hyperparameters by hand

In [None]:
### imports
from sklearn.ensemble import RandomForestClassifier

In [None]:
### creating and evaluating baseline algorithm
algo_baseline = evaluateAlgo(RandomForestClassifier(n_jobs=-1), features, target)

In [None]:
### reading default hyperparameters of baseline algorithm
RandomForestClassifier(n_jobs=-1).get_params()

--- hyperparameters to adjust  
`max_depth=`  
`max_features=`  
`min_samples_leaf=`  
`min_samples_split=`  
`n_estimators=`

In [None]:
### creating and evaluating adjusted algorithm (max_depth=10)
algo_hand1 = evaluateAlgo(RandomForestClassifier(max_depth=10, n_jobs=-1), features, target)

In [None]:
### creating and evaluating adjusted algorithm (n_estimators=500)
algo_hand2 = evaluateAlgo(RandomForestClassifier(n_estimators=500, n_jobs=-1), features, target)

#### Tuning hyperparameters with randomized search cross validation

In [None]:
### imports
from sklearn.model_selection import RandomizedSearchCV

In [None]:
### running randomized search

### creating random grid
random_grid = {
    "max_depth": [None, 5, 10, 20, 30],
    "max_features": ["sqrt"],
    "min_samples_leaf": [1, 2, 4],
    "min_samples_split": [2, 4, 6],
    "n_estimators": [10, 100, 200, 500, 1000, 1200]}

### creating randomized search object
classifier_rscv = RandomizedSearchCV(
    estimator=RandomForestClassifier(n_jobs=-1),
    param_distributions=random_grid,
    n_iter=10, cv=5, verbose=True)

### training randomized search object
numpy.random.seed(42)
classifier_rscv.fit(X=features, y=target);

In [None]:
### reading best parameters
classifier_rscv.best_params_

In [None]:
### evaluating randomized search object
algo_rscv = evaluateAlgo(classifier_rscv.best_estimator_, features, target)

#### Tuning hyperparameters with grid search cross validation

In [None]:
### imports
from sklearn.model_selection import GridSearchCV

In [None]:
### running grid search

### creating search grid
search_grid = {
    "max_depth": [None, 5, 10, 20, 30],
    "max_features": ["sqrt"],
    "min_samples_leaf": [1, 2, 4],
    "min_samples_split": [2, 4, 6],
    "n_estimators": [10, 100, 200, 500, 1000, 1200]}

### creating grid search object
classifier_gscv = GridSearchCV(
    estimator=RandomForestClassifier(n_jobs=-1),
    param_grid=search_grid,
    cv=5, verbose=True)

### training grid search object
numpy.random.seed(42)
classifier_gscv.fit(X=features, y=target);

In [None]:
### reading best parameters
classifier_gscv.best_params_

In [None]:
### evaluating grid search object
algo_gscv = evaluateAlgo(classifier_gscv.best_estimator_, features, target)

#### Comparing algorithms

In [None]:
### creating metrics dataframe
metrics_df = pandas.DataFrame(data={
    "Baseline": algo_baseline,
    "max_depth=10": algo_hand1,
    "n_estimators=500": algo_hand2,
    "Random Search": algo_rscv,
    "Grid Search": algo_gscv})
metrics_df

In [None]:
### plotting metrics dataframe
metrics_df.plot.bar(figsize=(10,8));