## Scikit-Learn (Sklearn) Course

<span>
0. sklearn workflow overview<br>
1. preparing data (exploring, cleaning, transforming, reducing, splitting)<br>
2. selecting machine learning model / algorithm<br>
3. training algorithm and making predictions<br>
4. evaluating algorithm<br>
<span style="color:orange">5. improving model</span><br>
6. saving and loading algorithm<br>
7. putting it all together
</span>

## 5. Improving the Model

#### General concepts

--- baseline  
first model = baseline model  
first prediction = baseline prediction

--- improving model / data perspective  
collecting more data (the more data the better)  
improving data > adding more features

--- improving model / algorithm perspective  
using a better, more complex algorithm  
improving current algorithm with hyperparameter tuning

--- hyperparameters  
settings of the algorithm that the user can adjust  
basically, hyperparameters are the function parameters of the algorithm instance  
hyperparameters are detailed in the documentation of each algorithm

--- hyperparameter adjustment methods  
by hand (guessing)  
random search with randomized search cross validation  
exhaustive search (brute force) with grid search cross validation

#### Tuning hyperparameters by hand

In [None]:
### imports
import numpy, pandas
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [None]:
### preparing data

### loading heart disease data into dataframe
heart_disease = pandas.read_csv("data-heart-disease.csv")

### shuffling heart disease dataframe
numpy.random.seed(42)
heart_disease = heart_disease.sample(frac=1.0)

### splitting data features <> target
features = heart_disease.drop(columns="target")
target = heart_disease.loc[:, "target"]

### splitting data train <> test
train_index = round(0.7 * heart_disease.index.size)
valid_index = round(0.85 * heart_disease.index.size)
features_train, target_train = features[:train_index], target[:train_index]
features_valid, target_valid = features[train_index:valid_index], target[train_index:valid_index]
features_test, target_test = features[valid_index:], target[valid_index:]

In [None]:
### classification algorithm evaluation function
def evaluatePreds(target_preds, target_test):
    metrics_dict = {
        "Accuracy": accuracy_score(y_pred=target_preds, y_true=target_test),
        "Precision": precision_score(y_pred=target_preds, y_true=target_test),
        "Recall": recall_score(y_pred=target_preds, y_true=target_test),
        "F1 Score": f1_score(y_pred=target_preds, y_true=target_test)}
    print(f"""Accuracy: {100.0 * metrics_dict["Accuracy"]:.3f}%""")
    print(f"""Precision: {100.0 * metrics_dict["Precision"]:.3f}%""")
    print(f"""Recall: {100.0 * metrics_dict["Recall"]:.3f}%""")
    print(f"""F1 Score: {100.0 * metrics_dict["F1 Score"]:.3f}%""")
    return metrics_dict

In [None]:
### creating and evaluating baseline algorithm

### creating, training, predicting algorithm
numpy.random.seed(42)
classifier = RandomForestClassifier()
classifier.fit(X=features_train, y=target_train)
target_preds = classifier.predict(X=features_valid)

### evaluating algorithm
algo_baseline = evaluatePreds(target_preds, target_valid)

In [None]:
### reading default hyperparameters of baseline algorithm
classifier.get_params()

--- hyperparameters to adjust  
`max_depth=`  
`max_features=`  
`min_samples_leaf=`  
`min_samples_split=`  
`n_estimators=`

In [None]:
### creating and evaluating adjusted algorithm (max_depth=10)

### creating, training, predicting algorithm
numpy.random.seed(42)
classifier = RandomForestClassifier(max_depth=10)
classifier.fit(X=features_train, y=target_train)
target_preds = classifier.predict(X=features_valid)

### evaluating algorithm
algo_hand1 = evaluatePreds(target_preds, target_valid)

In [None]:
### creating and evaluating adjusted algorithm (n_estimators=500)

### creating, training, predicting algorithm
numpy.random.seed(42)
classifier = RandomForestClassifier(n_estimators=500)
classifier.fit(X=features_train, y=target_train)
target_preds = classifier.predict(X=features_valid)

### evaluating algorithm
algo_hand2 = evaluatePreds(target_preds, target_valid)

#### Tuning hyperparameters with randomized search cross validation

In [None]:
### imports
import numpy, pandas
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier

In [None]:
### preparing data

### loading heart disease data into dataframe
heart_disease = pandas.read_csv("data-heart-disease.csv")

### splitting data features <> target
features = heart_disease.drop(columns="target")
target = heart_disease.loc[:, "target"]

### splitting data train <> test
numpy.random.seed(42)
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.2)

In [None]:
### running randomized search

### creating search grid
search_grid = {
    "max_depth": [None, 5, 10, 20, 30],
    "max_features": ["sqrt"],
    "min_samples_leaf": [1, 2, 4],
    "min_samples_split": [2, 4, 6],
    "n_estimators": [10, 100, 200, 500, 1000, 1200]}

### creating randomized search algorithm
numpy.random.seed(42)
rscv_classifier = RandomizedSearchCV(
    estimator=RandomForestClassifier(n_jobs=-1),
    param_distributions=search_grid,
    n_iter=10, cv=5, verbose=2)

### training randomized search algorithm
rscv_classifier.fit(X=features, y=target);

In [None]:
### reading best parameters
rscv_classifier.best_params_

In [None]:
### evaluating algorithm
target_preds = rscv_classifier.predict(features_test)
algo_rscv = evaluatePreds(target_preds, target_test)

#### Creating regression model

In [None]:
### imports
import numpy, pandas
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

In [None]:
### preparing data

### loading california housing dataset
housing_dict = fetch_california_housing()

### creating california housing dataframe
housing_df = pandas.DataFrame(data=housing_dict["data"], columns=housing_dict["feature_names"])
housing_df["MedHouseVal"] = housing_dict["target"]

### splitting data features/target
features = housing_df.drop(columns="MedHouseVal")
target = housing_df.loc[:, "MedHouseVal"]

### splitting data train/test
numpy.random.seed(42)
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.2)
target_test: numpy.ndarray

In [None]:
### random forest regressor training and prediction

### instantiating model
numpy.random.seed(42)
regressor = RandomForestRegressor(n_estimators=100)

### training model / prediction
regressor.fit(X=features_train, y=target_train)
target_preds = regressor.predict(X=features_test)