## Scikit-Learn (sklearn) Course

<span>
0. sklearn workflow overview<br>
1. preparing data (collecting, exploring, cleaning, transforming, reducing, splitting)<br>
2. defining problem / selecting machine learning model<br>
3. training model and making predictions<br>
<span style="color:orange">4. evaluating model</span><br>
5. improving model<br>
6. saving and loading model<br>
7. putting it all together
</span>

## 4. Evaluating Model

#### Concepts

--- resources  
[sklearn documentation > model evaluation](https://scikit-learn.org/stable/modules/model_evaluation.html)  
[statquest youtube video: ROC and AUC explained](https://www.youtube.com/watch?v=4jRBRDbJemM)  
[sklearn documentation > ROC curve for multiclass classification models](https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html)  
<span style="color:red">>>> Load and prepare data<span>

--- sklearn built-in evaluation methods  
`model.score()` method  
cross valiadion with `scoring=` parameter  
metric functions

--- default `model.score()` and cross validation metrics  
classification models: mean accuracy (true predictions / all predictions)  
regression models: r^2 - [coefficient of determination](https://en.wikipedia.org/wiki/Coefficient_of_determination)

--- cross validation  
creates `cv=k` different train/test splits from the same dataset (k-fold cross validation)  
trains and scores the algorithm on all splits > training covers the entire dataset  
scoring metric is defined by the `scoring=` parameter (`scoring=None` invokes the default scorer)

--- classification model metrics  
**mean accuracy** true predictions / all predictions  
**receiver operating characteristic (ROC) curve**  see below  
**area under the ROC curve (AUC)** see below  
**confusion matrix**  see below  
classification report

--- receiver operating characteristic (ROC) curve  
plots true positive rate (tpr) over false positive rate (fpr)  
tpr = specificity = recall: true positive predictions / all positive targets  
fpr (1 - sensitivity): false positive predictions / all negative targets  
suitable for binary classification models  
visualizes the effect of varying the algorithm decision threshold  


--- ROC curve for multiclass classification models  
a ROC curve works with binary output, so multiclass output must be binarized  
one-vs-rest: comparing each class to all the others  
one-vs-one: comparing every pairwise combination of classes


--- area under the ROC curve (AUC)  
integral of ROC curve > ranges between 0.0-1.0  
used to compare the performance of different algorithms

--- confusion matrix  
a quick way to compare predictions to targets  
gives an idea of where the model is confused

--- coding tricks within jupyter notebook  
**`!command`, e.g., `!dir`** runs terminal command within jupyter notebook  
**`sklearn.__version__`** displays version of installed module

#### Creating classification model

In [None]:
### imports ------------------------------------------------------------------------------------------------------------

import numpy, pandas

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

In [None]:
### preparing data -----------------------------------------------------------------------------------------------------

### loading heart disease classification data into dataframe
heart_disease = pandas.read_csv("data-heart-disease.csv")

### splitting data features/target
features = heart_disease.drop(columns="target")
target = heart_disease.loc[:, "target"]

### splitting data train/test
numpy.random.seed(42)
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.2)

In [None]:
### creating random forest classifier ----------------------------------------------------------------------------------

### instantiating model
numpy.random.seed(42)
classifier = RandomForestClassifier(n_estimators=100)

### training model
classifier.fit(features_train, target_train);

#### Evaluating classification model

#### Note: Update kwargs!!!

In [None]:
### imports ------------------------------------------------------------------------------------------------------------

from matplotlib import pyplot

from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_curve, roc_auc_score, ConfusionMatrixDisplay

In [None]:
### evaluating model with model.score() method on training data --------------------------------------------------------
classifier.score(X=features_train, y=target_train)

In [None]:
### evaluating model with model.score() method on test data ------------------------------------------------------------
classifier.score(X=features_test, y=target_test)

In [None]:
### evaluating model with default score (accuracy) cross validation ----------------------------------------------------
cv_list = cross_val_score(estimator=classifier, X=features, y=target, cv=5, scoring=None)
cv_mean = numpy.mean(cv_list)0
cv_list, cv_mean

In [None]:
### function for plotting ROC curve ------------------------------------------------------------------------------------

### function init
def plotRoc(fpr, tpr):
    """
    Plots ROC curve, i.e., true positive rate (tpr) over false positive rate (fpr)
    """

    ### plotting ROC curve
    pyplot.plot(fpr, tpr, color="orange", label="ROC curve")

    ### plotting baseline
    pyplot.plot([0,1], [0,1], color="blue", linestyle="--", label="Guessing")

    ### customizing plot
    pyplot.title("Receiver Operating Characteristic (ROC) Curve")
    pyplot.ylabel("True Positive Rate")
    pyplot.xlabel("False Positive Rate")
    pyplot.legend()

    ### rendering plot
    pyplot.show()

    ### function termination
    return

In [None]:
### evaluating model with ROC curve ------------------------------------------------------------------------------------
predict_positive_probs = classifier.predict_proba(features_test)[:, 1]
model_fpr, model_tpr, model_thresholds = roc_curve(target_test, predict_positive_probs)
perfect_fpr, perfect_tpr, perfect_threshold = roc_curve(target_test, target_test)
plotRoc(model_fpr, model_tpr), plotRoc(perfect_fpr, perfect_tpr);

In [None]:
### evaluating model with AUC score ------------------------------------------------------------------------------------
model_auc = roc_auc_score(target_test, predict_positive_probs)
perfect_auc = roc_auc_score(target_test, target_test)
model_auc, perfect_auc

In [None]:
### evaluating model with confusion matrix -----------------------------------------------------------------------------
target_preds = classifier.predict(features_test)
pandas.crosstab(target_test, target_preds, rownames=["Targets"], colnames=["Predictions"])

In [None]:
### visualizing confusion matrix with sklearn --------------------------------------------------------------------------
ConfusionMatrixDisplay.from_predictions(y_true=target_test, y_pred=target_preds);

#### Evaluating regression models

In [None]:
### imports ------------------------------------------------------------------------------------------------------------

from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestRegressor

In [None]:
### preparing data -----------------------------------------------------------------------------------------------------

### loading california housing dataset
housing_dict = fetch_california_housing()

### creating california housing dataframe
housing_df = pandas.DataFrame(data=housing_dict["data"], columns=housing_dict["feature_names"])
housing_df["MedHouseVal"] = housing_dict["target"]

### splitting data features/target
features = housing_df.drop(columns="MedHouseVal")
target = housing_df.loc[:, "MedHouseVal"]

### splitting data train/test
numpy.random.seed(42)
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.2)

In [None]:
### creating random forest regressor -----------------------------------------------------------------------------------

### instantiating model
numpy.random.seed(42)
regressor = RandomForestRegressor(n_estimators=100)

### training model
regressor.fit(features_train, target_train);

In [None]:
### evaluating model with .score() method ------------------------------------------------------------------------------
regressor.score(features_test, target_test)

In [None]:
### predicting with predict() function ---------------------------------------------------------------------------------
target_prediction = regressor.predict(features_test)
target_prediction[:10]

In [None]:
### comparing predictions to true values / metrics.mean_absolute_error function ----------------------------------------
mean_absolute_error(target_test, target_prediction)