In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Choose Estimator

**Notes**
* Sklearn refers to machine learning models and algorithms as **estimators**
* Types of Problems:
    * Classification Problem: Predicting a category, ex. heart disease or not, red or blue
        * Classification estimators are refered to as `clf`
    * Regression Problem: Predicting a number, ex. a price

To choose estimator, refer to the below map

<img alt="Sklearn Map" src="images/sklearn-map.png" width="80%"/>

If you try a model and it doesn't give you the score you need, try a different model. Choose from the map.

An **ensemble model** combines the predictions of smaller models into one prediction. Like, taking the advice of 10 doctors instead of one

A **random forest model** is an ensemble model that puts the input through multiple decision trees (n_estimators = number of trees) each returning a prediction then choose the final prediction based on the majority prediction of the trees.

**Note to Self**: Dive deeper into **preprocessors**, **piplines**, and **ColumnTransformer**

**Tidbit**:
* If you are working with structured data, use ensemble models
* If you are working with unstructured data, ex. images, use deep learning or transfer learning

**Self Note**: What are deep learning and transfer learning?

___

## Fit a model to data

**Note**: Different names for:
* `x`: features, features variables, data
* `y`: labels, targets, target variables, ground truth

**Self Note**: More terms to note? estimators, preprocessors, piplines, etc.? What is a model?

Fitting a model to data makes the model go through the example data in x_train and their corresponding target in y_train and try to figure out the pattern between them (what combination of data in x_train could lead to this specific value in y_train?) similar to how your brain tries to find pattern in things. As for how it finds the pattern depends on the type of model you choose.

___

## Use the model to predict

Two ways of predicting:
1. `predict()`
2. `predict_proba()`

The model needs to be passed the exact same data structure it trained on in order to predict

We compare the ground truth `y_test` to the predicted value to see how accurate the prediction was, this will help us evaluate the model

**Trying to confirm a theory...**  
Does setting the random seed make the model score fixed?

In [10]:
from sklearn.datasets import fetch_california_housing
california_housing = fetch_california_housing()
x = pd.DataFrame(california_housing.data, columns=california_housing.feature_names)

x

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25
...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32


In [18]:
y = pd.Series(california_housing.target)

y

0        4.526
1        3.585
2        3.521
3        3.413
4        3.422
         ...  
20635    0.781
20636    0.771
20637    0.923
20638    0.847
20639    0.894
Length: 20640, dtype: float64

In [19]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
model.fit(x_train, y_train)

model.score(x_test, y_test)

0.8113492879523005

In [22]:
# Scoring again
model.score(x_test, y_test)

0.8113492879523005

Okay...

In [23]:
# Fit again
model.fit(x_train, y_train)
model.score(x_test, y_test)

0.8129310591248636

In [24]:
# Aha, let's try again
model.fit(x_train, y_train)
model.score(x_test, y_test)

0.8113784828959737

Okay, I'll try setting the random seed

In [26]:
np.random.seed(42)
model.fit(x_train, y_train)
model.score(x_test, y_test)

0.8124166785360362

In [27]:
# Now again...
model.fit(x_train, y_train)
model.score(x_test, y_test)

0.810691739062182

Hmm... what if I set seed on every block

In [28]:
np.random.seed(42)
model.fit(x_train, y_train)
model.score(x_test, y_test)

0.8124166785360362

In [29]:
# Again
np.random.seed(42)
model.fit(x_train, y_train)
model.score(x_test, y_test)

0.8124166785360362

**Theory Confirmed**  
Setting the random seed will train the model the same way always, making it give the same score no matter how many times you retrain it. This should help when writing notes in jupyter notebook and returning to it later, since I'll have to rerun the cells.

**predict vs predict_proba**  
`predict` will return a new target column with the models guessed values.  
`predict_proba` gives you, for each row of data, the probability of each output being true, so if the output is either 0 or 1, then it will give you the probability of this record's output being 0 and the probability of this record's output being 1.  

`predict_proba` will help us know whether the model was confident of it's answers, i.e. very high probability vs very low probability, or it wasn't confident of it's answer, i.e. almost equal probabilities but one output was the highest so it won.

___

## Evaluating the model

Three ways to evaluate a model:  
1. Estimator's built-in `score()` method
2. The `scoring` parameter
3. Problem-specific metric functions

### Cross-Validation

`cross_val_score()` takes 4 main parameters:  
1. the estimator
2. x (entire x)
3. y (entire y)
4. cv: number of folds, default 5

#### What it does:
1. Splits `x` and `y` into 5 equal splits called "folds"
2. Trains the model on 4 folds, tests it on the 5th, then scores it
3. Repeats the process using a different fold as the test set
4. Continues until it has used all folds as the test set
5. Return an array containing all scores

#### What you can do:
* You can take the average of all scores to evaluate the model
* Change the number of folds by specifying `cv`

#### Why use it?
Because one test split might give you a biased result. Using multiple splits for testing gives us a more reliable and fairer evaluation.

### Classification Model Evaluation Metrics

```mermaid
graph LR
A((Accuracy)) --- B
B((Area under ROC curve)) --- C
C((Confusion matrix)) --- D
D((Classification report))
```

#### Accuracy

Same what we've been using so far

#### Area Under The Receiver Operating Characteristic Curve (AUC/ROC)