## 2 Choosing the right algorithm for the problem

Some things to note
* sklearn refers to ML models, algorimth as estimators
* Classification problem - predicting a category
  * clf (short for classifier) is used to refer to classification algorithms
* Regression problem - predicting a number

### 2.1 Picking a machine learning model for a regression problem.

Using California housing dataset as an example.

In [None]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

In [None]:
housing = fetch_california_housing()
housing

In [None]:
housing_df = pd.DataFrame(data=housing.data, columns=housing.feature_names)
housing_df["target"] = housing.target # Median House Value

In [None]:
housing_df.head()

In [None]:
np.random.seed(42)
x = housing_df.drop("target", axis=1)
y = housing_df["target"]

x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.2
)

x_train.shape, x_test.shape, y_train.shape, y_test.shape

In [None]:
from sklearn.linear_model import Ridge

model = Ridge()
model.fit(x_train, y_train)
model.score(x_test, y_test)

In [None]:
from sklearn.linear_model import ElasticNet

model = ElasticNet()
model.fit(x_train, y_train)
model.score(x_test, y_test)

In [None]:
from sklearn.linear_model import Lasso

model = Lasso()
model.fit(x_train, y_train)
model.score(x_test, y_test)

In [None]:
# from sklearn.svm import SVR

# model = SVR(kernel="linear")
# model.fit(x_train, y_train)
# model.score(x_test, y_test)

In [None]:
# from sklearn.svm import SVR

# model = SVR(kernel="rbf")
# model.fit(x_train, y_train)
# model.score(x_test, y_test)

In [None]:
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor()
model.fit(x_train, y_train)
model.score(x_test, y_test)

### 2.2 Picking a machine learning model for a classification problem.

Using California housing dataset as an example.

In [None]:
heart_disease = pd.read_csv("../ztm-ml/data/heart-disease.csv")
heart_disease.head()

In [None]:
np.random.seed(42)
x = heart_disease.drop("target", axis=1)
y = heart_disease["target"]

x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.2
)

In [None]:
from sklearn.svm import LinearSVC

model = LinearSVC()
model.fit(x_train, y_train)
model.score(x_test, y_test)

In [None]:
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(n_neighbors=10)
model.fit(x_train, y_train)
model.score(x_test, y_test)

In [None]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=50)
model.fit(x_train, y_train)
model.score(x_test, y_test)

In [None]:
model.feature_importances_

importances = pd.DataFrame(
    {"feature": x_train.columns, "importance": model.feature_importances_}
)
importances = importances.sort_values(by="importance", ascending=False)

### Random Forest model deep dive

These resources will help you understand what's happening inside the Random Forest models we've been using.

* [Random Forest Wikipedia](https://en.wikipedia.org/wiki/Random_forest)
* [Random Forest Wikipedia (simple version)](https://simple.wikipedia.org/wiki/Random_forest)
* [Random Forests in Python](http://blog.yhat.com/posts/random-forests-in-python.html) by yhat
* [An Implementation and Explanation of the Random Forest in Python](https://towardsdatascience.com/an-implementation-and-explanation-of-the-random-forest-in-python-77bf308a9b76) by Will Koehrsen

## 3. Fitting a model/algorithm and making predictions

### 3.1 Fitting a model to the data
* `x` - features, inputs, independent variables, data
* `y` - labels, targets, dependent variables, answers

In [None]:
heart_disease = pd.read_csv("../ztm-ml/data/heart-disease.csv")
heart_disease.head()

In [None]:
np.random.seed(42)
x = heart_disease.drop("target", axis=1)
y = heart_disease["target"]

x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.2
)

# Model
clf = RandomForestClassifier(n_estimators=100)

# Fit the model to the data
clf.fit(x_train, y_train)

# Evaluate the model
clf.score(x_test, y_test)

### 3.2 Making predictions using a machine learning model

2 ways to make predictions
1. `predict()` - predicts a label (classification) or a value (regression)
2. `predict_proba()` - predicts probability of a classification label

In [None]:
y_pred=clf.predict(x_test)

In [None]:
np.mean(y_pred == y_test), clf.score(x_test, y_test)

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

accuracy_score(y_test, y_pred)

In [None]:
print(classification_report(y_test, y_pred))

In [None]:
confusion_matrix(y_test, y_pred)

#### predict_proba() returns a list of probabilities for each class the model is trained to predict.

In [None]:
y_pred_proba=clf.predict_proba(x_test)
y_pred_proba[:5]

#### `predict()` on regression models

In [None]:
housing_df = pd.DataFrame(data=housing.data, columns=housing.feature_names)
housing_df["target"] = housing.target # Median House Value

In [None]:
np.random.seed(42)
x = housing_df.drop("target", axis=1)
y = housing_df["target"]

x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.2
)

x_train.shape, x_test.shape, y_train.shape, y_test.shape

In [None]:
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor()
model.fit(x_train, y_train)
model.score(x_test, y_test)

In [None]:
y_pred = model.predict(x_test)
y_pred[:10]

In [None]:
pred_df = pd.DataFrame({"Predicted": y_pred, "Actual": y_test})

In [None]:
pred_df['difference'] = pred_df['Actual'] - pred_df['Predicted']

In [None]:
pred_df['difference'].abs().mean()

## 4. Evaluating a machine learning model

Three ways to evaluate Scikit-Learn models/estimators: 

1. Estimator's built-in `score()` method
2. The `scoring` parameter
3. Problem-specific metric functions
    
You can read more about these here: https://scikit-learn.org/stable/modules/model_evaluation.html 

In [None]:
from sklearn.ensemble import RandomForestClassifier

np.random.seed(42)

x = heart_disease.drop("target", axis=1)
y = heart_disease["target"]

x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.2
)

model = RandomForestClassifier()
model.fit(x_train, y_train)
model.score(x_test, y_test)