# Introduction to Scikit-Learn (Sklearn)

This notebook demonstrates some of the most useful functions of the beautiful Scikit-Learn Library

**What is Covered:**

0. An end-to-end scikit learn workflow
1. Getting the data ready
2. Choose the right estimator/algorithm/model for the problem
3. Fit the model and use it to make prediction on the data
4. Evaluating a model
5. Improve a model
6. Save and load a trained model
7. Putting it all together

In [None]:
# standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

## 0. An end-to-end Scikit-Learn workflow

In [None]:
# 1. Getting the data ready
heart_disease = pd.read_csv('https://raw.githubusercontent.com/mrdbourke/zero-to-mastery-ml/master/data/heart-disease.csv')
heart_disease.head(3)

In [None]:
# Create X (features)
X = heart_disease.drop("target", axis=1)

# Create y (label matrix)
y = heart_disease["target"]

In [None]:
# 2. Choose the right model and hyperparameters
from sklearn.ensemble import RandomForestClassifier
clf =  RandomForestClassifier()

# We'll keep the default hyperparameters
clf.get_params()


In [None]:
# 3. Fit the model to the data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
clf.fit(X_train, y_train)

In [None]:
# Make a prediction
y_preds = clf.predict(X_test)
y_preds

In [None]:
# 4. Evaluate the model on the training data and the test data
clf.score(X_train, y_train)

In [None]:
clf.score(X_test, y_test)

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(classification_report(y_test, y_preds))

In [None]:
# 5. Improve a model

# try different amount of an n_estimators
np.random.seed(42)
for i in range(10, 100, 10):
    print(f"Trying model with {i} estimators . . .")
    clf = RandomForestClassifier(n_estimators=i).fit(X_train, y_train)
    print(f"Model accuracy on test set: {clf.score(X_test, y_test) * 100:.2f}%", end='\n________________\n')
    

In [None]:
# 6. Save a model and load it
import pickle

pickle.dump(clf, open("random_forest_model_1.pkl", "wb"))

In [None]:
loaded_model = pickle.load(open('./random_forest_model_1.pkl', 'rb'))
loaded_model.score(X_test, y_test)

## 1. Getting our data ready to be used by machine learning

Three main things to do:
1. Split the data into features and labels (usually called `X` and `y`)
2. Filling (also called imputing) or diregarding missing values
3. Converting non-numerical values to numerical values (also called feature encoding)

## 1.1 Making sure the data is all numerical

In [None]:
car_sales = pd.read_csv('https://raw.githubusercontent.com/mrdbourke/zero-to-mastery-ml/master/data/car-sales-extended.csv')
car_sales.dtypes

In [None]:
car_sales.head(3)

In [None]:
# Split into X/y
X = car_sales.drop('Price', axis=1)
y = car_sales['Price']
X.shape, y.shape

In [None]:
# Turn Categories into numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# Categorical features
categorical_features = ['Make', 'Colour']
door_feature = ['Doors']

one_hot = OneHotEncoder()
transformer = ColumnTransformer([
    ('one_hot', one_hot, categorical_features),
], remainder='passthrough')

transformed_X = transformer.fit_transform(X)
pd.DataFrame(transformed_X).head(3)

In [None]:
# Fit the model
X_train, X_test, y_train, y_test = train_test_split(transformed_X, y, test_size=0.2)

from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
model.fit(X_train, y_train)
model.score(X_test, y_test)

## 1.2 Dealing with missing values

1. Fill them with some value
2. Remove the samples with missing data altogether

In [None]:
car_sales_missing = pd.read_csv('https://raw.githubusercontent.com/mrdbourke/zero-to-mastery-ml/master/data/car-sales-extended-missing-data.csv')
car_sales_missing.isnull().sum()

In [None]:
# Drop the rows with no labels
car_sales_missing.dropna(subset=['Price'], inplace=True)
car_sales_missing.isnull().sum()

In [None]:
# Split into X & y
X = car_sales_missing.drop('Price', axis=1)
y = car_sales_missing['Price']
X.shape, y.shape

In [None]:
car_sales_missing.head(3)

In [None]:
# Fill missing values with Scikit-Learn
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

# Fill categorical values with `missing` and numerical values with mean
cat_imputer = SimpleImputer(strategy='constant', fill_value='missing')
door_imputer = SimpleImputer(strategy='constant', fill_value=4)
num_imputer = SimpleImputer(strategy='mean')

# Define columns
cat_features = ['Make', 'Colour']
door_feature = ['Doors']
num_features = ['Odometer (KM)']

# Create an imputer
imputer = ColumnTransformer([
    ('cat_imputer', cat_imputer, cat_features),
    ('door_imputer', door_imputer, door_feature),
    ('num_imputer', num_imputer, num_features)
])

# Transform the data
filled_X = imputer.fit_transform(X)
filled_X_df = pd.DataFrame(filled_X, columns=['Make', 'Colour', 'Doors', 'Odometer (KM)'])
pd.DataFrame(filled_X).isnull().sum()

In [None]:
# Convert data to numbers
# Turn the categories into numbers

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ['Make', 'Colour', 'Doors']
one_hot = OneHotEncoder()
transformer = ColumnTransformer([
    ('one_hot',
     one_hot,
     categorical_features)
], remainder='passthrough')

transformed_X = transformer.fit_transform(filled_X_df)

In [None]:
# fit a model
np.random.seed(42)
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(transformed_X, y, test_size=0.2)

model = RandomForestRegressor()
model.fit(X_train, y_train)
model.score(X_test, y_test)

## 2. Choosing the right estimator for our problem


<a href='https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html' target='_blank'><img src='https://scikit-learn.org/stable/_static/ml_map.png' style='width: 1000px;' id='ml_map'></a>

### 2.1 Picking a machine learning model for our regression problem

In [None]:
# import boston housing dataset
from sklearn.datasets import load_boston

boston = load_boston()
boston_df = pd.DataFrame(data=boston['data'], columns=boston['feature_names'])
boston_df['target'] = pd.Series(boston['target'])
boston_df.head(3)

In [None]:
boston_df.shape

In [None]:
# Let's try the ridge regression model
from sklearn.linear_model import Ridge

# Setup random seed
np.random.seed(42)

# Create the data
X = boston_df.drop('target', axis=1)
y = boston_df['target']

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Instantiate the Ridge model
model = Ridge()
model.fit(X_train, y_train)

# Check the score of the ridge model on test data
model.score(X_test, y_test)

How do we improve this score?

What if Ridge was not working?

In [None]:
# Let's try RandomForestRegressor
from sklearn.ensemble import RandomForestRegressor

# Setup random seed
np.random.seed(42)

# Create the data
X = boston_df.drop('target', axis=1)
y = boston_df['target']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Instantiate RandomForestRegressor
rf = RandomForestRegressor()
rf.fit(X_train, y_train)

# Evalute the model (Check the score for RandomForestRegressor)
rf.score(X_test, y_test)

## 2.2 Choosing an estimator for classfication problems

In [None]:
heart_disease = pd.read_csv('https://raw.githubusercontent.com/mrdbourke/zero-to-mastery-ml/master/data/heart-disease.csv')
heart_disease.head(3)

In [None]:
heart_disease.shape

Let's check [the ml map](#ml_map)

the map advises to use <a href='https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC' target='_blank'>linear svc</a>

In [None]:
# Import the linear svc estimator class
from sklearn.svm import LinearSVC

# Setup random seed
np.random.seed(42)

# prepare the data
X = heart_disease.drop('target', axis=1)
y = heart_disease['target']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Instantiate the model
clf = LinearSVC()

# fit the model
clf.fit(X_train, y_train)

# score the model
clf.score(X_test, y_test)

Lets' check ensemble classifier

In [None]:
# Import RandomForestClassifier estimator class
from sklearn.ensemble import RandomForestClassifier

# Setup random seed
np.random.seed(42)

# Make the data
X = heart_disease.drop('target', axis=1)
y = heart_disease['target']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Initialize and fit the model
clf = RandomForestClassifier().fit(X_train, y_train)

# Score the model
clf.score(X_test, y_test)

## 3. Fit the model/algorithm and use it to make predictions

### 3.1 Fittig the model to the data

### 3.2 Making predictions using a machine learning model
two ways to make predictions
1. `predict()`
2. `predict_proba()`

In [None]:
# use a trained model to make predictions
y_preds = clf.predict(X_test)
y_preds

In [None]:
y_test = np.array(y_test)
y_test

In [None]:
# Compare predictions to truth labels
np.mean(y_preds == y_test)

In [None]:
clf.score(X_test, y_test)

In [None]:
from sklearn.metrics import accuracy_score
# returns the mean accuracy on the given data and labels
accuracy_score(y_test, y_preds)

Make predictions with `predict_proba()`


In [None]:
# predic_probab return probabilities of a classification label
clf.predict_proba(X_test[:5])

In [None]:
# Let's predict on the same data
clf.predict(X_test[:5])

`predict()` can also be used for regression models

In [None]:
from sklearn.ensemble import RandomForestRegressor

np.random.seed(42)

# Create the data
X = boston_df.drop('target', axis=1)
y = boston_df['target']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Initialize and fit the model
model = RandomForestRegressor().fit(X_train, y_train)

# Make predictions
y_preds = model.predict(X_test)
y_preds[:10]

In [None]:
# Comare predictions to the truth
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test, y_preds)

## 4. Evaluating a machine learning model

Three way to evaluate scikit learn model/estimator:
1. Estimator `score` method
2. The `scoring` parameter
3. Problem-specific metric functions

### 4.1 Evaluating a model with `score` method

In [None]:
from sklearn.ensemble import RandomForestClassifier
np.random.seed(42)

X = heart_disease.drop('target', axis=1)
y = heart_disease['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
clf = RandomForestClassifier().fit(X_train, y_train).fit(X_train, y_train)

In [None]:
clf.score(X_test, y_test)

### 4.2 Evaluating a model using a `scoring` parameter

In [None]:
from sklearn.model_selection import cross_val_score

from sklearn.ensemble import RandomForestClassifier

np.random.seed(42)

X = heart_disease.drop('target', axis=1)
y = heart_disease['target']

X_train, X_test, y_train, y_test = train_test_split(X, y)

clf = RandomForestClassifier().fit(X_train, y_train)

clf.score(X_test, y_test)

In [None]:
cross_val_score(clf, X, y, cv=5)

`scoring` parameter is set to `None` by default

When `scoring` is set to `None`, default evaluation metric is used, that is `score` in case of classifier

In [None]:
cross_val_score(clf, X, y, cv=5, scoring=None)


### 4.2.1 Classification model evaluation metrics
1. Accuracy
2. Area under the curve
3. confusion matrix
4. classification report

**Accuracy**

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

np.random.seed(42)

X = heart_disease.drop('target', axis=1)
y = heart_disease['target']

clf = RandomForestClassifier()

accuracy_cv_score = cross_val_score(clf, X, y, cv=5)
accuracy_cv_score

In [None]:
print(f"Heart disease classifier cross-validated accuracy {np.mean(accuracy_cv_score) *100 :.2f} %")

**Area under the receiver operating charactristic curve (AUC/ROC)**
* Area under curve (AUC)
* ROC Curve

ROC curves are a comparison of a model's true positive rate (tpr) versus a model's false positive rate (fpr)
* True positive = model predicts 1 when truth is 1
* False positive = model predicts 1 when truth is 0
* True negative = model predicts 0 when truth is 0
* False negative = model predicts 0 when truth is 1

True Positive Rate = $\frac{True Positive}{True Positive + False Negative}$

False Negative Rate = $\frac{False Positive}{False Positive + True Negative}$

**See <a href='https://www.youtube.com/watch?v=4jRBRDbJemM'>this video</a>**

In [None]:
from sklearn.ensemble import RandomForestClassifier

X = heart_disease.drop('target', axis=1)
y = heart_disease['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Initialize and fit the model
clf = RandomForestClassifier()
clf.fit(X_train, y_train)

# Make predictions with probabilities
y_probs = clf.predict_proba(X_test)

y_probs[:7], y_probs.shape[0]

In [None]:
y_probs_positive = y_probs[:,1]
y_probs_positive[:7]

In [None]:
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_test, y_probs_positive)

In [None]:
# Create a function to plot ROC curve
import matplotlib.pyplot as plt

def plot_roc_curve(fpr, tpr):
    """
    Plots a ROC curve given the false positive rate (frp)
    and true positive rate (tpr) of a model
    """
    # plot roc curve
    plt.plot(fpr, tpr, color='orange', label='ROC')
    # plot line with predictive power
    plt.plot([[0, 0],
              [1, 1]],
             color='darkblue',
             linestyle='--',
             label='Guessing')
    # Customize the plot
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    


plot_roc_curve(fpr, tpr)


**Confusion Matrix**
A confusion matrix is a quick way to compute the labels a model
predicts and the actual labels it was supposed to predict

In essence, giving you an idea of where the model is getting confused.

In [None]:
from sklearn.metrics import confusion_matrix

y_preds = clf.predict(X_test)

confusion_matrix(y_test, y_preds)

In [None]:
from sklearn.metrics import plot_confusion_matrix

plot_confusion_matrix(clf, X, y)

**Classification Report**

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_preds))

In [None]:
# Where precision and recall become valuable

disease_true = np.zeros(10000)
disease_true[0] = 1  # only one positive case

disease_preds = np.zeros(10000)

pd.DataFrame(classification_report(disease_true,
                                   disease_preds,
                                   output_dict=True))

To summarize classification metrics
* **Accuracy** is a good measure if all classes are balanaced (e.g. same amount of samples which are labeld 0 of 1)
* **Precision** and **recall** become more important when classes are impbalanced
* if false positive predictions are worse than false negatives, aim for higher precision
* if false negative predictions are worse than false positives, aim for higher recall
* **F1 Score** is a combination of precision and recall

### 4.2.2 Regression model evaluation metrics
Model evaluation metrics documentation - https://scikit-learn.org/stable/modules/model_evaluation.html

1 - R^2 or Coeficient of Determination.</br>
2 - Mean Absolute Error (MAE)</br>
3 - Mean Squared Error (MSE)

**R^2**</br>
Compares your model predictions to the mean of the targets. Values can range from negative infinity (a very poor model) to 1.
If all the model does is predict the mean of the target, its R^2 value would be 0.
If the model perfectly predicts a range of numbers, its R^2 value would be 1.

In [None]:
from sklearn.ensemble import RandomForestRegressor

np.random.seed(42)

X = boston_df.drop('target', axis=1)

y = boston_df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = RandomForestRegressor().fit(X_train, y_train)

model.score(X_test, y_test)  # Default Metric is R^2 (the coeficient of determination)

In [None]:
from sklearn.metrics import r2_score

r2_score(y_test, np.random.randint(0, 9, size=(len(y_test), 1)))

**Mean Absolute Error**</br>
is the average of the absolute differences between predictions and actual values.
It gives you an idea of how wrong your model's predictions are.

In [None]:
# Mean Absolute Error
from sklearn.metrics import mean_absolute_error

y_preds = model.predict(X_test)

mae = mean_absolute_error(y_test, y_preds)
mae

**Mean Squared Error**</br>


In [None]:
# Mean Squaed Error
from sklearn.metrics import mean_squared_error

mse = mean_squared_error(y_test, y_preds)
mse

### 4.2.3 Finally using the `scoring` parameter

#### 4.2.3.1 Classifier model

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

np.random.seed(42)

X = heart_disease.drop('target', axis=1)
y = heart_disease['target']

clf = RandomForestClassifier()

**Accuracy**

In [None]:
cv_acc = cross_val_score(clf, X, y, cv=5)
cv_acc

In [None]:
# Average cross validated accuracy (accuracy is the default metric for scoring randomforestclassifier)
print(f"The average cross validated accuracy is {np.mean(cv_acc)*100:.2f}%")

**Precision**

In [None]:
cv_precision = cross_val_score(clf, X, y, cv=5, scoring='precision')
cv_precision

In [None]:
# Average cross validated precision
print(f"The average cross validated precision is {np.mean(cv_precision)*100:.2f}%")

**Recall**

In [None]:
cv_recall = cross_val_score(clf, X, y, cv=5, scoring='recall')
cv_recall

In [None]:
print(f"The average cross validated cross recall is {np.mean(cv_recall)*100:.2f}%")

**F1**

In [None]:
cv_f1 = cross_val_score(clf, X, y, cv=5, scoring='f1')
cv_f1

In [None]:
print(f"The average cross validated cross f1 is {np.mean(cv_f1)*100:.2f}%")

#### 4.2.3.2 Regression model

In [None]:
from sklearn.model_selection import cross_val_score

from sklearn.ensemble import RandomForestRegressor

np.random.seed(42)

X = boston_df.drop('target', axis=1)
y = boston_df['target']

model = RandomForestRegressor()

**R^2**

In [None]:
cv_r2 = cross_val_score(model, X, y, cv=5, scoring=None)  # with None, score default that is R^2 is used

cv_r2, np.mean(cv_r2)


**Mean Absolute Error**

In [None]:
cv_mae = cross_val_score(model, X, y, cv=5, scoring='neg_mean_absolute_error')  # neg because the convention is higher is better

cv_mae, np.mean(cv_mae)

**Mean Squared Error**

In [None]:
cv_mse = cross_val_score(model, X, y, cv=5, scoring='neg_mean_squared_error')
cv_mse, np.mean(cv_mse)

## 5. Improving a model
First Predictions = Baseline Predictions</br>
First Model = Base Model

From a data perspective:</br>
* Could we collect more data? (generally, the more data, the better)
* Could we improve our data?

From a model perspective
* Is there a better model we could use
* Could we improve the current model by tuning hyperparameters

In [None]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
clf.get_params()

**Three ways to adjust hyperparameters**
1. By hand 
2. Randomly with RandomSearchCV
3. Exhaustively with GridSearchCV

## 5.1 Turning hyperparameters by hand
Divide the data into train, validation, and test sets

We are going to try and adjust:

* `max_depth`
* `max_features`
* `min_samples_leaf`
* `min_samples_split`
* `n_estimators`

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
def evaluate_preds(y_true, y_preds):
    """
    performs evaluation comparison on y_true labels vs. y_pred labels
    on a classification model
    """
    accuracy = accuracy_score(y_true, y_preds)
    precision = precision_score(y_true, y_preds)
    recall = recall_score(y_true, y_preds)
    f1 = f1_score(y_true, y_preds)
    metrics_dic = {'accuracy': round(accuracy, 2),
                   'precision': round(precision, 2),
                   'recall': round(recall, 2),
                   'f1': round(f1, 2)
                  }
    
    return metrics_dic


### 5.2 Hyperparameter Tuning with RandomizedSearchCV

In [None]:
from sklearn.model_selection import RandomizedSearchCV

grid = {'n_estimators': [10, 100, 200, 1000, 5000, 10000],
        'max_depth': [None, 5, 10, 20, 30],
        'max_features': ['auto', 'sqrt'],
        'min_samples_split': [2, 4, 6],
        'min_samples_leaf': [1, 2, 4]}

np.random.seed(42)

# Split into X, y
heart_disease_shuffled = heart_disease.sample(frac=1)
X = heart_disease_shuffled.drop('target', axis=1)
y = heart_disease_shuffled['target']

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Instantiate a RandomForestClassifier
clf = RandomForestClassifier()

# Setup RandomizedSearchCV
rs_clf = RandomizedSearchCV(estimator=clf,
                            param_distributions=grid,
                            n_iter=3,  # number of models to try
                            cv=5,
                            verbose=2
                           )

# Fit the RandomizedSearchCV version of clf
rs_clf.fit(X_train, y_train)

In [None]:
rs_clf.best_params_

In [None]:
# Make predictions with the best hyperparameters
rs_y_preds = rs_clf.predict(X_test)

# Evaluate the predictions
rs_metrics = evaluate_preds(y_test, rs_y_preds)
rs_metrics

### 5.3 Hyperparameter tuning with GridSearchCV

In [None]:
from sklearn.model_selection import GridSearchCV

grid_2 = {'n_estimators': [100],
        'max_depth': [5, 10],
        'max_features': ['auto'],
        'min_samples_split': [2],
        'min_samples_leaf': [2, 4]}

np.random.seed(42)

# Split into X, y
heart_disease_shuffled = heart_disease.sample(frac=1)
X = heart_disease_shuffled.drop('target', axis=1)
y = heart_disease_shuffled['target']

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Instantiate a RandomForestClassifier
clf = RandomForestClassifier()

# Setup RandomizedSearchCV
gs_clf = GridSearchCV(estimator=clf,
                            param_grid=grid_2,
                            cv=5,
                            verbose=2
                           )

# Fit the RandomizedSearchCV version of clf

gs_clf.fit(X_train, y_train)

In [None]:
gs_clf.best_params_

In [None]:
gs_y_preds = gs_clf.predict(X_test)

# Evbaluate the predictions
gs_metrics = evaluate_preds(y_test, gs_y_preds)
gs_metrics

## 6. Saving and loading trained machine learning models
To ways to save and load machine learning models</br>
1. With python's `pickle` module
2. With the `joblib` module

**Pickle**

In [None]:
import pickle

# save an existing model to file
pickle.dump(gs_clf, open('gs_random_forest_model_1.pkl', 'wb'))

In [None]:
# load a saved model
loaded_pickle_model = pickle.load(open('gs_random_forest_model_1.pkl', 'rb'))

# Make some predictions
loaded_pickle_model.predict(X_test)

**joblib**

In [None]:
from joblib import dump, load

# Save model to file
dump(gs_clf, filename='gs_random_forest_model_1.joblib')

In [None]:
# import a saved joblib model
loaded_joblib_model = load(filename='gs_random_forest_model_1.joblib')

# Make some predictions
loaded_joblib_model.predict(X_test)

**If the model is large, in case of scikit-learn, it is more efficient to use joblib than to use pickle**

## 7. Putting it all together!

In [None]:
data = pd.read_csv('https://raw.githubusercontent.com/mrdbourke/zero-to-mastery-ml/master/data/car-sales-extended-missing-data.csv')
data.head(3)

In [None]:
data.dtypes

In [None]:
data.isna().sum()

STEPS:
   1. Fill the missing data
   2. Convert data to number
   3. Build a model on the data
   
<a href='https://colab.research.google.com/drive/1AX3Llawt0zdjtOxaYuTZX69dhxwinFDi?usp=sharing#scrollTo=KTyDN_BOb0Al'>source</a>

In [None]:
# Getting data ready
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

# Modeling
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, GridSearchCV

# Setup random seed
import numpy as np
np.random.seed(42)

# import data and drop rows with missing labels
data = pd.read_csv('https://raw.githubusercontent.com/mrdbourke/zero-to-mastery-ml/master/data/car-sales-extended-missing-data.csv')
data.dropna(subset=['Price'], inplace=True)


# Define different features and transformer pipeline

# Define categorical columns
categorical_features = ["Make", "Colour"]
# Create categorical transformer (imputes missing values, then one hot encodes them)
categorical_transformer = Pipeline(steps=[
  ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
  ('onehot', OneHotEncoder(handle_unknown='ignore'))                                         
])

# Define door feature
door_feature = ["Doors"]
# Create door transformer (fills all door missing values with 4)
door_transformer = Pipeline(steps=[
  ('imputer', SimpleImputer(strategy='constant', fill_value=4)),
])

# Define numeric features
numeric_features = ["Odometer (KM)"]
# Create a transformer for filling all missing numeric values with the mean
numeric_transformer = Pipeline(steps=[
  ('imputer', SimpleImputer(strategy='mean'))  
])

# Setup preprocessing steps (Fill missing values, then convert to numbers)
# Create a column transformer which combines all of the other transformers 
# into one step
preprocessor = ColumnTransformer(
    transformers=[
      # (name, transformer_to_use, features_to_use transform)
      ('categorical', categorical_transformer, categorical_features),
      ('door', door_transformer, door_feature),
      ('numerical', numeric_transformer, numeric_features)
])

# Create a preprocessing and modeling pipeline
# Create the preprocessing and modelling pipeline
model = Pipeline(steps=[('preprocessor', preprocessor), # this will fill our missing data and make sure it's all numbers
                        ('model', RandomForestRegressor())]) # this will model our data

# Split data
X = data.drop('Price', axis=1)
y = data['Price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Fit and score the model
model.fit(X_train, y_train)
model.score(X_test, y_test)

**Use `GridSearchCV` or `RandomizedSearchCV` with pipeline**

In [None]:
# Use GridSearchCV with our pipeline
pipe_grid = {
    "preprocessor__numerical__imputer__strategy": ['mean', 'median'],
    "model__n_estimators": [100],
    "model__max_depth": [None, 5],
    "model__max_features": ['auto'],
    "model__min_samples_split": [2]
}

gs_model = GridSearchCV(model, pipe_grid, cv=5, verbose=1)
gs_model.fit(X_train, y_train)

In [None]:
gs_model.score(X_test, y_test)

Practice:
https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/section-2-data-science-and-ml-tools/scikit-learn-exercises.ipynb