## Scikit-Learn (Sklearn) Course

<span>
0. sklearn workflow overview<br>
1. preparing data (exploring, cleaning, transforming, reducing, splitting)<br>
2. selecting the machine learning model / algorithm<br>
3. training the algorithm and making predictions<br>
4. evaluating the algorithm<br>
5. improving the model<br>
6. saving and loading the algorithm<br>
<span style="color:orange">7. putting it all together</span>
</span>

## 7. Putting It All Together

#### General concepts

--- sklearn pipeline class  
pipeline is a tool to combine the transforming and the training steps of modeling  
a pipeline may contain several transforming steps and a final training step  

#### Preparing data

In [None]:
### imports

import numpy, pandas

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

In [None]:
### preparing data

### loading extended car sales data with missing values
car_sales = pandas.read_csv("data-car-sales-missing.csv")

### deleting rows with missing targets (price column)
car_sales = car_sales.dropna(subset=["Price"])

### splitting data / features <> target
features = car_sales.drop(columns="Price")
target = car_sales.loc[:, "Price"]

In [None]:
### preprocessing data

### defining category pipeline
categorical_features = ["Make", "Colour"]
categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="constant", fill_value="missing")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))])

### defining doors pipeline
doors_feature = ["Doors"]
doors_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="constant", fill_value=4.0))])

### defining odometer pipeline
odometer_feature = ["Odometer (KM)"]
odometer_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean"))])

### defining preprocessor
preprocessor = ColumnTransformer(transformers=[
    ("categorical", categorical_transformer, categorical_features),
    ("doors", doors_transformer, doors_feature),
    ("odometer", odometer_transformer, odometer_feature)])


#### Selecting, training, and evaluating the algorithm

In [None]:
### imports
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score

In [None]:
### modeling

### defining modeling pipeline
regressor = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("model", RandomForestRegressor(n_jobs=-1))])

### training and scoring algorithm
numpy.random.seed(42)
cross_val_score(estimator=regressor, X=features, y=target, cv=5, scoring="r2").mean()

#### Improving the model

In [None]:
### imports
from sklearn.model_selection import GridSearchCV

In [None]:
### running brute force grid search

### creating search grid
search_grid = {
    "preprocessor__odometer__imputer__strategy": ["mean", "median"],
    "model__max_depth": [None, 5],
    "model__max_features": ["sqrt"],
    "model__min_samples_leaf": [1, 2],
    "model__min_samples_split": [2, 4],
    "model__n_estimators": [100, 1000]}

### creating grid search object
regressor_gscv = GridSearchCV(estimator=regressor, param_grid=search_grid, cv=5, verbose=True)

### training grid search object
numpy.random.seed(42)
regressor_gscv.fit(X=features, y=target);

In [None]:
### reading best parameters
regressor_gscv.best_params_

In [None]:
### evaluating best estimator
regressor_best = regressor_gscv.best_estimator_
numpy.random.seed(42)
cross_val_score(estimator=regressor_best, X=features, y=target, cv=5, scoring="r2").mean()