## Scikit-Learn (sklearn) Course

<span style="color:navajowhite">
--- topics covered<br>
0. sklearn workflow overview<br>
1. preparing data (collecting, features/targets, splitting)<br>
2. defining problem / selecting machine learning estimator/algorithm/model<br>
3. training model and making predictions<br>
4. evaluating model<br>
5. improving model<br>
6. saving and loading model<br>
7. putting it all together
</span>

In [None]:
### imports
import pickle
import numpy, pandas
from matplotlib import pyplot
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
### displaying matplotlib figures within notebook
%matplotlib inline

## Sklearn Workflow

#### 1. Preparing data

In [None]:
### pandas dataframe (heart_disease), reading data from csv file
heart_disease = pandas.read_csv("data-heart-disease.csv")
### pandas dataframe (features), contains heart_disease dataframe, dropping heart_disease/target
features = heart_disease.drop(columns="target")
### pandas series (target), contains heart_disease/target
target = heart_disease.loc[:, "target"]
### splitting data into train/test sets
### features >>> features_train, features_test
### target >>> target_train, target_test
### test set size is 20% of all data (test_size)
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.2)
features_train: numpy.ndarray; features_test: numpy.ndarray; target_train: numpy.ndarray; target_test: numpy.ndarray
features_train.shape, features_test.shape, target_train.shape, target_test.shape

#### 2. Selecting machine learning model

In [None]:
### instantiating a random forest classifier
classifier = RandomForestClassifier()
### listing classifier parameters
classifier.get_params()

#### 3. Training model / Making predictions

In [None]:
### fitting classifier to training dataset
classifier.fit(features_train, target_train)
### making predictions with trained classifier on test dataset
target_prediction = classifier.predict(features_test)
### viewing classifier predictions
target_prediction

#### 4. Evaluating model

In [None]:
### classifier accuracy score on training dataset
print(classifier.score(features_train, target_train))
### classifier accuracy score on test dataset
print(classifier.score(features_test, target_test))
### classifier accuracy score on test dataset
print(accuracy_score(target_test, target_prediction))

In [None]:
### classifier main metrics on test dataset
print(classification_report(target_test, target_prediction))
### classifier confusion matrix on test dataset
print(confusion_matrix(target_test, target_prediction))

#### 5. Improving model

In [None]:
### adjusting number of estimators (n-estimators)
for num_estimators in range(10, 100, 10):
    print(f"Trying classifier with {num_estimators} estimators...")
    classifier = RandomForestClassifier(n_estimators=num_estimators)
    classifier.fit(features_train, target_train)
    print(f"Classifier accuracy score on test dataset: {classifier.score(features_test, target_test)}")
    print()

#### 6. Saving model / Loading model

In [None]:
### saving classifier (dump) into pickle binary file
pickle.dump(classifier, open("model-random-forest-1.pkl", "wb"))

In [None]:
### loading classifier from pickle binary file
classifier_loaded: RandomForestClassifier = pickle.load(open("model-random-forest-1.pkl", "rb"))
### loaded classifier accuracy score on test dataset
classifier_loaded.score(features_test, target_test)

## 1. Preparing Data

<span style="color:navajowhite">
--- main steps<br>
splitting data (features, targets, train, test)<br>
handling missing values (filling or omitting)<br>
converting non-numerical values (feature encoding)
</span>

In [None]:
### pandas dataframe (heart_disease), reading data from csv file
heart_disease = pandas.read_csv("data-heart-disease.csv")
### pandas dataframe (features), contains heart_disease dataframe, dropping heart_disease/target
features = heart_disease.drop(columns="target")
### pandas series (target), contains heart_disease/target
target = heart_disease.loc[:, "target"]
### splitting data into train/test sets
### features >>> features_train, features_test
### target >>> target_train, target_test
### test set size is 20% of all data (test_size)
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.2)
features_train: numpy.ndarray; features_test: numpy.ndarray; target_train: numpy.ndarray; target_test: numpy.ndarray
features_train.shape, features_test.shape, target_train.shape, target_test.shape

#### Data transformation

In [None]:
### imports ------------------------------------------------------------------------------------------------------------

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

In [None]:
### collecting data ----------------------------------------------------------------------------------------------------

### pandas dataframe (car_sales), reading data from csv file
car_sales = pandas.read_csv("data-car-sales-extended.csv")

car_sales.head(10)

In [None]:
### splitting data: features / target ----------------------------------------------------------------------------------

### pandas dataframe (features), contains car_sales dataframe, dropping car_sales/Price
features = car_sales.drop(columns="Price")
### pandas series (target), contains car_sales/Price
target = car_sales.loc[:, "Price"]

features, target

In [None]:
### sklearn feature encoding -------------------------------------------------------------------------------------------

### column transformer object (transformer), named "one_hot", using OneHotEncoder()
### encodes car_sales/Make, car_sales/Colour, and car_sales/Doors, passing throung other columns (remainder)
transformer = ColumnTransformer([("one_hot", OneHotEncoder(), ["Make", "Colour", "Doors"])], remainder="passthrough")
### numpy array (sklearn_encoded), contains transformed features matrix
sklearn_encoded = transformer.fit_transform(features)

pandas.DataFrame(data=sklearn_encoded)

<span style="color:navajowhite">
--- feature encoding concepts<br>
ColumnTransformer / OneHotEncoder creates individual columns for each category<br>
suppose a "Planets" column contains 4 categories: Mercury / Venus / Earth / Mars<br>
feature encoding deletes "Planets" column and creates 4 new columns for each category<br>
where "Planets" contained Mercury, the "Mercury" column gets 1, the rest get 0, etc...
</span>

In [None]:
### pandas feature encoding --------------------------------------------------------------------------------------------

### pandas dataframe (pandas_encoded), contains car_sales dataframe
### encodes car_sales/Make, car_sales/Colour, and car_sales/Doors
pandas_encoded = pandas.get_dummies(car_sales)

pandas_encoded

In [None]:
### splitting data: train / test ---------------------------------------------------------------------------------------

### sklearn_encoded >>> features_train, features_test
### target >>> target_train, target_test
### test set size is 20% of all data (test_size)
features_train, features_test, target_train, target_test = train_test_split(sklearn_encoded, target, test_size=0.2)

features_train: numpy.ndarray; features_test: numpy.ndarray; target_train: numpy.ndarray; target_test: numpy.ndarray
features_train.shape, features_test.shape, target_train.shape, target_test.shape

In [None]:
### training and evaluating a machine learning model -------------------------------------------------------------------

### instantiating a random forest regressor (regressor)
regressor = RandomForestRegressor()
### fitting regressor to training dataset
regressor.fit(features_train, target_train)

### regressor accuracy scores on training and test datasets
regressor.score(features_train, target_train), regressor.score(features_test, target_test)

#### Handling missing values with pandas

<span style="color:navajowhite">
--- options for handling missing values<br>
removing records containing missing value(s)<br>
filling missing value(s) (imputation)
</span>

In [None]:
### imports
import pandas

In [None]:
### loading dataframe
carsales_missing = pandas.read_csv("data-car-sales-missing.csv")
carsales_missing

In [None]:
### checking for missing values
carsales_missing.isna().sum()

In [None]:
### filling missing values
carsales_missing["Make"] = carsales_missing["Make"].fillna("Missing")
carsales_missing["Colour"] = carsales_missing["Colour"].fillna("Missing")
carsales_missing["Odometer (KM)"] = carsales_missing["Odometer (KM)"].fillna(carsales_missing["Odometer (KM)"].mean())
carsales_missing["Doors"] = carsales_missing["Doors"].fillna(4)
carsales_missing.isna().sum()

In [None]:
### removing rows (axis="index") with missing values
carsales_missing = carsales_missing.dropna(axis="index")
carsales_missing.isna().sum()

In [None]:
### resetting index, dropping old index column (drop=True)
carsales_missing = carsales_missing.reset_index(drop=True)
carsales_missing

#### Handling missing values with sklearn

In [None]:
### imports
import pandas
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

In [None]:
### loading dataframe
carsales_missing = pandas.read_csv("data-car-sales-missing.csv")
carsales_missing

In [None]:
### checking for missing values
carsales_missing.isna().sum()

In [None]:
### removing rows (axis="index") with missing price value (subset=["Price"])
carsales_missing = carsales_missing.dropna(axis="index", subset=["Price"]).reset_index(drop=True)
carsales_missing.isna().sum()

In [None]:
### instantiating imputers
### constant strategy fills with fill_value
### most_frequent strategy fills with most frequent value in column
### mean strategy fills with mean of column
category_imputer = SimpleImputer(strategy="constant", fill_value="Missing")
door_imputer = SimpleImputer(strategy="most_frequent")
numeric_imputer = SimpleImputer(strategy="mean")

In [None]:
### instantiating column transformer, unspecified columns remain unchanged (remainder="passthrough")
transformer = ColumnTransformer([
    ("category_imputer", category_imputer, ["Make", "Colour"]),
    ("door_imputer", door_imputer, ["Doors"]),
    ("numeric_imputer", numeric_imputer, ["Odometer (KM)"])],
    remainder="passthrough")
### fitting transformers and transforming columns, returns numpy array
carsales_filled = transformer.fit_transform(carsales_missing)

In [None]:
### converting numpy array > pandas dataframe, re-checking for missing values
carsales_filled = pandas.DataFrame(data=carsales_filled, columns=carsales_missing.columns.values)
carsales_filled.isna().sum()

## 2. Selecting Machine Learning Model

In [None]:
### imports
import numpy, pandas
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge

In [None]:
### loading california housing dataset (regression problem)
housing_dict = fetch_california_housing()
### creating features dataframe
housing_df = pandas.DataFrame(data=housing_dict["data"], columns=housing_dict["feature_names"])
### adding target column
housing_df["MedHouseVal"] = housing_dict["target"]
housing_df

In [None]:
### splitting data features/target
features = housing_df.drop(columns="MedHouseVal")
target = housing_df.loc[:, "MedHouseVal"]
### splitting data train/test
numpy.random.seed(42)
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.2)

<span style="color:navajowhite">
--- model selection<br>
model is selected from the sklean model selection map<br>
when the map suggests several models, each one should be tested
</span>

In [None]:
### instantiating, training, and evaluating a model
regressor = Ridge()
regressor.fit(features_train, target_train)
regressor.score(features_test, target_test)

<span style="color:navajowhite">
--- model evaluation<br>
the score of a ridge regressor is the coefficient of determination (r-square)<br>
r-square quantifies the linear relationship between feature and target variables
</span>