# Introduction to Scikit-learn (sklearn)

This notebook demonstrates some of the most useful functions of the beautifil Scikit-Learn library.

What we are covering:

0. end-to-end sklearn workflow
1. Data Pre-Processing
2. Choose the right algorithm for the problem
3. Fit the model and use it to do predictions
4. Evaluate model
5. Improve model
6. Save and load trained model
7. Putting it all together

## 0. An end-to-end scikit-learn workflow

In [None]:
%load_ext cudf.pandas
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

## 1. Data Pre-Processing

Three main steps:
1. Split the data into feaures and labels (usually `X` and `y`)
2. Fill (also called impute) or remove missing values
3. Convert non-numeric values to numeric values (also called feature encoding)

In [None]:
heart_disease = pd.read_csv("/home/rapids/workspace/ML_Course/ztm-ml/data/heart-disease.csv")
heart_disease

In [None]:
X = heart_disease.drop("target", axis=1)
y = heart_disease["target"]

In [None]:
# Split the data into training and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
X_train.info()

In [None]:
car_sales = pd.read_csv('/home/rapids/workspace/ML_Course/ztm-ml/data/car-sales-extended.csv')

In [None]:
car_sales.info()

In [None]:
X = car_sales.drop("Price", axis=1)
y = car_sales["Price"]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor()
# model.fit(X_train, y_train) # failed cause toyota is not categorized


In [None]:
X.info()

In [None]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour", "Doors"]

In [None]:
# one_hot = OneHotEncoder()
# transformer = ColumnTransformer([("one_hot", 
#                                  one_hot, 
#                                  categorical_features)],
#                                  remainder="passthrough")


# transformed_X = transformer.fit_transform(X)
# transformed_X.head(5)

In [None]:
transformed_X = pd.get_dummies(X, columns=categorical_features, drop_first=True)
transformed_X.head(5)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(transformed_X, y, test_size=0.2, random_state=42)

In [None]:
model = RandomForestRegressor()
model.fit(X_train, y_train) # failed cause toyota is not categorized

In [None]:
model.score(X_test, y_test)

## What if there were missing values?
1. Fill them with some value (also called imputation)
2. Remove the samples with missing values altogether

In [None]:
# import car sales missing data

car_sales_missing = pd.read_csv("/home/rapids/workspace/ML_Course/ztm-ml/data/car-sales-extended-missing-data.csv")

In [None]:
car_sales_missing.isna().sum()

In [None]:
# lets try convert our data to numeric values
X = car_sales_missing.drop("Price", axis=1)
y = car_sales_missing["Price"]

In [None]:
X[X.isna().any(axis=1)]

In [None]:
# Code runs but logis does not work cause of missing data but when this is done the data will be converted to values
categorical_features = ["Make", "Colour", "Doors"]
transformed_X = pd.get_dummies(X, columns=categorical_features, drop_first=True)
transformed_X

In [None]:
transformed_X.isna().sum()

In [None]:
# Option 1 - Fill missing data with pandas

car_sales_missing['Make'].fillna("missing", inplace=True)
car_sales_missing['Colour'].fillna("missing", inplace=True)
car_sales_missing['Odometer (KM)'].fillna(car_sales_missing['Odometer (KM)'].mean(), inplace=True)
car_sales_missing['Doors'].fillna(4, inplace=True)

In [None]:
car_sales_missing.isna().sum()

In [None]:
# option 2 - drop rows with missing data
car_sales_missing.dropna(inplace=True) 

In [None]:
car_sales_missing.isna().sum()

In [None]:
# lets try convert our data to numeric values
X = car_sales_missing.drop("Price", axis=1)
y = car_sales_missing["Price"]

In [None]:
# Code runs but logis does not work cause of missing data but when this is done the data will be converted to values
categorical_features = ["Make", "Colour", "Doors"]
transformed_X = pd.get_dummies(X, columns=categorical_features, drop_first=True)
transformed_X

In [None]:
X_train, X_test, y_train, y_test = train_test_split(transformed_X, y, test_size=0.2, random_state=42)

In [None]:
y_train

In [None]:
model = RandomForestRegressor()
model.fit(X_train, y_train) # failed cause toyota is not categorized

In [None]:
model.score(X_test, y_test)