<a href="https://colab.research.google.com/github/tabaraei/CheatSheet/blob/master/notebooks/Scikit-Learn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Setup the enivironment

Please ensure that you have the latest scikit-learn version.
Run `!pip install --upgrade scikit-learn` if necessary.

In [1]:
# restart runtime after running the following command
!pip install --upgrade scikit-learn



In [2]:
import pandas as pd
import numpy as np

## Data configuration

### 1- Load the dataset

#### 1-1- Regression data:

In [3]:
from sklearn.datasets import load_diabetes

reg_dataframe = load_diabetes(as_frame=True).frame
reg_dataframe.head()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019908,-0.017646,151.0
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.06833,-0.092204,75.0
2,0.085299,0.05068,0.044451,-0.005671,-0.045599,-0.034194,-0.032356,-0.002592,0.002864,-0.02593,141.0
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022692,-0.009362,206.0
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031991,-0.046641,135.0


#### 1-2- Classification data

In [4]:
from sklearn.datasets import fetch_openml

cls_dataframe = fetch_openml(name='adult', version=2, as_frame=True).frame
cls_dataframe.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,25.0,Private,226802.0,11th,7.0,Never-married,Machine-op-inspct,Own-child,Black,Male,0.0,0.0,40.0,United-States,<=50K
1,38.0,Private,89814.0,HS-grad,9.0,Married-civ-spouse,Farming-fishing,Husband,White,Male,0.0,0.0,50.0,United-States,<=50K
2,28.0,Local-gov,336951.0,Assoc-acdm,12.0,Married-civ-spouse,Protective-serv,Husband,White,Male,0.0,0.0,40.0,United-States,>50K
3,44.0,Private,160323.0,Some-college,10.0,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688.0,0.0,40.0,United-States,>50K
4,18.0,,103497.0,Some-college,10.0,Never-married,,Own-child,White,Female,0.0,0.0,30.0,United-States,<=50K


### 2- Handling missing values

#### 2-1- Remove rows with missing value

In [5]:
print(reg_dataframe.shape, cls_dataframe.shape)
reg_dataframe.dropna(axis=0, inplace=True)
cls_dataframe.dropna(axis=0, inplace=True)
print(reg_dataframe.shape, cls_dataframe.shape)

(442, 11) (48842, 15)
(442, 11) (45222, 15)


#### 2-2- Remove cols with missing value

In [6]:
reg_cols_with_missing = [col for col in reg_dataframe.columns if reg_dataframe[col].isnull().any()]
reg_dataframe.drop(reg_cols_with_missing, axis=1, inplace=True)

cls_cols_with_missing = [col for col in cls_dataframe.columns if cls_dataframe[col].isnull().any()]
cls_dataframe.drop(cls_cols_with_missing, axis=1, inplace=True)

reg_cols_with_missing, cls_cols_with_missing

([], [])

### 3- Train Test Split

#### 3-1- Split target class from features

In [7]:
y_reg = reg_dataframe['target']
X_reg = reg_dataframe.drop(['target'], axis='columns')

y_cls = cls_dataframe['class']
X_cls = cls_dataframe.drop(['class'], axis='columns')

#### 3-2- Split training data from test data

In [8]:
from sklearn.model_selection import train_test_split

X_train_reg, X_test_reg, y_train_reg, y_test_reg = \
    train_test_split(X_reg, y_reg, random_state=1, test_size=0.1)
    
X_train_cls, X_test_cls, y_train_cls, y_test_cls = \
    train_test_split(X_cls, y_cls, random_state=1, test_size=0.1)

#### 3-3- Prevent from SettingWithCopyWarning

In [9]:

X_train_reg, X_test_reg, y_train_reg, y_test_reg = \
    X_train_reg.copy(), X_test_reg.copy(), y_train_reg.copy(), y_test_reg.copy()

X_train_cls, X_test_cls, y_train_cls, y_test_cls = \
    X_train_cls.copy(), X_test_cls.copy(), y_train_cls.copy(), y_test_cls.copy()

## Preprocessing

### 1- Handling categorical features

In [10]:
categories = X_train_cls.select_dtypes(include='category')
low_cardinality_cols = [col for col in X_train_cls.columns if X_train_cls[col].nunique() < 7]
high_cardinality_cols = list(set(categories) - set(low_cardinality_cols))

low_cardinality_cols, high_cardinality_cols

(['relationship', 'race', 'sex'],
 ['occupation', 'marital-status', 'workclass', 'education', 'native-country'])

#### 1-1- Ordinal Encoder

In [11]:
X_train_cls[high_cardinality_cols].head(3)

Unnamed: 0,occupation,marital-status,workclass,education,native-country
917,Machine-op-inspct,Married-civ-spouse,Private,Some-college,United-States
28178,Sales,Married-civ-spouse,Private,Some-college,United-States
46975,Handlers-cleaners,Married-civ-spouse,Private,HS-grad,United-States


In [12]:
from sklearn.preprocessing import OrdinalEncoder

encoder = OrdinalEncoder()
X_train_cls[high_cardinality_cols] = encoder.fit_transform(X_train_cls[high_cardinality_cols].copy())
X_test_cls[high_cardinality_cols] = encoder.transform(X_test_cls[high_cardinality_cols].copy())

In [13]:
X_train_cls[high_cardinality_cols].head(3)

Unnamed: 0,occupation,marital-status,workclass,education,native-country
917,6.0,2.0,2.0,15.0,38.0
28178,11.0,2.0,2.0,15.0,38.0
46975,5.0,2.0,2.0,11.0,38.0


#### 1-2- One Hot Encoder

In [14]:
X_train_cls[low_cardinality_cols].head(3)

Unnamed: 0,relationship,race,sex
917,Husband,White,Male
28178,Husband,White,Male
46975,Husband,White,Male


In [15]:
from sklearn.preprocessing import OneHotEncoder

# get dataframe of one-hot encoding on low cardinality columns
encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
one_hot_train = pd.DataFrame(encoder.fit_transform(X_train_cls[low_cardinality_cols]))
one_hot_test = pd.DataFrame(encoder.transform(X_test_cls[low_cardinality_cols]))

# fix indices of one-hot encoder before merging
one_hot_train.index = X_train_cls.index
one_hot_test.index = X_test_cls.index

# assign column names
one_hot_train.columns = encoder.get_feature_names()
one_hot_test.columns = encoder.get_feature_names()

# Remove categorical columns
X_train_cls = X_train_cls.drop(low_cardinality_cols, axis=1)
X_test_cls = X_test_cls.drop(low_cardinality_cols, axis=1)

# Add one-hot encoded columns to numerical features
X_train_cls = pd.concat([X_train_cls, one_hot_train], axis=1)
X_test_cls = pd.concat([X_test_cls, one_hot_test], axis=1)

In [16]:
one_hot_train.head(3)

Unnamed: 0,x0_Husband,x0_Not-in-family,x0_Other-relative,x0_Own-child,x0_Unmarried,x0_Wife,x1_Amer-Indian-Eskimo,x1_Asian-Pac-Islander,x1_Black,x1_Other,x1_White,x2_Female,x2_Male
917,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
28178,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
46975,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0


#### 1-3- Remove Categorical features and replace with preprocessed ones

### 2- Numeric features

In [17]:
# from sklearn.preprocessing import StandardScaler

# scaler = StandardScaler()
# scaler.fit(data_train)

## Model

Activate the diagram display to see the visualized model.

In [18]:
from sklearn import set_config
set_config(display='diagram')

### 1- Regression

#### 1-1- Linear Regression

In [19]:
from sklearn.linear_model import LinearRegression

X, y = X_train_reg, y_train_reg
clf = LinearRegression()
clf.fit(X, y)

#### 1-2- Logistic Regression

In [20]:
from sklearn.linear_model import LogisticRegression

X, y = X_train_reg, y_train_reg
clf = LogisticRegression(random_state=0, max_iter=1000)
clf.fit(X, y)

### 2- Classification

#### 2-1- Desicion Tree

In [21]:
from sklearn.tree import DecisionTreeClassifier

X, y = X_train_cls.select_dtypes(exclude='category'), y_train_cls
clf = DecisionTreeClassifier()
clf.fit(X, y)

## Evaluation

Predict and evaluate model performance on a model. First let's create a model:

In [22]:
X, y, X_test, y_test = X_train_reg, y_train_reg, X_test_reg, y_test_reg
clf = LinearRegression()
clf.fit(X, y)

### 1- Predict

In [23]:
y_pred = clf.predict(X_test)
y_pred[:5]

array([122.41932407, 111.48365055, 184.18558652,  68.52087272,
       171.44221745])

In [24]:
(y_test == y_pred).mean()

0.0

In [25]:
accuracy = clf.score(X_test, y_test)
print(f"The test accuracy using a {clf.__class__.__name__} is "f"{accuracy:.3f}")

The test accuracy using a LinearRegression is 0.317
