In [1]:
import warnings
warnings.filterwarnings('ignore')

# Machine learning steps summary

Let's have a short, detailed review of some of the steps you have to go through in a machine learning project:

- defining the problem
- collecting and exploring data
- **cleaning/preparing data**
- **training and optimizing model**
- **model evaluation**
- model deployment



## I. Cleaning and preparing the data

The things to do in data cleaning and preparation:
1. Split the data

Then on the train set:
2. Handle missing values, duplicates, outliers (if any)
3. Rescale quantitative features
4. Encode qualitative features

Finally on the test set:
5. Apply handling of missing values... performed on train set
6. Apply rescaling performed on train set
7. Encode qualitative features

# II. Model training and optimization

The steps in the model training and optimization:
1. Train and optimize model on the train dataset, using grid search and cross validation
2. Look for high bias or high variance

In case of high variance (i.e. overfitting):
- Try either to add regularization, add samples, remove features
In case of high bias (i.e. underfitting):
- Try either to remove regularization, add features, more complex model

Then, if features are changed, go back to data preparation and iterate.

If features remain unchanged, go back to model training and optimization and iterate.

# III. Model evaluation

Once both the model and the hyperparameters are optimized: evaluate your model on the test set.

Using the evaluation metric, compute the evaluation on the test set: this is supposed to be your final result.

# IV. Example on census dataset

Let's try to apply this on the [Census income dataset](https://archive.ics.uci.edu/ml/datasets/census+income).

The goal is to predict whether income is above of below k$ per year, based on personal information.

In [2]:
import pandas as pd
import numpy as np

headers_col = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', \
               'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'target']

df = pd.read_csv('adult.data', names=headers_col, index_col=False)
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,target
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


## IV.1. Cleaning and preparing the data

First, do we have duplicates or missing values? We might need to handle them after the split, but can have a look now:

In [3]:
# Check for missing values
df.isna().sum()

age               0
workclass         0
fnlwgt            0
education         0
education-num     0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
capital-gain      0
capital-loss      0
hours-per-week    0
native-country    0
target            0
dtype: int64

No missing values. 

What about duplicates?

In [4]:
df.duplicated().sum()

24

We do have duplicates, for simplicity, we will drop them now.

In [5]:
df = df.drop_duplicates()

### Split the data

We can now choose the features (here we can take them all for simplicity) and then split the dataset:

In [6]:
# Compute the X and y
X = df.drop('target', axis=1)
y = df['target']
X.shape, y.shape

((32537, 14), (32537,))

In [7]:
# Split the data
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=0)

### Prepare the data

Let's first deal with quantitative data:

In [8]:
# Retrieve the quantitative features, though education is arguable quantitative
quanti_columns = ['age', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']
quanti_feats_train = X_train[quanti_columns]
quanti_feats_test = X_test[quanti_columns]

In [9]:
from sklearn.preprocessing import StandardScaler
# Rescale the quantitative features
scaler = StandardScaler()
quanti_feats_train = scaler.fit_transform(quanti_feats_train)
quanti_feats_test = scaler.transform(quanti_feats_test)

Let's deal now with qualitative data:

In [10]:
# Retrieve the qualitative features
quali_columns = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']
quali_feats_train = X_train[quali_columns]
quali_feats_test = X_test[quali_columns]

In [11]:
# One hot encode the qualitative features
quali_feats_train = pd.get_dummies(quali_feats_train, drop_first=True)
quali_feats_test = pd.get_dummies(quali_feats_test, drop_first=True)

One should check the features are identical in both train and test:

In [12]:
# Are there missing columns?
missing_cols = set( quali_feats_train.columns ) - set( quali_feats_test.columns )
missing_cols

{'native-country_ Holand-Netherlands',
 'native-country_ Honduras',
 'workclass_ Never-worked'}

Three columns never appear in the test dataset, we have to add them manually:

In [13]:
# Add zeros for those columns
for c in missing_cols:
    quali_feats_test[c] = 0
# Check there are no more missing columns
print('number of missing columns:', len(set( quali_feats_train.columns ) - set( quali_feats_test.columns )))

number of missing columns: 0


We should now concatenate all of this, to have our final train and test sets:

In [14]:
# Concatenate
X_train = np.concatenate([quanti_feats_train, quali_feats_train], axis=1)
X_test = np.concatenate([quanti_feats_test, quali_feats_test], axis=1)
# Check the shape is the expected one
X_train.shape, X_test.shape

((26029, 99), (6508, 99))

## IV.2. Train and optimize a model

We will use a logistic regression and optimize it with grid search and cross validation.

In [15]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

# Define the hyperparameters we want to test
param_grid = {'penalty': ['l2', 'none'], 
              'C': [0.01, 0.03, 0.1, 0.3, 1.]}
# Define the gridsearch object
grid = GridSearchCV(LogisticRegression(),
                    param_grid,
                    scoring='accuracy',
                    cv=5,
                    return_train_score=True
                   )
# Fit and wait
grid.fit(X_train, y_train)
# Print the best score
print('best score:', grid.best_score_)
# Print the best hyperparams
print('best score:', grid.best_params_)

best score: 0.8532403644210127
best score: {'C': 1.0, 'penalty': 'l2'}


We can look for overfitting:

In [16]:
# Display mean train accuracy
print(grid.cv_results_['mean_train_score'])
# Display mean valid accuracy
print(grid.cv_results_['mean_test_score'])

[0.84928349 0.85434514 0.85162702 0.85434514 0.85300049 0.85434514
 0.85382649 0.85434514 0.85419146 0.85434514]
[0.84916807 0.85251042 0.85085848 0.85251042 0.85170365 0.85251042
 0.85316355 0.85251042 0.85324036 0.85251042]


It seems there is not much overfitting here: values are quite similar in train and validation sets.

## IV.3. Evaluate the model

Finally, we can evaluate the model on the test dataset:

In [17]:
from sklearn.metrics import accuracy_score

y_pred = grid.predict(X_test)
print('Accuracy score:', accuracy_score(y_test, y_pred))

Accuracy score: 0.7988629379225568
