## Machine Learning Model Pipeline: Wrapping up for Deployment


Here, we will summarise, the key pieces of code, that we need to take forward, for this particular project, to put our model in production.



In [2]:
# to handle datasets
import pandas as pd
import numpy as np

# to divide train and test set
from sklearn.model_selection import train_test_split

# feature scaling
from sklearn.preprocessing import MinMaxScaler

# to build the models
from sklearn.linear_model import Lasso
from sklearn.tree import DecisionTreeRegressor

# to evaluate the models
from sklearn.metrics import mean_squared_error
from math import sqrt

# to persist the model and the scaler
from sklearn.externals import joblib

# to visualise al the columns in the dataframe
pd.pandas.set_option('display.max_columns', None)



### Setting the seed


Important note **Always set the seeds**.

Load Libraries

In [3]:
SEED = 123

## Load data

We need the training data to train our model in the production environment. 

In [4]:
# load dataset
data = pd.read_csv('../../data/interim/raw_useful_ftrs.csv')
print(data.shape)
data.head()

(329275, 14)


Unnamed: 0,code_postal,date_mutation,id_parcelle,latitude,longitude,nature_culture,nom_commune,nombre_lots,nombre_pieces_principales,numero_disposition,surface_reelle_bati,surface_terrain,type_local,valeur_fonciere
0,9800.0,2019-09-06,092630000A0991,42.936027,0.931099,jardins,Saint-Jean-du-Castillonnais,0,,1,,245.0,,167220.0
1,78450.0,2019-12-13,78674000ZL0016,48.840698,2.000984,prés,Villepreux,0,,1,,4050.0,,1007200.0
2,78440.0,2019-03-05,783170000B0236,49.051101,1.843135,taillis sous futaie,Jambville,0,,1,,56465.0,,320000.0
3,44550.0,2019-10-30,44103000AB0121,47.327899,-2.153218,sols,Montoir-de-Bretagne,0,0.0,1,60.0,98.0,Local industriel. commercial ou assimilé,243420.0
4,11000.0,2019-04-20,11069000AE0372,43.218986,2.346464,sols,Carcassonne,0,5.0,1,106.0,1223.0,Maison,131500.0


In [161]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 329275 entries, 0 to 329274
Data columns (total 14 columns):
code_postal                  329143 non-null float64
date_mutation                329275 non-null object
id_parcelle                  329275 non-null object
latitude                     322415 non-null float64
longitude                    322415 non-null float64
nature_culture               243611 non-null object
nom_commune                  329275 non-null object
nombre_lots                  329275 non-null int64
nombre_pieces_principales    180225 non-null float64
numero_disposition           329275 non-null int64
surface_reelle_bati          141350 non-null float64
surface_terrain              243602 non-null float64
type_local                   180434 non-null object
valeur_fonciere              329275 non-null float64
dtypes: float64(7), int64(2), object(5)
memory usage: 35.2+ MB


Recode code postal as object (This should be done on EDA Preproc Script)

In [9]:
data.loc[:,'code_postal'] = data.loc[:,'code_postal'].astype('object')
data.info()
data.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 329275 entries, 0 to 329274
Data columns (total 14 columns):
code_postal                  329143 non-null object
date_mutation                329275 non-null object
id_parcelle                  329275 non-null object
latitude                     322415 non-null float64
longitude                    322415 non-null float64
nature_culture               243611 non-null object
nom_commune                  329275 non-null object
nombre_lots                  329275 non-null int64
nombre_pieces_principales    180225 non-null float64
numero_disposition           329275 non-null int64
surface_reelle_bati          141350 non-null float64
surface_terrain              243602 non-null float64
type_local                   180434 non-null object
valeur_fonciere              329275 non-null float64
dtypes: float64(6), int64(2), object(6)
memory usage: 35.2+ MB


(329275, 14)

In [8]:
# load dataset tags on features to use
features_csv= pd.read_csv('../../data/interim/features_to_use_summary.csv', index_col=0)
print(features_csv.shape)
features_csv.use_ftr.head(40)

(40, 15)


adresse_code_voie               False
adresse_nom_voie                False
adresse_numero                  False
adresse_suffixe                 False
ancien_code_commune             False
ancien_id_parcelle              False
ancien_nom_commune              False
code_commune                    False
code_departement                False
code_nature_culture             False
code_nature_culture_speciale    False
code_postal                      True
code_type_local                 False
date_mutation                    True
id_mutation                     False
id_parcelle                      True
latitude                         True
longitude                        True
lot1_numero                     False
lot1_surface_carrez             False
lot2_numero                     False
lot2_surface_carrez             False
lot3_numero                     False
lot3_surface_carrez             False
lot4_numero                     False
lot4_surface_carrez             False
lot5_numero 

## Separate dataset into train and test

Before beginning to engineer our features, it is important to separate our data intro training and testing set. This is to avoid over-fitting. There is an element of randomness in dividing the dataset, so remember to set the seed.

In [164]:
# Let's separate into train and test set
# Remember to seet the seed (random_state for this sklearn function)
# 
X_train, X_test, y_train, y_test = train_test_split(data, data.valeur_fonciere,
                                                    test_size=0.1,
                                                    random_state=SEED) # we are setting the seed here
X_train.shape, X_test.shape

((296347, 14), (32928, 14))

## Selected features

Remember that we will deploy our model utilising only a subset of features, the most predictive ones. This is to make simpler models, so that we build simpler code for deployment. We will tell you more about this in coming lectures.

In [165]:
# load selected features
features = list(features_csv.loc[features_csv.use_ftr==True, :].index)
...
features

print('Number of features: ', len(features))
# Remember that add extra ftrs additional feature engineering step into production as p.ex distance_to_big_city

Number of features:  13


### Missing values

For categorical variables, we will fill missing information by adding an additional category: "missing"

In [166]:
# make a list of the categorical variables that contain missing values
vars_with_na = [var for var in features if X_train[var].isnull().sum()>1 and X_train[var].dtypes=='O']

# print the variable name and the percentage of missing values
# for var in vars_with_na:
#     print(var, np.round(X_train[var].isnull().mean(), 3),  ' % missing values')

There are still categorical variables with NA for the final model, so we need to include this piece of feature engineering logic in the deployment pipeline to input those NA values.

In [167]:
# Next functio can go  in the feature engineering notebook:

# function to replace NA in categorical variables
def fill_categorical_na(df, var_list):
    X = df.copy()
    X[var_list] = df[var_list].fillna('Missing')
    return X

# replace missing values with new label: "Missing"
X_train = fill_categorical_na(X_train, vars_with_na)
X_test = fill_categorical_na(X_test, vars_with_na)

# check that we have no missing information in the engineered variables
X_train[vars_with_na].isnull().sum()

code_postal       0
nature_culture    0
type_local        0
dtype: int64

For numerical variables, we are going to add an additional variable capturing the missing information, and then replace the missing information in the original variable by the mode, or most frequent value:

In [168]:
# make a list of the numerical variables that contain missing values
vars_with_na = [var for var in features if X_train[var].isnull().sum()>1 and X_train[var].dtypes!='O']
vars_with_na
# print the variable name and the percentage of missing values
for var in vars_with_na:
    print(var, np.round(X_train[var].isnull().mean(), 3),  ' % missing values')

latitude 0.021  % missing values
longitude 0.021  % missing values
nombre_pieces_principales 0.452  % missing values
surface_reelle_bati 0.571  % missing values
surface_terrain 0.261  % missing values


It seems however that surfacer_reelle_bati and surface_trrain should be inputed to zero while others can be inputed to median

In [169]:
vars_with_na_at_zero = ['surface_reelle_bati', 'surface_terrain']

#### Important: persisting the mean value for NA imputation

As you will see in future sections, one of the key pieces of deploying the model is "Model Validation". Model validation refers to corroborating that the deployed model and the model built during research, are identical. The entire pipeline needs to produce identical results.

Therefore, in order to check at the end of the process that the feature engineering pipelines are identical, we will save -we will persist-, the mean value of the variable, so that we can use it at the end, to corroborate our models.

In [170]:
# replace the missing values

mean_var_dict = {}

for var in vars_with_na:
    
    # calculate the mode
    mode_val = X_train[var].mode()[0]
    
    # we persist the mean in the dictionary
    mean_var_dict[var] = mode_val
    
    # train
    # note  that the additional binary variable was not selected, so we don't need this step any more
    #X_train[var+'_na'] = np.where(X_train[var].isnull(), 1, 0)
    X_train[var].fillna(mode_val, inplace=True)
    
    # test
    # note  that the additional binary variable was not selected, so we don't need this step any more
    #X_test[var+'_na'] = np.where(X_test[var].isnull(), 1, 0)
    X_test[var].fillna(mode_val, inplace=True)

# we save the dictionary for later
np.save('../../models/mean_var_dict.npy', mean_var_dict)

# check that we have no more missing values in the engineered variables
X_train[vars_with_na].isnull().sum()

latitude                     0
longitude                    0
nombre_pieces_principales    0
surface_reelle_bati          0
surface_terrain              0
dtype: int64

### Temporal variables

We can add here some temporal EDA features, such as date transformations.

In [171]:
# # create the temporal var "elapsed years"
# def elapsed_years(df, var):
#     # capture difference between year variable and year the house was sold
#     df[var] = df['YrSold'] - df[var]
#     return df

In [172]:
# X_train = elapsed_years(X_train, 'YearRemodAdd')
# X_test = elapsed_years(X_test, 'YearRemodAdd')

### Numerical variables

We will log transform some numerical variables that do not contain zeros in order to get a more Gaussian-like distribution. This tends to help Linear machine learning models.


In [173]:

num_ftrs = X_train.select_dtypes('number').columns
num_ftrs

Index(['latitude', 'longitude', 'nombre_lots', 'nombre_pieces_principales',
       'numero_disposition', 'surface_reelle_bati', 'surface_terrain',
       'valeur_fonciere'],
      dtype='object')

In [174]:
ftrs_to_log_transf = ['valeur_fonciere']
for var in ftrs_to_log_transf:
    X_train[var] = np.log1p(X_train[var])
    X_test[var]= np.log1p(X_test[var])

### Categorical variables

We do have categorical variables in our final model. First, we will remove those categories within variables that are present in less than 1% of the observations:

In [175]:
# let's capture the categorical variables first
cat_vars = [var for var in features if X_train[var].dtype == 'O']
cat_vars

['code_postal',
 'date_mutation',
 'id_parcelle',
 'nature_culture',
 'nom_commune',
 'type_local']

#### Important: persisting the frequent labels

As you will see in future sections, one of the key pieces of deploying the model is "Model Validation". Model validation refers to corroborating that the deployed model and the model built during research, are identical. The entire pipeline needs to produce identical results.

Therefore, in order to check at the end of the process, that the feature engineering pipelines are identical, we will save -we will persist-, the list of frequent labels per variable, so that we can use it at the end, to corroborate our models.

In [176]:
def find_frequent_labels(df, var, rare_perc):
    # finds the labels that are shared by more than a certain % of the houses in the dataset
    df = df.copy()
    tmp = df.groupby(var)['valeur_fonciere'].count() / len(df)
    return tmp[tmp>rare_perc].index

frequent_labels_dict = {}

for var in cat_vars:
    frequent_ls = find_frequent_labels(X_train, var, 0.01)
    
    # we save the list in a dictionary
    frequent_labels_dict[var] = frequent_ls
    
    X_train[var] = np.where(X_train[var].isin(frequent_ls), X_train[var], 'Rare')
    X_test[var] = np.where(X_test[var].isin(frequent_ls), X_test[var], 'Rare')
    
# now we save the dictionary
np.save('../../models/FrequentLabels.npy', frequent_labels_dict)

In [177]:
frequent_labels_dict

{'code_postal': Index([], dtype='object', name='code_postal'),
 'date_mutation': Index([], dtype='object', name='date_mutation'),
 'id_parcelle': Index([], dtype='object', name='id_parcelle'),
 'nature_culture': Index(['Missing', 'futaies résineuses', 'jardins', 'landes', 'prés', 'sols',
        'taillis simples', 'terrains a bâtir', 'terrains d'agrément', 'terres',
        'vignes'],
       dtype='object', name='nature_culture'),
 'nom_commune': Index([], dtype='object', name='nom_commune'),
 'type_local': Index(['Appartement', 'Dépendance', 'Local industriel. commercial ou assimilé',
        'Maison', 'Missing'],
       dtype='object', name='type_local')}

Next, we need to transform the strings of these variables into numbers. We will do it so that we capture the monotonic relationship between the label and the target:

In [178]:
# this function will assign discrete values to the strings of the variables, 
# so that the smaller value corresponds to the smaller mean of target

def replace_categories(train, test, var, target):
    ordered_labels = train.groupby([var])[target].mean().sort_values().index
    ordinal_label = {k:i for i, k in enumerate(ordered_labels, 0)} 
    train[var] = train[var].map(ordinal_label)
    test[var] = test[var].map(ordinal_label)

In [179]:
for var in cat_vars:
    replace_categories(X_train, X_test, var, 'valeur_fonciere')

In [180]:
# check absence of na
[var for var in features if X_train[var].isnull().sum()>0]

[]

In [181]:
# check absence of na
[var for var in features if X_test[var].isnull().sum()>0]

[]

### Feature Scaling

For use in linear models, features need to be either scaled or normalised. In the next section, I will scale features between the min and max values:

In [182]:
# capture the target
y_train = X_train['valeur_fonciere']
y_test = X_test['valeur_fonciere']

In [183]:
# fit scaler
scaler = MinMaxScaler() # create an instance
scaler.fit(X_train[features]) #  fit  the scaler to the train set for later use

# we persist the model for future use
joblib.dump(scaler, '../../models/scaler.pkl')

['../../models/scaler.pkl']

In [184]:
# transform the train and test set, and add on the Id and SalePrice variables
X_train = pd.DataFrame(scaler.transform(X_train[features]), columns=features)
X_test = pd.DataFrame(scaler.transform(X_test[features]), columns=features)

In [185]:
X_train[features].head()

Unnamed: 0,code_postal,date_mutation,id_parcelle,latitude,longitude,nature_culture,nom_commune,nombre_lots,nombre_pieces_principales,numero_disposition,surface_reelle_bati,surface_terrain,type_local
0,0.0,0.0,0.0,0.970994,0.550094,0.909091,0.0,0.005714,0.0,0.0,0.000367,0.000726,0.25
1,0.0,0.0,0.0,0.89628,0.591264,0.909091,0.0,0.005714,0.0,0.0,0.000367,0.000726,0.25
2,0.0,0.0,0.0,0.941102,0.552015,0.272727,0.0,0.0,0.0,0.0,0.000367,0.012155,0.0
3,0.0,0.0,0.0,0.947205,0.517275,0.636364,0.0,0.0,0.0,0.0,0.000367,0.002244,0.0
4,0.0,0.0,0.0,0.914319,0.526196,0.818182,0.0,0.0,0.121212,0.0,0.000423,0.000842,0.5


In [186]:
y_train.isnull().count()

296347

In [187]:
# train the model
models = { 'lasso': Lasso(alpha=0.005, random_state=SEED)           # REMEMBER to set the random_state / seed
          ,'treeReg': DecisionTreeRegressor(random_state=SEED)
         }


model = models['treeReg']

model.fit(X_train, y_train)

# we persist the model for future use
joblib.dump(model, '../../models/model.pkl')

['../../models/model.pkl']

In [190]:
# evaluate the model:
# remember that we log transformed the output (SalePrice) in our feature engineering notebook / lecture.

# In order to get the true performance of the Lasso
# we need to transform both the target and the predictions
# back to the original house prices values.
# 
# We will evaluate performance using the mean squared error and the
# root of the mean squared error

pred = model.predict(X_train)
print('Train mse: {}'.format(mean_squared_error(np.exp(y_train), np.exp(pred))))
print('Train rmse: {}'.format(sqrt(mean_squared_error(np.exp(y_train), np.exp(pred)))))
print()
pred = model.predict(X_test)
print('Test mse: {}'.format(mean_squared_error(np.exp(y_test), np.exp(pred))))
print('Test rmse: {}'.format(sqrt(mean_squared_error(np.exp(y_test), np.exp(pred)))))
print()
print('Average property price: ', np.exp(y_train).median())

Train mse: 62821172688.4176
Train rmse: 250641.5222751761

Test mse: 14871396475009.244
Test rmse: 3856344.9631755254

Average property price:  132000.9999999999
