##  Intermediate Machine Learning

Source: www.kaggle.com
Date: Mar 2024

Highlights:
- real world data issues
- pipeline design
- advanced techniques for model validation
- how to avoid common and crucial data science mistakes

There are two sets of data here:  

1. Practice Data: using the Melbourne Housing Snapshot dataset, `melb_data`    

2. Competition Data: using *"Housing Prices Comparison for Kaggle Learn Users"* data that has separate training data (`train.csv`) and test data (`test.csv`)  


### Getting re-Started


In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split as tts


In [2]:
# Load data
X_full      = pd.read_csv("data/train.csv", index_col="Id")
X_test_full = pd.read_csv("data/test.csv", index_col="Id")

# targets and predictors
y        = X_full.SalePrice
features = ['LotArea','YearBuilt','1stFlrSF','2ndFlrSF','FullBath','BedroomAbvGr','TotRmsAbvGrd']
X        = X_full[ features ].copy()
X_test   = X_test_full[ features ].copy()


In [3]:
# generate validation data with train-test splitting
X_train, X_valid, y_train, y_valid = tts(X, y, train_size=0.80, test_size=0.20, random_state=0)


In [4]:
X_train.head()

Unnamed: 0_level_0,LotArea,YearBuilt,1stFlrSF,2ndFlrSF,FullBath,BedroomAbvGr,TotRmsAbvGrd
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
619,11694,2007,1828,0,2,3,9
871,6600,1962,894,0,1,2,5
93,13360,1921,964,0,1,2,5
818,13265,2002,1689,0,2,3,7
303,13704,2001,1541,0,2,3,6


In [5]:
# Define five different Random Forest models

from sklearn.ensemble import RandomForestRegressor as randomforest

model_1 = randomforest(n_estimators=50,  random_state=0)
model_2 = randomforest(n_estimators=100, random_state=0)
model_3 = randomforest(n_estimators=100, random_state=0, criterion="absolute_error")
model_4 = randomforest(n_estimators=200, random_state=0, min_samples_split=20)
model_5 = randomforest(n_estimators=100, random_state=0, max_depth=7)

models = [model_1, model_2, model_3, model_4, model_5]


In [6]:
from sklearn.metrics import mean_absolute_error as mae

def score_model(model, X_t=X_train, X_v=X_valid, y_t=y_train, y_v=y_valid):
    model.fit(X_t, y_t)
    preds = model.predict(X_v)
    return mae(y_v, preds)


In [7]:
for i in range(len(models)):
    MAE = score_model(models[i])
    print("Model %d M.A.E.= %d" % (i+1, MAE))
    del MAE


Model 1 M.A.E.= 24015
Model 2 M.A.E.= 23740
Model 3 M.A.E.= 23528
Model 4 M.A.E.= 23996
Model 5 M.A.E.= 23706


Best model is model 3 which has the medium number of estimators but an added **criterion** option.  
In documentation, this argument *"measures the quality of a split"*, with the 'squared_error' default.  
*"absolute_error"*: for the mean absolute error, which minimizes the L1 loss using the median for each terminal mode.  

In [8]:
my_model = randomforest(n_estimators=250, criterion="absolute_error", random_state=0)

my_model.fit(X, y)

preds_test = my_model.predict(X_test)
output     = pd.DataFrame({'Id': X_test.index, 
                           'SalePrice': preds_test})

output

Unnamed: 0,Id,SalePrice
0,1461,120377.032
1,1462,157304.400
2,1463,186017.404
3,1464,178798.648
4,1465,191582.792
...,...,...
1454,2915,84563.200
1455,2916,87542.000
1456,2917,154625.532
1457,2918,130817.000


In [9]:
del preds_test, output
del model_1, model_2, model_3, model_4, model_5, models, my_model
del X_train, X_valid, y_train, y_valid
del X_full, X_test_full, y, features, X, X_test

### Missing Values in a DataSet

In [10]:
data_2          = pd.read_csv("data/melb_data.csv")
y_2             = data_2.Price
melb_predictors = data_2.drop(['Price'], axis=1)
X_2             = melb_predictors.select_dtypes(exclude=['object'])

X_t_2, X_v_2, y_t_2, y_v_2 = tts(X_2, 
                                 y_2, 
                                 train_size=0.80, 
                                 test_size=0.20)


In [11]:
print(len(data_2))
print(len(data_2.columns))

13580
21


In [12]:
def score_dataset(X_t=X_t_2, X_v=X_v_2, y_t=y_t_2, y_v=y_v_2):
    model = randomforest(n_estimators=10, random_state=0)
    model.fit(X_t, y_t)
    preds = model.predict(X_v)
    return mae(y_v, preds)


**Three approaches to working with datasets with missing values**  
1. Remove columns with *any* missing data  
2. Imputation  
3. Imputation with categorical columns denoting entries that were imputed  

In [13]:
# from the X_t_2 dataset, determine which columns containing any missing data entries
cols_with_missing = [i for i in X_t_2.columns if X_t_2[ i ].isnull().any()]
cols_with_missing

['Car', 'BuildingArea', 'YearBuilt']

In [14]:
reduced_X_t_2 = X_t_2.drop(cols_with_missing, axis=1)
reduced_X_v_2 = X_v_2.drop(cols_with_missing, axis=1)

print("MAE from Approach #1 (drop columns with missing values): %f " % score_dataset(reduced_X_t_2, reduced_X_v_2, y_t_2, y_v_2) )

MAE from Approach #1 (drop columns with missing values): 189450.785322 


In [15]:
from sklearn.impute import SimpleImputer


In [16]:
impute2 = SimpleImputer()

imputed_X_t_2 = pd.DataFrame(impute2.fit_transform(X_t_2))
imputed_X_v_2 = pd.DataFrame(impute2.transform(X_v_2))

imputed_X_t_2.columns = X_t_2.columns
imputed_X_v_2.columns = X_v_2.columns

print("MAE from Approach #2 (Imputation): %f " % score_dataset(imputed_X_t_2, imputed_X_v_2, y_t_2, y_v_2) )

MAE from Approach #2 (Imputation): 184035.302577 


In [17]:
X_t_3 = X_t_2.copy()
X_v_3 = X_v_2.copy()

for col in cols_with_missing:
    print(col)
    X_t_3[col+"_was_missing"] = X_t_3[col].isnull()
    X_v_3[col+"_was_missing"] = X_v_3[col].isnull()

impute3 = SimpleImputer()
imputed_X_t_3 = pd.DataFrame(impute3.fit_transform(X_t_3))
imputed_X_v_3 = pd.DataFrame(impute3.transform(X_v_3))

imputed_X_t_3.columns = X_t_3.columns
imputed_X_v_3.columns = X_v_3.columns

print("MAE from Approach #3 (Categorical Imputation): %f " % score_dataset(imputed_X_t_3, imputed_X_v_3, y_t_2, y_v_2) )

Car
BuildingArea
YearBuilt
MAE from Approach #3 (Categorical Imputation): 182808.073331 


In [18]:
del cols_with_missing, reduced_X_t_2, reduced_X_v_2, imputed_X_t_2, imputed_X_v_2, X_t_3, X_v_3, imputed_X_t_3, imputed_X_v_3

In [19]:
print(X_t_2.shape)

(10864, 12)


In [20]:
missing_val_count_by_col = X_t_2.isnull().sum()
print(missing_val_count_by_col[missing_val_count_by_col > 0])
del missing_val_count_by_col

Car               48
BuildingArea    5150
YearBuilt       4296
dtype: int64


In [21]:
# difference between .fit(), .transform(), and .fit_transform

A = np.array([[7, 2, 3],
              [4, np.nan, 6], 
              [10, 5, 9]])
B = np.array([[np.nan, 2, 3],
              [4, np.nan, 6],
              [10, np.nan, 9]])
print(A)
print()
print(B)
print()

imp_mean = SimpleImputer()
print("fit model using A, then transform onto B: ")
imp_mean.fit(A)
print(A)
print(imp_mean.transform(B))

imp_mean = SimpleImputer()
print("use fit_transform on matrix B: ")
print(imp_mean.fit_transform(B))

imp_mean = SimpleImputer()
print("fit model using A and tranform it, then transform onto B: ")
print(imp_mean.fit_transform(A))
print(imp_mean.transform(B))

del A, B

[[ 7.  2.  3.]
 [ 4. nan  6.]
 [10.  5.  9.]]

[[nan  2.  3.]
 [ 4. nan  6.]
 [10. nan  9.]]

fit model using A, then transform onto B: 
[[ 7.  2.  3.]
 [ 4. nan  6.]
 [10.  5.  9.]]
[[ 7.   2.   3. ]
 [ 4.   3.5  6. ]
 [10.   3.5  9. ]]
use fit_transform on matrix B: 
[[ 7.  2.  3.]
 [ 4.  2.  6.]
 [10.  2.  9.]]
fit model using A and tranform it, then transform onto B: 
[[ 7.   2.   3. ]
 [ 4.   3.5  6. ]
 [10.   5.   9. ]]
[[ 7.   2.   3. ]
 [ 4.   3.5  6. ]
 [10.   3.5  9. ]]


In [22]:
del imp_mean, impute2, impute3

del melb_predictors

del X_t_2, X_v_2, y_t_2, y_v_2

del data_2, X_2, y_2

### Categorical Variables

Continue using *'melb_data'* dataset, but this time we **keep** the columns that contain 'object' data  

In [23]:
data_2 = pd.read_csv("data/melb_data.csv")
y_2    = data_2.Price
X_2    = data_2.drop(['Price'], axis=1)

X_t_2, X_v_2, y_t_2, y_v_2 = tts(X_2, 
                                 y_2, 
                                 train_size=0.80, 
                                 test_size=0.20)


In [24]:
# For this exercise, simply drop columns that have missing data
cols_with_missing = [i for i in X_t_2.columns if X_t_2[i].isnull().any()]

red_X_t_2 = X_t_2.copy().drop(cols_with_missing, axis=1)#, inplace=True)
red_X_v_2 = X_v_2.copy().drop(cols_with_missing, axis=1)#, inplace=True)


In [25]:
red_X_t_2.head(5)

Unnamed: 0,Suburb,Address,Rooms,Type,Method,SellerG,Date,Distance,Postcode,Bedroom2,Bathroom,Landsize,Lattitude,Longtitude,Regionname,Propertycount
6194,Templestowe Lower,41 Dellfield Dr,4,h,S,Barry,3/12/2016,13.8,3107.0,4.0,2.0,673.0,-37.7647,145.1246,Eastern Metropolitan,5420.0
10729,Lalor,23 Cyprus St,3,h,S,Barry,8/07/2017,16.3,3075.0,3.0,2.0,655.0,-37.67414,145.0205,Northern Metropolitan,8279.0
715,Bentleigh,1/19 Patterson Rd,2,h,S,Buxton,7/11/2016,13.0,3204.0,2.0,1.0,298.0,-37.9247,145.0287,Southern Metropolitan,6795.0
7080,Jacana,108 Sunset Bvd,3,h,S,Barry,16/04/2016,14.5,3047.0,3.0,1.0,650.0,-37.687,144.9084,Northern Metropolitan,851.0
9649,Mulgrave,13 Excelsior Cct,3,h,S,Ray,17/06/2017,18.8,3170.0,3.0,2.0,145.0,-37.93226,145.19097,South-Eastern Metropolitan,7113.0


In [26]:
low_cardinality_cols = [col for col in red_X_t_2.columns 
                        if red_X_t_2[col].nunique() < 10 
                        and red_X_t_2[col].dtype=="object"]

print(low_cardinality_cols)


['Type', 'Method', 'Regionname']


In [27]:
numerical_cols = [col for col in red_X_t_2 if red_X_t_2[col].dtype in ['int64','float64']]
print(numerical_cols)


['Rooms', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude', 'Propertycount']


In [28]:
cat_cols = low_cardinality_cols + numerical_cols
X_train_cat = red_X_t_2[cat_cols].copy()
X_valid_cat = red_X_v_2[cat_cols].copy()


In [29]:
X_train_cat.head(5)

Unnamed: 0,Type,Method,Regionname,Rooms,Distance,Postcode,Bedroom2,Bathroom,Landsize,Lattitude,Longtitude,Propertycount
6194,h,S,Eastern Metropolitan,4,13.8,3107.0,4.0,2.0,673.0,-37.7647,145.1246,5420.0
10729,h,S,Northern Metropolitan,3,16.3,3075.0,3.0,2.0,655.0,-37.67414,145.0205,8279.0
715,h,S,Southern Metropolitan,2,13.0,3204.0,2.0,1.0,298.0,-37.9247,145.0287,6795.0
7080,h,S,Northern Metropolitan,3,14.5,3047.0,3.0,1.0,650.0,-37.687,144.9084,851.0
9649,h,S,South-Eastern Metropolitan,3,18.8,3170.0,3.0,2.0,145.0,-37.93226,145.19097,7113.0


In [30]:
# Approach #1- Drop categorical variables
drop_X_train = X_train_cat.select_dtypes(exclude=["object"])
drop_X_valid = X_valid_cat.select_dtypes(exclude=["object"])
print(drop_X_train.shape)
print("MAE, Approach #1 (drop all categorical data): %f" % score_dataset(drop_X_train, drop_X_valid, y_t_2, y_v_2) )
del drop_X_train, drop_X_valid

(10864, 9)
MAE, Approach #1 (drop all categorical data): 199530.107046


In [31]:
from sklearn.preprocessing import OrdinalEncoder


In [32]:
label_X_train = X_train_cat.copy()
label_X_valid = X_valid_cat.copy()

ordinal_encoder = OrdinalEncoder()

label_X_train[ low_cardinality_cols ] = ordinal_encoder.fit_transform( X_train_cat[low_cardinality_cols] )
label_X_valid[ low_cardinality_cols ] = ordinal_encoder.transform( X_valid_cat[low_cardinality_cols] )
print(label_X_train.shape)
print("MAE, Approach #2 (ordinal encoding): %f" % score_dataset(label_X_train, label_X_valid, y_t_2, y_v_2) )



(10864, 12)


MAE, Approach #2 (ordinal encoding): 185820.856001


In [33]:
X_train_cat.head(5)

Unnamed: 0,Type,Method,Regionname,Rooms,Distance,Postcode,Bedroom2,Bathroom,Landsize,Lattitude,Longtitude,Propertycount
6194,h,S,Eastern Metropolitan,4,13.8,3107.0,4.0,2.0,673.0,-37.7647,145.1246,5420.0
10729,h,S,Northern Metropolitan,3,16.3,3075.0,3.0,2.0,655.0,-37.67414,145.0205,8279.0
715,h,S,Southern Metropolitan,2,13.0,3204.0,2.0,1.0,298.0,-37.9247,145.0287,6795.0
7080,h,S,Northern Metropolitan,3,14.5,3047.0,3.0,1.0,650.0,-37.687,144.9084,851.0
9649,h,S,South-Eastern Metropolitan,3,18.8,3170.0,3.0,2.0,145.0,-37.93226,145.19097,7113.0


In [34]:
label_X_train.head(5)

Unnamed: 0,Type,Method,Regionname,Rooms,Distance,Postcode,Bedroom2,Bathroom,Landsize,Lattitude,Longtitude,Propertycount
6194,0.0,1.0,0.0,4,13.8,3107.0,4.0,2.0,673.0,-37.7647,145.1246,5420.0
10729,0.0,1.0,2.0,3,16.3,3075.0,3.0,2.0,655.0,-37.67414,145.0205,8279.0
715,0.0,1.0,5.0,2,13.0,3204.0,2.0,1.0,298.0,-37.9247,145.0287,6795.0
7080,0.0,1.0,2.0,3,14.5,3047.0,3.0,1.0,650.0,-37.687,144.9084,851.0
9649,0.0,1.0,4.0,3,18.8,3170.0,3.0,2.0,145.0,-37.93226,145.19097,7113.0


In [35]:
del label_X_train, label_X_valid

In [36]:
from sklearn.preprocessing import OneHotEncoder


In [37]:
onehot_encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

OH_cols_train = pd.DataFrame(onehot_encoder.fit_transform(X_train_cat[ low_cardinality_cols ]))
OH_cols_valid = pd.DataFrame(onehot_encoder.transform(X_valid_cat[ low_cardinality_cols ]))

OH_cols_train.index = X_train_cat.index
OH_cols_valid.index = X_valid_cat.index

# Make df that removes the categorial columns, which will be replaced with one-hot encoding
num_X_train = X_train_cat.drop(low_cardinality_cols, axis=1)
num_X_valid = X_valid_cat.drop(low_cardinality_cols, axis=1)

# Make df that adds one-hot encoding to the numerical dfs made directly above
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)

del OH_cols_train, OH_cols_valid
del num_X_train, num_X_valid

OH_X_train.columns = OH_X_train.columns.astype('str')
OH_X_valid.columns = OH_X_valid.columns.astype('str')
print(OH_X_train.shape)
print("MAE, Approach #3 (one-hot encoding): %f" % score_dataset(OH_X_train, OH_X_valid, y_t_2, y_v_2) )


(10864, 25)
MAE, Approach #3 (one-hot encoding): 185283.846441


In [38]:
X_train_cat.head(5)

Unnamed: 0,Type,Method,Regionname,Rooms,Distance,Postcode,Bedroom2,Bathroom,Landsize,Lattitude,Longtitude,Propertycount
6194,h,S,Eastern Metropolitan,4,13.8,3107.0,4.0,2.0,673.0,-37.7647,145.1246,5420.0
10729,h,S,Northern Metropolitan,3,16.3,3075.0,3.0,2.0,655.0,-37.67414,145.0205,8279.0
715,h,S,Southern Metropolitan,2,13.0,3204.0,2.0,1.0,298.0,-37.9247,145.0287,6795.0
7080,h,S,Northern Metropolitan,3,14.5,3047.0,3.0,1.0,650.0,-37.687,144.9084,851.0
9649,h,S,South-Eastern Metropolitan,3,18.8,3170.0,3.0,2.0,145.0,-37.93226,145.19097,7113.0


In [39]:
OH_X_train.head(5)

Unnamed: 0,Rooms,Distance,Postcode,Bedroom2,Bathroom,Landsize,Lattitude,Longtitude,Propertycount,0,...,6,7,8,9,10,11,12,13,14,15
6194,4,13.8,3107.0,4.0,2.0,673.0,-37.7647,145.1246,5420.0,1.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10729,3,16.3,3075.0,3.0,2.0,655.0,-37.67414,145.0205,8279.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
715,2,13.0,3204.0,2.0,1.0,298.0,-37.9247,145.0287,6795.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
7080,3,14.5,3047.0,3.0,1.0,650.0,-37.687,144.9084,851.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
9649,3,18.8,3170.0,3.0,2.0,145.0,-37.93226,145.19097,7113.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


In [40]:
del OH_X_train, OH_X_valid

del cat_cols, cols_with_missing, low_cardinality_cols, numerical_cols
del X_train_cat, X_valid_cat
del red_X_t_2, red_X_v_2

del X_t_2, X_v_2, y_t_2, y_v_2
del data_2, X_2, y_2

### Pipelines  

As model generation complexity increases, use of **pipelines** will make tasks more organized and efficient

In [41]:
# Below is the example of a simple Pipeline from the scikit-learn documentation
from sklearn.svm import SVC 
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_classification
from sklearn.pipeline import Pipeline

X, y = make_classification(random_state=0)

X_t, X_v, y_t, y_v = tts(X, y, random_state=0)

pipe = Pipeline([('scaler', StandardScaler()),
                 ('svc', SVC())])


In [42]:
pipe.fit(X_t, y_t).score(X_v, y_v)


0.88

In [43]:
pipe.set_params(svc__C = 10).fit(X_t, y_t).score(X_v, y_v)


0.76

In [44]:
del pipe
del X_t, X_v, y_t, y_v 
del X, y


In [45]:
data_2 = pd.read_csv("data/melb_data.csv")

y_2    = data_2.Price
X_2    = data_2.drop(['Price'], axis=1)

X_t_2, X_v_2, y_t_2, y_v_2 = tts(X_2, 
                                 y_2, 
                                 train_size=0.80, 
                                 test_size=0.20)

cat_cols = [col for col in X_t_2.columns 
            if X_t_2[col].nunique() < 10
            and X_t_2[col].dtype=="object"]

num_cols = [col for col in X_t_2.columns
            if X_t_2[col].dtype in ['int64','float64']]

data_cols = cat_cols + num_cols

X_train = X_t_2[data_cols].copy()
X_valid = X_v_2[data_cols].copy()


In [46]:
X_train.head()

Unnamed: 0,Type,Method,Regionname,Rooms,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
10022,h,S,Northern Metropolitan,5,12.0,3073.0,5.0,3.0,4.0,890.0,203.0,1960.0,-37.70295,145.00277,21650.0
3503,t,SP,Western Metropolitan,3,12.8,3033.0,3.0,2.0,1.0,320.0,133.0,2013.0,-37.7415,144.8668,5629.0
2828,u,S,Southern Metropolitan,2,9.2,3146.0,2.0,1.0,1.0,0.0,,1960.0,-37.8566,145.0553,10412.0
10138,h,S,Eastern Victoria,4,35.2,3806.0,4.0,2.0,2.0,796.0,,2004.0,-38.06022,145.35044,17093.0
11321,u,SP,Northern Metropolitan,1,2.0,3066.0,1.0,1.0,0.0,0.0,46.0,1960.0,-37.79597,144.99108,4553.0


#### Three steps in constructing a pipeline  

1. Define preprocessing steps  
2. Define the model  
3. Create and evaluate the pipeline  


In [47]:
numerical_transform   = SimpleImputer(strategy="constant")

categorical_transform = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy="most_frequent")),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


In [48]:
from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer(transformers=[
    ('num', numerical_transform, num_cols),
    ('cat', categorical_transform, cat_cols)
])


The two cells above did not use *any* of the test or validation data.  
All they did was set up all of the preprocessing steps.  

For the *numerical* data, impute missing data using `strategy="constant"` to replace missing values with `fill_value`, another keyword argument whose None defaults to 0.   

For the *categorical* data, first impute the data with the `strategy="most_frequent"` argument to replace missing values with the most frequent value of that column. This is followed with a one-hot encoding step.  
  
Finally, combine together in to **ColumnTransformer()** class. This applies the collection of preprocessing steps to the columns of an array or data frame.  
   

In [49]:
# Construct the model

model = randomforest(n_estimators=100, random_state=0)


In [50]:
# Construct the pipeline, then evaluate

pipeline1 = Pipeline(steps=[
    ('preprocess', preprocessor),
    ('model', model)
])


In [51]:
pipeline1.fit(X_train, y_t_2)

preds = pipeline1.predict(X_valid)

score = mae(y_v_2, preds)

print("MAE: ", score)


MAE:  165914.6103698541


In [52]:
del score, preds
del model, numerical_transform, categorical_transform, onehot_encoder, ordinal_encoder, preprocessor, pipeline1
del cat_cols, num_cols, 
del data_cols
del X_train, X_valid
del X_t_2, X_v_2, y_t_2, y_v_2
del y_2, X_2, data_2 

### Cross-Validation

In [53]:
data          = pd.read_csv("data/melb_data.csv")
y             = data.Price
cols_to_use   = ['Rooms','Distance','Landsize','BuildingArea','YearBuilt']
X             = data[cols_to_use]


In [54]:
X

Unnamed: 0,Rooms,Distance,Landsize,BuildingArea,YearBuilt
0,2,2.5,202.0,,
1,2,2.5,156.0,79.0,1900.0
2,3,2.5,134.0,150.0,1900.0
3,3,2.5,94.0,,
4,4,2.5,120.0,142.0,2014.0
...,...,...,...,...,...
13575,4,16.7,652.0,,1981.0
13576,3,6.8,333.0,133.0,1995.0
13577,3,6.8,436.0,,1997.0
13578,4,6.8,866.0,157.0,1920.0


In [55]:
from sklearn.ensemble import RandomForestRegressor as randomforest 
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

my_pipeline = Pipeline(steps=[
    ('preprocessor', SimpleImputer()), 
    ('model', randomforest(n_estimators=50, random_state=0))
])

my_pipeline

In [56]:
from sklearn.model_selection import cross_val_score as cv_score 

scores = -1*cv_score(my_pipeline, X, y, cv=5, scoring='neg_mean_absolute_error')

print("MAE scores:\n", scores)


MAE scores:
 [301628.7893587  303164.4782723  287298.331666   236061.84754543
 260383.45111427]


In [57]:
scores.max() - scores.min()

67102.63072686753

In [58]:
scores.mean()

277707.3795913405

In [59]:
del scores
del my_pipeline
del y, X, cols_to_use, data

### Gradient Boosting

Use of `xgboost`, stated on the module as "the most accurate modeling technique for structured data"  

As name suggests, use of *gradient descent* on the loss function of the modeling process

In [60]:
data               = pd.read_csv("data/melb_data.csv")
y                  = data.Price
cols_to_use        = ['Rooms','Distance','Landsize','BuildingArea','YearBuilt']
X                  = data[cols_to_use]
X_t, X_v, y_t, y_v = tts(X, y)


In [63]:
from xgboost import XGBRegressor 

my_model = XGBRegressor()

my_model.fit(X_t, y_t)


In [64]:
predictions = my_model.predict(X_v)

print("Mean Absolute Error from XGBoost with starting naive model: ", str(mae(predictions, y_v)))


Mean Absolute Error from XGBoost with starting naive model:  235170.43144329896


In [65]:
my_model = XGBRegressor(n_estimators=500)

my_model.fit(X_t, y_t)
predictions = my_model.predict(X_v)

print("Mean Absolute Error from XGBoost, naive model, 500 cycles: ", str(mae(predictions, y_v)))
del my_model, predictions

Mean Absolute Error from XGBoost, naive model, 500 cycles:  247287.21848076215


In [68]:
my_model = XGBRegressor(n_estimators=500, early_stopping_rounds=5)

my_model.fit(X_t, y_t,
             eval_set=[(X_v, y_v)],
             verbose=False)

predictions = my_model.predict(X_v)

print("Mean Absolute Error from XGBoost, naive model, 500 cycles, early_stopping of 5: ", str(mae(predictions, y_v)))
del my_model, predictions

Mean Absolute Error from XGBoost, naive model, 500 cycles, early_stopping of 5:  237326.65647091312


In [72]:
my_model = XGBRegressor(n_estimators=1000, early_stopping_rounds=5, learning_rate=0.05)

my_model.fit(X_t, y_t,
             eval_set=[(X_v, y_v)],
             verbose=False)

predictions = my_model.predict(X_v)

print("Mean Absolute Error from XGBoost, naive model, 1000 cycles, early_stopping of 5, learning rate 0.05: ", str(mae(predictions, y_v)))
del my_model, predictions

Mean Absolute Error from XGBoost, naive model, 1000 cycles, early_stopping of 5, learning rate 0.05:  243260.1128681885


In [76]:
my_model = XGBRegressor(n_estimators=1000, early_stopping_rounds=5, learning_rate=0.05,
                        n_jobs=3 # parallel processing (large datasets only) based on number of accessible CPUS on either local machine, virtual machine, or HPC
                        )

my_model.fit(X_t, y_t,
             eval_set=[(X_v, y_v)],
             verbose=False)

predictions = my_model.predict(X_v)

print("Mean Absolute Error from XGBoost, naive model, 1000 cycles, early_stopping of 5, learning rate 0.05: ", str(mae(predictions, y_v)))
del my_model, predictions

Mean Absolute Error from XGBoost, naive model, 1000 cycles, early_stopping of 5, learning rate 0.05:  243260.1128681885


In [77]:
del X_t, X_v, y_t, y_v
del data, y, cols_to_use, X

### Data Leakage

Loss of data during fitting

High performance at training set, but poor performance at decision making stages  

2 types: (i) target leakage, (ii) train-test contamination  

#### Target Leakage Example:

take credit card application dataset and skip basic data set-up stage

In [78]:
data = pd.read_csv("data/AER_credit_card_data.csv", true_values=['yes'], false_values=['no'])

y = data.card 
X = data.drop(['card'], axis=1)

X.shape

(1319, 11)

In [79]:
X.head()

Unnamed: 0,reports,age,income,share,expenditure,owner,selfemp,dependents,months,majorcards,active
0,0,37.66667,4.52,0.03327,124.9833,True,False,3,54,1,12
1,0,33.25,2.42,0.005217,9.854167,False,False,3,34,1,13
2,0,33.66667,4.5,0.004156,15.0,True,False,4,58,1,5
3,0,30.5,2.54,0.065214,137.8692,False,False,0,25,1,7
4,0,32.16667,9.7867,0.067051,546.5033,True,False,2,64,1,5


In [81]:
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

my_pipeline = make_pipeline(RandomForestClassifier(n_estimators=100))

cv_scores = cross_val_score(my_pipeline, X, y, cv=5, scoring='accuracy')

print("Cross-validation accuracy: %f" % cv_scores.mean())


Cross-validation accuracy: 0.979534


In [82]:
expenditures_cardholders    = X.expenditure[y]
expenditures_noncardholders = X.expenditure[~y]

print('Fraction of those who did not receive a card and had no expenditures: %.2f' \
      %((expenditures_noncardholders == 0).mean()))
print('Fraction of those who received a card and had no expenditures: %.2f' \
      %(( expenditures_cardholders == 0).mean()))

del expenditures_cardholders, expenditures_noncardholders

Fraction of those who did not receive a card and had no expenditures: 1.00
Fraction of those who received a card and had no expenditures: 0.02


In [83]:
potential_leaks = ['expenditure','share','active','majorcards']
X2 = X.drop(potential_leaks, axis=1)

cv_scores2 = cross_val_score(my_pipeline, X2, y, cv=5, scoring='accuracy')
print("Cross-validation accuracy: %f" % cv_scores2.mean())


Cross-validation accuracy: 0.833956


In [84]:
del potential_leaks, X2, cv_scores2
del cv_scores
del my_pipeline, X, y, data

### ------- END -------