# Introduction to Machine Learning
This is my first attempt at creating a simple machine learning model using random forest. I am using the data from the Kaggle <a href="https://www.kaggle.com/c/home-data-for-ml-course/data">Housing prices Competition for Kaggle Learn Users</a> competition.

This notebook will document my entire learning process. I will create another notebook with my complete and optimized model.

<h2 style="color: blue">Loading and Reading the Data</h2>

In [23]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

data = pd.read_csv('train.csv', index_col='Id')
data_test = pd.read_csv('test.csv', index_col='Id')

<h2 style="color: blue">Prepping Data</h2>

In [24]:
#Remove rows with missing prices, separate target from predictors
data.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = data.SalePrice
data.drop(['SalePrice'], axis=1, inplace=True)

# Drop columns with missing values
cols_with_missing = [col for col in data.columns if data[col].isnull().any()] 
data.drop(cols_with_missing, axis=1, inplace=True)
data_test.drop(cols_with_missing, axis=1, inplace=True)

# Drop columns with missing values
cols_with_missing = [col for col in data.columns if data[col].isnull().any()] 
data.drop(cols_with_missing, axis=1, inplace=True)
data_test.drop(cols_with_missing, axis=1, inplace=True)

X_train, X_valid, y_train, y_valid = train_test_split(data, y, train_size=0.8, test_size=0.2, random_state=0)

In [25]:
X_train.head()

Unnamed: 0_level_0,MSSubClass,MSZoning,LotArea,Street,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,...,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SaleType,SaleCondition
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
619,20,RL,11694,Pave,Reg,Lvl,AllPub,Inside,Gtl,NridgHt,...,108,0,0,260,0,0,7,2007,New,Partial
871,20,RL,6600,Pave,Reg,Lvl,AllPub,Inside,Gtl,NAmes,...,0,0,0,0,0,0,8,2009,WD,Normal
93,30,RL,13360,Pave,IR1,HLS,AllPub,Inside,Gtl,Crawfor,...,0,44,0,0,0,0,8,2009,WD,Normal
818,20,RL,13265,Pave,IR1,Lvl,AllPub,CulDSac,Gtl,Mitchel,...,59,0,0,0,0,0,7,2008,WD,Normal
303,20,RL,13704,Pave,IR1,Lvl,AllPub,Corner,Gtl,CollgCr,...,81,0,0,0,0,0,1,2006,WD,Normal


In [26]:
# function for comparing different approaches
def score_dataset(X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators=100, random_state=0)
    model.fit(X_train, y_train)
    preds = model.predict(X_valid)
    return mean_absolute_error(y_valid, preds)

<h2 style="color: blue">Dealing with Categorical Data</h2>

I will be doing a preliminary investigation of various methods for dealing with categorical values. (drop columns, label encoding, one-hot encoding)

### Dropping columns with categorical data 

In [27]:
drop_X_train = X_train.select_dtypes(exclude="object")
drop_X_valid = X_valid.select_dtypes(exclude="object")

In [28]:
print("MAE from Approach 1 (Drop categorical variables):")
print(score_dataset(drop_X_train, drop_X_valid, y_train, y_valid))

MAE from Approach 1 (Drop categorical variables):
17837.82570776256


### Label Encoding

In [29]:
print("Unique values in 'Condition2' column in training data:", X_train['Condition2'].unique())
print("\nUnique values in 'Condition2' column in validation data:", X_valid['Condition2'].unique())

Unique values in 'Condition2' column in training data: ['Norm' 'PosA' 'Feedr' 'PosN' 'Artery' 'RRAe']

Unique values in 'Condition2' column in validation data: ['Norm' 'RRAn' 'RRNn' 'Artery' 'Feedr' 'PosN']


The validation data contains values that don't also appear in the training data. Label encoding will throw an error.

In [30]:
object_cols = [col for col in X_train.columns if X_train[col].dtype == "object"]

# Columns that can be safely label encoded
good_label_cols = [col for col in object_cols if set(X_train[col]) == set(X_valid[col])]
        
# Problematic columns that will be dropped from the dataset
bad_label_cols = list(set(object_cols)-set(good_label_cols))
        
print('Categorical columns that will be label encoded:', good_label_cols)
print('\nCategorical columns that will be dropped from the dataset:', bad_label_cols)

Categorical columns that will be label encoded: ['MSZoning', 'Street', 'LotShape', 'LandContour', 'LotConfig', 'BldgType', 'HouseStyle', 'ExterQual', 'CentralAir', 'KitchenQual', 'PavedDrive', 'SaleCondition']

Categorical columns that will be dropped from the dataset: ['Utilities', 'Neighborhood', 'Condition1', 'RoofMatl', 'Heating', 'SaleType', 'Foundation', 'Condition2', 'HeatingQC', 'Exterior1st', 'Functional', 'RoofStyle', 'Exterior2nd', 'ExterCond', 'LandSlope']


In [31]:
# Drop categorical columns that will not be encoded
label_X_train = X_train.drop(bad_label_cols, axis=1)
label_X_valid = X_valid.drop(bad_label_cols, axis=1)

my_label_encoder = LabelEncoder()

for col in good_label_cols:
    label_X_train[col] = my_label_encoder.fit_transform(X_train[col])
    label_X_valid[col] = my_label_encoder.transform(X_valid[col])

In [32]:
print("MAE from Approach 2 (Label Encoding):") 
print(score_dataset(label_X_train, label_X_valid, y_train, y_valid))

MAE from Approach 2 (Label Encoding):
17575.291883561644


### One-hot Encoding

In [33]:
# Get number of unique entries in each column with categorical data
object_nunique = list(map(lambda col: X_train[col].nunique(), object_cols))
d = dict(zip(object_cols, object_nunique))

sorted(d.items(), key=lambda x: x[1])

[('Street', 2),
 ('Utilities', 2),
 ('CentralAir', 2),
 ('LandSlope', 3),
 ('PavedDrive', 3),
 ('LotShape', 4),
 ('LandContour', 4),
 ('ExterQual', 4),
 ('KitchenQual', 4),
 ('MSZoning', 5),
 ('LotConfig', 5),
 ('BldgType', 5),
 ('ExterCond', 5),
 ('HeatingQC', 5),
 ('Condition2', 6),
 ('RoofStyle', 6),
 ('Foundation', 6),
 ('Heating', 6),
 ('Functional', 6),
 ('SaleCondition', 6),
 ('RoofMatl', 7),
 ('HouseStyle', 8),
 ('Condition1', 9),
 ('SaleType', 9),
 ('Exterior1st', 15),
 ('Exterior2nd', 16),
 ('Neighborhood', 25)]

# Step 4: One-hot encoding

In [34]:
# Only columns with cardinality of <10 will be encoded
low_cardinality_cols = [col for col in object_cols if X_train[col].nunique() < 10]

# Columns that will be dropped from the dataset
high_cardinality_cols = list(set(object_cols)-set(low_cardinality_cols))

print('Categorical columns that will be one-hot encoded:', low_cardinality_cols)
print('\nCategorical columns that will be dropped from the dataset:', high_cardinality_cols)

Categorical columns that will be one-hot encoded: ['MSZoning', 'Street', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'ExterQual', 'ExterCond', 'Foundation', 'Heating', 'HeatingQC', 'CentralAir', 'KitchenQual', 'Functional', 'PavedDrive', 'SaleType', 'SaleCondition']

Categorical columns that will be dropped from the dataset: ['Exterior2nd', 'Neighborhood', 'Exterior1st']


<strong>Note:</strong>
- Setting handle_unknown='ignore' avoids errors when the validation data contains classes that aren't represented in the training data 
- Setting sparse=False ensures that the encoded columns are returned as a numpy array (instead of a sparse matrix).

In [35]:
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)

OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[low_cardinality_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[low_cardinality_cols]))

OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index

drop_X_train = X_train.drop(object_cols, axis=1)
drop_X_valid = X_valid.drop(object_cols, axis=1)

OH_X_train = pd.concat([drop_X_train, OH_cols_train], axis=1) # Your code here
OH_X_valid = pd.concat([drop_X_valid, OH_cols_valid], axis=1) # Your code here

In [36]:
print("MAE from Approach 3 (One-Hot Encoding):") 
print(score_dataset(OH_X_train, OH_X_valid, y_train, y_valid))

MAE from Approach 3 (One-Hot Encoding):
17525.345719178084


<h2 style="color: blue">Experimenting With Random Forest Models</h2>

In [37]:
rf_model_1 = RandomForestRegressor(random_state=0)
rf_model_2 = RandomForestRegressor(n_estimators = 200, random_state=0)
rf_model_3 = RandomForestRegressor(n_estimators = 200, criterion='mae', random_state=0)
rf_model_4 = RandomForestRegressor(n_estimators = 200, criterion='mae', min_samples_split=10, random_state=0)
rf_model_5 = RandomForestRegressor(n_estimators = 300, min_samples_split=10, max_depth=7, random_state=0)

models = [rf_model_1, rf_model_2, rf_model_3, rf_model_4, rf_model_5]

In [38]:
def error_model(model):
    model.fit(OH_X_train, y_train)
    preds = model.predict(OH_X_valid)
    return mean_absolute_error(y_valid, preds)

In [39]:
for i in range(0, len(models)):
    mae = error_model(models[i])
    print(mae)

17525.345719178084
17184.68566210046
17247.401875
17753.808202054795
18066.645685243133


In [40]:
OH_X_train.head()

Unnamed: 0_level_0,MSSubClass,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,...,112,113,114,115,116,117,118,119,120,121
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
619,20,11694,9,5,2007,2007,48,0,1774,1822,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
871,20,6600,5,5,1962,1962,0,0,894,894,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
93,30,13360,5,7,1921,2006,713,0,163,876,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
818,20,13265,8,5,2002,2002,1218,0,350,1568,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
303,20,13704,7,5,2001,2002,0,0,1541,1541,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0


In [41]:
model = RandomForestRegressor(n_estimators = 200, random_state=0)

model.fit(OH_X_train, y_train)
preds_valid = model.predict(OH_X_valid)
print(mean_absolute_error(y_valid, preds_valid))

17184.68566210046


In [42]:
data_test.shape

(1459, 60)

In [43]:
X_train.shape

(1168, 60)

In [44]:
low_cardinality = [col for col in object_cols if data_test[col].nunique() < 10]
print(low_cardinality)

['MSZoning', 'Street', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'ExterQual', 'ExterCond', 'Foundation', 'Heating', 'HeatingQC', 'CentralAir', 'KitchenQual', 'Functional', 'PavedDrive', 'SaleType', 'SaleCondition']


In [45]:
#Drop rows with missing values
final_X_test = data_test.dropna()
final_X_test.shape

(1447, 60)

In [46]:
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)

OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[low_cardinality_cols]))
OH_cols_test = pd.DataFrame(OH_encoder.transform(final_X_test[low_cardinality_cols]))

OH_cols_test.index = final_X_test.index

drop_X_test = final_X_test.drop(object_cols, axis=1)
OH_X_test = pd.concat([drop_X_test, OH_cols_test], axis=1)

In [47]:
preds_test = model.predict(OH_X_test)