# Introduction to Machine Learning
This is my first attempt at creating a simple machine learning model using random forest. I am using the data from the Kaggle <a href="https://www.kaggle.com/c/home-data-for-ml-course/data">Housing prices Competition for Kaggle Learn Users</a> competition.

This notebook will document my entire learning process. I will create another notebook with my complete and optimized model.

<h2 style="color: blue">Loading and Reading the Data</h2>

In [3]:
import pandas as pd
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.impute import SimpleImputer

data = pd.read_csv('train.csv', index_col='Id')
data_test = pd.read_csv('test.csv', index_col='Id')

<h2 style="color: blue">Understanding the Data</h2>

In [4]:
print(data.columns)
print(f"\nSize: {data.shape}\n")
data.head()

Index(['MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street', 'Alley',
       'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope',
       'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle',
       'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle',
       'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea',
       'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond',
       'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2',
       'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC',
       'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF',
       'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath',
       'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd',
       'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType', 'GarageYrBlt',
       'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual', 'GarageCond',
       'PavedDrive', 'Wo

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,2,2008,WD,Normal,208500
2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,...,0,,,,0,5,2007,WD,Normal,181500
3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,...,0,,,,0,9,2008,WD,Normal,223500
4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,...,0,,,,0,2,2006,WD,Abnorml,140000
5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,...,0,,,,0,12,2008,WD,Normal,250000


<h2 style="color: blue">Prepping Data</h2>

In [5]:
#Remove rows with missing prices, separate target from predictors
data.dropna(axis=0, how='any', thresh=None, subset=['SalePrice'], inplace=True)
y = data.SalePrice
data.drop(['SalePrice'], axis=1, inplace=True)

X = data.select_dtypes(exclude=['object'])
X_test = data_test.select_dtypes(exclude=['object'])

#Split data into training and validation set
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)

In [6]:
X_train.shape

(1168, 36)

In [7]:
X_train.columns

Index(['MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual', 'OverallCond',
       'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2',
       'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF',
       'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath',
       'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces',
       'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal',
       'MoSold', 'YrSold'],
      dtype='object')

In [8]:
X_train.head()

Unnamed: 0_level_0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
619,20,90.0,11694,9,5,2007,2007,452.0,48,0,...,774,0,108,0,0,260,0,0,7,2007
871,20,60.0,6600,5,5,1962,1962,0.0,0,0,...,308,0,0,0,0,0,0,0,8,2009
93,30,80.0,13360,5,7,1921,2006,0.0,713,0,...,432,0,0,44,0,0,0,0,8,2009
818,20,,13265,8,5,2002,2002,148.0,1218,0,...,857,150,59,0,0,0,0,0,7,2008
303,20,118.0,13704,7,5,2001,2002,150.0,0,0,...,843,468,81,0,0,0,0,0,1,2006


<h2 style="color: blue">Investigating Missing Numerical Values</h2>

I will be doing a preliminary investigation of various methods for dealing with missing values. (drop columns, imputation)

In [9]:
missing_val_count_each_col = X_train.select_dtypes(exclude="object").isnull().sum()
print(missing_val_count_each_col[missing_val_count_each_col > 0])

LotFrontage    212
MasVnrArea       6
GarageYrBlt     58
dtype: int64


In [10]:
# Comparing error from different methods of removing missing values
def error_missing_val(X_t, X_v, y_t, y_v):
    model = RandomForestRegressor(random_state=0)
    model.fit(X_t, y_t)
    preds = model.predict(X_v)
    return mean_absolute_error(y_v, preds)

In [11]:
# Dropping columns with missing values
#cols_missing = [col for col in X_train.columns if (X_train[col].isnull().any())]

#drop_col_X_train = X_train.drop(cols_missing, axis=1).copy()
#drop_col_X_valid = X_valid.drop(cols_missing, axis=1).copy()

# Imputation (Replacing missing values with mean value)
#imputer_1 = SimpleImputer()

#imputed_1_X_train = pd.DataFrame(imputer_1.fit_transform(X_train))
#imputed_1_X_valid = pd.DataFrame(imputer_1.transform(X_valid))

#imputed_1_X_train.columns = X_train.columns
#imputed_1_X_valid.columns = X_valid.columns

# Imputation (Replacing missing values with most frequent column value)
#imputer_3 = SimpleImputer(strategy='most_frequent')

#imputed_3_X_train = pd.DataFrame(imputer_3.fit_transform(X_train))
#imputed_3_X_valid = pd.DataFrame(imputer_3.transform(X_valid))

#imputed_3_X_train.columns = X_train.columns
#imputed_3_X_valid.columns = X_valid.columns

# Imputation (Replacing missing values with min column value)
#imputed_4_X_train = X_train.copy()
#imputed_4_X_valid = X_valid.copy()

#for col in cols_missing:
  #  min = imputed_4_X_train[col].min()
   # imputed_4_X_train[col] = imputed_4_X_train[col].fillna(value=min)
    #imputed_4_X_valid[col] = imputed_4_X_valid[col].fillna(value=min)
    
# Imputation (Replacing missing values with min column value)
#imputer_5 = SimpleImputer(strategy='constant')

#imputed_5_X_train = pd.DataFrame(imputer_5.fit_transform(X_train))
#imputed_5_X_valid = pd.DataFrame(imputer_5.transform(X_valid))

#imputed_5_X_train.columns = X_train.columns
#imputed_5_X_valid.columns = X_valid.columns

<p style = "background-color: yellow">Based on the MAE values given, I will be replace missing values with the column's median since it yielded the lowest error.</p>

In [15]:
# Replace the missing values with the median of each column
imputer_2 = SimpleImputer(strategy='median')

imputed_X_train = pd.DataFrame(imputer_2.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(imputer_2.transform(X_valid))

imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns

print(error_missing_val(imputed_X_train, imputed_X_valid, y_train, y_valid))

17791.59899543379


<h2 style="color: blue">Experimenting With Random Forest Models</h2>

In [16]:
rf_model_1 = RandomForestRegressor(random_state=0)
rf_model_2 = RandomForestRegressor(n_estimators = 200, random_state=0)
rf_model_3 = RandomForestRegressor(n_estimators = 200, criterion='mae', random_state=0)
rf_model_4 = RandomForestRegressor(n_estimators = 200, criterion='mae', min_samples_split=10, random_state=0)
rf_model_5 = RandomForestRegressor(n_estimators = 300, min_samples_split=10, max_depth=7, random_state=0)

models = [rf_model_1, rf_model_2, rf_model_3, rf_model_4, rf_model_5]

In [20]:
def error_model(model):
    model.fit(imputed_X_train, y_train)
    preds = model.predict(imputed_X_valid)
    return mean_absolute_error(y_valid, preds)

In [25]:
for i in range(0, len(models)):
    mae = error_model(models[i])
    print(mae)

17791.59899543379
17491.11966894977
17897.672243150686
18145.702619863016
18496.001894298246


I will be submitting my code to the Housing Prices Competition For Kaggle Learn Users as I modify my code:
<p><strong>Most recent update:</strong> Your submission scored 16530.51961, which is an improvement of your previous score of 16615.62614. Great job!</p>