## Level 2 *Learn Maching Learning* series on Kaggle
This is the level 2 part of the *Learn Machine Learning* series on Kaggle using Python (https://www.kaggle.com/learn/machine-learning). The data used is from the [*Home Prices: Advanced Regression Techniques*](https://www.kaggle.com/c/house-prices-advanced-regression-techniques) competition.

Like the post for level 1, this post will show the section name, my code from the corresponding section for the instructions under **Your Turn**, and some brief notes on what is taught in each section.

First I'll run the necessary code from before and add a new function, score_dataset.

In [1]:
# Import the necessary libraries
import pandas as pd
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor


# Save filepath to variable
training_data_filepath = "C:/Development/Kaggle/House Prices - Advanced \
Regression Techniques/train.csv"

# Read the data and store in a dataframe called training_set
training_set = pd.read_csv(training_data_filepath)

# Select the target variable and call it y
y = training_set.SalePrice

# Create a list of the predictor variables
# predictors = ["LotArea", "YearBuilt", "1stFlrSF", "2ndFlrSF", "FullBath",
#               "BedroomAbvGr", "TotRmsAbvGrd"]

# Create the dataframe with only numeric predictors
# X = training_set[predictors]
X = training_set.select_dtypes(exclude=["object"])

# Split data into training and validation data, for both predictors and
# target.
# The split is based on a random number generator. Supplying a numeric value
# to the random_state argument guarantees we get the same split every time we
# run this script. It can be any number; I'm choosing 42.
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42,
                                                 train_size=0.7,
                                                 test_size=0.3)

def score_dataset(X_train, X_test, y_train, y_test):
    model = RandomForestRegressor()
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    return mean_absolute_error(y_test, preds)

### Section 1
[Handling Missing Values](https://www.kaggle.com/dansbecker/handling-missing-values)

This section teaches multiple approaches for dealing with missing data fields.

In [2]:
# Detect which columns have missing values
print(X.isnull().sum())

Id                 0
MSSubClass         0
LotFrontage      259
LotArea            0
OverallQual        0
OverallCond        0
YearBuilt          0
YearRemodAdd       0
MasVnrArea         8
BsmtFinSF1         0
BsmtFinSF2         0
BsmtUnfSF          0
TotalBsmtSF        0
1stFlrSF           0
2ndFlrSF           0
LowQualFinSF       0
GrLivArea          0
BsmtFullBath       0
BsmtHalfBath       0
FullBath           0
HalfBath           0
BedroomAbvGr       0
KitchenAbvGr       0
TotRmsAbvGrd       0
Fireplaces         0
GarageYrBlt       81
GarageCars         0
GarageArea         0
WoodDeckSF         0
OpenPorchSF        0
EnclosedPorch      0
3SsnPorch          0
ScreenPorch        0
PoolArea           0
MiscVal            0
MoSold             0
YrSold             0
SalePrice          0
dtype: int64


In [3]:
# Get model score from dropping columns with missing values
cols_with_missing = [col for col in X_train.columns
                     if X_train[col].isnull().any()]
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_test = X_test.drop(cols_with_missing, axis=1)

print("Mean Absolute Error from dropping columns with missing values:")
print(score_dataset(reduced_X_train, reduced_X_test, y_train, y_test))

Mean Absolute Error from dropping columns with missing values:
1094.2438356164384


In [6]:
# Get model score from Imputation
from sklearn.preprocessing import Imputer

my_imputer = Imputer()
imputed_X_train = my_imputer.fit_transform(X_train)
imputed_X_test = my_imputer.transform(X_test)

# "fit_transform" is the training step. It "learns" based upon the training set data.
# "transform" uses the newly trained model to make predictions on the "test set"
# (a.k.a. "validation set" in the the first tutorial).

print("Mean Absolute Error from Imputation:")
print(score_dataset(imputed_X_train, imputed_X_test, y_train, y_test))

Mean Absolute Error from Imputation:
1075.4383561643833


In [7]:
# Get model score from Imputation with extra columns showing what was imputed
imputed_X_train_plus = X_train.copy()
imputed_X_test_plus = X_test.copy()

cols_with_missing = (col for col in X_train.columns
                    if X_train[col].isnull().any())

for col in cols_with_missing:
    imputed_X_train_plus[col + "_was_missing"] = imputed_X_train_plus[col].isnull()
    imputed_X_test_plus[col + "_was_missing"] = imputed_X_test_plus[col].isnull()

# Imputation
imputed_X_train_plus = my_imputer.fit_transform(imputed_X_train_plus)
imputed_X_test_plus = my_imputer.transform(imputed_X_test_plus)

print("Mean Absolute Error from Imputation while tracking what was imputed:")
print(score_dataset(imputed_X_train_plus, imputed_X_test_plus, y_train, y_test))

Mean Absolute Error from Imputation while tracking what was imputed:
1027.7054794520548
