### Iowa Housing Lab -- Solutions

Welcome!! This lab is going to be a bit more of an advanced version of yesterday's class, where we build a regression model to predict housing prices, but this time do so with a dataset that has a more interesting mix of data -- ordinal and nominal features, as well as some missing values.

**Important:** A summary of each of the columns in this dataset, and what their values mean, can be found here: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

**Step 1).  Load in both your training & test sets**

In [1]:
# your code here
import pandas as pd
import numpy as np
train = pd.read_csv('../data/iowa_housing/train.csv')
test  = pd.read_csv('../data/iowa_housing/test.csv')

In [2]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 19 columns):
Id              1460 non-null int64
MSSubClass      1460 non-null int64
MSZoning        1460 non-null object
LotArea         1460 non-null int64
Neighborhood    1460 non-null object
OverallQual     1460 non-null int64
OverallCond     1460 non-null int64
YearBuilt       1460 non-null int64
GrLivArea       1460 non-null int64
1stFlrSF        1460 non-null int64
2ndFlrSF        1460 non-null int64
GrLivArea.1     1460 non-null int64
FullBath        1460 non-null int64
HalfBath        1460 non-null int64
GarageType      1379 non-null object
GarageYrBlt     1379 non-null float64
GarageFinish    1379 non-null object
GarageCars      1460 non-null int64
SalePrice       1460 non-null int64
dtypes: float64(1), int64(14), object(4)
memory usage: 216.8+ KB


Also....when you're cleaning training & test sets, it's usually a good idea to separate the column you're trying to predict from everything else.  

For now, declare `y` to be the `SalePrice` column, and then remove it from the training set entirely.  You can drop the `ID` column too, since it encodes nothing meaningful.

In [3]:
# your answer here
y = train['SalePrice']
train.drop('SalePrice', axis=1, inplace=True)
train.drop('Id', axis=1, inplace=True)
test.drop('Id', axis=1, inplace=True)

**Step 2).  There are missing values throughout this dataset.  For the time being, let's try and do a few things:**

 - were these missing values likely to be randomly occurring, or are they likely encoding for something else?  
 
If values are encoding for something else, there are usually either high correlations with missing values in similar columns, and/or they could potentially represent a particular rank in a hierarchy -- ie, 'None', 0, 'Other', etc.  Ie, the missing values basically are encoding for something specific, it's just not mentioned.

Take a look at the column descriptions, see what you think they might be.

 - if you think they are missing at random, fill in the missing values with their mean(numeric columns) or mode(categorical columns)
 - if you think they are **not** missing at random, then go ahead and fill them in with a value to encode what they are (0, 'Other', and 'None' are common choices)
 
**Hint:** You can try encoding null & non-null values to 0 and 1, respectively, and use the corr() method on that. 
 
*If filling in missing values, make sure to perform this operation on the training and test set, using values from the training set for imputation.*

In [4]:
# your code here
train_empty = train.loc[:, train.isnull().sum() > 0]

In [5]:
# there is a 100% correlation between the empty values in these columns
# they all encode for a garage -- these almost certainly represent the same thing
train_empty.isnull().astype(int).corr()

Unnamed: 0,GarageType,GarageYrBlt,GarageFinish
GarageType,1.0,1.0,1.0
GarageYrBlt,1.0,1.0,1.0
GarageFinish,1.0,1.0,1.0


In [6]:
# grab the columns
cols = train_empty.columns.tolist()
# fill with the appropriate value  -- NA, Other, could also work
train[['GarageType', 'GarageFinish']] = train[['GarageType', 'GarageFinish']].fillna('None')
test[['GarageType', 'GarageFinish']]  = test[['GarageType', 'GarageFinish']].fillna('None')

In [7]:
# we'll use this for GarageYrBlt since it's a numeric column
train['GarageYrBlt'].fillna(0, inplace=True)
test['GarageYrBlt'].fillna(0, inplace=True)

In [8]:
# there are still some empty columns in the test set, we'll impute these 
# using values from the training set
test.isnull().sum()

MSSubClass      0
MSZoning        4
LotArea         0
Neighborhood    0
OverallQual     0
OverallCond     0
YearBuilt       0
GrLivArea       0
1stFlrSF        0
2ndFlrSF        0
GrLivArea.1     0
FullBath        0
HalfBath        0
GarageType      0
GarageYrBlt     0
GarageFinish    0
GarageCars      1
dtype: int64

In [9]:
# finding the values to use in the training set
ms_mode   = train['MSZoning'].mode()[0]
gcarsmean = train['GarageCars'].mean()

In [10]:
# and applying them to the test set
test['MSZoning'].fillna(ms_mode, inplace=True)
test['GarageCars'].fillna(gcarsmean, inplace=True)

**Step 3): Ordinal vs Categorical Columns**

There are a number of categorical columns in this dataset, and they could represent both ordinal or nominal data.  

Take a look at their descriptions, and decide which one belongs to which.

In [11]:
# your answer here (no real code required for this one)

**Step 4):  Go Ahead and Change Your Ordinal Variables To Their Appropriate Values, If They Exist**

In [12]:
# your code here
# we'll assume the GarageFinish is ordinal.  Ie, FinishedGarage > Unfinished Garage
garage_mapping = {
    'None': 0, # no garage
    'Unf' : 1, # unfinished garage
    'RFn' : 2, # partially finished garage
    'Fin' : 3  # finished garage
}

train['GarageFinish'] = train['GarageFinish'].map(garage_mapping)
test['GarageFinish']  = test['GarageFinish'].map(garage_mapping)

**Step 5):  Now, OneHot Encode Your Dataset For Your Remaining Categorical Columns** 

**Note:** You want your training and your test sets attached for this one.  Detach them when you're finished.

**2nd Note:** Some columns are categorical, even if they're encoded as numbers.  the `MSSubClass` is essentially a zoning category, even though it's encoded as a number.  It's a good idea to encode these variables as strings using the `astype` method.

**3rd Note:** We'll discuss better ways to get around this, but the test set has a value in the `MSSubClass` column that is **not** in the training set.  For the time being just drop the column `MSSubClass_150` from both training and test sets before proceeding to the next step.

In [13]:
# MSSubClass is really a category, moreso than a true number
# so we'll add it to the list of items to be encoded
train['MSSubClass'] = train['MSSubClass'].astype(str)
test['MSSubClass']  = test['MSSubClass'].astype(str)

In [14]:
# concatenate and encode
master = pd.concat([train, test])
master = pd.get_dummies(master)

In [15]:
# drop MSSubClass150
master.drop('MSSubClass_150', axis=1, inplace=True)

In [16]:
# and split back apart
train  = master.iloc[:1460].copy()
test   = master.iloc[1460:].copy()

**Step 6): Standardize Your Data On Your Training and Test Sets**

**Remember:** Use the values from your training set to standardize your test set!  

Ask me if you have any questions on how to do this.

In [17]:
train_means = train.mean()
train_stds  = train.std()

In [18]:
# standardize the training set
train_std = train - train_means
train_std /= train_stds

In [19]:
# and do the same for the test set
test -= train_means
test /= train_stds

**Step 7):  Create a validation set out of your training set, and import Linear Regression**

Since there is no time based component, random shuffling is fine.

In [20]:
# your answer here
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

X_train, X_val, y_train, y_val = train_test_split(train, y, random_state=2020)

**Step 8): Initialize Linear Regression, fit it on your training set, and score it on your validation set to get a feel for how you did.**

In [21]:
# your answer here
lreg = LinearRegression()
lreg.fit(X_train, y_train)
lreg.score(X_val, y_val)

0.853339945174324

**Step 9):  Finally, go ahead and make your predictions on your test set.**

Create a dataframe with the following the following columns: 

    `ID`: The original ID of each row in the test set.  Goes from 1461 - 2919
    `SalePrice`: The predicted Sale Price of the house in the test set. 

In [22]:
# your answer here
preds = pd.DataFrame()
preds['ID'] = np.arange(1461, 1461+1459)
preds['prediction'] = lreg.predict(test)

Now, use the `to_csv()` method to output the file to a csv.  Make sure to use `index=False` as an argument.

In [23]:
preds.to_csv('submission.csv', index=False)

**Bonus:** Can you improve your score?

The first part of this lab was meant to be a walk through of the basics of prepping a data set and getting it ready.

However, there's a lot that could be improved upon!  

Using validation scores as your guide, you could try and look at some of the following:

 - Removing outliers from the target variable, or using log transformations to make the data smoother
 - There are lots of highly correlated variables in this dataset.  Do the 4 different columns about the fireplace really tell you something that different from one another?  You can try averaging multiple columns into one if they're highly correlated, or removing some entirely to see if it improves anything.

In [24]:
# your answer here