# Kaggle Regression

## Basic set-up for working w/ pandas in Jupyter Notebooks
As long as you have Anaconda installed, the actual importing of data should be the same for you as it is in this notebook. Personally, I work with a lot of csv's, but there are some other methods for getting files into a pandas dataframe that I'll include in another notebook. This notebook is an analysis of some basic housing data from kaggle.com (essentially just a practice for an intermediate regression problem). The top part is mostly what you would be concerned with when looking at how to import your data.<br>
<br>
Some useful links: <br>
Pandas Tutorials: https://pandas.pydata.org/pandas-docs/stable/tutorials.html <br>
Datacamp Tutorial: https://www.datacamp.com/courses/intro-to-python-for-data-science <br>
Datacamp cheatsheets: https://www.datacamp.com/community/data-science-cheatsheets<br>
Python and ML Resources: https://sebastianraschka.com/resources.html
<br>
<br>
Good Books:<br> 
*Python Crash Course* by Eric Matthes<br>
*Python Machine Learning* by Sebastian Raschka<br>
*Python for Data Analysis* by Wes McKinney<br><br>
*Note: Markdown is below the code that it refers to. Comments are inline* 

In [1]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### Importing the proper libraries
Above, we are just importing the libraries we need to hold the data (numpy and pandas) and visualize it (matplotlib). I actually hardly use numpy in my code, but is a dependency for pandas, so it must be imported. Pandas has a **DataFrame** object that is like a spreadsheet or sql table. This is what you're going to want to work with your data in. 

In [2]:
kag = pd.read_csv('data/train.csv', index_col = 0, header = 0)

### Importing from csv
It's really this easy. One line of code. Here's the breakdown:
- *kag* is the variable you're going to store your dataframe in. Need this to call the dataframe in your code.
- *pd* refers to the pandas library we imported, with all of its methods and classes
- *.read_csv()* is a pandas method used to read from a csv file
- The common parameters are as follows (data, index_col, header):
    - your data is typically going to be a relative file path. It's good practice to create a data folder in the folder you are doing your data analysis in and keeping your data in that. Makes it much easier to call
    - index_col is just the column in your csv you're going to use as the index
    - header is the row that will be the column names

In [3]:
kag.head() #.head is a pandas dataframe method that shows the top n rows of a dataframe (default is 5 rows)

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,2,2008,WD,Normal,208500
2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,...,0,,,,0,5,2007,WD,Normal,181500
3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,...,0,,,,0,9,2008,WD,Normal,223500
4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,...,0,,,,0,2,2006,WD,Abnorml,140000
5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,...,0,,,,0,12,2008,WD,Normal,250000


In [4]:
kag.info() #.info shows you some basic stuff about your columns (how many non-nulls, data type, etc.)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1460 entries, 1 to 1460
Data columns (total 80 columns):
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-null object
Exterior2nd      1460 non-

In [5]:
kag.GarageCond.head() #some different ways to slice data

Id
1    TA
2    TA
3    TA
4    TA
5    TA
Name: GarageCond, dtype: object

In [6]:
kag["GarageCond"].head() #same as above

Id
1    TA
2    TA
3    TA
4    TA
5    TA
Name: GarageCond, dtype: object

In [7]:
kag[["GarageCond"]].head() #show as a dataframe with double brackets

Unnamed: 0_level_0,GarageCond
Id,Unnamed: 1_level_1
1,TA
2,TA
3,TA
4,TA
5,TA


In [15]:
kag.iloc[[0],[3]] #slice by integer row and column with .iloc (here shows row 0, column 3)

Unnamed: 0_level_0,LotArea
Id,Unnamed: 1_level_1
1,8450


In [16]:
kag.iloc[0,3] # as a series

8450

In [17]:
kag.iloc[:4,:3] # return a up to row 4 and up to column three (Note that python indexing starts at 0 and is non-inclusive)
#i.e this is technically showing rows 0, 1, 2, and 3

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,60,RL,65.0
2,20,RL,80.0
3,60,RL,68.0
4,70,RL,60.0


In [18]:
kag.iloc[:4,1]

Id
1    RL
2    RL
3    RL
4    RL
Name: MSZoning, dtype: object

In [19]:
kag.iloc[:4,[1]]

Unnamed: 0_level_0,MSZoning
Id,Unnamed: 1_level_1
1,RL
2,RL
3,RL
4,RL


### Filtering

In [21]:
kag[kag.GarageCond == "Gd"] # show where kag dataframe's GarageCond column is equal to "Gd"

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
43,85,RL,,9180,Pave,,IR1,Lvl,AllPub,CulDSac,...,0,,MnPrv,,0,12,2007,WD,Normal,144000
210,20,RL,75.0,8250,Pave,,Reg,Lvl,AllPub,Inside,...,0,,MnPrv,,0,7,2008,WD,Normal,145000
270,20,RL,,7917,Pave,,IR1,Lvl,AllPub,Corner,...,0,,GdPrv,,0,5,2007,WD,Normal,148000
741,70,RM,60.0,9600,Pave,Grvl,Reg,Lvl,AllPub,Inside,...,0,,GdPrv,,0,5,2007,WD,Abnorml,132000
843,80,RL,82.0,9020,Pave,,Reg,Lvl,AllPub,Inside,...,0,,GdPrv,,0,5,2008,WD,Normal,174900
884,75,RL,60.0,6204,Pave,,Reg,Bnk,AllPub,Inside,...,0,,,,0,3,2006,WD,Normal,118500
1282,20,RL,50.0,8049,Pave,,IR1,Lvl,AllPub,CulDSac,...,0,,,,0,7,2006,WD,Normal,180000
1313,60,RL,,9572,Pave,,IR1,Lvl,AllPub,Inside,...,0,,,,0,6,2007,WD,Normal,302000
1424,80,RL,,19690,Pave,,IR1,Lvl,AllPub,CulDSac,...,738,Gd,GdPrv,,0,8,2006,WD,Alloca,274970


In [None]:
kag_cnd = kag[kag.GarageCond == "Gd"] # can store it in a variable

### Python comparators
- < less than
- <= less than or equal to
- *>* greater than
- *>=* greater than or equal to
- == equal to
- != not equal to

### Logical Operators
- & --> and
- | --> or
- ! --> not

In [22]:
#in action
kag[(kag.GarageCond == "Gd") & (kag.SalePrice > 175000)]

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1282,20,RL,50.0,8049,Pave,,IR1,Lvl,AllPub,CulDSac,...,0,,,,0,7,2006,WD,Normal,180000
1313,60,RL,,9572,Pave,,IR1,Lvl,AllPub,Inside,...,0,,,,0,6,2007,WD,Normal,302000
1424,80,RL,,19690,Pave,,IR1,Lvl,AllPub,CulDSac,...,738,Gd,GdPrv,,0,8,2006,WD,Alloca,274970


# Summary of Intro
The stuff above should get you started with playing around with data in python. The rest of this notebook trends toward exploratory analysis and feature engineering, which are useful, but you'd be better served looking at some of the documentation or books that I've listed. They will be able to help you out with getting started
# XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

In [5]:
kag.GarageCond.value_counts() 

TA    1326
Fa      35
Gd       9
Po       7
Ex       2
Name: GarageCond, dtype: int64

In [6]:
kag.MiscFeature.value_counts()

Shed    49
Othr     2
Gar2     2
TenC     1
Name: MiscFeature, dtype: int64

In [7]:
kag.Neighborhood.value_counts()

NAmes      225
CollgCr    150
OldTown    113
Edwards    100
Somerst     86
Gilbert     79
NridgHt     77
Sawyer      74
NWAmes      73
SawyerW     59
BrkSide     58
Crawfor     51
Mitchel     49
NoRidge     41
Timber      38
IDOTRR      37
ClearCr     28
StoneBr     25
SWISU       25
MeadowV     17
Blmngtn     17
BrDale      16
Veenker     11
NPkVill      9
Blueste      2
Name: Neighborhood, dtype: int64

In [8]:
neighborhood = pd.DataFrame(kag.Neighborhood.value_counts())

In [9]:
neighborhood = kag[['Neighborhood', 'SalePrice']]

In [10]:
ngrp = neighborhood.groupby('Neighborhood').median()

In [11]:
ngrp

Unnamed: 0_level_0,SalePrice
Neighborhood,Unnamed: 1_level_1
Blmngtn,191000
Blueste,137500
BrDale,106000
BrkSide,124300
ClearCr,200250
CollgCr,197200
Crawfor,200624
Edwards,121750
Gilbert,181000
IDOTRR,103000


In [12]:
ngrp.describe()

Unnamed: 0,SalePrice
count,25.0
mean,176515.96
std,60700.369519
min,88000.0
25%,135000.0
50%,179900.0
75%,200624.0
max,315000.0


In [13]:
ngrp['Neighborhood'] = ngrp.index

In [14]:
ngrp.sort_values('SalePrice', ascending = False)

Unnamed: 0_level_0,SalePrice,Neighborhood
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1
NridgHt,315000,NridgHt
NoRidge,301500,NoRidge
StoneBr,278000,StoneBr
Timber,228475,Timber
Somerst,225500,Somerst
Veenker,218000,Veenker
Crawfor,200624,Crawfor
ClearCr,200250,ClearCr
CollgCr,197200,CollgCr
Blmngtn,191000,Blmngtn


In [15]:
kag[kag.LotFrontage.isnull()]

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
8,60,RL,,10382,Pave,,IR1,Lvl,AllPub,Corner,...,0,,,Shed,350,11,2009,WD,Normal,200000
13,20,RL,,12968,Pave,,IR2,Lvl,AllPub,Inside,...,0,,,,0,9,2008,WD,Normal,144000
15,20,RL,,10920,Pave,,IR1,Lvl,AllPub,Corner,...,0,,GdWo,,0,5,2008,WD,Normal,157000
17,20,RL,,11241,Pave,,IR1,Lvl,AllPub,CulDSac,...,0,,,Shed,700,3,2010,WD,Normal,149000
25,20,RL,,8246,Pave,,IR1,Lvl,AllPub,Inside,...,0,,MnPrv,,0,5,2010,WD,Normal,154000
32,20,RL,,8544,Pave,,IR1,Lvl,AllPub,CulDSac,...,0,,MnPrv,,0,6,2008,WD,Normal,149350
43,85,RL,,9180,Pave,,IR1,Lvl,AllPub,CulDSac,...,0,,MnPrv,,0,12,2007,WD,Normal,144000
44,20,RL,,9200,Pave,,IR1,Lvl,AllPub,CulDSac,...,0,,MnPrv,,0,7,2008,WD,Normal,130250
51,60,RL,,13869,Pave,,IR2,Lvl,AllPub,Corner,...,0,,,,0,7,2007,WD,Normal,177000
65,60,RL,,9375,Pave,,Reg,Lvl,AllPub,Inside,...,0,,GdPrv,,0,2,2009,WD,Normal,219500


In [16]:
#Creating a function that will create a new column
def house_age(d):
    if d["YearBuilt"] == d['YearRemodAdd']:
        return 2014 - d['YearBuilt']
    else:
        return 2014 - d['YearRemodAdd']

In [17]:
kag["HomeAge"] = kag.apply(house_age, axis = 1) #creating the new column

In [18]:
kag[['HomeAge']].head()

Unnamed: 0_level_0,HomeAge
Id,Unnamed: 1_level_1
1,11
2,38
3,12
4,44
5,14


In [19]:
kag[['YearBuilt', 'YearRemodAdd', 'HomeAge']].head()

Unnamed: 0_level_0,YearBuilt,YearRemodAdd,HomeAge
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,2003,2003,11
2,1976,1976,38
3,2001,2002,12
4,1915,1970,44
5,2000,2000,14


In [20]:
kag.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1460 entries, 1 to 1460
Data columns (total 81 columns):
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-null object
Exterior2nd      1460 non-

In [21]:
kag.Electrical.value_counts()

SBrkr    1334
FuseA      94
FuseF      27
FuseP       3
Mix         1
Name: Electrical, dtype: int64

In [22]:
kag[kag.Electrical.isnull()]

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice,HomeAge
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1380,80,RL,73.0,9735,Pave,,Reg,Lvl,AllPub,Inside,...,,,,0,5,2008,WD,Normal,167500,7


In [23]:
kag.loc[1380, 'Electrical'] = 'SBrkr'

In [24]:
kag.Electrical.value_counts()

SBrkr    1335
FuseA      94
FuseF      27
FuseP       3
Mix         1
Name: Electrical, dtype: int64

In [25]:
kag.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1460 entries, 1 to 1460
Data columns (total 81 columns):
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-null object
Exterior2nd      1460 non-

# XXXXXXXXXXXXXXXX
#          KAG Fill
# XXXXXXXXXXXXXXXX

In [26]:
kag = kag.fillna(0)

In [27]:
kag.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1460 entries, 1 to 1460
Data columns (total 81 columns):
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1460 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            1460 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-null object
Exterior2nd      1460 no

In [28]:
def Garage_Age(d):
    if d.GarageYrBlt == 0:
        return 0
    else: 
        return 2014 - d.GarageYrBlt

In [29]:
kag['GarageAge'] = kag.apply(Garage_Age, axis = 1)

In [30]:
continuous_columns = ['LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'MasVnrArea', 'BsmtUnfSF', 'TotalBsmtSF', 'GrLivArea', 'LowQualFinSF', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'GarageAge', 'HomeAge', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal']

# XXXXXXXXXXXXXXX
# Creation of Dummies
# XXXXXXXXXXXXXXX

In [31]:
dum_MSSubClass = pd.get_dummies(kag.MSSubClass, prefix = 'mssubclass_')

In [32]:
dum_MSZoning = pd.get_dummies(kag.MSZoning, prefix = 'mszoning_')

In [33]:
dum_Street = pd.get_dummies(kag.Street, prefix = 'street_')
dum_Alley = pd.get_dummies(kag.Alley, prefix = 'alley_')
dum_LotShape = pd.get_dummies(kag.LotShape, prefix = 'lotshape_')
dum_LandContour = pd.get_dummies(kag.LandContour, prefix = 'landcontour_')
dum_Utilities = pd.get_dummies(kag.Utilities, prefix = 'utilities_')
dum_LotConfig = pd.get_dummies(kag.LotConfig, prefix = 'lotconfig_')
dum_LandSlope = pd.get_dummies(kag.LandSlope, prefix = 'landslope_')
dum_Neighborhood = pd.get_dummies(kag.Neighborhood, prefix = 'neighborhood_')
dum_Condition1 = pd.get_dummies(kag.Condition1, prefix = 'condition1_')
dum_Condition2 = pd.get_dummies(kag.Condition2, prefix = 'condition2_')
dum_BldgType = pd.get_dummies(kag.BldgType, prefix = 'bldgtype_')
dum_HouseStyle = pd.get_dummies(kag.HouseStyle, prefix = 'housestyle_')
dum_RoofStyle = pd.get_dummies(kag.RoofStyle, prefix = 'roofstyle_')
dum_RoofMatl = pd.get_dummies(kag.RoofMatl, prefix = 'roofmatl_')

In [34]:
dum_Exterior1st = pd.get_dummies(kag.Exterior1st, prefix = 'exterior1st_')
dum_Exterior2nd = pd.get_dummies(kag.Exterior2nd, prefix = 'exterior2nd_')
dum_MasVnrType = pd.get_dummies(kag.MasVnrType, prefix = 'masvnrtype_')
dum_ExterQual = pd.get_dummies(kag.ExterQual, prefix = 'exterqual_')
dum_ExterCond = pd.get_dummies(kag.ExterCond, prefix = 'extercond_')
dum_Foundation = pd.get_dummies(kag.Foundation, prefix = 'foundation_')
dum_BsmtQual = pd.get_dummies(kag.BsmtQual, prefix = 'bsmtqual_')
dum_BsmtCond = pd.get_dummies(kag.BsmtCond, prefix = 'bsmtcond_')
dum_BsmtExposure = pd.get_dummies(kag.BsmtExposure, prefix = 'bsmtexp_')
dum_BsmtFinType1 = pd.get_dummies(kag.BsmtFinType1, prefix = 'bsmtfin1_')
dum_Heating = pd.get_dummies(kag.Heating, prefix = 'heating_')
dum_HeatingQC = pd.get_dummies(kag.HeatingQC, prefix = 'heatingqc_')
dum_Electrical = pd.get_dummies(kag.Electrical, prefix = 'electrical_')
dum_KitchenQual = pd.get_dummies(kag.KitchenQual, prefix = 'kitchenqual_')
dum_Functional = pd.get_dummies(kag.Functional, prefix = 'functional_')
dum_FireplaceQu = pd.get_dummies(kag.FireplaceQu, prefix = 'fireplacequ_')
dum_GarageType = pd.get_dummies(kag.GarageType, prefix = 'garagetype_')
dum_GarageFinish = pd.get_dummies(kag.GarageFinish, prefix = 'garagefinish_')
dum_GarageQual = pd.get_dummies(kag.GarageQual, prefix = 'garagequal_')
dum_GarageCond = pd.get_dummies(kag.GarageCond, prefix = 'garagecond_')

In [35]:
dum_PavedDrive = pd.get_dummies(kag.PavedDrive, prefix = 'paved_')
dum_PoolQC = pd.get_dummies(kag.PoolQC, prefix = 'poolqc_')
dum_Fence = pd.get_dummies(kag.Fence, prefix = 'fence_')
dum_MiscFeature = pd.get_dummies(kag.MiscFeature, prefix = 'miscfeat_')
dum_MoSold = pd.get_dummies(kag.MoSold, prefix = 'monthsold_')
dum_YrSold = pd.get_dummies(kag.YrSold, prefix = 'yrsold_')
dum_SaleType = pd.get_dummies(kag.SaleType, prefix = 'saletype_')
dum_SaleCondition = pd.get_dummies(kag.SaleCondition, prefix = 'salecondition_')

In [36]:
kag[['YrSold']].sort_values(by = 'YrSold', ascending = False).head()

Unnamed: 0_level_0,YrSold
Id,Unnamed: 1_level_1
731,2010
1106,2010
241,2010
1190,2010
245,2010


In [37]:
def centralheat(d):
    if d.CentralAir == 'Yes':
        return 1
    else:
        return 0

In [38]:
kag['CentralAirCond'] = kag.apply(centralheat, axis = 1)

In [39]:
continuous_columns.append('CentralAirCond')

In [40]:
continuous_columns

['LotFrontage',
 'LotArea',
 'OverallQual',
 'OverallCond',
 'MasVnrArea',
 'BsmtUnfSF',
 'TotalBsmtSF',
 'GrLivArea',
 'LowQualFinSF',
 'BsmtFullBath',
 'BsmtHalfBath',
 'FullBath',
 'HalfBath',
 'BedroomAbvGr',
 'KitchenAbvGr',
 'TotRmsAbvGrd',
 'GarageAge',
 'HomeAge',
 'GarageCars',
 'GarageArea',
 'WoodDeckSF',
 'OpenPorchSF',
 'EnclosedPorch',
 '3SsnPorch',
 'ScreenPorch',
 'PoolArea',
 'MiscVal',
 'CentralAirCond']

In [41]:
dum_MSSubClass['mssubclass__150'] = 0

In [42]:
dum_Utilities = dum_Utilities.drop('utilities__NoSeWa', axis = 1)

# Below is how you would merge dataframes using .concat

In [43]:
dummies = pd.concat([dum_MSSubClass, dum_MSZoning, dum_Street, dum_Alley, dum_LotShape, dum_LandContour, dum_Utilities, dum_LotConfig, dum_LandSlope, dum_Neighborhood, dum_Condition1, dum_Condition2, dum_BldgType, dum_HouseStyle, dum_RoofStyle, dum_RoofMatl], axis = 1)

In [44]:
dummies.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1460 entries, 1 to 1460
Columns: 112 entries, mssubclass__20 to roofmatl__WdShngl
dtypes: int64(1), uint8(111)
memory usage: 221.1 KB


In [45]:
dummies = pd.concat([dummies, dum_Exterior1st, dum_Exterior2nd, dum_MasVnrType, dum_ExterQual, dum_ExterCond, dum_Foundation, dum_BsmtQual, dum_BsmtCond, dum_BsmtExposure, dum_BsmtFinType1, dum_Heating, dum_HeatingQC, dum_Electrical, dum_KitchenQual, dum_Functional, dum_FireplaceQu, dum_GarageType, dum_GarageFinish, dum_GarageQual, dum_GarageCond], axis = 1)

In [46]:
dummies.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1460 entries, 1 to 1460
Columns: 241 entries, mssubclass__20 to garagecond__TA
dtypes: int64(1), uint8(240)
memory usage: 405.0 KB


In [47]:
dummies = pd.concat([dummies, dum_PavedDrive, dum_PoolQC, dum_Fence, dum_MiscFeature, dum_MoSold, dum_YrSold, dum_SaleType, dum_SaleCondition], axis = 1)

In [48]:
dummies.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1460 entries, 1 to 1460
Columns: 290 entries, mssubclass__20 to salecondition__Partial
dtypes: int64(1), uint8(289)
memory usage: 474.9 KB


In [49]:
kag.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1460 entries, 1 to 1460
Data columns (total 83 columns):
MSSubClass        1460 non-null int64
MSZoning          1460 non-null object
LotFrontage       1460 non-null float64
LotArea           1460 non-null int64
Street            1460 non-null object
Alley             1460 non-null object
LotShape          1460 non-null object
LandContour       1460 non-null object
Utilities         1460 non-null object
LotConfig         1460 non-null object
LandSlope         1460 non-null object
Neighborhood      1460 non-null object
Condition1        1460 non-null object
Condition2        1460 non-null object
BldgType          1460 non-null object
HouseStyle        1460 non-null object
OverallQual       1460 non-null int64
OverallCond       1460 non-null int64
YearBuilt         1460 non-null int64
YearRemodAdd      1460 non-null int64
RoofStyle         1460 non-null object
RoofMatl          1460 non-null object
Exterior1st       1460 non-null object
E

In [50]:
kag_continuous = kag[continuous_columns]

In [51]:
kag_continuous.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1460 entries, 1 to 1460
Data columns (total 28 columns):
LotFrontage       1460 non-null float64
LotArea           1460 non-null int64
OverallQual       1460 non-null int64
OverallCond       1460 non-null int64
MasVnrArea        1460 non-null float64
BsmtUnfSF         1460 non-null int64
TotalBsmtSF       1460 non-null int64
GrLivArea         1460 non-null int64
LowQualFinSF      1460 non-null int64
BsmtFullBath      1460 non-null int64
BsmtHalfBath      1460 non-null int64
FullBath          1460 non-null int64
HalfBath          1460 non-null int64
BedroomAbvGr      1460 non-null int64
KitchenAbvGr      1460 non-null int64
TotRmsAbvGrd      1460 non-null int64
GarageAge         1460 non-null float64
HomeAge           1460 non-null int64
GarageCars        1460 non-null int64
GarageArea        1460 non-null int64
WoodDeckSF        1460 non-null int64
OpenPorchSF       1460 non-null int64
EnclosedPorch     1460 non-null int64
3SsnPorch    

In [52]:
ames_full = pd.concat([dummies, kag_continuous, kag['SalePrice']], axis = 1)

In [53]:
ames_full.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1460 entries, 1 to 1460
Columns: 319 entries, mssubclass__20 to SalePrice
dtypes: float64(3), int64(27), uint8(289)
memory usage: 805.6 KB


# XXXXXXXXXXXXXX
# Full Train Dataset
# XXXXXXXXXXXXXX

In [54]:
#ames_full.to_csv('data/Deep_housingdata.csv')

# XXXXXXXXXXXXXXXXXXX
# Training RforestRegressor
# XXXXXXXXXXXXXXXXXXX

In [55]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error

In [56]:
X = ames_full.drop('SalePrice', axis = 1)
Y = ames_full[['SalePrice']]

In [57]:
X.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1460 entries, 1 to 1460
Columns: 318 entries, mssubclass__20 to CentralAirCond
dtypes: float64(3), int64(26), uint8(289)
memory usage: 794.2 KB


In [58]:
Y.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1460 entries, 1 to 1460
Data columns (total 1 columns):
SalePrice    1460 non-null int64
dtypes: int64(1)
memory usage: 62.8 KB


In [59]:
X.values.reshape(-1,1)

array([[0.],
       [0.],
       [0.],
       ...,
       [0.],
       [0.],
       [0.]])

In [60]:
Y.values.reshape(-1,1)

array([[208500],
       [181500],
       [223500],
       ...,
       [266500],
       [142125],
       [147500]], dtype=int64)

In [61]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3, random_state = 42)

In [62]:
from sklearn.ensemble import RandomForestRegressor

In [63]:
rf1 = RandomForestRegressor(n_estimators = 100, max_depth = 5, random_state = 42)

In [64]:
rf1.fit(X_train, Y_train)

  """Entry point for launching an IPython kernel.


RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=5,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
           oob_score=False, random_state=42, verbose=0, warm_start=False)

In [65]:
y_pred = rf1.predict(X_test)

In [66]:
mean_squared_error(Y_test,y_pred)

991425991.1545523

In [67]:
r2_score(Y_test,y_pred)

0.85792309115934

# XXXXXXXXXXXXXXX
# Formattting Test Data
# XXXXXXXXXXXXXXX

In [68]:
ames_test = pd.read_csv('data/test.csv', index_col = 0, header = 0)

In [69]:
ames_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1459 entries, 1461 to 2919
Data columns (total 79 columns):
MSSubClass       1459 non-null int64
MSZoning         1455 non-null object
LotFrontage      1232 non-null float64
LotArea          1459 non-null int64
Street           1459 non-null object
Alley            107 non-null object
LotShape         1459 non-null object
LandContour      1459 non-null object
Utilities        1457 non-null object
LotConfig        1459 non-null object
LandSlope        1459 non-null object
Neighborhood     1459 non-null object
Condition1       1459 non-null object
Condition2       1459 non-null object
BldgType         1459 non-null object
HouseStyle       1459 non-null object
OverallQual      1459 non-null int64
OverallCond      1459 non-null int64
YearBuilt        1459 non-null int64
YearRemodAdd     1459 non-null int64
RoofStyle        1459 non-null object
RoofMatl         1459 non-null object
Exterior1st      1458 non-null object
Exterior2nd      1458 

In [70]:
ames_test.loc[ames_test.MSZoning.isnull()]

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1916,30,,109.0,21780,Grvl,,Reg,Lvl,,Inside,...,0,0,,,,0,3,2009,ConLD,Normal
2217,20,,80.0,14584,Pave,,Reg,Low,AllPub,Inside,...,0,0,,,,0,2,2008,WD,Abnorml
2251,70,,,56600,Pave,,IR1,Low,AllPub,Inside,...,0,0,,,,0,1,2008,WD,Normal
2905,20,,125.0,31250,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,,,,0,5,2006,WD,Normal


In [71]:
ames_test['MSZoning'].value_counts()

RL         1114
RM          242
FV           74
C (all)      15
RH           10
Name: MSZoning, dtype: int64

In [72]:
ames_test['MSZoning'] = ames_test['MSZoning'].fillna('RL')

In [73]:
lot_ratio = pd.DataFrame(round(ames_test.LotArea / ames_test.LotFrontage, 2))

In [74]:
lot_ratio.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1459 entries, 1461 to 2919
Data columns (total 1 columns):
0    1232 non-null float64
dtypes: float64(1)
memory usage: 22.8 KB


In [75]:
lot_ratio = lot_ratio.dropna(how = 'any')

In [76]:
lot_ratio.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1232 entries, 1461 to 2919
Data columns (total 1 columns):
0    1232 non-null float64
dtypes: float64(1)
memory usage: 19.2 KB


In [77]:
lot_ratio.describe()

Unnamed: 0,0
count,1232.0
mean,139.131112
std,55.914157
min,44.04
25%,114.305
50%,128.715
75%,150.9225
max,999.5


In [78]:
def lot_front(d):
    if d['LotFrontage'] >= 0:
        return d['LotFrontage']
    else:
        return round(d['LotArea'] / 129, 0)

In [79]:
ames_test['LotFrontage'] = ames_test.apply(lot_front, axis = 1)

In [80]:
ames_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1459 entries, 1461 to 2919
Data columns (total 79 columns):
MSSubClass       1459 non-null int64
MSZoning         1459 non-null object
LotFrontage      1459 non-null float64
LotArea          1459 non-null int64
Street           1459 non-null object
Alley            107 non-null object
LotShape         1459 non-null object
LandContour      1459 non-null object
Utilities        1457 non-null object
LotConfig        1459 non-null object
LandSlope        1459 non-null object
Neighborhood     1459 non-null object
Condition1       1459 non-null object
Condition2       1459 non-null object
BldgType         1459 non-null object
HouseStyle       1459 non-null object
OverallQual      1459 non-null int64
OverallCond      1459 non-null int64
YearBuilt        1459 non-null int64
YearRemodAdd     1459 non-null int64
RoofStyle        1459 non-null object
RoofMatl         1459 non-null object
Exterior1st      1458 non-null object
Exterior2nd      1458 

In [81]:
ames_test.loc[2251]

MSSubClass            70
MSZoning              RL
LotFrontage          439
LotArea            56600
Street              Pave
Alley                NaN
LotShape             IR1
LandContour          Low
Utilities         AllPub
LotConfig         Inside
LandSlope            Gtl
Neighborhood      IDOTRR
Condition1          Norm
Condition2          Norm
BldgType            1Fam
HouseStyle        2.5Unf
OverallQual            5
OverallCond            1
YearBuilt           1900
YearRemodAdd        1950
RoofStyle            Hip
RoofMatl         CompShg
Exterior1st      Wd Sdng
Exterior2nd      Wd Sdng
MasVnrType          None
MasVnrArea             0
ExterQual             TA
ExterCond             TA
Foundation        BrkTil
BsmtQual              TA
                  ...   
HalfBath               0
BedroomAbvGr           4
KitchenAbvGr           1
KitchenQual           TA
TotRmsAbvGrd           7
Functional          Maj1
Fireplaces             0
FireplaceQu          NaN
GarageType        Detchd


In [82]:
ames_test['Alley'] = ames_test.Alley.fillna(0)

In [83]:
ames_test['Alley'].describe()

count     1459
unique       3
top          0
freq      1352
Name: Alley, dtype: int64

In [84]:
ames_test.Utilities.value_counts()

AllPub    1457
Name: Utilities, dtype: int64

In [85]:
ames_test[ames_test.Utilities.isnull()]

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1916,30,RL,109.0,21780,Grvl,0,Reg,Lvl,,Inside,...,0,0,,,,0,3,2009,ConLD,Normal
1946,20,RL,242.0,31220,Pave,0,IR1,Bnk,,FR2,...,0,0,,,Shed,750,5,2008,WD,Normal


In [86]:
ames_test['Utilities'] = ames_test.Utilities.fillna('AllPub')

In [87]:
ames_test.Utilities.value_counts()

AllPub    1459
Name: Utilities, dtype: int64

In [88]:
ames_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1459 entries, 1461 to 2919
Data columns (total 79 columns):
MSSubClass       1459 non-null int64
MSZoning         1459 non-null object
LotFrontage      1459 non-null float64
LotArea          1459 non-null int64
Street           1459 non-null object
Alley            1459 non-null object
LotShape         1459 non-null object
LandContour      1459 non-null object
Utilities        1459 non-null object
LotConfig        1459 non-null object
LandSlope        1459 non-null object
Neighborhood     1459 non-null object
Condition1       1459 non-null object
Condition2       1459 non-null object
BldgType         1459 non-null object
HouseStyle       1459 non-null object
OverallQual      1459 non-null int64
OverallCond      1459 non-null int64
YearBuilt        1459 non-null int64
YearRemodAdd     1459 non-null int64
RoofStyle        1459 non-null object
RoofMatl         1459 non-null object
Exterior1st      1458 non-null object
Exterior2nd      1458

In [89]:
ames_test.Exterior1st.value_counts()

VinylSd    510
MetalSd    230
HdBoard    220
Wd Sdng    205
Plywood    113
CemntBd     65
BrkFace     37
WdShing     30
AsbShng     24
Stucco      18
BrkComm      4
CBlock       1
AsphShn      1
Name: Exterior1st, dtype: int64

In [90]:
ames_test['Exterior1st'] = ames_test.Exterior1st.fillna('VinylSd')

In [91]:
ames_test['Exterior2nd'] = ames_test.Exterior2nd.fillna('Other')

In [92]:
ames_test['MasVnrArea'][ames_test.MasVnrType.isnull()]

Id
1692      NaN
1707      NaN
1883      NaN
1993      NaN
2005      NaN
2042      NaN
2312      NaN
2326      NaN
2341      NaN
2350      NaN
2369      NaN
2593      NaN
2611    198.0
2658      NaN
2687      NaN
2863      NaN
Name: MasVnrArea, dtype: float64

In [93]:
ames_test.loc[2611, 'MasVnrArea']

198.0

In [94]:
ames_test.MasVnrType.value_counts(dropna = False)

None       878
BrkFace    434
Stone      121
NaN         16
BrkCmn      10
Name: MasVnrType, dtype: int64

In [95]:
kag.MasVnrType.value_counts()

None       864
BrkFace    445
Stone      128
BrkCmn      15
0            8
Name: MasVnrType, dtype: int64

In [96]:
ames_test['MasVnrArea'][ames_test['MasVnrType'] == 'None'].head()

Id
1461    0.0
1463    0.0
1465    0.0
1466    0.0
1467    0.0
Name: MasVnrArea, dtype: float64

In [97]:
ames_test.loc[2611, 'MasVnrType'] = 'BrkFace'

In [98]:
ames_test.loc[2611][['MasVnrType', 'MasVnrArea']]

MasVnrType    BrkFace
MasVnrArea        198
Name: 2611, dtype: object

In [99]:
ames_test['MasVnrType'] = ames_test.MasVnrType.fillna(0)
ames_test['MasVnrArea'] = ames_test.MasVnrArea.fillna(0)

In [100]:
ames_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1459 entries, 1461 to 2919
Data columns (total 79 columns):
MSSubClass       1459 non-null int64
MSZoning         1459 non-null object
LotFrontage      1459 non-null float64
LotArea          1459 non-null int64
Street           1459 non-null object
Alley            1459 non-null object
LotShape         1459 non-null object
LandContour      1459 non-null object
Utilities        1459 non-null object
LotConfig        1459 non-null object
LandSlope        1459 non-null object
Neighborhood     1459 non-null object
Condition1       1459 non-null object
Condition2       1459 non-null object
BldgType         1459 non-null object
HouseStyle       1459 non-null object
OverallQual      1459 non-null int64
OverallCond      1459 non-null int64
YearBuilt        1459 non-null int64
YearRemodAdd     1459 non-null int64
RoofStyle        1459 non-null object
RoofMatl         1459 non-null object
Exterior1st      1459 non-null object
Exterior2nd      1459

In [101]:
ames_test[['BsmtQual', 'BsmtFinSF1']][ames_test.TotalBsmtSF.isnull()]

Unnamed: 0_level_0,BsmtQual,BsmtFinSF1
Id,Unnamed: 1_level_1,Unnamed: 2_level_1
2121,,


In [102]:
ames_test[['BsmtQual', 'BsmtCond']][(ames_test.BsmtCond.isnull()) & (ames_test.BsmtQual.notnull())]

Unnamed: 0_level_0,BsmtQual,BsmtCond
Id,Unnamed: 1_level_1,Unnamed: 2_level_1
2041,Gd,
2186,TA,
2525,TA,


In [103]:
ames_test[['BsmtQual', 'BsmtCond']][(ames_test.BsmtCond.notnull()) & (ames_test.BsmtQual.isnull())]

Unnamed: 0_level_0,BsmtQual,BsmtCond
Id,Unnamed: 1_level_1,Unnamed: 2_level_1
2218,,Fa
2219,,TA


In [104]:
ames_test.loc[2041, 'BsmtCond'] = 'Gd'

In [105]:
ames_test.loc[2186, 'BsmtCond'] = 'TA'
ames_test.loc[2525, 'BsmtCond'] = 'TA'

In [106]:
ames_test.loc[2218, 'BsmtQual'] = 'Fa'
ames_test.loc[2219, 'BsmtQual'] = 'TA'

In [107]:
ames_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1459 entries, 1461 to 2919
Data columns (total 79 columns):
MSSubClass       1459 non-null int64
MSZoning         1459 non-null object
LotFrontage      1459 non-null float64
LotArea          1459 non-null int64
Street           1459 non-null object
Alley            1459 non-null object
LotShape         1459 non-null object
LandContour      1459 non-null object
Utilities        1459 non-null object
LotConfig        1459 non-null object
LandSlope        1459 non-null object
Neighborhood     1459 non-null object
Condition1       1459 non-null object
Condition2       1459 non-null object
BldgType         1459 non-null object
HouseStyle       1459 non-null object
OverallQual      1459 non-null int64
OverallCond      1459 non-null int64
YearBuilt        1459 non-null int64
YearRemodAdd     1459 non-null int64
RoofStyle        1459 non-null object
RoofMatl         1459 non-null object
Exterior1st      1459 non-null object
Exterior2nd      1459

In [108]:
ames_test[['BsmtQual', 'BsmtCond', 'BsmtExposure']][(ames_test.BsmtExposure.isnull()) & (ames_test.BsmtQual.notnull())]

Unnamed: 0_level_0,BsmtQual,BsmtCond,BsmtExposure
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1488,Gd,TA,
2349,Gd,TA,


In [109]:
ames_test[['BsmtExposure']][ames_test['BsmtCond'] == 'TA'].describe(include = 'all')

Unnamed: 0,BsmtExposure
count,1295
unique,4
top,No
freq,867


In [110]:
ames_test.loc[1488, 'BsmtExposure'] = 'No'
ames_test.loc[2349, 'BsmtExposure'] = 'No'

In [111]:
ames_test[['BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF']] = ames_test[['BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF']].fillna(0)    

In [112]:
ames_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1459 entries, 1461 to 2919
Data columns (total 79 columns):
MSSubClass       1459 non-null int64
MSZoning         1459 non-null object
LotFrontage      1459 non-null float64
LotArea          1459 non-null int64
Street           1459 non-null object
Alley            1459 non-null object
LotShape         1459 non-null object
LandContour      1459 non-null object
Utilities        1459 non-null object
LotConfig        1459 non-null object
LandSlope        1459 non-null object
Neighborhood     1459 non-null object
Condition1       1459 non-null object
Condition2       1459 non-null object
BldgType         1459 non-null object
HouseStyle       1459 non-null object
OverallQual      1459 non-null int64
OverallCond      1459 non-null int64
YearBuilt        1459 non-null int64
YearRemodAdd     1459 non-null int64
RoofStyle        1459 non-null object
RoofMatl         1459 non-null object
Exterior1st      1459 non-null object
Exterior2nd      1459

In [113]:
ames_test[['BsmtFullBath', 'BsmtHalfBath']] = ames_test[['BsmtFullBath', 'BsmtHalfBath']].fillna(0)

In [114]:
ames_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1459 entries, 1461 to 2919
Data columns (total 79 columns):
MSSubClass       1459 non-null int64
MSZoning         1459 non-null object
LotFrontage      1459 non-null float64
LotArea          1459 non-null int64
Street           1459 non-null object
Alley            1459 non-null object
LotShape         1459 non-null object
LandContour      1459 non-null object
Utilities        1459 non-null object
LotConfig        1459 non-null object
LandSlope        1459 non-null object
Neighborhood     1459 non-null object
Condition1       1459 non-null object
Condition2       1459 non-null object
BldgType         1459 non-null object
HouseStyle       1459 non-null object
OverallQual      1459 non-null int64
OverallCond      1459 non-null int64
YearBuilt        1459 non-null int64
YearRemodAdd     1459 non-null int64
RoofStyle        1459 non-null object
RoofMatl         1459 non-null object
Exterior1st      1459 non-null object
Exterior2nd      1459

In [115]:
ames_test.KitchenQual.value_counts()

TA    757
Gd    565
Ex    105
Fa     31
Name: KitchenQual, dtype: int64

In [116]:
ames_test['KitchenQual'] = ames_test.KitchenQual.fillna('TA')

In [117]:
ames_test.KitchenQual.value_counts()

TA    758
Gd    565
Ex    105
Fa     31
Name: KitchenQual, dtype: int64

In [118]:
ames_test.Functional.value_counts()

Typ     1357
Min2      36
Min1      34
Mod       20
Maj1       5
Maj2       4
Sev        1
Name: Functional, dtype: int64

In [119]:
ames_test.Functional = ames_test.Functional.fillna('Typ')

In [120]:
ames_test.Functional.value_counts()

Typ     1359
Min2      36
Min1      34
Mod       20
Maj1       5
Maj2       4
Sev        1
Name: Functional, dtype: int64

In [121]:
ames_test.FireplaceQu = ames_test.FireplaceQu.fillna(0)

In [122]:
ames_test[['GarageType', 'GarageYrBlt', 'GarageFinish', 'SaleType', 'GarageCond', 'GarageQual']][(ames_test.GarageYrBlt.isnull()) & (ames_test.GarageType.notnull())]

Unnamed: 0_level_0,GarageType,GarageYrBlt,GarageFinish,SaleType,GarageCond,GarageQual
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2127,Detchd,,,WD,,
2577,Detchd,,,WD,,


In [123]:
ames_test.GarageYrBlt.describe()

count    1381.000000
mean     1977.721217
std        26.431175
min      1895.000000
25%      1959.000000
50%      1979.000000
75%      2002.000000
max      2207.000000
Name: GarageYrBlt, dtype: float64

In [124]:
ames_test.loc[2593, 'GarageYrBlt'] = 2007

In [125]:
ames_test.loc[2127, 'GarageYrBlt'] = 1980
ames_test.loc[2127, 'GarageFinish'] = 'Fin'
ames_test.loc[2127, 'GarageCond'] = 'TA'
ames_test.loc[2127, 'GarageQual'] = 'TA'

In [126]:
ames_test.loc[2577, 'GarageYrBlt'] = 1980
ames_test.loc[2577, 'GarageFinish'] = 'Fin'
ames_test.loc[2577, 'GarageCond'] = 'TA'
ames_test.loc[2577, 'GarageQual'] = 'TA'

In [127]:
ames_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1459 entries, 1461 to 2919
Data columns (total 79 columns):
MSSubClass       1459 non-null int64
MSZoning         1459 non-null object
LotFrontage      1459 non-null float64
LotArea          1459 non-null int64
Street           1459 non-null object
Alley            1459 non-null object
LotShape         1459 non-null object
LandContour      1459 non-null object
Utilities        1459 non-null object
LotConfig        1459 non-null object
LandSlope        1459 non-null object
Neighborhood     1459 non-null object
Condition1       1459 non-null object
Condition2       1459 non-null object
BldgType         1459 non-null object
HouseStyle       1459 non-null object
OverallQual      1459 non-null int64
OverallCond      1459 non-null int64
YearBuilt        1459 non-null int64
YearRemodAdd     1459 non-null int64
RoofStyle        1459 non-null object
RoofMatl         1459 non-null object
Exterior1st      1459 non-null object
Exterior2nd      1459

In [128]:
ames_test['YearBuilt'][ames_test.SaleType.isnull()]

Id
2490    1958
Name: YearBuilt, dtype: int64

In [129]:
ames_test['SaleType'].value_counts()

WD       1258
New       117
COD        44
ConLD      17
CWD         8
ConLI       4
Oth         4
Con         3
ConLw       3
Name: SaleType, dtype: int64

In [130]:
ames_test.loc[2490, 'SaleType'] = 'WD'

In [131]:
ames_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1459 entries, 1461 to 2919
Data columns (total 79 columns):
MSSubClass       1459 non-null int64
MSZoning         1459 non-null object
LotFrontage      1459 non-null float64
LotArea          1459 non-null int64
Street           1459 non-null object
Alley            1459 non-null object
LotShape         1459 non-null object
LandContour      1459 non-null object
Utilities        1459 non-null object
LotConfig        1459 non-null object
LandSlope        1459 non-null object
Neighborhood     1459 non-null object
Condition1       1459 non-null object
Condition2       1459 non-null object
BldgType         1459 non-null object
HouseStyle       1459 non-null object
OverallQual      1459 non-null int64
OverallCond      1459 non-null int64
YearBuilt        1459 non-null int64
YearRemodAdd     1459 non-null int64
RoofStyle        1459 non-null object
RoofMatl         1459 non-null object
Exterior1st      1459 non-null object
Exterior2nd      1459

In [132]:
ames_test = ames_test.fillna(0)

In [133]:
ames_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1459 entries, 1461 to 2919
Data columns (total 79 columns):
MSSubClass       1459 non-null int64
MSZoning         1459 non-null object
LotFrontage      1459 non-null float64
LotArea          1459 non-null int64
Street           1459 non-null object
Alley            1459 non-null object
LotShape         1459 non-null object
LandContour      1459 non-null object
Utilities        1459 non-null object
LotConfig        1459 non-null object
LandSlope        1459 non-null object
Neighborhood     1459 non-null object
Condition1       1459 non-null object
Condition2       1459 non-null object
BldgType         1459 non-null object
HouseStyle       1459 non-null object
OverallQual      1459 non-null int64
OverallCond      1459 non-null int64
YearBuilt        1459 non-null int64
YearRemodAdd     1459 non-null int64
RoofStyle        1459 non-null object
RoofMatl         1459 non-null object
Exterior1st      1459 non-null object
Exterior2nd      1459

In [134]:
ames_test["HomeAge"] = ames_test.apply(house_age, axis = 1)

In [135]:
ames_test['GarageAge'] = ames_test.apply(Garage_Age, axis = 1)

In [136]:
ames_test['CentralAirCond'] = ames_test.apply(centralheat, axis = 1)

In [137]:
ames_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1459 entries, 1461 to 2919
Data columns (total 82 columns):
MSSubClass        1459 non-null int64
MSZoning          1459 non-null object
LotFrontage       1459 non-null float64
LotArea           1459 non-null int64
Street            1459 non-null object
Alley             1459 non-null object
LotShape          1459 non-null object
LandContour       1459 non-null object
Utilities         1459 non-null object
LotConfig         1459 non-null object
LandSlope         1459 non-null object
Neighborhood      1459 non-null object
Condition1        1459 non-null object
Condition2        1459 non-null object
BldgType          1459 non-null object
HouseStyle        1459 non-null object
OverallQual       1459 non-null int64
OverallCond       1459 non-null int64
YearBuilt         1459 non-null int64
YearRemodAdd      1459 non-null int64
RoofStyle         1459 non-null object
RoofMatl          1459 non-null object
Exterior1st       1459 non-null objec

In [138]:
ames_continuous = ames_test[continuous_columns]

In [139]:
ames_continuous.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1459 entries, 1461 to 2919
Data columns (total 28 columns):
LotFrontage       1459 non-null float64
LotArea           1459 non-null int64
OverallQual       1459 non-null int64
OverallCond       1459 non-null int64
MasVnrArea        1459 non-null float64
BsmtUnfSF         1459 non-null float64
TotalBsmtSF       1459 non-null float64
GrLivArea         1459 non-null int64
LowQualFinSF      1459 non-null int64
BsmtFullBath      1459 non-null float64
BsmtHalfBath      1459 non-null float64
FullBath          1459 non-null int64
HalfBath          1459 non-null int64
BedroomAbvGr      1459 non-null int64
KitchenAbvGr      1459 non-null int64
TotRmsAbvGrd      1459 non-null int64
GarageAge         1459 non-null float64
HomeAge           1459 non-null int64
GarageCars        1459 non-null float64
GarageArea        1459 non-null float64
WoodDeckSF        1459 non-null int64
OpenPorchSF       1459 non-null int64
EnclosedPorch     1459 non-null int6

# XXXXXXXXXXXXXXXXXX
# Creating Test Dummies
# XXXXXXXXXXXXXXXXXX

In [140]:
Adum_MSSubClass = pd.get_dummies(ames_test.MSSubClass, prefix = 'mssubclass_')
Adum_MSZoning = pd.get_dummies(ames_test.MSZoning, prefix = 'mszoning_')
Adum_Street = pd.get_dummies(ames_test.Street, prefix = 'street_')
Adum_Alley = pd.get_dummies(ames_test.Alley, prefix = 'alley_')
Adum_LotShape = pd.get_dummies(ames_test.LotShape, prefix = 'lotshape_')
Adum_LandContour = pd.get_dummies(ames_test.LandContour, prefix = 'landcontour_')
Adum_Utilities = pd.get_dummies(ames_test.Utilities, prefix = 'utilities_')
Adum_LotConfig = pd.get_dummies(ames_test.LotConfig, prefix = 'lotconfig_')
Adum_LandSlope = pd.get_dummies(ames_test.LandSlope, prefix = 'landslope_')
Adum_Neighborhood = pd.get_dummies(ames_test.Neighborhood, prefix = 'neighborhood_')
Adum_Condition1 = pd.get_dummies(ames_test.Condition1, prefix = 'condition1_')
Adum_Condition2 = pd.get_dummies(ames_test.Condition2, prefix = 'condition2_')
Adum_BldgType = pd.get_dummies(ames_test.BldgType, prefix = 'bldgtype_')
Adum_HouseStyle = pd.get_dummies(ames_test.HouseStyle, prefix = 'housestyle_')
Adum_RoofStyle = pd.get_dummies(ames_test.RoofStyle, prefix = 'roofstyle_')
Adum_RoofMatl = pd.get_dummies(ames_test.RoofMatl, prefix = 'roofmatl_')
Adum_Exterior1st = pd.get_dummies(ames_test.Exterior1st, prefix = 'exterior1st_')
Adum_Exterior2nd = pd.get_dummies(ames_test.Exterior2nd, prefix = 'exterior2nd_')
Adum_MasVnrType = pd.get_dummies(ames_test.MasVnrType, prefix = 'masvnrtype_')
Adum_ExterQual = pd.get_dummies(ames_test.ExterQual, prefix = 'exterqual_')
Adum_ExterCond = pd.get_dummies(ames_test.ExterCond, prefix = 'extercond_')
Adum_Foundation = pd.get_dummies(ames_test.Foundation, prefix = 'foundation_')
Adum_BsmtQual = pd.get_dummies(ames_test.BsmtQual, prefix = 'bsmtqual_')
Adum_BsmtCond = pd.get_dummies(ames_test.BsmtCond, prefix = 'bsmtcond_')
Adum_BsmtExposure = pd.get_dummies(ames_test.BsmtExposure, prefix = 'bsmtexp_')
Adum_BsmtFinType1 = pd.get_dummies(ames_test.BsmtFinType1, prefix = 'bsmtfin1_')
Adum_Heating = pd.get_dummies(ames_test.Heating, prefix = 'heating_')
Adum_HeatingQC = pd.get_dummies(ames_test.HeatingQC, prefix = 'heatingqc_')
Adum_Electrical = pd.get_dummies(ames_test.Electrical, prefix = 'electrical_')
Adum_KitchenQual = pd.get_dummies(ames_test.KitchenQual, prefix = 'kitchenqual_')
Adum_Functional = pd.get_dummies(ames_test.Functional, prefix = 'functional_')
Adum_FireplaceQu = pd.get_dummies(ames_test.FireplaceQu, prefix = 'fireplacequ_')
Adum_GarageType = pd.get_dummies(ames_test.GarageType, prefix = 'garagetype_')
Adum_GarageFinish = pd.get_dummies(ames_test.GarageFinish, prefix = 'garagefinish_')
Adum_GarageQual = pd.get_dummies(ames_test.GarageQual, prefix = 'garagequal_')
Adum_GarageCond = pd.get_dummies(ames_test.GarageCond, prefix = 'garagecond_')
Adum_PavedDrive = pd.get_dummies(ames_test.PavedDrive, prefix = 'paved_')
Adum_PoolQC = pd.get_dummies(ames_test.PoolQC, prefix = 'poolqc_')
Adum_Fence = pd.get_dummies(ames_test.Fence, prefix = 'fence_')
Adum_MiscFeature = pd.get_dummies(ames_test.MiscFeature, prefix = 'miscfeat_')
Adum_MoSold = pd.get_dummies(ames_test.MoSold, prefix = 'monthsold_')
Adum_YrSold = pd.get_dummies(ames_test.YrSold, prefix = 'yrsold_')
Adum_SaleType = pd.get_dummies(ames_test.SaleType, prefix = 'saletype_')
Adum_SaleCondition = pd.get_dummies(ames_test.SaleCondition, prefix = 'salecondition_')

In [141]:
Adummies = pd.concat([Adum_MSSubClass, Adum_MSZoning, Adum_Street, Adum_Alley, Adum_LotShape, Adum_LandContour, Adum_Utilities, Adum_LotConfig, Adum_LandSlope, Adum_Neighborhood, Adum_Condition1, Adum_Condition2, Adum_BldgType, Adum_HouseStyle, Adum_RoofStyle, Adum_RoofMatl], axis = 1)

In [142]:
Adummies.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1459 entries, 1461 to 2919
Columns: 104 entries, mssubclass__20 to roofmatl__WdShngl
dtypes: uint8(104)
memory usage: 199.6 KB


In [143]:
Adummies = pd.concat([Adummies, Adum_Exterior1st, Adum_Exterior2nd, Adum_MasVnrType, Adum_ExterQual, Adum_ExterCond, Adum_Foundation, Adum_BsmtQual, Adum_BsmtCond, Adum_BsmtExposure, Adum_BsmtFinType1, Adum_Heating, Adum_HeatingQC, Adum_Electrical, Adum_KitchenQual, Adum_Functional, Adum_FireplaceQu, Adum_GarageType, Adum_GarageFinish, Adum_GarageQual, Adum_GarageCond], axis = 1)

In [144]:
Adummies = pd.concat([Adummies, Adum_PavedDrive, Adum_PoolQC, Adum_Fence, Adum_MiscFeature, Adum_MoSold, Adum_YrSold, Adum_SaleType, Adum_SaleCondition], axis = 1)

In [145]:
Adummies['condition2__RRAe'] = 0
Adummies['condition2__RRAn'] = 0
Adummies['condition2__RRNn'] = 0
Adummies['housestyle__2.5Fin'] = 0
Adummies['roofmatl__ClyTile'] = 0
Adummies['roofmatl__Membran'] = 0
Adummies['roofmatl__Metal'] = 0
Adummies['roofmatl__Roll'] = 0
Adummies['exterior1st__ImStucc'] = 0
Adummies['exterior1st__Stone'] = 0
Adummies['heating__Floor'] = 0
Adummies['heating__OthW'] = 0
Adummies['electrical__Mix'] = 0
Adummies['garagequal__Ex'] = 0
Adummies['poolqc__Fa'] = 0
Adummies['miscfeat__TenC'] = 0

In [147]:
Adum_PoolQC.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1459 entries, 1461 to 2919
Data columns (total 3 columns):
poolqc__0     1459 non-null uint8
poolqc__Ex    1459 non-null uint8
poolqc__Gd    1459 non-null uint8
dtypes: uint8(3)
memory usage: 55.7 KB


# Note on using *pd.get_dummies*
While **pd.get_dummies** is quick and convenient, it does provide some challenges when your test set has different values for categorical variables than the train set. I have erased some work where I was forced to check every dummy variable I created in order to make sure that the train and test sets had the same columns.

In [408]:
Adum_SaleCondition.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1459 entries, 1461 to 2919
Data columns (total 6 columns):
salecondition__Abnorml    1459 non-null uint8
salecondition__AdjLand    1459 non-null uint8
salecondition__Alloca     1459 non-null uint8
salecondition__Family     1459 non-null uint8
salecondition__Normal     1459 non-null uint8
salecondition__Partial    1459 non-null uint8
dtypes: uint8(6)
memory usage: 59.9 KB


In [409]:
dum_SaleCondition.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1460 entries, 1 to 1460
Data columns (total 6 columns):
salecondition__Abnorml    1460 non-null uint8
salecondition__AdjLand    1460 non-null uint8
salecondition__Alloca     1460 non-null uint8
salecondition__Family     1460 non-null uint8
salecondition__Normal     1460 non-null uint8
salecondition__Partial    1460 non-null uint8
dtypes: uint8(6)
memory usage: 60.0 KB


In [410]:
dummies.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1460 entries, 1 to 1460
Columns: 290 entries, mssubclass__20 to salecondition__Partial
dtypes: int64(1), uint8(289)
memory usage: 474.9 KB


In [411]:
test_dummies = Adummies[dummies.columns]

In [412]:
test_dummies.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1459 entries, 1461 to 2919
Columns: 290 entries, mssubclass__20 to salecondition__Partial
dtypes: int64(16), uint8(274)
memory usage: 624.2 KB


In [413]:
test_full = pd.concat([test_dummies, ames_continuous], axis = 1)

In [414]:
test_full.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1459 entries, 1461 to 2919
Columns: 318 entries, mssubclass__20 to CentralAirCond
dtypes: float64(9), int64(35), uint8(274)
memory usage: 943.3 KB


In [415]:
#test_full.to_csv('data/test_full.csv')

In [418]:
test_full.values.reshape(-1,1)

array([[1.],
       [0.],
       [0.],
       ...,
       [0.],
       [0.],
       [0.]])

In [419]:
test_predict = rf1.predict(test_full)

In [421]:
df_test = pd.DataFrame(test_full.index)

In [423]:
df_test.head()

Unnamed: 0,Id
0,1461
1,1462
2,1463
3,1464
4,1465


In [424]:
df_test['SalePrice'] = test_predict

In [425]:
df_test.head()

Unnamed: 0,Id,SalePrice
0,1461,124490.822135
1,1462,146073.478766
2,1463,173196.885511
3,1464,184925.274651
4,1465,225864.86369


In [426]:
#df_test.to_csv('data/RF_1.csv')

# GridSearch

In [427]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

In [428]:
pipeline = Pipeline([ ('clf',RandomForestRegressor())])

In [429]:
parameters = {
    'clf__n_estimators':(1000,2000,3000),
    'clf__max_depth':(100,200,300),
    'clf__min_samples_split':(2,3),
    'clf__min_samples_leaf':(1,2) }

In [432]:
grid_search = GridSearchCV(pipeline,parameters,n_jobs=1, cv=5, verbose=1, scoring='neg_mean_squared_log_error')

In [433]:
#grid_search.fit(X_train,Y_train)

Fitting 5 folds for each of 36 candidates, totalling 180 fits


  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estima

  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estima

  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estima

  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estima

  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estima

  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estima

GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('clf', RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'clf__n_estimators': (1000, 2000, 3000), 'clf__max_depth': (100, 200, 300), 'clf__min_samples_split': (2, 3), 'clf__min_samples_leaf': (1, 2)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='neg_mean_squared_log_error', verbose=1)

In [434]:
grid_search.best_score_

-0.024586361448362515

In [435]:
grid_search.best_estimator_.get_params()

{'clf': RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=300,
            max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=2, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=1000, n_jobs=1,
            oob_score=False, random_state=None, verbose=0, warm_start=False),
 'clf__bootstrap': True,
 'clf__criterion': 'mse',
 'clf__max_depth': 300,
 'clf__max_features': 'auto',
 'clf__max_leaf_nodes': None,
 'clf__min_impurity_decrease': 0.0,
 'clf__min_impurity_split': None,
 'clf__min_samples_leaf': 2,
 'clf__min_samples_split': 2,
 'clf__min_weight_fraction_leaf': 0.0,
 'clf__n_estimators': 1000,
 'clf__n_jobs': 1,
 'clf__oob_score': False,
 'clf__random_state': None,
 'clf__verbose': 0,
 'clf__warm_start': False,
 'memory': None,
 'steps': [('clf',
   RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=300,
              max_features='auto', max_lea

In [436]:
predictions = grid_search.predict(X_test)

In [437]:
r2_score(Y_test, predictions)

0.8884807593777844

In [438]:
test_predictions = grid_search.predict(test_full)

In [439]:
df_grid = pd.DataFrame(test_full.index)

In [440]:
df_grid['SalePrice'] = test_predictions

In [441]:
df_grid.head()

Unnamed: 0,Id,SalePrice
0,1461,126995.465777
1,1462,152234.377433
2,1463,179688.663669
3,1464,187839.21726
4,1465,209962.446761


In [442]:
#df_grid.to_csv('data/Grid_pred1.csv')

## Results
Grid Search performed better than the RF alone, with RMSE of 0.157 compared to 0.181!