# Model Prep and Submissions

In this notebook I will be applying my models and features to the test data set. Each section has a corresponding entry in an EDA notebook. I will then edit the test dataset so that it matches my features and cleaned data and then create predictions to submit to Kaggle.

In [1]:
import pandas as pd
import numpy as np

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score

# Base Model
The model below uses the mean value of the sales price for the base model. We were walked through this whole submission in Breakfast Hour, but I am copying it to here to show how it was determined. 

In [2]:
df_train = pd.read_csv('../datasets/provided_data/train.csv')
df_test = pd.read_csv('../datasets/provided_data/test.csv')

In [3]:
df_test['SalePrice'] = df_train['SalePrice'].mean()

In [4]:
submission = df_test[['Id','SalePrice']]

In [5]:
submission.set_index('Id', inplace=True)

In [6]:
submission.to_csv('../datasets/submissions/baseline_mean_submission.csv')

The score for the baseline model was 84,362.32370. I will be comparing everything to this model.

### Model 1: Numeric Columns with No Nulls

This model is as basic as it gets. It will use all numeric columns regardless of whether or not they are categorical to see what happens. Don't expect much from this model, but it will be interesting.

In [7]:
training_df_v1 = pd.read_csv('../datasets/cleaned_data/clean_train_v1.csv', na_filter=False)
test_df_v1 = pd.read_csv('../datasets/cleaned_data/clean_test_v1.csv', na_filter=False)
feat_v1 = pd.read_csv('../datasets/cleaned_data/feat_v1.csv', na_filter=False)
feat_v1.head()

Unnamed: 0,Id,PID,MS SubClass,Lot Area,Overall Qual,Overall Cond,Year Built,Year Remod/Add,1st Flr SF,2nd Flr SF,...,Fireplaces,Wood Deck SF,Open Porch SF,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Misc Val,Mo Sold,Yr Sold
0,109,533352170,60,13517,6,8,1976,2005,725,754,...,0,0,44,0,0,0,0,0,3,2010
1,544,531379050,60,11492,7,5,1996,1997,913,1209,...,1,0,74,0,0,0,0,0,4,2009
2,153,535304180,20,7922,5,7,1953,2007,1057,0,...,0,0,52,0,0,0,0,0,1,2010
3,318,916386060,60,9802,5,5,2006,2007,744,700,...,0,100,0,0,0,0,0,0,4,2010
4,255,906425045,50,14235,6,8,1900,1993,831,614,...,0,0,59,0,0,0,0,0,3,2010


In [8]:
features_v1 = feat_v1.columns
X_v1 = training_df_v1[features_v1]
y_v1 = training_df_v1['SalePrice']
X_test_v1 = test_df_v1[features_v1]

In [9]:
lr_v1 = LinearRegression()
lr_v1.fit(X_v1, y_v1)

LinearRegression()

In [10]:
pred_v1 = lr_v1.predict(X_test_v1)

In [11]:
test_df_v1['SalePrice'] = pred_v1

In [12]:
test_df_v1.head()

Unnamed: 0,Id,PID,MS SubClass,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,Utilities,...,Sale Type_VWD,Sale Type_WD,MS Zoning_C (all),Neighborhood_GrnHill,Neighborhood_Landmrk,Exterior 1st_CBlock,Exterior 1st_ImStucc,Exterior 1st_Stone,Exterior 2nd_Stone,SalePrice
0,2658,902301120,190,69.0,9142,1,Grvl,0,Lvl,AllPub,...,0,1,0,0,0,0,0,0,0,131885.864704
1,2718,905108090,90,0.0,9662,1,,1,Lvl,AllPub,...,0,1,0,0,0,0,0,0,0,166444.074368
2,2414,528218130,60,58.0,17104,1,,1,Lvl,AllPub,...,0,0,0,0,0,0,0,0,0,214310.547963
3,1989,902207150,30,60.0,8520,1,,0,Lvl,AllPub,...,0,1,0,0,0,0,0,0,0,113989.587277
4,625,535105100,20,0.0,9500,1,,1,Lvl,AllPub,...,0,1,0,0,0,0,0,0,0,191029.879516


In [13]:
submission_v1 = test_df_v1[['Id','SalePrice']]
submission_v1.set_index('Id', inplace=True)
submission_v1.head()

Unnamed: 0_level_0,SalePrice
Id,Unnamed: 1_level_1
2658,131885.864704
2718,166444.074368
2414,214310.547963
1989,113989.587277
625,191029.879516


In [14]:
submission_v1.to_csv('../datasets/submissions/submission_v1.csv')

The score on this submission was 35,407.82153, which is my mark to beat now. 

### Model 2

This model will have a couple of submissions with it. I am going to be doing different types of feature engineering in the 02.1_Numeric_EDA_and_Feat_Engineering and submitting a couple of predictions along the way to see what works and what doesn't. I will make note of the results after each submission. 

In [15]:
training_df_v2 = pd.read_csv('../datasets/ready_for_model/clean_train_v2.csv', na_filter=False)
test_df_v2 = pd.read_csv('../datasets/cleaned_data/clean_test_v1.csv', na_filter=False)
feat_v2 = pd.read_csv('../datasets/ready_for_model/feat_v2.csv', na_filter=False)

In [16]:
features_v2 = feat_v2.drop(columns='Id').columns
X_v2 = training_df_v2[features_v2]
y_v2 = training_df_v2['SalePrice']
X_test_v2 = test_df_v2[features_v2]

In [17]:
features_v2

Index(['Lot Area', 'Overall Qual', 'Overall Cond', '1st Flr SF', '2nd Flr SF',
       'Low Qual Fin SF', 'Gr Liv Area', 'Full Bath', 'Half Bath',
       'Bedroom AbvGr', 'Kitchen AbvGr', 'TotRms AbvGrd', 'Fireplaces',
       'Wood Deck SF', 'Open Porch SF', 'Enclosed Porch', '3Ssn Porch',
       'Screen Porch', 'Pool Area', 'Misc Val', 'Year Built Year Remod/Add'],
      dtype='object')

In [18]:
lr_v2 = LinearRegression()
lr_v2.fit(X_v2, y_v2)

LinearRegression()

In [19]:
pred_v2 = lr_v2.predict(X_test_v2)
test_df_v2['SalePrice'] = pred_v2
test_df_v2.head()

Unnamed: 0,Id,PID,MS SubClass,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,Utilities,...,Sale Type_VWD,Sale Type_WD,MS Zoning_C (all),Neighborhood_GrnHill,Neighborhood_Landmrk,Exterior 1st_CBlock,Exterior 1st_ImStucc,Exterior 1st_Stone,Exterior 2nd_Stone,SalePrice
0,2658,902301120,190,69.0,9142,1,Grvl,0,Lvl,AllPub,...,0,1,0,0,0,0,0,0,0,136302.818541
1,2718,905108090,90,0.0,9662,1,,1,Lvl,AllPub,...,0,1,0,0,0,0,0,0,0,171400.081354
2,2414,528218130,60,58.0,17104,1,,1,Lvl,AllPub,...,0,0,0,0,0,0,0,0,0,210314.027797
3,1989,902207150,30,60.0,8520,1,,0,Lvl,AllPub,...,0,1,0,0,0,0,0,0,0,118377.088923
4,625,535105100,20,0.0,9500,1,,1,Lvl,AllPub,...,0,1,0,0,0,0,0,0,0,189480.014046


In [20]:
submission_v2 = test_df_v2[['Id','SalePrice']]
submission_v2.set_index('Id', inplace=True)
submission_v2.head()

Unnamed: 0_level_0,SalePrice
Id,Unnamed: 1_level_1
2658,136302.818541
2718,171400.081354
2414,210314.027797
1989,118377.088923
625,189480.014046


In [21]:
submission_v2.to_csv('../datasets/submissions/submission_v2.csv')

The score for this submission was 36535.52306, which not an improvement over my first model, but I noticed there is a major outlier to be addressed.

#### Version removing major outlier
This version removes the major outlier in the data to see what the score looks like now. The training dataframe is updated, but everything else stays the same. 

In [22]:
training_df_v2_1 = pd.read_csv('../datasets/ready_for_model/clean_train_v2_1.csv', na_filter=False)
X_v2_1 = training_df_v2_1[features_v2]
y_v2_1 = training_df_v2_1['SalePrice']
X_test_v2_1 = test_df_v2[features_v2]

In [23]:
lr_v2_1 = LinearRegression()
lr_v2_1.fit(X_v2_1, y_v2_1)

LinearRegression()

In [24]:
pred_v2_1 = lr_v2_1.predict(X_test_v2_1)
test_df_v2['SalePrice'] = pred_v2_1
test_df_v2.head()

Unnamed: 0,Id,PID,MS SubClass,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,Utilities,...,Sale Type_VWD,Sale Type_WD,MS Zoning_C (all),Neighborhood_GrnHill,Neighborhood_Landmrk,Exterior 1st_CBlock,Exterior 1st_ImStucc,Exterior 1st_Stone,Exterior 2nd_Stone,SalePrice
0,2658,902301120,190,69.0,9142,1,Grvl,0,Lvl,AllPub,...,0,1,0,0,0,0,0,0,0,136004.950751
1,2718,905108090,90,0.0,9662,1,,1,Lvl,AllPub,...,0,1,0,0,0,0,0,0,0,171181.802376
2,2414,528218130,60,58.0,17104,1,,1,Lvl,AllPub,...,0,0,0,0,0,0,0,0,0,207688.131604
3,1989,902207150,30,60.0,8520,1,,0,Lvl,AllPub,...,0,1,0,0,0,0,0,0,0,118129.587575
4,625,535105100,20,0.0,9500,1,,1,Lvl,AllPub,...,0,1,0,0,0,0,0,0,0,187686.908513


In [25]:
submission_v3 = test_df_v2[['Id','SalePrice']]
submission_v3.set_index('Id', inplace=True)
submission_v3.head()

Unnamed: 0_level_0,SalePrice
Id,Unnamed: 1_level_1
2658,136004.950751
2718,171181.802376
2414,207688.131604
1989,118129.587575
625,187686.908513


In [26]:
submission_v3.to_csv('../datasets/submissions/submission_v3.csv')

The score for this submission was 34867.79248, which is an improvement over the previous entries. 

### Model 3
The submissions from here include all the numeric columns that originally had nulls in them. It is still all numeric columns.

In [27]:
training_df_v3 = pd.read_csv('../datasets/ready_for_model/clean_train_v3.csv', na_filter=False)
test_df_v3 = test_df_v2.drop(columns='SalePrice')
feat_v3 = pd.read_csv('../datasets/ready_for_model/feat_v3.csv', na_filter=False)

In [28]:
feat_v3.head()

Unnamed: 0,Overall Qual,Overall Cond,1st Flr SF,2nd Flr SF,Gr Liv Area,Full Bath,Half Bath,Bedroom AbvGr,Kitchen AbvGr,TotRms AbvGrd,Fireplaces,Misc Val,Year Built Year Remod/Add,Mas Vnr Area,Total Bsmt SF,Bsmt Full Bath,Bsmt Half Bath,Garage Yr Blt,Garage Cars,Garage Area
0,6,8,725,754,1479,2,1,3,1,6,0,0,3961880,289.0,725.0,0.0,0.0,1976.0,2.0,475.0
1,7,5,913,1209,2122,2,1,4,1,8,1,0,3986012,132.0,913.0,1.0,0.0,1997.0,2.0,559.0
2,5,7,1057,0,1057,1,0,3,1,5,0,0,3919671,0.0,1057.0,1.0,0.0,1953.0,1.0,246.0
3,5,5,744,700,1444,2,1,3,1,7,0,0,4026042,0.0,384.0,0.0,0.0,2007.0,2.0,400.0
4,6,8,831,614,1445,2,0,3,1,6,0,0,3786700,0.0,676.0,0.0,0.0,1957.0,2.0,484.0


In [29]:
features_v3 = feat_v3.columns
X_v3 = training_df_v3[features_v3]
y_v3 = training_df_v3['SalePrice']
X_test_v3 = test_df_v3[features_v3]

In [30]:
lr_v3 = LinearRegression()
lr_v3.fit(X_v3, y_v3)

LinearRegression()

In [31]:
pred_v3 = lr_v3.predict(X_test_v3)
test_df_v3['SalePrice'] = pred_v3
test_df_v3.head()

Unnamed: 0,Id,PID,MS SubClass,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,Utilities,...,Sale Type_VWD,Sale Type_WD,MS Zoning_C (all),Neighborhood_GrnHill,Neighborhood_Landmrk,Exterior 1st_CBlock,Exterior 1st_ImStucc,Exterior 1st_Stone,Exterior 2nd_Stone,SalePrice
0,2658,902301120,190,69.0,9142,1,Grvl,0,Lvl,AllPub,...,0,1,0,0,0,0,0,0,0,141380.602525
1,2718,905108090,90,0.0,9662,1,,1,Lvl,AllPub,...,0,1,0,0,0,0,0,0,0,168768.821533
2,2414,528218130,60,58.0,17104,1,,1,Lvl,AllPub,...,0,0,0,0,0,0,0,0,0,192230.742933
3,1989,902207150,30,60.0,8520,1,,0,Lvl,AllPub,...,0,1,0,0,0,0,0,0,0,119553.579559
4,625,535105100,20,0.0,9500,1,,1,Lvl,AllPub,...,0,1,0,0,0,0,0,0,0,195729.584181


In [32]:
submission_v4 = test_df_v3[['Id','SalePrice']]
submission_v4.set_index('Id', inplace=True)
submission_v4.head()

Unnamed: 0_level_0,SalePrice
Id,Unnamed: 1_level_1
2658,141380.602525
2718,168768.821533
2414,192230.742933
1989,119553.579559
625,195729.584181


In [33]:
submission_v4.to_csv('../datasets/submissions/submission_v4.csv')

The score for this run was 31,435.46194. A moderate improvement making it my best so far. 

In [34]:
training_df_v3_1 = pd.read_csv('../datasets/ready_for_model/clean_train_v3_1.csv', na_filter=False)
feat_v3_1 = pd.read_csv('../datasets/ready_for_model/feat_v3_1.csv', na_filter=False)

In [35]:
feat_v3_1.head()

Unnamed: 0,Overall Qual,Overall Cond,1st Flr SF,2nd Flr SF,Gr Liv Area,Half Bath,Kitchen AbvGr,TotRms AbvGrd,Fireplaces,Misc Val,Year Built Year Remod/Add,Mas Vnr Area,Total Bsmt SF,Bsmt Full Bath,Garage Area,bed_full_bath
0,6,8,725,754,1479,1,1,6,0,0,3961880,289.0,725.0,0.0,475.0,6.0
1,7,5,913,1209,2122,1,1,8,1,0,3986012,132.0,913.0,1.0,559.0,8.0
2,5,7,1057,0,1057,0,1,5,0,0,3919671,0.0,1057.0,1.0,246.0,3.0
3,5,5,744,700,1444,1,1,7,0,0,4026042,0.0,384.0,0.0,400.0,6.0
4,6,8,831,614,1445,0,1,6,0,0,3786700,0.0,676.0,0.0,484.0,6.0


In [36]:
test_df_v3_1 = test_df_v3.drop(columns='SalePrice')
test_df_v3_1.head()

Unnamed: 0,Id,PID,MS SubClass,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,Utilities,...,Sale Type_Oth,Sale Type_VWD,Sale Type_WD,MS Zoning_C (all),Neighborhood_GrnHill,Neighborhood_Landmrk,Exterior 1st_CBlock,Exterior 1st_ImStucc,Exterior 1st_Stone,Exterior 2nd_Stone
0,2658,902301120,190,69.0,9142,1,Grvl,0,Lvl,AllPub,...,0,0,1,0,0,0,0,0,0,0
1,2718,905108090,90,0.0,9662,1,,1,Lvl,AllPub,...,0,0,1,0,0,0,0,0,0,0
2,2414,528218130,60,58.0,17104,1,,1,Lvl,AllPub,...,0,0,0,0,0,0,0,0,0,0
3,1989,902207150,30,60.0,8520,1,,0,Lvl,AllPub,...,0,0,1,0,0,0,0,0,0,0
4,625,535105100,20,0.0,9500,1,,1,Lvl,AllPub,...,0,0,1,0,0,0,0,0,0,0


In [37]:
features_v3_1 = feat_v3_1.columns
X_v3_1 = training_df_v3_1[features_v3_1]
y_v3_1 = training_df_v3_1['SalePrice']
X_test_v3_1 = test_df_v3_1[features_v3_1]

In [38]:
lr_v3_1 = LinearRegression()
lr_v3_1.fit(X_v3_1, y_v3_1)

LinearRegression()

In [39]:
pred_v3_1 = lr_v3_1.predict(X_test_v3_1)
test_df_v3_1['SalePrice'] = pred_v3_1
test_df_v3_1.head()

Unnamed: 0,Id,PID,MS SubClass,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,Utilities,...,Sale Type_VWD,Sale Type_WD,MS Zoning_C (all),Neighborhood_GrnHill,Neighborhood_Landmrk,Exterior 1st_CBlock,Exterior 1st_ImStucc,Exterior 1st_Stone,Exterior 2nd_Stone,SalePrice
0,2658,902301120,190,69.0,9142,1,Grvl,0,Lvl,AllPub,...,0,1,0,0,0,0,0,0,0,144725.531797
1,2718,905108090,90,0.0,9662,1,,1,Lvl,AllPub,...,0,1,0,0,0,0,0,0,0,183602.704594
2,2414,528218130,60,58.0,17104,1,,1,Lvl,AllPub,...,0,0,0,0,0,0,0,0,0,193735.32528
3,1989,902207150,30,60.0,8520,1,,0,Lvl,AllPub,...,0,1,0,0,0,0,0,0,0,115755.226857
4,625,535105100,20,0.0,9500,1,,1,Lvl,AllPub,...,0,1,0,0,0,0,0,0,0,200264.710382


In [40]:
submission_v5 = test_df_v3_1[['Id','SalePrice']]
submission_v5.set_index('Id', inplace=True)
submission_v5.head()

Unnamed: 0_level_0,SalePrice
Id,Unnamed: 1_level_1
2658,144725.531797
2718,183602.704594
2414,193735.32528
1989,115755.226857
625,200264.710382


In [41]:
submission_v5.to_csv('../datasets/submissions/submission_v5.csv')

The score for this entry was 32,155.45560, which is to be expected since the test scores were slightly more biased and had a little more variance.

### Model 4

This is the last model that uses all possible numeric columns, narrowed down and feature engineered to be the best model possible.

In [42]:
feat_v4 = pd.read_csv('../datasets/ready_for_model/feat_v4.csv', na_filter=False)
training_df_v4 = pd.read_csv('../datasets/ready_for_model/clean_train_v4.csv', na_filter=False)
test_df_v4 =test_df_v3_1.drop(columns=['SalePrice'])

In [43]:
feat_v4.head()

Unnamed: 0,Overall Qual,Overall Cond,1st Flr SF,2nd Flr SF,Gr Liv Area,Half Bath,Kitchen AbvGr,TotRms AbvGrd,Fireplaces,Misc Val,...,BsmtFin Type 2,Heating QC,Central Air,Kitchen Qual,Functional,Fireplace Qu,Garage Qual,Garage Cond,Paved Drive,Pool QC
0,6,8,725,754,1479,1,1,6,0,0,...,1.0,5.0,1.0,4.0,7.0,0.0,3.0,3.0,1.0,0.0
1,7,5,913,1209,2122,1,1,8,1,0,...,1.0,5.0,1.0,4.0,7.0,3.0,3.0,3.0,1.0,0.0
2,5,7,1057,0,1057,0,1,5,0,0,...,1.0,3.0,1.0,4.0,7.0,0.0,3.0,3.0,1.0,0.0
3,5,5,744,700,1444,1,1,7,0,0,...,1.0,4.0,1.0,3.0,7.0,0.0,3.0,3.0,1.0,0.0
4,6,8,831,614,1445,0,1,6,0,0,...,1.0,3.0,1.0,3.0,7.0,0.0,3.0,3.0,0.0,0.0


In [44]:
training_df_v4.head()

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,SalePrice,Year Built Year Remod/Add,bed_full_bath
0,109,533352170,60,RL,0.0,13517,1,,1,Lvl,...,0,,,0,3,2010,WD,130500,3961880,6
1,544,531379050,60,RL,43.0,11492,1,,1,Lvl,...,0,,,0,4,2009,WD,220000,3986012,8
2,153,535304180,20,RL,68.0,7922,1,,0,Lvl,...,0,,,0,1,2010,WD,109000,3919671,3
3,318,916386060,60,RL,73.0,9802,1,,0,Lvl,...,0,,,0,4,2010,WD,174000,4026042,6
4,255,906425045,50,RL,82.0,14235,1,,1,Lvl,...,0,,,0,3,2010,WD,138500,3786700,6


In [45]:
test_df_v4.head()

Unnamed: 0,Id,PID,MS SubClass,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,Utilities,...,Sale Type_Oth,Sale Type_VWD,Sale Type_WD,MS Zoning_C (all),Neighborhood_GrnHill,Neighborhood_Landmrk,Exterior 1st_CBlock,Exterior 1st_ImStucc,Exterior 1st_Stone,Exterior 2nd_Stone
0,2658,902301120,190,69.0,9142,1,Grvl,0,Lvl,AllPub,...,0,0,1,0,0,0,0,0,0,0
1,2718,905108090,90,0.0,9662,1,,1,Lvl,AllPub,...,0,0,1,0,0,0,0,0,0,0
2,2414,528218130,60,58.0,17104,1,,1,Lvl,AllPub,...,0,0,0,0,0,0,0,0,0,0
3,1989,902207150,30,60.0,8520,1,,0,Lvl,AllPub,...,0,0,1,0,0,0,0,0,0,0
4,625,535105100,20,0.0,9500,1,,1,Lvl,AllPub,...,0,0,1,0,0,0,0,0,0,0


In [46]:
features_v4 = feat_v4.columns
X_v4 = training_df_v4[features_v4]
y_v4 = training_df_v4['SalePrice']
X_test_v4 = test_df_v4[features_v4]

In [47]:
lr_v4 = LinearRegression()
lr_v4.fit(X_v4, y_v4)

LinearRegression()

In [48]:
pred_v4 = lr_v4.predict(X_test_v4)
test_df_v4['SalePrice'] = pred_v4
test_df_v4.head()

Unnamed: 0,Id,PID,MS SubClass,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,Utilities,...,Sale Type_VWD,Sale Type_WD,MS Zoning_C (all),Neighborhood_GrnHill,Neighborhood_Landmrk,Exterior 1st_CBlock,Exterior 1st_ImStucc,Exterior 1st_Stone,Exterior 2nd_Stone,SalePrice
0,2658,902301120,190,69.0,9142,1,Grvl,0,Lvl,AllPub,...,0,1,0,0,0,0,0,0,0,142898.11042
1,2718,905108090,90,0.0,9662,1,,1,Lvl,AllPub,...,0,1,0,0,0,0,0,0,0,177383.377152
2,2414,528218130,60,58.0,17104,1,,1,Lvl,AllPub,...,0,0,0,0,0,0,0,0,0,196658.415914
3,1989,902207150,30,60.0,8520,1,,0,Lvl,AllPub,...,0,1,0,0,0,0,0,0,0,117723.160271
4,625,535105100,20,0.0,9500,1,,1,Lvl,AllPub,...,0,1,0,0,0,0,0,0,0,195267.192095


In [49]:
submission_v6 = test_df_v4[['Id','SalePrice']]
submission_v6.set_index('Id', inplace=True)
submission_v6.head()

Unnamed: 0_level_0,SalePrice
Id,Unnamed: 1_level_1
2658,142898.11042
2718,177383.377152
2414,196658.415914
1989,117723.160271
625,195267.192095


In [50]:
submission_v6.to_csv('../datasets/submissions/submission_v6.csv')

The score for this model was 29,985.47562. Got below 30,000 for the first time, so that's exciting. 

In [51]:
feat_v4_1 = pd.read_csv('../datasets/ready_for_model/feat_v4_1.csv', na_filter=False)
training_df_v4_1 = pd.read_csv('../datasets/ready_for_model/clean_train_v4_1.csv', na_filter=False)
test_df_v4_1 =test_df_v4.drop(columns=['SalePrice'])

In [52]:
test_df_v4_1.head()

Unnamed: 0,Id,PID,MS SubClass,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,Utilities,...,Sale Type_Oth,Sale Type_VWD,Sale Type_WD,MS Zoning_C (all),Neighborhood_GrnHill,Neighborhood_Landmrk,Exterior 1st_CBlock,Exterior 1st_ImStucc,Exterior 1st_Stone,Exterior 2nd_Stone
0,2658,902301120,190,69.0,9142,1,Grvl,0,Lvl,AllPub,...,0,0,1,0,0,0,0,0,0,0
1,2718,905108090,90,0.0,9662,1,,1,Lvl,AllPub,...,0,0,1,0,0,0,0,0,0,0
2,2414,528218130,60,58.0,17104,1,,1,Lvl,AllPub,...,0,0,0,0,0,0,0,0,0,0
3,1989,902207150,30,60.0,8520,1,,0,Lvl,AllPub,...,0,0,1,0,0,0,0,0,0,0
4,625,535105100,20,0.0,9500,1,,1,Lvl,AllPub,...,0,0,1,0,0,0,0,0,0,0


In [53]:
training_df_v4_1.head()

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Year Built Year Remod/Add,bed_full_bath,overall_qual_cond,exter_qual_cond,bsmt_qual_cond,mas_vnr_ext_qual_cond,bsmt_sf_qual_cond,garage_area_qual_cond,bsmt_exposure_qual,tot_fire_qu
0,109,533352170,60,RL,0.0,13517,1,,1,Lvl,...,3961880,6,48,12,9,3468.0,6525.0,4275.0,3,0
1,544,531379050,60,RL,43.0,11492,1,,1,Lvl,...,3986012,8,35,12,12,1584.0,10956.0,5031.0,4,3
2,153,535304180,20,RL,68.0,7922,1,,0,Lvl,...,3919671,3,35,12,9,0.0,9513.0,2214.0,3,0
3,318,916386060,60,RL,73.0,9802,1,,0,Lvl,...,4026042,6,25,9,12,0.0,4608.0,3600.0,4,0
4,255,906425045,50,RL,82.0,14235,1,,1,Lvl,...,3786700,6,48,9,8,0.0,5408.0,4356.0,2,0


In [54]:
feat_v4_1.columns

Index(['Overall Qual', 'Overall Cond', '1st Flr SF', '2nd Flr SF',
       'Gr Liv Area', 'Half Bath', 'Kitchen AbvGr', 'TotRms AbvGrd',
       'Fireplaces', 'Misc Val', 'Year Built Year Remod/Add', 'Mas Vnr Area',
       'Total Bsmt SF', 'Bsmt Full Bath', 'Garage Area', 'bed_full_bath',
       'Street', 'Lot Shape', 'Land Slope', 'Exter Qual', 'Exter Cond',
       'Bsmt Qual', 'Bsmt Cond', 'Bsmt Exposure', 'BsmtFin Type 1',
       'BsmtFin Type 2', 'Heating QC', 'Central Air', 'Kitchen Qual',
       'Functional', 'Fireplace Qu', 'Garage Qual', 'Garage Cond',
       'Paved Drive', 'Pool QC', 'overall_qual_cond', 'exter_qual_cond',
       'bsmt_qual_cond', 'mas_vnr_ext_qual_cond', 'bsmt_sf_qual_cond',
       'garage_area_qual_cond', 'bsmt_exposure_qual', 'tot_fire_qu'],
      dtype='object')

In [55]:
features_v4_1 = feat_v4_1.columns
X_v4_1 = training_df_v4_1[features_v4_1]
y_v4_1 = training_df_v4_1['SalePrice']
X_test_v4_1 = test_df_v4_1[features_v4_1]

In [56]:
lr_v4_1 = LinearRegression()
lr_v4_1.fit(X_v4_1, y_v4_1)

LinearRegression()

In [57]:
pred_v4_1 = lr_v4_1.predict(X_test_v4_1)
test_df_v4_1['SalePrice'] = pred_v4_1
test_df_v4_1.head()

Unnamed: 0,Id,PID,MS SubClass,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,Utilities,...,Sale Type_VWD,Sale Type_WD,MS Zoning_C (all),Neighborhood_GrnHill,Neighborhood_Landmrk,Exterior 1st_CBlock,Exterior 1st_ImStucc,Exterior 1st_Stone,Exterior 2nd_Stone,SalePrice
0,2658,902301120,190,69.0,9142,1,Grvl,0,Lvl,AllPub,...,0,1,0,0,0,0,0,0,0,137000.194282
1,2718,905108090,90,0.0,9662,1,,1,Lvl,AllPub,...,0,1,0,0,0,0,0,0,0,190615.37454
2,2414,528218130,60,58.0,17104,1,,1,Lvl,AllPub,...,0,0,0,0,0,0,0,0,0,191502.779332
3,1989,902207150,30,60.0,8520,1,,0,Lvl,AllPub,...,0,1,0,0,0,0,0,0,0,100796.892373
4,625,535105100,20,0.0,9500,1,,1,Lvl,AllPub,...,0,1,0,0,0,0,0,0,0,196111.009955


In [58]:
submission_v7 = test_df_v4_1[['Id','SalePrice']]
submission_v7.set_index('Id', inplace=True)
submission_v7.head()

Unnamed: 0_level_0,SalePrice
Id,Unnamed: 1_level_1
2658,137000.194282
2718,190615.37454
2414,191502.779332
1989,100796.892373
625,196111.009955


In [59]:
submission_v7.to_csv('../datasets/submissions/submission_v7.csv')

This is easily the best model so far with a score of 24,464.11077. 

In [60]:
feat_v4_2 = pd.read_csv('../datasets/ready_for_model/feat_v4_2.csv')
feat_v4_2.head()

Unnamed: 0,Overall Qual,Overall Cond,1st Flr SF,2nd Flr SF,Gr Liv Area,Half Bath,Kitchen AbvGr,TotRms AbvGrd,Fireplaces,Year Built Year Remod/Add,...,Garage Cond,Paved Drive,overall_qual_cond,exter_qual_cond,bsmt_qual_cond,mas_vnr_ext_qual_cond,bsmt_sf_qual_cond,garage_area_qual_cond,bsmt_exposure_qual,tot_fire_qu
0,6,8,725,754,1479,1,1,6,0,3961880,...,3.0,1.0,48.0,12.0,9.0,3468.0,6525.0,4275.0,3.0,0.0
1,7,5,913,1209,2122,1,1,8,1,3986012,...,3.0,1.0,35.0,12.0,12.0,1584.0,10956.0,5031.0,4.0,3.0
2,5,7,1057,0,1057,0,1,5,0,3919671,...,3.0,1.0,35.0,12.0,9.0,0.0,9513.0,2214.0,3.0,0.0
3,5,5,744,700,1444,1,1,7,0,4026042,...,3.0,1.0,25.0,9.0,12.0,0.0,4608.0,3600.0,4.0,0.0
4,6,8,831,614,1445,0,1,6,0,3786700,...,3.0,0.0,48.0,9.0,8.0,0.0,5408.0,4356.0,2.0,0.0


In [61]:
features_v4_2 = feat_v4_2.columns
X_v4_2 = training_df_v4_1[features_v4_2]
y_v4_2 = training_df_v4_1['SalePrice']
X_test_v4_2 = test_df_v4_1[features_v4_2]

In [62]:
lr_v4_2 = LinearRegression()
lr_v4_2.fit(X_v4_2, y_v4_2)

LinearRegression()

In [63]:
pred_v4_2 = lr_v4_2.predict(X_test_v4_2)
test_df_v4_1['SalePrice'] = pred_v4_2
test_df_v4_1.head()

Unnamed: 0,Id,PID,MS SubClass,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,Utilities,...,Sale Type_VWD,Sale Type_WD,MS Zoning_C (all),Neighborhood_GrnHill,Neighborhood_Landmrk,Exterior 1st_CBlock,Exterior 1st_ImStucc,Exterior 1st_Stone,Exterior 2nd_Stone,SalePrice
0,2658,902301120,190,69.0,9142,1,Grvl,0,Lvl,AllPub,...,0,1,0,0,0,0,0,0,0,135716.086546
1,2718,905108090,90,0.0,9662,1,,1,Lvl,AllPub,...,0,1,0,0,0,0,0,0,0,190671.818899
2,2414,528218130,60,58.0,17104,1,,1,Lvl,AllPub,...,0,0,0,0,0,0,0,0,0,192226.722774
3,1989,902207150,30,60.0,8520,1,,0,Lvl,AllPub,...,0,1,0,0,0,0,0,0,0,100728.041029
4,625,535105100,20,0.0,9500,1,,1,Lvl,AllPub,...,0,1,0,0,0,0,0,0,0,197528.267927


In [64]:
submission_v8 = test_df_v4_1[['Id','SalePrice']]
submission_v8.set_index('Id', inplace=True)
submission_v8.head()

Unnamed: 0_level_0,SalePrice
Id,Unnamed: 1_level_1
2658,135716.086546
2718,190671.818899
2414,192226.722774
1989,100728.041029
625,197528.267927


In [65]:
submission_v8.to_csv('../datasets/submissions/submission_v8.csv')

This version was not better, producing only a score of 25,035.15235. I will stick with the earlier model.

### Model 5
This was were the submissions from the dummies only model were going to go. Unfortunately there were issues with the dummy only model, so I am not submitting anything from that model.

### Model 6
This is where the combined numeric and dummy features are going to be submitted from. 

In [66]:
training_df_v6 = pd.read_csv('../datasets/cleaned_data/clean_train_v6.csv', na_filter=False)
test_df_v6 = test_df_v4_1.drop(columns='SalePrice')
final_feat = pd.read_csv('../datasets/ready_for_model/final_feat_v1.csv', na_filter=False)

In [67]:
features_v6 = final_feat.columns
X_v6 = training_df_v6[features_v6]
y_v6 = training_df_v6['SalePrice']
X_test_v6 = test_df_v6[features_v6]

In [68]:
lr_v6 = LinearRegression()
lr_v6.fit(X_v6, y_v6)

LinearRegression()

In [69]:
preds_v6 = lr_v6.predict(X_test_v6)
test_df_v6['SalePrice'] = preds_v6
test_df_v6.head()

Unnamed: 0,Id,PID,MS SubClass,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,Utilities,...,Sale Type_VWD,Sale Type_WD,MS Zoning_C (all),Neighborhood_GrnHill,Neighborhood_Landmrk,Exterior 1st_CBlock,Exterior 1st_ImStucc,Exterior 1st_Stone,Exterior 2nd_Stone,SalePrice
0,2658,902301120,190,69.0,9142,1,Grvl,0,Lvl,AllPub,...,0,1,0,0,0,0,0,0,0,130362.948868
1,2718,905108090,90,0.0,9662,1,,1,Lvl,AllPub,...,0,1,0,0,0,0,0,0,0,181697.304502
2,2414,528218130,60,58.0,17104,1,,1,Lvl,AllPub,...,0,0,0,0,0,0,0,0,0,198610.329158
3,1989,902207150,30,60.0,8520,1,,0,Lvl,AllPub,...,0,1,0,0,0,0,0,0,0,92044.874227
4,625,535105100,20,0.0,9500,1,,1,Lvl,AllPub,...,0,1,0,0,0,0,0,0,0,185730.062849


In [70]:
submission_v9 = test_df_v6[['Id','SalePrice']]
submission_v9.set_index('Id', inplace=True)
submission_v9.head()

Unnamed: 0_level_0,SalePrice
Id,Unnamed: 1_level_1
2658,130362.948868
2718,181697.304502
2414,198610.329158
1989,92044.874227
625,185730.062849


In [71]:
submission_v9.to_csv('../datasets/submissions/submission_v9.csv')

This model has given me the best score so far of 23,282.12783.

In [72]:
final_feat_v2 = pd.read_csv('../datasets/ready_for_model/final_feat_v2.csv', na_filter=False)

In [73]:
features_v6 = final_feat_v2.columns
X_v6 = training_df_v6[features_v6]
y_v6 = training_df_v6['SalePrice']
X_test_v6 = test_df_v6[features_v6]

In [74]:
lr_v6 = LinearRegression()
lr_v6.fit(X_v6, y_v6)

LinearRegression()

In [75]:
preds_v6 = lr_v6.predict(X_test_v6)
test_df_v6['SalePrice'] = preds_v6
test_df_v6.head()

Unnamed: 0,Id,PID,MS SubClass,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,Utilities,...,Sale Type_VWD,Sale Type_WD,MS Zoning_C (all),Neighborhood_GrnHill,Neighborhood_Landmrk,Exterior 1st_CBlock,Exterior 1st_ImStucc,Exterior 1st_Stone,Exterior 2nd_Stone,SalePrice
0,2658,902301120,190,69.0,9142,1,Grvl,0,Lvl,AllPub,...,0,1,0,0,0,0,0,0,0,136853.208294
1,2718,905108090,90,0.0,9662,1,,1,Lvl,AllPub,...,0,1,0,0,0,0,0,0,0,176478.246222
2,2414,528218130,60,58.0,17104,1,,1,Lvl,AllPub,...,0,0,0,0,0,0,0,0,0,214910.8356
3,1989,902207150,30,60.0,8520,1,,0,Lvl,AllPub,...,0,1,0,0,0,0,0,0,0,92593.43158
4,625,535105100,20,0.0,9500,1,,1,Lvl,AllPub,...,0,1,0,0,0,0,0,0,0,186679.643775


In [76]:
submission_v10 = test_df_v6[['Id','SalePrice']]
submission_v10.set_index('Id', inplace=True)
submission_v10.head()

Unnamed: 0_level_0,SalePrice
Id,Unnamed: 1_level_1
2658,136853.208294
2718,176478.246222
2414,214910.8356
1989,92593.43158
625,186679.643775


In [77]:
submission_v10.to_csv('../datasets/submissions/submission_v10.csv')

The score for this model was ever so slightly better: 24,337.70060

In [78]:
final_feat_v3 = pd.read_csv('../datasets/ready_for_model/final_feat_v3.csv', na_filter=False)

In [79]:
features_v6 = final_feat_v3.columns
X_v6 = training_df_v6[features_v6]
y_v6 = training_df_v6['SalePrice']
X_test_v6 = test_df_v6[features_v6]

In [80]:
lr_v6 = LinearRegression()
lr_v6.fit(X_v6, y_v6)

LinearRegression()

In [81]:
preds_v6 = lr_v6.predict(X_test_v6)
test_df_v6['SalePrice'] = preds_v6
test_df_v6.head()

Unnamed: 0,Id,PID,MS SubClass,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,Utilities,...,Sale Type_VWD,Sale Type_WD,MS Zoning_C (all),Neighborhood_GrnHill,Neighborhood_Landmrk,Exterior 1st_CBlock,Exterior 1st_ImStucc,Exterior 1st_Stone,Exterior 2nd_Stone,SalePrice
0,2658,902301120,190,69.0,9142,1,Grvl,0,Lvl,AllPub,...,0,1,0,0,0,0,0,0,0,140035.652984
1,2718,905108090,90,0.0,9662,1,,1,Lvl,AllPub,...,0,1,0,0,0,0,0,0,0,169715.087268
2,2414,528218130,60,58.0,17104,1,,1,Lvl,AllPub,...,0,0,0,0,0,0,0,0,0,218938.512845
3,1989,902207150,30,60.0,8520,1,,0,Lvl,AllPub,...,0,1,0,0,0,0,0,0,0,95335.632839
4,625,535105100,20,0.0,9500,1,,1,Lvl,AllPub,...,0,1,0,0,0,0,0,0,0,184464.868577


In [82]:
submission_v11 = test_df_v6[['Id','SalePrice']]
submission_v11.set_index('Id', inplace=True)
submission_v11.head()

Unnamed: 0_level_0,SalePrice
Id,Unnamed: 1_level_1
2658,140035.652984
2718,169715.087268
2414,218938.512845
1989,95335.632839
625,184464.868577


In [83]:
submission_v11.to_csv('../datasets/submissions/submission_v11.csv')

The score for this one was a little worse: 25,312.10289

# Next Steps
I have not been able to improve my model and I am just out of time. Thus I am going to stop here and expand on my final thoughts and finding in the last notebook, 04_Final_Submission. 