# Iowa Housing Project Data

First it is important to explore the data by doing Exploratory Data Analysis on any data in order to get a better understanding of what the data looks like such as if there are any missing values and what the data types are within the dataframe. Knowing this can help when trying to understand why data may be graphing itself in certain ways. But before I can start the analysis, I have to import the libraries I need and read in the dataset. 

### Import Libraries

In [2]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, PolynomialFeatures

### Read in Dataset

In [3]:
ames = pd.read_csv('datasets/train.csv')

### Testing for Null Values and Data types

In [4]:
ames.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2051 entries, 0 to 2050
Data columns (total 81 columns):
Id                 2051 non-null int64
PID                2051 non-null int64
MS SubClass        2051 non-null int64
MS Zoning          2051 non-null object
Lot Frontage       1721 non-null float64
Lot Area           2051 non-null int64
Street             2051 non-null object
Alley              140 non-null object
Lot Shape          2051 non-null object
Land Contour       2051 non-null object
Utilities          2051 non-null object
Lot Config         2051 non-null object
Land Slope         2051 non-null object
Neighborhood       2051 non-null object
Condition 1        2051 non-null object
Condition 2        2051 non-null object
Bldg Type          2051 non-null object
House Style        2051 non-null object
Overall Qual       2051 non-null int64
Overall Cond       2051 non-null int64
Year Built         2051 non-null int64
Year Remod/Add     2051 non-null int64
Roof Style         20

### Chosen Variables based on Research of Housing Prices

In [6]:
researched = ['Neighborhood', 'MS SubClass', 'Lot Area', '1st Flr SF', '2nd Flr SF', 'Low Qual Fin SF', 
             'Gr Liv Area', 'Total Bsmt SF', 'Full Bath', 'Half Bath', 'Bedroom AbvGr', 'TotRms AbvGrd', 'Fireplaces',
             'Garage Area', 'Wood Deck SF', 'Open Porch SF', 'Enclosed Porch', '3Ssn Porch', 'Screen Porch',
             'Pool Area']

ames[researched].isnull().sum()

Neighborhood       0
MS SubClass        0
Lot Area           0
1st Flr SF         0
2nd Flr SF         0
Low Qual Fin SF    0
Gr Liv Area        0
Total Bsmt SF      1
Full Bath          0
Half Bath          0
Bedroom AbvGr      0
TotRms AbvGrd      0
Fireplaces         0
Garage Area        1
Wood Deck SF       0
Open Porch SF      0
Enclosed Porch     0
3Ssn Porch         0
Screen Porch       0
Pool Area          0
dtype: int64

### Create X and y values and dummies values for categories to add to X values

In [7]:
X = pd.get_dummies(ames[researched], 'Neighborhood').fillna(0)
X.shape

(2051, 47)

In [8]:
y = ames['SalePrice']
y.shape

(2051,)

### Create, fit  and score model

In [9]:
research_model = LinearRegression()
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [10]:
cross_val_score(research_model, X_train, y_train, cv = 5).mean()

0.7399297615159182

In [11]:
research_model.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [12]:
research_model.score(X_train, y_train)

0.8101827173183129

In [13]:
research_model.score(X_test, y_test)

0.8154331392773961

### Create file to sumbit to Kaggle

In [14]:
test = pd.read_csv('datasets/test.csv')

In [15]:
X = pd.get_dummies(test[researched], 'Neighborhood').fillna(0)
X.shape

(878, 45)

In [16]:
[col for col in X_train.columns if col not in X.columns]

['Neighborhood_GrnHill', 'Neighborhood_Landmrk']

In [17]:
X['Neighborhood_GrnHill'] = 0

In [18]:
X['Neighborhood_Landmrk'] = 0

In [19]:
X.shape

(878, 47)

In [20]:
predictions = research_model.predict(X)

In [21]:
res_model = pd.DataFrame({'Id': test['Id'], 'SalePrice': predictions})

In [22]:
res_model.to_csv('Preds/res_model.csv', index = False)

### Creating another model based on research

##### Created a new set of variables to see how the model would change if I reduced variables that may not have affected the price of a house according to another article I read. 

In [23]:
researched_2 = ['Neighborhood', 'MS SubClass', 'Lot Area', '1st Flr SF', '2nd Flr SF', 'Low Qual Fin SF', 
             'Gr Liv Area', 'Total Bsmt SF', 'Full Bath', 'Half Bath', 'Bedroom AbvGr', 'TotRms AbvGrd', 'Fireplaces',
             'Garage Area']

In [24]:
X = pd.get_dummies(ames[researched_2], 'Neighborhood').fillna(0)
y = ames['SalePrice']

In [25]:
research_2_model = LinearRegression()
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [26]:
research_2_model.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [27]:
research_2_model.score(X_train, y_train)

0.8231506041955994

In [28]:
research_2_model.score(X_test, y_test)

0.7406523039712374

In [30]:
cross_val_score(research_2_model, X_train, y_train, cv = 5).mean()

0.7907213254869947