### PROBLEM

Return to your Ames Data.  We have covered a lot of ground today, so let's summarize the things we could do to improve the performance of our original model that compared the Above Ground Living Area to the Logarithm of the Sale Price.
<div class="alert alert-info" role="alert">
1. Clean data, drop missing values
2. Transform data, code variables using either ordinal values or OneHotEncoder methods
3. Create more features from existing features
4. Split our data into testing and training sets
5. Normalize quantitative features
6. Use Regularized Regression methods and Polynomial regression to improve performance of model
</div>
Can you use some or all of these ideas to improve upon your initial model?

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import sympy as sy
import pandas as pd
import pandas_profiling

import seaborn as sns
import statsmodels.formula.api as smf
from sklearn.linear_model import LinearRegression
from sklearn.dummy import DummyRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler, MinMaxScaler

import warnings
warnings.filterwarnings('ignore')

This call to matplotlib.use() has no effect because the backend has already
been chosen; matplotlib.use() must be called *before* pylab, matplotlib.pyplot,
or matplotlib.backends is imported for the first time.

The backend was *originally* set to 'module://ipykernel.pylab.backend_inline' by the following code:
  File "/Users/sankokohtet/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/Users/sankokohtet/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/sankokohtet/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "/Users/sankokohtet/anaconda3/lib/python3.6/site-packages/traitlets/config/application.py", line 658, in launch_instance
    app.start()
  File "/Users/sankokohtet/anaconda3/lib/python3.6/site-packages/ipykernel/kernelapp.py", line 478, in start
    self.io_loop.start()
  File "/Users/sankokohtet/anaconda3

In [2]:
ames = pd.read_csv('data/ames_housing.csv')

In [3]:
ames.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [4]:
ames.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-n

In [5]:
ames['Alley'].value_counts()

Grvl    50
Pave    41
Name: Alley, dtype: int64

In [6]:
ames['Alley'] = ames['Alley'].fillna("None")

In [7]:
ames['Alley'].value_counts()

None    1369
Grvl      50
Pave      41
Name: Alley, dtype: int64

In [8]:
ames['FireplaceQu'].value_counts()

Gd    380
TA    313
Fa     33
Ex     24
Po     20
Name: FireplaceQu, dtype: int64

In [9]:
ames['FireplaceQu'] = ames['FireplaceQu'].fillna("None")

In [10]:
ames['FireplaceQu'].value_counts()

None    690
Gd      380
TA      313
Fa       33
Ex       24
Po       20
Name: FireplaceQu, dtype: int64

In [11]:
ames['OverallQual'].value_counts()

5     397
6     374
7     319
8     168
4     116
9      43
3      20
10     18
2       3
1       2
Name: OverallQual, dtype: int64

In [12]:
ames['OverallCond'].value_counts()

5    821
6    252
7    205
8     72
4     57
3     25
9     22
2      5
1      1
Name: OverallCond, dtype: int64

In [13]:
ames ['OverallGrade'] = ames['OverallCond'] * ames['OverallQual']

In [14]:
ames ['OverallGrade'].value_counts()

35    336
30    284
40    177
25    142
42     94
36     84
20     65
45     46
48     37
24     31
49     23
28     22
16     20
56     15
50     15
18     12
12     11
63      9
15      9
6       4
72      3
32      3
9       3
54      3
10      2
90      2
21      2
64      2
60      1
8       1
3       1
1       1
Name: OverallGrade, dtype: int64

In [15]:
ames['BsmtQual'].value_counts()

TA    649
Gd    618
Ex    121
Fa     35
Name: BsmtQual, dtype: int64

In [16]:
ames = ames.replace({"BsmtQual": {"No": 0, "Po": 1, "Fa": 2, "TA": 3, "Gd": 4, "Ex": 5}})

In [17]:
ames['BsmtQual'].value_counts()

3.0    649
4.0    618
5.0    121
2.0     35
Name: BsmtQual, dtype: int64

In [18]:
ames['BsmtCond'] = ames['BsmtCond'].fillna("None")

In [19]:
ames['BsmtCond'].value_counts()

TA      1311
Gd        65
Fa        45
None      37
Po         2
Name: BsmtCond, dtype: int64

In [20]:
ames = ames.replace({"BsmtCond": {"No": 0, "Po": 1, "Fa": 2, "TA": 3, "Gd": 4, "Ex": 5}})

In [21]:
ames['BsmtCond'].value_counts()

3       1311
4         65
2         45
None      37
1          2
Name: BsmtCond, dtype: int64

In [22]:
ames['BasementOverall'] = ames['BsmtCond'] * ames['BsmtQual']

In [23]:
ames['BasementOverall'].value_counts()

12.0    598
9.0     596
15.0    110
6.0      60
16.0     36
20.0     11
4.0       8
8.0       2
2.0       2
Name: BasementOverall, dtype: int64

In [24]:
ames = ames.replace({"GarageQual" : {"No" : 0, "Po" : 1, "Fa" : 2, "TA": 3, "Gd" : 4, "Ex" : 5}})

In [25]:
ames['GarageQual'].value_counts()

3.0    1311
2.0      48
4.0      14
1.0       3
5.0       3
Name: GarageQual, dtype: int64

In [26]:
ames['GarageCond'] = ames['GarageCond'].fillna("None")

In [27]:
ames = ames.replace({"GarageCond" : {"No" : 0, "Po" : 1, "Fa" : 2, "TA": 3, "Gd" : 4, "Ex" : 5}})

In [28]:
ames['GarageCond'].value_counts()

3       1326
None      81
2         35
4          9
1          7
5          2
Name: GarageCond, dtype: int64

In [29]:
ames['GarageOverall'] = ames['GarageCond'] * ames['GarageQual']

In [30]:
ames['GarageOverall'].value_counts()

9.0     1291
6.0       39
4.0       20
12.0      15
16.0       4
2.0        4
1.0        3
25.0       2
15.0       1
Name: GarageOverall, dtype: int64

In [31]:
ames['BedroomAbvGr'].value_counts()

3    804
2    358
4    213
1     50
5     21
6      7
0      6
8      1
Name: BedroomAbvGr, dtype: int64

In [32]:
ames['KitchenAbvGr'].value_counts()

1    1392
2      65
3       2
0       1
Name: KitchenAbvGr, dtype: int64

In [33]:
ames['TotRmsAbvGrd'].value_counts()

6     402
7     329
5     275
8     187
4      97
9      75
10     47
11     18
3      17
12     11
14      1
2       1
Name: TotRmsAbvGrd, dtype: int64

In [34]:
ames['SalePrice'].value_counts()

140000    20
135000    17
145000    14
155000    14
190000    13
110000    13
160000    12
115000    12
139000    11
130000    11
125000    10
143000    10
185000    10
180000    10
144000    10
175000     9
147000     9
100000     9
127000     9
165000     8
176000     8
170000     8
129000     8
230000     8
250000     8
200000     8
141000     8
215000     8
148000     7
173000     7
          ..
64500      1
326000     1
277500     1
259000     1
254900     1
131400     1
181134     1
142953     1
245350     1
121600     1
337500     1
228950     1
274000     1
317000     1
154500     1
52000      1
107400     1
218000     1
104000     1
68500      1
94000      1
466500     1
410000     1
437154     1
219210     1
84900      1
424870     1
415298     1
62383      1
34900      1
Name: SalePrice, Length: 663, dtype: int64

In [35]:
ames.corr().SalePrice

Id              -0.021917
MSSubClass      -0.084284
LotFrontage      0.351799
LotArea          0.263843
OverallQual      0.790982
OverallCond     -0.077856
YearBuilt        0.522897
YearRemodAdd     0.507101
MasVnrArea       0.477493
BsmtQual         0.644019
BsmtFinSF1       0.386420
BsmtFinSF2      -0.011378
BsmtUnfSF        0.214479
TotalBsmtSF      0.613581
1stFlrSF         0.605852
2ndFlrSF         0.319334
LowQualFinSF    -0.025606
GrLivArea        0.708624
BsmtFullBath     0.227122
BsmtHalfBath    -0.016844
FullBath         0.560664
HalfBath         0.284108
BedroomAbvGr     0.168213
KitchenAbvGr    -0.135907
TotRmsAbvGrd     0.533723
Fireplaces       0.466929
GarageYrBlt      0.486362
GarageCars       0.640409
GarageArea       0.623431
GarageQual       0.156693
WoodDeckSF       0.324413
OpenPorchSF      0.315856
EnclosedPorch   -0.128578
3SsnPorch        0.044584
ScreenPorch      0.111447
PoolArea         0.092404
MiscVal         -0.021190
MoSold           0.046432
YrSold      

In [None]:
corr_mat = ames.corr()

In [None]:
plt.figure()
sns.heatmap(corr_mat, cmap='magma', annot=True)

<matplotlib.axes._subplots.AxesSubplot at 0x1a19dcd358>

In [None]:
lr = LinearRegression()
lr.fit(ames[['BedroomAbvGr', 'KitchenAbvGr']], ames.SalePrice)

In [None]:
Predictions = lr.predict(ames[['BedroomAbvGr', 'KitchenAbvGr']])

In [None]:
mse = mean_squared_error(Predictions, ames.SalePrice)
print("The MSE is {:3f}".format(mse))

In [None]:
rmse = np.sqrt(mse)
rmse

In [None]:
X = ames[['OverallGrade', 'GrLivArea', 'TotRmsAbvGrd']]
y = ames.SalePrice

In [None]:
lr = LinearRegression()
lr.fit(X, y)
pred = lr.predict(X)
mse = mean_squared_error(pred,y)
rmse = np.sqrt(mse)
print(" The MSE is {:.4f}".format(mse), '\nRMSE: {:.4f}'.format(rmse))

In [None]:
base = DummyRegressor()

In [None]:
base.fit(X,y)

In [None]:
base.predict(X)

In [None]:
dum_pred = base.predict(X)
mean_squared_error(dum_pred, y)

In [None]:
np.sqrt(mean_squared_error(dum_pred,y))

In [None]:
y = ames['SalePrice']
ames = ames.drop('SalePrice', axis = 1)

In [None]:
ames_numeric = ames.select_dtypes(include = 'int64')
ames_numeric.head()

In [None]:
std_scaled = StandardScaler()
minmax_scaled = MinMaxScaler()

In [None]:
cols = ames_numeric.columns

In [None]:
std_df = std_scaled.fit_transform(ames[[name for name in cols]])
minmax_df = minmax_scaled.fit_transform(ames[[name for name in cols]])

In [None]:
pd.DataFrame(std_df).head()

In [None]:
pd.DataFrame(minmax_df).head()

In [None]:
lm = LinearRegression()

In [None]:
y = np.log(y)

In [None]:
ames_numeric_scaled = std_scaled.fit_transform(ames[[name for name in cols]])

In [None]:
lm.fit(ames_numeric_scaled, y)

In [None]:
predictions = lm.predict(ames_numeric_scaled)

In [None]:
mse = mean_squared_error(y, predictions)

In [None]:
rmse = np.sqrt(mse)
score = lm.score(ames_numeric_scaled, predictions)

In [None]:
print('R-squared score: {}'.format(score), '\nRMSE: {:.4f}'.format(rmse))

In [None]:
X_train, X_test, y_train, y_test = train_test_split(ames_numeric_scaled, y)

In [None]:
lm.fit(X_train, y_train)

In [None]:
pred = lm.predict(X_test)

In [None]:
mse = mean_squared_error(y_test, pred)

In [None]:
rmse = np.sqrt(mse)
rmse