# Ames Housing Data and Kaggle Challenge
### Part 3C: Lasso

For this section, I am going to apply the Lasso Regression Model.

Importing packages:

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
import matplotlib.pyplot as plt
from sklearn import metrics
import pickle
import statsmodels.api as sm
from sklearn.linear_model import Lasso, LassoCV

Reading in the data:

In [2]:
ames = pd.read_csv('../datasets/ames_housing_cleaned.csv')

In [3]:
ames.head(2)

Unnamed: 0,id,pid,ms_subclass,ms_zoning,lot_area,street,lot_shape,land_contour,utilities,lot_config,...,open_porch_sf,enclosed_porch,3ssn_porch,screen_porch,pool_area,misc_val,mo_sold,yr_sold,sale_type,saleprice
0,109,533352170,60,RL,13517,Pave,IR1,Lvl,AllPub,CulDSac,...,44,0,0,0,0,0,3,2010,WD,130500
1,544,531379050,60,RL,11492,Pave,IR1,Lvl,AllPub,CulDSac,...,74,0,0,0,0,0,4,2009,WD,220000


First, I'll set up the features I'll use with my model. I'm going to stick to the numerical features including ones I've engineered:
- neighborhood_avg
- bed_bath
- gr_liv_area_log
- built/remodel
- only_full_bath
- lot_area_log

In [4]:
neighbor_avg = ames.groupby(ames['neighborhood'])['saleprice'].mean()

In [5]:
neighborhoods = list(ames['neighborhood'].unique())
neighborhoods = sorted(neighborhoods)

In [6]:
neighb_dict = {neighborhood : round((avg/10_0000), 2) 
               for(neighborhood, avg) in zip(neighborhoods, neighbor_avg)}

In [7]:
print(neighb_dict)

{'Blmngtn': 2.0, 'Blueste': 1.45, 'BrDale': 1.03, 'BrkSide': 1.29, 'ClearCr': 2.21, 'CollgCr': 2.02, 'Crawfor': 2.06, 'Edwards': 1.32, 'Gilbert': 1.89, 'Greens': 1.89, 'GrnHill': 3.3, 'IDOTRR': 1.04, 'Landmrk': 1.37, 'MeadowV': 1.0, 'Mitchel': 1.71, 'NAmes': 1.48, 'NPkVill': 1.4, 'NWAmes': 1.95, 'NoRidge': 3.16, 'NridgHt': 3.22, 'OldTown': 1.26, 'SWISU': 1.35, 'Sawyer': 1.39, 'SawyerW': 1.92, 'Somerst': 2.27, 'StoneBr': 3.3, 'Timber': 2.4, 'Veenker': 2.54}


In [8]:
#Adapted from Daniel Kim's potato example

ames['neighborhood_avg'] = ames['neighborhood'].apply(lambda x: neighb_dict[x])

In [9]:
ames['total_bath'] = ames['bsmt_full_bath'] + (ames['bsmt_half_bath'] *.5) + ames['full_bath'] + (ames['half_bath'] *.5)


In [10]:
ames['bed_bath'] = ames['bedroom_abvgr'] * ames['total_bath']

In [11]:
ames['gr_liv_area_log'] = np.log(ames['gr_liv_area'])

In [12]:
ames['built/remodel'] = ames['year_built'] * ames['year_remod/add']**2

In [13]:
ames['only_full_bath'] = ames['bsmt_full_bath'] + ames['full_bath']

In [14]:
ames['lot_area_log'] = (np.log((ames['lot_area'])))*ames['gr_liv_area']

Setting the features:

In [15]:
ridge_features = ['overall_qual',
 'garage_area',
 'mas_vnr_area',
 'neighborhood_avg',
 'bed_bath', 
 'gr_liv_area_log',
'built/remodel', 
'only_full_bath',
  'lot_area_log',
'year_remod/add']

In [16]:
X_L = ames[ridge_features]
y_L = ames['saleprice']

In [17]:
X_l_train, X_l_test, y_l_train, y_l_test = train_test_split(X_L, y_L, random_state = 21)

Instantiating and fitting the scaler:

In [18]:
ss = StandardScaler()
ss.fit(X_l_train)

StandardScaler(copy=True, with_mean=True, with_std=True)

Scaling the data:

In [19]:
Z_train = ss.transform(X_l_train)
Z_test = ss.transform(X_l_test)

Setting the alphas and instantiating the model:

In [20]:
alpha_l = np.logspace(-3, 3, 100)
lasso_model = LassoCV(
    alphas = alpha_l,
    cv = 5
)

Fitting the model:

In [21]:
lasso_model.fit(Z_train, y_l_train);

Scoring the model:

In [22]:
print(lasso_model.score(Z_train, y_l_train))
print(lasso_model.score(Z_test, y_l_test))

0.8148403679358442
0.8229088384605532


Creating predictions:

In [23]:
predictions = lasso_model.predict(Z_train)

In [24]:
mean_2_error = metrics.mean_squared_error(y_l_train, predictions)

In [25]:
print(mean_2_error**.5)

33860.43151914274


This model isn't performing any better than my MLR models so I won't submit this.