<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 15px; height: 80px">

# Project 3

### Regression and Classification with the Ames Housing Data

---

You have just joined a new "full stack" real estate company in Ames, Iowa. The strategy of the firm is two-fold:
- Own the entire process from the purchase of the land all the way to sale of the house, and anything in between.
- Use statistical analysis to optimize investment and maximize return.

The company is still small, and though investment is substantial the short-term goals of the company are more oriented towards purchasing existing houses and flipping them as opposed to constructing entirely new houses. That being said, the company has access to a large construction workforce operating at rock-bottom prices.

This project uses the [Ames housing data recently made available on kaggle](https://www.kaggle.com/c/house-prices-advanced-regression-techniques).

In [1]:
import numpy as np
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

sns.set_style('whitegrid')

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 1. Estimating the value of homes from fixed characteristics.

---

Your superiors have outlined this year's strategy for the company:
1. Develop an algorithm to reliably estimate the value of residential houses based on *fixed* characteristics.
2. Identify characteristics of houses that the company can cost-effectively change/renovate with their construction team.
3. Evaluate the mean dollar value of different renovations.

Then we can use that to buy houses that are likely to sell for more than the cost of the purchase plus renovations.

Your first job is to tackle #1. You have a dataset of housing sale data with a huge amount of features identifying different aspects of the house. The full description of the data features can be found in a separate file:

    housing.csv
    data_description.txt
    
You need to build a reliable estimator for the price of the house given characteristics of the house that cannot be renovated. Some examples include:
- The neighborhood
- Square feet
- Bedrooms, bathrooms
- Basement and garage space

and many more. 

Some examples of things that **ARE renovate-able:**
- Roof and exterior features
- "Quality" metrics, such as kitchen quality
- "Condition" metrics, such as condition of garage
- Heating and electrical components

and generally anything you deem can be modified without having to undergo major construction on the house.

---

**Your goals:**
1. Perform any cleaning, feature engineering, and EDA you deem necessary.
- Be sure to remove any houses that are not residential from the dataset.
- Identify **fixed** features that can predict price.
- Train a model on pre-2010 data and evaluate its performance on the 2010 houses.
- Characterize your model. How well does it perform? What are the best estimates of price?

> **Note:** The EDA and feature engineering component to this project is not trivial! Be sure to always think critically and creatively. Justify your actions! Use the data description file!

In [2]:
# Load the data
house = pd.read_csv('./housing.csv')

<title><center><u>PREPROCESSING</u><center></title>

---
##### << Be sure to remove any houses that are not residential from the dataset.

In [3]:
house.shape

(1460, 81)

In [4]:
house.columns

Index([u'Id', u'MSSubClass', u'MSZoning', u'LotFrontage', u'LotArea',
       u'Street', u'Alley', u'LotShape', u'LandContour', u'Utilities',
       u'LotConfig', u'LandSlope', u'Neighborhood', u'Condition1',
       u'Condition2', u'BldgType', u'HouseStyle', u'OverallQual',
       u'OverallCond', u'YearBuilt', u'YearRemodAdd', u'RoofStyle',
       u'RoofMatl', u'Exterior1st', u'Exterior2nd', u'MasVnrType',
       u'MasVnrArea', u'ExterQual', u'ExterCond', u'Foundation', u'BsmtQual',
       u'BsmtCond', u'BsmtExposure', u'BsmtFinType1', u'BsmtFinSF1',
       u'BsmtFinType2', u'BsmtFinSF2', u'BsmtUnfSF', u'TotalBsmtSF',
       u'Heating', u'HeatingQC', u'CentralAir', u'Electrical', u'1stFlrSF',
       u'2ndFlrSF', u'LowQualFinSF', u'GrLivArea', u'BsmtFullBath',
       u'BsmtHalfBath', u'FullBath', u'HalfBath', u'BedroomAbvGr',
       u'KitchenAbvGr', u'KitchenQual', u'TotRmsAbvGrd', u'Functional',
       u'Fireplaces', u'FireplaceQu', u'GarageType', u'GarageYrBlt',
       u'GarageFinish',

##### MSZoning: Identifies the general zoning classification of the sale.
		
       A	Agriculture
       C	Commercial
       FV	Floating Village Residential
       I	Industrial
       RH	Residential High Density
       RL	Residential Low Density
       RP	Residential Low Density Park 
       RM	Residential Medium Density

In [5]:
not_residential = ['A', 'C', 'I']
house[house['MSZoning'].isin(not_residential)]
# 'A', 'C', 'I' is already not in the dataset.

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice


##### Section end >>

---

---
##### << Train model on pre-2010 data and evaluate its performance on the 2010 houses.
(Finding out what this means first)

In [6]:
house['YrSold'].unique()
# data only has pre-2019

array([2008, 2007, 2006, 2009, 2010])

##### Section end >>

---

<h2><center><u>QUESTION 1<u><center></h2>

#### WORKFLOW
1. Identify fixed features
2. Train data using pre-2010 data, predict on 2010 data.
3. Evaluate

#### I'm having an error here trying to change the values in the MSSubClass to type 'Category'.

#### cont

In [7]:
# Selected only a few fixed_feature and non_fixed_features.
print house.columns
irrelevant_features = ['Id']

fixed_features = ['Neighborhood', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', 
                      '3SsnPorch', 'ScreenPorch', 'PoolArea', 'LotArea', 'MasVnrArea',
                      'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF',
                     'GrLivArea', 'GarageArea','BsmtFullBath', 'BsmtHalfBath',
                 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'SalePrice']
                      
non_fixed_features = ['RoofMatl', 'Exterior1st', 'MasVnrType', 'HeatingQC', 'KitchenQual', 'Condition1', 'Condition2',
                      'SalePrice', 'ExterQual', 'FireplaceQu', 'GarageQual', 'GarageCond', 'PoolQC', 'Fence', 
                      'OverallCond', 'BsmtCond']

Index([u'Id', u'MSSubClass', u'MSZoning', u'LotFrontage', u'LotArea',
       u'Street', u'Alley', u'LotShape', u'LandContour', u'Utilities',
       u'LotConfig', u'LandSlope', u'Neighborhood', u'Condition1',
       u'Condition2', u'BldgType', u'HouseStyle', u'OverallQual',
       u'OverallCond', u'YearBuilt', u'YearRemodAdd', u'RoofStyle',
       u'RoofMatl', u'Exterior1st', u'Exterior2nd', u'MasVnrType',
       u'MasVnrArea', u'ExterQual', u'ExterCond', u'Foundation', u'BsmtQual',
       u'BsmtCond', u'BsmtExposure', u'BsmtFinType1', u'BsmtFinSF1',
       u'BsmtFinType2', u'BsmtFinSF2', u'BsmtUnfSF', u'TotalBsmtSF',
       u'Heating', u'HeatingQC', u'CentralAir', u'Electrical', u'1stFlrSF',
       u'2ndFlrSF', u'LowQualFinSF', u'GrLivArea', u'BsmtFullBath',
       u'BsmtHalfBath', u'FullBath', u'HalfBath', u'BedroomAbvGr',
       u'KitchenAbvGr', u'KitchenQual', u'TotRmsAbvGrd', u'Functional',
       u'Fireplaces', u'FireplaceQu', u'GarageType', u'GarageYrBlt',
       u'GarageFinish',

In [8]:
fixed = house[fixed_features]
X = fixed.iloc[:,:-1]
y = fixed.iloc[:,-1]

In [9]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.cross_validation import cross_val_score, cross_val_predict



<center> Q1 - CREATING DUMMY VARIABLES FOR THE COLUMNS THAT NEED IT <center>

In [10]:
neigh_dummy = pd.get_dummies(X['Neighborhood'])
X_concat = pd.concat([X, neigh_dummy], axis=1)
X_concat.drop('Neighborhood', axis =1, inplace=True)

<center> Q1 - REMOVING NULL VALUES FROM MY PREDICTORS <center>

In [11]:
X_concat['MasVnrArea'] = pd.to_numeric(X_concat['MasVnrArea'], errors='coerce')
#print X_concat['MasVnrArea'].isnull().sum()
X_concat['MasVnrArea'] = X_concat['MasVnrArea'].fillna(X_concat['MasVnrArea'].mean())
#print X_concat['MasVnrArea'].unique()
X_concat['MasVnrArea'].astype('float')
#print X_concat['MasVnrArea'].dtype
y = y.astype('float')
#y.dtype

<center> Q1 FEATURE SELECTION K-BEST<center>

In [12]:
from sklearn.feature_selection import SelectKBest, chi2, f_classif

cols = list(X_concat.columns)

# Build the selector — we'll build one with each score type.
skb_f = SelectKBest(f_classif, k=10)
skb_chi2 = SelectKBest(chi2, k=10)

# Train the selector on the data.
skb_f.fit(X_concat, y)
skb_chi2.fit(X_concat, y)

# Examine the results.
kbest = pd.DataFrame([cols, list(skb_f.scores_), list(skb_chi2.scores_)], 
                     index=['feature','f_classif','chi2 score']).T.sort_values('f_classif', ascending=False)
kbest

Unnamed: 0,feature,f_classif,chi2 score
36,NridgHt,4.48329,1090.23
12,GrLivArea,3.43511,196850.0
35,NoRidge,3.43219,1050.51
6,LotArea,3.28585,10115000.0
44,Veenker,2.69042,1001.05
13,GarageArea,2.58334,96184.1
9,1stFlrSF,2.33862,123810.0
8,TotalBsmtSF,2.31529,174706.0
19,TotalBsmtSF,2.31529,174706.0
41,Somerst,2.11225,875.173


In [13]:
bottom_5features = ['Blueste', 'Sawyer', 'SWISU', 'ClearCr', 'NWAmes']
bottom_10features = ['Blueste', 'Sawyer', 'SWISU', 'ClearCr', 'NWAmes', 'NPkVill', 'Mitchel', 'EnclosedPorch',
                    'Edwards', 'BsmtHalfBath']

In [14]:
X_drop5 = X_concat.drop(bottom_5features, axis=1)
X_drop10 = X_concat.drop(bottom_10features, axis=1)
print X_concat.shape
print X_drop10.shape
print X_drop5.shape

(1460, 45)
(1460, 35)
(1460, 40)


<center> Q1 FEATURE SELECTION BY RFE(RECURSIVE FEATURE ELEMINATION <center>

In [15]:
from sklearn.feature_selection import RFECV

lr = LinearRegression()
selector = RFECV(lr, step=1, cv=10)
selector = selector.fit(X_concat, y)

print selector.support_
print selector.ranking_ 
print len(selector.ranking_)

[ True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True]
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1]
45


REMARKS:
    
When instantiating the RFECV:
1. 'step' indicates what percent of features (or number of features if an integer) to remove at each iteration.
2. 'cv' indicates the number of cross-validation folds to use for evaluating what features are important.



<center> Q1 FEATURE SELECTION BY LASSO PENALTY <center>

I assume the LASSO penalty is the same as the LASSO Regression which I will caryy out later on.

<center> Q1 - LINEAR MODEL WITHOUT CV <center>

#### used all 45 features

In [16]:
X_train45, X_test45, y_train45, y_test45 = train_test_split(X_concat, y, train_size=0.7, random_state=88)



In [17]:
X_train45.shape, y_train45.shape

((1021, 45), (1021,))

In [18]:
X_test45.shape, y_test45.shape

((439, 45), (439,))

In [19]:
model45 = LinearRegression()
model45.fit(X_train45, y_train45)
model45.score(X_test45, y_test45)

0.6278060510205139

#### used 40 features

In [20]:
X_train, X_test, y_train, y_test = train_test_split(X_drop5, y, train_size=0.7, random_state=88)

In [21]:
X_train.shape, y_train.shape

((1021, 40), (1021,))

In [22]:
model40 = LinearRegression()
model40.fit(X_train, y_train)
model40.score(X_test, y_test)

0.6297707893547342

#### used 35 features

In [23]:
X_train, X_test, y_train, y_test = train_test_split(X_drop10, y, train_size=0.7, random_state=88)

In [24]:
X_train.shape, y_train.shape

((1021, 35), (1021,))

In [25]:
model35 = LinearRegression()
model35.fit(X_train, y_train)
model35.score(X_test, y_test)

0.6253164532160111

<center> Q1 - LINEAR MODEL WITH CV <center>

#### used all 45 features

In [26]:
from sklearn.cross_validation import cross_val_score
# with cross validation
lr = LinearRegression()
lr_scores = cross_val_score(lr, X_concat, y, cv=10)
print lr_scores
print '='*20
print np.mean(lr_scores)

# Note that you are already scoring it against the target variable, you don't need to do anything else further.
    # Test ridge, lasso and elastic net to see if you can get a better accruacy.
    # Then do feature selection next first before testing these models
    # Lasso itself is like a feature selection already  no? I can get back and see which coefficients are down to 0.

[0.85267508 0.81715762 0.85584122 0.75818831 0.79318915 0.83000955
 0.77493096 0.77493694 0.48589756 0.82256481]
0.7765391190263479


#### used 40 features

In [27]:
from sklearn.cross_validation import cross_val_score
# with cross validation
lr = LinearRegression()
lr_scores = cross_val_score(lr, X_drop5, y, cv=10)
print lr_scores
print '='*20
print np.mean(lr_scores)

[0.85119429 0.8156707  0.85333799 0.75661876 0.79741649 0.83444644
 0.77296758 0.76832878 0.48129847 0.82298841]
0.7754267915699017


#### used 35 features

In [28]:
from sklearn.cross_validation import cross_val_score
# with cross validation
lr = LinearRegression()
lr_scores = cross_val_score(lr, X_drop10, y, cv=10)
print lr_scores
print '='*20
print np.mean(lr_scores)

[0.85537038 0.81771412 0.84857502 0.75071678 0.80033615 0.83429664
 0.77617481 0.77189836 0.47145316 0.82540927]
0.7751944691790041


<center> Q1 - RIDGE MODEL WITH CV <center>

In [29]:
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import cross_val_score
from sklearn.linear_model import Ridge, RidgeCV

ss = StandardScaler()
Xs = ss.fit_transform(X_concat)

# Test the range of alphas with the number of CV in RidgeCV 
range_alpha = np.logspace(2,3,50)
RCV = RidgeCV(alphas=range_alpha, cv=10)
RCV.fit(Xs,y)
optimal_ridge_alpha = RCV.alpha_
optimal_ridge_alpha

132.57113655901094

In [30]:
ridge = Ridge(alpha=optimal_ridge_alpha,)
ridge_score = cross_val_score(ridge, Xs, y, cv=10)
print ridge_score
print '='*20
print 'mean of ridge_score: ', np.mean(ridge_score)

[0.85514887 0.82168854 0.85670836 0.75814673 0.79268222 0.83209676
 0.77371915 0.78225637 0.4857345  0.82757763]
mean of ridge_score:  0.778575914261703


<center> Q1 - LASSO MODEL WITH CV <center>

In [31]:
from sklearn.linear_model import Lasso, LassoCV

alpha_range = np.logspace(-2,7,50)
RCV = RidgeCV(alphas=alpha_range, cv=10)
RCV.fit(Xs, y)
RCV.alpha_

optimal_lasso = LassoCV(n_alphas=500, cv=10, verbose=1)
optimal_lasso.fit(Xs, y)
optimal_lasso_alpha =  optimal_lasso.alpha_
print optimal_lasso_alpha

........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

152.4705678544128


....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    0.8s finished


In [32]:
lasso = Lasso(alpha=optimal_lasso_alpha)
lasso_score = cross_val_score(lasso, Xs, y, cv=10)
print lasso_score
print '='*20
print 'mean of lasso_score: ', np.mean(lasso_score)

[0.85337597 0.81919197 0.85739251 0.75793668 0.79269722 0.83135543
 0.77514095 0.77496517 0.48317322 0.82319669]
mean of lasso_score:  0.7768425811609785


<center> Q1 - WHICH FEATURES GOT ZEROED OUT BY LASSO? <center>

In [103]:
lasso.fit(Xs,y)
lasso_coef = pd.DataFrame({'Features': X_concat.columns,
                          'Coef': lasso.coef_})
lasso_coef = lasso_coef[['Features', 'Coef']]
print '='*20
print lasso_coef[lasso_coef['Coef'] == 0]
print '='*20
print "5 features got zeroed out by lasso"
print '='*20
print '\nnumber of predictor features used in total: ', len(lasso_coef)
print '\npercentage of features zeroed out: 5 / 45 = ', round(5/float(45),2)

ValueError: arrays must all be same length

<center> Q1 - ELASTIC NET MODEL WITH CV <center>

In [34]:
from sklearn.linear_model import ElasticNetCV, ElasticNet

range_l1_ratios = np.linspace(0.01, 1, 25)
optimal_hyper_param = ElasticNetCV(l1_ratio=range_l1_ratios, n_alphas=30, cv=10)
optimal_hyper_param.fit(Xs,y)
print 'alpha: ', optimal_hyper_param.alpha_
print 'l1_ratio: ', optimal_hyper_param.l1_ratio_
print '\n', '='*20
print '\nSince the l1_ratio is 1.0 then the ElasticNet will perform just like the Lasso Regression'

alpha:  185.1661793602367
l1_ratio:  1.0


Since the l1_ratio is 1.0 then the ElasticNet will perform just like the Lasso Regression


In [35]:
enet = ElasticNet(alpha=optimal_hyper_param.alpha_, l1_ratio=optimal_hyper_param.l1_ratio_)
enet_scores = cross_val_score(enet, Xs, y, cv=10)
print enet_scores
print'='*20
print 'mean of LinearReg_scores: ', np.mean(lr_scores)
print 'mean of enet_scores: ' , np.mean(enet_scores)
print 'mean of ridge_score: ', np.mean(ridge_score)
print 'mean of lasso_score: ', np.mean(lasso_score)

[0.85344984 0.81961169 0.85767914 0.75783499 0.79257631 0.8315766
 0.77515122 0.77494128 0.48257558 0.82331534]
mean of LinearReg_scores:  0.7751944691790041
mean of enet_scores:  0.7768711988210377
mean of ridge_score:  0.778575914261703
mean of lasso_score:  0.7768425811609785


<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 2. Determine any value of *changeable* property characteristics unexplained by the *fixed* ones.

---

Now that you have a model that estimates the price of a house based on its static characteristics, we can move forward with part 2 and 3 of the plan: what are the costs/benefits of quality, condition, and renovations?

There are two specific requirements for these estimates:
1. The estimates of effects must be in terms of dollars added or subtracted from the house value. 
2. The effects must be on the variance in price remaining from the first model.

The residuals from the first model (training and testing) represent the variance in price unexplained by the fixed characteristics. Of that variance in price remaining, how much of it can be explained by the easy-to-change aspects of the property?

---

**Your goals:**
1. Evaluate the effect in dollars of the renovate-able features. 
- How would your company use this second model and its coefficients to determine whether they should buy a property or not? Explain how the company can use the two models you have built to determine if they can make money. 
- Investigate how much of the variance in price remaining is explained by these features.
- Do you trust your model? Should it be used to evaluate which properties to buy and fix up?

#### Step 1 : Using the residuals as the target variable

In [36]:
ridge = Ridge()
ridge.fit(X_concat, y)
y_ridge_predict = ridge.predict(X_test45)
# This doesn't seem right. If I'm training my ridge on all data then my prediction is already on seen data.
    #  I have to ask about this.
ridge_resuduals = y_test45 - y_ridge_predict
non_fixed_ridge_y = ridge_resuduals

In [37]:
y_predict45 = model45.predict(X_test45)
linear_residuals = y_test45 - y_predict45
non_fixed_linear_y = linear_residuals

##### Step 2: Take all non_fixed featuers and convert to dummy variables

In [38]:
non_fixed_X = house[non_fixed_features].copy().drop('SalePrice', axis=1)
non_fixed_X  = pd.get_dummies(non_fixed_X)

#### I only need the index that corresponds to the X_test from question 1

In [39]:
X_test45.index[:5]

Int64Index([1189, 1231, 327, 645, 210], dtype='int64')

In [40]:
len(X_test45.index[:])

439

In [41]:
# X_test.index
non_fixed_X = X = non_fixed_X.iloc[X_test.index]

In [42]:
non_fixed_X.head()

Unnamed: 0,OverallCond,RoofMatl_ClyTile,RoofMatl_CompShg,RoofMatl_Membran,RoofMatl_Metal,RoofMatl_Roll,RoofMatl_Tar&Grv,RoofMatl_WdShake,RoofMatl_WdShngl,Exterior1st_AsbShng,...,PoolQC_Fa,PoolQC_Gd,Fence_GdPrv,Fence_GdWo,Fence_MnPrv,Fence_MnWw,BsmtCond_Fa,BsmtCond_Gd,BsmtCond_Po,BsmtCond_TA
1189,5,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1231,6,0,1,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,1
327,5,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
645,5,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
210,6,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


<center> Q2: TESTING THE RESIDEULS ON LINEAR, RIDGE, LASSO & ENET <center>

#### Linear

In [43]:
lr = LinearRegression()
lr_score = cross_val_score(lr, non_fixed_X,  non_fixed_linear_y, cv=10)
print 'mean of score : ', np.mean(lr_score)

mean of score :  -4.169835641267372e+23


#### Ridge

In [44]:
optimal_alphas = np.logspace(-2,7,50)
RCV = RidgeCV(alphas=optimal_alphas, cv=10)
RCV.fit(non_fixed_X, non_fixed_linear_y)
print RCV.alpha_
print '='*20

ridge=Ridge(alpha=RCV.alpha_)
ridge_score = cross_val_score(ridge, non_fixed_X, non_fixed_linear_y, cv=10)
print ridge_score
print '='*20
print 'mean of ridge scores: ', np.mean(ridge_score)

47.14866363457394
[-0.1213242  -0.01875507  0.04888546  0.08202169 -0.05697737  0.15789197
  0.31222055  0.09053512  0.18502632  0.1721814 ]
mean of ridge scores:  0.08517058743888851


#### Lasso

In [45]:
LCV = LassoCV(n_alphas=500, cv=10)
LCV.fit(non_fixed_X, non_fixed_linear_y)
LCV.alpha_

363.5953796186932

In [46]:
lasso = Lasso(alpha=LCV.alpha_)
lasso_score = cross_val_score(lasso, non_fixed_X, non_fixed_linear_y, cv=10)
print lasso_score
print '='*20
print 'mean of lasso_score: ', np.mean(lasso_score)

[-0.12444874  0.04925458  0.07275198  0.0745331  -0.07201356  0.29955781
  0.37275292  0.13936973  0.10842363  0.12867536]
mean of lasso_score:  0.10488568160985276


#### ENET

In [47]:
l1_ratios = np.linspace(0.01,1.0,25)
ECV = ElasticNetCV(l1_ratio=l1_ratios, n_alphas=50, cv=10)
ECV.fit(non_fixed_X, non_fixed_linear_y)
print ECV.alpha_
print ECV.l1_ratio_

363.38999591528693
1.0


CONCLUSION
1. since fixed features can only explain abouy 77% of the sale price then the of the remaining 23% which is unexplained, 10% of it which. is 2.3% can be expalined by non_fixed_features.
2. so the remaing 20.7% of the sale price cannot be explained by the fixed and unfixed features.

<img src="http://imgur.com/GCAf1UX.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 3. What property characteristics predict an "abnormal" sale?

---

The `SaleCondition` feature indicates the circumstances of the house sale. From the data file, we can see that the possibilities are:

       Normal	Normal Sale
       Abnorml	Abnormal Sale -  trade, foreclosure, short sale
       AdjLand	Adjoining Land Purchase
       Alloca	Allocation - two linked properties with separate deeds, typically condo with a garage unit	
       Family	Sale between family members
       Partial	Home was not completed when last assessed (associated with New Homes)
       
One of the executives at your company has an "in" with higher-ups at the major regional bank. His friends at the bank have made him a proposal: if he can reliably indicate what features, if any, predict "abnormal" sales (foreclosures, short sales, etc.), then in return the bank will give him first dibs on the pre-auction purchase of those properties (at a dirt-cheap price).

He has tasked you with determining (and adequately validating) which features of a property predict this type of sale. 

---

**Your task:**
1. Determine which features predict the `Abnorml` category in the `SaleCondition` feature.
- Justify your results.

This is a challenging task that tests your ability to perform classification analysis in the face of severe class imbalance. You may find that simply running a classifier on the full dataset to predict the category ends up useless: when there is bad class imbalance classifiers often tend to simply guess the majority class.

It is up to you to determine how you will tackle this problem. I recommend doing some research to find out how others have dealt with the problem in the past. Make sure to justify your solution. Don't worry about it being "the best" solution, but be rigorous.

Be sure to indicate which features are predictive (if any) and whether they are positive or negative predictors of abnormal sales.

#### Create a new column, 1 for abnormal, 0 for the rest.

In [48]:
df = house
X = df.drop(columns=['Id','SalePrice'], axis=1)
print X.shape
# removing those ith 80%. null. 
# X[X.isnull.mean() < 0.7]
X = X[X.columns[X.isnull().sum() < 0.8]]
print X.shape
print 'dropped 19 columns that had more than. 80% Nans in their columns'
X.drop('SaleCondition', axis=1, inplace=True)

(1460, 79)
(1460, 60)
dropped 19 columns that had more than. 80% Nans in their columns


In [49]:
y = df[['SaleCondition']]
y['SaleCondition'] = y['SaleCondition'].apply(lambda x: 1 if x == 'Abnorml' else 0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [50]:
req_dummy = ['MSZoning', 'Street', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood',
            'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'OverallQual', 'OverallCond', 'RoofStyle'
            ,'RoofMatl', 'Exterior1st', 'Exterior2nd', 'ExterQual', 'ExterCond', 'Foundation', 'Heating',
            'HeatingQC', 'CentralAir', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr',
            'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'GarageCars', 'PavedDrive',
            'SaleType']
print len(req_dummy)

37


In [51]:
dummy_cols =  pd.get_dummies(X[req_dummy])
dummy_cols.shape

(1460, 187)

In [52]:
X.drop(columns=req_dummy, axis=1, inplace=True)

In [53]:
X.shape

(1460, 22)

In [54]:
X = pd.concat([X, dummy_cols], axis=1)

In [55]:
X.shape

(1460, 209)

In [56]:
X.dtypes

MSSubClass         int64
LotArea            int64
YearBuilt          int64
YearRemodAdd       int64
BsmtFinSF1         int64
BsmtFinSF2         int64
BsmtUnfSF          int64
TotalBsmtSF        int64
1stFlrSF           int64
2ndFlrSF           int64
LowQualFinSF       int64
GrLivArea          int64
GarageArea         int64
WoodDeckSF         int64
OpenPorchSF        int64
EnclosedPorch      int64
3SsnPorch          int64
ScreenPorch        int64
PoolArea           int64
MiscVal            int64
MoSold             int64
YrSold             int64
OverallQual        int64
OverallCond        int64
BsmtFullBath       int64
BsmtHalfBath       int64
FullBath           int64
HalfBath           int64
BedroomAbvGr       int64
KitchenAbvGr       int64
                   ...  
HeatingQC_Ex       uint8
HeatingQC_Fa       uint8
HeatingQC_Gd       uint8
HeatingQC_Po       uint8
HeatingQC_TA       uint8
CentralAir_N       uint8
CentralAir_Y       uint8
KitchenQual_Ex     uint8
KitchenQual_Fa     uint8


In [57]:
X.columns[X.dtypes == 'object']

Index([], dtype='object')

In [58]:
y.shape

(1460, 1)

<center> Q3 - FEATURE SELECTION USING KBEST <center>

In [59]:
from sklearn.feature_selection import SelectKBest, chi2, f_classif

skb_f = SelectKBest(f_classif, k=50)
skb_chi2 = SelectKBest(chi2, k=50)

# Train the selector on the data.
skb_f.fit(X, y)
skb_chi2.fit(X, y)

# Examine the results.
kbest = pd.DataFrame([X.columns, list(skb_f.scores_), list(skb_chi2.scores_)], 
                     index=['feature','f_classif','chi2 score']).T.sort_values('f_classif', ascending=False)

  y = column_or_1d(y, warn=True)


Unnamed: 0,feature,f_classif,chi2 score
200,SaleType_COD,185.134,159.655
207,SaleType_Oth,41.5451,40.3663
33,MSZoning_C (all),29.5713,28.8244
208,SaleType_WD,29.423,3.81777
3,YearRemodAdd,27.4516,5.79003
137,Exterior1st_Stone,27.4169,26.9109
2,YearBuilt,17.9344,8.20402
32,GarageCars,17.8986,5.5919
22,OverallQual,15.7983,4.90442
35,MSZoning_RH,15.0067,14.7112


In [84]:
print len(kbest[kbest['f_classif'] < 0.1])
kbest[kbest['f_classif'] < 0.1]

37


Unnamed: 0,feature,f_classif,chi2 score
164,ExterCond_Gd,0.0956066,0.0861583
64,Neighborhood_Crawfor,0.087876,0.0849176
182,HeatingQC_Po,0.0742722,0.0743194
128,Exterior1st_AsphShn,0.0742722,0.0743194
131,Exterior1st_CBlock,0.0742722,0.0743194
97,Condition2_RRAe,0.0742722,0.0743194
146,Exterior2nd_CBlock,0.0742722,0.0743194
134,Exterior1st_ImStucc,0.0742722,0.0743194
123,RoofMatl_Roll,0.0742722,0.0743194
165,ExterCond_Po,0.0742722,0.0743194


<center> Q3 - FEATURE SELECTION USING RECURSIVE FEATURE SELECTION <center>

In [61]:
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
selector = RFECV(lr, step=2, cv=10)
selector = selector.fit(X, y)

print selector.support_
print selector.ranking_

[ True False  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True False  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  T

In [68]:
RFS_df = pd.DataFrame([X.columns, selector.support_, selector.ranking_], index=['col_name', 'support', 'ranking']).T
RFS_df[RFS_df['ranking'] > 1]

Unnamed: 0,col_name,support,ranking
1,LotArea,False,2
97,Condition2_RRAe,False,2


<center> Q3 - FEATURE SELECTION USING LASSO PENALTY <center>

In [76]:
from sklearn.preprocessing import StandardScaler

ss = StandardScaler()
Xs = ss.fit_transform(X)

from sklearn.linear_model import LogisticRegressionCV

lrcv = LogisticRegressionCV(penalty='l1', Cs=10, cv=3, solver='liblinear')
lrcv.fit(Xs, y)

LogisticRegressionCV(Cs=10, class_weight=None, cv=3, dual=False,
           fit_intercept=True, intercept_scaling=1.0, max_iter=100,
           multi_class='ovr', n_jobs=1, penalty='l1', random_state=None,
           refit=True, scoring=None, solver='liblinear', tol=0.0001,
           verbose=0)

In [144]:
lrcv_coef = pd.DataFrame(lrcv.coef_)
lrcv_coef = lrcv_coef.T
lrcv_coef['col_name'] = X.columns
lrcv_coef.rename({0:'coef'}, axis=1, inplace=True)
lrcv_coef = lrcv_coef[['col_name', 'coef']]
print len(lrcv_coef),  ' columns in total'
print len(lrcv_coef[lrcv_coef['coef'] == 0]), ' columns which had their  coefficient zeroed out'
print "="*60
print lrcv_coef[lrcv_coef['coef'] == 0]

209  columns in total
101  columns which had their  coefficient zeroed out
                 col_name  coef
0              MSSubClass   0.0
4              BsmtFinSF1   0.0
6               BsmtUnfSF   0.0
8                1stFlrSF   0.0
9                2ndFlrSF   0.0
11              GrLivArea   0.0
13             WoodDeckSF   0.0
15          EnclosedPorch   0.0
20                 MoSold   0.0
22            OverallQual   0.0
24           BsmtFullBath   0.0
28           BedroomAbvGr   0.0
29           KitchenAbvGr   0.0
32             GarageCars   0.0
36            MSZoning_RL   0.0
37            MSZoning_RM   0.0
40           LotShape_IR1   0.0
41           LotShape_IR2   0.0
42           LotShape_IR3   0.0
43           LotShape_Reg   0.0
46        LandContour_Low   0.0
47        LandContour_Lvl   0.0
53          LotConfig_FR3   0.0
54       LotConfig_Inside   0.0
55          LandSlope_Gtl   0.0
56          LandSlope_Mod   0.0
61   Neighborhood_BrkSide   0.0
64   Neighborhood_Crawfor   0

<h2><center>QUICK REFERENCE</center></h2>

#### Standardizing the predictors when using a regulator

In [60]:
# Initialize the StandardScaler object.
from sklearn.preprocessing import StandardScaler

ss = StandardScaler()

# Use the "fit_transform" function to standardize the X design matrix.
Xs = ss.fit_transform(X)