# Machine Learning Example Case: 
House Sale Price Prediction (like Zillow's "zestimate") 

When you see a line starting with "TASK", do that task!

### TASK: Click on the next cell and press shift-enter
You will get the code in it get executed.   
The result of last command or representation of last varible in that cell will be displayed 

In [1]:
import pandas as pd
housing = pd.read_csv('data/housing_processed.csv')
housing.head()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,ExterQual,...,GarageType_NA,SaleType_COD,SaleType_CWD,SaleType_Con,SaleType_ConLD,SaleType_ConLI,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD
0,1,60,65.0,8450,7,5,2003,2003,196.0,Gd,...,0,0,0,0,0,0,0,0,0,1
1,2,20,80.0,9600,6,8,1976,1976,0.0,TA,...,0,0,0,0,0,0,0,0,0,1
2,3,60,68.0,11250,7,5,2001,2002,162.0,Gd,...,0,0,0,0,0,0,0,0,0,1
3,4,70,60.0,9550,7,5,1915,1970,0.0,TA,...,0,0,0,0,0,0,0,0,0,1
4,5,60,84.0,14260,8,5,2000,2000,350.0,Gd,...,0,0,0,0,0,0,0,0,0,1


### Filtering Columns
Some columns were not removed when equivalent coded ones were created

In [2]:
housing[["ExterQual","ExterQual_Coded"]].head()

Unnamed: 0,ExterQual,ExterQual_Coded
0,Gd,3
1,TA,2
2,Gd,3
3,TA,2
4,Gd,3


### Filtering in a series
dtypes returns a series   
filtering series and dataframes are similar

In [3]:
type(housing.dtypes==object)

pandas.core.series.Series

In [4]:
housing.dtypes[housing.dtypes==object].shape

(16,)

In [5]:
"SalePrice" in housing.columns 

True

### Removing Undesired Columns
In my case, my colleague had left above non-numeric columns in preprocessing, after creating corresponding coded versions

In [6]:
len(housing.columns)

238

In [7]:
# We could drop columns by name:
housing_ml = housing.drop(columns=["ExterQual"])

In [8]:
# or wholesale, keeping only numeric:
housing_ml = housing.loc[:,housing.dtypes != object]

In [9]:
len(housing_ml.columns)

222

# Separate Target into new Variable
- "SalePrice" is the target.    
 - The value we want to predict from other values (features) for a house.  
- Currently it is a column like the other features.   
- Scikit-learn needs 2 variables: features (X) and target (y) to be Predicted into its own 1-D array 

# NumPy
- Both Pandas and scikit-learn are build on top of NumPy
- scikit-learn can not directly work on dataframes
- X and y data type needs to be NumPy "ndarrays"

In [10]:
housing_ml.shape

(1460, 222)

In [11]:
# Split data as features and target
# take "SalePrice" values into its own 1-D array 
sale_price = housing_ml.pop('SalePrice')
type(sale_price)

pandas.core.series.Series

In [12]:
# pop removes the column
# "in place" operation
# now housing_ml has one less column
housing_ml.shape

(1460, 221)

In [13]:
y = sale_price.values
type(y)

numpy.ndarray

# See what other methods are available for ndarray

In [14]:
# press tab after putting cursor after dot "."
#y. #uncomment, press tab after . 

In [15]:
y.shape
# (1460,)
# it is equivalent to (1460)
# means it is a 1-d array

(1460,)

### TASK: get ndarray version of feature dataframe put it onto variable X

In [16]:
X = housing_ml.values

### TASK: check the shape of X

In [17]:
X.shape

(1460, 221)

### TASK: programmatically check if X and y has matching number of rows
You

In [18]:
X.shape[0] == y.shape[0]

True

# First Model
Q: What would you do if you had no features?

A: You would always estimatate the average house price.

We will have to do much better than that.  
We have so much data to base our decision on.   
It can still serve us as a baseline to compare.   
An inferior baseline could be: random in the range or max and min in training data. 

In [19]:
# Import estimator
from sklearn.dummy import DummyRegressor
# Instantiate estimator
# guess the mean every single time
mean_reg = DummyRegressor(strategy='mean')
# fit estimator
mean_reg.fit(X, y)

DummyRegressor()

In [20]:
# predict
mean_reg.predict(X)

array([180921.19589041, 180921.19589041, 180921.19589041, ...,
       180921.19589041, 180921.19589041, 180921.19589041])

## Evaluating The Model
scikit-learn regressors have a score function.   
It gives you how much better your model does compared to worst model
Technically: what percentage of the variance has decreased over the worst model

"Mean" *is* the worst model, so its score will be 0.

In [21]:
mean_reg.score(X, y)

0.0

## Fitting a linear model 
First, let's use only one feature 

In [22]:
from sklearn.linear_model import LinearRegression
linear_model = LinearRegression()

In [23]:
X_lf = housing_ml[['LotFrontage']]

In [24]:
linear_model.fit(X_lf, y)

LinearRegression()

Above, you see that it used defaults to create the estimator.   
You could google "LinearRegression sklearn" and find the documentation:
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
to see the options for the other parameters.

In [25]:
y_pred = linear_model.predict(X_lf)

In [26]:
linear_model.score(X_lf, y)

0.11215612336205605

### Chart Showing the Linear Fit
matplotlib is the most common visualization library

# Using all predictors!

In [28]:
# We had 81 columns (80 features) in original dataset,
# coded as 221 features!
X.shape

(1460, 221)

In [29]:
#linear_model.fit(X, y)

In [30]:
#y_pred3 = linear_model.predict(X)

In [31]:
#linear_model.score(X, y)

1)Split the data for training and testing, to use 80 percent as training data. 
use 21 as your randomization seed (so you achieve same results for us to grade). 
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html 
random_state = 21 train_size = .8 


In [32]:
from sklearn.model_selection import train_test_split

In [33]:
 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=21)

In [73]:
X_train.shape[0] == y_train.shape[0]

True

In [74]:
X_test.shape[0] == y_test.shape[0]

True

2)Analyze which feature alone would give the best prediction, list the scores and RMSE errors achieved by the top 10 predictors by score. 

In [34]:
input_columns = list(housing_ml.columns)

In [39]:
linear_model.fit(X_train, y_train)

LinearRegression()

In [85]:
model = LinearRegression()
# fit the model
model.fit(X, y)
# get importance
importance = model.coef_
# summarize feature importance
for i,v in enumerate(importance):
	print('Feature: %0d, Score: %.5f' % (i,v))

Feature: 0, Score: 0.61873
Feature: 1, Score: -104.76818
Feature: 2, Score: 82.91462
Feature: 3, Score: 0.57011
Feature: 4, Score: 8283.85534
Feature: 5, Score: 5379.56249
Feature: 6, Score: 261.54243
Feature: 7, Score: 6.79768
Feature: 8, Score: 30.77009
Feature: 9, Score: 18.20556
Feature: 10, Score: 2.11373
Feature: 11, Score: -2.06695
Feature: 12, Score: 18.25228
Feature: 13, Score: 12.11180
Feature: 14, Score: 30.51743
Feature: 15, Score: -5.95857
Feature: 16, Score: 36.67065
Feature: 17, Score: 1407.35083
Feature: 18, Score: 1238.64194
Feature: 19, Score: 1796.26409
Feature: 20, Score: 2394.63247
Feature: 21, Score: -6076.78357
Feature: 22, Score: -14240.75205
Feature: 23, Score: 2547.03836
Feature: 24, Score: 6356.12542
Feature: 25, Score: 4446.39314
Feature: 26, Score: 15.88626
Feature: 27, Score: 10.96318
Feature: 28, Score: 9.95645
Feature: 29, Score: -0.21072
Feature: 30, Score: 37.38819
Feature: 31, Score: 34.41951
Feature: 32, Score: 70.84911
Feature: 33, Score: -0.14885
F

In [98]:
import numpy as np

sorted_indices = np.argsort(importance)[::-1]

sorted_indices

array([  4,  16,  14,  12,   9,  25,  13,   6,   2,  26,   3,  11,  51,
        22,  34, 210,  27,   7, 218,  23,   5,  19,  44,  36, 164,   8,
        40,  47, 195,  21,  28,  41,   0,  49,  45,  29,  58,  38,  43,
        95,  10,  52,   1, 139,  48, 220,  65,  62,  77, 187,  50,  17,
       108,  24, 155,  56,  75,  90, 130, 109,  35, 186,  46,  20,  89,
       177, 104, 102,  55,  81,  68, 141, 204, 103,  99,  79,  15, 190,
       180,  37,  59, 206,  39, 188,  57, 145, 152, 182,  94, 147, 175,
       173,  85, 189, 208, 132, 114, 212,  72, 158, 100, 181,  31, 135,
        76, 185, 167,  97, 125,  70,  88, 160, 200, 149, 129,  73,  42,
       101, 191,  30, 113,  82, 161, 170,  18,  93,  61,  66, 184, 110,
       196, 217, 131, 106, 163,  71, 165,  96, 127,  86,  92, 157,  91,
       107, 201, 211, 123, 205, 213, 214, 215, 216, 209, 207, 199, 183,
       203, 202, 198, 197, 194, 193, 192,  32,  33,  53, 179, 174,  54,
       128, 134, 136, 137, 138, 176, 140, 142, 143, 144, 146, 14

In [84]:
Dict = {}
print("Empty Dictionary: ")
print(Dict)

Empty Dictionary: 
{}


In [87]:
# Create a list of features: done
feature_list = list(importance)

# Save the results inside a DataFrame using feature_list as an index
relative_importances = pd.DataFrame(index=feature_list, data=importance, columns=["importance"])

# Sort the DataFrame to learn most important features
relative_importances.sort_values(by="importance", ascending=False)

Unnamed: 0,importance
136960.601807,136960.601807
135492.398731,135492.398731
106326.197955,106326.197955
78280.684021,78280.684021
63900.928961,63900.928961
...,...
-36764.324272,-36764.324272
-45516.380532,-45516.380532
-64876.001927,-64876.001927
-194699.754152,-194699.754152


In [88]:
from sklearn.tree import DecisionTreeRegressor


# define the model
model = DecisionTreeRegressor()
# fit the model
model.fit(X, y)
# get importance
importance = model.feature_importances_
# summarize feature importance
for i,v in enumerate(importance):
	print('Feature: %0d, Score: %.5f' % (i,v))
# plot feature importance

Feature: 0, Score: 0.00132
Feature: 1, Score: 0.00055
Feature: 2, Score: 0.01207
Feature: 3, Score: 0.00778
Feature: 4, Score: 0.63603
Feature: 5, Score: 0.00298
Feature: 6, Score: 0.01233
Feature: 7, Score: 0.00390
Feature: 8, Score: 0.00239
Feature: 9, Score: 0.02391
Feature: 10, Score: 0.00062
Feature: 11, Score: 0.00739
Feature: 12, Score: 0.02841
Feature: 13, Score: 0.01875
Feature: 14, Score: 0.03714
Feature: 15, Score: 0.00013
Feature: 16, Score: 0.10126
Feature: 17, Score: 0.00039
Feature: 18, Score: 0.00000
Feature: 19, Score: 0.00266
Feature: 20, Score: 0.00024
Feature: 21, Score: 0.00185
Feature: 22, Score: 0.00580
Feature: 23, Score: 0.00321
Feature: 24, Score: 0.00035
Feature: 25, Score: 0.02298
Feature: 26, Score: 0.00878
Feature: 27, Score: 0.00392
Feature: 28, Score: 0.00169
Feature: 29, Score: 0.00086
Feature: 30, Score: 0.00000
Feature: 31, Score: 0.00003
Feature: 32, Score: 0.00000
Feature: 33, Score: 0.00000
Feature: 34, Score: 0.00472
Feature: 35, Score: 0.00027
Fe

3)Select all possible 2 pairs of these top 10 predictors, and train 45 linear models, list the scores and RMSE errors achieved by the top 10 predictors by score.


In [99]:
top10 =[ 4,  16,  14,  12,   9,  25,  13,   6,   2,  26]

In [100]:
import itertools as it

In [101]:
combinations = it.combinations(top10,2)

In [102]:
for i in combinations:
    print(i, type(i))

(4, 16) <class 'tuple'>
(4, 14) <class 'tuple'>
(4, 12) <class 'tuple'>
(4, 9) <class 'tuple'>
(4, 25) <class 'tuple'>
(4, 13) <class 'tuple'>
(4, 6) <class 'tuple'>
(4, 2) <class 'tuple'>
(4, 26) <class 'tuple'>
(16, 14) <class 'tuple'>
(16, 12) <class 'tuple'>
(16, 9) <class 'tuple'>
(16, 25) <class 'tuple'>
(16, 13) <class 'tuple'>
(16, 6) <class 'tuple'>
(16, 2) <class 'tuple'>
(16, 26) <class 'tuple'>
(14, 12) <class 'tuple'>
(14, 9) <class 'tuple'>
(14, 25) <class 'tuple'>
(14, 13) <class 'tuple'>
(14, 6) <class 'tuple'>
(14, 2) <class 'tuple'>
(14, 26) <class 'tuple'>
(12, 9) <class 'tuple'>
(12, 25) <class 'tuple'>
(12, 13) <class 'tuple'>
(12, 6) <class 'tuple'>
(12, 2) <class 'tuple'>
(12, 26) <class 'tuple'>
(9, 25) <class 'tuple'>
(9, 13) <class 'tuple'>
(9, 6) <class 'tuple'>
(9, 2) <class 'tuple'>
(9, 26) <class 'tuple'>
(25, 13) <class 'tuple'>
(25, 6) <class 'tuple'>
(25, 2) <class 'tuple'>
(25, 26) <class 'tuple'>
(13, 6) <class 'tuple'>
(13, 2) <class 'tuple'>
(13, 26

In [110]:
linear_model

LinearRegression()

In [None]:
features = [(4,16)]

target = 'SalePrice'

X = housing_ml[features].values.reshape(-1, len(features))


model = linear_model.fit(X, y)

4)Train a single model using all features. Calculate RMSE and score. Observe how much of the prediction power was in the 2 pairs, vs all features

In [86]:
# example of mutual information feature selection for numerical input data
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_regression
from matplotlib import pyplot

# feature selection
def select_features(X_train, y_train, X_test):
	# configure to select all features
	fs = SelectKBest(score_func=mutual_info_regression, k='all')
	# learn relationship from training data
	fs.fit(X_train, y_train)
	# transform train input data
	X_train_fs = fs.transform(X_train)
	# transform test input data
	X_test_fs = fs.transform(X_test)
	return X_train_fs, X_test_fs, fs


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=21)
# feature selection
X_train_fs, X_test_fs, fs = select_features(X_train, y_train, X_test)
# what are scores for the features
for i in range(len(fs.scores_)):
	print('Feature %d: %f' % (i, fs.scores_[i]))

Feature 0: 0.003829
Feature 1: 0.276598
Feature 2: 0.164263
Feature 3: 0.161541
Feature 4: 0.516781
Feature 5: 0.092374
Feature 6: 0.360999
Feature 7: 0.241032
Feature 8: 0.082815
Feature 9: 0.169728
Feature 10: 0.011823
Feature 11: 0.136570
Feature 12: 0.360355
Feature 13: 0.297991
Feature 14: 0.205934
Feature 15: 0.002768
Feature 16: 0.467822
Feature 17: 0.023380
Feature 18: 0.014285
Feature 19: 0.273111
Feature 20: 0.082441
Feature 21: 0.060204
Feature 22: 0.020151
Feature 23: 0.225751
Feature 24: 0.163579
Feature 25: 0.377579
Feature 26: 0.373009
Feature 27: 0.096832
Feature 28: 0.147105
Feature 29: 0.024940
Feature 30: 0.000000
Feature 31: 0.015418
Feature 32: 0.000000
Feature 33: 0.014071
Feature 34: 0.000000
Feature 35: 0.020270
Feature 36: 0.309540
Feature 37: 0.009028
Feature 38: 0.324100
Feature 39: 0.042887
Feature 40: 0.088292
Feature 41: 0.141790
Feature 42: 0.033581
Feature 43: 0.162118
Feature 44: 0.325650
Feature 45: 0.028132
Feature 46: 0.210571
Feature 47: 0.264917
Fe

5)Use the 5NN and 10NN regressor with all features, and list the RMSE and score for these 2 models 
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html 
observe if the results are better than linear regression

In [118]:
from sklearn.neighbors import KNeighborsRegressor
>>> neigh = KNeighborsRegressor(n_neighbors=5)
>>> neigh.fit(X_train, y_train)
KNeighborsRegressor()




KNeighborsRegressor()

In [109]:
from sklearn.neighbors import NearestNeighbors
>>> neigh = NearestNeighbors(n_neighbors=10)
>>> neigh.fit(X_train, y_train)
NearestNeighbors(n_neighbors=10)



NearestNeighbors(n_neighbors=10)

6)Which regressor is better for inference? 

KNN would be better for inference.