# The dataset

We will explore regression models using a standard bencharking dataset, the AMES Housing data. This comprises data about individual housing property in AMES, Iowa, USA, from 2006-2010. More information on the data set can be found here: 
Dataset description -  http://jse.amstat.org/v19n3/decock.pdf
Dataset documentation http://jse.amstat.org/v19n3/decock/DataDocumentation.txt
The actual dataset in text, tab-separated format: http://jse.amstat.org/v19n3/decock/AmesHousing.txt

In [5]:
import pandas as pd

We will not use the complete set of attributes in the dataset but we will downselect specific ones: 

Overall Qual (Ordinal): Rates the overall material and finish of the house
       10	Very Excellent
       9	Excellent
       8	Very Good
       7	Good
       6	Above Average
       5	Average
       4	Below Average
       3	Fair
       2	Poor
       1	Very Poor
       
Overall Cond (Ordinal): Rates the overall condition of the house
       10	Very Excellent
       9	Excellent
       8	Very Good
       7	Good
       6	Above Average	
       5	Average
       4	Below Average	
       3	Fair
       2	Poor
       1	Very Poor 
       
Gr Liv Area (Continuous): Above grade (ground) living area square feet

entral Air (Nominal): Central air conditioning
       N	No
       Y	Yes

total Bsmt SF (Continuous): Total square feet of basement area

SalePrice (Continuous): Sale price in $$

In [6]:
columns = ['Overall Qual', 'Overall Cond', 'Gr Liv Area', 'Central Air', 'Total Bsmt SF', 'SalePrice']

In [7]:
df = pd.read_csv('http://jse.amstat.org/v19n3/decock/AmesHousing.txt',sep='\t',usecols=columns)

We can also read the data set locally

In [8]:
df = pd.read_csv('AmesHousing.txt',sep='\t',usecols=columns)

Let's inspect the first 5 lines of the dataframe populated with the data

In [9]:
df.head()

Unnamed: 0,Overall Qual,Overall Cond,Total Bsmt SF,Central Air,Gr Liv Area,SalePrice
0,6,5,1080.0,Y,1656,215000
1,5,6,882.0,Y,896,105000
2,6,6,1329.0,Y,1329,172000
3,7,5,2110.0,Y,2110,244000
4,5,5,928.0,Y,1629,189900


All attributes above are numerical, apart from Central Air

Let's inspect also the shape / dimensions of the dataset

In [10]:
df.shape

(2930, 6)

We can convert the Central Air varariable to a numerical one by mapping 'N' values to 0 and 'Y' values to 1 and inspect the data again.

In [11]:
df['Central Air'] = df['Central Air'].map({'N': 0, 'Y': 1})

In [12]:
df.head()

Unnamed: 0,Overall Qual,Overall Cond,Total Bsmt SF,Central Air,Gr Liv Area,SalePrice
0,6,5,1080.0,1,1656,215000
1,5,6,882.0,1,896,105000
2,6,6,1329.0,1,1329,172000
3,7,5,2110.0,1,2110,244000
4,5,5,928.0,1,1629,189900


Before progressing any further, let's check for missing values using the isnull() method of the dataframe

In [13]:
df.isnull().sum()

Overall Qual     0
Overall Cond     0
Total Bsmt SF    1
Central Air      0
Gr Liv Area      0
SalePrice        0
dtype: int64

We have counted '1' missing value in the Total Bsmt SF attribute. As this is in just one record, the simplest approach without sacrificing much information from the dataset is to remove this record. We will use a method which drops "non a number" records. 

In [14]:
df = df.dropna(axis=0)

In [15]:
df.isnull().sum()

Overall Qual     0
Overall Cond     0
Total Bsmt SF    0
Central Air      0
Gr Liv Area      0
SalePrice        0
dtype: int64

Good - no missing values any more

# Applying Support Vector Regresson (SVR)

In [37]:
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.metrics import mean_absolute_error, mean_squared_error
import matplotlib.pyplot as plt

In [38]:
target = 'SalePrice'
features = df.columns[df.columns != target]
X = df[features].values
y = df[target].values
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=123)

In [39]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# Fit only on X_train
scaler.fit(X_train)
# Scale both X_train and X_test
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [40]:
from sklearn.svm import SVR

In [48]:
regressor = SVR(kernel='rbf')

In [49]:
regressor.fit(X_train,y_train)

In [50]:
y_pred = regressor.predict(X_test)

In [52]:
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)
print(f'mae: {mae}')
print(f'mse: {mse}')
print(f'rmse: {rmse}')
print(regressor.score(X_test, y_test))

mae: 54472.012672406
mse: 6415620349.938111
rmse: 80097.56769052423
-0.050928099860566345


# k-fold cross-validation

In [53]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error

In [54]:
cv = KFold(n_splits=5, random_state=1, shuffle=True)

In [58]:
regressor = SVR(kernel='rbf')
regressor.fit(X_train,y_train)
y_pred = regressor.predict(X_test)
scores = cross_val_score(regressor, X, y, scoring='neg_mean_absolute_error',
                         cv=cv, n_jobs=-1)
print(scores)
#k-fold_score = averages(scores)
#print(k-fold_score)
#mae = mean_absolute_error(y_test, y_pred)
#mse = mean_squared_error(y_test, y_pred)
#rmse = mean_squared_error(y_test, y_pred, squared=False)
#print(f'mae: {mae}')
#print(f'mse: {mse}')
#print(f'rmse: {rmse}')
#print(knr.score(X_train, y_train))
#print(knr.score(X_test, y_test))

[-58433.58769918 -52042.32903469 -57874.81892764 -53884.53739124
 -57831.76667253]


In [59]:
print(np.average(scores))

-56013.407945056344


Above you see the average score (negative of mean absolute error) over the k-folds

In [60]:
scores = cross_val_score(regressor, X, y, scoring='neg_mean_squared_error',
                         cv=cv, n_jobs=-1)
print(scores)
print(np.average(scores))

[-7.98434945e+09 -5.42998708e+09 -7.11151291e+09 -6.14488220e+09
 -7.13713535e+09]
-6761573399.537901


In [61]:
scores = cross_val_score(regressor, X, y, scoring='neg_root_mean_squared_error',
                         cv=cv, n_jobs=-1)
print(scores)
print(np.average(scores))

[-89355.18703211 -73688.44606562 -84329.78660209 -78389.29903106
 -84481.56812965]
-82048.8573721068
