Regression model

We have build a regression model to predict the most probable prices of the houses in a neighborhood. Based on our observations and analysis to this point, we believe following are the parameters that would be the best choice for evaluating the output. 

Parameters used to train the model

1. Median Income
2. Media TOM
3. Nightlife
4. Education quality
5. 311 data
7. Diversity

We have performed feature scaling on all the parameters to make sure gradient descent converges more quickly- this step includes subtracting mean and then dividing by sdtandard deviation. 

In [230]:
import pandas as pd
import numpy as np
import pickle
import scipy.stats as ss
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

In [328]:
baseline_df = pd.read_csv('../BaltimoreRanking-519/Data/regression_data.csv') #change the path accordingly
baseline_df

Unnamed: 0,CSA2010,TOM,Crime,Restaurant,Education,Diversity,Price
0,Allendale/Irvington/S. Hilton,43,55.1,9,83.7,24.138062,33250
1,Beechfield/Ten Hills/West Hills,49,46.6,1,87.6,37.403378,130000
2,Belair-Edison,39,56.6,10,79.7,25.807071,41975
3,Brooklyn/Curtis Bay/Hawkins Point,36,54.9,37,74.6,67.740252,40000
4,Canton,30,46.5,90,76.6,29.319447,275000
5,Cedonia/Frankford,45,52.3,16,83.4,38.68363,78575
6,Cherry Hill,38,53.5,4,76.0,12.653485,23500
7,Chinquapin Park/Belvedere,36,47.7,12,80.2,50.699036,120000
8,Claremont/Armistead,57,46.3,2,79.5,67.91442,90000
9,Clifton-Berea,32,55.0,17,62.9,9.107467,20000


In [232]:
x_df = baseline_df[['TOM', 'Crime', 'Restaurant', 'Education', 'Diversity']] 
y_dfx_log_df = baseline_df['Price']

In [233]:
x_df.corr()

Unnamed: 0,TOM,Crime,Restaurant,Education,Diversity
TOM,1.0,0.221893,-0.02398,-0.258521,0.043518
Crime,0.221893,1.0,0.451788,-0.01208,0.313616
Restaurant,-0.02398,0.451788,1.0,-0.128404,0.306724
Education,-0.258521,-0.01208,-0.128404,1.0,0.096767
Diversity,0.043518,0.313616,0.306724,0.096767,1.0


We see from the correlation table that features are independent of each other due to low coefficient value except Restaurant and Crime. By doing P test for these two features we see that p value is lw so we can consider both the featues for our model.

In [234]:
restaurant = x_df['Restaurant']
crime = x_df['Crime']
corr, p = ss.pearsonr(restaurant, crime)
print ("R: %s P: %s" %(corr, p))

R: 0.451787704805 P: 0.000535354843477


In [235]:
x_log_df = x_df.copy()
cols = list(x_df.columns)
for col in cols:
    x_log_df[col] = np.log(x_df[col])

y_log_df = np.log(y_df)
x_log_df.head(2)

Unnamed: 0,TOM,Crime,Restaurant,Education,Diversity
0,3.7612,4.00915,2.197225,4.427239,3.18379
1,3.89182,3.841601,0.0,4.472781,3.621761


In [306]:
model = LinearRegression(fit_intercept=True, normalize=True, copy_X=True)

In [307]:
X = x_log_df
Y = y_log_df

In [311]:
model = train(model, X, Y)
prediction = test(model, X, Y, True)

Regression coefficients: 
 [-0.60500572 -0.20395281  0.09514333  0.51082438  0.83366618]
Bias 9.16
Residual sum of squares: 0.00
Variance score: 0.36


In [312]:
print("Outliers: ")
outliers = [3, 11, 47, 50, 54]
residual[outliers]

Outliers: 


3     2.150382
11    1.927199
47    1.616619
50    2.086255
54    1.657367
Name: Price, dtype: float64

In [313]:
X_new = X.drop(outliers)
Y_new = Y.drop(outliers)
X_new.reset_index()
Y_new.reset_index()

X_train = X_new[:40]
X_test = X_new[40:]
Y_train = Y_new[:40]
Y_test = Y_new[40:]

In [314]:
model = train(model, X_new, Y_new)
prediction= test(model, X_test, Y_test, False)

Regression coefficients: 
 [-0.60500572 -0.20395281  0.09514333  0.51082438  0.83366618]
Bias 9.16
Residual sum of squares: 0.05
Variance score: 0.78


In [320]:
pickle.dump(model, open('Baltimore_model.sav', 'wb'))

Below code is to train and test the model

In [309]:
def test(model, X_test, Y_test, flag):
    prediction = (model.predict(X_test))
    print("Residual sum of squares: %.2f" % np.mean(prediction - Y_test) ** 2)
    print('Variance score: %.2f' % model.score(X_test, Y_test))
    if flag:
        residual = (prediction - Y_test) ** 2
        #print(residual)
    return prediction
        
def train(model, X_train, Y_train):
        model.fit(X_new, Y_new)
        print('Regression coefficients: \n', model.coef_)
        print('Bias %.2f' % model.intercept_)
        return model