<a href="https://colab.research.google.com/github/solharsh/ML_Repository_University_Of_Chicago_PGD/blob/master/Assignment2_Boston_Housing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 2

You are interested in building a model that will predict housing values in Boston suburbs using various predictor variables that you have available. This is your data set: boston.csvPreview the document (the number of attributes has been reduced from the original form)

Use Multiple Linear Regression to build your model, with the median value of owner-occupied homes as the target variable and the rest as predictors.

Determine the significance of these different predictors, and drop the ones that are not useful for your model. Document your work and explain your decision making as you build your model. Report your final model's accuracy.

Upload your homework here as a Jupyter Notebook, a .ipynb file. You can also submit a second file converted to HTML (by going to file-->download as-->HTML).

Data set Info:
Title: Boston Housing Data
Information: Concerns housing values in suburbs of Boston.
Number of Observations: 506
Number of Attributes: 9 (Original data set has 13 variables)
Attribute Information:
- 1. CRIM per capita crime rate by town
- 2. INDUS proportion of non-retail business acres per town
- 3. NOX nitric oxides concentration (parts per 10 million)
- 4. RM average number of rooms per dwelling
- 5. AGE proportion of owner-occupied units built prior to 1940
- 6. DIS weighted distances to five Boston employment centres
- 7. TAX full-value property-tax rate per \$10,000
- 8. PT pupil-teacher ratio by town
- 9. B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- 10. MV Median value of owner-occupied homes in $1000's

 

For more information on this dataset, see https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.names

## Method 1 : Multiple Linear Regression with "Statsmodels"

In [2]:
import statsmodels.api as sm #importing all important libraries for Statsmodels and Scikit-Learn
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression

  import pandas.util.testing as tm


In [0]:
df1 = pd.read_csv("/content/drive/My Drive/UC Machine Learning/Datasets/boston.csv") #reading csv file from the local database

In [4]:
df1.head() #displaying top 5 rows

Unnamed: 0,CRIM,INDUS,NOX,RM,AGE,DIS,TAX,PT,B,MV
0,0.00632,2.31,0.538,6.575,65.199997,4.09,296,15.3,396.899994,24.0
1,0.02731,7.07,0.469,6.421,78.900002,4.9671,242,17.799999,396.899994,21.6
2,0.02729,7.07,0.469,7.185,61.099998,4.9671,242,17.799999,392.829987,34.700001
3,0.03237,2.18,0.458,6.998,45.799999,6.0622,222,18.700001,394.630005,33.400002
4,0.06905,2.18,0.458,7.147,54.200001,6.0622,222,18.700001,396.899994,36.200001


In [5]:
df1.shape 

(506, 10)

In [0]:
X_train, X_test, Y_train, Y_test = train_test_split(df1.drop("MV", axis=1), df1['MV'], test_size = 0.2,\
random_state=149) 
#since the records are few, splitting it to train and test the data in 80-20% ratio.Selecting a random_state value of 149
#MV is the target variable

In [7]:
X_test.shape 

(102, 9)

In [0]:
X_train = sm.add_constant(X_train) #adding the constant to the training data

In [0]:
model_1 = sm.OLS(Y_train, X_train) #selecting OLS method

In [0]:
result = model_1.fit() #fitting the model with train dataset

In [11]:
print(result.summary())

                            OLS Regression Results                            
Dep. Variable:                     MV   R-squared:                       0.643
Model:                            OLS   Adj. R-squared:                  0.635
Method:                 Least Squares   F-statistic:                     78.87
Date:                Wed, 06 May 2020   Prob (F-statistic):           1.51e-82
Time:                        00:16:15   Log-Likelihood:                -1270.4
No. Observations:                 404   AIC:                             2561.
Df Residuals:                     394   BIC:                             2601.
Df Model:                           9                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         17.9711      6.332      2.838      0.0

Interesting observations to note:
- 1.Among all the variables, INDUS and TAX have P value greater than 0.05.It basically infers that these two parameters have the least influence on the target variable MV
- 2.Variables which have P value 0 or less than 0.05 have greater influence on target variable. Hence, in the below steps, we will remove INDUS and TAX to check if R-squared value increase or not. 

In [0]:
X_train = X_train.drop("INDUS", axis=1) #dropping INDUS column

In [0]:
new_result_without_indus = sm.OLS(Y_train, X_train).fit() #Applying OLS method and fitting it with new dataset

In [14]:
print(new_result_without_indus.summary())

                            OLS Regression Results                            
Dep. Variable:                     MV   R-squared:                       0.642
Model:                            OLS   Adj. R-squared:                  0.635
Method:                 Least Squares   F-statistic:                     88.53
Date:                Wed, 06 May 2020   Prob (F-statistic):           2.80e-83
Time:                        00:16:44   Log-Likelihood:                -1271.0
No. Observations:                 404   AIC:                             2560.
Df Residuals:                     395   BIC:                             2596.
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         17.6122      6.326      2.784      0.0

In [0]:
X_test = X_test.drop("INDUS", axis=1) #dropping INDUS from test dataset as well

In [0]:
predictions = new_result_without_indus.predict(sm.add_constant(X_test))

In [19]:
r2_score(Y_test, predictions) #checking the r-squared accuracy

0.7500120438513593

In [0]:
X_train = X_train.drop("TAX", axis=1) #dropping TAX from the train data

In [0]:
new_result_without_tax = sm.OLS(Y_train, X_train).fit()

In [22]:
print(new_result_without_tax.summary())

                            OLS Regression Results                            
Dep. Variable:                     MV   R-squared:                       0.642
Model:                            OLS   Adj. R-squared:                  0.635
Method:                 Least Squares   F-statistic:                     101.3
Date:                Wed, 06 May 2020   Prob (F-statistic):           3.00e-84
Time:                        00:18:23   Log-Likelihood:                -1271.1
No. Observations:                 404   AIC:                             2558.
Df Residuals:                     396   BIC:                             2590.
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         17.0371      6.203      2.747      0.0

In [0]:
X_test = X_test.drop("TAX", axis=1)

In [0]:
predictions = new_result_without_tax.predict(sm.add_constant(X_test))

In [25]:
r2_score(Y_test, predictions) #checking the r-squared accuracy 

0.7517651887931424

Please note the increase in R-squared value after removing both TAX and INDUS fields (0.750 to 0.751)

# Method 2: Multiple Linear Regression with "Scikit-Learn"

In [27]:
df2 = pd.read_csv("/content/drive/My Drive/UC Machine Learning/Datasets/boston.csv") #importing from the local database

df2.head()

Unnamed: 0,CRIM,INDUS,NOX,RM,AGE,DIS,TAX,PT,B,MV
0,0.00632,2.31,0.538,6.575,65.199997,4.09,296,15.3,396.899994,24.0
1,0.02731,7.07,0.469,6.421,78.900002,4.9671,242,17.799999,396.899994,21.6
2,0.02729,7.07,0.469,7.185,61.099998,4.9671,242,17.799999,392.829987,34.700001
3,0.03237,2.18,0.458,6.998,45.799999,6.0622,222,18.700001,394.630005,33.400002
4,0.06905,2.18,0.458,7.147,54.200001,6.0622,222,18.700001,396.899994,36.200001


In [0]:
X1_train, X1_test, Y1_train, Y1_test = train_test_split(df2.drop("MV", axis=1), df2['MV'], test_size = 0.2,\
                                                    random_state=149) #applying same test_size and random_state as used earlier



In [0]:
model = LinearRegression(normalize=True) #Creating model with LinearRegression class from Scikit-Learn

In [30]:
model.fit(X1_train, Y1_train) #fitting the linear regression model to training dataset

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=True)

In [0]:
predictions = model.predict(X1_test)

In [32]:
pd.DataFrame({'actual value': Y1_test, 'predictions':predictions}).sample(5)

Unnamed: 0,actual value,predictions
170,17.4,22.416734
127,16.200001,15.902856
432,16.1,19.051471
250,24.4,24.101687
375,15.0,26.215343


In [33]:
model.score(X1_test, Y1_test) #checking the accuracy with R-squared value

0.7525085195003509

Final Comments:

While evaluating the effect of predictor variables on target variables, the value of P (less than 0.05) indicates its measure of dependency. Closer to zero increases dependency.

The same can be learnt by removing the predictor variables (INDUS and TAX) to check accuracy by calcuclating r-squared value. It can be observed that the value has increased after removing both the predictor variables

This can be interpreted as in the city of Boston, Median value of owner-occupied homes have lower influence by 'proportion of non-retail business acres per town' and 'full-value property-tax rate'.

According to the Statsmodel, the model can improve the accuracy of prediction by 75.1% According to the Scikit-Learn, the model can improve the accuracy of prediction by 75.2%
