### Multiple Linear Regression Introduction

In this notebook (and following quizzes), you will be creating a few simple linear regression models, as well as a multiple linear regression model, to predict home value.

Let's get started by importing the necessary libraries and reading in the data you will be using.

In [37]:
import numpy as np
import pandas as pd
import statsmodels.api as sm;

df = pd.read_csv('house_prices.csv')
df.head()

Unnamed: 0,house_id,neighborhood,area,bedrooms,bathrooms,style,price
0,1112,B,1188,3,2,ranch,598291
1,491,B,3512,5,3,victorian,1744259
2,5952,B,1134,3,2,ranch,571669
3,3525,A,1940,4,2,ranch,493675
4,5108,B,2208,6,4,victorian,1101539


`1.` Using statsmodels, fit three individual simple linear regression models to predict price.  You should have a model that uses **area**, another using **bedrooms**, and a final one using **bathrooms**.  You will also want to use an intercept in each of your three models.

Use the results from each of your models to answer the first two quiz questions below.


In [38]:
df['intercection'] = 1
lm_area = sm.OLS(df['price'], df[['area', 'intercection']])
res_area = lm_area.fit()
res_area.summary()

0,1,2,3
Dep. Variable:,price,R-squared:,0.678
Model:,OLS,Adj. R-squared:,0.678
Method:,Least Squares,F-statistic:,12690.0
Date:,"Fri, 17 Apr 2020",Prob (F-statistic):,0.0
Time:,09:41:38,Log-Likelihood:,-84517.0
No. Observations:,6028,AIC:,169000.0
Df Residuals:,6026,BIC:,169100.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
area,348.4664,3.093,112.662,0.000,342.403,354.530
intercection,9587.8878,7637.479,1.255,0.209,-5384.303,2.46e+04

0,1,2,3
Omnibus:,368.609,Durbin-Watson:,2.007
Prob(Omnibus):,0.0,Jarque-Bera (JB):,349.279
Skew:,0.534,Prob(JB):,1.43e-76
Kurtosis:,2.499,Cond. No.,4930.0


In [39]:
df['intercection'] = 1
lm_bathrooms = sm.OLS(df['price'], df[['bathrooms', 'intercection']])
res_bathrooms = lm_bathrooms.fit()
res_bathrooms.summary()

0,1,2,3
Dep. Variable:,price,R-squared:,0.541
Model:,OLS,Adj. R-squared:,0.541
Method:,Least Squares,F-statistic:,7116.0
Date:,"Fri, 17 Apr 2020",Prob (F-statistic):,0.0
Time:,09:41:38,Log-Likelihood:,-85583.0
No. Observations:,6028,AIC:,171200.0
Df Residuals:,6026,BIC:,171200.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
bathrooms,3.295e+05,3905.540,84.358,0.000,3.22e+05,3.37e+05
intercection,4.314e+04,9587.189,4.500,0.000,2.43e+04,6.19e+04

0,1,2,3
Omnibus:,915.429,Durbin-Watson:,2.003
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1537.531
Skew:,1.01,Prob(JB):,0.0
Kurtosis:,4.428,Cond. No.,5.84


In [40]:
df['intercection'] = 1
lm_bedrooms = sm.OLS(df['price'], df[['bedrooms', 'intercection']])
res_bedrooms = lm_bedrooms.fit()
res_bedrooms.summary()

0,1,2,3
Dep. Variable:,price,R-squared:,0.553
Model:,OLS,Adj. R-squared:,0.553
Method:,Least Squares,F-statistic:,7446.0
Date:,"Fri, 17 Apr 2020",Prob (F-statistic):,0.0
Time:,09:41:38,Log-Likelihood:,-85509.0
No. Observations:,6028,AIC:,171000.0
Df Residuals:,6026,BIC:,171000.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
bedrooms,2.284e+05,2646.744,86.289,0.000,2.23e+05,2.34e+05
intercection,-9.485e+04,1.08e+04,-8.762,0.000,-1.16e+05,-7.36e+04

0,1,2,3
Omnibus:,967.118,Durbin-Watson:,2.014
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1599.431
Skew:,1.074,Prob(JB):,0.0
Kurtosis:,4.325,Cond. No.,10.3


`2.` Now that you have looked at the results from the simple linear regression models, let's try a multiple linear regression model using all three of these variables  at the same time.  You will still want an intercept in this model.

In [41]:
lm = sm.OLS(df['price'], df[['area', 'bedrooms', 'bathrooms', 'intercection']])
res = lm.fit()
res.summary()

0,1,2,3
Dep. Variable:,price,R-squared:,0.678
Model:,OLS,Adj. R-squared:,0.678
Method:,Least Squares,F-statistic:,4230.0
Date:,"Fri, 17 Apr 2020",Prob (F-statistic):,0.0
Time:,09:41:38,Log-Likelihood:,-84517.0
No. Observations:,6028,AIC:,169000.0
Df Residuals:,6024,BIC:,169100.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
area,345.9110,7.227,47.863,0.000,331.743,360.079
bedrooms,-2925.8063,1.03e+04,-0.285,0.775,-2.3e+04,1.72e+04
bathrooms,7345.3917,1.43e+04,0.515,0.607,-2.06e+04,3.53e+04
intercection,1.007e+04,1.04e+04,0.972,0.331,-1.02e+04,3.04e+04

0,1,2,3
Omnibus:,367.658,Durbin-Watson:,2.007
Prob(Omnibus):,0.0,Jarque-Bera (JB):,350.116
Skew:,0.536,Prob(JB):,9.4e-77
Kurtosis:,2.503,Cond. No.,11600.0


`3.` Along with using the **area**, **bedrooms**, and **bathrooms** you might also want to use **style** to predict the price.  Try adding this to your multiple linear regression model.  What happens?  Use the final quiz below to provide your answer.

In [42]:
# will break, now we fix that

### Dummy Variables

You saw in the earlier notebook that you weren't able to directly add a categorical variable to your multiple linear regression model. In this notebook, you will get some practice incorporating categorical data by converting to dummy variables in your models and interpreting the output.

Let's start by reading in the necessary libraries and data.



In [46]:
df = pd.read_csv('./house_prices_02.csv')
df.head()

Unnamed: 0,house_id,neighborhood,area,bedrooms,bathrooms,style,price
0,1112,B,1188,3,2,ranch,598291
1,491,B,3512,5,3,victorian,1744259
2,5952,B,1134,3,2,ranch,571669
3,3525,A,1940,4,2,ranch,493675
4,5108,B,2208,6,4,victorian,1101539


`1.` Use the [pd.get_dummies](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html) documentation to assist you with obtaining dummy variables for the **neighborhood** column.  Then use [join](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.join.html) to add the dummy variables to your dataframe, **df**, and store the joined results in **df_new**.

Fit a linear model using **all three levels** of **neighborhood** to predict the price. Don't forget an intercept.

Use your results to answer quiz 1 below.

In [49]:
df['intersection'] = 1
df_neighborhood = pd.get_dummies(df.neighborhood)
df_new = df.join(df_neighborhood)
df_new


INSTALLED VERSIONS
------------------
commit           : None
python           : 3.6.10.final.0
python-bits      : 64
OS               : Windows
OS-release       : 10
machine          : AMD64
processor        : Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder        : little
LC_ALL           : None
LANG             : de_DE.UTF-8
LOCALE           : None.None

pandas           : 1.0.3
numpy            : 1.18.1
pytz             : 2019.3
dateutil         : 2.8.1
pip              : 20.0.2
setuptools       : 46.1.1.post20200323
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 2.11.1
IPython          : 7.13.0
pandas_datareader: None
bs4              : None
bottleneck       : None
fastparquet      : None
gcsfs            : None
lxml.etree       : None

In [48]:
lm = sm.OLS(df_new['price'], df_new[[ 'intersection', 'A', 'B', 'C']])
res = lm.fit()
res.summary()

0,1,2,3
Dep. Variable:,price,R-squared:,0.246
Model:,OLS,Adj. R-squared:,0.246
Method:,Least Squares,F-statistic:,983.1
Date:,"Fri, 17 Apr 2020",Prob (F-statistic):,0.0
Time:,09:45:19,Log-Likelihood:,-87082.0
No. Observations:,6028,AIC:,174200.0
Df Residuals:,6025,BIC:,174200.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
intersection,5.381e+05,4439.653,121.210,0.000,5.29e+05,5.47e+05
A,3001.8311,8650.726,0.347,0.729,-1.4e+04,2e+04
B,5.325e+05,7894.313,67.448,0.000,5.17e+05,5.48e+05
C,2669.4717,8925.271,0.299,0.765,-1.48e+04,2.02e+04

0,1,2,3
Omnibus:,689.315,Durbin-Watson:,1.999
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1154.155
Skew:,0.793,Prob(JB):,2.3900000000000002e-251
Kurtosis:,4.442,Cond. No.,1320000000000000.0


`2.`  Now, fit an appropriate linear model for using **neighborhood** to predict the price of a home. Use **neighborhood A** as your baseline. (And remember that the values shown in the results for the other neighborhoods will be based on comparisons with this baseline neighborhood A then.) Use your resulting model to answer the questions in Quiz 2 and Quiz 3 below.

In [None]:
lm = sm.OLS(df['price'], df[[ 'intercection', 'B', 'C']])
res = lm.fit()
res.summary()

`3.` Run the two cells below to look at the home prices for the A and C neighborhoods. Add neighborhood B. This creates a glimpse into the differences that you found in the previous linear model.

In [None]:
plt.hist(df_new.query("C == 1")['price'], alpha = 0.3, label = 'C');
plt.hist(df_new.query("A == 1")['price'], alpha = 0.3, label = 'A');

plt.legend();

`4.` Now, add dummy variables for the **style** of house. Create a new linear model using these new dummies, as well as the previous **neighborhood** dummies.  Use **ranch** as the baseline for the **style**.  Additionally, add **bathrooms** and **bedrooms** to your linear model.  Don't forget an intercept.  Use the results of your linear model to answer the last two questions below. **Home prices are measured in dollars, and this dataset is not real.**

To minimize scrolling, it might be useful to open another browser window to this concept to answer the quiz questions.