# Multiple Linear Regression with Dummies - Exercise

You are given a real estate dataset. 

Real estate is one of those examples that every regression course goes through as it is extremely easy to understand and there is a (almost always) certain causal relationship to be found.

The data is located in the file: 'real_estate_price_size_year_view.csv'. 

You are expected to create a multiple linear regression (similar to the one in the lecture), using the new data. 

In this exercise, the dependent variable is 'price', while the independent variables are 'size', 'year', and 'view'.

#### Regarding the 'view' variable:
There are two options: 'Sea view' and 'No sea view'. You are expected to create a dummy variable for view and include it in the regression

Good luck!

## Import the relevant libraries

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import seaborn as sns
sns.set()

## Load the data

In [2]:
raw_data = pd.read_csv('real_estate_price_size_year_view.csv')

In [3]:
raw_data.describe(include='all')

Unnamed: 0,price,size,year,view
count,100.0,100.0,100.0,100
unique,,,,2
top,,,,No sea view
freq,,,,51
mean,292289.47016,853.0242,2012.6,
std,77051.727525,297.941951,4.729021,
min,154282.128,479.75,2006.0,
25%,234280.148,643.33,2009.0,
50%,280590.716,696.405,2015.0,
75%,335723.696,1029.3225,2018.0,


## Create a dummy variable for 'view'

In [4]:
data = raw_data.copy()
data['view'] = data['view'].map({'No sea view': 0, 'Sea view': 1})

In [5]:
data.describe()

Unnamed: 0,price,size,year,view
count,100.0,100.0,100.0,100.0
mean,292289.47016,853.0242,2012.6,0.49
std,77051.727525,297.941951,4.729021,0.502418
min,154282.128,479.75,2006.0,0.0
25%,234280.148,643.33,2009.0,0.0
50%,280590.716,696.405,2015.0,0.0
75%,335723.696,1029.3225,2018.0,1.0
max,500681.128,1842.51,2018.0,1.0


## Create the regression

### Declare the dependent and the independent variables

In [6]:
y = data['price']
x1 = data[['size', 'year', 'view']]

### Regression

In [7]:
x = sm.add_constant(x1)
result = sm.OLS(y, x).fit()
result.summary()

0,1,2,3
Dep. Variable:,price,R-squared:,0.913
Model:,OLS,Adj. R-squared:,0.91
Method:,Least Squares,F-statistic:,335.2
Date:,"Mon, 05 Feb 2024",Prob (F-statistic):,1.02e-50
Time:,15:41:53,Log-Likelihood:,-1144.6
No. Observations:,100,AIC:,2297.0
Df Residuals:,96,BIC:,2308.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-5.398e+06,9.94e+05,-5.431,0.000,-7.37e+06,-3.43e+06
size,223.0316,7.838,28.455,0.000,207.473,238.590
year,2718.9489,493.502,5.510,0.000,1739.356,3698.542
view,5.673e+04,4627.695,12.258,0.000,4.75e+04,6.59e+04

0,1,2,3
Omnibus:,29.224,Durbin-Watson:,1.965
Prob(Omnibus):,0.0,Jarque-Bera (JB):,64.957
Skew:,1.088,Prob(JB):,7.85e-15
Kurtosis:,6.295,Cond. No.,942000.0


# Note:
By adding the 'sea view' variable our model's predictive power is majorly increased, from an adjusted R-squared of 0.772 and an F-statistic of 168.5, to a R-squared of 0.910 and an F-statistic of 335.2, this was to be expected since view is a major factor in estate prices but it is very interesting to see the numbers confirming it.

Let's make use of our model by doing some predictions:

In [8]:
x

Unnamed: 0,const,size,year,view
0,1.0,643.09,2015,0
1,1.0,656.22,2009,0
2,1.0,487.29,2018,1
3,1.0,1504.75,2015,0
4,1.0,1275.46,2009,1
...,...,...,...,...
95,1.0,549.80,2009,1
96,1.0,1037.44,2009,0
97,1.0,1504.75,2006,0
98,1.0,648.29,2015,0


In [9]:
test_data = pd.DataFrame({'const': 1, 'size': [700, 800, 900], 'year': [2005, 2020, 2012], 'view': [1, 0, 0]})
# Prevent automatic alphabetical order when creating the data frame
test_data = test_data[['const', 'size', 'year', 'view']]
test_data.rename(index={0: 'house 1', 1: 'house 2', 2: 'house 3'})

Unnamed: 0,const,size,year,view
house 1,1,700,2005,1
house 2,1,800,2020,0
house 3,1,900,2012,0


In [10]:
prediction = result.predict(test_data)
prediction

0    266426.493564
1    272787.869039
2    273339.439874
dtype: float64

In [11]:
test_data.join(pd.DataFrame({'Predicted price': prediction})).rename(index={0: 'house 1', 1: 'house 2', 2: 'house 3'})

Unnamed: 0,const,size,year,view,Predicted price
house 1,1,700,2005,1,266426.493564
house 2,1,800,2020,0,272787.869039
house 3,1,900,2012,0,273339.439874


As we can see by these predictions we can verify that although 'house 1' is older and smaller, just because it has a sea view it almost reaches the others' prices, we can also see that 'house 2' and 'house 3' have almost the same price even tough 'house 3' is bigger just because 'house 2' is more recent.