# Estimating the rent in Paris using multiple linear regression with dummies variables

In this tutorial, we will carry out a case study to estimate the rent by explaining the most influential boroughs of Paris the French capital. The used method is the Multiple Linear Regression with variable dummies.

## Preprocessing

In [1]:
# Importing the libraries
import numpy as np
import pandas as pd

# Importing the dataset
dataset = pd.read_csv('data/house_data.csv')

# Cleaning all observations with nan values
dataset = dataset.dropna()

Transformation of the categorical variable named "arrondissement" into five dummies variables each of these ones represents an "arrondissement".

Construction of a global dataframe containing the five dummies variables that have just been constructed plus the two variables "price" and "surface".

In [2]:
dummies = pd.DataFrame({'arrondissement': dataset.arrondissement.astype(str)})
dummies = pd.get_dummies(dummies)
dataset = pd.concat([dummies, dataset], axis=1)
del dataset['arrondissement']
dataset.head()

Unnamed: 0,arrondissement_1.0,arrondissement_10.0,arrondissement_2.0,arrondissement_3.0,arrondissement_4.0,price,surface
0,1,0,0,0,0,1820,46.1
1,1,0,0,0,0,1750,41.0
2,1,0,0,0,0,1900,55.0
3,1,0,0,0,0,1950,46.0
4,1,0,0,0,0,1950,49.0


## Training the model
Construction of X data set containing the prediction variables as well as the vector y containing the data to be predicted.

In [3]:
X = dataset.drop('price', axis=1)
y = dataset['price']

Spliting of X and y into two parts each, which will be used for training and testing the model. The part reserved for training contains 80% of the data.

In [4]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

Fitting the multiple linear regression model using the training dataset.

In [5]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Predict the results of the test dataset.

In [6]:
y_pred = regressor.predict(X_test)

Display of the model parameters such as the constant and the coefficients of the variables.

In [7]:
print('Const: \n', regressor.intercept_)
print('Coefficients: \n', regressor.coef_)

Const: 
 72.11912099989468
Coefficients: 
 [ 193.72618272 -317.07253919   45.99391318  -50.86351754  128.21596083
   32.84421801]


## Calculation of the theta parameters
We will check if the vector theta gives the same results or not. We must not forget to add a composite column of 1 in X_train that I named X_train_1.

In [8]:
X_train_1 = np.append(arr = np.ones((X_train.shape[0], 1)).astype(int), values = X_train, axis = 1)

theta = np.linalg.inv(X_train_1.T.dot(X_train_1)).dot(X_train_1.T).dot(y_train)
print(theta)

[-1427.32316878   775.34351357  1300.96226403  1965.56352984
   163.42487133  1050.06161534    32.84421801]


The results are different from those given above. The reason is that among dummies variables there are some that are collinear.
You should delete a given variable. However, the deletion will not be randomly. There is a selection method
variables that have no interest in the model and that it is better to delete them one by one. This method is called "Backward Elimination".

### Keeping all the variables in order to select which one to be eliminated

In [9]:
import statsmodels.formula.api as sm

X_train = pd.concat([pd.DataFrame(np.ones((X_train.shape[0], 1)).astype(int),
                                  index=X_train.index), X_train], axis=1)
X_train.rename(columns={0: 'const'}, inplace=True)

X_opt = X_train[['const', 'arrondissement_1.0', 'arrondissement_2.0','arrondissement_3.0', 
                 'arrondissement_4.0', 'arrondissement_10.0', 'surface']]
regressor_OLS = sm.OLS(endog = y_train, exog = X_opt).fit()
regressor_OLS.summary()

0,1,2,3
Dep. Variable:,price,R-squared:,0.86
Model:,OLS,Adj. R-squared:,0.859
Method:,Least Squares,F-statistic:,800.5
Date:,"Fri, 28 Sep 2018",Prob (F-statistic):,3.24e-275
Time:,14:33:52,Log-Likelihood:,-5090.1
No. Observations:,657,AIC:,10190.0
Df Residuals:,651,BIC:,10220.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,60.0993,32.131,1.870,0.062,-2.994,123.192
arrondissement_1.0,205.7460,46.081,4.465,0.000,115.261,296.231
arrondissement_2.0,58.0138,49.317,1.176,0.240,-38.826,154.853
arrondissement_3.0,-38.8437,42.975,-0.904,0.366,-123.229,45.542
arrondissement_4.0,140.2358,43.127,3.252,0.001,55.551,224.921
arrondissement_10.0,-305.0527,43.395,-7.030,0.000,-390.263,-219.842
surface,32.8442,0.549,59.868,0.000,31.767,33.921

0,1,2,3
Omnibus:,342.332,Durbin-Watson:,1.924
Prob(Omnibus):,0.0,Jarque-Bera (JB):,5037.096
Skew:,1.963,Prob(JB):,0.0
Kurtosis:,15.984,Cond. No.,4.93e+17


By investigating the "coef" column, one would normally find the same results as with the established "regressor" model. However, the cause of this difference is as it was explained it is the collinearity between some dummies variables. We will see after an elimination of a variable that the results will be identical with "regressor", "theta" and "OLS Regression".

In the middle table, there is a column that displays the p-values of each coefficient. The first variable to be deleted is the one with the largest p-value and which exceeds 5% which is "arrondissement_3.0".

### Elimination of "arrondissement_3.0"

#### Results with "regressor"

In [10]:
X_opt = X_train[['arrondissement_1.0', 'arrondissement_2.0', 
                 'arrondissement_4.0', 'arrondissement_10.0', 'surface']]

regressor = LinearRegression()
regressor.fit(X_opt, y_train)

print('Const: \n', regressor.intercept_)
print('Coefficients: \n', regressor.coef_)

Const: 
 21.25560346291195
Coefficients: 
 [ 244.58970026   96.85743072  179.07947836 -266.20902165   32.84421801]


#### Results with "theta"

In [11]:
X_opt = X_train[['const', 'arrondissement_1.0', 'arrondissement_2.0', 
                 'arrondissement_4.0', 'arrondissement_10.0', 'surface']]

theta = np.linalg.inv(X_opt.T.dot(X_opt)).dot(X_opt.T).dot(y_train)
print(theta)

[  21.25560346  244.58970026   96.85743072  179.07947836 -266.20902165
   32.84421801]


#### Results with "regressor_OLS"

In [12]:
regressor_OLS = sm.OLS(endog = y_train, exog = X_opt).fit()
regressor_OLS.summary()

0,1,2,3
Dep. Variable:,price,R-squared:,0.86
Model:,OLS,Adj. R-squared:,0.859
Method:,Least Squares,F-statistic:,800.5
Date:,"Fri, 28 Sep 2018",Prob (F-statistic):,3.24e-275
Time:,14:33:52,Log-Likelihood:,-5090.1
No. Observations:,657,AIC:,10190.0
Df Residuals:,651,BIC:,10220.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,21.2556,56.287,0.378,0.706,-89.270,131.782
arrondissement_1.0,244.5897,68.606,3.565,0.000,109.875,379.305
arrondissement_2.0,96.8574,73.392,1.320,0.187,-47.257,240.972
arrondissement_4.0,179.0795,66.636,2.687,0.007,48.232,309.927
arrondissement_10.0,-266.2090,67.506,-3.943,0.000,-398.765,-133.653
surface,32.8442,0.549,59.868,0.000,31.767,33.921

0,1,2,3
Omnibus:,342.332,Durbin-Watson:,1.924
Prob(Omnibus):,0.0,Jarque-Bera (JB):,5037.096
Skew:,1.963,Prob(JB):,0.0
Kurtosis:,15.984,Cond. No.,374.0


We now see that the results of the three methods are identical in eliminating the cause of collinearity. However, to go further by seeking a more compact model, we must continue to eliminate variables that are not significant by examining their p-values. So, the variable to eliminate is the constant. For the rest, we study only the OLS regression method.

### Elimination of the constant

In [13]:
X_opt = X_train[['arrondissement_1.0', 'arrondissement_2.0', 'arrondissement_4.0',
                 'arrondissement_10.0', 'surface']]
regressor_OLS = sm.OLS(endog = y_train, exog = X_opt).fit()
regressor_OLS.summary()

0,1,2,3
Dep. Variable:,price,R-squared:,0.949
Model:,OLS,Adj. R-squared:,0.948
Method:,Least Squares,F-statistic:,2419.0
Date:,"Fri, 28 Sep 2018",Prob (F-statistic):,0.0
Time:,14:33:52,Log-Likelihood:,-5090.2
No. Observations:,657,AIC:,10190.0
Df Residuals:,652,BIC:,10210.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
arrondissement_1.0,257.5807,59.318,4.342,0.000,141.103,374.058
arrondissement_2.0,112.6946,60.190,1.872,0.062,-5.495,230.884
arrondissement_4.0,193.4490,54.667,3.539,0.000,86.104,300.794
arrondissement_10.0,-250.5037,53.138,-4.714,0.000,-354.846,-146.161
surface,32.9569,0.460,71.629,0.000,32.053,33.860

0,1,2,3
Omnibus:,334.023,Durbin-Watson:,1.921
Prob(Omnibus):,0.0,Jarque-Bera (JB):,4804.201
Skew:,1.906,Prob(JB):,0.0
Kurtosis:,15.687,Cond. No.,238.0


If we continue the elimination procedure, we should eliminate the variable "arrondissement_2.0". However, this one can be significant. For this, we must consider when eliminating variables the parameters "R-squared" and "Adj. R-squared". If in each variable elimination they increase, this implies that the model is improved by eliminating the variable in question. Otherwise, it is better to keep it and stopping the variable elimination procedure. 

Let's see what it gives the elimination of "arrondissement_2.0".

### Elimination of "arrondissement_2.0"

In [14]:
X_opt = X_train[['arrondissement_1.0', 'arrondissement_4.0', 
                 'arrondissement_10.0', 'surface']]
regressor_OLS = sm.OLS(endog = y_train, exog = X_opt).fit()
regressor_OLS.summary()

0,1,2,3
Dep. Variable:,price,R-squared:,0.949
Model:,OLS,Adj. R-squared:,0.948
Method:,Least Squares,F-statistic:,3011.0
Date:,"Fri, 28 Sep 2018",Prob (F-statistic):,0.0
Time:,14:33:52,Log-Likelihood:,-5091.9
No. Observations:,657,AIC:,10190.0
Df Residuals:,653,BIC:,10210.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
arrondissement_1.0,234.3441,58.117,4.032,0.000,120.226,348.462
arrondissement_4.0,174.0883,53.783,3.237,0.001,68.480,279.697
arrondissement_10.0,-266.1085,52.581,-5.061,0.000,-369.357,-162.860
surface,33.2736,0.429,77.616,0.000,32.432,34.115

0,1,2,3
Omnibus:,306.362,Durbin-Watson:,1.929
Prob(Omnibus):,0.0,Jarque-Bera (JB):,4074.289
Skew:,1.722,Prob(JB):,0.0
Kurtosis:,14.704,Cond. No.,215.0


It is noted that the values of the parameters "R-squared" and "Adj.r-squared" have not changed by eliminating "arrondissement_2.0". The model has consecontly kept its optimum. Since such a variable has a slightly higher p-value still at 5% and its elimination has not degraded the model, we will eliminate it by keeping the last model which is obtained using only "arrondissement_1.0", "arrondissement_4.0", "arrondissement_10.0" and "surface".

## Interpretations

Now, consider the coefficients corresponding to these variables in order to estimate the rent in Paris. First of all, we note that the price of the rent of a square meter is estimated at 33.27 euros independently of the boroughs: "arrondissement_2.0" and "arrondissement_3.0". These last two variables have no infulence on the rent price. For example, a studio of 27 m2 will be rented at 27x33.2736 = 900 euros in these two districts.

Indeed, the rent will be more expensive in boroughs "arrondissement_1.0" then "arrondissement_4.0". The same area of the studio will be rented at 900+234.34 = 1134 euros and 900+174 = 1074 euros in "arrondissement_1.0" and "arrondissement_4.0" respectively. These ones are certainly the chic districts in Paris.

On the other hand, the district "arrondissement_10.0" suffers apparently from a bad reputation. The rent in this borough is penalized 266.10 euros less. That is to say, the same area of such a studio will be rented at 900-266.10 = 634 euros. Such a district may be a very popular borough.