# Multiple Linear Regression with sklearn

Given a real estate dataset, to create a multiple linear regression. The dependent variable is 'price', while the independent variables are 'size' and 'year'.


## Relevant libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

from sklearn.linear_model import LinearRegression

## Load the data

In [3]:
df = pd.read_csv("real_estate_price_size_year.csv")
df.iloc[0]

price    234314.144
size        643.090
year       2015.000
Name: 0, dtype: float64

In [4]:
df.describe(include="all")

Unnamed: 0,price,size,year
count,100.0,100.0,100.0
mean,292289.47016,853.0242,2012.6
std,77051.727525,297.941951,4.729021
min,154282.128,479.75,2006.0
25%,234280.148,643.33,2009.0
50%,280590.716,696.405,2015.0
75%,335723.696,1029.3225,2018.0
max,500681.128,1842.51,2018.0


## Create the regression

### Declare the dependent and the independent variables

In [5]:
x = df[["size","year"]]
y = df["price"]

### Regression

In [11]:
reg = LinearRegression().fit(x,y)


### Intercept

In [12]:
reg.intercept_

-5772267.01746328

### Coefficients

In [13]:
reg.coef_

array([ 227.70085401, 2916.78532684])

### R-squared

In [10]:
reg.score(x,y)

0.7764803683276792

### Calculate the Adjusted R-squared

In [14]:
def adj_r2(x,y):
    r2 = reg.score(x,y)
    n = x.shape[0]
    p = x.shape[1]
    adjusted_r2 = 1-(1-r2)*(n-1)/(n-p-1)
    return adjusted_r2

In [15]:
adj_r2(x,y)

0.7718717161282499

### R-squared and the Adjusted R-squared

R-squared > Adjusted R-squared, no penalty by including 2 independent variables

### Compare the Adjusted R-squared with the R-squared of the simple linear regression

The independent variable "year" could be removed from equation, since it does not bring too much value to the result.

### Predictions

An apartment that has a size of 750 sq.ft. from 2009.

In [19]:
reg.predict([[750,2009]])

array([258330.34465995])

### The univariate p-values of the variables

In [20]:
from sklearn.feature_selection import f_regression

In [29]:
print("F-score & P-value")
f_regression(x,y)

F-score & P-value


(array([285.92105192,   0.85525799]), array([8.12763222e-31, 3.57340758e-01]))

In [30]:
p_values = f_regression(x,y)[1]
p_values

array([8.12763222e-31, 3.57340758e-01])

In [31]:
p_values.round(3)

array([0.   , 0.357])

### Summary table 

In [32]:
reg_summary = pd.DataFrame(data = x.columns.values, columns=['Features'])
reg_summary ['Coefficients'] = reg.coef_
reg_summary ['p-values'] = p_values.round(3)
reg_summary

Unnamed: 0,Features,Coefficients,p-values
0,size,227.700854,0.0
1,year,2916.785327,0.357


'Year' is not event significant, therefore we should remove it from the model.