# Multiple Linear Regression
- Root mean squared error (RMSE): La raiz cuadrada de la media del error al cuadrado. Métrica para comparar modelos de regresión
- Residual standard error (RSE): Los mismo que RMSE, solo que ajustado a los grados de libertad.
- R-squared (coefficient of determination, r2): La proporción de la varianza explicada por el modelo, rango entre 0 y 1
- t-statistic: El coeficiente para un predictor, dividido por el error estándar del coeficiente. Nos da una métrica para comparar la importancia de las variables en el modelo. "Entre más alta es la estadística t y más bajo el valor de p, más significante es el predictor."
- Weighted regression: La regresión con los registros teniendo diferentes pesos.

In [9]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error,r2_score

In [10]:
data = pd.read_csv("../Datasets/house_sales.csv",delimiter="\t")

In [11]:
data

Unnamed: 0,DocumentDate,SalePrice,PropertyID,PropertyType,ym,zhvi_px,zhvi_idx,AdjSalePrice,NbrLivingUnits,SqFtLot,...,Bathrooms,Bedrooms,BldgGrade,YrBuilt,YrRenovated,TrafficNoise,LandVal,ImpsVal,ZipCode,NewConstruction
1,2014-09-16,280000,1000102,Multiplex,2014-09-01,405100,0.930836,300805.0,2,9373,...,3.00,6,7,1991,0,0,70000,229000,98002,False
2,2006-06-16,1000000,1200013,Single Family,2006-06-01,404400,0.929228,1076162.0,1,20156,...,3.75,4,10,2005,0,0,203000,590000,98166,True
3,2007-01-29,745000,1200019,Single Family,2007-01-01,425600,0.977941,761805.0,1,26036,...,1.75,4,8,1947,0,0,183000,275000,98166,False
4,2008-02-25,425000,2800016,Single Family,2008-02-01,418400,0.961397,442065.0,1,8618,...,3.75,5,7,1966,0,0,104000,229000,98168,False
5,2013-03-29,240000,2800024,Single Family,2013-03-01,351600,0.807904,297065.0,1,8620,...,1.75,4,7,1948,0,0,104000,205000,98168,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27057,2011-04-08,325000,9842300710,Single Family,2011-04-01,318700,0.732307,443803.0,1,5468,...,1.75,3,7,1951,0,0,201000,172000,98126,False
27058,2007-09-28,1580000,9845500010,Single Family,2007-09-01,433500,0.996094,1586196.0,1,23914,...,4.50,4,11,2000,0,1,703000,951000,98040,False
27061,2012-07-09,165000,9899200010,Single Family,2012-07-01,325300,0.747472,220744.0,1,11170,...,1.00,4,6,1971,0,0,92000,130000,98055,False
27062,2006-05-26,315000,9900000355,Single Family,2006-05-01,400600,0.920496,342207.0,1,6223,...,2.00,3,7,1939,0,0,103000,212000,98166,False


In [12]:
subset = ['AdjSalePrice','SqFtTotLiving','SqFtLot','Bathrooms','Bedrooms']
data = data[subset]
data

Unnamed: 0,AdjSalePrice,SqFtTotLiving,SqFtLot,Bathrooms,Bedrooms
1,300805.0,2400,9373,3.00,6
2,1076162.0,3764,20156,3.75,4
3,761805.0,2060,26036,1.75,4
4,442065.0,3200,8618,3.75,5
5,297065.0,1720,8620,1.75,4
...,...,...,...,...,...
27057,443803.0,1480,5468,1.75,3
27058,1586196.0,4720,23914,4.50,4
27061,220744.0,1070,11170,1.00,4
27062,342207.0,1345,6223,2.00,3


In [13]:
predictors = ['SqFtTotLiving','SqFtLot','Bathrooms','Bedrooms']
outcome = 'AdjSalePrice'

house_lm = LinearRegression()
house_lm.fit(data[predictors],data[outcome])

LinearRegression()

In [14]:
print('Intercept',house_lm.intercept_)
for name, coef in zip(predictors,house_lm.coef_):
    print(name,coef)

Intercept 96960.38147619145
SqFtTotLiving 327.82525951863136
SqFtLot -0.08468587956200668
Bathrooms 13256.963751401256
Bedrooms -71714.7853048912


In [15]:
fitted = house_lm.predict(data[predictors])

#mean_squared_error
RMSE = np.sqrt(mean_squared_error(data[outcome],fitted))

r2 = r2_score(data[outcome],fitted)

RSE = np.sqrt(np.sum((data[outcome]-fitted)**2)/(data[outcome].size-1-len(predictors)))
print(RMSE)
print(RSE)
print(r2)

272275.5661481364
272305.5745868494
0.5008781108521302


In [16]:
# Predecir el valor de una casa con estas carateristicas:
# SqFtTotLiving: 2000
# 'SqFtLot': 9000
# 'Bathrooms': 4
# 'Bedrooms: 5

In [17]:
predictionData = pd.DataFrame({
    'SqFtTotLiving':[2000,3000,2000],
    'SqFtLot':[9000,9000,9000],
    'Bathrooms':[4,4,14],
    'Bedrooms':[5,5,10]
})

house_lm.predict(predictionData)

array([446302.65607855, 774127.91559718, 220298.3670681 ])

In [18]:
precioextra = 327.82525951863136*1000

774127.915597-precioextra

446302.65607836866

In [19]:
model = sm.OLS(data[outcome],data[predictors].assign(const=1))

results = model.fit()
results.summary()

0,1,2,3
Dep. Variable:,AdjSalePrice,R-squared:,0.501
Model:,OLS,Adj. R-squared:,0.501
Method:,Least Squares,F-statistic:,5690.0
Date:,"Fri, 29 Apr 2022",Prob (F-statistic):,0.0
Time:,18:16:05,Log-Likelihood:,-316110.0
No. Observations:,22687,AIC:,632200.0
Df Residuals:,22682,BIC:,632300.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
SqFtTotLiving,327.8253,3.329,98.476,0.000,321.300,334.350
SqFtLot,-0.0847,0.064,-1.328,0.184,-0.210,0.040
Bathrooms,1.326e+04,3699.503,3.583,0.000,6005.685,2.05e+04
Bedrooms,-7.171e+04,2533.088,-28.311,0.000,-7.67e+04,-6.67e+04
const,9.696e+04,7343.597,13.203,0.000,8.26e+04,1.11e+05

0,1,2,3
Omnibus:,27750.294,Durbin-Watson:,1.258
Prob(Omnibus):,0.0,Jarque-Bera (JB):,13441212.269
Skew:,6.133,Prob(JB):,0.0
Kurtosis:,121.611,Cond. No.,132000.0
