# Verificando as hipóteses da Regressão Linear em Python e R
https://towardsdatascience.com/verifying-the-assumptions-of-linear-regression-in-python-and-r-f4cd2907d4c0

In [1]:
import pandas as pd 
from sklearn.datasets import load_boston

In [2]:
boston = load_boston()
X = pd.DataFrame(boston.data, columns = boston.feature_names)

#O foco é verificar as hipóteses da regressão linear, 
#vamos dropar a variável CHAS a qual é categórica apenas pela didática

X.drop('CHAS', axis = 1, inplace = True)

#Variável resposta é o valor da mediana dos valores das casas em vários bairros de Boston
y = pd.Series(boston.target, name = 'MEDV') 


### Visualizar variáveis independentes (vetores coluna que entram na Matriz de Design)

In [3]:
X.sample(5)

Unnamed: 0,CRIM,ZN,INDUS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
501,0.06263,0.0,11.93,0.573,6.593,69.1,2.4786,1.0,273.0,21.0,391.99,9.67
403,24.8017,0.0,18.1,0.693,5.349,96.0,1.7028,24.0,666.0,20.2,396.9,19.77
457,8.20058,0.0,18.1,0.713,5.936,80.3,2.7792,24.0,666.0,20.2,3.5,16.94
236,0.52058,0.0,6.2,0.507,6.631,76.5,4.148,8.0,307.0,17.4,388.45,9.54
186,0.05602,0.0,2.46,0.488,7.831,53.6,3.1992,3.0,193.0,17.8,392.63,4.45


### Abordagem canônica de aplicação da regressão linear em Python através do scikit learn 

In [4]:
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(X,y)

# Beta coeficientes
print(f'Coeficientes: {lin_reg.coef_}') 
print(f'Intercept: {lin_reg.intercept_}')

#Coeficiente de determinação 
print(f'R^2 score: {lin_reg.score(X,y)}')

Coeficientes: [-1.13139078e-01  4.70524578e-02  4.03114536e-02 -1.73669994e+01
  3.85049169e+00  2.78375651e-03 -1.48537390e+00  3.28311011e-01
 -1.37558288e-02 -9.90958031e-01  9.74145094e-03 -5.34157620e-01]
Intercept: 36.89195979693275
R^2 score: 0.7355165089722999


# No R

In [5]:
%load_ext rpy2.ipython

OSError: cannot load library 'C:\PROGRA~1\R\R-36~1.1\bin\x64\R.dll': error 0x7e

### Para obter um número maior de informações, pode-se utilizar a biblioteca statsmodels
- Quando utilizamos esta biblioteca, precisa-se adicionar um variável do vetor coluna da intersecção

In [9]:
import statsmodels.api as sm

##Adiciona o vetor coluna de 1s na Matriz de Design 
X_constante = sm.add_constant(X)


In [11]:
X_constante.sample(5)

Unnamed: 0,const,CRIM,ZN,INDUS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
221,1.0,0.40771,0.0,6.2,0.507,6.164,91.3,3.048,8.0,307.0,17.4,395.24,21.46
216,1.0,0.0456,0.0,13.89,0.55,5.888,56.0,3.1121,5.0,276.0,16.4,392.8,13.51
64,1.0,0.01951,17.5,1.38,0.4161,7.104,59.5,9.2229,3.0,216.0,18.6,393.24,8.05
425,1.0,15.8603,0.0,18.1,0.679,5.896,95.4,1.9096,24.0,666.0,20.2,7.68,24.39
448,1.0,9.32909,0.0,18.1,0.713,6.185,98.7,2.2616,24.0,666.0,20.2,396.9,18.13


In [14]:
# Aplicação da regressão linear com a biblioteca statsmodels
lin_reg = sm.OLS(y, X_constante).fit()
lin_reg.summary()

0,1,2,3
Dep. Variable:,MEDV,R-squared:,0.736
Model:,OLS,Adj. R-squared:,0.729
Method:,Least Squares,F-statistic:,114.3
Date:,"Mon, 10 Aug 2020",Prob (F-statistic):,7.299999999999999e-134
Time:,21:58:12,Log-Likelihood:,-1503.8
No. Observations:,506,AIC:,3034.0
Df Residuals:,493,BIC:,3088.0
Df Model:,12,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,36.8920,5.147,7.168,0.000,26.780,47.004
CRIM,-0.1131,0.033,-3.417,0.001,-0.178,-0.048
ZN,0.0471,0.014,3.398,0.001,0.020,0.074
INDUS,0.0403,0.062,0.653,0.514,-0.081,0.162
NOX,-17.3670,3.851,-4.509,0.000,-24.934,-9.800
RM,3.8505,0.421,9.137,0.000,3.023,4.678
AGE,0.0028,0.013,0.209,0.834,-0.023,0.029
DIS,-1.4854,0.201,-7.383,0.000,-1.881,-1.090
RAD,0.3283,0.067,4.934,0.000,0.198,0.459

0,1,2,3
Omnibus:,190.856,Durbin-Watson:,1.016
Prob(Omnibus):,0.0,Jarque-Bera (JB):,898.352
Skew:,1.619,Prob(JB):,8.42e-196
Kurtosis:,8.668,Cond. No.,15100.0


#### A Hipótese Nula na Regressão Linear diz que o coeficiente é nulo, portanto se o Valor-P calculado para o coeficiente for menor que 0,05 (assumido esse nível de significância), rejeita-se a hipótese nula e o coeficiente é diferente de zero

##### De acordo com o Teorema de Gauss Markov, em uma regressão linear, a aplicação do métodos dos mínimos quadrados resulta no BLUE (Best Linear Unbiased Estimator) se:
- A média dos erros (dos resíduos) é zero
- Os erros não são correlacionados $(Cov($e_1$, e2, ..., ep)$