# Regressió lineal múltiple - Backward Elimination
--------------------

La tècnica d’eliminació cap enrere funciona agafant tote les variables independents per generar el model i anem eliminant-ne una a una mentre hi hagi alguna variable amb p-valor > a un cert un nivell de significació.
Passos:
* **PAS 1**: Establir el nivell de significació (SL/α) per estar dins el model (α <= 0.05)
* **PAS 2**: Calcular el model amb totes les possibles variables independents
* **PAS 3**: Agafem la variable independent amb el p-valor més gran  
    *Si p-valor > α llavors passem al PAS 4
    *Altrament passem al PAS 5 - Fi
* **PAS 4**: Eliminem la variable predictora que té el p-valor més gran i tornem a calcular el model 
* **PAS 5**: Fi


L'exemple que utilitzarem serà un data set que conté el benerfici de 50 startups dels Estats Units juntament amb les dades de despesa en diferents àmbits: I+D, Màrqueting, Administració i la seva localització.

En aquest exemple volem veure si el benefici depèn de totes les variabled, d'unes quantes o de cap.
La lògica ens diu que si una startup gasta més en I+D segurament tindrà més benefici, però volem saber com influeix en el benefici la localitació i les despeses relacionades amb màrqueting i administració.

In [1]:
# Importem les llibreries necessàries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

In [2]:

  
# Importem el dataset
df = pd.read_csv('dataset/50_Startups.csv')
df.head()


Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


In [12]:

# Dividim el dataframe amb les variables independents (X) i les dependents (Y)
x = df[['R&D Spend', 'Administration', 'Marketing Spend', 'State']]
y = df['Profit']
x.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State
0,165349.2,136897.8,471784.1,New York
1,162597.7,151377.59,443898.53,California
2,153441.51,101145.55,407934.54,Florida
3,144372.41,118671.85,383199.62,New York
4,142107.34,91391.77,366168.42,Florida


In [4]:
y.head()

0    192261.83
1    191792.06
2    191050.39
3    182901.99
4    166187.94
Name: Profit, dtype: float64

In [13]:
# Construim les variables dummy a partir de la variable categòrica State
x = pd.get_dummies(x,columns=["State"],drop_first=True)
x.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State_Florida,State_New York
0,165349.2,136897.8,471784.1,0,1
1,162597.7,151377.59,443898.53,0,0
2,153441.51,101145.55,407934.54,1,0
3,144372.41,118671.85,383199.62,0,1
4,142107.34,91391.77,366168.42,1,0


In [39]:
# Dividim el dataset amb dades de test i de train.
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 0)

In [40]:
print ("------\nTRAIN\n------")
print(x_train)
print(y_train)
print ("------\nTEST\n------")
print(x_test)
print(y_test)

------
TRAIN
------
    R&D Spend  Administration  Marketing Spend  State_Florida  State_New York
7   130298.13       145530.06        323876.68              1               0
14  119943.24       156547.42        256512.92              1               0
45    1000.23       124153.04          1903.93              0               1
48     542.05        51743.15             0.00              0               1
29   65605.48       153032.06        107138.38              0               1
15  114523.61       122616.84        261776.23              0               1
30   61994.48       115641.28         91131.24              1               0
32   63408.86       129219.61         46085.25              0               0
16   78013.11       121597.55        264346.06              0               0
42   23640.93        96189.63        148001.11              0               0
20   76253.86       113867.30        298664.47              0               0
43   15505.73       127382.30         35534.

In [38]:
# Ajustem el model de regressió lineal multiple
ls = LinearRegression()
ls.fit(x_train, y_train)



LinearRegression()

In [47]:
import statsmodels.regression.linear_model as sm

# Afegim una columne de 1's per simular la columna del terme independent B0
x['terme_indep'] = 1;
#x = np.append(arr = np.ones((50, 1)).astype(int), values = x, axis = 1)
x.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State_Florida,State_New York,terme_indep
0,165349.2,136897.8,471784.1,0,1,1
1,162597.7,151377.59,443898.53,0,0,1
2,153441.51,101145.55,407934.54,1,0,1
3,144372.41,118671.85,383199.62,0,1,1
4,142107.34,91391.77,366168.42,1,0,1


In [59]:
## x_opt és el conjunt de variables independents òptimes / significatives
## per predir la y.
x_opt =  x.iloc[:, [0, 1, 2, 3, 4, 5]]
x_opt.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State_Florida,State_New York,terme_indep
0,165349.2,136897.8,471784.1,0,1,1
1,162597.7,151377.59,443898.53,0,0,1
2,153441.51,101145.55,407934.54,1,0,1
3,144372.41,118671.85,383199.62,0,1,1
4,142107.34,91391.77,366168.42,1,0,1


## PAS 1

Inicialitzem el nivell de significació

In [None]:
SL = 0.05

## PAS 2

In [60]:
# OLS = Ordinary List Squares. Tècnica dels mínims quadrats
# Áquesta OLS és el mateix que vàrem utilitzar en el cas de regressio_linieal_simple, 
# però en aquest cas ens retorna una sèrie d'estadístics que utilitzarem.
# ENDOG = VARIABLE A PREDIR (ENDÒGENA, INTRÍNSICA)
# EXOG = VAIRABLE EXTERNA (EXÒGENA)
# L'ordenada a l'origen no està incluida per defecte i l'hem d'afegir mitjançant una columna de 1's
lr_ols = sm.OLS(endog = y, exog = x_opt ).fit()


## PAS 3

Mirem quina és la variable independent amb un p_valor més gran i comprovar si aquest valor és més gran que SL.

In [61]:
lr_ols.summary()

0,1,2,3
Dep. Variable:,Profit,R-squared:,0.951
Model:,OLS,Adj. R-squared:,0.945
Method:,Least Squares,F-statistic:,169.9
Date:,"Sat, 03 Dec 2022",Prob (F-statistic):,1.34e-27
Time:,19:46:53,Log-Likelihood:,-525.38
No. Observations:,50,AIC:,1063.0
Df Residuals:,44,BIC:,1074.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
R&D Spend,0.8060,0.046,17.369,0.000,0.712,0.900
Administration,-0.0270,0.052,-0.517,0.608,-0.132,0.078
Marketing Spend,0.0270,0.017,1.574,0.123,-0.008,0.062
State_Florida,198.7888,3371.007,0.059,0.953,-6595.030,6992.607
State_New York,-41.8870,3256.039,-0.013,0.990,-6604.003,6520.229
terme_indep,5.013e+04,6884.820,7.281,0.000,3.62e+04,6.4e+04

0,1,2,3
Omnibus:,14.782,Durbin-Watson:,1.283
Prob(Omnibus):,0.001,Jarque-Bera (JB):,21.266
Skew:,-0.948,Prob(JB):,2.41e-05
Kurtosis:,5.572,Cond. No.,1450000.0


## PAS 4
Fen el summary veiem que la variable `Stat_New York` és superior al SL. Per tant hem d'eliminar-la

In [69]:
# Eliminem State_New York
x_opt = x_opt.drop(['State_New York'], axis=1)
x_opt.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State_Florida,terme_indep
0,165349.2,136897.8,471784.1,0,1
1,162597.7,151377.59,443898.53,0,1
2,153441.51,101145.55,407934.54,1,1
3,144372.41,118671.85,383199.62,0,1
4,142107.34,91391.77,366168.42,1,1


## PAS 2
Calculem altre cop el model sense `State_New York`

In [70]:
lr_ols = sm.OLS(endog = y, exog = x_opt ).fit()

## PAS 3
Comprovem la variable independent amb el p-valor més gran

In [71]:
lr_ols.summary()

0,1,2,3
Dep. Variable:,Profit,R-squared:,0.951
Model:,OLS,Adj. R-squared:,0.946
Method:,Least Squares,F-statistic:,217.2
Date:,"Sat, 03 Dec 2022",Prob (F-statistic):,8.49e-29
Time:,19:55:58,Log-Likelihood:,-525.38
No. Observations:,50,AIC:,1061.0
Df Residuals:,45,BIC:,1070.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
R&D Spend,0.8060,0.046,17.606,0.000,0.714,0.898
Administration,-0.0270,0.052,-0.523,0.604,-0.131,0.077
Marketing Spend,0.0270,0.017,1.592,0.118,-0.007,0.061
State_Florida,220.1585,2900.536,0.076,0.940,-5621.821,6062.138
terme_indep,5.011e+04,6647.870,7.537,0.000,3.67e+04,6.35e+04

0,1,2,3
Omnibus:,14.758,Durbin-Watson:,1.282
Prob(Omnibus):,0.001,Jarque-Bera (JB):,21.172
Skew:,-0.948,Prob(JB):,2.53e-05
Kurtosis:,5.563,Cond. No.,1400000.0


## PAS 4
Veiem en el summary que la que té el p-valor més gran és `State_Florida` i per això l'eliminem

In [72]:
# Eliminem State_Florida
x_opt = x_opt.drop(['State_Florida'], axis=1)
x_opt.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,terme_indep
0,165349.2,136897.8,471784.1,1
1,162597.7,151377.59,443898.53,1
2,153441.51,101145.55,407934.54,1
3,144372.41,118671.85,383199.62,1
4,142107.34,91391.77,366168.42,1


## PAS 2
Calculem altre cop el model sense `State_Florida`

In [73]:
lr_ols = sm.OLS(endog = y, exog = x_opt ).fit()

## PAS 3
Comprovem la variable independent amb el p-valor més gran

In [74]:
lr_ols.summary()

0,1,2,3
Dep. Variable:,Profit,R-squared:,0.951
Model:,OLS,Adj. R-squared:,0.948
Method:,Least Squares,F-statistic:,296.0
Date:,"Sat, 03 Dec 2022",Prob (F-statistic):,4.53e-30
Time:,20:27:42,Log-Likelihood:,-525.39
No. Observations:,50,AIC:,1059.0
Df Residuals:,46,BIC:,1066.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
R&D Spend,0.8057,0.045,17.846,0.000,0.715,0.897
Administration,-0.0268,0.051,-0.526,0.602,-0.130,0.076
Marketing Spend,0.0272,0.016,1.655,0.105,-0.006,0.060
terme_indep,5.012e+04,6572.353,7.626,0.000,3.69e+04,6.34e+04

0,1,2,3
Omnibus:,14.838,Durbin-Watson:,1.282
Prob(Omnibus):,0.001,Jarque-Bera (JB):,21.442
Skew:,-0.949,Prob(JB):,2.21e-05
Kurtosis:,5.586,Cond. No.,1400000.0


## PAS 4
Veiem en el summary que la que té el p-valor més gran és `Administration` (0.602) i el valor és més gran que SL (0.05) per això l'eliminem

In [76]:
# Eliminem State_Florida
x_opt = x_opt.drop(['Administration'], axis=1)
x_opt.head()

Unnamed: 0,R&D Spend,Marketing Spend,terme_indep
0,165349.2,471784.1,1
1,162597.7,443898.53,1
2,153441.51,407934.54,1
3,144372.41,383199.62,1
4,142107.34,366168.42,1


## PAS 2
Calculem altre cop el model sense `Administration`

In [77]:
lr_ols = sm.OLS(endog = y, exog = x_opt ).fit()

## PAS 3
Comprovem la variable independent amb el p-valor més gran

In [78]:
lr_ols.summary()

0,1,2,3
Dep. Variable:,Profit,R-squared:,0.95
Model:,OLS,Adj. R-squared:,0.948
Method:,Least Squares,F-statistic:,450.8
Date:,"Sat, 03 Dec 2022",Prob (F-statistic):,2.1600000000000003e-31
Time:,20:30:34,Log-Likelihood:,-525.54
No. Observations:,50,AIC:,1057.0
Df Residuals:,47,BIC:,1063.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
R&D Spend,0.7966,0.041,19.266,0.000,0.713,0.880
Marketing Spend,0.0299,0.016,1.927,0.060,-0.001,0.061
terme_indep,4.698e+04,2689.933,17.464,0.000,4.16e+04,5.24e+04

0,1,2,3
Omnibus:,14.677,Durbin-Watson:,1.257
Prob(Omnibus):,0.001,Jarque-Bera (JB):,21.161
Skew:,-0.939,Prob(JB):,2.54e-05
Kurtosis:,5.575,Cond. No.,532000.0


## PAS 4
Veiem en el summary que la que té el p-valor més gran és `Marketing Spend` (0.060) i el valor és més gran que SL (0.05) per això l'eliminem

In [79]:
# Eliminem State_Florida
x_opt = x_opt.drop(['Marketing Spend'], axis=1)
x_opt.head()

Unnamed: 0,R&D Spend,terme_indep
0,165349.2,1
1,162597.7,1
2,153441.51,1
3,144372.41,1
4,142107.34,1


## PAS 2
Calculem altre cop el model sense `Marketing Spend`

In [80]:
lr_ols = sm.OLS(endog = y, exog = x_opt ).fit()

## PAS 3
Comprovem la variable independent amb el p-valor més gran

In [81]:
lr_ols.summary()

0,1,2,3
Dep. Variable:,Profit,R-squared:,0.947
Model:,OLS,Adj. R-squared:,0.945
Method:,Least Squares,F-statistic:,849.8
Date:,"Sat, 03 Dec 2022",Prob (F-statistic):,3.5000000000000004e-32
Time:,20:34:44,Log-Likelihood:,-527.44
No. Observations:,50,AIC:,1059.0
Df Residuals:,48,BIC:,1063.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
R&D Spend,0.8543,0.029,29.151,0.000,0.795,0.913
terme_indep,4.903e+04,2537.897,19.320,0.000,4.39e+04,5.41e+04

0,1,2,3
Omnibus:,13.727,Durbin-Watson:,1.116
Prob(Omnibus):,0.001,Jarque-Bera (JB):,18.536
Skew:,-0.911,Prob(JB):,9.44e-05
Kurtosis:,5.361,Cond. No.,165000.0


## PAS 3
Comprovem la variable independent amb el p-valor més gran. En aquest cas el p-valor de les que tenim és 0.00

## PAS 5
Hem acabat Només tenim que la variable `R&D Spend`(X1) determina Y

\begin{equation}
Y = 0.8543·X1 + 49030
\end{equation}

In [84]:
# Mostrem els coeficent B1 (R&D Spend) obtingut:
print('Coeficient \u03B21: %.5f' % 0.8543)

# Valor del temre independent (X=0)
print('Terme independent \u03B20: %.5f' % 49030)

print("Coeficient de determinació R^2:", 0.947)

Coeficient β1: 0.85430
Terme independent β0: 49030.00000
Coeficient de determinació R^2: 0.947
