# EBAC - Regressão II - regressão múltipla

## Tarefa I

#### Previsão de renda II

Vamos continuar trabalhando com a base 'previsao_de_renda.csv', que é a base do seu próximo projeto. Vamos usar os recursos que vimos até aqui nesta base.

|variavel|descrição|
|-|-|
|data_ref                | Data de referência de coleta das variáveis |
|index                   | Código de identificação do cliente|
|sexo                    | Sexo do cliente|
|posse_de_veiculo        | Indica se o cliente possui veículo|
|posse_de_imovel         | Indica se o cliente possui imóvel|
|qtd_filhos              | Quantidade de filhos do cliente|
|tipo_renda              | Tipo de renda do cliente|
|educacao                | Grau de instrução do cliente|
|estado_civil            | Estado civil do cliente|
|tipo_residencia         | Tipo de residência do cliente (própria, alugada etc)|
|idade                   | Idade do cliente|
|tempo_emprego           | Tempo no emprego atual|
|qt_pessoas_residencia   | Quantidade de pessoas que moram na residência|
|renda                   | Renda em reais|

In [29]:
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import statsmodels as sm
from sklearn.model_selection import train_test_split


In [2]:
df = pd.read_csv('previsao_de_renda.csv')
df.drop('Unnamed: 0', axis=1, inplace=True)

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15000 entries, 0 to 14999
Data columns (total 14 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   data_ref               15000 non-null  object 
 1   id_cliente             15000 non-null  int64  
 2   sexo                   15000 non-null  object 
 3   posse_de_veiculo       15000 non-null  bool   
 4   posse_de_imovel        15000 non-null  bool   
 5   qtd_filhos             15000 non-null  int64  
 6   tipo_renda             15000 non-null  object 
 7   educacao               15000 non-null  object 
 8   estado_civil           15000 non-null  object 
 9   tipo_residencia        15000 non-null  object 
 10  idade                  15000 non-null  int64  
 11  tempo_emprego          12427 non-null  float64
 12  qt_pessoas_residencia  15000 non-null  float64
 13  renda                  15000 non-null  float64
dtypes: bool(2), float64(3), int64(3), object(6)
memory usa

1. Separe a base em treinamento e teste (25% para teste, 75% para treinamento).
2. Rode uma regularização *ridge* com alpha = [0, 0.001, 0.005, 0.01, 0.05, 0.1] e avalie o $R^2$ na base de testes. Qual o melhor modelo?
3. Faça o mesmo que no passo 2, com uma regressão *LASSO*. Qual método chega a um melhor resultado?
4. Rode um modelo *stepwise*. Avalie o $R^2$ na vase de testes. Qual o melhor resultado?
5. Compare os parâmetros e avalie eventuais diferenças. Qual modelo você acha o melhor de todos?
6. Partindo dos modelos que você ajustou, tente melhorar o $R^2$ na base de testes. Use a criatividade, veja se consegue inserir alguma transformação ou combinação de variáveis.
7. Ajuste uma árvore de regressão e veja se consegue um $R^2$ melhor com ela.

In [4]:
# 1
train, test = train_test_split(df, test_size=0.25)
print('Train:', train.shape)
print('Test:', test.shape)

Train: (11250, 14)
Test: (3750, 14)


In [42]:
# 2
modelo = '''np.log(renda) ~ 
                               C(sexo, Treatment(0))
                               + C(educacao, Treatment(0))
                               + C(posse_de_imovel, Treatment(1))
                               + idade
                               + tempo_emprego
                               '''
md = smf.ols(modelo, data=test)
reg = md.fit_regularized(method='elastic_net', refit=True, L1_wt=0, alpha=0)

print(reg.params)

[ 7.34283792e+00  7.83383713e-01 -4.77551590e-02  2.66422541e-02
  1.16413519e-01 -8.14666641e-03 -9.94732068e-02  3.48218457e-03
  5.88430847e-02]


In [6]:
md = smf.ols(modelo, data=test)
reg = md.fit_regularized(method='elastic_net', refit=True, L1_wt=0, alpha=0.001)
reg.params

array([ 6.64056946,  0.80059847,  0.2641517 ,  0.58266584,  0.6797108 ,
        0.55907247, -0.08520713,  0.00671049,  0.05864884])

In [7]:
md = smf.ols(modelo, data=test)
reg = md.fit_regularized(method='elastic_net', refit=True, L1_wt=0, alpha=0.005)
reg.params

array([ 5.59967702,  0.84927429,  0.18868406,  1.14932964,  1.26613313,
        1.10255156, -0.03620908,  0.01706531,  0.05810497])

In [8]:
md = smf.ols(modelo, data=test)
reg = md.fit_regularized(method='elastic_net', refit=True, L1_wt=0, alpha=0.01)
reg.params

array([5.00428239, 0.89236413, 0.10529633, 1.25983511, 1.39444894,
       1.16938654, 0.01461208, 0.02762006, 0.0576557 ])

In [9]:
md = smf.ols(modelo, data=test)
reg = md.fit_regularized(method='elastic_net', refit=True, L1_wt=0, alpha=0.05)
reg.params

array([3.08178664, 1.00431655, 0.00378968, 0.96359946, 1.16052495,
       0.74724585, 0.21699989, 0.07679084, 0.05538709])

In [10]:
md = smf.ols(modelo, data=test)
reg = md.fit_regularized(method='elastic_net', refit=True, L1_wt=0, alpha=0.1)
reg.params

array([ 2.15034718,  0.96798983, -0.0048509 ,  0.68495697,  0.89557392,
        0.48232492,  0.28473763,  0.10491103,  0.05354053])

In [11]:
# 3
md = smf.ols(modelo, data=test)
reg = md.fit_regularized(method='elastic_net', refit=True, L1_wt=1, alpha=0.001)
reg.summary()

0,1,2,3
Dep. Variable:,np.log(renda),R-squared:,0.336
Model:,OLS,Adj. R-squared:,0.335
Method:,Least Squares,F-statistic:,225.4
Date:,"Wed, 08 Mar 2023",Prob (F-statistic):,1.5100000000000002e-271
Time:,19:41:04,Log-Likelihood:,-3410.6
No. Observations:,3119,AIC:,6837.0
Df Residuals:,3112,BIC:,6885.0
Df Model:,7,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,7.4573,0.063,118.489,0.000,7.334,7.581
"C(sexo, Treatment(0))[T.M]",0.7826,0.027,28.626,0.000,0.729,0.836
"C(educacao, Treatment(0))[T.Pós graduação]",0,0,,,0,0
"C(educacao, Treatment(0))[T.Secundário]",-0.0867,0.027,-3.183,0.001,-0.140,-0.033
"C(educacao, Treatment(0))[T.Superior completo]",0,0,,,0,0
"C(educacao, Treatment(0))[T.Superior incompleto]",-0.1216,0.063,-1.935,0.053,-0.245,0.002
"C(posse_de_imovel, Treatment(1))[T.False]",-0.0990,0.027,-3.624,0.000,-0.153,-0.045
idade,0.0035,0.001,2.346,0.019,0.001,0.006
tempo_emprego,0.0589,0.002,28.541,0.000,0.055,0.063

0,1,2,3
Omnibus:,2.045,Durbin-Watson:,1.993
Prob(Omnibus):,0.36,Jarque-Bera (JB):,2.008
Skew:,0.028,Prob(JB):,0.366
Kurtosis:,2.889,Cond. No.,1360.0


In [12]:
md = smf.ols(modelo, data=test)
reg = md.fit_regularized(method='elastic_net', refit=True, L1_wt=1, alpha=0.005)
reg.summary()

0,1,2,3
Dep. Variable:,np.log(renda),R-squared:,0.336
Model:,OLS,Adj. R-squared:,0.335
Method:,Least Squares,F-statistic:,262.1
Date:,"Wed, 08 Mar 2023",Prob (F-statistic):,5.77e-272
Time:,19:41:04,Log-Likelihood:,-3412.4
No. Observations:,3119,AIC:,6839.0
Df Residuals:,3113,BIC:,6881.0
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,7.4331,0.062,120.453,0.000,7.312,7.554
"C(sexo, Treatment(0))[T.M]",0.7813,0.027,28.576,0.000,0.728,0.835
"C(educacao, Treatment(0))[T.Pós graduação]",0,0,,,0,0
"C(educacao, Treatment(0))[T.Secundário]",-0.0739,0.026,-2.796,0.005,-0.126,-0.022
"C(educacao, Treatment(0))[T.Superior completo]",0,0,,,0,0
"C(educacao, Treatment(0))[T.Superior incompleto]",0,0,,,0,0
"C(posse_de_imovel, Treatment(1))[T.False]",-0.1001,0.027,-3.663,0.000,-0.154,-0.047
idade,0.0037,0.001,2.549,0.011,0.001,0.007
tempo_emprego,0.0590,0.002,28.559,0.000,0.055,0.063

0,1,2,3
Omnibus:,1.976,Durbin-Watson:,1.993
Prob(Omnibus):,0.372,Jarque-Bera (JB):,1.951
Skew:,0.029,Prob(JB):,0.377
Kurtosis:,2.892,Cond. No.,1360.0


In [13]:
md = smf.ols(modelo, data=test)
reg = md.fit_regularized(method='elastic_net', refit=True, L1_wt=1, alpha=0.01)
reg.summary()

0,1,2,3
Dep. Variable:,np.log(renda),R-squared:,0.336
Model:,OLS,Adj. R-squared:,0.335
Method:,Least Squares,F-statistic:,262.1
Date:,"Wed, 08 Mar 2023",Prob (F-statistic):,5.77e-272
Time:,19:41:04,Log-Likelihood:,-3412.4
No. Observations:,3119,AIC:,6839.0
Df Residuals:,3113,BIC:,6881.0
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,7.4331,0.062,120.453,0.000,7.312,7.554
"C(sexo, Treatment(0))[T.M]",0.7813,0.027,28.576,0.000,0.728,0.835
"C(educacao, Treatment(0))[T.Pós graduação]",0,0,,,0,0
"C(educacao, Treatment(0))[T.Secundário]",-0.0739,0.026,-2.796,0.005,-0.126,-0.022
"C(educacao, Treatment(0))[T.Superior completo]",0,0,,,0,0
"C(educacao, Treatment(0))[T.Superior incompleto]",0,0,,,0,0
"C(posse_de_imovel, Treatment(1))[T.False]",-0.1001,0.027,-3.663,0.000,-0.154,-0.047
idade,0.0037,0.001,2.549,0.011,0.001,0.007
tempo_emprego,0.0590,0.002,28.559,0.000,0.055,0.063

0,1,2,3
Omnibus:,1.976,Durbin-Watson:,1.993
Prob(Omnibus):,0.372,Jarque-Bera (JB):,1.951
Skew:,0.029,Prob(JB):,0.377
Kurtosis:,2.892,Cond. No.,1360.0


In [14]:
md = smf.ols(modelo, data=test)
reg = md.fit_regularized(method='elastic_net', refit=True, L1_wt=1, alpha=0.05)
reg.summary()

0,1,2,3
Dep. Variable:,np.log(renda),R-squared:,0.331
Model:,OLS,Adj. R-squared:,0.33
Method:,Least Squares,F-statistic:,385.2
Date:,"Wed, 08 Mar 2023",Prob (F-statistic):,7.4699999999999995e-270
Time:,19:41:05,Log-Likelihood:,-3423.3
No. Observations:,3119,AIC:,6857.0
Df Residuals:,3115,BIC:,6887.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,7.3634,0.060,122.902,0.000,7.246,7.481
"C(sexo, Treatment(0))[T.M]",0.7772,0.027,28.369,0.000,0.724,0.831
"C(educacao, Treatment(0))[T.Pós graduação]",0,0,,,0,0
"C(educacao, Treatment(0))[T.Secundário]",0,0,,,0,0
"C(educacao, Treatment(0))[T.Superior completo]",0,0,,,0,0
"C(educacao, Treatment(0))[T.Superior incompleto]",0,0,,,0,0
"C(posse_de_imovel, Treatment(1))[T.False]",0,0,,,0,0
idade,0.0036,0.001,2.503,0.012,0.001,0.006
tempo_emprego,0.0588,0.002,28.411,0.000,0.055,0.063

0,1,2,3
Omnibus:,1.728,Durbin-Watson:,1.994
Prob(Omnibus):,0.421,Jarque-Bera (JB):,1.71
Skew:,0.021,Prob(JB):,0.425
Kurtosis:,2.893,Cond. No.,1360.0


In [15]:
md = smf.ols(modelo, data=test)
reg = md.fit_regularized(method='elastic_net', refit=True, L1_wt=1, alpha=0.1)
reg.summary()

0,1,2,3
Dep. Variable:,np.log(renda),R-squared:,0.331
Model:,OLS,Adj. R-squared:,0.33
Method:,Least Squares,F-statistic:,385.2
Date:,"Wed, 08 Mar 2023",Prob (F-statistic):,7.4699999999999995e-270
Time:,19:41:05,Log-Likelihood:,-3423.3
No. Observations:,3119,AIC:,6857.0
Df Residuals:,3115,BIC:,6887.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,7.3634,0.060,122.902,0.000,7.246,7.481
"C(sexo, Treatment(0))[T.M]",0.7772,0.027,28.369,0.000,0.724,0.831
"C(educacao, Treatment(0))[T.Pós graduação]",0,0,,,0,0
"C(educacao, Treatment(0))[T.Secundário]",0,0,,,0,0
"C(educacao, Treatment(0))[T.Superior completo]",0,0,,,0,0
"C(educacao, Treatment(0))[T.Superior incompleto]",0,0,,,0,0
"C(posse_de_imovel, Treatment(1))[T.False]",0,0,,,0,0
idade,0.0036,0.001,2.503,0.012,0.001,0.006
tempo_emprego,0.0588,0.002,28.411,0.000,0.055,0.063

0,1,2,3
Omnibus:,1.728,Durbin-Watson:,1.994
Prob(Omnibus):,0.421,Jarque-Bera (JB):,1.71
Skew:,0.021,Prob(JB):,0.425
Kurtosis:,2.893,Cond. No.,1360.0
