# EBAC - Regressão II - regressão múltipla

## Tarefa I

#### Previsão de renda II

Vamos continuar trabalhando com a base 'previsao_de_renda.csv', que é a base do seu próximo projeto. Vamos usar os recursos que vimos até aqui nesta base.

|variavel|descrição|
|-|-|
|data_ref                | Data de referência de coleta das variáveis |
|index                   | Código de identificação do cliente|
|sexo                    | Sexo do cliente|
|posse_de_veiculo        | Indica se o cliente possui veículo|
|posse_de_imovel         | Indica se o cliente possui imóvel|
|qtd_filhos              | Quantidade de filhos do cliente|
|tipo_renda              | Tipo de renda do cliente|
|educacao                | Grau de instrução do cliente|
|estado_civil            | Estado civil do cliente|
|tipo_residencia         | Tipo de residência do cliente (própria, alugada etc)|
|idade                   | Idade do cliente|
|tempo_emprego           | Tempo no emprego atual|
|qt_pessoas_residencia   | Quantidade de pessoas que moram na residência|
|renda                   | Renda em reais|

In [49]:
import pandas as pd
from sklearn.model_selection import train_test_split
import statsmodels.formula.api as smf
import statsmodels.api as sm
import numpy as np
from sklearn import preprocessing
from sklearn.tree import DecisionTreeRegressor

In [3]:
df = pd.read_csv('previsao_de_renda.csv', index_col=[0]).drop(['id_cliente', 'data_ref'], axis=1)
df.dropna(inplace=True)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12427 entries, 0 to 14999
Data columns (total 12 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   sexo                   12427 non-null  object 
 1   posse_de_veiculo       12427 non-null  bool   
 2   posse_de_imovel        12427 non-null  bool   
 3   qtd_filhos             12427 non-null  int64  
 4   tipo_renda             12427 non-null  object 
 5   educacao               12427 non-null  object 
 6   estado_civil           12427 non-null  object 
 7   tipo_residencia        12427 non-null  object 
 8   idade                  12427 non-null  int64  
 9   tempo_emprego          12427 non-null  float64
 10  qt_pessoas_residencia  12427 non-null  float64
 11  renda                  12427 non-null  float64
dtypes: bool(2), float64(3), int64(2), object(5)
memory usage: 1.1+ MB


1. Separe a base em treinamento e teste (25% para teste, 75% para treinamento).
2. Rode uma regularização *ridge* com alpha = [0, 0.001, 0.005, 0.01, 0.05, 0.1] e avalie o $R^2$ na base de testes. Qual o melhor modelo?
3. Faça o mesmo que no passo 2, com uma regressão *LASSO*. Qual método chega a um melhor resultado?
4. Rode um modelo *stepwise*. Avalie o $R^2$ na vase de testes. Qual o melhor resultado?
5. Compare os parâmetros e avalie eventuais diferenças. Qual modelo você acha o melhor de todos?
6. Partindo dos modelos que você ajustou, tente melhorar o $R^2$ na base de testes. Use a criatividade, veja se consegue inserir alguma transformação ou combinação de variáveis.
7. Ajuste uma árvore de regressão e veja se consegue um $R^2$ melhor com ela.

In [5]:
# Separando a base entre treino e teste
X_train, X_test = train_test_split(df, test_size=0.25)

In [6]:
# Função para criar os modelos
def create_model(alpha, reg_type, data):
    model = 'np.log(renda) ~ C(sexo) + C(posse_de_veiculo) + C(posse_de_imovel) + C(tipo_renda) + C(educacao) + C(estado_civil) + C(tipo_residencia) + qt_pessoas_residencia + qtd_filhos + tempo_emprego'
    model = smf.ols(model, data)
    reg = model.fit_regularized(method='elastic_net', L1_wt=reg_type, alpha=alpha, refit=True)
    return reg

In [7]:
alphas = [0, 0.001, 0.005, 0.01, 0.05, 0.1]
rsquareds = []

In [8]:
# Regularização ridge - Avaliando o melhor alpha

for alpha in alphas:
  reg = create_model(alpha, 0.01, X_test)
  rsquareds.append({alpha: reg.rsquared})

In [9]:
rsquareds
# O alpha que retornou o melhor R² foi o alpha de valor 0

[{0: 0.3592085250073409},
 {0.001: 0.35917334602204687},
 {0.005: 0.3588295201240439},
 {0.01: 0.3581908004960358},
 {0.05: 0.35841275956065666},
 {0.1: 0.35775351688606016}]

In [10]:
modelo_final = create_model(0, 0.01, X_test)
modelo_final.summary()

0,1,2,3
Dep. Variable:,np.log(renda),R-squared:,0.359
Model:,OLS,Adj. R-squared:,0.355
Method:,Least Squares,F-statistic:,75.17
Date:,"Thu, 10 Aug 2023",Prob (F-statistic):,5.13e-277
Time:,20:18:15,Log-Likelihood:,-3341.4
No. Observations:,3107,AIC:,6731.0
Df Residuals:,3084,BIC:,6876.0
Df Model:,23,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,6.3184,0.514,12.303,0.000,5.311,7.325
C(sexo)[T.M],0.7734,0.029,26.687,0.000,0.717,0.830
C(posse_de_veiculo)[T.True],0.0507,0.028,1.809,0.071,-0.004,0.106
C(posse_de_imovel)[T.True],0.1040,0.028,3.763,0.000,0.050,0.158
C(tipo_renda)[T.Bolsista],0.3061,0.412,0.742,0.458,-0.502,1.115
C(tipo_renda)[T.Empresário],0.1623,0.030,5.475,0.000,0.104,0.220
C(tipo_renda)[T.Servidor público],0.0139,0.045,0.309,0.757,-0.074,0.102
C(educacao)[T.Pós graduação],-0.0255,0.353,-0.072,0.942,-0.718,0.667
C(educacao)[T.Secundário],-0.1522,0.149,-1.018,0.309,-0.445,0.141

0,1,2,3
Omnibus:,0.209,Durbin-Watson:,1.955
Prob(Omnibus):,0.901,Jarque-Bera (JB):,0.165
Skew:,-0.011,Prob(JB):,0.921
Kurtosis:,3.028,Cond. No.,597.0


In [11]:
rsquareds_ = []

In [12]:
# Regularização LASSO - Avaliando o melhor alpha

for alpha in alphas:
  reg = create_model(alpha, 1.0, X_test)
  rsquareds_.append({alpha: reg.rsquared})

In [13]:
rsquareds_

[{0: 0.3592085250073409},
 {0.001: 0.3565597670362646},
 {0.005: 0.34671406982570585},
 {0.01: 0.3463974620191105},
 {0.05: 0.3402329114511926},
 {0.1: 0.3402329114511926}]

In [14]:
# O alpha que retornou o melhor R² foi novamente o alpha de valor 0. Ambos os métodos chegaram no mesmo resultado
modelo_final_2 = create_model(0, 1.0, X_test)
modelo_final_2.summary()

0,1,2,3
Dep. Variable:,np.log(renda),R-squared:,0.359
Model:,OLS,Adj. R-squared:,0.355
Method:,Least Squares,F-statistic:,75.17
Date:,"Thu, 10 Aug 2023",Prob (F-statistic):,5.13e-277
Time:,20:18:18,Log-Likelihood:,-3341.4
No. Observations:,3107,AIC:,6731.0
Df Residuals:,3084,BIC:,6876.0
Df Model:,23,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,6.3184,0.514,12.303,0.000,5.311,7.325
C(sexo)[T.M],0.7734,0.029,26.687,0.000,0.717,0.830
C(posse_de_veiculo)[T.True],0.0507,0.028,1.809,0.071,-0.004,0.106
C(posse_de_imovel)[T.True],0.1040,0.028,3.763,0.000,0.050,0.158
C(tipo_renda)[T.Bolsista],0.3061,0.412,0.742,0.458,-0.502,1.115
C(tipo_renda)[T.Empresário],0.1623,0.030,5.475,0.000,0.104,0.220
C(tipo_renda)[T.Servidor público],0.0139,0.045,0.309,0.757,-0.074,0.102
C(educacao)[T.Pós graduação],-0.0255,0.353,-0.072,0.942,-0.718,0.667
C(educacao)[T.Secundário],-0.1522,0.149,-1.018,0.309,-0.445,0.141

0,1,2,3
Omnibus:,0.209,Durbin-Watson:,1.955
Prob(Omnibus):,0.901,Jarque-Bera (JB):,0.165
Skew:,-0.011,Prob(JB):,0.921
Kurtosis:,3.028,Cond. No.,597.0


In [15]:
df_dm = pd.get_dummies(df)
y = df_dm['renda']
X = df_dm.drop('renda', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

In [20]:
# Regressão stepwise
model = 'np.log(renda) ~ C(sexo) + C(posse_de_veiculo) + C(posse_de_imovel) + C(tipo_renda) + C(educacao) + C(estado_civil) + C(tipo_residencia) + qt_pessoas_residencia + qtd_filhos + tempo_emprego'
st_reg = sm.OLS(y_test.astype(float), X_test.astype(float), model).fit()

In [21]:
st_reg.summary()

0,1,2,3
Dep. Variable:,renda,R-squared:,0.27
Model:,OLS,Adj. R-squared:,0.265
Method:,Least Squares,F-statistic:,47.59
Date:,"Thu, 10 Aug 2023",Prob (F-statistic):,1.81e-190
Time:,20:21:44,Log-Likelihood:,-31924.0
No. Observations:,3107,AIC:,63900.0
Df Residuals:,3082,BIC:,64050.0
Df Model:,24,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
posse_de_veiculo,553.0579,276.469,2.000,0.046,10.976,1095.140
posse_de_imovel,302.1051,273.988,1.103,0.270,-235.112,839.322
qtd_filhos,-652.0644,1965.759,-0.332,0.740,-4506.394,3202.265
idade,35.4593,15.857,2.236,0.025,4.368,66.550
tempo_emprego,509.0392,20.270,25.113,0.000,469.296,548.783
qt_pessoas_residencia,616.8992,1958.502,0.315,0.753,-3223.203,4457.001
sexo_F,-3045.5169,1241.670,-2.453,0.014,-5480.102,-610.931
sexo_M,2843.6637,1247.685,2.279,0.023,397.286,5290.042
tipo_renda_Assalariado,335.1878,1030.452,0.325,0.745,-1685.255,2355.630

0,1,2,3
Omnibus:,4172.126,Durbin-Watson:,2.007
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1672728.243
Skew:,7.295,Prob(JB):,0.0
Kurtosis:,115.73,Cond. No.,1.27e+16


In [18]:
# Os modelos com regressão LASSO e ridge foram melhores

In [52]:
# Rodando uma árvore de regressão
tree =  DecisionTreeRegressor(max_depth= 5, min_samples_split= 10)

In [53]:
dum = pd.get_dummies(df)
X_train, X_test = train_test_split(dum, test_size=0.25)
encoded = preprocessing.LabelEncoder().fit_transform(np.array([1.4,0.4]))

In [59]:
tree.fit(X_train.astype(float), y_train.astype(float))

In [61]:
tree.score(X_train, y_train)

0.012339977330830143