### Regressão

#### $ \bar{y} = a + b_x $

### Minimos Quadrados

### $a = \bar{y} - b \bar{x}$

### $b = \frac{\sum_{i=1}^n{(x_i - \bar{x}) (y_i - \bar{y} )}}{\sum_{i=1}^n{(x_i - \bar{x})^2 }} $

# Imports

In [30]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn import metrics as mt
import statsmodels.formula.api as smf
import statsmodels.api as sm
import math as mat

# Load Dataset

In [2]:
df = pd.read_csv('../data/train.csv')

# Data Preparation

In [3]:
df.head()

Unnamed: 0,id_cliente,idade,saldo_atual,divida_atual,renda_anual,valor_em_investimentos,taxa_utilizacao_credito,num_emprestimos,num_contas_bancarias,num_cartoes_credito,dias_atraso_dt_venc,num_pgtos_atrasados,num_consultas_credito,taxa_juros,investe_exterior,pessoa_polit_exp,limite_adicional
0,1767,21,278.172008,2577.05,24196.89636,104.306544,31.038763,6,5,7,21,14,9,15,Não,Não,Negar
1,11920,40,268.874152,2465.39,19227.37796,69.858778,36.917093,5,8,5,40,23,10,18,Não,Não,Negar
2,8910,36,446.643127,1055.29,42822.28223,134.201478,34.561714,0,3,6,26,13,3,15,Sim,Não,Negar
3,4964,58,321.141267,703.05,51786.826,297.350067,31.493561,0,3,7,12,7,2,1,Sim,Não,Negar
4,10100,35,428.716114,891.29,44626.85346,134.201478,28.028887,2,8,7,24,10,8,20,Sim,Não,Negar


In [4]:
features = ['idade',
            'divida_atual',
            'renda_anual',
            'valor_em_investimentos',
            'taxa_utilizacao_credito',
            'num_emprestimos',
            'num_contas_bancarias',
            'num_cartoes_credito',
            'dias_atraso_dt_venc',
            'num_pgtos_atrasados',
            'num_consultas_credito',
            'taxa_juros']

label = ['saldo_atual']

In [5]:
x_train = df[features]
y_train = df[label]

# Model Training

In [6]:
lr_model = LinearRegression()

lr_model.fit(x_train,y_train)

LinearRegression()

In [7]:
y_pred = lr_model.predict(x_train)

In [8]:
# df1 = df[['id_cliente','saldo_atual']]
df1 = df.loc[: ,['id_cliente','saldo_atual']]

df1['predicted'] = y_pred

In [9]:
df1

Unnamed: 0,id_cliente,saldo_atual,predicted
0,1767,278.172008,346.669549
1,11920,268.874152,367.840277
2,8910,446.643127,431.468979
3,4964,321.141267,445.506463
4,10100,428.716114,378.271169
...,...,...,...
9495,5155,157.500279,449.221632
9496,11977,497.714090,369.259284
9497,9278,306.557684,412.251748
9498,2525,209.870718,400.685299


In [10]:
lr_model.coef_

array([[ 3.96706202e-04, -4.00595601e-02,  2.77622532e-06,
        -1.04318668e-03,  9.80890872e+00, -1.22353405e-02,
        -6.33015538e-03, -3.57808095e-03, -2.15858165e+00,
         3.77570060e-04,  6.79176336e-03,  3.48471845e-03]])

Fazendo a regressão na mão

In [11]:
np.sum((x_train.loc[0, ].values * lr_model.coef_)) + lr_model.intercept_

array([346.66954862])

In [12]:
df1.head(10)

Unnamed: 0,id_cliente,saldo_atual,predicted
0,1767,278.172008,346.669549
1,11920,268.874152,367.840277
2,8910,446.643127,431.468979
3,4964,321.141267,445.506463
4,10100,428.716114,378.271169
5,2755,327.437723,352.483346
6,2859,635.332001,410.604085
7,5569,213.441895,210.685639
8,11674,566.245423,388.619965
9,5779,272.577709,301.147706


In [13]:
for i in range( len( x_train) ):
    print(i , np.sum((x_train.loc[i, ].values * lr_model.coef_)) + lr_model.intercept_)

    if i == 9:
        break

0 [346.66954862]
1 [367.84027655]
2 [431.46897895]
3 [445.50646329]
4 [378.27116865]
5 [352.4833463]
6 [410.60408532]
7 [210.68563909]
8 [388.61996465]
9 [301.1477063]


# Performance

### $SSR = {\sum_{i=1}^n{ (\hat{y_i} - \bar{y})^2}}$

### $SSE = {\sum_{i=1}^n{ (y_i - \hat{y})^2}}$

### $SSTO = {\sum_{i=1}^n{ (y_i - \bar{y})^2}}$ OU SSTO = SSE + SSR

### $R^2$ = Coeficiente de determinação

### $R^2 = 1 - \frac{ SSE }{ SSTO } $

### $R^2$ =  O quanto do comportamento eu posso explicar com as features que eu estou observando

In [14]:
mt.r2_score(y_train, y_pred)

0.16917364489050013

In [15]:
lr_model = smf.ols(formula='saldo_atual ~ idade + divida_atual + num_emprestimos', data=df)
lr_model = lr_model.fit()

anova_results2 = sm.stats.anova_lm(lr_model, type=1)
anova_results2

Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
idade,1.0,408.5925,408.5925,0.009575,0.9220529
divida_atual,1.0,44940050.0,44940050.0,1053.104993,3.537135e-219
num_emprestimos,1.0,3976.109,3976.109,0.093174,0.7601864
Residual,9496.0,405230900.0,42673.85,,


In [16]:
lr_model.summary()

0,1,2,3
Dep. Variable:,saldo_atual,R-squared:,0.1
Model:,OLS,Adj. R-squared:,0.1
Method:,Least Squares,F-statistic:,351.1
Date:,"Fri, 28 Jul 2023",Prob (F-statistic):,3.2099999999999996e-216
Time:,14:08:43,Log-Likelihood:,-64119.0
No. Observations:,9500,AIC:,128200.0
Df Residuals:,9496,BIC:,128300.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,490.1201,3.419,143.337,0.000,483.417,496.823
idade,-1.985e-05,0.003,-0.006,0.995,-0.006,0.006
divida_atual,-0.0595,0.002,-32.440,0.000,-0.063,-0.056
num_emprestimos,-0.0103,0.034,-0.305,0.760,-0.077,0.056

0,1,2,3
Omnibus:,2666.251,Durbin-Watson:,2.02
Prob(Omnibus):,0.0,Jarque-Bera (JB):,7333.012
Skew:,1.494,Prob(JB):,0.0
Kurtosis:,6.097,Cond. No.,2980.0


dar uma olhada em ferramentas de feature selection

R tem Nativo, python tem que procurar

### Erro Quadrático Médio

### $MSE = \frac{1}{n}{\sum_{i=1}^n{ (y_i - \hat{y})^2}}$

In [17]:
mse = np.round( mt.mean_squared_error( y_train, y_pred ) , 2 )
mse

39370.27

### Raiz quadrada do erro-médio

### $RMSE = \sqrt{\frac{1}{n}{\sum_{i=1}^n{ (y_i - \hat{y})^2}}}$

In [21]:
rmse = round(mat.sqrt(np.round( mt.mean_squared_error( y_train, y_pred ) , 2 )),2)
rmse

198.42

### Porcentagem Raiz quadrada do erro-médio

### $RMSPE = \sqrt{\frac{1}{n}{\sum_{i=1}^n{ (\frac{\hat{y_i} - y_i}{y_i})^2}}}$

In [50]:
rmspe = round(np.sqrt( mt.mean_squared_error( y_train, y_pred )/abs(y_train)),2)
rmspe

Unnamed: 0,saldo_atual
0,11.90
1,12.10
2,9.39
3,11.07
4,9.58
...,...
9495,15.81
9496,8.89
9497,11.33
9498,13.70


RMSPE: saldo_atual    5244.150453
dtype: float64


  return mean(axis=axis, dtype=dtype, out=out, **kwargs)


### Erro médio absoluto

### $MAE = \frac{1}{n}{\sum_{i=1}^n|y_i - \hat{y_i}|}$

In [24]:
mae = np.round(mt.mean_absolute_error( y_train, y_pred ),2)
mae

145.76

### Erro Percentual Absoluto Médio

### $MAPE = \frac{1}{n}{\sum_{i=1}^n|\frac{{ A_t - F_t^2}}{A_t}|}$

In [26]:
mape = round(mt.mean_absolute_percentage_error(y_train,y_pred),2)
mape

1.31