<a href="https://colab.research.google.com/github/thiagot3/Modelos-de-Regressao-nao-Lineares/blob/main/Modelos%20n%C3%A3o%20Lineares.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Árvores de Decisão


| Vantagens             | Desvantagens  
|:-------------------|:-----------------------|
| Fácil de Entender                   | Overfitting       |
| Menor necessidade de limpar os dados| Não é adequado para variáveis   contínuas|
| Não é restrito a tipagem dos dados|

## Ensemble,
São conjunto de preditores (previsores) para definir uma decisão

| Bagging:            |
|:-------------------|
| Treinar uma série de modelos paralelamente. Cada modelo é treinado por um conunto de amostras aleatórias |

</br>

| Boosting:  |
|:-----------------------|
|Treinar uma série de modelos sequencialmente. Cada modelo é treinado aprendendo com os erros do modelo anterior.|

> Random Forest Regressor:
 * É um algoritimo que usa o Bagging Ensemble
 * Constrói várias árvores de decisão durante o tempo de tereinamento e retorna a média dos resultados de cada árvore

> AdaBoost:
  * Algoritimo que usa a metodologia Boosting Ensemble
  * Costói Árvores sequenciais onde a próxima refina o resultado da anterior

> Gradient Boosting:
 * Como o nome já antecipa é um algoritimo que utiliza o Boosting
 * Ele utiliza o resíduo (Resultado Previsto - Resultado Real) da Árvore anterior para melhorar a próxima

In [136]:
import pandas as pd
import numpy as np
import pydot
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons
from sklearn import metrics
from sklearn.tree import DecisionTreeRegressor, export_graphviz 
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor, GradientBoostingRegressor

In [137]:
df =  pd.read_excel("/content/temps.xlsx") # DataSet da temperatura de Seattle
df.head()
#temp1: Temperatura um dia antes da data
#temp2: Tempreatura dois dias antes da data
#average: média do dia
#actual: a temperatura no momento (Váriavel target)


Unnamed: 0,year,month,day,week,temp_2,temp_1,average,actual
0,2016,1,1,Fri,45,45,45.6,45
1,2016,1,2,Sat,44,45,45.7,44
2,2016,1,3,Sun,45,44,45.8,41
3,2016,1,4,Mon,44,41,45.9,40
4,2016,1,5,Tues,41,40,46.0,44


In [138]:
df.describe()

Unnamed: 0,year,month,day,temp_2,temp_1,average,actual
count,348.0,348.0,348.0,348.0,348.0,348.0,348.0
mean,2016.0,6.477011,15.514368,62.652299,62.701149,59.760632,62.543103
std,0.0,3.49838,8.772982,12.165398,12.120542,10.527306,11.794146
min,2016.0,1.0,1.0,35.0,35.0,45.1,35.0
25%,2016.0,3.0,8.0,54.0,54.0,49.975,54.0
50%,2016.0,6.0,15.0,62.5,62.5,58.2,62.5
75%,2016.0,10.0,23.0,71.0,71.0,69.025,71.0
max,2016.0,12.0,31.0,117.0,117.0,77.4,92.0


In [139]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 348 entries, 0 to 347
Data columns (total 8 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   year     348 non-null    int64  
 1   month    348 non-null    int64  
 2   day      348 non-null    int64  
 3   week     348 non-null    object 
 4   temp_2   348 non-null    int64  
 5   temp_1   348 non-null    int64  
 6   average  348 non-null    float64
 7   actual   348 non-null    int64  
dtypes: float64(1), int64(6), object(1)
memory usage: 21.9+ KB


In [140]:
df = pd.get_dummies(df)
df.head()

Unnamed: 0,year,month,day,temp_2,temp_1,average,actual,week_Fri,week_Mon,week_Sat,week_Sun,week_Thurs,week_Tues,week_Wed
0,2016,1,1,45,45,45.6,45,1,0,0,0,0,0,0
1,2016,1,2,44,45,45.7,44,0,0,1,0,0,0,0
2,2016,1,3,45,44,45.8,41,0,0,0,1,0,0,0
3,2016,1,4,44,41,45.9,40,0,1,0,0,0,0,0
4,2016,1,5,41,40,46.0,44,0,0,0,0,0,1,0


Separando variáveis X e Y

In [141]:
# separando variáveis usando numpy
y = np.array(df['actual'])
x = df.drop('actual', axis = 1)
df_list = list(x.columns)
x = np.array(x)

Criando váriaveis de treino e teste

In [142]:
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size = 0.25, random_state = 42)

Criando variável base de erro de acordo com a média

In [143]:
baseline_pred = xtest[:, df_list.index("average")]

baseline_error = abs(baseline_pred - ytest)
print("Baseline error average:", round(np.mean(baseline_error),2))

Baseline error average: 5.06


### RandomForest

In [144]:
rf = RandomForestRegressor(n_estimators = 1000, random_state= 42)
rf.fit(xtrain,ytrain)

In [145]:
prediction_rf = rf.predict(xtest)

error_rf =  abs(prediction_rf - ytest)

r_sq = rf.score(x, y)
print("R²", r_sq)
print("MAE:", metrics.mean_absolute_error(ytest, prediction_rf))
print("MSE:", metrics.mean_squared_error(ytest, prediction_rf))
print("RMSE:", np.sqrt(metrics.mean_squared_error(ytest, prediction_rf)))

R² 0.932094797587982
MAE: 3.932057471264368
MSE: 26.68358100000001
RMSE: 5.165615258611505


Descobrindo a importância de cada variável no modelo:

In [153]:
importances = list(rf.feature_importances_)

feature_importance = [(x, round(importance, 2)) for x, importance in zip(df_list, importances)]

feature_importance = sorted(feature_importance, key = lambda x: x[1], reverse = True)

[print("Feature: {:20} Importance: {}" .format(*pair)) for pair in feature_importance];

Feature: temp_1               Importance: 0.76
Feature: average              Importance: 0.22
Feature: month                Importance: 0.01
Feature: temp_2               Importance: 0.01
Feature: year                 Importance: 0.0
Feature: day                  Importance: 0.0
Feature: week_Fri             Importance: 0.0
Feature: week_Mon             Importance: 0.0
Feature: week_Sat             Importance: 0.0
Feature: week_Sun             Importance: 0.0
Feature: week_Thurs           Importance: 0.0
Feature: week_Tues            Importance: 0.0
Feature: week_Wed             Importance: 0.0


> Curiosidade: Vendo os nós do modelo de Random Forest

In [150]:
rf = RandomForestRegressor(max_depth=3)
rf.fit(xtrain,ytrain)

tree = rf.estimators_[5]
tree

export_graphviz(tree, out_file="tree.dot", feature_names= df_list, rounded = True, precision = 1)
(graph, ) = pydot.graph_from_dot_file("tree.dot")
graph.write_png("tree.png")

### AdaBoost

In [146]:
ada = AdaBoostRegressor(n_estimators=1000)
ada.fit(xtrain, ytrain)

ada_pred = ada.predict(xtest)

In [147]:
error_ada =  abs(ada_pred - ytest)

r_sq = ada.score(x, y)
print("R²", r_sq)
print("MAE:", metrics.mean_absolute_error(ytest, ada_pred))
print("MSE:", metrics.mean_squared_error(ytest, ada_pred))
print("RMSE:", np.sqrt(metrics.mean_squared_error(ytest, ada_pred)))

R² 0.8790628116945167
MAE: 3.607622188020589
MSE: 23.078259923592586
RMSE: 4.803983755550448


Descobrindo a importância de cada variável no modelo:

In [154]:
importances = list(ada.feature_importances_)

feature_importance = [(x, round(importance, 2)) for x, importance in zip(df_list, importances)]

feature_importance = sorted(feature_importance, key = lambda x: x[1], reverse = True)

[print("Feature: {:20} Importance: {}" .format(*pair)) for pair in feature_importance];

Feature: temp_1               Importance: 0.47
Feature: average              Importance: 0.26
Feature: temp_2               Importance: 0.1
Feature: month                Importance: 0.07
Feature: day                  Importance: 0.04
Feature: week_Mon             Importance: 0.04
Feature: week_Fri             Importance: 0.01
Feature: week_Sun             Importance: 0.01
Feature: year                 Importance: 0.0
Feature: week_Sat             Importance: 0.0
Feature: week_Thurs           Importance: 0.0
Feature: week_Tues            Importance: 0.0
Feature: week_Wed             Importance: 0.0


### GradientBoosting

In [148]:
gbr = GradientBoostingRegressor(n_estimators=1000)
gbr.fit(xtrain, ytrain)

gbr_pred = gbr.predict(xtest)

In [149]:
error_gbr =  abs(ada_pred - ytest)

r_sq = gbr.score(x, y)
print("R²", r_sq)
print("MAE:", metrics.mean_absolute_error(ytest, gbr_pred))
print("MSE:", metrics.mean_squared_error(ytest, gbr_pred))
print("RMSE:", np.sqrt(metrics.mean_squared_error(ytest, gbr_pred)))

R² 0.9398910000332299
MAE: 4.296477133165536
MSE: 33.31931111693481
RMSE: 5.772288204597446


Descobrindo a importância de cada variável no modelo:

In [155]:
importances = list(gbr.feature_importances_)

feature_importance = [(x, round(importance, 2)) for x, importance in zip(df_list, importances)]

feature_importance = sorted(feature_importance, key = lambda x: x[1], reverse = True)

[print("Feature: {:20} Importance: {}" .format(*pair)) for pair in feature_importance];

Feature: temp_1               Importance: 0.61
Feature: average              Importance: 0.3
Feature: day                  Importance: 0.03
Feature: temp_2               Importance: 0.02
Feature: month                Importance: 0.01
Feature: week_Fri             Importance: 0.01
Feature: year                 Importance: 0.0
Feature: week_Mon             Importance: 0.0
Feature: week_Sat             Importance: 0.0
Feature: week_Sun             Importance: 0.0
Feature: week_Thurs           Importance: 0.0
Feature: week_Tues            Importance: 0.0
Feature: week_Wed             Importance: 0.0


Neste caso o AdaBoost seria o selecionado, por performar melhor, de acordo com as métricas

---

