Esse notebook é uma tradução do [artigo](https://towardsdatascience.com/a-complete-machine-learning-project-walk-through-in-python-part-two-300f1f8147e2) originalmente escrito por [William Koehrsen](https://twitter.com/koehrsen_will) no site [Towards Data Science](https://towardsdatascience.com)

# Um projeto completo de aprendizado de máquina em Python - Parte 2

Montar todas as peças de aprendizado de máquina necessárias para resolver um problema pode ser uma tarefa assustadora. Nesta série de artigos, estamos caminhando para implementar um fluxo de trabalho de aprendizado de máquina usando um conjunto de dados do mundo real para ver como as técnicas individuais se juntam.

No primeiro [notebook](https://github.com/willsilvano/datascience/blob/master/Towards%20DataScience/Energy%20New%20York%20-%20Part%20One.ipynb), limpamos e estruturamos os dados, fizemos uma análise exploratória de dados, desenvolvemos um conjunto de features para usar em nosso modelo e estabelecemos uma linha de base em relação à qual podemos medir o desempenho. Neste artigo, veremos como implementar e comparar vários modelos de aprendizado de máquina no Python, executar o ajuste de hiperparâmetros para otimizar o melhor modelo e avaliar o modelo final no conjunto de testes.

# Imports

Abaixo estão as bibliotecas que serão utilizadas nesse notebook:

In [1]:
# Pandas e numpy para manipulação dos dados
import pandas as pd
import numpy as np

# Desativa alguns warnings
pd.options.mode.chained_assignment = None

# Altera o padrão do número de colunas exibidas pelo pandas para 60
pd.set_option('display.max_columns', 60)

# Matplotlib para visualização dos dados
import matplotlib.pyplot as plt
%matplotlib inline

# Altera o tamanho da fonte padrão
plt.rcParams['font.size'] = 24

from IPython.core.pylabtools import figsize

# Seaborn para visualização
import seaborn as sns
sns.set(font_scale = 2)

# Atribuição para valores faltantes e padrões de escala de valores
from sklearn.preprocessing import Imputer, MinMaxScaler

# Modelos de Machine Learning
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor

# Hiperparâmetros
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

# Avaliação e Seleção de Modelos

Lembramos que estamos trabalhando em uma tarefa de regressão supervisionada: usando os [dados de energia de edifícios da cidade de Nova York](http://www.nyc.gov/html/gbee/html/plan/ll84_scores.shtml), queremos desenvolver um modelo que possa prever a pontuação [Energy Star](https://www.energystar.gov/buildings/facility-owners-and-managers/existing-buildings/use-portfolio-manager/interpret-your-results/what) de um edifício. Nosso foco está na precisão das previsões e interpretabilidade do modelo.

Há uma[ tonelada de modelos de aprendizado de máquina](http://scikit-learn.org/stable/supervised_learning.html) para escolher e decidir por onde começar pode ser intimidante. Embora existam [alguns gráficos](https://docs.microsoft.com/en-us/azure/machine-learning/studio/algorithm-cheat-sheet) que tentam mostrar qual algoritmo usar, prefiro apenas experimentar vários e ver qual deles funciona melhor! O aprendizado de máquina ainda é um campo impulsionado principalmente por [resultados empíricos (experimentais) e não teóricos](https://www.quora.com/How-much-of-deep-learning-research-is-empirical-versus-theoretical), e é quase impossível [saber com antecedência qual modelo fará o melhor](http://www.statsblogs.com/2014/01/25/machine-learning-lesson-of-the-day-the-no-free-lunch-theorem/).

Geralmente, é uma boa ideia começar com modelos simples e interpretáveis, como a regressão linear, e, se o desempenho não for adequado, passar para métodos mais complexos, mas geralmente mais precisos. O gráfico a seguir mostra uma versão (altamente não científica) da relação entre precisão e interpretabilidade:

![alt text](https://cdn-images-1.medium.com/max/800/1*NkffR5Ufy_h4RfSVpTJ2iQ.png)

Vamos avaliar cinco modelos diferentes cobrindo o espectro da complexidade:

- **Linear Regression**
- **K-Nearest Neighbors Regression**
- **Random Forest Regression**
- **Gradient Boosted Regression**
- **Support Vector Machine Regression**

Neste notebook, vamos nos concentrar na implementação desses métodos, em vez da teoria por trás deles. Para qualquer pessoa interessada em aprender a fundo, eu recomendo [Uma Introdução à Aprendizagem Estatística](http://www-bcf.usc.edu/~gareth/ISL/) (disponível on-line gratuito) ou o [Aprendizado de Máquina Prático com o Scikit-Learn e o TensorFlow](http://shop.oreilly.com/product/0636920052289.do). Ambos os livros didáticos fazem um ótimo trabalho explicando a teoria e mostrando como usar efetivamente os métodos em R e Python, respectivamente.

# Imputando Valores Ausentes

Embora tenhamos descartado as colunas com mais de 50% de valores ausentes quando limpamos os dados, ainda há algumas observações ausentes. Os modelos de aprendizado de máquina não podem lidar com quaisquer valores ausentes, então temos que preenchê-los, um [processo conhecido como imputação](https://en.wikipedia.org/wiki/Imputation_%28statistics%29).

Primeiro, lemos todos os dados que salvamos anteriormente:

In [5]:
# Cria novos dataframes com base nos arquivos salvos anteriormente
train_features = pd.read_csv('data/energy_new_york_training_features.csv')
test_features = pd.read_csv('data/energy_new_york_testing_features.csv')

train_labels = pd.read_csv('data/energy_new_york_training_labels.csv')
train_labels = pd.read_csv('data/energy_new_york_testing_labels.csv')

# Exibe o tamanho dos dataframes
print('Training Feature Size', train_features.shape)
print('Testing Feature Size ', test_features.shape)
print('Training Labels Size ', train_labels.shape)
print('Testing Labels Size  ', train_labels.shape)

Training Feature Size (6622, 65)
Testing Feature Size  (2839, 65)
Training Labels Size  (2839, 1)
Testing Labels Size   (2839, 1)


In [7]:
# Mosstra as 10 primeiras linhas do dataframe
train_features.head(10)

Unnamed: 0,Direct GHG Emissions (Metric Tons CO2e),Water Intensity (All Water Sources) (gal/ft²),log_Property Id,log_Year Built,log_Number of Buildings - Self-reported,log_Occupancy,log_Weather Normalized Site Natural Gas Use (therms),log_Direct GHG Emissions (Metric Tons CO2e),log_Property GFA - Self-Reported (ft²),log_Water Intensity (All Water Sources) (gal/ft²),log_Source EUI (kBtu/ft²),log_Community Board,log_Census Tract,Borough_Bronx,Borough_Brooklyn,Borough_Manhattan,Borough_Queens,Borough_Staten Island,Largest Property Use Type_Adult Education,Largest Property Use Type_Automobile Dealership,Largest Property Use Type_Bank Branch,Largest Property Use Type_College/University,Largest Property Use Type_Convenience Store without Gas Station,Largest Property Use Type_Courthouse,Largest Property Use Type_Distribution Center,Largest Property Use Type_Enclosed Mall,Largest Property Use Type_Financial Office,Largest Property Use Type_Hospital (General Medical & Surgical),Largest Property Use Type_Hotel,Largest Property Use Type_K-12 School,...,Largest Property Use Type_Museum,Largest Property Use Type_Non-Refrigerated Warehouse,Largest Property Use Type_Office,Largest Property Use Type_Other,Largest Property Use Type_Other - Education,Largest Property Use Type_Other - Entertainment/Public Assembly,Largest Property Use Type_Other - Lodging/Residential,Largest Property Use Type_Other - Mall,Largest Property Use Type_Other - Public Services,Largest Property Use Type_Other - Recreation,Largest Property Use Type_Other - Services,Largest Property Use Type_Other - Specialty Hospital,Largest Property Use Type_Outpatient Rehabilitation/Physical Therapy,Largest Property Use Type_Parking,Largest Property Use Type_Performing Arts,Largest Property Use Type_Pre-school/Daycare,Largest Property Use Type_Refrigerated Warehouse,"Largest Property Use Type_Repair Services (Vehicle, Shoe, Locksmith, etc.)",Largest Property Use Type_Residence Hall/Dormitory,Largest Property Use Type_Residential Care Facility,Largest Property Use Type_Restaurant,Largest Property Use Type_Retail Store,Largest Property Use Type_Self-Storage Facility,Largest Property Use Type_Senior Care Community,Largest Property Use Type_Social/Meeting Hall,Largest Property Use Type_Strip Mall,Largest Property Use Type_Supermarket/Grocery Store,Largest Property Use Type_Urgent Care/Clinic/Other Outpatient,Largest Property Use Type_Wholesale Club/Supercenter,Largest Property Use Type_Worship Facility
0,440.9,99.41,15.581915,7.575585,0.0,4.60517,11.429059,6.088818,11.255449,4.599253,5.143416,,,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,217.9,,15.296761,7.563201,0.0,4.60517,10.708118,5.384036,10.858999,,4.902307,1.098612,5.081404,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,151.0,,15.355679,7.577634,0.0,4.60517,10.255492,5.01728,11.561716,,4.254193,2.197225,5.796058,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,672.4,,15.037325,7.596894,0.0,4.60517,11.823328,6.510853,11.770146,,5.665388,0.0,3.295837,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
4,456.5,28.65,14.810363,7.56372,0.0,4.60517,7.537217,6.123589,11.711366,3.355153,4.487512,1.94591,5.105945,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5,248.8,4.8,14.7037,7.564757,0.0,4.49981,10.627995,5.516649,12.398963,1.568616,4.844187,0.693147,3.610918,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6,227.3,67.14,15.56253,7.571474,0.0,4.60517,10.722937,5.426271,11.242428,4.20678,4.528289,2.484907,6.194405,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7,431.4,30.73,14.797778,7.569412,0.0,4.60517,,6.067036,11.609598,3.425239,4.809742,2.079442,4.955827,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
8,478.4,41.96,14.902276,7.580189,0.0,4.60517,11.40943,6.170447,12.058674,3.736717,4.508659,,,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
9,293.2,86.88,15.300449,7.570959,0.0,4.60517,10.892895,5.680855,11.074731,4.464528,4.890349,,,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


Cada valor que é `NaN` representa uma observação ausente. Embora haja [várias maneiras de preencher os dados ausentes](https://www.omicsonline.org/open-access/a-comparison-of-six-methods-for-missing-data-imputation-2155-6180-1000224.php?aid=54590), usaremos um método relativamente simples, a imputação mediana. Isso substitui todos os valores ausentes em uma coluna pelo valor mediano da coluna.
