# Importando as bibliotecas

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

In [2]:
data = pd.read_csv('melb_data.csv')
data.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,...,1.0,1.0,202.0,,,Yarra,-37.7996,144.9984,Northern Metropolitan,4019.0
1,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019.0
2,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,...,2.0,0.0,134.0,150.0,1900.0,Yarra,-37.8093,144.9944,Northern Metropolitan,4019.0
3,Abbotsford,40 Federation La,3,h,850000.0,PI,Biggin,4/03/2017,2.5,3067.0,...,2.0,1.0,94.0,,,Yarra,-37.7969,144.9969,Northern Metropolitan,4019.0
4,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,4/06/2016,2.5,3067.0,...,1.0,2.0,120.0,142.0,2014.0,Yarra,-37.8072,144.9941,Northern Metropolitan,4019.0


# Analisando os dados

In [3]:
data.describe()

Unnamed: 0,Rooms,Price,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
count,13580.0,13580.0,13580.0,13580.0,13580.0,13580.0,13518.0,13580.0,7130.0,8205.0,13580.0,13580.0,13580.0
mean,2.937997,1075684.0,10.137776,3105.301915,2.914728,1.534242,1.610075,558.416127,151.96765,1964.684217,-37.809203,144.995216,7454.417378
std,0.955748,639310.7,5.868725,90.676964,0.965921,0.691712,0.962634,3990.669241,541.014538,37.273762,0.07926,0.103916,4378.581772
min,1.0,85000.0,0.0,3000.0,0.0,0.0,0.0,0.0,0.0,1196.0,-38.18255,144.43181,249.0
25%,2.0,650000.0,6.1,3044.0,2.0,1.0,1.0,177.0,93.0,1940.0,-37.856822,144.9296,4380.0
50%,3.0,903000.0,9.2,3084.0,3.0,1.0,2.0,440.0,126.0,1970.0,-37.802355,145.0001,6555.0
75%,3.0,1330000.0,13.0,3148.0,3.0,2.0,2.0,651.0,174.0,1999.0,-37.7564,145.058305,10331.0
max,10.0,9000000.0,48.1,3977.0,20.0,8.0,10.0,433014.0,44515.0,2018.0,-37.40853,145.52635,21650.0


In [4]:
data.columns

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')

In [5]:
# The Melbourne data has some missing values (some houses for which some variables weren't recorded.)
# We'll learn to handle missing values in a later tutorial.  
# Your Iowa data doesn't have missing values in the columns you use. 
# So we will take the simplest option for now, and drop houses from our data. 
# Don't worry about this much for now, though the code is:

data = data.dropna(axis = 0)

## Selecting the Prediction Target
    You can pull out a variable with dot-notation. This single column is stored in a Series, which is broadly like a DataFrame with only a single column of data.

    We'll use the dot notation to select the column we want to predict, which is called the prediction target. By convention, the prediction target is called y. So the code we need to save the house prices in the Melbourne data is
    
    Esse é um exemplo de modelo SUPERVISIONADO, temos as FEATURES (X) e uma variável TARGET (Y).

In [6]:
y = data.Price #ou y = data['Price']

## Choosing 'Features'
    The columns that are inputted into our model (and later used to make predictions) are called "features." In our case, those would be the columns used to determine the home price. Sometimes, you will use all columns except the target as features. Other times you'll be better off with fewer features.

    For now, we'll build a model with only a few features. Later on you'll see how to iterate and compare models built with different features.

    We select multiple features by providing a list of column names inside brackets. Each item in that list should be a string (with quotes).

    Here is an example:

In [7]:
data_features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']

# By convention, this data is called 'x':

X = data[data_features] 

In [8]:
#Visually checking your data with these commands is an important part of a data scientist's job. 
#You'll frequently find surprises in the dataset that deserve further inspection.

X.describe()

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
count,6196.0,6196.0,6196.0,6196.0,6196.0
mean,2.931407,1.57634,471.00694,-37.807904,144.990201
std,0.971079,0.711362,897.449881,0.07585,0.099165
min,1.0,1.0,0.0,-38.16492,144.54237
25%,2.0,1.0,152.0,-37.855438,144.926198
50%,3.0,1.0,373.0,-37.80225,144.9958
75%,4.0,2.0,628.0,-37.7582,145.0527
max,8.0,8.0,37000.0,-37.45709,145.52635


## Building your model
    You will use the scikit-learn library to create your models. When coding, this library is written as sklearn, as you will see in the sample code. Scikit-learn is easily the most popular library for modeling the types of data typically stored in DataFrames.

    The steps to building and using a model are:

    Define: What type of model will it be? A decision tree? Some other type of model? Some other parameters of the model type are specified too.
    
    Fit: Capture patterns from provided data. This is the heart of modeling. (treinar o modelo = fit)
    
    Predict: Just what it sounds like (teste)
    
    Evaluate: Determine how accurate the model's predictions are.(medir o erro e acerto dos dados)
    
    
    Here is an example of defining a decision tree model with scikit-learn and fitting it with the features and target variable.

In [9]:
from sklearn.tree import DecisionTreeRegressor

# Define model. Specify a number for random_state to ensure same results each run
melbourne_model = DecisionTreeRegressor(random_state=1) #random_state garante que a árvore será a mesma = fixar a aleatoriedade

# Fit model (treinando o modelo)
melbourne_model.fit(X, y) #(features, alvo)

DecisionTreeRegressor(random_state=1)

    Many machine learning models allow some randomness in model training. Specifying a number for random_state ensures you get the same results in each run. This is considered a good practice. You use any number, and model quality won't depend meaningfully on exactly what value you choose.

    We now have a fitted model that we can use to make predictions.

    In practice, you'll want to make predictions for new houses coming on the market rather than the houses we already have prices for. But we'll make predictions for the first few rows of the training data to see how the predict function works.

In [10]:
print("Making predictions for the following 5 houses:")
print(X.head())

print("The predictions are")
print(melbourne_model.predict(X.head()))

Making predictions for the following 5 houses:
   Rooms  Bathroom  Landsize  Lattitude  Longtitude
1      2       1.0     156.0   -37.8079    144.9934
2      3       2.0     134.0   -37.8093    144.9944
4      4       1.0     120.0   -37.8072    144.9941
6      3       2.0     245.0   -37.8024    144.9993
7      2       1.0     256.0   -37.8060    144.9954
The predictions are
[1035000. 1465000. 1600000. 1876000. 1636000.]


## Model Validation
    Measure the perfomance of your model, so you can test and compare alternatives. 
    
    Many people make a huge mistake when measuring predictive accuracy. They make predictions with their training data and compare those predictions to the target values in the training data. You'll see the problem with this approach and how to solve it in a moment, but let's think about how we'd do this first.

    You'd first need to summarize the model quality into an understandable way. If you compare predicted and actual home values for 10,000 houses, you'll likely find mix of good and bad predictions. Looking through a list of 10,000 predicted and actual values would be pointless. We need to summarize this into a single metric.

    There are many metrics for summarizing model quality, but we'll start with one called **Mean Absolute Error** (also called MAE). Let's break down this metric starting with the last word, error.

    The prediction error for each house is:

    error=actual−predicted
    
    So, if a house cost $150,000 and you predicted it would cost $100,000 the error is $50,000.
    
     With the MAE metric, we take the absolute value of each error. This converts each error to a positive number. We then take the average of those absolute errors. This is our measure of model quality. In plain English, it can be said as

   'On average, our predictions are off by about X.'

In [11]:
# To calculate MAE, we first need a model. That is built in the cell bellow:

# Load data

melbourne_data = pd.read_csv('melb_data.csv')

# Filter rows with missing price values

filtered_melbourne_data = melbourne_data.dropna(axis=0)

# Choose target and features

y = filtered_melbourne_data.Price

melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 
                        'YearBuilt', 'Lattitude', 'Longtitude']

X = filtered_melbourne_data[melbourne_features]

from sklearn.tree import DecisionTreeRegressor

# Define model
melbourne_model = DecisionTreeRegressor()

# Fit model
melbourne_model.fit(X, y)

DecisionTreeRegressor()

**Once we have a model, here is how we calculate the mean absolute error:**

In [12]:
from sklearn.metrics import mean_absolute_error

predicted_home_prices = melbourne_model.predict(X) #cria uma lista com as previsões a cada linha

mean_absolute_error(y, predicted_home_prices)   #1º alvo (dado real/existente) e 2º previsão

434.71594577146544

In [13]:
#Comparação entre previsões e target -> o dados não foram separados em treino e teste -> overfitting 

print ('Predictions', predicted_home_prices[:5])

print ('Target', y[:5].values)

Predictions [1035000. 1465000. 1600000. 1876000. 1636000.]
Target [1035000. 1465000. 1600000. 1876000. 1636000.]


### The Problem with "In-Sample" Scores
    The measure we just computed can be called an "in-sample" score. We used a single "sample" of houses for both building the model and evaluating it.
    Since models' practical value come from making predictions on new data, we measure performance on data that wasn't used to build the model. The most straightforward way to do this is to exclude some data from the model-building process, and then use those to test the model's accuracy on data it hasn't seen before. This data is called validation data.

**The scikit-learn library has a function train_test_split to break up the data into two pieces. 
We'll use some of that data as training data to fit the model, 
and we'll use the other data as validation data to calculate mean_absolute_error.**

In [14]:
from sklearn.model_selection import train_test_split

# split data into training and validation data, for both features and target
# The split is based on a random number generator. Supplying a numeric value to
# the random_state argument guarantees we get the same split every time we run this script.

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)

# Define model

melbourne_model = DecisionTreeRegressor()

# Fit model

melbourne_model.fit(train_X, train_y) #treina o X e a target

# get predicted prices on validation data

val_predictions = melbourne_model.predict(val_X) #nosso modelo NUNCA viu o 'val_x' -> faz a previsão utilizando X da valid.

print(mean_absolute_error(val_y, val_predictions)) # (qual Y saiu da previsão em X? , previsão das features)





263160.2130406714


In [15]:
#Comparação entre previsões e target -> o dados não foram separados em treino e teste -> overfitting 

print ('Predictions', val_predictions[:5])

print ('Target', val_y[:5].values)

Predictions [ 937500.  550000. 1182500. 1382500.  910000.]
Target [ 815000.  655000.  957500. 1330000.  722000.]


In [16]:
#Observação da divisão das features = total seria (6196,7)
train_X.shape, val_X.shape

((4647, 7), (1549, 7))

    Your mean absolute error for the in-sample data was about 500 dollars. Out-of-sample it is more than 250,000 dollars.

    This is the difference between a model that is almost exactly right, and one that is unusable for most practical purposes. As a point of reference, the average home value in the validation data is 1.1 million dollars. So the error in new data is about a quarter of the average home value.

    There are many ways to improve this model, such as experimenting to find better features or different model types.

In [17]:
#print the top few validation predictions
#print(iowa_model.predict(val_X.head()))

#print the top few actual prices from validation data
#print(val_y.head())

#https://www.youtube.com/watch?v=N0wi3f9PCqg [assistir] 


## Underfitting and Overfitting
    2. Overfiting: (Muito bom no treino e ruim na validação / dados novos)
             Where a model matches the training data almost perfectly, but does poorly in validation and other new data. On the flip side, if we make our tree very shallow, it doesn't divide up the houses into very distinct groups.
             É um termo usado em estatística para descrever quando um modelo estatístico se ajusta muito bem ao conjunto de dados anteriormente observado, mas se mostra ineficaz para prever novos resultados.

    3. Underfittins: (Ruim no treino e ruim na validação)
            Em um extremo, se uma árvore divide as casas em apenas 2 ou 4, cada grupo ainda tem uma grande variedade de casas. As previsões resultantes podem estar distantes para a maioria das casas, mesmo nos dados de treinamento (e também serão ruins na validação pelo mesmo motivo). Quando um modelo falha em capturar distinções e padrões importantes nos dados, ele apresenta um desempenho ruim mesmo nos dados de treinamento, isso é chamado de underfitting.
            
     Example
    There are a few alternatives for controlling the tree depth, and many allow for some routes through the tree to have greater depth than other routes. But the max_leaf_nodes argument provides a very sensible way to control overfitting vs underfitting. 
   **The more leaves we allow the model to make, the more we move from the underfitting area in the above graph to the overfitting area.**

    We can use a utility function to help compare MAE scores from different values for max_leaf_nodes:

In [18]:
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor

def get_mae (max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor (max_leaf_nodes = max_leaf_nodes, random_state = 0)
    model.fit(train_X, train_y)
    preds_val = model.predict (val_X)
    mae = mean_absolute_error (val_y, preds_val)
    
    return (mae)

**Os dados estão carregados em train_X, train_y e val_y, utilizando o código abaixo:**

In [19]:
#Filter rows with missing values

filtered_melbourne_data = melbourne_data.dropna(axis=0)

#Choose target and features
y = filtered_melbourne_data.Price

melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 
                        'YearBuilt', 'Lattitude', 'Longtitude']

X = filtered_melbourne_data[melbourne_features]

from sklearn.model_selection import train_test_split

#split data into training and validation data, for both features and target

train_X, val_X, train_y, val_y = train_test_split(X, y,random_state = 0)

**Diante disso, podemos utilizar um loop for para comparar a acurácia dos modelos construídos utilizando diferentes valores for 'max_leaf_nodes'**

In [20]:
# compare MAE with differing values of max_leaf_nodes

candidate_max_leaf_nodes = [5, 25, 50, 100, 250, 500]

for candidate in candidate_max_leaf_nodes:
    
    erro_do_candidato = get_mae(candidate, train_X, val_X, train_y, val_y)       
    
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(candidate, erro_do_candidato)) #Cálculo do MAE na VALIDAÇÃO!


Max leaf nodes: 5  		 Mean Absolute Error:  347380
Max leaf nodes: 25  		 Mean Absolute Error:  271044
Max leaf nodes: 50  		 Mean Absolute Error:  258171
Max leaf nodes: 100  		 Mean Absolute Error:  248734
Max leaf nodes: 250  		 Mean Absolute Error:  247206
Max leaf nodes: 500  		 Mean Absolute Error:  243495


    Max leaf nodes: 5  		Mean Absolute Error:  347380 [underfitting-> menos dados e o erro maior, nao captura padrões]
    Max leaf nodes: 50  	Mean Absolute Error:  258171 
    Max leaf nodes: 500  	Mean Absolute Error:  243495
    Max leaf nodes: 5000    Mean Absolute Error:  254983 [overfitting-> mais dados e o erro é MAIOR]

**Of the options listed, 500 is the optimal number of leaves.**

In [21]:
best_tree_size = 500

#Fit the model with best_tree_size. Fill in argument to make optimal size

final_model = DecisionTreeRegressor(max_leaf_nodes=best_tree_size, random_state=1)

# fit the final model

final_model.fit(X, y)

DecisionTreeRegressor(max_leaf_nodes=500, random_state=1)

**Conclusion:**
        Here's the takeaway: Models can suffer from either:

    Overfitting: capturing spurious patterns that won't recur in the future, leading to less accurate predictions,São padrões que existem nos dados de treino, mas não existem e não tem padrões em dados novos. 
    or
    Underfitting: failing to capture relevant patterns, again leading to less accurate predictions. Quando o modelo não é complexo o bastante para capturar padrões relevantes para prever acertos

    We use validation data, which isn't used in model training, to measure a candidate model's accuracy. This lets us try many candidate models and keep the best one [Otimização de hiperparametros]

## Random Forests
Using a more sophisticated machine learning algorithm.

    Random Forests utiliza muitas árvores e faz uma previsão calculando a média das previsões de cada árvore componente. Geralmente tem uma precisão preditiva muito melhor do que uma única árvore de decisão e funciona bem com parâmetros padrão. Se você continuar modelando, poderá aprender mais modelos com desempenho ainda melhor, mas muitos deles são sensíveis à obtenção dos parâmetros corretos.
    O segredo da Random Forest é que 'cada árvore' do mesmo dado utiliza diferentes parâmetros para chegar em um resultado e a partir dos resultados obtidos se tira a média das previsões de cada árvore componente, gerando retornos mais assertivos. 
    

**We build a random forest model similarly to how we built a decision tree in scikit-learn - this time using the RandomForestRegressor class instead of DecisionTreeRegressor.**

In [22]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

forest_model = RandomForestRegressor(random_state=1)
forest_model.fit(train_X, train_y)

melb_preds = forest_model.predict(val_X) #um preço para cada linha da feature / ID.

print(mean_absolute_error(val_y, melb_preds))

191669.7536453626


**Conclusion
There is likely room for further improvement, but this is a big improvement over the best decision tree error of 250,000. There are parameters which allow you to change the performance of the Random Forest much as we changed the maximum depth of the single decision tree. But one of the best features of Random Forest models is that they generally work reasonably even without this tuning.**