In [650]:
import pandas as pd
from pathlib import Path

# save filepath to variable for easier access
melbourne_file_path = Path.cwd()

# Define o caminho relativo para a pasta de dados
data_folder = melbourne_file_path / "melb_data.csv"

# read the data and store data in DataFrame titled melbourne_data
melbourne_data = pd.read_csv(data_folder) 

# print a columns of the data frame 
melbourne_data.columns

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')

In [651]:
# dropna drops missing values (think of na as "not available")
melbourne_data = melbourne_data.dropna(axis=0)

# Selecting The Prediction Target

This single column is stored in a Series, which is broadly like a DataFrame with only a single column of data.

In [652]:
y = melbourne_data.Price

# Choosing "Features"

Columns used to determine the home price. For now, we'll build a model with only a few features. 

In [653]:
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']

X = melbourne_data[melbourne_features]

Let's quickly review the data we'll be using to predict house prices using the describe method and the head method, which shows the top few rows.

In [654]:
X.describe()

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
count,6196.0,6196.0,6196.0,6196.0,6196.0
mean,2.931407,1.57634,471.00694,-37.807904,144.990201
std,0.971079,0.711362,897.449881,0.07585,0.099165
min,1.0,1.0,0.0,-38.16492,144.54237
25%,2.0,1.0,152.0,-37.855438,144.926198
50%,3.0,1.0,373.0,-37.80225,144.9958
75%,4.0,2.0,628.0,-37.7582,145.0527
max,8.0,8.0,37000.0,-37.45709,145.52635


In [655]:
X.head()

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
1,2,1.0,156.0,-37.8079,144.9934
2,3,2.0,134.0,-37.8093,144.9944
4,4,1.0,120.0,-37.8072,144.9941
6,3,2.0,245.0,-37.8024,144.9993
7,2,1.0,256.0,-37.806,144.9954


# Building Models

We will use the "scikit-learn" library to create our models. 
The steps to building and using model are:

**Define**: What type of model will it be? A decision tree? Some other type of model? Some other parameters of the model type are specified too. 

**Fit**: Capture patterns from provided data. This is the heart of modeling.

**Predict**: Just what it sounds like.

**Evaluate**: Determine how accurate the model's predictions are.

In [656]:
from sklearn.tree import DecisionTreeRegressor

# Define model. Specify a number for random_state to ensure same results each run
melbourne_model = DecisionTreeRegressor(random_state = 1)

# Fit model
melbourne_model.fit(X, y)

Many machine learning models incorporate randomness during the training process. By specifying a number for the random_state parameter, we ensure consistent results in each run. This is considered a good practice. You can choose any number, as the quality of the model won't depend significantly on the specific value you select.

Now we have a fitted model that we can use to make predictions.

In real-world scenarios, we typically want to make predictions for new houses entering the market, rather than using the houses we already have price information for. However, for the purpose of understanding how the predict function works, we will make predictions for the first few rows of the training data.






In [657]:
print(f"Making predictions for the following 5 houses:")
print(X.head())
print(f"The predictions are: {melbourne_model.predict(X.head())}")

Making predictions for the following 5 houses:
   Rooms  Bathroom  Landsize  Lattitude  Longtitude
1      2       1.0     156.0   -37.8079    144.9934
2      3       2.0     134.0   -37.8093    144.9944
4      4       1.0     120.0   -37.8072    144.9941
6      3       2.0     245.0   -37.8024    144.9993
7      2       1.0     256.0   -37.8060    144.9954
The predictions are: [1035000. 1465000. 1600000. 1876000. 1636000.]


We can see above that the predicted values are equal to the actual values of the Price. A decision tree regression model was able to make predictions that perfectly match the dataset we used.

In [658]:
# Making predictions for the first few houses
predictions = melbourne_model.predict(X.head())

# Getting the actual values for the first few houses
actual_values = y.head()

# Comparing the predictions with the actual values
comparison = pd.DataFrame({'Predicted': predictions, 'Actual': actual_values})

# Printing the comparison
print(comparison)


   Predicted     Actual
1  1035000.0  1035000.0
2  1465000.0  1465000.0
4  1600000.0  1600000.0
6  1876000.0  1876000.0
7  1636000.0  1636000.0


# How good this model is?

There are many metrics for summarizing model quality, but we'll start with one called Mean Absolute Error (also called MAE).

In [659]:
from sklearn.metrics import mean_absolute_error

predicted_home_prices = melbourne_model.predict(X)
mean_absolute_error(y, predicted_home_prices)

1115.7467183128902

The issue with "in-sample" scores is that they can be misleading. In an "in-sample" evaluation, we use the same data to both build and evaluate the model. However, this approach can lead to overfitting.

Let's consider an example: suppose we are predicting home prices, and we include the color of the door as a feature. In our training dataset, we notice that all homes with green doors are very expensive. The model will pick up on this pattern and consistently predict high prices for homes with green doors.

While the model may appear accurate when evaluated on the training data, this pattern might not hold true in the larger real estate market. When the model encounters new data, it could be highly inaccurate because it has learned a specific pattern that doesn't generalize well.

To address this issue, we need to evaluate the model's performance on data it hasn't seen before. We do this by setting aside a portion of the data as validation data, which is not used during the model-building process. This allows us to assess how well the model performs on unseen data and provides a more reliable measure of its practical value.

By using validation data, we can determine if the model generalizes well and makes accurate predictions on new, unseen data. This helps us avoid the pitfall of overfitting and ensures that the model is useful in real-world scenarios.

# Coding it

"train_test_split" is a function to break up data into two pieces. Some as training data to fit the model, and we will use another as validation data to calculate the mean_absolute_error

In [660]:
from sklearn.model_selection import train_test_split

# split data into training and validation data, for both features and target
# The split is based on a random number generator. Supplying a numeric value to
# the random_state argument guarantees we get the same split every time we
# run this script.
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)
# Define model
melbourne_model = DecisionTreeRegressor()
# Fit model
melbourne_model.fit(train_X, train_y)

# get predicted prices on validation data
val_predictions = melbourne_model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))

274597.61846352485


Ok, our MAE fot the in_sample data is about 1000 dollars. Out-of-sample is is more than 260 dollars. This model is not so good. There are many ways to improve this model, such as experimenting to find better features or different model types.

# Underfitting and Overfitting

Overfitting occurs when a machine learning model becomes too complex or too closely fitted to the training data. In other words, the model learns the noise and random fluctuations in the training data, rather than the underlying patterns or relationships that would enable it to make accurate predictions on new, unseen data.

Underfitting occurs when a model is too simple to capture the underlying patterns in the data. In this case, the model fails to learn the relationships between the features and the labels, resulting in low accuracy and poor performance, both on the training data and new data.

The code above help us choose the number of nodes in the decision tree.

In [661]:
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor

def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    # Create a decision tree regression model with the specified max_leaf_nodes
    model = DecisionTreeRegressor(max_leaf_nodes = max_leaf_nodes, random_state = 0)
    
    # Fit the model on the training data
    model.fit(train_X, train_y)
    
    # Generate predictions on the validation data
    preds_val = model.predict(val_X)
    
    # Calculate the mean absolute error between the true and predicted values
    mae = mean_absolute_error(val_y, preds_val)
    
    # Return the computed MAE
    return mae


We can use a for-loop to compare the accuracy of models built with different values for max_leaf_nodes.

In [662]:
# compare MAE with differing values of max_leaf_nodes
max_leaf_nodes_list = [5, 50, 500, 5000]

if __name__ == "__main__":
    for max_leaf_nodes in max_leaf_nodes_list:
        my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
        print(f"Max leaf nodes:{max_leaf_nodes}  \t    Mean absolute error :{my_mae}")


Max leaf nodes:5  	    Mean absolute error :385696.54278937966
Max leaf nodes:50  	    Mean absolute error :279794.61143891385
Max leaf nodes:500  	    Mean absolute error :261718.1134423186
Max leaf nodes:5000  	    Mean absolute error :271320.97310092533


These results suggest that the model with 500 leaf nodes performs relatively better than the others, as it has the lowest MAE.

# Final Model

The best tree size has been determined. If we were to deploy this model in practice, we would make it even more accurate by using all of the data and keeping that tree size. In other words, we don't need to hold out the validation data now that we have made all our modeling decisions.

In [663]:

# Combine the training and validation data
combined_train_X = pd.concat([train_X, val_X], ignore_index=True)
combined_train_y = pd.concat([train_y, val_y], ignore_index=True)

# Create the final model with the desired parameters
final_model = DecisionTreeRegressor(max_leaf_nodes=500)

# Fit the final model on the combined training data
final_model.fit(combined_train_X, combined_train_y)