# The First Machine Learning Model

Building my first model. Hurray!

## Selecting Data for Modeling

The dataset had too many variables to wrap our head around, or even to print out nicely. How can I pare down this overwhelming amount of data to something we can understand?

We'll start by picking a few variables using our intuition. Later courses will show you statistical techniques to automatically prioritize variables. 

To choose variables/columns, we'll need to see a list of all columns in the dataset. That is done with the **columns** property of the **DataFrame** (*the bottom line of code below*).

In [1]:
import pandas as pd


melbourne_path = '/Users/jeanzayas/Desktop/Divergence/DATA ANALYSIS/Learning/ML kaggle learn/Data_ML_try/Melbourne Housing/melb_data.csv'

melbourne_data = pd.read_csv(melbourne_path)
melbourne_data.columns

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')

 The Melbourne data has some missing values (some houses for which some variables weren't recorded.)

We'll learn to handle the missing values later. 

Your Iowa data doesn't have missing values in teh columns you use.

So we will take the simplest option for now, and drop houses from our data. 

Don't worry about this much for now, though the code is:


In [2]:
# dropna drops missing values (think of na as "not available")
melbourne_data = melbourne_data.dropna(axis=0)
melbourne_data

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
1,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,...,1.0,0.0,156.0,79.00,1900.0,Yarra,-37.80790,144.99340,Northern Metropolitan,4019.0
2,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,...,2.0,0.0,134.0,150.00,1900.0,Yarra,-37.80930,144.99440,Northern Metropolitan,4019.0
4,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,4/06/2016,2.5,3067.0,...,1.0,2.0,120.0,142.00,2014.0,Yarra,-37.80720,144.99410,Northern Metropolitan,4019.0
6,Abbotsford,124 Yarra St,3,h,1876000.0,S,Nelson,7/05/2016,2.5,3067.0,...,2.0,0.0,245.0,210.00,1910.0,Yarra,-37.80240,144.99930,Northern Metropolitan,4019.0
7,Abbotsford,98 Charles St,2,h,1636000.0,S,Nelson,8/10/2016,2.5,3067.0,...,1.0,2.0,256.0,107.00,1890.0,Yarra,-37.80600,144.99540,Northern Metropolitan,4019.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12205,Whittlesea,30 Sherwin St,3,h,601000.0,S,Ray,29/07/2017,35.5,3757.0,...,2.0,1.0,972.0,149.00,1996.0,Whittlesea,-37.51232,145.13282,Northern Victoria,2170.0
12206,Williamstown,75 Cecil St,3,h,1050000.0,VB,Williams,29/07/2017,6.8,3016.0,...,1.0,0.0,179.0,115.00,1890.0,Hobsons Bay,-37.86558,144.90474,Western Metropolitan,6380.0
12207,Williamstown,2/29 Dover Rd,1,u,385000.0,SP,Williams,29/07/2017,6.8,3016.0,...,1.0,1.0,0.0,35.64,1967.0,Hobsons Bay,-37.85588,144.89936,Western Metropolitan,6380.0
12209,Windsor,201/152 Peel St,2,u,560000.0,PI,hockingstuart,29/07/2017,4.6,3181.0,...,1.0,1.0,0.0,61.60,2012.0,Stonnington,-37.85581,144.99025,Southern Metropolitan,4380.0


There are many ways to select a subset of your data. The **Pandas course** covers these in more depth, but we will focus on two approaches for now. 

    1. Dot notation, which we use to select the "prediction target"
    2. Selecting with a column list, which we use to selct the "features"
   
### Selecting The Prediction Target

You can pull out a variable with **dot-notation**. This single column is stored in a **Series**, which is broadly like a DataFrame with only a single column of data. 

We'll use the dot notation to select the column we want to predict, which is called the **prediction target**. By convetion, the prediction target is called **y**. So the code we need to save the house prices in the Melbourne data is 

In [3]:
y = melbourne_data.Price

### Choosing "Features"

The columns that are inputted into our model (and later used to make predictions) are called "features". In our case, those would be the columns used to determine the home price. Sometimes, you will use all columns except the target as features. Other times you'll be better off with fewer features. 

For now, we'll build a model with only a few features. Later on you'll see how to iterate and compare models built with different features. 

we select multiple features by providing a list of column names inside brackets. Each item in that list should be a string (with quotes).

Here is an example:

In [4]:
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 'YearBuilt', 'Lattitude', 'Longtitude']

By convention, this data is called **X**.

In [5]:
X = melbourne_data[melbourne_features]

Let's quickly review the data we'll be using to predict house prices using the describe method and the head method, which shows the top few rows.

In [6]:
X.describe()

Unnamed: 0,Rooms,Bathroom,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude
count,6196.0,6196.0,6196.0,6196.0,6196.0,6196.0,6196.0
mean,2.931407,1.57634,471.00694,141.568645,1964.081988,-37.807904,144.990201
std,0.971079,0.711362,897.449881,90.834824,38.105673,0.07585,0.099165
min,1.0,1.0,0.0,0.0,1196.0,-38.16492,144.54237
25%,2.0,1.0,152.0,91.0,1940.0,-37.855438,144.926198
50%,3.0,1.0,373.0,124.0,1970.0,-37.80225,144.9958
75%,4.0,2.0,628.0,170.0,2000.0,-37.7582,145.0527
max,8.0,8.0,37000.0,3112.0,2018.0,-37.45709,145.52635


In [7]:
X.head()

Unnamed: 0,Rooms,Bathroom,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude
1,2,1.0,156.0,79.0,1900.0,-37.8079,144.9934
2,3,2.0,134.0,150.0,1900.0,-37.8093,144.9944
4,4,1.0,120.0,142.0,2014.0,-37.8072,144.9941
6,3,2.0,245.0,210.0,1910.0,-37.8024,144.9993
7,2,1.0,256.0,107.0,1890.0,-37.806,144.9954


Visually checking your data with these commands is an important part of a data scientist's job. You'll frequently find surprises in the dataset that deserve your inspector skills.


--------------------------------------------------

## Building Your Model

I will use the **scikit-learn** library to create my models. When coding, this library is written as **sklearn**, as you will see in the sample code. **Scikit-learn** is easily the most popular library for modeling the types of data typically stored in **DataFrames**.

The steps to building the model are:
 
   1. **Define**: What type of model will it be? A decision tree? Some other type of model?
   *Some other parameters of the model type are specified too.*
   
   2. **Fit**: Capture patterns from provided data. This is the heart of modeling. 
   3. **Predict**: Just what it sounds like
   4. **Evaluate**: Determine how accurate the model's predictions are. 
   
Here is an example of defining a decision tree model with scikit-learn and fitting it with the features and target variable. 

In [8]:
from sklearn.tree import DecisionTreeRegressor

# Define model. Specify a number for random_state to ensure same results each run

melbourne_model = DecisionTreeRegressor(random_state=1)

# Fit model
melbourne_model.fit(X, y)

DecisionTreeRegressor(random_state=1)

Many machine learning models allow some randomness in model training. Specifying a number for <mark>random_state</mark> ensures that I get the same results in each run. This is considered a good practice. You use any number, and model quality won't depend meaningfully on exactly what value you choose. 

We now have a fitted model that we can use to make predicitons. 

In practice, I'll want to make predictions for new houses coming on the market rather than the houses we already have prices for. But I'll make predictions for the first few rows of the training data to see how the predict function works. 

In [9]:
print("Making predictions for the following 5 houses:")
print(X.head())
print("The predictions are")
print(melbourne_model.predict(X.head()))

Making predictions for the following 5 houses:
   Rooms  Bathroom  Landsize  BuildingArea  YearBuilt  Lattitude  Longtitude
1      2       1.0     156.0          79.0     1900.0   -37.8079    144.9934
2      3       2.0     134.0         150.0     1900.0   -37.8093    144.9944
4      4       1.0     120.0         142.0     2014.0   -37.8072    144.9941
6      3       2.0     245.0         210.0     1910.0   -37.8024    144.9993
7      2       1.0     256.0         107.0     1890.0   -37.8060    144.9954
The predictions are
[1035000. 1465000. 1600000. 1876000. 1636000.]


## RECAP

So far, I have loaded the data and reviewed it with the following code. 

In [10]:
# Code you have previously used to load data

import pandas as pd

# Path of the file to read 
iowa_file = '/Users/jeanzayas/Desktop/Divergence/DATA ANALYSIS/Learning/ML kaggle learn/Data_ML_try/Housing Prices/train.csv'

iowa_data = pd.read_csv(iowa_file)

print("Setup Complete")

Setup Complete


## Step 1: Specify Predicion Target

Selecct the target variable, which corresponds to the sales price. Save this to a new variable called <mark>y</mark>. You'll need to print a list of the columns to find the name of the column you need. 


In [11]:
# print a list of columns in the dataset to find the name of the prediction target

iowa_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

In [12]:
y = iowa_data.SalePrice

## Step 2: Create X 

Now I will create a **DataFrame** called <mark>**X**</mark>.

Since we only want some columns from the original data,  first we will create a list with the names of the columns we want in <mark>**X**</mark>.

Using just the following columns in the list:

  - LotArea
  - YearBuilt
  - 1stFlrSF
  - 2ndFlrSF
  - FullBath
  - BedroomAbvGr
  - TotRmsAbvGrd
 
After we've created that list of features, use it to create the **DataFrame** that we'll use to fi the model.

In [13]:
# Create this list of features below

feature_names = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']

# Select data corresponding to features in feature_names

X = iowa_data[feature_names]



## Review Data

Before building a model, take a quick look at <mark>**X**</mark> to verify it looks sensible

In [14]:
# Review data

# print description or statistics from X

X.describe()

Unnamed: 0,LotArea,YearBuilt,1stFlrSF,2ndFlrSF,FullBath,BedroomAbvGr,TotRmsAbvGrd
count,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,10516.828082,1971.267808,1162.626712,346.992466,1.565068,2.866438,6.517808
std,9981.264932,30.202904,386.587738,436.528436,0.550916,0.815778,1.625393
min,1300.0,1872.0,334.0,0.0,0.0,0.0,2.0
25%,7553.5,1954.0,882.0,0.0,1.0,2.0,5.0
50%,9478.5,1973.0,1087.0,0.0,2.0,3.0,6.0
75%,11601.5,2000.0,1391.25,728.0,2.0,3.0,7.0
max,215245.0,2010.0,4692.0,2065.0,3.0,8.0,14.0


## Step 3: Specify and Fit Model

Create a <mark>DecisionTreeRegressor</mark> and save it iowa_model. Ensure you've donme the relevant import from sklearn to run this command.

Then fit the model you just created using teh data in <mark>X</mark> and <mark>Y</mark> that you saved above. 

In [15]:
from sklearn.tree import DecisionTreeRegressor
iowa_model = DecisionTreeRegressor(random_state=1)
iowa_model.fit(X, y)

DecisionTreeRegressor(random_state=1)

## Step 4: Make Predicitions 

Make predictions with the model's <mark>*predict*</mark> command using <mark>X</mark> as the data. Save the results to a variable called <mark>predictions</mark>.

In [16]:
predictions = iowa_model.predict(X)

print(predictions)

[208500. 181500. 223500. ... 266500. 142125. 147500.]


# __________________

# Model Validation 

Measure the performance of your model, so you can test and compare alternatives. 

You've built a model. But how good is it? 

In this lesson, you will learn to use model validatiopn to measure the quality of your model. Measuring model quality is the key to iteratively improving your models. 

### What is Model Validation 

You'll want to evaluate almost every model you ever build. In most (though not all) applications, The relecant measure of model quality is predictive accuracy. In other words, will the model's predictions be close to what actually happens. 

Many people make a huge mistake when measuring predictive accuracy. They make predictions with their training data. You'll see the problem with this approach and how to solve it in a moment, but let's think about how we'd do this first.

We first need to summarize the model quality into a understandable way. If you compare predicted and actual home values for 10,000 houses, you'll likely find a mix of good and bad predictions. Looking through a list of 10,000 predicted and actual values would be pointless. We need to summarize this into a single metric.

There are many metrics for summarizing model quality, but we'll start with one called **Mean Absoluty Error** (also called **MAE**). Let's break down this metric starting with the last word, error. 

The prediction error for each house is:
>
$error=actual-predicted


So, if a house cost 150,000 dollars and you predicted it would cost 100,000 dollars the error is $50,000.

With the **MAE** metric, we take the absolute value of each error. This converts each error to a positive number. We then Take the average of those absolute errors. This is our measure of model quality. In plain English, it can be said as 

    - *On average, our predictions are off by about X.
 
To calculate **MAE**, we first need a model. 

In [17]:
# Fit model
melbourne_model.fit(X, y)

DecisionTreeRegressor(random_state=1)

Once we have a model, here is how we calculate the mean absolute error:

In [18]:
from sklearn.metrics import mean_absolute_error

predicted_home_prices = melbourne_model.predict(X)
mean_absolute_error(y, predicted_home_prices)

62.35433789954339

### The Problem with "In-Sample" Scores

The measure we just computed can be called an "in-sample" score. We used a single "sample" of houses for both building the model and evaluating it. Here's why this is bad.

Imagine that, in the large real estate market, door color is unrelated to home price.

However, in the sample of data you used to build the model, all homes with green doors were very expensive. The model's job is to find patterns that predict home prices, so it will see this pattern, and it will always predict high prices for homes with green doors.

Since this pattern was derived from the training data, the model will appear accurate in the training data.

But if this pattern doesn't hold when the model sees new data, the model would be very inaccurate when used in practice.

Since models' practical value come from making predictions on new data, we measure performance on data that wasn't used to build the model. The most straightforward way to do this is to exclude some data from the model-building process, and then use those to test the model's accuracy on data it hasn't seen before. This data is called **validation data**.

### Coding It

The scikit-learn library has a function train_test_split to break up the data into two pieces. We'll use some of that data as training data to fit the model, and we'll use the other data as validation data to calculate mean_absolute_error.

Here is the code:

In [19]:
from sklearn.model_selection import train_test_split

# split data into training and validation data, for both features and target
# The split is based on a random number generator. Supllying a numeric value to 
# the random_state argument guarantees we get the same split every time we
# run this script. 


# when creating the model change the state to 1
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 1)

# Define model
melbourne_model.fit(train_X, train_y)

# get predicted prices on validation data
val_predictions = melbourne_model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))

29652.931506849316


### Conclusion

**Wow!**

Our mean absolute error for the in-sample data was about 65 dollars. Out-of-sample it's less than 40,000 dollars.

This is the difference between a model that is almost exactly right, and one that is unusable for most practical purposes. As a point of reference, the average home value in the validation data is 1.1 million dollars. So the rror in new data is about a quarter of the averge home value. 

There are many ways to improve this model, such as experimenting to find better features or different model types. 

## RECAP

Let's test how good the model is. 

In [20]:
print("First in-sample predicitons:", iowa_model.predict(X.head()))
print("Actual target values for those homes:", y.head().tolist())

First in-sample predicitons: [208500. 181500. 223500. 140000. 250000.]
Actual target values for those homes: [208500, 181500, 223500, 140000, 250000]


### Step 1: Split your data

Use the train_test_split function to split up your data. 

Give it the argument random_state=1 so teh check functions know what to expect when verifying your code. 

Recall, your features are loaded in the DataFrame **X** and your target is loaded in **y**.

In [21]:
# Import the train_test_split function 
print(train_X, val_X, train_y, val_y.head())

      LotArea  YearBuilt  1stFlrSF  2ndFlrSF  FullBath  BedroomAbvGr  \
6       10084       2004      1694         0         2             3   
807     21384       1923      1072       504         1             3   
955      7136       1946       979       979         2             4   
1040    13125       1957      1803         0         2             3   
701      9600       1969      1164         0         1             3   
...       ...        ...       ...       ...       ...           ...   
715     10140       1974      1350         0         2             3   
905      9920       1954      1063         0         1             3   
1096     6882       1914       773       582         1             3   
235      1680       1971       483       504         1             2   
1061    18000       1935       894         0         1             2   

      TotRmsAbvGrd  
6                7  
807              6  
955              8  
1040             8  
701              6  
...      

### Step 2: Specify and Fit the Model

Create a DecisionTreeRegressor model and fit it to the relevant data. Set random_state to 1 again when creating the model. 

We imported **DecisionTreeRegressor** in your last exercise and that code has been copied tot he setup code aboce. So, no need to import it again. 

In [22]:
# Specify the model

iowa_model = DecisionTreeRegressor(random_state=1)

# Fit iowa_model with the training data.
iowa_model.fit(train_X, train_y)

# Check your answer
print(iowa_model)

DecisionTreeRegressor(random_state=1)


### Step 3: Make Predictions with Validation data


In [23]:
# Predict with all validation observations 

val_predictions = iowa_model.predict(val_X)

In [24]:
import numpy as np

# print the top few validations predictions 
print(" The top few validations predictions are:", 
      val_predictions[0],', '
     ,val_predictions[1],', '
     ,val_predictions[2],', '
     ,val_predictions[3],', '
     ,val_predictions[4]
     )
# print the top few actual prices from validation data 

print("The top few actual prices from validation data: ", np.sort(y, kind="quicksort")[:6])

 The top few validations predictions are: 186500.0 ,  184000.0 ,  130000.0 ,  92000.0 ,  164500.0
The top few actual prices from validation data:  [34900 35311 37900 39300 40000 52000]


What do you notice that is different from what you saw with in-sample predictions (which are printed after the top code cell in this page).

Do you remember why validation predictions differ from in-sample (or training) predictions? This is an important idea from the last lesson.

### Step 4: Calculate the Mean Absolute Error in Validation Data

In [25]:
from sklearn.metrics import mean_absolute_error

val_mae = mean_absolute_error(val_y, val_predictions)

print(val_mae)

29652.931506849316


Is that **MAE** good? There isn't a general rule for what values are good that applies to across applications. But you'll see how to use (and improve) this number in the next step.
# _____________________

# Underfitting and Overfitting 

Fine-tune the model for better performance

At the end of this step, we will understand the concepts of underfitting and overfitting, and we'll be able to apply these ideas to make your models more accurate. 

## Experimenting With Different Models

Now that I have a reliable way to measure model accuracy, we can experiment with alternative models and see which gives the best predictions. But what alternatives do I have for models?

We can see in ***scikit-learn***'s [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html) that the decision tree model has many options (more than you'll want or need for a long time). The most important options determine the tree's depth. Recall from [the first lesson in this course](https://www.kaggle.com/code/dansbecker/how-models-work/tutorial) that a tree's depth is a measure of how many splits it makes before coming to a prediction. 

<figure> 
    <img src="http://i.imgur.com/R3ywQsR.png" alt="Shallow Tree"/>
    <figure-caption> Shallow Tree</figure-caption>
</figure>

In practice, it's not uncommon for a tree to have 10 splits between the top level (all houses) and a leaf. As the tree gets deeper, the dataset gets sliced up into leaves with fewer houses. If a tree only had 1 split, it divides the data into 2 groups. If each group is split again, we would get 4 groups of houses. Splitting each of those again would create 8 groups. If we keep doubling the numbers of groups by adding more splits at each level, we'll have 2<sup>10</sup> groups of houses by the time we get to the 10<sup>th</sup> level. That's 1024 leaves.

When we divide the houses amongst many leaves, we also have fewer houses in each leaf. Leaves with very few houses will make predictions that are quite close to those homes actual values, but they may make very unreliable predictions for new data (because each prediction is based on only a few houses).

This is a phenomenon called **overfitting**, where a model matches the training data almost perfectly, but does poorly in validation and other new data. On the flip side, if we make our tree very shallow, it doesn't divide up the houses into very distinct groups. 

At an extreme, if a tree divides houses into only 2 or 4, each group still has a wide variety of houses. Resulting predictions may be far off for most houses, even in the training data( and it will be bad in validation too for the same reason). When a model fails to capture important distinctions and patterns in the data, so it performs poorly even in training data, that is called **underfitting**.

Since we care about accuracy on new data, which we estimate from our validation data, we want to find the sweet spot between underfitting and overfitting. Visually, we want the low point of the (red) validation curve in the figure below. 



## Mean Absolute Error
<figure>
<img src="http://i.imgur.com/AXSEOfI.png" alt="MAE example"/>

</figure>


## Example 

There are a few alternatives for controlling the tree depth, and many allow for some reoutes through the tree to have greater depth than other routes. But the *max_leaf_nodes* argument provides a very sensible way to control overfitting vs underfitting. The more leaves we allow the model to make, the more we move from the underfitting area in the above graph to the overfitting area. 

We can use a utility function to help compare **MAE** scores from different values for *max_leaf_nodes*:

In [26]:
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor

def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return (mae)

The data is loaded into **train_X, val_X, train_y** and **val_y** using the code you've already seen (and which you've already written).

We can use a for-loop to compare the accuracy of models built with different values for *max_leaf_nodes*. 


In [27]:
# compare MAE with differing values of max_leaves_nodes

for max_leaf_nodes in [5, 50, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d \t\t Mean Absolute Error: %d" %(max_leaf_nodes, my_mae))

Max leaf nodes: 5 		 Mean Absolute Error: 35044
Max leaf nodes: 50 		 Mean Absolute Error: 27405
Max leaf nodes: 500 		 Mean Absolute Error: 29454
Max leaf nodes: 5000 		 Mean Absolute Error: 30139


Of the options listed, 50 is the optimal number of leaves. It gives the least amount of error. 

________

## Conclusion 

Here's the takeaway: Models can suffer from either:

- **Overfitting**: capturing spurious patterns that won't recur in the future, leading to less accurate predictions, or 

- **Underfitting**: failing to capture relevant patterns, aggain leading to less accurate predictions.

We use **validation** data, which isn't used in model training, to measure a candidate model's accuracy. This lets us try many candidate models and keep the best one. 


# _________

### Compare Different Tree Sizes

Writing a loop that tries the following values for *max_leaf_nodes* from a set of possible values. 

Call the *get_mae* function on each value of *max_leaf_nodes*. Store the output  in some way that allows you to select the value of *max_leaf_nodes* that gives the most accurate model on your data. 

In [28]:
# in this case the following loop is another way to produce the same results as above.

candidate_max_leaf_nodes = [5, 25, 50, 100, 250, 500]

# Write a loop to find the ideal tree size from candidate_max_leaf_nodes
# Here is a short solution with a dict comprehension

scores = {leaf_size: get_mae(leaf_size, train_X, val_X, train_y, val_y) for leaf_size in candidate_max_leaf_nodes}

# Store the best value of max_leaf_nodes (it will be either 5, 25, 50, 100, 250 or 500)

best_tree_size = min(scores, key=scores.get)

### Fit Model Using All Data

You know the nest tree size. If you were going to deploy this model in practice, you would make it even more accurate by using all of the data and keeping that tree size. That is, you don't need to hold out the validation data now that you've made all your modeling decisions.

In [29]:
final_model = DecisionTreeRegressor(max_leaf_nodes=best_tree_size, random_state=1)

# Fit the model
final_model.fit(X, y)

predict_final = final_model.predict(val_X)
maer = mean_absolute_error(val_y, predict_final)

print(maer)

16815.938748057826
