**This notebook is an exercise in the [Introduction to Machine Learning](https://www.kaggle.com/learn/intro-to-machine-learning) course.  You can reference the tutorial at [this link](https://www.kaggle.com/dansbecker/model-validation).**

---


## Recap
You've built a model. In this exercise you will test how good your model is.

Run the cell below to set up your coding environment where the previous exercise left off.

In [1]:
# Code you have previously used to load data
import pandas as pd
from sklearn.tree import DecisionTreeRegressor

# Path of the file to read
iowa_file_path = './data/home-data-for-ml-course/train.csv'

home_data = pd.read_csv(iowa_file_path)
y = home_data.SalePrice
feature_columns = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = home_data[feature_columns]

# Specify Model
iowa_model = DecisionTreeRegressor()
# Fit Model
iowa_model.fit(X, y)

print("First in-sample predictions:", iowa_model.predict(X.head()))
print("Actual target values for those homes:", y.head().tolist())

# Set up code checking
print("Setup Complete")

First in-sample predictions: [208500. 181500. 223500. 140000. 250000.]
Actual target values for those homes: [208500, 181500, 223500, 140000, 250000]
Setup Complete


# Exercises

## Step 1: Split Your Data
Use the `train_test_split` function to split up your data.

Give it the argument `random_state=1` so the `check` functions know what to expect when verifying your code.

Recall, your features are loaded in the DataFrame **X** and your target is loaded in **y**.


In [2]:
# Import the train_test_split function 
from sklearn.model_selection import train_test_split

#split data
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 1)

## Step 2: Specify and Fit the Model

Create a `DecisionTreeRegressor` model and fit it to the relevant data.
Set `random_state` to 1 again when creating the model.

In [3]:
# Imported DecisionTreeRegressor in the setup code above. So, no need to
# import it again

# Specify the model
iowa_model = DecisionTreeRegressor(random_state = 1)

# Fit iowa_model with the training data.
iowa_model.fit(train_X,train_y)

# Check your answer
iowa_model

DecisionTreeRegressor(random_state=1)

## Step 3: Make Predictions with Validation data


In [4]:
# Predict with all validation observations
val_predictions = iowa_model.predict(val_X)

# Check your answer
val_predictions

array([186500., 184000., 130000.,  92000., 164500., 220000., 335000.,
       144152., 215000., 262000., 180000., 121000., 175900., 210000.,
       248900., 131000., 100000., 149350., 235000., 156000., 149900.,
       265979., 193500., 377500., 100000., 162900., 145000., 180000.,
       582933., 146000., 140000.,  91500., 112500., 113000., 145000.,
       312500., 110000., 132000., 305000., 128000., 162900., 115000.,
       110000., 124000., 215200., 180000.,  79000., 192000., 282922.,
       235000., 132000., 325000.,  80000., 237000., 208300., 100000.,
       120500., 162000., 153000., 187000., 185750., 335000., 129000.,
       124900., 185750., 133700., 127000., 230000., 146800., 157900.,
       136000., 153575., 335000., 177500., 143000., 202500., 168500.,
       105000., 305900., 192000., 190000., 140200., 134900., 128950.,
       213000., 108959., 149500., 190000., 175900., 160000., 250580.,
       157000., 120500., 147500., 118000., 117000., 110000., 130000.,
       148500., 1480

Inspect your predictions and actual values from validation data.

In [5]:
# print the top few validation predictions
print(iowa_model.predict(val_X.head()))
# print the top few actual prices from validation data
print(val_y.head().tolist())

[186500. 184000. 130000.  92000. 164500.]
[231500, 179500, 122000, 84500, 142000]


What do you notice that is different from what you saw with in-sample predictions (which are printed after the top code cell in this page).

Do you remember why validation predictions differ from in-sample (or training) predictions? This is an important idea from the last lesson.

## Step 4: Calculate the Mean Absolute Error in Validation Data


In [6]:
from sklearn.metrics import mean_absolute_error
val_mae = mean_absolute_error(val_y, val_predictions)
# print the validation_mae
print(val_mae)

29652.931506849316
