**This notebook was an exercise in the [Introduction to Machine Learning](https://www.kaggle.com/learn/intro-to-machine-learning) course.**

---


# Introduction
## Purpose
The "Intro to Machine Learning" tutorial's purpose is to **learn the core ideas in machine learning, to build your first models, and to submit predictions for a Kaggle competitions.**

The purpose of this notebook is to give a "one-page" summary of what I've learned.

## Context
>Your cousin has made millions of dollars speculating on real estate. He's offered to become business partners with you because of your interest in data science. He'll supply the money, and you'll supply models that predict how much various houses are worth.

## Concept Model of Machine Learning
![](http://www.plantuml.com/plantuml/png/NP9FYzim4CNl_XJ3zh1sMvkUGrYMGaeB_PF3mkx14BJsE16L9QCP1zBIxzuPAQwtUuhUQ7vldYQ-9pQHvz4L9ziZT3Ps3YaB72U-m8ZZCqOgYjllePUhpXaYk6azm3SfE83Mtu0X69FwNTGG9hQZ_OMnh42a2qI7OVOTs-3BglYpU2I-zViGUGZEXglD58QbunCQdYEsVkUFrYD6wu-fQy2bvI4QwwKClM6JxlGWETx1bIPu0b4F9XwHuLBKjQYfRoAQ_j3HkQn4izeS68aFD3dBQyvrHElLcf1rJ2PapiBAaELuMTdTsRZPwCl_fwMihFuAcGylkHyJneGPbq4emqHLOkMGM4qhy0h1taGp8cCKePsJQaX_oYFQuua97a74Hsi82PvNzFD51awly5Cg2Fu6lgA9QSsI2aL_yPQjST05trlDIR2QxIFsjikFbiu18_eEisUOUPevvMyuvFqBMTpk-YDIaVv-U5lNr-zf0s4Gdk2k612ahBkGAiyUjPYxf7y9cwzW--9ckN3wXXMgeswalbiYRM0qFv5Qty2K7r3c3LV_aAsuwVOtqX4w93NJGAlNadecTO9ci5m-lRzKiCsDaYT_9Aiy6mL6v6WxokMc_50rEgcuHz_Fe_iB)

This diagram includes key concepts used throughout the tutorial and throughout the article [The 7 Steps of Machine Learning](https://towardsdatascience.com/the-7-steps-of-machine-learning-2877d7e5548e).

## Notebook overview
We'll start with a model called the [Decision Tree](https://en.wikipedia.org/wiki/Decision_tree). There are fancier models that give more accurate predictions. But decision trees are easy to understand, and they are the basic building block for some of the best models in data science. We'll try also [Random forest](https://en.wikipedia.org/wiki/Random_forest) before preparing submission for competition scoring. The notebook is structured to these parts:
+ Set Up the Environment
+ Prepare the Data
+ Predict with Decision Tree Model
+ Predict with Random Forest Model
+ Prepare a Submission For the Competition

# Set Up the Environment
We'll need these modules and objects:
![](http://www.plantuml.com/plantuml/png/RL3BRiCm3BlxAuoUza3_eOUY7r3i7YWsjnZpmv2K3SMmVv_ZfkXQz1ABf4WfseaIwvoYcOA7TO5TX9m1KjMJJKWZM8nyXbo9ATc9il_ce8fibMVyarB9nKrS4ivAaoA8ittPRXi3cENJqHukI2ZvZO4ZFXWYXM_wK_68Wo32QMiqTtZDf907XUWWDGQz3O1oS6BMT-Ke3pHAYRDY0FFTK6H16YFCfUZiaR8lwL0OejTbZOiaOZUrgIGRIsAhDzLt2uFyuEHhRTgJqe4fmVwOL-lQ-9IbLJ9HHNkzdYeMOcg-f-U5WJWE8phGicIrydVDlmPUrkZdwUdlGRkfCdnmsbtcuRtjvSpcuHzUvHsodrFy0m00)

In [None]:
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.ensemble import RandomForestRegressor

# Instead of data gathering, just check input
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Prepare the Data
## Load and Understand Your Data

In [None]:
pd.options.display.max_rows = None
pd.options.display.max_columns = None

# Path of the file to read
iowa_file_path = '../input/home-data-for-ml-course/train.csv'
# Load data
home_data = pd.read_csv(iowa_file_path)
# Print the list of columns in the dataset to find the name of the prediction target
print(home_data.columns)
home_data.head()

As `the DataFrame.descibe()` function reports only numerical columns, I have the `report_structure` function to get deeper insight:

In [None]:
def report_structure(df):
    '''
    Function report_structure 
    - takes an input dataframe
    - sets up anxiliary functions
    - returns a new dataframe with description of the input dataframe's columns.
    '''

    descriptors = []
    def myCount(f):
        try:
            return df[f].count()
        except:
            return 'Not applicable'
    
    def cUnique(f):
        try:
            return df[f].nunique()
        except:
            return 'Not applicable'

    def min_item(f):
        try:
            return df[f].min()
        except:
            return 'Not applicable'

    def max_item(f):
        try:
            return df[f].max()
        except:
            return 'Not applicable'

    def top_item(f):
        try:
            return df[f].value_counts().index[0]
        except:
            return 'Not applicable'

    def top_freq(f):
        try:
            return df[f].value_counts().values[0]
        except:
            return 'Not applicable'
    
    def unique(f):
        try:
            l = list(df[f].unique())
            if len(l) < 10:
                return l
            else:
                return l[:2]+['...']+l[-2:]
        except:
            return 'Not applicable'

    for col in list(df.columns):
        descriptors.append([myCount(col), cUnique(col), min_item(col), max_item(col), top_item(col), top_freq(col), unique(col)])
    out = pd.DataFrame.from_records(descriptors, index=df.columns, columns=['count','cUnique','min', 'max', 'top','freq', 'unique'])
    return out


In [None]:
report_structure(home_data)

## Set Up Features and a Target

In [None]:
# Set up a target object and call it y
y = home_data.SalePrice
# Select features and create X
feature_names = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = home_data[feature_names]
# Review data
# print description or statistics from X
X.describe()

In [None]:
# print the top few lines
X.head()

## Split Features X into Training and Validation Datasets
Split data into training and validation data, for both features and target. The split is based on a random number generator. Supplying a numeric value to the random_state argument guarantees we get the same split every time we run this script.

In [None]:
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 1)
print(f'The original dataframe X with {len(X)} observations has been splitted with the ratio {round(len(train_X)/len(X)*100)} : {round(len(val_X)/len(X)*100)}.')
print('Shapes of tranining and validation datasets:',[df.shape for df in [train_X, val_X, train_y, val_y]])

# Predict with Decision Tree Model
## Specify and Fit the Model

In [None]:
# For model reproducibility, set a numeric value for random_state when specifying the model
iowa_model = DecisionTreeRegressor(random_state=1)

# Fit the model
iowa_model.fit(train_X, train_y)

## Make Validation Predictions and Calculate MAE

In [None]:
val_predictions = iowa_model.predict(val_X)
mae = mean_absolute_error(val_y, val_predictions)
print(f'MAE = ${mae:,.2f} and y.mean = ${y.mean():,.2f}, so inaccuracy is {mae/y.mean():.1%} !')

## Optimize the Model
There are a few alternatives for controlling the tree depth, and many allow for some routes through the tree to have greater depth than other routes. But the max_leaf_nodes argument provides a very sensible way **to control overfitting vs underfitting**. The more leaves we allow the model to make, the more we move from the underfitting area in the above graph to the overfitting area.

We can use a utility function to help compare MAE scores from different values for max_leaf_nodes:

In [None]:
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)

candidate_max_leaf_nodes = [5, 25, 50, 100, 250, 500]
scores = {leaf_size: get_mae(leaf_size, train_X, val_X, train_y, val_y) for leaf_size in candidate_max_leaf_nodes}
print([str(k)+' : $'+str(round(v)) for k,v in scores.items()])

best_tree_size = min(scores, key=scores.get)
print('With best tree size:', best_tree_size)

iowa_model = DecisionTreeRegressor(max_leaf_nodes=best_tree_size, random_state=1)
iowa_model.fit(train_X, train_y)
val_predictions = iowa_model.predict(val_X)
mae = mean_absolute_error(val_y, val_predictions)
print(f'MAE = ${mae:,.2f} and y.mean = ${y.mean():,.2f}, so inaccuracy is {mae/y.mean():.1%} !')

## Fit the Final Model
You know the best tree size. If you were going to deploy this model in practice, you would make it even more accurate by using all of the data and keeping that tree size. That is, you don't need to hold out the validation data now that you've made all your modeling decisions.

In [None]:
# Fill in argument to make optimal size and fit the final model
final_model = DecisionTreeRegressor(max_leaf_nodes=best_tree_size)
final_model.fit(X, y)
iowa_preds = final_model.predict(val_X)
mae_f = mean_absolute_error(val_y, iowa_preds)
print('There is no sense in validating the final model as')
print(f'MAE = ${mae_f:,.2f} and y.mean = ${y.mean():,.2f}, so inaccuracy is now only {mae_f/y.mean():.1%}. This is a very biased result!!')

You've tuned this model and improved your results. But we are still using Decision Tree models, which are not very sophisticated by modern machine learning standards. In the next step you will learn to use Random Forests to improve your models even more.

# Predict with Random Forest Model
The random forest uses many trees, and it makes a prediction by averaging the predictions of each component tree. It generally has much better predictive accuracy than a single decision tree and it works well with default parameters. If you keep modeling, you can learn more models with even better performance, but many of those are sensitive to getting the right parameters.

In [None]:
forest_model = RandomForestRegressor()
forest_model.fit(train_X, train_y)
iowa_preds = forest_model.predict(val_X)
mae_rf = mean_absolute_error(val_y, iowa_preds)
print(f'MAE = ${mae_rf:,.2f} and y.mean = ${y.mean():,.2f}, so inaccuracy is {mae_rf/y.mean():.1%} !')

There is likely room for further improvement, but this is a big improvement over the best decision tree error of 250,000. There are parameters which allow you to change the performance of the Random Forest much as we changed the maximum depth of the single decision tree. But one of the best features of Random Forest models is that they generally work reasonably even without this tuning.

So far, you have followed specific instructions at each step of your project. This helped learn key ideas and build your first model, but now you know enough to try things on your own.

Machine Learning competitions are a great way to try your own ideas and learn more as you independently navigate a machine learning project.

# Prepare a Submission For the Competition

In [None]:
# To improve accuracy, create a new Random Forest model which you will train on all training data
rf_model_on_full_data = RandomForestRegressor()

# Fit rf_model_on_full_data on all data from the training data
rf_model_on_full_data.fit(X, y)
# Path to file you will use for predictions
test_data_path = '../input/home-data-for-ml-course/test.csv'

# Read test data file using pandas
test_data = pd.read_csv(test_data_path)

# Create test_X which comes from test_data but includes only the columns you used for prediction.
# The list of columns is stored in a variable
test_X = test_data[feature_names]

# Make predictions which we will submit. 
test_preds = rf_model_on_full_data.predict(test_X)

# The lines below shows how to save predictions in format used for competition scoring

output = pd.DataFrame({'Id': test_data.Id,
                       'SalePrice': test_preds})
output.to_csv('hpc_submission.csv', index=False)

In [None]:
print('Output shape:', output.shape)
output.head()

---




*Have questions or comments? Visit the [Learn Discussion forum](https://www.kaggle.com/learn-forum/161285) to chat with other Learners.*