# Introduction to Machine Learning

The aim of this notebook is to practice running through the whole workflow for a introductory machine learning model.

## Steps

1. Extract data to DataFrame
2. Split training and validation data
3. Define function to calculate model error
4. Experiment with hyper-parameters

## Model components:
```
  sklearn
    ├── tree
    │   └── DecisionTreeRegressor
    ├── metrics
    │   └── mean_absolute_error
    ├── model_selection
    │   └── train_test_split
    └── ensemble
        └── RandomForestRegressor

https://tree.nathanfriend.io/?s=(%27opt0s!(%27fancy!true~fullPat2~trailingSlas2)~3(%273%27sklearn*tree4Decis0Tree.metrics4mean_absolute_error*model_select04train_test_split*ensemble4RandomForest.%27)~vers0!%271%27)*%5Cn--%20%20.Regressor*0ion2h!false3source!4*-%014320.-*
```

## Useful pd.DataFrame and pd.Series methods
```
pd.DataFrame
    ├── to_csv(index_col==['colName'], parse_dates)
    ├── select_dtypes(exclude=['dtype'], include=['dtype'])
    ├── drop(['colName'], axis)
    ├── dropna(axis)
    ├── fillna(value/method)
    ├── isnull
    └── notnull 
```

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Step 1: Extract data to DataFrame (and explore data).

In [None]:
train_filepath = '../input/house-prices-data/train.csv'
test_filepath = '../input/house-prices-data/test.csv'

train_data = pd.read_csv(train_filepath, index_col='Id', parse_dates=True)
test_data = pd.read_csv(test_filepath, index_col='Id', parse_dates=True)

## Step 2: Separate training and validation data

In [None]:
# Select features
features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']

# Remove null columns
train_data.dropna(axis=1, inplace=True)

y = train_data.SalePrice
X = train_data[features]

X_test = test_data[features]

# Inspect columns
print(f'Columns: {list(X.columns)}\n')

# Perform train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=1)

## Step 3: Build model and define model error

In [None]:
def get_mae_DTR(max_leaf_nodes, X_train, X_valid, y_train, y_valid):
    '''
    This function returns the mean absolute error for a given max_leaf_node,
    using the DecisionTreeRegressor as the estimator.
    '''
    
    # Define model type
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=2)
    
    # Train the model using training data
    model.fit(X_train, y_train)
    
    # Predict values for validation and test
    preds_y_valid = model.predict(X_valid)
    
    # Calculate mean absolute error
    return (mean_absolute_error(preds_y_valid, y_valid))

## Step 4: Test hyper-parameters

In [None]:
%matplotlib inline

# Generate a range from 0 to 500, with intervals of 1
leaf_node_range = range(2, 501)

# Calculate the mean_absolute_error for the validation data
valid_mae = [get_mae_DTR(num_leaf_nodes, X_train, X_valid, y_train, y_valid) for num_leaf_nodes in leaf_node_range]

# Convert data into a pd.DataFrame
results = pd.DataFrame({'leaf_node': leaf_node_range,
                       'valid_mae': valid_mae})

print(f"Min. mean absolute error: {results['valid_mae'].min()}\nMax. leaf nodes: {results['valid_mae'].idxmin()}")

# Plot the graph
sns.lineplot(x='leaf_node', y='valid_mae', data=results, label='Validation')
sns.set_style('whitegrid')
sns.regplot(x=[results['valid_mae'].idxmin()], y=[results['valid_mae'].min()], label=f"({results['valid_mae'].idxmin()}, {results['valid_mae'].min():.0f})")
plt.title('Mean absolute error against max. leaf nodes')
plt.xlabel('Mean absolute error')
plt.ylabel('Maximum leaf nodes')
plt.legend()
plt.show()

From the graph, the optimal number of leaf nodes is 46.

## Step 5: Test ensemble model

In [None]:
def get_mae_RFR(n_estimators, max_depth, X_train, X_valid, y_train, y_valid):
    '''
    This function returns the mean absolute error for a given max_leaf_node,
    using the RandomForestRegressor as the estimator.
    '''
    
    # Define model type
    model = RandomForestRegressor(n_estimators=n_estimators, max_depth=max_depth, random_state=1)
    
    # Train the model using training data
    model.fit(X_train, y_train)
    
    # Predict values for validation and test
    preds_y_valid = model.predict(X_valid)
    
    # Calculate mean absolute error
    return (mean_absolute_error(preds_y_valid, y_valid))