<a href="https://www.kaggle.com/rsizem2/kaggle-learn-reference-intermediate-ml?scriptVersionId=84530155" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Reference Notebook: Intermediate Machine Learning

This notebook is an attempt to summarize the coding techniques covered in the [Intermediate Machine Learning](https://www.kaggle.com/learn/intermediate-machine-learning) course (both the notes and the exercises) into one notebook for easier reference. If you find this useful, I have a similar notebook for the follow up course on [Feature Engineering](https://www.kaggle.com/rsizem2/kaggle-learn-reference-feature-engineering).

In [1]:
# Global variables for testing changes to this notebook quickly
RANDOM_SEED = 0
NUM_FOLDS = 8
SUBMIT = True

## Imports

In [2]:
import time
import os
import warnings
import numpy as np
import pandas as pd 

# Testing, Scoring
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.metrics import mean_absolute_error

# Preprocssing
from functools import partial
from sklearn.base import clone
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder 
from category_encoders import OneHotEncoder
from sklearn.impute import SimpleImputer

# Models 
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor

# Mute warnings
warnings.filterwarnings('ignore')

## Loading Data

In [3]:
# Load data, drop columns with missing target
train = pd.read_csv("../input/home-data-for-ml-course/train.csv") 
test = pd.read_csv("../input/home-data-for-ml-course/test.csv")
    
# Remove rows with missing target
train.dropna(axis=0, subset=['SalePrice'], inplace=True)

# Columns of interest
features = [x for x in train.columns if x not in ['SalePrice','Id']]
categorical = [x for x in features if train[x].dtype == "object"]
numerical = [x for x in features if  train[x].dtype in ['int64', 'float64']]
high_cardinality = [x for x in categorical if train[x].nunique() >= 10]
low_cardinality = [x for x in categorical if x not in high_cardinality]

## Scoring Function

In [4]:
# Function for comparing different preprocessing approaches
def score(preprocessing = None):
    
    # Get train/test split for scoring
    X_train, X_valid, y_train, y_valid = train_test_split(
        train[features], train['SalePrice'], 
        random_state = RANDOM_SEED
    )
    
    # Apply preprocessing, if applicable
    if preprocessing:
        X_train, X_valid = preprocessing(X_train, X_valid)
        
    # Train random forest model
    model = RandomForestRegressor(random_state = RANDOM_SEED)
    model.fit(X_train, y_train)
    valid_preds = model.predict(X_valid)
    
    return round(mean_absolute_error(y_valid, valid_preds), 2)

# Lesson 2: Missing Values

Note: The strategies from the [Missing Values](https://www.kaggle.com/alexisbcook/missing-values) notebook assume that all of the categorical data has been removed.

1. Drop columns containing missing values
2. Impute missing values using various strategies (e.g. mean, median, most-frequent)

## Strategy 1: Drop columns with NAs

In [5]:
# Missing Values Strategy 1: Drop columns w/ missing values
def drop_missing(X_train, X_valid):
    
    # Get names of columns with missing values
    na_cols = [col for col in numerical if X_train[col].isnull().any()]
    X_train = X_train[numerical].drop(na_cols, axis=1)
    X_valid = X_valid[numerical].drop(na_cols, axis=1)
    
    return X_train, X_valid

In [6]:
# Test Strategy 1: Drop Columns with NAs
print('Strategy 1: Drop Columns with NAs:',score(preprocessing = drop_missing),'MAE\n')

Strategy 1: Drop Columns with NAs: 17668.58 MAE



## Strategy 2: Imputing Missing Values

Note: The lecture notes only covers numerical data, we include an imputer for categorical data as well.

In [7]:
# Missing Values Strategy 2: Impute missing values (numerical data only)
def impute_numerical(X_train, X_valid, strategy ='mean', drop = True):
    
    columns = [x for x in X_train.columns if x in numerical]
    if drop:
        X_train, X_valid = X_train[columns].copy(), X_valid[columns].copy()
    else:
        X_train, X_valid = X_train.copy(), X_valid.copy()

    # impute NAs for numerical cols
    imputer = SimpleImputer(strategy=strategy)
    X_train[columns] = imputer.fit_transform(X_train[columns])
    X_valid[columns] = imputer.transform(X_valid[columns])
    
    return X_train, X_valid

# Missing Values Strategy 2.5: Impute missing values (categorical data only)
def impute_categorical(X_train, X_valid, strategy ='constant'):
    
    columns = [x for x in X_train.columns if x in categorical]
    X_train, X_valid = X_train.copy(), X_valid.copy()
    
    # assert valid strategy
    assert strategy in ['constant','most_frequent']
    
    # impute NAs for categorical columns
    imputer = SimpleImputer(strategy = strategy, fill_value = 'None')
    X_train[columns] = imputer.fit_transform(X_train[columns])
    X_valid[columns] = imputer.transform(X_valid[columns])
    
    return X_train, X_valid

In [8]:
# Test Strategy 2.0: Impute using the mean value
preprocessing = impute_numerical
print('Strategy 2.0 (Impute w/ Mean):', score(preprocessing),'MAE\n')
    
# Test Strategy 2.1: Impute using the median value
preprocessing = partial(impute_numerical, strategy = 'median')
print('Strategy 2.1 (Impute w/ Median):', score(preprocessing),'MAE\n')
    
# Test Strategy 2.2: Impute using the most frequent value
preprocessing = partial(impute_numerical, strategy = 'most_frequent')
print('Strategy 2.2 (Impute w/ Mode):', score(preprocessing),'MAE\n')

Strategy 2.0 (Impute w/ Mean): 17502.92 MAE

Strategy 2.1 (Impute w/ Median): 17613.36 MAE

Strategy 2.2 (Impute w/ Mode): 17525.83 MAE



## Strategy 2.5: Imputation (extended)

This approach was covered in the notes but not used for the exercise, we include it for the sake of completeness. In this approach missing values are imputed and for each column with missing values a new column is added with True/False values indicating which row had missing values.

In [9]:
# Missing Values Strategy 3: Impute missing values w/ indicator (numerical data only) 
def impute_numerical_extended(X_train, X_valid, strategy ='mean'):
    
    columns = [x for x in X_train.columns if x in numerical]
    X_train, X_valid = X_train[columns].copy(), X_valid[columns].copy()
    
    # 1. Get numerical columns w/ NAs
    na_cols = [x for x in numerical if X_train[x].isnull().any()]
    
    # 2. Add indicator column for missing values
    for col in na_cols:
        X_train[col + '_was_NA'] = X_train[col].isnull().astype(int)
        X_valid[col + '_was_NA'] = X_valid[col].isnull().astype(int)
    
    # 3. Fit on training data, apply to validation set
    imputer = SimpleImputer(strategy=strategy)
    X_train = imputer.fit_transform(X_train)
    X_valid = imputer.transform(X_valid)
    
    return X_train, X_valid

# Missing Values Strategy 3.5: Impute missing values w/ indicator (categorical data only)
def impute_categorical_extended(X_train, X_valid, strategy ='constant'):
    
    assert strategy in ['constant','most_frequent']
    columns = [x for x in X_train.columns if x in categorical]
    X_train, X_valid = X_train[columns].copy(), X_valid[columns].copy()

    # 1. Determine categorical features with missing values
    na_cols = [col for col in categorical if X_train[col].isnull().any()]
    
    # 2. Add indicator columns for imputed values
    for col in na_cols:
        X_train[col + '_was_NA'] = X_train[col].isnull().astype(int)
        X_valid[col + '_was_NA'] = X_valid[col].isnull().astype(int)
    
    # 3. Fit imputer on training data, apply to validation set
    imputer = SimpleImputer(strategy=strategy, fill_value = 'None')
    X_train = imputer.fit_transform(X_train)
    X_valid = imputer.transform(X_valid)
    
    return X_train, X_valid

In [10]:
# Test Strategy 2.0: Impute using the mean value
preprocessing = impute_numerical_extended
print('Strategy 2.0 (Extended Impute w/ Mean):', score(preprocessing),'MAE\n')
    
# Test Strategy 2.1: Impute using the median value
preprocessing = partial(impute_numerical_extended, strategy = 'median')
print('Strategy 2.1 (Extended Impute w/ Median):', score(preprocessing),'MAE\n')
    
# Test Strategy 2.2: Impute using the most frequent value
preprocessing = partial(impute_numerical_extended, strategy = 'most_frequent')
print('Strategy 2.2 (Extended Impute w/ Mode):', score(preprocessing),'MAE\n')

Strategy 2.0 (Extended Impute w/ Mean): 17536.13 MAE

Strategy 2.1 (Extended Impute w/ Median): 17579.02 MAE

Strategy 2.2 (Extended Impute w/ Mode): 17772.08 MAE



# Lesson 3: Categorical Variables

Note: the strategies in the [Categorical Variables](https://www.kaggle.com/alexisbcook/categorical-variables) notes assume that all columns with missing data have been removed.

1. Drop Categorical Variables
2. Label Encode
3. One-Hot Encode
4. Both Label Encoding and One-Hot Encoding

## Strategy 1: Drop Categorical Variables

In [11]:
def drop_categorical(X_train, X_valid):
    
    # Drop categorical columns
    X_train = X_train.drop(categorical, axis=1)
    X_valid = X_valid.drop(categorical, axis=1)
    
    # Drop NA columns
    na_cols = [col for col in numerical if X_train[col].isnull().any()]
    X_train = X_train.drop(na_cols, axis=1)
    X_valid = X_valid.drop(na_cols, axis=1)
    
    return X_train, X_valid

In [12]:
# Strategy 1: Drop all categorical variables
preprocessing = drop_categorical
print("Strategy 1 (Drop categorical variables):", score(preprocessing),'MAE\n')

Strategy 1 (Drop categorical variables): 17668.58 MAE



## Strategy 2: Ordinal Encoding

Transforms categorical columns to integer columns with values from 0 to N-1 where N is the number of unique values. 

**Note:** This is called "label encoding" in the notes, but the [LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html#sklearn.preprocessing.LabelEncoder) is intended for transforming the target variable (for classification problems) whereas the [OrdinalEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html#sklearn.preprocessing.OrdinalEncoder) is for transforming categorical variables.

In [13]:
# Strategy 2: Ordinal Encoding
def ordinal_encoding(X_train, X_valid, verbose = True):
    
    # 0. Drop NA columns
    na_cols = [col for col in features if train[col].isnull().any()]
    X_train = X_train.drop(na_cols, axis=1)
    X_valid = X_valid.drop(na_cols, axis=1)
    
    # 1. Columns which share the same values in training and validation sets
    columns = [col for col in X_train.columns if col in categorical]
    good_label_cols = [col for col in columns if set(X_train[col]) == set(X_valid[col])]
    bad_label_cols = list(set(columns)-set(good_label_cols))
    if verbose: print('Columns encoded:', len(good_label_cols))
    if verbose: print('Columns dropped:', len(bad_label_cols), end = '\n\n')

    # 2. Drop categorical columns that will not be encoded
    X_train = X_train.drop(bad_label_cols, axis=1)
    X_valid = X_valid.drop(bad_label_cols, axis=1)

    # 3. Apply ordinal encoder to "good" columns
    ordinal_encoder = OrdinalEncoder()
    X_train[good_label_cols] = ordinal_encoder.fit_transform(X_train[good_label_cols])
    X_valid[good_label_cols] = ordinal_encoder.transform(X_valid[good_label_cols])     
    
    return X_train, X_valid

In [14]:
# Strategy 2: Ordinal Encoding
preprocessing = ordinal_encoding
print("Strategy 2 (Ordinal Encoding):", score(preprocessing), 'MAE\n')

Columns encoded: 13
Columns dropped: 14

Strategy 2 (Ordinal Encoding): 17118.1 MAE



## Strategy 3: One-Hot Encoding

Note: I use the [category_encoders](https://contrib.scikit-learn.org/category_encoders/onehot.html) library rather than sklearn for this encoder. In my opinion it is more convenient and easier to use.

In [15]:
# Strategy 3: One-Hot Encoding
def one_hot_encoding(X_train, X_valid, verbose = True):
    
    # 0. Drop NA columns
    na_cols = [col for col in features if train[col].isnull().any()]
    X_train = X_train.drop(na_cols, axis=1)
    X_valid = X_valid.drop(na_cols, axis=1)
    
    # 1. Determine low cardinality (few unique values) categorical variables
    columns = [col for col in X_train.columns if col in categorical]
    good_cols = [col for col in columns if set(X_train[col]) == set(X_valid[col])]
    low_cols = [col for col in good_cols if X_train[col].nunique() < 10]
    bad_cols = list(set(columns)-set(low_cols))
    if verbose: print('Columns encoded:', len(low_cols))
    if verbose: print('Columns dropped:', len(bad_cols), end = '\n\n')
    
    # 2. Drop categorical columns that will not be encoded
    X_train = X_train.drop(bad_cols, axis=1)
    X_valid = X_valid.drop(bad_cols, axis=1)
    
    # 3. Apply one-hot encoding to low cardinality columns
    encoder = OneHotEncoder(cols=low_cols, use_cat_names=True)
    X_train = encoder.fit_transform(X_train)
    X_valid = encoder.transform(X_valid)
    
    return X_train, X_valid

In [16]:
# Strategy 3: One-Hot Encoding
preprocessing = one_hot_encoding
print("Strategy 3 (One-Hot Encoding):", score(preprocessing), 'MAE\n')

Columns encoded: 13
Columns dropped: 14

Strategy 3 (One-Hot Encoding): 17103.97 MAE



# Competition Submission 1

My first submission which uses only the basic techniques (e.g. no pipelines, XGBoost) from the [Missing Values](https://www.kaggle.com/alexisbcook/missing-values) and [Categorical Variables](https://www.kaggle.com/alexisbcook/categorical-variables) notebooks. In particular, we do the following:

1. Impute numerical data NAs with median
2. Impute categorical data NAs with constant value (placeholder)
3. One-Hot encode low cardinality data
4. Ordinally encode high cardinality data
5. Train random forest using the full training data.

## Competition Strategy

In [17]:
def basic_strategy(X_train, X_test):
    
    # 1. Determine relevant categorical columns
    good_cols = [col for col in categorical if set(X_train[col]) == set(X_test[col])]
    low_cols = [col for col in good_cols if X_train[col].nunique() < 10]
    high_cols = list(set(good_cols)-set(low_cols))
    bad_cols = list(set(categorical)-set(good_cols))
    
    # 1.2 Drop irrelevant columns
    X_train = X_train.drop(bad_cols, axis=1)
    X_test = X_test.drop(bad_cols, axis=1)
    
    # 2. Impute using median value for numerical data
    X_train, X_test = impute_numerical(
        X_train, X_test, strategy = 'median', drop = False
    )
    
    # 2.1 Impute constant string for categorical data
    X_train, X_test = impute_categorical(
        X_train, X_test, strategy = 'constant'
    )
    
    # 3. One-Hot encode low cardinality columns
    onehot_encoder = OneHotEncoder(cols = low_cols, use_cat_names=True)
    X_train = onehot_encoder.fit_transform(X_train)
    X_test = onehot_encoder.transform(X_test)
    
    # 3. Ordinal encode high cardinality columns
    ordinal_encoder = OrdinalEncoder()
    X_train[high_cols] = ordinal_encoder.fit_transform(X_train[high_cols])
    X_test[high_cols] = ordinal_encoder.transform(X_test[high_cols]) 
    
    return X_train, X_test

## Submission Function

In [18]:
def make_submission(preprocessing):
    
    # Load data
    X_train, y_train = train[features].copy(), train['SalePrice'].copy()
    X_test = test[features].copy()
    
    # Preprocessing
    X_train, X_test = preprocessing(X_train, X_test)

    # Create submission
    model = RandomForestRegressor(random_state = RANDOM_SEED)
    model.fit(X_train, y_train)
    test_preds = model.predict(X_test)
    
    output = pd.DataFrame({'Id': X_test.index,'SalePrice': test_preds})
    output.to_csv(preprocessing.__name__ + ' submission.csv', index=False)

In [19]:
# Competition Strategy 1
preprocessing = basic_strategy

# Get validation score
print("Competition Strategy 1:", score(preprocessing), 'MAE\n')

# Make submission
if SUBMIT: 
    make_submission(preprocessing)
    print('Created Submission.\n')

Competition Strategy 1: 17044.47 MAE

Created Submission.



# Lesson 4: Pipelines

Note: the strategies from the [Pipelines](https://www.kaggle.com/alexisbcook/pipelines) notebook assumes that all high cardinality (> 10) categorical data has been discarded. 

## 4.1 Scoring Function

Similar to the scoring function used in the previous section but accepts a scikit-learn [pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline) as input.

In [20]:
# Function for comparing pipelines
def score_pipeline(sklearn_pipeline):
    
    # Get train/test split for scoring
    X_train, X_valid, y_train, y_valid = train_test_split(
        train[features], train['SalePrice'], 
        random_state = RANDOM_SEED
    )
        
    # Clones the input pipeline (creates new instance)
    pipeline = clone(sklearn_pipeline)
    pipeline.fit(X_train, y_train)
    valid_preds = pipeline.predict(X_valid)
    
    return round(mean_absolute_error(y_valid, valid_preds), 2)

## 4.2 Basic Pipeline

The pipeline from the notes does the following:

1. Imputes missing numerical values with a constant
2. Imputes categorical missing values with the most frequent value
3. One-Hot encodes the categorical variables

In [21]:
# Basic pipeline from the notes, as is.
def basic_pipeline():
    
    # 1. Preprocessing for numerical data
    numerical_transformer = SimpleImputer(strategy='constant')

    # 2. Preprocessing for categorical data
    categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('onehot', OneHotEncoder())
    ])

    # 3. Bundle preprocessing for numerical and categorical data
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numerical_transformer, numerical),
            ('cat', categorical_transformer, categorical)
        ])
    
    # 4. The model, which is the same as in the scoring_function
    model = RandomForestRegressor(random_state = RANDOM_SEED)
    
    return Pipeline(steps=[('preprocessor', preprocessor),('model', model)])

In [22]:
pipeline = basic_pipeline()
print("Basic Pipeline:", score_pipeline(pipeline), ' MAE\n')

Basic Pipeline: 17032.58  MAE



## 4.3 Improved Pipeline

This is a tweaked version of the pipeline in the exercise which performs a bit better. The differences with the notes are as follows:

1. We use median imputation for the numerical data 
2. We use a constant value for the categorical imputer

In [23]:
# Slightly better pipeline
def improved_pipeline():
    
    # 1. Preprocessing for numerical data
    numerical_transformer = SimpleImputer(strategy='median')

    # 2. Preprocessing for categorical data
    categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='constant', fill_value = 'None')),
        ('onehot', OneHotEncoder())
    ])

    # 3. Bundle preprocessing for numerical and categorical data
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numerical_transformer, numerical),
            ('cat', categorical_transformer, categorical)
        ])
    
    # 4. The model, which is the same as in the scoring_function
    model = RandomForestRegressor(random_state = RANDOM_SEED)
    
    return Pipeline(steps=[('preprocessor', preprocessor),('model', model)])

In [24]:
pipeline = improved_pipeline()
print("Improved Pipeline:", score_pipeline(pipeline), ' MAE\n')

Improved Pipeline: 16939.56  MAE



# Lesson 5: Cross Validation

Note: The [Cross Validation](https://www.kaggle.com/alexisbcook/cross-validation) examples use only numerical data.

## 5.1 Scoring

The following function performs k-fold cross-validation using a random forest model with a specified number of trees.

In [25]:
# Testing a RandomForestRegressor for various parameters using cross-validation
def get_cross_val_score(num_folds, num_trees):
    
    # Get data
    X, y = train[numerical].copy(), train['SalePrice'].copy()
    
    # Default imputer (np.nan and mean value)
    pipeline = Pipeline(steps=[
        ('preprocessor', SimpleImputer()),
        ('model', RandomForestRegressor(n_estimators = num_trees, random_state = RANDOM_SEED))
    ])
    
    # multiply by -1 so that it matches scores from previous sections
    scores = -1 * cross_val_score(
        pipeline, X, y, 
        cv=num_folds, 
        scoring='neg_mean_absolute_error'
    )
    return round(scores.mean(), 2)

## 5.2 Testing Number of Estimators

We test various values for the number of estimators in a random forest model using 3-fold cross-validation:

In [26]:
# Test 50 through 300 estimators
def test_estimators():
    num_folds = 3
    for num_trees in range(50,450,50):
        print(
            f'n_estimators: {num_trees}\t {num_folds}-fold MAE:', 
            get_cross_val_score(num_folds, num_trees)
        )

In [27]:
test_estimators()

n_estimators: 50	 3-fold MAE: 18353.84
n_estimators: 100	 3-fold MAE: 18395.22
n_estimators: 150	 3-fold MAE: 18288.73
n_estimators: 200	 3-fold MAE: 18248.35
n_estimators: 250	 3-fold MAE: 18255.27
n_estimators: 300	 3-fold MAE: 18275.24
n_estimators: 350	 3-fold MAE: 18270.29
n_estimators: 400	 3-fold MAE: 18270.2


# Lesson 6: XGBoost

Note: The examples in the [XGBoost](https://www.kaggle.com/alexisbcook/xgboost) notes discard high-cardinality categorical data. We create our own function for performing cross-validation so we can use `early_stopping_rounds`, which isn't possible with `cross_val_score`.

## 6.1 XGBoost Scoring Function

In [28]:
def score_xgboost(preprocessing, xgboost_model):
    
    # Drop high cardinality categorical variables
    X = train[features].drop(high_cardinality, axis=1)
    y = train['SalePrice'].copy()
    
    # Data structure for storing scores and times
    scores = np.zeros(NUM_FOLDS)
    times = np.zeros(NUM_FOLDS)
    
    # K-fold cross-validation
    kfold = KFold(n_splits = NUM_FOLDS, shuffle = True, random_state = RANDOM_SEED)
    for fold, (train_idx, valid_idx) in enumerate(kfold.split(X)):
        
        # Training and Validation Sets
        X_train, X_valid = X.iloc[train_idx], X.iloc[valid_idx]
        y_train, y_valid = y.iloc[train_idx], y.iloc[valid_idx]
        
        # Preprocessing
        X_train, X_valid = preprocessing(X_train, X_valid)
        
        # Create model
        start = time.time()
        model = clone(xgboost_model)
        model.fit(
            X_train, y_train, 
            early_stopping_rounds=5,
            eval_set=[(X_valid, y_valid)], 
            verbose=False
        )
        
        # validation predictions
        valid_preds = np.ravel(model.predict(X_valid))
        scores[fold] = mean_absolute_error(y_valid, valid_preds)
        end = time.time()
        print(f'Fold {fold} MAE: {round(scores[fold], 5)} in {round(end-start,2)}s.')
        time.sleep(0.5)
    return round(scores.mean(), 2)

## 6.2 XGBoost Preprocessing

In [29]:
def preprocessing(X_train, X_valid):
    
    X_train, X_valid = impute_numerical(X_train, X_valid, drop = False)
    X_train, X_valid = impute_categorical(X_train, X_valid)
    X_train, X_valid = one_hot_encoding(X_train, X_valid, verbose = False)
    
    return X_train, X_valid

In [30]:
# Baseline XGBoost model with default settings
model = XGBRegressor(random_state = RANDOM_SEED)
print("\nMAE for XGBoost (Baseline):", score_xgboost(preprocessing, model), end = '\n\n')
    
# Improved hyperparameters
model2 = XGBRegressor(
    random_state = RANDOM_SEED,
    n_estimators = 500,
    learning_rate = 0.05
)
print("\nMAE for XGBoost (Better):", score_xgboost(preprocessing, model2), end = '\n\n')
    
# Worse hyperparameters
model3 = XGBRegressor(
    random_state = RANDOM_SEED, 
    n_estimators = 750, 
    learning_rate = 0.001
)
print("\nMAE for XGBoost (Worse):", score_xgboost(preprocessing, model3), end = '\n\n')


Fold 0 MAE: 15867.29803 in 0.11s.
Fold 1 MAE: 19328.60869 in 0.1s.
Fold 2 MAE: 16352.65493 in 0.1s.
Fold 3 MAE: 20900.76323 in 0.09s.
Fold 4 MAE: 17106.46529 in 0.09s.
Fold 5 MAE: 19747.92421 in 0.25s.
Fold 6 MAE: 15857.24856 in 0.11s.
Fold 7 MAE: 15745.21089 in 0.61s.

MAE for XGBoost (Baseline): 17613.27

Fold 0 MAE: 14331.46734 in 0.53s.
Fold 1 MAE: 18595.55697 in 0.51s.
Fold 2 MAE: 14125.38232 in 0.62s.
Fold 3 MAE: 21215.01712 in 0.3s.
Fold 4 MAE: 16395.80076 in 0.61s.
Fold 5 MAE: 18218.53782 in 0.59s.
Fold 6 MAE: 15229.12968 in 0.47s.
Fold 7 MAE: 13862.34032 in 0.44s.

MAE for XGBoost (Better): 16496.65

Fold 0 MAE: 87020.86198 in 2.82s.
Fold 1 MAE: 85916.52645 in 3.07s.
Fold 2 MAE: 86571.39045 in 2.97s.
Fold 3 MAE: 85719.34347 in 3.28s.
Fold 4 MAE: 91725.70112 in 2.75s.
Fold 5 MAE: 92822.97081 in 3.77s.
Fold 6 MAE: 85756.31864 in 2.8s.
Fold 7 MAE: 80693.72452 in 2.54s.

MAE for XGBoost (Worse): 87028.35



# Competition Submission

In previous versions of this notebook, I used knowledge about the columns to encode certain categorical columns in a way that preserves their natural ordering. I decided to keep things simple and use more naive feature engineering in order to keep with the spirit of the course. For the final submission, I will do the following:

1. Impute missing numerical data with the median
2. Import missing categorical data with a placeholder
3. Encode all categorical data with an `OrdinalEncoder`

## 1. Preprocessing


In [31]:
# We use the ordinal encoder from the category encoders library
from category_encoders import OrdinalEncoder

# Preprocessing function which operates on test data
def preprocessing(X_train, X_valid, X_test):
    
    # Impute NAs for numerical cols
    columns = [x for x in X_train.columns if x in numerical]
    imputer = SimpleImputer(strategy = 'median')
    X_train[columns] = imputer.fit_transform(X_train[columns])
    X_valid[columns] = imputer.transform(X_valid[columns])
    X_test[columns] = imputer.transform(X_test[columns])
    
    # Impute NAs for categorical cols
    columns = [x for x in X_train.columns if x in categorical]
    imputer = SimpleImputer(strategy = 'constant', fill_value = 'None')
    X_train[columns] = imputer.fit_transform(X_train[columns])
    X_valid[columns] = imputer.transform(X_valid[columns])
    X_test[columns] = imputer.transform(X_test[columns])
    
    # Ordinally encode categorical columns
    encoder = OrdinalEncoder(cols = columns)
    X_train = encoder.fit_transform(X_train)
    X_valid = encoder.transform(X_valid)
    X_test = encoder.transform(X_test)
    
    return X_train, X_valid, X_test

## 2. Generate Submission

In [32]:
def generate_submission(submit = False):
    
    # Drop high cardinality categorical variables
    X = train[features].drop(high_cardinality, axis=1)
    y = train['SalePrice'].copy()
    
    # Data structure for storing scores and times
    test_preds = np.zeros((test.shape[0],))
    scores = np.zeros(NUM_FOLDS)
    times = np.zeros(NUM_FOLDS)
    
    # K-fold cross-validation
    kfold = KFold(n_splits = NUM_FOLDS, shuffle = True, random_state = RANDOM_SEED)
    for fold, (train_idx, valid_idx) in enumerate(kfold.split(X)):
        
        # Training and Validation Sets
        X_train, X_valid = X.iloc[train_idx], X.iloc[valid_idx]
        y_train, y_valid = y.iloc[train_idx], y.iloc[valid_idx]
        X_test = test[features].drop(high_cardinality, axis=1)
        
        # Preprocessing
        X_train, X_valid, X_test = preprocessing(X_train, X_valid, X_test)
        
        # Create model
        start = time.time()
        model = XGBRegressor(    
            random_state = RANDOM_SEED,
            n_estimators = 1000,
            learning_rate = 0.05
        )
        model.fit(
            X_train, y_train, 
            early_stopping_rounds = 10,
            eval_set=[(X_valid, y_valid)], 
            verbose=False
        )
        
        # validation predictions
        valid_preds = np.ravel(model.predict(X_valid))
        test_preds += model.predict(X_test) / NUM_FOLDS
        scores[fold] = mean_absolute_error(y_valid, valid_preds)
        end = time.time()
        print(f'Fold {fold} MAE: {round(scores[fold], 5)} in {round(end-start,2)}s.')
        time.sleep(0.5)
    
    if submit:
        output = pd.DataFrame({'Id': test.index,'SalePrice': test_preds})
        output.to_csv(preprocessing.__name__ + ' submission.csv', index=False)
        print('\nCreated Submission.\n')

In [33]:
# change to submit = True to generate output
generate_submission(SUBMIT)

Fold 0 MAE: 16180.02754 in 0.83s.
Fold 1 MAE: 17572.5708 in 0.64s.
Fold 2 MAE: 15113.51981 in 0.8s.
Fold 3 MAE: 19262.53168 in 0.4s.
Fold 4 MAE: 15566.355 in 0.8s.
Fold 5 MAE: 17495.09937 in 0.99s.
Fold 6 MAE: 14208.65084 in 1.12s.
Fold 7 MAE: 13445.90363 in 0.69s.

Created Submission.



I hope you found this useful I plan to make a similar notebook from the [Feature Engineering](https://www.kaggle.com/learn/feature-engineering) course, which will build off of this notebook.