This is where the magic happens:

# MODELLING

In this notebook I'll work on a model to what a tourist will spend when vacationing in Tanzania.
The evaluation metric for the model is **Mean Absolute Error**.


To do:
- add comments
- feature selection/feature engineering (subregions)
- model
- again if necessary: feature selection/feature engineering
- outlier handling (Isolation Forest?)
- hyperparameter tuning
- interpretation/visualization

In [1]:
# import some packages that I'll need

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, cross_val_predict, cross_val_score, cross_validate
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, RobustScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

# suppress warnings
import warnings
warnings.filterwarnings('ignore')

# set color scheme
cpal = ["#f94144","#f3722c","#f8961e","#f9844a","#f9c74f","#90be6d","#43aa8b","#4d908e","#577590","#277da1"]

# seaborn theme
sns.set()

# use natural numbers
pd.options.display.float_format = "{:.2f}".format

# set random seed
RSEED = 42

In [2]:
# load data
TZA = pd.read_csv('data/Train.csv')

In [3]:
# train test split
train, test = train_test_split(TZA, test_size = 0.3, random_state = RSEED)

## Preprocessing

I am going to preprocess the data now. I'll do it separately for train and test data, carefully avoiding data leakage.

### Missing values and minor adjustments

In [4]:
# function to handle missing data and to make some minor adjustments on the dataset

def adjustments(df):
    # fill NaN total_male/total_female with 0
    df['total_male'] = df['total_male'].fillna(0)
    df['total_female'] = df['total_female'].fillna(0)
    # add a column group_size based on total_male/total_female
    df['group_size'] = df['total_female'] + df['total_male']
    # fill NaN travel_with with "Alone" if group_size is zero
    df.loc[df.group_size == 1, 'travel_with'] = 'Alone'
    # fill remaining NaN travel_with with missing
    df['travel_with'] = df['travel_with'].fillna('missing')
    # fill NaN most_impressing with "No comments"
    df['most_impressing'] = df['most_impressing'].fillna('No comments')
    # add a column total_nights based on night_zanzibar/night_mainland
    df['total_nights'] = df['night_zanzibar'] + df['night_mainland']
    # delete rows if group_size is zero
    df = df[df.group_size > 0]
    # delete rows if total_nights is zero
    df = df[df.total_nights > 0]
    # drop id column
    df = df.drop(['ID'], axis =1)
    # drop night_mainland column (to avoid multicollinearity)
    df = df.drop(['night_mainland'], axis =1)
    # drop total_male column (to avoid multicollinearity)
    df = df.drop(['total_male'], axis =1)
    return df

In [5]:
# apply function on train data
train = adjustments(train)
# apply function on test data
test = adjustments(test)

In [6]:
# separate target variable

X_train = train.drop(['total_cost'], axis=1)
y_train = train['total_cost']

X_test = test.drop(['total_cost'], axis=1)
y_test = test['total_cost']

### Build Pipelines

In [7]:
cat_features = list(X_train.columns[X_train.dtypes==object])
cat_pipeline = Pipeline([
    ('1hot', OneHotEncoder(handle_unknown= 'ignore', drop = 'first'))
])

In [8]:
num_features = list(X_train.columns[X_train.dtypes!=object])
num_pipeline = Pipeline([
    ('rob_scaler', RobustScaler())
])

In [9]:
preprocessor = ColumnTransformer([
    ('num', num_pipeline, num_features),
    ('cat', cat_pipeline, cat_features)
])

### Baseline Model

In [10]:
# build pipeline
pipe_linreg_bl = Pipeline([
    ('preprocessor', preprocessor),
    ('linreg', LinearRegression())
])

In [11]:
# cross validate
y_train_predicted_bl = cross_val_predict(pipe_linreg_bl, X_train, y_train, cv=5)

In [12]:
# print MAE of Baseline Model
print("Mean Absolute Error Baseline Model: {:.2f}".format(mean_absolute_error(y_train, y_train_predicted_bl)))

Mean Absolute Error Baseline Model: 6124770.14
