# Aaron Spaulding Palmer Penguin Model

The goal is to predict species of penguin.

# Overview
## Import data
* Open data
* Clean
    * Remove 'na'
    * Remove '.'
* Binarize the target variable to a one-hot encoding
* One-hot encode the categorical variables in the train data
    * 'sex'
    * 'island'
* One-hot encode the target variable 'species'

## Build the model
My final model is an ensemble of three gradient boosting machines (GBMs) where each GBM is assigned one species label. The final species label is then assigned by picking the model that predicts the highest probability.

To simplify this I use the 'MultiOutputClassifier' from sklearn.

## Validate the model
### Method
This model is validated using a 10-fold cross-validation (CV).
### Metric
Since we are doing multi-species classification we need a suitable metric. I use micro-weighted F1. This is the same metric used in the Cornell bird detection competition.
### Why not a train-test split?
The 10-fold CV was chosen since a train-test split may make a model with weak predicitve performance appear otherwise. To highlight this I use my model and show how it appears to have 100% accuracy. This occurs since the chosen test set may not be a good representation of the entire set.


# Setup

Import some libraries

In [None]:
import os

import numpy as np
import pandas as pd

from sklearn.experimental import enable_halving_search_cv

from sklearn.metrics import f1_score
from sklearn.metrics import make_scorer

from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MultiLabelBinarizer

from sklearn.pipeline import Pipeline

from sklearn.multioutput import MultiOutputClassifier

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate
from sklearn.model_selection import HalvingGridSearchCV

from sklearn.ensemble import BaggingClassifier


import xgboost as xgb

In [None]:
file_location = r'../input/palmer-archipelago-antarctica-penguin-data/penguins_size.csv'

predictor_columns = ['island', 'culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm', 'body_mass_g', 'sex']
target_column = ['species']

## Import data

In [None]:
# Open and clean
data = pd.read_csv(file_location)
data = data.dropna()
data = data[data.sex != '.']

In [None]:
raw_X = np.array(data[predictor_columns])
raw_Y = np.array(data[target_column])

In [None]:
# One-hot encode target variable
binarizer = MultiLabelBinarizer(classes=['Adelie', 'Chinstrap', 'Gentoo'])
encoded_Y = binarizer.fit_transform(raw_Y)

In [None]:
# One-hot encode the categorical columns
encoder = OneHotEncoder(handle_unknown='ignore')
encoder.fit(raw_X[:, [0, 5]])

In [None]:
# Combine the data
new_X_categroical_columns = encoder.transform(raw_X[:, [0, 5]]).toarray()
old_X_value_columns = raw_X[:, 1:5].astype(np.dtype(float))
encoded_X = np.concatenate((new_X_categroical_columns, old_X_value_columns), axis=1)

# Model Design
This is my final model. I explained the details in the overview section.

In [None]:
base_model = xgb.XGBClassifier(use_label_encoder=False,
                               eval_metric = 'logloss',
                               eta=0.3,
                               max_depth=3,
                               subsample=0.65,
                               grow_policy='lossguide',
                               max_leaves=1,
                               booster='dart',normalize_type='forest',rate_drop=0.001)
parameters = base_model.get_params()
params = {'classify__estimator__' + parameter:value for (parameter, value) in parameters.items()}

In [None]:
# Set up our pipeline of just one model
classifier = MultiOutputClassifier(xgb.XGBClassifier(), n_jobs=1)
model = Pipeline([('classify', classifier)])
_ = model.set_params(**params)

# Our model!
model = model

# Validation
10-fold CV!

In [None]:
f1_micro = make_scorer(f1_score, average='micro')

cv_results = cross_validate(model, encoded_X, encoded_Y, scoring=f1_micro, cv=10, n_jobs=4)
print(f'Mean Error: {np.mean(cv_results["test_score"])}')

# Why not use a train-test split?
This is a simple demo to show how a train-test split can be deceiving.

I use the same model and train/test on a train-test split. If this was our only validation method we might be led to believe that this model is perfect. However this would be incorrect.

In [None]:
np.random.seed(seed=8)

X_train, X_test, Y_train, Y_test = train_test_split(encoded_X, encoded_Y)

model.fit(X_train, Y_train)

predictions = model.predict(X_test)
error = f1_score(Y_test, predictions, average='micro')
print(f'Train-Test split error: {error}')