## Introduction

The objective of this classification task is to predict the health outcomes of horses based on their historical medical data. There are three potential outcomes: "lived," "died," and "euthanized." In this notebook, I will use AutoML via FLaML to simultaneously tune the hyperparameters of multiple supervised learning models and pick the best model. The preprocessing and feature engineering steps, including feature selection are taken from [this](https://www.kaggle.com/code/syerramilli/ps3e22-eda-catboost-baseline?scriptVersionId=143346482) notebook.

In [None]:
# issue with ray version 2.5: https://github.com/microsoft/FLAML/issues/1132
!pip install FLAML "ray[tune]<2.5.0"

In [None]:
import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from flaml import AutoML
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.metrics import f1_score

from numbers import Number 
from pathlib import Path
from typing import Optional, Dict

plt.style.use('ggplot')

In [None]:
path = Path('/kaggle/input/playground-series-s3e22')
train = pd.read_csv(path/'train.csv',index_col=['id'])
test = pd.read_csv(path/'test.csv',index_col=['id'])

del train['hospital_number']
del test['hospital_number']

train.head()

## Preprocessing (not including missing value analysis)

As mentioned earlier, I will be using the preprocessing and cleaning steps taken in the catboost notebook. The steps are

1. Dropping the `lesion_2`, `lesion_3` and `Age` features: In each case, the most common value occurs in more than 90% of the observations. It is unlikely that any model can learn the effects of these features on the response.
2. Decoding the site and type from `lesion_1` and then dropping `lesion_1`: Refer to the description in the [orignal dataset](https://www.kaggle.com/datasets/yasserh/horse-survival-dataset) for the details.
3. Replacing erroneous categories in a couple of features with `pd.NA`
4. For some of the categorical features, mapping a few categories with very few observations onto other categories
5. Ordinal encoding columns that are either binary valued or have some inherent order
6. Converting the `dtype` of the remaining categorical columns to `pd.Categorical`
7. Creating a new numerical feature `log_pulseSq_total_protein` which is given by
$$\texttt{log_pulseSq_total_protein} = \log\left(\frac{\texttt{pulse}^2}{\texttt{total_protein}}\right):$$
This feature turns out to be useful for predicting the "died" outcomes.


In [None]:
def map_lesion_site(value:str) -> str:
    '''
    Return the site of the lesion given its code
    '''
    if value[:2] == "11" and len(value) == 5:
        return "all_intestinal"
    elif value[0] == "1":
        return "gastric"
    elif value[0] == "2":
        return "sm_intestine"
    elif value[0] == "3":
        return "lg_colon"
    elif value[0] == "4":
        return "lg_colon_and_cecum"
    elif value[0] == "5":
        return "cecum"
    elif value[0] == "6":
        return "transverse_colon"
    elif value[0] == "7":
        return "retum_colon"
    elif value[0] == "8":
        return "uterus"
    elif value[0] == "9":
        return "bladder"
    elif value[0] == "0":
        return "none"
    else:
        return "ERROR"
    
def map_lesion_type(value:str) -> str:
    '''
    Returns the type of lesion given its code
    '''
    if value == '0':
        return "none"
    
    value2 = value[2] if len(value)==5 else value[1]
    
    if value2 == '1':
        return "simple"
    elif value2 == '2':
        return 'strangulation'
    elif value2 == '3':
        return 'inflammation'
    elif value2 == '4':
        return 'other'
    
    return 'ERROR'

In [None]:
categorical_columns = [
    'mucous_membrane', 'abdomen','rectal_exam_feces', 
    'lesion_site', 'lesion_type' 
]

def preprocess(df:pd.DataFrame) -> None:
    # drop lesion_2, lesion_3, age
    df = df.drop(['age','lesion_2','lesion_3'],axis=1)
    
    # parsing the lesion sites and types
    # TODO: parse subtypes
    df['lesion_site'] = df['lesion_1'].astype(str).apply(map_lesion_site)
    df['lesion_type'] = df['lesion_1'].astype(str).apply(map_lesion_type)
    del df['lesion_1']    
    
    
    # cleaning some of the categorical features
    df['peristalsis'] = df['peristalsis'].replace('distend_small',pd.NA)
    df['rectal_exam_feces'] = df['rectal_exam_feces'].replace('serosanguious',pd.NA)
    
    # merging some of the categories
    df['capillary_refill_time'] = df['capillary_refill_time'].replace('3','more_3_sec')
    df['pain'] = df['pain'].replace('slight','alert')
    
    
    # encoding some of the categorical levels as ordinal
    ordinal_and_binary_dict = {
        'surgery': ['no','yes'], 
        'temp_of_extremities': ['cold','cool', 'normal', 'warm'], 
        'peripheral_pulse': ['absent','reduced', 'normal','increased'], 
        'pain':['alert', 'depressed', 'mild_pain', 'moderate', 'severe_pain', 'extreme_pain'],
        'capillary_refill_time': ['less_3_sec', 'more_3_sec'], 
        'peristalsis': ['absent', 'hypomotile', 'normal', 'hypermotile'], 
        'abdominal_distention': ['none', 'slight', 'moderate', 'severe'], 
        'nasogastric_tube': ['none', 'slight', 'significant'], 
        'nasogastric_reflux': ['none','slight','less_1_liter', 'more_1_liter'], 
        'abdomo_appearance': ['serosanguious', 'cloudy', 'clear'], 
        'surgical_lesion': ['no', 'yes'], 
        'cp_data': ['no', 'yes']
    }
    
    for column, levels in ordinal_and_binary_dict.items():
        df[column] = df[column].replace({
            level:i for i,level in enumerate(levels)
        })
    
    # converting the dtypes for the remaining columns to pd.Categorical 
    for column in categorical_columns:
        # useful for other featur
        df[column] = df[column].astype('category')
        
    
    # feature engineering
    df['log_pulseSq_total_protein'] = -np.log(df['total_protein']) + 2*np.log(df['pulse'])
        
    return df

In [None]:
# preprocess
train = preprocess(train)
test = preprocess(test)

## Handling missing values

In the cell below, I compute the fraction of missing values in each column. (Note: Columns with no missing values are excluded).

In [None]:
def filter_greater_than(series:pd.Series,threshold:Number) -> pd.Series:
    '''
    Returns series elements greater than threshold. This funtion can be
    used with the .pipe methods
    '''
    return series[series>threshold]

def get_perc_missing(df:pd.DataFrame) -> pd.Series:
    return (
        (df.isnull().sum()/df.shape[0]*100)
        .sort_values(ascending=False)
        .pipe(filter_greater_than,threshold=0)
        .round(3)
    )

perc_missing = get_perc_missing(train)
perc_missing

The same columns have missing entries in the test set too, and the percentage of missing values in the test set is roughly the same. So, we will need a concrete imputation strategy.

In [None]:
# get the percentage of missing entries within each
# column of the test set
get_perc_missing(test)

### Imputation strategy

1. For `abdomen` and `rectal_exam_feces`, we add a new category called `"missing"` for the missing entries.
2. For the remaining columns, we impute the missing value with the mode.

In [None]:
# for abdome and rectal_exam_feces, add a new category called missing
for column in ['abdomen','rectal_exam_feces']:
    train[column] = train[column].astype('object').fillna('missing').astype('category')
    test[column] = test[column].astype('object').fillna('missing').astype('category')
    
# for the remaining columns with missing values, impute with mode
for column in perc_missing.iloc[2:].index:
    mode_col = train[column].mode().iloc[0]
    train[column] = train[column].fillna(mode_col)
    test[column] = test[column].fillna(mode_col)

## Feature selection

In the catboost baseline notebook ([here](https://www.kaggle.com/code/syerramilli/ps3e22-eda-catboost-baseline#Model-with-fewer-features)), I found that selecting a subset of features slightly improved modeling performance. I will be using the same subset here.

In [None]:
reduced_features = [
    'pain', 'total_protein', 'surgery', 'packed_cell_volume', 'lesion_type', 'abdomo_protein', 
    'lesion_site', 'mucous_membrane', 'nasogastric_reflux_ph', 'rectal_exam_feces', 
    'log_pulseSq_total_protein', 'abdomo_appearance', 'temp_of_extremities', 'respiratory_rate'
]


X = train[reduced_features]
X_test = test[reduced_features]

## AutoML via FLaML

FLAML tunes both the type of estimator (e.g., xgboost, random forest, etc.) and the hyperparameters for each estimator simulataneosuly.

**Note**: In the AUTOML settings, I pass "ensemble":True. This means that the final model will be a stacked ensemble of the best models for each class.

In [None]:
automl = AutoML()
automl_settings = {
    "time_budget": 1500,  # total running time in seconds (25 minutes)
    "metric": 'micro_f1', 
    "task": 'classification',  # task type
    "estimator_list":['lgbm', 'rf','xgboost', 'extra_tree', 'xgb_limitdepth'],
    "log_file_name": 'health_outcomes_.log',
    "log_training_metric": True,  # whether to log training metric
    "keep_search_state": True, # needed if you want to keep the cross validation information
    "eval_method": "cv",
    "split_type": RepeatedStratifiedKFold(n_splits=10, n_repeats=4, random_state=1),
    "ensemble":True,
}


with warnings.catch_warnings():
    # skips deprecation warnings from xgboost
    warnings.simplefilter("ignore")
    automl.fit(X, train['outcome'], **automl_settings)

Here are the best CV micro F1 scores for each estimator type.

In [None]:
# best loss per estimator
(1-pd.Series(automl.best_loss_per_estimator)).sort_values(ascending=False).round(4)

To get the corresponding configuration for each estimator. use the `.best_config_per_estimator` attribute

In [None]:
automl.best_config_per_estimator

As mentioned earlier, the final model will be a stacked ensemble of the best models for each class of models. To disable this, and simply select the best performing model, set `"ensemble":False` in `automl_settings`.

In [None]:
automl.model

In [None]:
# save model
import pickle
with open('automl_space_titanic.pkl', 'wb') as f:
    pickle.dump(automl, f, pickle.HIGHEST_PROTOCOL)

## Test predictions

In [None]:
submission = pd.DataFrame({
    'id':test.index.values,
    'outcome':automl.predict(X_test).ravel()
})
submission.to_csv('submission.csv',index=False)
submission['outcome'].value_counts()/submission.shape[0]