<a href="https://www.kaggle.com/code/yaaangzhou/playground-s3-e22-eda-modeling?scriptVersionId=142802848" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

**Created by Yang Zhou**

**[PLAYGROUND S-3,E-22] 📊EDA**

**12 Sep 2023**

# <center style="font-family: consolas; font-size: 32px; font-weight: bold;">Predict Health Outcomes of Horses</center>
<p><center style="color:#949494; font-family: consolas; font-size: 20px;">Playground Series - Season 3, Episode 22</center></p>

***

# <center style="font-family: consolas; font-size: 32px; font-weight: bold;">Insights and Tricks</center>

+ Note that there are some columns in the test dataset that have data imbalances.

+ The column `hosptial number` should be a categorical variable because it represents the numbers of different hospitals.

+ In column `pain`, different sub-labels appear in the test data and training data. `moderate` appeared in the test data and not in training data. The way I handle this situation is to OneHot encode after merging the test and training data.

# <center style="font-family: consolas; font-size: 32px; font-weight: bold;">Version Detail</center>

| Version | Description | Public Score |
|---------|-------------|-----------------|
| Version 2 | Add ML models |  |
| Version 1 | Autogluon Baseline | 0.79878 |

In [None]:
!pip install autogluon

# 0. Imports

In [None]:
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', None)
import math
from scipy import stats
from statsmodels.stats.outliers_influence import variance_inflation_factor
from collections import Counter

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# Preprocessing
from sklearn.preprocessing import MinMaxScaler, StandardScaler, MaxAbsScaler, RobustScaler, PowerTransformer, QuantileTransformer, OrdinalEncoder, LabelEncoder
from sklearn.impute import SimpleImputer

# Model Selection
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import KFold
import autogluon as ag

# Models
from sklearn.ensemble import HistGradientBoostingClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.ensemble import VotingClassifier, StackingClassifier
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier

# Metrics
from sklearn.metrics import mean_absolute_error 
from sklearn.metrics import mean_squared_error 
from sklearn.metrics import mean_squared_log_error 
from sklearn.metrics import r2_score 
from sklearn.metrics import roc_curve
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.metrics import auc

import warnings
warnings.filterwarnings('ignore')

In [None]:
# Adjusting plot style

rc = {
    "axes.facecolor": "#F8F8F8",
    "figure.facecolor": "#F8F8F8",
    "axes.edgecolor": "#000000",
    "grid.color": "#EBEBE7" + "30",
    "font.family": "serif",
    "axes.labelcolor": "#000000",
    "xtick.color": "#000000",
    "ytick.color": "#000000",
    "grid.alpha": 0.4
}

sns.set(rc=rc)
palette = ['#302c36', '#037d97', '#E4591E', '#C09741',
           '#EC5B6D', '#90A6B1', '#6ca957', '#D8E3E2']

from colorama import Style, Fore
blk = Style.BRIGHT + Fore.BLACK
mgt = Style.BRIGHT + Fore.MAGENTA
red = Style.BRIGHT + Fore.RED
blu = Style.BRIGHT + Fore.BLUE
res = Style.RESET_ALL

# 1. Load Data

In [None]:
train = pd.read_csv('/kaggle/input/playground-series-s3e22/train.csv')
test = pd.read_csv('/kaggle/input/playground-series-s3e22/test.csv')
# origin = pd.read_csv('/kaggle/input/horse-survival-dataset/horse.csv')
sample_submission = pd.read_csv('/kaggle/input/playground-series-s3e22/sample_submission.csv')

# Drop column id

train.drop('id',axis=1,inplace=True)
test.drop('id',axis=1,inplace=True)

total = pd.concat([train, test], ignore_index=True)
total = total.drop_duplicates()
total

print('The shape of the train data:', train.shape)
print('The shape of the test data:', test.shape)
# print('The shape of the origin data:', origin.shape)
print('The shape of the total data:', total.shape)

In [None]:
train.head()

# 2. EDA


In [None]:
num_var = [column for column in train.columns if train[column].nunique() > 10]

bin_var = [column for column in train.columns if train[column].nunique() == 2]
cat_var = [column for column in train.columns if train[column].nunique() < 10]
cat_var.remove('outcome')

target = 'outcome'

In [None]:
train.describe().T\
    .style.bar(subset=['mean'], color=px.colors.qualitative.G10[2])\
    .background_gradient(subset=['std'], cmap='Blues')\
    .background_gradient(subset=['50%'], cmap='BuGn')

**`Hospital number` should appear as a categorical variable, I will handle in feature engineering.**

In [None]:
def summary(df):
    sum = pd.DataFrame(df.dtypes, columns=['dtypes'])
    sum['missing#'] = df.isna().sum()
    sum['missing%'] = (df.isna().sum())/len(df)
    sum['uniques'] = df.nunique().values
    sum['count'] = df.count().values
    #sum['skew'] = df.skew().values
    return sum

summary(train).style.background_gradient(cmap='Blues')

**There are more missing cases in `rectal_exam_feces` and `abdomen` columns.**

**First, i want to look at the distribution of categorical features. Include the target.**

In [None]:
columns_cat = [column for column in train.columns if train[column].nunique() < 10]

def plot_count(df,columns,n_cols):
    '''
    # Function to genear countplot
    df: total data
    columns: category variables
    n_cols: num of cols
    '''
    n_rows = (len(columns) - 1) // n_cols + 1
    fig, ax = plt.subplots(n_rows, n_cols, figsize=(17, 4 * n_rows))
    ax = ax.flatten()
    
    for i, column in enumerate(columns):
        sns.countplot(data=df, x=column, ax=ax[i])

        # Titles
        ax[i].set_title(f'{column} Counts', fontsize=18)
        ax[i].set_xlabel(None, fontsize=16)
        ax[i].set_ylabel(None, fontsize=16)
        ax[i].tick_params(axis='x', rotation=10)

        for p in ax[i].patches:
            value = int(p.get_height())
            ax[i].annotate(f'{value:.0f}', (p.get_x() + p.get_width() / 2, p.get_height()),
                           ha='center', va='bottom', fontsize=9)

    ylim_top = ax[i].get_ylim()[1]
    ax[i].set_ylim(top=ylim_top * 1.1)
    for i in range(len(columns), len(ax)):
        ax[i].axis('off')

    # fig.suptitle(plotname, fontsize=25, fontweight='bold')
    plt.tight_layout()
    plt.show()
    
plot_count(train,columns_cat,3)

In [None]:
columns_cat = [column for column in train.columns if train[column].nunique() < 10 and column != target]
plot_count(test,columns_cat,3)

**There are data imbalances in some features, which are manifested in:**
1. The `age Counts` column contains a large number of adults.
2. The `peripheral_pulse_Counts` contains little number of `absent` and `increased`.
3. It Loos like `lesion_2` and `lesion_3` make no sense.

**Now let me have a look at numerical features.**

In [None]:
def plot_pair(df_train,num_var,target,plotname):
    '''
    Funtion to make a pairplot:
    df_train: total data
    num_var: a list of numeric variable
    target: target variable
    '''
    g = sns.pairplot(data=df_train, x_vars=num_var, y_vars=num_var, hue=target, corner=True)
    g._legend.set_bbox_to_anchor((0.8, 0.7))
    g._legend.set_title(target)
    g._legend.loc = 'upper center'
    g._legend.get_title().set_fontsize(14)
    for item in g._legend.get_texts():
        item.set_fontsize(14)

    plt.suptitle(plotname, ha='center', fontweight='bold', fontsize=25, y=0.98)
    plt.show()

plot_pair(train,num_var,target,plotname = 'Scatter Matrix with Target')

In [None]:
df = pd.concat([train[num_var].assign(Source = 'Train'), 
                test[num_var].assign(Source = 'Test')], 
               axis=0, ignore_index = True);

fig, axes = plt.subplots(len(num_var), 3 ,figsize = (16, len(num_var) * 4.2), 
                         gridspec_kw = {'hspace': 0.35, 'wspace': 0.3, 'width_ratios': [0.80, 0.20, 0.20]});

for i,col in enumerate(num_var):
    ax = axes[i,0];
    sns.kdeplot(data = df[[col, 'Source']], x = col, hue = 'Source', ax = ax, linewidth = 2.1)
    ax.set_title(f"\n{col}",fontsize = 9, fontweight= 'bold');
    ax.grid(visible=True, which = 'both', linestyle = '--', color='lightgrey', linewidth = 0.75);
    ax.set(xlabel = '', ylabel = '');
    ax = axes[i,1];
    sns.boxplot(data = df.loc[df.Source == 'Train', [col]], y = col, width = 0.25,saturation = 0.90, linewidth = 0.90, fliersize= 2.25, color = '#037d97',
                ax = ax);
    ax.set(xlabel = '', ylabel = '');
    ax.set_title(f"Train",fontsize = 9, fontweight= 'bold');

    ax = axes[i,2];
    sns.boxplot(data = df.loc[df.Source == 'Test', [col]], y = col, width = 0.25, fliersize= 2.25,
                saturation = 0.6, linewidth = 0.90, color = '#E4591E',
                ax = ax); 
    ax.set(xlabel = '', ylabel = '');
    ax.set_title(f"Test",fontsize = 9, fontweight= 'bold');

plt.tight_layout();
plt.show();


**It seems like the `lesion_3` doesn't make sense and I will delete it in feature engineering.**

**Now, let's look at the distribution of numerical features in the training set.**

In [None]:
plt.figure(figsize=(14, len(num_var) * 2.5))

for idx, column in enumerate(num_var):
    plt.subplot(len(num_var), 2, idx*2+1)
    sns.histplot(x=column, hue="outcome", data=train, bins=30, kde=True)
    plt.title(f"{column} Distribution for outcome")
    plt.ylim(0, train[column].value_counts().max() + 10)
    
plt.tight_layout()
plt.show()

In [None]:
corr_matrix = train[num_var].corr()
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))

plt.figure(figsize=(15, 12))
sns.heatmap(corr_matrix, mask=mask, annot=False, cmap='Blues', fmt='.2f', linewidths=1, square=True, annot_kws={"size": 9} )
plt.title('Correlation Matrix', fontsize=15)
plt.show()

# 3. Features Selections

+ **During feature selection, I can perform the following tests:**
    + **For categorical variables, a chi-square test will be performed to observe their relationship with the target.**
    + **We can also use SFS and RFECV for automatic feature selection.**

**You can find a complete and detailed tutorial in this [notebook](https://www.kaggle.com/code/alvinleenh/ps3e21-6-basic-feature-selection-techniques), written by [DR. ALVINLEENH](https://www.kaggle.com/alvinleenh).**

## Preprocessing

In [None]:
# Mapping target to numbers
train[target] = train[target].map({'died':0,'euthanized':1,'lived':2})
total[target] = total[target].map({'died':0,'euthanized':1,'lived':2})

In [None]:
total = pd.get_dummies(total, columns=['surgery',
                                             'age',
                                             'temp_of_extremities',
                                             'peripheral_pulse',
                                             'mucous_membrane',
                                             'capillary_refill_time',
                                             'pain',
                                             'peristalsis',
                                             'abdominal_distention',
                                             'nasogastric_tube',
                                             'nasogastric_reflux',
                                             'rectal_exam_feces',
                                             'abdomen',
                                             'abdomo_appearance',
                                             'surgical_lesion',
                                             'cp_data'])

In [None]:
df_train = total.loc[0:train.index[-1]]
df_test = total[total[target].isna()]

In [None]:
def features_engineering(df):
    # Drop useless cols

    # df_encoded = df.copy()
    # df_encoded.drop(['lesion_3','lesion_2'],axis = 1, inplace = True)
    
    # StandardScaler for numeric features
    # sc = StandardScaler()
    # for var in num_var:
        # df[var] = sc.fit_transform(df[var].values.reshape(-1,1))
    
    return df

# train = features_engineering(train)
# test = features_engineering(test)

# 4. Modeling

In [None]:
content_xgb_cv_scores, content_xgb_preds = list(), list()
content_lgbm_cv_scores, content_lgbm_preds = list(), list()
content_rf_cv_scores, content_rf_preds = list(), list()
content_ens_cv_scores, content_ens_preds = list(), list()

kf = KFold(n_splits=5, random_state=42, shuffle=True)

X = df_train.drop(target,axis=1)
Y = df_train[target]

for i, (train_ix, test_ix) in enumerate(kf.split(X)):
    X_train, X_test = X.iloc[train_ix], X.iloc[test_ix]
    Y_train, Y_test = Y.iloc[train_ix], Y.iloc[test_ix]
    
    print('---------------------------------------------------------------')
    
    ## RandomForestClassifier
    rf_content = RandomForestClassifier(n_estimators=100, random_state=42).fit(X_train, Y_train)
    rf_pred = rf_content.predict(X_test)   
    accuracy = accuracy_score(Y_test, rf_pred)  
    print('Fold', i+1, '==> RandomForestClassifier oof Accuracy score is ==>', accuracy)
    content_rf_cv_scores.append(accuracy)
    
    ## Pred
    rf_pred_test = rf_content.predict_proba(df_test.drop(target,axis=1))
    content_rf_preds.append(rf_pred_test)
    
    ## XGBClassifer
    xgb_content = XGBClassifier(n_estimators=100, random_state=42).fit(X_train, Y_train)
    xgb_pred = xgb_content.predict(X_test)   
    accuracy_xgb = accuracy_score(Y_test, xgb_pred)  
    print('Fold', i+1, '==> XGBoost oof Accuracy score is ==>', accuracy_xgb)
    content_xgb_cv_scores.append(accuracy_xgb)
    
    ## Pred
    xgb_pred_test = xgb_content.predict_proba(df_test.drop(target,axis=1))
    content_xgb_preds.append(xgb_pred_test)
    
    ## LightGBM
    lgbm_content = LGBMClassifier(n_estimators=100, random_state=42).fit(X_train, Y_train)
    lgbm_pred = lgbm_content.predict(X_test)   
    accuracy_lgbm = accuracy_score(Y_test, lgbm_pred)  
    print('Fold', i+1, '==> LightGBM oof Accuracy score is ==>', accuracy_lgbm)
    content_lgbm_cv_scores.append(accuracy_lgbm)
    
    ## Pred
    lgbm_pred_test = lgbm_content.predict_proba(df_test.drop(target,axis=1))
    content_lgbm_preds.append(lgbm_pred_test)
    
    ## Ensemble Model
    voting_classifier = VotingClassifier(estimators=[
        ('xgb', xgb_content),
        ('lgbm', lgbm_content),
        ('rf', rf_content)
    ], voting='hard')
    voting_classifier.fit(X_train, Y_train)
    ensemble_pred = voting_classifier.predict(X_test)
    accuracy_ens = accuracy_score(Y_test, ensemble_pred)
    
    print('Fold', i+1, '==> Ensemble Model oof Accuracy score is ==>', accuracy_ens)
    content_ens_cv_scores.append(accuracy_ens)

print('---------------------------------------------------------------')
print('Average Accuracy of XGBoost model is:', np.mean(content_xgb_cv_scores))
print('Average Accuracy of LGBM model is:', np.mean(content_lgbm_cv_scores))
print('Average Accuracy of RF model is:', np.mean(content_rf_cv_scores))
print('Average Accuracy of Ensemble Model is:', np.mean(content_ens_cv_scores))

In [None]:
# Simple Voting Classifer

ens_preds = voting_classifier.predict(df_test.drop(target,axis=1))


ens_submission = pd.DataFrame({'id': sample_submission['id'], 'outcome': ens_preds})
ens_submission['outcome'] = ens_submission['outcome'].map({0:'died',1:'euthanized',2:'lived'})
ens_submission.to_csv('ens_submission.csv',index=False)

In [None]:
lgb_md = LGBMClassifier(objective='multiclass',metric='auc_mu',feature_pre_filter=False,num_leaves=248, min_child_samples=20, num_iterations=50, early_stopping_round= None).fit(X,Y)
lgb_preds = lgb_md.predict(df_test.drop(target,axis=1))

lgb_submission = pd.DataFrame({'id': sample_submission['id'], 'outcome': lgb_preds})
lgb_submission['outcome'] = lgb_submission['outcome'].map({0:'died',1:'euthanized',2:'lived'})
lgb_submission.to_csv('ens_submission.csv',index=False)

# 5. Baseline with Autogluon

At the beginning, I'm gonna build a baseline model using an automated machine learning framework.

In [None]:
from autogluon.tabular import TabularDataset, TabularPredictor

#train_data = TabularDataset('/kaggle/input/playground-series-s3e22/train.csv')
#test_data = TabularDataset('/kaggle/input/playground-series-s3e22/test.csv')

predictor = TabularPredictor(label='outcome').fit(df_train)
preds = predictor.predict(df_test.drop(target,axis=1))

In [None]:
preds = preds.map({0:'died',1:'euthanized',2:'lived'})

In [None]:
auto_submission = pd.DataFrame({'id': sample_submission['id'], 'outcome': preds})
auto_submission.to_csv('auto_submission.csv',index=False)

In [None]:
auto_submission