**Achintya Gupta**

*5th Mar 2020*

The noteboook structure :- 
            1. Introduction
            2. Imports
            3. Path and Initialisaitons
            4. Reading the Data
            5. Helper Function Class. 
               Functions definitions  :
                4.1 calcLogLossError
                4.2 createProbabHeatmap
            6. Data Processing Class. 
               Functions definitions  :
                5.1 preprocess_dataset
                5.2 feature_engineering
                5.3 process_dataset
                5.4 impute
                5.5 train_cv_split
            7. Preparing The Data
            8. Exploratory Data Analysis
            9. Modelling
                9.1 Base Modelling
                9.2 Feature Combination Selection
                9.3 Hyper Parameter Tuning
                9.4 Model Interpretation


# Introduction

Can you predict wether an animal, going out of shelter home would had been **Adopted** by someone,or probably **Returned_to_owner** or maybe **Transferred** someplace else or to someone, or if sadly it might have **Died** or **Euthanised** while in the shelter.

<img src="https://images.unsplash.com/photo-1450778869180-41d0601e046e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=2650&q=80" width="600px">


***The Objective***, as stated above is simple : "Predict the **Outcome_Type** of an Animal going out of shelter home".

****

- You are given a dataset of following features :- 

    - **Name** : Name of the animal, if it had one.
    - **DateTime** : Timestamp at which it was brought in the shelter home
    - **OutcomeType** : Outcome of the animal, 5 classes to ***predict***.
    - **OutcomeSubtype** : Much granular detail of the type of the outcome for that specific animal. 
    - **AnimalType** : Type of Animal, Dog or Cat.
    - **SexuponOutcome** : Sex of the animal, when they leave the shelter home. (Don’t know why its mentioned ‘SexuponOutcome’ and not just 'Gender'? It kinda implies there might be a difference between 'SexuponOutcome' and 'SexuponIntake' that they troubled themselves to name it this way.)
    - **AgeuponOutcome** : Age of the animal when they came out of the shelter home.
    - **Breed** : Breed of the animal .
    - **Color** : Color of the animal

# Imports

In [None]:
# General
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm
import random, re, itertools
from IPython.display import Math

#Visualisations
import matplotlib.pyplot as plt
import seaborn as sns

# Algo
from sklearn.ensemble import RandomForestClassifier

# Path and Initialisations

In [None]:
TRAIN_PATH = '/kaggle/input/shelter-animal-outcomes/train.csv.gz'
TEST_PATH = '/kaggle/input/shelter-animal-outcomes/test.csv.gz'
SAMPLESUBMISSION_PATH = '/kaggle/input/shelter-animal-outcomes/sample_submission.csv.gz'


TARGET_VARIABLE = 'OutcomeType'

# Read Data

In [None]:
train_data = pd.read_csv(TRAIN_PATH, index_col=0)
test_data = pd.read_csv(TEST_PATH, index_col=0)
samplesubmission_data = pd.read_csv(SAMPLESUBMISSION_PATH, index_col=0)

print('Shape of Train Data : ', train_data.shape)
print('Shape of Test Data : ', test_data.shape)


# Helper Function Class

First lets create a a Helper Fucntion class, which would have just two, but functions frequently used in this entire notebook :- 

1. ***calcLogLossError*** : To calculate the log loss of the all the predicted probabilities, the predicted probabilites are first scaled to 

    *max(min(x,1-1e-15),1e-15)* to prevent the logarithmic transformations from blowing up.

$logloss = -\frac{1}{N}\sum_{i=1}^{N}\sum_{j=1}^{M}y_{ij}log(p_{ij})$

where $N$ is total number of animals, and $M$ is total number of classes to be predicted and $p_{ij}$ is the predicted probability of a particular animal ID belonging to one of the predict classes.


2. ***createProbabHeatmap*** : Creates a Confusion Matrix + Scatter Plot with the mean of the predicted probabilities. Inspired from [here](https://andraszsom.wordpress.com/2016/07/27/kaggle-for-the-paws/)



In [None]:
class HelperFunctions:
    def __init__(self):
        print('Initialising the HelperFunctions class..')
        
    def calcLogLossError(self, metrics):
        
        #Scaling
        metrics.iloc[:,:5] = metrics.iloc[:,:5].apply(lambda x: x/x.sum(),axis=1)
        metrics.iloc[:,:5] = metrics.iloc[:,:5].applymap(lambda x: max(min(x,1-1e-15),1e-15))
        metrics.iloc[:,:5] = np.log(metrics.iloc[:,:5])
        loglossScore = (-(metrics.iloc[:,:5]*pd.get_dummies(metrics['Actuals'])).sum().sum())/metrics.shape[0]
        return loglossScore
               
    def createProbabHeatmap(self, cv_metrics):
        
        cv_metrics.sort_values('Actuals', inplace=True)
        plt.rcParams['figure.dpi'] = 180
        plt.rcParams['figure.figsize'] = (18,18)
        fig, axes =plt.subplots(5,5, sharex=False, sharey=True)
        fig.text(0.5, -0.05, 'Index', ha='center', fontsize=25)
        fig.text(-0.05, 0.5, 'Predicted Probabilities', va='center', rotation='vertical', fontsize=25)
        plt.tight_layout()

        for idx, eachClass in enumerate(cv_metrics.Actuals.unique()):

            probMatrix = cv_metrics[cv_metrics.Actuals == eachClass].reset_index().drop('AnimalID', axis=1)
            temp2 = {k:0 for k in cv_metrics.Actuals.unique()}
            predictionCounts = probMatrix.Predictions.value_counts().to_dict()
            for k in predictionCounts.keys():
                temp2[k] = predictionCounts[k]
            predictionCounts = temp2

            for idx1, eachPlotAX in enumerate(axes[idx]):

                sns.scatterplot(x = probMatrix.iloc[:,idx1].index, 
                                y = probMatrix.iloc[:,idx1].values, 
                                ax = eachPlotAX,
                                hue = probMatrix.iloc[:,idx1].values,
                                s = 20,
                                legend=False,
                                palette = 'plasma')
                eachPlotAX.hlines(np.mean(probMatrix.iloc[:,idx1].values),
                                  0, probMatrix.iloc[:,idx1].index[-1]*1.1, color='maroon', linewidth=3)
                
                if idx == idx1 : 
                    eachPlotAX.set_facecolor('darkgray')
                    eachPlotAX.text(0.5,0.5, "{0}/{1}".format(predictionCounts[probMatrix.iloc[:,idx1].name],
                                                           probMatrix.shape[0]), size=25,
                                ha="center", color='black', transform=eachPlotAX.transAxes)
                else:
                    eachPlotAX.set_facecolor('lightblue')
                    eachPlotAX.text(0.5,0.5, "{0}/{1}".format(predictionCounts[probMatrix.iloc[:,idx1].name],
                                                           probMatrix.shape[0]), size=15,
                                ha="center", color='black', transform=eachPlotAX.transAxes)
                    
                eachPlotAX.grid()
                eachPlotAX.linewidth =10

                if idx1 == 0 : eachPlotAX.set_ylabel(eachClass)
                if idx == len(axes)-1 : eachPlotAX.set_xlabel(cv_metrics.columns[:-1][idx1])

        plt.close()
        return fig

helper_functions_handler = HelperFunctions()

# Data Processing Class

Now, Lets get to **THE** buisness! Processing The data.



1. ***preprocess_dataset*** : Preprocessing the dataset, 
    - 1.1) The column, ***'OutcomeSubtype'*** was dropped from the training dataset.
    - 1.2) The column, ***'DateTime'*** is converted to datetime pandas object. And since we are dealing with a dataset which contains a datetime element, it might be good idea to sort the dataset on this column, to make the later data correspond to the latest data with respect to time, hence in a way replicating the data model might get in the real world.
    - 1.3) The column, ***'Name'*** having the name of the pet, is only useful to the extent of *"did the pet have a name??"*. Assuming NaN in this column means that the pet did not have a name, this column was converted to "HAS_NAME" and "NO_NAME".
    - 1.4) The column, ***'AgeuponOutcome'*** is the age of the pet, and the data is represented in [years, months and days] so that was converted to *'AgeuponOutcomeDays'*.
    
2. ***feature_engineering*** : Engineering New Features,
    - 1.1) ***Calendar Variables*** : From the *DateTime* column, **Month**, **Day**, **WeekOfYear**, **DayOfWeek** and **DayType** which represents the hour of the day in which the pet was delivered to the Shelter Home.
    These engineered variables try to answer the following questionsm respectively:
    - 1.2) ***AnimalType_Breed*** : Combining two Categorical Columns, Which type of animal of what breed? Does that combination affect our target variable?
    - 1.3) ***AnimalType_Color*** : Combining two Categorical Columns, Which type of animal of what color? Does that combination affect our target variable?
    - 1.4) ***AnimalType_Breed_Color*** : Combining three Categorical Columns, Which type of animal of what color of what breed? Does that combination affect our target variable?
    - 1.5) ***AnimalType_SexuponOutcome*** : Combining two Categorical Columns, Which type of animal of what sex/gender? Does that combination affect our target variable?
    - 1.5) ***AnimalType_Name*** : Combining two Categorical Columns, Which type of animal and did it have a name? Does that combination affect our target variable?
    - 1.5) ***AnimalType_DayType*** : Combining two Categorical Columns, Which type of animal brought in at what hour of the day? Does that combination affect our target variable?
    
    - 1.5) ***AgeGroup*** : Bucketing the age of the pet into 100 buckets, the idea or the *hope* is that the age group might have a relationship with the target variable
    
3. ***process_dataset*** : A function which accepts the engineered data and the respective transformatin mapping, one of these : 
    - **NORM_MAX** : Normalising a numerical column by the maximum of that column.
    - **NORM_MAXMIN** : Normalising a numerical column by the minimum of that column.
    - **NORM_SUM** : Normalising a numerical column by the sum of that column.
    - **NORM_CUSTOM** : Normalising a numerical column by a custom value.
    - **ORIGNAL** : Let the column be as it is
    - **OHE** : One Hot Encoding.
    - **OHE_DROP** : One Hot Encoding, and dropping the first level because that might be redundant.
    - **LE_O** : Label Encoding of a categorical column, Ordinal in nature having ordered categories.
    - **LE_N** : Label Encoding of a categorical column, Nominal in nature having no order within the categories.
    - **LE_O_NORM** : Label Encoding of a categorical column, Ordinal in nature having ordered categories. And then normalising with the max(For Neural Nets).
    - **LE_N_NORM** : Label Encoding of a categorical column, Nominal in nature having no order within the categories.And then normalising with the max(For Neural Nets).
    - **DROP** : Dropping the column.
4. ***impute*** : For now just filling the missing values of with -1, in the hope that the missing values might have a relationship with the target variable
5. ***train_cv_split*** : Since the dataset has a time component, the train and cross validation is split by 80-20 ratio.


In [None]:
train_data.head()

In [None]:
train_data.isna().sum()

In [None]:
class DataProcessing:
    def __init__(self):
        print('Initialising Data Processinig Class..')

    def preprocess_dataset(self, df, mode='train', sort_bydate=True):
        
        # Drop OutcomeSubtype
        if mode == 'train':
            df.drop(['OutcomeSubtype'], axis=1, inplace=True)

        # Generate Calendar Variables
        
        df['DateTime'] = pd.to_datetime(df['DateTime'])
        
        if sort_bydate:
            df.sort_values('DateTime', inplace=True)
        
        _nonnan_idx = df.loc[~df.Name.isna()].index
        df.loc[_nonnan_idx, 'Name'] = ['HAS_NAME']*df.Name[~df.Name.isna()].shape[0]
        df['Name'].fillna('NO_NAME', inplace=True)


        # Transform AgeuponOutcome TO AgeuponOutcomeDays
        def transformAgeuponOutcome(x):
            x = str(x)
            if x == 'nan':
                return np.nan
            elif 'year' in x:
                return int(x.split(' ')[0])*365
            elif 'month' in x:
                return int(x.split(' ')[0])*30
            elif 'day' in x:
                return int(x.split(' ')[0])*1

        df['AgeuponOutcome'] = df.AgeuponOutcome.apply(lambda x : transformAgeuponOutcome(x))
        df.rename(columns={'AgeuponOutcome':'AgeuponOutcomeDays'},inplace=True)

        return df
    
    def feature_engineering(self, df):
        
        # Calendar Variables
        def generateHourOfTheDay(x):
            if x < 12:
                return 'Morning'
            elif x>=12 and x < 17:
                return 'Noon'
            elif x>=17 and x < 20:
                return 'Eevening'
            elif x >= 20:
                return 'Night'
            
        # Simplify Breed..
        df.Breed = df.Breed.apply(lambda x : re.sub(' Mix', '', x))
        df.Breed = df.Breed.apply(lambda x : x.split('/')[0])
        df.Color = df.Color.apply(lambda x : x.split('/')[0])
        
        df['Month'] = df['DateTime'].dt.month
        df['Day'] = df['DateTime'].dt.day
        df['WeekOfYear'] = df['DateTime'].dt.weekofyear
        df['DayOfWeek'] = df['DateTime'].dt.dayofweek
        df['DayType'] = pd.to_datetime(df['DateTime']).dt.hour.apply(lambda x : generateHourOfTheDay(x))
        df.drop(['DateTime'], axis=1, inplace=True)
        
        df['AnimalType_Breed'] = df.apply(lambda x : x.AnimalType+'_'+x.Breed, axis=1)
        df['AnimalType_Color'] = df.apply(lambda x : x.AnimalType+'_'+x.Color, axis=1)
        df['AnimalType_Breed_Color'] = df.apply(lambda x : x.AnimalType_Breed+'_'+x.Color, axis=1)
        df['AnimalType_SexuponOutcome'] = df.apply(lambda x : np.nan if pd.isna(x.SexuponOutcome) else x.AnimalType+'_'+x.SexuponOutcome, 
                                                   axis=1)
        df['AnimalType_Name'] = df.apply(lambda x : x.AnimalType+'_'+x.Name, axis=1)
        df['AnimalType_DayType'] = df.apply(lambda x : x.AnimalType+'_'+x.DayType, axis=1)
        
        df['AgeGroup'] = pd.cut(df.AgeuponOutcomeDays.fillna(0), bins=50)

        return df

    def process_dataset(self, df, transformDict, mode='train', verbose=True):
        """Available Types :
          'NORM_MAX','NORM_MIN', 'NORM_SUM','NORM_CUSTOM', 'ORIGNAL', 'OHE',
          'OHE_DROP', 'LE_O', 'LE_N', 'LE_O_NORM', 'LE_N_NORM', 'DROP'"""
        
        catMappings = {}
        print('Transforming Varables..')
        
        for eachKey, eachValue in transformDict.items():
            
            
            if eachValue['Type'] == 'NORM_MAX':
                df[eachKey] = df[eachKey]/df[eachKey].max()

            elif eachValue['Type'] == 'NORM_MAXMIN':
                df[eachKey] = df[eachKey]/(df[eachKey].max()-df[eachKey].min())

            elif eachValue['Type'] == 'NORM_SUM':
                df[eachKey] = df[eachKey]/df[eachKey].sum()
            
            elif eachValue['Type'] == 'NORM_CUSTOM':
                df[eachKey] = df[eachKey]/ast.literal_eval(ast.literal_eval(eachRow['Params']))

            elif eachValue['Type'] == 'ORIGNAL':
                df[eachKey] = df[eachKey]

            elif eachValue['Type'] == 'OHE':
                _dummies = pd.get_dummies(df[[eachKey]].copy(),
                           prefix=[eachKey], 
                           prefix_sep = '_',
                           columns = [eachKey], 
                           drop_first=True)

                df[_dummies.columns] = _dummies

            elif eachValue['Type'] == 'OHE_DROP':
                _dummies = pd.get_dummies(df[[eachKey]].copy(),
                           prefix=[eachKey], 
                           prefix_sep = '_',
                           columns = [eachKey], 
                           drop_first=True)

                df[_dummies.columns] = _dummies
                df.drop(eachKey, axis=1, inplace=True)

            elif eachValue['Type'] == 'LE_O':
                _param = eachValue['Params']
                _CM = pd.Categorical(df[eachKey], categories=_param)
                df[eachKey] = _CM.codes
                catMappings[eachKey] = _CM
            
            elif eachValue['Type'] == 'LE_N':
                _CM = pd.Categorical(df[eachKey])
                df[eachKey] = _CM.codes
                catMappings[eachKey] = _CM

            elif eachValue['Type'] == 'LE_O_NORM':
                _param = eachValue['Params']
                _CM = pd.Categorical(df[eachKey], categories=_param)
                df[eachKey] = _CM.codes
                df[eachKey] = df[eachKey]/df[eachKey].max()
                catMappings[eachKey] = _CM
            
            elif eachValue['Type'] == 'LE_N_NORM':
                _CM = pd.Categorical(df[eachKey])
                df[eachKey] = _CM.codes
                df[eachKey] = df[eachKey]/df[eachKey].max()
                catMappings[eachIdx] = _CM

            elif eachValue['Type'] == 'DROP':
                df.drop(eachKey, axis=1, inplace=True)

        if mode == 'train':
            _CM = pd.Categorical(df[TARGET_VARIABLE])
            df[TARGET_VARIABLE] = _CM.codes
            catMappings[TARGET_VARIABLE] = _CM
            self.trainCatMappings = catMappings
            
        elif mode == 'test':
            self.testCatMappings = catMappings

        if verbose : print('Shape of the {0} dataset After Processing : {1}\n'.format(mode, df.shape))

        return df    

    def impute(self, df):
        
        df.fillna(-1, inplace=True)
        
        return df
        
    def train_cv_split(self, df, ratio=0.2):
        
        upto = int(df.shape[0]*ratio)
        trainDF = df[:-upto]
        cvDF = df[-upto:]
        
        return trainDF, cvDF
    
dataProcessingHandler = DataProcessing()

# Processing the data

Processing the data for Train and Test set.

**Pre Processsing** > **Feature Engineering** > **Process Data** > **Impute** 

But first, we need to prepare the transformation dictionary for all the columns and that include the feature engineered columns as well. Which will be used for processing the dataset. It might be tedious at first,  but when the feature size is massive and we need to try various version of transformation this comes quite handy.

In [None]:
transformDict = {'Name': {'Type': 'OHE_DROP',
                          'Params': None},
                 'AnimalType': {'Type': 'LE_N',
                                'Params': None},
                 'SexuponOutcome': {'Type': 'OHE_DROP',
                                    'Params': None},
                 'AgeuponOutcomeDays': {'Type': 'ORIGNAL',
                                        'Params': None},
                 'Breed': {'Type': 'LE_N',
                           'Params': None},
                 'Color': {'Type': 'LE_N',
                           'Params': None},
                 'Month': {'Type': 'ORIGNAL',
                           'Params': None},
                 'Day': {'Type': 'ORIGNAL',
                         'Params': None},
                 'WeekOfYear': {'Type': 'ORIGNAL',
                         'Params': None},
                 'DayOfWeek': {'Type': 'ORIGNAL',
                         'Params': None},
                 'DayType': {'Type': 'LE_O',
                             'Params': ['Morning',  'Noon', 'Eevening', 'Night']},
                 'AnimalType_Breed': {'Type': 'LE_N',
                                      'Params': None},
                 'AnimalType_Color': {'Type': 'LE_N',
                                      'Params': None},
                 'AnimalType_Breed_Color': {'Type': 'LE_N',
                                             'Params': None},
                 'AnimalType_SexuponOutcome': {'Type': 'LE_N',
                                               'Params': None},
                 'AnimalType_Name': {'Type': 'LE_N',
                                     'Params': None},
                 'AnimalType_DayType': {'Type': 'LE_N',
                                        'Params': None},
                 'AgeGroup': {'Type': 'LE_N',
                                        'Params': None}}

In [None]:
# Train Data
_mode = "train"
COMPLETE_TRAINING_DATA = dataProcessingHandler.preprocess_dataset(train_data.copy(), 
                                                                   mode=_mode, 
                                                                   sort_bydate= True)
COMPLETE_TRAINING_DATA = dataProcessingHandler.feature_engineering(COMPLETE_TRAINING_DATA)
COMPLETE_TRAINING_DATA = dataProcessingHandler.process_dataset(COMPLETE_TRAINING_DATA,
                                                                transformDict=transformDict, mode=_mode)
COMPLETE_TRAINING_DATA = dataProcessingHandler.impute(COMPLETE_TRAINING_DATA)


# Test Data
_mode = "test"
TESTING_DATA = dataProcessingHandler.preprocess_dataset(test_data.copy(), 
                                                         mode=_mode)
TESTING_DATA = dataProcessingHandler.feature_engineering(TESTING_DATA)
TESTING_DATA = dataProcessingHandler.process_dataset(TESTING_DATA, 
                                                      transformDict=transformDict, 
                                                      mode=_mode)
TESTING_DATA = dataProcessingHandler.impute(TESTING_DATA)

In [None]:
TRAINING_DATA, CV_DATA = dataProcessingHandler.train_cv_split(COMPLETE_TRAINING_DATA.copy())

print('Shape of the Training Set : ', TRAINING_DATA.shape)
print('Shape of the Cross Validation Set : ', CV_DATA.shape)
print('Shape of the Testing Set : ', TESTING_DATA.shape)

In [None]:
TRAINING_DATA_X = TRAINING_DATA.drop(TARGET_VARIABLE, axis=1).copy()
TRAINING_DATA_y = TRAINING_DATA[TARGET_VARIABLE]

CV_DATA_X = CV_DATA.drop(TARGET_VARIABLE, axis=1).copy()
CV_DATA_y = CV_DATA[TARGET_VARIABLE]

TESTING_DATA_X = TESTING_DATA.copy()

In [None]:
targetCatMapping = {idx:k for idx, k in enumerate(dataProcessingHandler.trainCatMappings[TARGET_VARIABLE].categories)}


# EDA

Just exploring...

In [None]:
trainDataEDA = train_data.copy()
trainDataEDA = dataProcessingHandler.preprocess_dataset(trainDataEDA, 
                                                        mode='train', 
                                                        sort_bydate= True)
trainDataEDA = dataProcessingHandler.feature_engineering(trainDataEDA)

In [None]:
# Analysing features w.r.t target
analysis_Cols = {'AnimalType': None,
                 'SexuponOutcome': None,
                 'AgeGroup': None,
                 'DayOfWeek': None,
                 'AnimalType_DayType': None,
                 'AnimalType_Name': None}

for eachCol in [k for k in analysis_Cols.keys()]:
    
    fig, axes = plt.subplots(1,2, figsize =(25,9))
    
    temp = trainDataEDA.groupby([TARGET_VARIABLE, eachCol]).size().unstack().fillna(0)
    
    
    t1 = temp.T/temp.sum(axis=1)
    t1.T.plot(kind='bar', stacked=True, ax = axes[0])
    t2 = temp/temp.sum(axis=0)
    t2.T.plot(kind='bar', stacked=True, ax = axes[1])
    
    
    axes[0].legend(loc='center left', bbox_to_anchor=(-0.25, 0.5))
    axes[1].legend(loc='center left', bbox_to_anchor=(1, 0.5))
    
    plt.suptitle('Analysing {0} v/s {1}'.format(eachCol, TARGET_VARIABLE), fontsize=20)
    analysis_Cols[eachCol] = fig
    plt.close()
    

In [None]:
analysis_Cols['AnimalType']

Analysing the target with respect to the Target column, Its evident that 

- More percentage of dogs are Adopted than cats and the relationship for Transfer is opposite
- Majority percentage of Return to owner is for Dogs.
- Cat has a more probability of dying than the dogs

In [None]:
analysis_Cols['SexuponOutcome']

For 'SexuponOutcome', there is not much to be derived from here, but :

- Interestingly, **Adoption** is clearly having a much biggere percentage in the cases of  **Neutered Male** and **Spayed Female**, so a One Hot Encoding of this feature might not be the bad idea. As that would make it easier for the model to capture the effect.
- Seldom do **Unknown** category of intake goes back to the owmner.
- **Adoption** of **Intact Female** is rather low.

In [None]:
analysis_Cols['AgeGroup']

For Feature Engineered 'AgeGroup', :

- A very explicit and increasing pattern emerges for the case of **Transfer** as we move towards the older age groups. And so does ***Euthenasia*
- In a similar fashion, the rate of **Adoption** and **Transfer** of the of a per is clearly decreasing as we move up the age group.

I am starting to have a feeling that **Adoption** and **Transfer** classes might not be that seprabale.

In [None]:
analysis_Cols['DayOfWeek']

Not much to derive from here..

In [None]:
analysis_Cols['AnimalType_DayType']

Ok, wierd a little bit...

- Of all the target classes **Adoption** is a proportionately higher than the other classes for either **Cats** or **Dogs** when the hour is late, either in evening or night.

In [None]:
analysis_Cols['AnimalType_Name']

Ok, my two cents...
- **Cats** and **Dogs** who have name have a higher chance of **Adoption** and those who dont have a higher chance of **Transfer**
- **Dogs** who had a name were mostly returned to their owners.

# Modelling

5th March - Lets take Random Forest for now.. A universal machine learning algorithm, which does a decent job of predicting in either regression or classification, philosiphacally. It builds a lots of [Decision Trees](https://en.wikipedia.org/wiki/Decision_tree) on random subsets of features and generalises the relationships in the datasets decently..

## Base Modelling

In [None]:
rfClf = RandomForestClassifier(oob_score=True, n_estimators=500, max_depth = 6, min_samples_leaf = 2)
_=rfClf.fit(TRAINING_DATA_X, TRAINING_DATA_y)

In [None]:
ypred = rfClf.predict(CV_DATA_X)
ypredProba = rfClf.predict_proba(CV_DATA_X)

CV_METRICS = pd.DataFrame(ypredProba, index=CV_DATA_X.index)
CV_METRICS.columns = CV_METRICS.columns.map(targetCatMapping)
CV_METRICS.applymap(lambda x: max(min(x,1-1e-15),1e-15))
CV_METRICS['Actuals'] = CV_DATA_y.map(targetCatMapping)
CV_METRICS['Predictions'] = pd.Series(ypred,
                                      index= CV_DATA_X.index).map(targetCatMapping)


loglossscore = helper_functions_handler.calcLogLossError(CV_METRICS.copy())

fiDF = pd.DataFrame(dict(zip(TRAINING_DATA_X.columns, rfClf.feature_importances_)).items(),columns=['Features', 'Feature_Importance'])
fiDF.set_index('Features', inplace=True)
fiDF.sort_values('Feature_Importance',ascending=False, inplace=True)
plt.rcParams['figure.figsize'] = (25,7)
fiDF.plot(kind='bar')
plt.hlines(fiDF.mean().values[0]*0.8,-20,200,color='r')
_=plt.title('OOB Score : {0}, LogLossError : {1}'.format(rfClf.oob_score_, loglossscore), fontsize=20)

In [None]:
helper_functions_handler.createProbabHeatmap(CV_METRICS.copy())


Ahaaa! As suspected, the **Adoption** and **Transfer** categories are not that separable within the dataset, even so much that the tree classifies other classes as them as well. And it is doing perfectly miserable job in classifying **Deaths**..

## Feature Combination Selection

This code block below comes in rather handy, when we are dealing with humungous feature size(which is not this case), but i will use it anyway to figure out. Which combination of the feature set might be the best to work with.

For a layman to read this function, this is how it will go :- 

***“Within the feature space of, lets say 100 features, randomly pick *n {preditorInjection}* number of features - *m {combinations_no}* number of times and see which works better. But dont let same combination of feature through to the model, which had been tried earlier.”***

*NOTE : I am running this block from only 6 to 10 as the kaggle kernel keeps crashing with only 20 combinations.. but you can increase if you want to!*

In [None]:
combinations_no = 20
bestFeatures = []
forced_columns = []
model = RandomForestClassifier(oob_score=True, n_estimators=50, max_depth = 6, min_samples_leaf = 2)

algoModel = 'RandomForestClassifier'

Now, lets initiate our Feature Combination Selection process..

In [None]:
loglossscore = 99999
triedCombinations = []
allFeatures = [k for k in TRAINING_DATA_X.columns]

for preditorInjection in range(6,10):

    pbar = tqdm(range(combinations_no))
    
    for eachR_Pick in pbar:
        
        availableCols = [k for k in allFeatures if k not in forced_columns+[TARGET_VARIABLE]]
        randomsubset_cols = []
        
        if availableCols != []:
            randomsubset_cols = random.sample(availableCols, preditorInjection)
            randomsubset_cols.sort()
        
        if tuple(randomsubset_cols) not in triedCombinations:
            triedCombinations += [tuple(randomsubset_cols)]
            
            selectedCols = forced_columns+randomsubset_cols+[TARGET_VARIABLE]

            _modellingData = COMPLETE_TRAINING_DATA[selectedCols]
            _TRAIN_DATA, _CV_DATA = dataProcessingHandler.train_cv_split(_modellingData, ratio=0.2)
            
            _modellingData_XTrain = _TRAIN_DATA.drop(TARGET_VARIABLE, axis=1).copy()
            _modellingData_yTrain = _TRAIN_DATA[TARGET_VARIABLE].copy()

            _modellingData_XCV = _CV_DATA.drop(TARGET_VARIABLE, axis=1).copy()
            _modellingData_YCV = _CV_DATA[TARGET_VARIABLE].copy()

            _modellingData_XTest = TESTING_DATA_X.copy()

            model.fit(_modellingData_XTrain, _modellingData_yTrain)
            _yPred = model.predict(_modellingData_XCV)
            _yPredProba = model.predict_proba(_modellingData_XCV)
            
            
            CV_METRICS = pd.DataFrame(_yPredProba, index=CV_DATA_X.index)
            CV_METRICS.columns = CV_METRICS.columns.map(targetCatMapping)
            CV_METRICS.applymap(lambda x: max(min(x,1-1e-15),1e-15))
            CV_METRICS['Actuals'] = CV_DATA_y.map(targetCatMapping)
            CV_METRICS['Predictions'] = pd.Series(_yPred, index= CV_DATA_X.index).map(targetCatMapping)

            _loglossscore = helper_functions_handler.calcLogLossError(CV_METRICS)
            pbar.set_description('Injection Level : {0} @ {1}'.format(preditorInjection, round(loglossscore,4)))

            if _loglossscore < loglossscore:
                loglossscore = _loglossscore
                
                bestFeatures.append( (loglossscore, _modellingData_XTrain.columns.tolist()))

        else:
            pass

## Hyper Parameter Tuning

Once done selecting the features, 

- Choose a grid of parameters that you want tune for a particular model.
- Create a pandas Dataframe out of it.
- Shuffle the dataframe. And run!

In [None]:
_features = bestFeatures[-1][1]
_features = ['AgeGroup',
             'AgeuponOutcomeDays',
             'AnimalType_Color',
             'AnimalType_Name',
             'AnimalType',
             'AnimalType_Breed',
             'DayType',
             'SexuponOutcome_Unknown',
             'SexuponOutcome_Neutered Male',
             'SexuponOutcome_Spayed Female']

In [None]:
# Change the parameters according to the model being used...
_max_depth = np.arange(5, 8, 1).astype(int)
_min_samples_leaf = np.arange(4, 6, 1).astype(int)
_n_estimators = np.arange(500, 1000, 50).astype(int)


print('All Hyper Param Combinations : ',len(_max_depth)*len(_min_samples_leaf)*len(_n_estimators))

_hyperp = [_max_depth, _min_samples_leaf, _n_estimators]
hyperCombinations = pd.DataFrame(list(itertools.product(*_hyperp)))
hyperCombinations.columns = ['max_depth', 'min_samples_leaf', 'n_estimators']

hyperCombinations.max_depth = hyperCombinations.max_depth.astype(int)
hyperCombinations.n_estimators = hyperCombinations.n_estimators.astype(int)
hyperCombinations = hyperCombinations.sample(frac=1)
hyperCombinations.sample(3)

In [None]:
loglossscore = 99999
hyperidx = 0
bestHParams= []

pbar = tqdm(hyperCombinations.iterrows())
for idx, eachHyper in pbar:
    
    model = RandomForestClassifier(oob_score=True, max_features='log2', n_jobs=-1)
    hparams = eachHyper.to_dict()
    hparams['max_depth'] = int(hparams['max_depth'])
    hparams['n_estimators'] = int(hparams['n_estimators'])
    model.set_params(**hparams)
    
    _=model.fit(TRAINING_DATA_X[_features], TRAINING_DATA_y)
    ypred = model.predict(CV_DATA_X[_features])
    ypredProba = model.predict_proba(CV_DATA_X[_features])
    CV_METRICS = pd.DataFrame(ypredProba, index=CV_DATA_X.index)
    CV_METRICS.columns = CV_METRICS.columns.map(targetCatMapping)
    CV_METRICS.applymap(lambda x: max(min(x,1-1e-15),1e-15))
    CV_METRICS['Actuals'] = CV_DATA_y.map(targetCatMapping)
    CV_METRICS['Predictions'] = pd.Series(ypred,
                                          index= CV_DATA_X.index).map(targetCatMapping)
    _loglossscore = helper_functions_handler.calcLogLossError(CV_METRICS.copy())
    pbar.set_description('HpIdx : {0} @ {1}'.format(hyperidx, round(loglossscore,4)))
    if _loglossscore < loglossscore:
        loglossscore = _loglossscore
        bestHParams.append([loglossscore, hparams])
        hyperidx = idx

## Final Model

On the Chosen model > Selected Features > Selected Hyper Parameters, make a model which would be the final one.

In [None]:
bestHParams

In [None]:
model = RandomForestClassifier(oob_score=True, max_features='log2', n_jobs=-1)
model.set_params(**bestHParams[-1][1])

_=model.fit(TRAINING_DATA_X[_features], TRAINING_DATA_y)
ypred = model.predict(CV_DATA_X[_features])
ypredProba = model.predict_proba(CV_DATA_X[_features])
CV_METRICS = pd.DataFrame(ypredProba, index=CV_DATA_X.index)
CV_METRICS.columns = CV_METRICS.columns.map(targetCatMapping)
CV_METRICS.applymap(lambda x: max(min(x,1-1e-15),1e-15))
CV_METRICS['Actuals'] = CV_DATA_y.map(targetCatMapping)
CV_METRICS['Predictions'] = pd.Series(ypred, index= CV_DATA_X.index).map(targetCatMapping)
_loglossscore = helper_functions_handler.calcLogLossError(CV_METRICS.copy())
print(_loglossscore)
helper_functions_handler.createProbabHeatmap(CV_METRICS.copy())

In [None]:
model = RandomForestClassifier(oob_score=True, max_features='log2', n_jobs=-1)
model.set_params(**bestHParams[-1][1])

_=model.fit(COMPLETE_TRAINING_DATA[_features], COMPLETE_TRAINING_DATA[TARGET_VARIABLE])
ypredProba = model.predict_proba(TESTING_DATA_X[_features])

submissionDF = pd.DataFrame(ypredProba)
submissionDF.columns = submissionDF.columns.map(targetCatMapping)
submissionDF.index = TESTING_DATA_X.index
submissionDF.index.name = TESTING_DATA_X.index.name
submissionDF.to_csv('submissionDF.csv')

# Model Interpretability

Disect open your model to understand and make a layman understand what this model makes out of a particular feature. Or how the model is working..

<img src="https://c4.wallpaperflare.com/wallpaper/485/436/987/warhammer-40k-techpriest-vitruvian-man-wallpaper-preview.jpg" width="400px">


Following :-

1.) Partial Dependence Plots, more about it [here](https://medium.com/@ag.ds.bubble/model-interpretability-a4244d82ffb2?source=friends_link&sk=38aa71fb0d8faf36878ff274a8707ad7)

In [None]:
from pdpbox import pdp

XTrain = COMPLETE_TRAINING_DATA[_features].copy()

pdp_StoreType = pdp.pdp_isolate(model=model, 
                                dataset=XTrain.sample(100), 
                                num_grid_points = 100,
                                model_features=XTrain.columns.tolist(), 
                                feature='AgeGroup')

fig, axes = pdp.pdp_plot(pdp_StoreType, 'AgeGroup', plot_lines=True)

In [None]:
pdp_StoreType = pdp.pdp_isolate(model=model, 
                                dataset=XTrain.sample(100), 
                                num_grid_points = 100,
                                model_features=XTrain.columns.tolist(), 
                                feature='AgeuponOutcomeDays')

fig, axes = pdp.pdp_plot(pdp_StoreType, 'AgeuponOutcomeDays', plot_lines=True)

# Random Testing Space