### 1 - Shelter Animals part 3
In [part 2](https://github.com/yscyang1/ExploringDataScience/blob/master/6-ShelterAnimals2.ipynb) of the my shelter animal competition analysis, we saw that I have probably maxed out the value of random forests for the minimal amount of pre-processing I did.  That isn't to say that random forests are now useless for this data set, but instead, it is time to start digging deeper into the data itself.  

First, lets import the usual libraries and read in the training set from the feather file.

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

In [None]:
train_df = pd.read_feather("../input/shelter/train_df")

### 2 - Finding the Most Important Features
Scikit-Learn's random forests have a handy feature that tells you the importance of each feature, which is crucial to data analysis and undersatnding how your random forest (or any  model) is making predictions.  

To find out which features were most important, first lets use a subset of the training data to speed things up.

In [None]:
def get_subset(df, train_percent=.6, validate_percent=.2, copy = True, seed=None):
    if copy:
        df_copy = df.copy()
    perm = np.random.RandomState(seed).permutation(df_copy.index)
    length = len(df_copy.index)
    train_end = int(train_percent * length)
    validate_end = int(validate_percent * length) + train_end
    train = df_copy.iloc[perm[:train_end]]
    validate = df_copy.iloc[perm[train_end:validate_end]]
    test = df_copy.iloc[perm[validate_end:]]
    
    return train, validate, test

In [None]:
train_speed, val_speed, test_speed = get_subset(train_df)

In [None]:
X_train_speed = train_speed.drop(['Outcome1', 'Outcome2'], axis = 1)

In [None]:
y_train_speed = train_speed['Outcome1']

In [None]:
X_val_speed = val_speed.drop(['Outcome1', 'Outcome2'], axis = 1)
y_val_speed = val_speed['Outcome1']

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
def print_score(model, X_t, y_t, X_v, y_v, oob = False):
    print('Training Score: {}'.format(model.score(X_t, y_t)))
    print('Validation Score: {}'.format(model.score(X_v, y_v)))
    if oob:
        if hasattr(model, 'oob_score_'):
            print("OOB Score:{}".format(model.oob_score_))

Next, create a random forest.  I've chosen to use the hyperparameters that yielded the best predictions from my last post.  

In [None]:
rf_speed = RandomForestClassifier(n_estimators=60, min_samples_leaf=7, max_features=0.3, min_samples_split= 20, bootstrap=False, n_jobs=-1)
rf_speed.fit(X_train_speed, y_train_speed)
print_score(rf_speed, X_train_speed, y_train_speed, X_val_speed, y_val_speed)

Finally, we can use the random forest's feature importances function to reveal the top features to pay attention to.  I've written a function to display feature importance in a dataframe in descending order.

In [None]:
def get_feat_imp(model, df):
    tmp = pd.DataFrame({'Feature':  np.array(df.columns), 'Importance': np.array(model.feature_importances_)})
    return tmp.sort_values(by = ['Importance'], ascending = False)

In [None]:
if_df = get_feat_imp(rf_speed, X_train_speed)
if_df.head()

The importance starts off at ~0.25, but very quickly drops off.  A bar graph emphasizes this point.

In [None]:
if_df.plot('Feature', 'Importance', kind = 'barh',legend=False, figsize=(10,6))

Based on this bar graph, if I remove everything with importance > 0.01 (basically everything of lower importance than Datemonth), then accuracy and feature importance of the random forest model shouldn't change.

In [None]:
train_if, val_if, _ = get_subset(train_df, seed = 55)

In [None]:
X_train_if = train_if[if_df[if_df['Importance']>0.01]['Feature'].values]
X_val_if = val_if[if_df[if_df['Importance']>0.01]['Feature'].values]

In [None]:
y_train_if = train_if['Outcome1']
y_val_if = val_if['Outcome1']

In [None]:
rf_if = RandomForestClassifier(n_estimators=60, min_samples_leaf=7, max_features=0.3, min_samples_split= 20, bootstrap=False, n_jobs=-1)
rf_if.fit(X_train_if, y_train_if)
print_score(rf_if, X_train_if, y_train_if, X_val_if, y_val_if)

In [None]:
rf_speed_fi = get_feat_imp(rf_if, X_train_if)
print(rf_speed_fi.head())
rf_speed_fi.plot('Feature', 'Importance', kind = 'barh',legend=False, figsize=(10,6))

Indeed, the training and validation accuracies as well as the bar plot don't have significant changes.  

The values spit out from the feature importance function give us relative importance to other features, but what does each feature mean for the accuracy of our model?  One way to test this is to shuffle the values of a feature and then running it through the same random forest model.  I've written a function, shuffle_col, that takes in a dataframe and the column name to be shuffled, and spits out a new dataframe with the shuffled column.  The original dataframe will remain the same.

In [None]:
from sklearn.utils import shuffle

In [None]:
def shuffle_col(df, col_name):
    df_copy = df.copy()
    # Reset index of copy because index from get_subset wasn't reset, causes nan problems later
    df_copy.reset_index(inplace=True, drop=True)
    df_new = df_copy.drop(col_name, axis=1)
    shuf = shuffle(df[col_name])
    shuf.reset_index(inplace=True, drop=True)
    df_new[col_name] = shuf
    return df_new

First, I'll try shuffling the sex column, since that was the most important feature.  I expect shuffling this column would have the biggest drop in accuracy, and going down the list (age, datehour, name, animal, etc) would have smaller and smaller impacts until it minimally affects the validation score.

In [None]:
shuf_df = shuffle_col(X_train_if, 'Sex')
rf_if.fit(shuf_df, y_train_if)
print_score(rf_if, shuf_df, y_train_if, X_val_if, y_val_if)

Wow, that validation score dropped dramatically, by a good 30%.  Shuffling the age column also has a pretty big accuracy drop (~20%), and as predicted, shuffling of the subsequent columns have lower effect on the accuracy of the validation set.

In [None]:
shuf_df = shuffle_col(X_train_if, 'Age')
rf_if.fit(shuf_df, y_train_if)
print_score(rf_if, shuf_df, y_train_if, X_val_if, y_val_if)

In [None]:
shuf_df = shuffle_col(X_train_if, 'Datehour')
rf_if.fit(shuf_df, y_train_if)
print_score(rf_if, shuf_df, y_train_if, X_val_if, y_val_if)

In [None]:
shuf_df = shuffle_col(X_train_if, 'Name')
rf_if.fit(shuf_df, y_train_if)
print_score(rf_if, shuf_df, y_train_if, X_val_if, y_val_if)

In [None]:
shuf_df = shuffle_col(X_train_if, 'Breed')
rf_if.fit(shuf_df, y_train_if)
print_score(rf_if, shuf_df, y_train_if, X_val_if, y_val_if)

#### 2.1 - A Deeper Dive into Important Features
What if there is a specific category in one of the features that is important, rather than the features as a whole?  For example, when I look at the top five most common names, almost 5000 of them are labeled as 5048 (you guessed it, its a null value).  And plotting the outcomes of the unnamed vs the top 3 most popular names show different outcome probabilities, where the unnamed are far more likely to be transferred whereas the named are more likely to be adopted or returned.  In this case, it is more likely that the better random forest split is if the name is equal to 5048 instead of greater or less than 5048.  

In [None]:
train_speed['Name'].value_counts(ascending = False)[:5]

In [None]:
fig, ((axis1, axis2), (axis3, axis4)) = plt.subplots(2,2,figsize=(10,7))
order = ['Transfer', 'Adoption', 'Return_to_owner', 'Euthanasia', 'Died']

sns.countplot(x = 'Name', hue = 'Outcome1', data = train_speed[train_speed['Name']==5048], hue_order= order, ax = axis1)
sns.countplot(x = 'Name', hue = 'Outcome1', data = train_speed[train_speed['Name']==540], hue_order=order, ax = axis2)
sns.countplot(x = 'Name', hue = 'Outcome1', data = train_speed[train_speed['Name']==4542], hue_order=order, ax = axis3)
sns.countplot(x = 'Name', hue = 'Outcome1', data = train_speed[train_speed['Name']==1305], hue_order=order, ax = axis4)

axis2.get_legend().remove()
axis3.get_legend().remove()
axis4.get_legend().remove()

Something Jeremy suggests is to one hot encode categorical data that has a small number of categories.  For example, sex has only 5 categories, but name has more than 4k names, so it wouldn't be practical to one hot encode this feature.  

In [None]:
print('Number of categories in Sex: {}'.format(train_if['Sex'].nunique()))
print('Categories in Sex: {}'.format(train_if['Sex'].unique()))
print('Number of categories in Name: {}'.format(train_if['Name'].nunique()))

I've written a function called oneHotEncode, which will take the dataframe and the max number of categories you want to encode.  For example, the column Sex has categories 0, 1, 3, 4, 5.  If i set my max_cat to be 6, then the function will encode the Sex category and create new columns called Sex_0, Sex_1, Sex_3, etc.  But it will not do so for the column Name because it has more than 6 categories.

In [None]:
def oneHotEncode(df, max_cat):
    for col in df.columns.values:
        if df[col].nunique() < max_cat:
            test = pd.get_dummies(df[col], prefix = col)
            df = pd.concat([df, test], axis=1)
        df.drop(col, axis = 1, inplace=True)
    return df

In [None]:
tmp = oneHotEncode(train_if[['Name', 'Animal', 'Sex', 'Age', 'Breed', 'Color']], 6)

In [None]:
train_if2 = pd.concat([train_if, tmp], axis = 1)
train_if2.drop(['Sex', 'Animal'], axis = 1, inplace = True)

In [None]:
tmp = oneHotEncode(val_if[['Name', 'Animal', 'Sex', 'Age', 'Breed', 'Color']], 6)
val_if2 = pd.concat([val_if, tmp], axis = 1)
val_if2.drop(['Animal', 'Sex'], axis = 1, inplace = True)

Recreating my training and validation sets to find most important features.  

In [None]:
X_train_if2 = train_if2.drop(['Outcome1', 'Outcome2'], axis = 1)
y_train_if2 = train_if2['Outcome1']
X_val_if2 = val_if2.drop(['Outcome1', 'Outcome2'], axis = 1)
y_val_if2 = val_if2['Outcome1']

In [None]:
rf_if2 = RandomForestClassifier(n_estimators=60, min_samples_leaf=7, max_features=0.3, min_samples_split= 20, bootstrap=False, n_jobs=-1)
rf_if2.fit(X_train_if2, y_train_if2)
print_score(rf_if2, X_train_if2, y_train_if2, X_val_if2, y_val_if2)

We see that both training and validation score decreased by a little, so it seems one hot encoding didn't help.  

As for the top five most important features, age, datehour, and name are still in there, but two of the sex categories also reached the top five.  Also note that the numbers in the feature importance chart are also important.  Before one hot encoding, sex and age had an importance of 0.256 and 0.244 respectively.  After one hot encoding, in the top five, sex got split into categories and have importances of less than 0.1, while age has a slightly smaller importance of 0.233.  As it stands, there is really only one staggeringly important feature instead of two, potentially accounting for the decrease in training and validation set scores.

I've also tried encoding for nameless animals, and while this encoding made it to the top five important features, it only had an importance of 0.087388, and the training and validation scores improved by about 0.005.  

In [None]:
rf_speed_fi2 = get_feat_imp(rf_if2, X_train_if2)
print(rf_speed_fi2.head())
rf_speed_fi2.plot('Feature', 'Importance', kind = 'barh',legend=False, figsize=(10,6))

In the case that one hot encoding does increase the validation score, it is still important to check if the one hot encoding increases the score on the test dataset, since it doesn't always improve the model.  

### 8 - Submitting to Kaggle
After removing the unimportant features and using one hot encoding, I submitted to the Kaggle leaderboard again, and got a score of 0.77713, placing me at 550.  This is two places lower than my last attempt.  It was a good exercise to take a deep dive into feature importance, but unfortunately it didn't seem to help in this case, as I didn't see any telltale signs such as strong correlation between features, redundant variables, etc.  