### 1 - Shelter Animal Outcomes
The explosion of social media came with, in my opinion, a plesant surprise:  more pictures of cats and dogs!  But as enjoyable as it is to look at pictures of animals being happy in their furever home, what about their counterparts still in the animal shelter?  How many of them will find their happy ending, or get shipped to a different facility for a second chance, or even worse, be euthanized?  Kaggle has an old [competition](https://www.kaggle.com/c/shelter-animal-outcomes) that takes data from the [Austin Animal Center](http://www.austintexas.gov/department/aac) and Kagglers were tasked to predict the outcome for each animal.  

The training and test data sets are both taken from a date range between 10/01/2013 to 2/21/2016.  
Most of the features are self explanatory, but the key column is the OutcomeType, which signals if the animal has been adopted, died, euthanized, returned to owner, or transferred to another shelter.  The features include names, animal type, gener, age, breed, and color.  


First, lets import all the libraries and the test and training datasets.

#### 1.1 - Importing Libraries

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

#### 1.2 - Load the Training Set

In [None]:
train_df = pd.read_csv('../input/train.csv', parse_dates=['DateTime'])

In [None]:
train_df.head()

In [None]:
train_df.info()

In [None]:
train_df.describe()

#### 1.3 - Load the Test Set

In [None]:
test_df = pd.read_csv('../input/test.csv', parse_dates=['DateTime'])

In [None]:
test_df.info()

In [None]:
test_df.describe(include = 'all')

First, I'd like to rename the columns to make them shorter, and delete the 'AnimalID' column in the training set since that won't tell us anything useful.

In [None]:
train_df.rename(columns = {'OutcomeType': 'Outcome1', 'OutcomeSubtype': 'Outcome2', 'AnimalType': 'Animal', 'SexuponOutcome': 'Sex', 'AgeuponOutcome': 'Age'}, inplace=True)
test_df.rename(columns = {'AnimalType': 'Animal', 'SexuponOutcome': 'Sex', 'AgeuponOutcome': 'Age'}, inplace=True)

In [None]:
train_df.drop('AnimalID', axis = 1, inplace = True)
test_ID = test_df['ID']
test_df.drop('ID', axis = 1, inplace = True)

### 2 - Initial Exploration
I've discovered fast.ai with Jeremy Howard and his course is really turning my world upside down.  Previous MOOCs, Kaggle datasets, and textbooks I've encountered seemed to emphasize data exploration before touching a model.  Jeremy however, advocates looking just enough at the data for minimal preprocessing before dumping the data into a model.  


#### 2.1 - Missing Data
First, I just want to see how much missing data there is, if any.  

In the training set, it looks like about half the secondary outcomes are missing, which is to be expected, 30% of the names and less than 1% of sex and age are missing.  

In the test set, a similar percent of names and ages are missing.  

In [None]:
train_df.isnull().sum()/train_df.shape[0]

In [None]:
test_df.isnull().sum()/test_df.shape[0]

#### 2.2 - Checking Outcomes

Its best to have balanced outcomes, but seeing as this is a real data set, that isn't the case.  Luckily, the numbers are in the animals' favor, with most of the animals being adopted or transfered.  

In [None]:
sns.countplot(x = 'Outcome1', data = train_df)

I believe Outcome2 data would be interesting to look at, but outside the scope of this post as it isn't used in the test set to factor in the primary outcome.  The column will be dropped later.

### 3 - Feature Engineering

#### 3.1 - Converting to Numbers and Categorical Variables
If I plug the data into the random forest model now, an error of 'ValueError: could not convert string to float: 'Brown Tabby/White'' pops up.  So first, I must make sure the data is properly formatted.  This means turning everything into numbers or categorical variables.  

In this case, the numeric values I can see are age.  Categorical values include ID, name, outcomes, animal, sex, breed, and color.  The 'DateTime' column is categorized as datetime, which was defined when I loaded the data set.  

##### 3.1.1 - Age
Just like humans, animals ages are described by days, weeks, months, and years.  Since the smallest unit is days , I will convert everything to days.  The only age that seems out of place is '0 years'.  There are around 20 of those, so they shouldn't impact results too much.  For now, I will convert them to the daily equivlaent of 6 months.  

In [None]:
train_df['Age'].unique()

In [None]:
def convert_age(col):
    try:
        num = col.split()[0]
        unit = col.split()[1]

        if unit == 'year' or unit == 'years':
            if num == '0':
                return 365/2
            return int(num) * 365
        if unit == 'month' or unit == 'months':
            return int(num) * 30
        if unit == 'week' or unit == 'weeks':
            return int(num) * 7
        if unit == 'day' or unit == 'days':
            return int(num)
    except AttributeError: 
        pass

In [None]:
train_df['Age'] = train_df['Age'].apply(convert_age)
test_df['Age'] = test_df['Age'].apply(convert_age)

There is a small percentage of null ages in the training set.  For now, I will impute the mean age.  

In [None]:
train_df['Age'].fillna(train_df['Age'].mean(), inplace = True)
test_df['Age'].fillna(test_df['Age'].mean(), inplace = True)

#### 3.2 - Dates
DateTime gets its own category because dates can be very telling.  It not only gives us year, day, month, hour, and minutes, but also we can figure out if it was the weekend, a holiday, if a sports game occured that day, etc.  Luckily, pandas recognizes the importance of dates and times, and has implemented some handy functions to help you figure this out. 

To access this information, first we define 'date' as the 'DateTime' series.  Then add '.dt.' to find out what attributes are available.  For example, fld.dt.dayofweek gives Mon, Tues, Wed etc as integers, with Mon=0.  

In [None]:
date = train_df['DateTime']
date.dt.dayofweek.head()

Much of the following function was taken from Jeremy, with some adjustments to fit this data set better.  Essentially, the function outputs the information in a new column, labeled as 'Date' + the attribute, such as 'Year' or 'Month'.  The original datetime column is then deleted.  

In [None]:
def convert_date(df, col):
    fld = df[col]
    targ_pre = 'Date'
    for n in ('day', 'dayofweek', 'dayofyear', 'hour', 'is_month_start', 'is_month_end', 'is_quarter_start', 'is_quarter_end', 
             'is_year_end', 'is_year_start', 'month', 'quarter', 'week', 'weekofyear', 'year'):
        df[targ_pre + n] = getattr(fld.dt, n.lower())
    df.drop(col, axis = 1, inplace = True)

In [None]:
convert_date(train_df, 'DateTime')
convert_date(test_df, 'DateTime')

In [None]:
test_df.columns

#### 3.3 - Creating Categorical Variables
Lastly, how do we deal with the strings?  As noted before, the strings give errors when inputted into a model.  There are a few ways to do this, including panda's pd.Categorical() function.  Jeremy explains that something to be careful of is to make sure the mapping is consistent between training and test sets.  In the fastai library, he has made a categorical mapping function to make sure everything is correct.  However, I believe sklearn's LabelEncoder() function is suitable.  At the very least, there is a way to check if the mappings are the same.  

LabelEncoder() cannot handle null values.  One way to deal with this is to convert null values to string.  

In [None]:
train_df['Name'] = train_df['Name'].fillna('NaN')
train_df['Sex'] = train_df['Sex'].fillna('NaN')

In [None]:
test_df['Name'] = test_df['Name'].fillna('NaN')

Now is also a good time to split the training set into X and y datasets.  

In [None]:
y = train_df['Outcome1']
X = train_df.drop(['Outcome1', 'Outcome2'], axis = 1)

Finally, we can apply label encoding.

In [None]:
from sklearn import preprocessing
labelEnc = preprocessing.LabelEncoder()

In [None]:
for col in test_df.columns.values:
    if test_df[col].dtypes == 'object':
        tmp = X[col].append(test_df[col])
        labelEnc.fit(tmp.values)
        X[col]=labelEnc.transform(X[col])
        test_df[col]=labelEnc.transform(test_df[col])

After label encoding, everything should be a number.  Remember that boolean values have numerical values of 0 and 1.  

In [None]:
X.head()

A note:  With the way I've written the code, the labels are overwritten everytime a column passes through the loop.  In the future, I would like to preserve the encoding for each column.  A possible solution is to use list comprehension as seen [here](https://stackoverflow.com/questions/24458645/label-encoding-across-multiple-columns-in-scikit-learn/31939145#31939145). 

#### 3.4 - Saving the File

I'm finally ready to pass the data into a model.  Lets save the file first.  Saving to feather format means the data is saved to disk in the same format it is saved to RAM, making it quick to both save and read in the data. 

In [None]:
import os
os.makedirs('tmp', exist_ok = True)
X.to_feather('tmp/X')
test_df.to_feather('tmp/test_df')

To reload the data, use the following code:

In [None]:
X = pd.read_feather('tmp/X')
test_df = pd.read_feather('tmp/test_df')

### 4 - Fitting the Model with Random Forest
Why random forest?  Jeremy goes through it much more eloquently in his video, but basically it is a great model to start with because it doesn't assume anything (e.g. linear vs non linear), and works well with numerical and categorical data.  In addition, if the random forest model does poorly, then its a sign that there might be problems within the data.

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X, y)

### 5 - Getting Predictions and Submitting
The last thing to do is apply the random forest classifier on the test data and create a submission file.  Scikit-learn makes this very easy.  The prediction for each animal ID can be obtained from the predict() function, and the shape of the predictions is a column of 11456 rows.  


In [None]:
pred = rf.predict(test_df)

In [None]:
pred.shape

However, the submission file wants to know probability of each outcome, as seen below.  To get the probabilty for each outcome, I used the predict_proba() function, and the list/order of outcomes is found from ".classes_".  

In [None]:
sub_df = pd.read_csv('../input/sample_submission.csv')
sub_df.head()

In [None]:
pred_prob = rf.predict_proba(test_df)

In [None]:
rf.classes_

I can now create a dataframe and save it as a csv file for submission.  

In [None]:
submit_df = pd.DataFrame(pred_prob, columns = ['Adoption', 'Died', 'Euthanasia', 'Return_to_owner', 'Transfer'])
submit_df.insert(0, 'ID', test_ID)

In [None]:
submit_df.head()

In [None]:
submit_df.to_csv('submission.csv', index = False)

#### 5.1 - Final Score
Using random forest and minimal data wrangling, I got a score of 0.82546, which places at 718 out of 1604 submissions.  Obviously there is room for improvement, but for a first try this is pretty good.  

Off the top of my head, one thing I can think of that might improve the score is identify holidays, or if an outcome occured 2-3 days before and after a holiday.  The reasoning being that often times, holidays are a stressful time for animals, with new people coming over, or being left alone, being scared by fireworks, etc.  