# Using Machine Learning to Predict Animal Outcomes At the Austin Animal Shelter
As a part of Austin's **Open Data Initative**, the Austin Animal Shelter releases the data they collect on animal intakes and outcomes, using these data sheets we hope to predict animals outcomes with a reasonable degree of accuracy.

In [1]:
# module imports
import pandas
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

In [2]:
ls -l

total 53948
-rw-r--r-- 1 randall randall 31441583 Sep 29 09:11 aac_intakes_outcomes.csv
-rw-r--r-- 1 randall randall  8801200 Nov 25 12:53 aac_shelter_cat_outcome_eng.csv
-rw-r--r-- 1 randall randall 10933886 Nov 25 12:53 aac_shelter_outcomes_modified.csv
-rw-r--r-- 1 randall randall  3548207 Nov 25 12:53 aac_shelter_outcomes.ods
-rw-r--r-- 1 randall randall    24948 Dec  3 12:49 notebook.ipynb
-rw-r--r-- 1 randall randall   105007 Dec  5 15:28 Team13_Project_Presentation.odp
-rw-r--r-- 1 randall randall   342036 Dec  5 15:30 Team13_Project_Presentation.pdf
-rw-r--r-- 1 randall randall    27444 Dec  5 23:02 Untitled.ipynb


Ooh, **aac_intakes_outcomes.csv** looks interesting, let's explore

In [3]:
shelter_outcomes = 'aac_intakes_outcomes.csv'

df = pd.read_csv(shelter_outcomes)
print("Shape:", df.shape)
df.head()

Shape: (79672, 41)


Unnamed: 0,age_upon_outcome,animal_id_outcome,date_of_birth,outcome_subtype,outcome_type,sex_upon_outcome,age_upon_outcome_(days),age_upon_outcome_(years),age_upon_outcome_age_group,outcome_datetime,...,age_upon_intake_age_group,intake_datetime,intake_month,intake_year,intake_monthyear,intake_weekday,intake_hour,intake_number,time_in_shelter,time_in_shelter_days
0,10 years,A006100,2007-07-09 00:00:00,,Return to Owner,Neutered Male,3650,10.0,"(7.5, 10.0]",2017-12-07 14:07:00,...,"(7.5, 10.0]",2017-12-07 00:00:00,12,2017,2017-12,Thursday,14,1.0,0 days 14:07:00.000000000,0.588194
1,7 years,A006100,2007-07-09 00:00:00,,Return to Owner,Neutered Male,2555,7.0,"(5.0, 7.5]",2014-12-20 16:35:00,...,"(5.0, 7.5]",2014-12-19 10:21:00,12,2014,2014-12,Friday,10,2.0,1 days 06:14:00.000000000,1.259722
2,6 years,A006100,2007-07-09 00:00:00,,Return to Owner,Neutered Male,2190,6.0,"(5.0, 7.5]",2014-03-08 17:10:00,...,"(5.0, 7.5]",2014-03-07 14:26:00,3,2014,2014-03,Friday,14,3.0,1 days 02:44:00.000000000,1.113889
3,10 years,A047759,2004-04-02 00:00:00,Partner,Transfer,Neutered Male,3650,10.0,"(7.5, 10.0]",2014-04-07 15:12:00,...,"(7.5, 10.0]",2014-04-02 15:55:00,4,2014,2014-04,Wednesday,15,1.0,4 days 23:17:00.000000000,4.970139
4,16 years,A134067,1997-10-16 00:00:00,,Return to Owner,Neutered Male,5840,16.0,"(15.0, 17.5]",2013-11-16 11:54:00,...,"(15.0, 17.5]",2013-11-16 09:02:00,11,2013,2013-11,Saturday,9,1.0,0 days 02:52:00.000000000,0.119444


Cool let's see what features we have left.

In [4]:
for column in df.columns:
    print(column)

age_upon_outcome
animal_id_outcome
date_of_birth
outcome_subtype
outcome_type
sex_upon_outcome
age_upon_outcome_(days)
age_upon_outcome_(years)
age_upon_outcome_age_group
outcome_datetime
outcome_month
outcome_year
outcome_monthyear
outcome_weekday
outcome_hour
outcome_number
dob_year
dob_month
dob_monthyear
age_upon_intake
animal_id_intake
animal_type
breed
color
found_location
intake_condition
intake_type
sex_upon_intake
count
age_upon_intake_(days)
age_upon_intake_(years)
age_upon_intake_age_group
intake_datetime
intake_month
intake_year
intake_monthyear
intake_weekday
intake_hour
intake_number
time_in_shelter
time_in_shelter_days


Now that we've taken a quick look at what the data looks like, let's trim it down a lil' bit, because not all of these are going to be necessary to help predict things.

In [5]:
try:
    df.drop(columns=['age_upon_outcome', 'animal_id_outcome', 'date_of_birth', 'outcome_subtype',\
                     'age_upon_outcome_(years)', 'age_upon_outcome_age_group', 'outcome_datetime',\
                     'outcome_year', 'outcome_number', 'dob_year', 'dob_month', 'dob_monthyear',\
                     'age_upon_intake', 'count', 'intake_datetime', 'intake_month', 'intake_year', \
                     'intake_monthyear', 'intake_weekday', 'intake_hour', 'intake_number',\
                     'time_in_shelter', 'age_upon_intake_(days)', 'age_upon_intake_(years)',\
                     'age_upon_intake_age_group','animal_id_intake'], inplace=True)
except KeyError:
    print("Columns already dropped, are you attempting to bamboozle me?")

df.dtypes

outcome_type                object
sex_upon_outcome            object
age_upon_outcome_(days)      int64
outcome_month                int64
outcome_monthyear           object
outcome_weekday             object
outcome_hour                 int64
animal_type                 object
breed                       object
color                       object
found_location              object
intake_condition            object
intake_type                 object
sex_upon_intake             object
time_in_shelter_days       float64
dtype: object

This feels like a good amount of features to train our model on. Next, let's do a little preprocessing on our data to make it trainable. 

First, we'll encode our categorical variables to let the model understand them, and drop empty values.

In [6]:
from sklearn.preprocessing import LabelEncoder

# highlighting categorical features
labelFeatures = ['sex_upon_outcome', 'outcome_weekday', 'animal_type', \
                 'breed', 'color', 'found_location', 'intake_condition', 'sex_upon_intake']

# dropping empty values
df.dropna(inplace = True, subset=['outcome_type', 'sex_upon_outcome', 'outcome_weekday', 'animal_type', \
                 'breed', 'color', 'found_location', 'intake_condition', 'sex_upon_intake'])

le = LabelEncoder()

# fitting categorical features
for f in labelFeatures:
    print("Fitting:", f)
    le.fit(df[f])
    df[f] = le.transform(df[f])
    
# fitting y value
le.fit(df.outcome_type)
df['outcome_type'] = le.transform(df.outcome_type)

# throwing the numeric values onto labelFeatures so we can add it all into X and Y
labelFeatures.append('age_upon_outcome_(days)')
labelFeatures.append('outcome_month')
labelFeatures.append('outcome_hour')
labelFeatures.append('time_in_shelter_days')

# and now let there be X and Y
X = df[labelFeatures]
y = df.outcome_type

# thank Zeus there's only train_test_split, and that there's not also a test_train_split, 
# I would've been trying to figure out that one for forever
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1, test_size=0.3, stratify=y)

Fitting: sex_upon_outcome
Fitting: outcome_weekday
Fitting: animal_type
Fitting: breed
Fitting: color
Fitting: found_location
Fitting: intake_condition
Fitting: sex_upon_intake


In [7]:
# we set random_state=1 so Zeus may arrange the data juuuuuust right
randoForest = RandomForestClassifier(random_state=1)

# now let's fit the model and see what happens
randoForest.fit(train_X, train_y)

predictions = randoForest.predict(val_X)
print("Validation MAE for Random Forest Model: {}".format(mean_absolute_error(predictions, val_y)))



Validation MAE for Random Forest Model: 1.2766224528222938


In [8]:
from sklearn.metrics import classification_report

print(classification_report(val_y, predictions))

              precision    recall  f1-score   support

           0       0.78      0.89      0.83     10078
           1       0.55      0.10      0.17       207
           2       0.75      0.33      0.46        91
           3       0.82      0.69      0.75      1873
           4       0.00      0.00      0.00        14
           5       0.00      0.00      0.00         5
           6       0.75      0.71      0.73      4437
           7       0.00      0.00      0.00        54
           8       0.78      0.72      0.75      7140

    accuracy                           0.78     23899
   macro avg       0.49      0.38      0.41     23899
weighted avg       0.77      0.78      0.77     23899



  'precision', 'predicted', average, warn_for)


Alright, our first prediction!

Our accuracy is 78%, not bad at all!

In [9]:
from sklearn.metrics import accuracy_score

# a list for our forests
rfcList = []
predictionList = []

# making, like, a lot of forests
# update: my computer only has memory to do about 50 of these before it just absolutely freezes
for i in range(1, 101, 5):
    rfcList.append(RandomForestClassifier(random_state=1, n_estimators=i))

i=0
# now we gon train a lot of forests
for forest in rfcList:
    forest.fit(train_X, train_y)
    predictionList.append(forest.predict(val_X))
    print("i=", i, "\tn_estimators: ", (i+1)*5, "\tAccuracy: ", end='')
    #print(mean_absolute_error(predictionList[i], val_y))
    print(accuracy_score(val_y, predictionList[i]))
    i+=1

i= 0 	n_estimators:  5 	Accuracy: 0.6814929494957948
i= 1 	n_estimators:  10 	Accuracy: 0.766852169546843
i= 2 	n_estimators:  15 	Accuracy: 0.7782333988869827
i= 3 	n_estimators:  20 	Accuracy: 0.785555880999205
i= 4 	n_estimators:  25 	Accuracy: 0.7870203774216494
i= 5 	n_estimators:  30 	Accuracy: 0.7864764216075987
i= 6 	n_estimators:  35 	Accuracy: 0.7873551194610653
i= 7 	n_estimators:  40 	Accuracy: 0.7901585840411732
i= 8 	n_estimators:  45 	Accuracy: 0.7890706724130717
i= 9 	n_estimators:  50 	Accuracy: 0.7901167412862463
i= 10 	n_estimators:  55 	Accuracy: 0.7904514833256622
i= 11 	n_estimators:  60 	Accuracy: 0.7911628101594209
i= 12 	n_estimators:  65 	Accuracy: 0.790953596384786
i= 13 	n_estimators:  70 	Accuracy: 0.7917067659734717
i= 14 	n_estimators:  75 	Accuracy: 0.7919159797481066
i= 15 	n_estimators:  80 	Accuracy: 0.7919996652579606
i= 16 	n_estimators:  85 	Accuracy: 0.7930038913762082
i= 17 	n_estimators:  90 	Accuracy: 0.7932131051508431
i= 18 	n_estimators:  95

Interesting. The accuracy improves steadily, up to a point of around 79%%, where after that, adding more n_estimators starts providing diminishing returns, if any benefit at all when compared to the previous iteration. Having around 50 seems to be the best blend of performance and accuracy in this scenario.

Let's see if we can gain any insights on how this model predicts it's results, and look at feature importances (and maybe also a classification report for n_estimators = 50)

In [12]:
print(classification_report(val_y, predictionList[9]))

for i in range(len(labelFeatures)-1):
    print(labelFeatures[i],"importance:\n\t", rfcList[9].feature_importances_[i])

              precision    recall  f1-score   support

           0       0.79      0.90      0.84     10078
           1       0.85      0.08      0.15       207
           2       0.85      0.32      0.46        91
           3       0.86      0.69      0.77      1873
           4       0.00      0.00      0.00        14
           5       0.00      0.00      0.00         5
           6       0.77      0.73      0.75      4437
           7       0.00      0.00      0.00        54
           8       0.78      0.74      0.76      7140

    accuracy                           0.79     23899
   macro avg       0.55      0.38      0.41     23899
weighted avg       0.79      0.79      0.78     23899

sex_upon_outcome importance:
	 0.10709370637109218
outcome_weekday importance:
	 0.04330829056402263
animal_type importance:
	 0.056957546317882184
breed importance:
	 0.0668402458619814
color importance:
	 0.06114152229370092
found_location importance:
	 0.08950647326773599
intake_condition im

The most important features seem to be **age_upon_outcome**, **sex_upon_outcome**, 