Pull in the Data first into Pandas DataFrame:

In [65]:
import pandas as pd

ShelterDF = pd.read_csv('../input/aac_shelter_outcomes.csv')
ShelterDF.head(20)

The basic question we are trying to answer is: what are the major factors for an animal's outcome?  
The intended audience/cleint is: the animal shelter and anyone involved with animal welfare.

A few ideas for what factors will best predict an animal's outcome:
Age, Animal Type, Breed, Color, Name, Outcome Subtype, and Sex

Pull up the basic info about this DataFrame, and we are looking for how many blank or "null" entries we are dealing with:

In [66]:
ShelterDF.info()

Another way to look at it:

In [67]:
print("Total number of Null Values for Each Column:")
ShelterDF.isnull().sum()

So it looks like we have 8 missing values for "age_upon_outcome", 23886 for "name", 42293 for "outcome_subtype", 12 for "outcome_type", and 2 for "sex_upon_outcome"

Before we deal with the nulls, let's drop a few columns we know we're not going to use:

In [68]:
ShelterDropDF = ShelterDF[['age_upon_outcome', 'animal_type', 'breed', 'color', 'name', 'outcome_subtype', 'outcome_type', 'sex_upon_outcome']]
ShelterDropDF.head()

We won't be using animal_id, date_of_birth (since we already have their age at outcome), datetime and monthyear(not sure what these two columns represent, and there was no description provided in Kaggle's Metadata)

Now let's have another look at the nulls for our updated DataFrame:

In [69]:
print("Total number of Null Values for Each Column:")
ShelterDropDF.isnull().sum()

Let's drop all null rows for outcome_type:

In [70]:
ShelterDropDF = ShelterDropDF.dropna(subset=['outcome_type']) 
ShelterDropDF.isnull().sum()

Now, let's run a correlation chart to see if we can find some insights through our EDA:

In [71]:
import seaborn as sns

cm = sns.light_palette("green", as_cmap=True)
s = ShelterDropDF.apply(lambda x : pd.factorize(x)[0]).corr(method='pearson', min_periods=1).style.background_gradient(cmap=cm)
s

It looks like the animal_type shows most positive correlation with outcome_type.  The rest seem to be close to no correlation.

The problem with this method, however is that it replaces the categorical variables with numerical variables, which imply rank (such as 1,2,3) when there is no justification for implying rank:

In [72]:
RankExampleDF = ShelterDropDF.apply(lambda x : pd.factorize(x)[0])
RankExampleDF.head()

Next, let's run some graphs to get a visual for our data, and how each of our feature variables compare to our target variable of outcome_type:

In [73]:
print("Outcome Type Variables:")
ShelterDropDF.outcome_type.unique()

In [74]:
Adopted = ShelterDropDF[ShelterDropDF['outcome_type'] == 'Adoption'].groupby('animal_type').size()
Transferred = ShelterDropDF[ShelterDropDF['outcome_type'] == 'Transfer'].groupby('animal_type').size()
Euthanized = ShelterDropDF[ShelterDropDF['outcome_type'] == 'Euthanasia'].groupby('animal_type').size()
Returned = ShelterDropDF[ShelterDropDF['outcome_type'] == 'Return to Owner'].groupby('animal_type').size()
Died = ShelterDropDF[ShelterDropDF['outcome_type'] == 'Died'].groupby('animal_type').size()
Disposed = ShelterDropDF[ShelterDropDF['outcome_type'] == 'Disposal'].groupby('animal_type').size()
Relocated = ShelterDropDF[ShelterDropDF['outcome_type'] == 'Relocate'].groupby('animal_type').size()
Missing = ShelterDropDF[ShelterDropDF['outcome_type'] == 'Missing'].groupby('animal_type').size()
Rto_Adopt = ShelterDropDF[ShelterDropDF['outcome_type'] == 'Rto-Adopt'].groupby('animal_type').size()

data = pd.concat([Adopted, Transferred, Euthanized, Returned, Died, Disposed, Relocated, Missing, Rto_Adopt], axis=1)
data.columns = ['Adopted', 'Transferred', 'Euthanized', 'Returned', 'Died', 'Disposed', 'Relocated', 'Missing', 'Rto_Adopt']
data.plot.bar(title='Animal Type Vs. Outcome Type')

It looks like Dogs were most likely to be adopted, followed by Cats.  Dogs were also by far most likely to be returned back to the original owner.  Cats appeared to be most likely to be transferred, followed closely behind by Dogs.  "Other" animals were most liekly to be euthanized.

In [75]:
Adopted = ShelterDropDF[ShelterDropDF['outcome_type'] == 'Adoption'].groupby('sex_upon_outcome').size()
Transferred = ShelterDropDF[ShelterDropDF['outcome_type'] == 'Transfer'].groupby('sex_upon_outcome').size()
Euthanized = ShelterDropDF[ShelterDropDF['outcome_type'] == 'Euthanasia'].groupby('sex_upon_outcome').size()
Returned = ShelterDropDF[ShelterDropDF['outcome_type'] == 'Return to Owner'].groupby('sex_upon_outcome').size()
Died = ShelterDropDF[ShelterDropDF['outcome_type'] == 'Died'].groupby('sex_upon_outcome').size()
Disposed = ShelterDropDF[ShelterDropDF['outcome_type'] == 'Disposal'].groupby('sex_upon_outcome').size()
Relocated = ShelterDropDF[ShelterDropDF['outcome_type'] == 'Relocate'].groupby('sex_upon_outcome').size()
Missing = ShelterDropDF[ShelterDropDF['outcome_type'] == 'Missing'].groupby('sex_upon_outcome').size()
Rto_Adopt = ShelterDropDF[ShelterDropDF['outcome_type'] == 'Rto-Adopt'].groupby('sex_upon_outcome').size()

data = pd.concat([Adopted, Transferred, Euthanized, Returned, Died, Disposed, Relocated, Missing, Rto_Adopt], axis=1)
data.columns = ['Adopted', 'Transferred', 'Euthanized', 'Returned', 'Died', 'Disposed', 'Relocated', 'Missing', 'Rto_Adopt']
data.plot.bar(title='Animal Sex Vs. Outcome Type')

It looks like Fixed Animals were far more likely to be adopted than any of the other ones.  Unknown animal sex were most likely to be euthanized.  This may be due to the fact that most of the Unknowns belonged to the "Other" category, which contained uncommon pets, and therefore difficult to determine sex, and were unlikely to be adopted anyways.  

In [76]:
Adopted = ShelterDropDF[ShelterDropDF['outcome_type'] == 'Adoption'].groupby('outcome_subtype').size()
Transferred = ShelterDropDF[ShelterDropDF['outcome_type'] == 'Transfer'].groupby('outcome_subtype').size()
Euthanized = ShelterDropDF[ShelterDropDF['outcome_type'] == 'Euthanasia'].groupby('outcome_subtype').size()
Returned = ShelterDropDF[ShelterDropDF['outcome_type'] == 'Return to Owner'].groupby('outcome_subtype').size()
Died = ShelterDropDF[ShelterDropDF['outcome_type'] == 'Died'].groupby('outcome_subtype').size()
Disposed = ShelterDropDF[ShelterDropDF['outcome_type'] == 'Disposal'].groupby('outcome_subtype').size()
Relocated = ShelterDropDF[ShelterDropDF['outcome_type'] == 'Relocate'].groupby('outcome_subtype').size()
Missing = ShelterDropDF[ShelterDropDF['outcome_type'] == 'Missing'].groupby('outcome_subtype').size()
Rto_Adopt = ShelterDropDF[ShelterDropDF['outcome_type'] == 'Rto-Adopt'].groupby('outcome_subtype').size()

data = pd.concat([Adopted, Transferred, Euthanized, Returned, Died, Disposed, Relocated, Missing, Rto_Adopt], axis=1)
data.columns = ['Adopted', 'Transferred', 'Euthanized', 'Returned', 'Died', 'Disposed', 'Relocated', 'Missing', 'Rto_Adopt']
data.plot.bar(title='Outcome Subtype Vs. Outcome Type')

Based on this data, we find that Partners were most likely to be transferred, while Foster animals were most likely to be adopted.  Animals that were at risk for Rabies, or Suffering were most likely to be euthanized.

In [77]:
print("Breed Variables w/ Number of those Variables in DataFrame:")
ShelterDropDF.breed.value_counts()

In [78]:
print("Color Variables w/ Number of those Variables in DataFrame:")
ShelterDropDF.color.value_counts()

The rest of the columns have too many variable to be able to list in a graph, so we just won't visualize them for now.

I think that we're ready to go into machine learning now:  

So my idea is to create multiple models.  Some will contain only one feature variable, and some will contain multiple feature variables.  

Here are the feature variables of each model I will create:
1) (all feature variables)
2) animal_type
3) breed
4) color 
5) name
6) outcome_subtype
7) sex_upon_outcome

Then, I will combine the best performing single feature variable models to create an optimized multiple variable model, which will hopefully work the best.

On second thought, there are just too many variables for breed, names, and color.  I think that I'll drop them

In [79]:
#I decided to get rid of the 'names' column':
ShelterDropDF = ShelterDropDF[['animal_type', 'outcome_subtype', 'outcome_type', 'sex_upon_outcome']]
#Creating the different DataFrames I'll be building my models off of:
MLAll = ShelterDropDF.dropna()
MLAnimal = ShelterDropDF[['animal_type', 'outcome_type']].dropna()
#I only drop rows on an 'as needed basis' so that I have every variable filled in for the other models:
MLSubtype = ShelterDropDF[['outcome_subtype', 'outcome_type']].dropna()
MLSex = ShelterDropDF[['sex_upon_outcome', 'outcome_type']].dropna()

# Let's Create Model 1:

First, we are going to create a Decision Tree Classifier:

In [146]:
df = pd.get_dummies(MLAll[['animal_type', 'outcome_subtype', 'sex_upon_outcome']])
X = df
y = MLAll.outcome_type

In [147]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, accuracy_score

clf = DecisionTreeClassifier(max_depth=3)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
                                                    random_state=0, stratify=y)

clf.fit(X_train, y_train)

In [148]:
predictions = clf.predict(X_test)

accuracy_score(y_test, predictions)

In [149]:
from sklearn.metrics import classification_report

print(classification_report(y_test, predictions))

Now, let's try a Random Forest Classifier:

In [150]:
from sklearn.ensemble import RandomForestClassifier

rfclf = RandomForestClassifier()
rfclf.fit(X_train, y_train)

In [151]:
rfclf_predictions = rfclf.predict(X_test)

accuracy_score(y_test, rfclf_predictions)

In [152]:
rfclf_predictions = rfclf.predict(X_test)

print(classification_report(y_test, rfclf_predictions))

# Model 2 (Animal Type Feature Variable):

In [153]:
df = pd.get_dummies(MLAnimal[['animal_type']])
X = df
y = MLAnimal.outcome_type

clf = DecisionTreeClassifier(max_depth=3)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
                                                    random_state=0, stratify=y)

clf.fit(X_train, y_train)

In [154]:
predictions = clf.predict(X_test)

accuracy_score(y_test, predictions)

In [155]:
print(classification_report(y_test, predictions))

In [156]:
rfclf = RandomForestClassifier()
rfclf.fit(X_train, y_train)

In [157]:
rfclf_predictions = rfclf.predict(X_test)

accuracy_score(y_test, rfclf_predictions)

In [158]:
rfclf_predictions = rfclf.predict(X_test)

print(classification_report(y_test, rfclf_predictions))

# Model 3 (Animal Subtype Feature Variable):

In [159]:
df = pd.get_dummies(MLSubtype[['outcome_subtype']])
X = df
y = MLSubtype.outcome_type

clf = DecisionTreeClassifier(max_depth=3)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
                                                    random_state=0, stratify=y)

clf.fit(X_train, y_train)

In [163]:
predictions = clf.predict(X_test)

accuracy_score(y_test, predictions)

In [164]:
print(classification_report(y_test, predictions))

In [165]:
rfclf = RandomForestClassifier()
rfclf.fit(X_train, y_train)

In [166]:
rfclf_predictions = rfclf.predict(X_test)

accuracy_score(y_test, rfclf_predictions)

In [167]:
rfclf_predictions = rfclf.predict(X_test)

print(classification_report(y_test, rfclf_predictions))

# Model 4 (Animal Sex Feature Variable):

In [168]:
df = pd.get_dummies(MLSex[['sex_upon_outcome']])
X = df
y = MLSex.outcome_type

clf = DecisionTreeClassifier(max_depth=3)
clf.fit(X, y)
y_hat = clf.predict(df)
accuracy_score(y, y_hat)

In [169]:
clf = DecisionTreeClassifier(max_depth=3)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
                                                    random_state=0, stratify=y)

clf.fit(X_train, y_train)

In [170]:
predictions = clf.predict(X_test)

accuracy_score(y_test, predictions)

In [171]:
print(classification_report(y_test, predictions))

In [172]:
rfclf = RandomForestClassifier()
rfclf.fit(X_train, y_train)

In [173]:
rfclf_predictions = rfclf.predict(X_test)

accuracy_score(y_test, rfclf_predictions)

In [174]:
print(classification_report(y_test, rfclf_predictions))

It looks like Animal Subtype performs the best, and is the most reliable feature variable for determining adoption, according to the machine learning models.