## Shelter Animal Outcomes
#### MIDS W207 Final Project
#### Clay Miller, Roseanna Hopper, Yubo Zhang

### Introduction

Approximately 6.5 to 7.6 million companion animals enter the animal shelters across the U.S. each year. Each year, approximately 1.5 million shelter animals are euthanized (670,000 dogs and 860,000 cats). The number of dogs and cats euthanized in U.S. shelters annually has declined from approximately 2.6 million in 2011. This decline can be partially explained by an increase in the percentage of animals adopted and an increase in the number of stray animals successfully returned to their owners.

For this exploration, we are using a dataset of intake information including breed, color, sex, and age from the [Austin Animal Center](https://www.kaggle.com/c/shelter-animal-outcomes), to develop a model that can be used for shelters to predict the outcome for each animal.  We are hoping that by using this model, the shelter can provide a little bit of extra help for animals that have a low adoption rate. In addition, we are hoping this dataset can help us to provide some key findings (for example, if age and gender would impact the adoption rate for dogs, if neutered or spayed cats are more likely to be adopted), understand which factors impact the chance of adoption, and identify differences between dog and cat adoption trends.  

In [22]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
import tensorflow 
import keras
import itertools

from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction import FeatureHasher, DictVectorizer
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.base import BaseEstimator, TransformerMixin
from bokeh.charts import Bar, output_file, show, output_notebook
from keras.models import Sequential
from keras.layers import Dense, Activation
from sklearn.preprocessing import LabelBinarizer
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from IPython.core.display import Image, display
from sklearn.externals.six import StringIO
from IPython.display import Image  
from sklearn import tree

output_notebook()


%matplotlib inline

### Raw Data

Each case in our raw training data is an individual animal, and the features include its characteristics (including name, animal type, age, gender, spay/neuter status, breed, coloring, and final outcome). Apart from age, all of the fields are categorical variables, so we’ll be creating binary variables for the majority of the fields (and their intersections). There are few missing values, and some cases where a field is marked as “Unknown” (appears in Name, Breed, and Gender). Overall, we are trying to predict the animal’s “final outcome”, of which there are five possibilities.


In [24]:
# Load the data
data = pd.read_csv('../data/test.csv')
breeds = pd.read_csv('../data/breeds.csv')
breeds['Breed'] = breeds['Breed'].str.strip()
top_breed_list = []
for b in breeds['Breed']:
    top_breed_list.append(b.strip())

data['Female'] = 'Female' in data['SexuponOutcome']
data['AgeuponOutcome'].fillna('', inplace = True)

#Create a continuous variable for age, making sure
#that all listed ages are on the same scale (months)
def ageConvert(age):
    regexyear = '(\d+) year'
    regexmnth = '(\d+) month'
    regexwk = '(\d+) week'
    regexday = '(\d+) day'
    if re.match(regexyear, age):
        const = int(re.match(regexyear, age).groups()[0])
        return const*52
    elif re.match(regexmnth, age):
        const = int(re.match(regexmnth, age).groups()[0])
        return const*4.5 # a month is roughly 4.5 weeks
    elif re.match(regexwk, age):
        return int(re.match(regexwk, age).groups()[0])
    elif re.match(regexday, age):
        const = int(re.match(regexday, age).groups()[0])
        return const/7 #7 days in a week
    else:
        return None
    
data['ConvertedAge']=data['AgeuponOutcome'].apply(ageConvert)


#Separate SexuponOutcome into Male/Female and into Intact/Spayed-Neutered
def female(i):
    i = str(i)
    if i.find('Female') >= 0: return 'Female'
    if i.find('Unknown') >= 0: return 'Unknown'
    return 'Male'
data['Female'] = data.SexuponOutcome.apply(female)

def intact(i):
    i = str(i)
    if i.find('Intact') >= 0: return 'Intact'
    if i.find('Unknown') >= 0: return 'Unknown'
    return 'Spayed/Neutered'
data['Intact'] = data.SexuponOutcome.apply(intact)

def mixed_breed(i):
    i = str(i)
    if i.find('Mix') >= 0: return 'Mixed Breed'
    if i.find('/') >= 0: return 'Known Breed Combo'
    return 'Nonmixed'
data['MixedBreed'] = data.Breed.apply(mixed_breed)

def top_breed(i):
    i = str(i)
    if any(word in i for word in top_breed_list):
        return int(1)
    else:
        return int(0)
data['TopBreed'] = data.Breed.apply(top_breed)

def breed_rank(i):
    i = str(i)
    ranks = []
    for word in top_breed_list:
        if word in i:
            ranks.append(int(breeds.loc[breeds['Breed'] == word]['2007']))
    if len(ranks) > 0:
        return np.mean(ranks)
    else:
        return 51.0
data['BreedRank'] = data.Breed.apply(breed_rank)

def pit_bull(i):
    i = str(i)
    if i.find("Pit Bull") >=0: return int(1)
    else: return int(0)
data['PitBull'] = data.Breed.apply(pit_bull)

def black_cat(i):
    i = str(i)
    if i == "Black": return int(1)
    else: return int(0)
data['BlackCat'] = data.Color.apply(black_cat)

def naming(i):
    if pd.isnull(i): return 'Unnamed'
    return 'Named'
data['Named'] = data.Name.apply(naming)

#Change all breed and color strings so that they are ordered consistently
#E.G. all "brown/black" and "black/brown" should become "black, brown"
def reorder(i):
    i = str(i)
    if i.find(" ") >= 0: i = i.replace(" ", "-")
    if i.find("/") >= 0: i = i.replace("/", " ")
    i = i.split()
    i = sorted(i)
    i = ' '.join(i)
    return i

data['OrderedColor'] = data.Color.apply(reorder)
data['OrderedBreed'] = data.Breed.apply(reorder)


In [25]:
#Importing this because multiple deprecation warnings cluttering the output
import warnings
warnings.filterwarnings('ignore')

continuous = ['ConvertedAge', 'BreedRank']
discrete = [
    'AnimalType',
    'Female',
    'Intact',
    'MixedBreed',
    'Named',
    'TopBreed',
    'PitBull',
    'BlackCat'
]
target = 'OutcomeType'

#For those missing an age, fill with the median age by animal type
data["ConvertedAge"] = data.groupby("AnimalType").transform(lambda x: x.fillna(x.median()))
data[continuous].describe().T



Unnamed: 0,count,mean,std,min,25%,50%,75%,max
ConvertedAge,11456.0,5728.5,3307.206676,1.0,2864.75,5728.5,8592.25,11456.0
BreedRank,11456.0,36.202732,20.205699,1.0,12.0,51.0,51.0,51.0


In [27]:
#Turn categorical variables into binaries
data2 = pd.concat([data[continuous], pd.get_dummies(data[discrete])], axis=1)



In [30]:
discrete = ['AnimalType_Cat', 'AnimalType_Dog', 'Female_Female', 'Female_Male', 'Female_Unknown',
           'Intact_Intact', 'Intact_Spayed/Neutered', 'Intact_Unknown', 'MixedBreed_Known Breed Combo',
           'MixedBreed_Mixed Breed', 'MixedBreed_Nonmixed', 'Named_Named', 'Named_Unnamed']


predictors = continuous + discrete



# Train/test split on the full dataset
X = data2[predictors]


#Normalize the continuous variables
ss = StandardScaler()
ss.fit(X_train[continuous])   # Compute mean and std of training data
X_train[continuous] = ss.transform(X_train[continuous])  # Use that mean and std to normalize columns of training data


In [33]:
X_train.to_csv('testdata_edid.csv')