***Sam Cressman Capstone Project: Shelter Animal Outcomes***

***Help improve outcomes for shelter animals***

***Capstone inspiration:*** [Kaggle](https://www.kaggle.com/c/shelter-animal-outcomes)

[Dataset from the City of Austin open data portal (pulled 6/20/18)](https://data.austintexas.gov/Health-and-Community-Services/Austin-Animal-Center-Outcomes/9t4d-g238)

"Every year, approximately 7.6 million companion animals end up in US shelters. Many animals are given up as unwanted by their owners, while others are picked up after getting lost or taken out of cruelty situations. Many of these animals find forever families to take them home, but just as many are not so lucky. 2.7 million dogs and cats are euthanized in the US every year. <br>

Using a dataset of intake information including breed, color, sex, and age from the Austin Animal Center, we're asking Kagglers to predict the outcome for each animal. <br>

We also believe this dataset can help us understand trends in animal outcomes. These insights could help shelters focus their energy on specific animals who need a little extra help finding a new home. We encourage you to publish your insights on Scripts so they are publicly accessible." <br>

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
animals = pd.read_csv("Austin_Animal_Center_Outcomes.csv")

***Quick EDA***

In [3]:
animals.head();

In [4]:
# All object columns

# animals.info()

In [5]:
# Most null values are in Name and Outcome Subtype

animals.isnull().sum();

In [6]:
animals.shape;

***Column by column***:

***Animal ID***: dropping column (acts as an index)

***Name***: Over 25,000 null values: transforming into "1" if animal contains a name, 0 otherwise

***DateTime***: time of outcome event: object type, splitting into day/month/year through string splicing and then converting into DateTime object to subtract Date of Birth from in order to create Age upon Outcome in years (will dummy each new column, will not model using this column)

***MonthYear***: dropping column (exactly the same as DateTime)

***Date of Birth***: splitting into day/month/year through string splicing and then converting into DateTime object to subtract Date of Birth from in order to create Age upon Outcome in years (will dummy each new column, will not model using this column)

***Outcome Type***: target, dropping 12 null values, mapping/converting into numerical values

***Outcome Subtype***: approximately half null: will keep column for EDA/visualization purposes and to further examine Outcome Type but will not include in models

***Animal Type***: 5 types (mainly Dog and Cat) but also Bird, Livestock, and Other (Other contains 99 different species)(will dummy column)

***Sex upon Outcome***: Neutered Male, Spayed Female, Intact Male, Intact Female, Unknown (mainly Animal Type Other)(will dummy column)

***Age upon Outcome***: creating a new column subtracting Date of Birth and Time of Outcome to get a time in years (then will drop Age Outcome and dummy new columns)

***Breed***: running through Natural Language Processing Count Vectorizer to create/combine Breed features (2223 unique breed combinations) and address cleanliness issues (spacing, "/" characters, "Mix" and "mix")

***Color***: running through Natural Language Processing Count Vectorizer to create/combine Color features (538 unique color combinations)

***Animal ID***

In [7]:
# Dropping column

In [8]:
animals.drop(columns=["Animal ID"], axis = 1, inplace=True)

***Name***

In [9]:
# Converting into 1 if name, 0 otherwise due to high number of null values

animals["Name"].isnull().sum();

In [10]:
# We see some names somewhat frequently

animals["Name"].value_counts()[0:10];

In [11]:
animals["has_name"] = animals['Name'].notnull().astype(int)

In [12]:
# Dropping Name column

animals.drop(columns=["Name"], axis = 1, inplace=True)

***DateTime***

In [13]:
# DateTime is time of outcome: will rename for clarity:

animals = animals.rename(columns={"DateTime": "Outcome Time"})

In [14]:
# Object type: will split out day, month, year

animals["Outcome Day"] = [day[3:5] for day in animals["Outcome Time"]]

In [15]:
animals["Outcome Month"] = [month[0:2] for month in animals["Outcome Time"]]

In [16]:
animals["Outcome Year"] = [year[6:10] for year in animals["Outcome Time"]]

In [17]:
# Converting to DateTime object to help create Age upon Outcome in years

animals["Outcome Time"] = pd.to_datetime(animals["Outcome Time"])

***MonthYear***

In [18]:
# Dropping column: exactly the same as DateTime (now Outcome Time)

animals.drop(columns = "MonthYear", axis = 1, inplace = True)

***Date of Birth***

In [19]:
# Object type: will split out day, month, year

animals["Birth Day"] = [date[3:5] for date in animals["Date of Birth"]]

In [20]:
animals["Birth Month"] = [month[0:2] for month in animals["Date of Birth"]]

In [21]:
animals["Birth Year"] = [year[6:10] for year in animals["Date of Birth"]]

In [22]:
# Converting from object to DateTime

animals["Date of Birth"] = pd.to_datetime(animals["Date of Birth"])

***Age Upon Outcome***

In [23]:
# Creating a new column subtracting Date of Birth and Time of Outcome 
# to get a time in years (then will drop Age upon Outcome and dummy new columns)

animals["Age at Outcome"] = animals["Outcome Time"] - animals["Date of Birth"]

# animals.drop(columns = "Age upon Outcome", axis = 1, inplace = True)

In [24]:
# Converting into string to splice out just days

animals["Age at Outcome"] = animals["Age at Outcome"].astype(str)

In [25]:
# Splitting on space

animals["Age at Outcome"] = animals["Age at Outcome"].str.split(" ")

In [26]:
animals["Age at Outcome"] = [age_days[0] for age_days in animals["Age at Outcome"]]

animals["Age at Outcome"] = animals["Age at Outcome"].astype(float)

In [27]:
# Dividing days by 365 to get float in years value

animals["Age at Outcome"] = animals["Age at Outcome"] / 365

In [28]:
# Dropping old column

animals.drop(columns = "Age upon Outcome", axis = 1, inplace = True)

***Outcome Type***

In [29]:
# Outcome Type: target for modeling (what we are attempting to predict)
# Will map dictionary values since multi class

animals["Outcome Type"].unique();

In [30]:
animals["Outcome Type"].isnull().sum();

In [31]:
# 12 null values: dropping those rows

animals.dropna(subset = ["Outcome Type"], inplace = True)

In [32]:
outcome_type_dict = {"Adoption": 0, "Return to Owner": 1, "Euthanasia": 2,
                     "Transfer": 3, "Rto-Adopt": 4, "Died": 5, "Disposal": 6,
                     "Missing": 7, "Relocate": 8}

In [33]:
animals["Outcome Type"] = animals["Outcome Type"].map(outcome_type_dict)

***Outcome Subtype***

In [34]:
# Outcome Subtype: approximately half null: will keep column for EDA/visualization purposes 
# and to further examine against Outcome Type but will not include in modeleling

animals["Outcome Subtype"].value_counts();

***Animal Type***

In [35]:
# 5 types (mainly Dog and Cat) but also Bird, Livestock, and Other 
# (Other contains 99 different species)(will dummy column)

animals["Animal Type"].unique();

In [36]:
# Mainly dogs and cats

animals["Animal Type"].value_counts();

In [37]:
# 99 unique "Other" breeds

animals[animals["Animal Type"] == "Other"]["Breed"].nunique() 
animals[animals["Animal Type"] == "Other"]["Breed"].unique();

***Sex upon Outcome***

In [38]:
animals["Sex upon Outcome"].unique()

array(['Spayed Female', 'Neutered Male', 'Unknown', 'Intact Female',
       'Intact Male', nan], dtype=object)

In [39]:
# Mainly neutered males and spayed females

animals["Sex upon Outcome"].value_counts();

In [40]:
# One null value, many Unknown values

animals["Sex upon Outcome"].isnull().sum();

In [41]:
# Dropping 1 null value

animals.dropna(subset = ["Sex upon Outcome"], inplace = True)

In [42]:
# 3989 of 7207 Unknown Sex upon Outcome are "Other" Animal Type

mask = (animals["Sex upon Outcome"] == "Unknown") & (animals["Animal Type"] == "Other")
len(animals[mask]);

***Dummy Columns***

In [43]:
animals.columns;

In [44]:
animals = pd.get_dummies(data = animals, columns = ["Animal Type",
                                          "Sex upon Outcome", "Outcome Day", 
                                          "Outcome Month", "Outcome Year", 
                                          "Birth Day", "Birth Month", "Birth Year"])

In [45]:
animals.columns;

In [46]:
animals.shape;

***Breed***

[Count Vectorizer Documentation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

In [47]:
# Running through Natural Language Processing Count Vectorizer to create/combine Breed features (2223 unique breeds) 
# and address cleanliness issues (spacing, "/" characters, "Mix" and "mix")

In [48]:
animals["Breed"].value_counts();

In [49]:
animals["Breed"].nunique();

In [50]:
# Cleaning

animals["Breed"] = animals["Breed"].str.replace("Mix", "", case = False)
animals["Breed"] = animals["Breed"].str.replace("mix", "", case = False)
animals["Breed"] = animals["Breed"].str.replace("/", " ")

In [51]:
# Setting up column/text to Vectorize

corpus = animals["Breed"].tolist()
corpus[0:10];

In [52]:
# Count Vectorizer: creating a new dataframe of just the top 100 2-3 word length features from Breed

cvec = CountVectorizer(analyzer = "word", ngram_range = (2, 3), lowercase = True, max_features = 100)
cvec.fit(corpus);

In [53]:
df_breeds_vec = pd.DataFrame(cvec.transform(corpus).todense(), columns = cvec.get_feature_names())

In [54]:
df_breeds_vec.shape;

In [55]:
df_breeds_vec.head();

In [56]:
df_breeds_vec.columns;

In [57]:
# Concating new Breed features with animals

animals = pd.concat([animals, df_breeds_vec], axis = 1)
animals.columns;

In [59]:
# Below is "hard coded" method of examing the top 30 breeds and manually creating dummy columns

# breeds = ["Domestic Shorthair", "Pit Bull", "Labrador Retriever", "Chihuahua Shorthair", "Chihuahua Longhair",
#          "Domestic Medium Hair", "German Shepherd", "Bat", "Domestic Longhair", "Australian Cattle Dog", "Siamese",
#          "Dachshund", "Boxer", "Border Collie", "Miniature Poodle", "Catahoula", "Australian Shepherd", "Rat Terrier",
#          "Raccoon", "Yorkshire Terrier", "Siberian Husky", "Jack Russell Terrier", "Miniature Schnauzer", "Staffordshire",
#          "Beagle", "Great Pyrenees", "Cairn Terrier", "Pointer", "Miniature Pinscher", "Corgi"]

# len(breeds)

# for breed in breeds:
#     animals[breed] = animals["Breed"].str.contains(breed, case=False).astype(int)
    
# animals.drop(columns=["Color"], axis = 1, inplace=True)  

***Color***

In [60]:
# Running through Natural Language Processing Count Vectorizer to 
# create/combine Color features (538 unique color combinations)

In [61]:
# 13 null values: need to drop

animals["Color"].isnull().sum()

animals.dropna(subset = ["Color"], inplace = True)

In [62]:
# 539 unique color combinations

animals["Color"].nunique();

In [63]:
# Cleaning

animals["Color"] = animals["Color"].str.replace("/", " ")

In [64]:
# Colors are repeated in different ways many times (ex: Black White vs. Black vs. Black Brown vs. Brown Black)

animals["Color"].value_counts();

In [65]:
# Setting up column/text to Vectorize

color_corpus = animals["Color"].tolist()
color_corpus[0:10];

In [66]:
# Count Vectorizer: creating a new dataframe of just the top 50 1-2 word length features from Color

color_cvec = CountVectorizer(analyzer = "word", ngram_range = (1, 2), lowercase = True, max_features = 50)
color_cvec.fit(color_corpus);

In [67]:
df_color_vec = pd.DataFrame(color_cvec.transform(color_corpus).todense(), columns = color_cvec.get_feature_names())

In [68]:
df_color_vec.columns;

In [69]:
animals = pd.concat([animals, df_color_vec], axis = 1)

In [71]:
# Below is "hard coded" method of examing the top 20 colors and manually creating dummy columns

# 20 colors

# colors = ["black", "white", "brown", "tabby", "tan", "orange", "blue", "tricolor", "calico", "brindle", "tortie",
#           "torbie", "red", "chocolate", "gray", "yellow", "green", "silver", "gold", "cream"]

# for color in colors:
#     animals[color] = animals["Color"].str.contains(color, case=False).astype(int)
    
# animals.drop(columns=["Color"], axis = 1, inplace=True)    