***Sam Cressman Capstone Project: Shelter Animal Outcomes***

***Help improve outcomes for shelter animals***

***Capstone inspiration:*** [Kaggle](https://www.kaggle.com/c/shelter-animal-outcomes)

[Dataset from the City of Austin open data portal (pulled 6/20/18)](https://data.austintexas.gov/Health-and-Community-Services/Austin-Animal-Center-Outcomes/9t4d-g238)

Every year, approximately 7.6 million companion animals end up in US shelters. Many animals are given up as unwanted by their owners, while others are picked up after getting lost or taken out of cruelty situations. Many of these animals find forever families to take them home, but just as many are not so lucky. 2.7 million dogs and cats are euthanized in the US every year. <br>

Using a dataset of intake information including breed, color, sex, and age from the Austin Animal Center, we're asking Kagglers to predict the outcome for each animal. <br>

We also believe this dataset can help us understand trends in animal outcomes. These insights could help shelters focus their energy on specific animals who need a little extra help finding a new home. We encourage you to publish your insights on Scripts so they are publicly accessible. <br>

In [300]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

In [252]:
animals = pd.read_csv("Austin_Animal_Center_Outcomes.csv")

***Quick EDA***

In [253]:
animals.head();

In [254]:
# All object columns

# animals.info()

In [255]:
# Most null values are in Name and Outcome Subtype

animals.isnull().sum();

***Column by column***:

***Animal ID***: dropping column (acts as an index)

***Name***: Over 25,000 null values: transforming into "1" if animal contains a name, 0 otherwise

***DateTime***: time of outcome event: object type, splitting into day/month/year through string splicing and then converting into DateTime object to subtract Date of Birth from in order to create Age upon Outcome in years (will dummy each new column, will not model using this column)

***MonthYear***: dropping column (exactly the same as DateTime)

***Date of Birth***: splitting into day/month/year through string splicing and then converting into DateTime object to subtract Date of Birth from in order to create Age upon Outcome in years (will dummy each new column, will not model using this column)

***Outcome Type***: target, dropping 12 null values, mapping/converting into numerical values

***Outcome Subtype***: approximately half null: will keep column for EDA/visualization purposes and to further examine Outcome Type but will not include in models

***Animal Type***: 5 types (mainly Dog and Cat) but also Bird, Livestock, and Other (Other contains 99 different species)(will dummy column)

***Sex upon Outcome***: Neutered Male, Spayed Female, Intact Male, Intact Female, Unknown (mainly Animal Type Other)(will dummy column)

***Age upon Outcome***: creating a new column subtracting Date of Birth and Time of Outcome to get a time in years (then will drop Age Outcome and dummy new columns)

***Breed***: running through Natural Language Processing Count Vectorizer to create/combine Breed features (2223 unique breeds) and address cleanliness issues (spacing, "/" characters, "Mix" and "mix")

***Color***: TBD (in progress)

***Animal ID***

In [256]:
# Dropping column

In [257]:
animals.drop(columns=["Animal ID"], axis = 1, inplace=True)

***Name***

In [258]:
# Converting into 1 if name, 0 otherwise

animals["Name"].isnull().sum();

In [259]:
# We see some names somewhat frequently

animals["Name"].value_counts()[0:10];

In [260]:
animals["has_name"] = animals['Name'].notnull().astype(int)

In [261]:
# Dropping Name column

animals.drop(columns=["Name"], axis = 1, inplace=True)

***DateTime***

In [262]:
# DateTime is time of outcome: will rename for clarity:

animals = animals.rename(columns={"DateTime": "Outcome Time"})

In [263]:
# Object type: will split out day, month, year

animals["Outcome Day"] = [day[3:5] for day in animals["Outcome Time"]]

# animals["Outcome Day"] = [day.replace("0", "") for day in animals["Outcome Day"] if day[0] == "0"]

In [264]:
animals["Outcome Month"] = [month[0:2] for month in animals["Outcome Time"]]

# animals["Outcome Month"] = [month.replace("0", "") for month in animals["Outcome Month"] if month[0] == "0"]

In [265]:
animals["Outcome Year"] = [year[6:10] for year in animals["Outcome Time"]]

In [266]:
# Converting to DateTime object to create Age upon Outcome in years

animals["Outcome Time"] = pd.to_datetime(animals["Outcome Time"])

***MonthYear***

In [267]:
# Dropping column: exactly the same as DateTime (now Outcome Time)

animals.drop(columns = "MonthYear", axis = 1, inplace = True)

***Date of Birth***

In [268]:
# Object type: will split out day, month, year

animals["Birth Day"] = [date[3:5] for date in animals["Date of Birth"]]

# animals["Birth Date"] = [date.replace("0", "") for date in animals["Date of Birth"] if date[0] == "0"]

In [269]:
animals["Birth Month"] = [month[0:2] for month in animals["Date of Birth"]]

# animals["Outcome Month"] = [month.replace("0", "") for month in animals["Outcome Month"] if month[0] == "0"]

In [270]:
animals["Birth Year"] = [year[6:10] for year in animals["Date of Birth"]]

In [271]:
# Converting from object to DateTime

# animals["Date of Birth"] = animals["Date of Birth"].apply(lambda x: dt.strptime(x,"%m/%d/%Y"))

animals["Date of Birth"] = pd.to_datetime(animals["Date of Birth"])

***Age Upon Outcome***

In [272]:
# Creating a new column subtracting Date of Birth and Time of Outcome 
# to get a time in years (then will drop Age Outcome and dummy new columns)

animals["Age at Outcome"] = animals["Outcome Time"] - animals["Date of Birth"]

# animals.drop(columns = "Age upon Outcome", axis = 1, inplace = True)

In [273]:
# Converting into string to splice out just days

animals["Age at Outcome"] = animals["Age at Outcome"].astype(str)

In [274]:
# Splitting on space

animals["Age at Outcome"] = animals["Age at Outcome"].str.split(" ")

In [275]:
animals["Age at Outcome"] = [age_days[0] for age_days in animals["Age at Outcome"]]

animals["Age at Outcome"] = animals["Age at Outcome"].astype(float)

In [276]:
# Dividing days by 365 to get float in years value

animals["Age at Outcome"] = animals["Age at Outcome"] / 365

In [277]:
# Dropping old column

animals.drop(columns = "Age upon Outcome", axis = 1, inplace = True)

***Outcome Type***

In [278]:
# Outcome Type: target for modeling (what we are attempting to predict)
# Will map dictionary values since multi class

animals["Outcome Type"].unique()

array(['Adoption', 'Return to Owner', 'Euthanasia', 'Transfer',
       'Rto-Adopt', 'Died', 'Disposal', 'Missing', 'Relocate', nan],
      dtype=object)

In [279]:
animals["Outcome Type"].isnull().sum();

In [280]:
# 12 null values: dropping those rows

animals.dropna(subset = ["Outcome Type"], inplace = True)

In [281]:
outcome_type_dict = {"Adoption": 0, "Return to Owner": 1, "Euthanasia": 2,
                     "Transfer": 3, "Rto-Adopt": 4, "Died": 5, "Disposal": 6,
                     "Missing": 7, "Relocate": 8}

In [282]:
animals["Outcome Type"] = animals["Outcome Type"].map(outcome_type_dict)

***Outcome Subtype***

In [283]:
# Outcome Subtype: approximately half null: will keep column for EDA/visualization purposes 
# and to further examine Outcome Type but will not include in models

animals["Outcome Subtype"].value_counts();

***Animal Type***

In [284]:
# 5 types (mainly Dog and Cat) but also Bird, Livestock, and Other 
# (Other contains 99 different species)(will dummy column)

animals["Animal Type"].unique()

array(['Cat', 'Dog', 'Other', 'Bird', 'Livestock'], dtype=object)

In [285]:
# Mainly dogs and cats

animals["Animal Type"].value_counts()

Dog          47905
Cat          31282
Other         4671
Bird           366
Livestock       10
Name: Animal Type, dtype: int64

In [286]:
# 99 unique "Other" breeds

animals[animals["Animal Type"] == "Other"]["Breed"].nunique() 
animals[animals["Animal Type"] == "Other"]["Breed"].unique();

***Sex upon Outcome***

In [287]:
animals["Sex upon Outcome"].unique()

array(['Spayed Female', 'Neutered Male', 'Unknown', 'Intact Female',
       'Intact Male', nan], dtype=object)

In [288]:
# One null value, many Unknown values

animals["Sex upon Outcome"].isnull().sum()

1

In [289]:
# Dropping 1 null value

animals.dropna(subset = ["Sex upon Outcome"], inplace = True)

In [290]:
animals["Sex upon Outcome"].value_counts()

Neutered Male    29836
Spayed Female    27001
Intact Male      10292
Intact Female     9897
Unknown           7207
Name: Sex upon Outcome, dtype: int64

In [291]:
# 3989 of 7207 Unknown Sex upon Outcome are "Other" Animal Type

mask = (animals["Sex upon Outcome"] == "Unknown") & (animals["Animal Type"] == "Other")
len(animals[mask]);

***Dummy Columns***

In [292]:
animals.columns;

Index(['Outcome Time', 'Date of Birth', 'Outcome Type', 'Outcome Subtype',
       'Animal Type', 'Sex upon Outcome', 'Breed', 'Color', 'has_name',
       'Outcome Day', 'Outcome Month', 'Outcome Year', 'Birth Day',
       'Birth Month', 'Birth Year', 'Age at Outcome'],
      dtype='object')

In [294]:
animals = pd.get_dummies(data = animals, columns = ["Animal Type",
                                          "Sex upon Outcome", "Outcome Day", 
                                          "Outcome Month", "Outcome Year", 
                                          "Birth Day", "Birth Month", "Birth Year"])

In [296]:
animals.columns;

***Breed***

[Count Vectorizer Documentation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

In [310]:
# Running through Natural Language Processing Count Vectorizer to create/combine Breed features (2223 unique breeds) 
# and address cleanliness issues (spacing, "/" characters, "Mix" and "mix")

In [308]:
animals["Breed"].nunique()

2223

In [305]:
# Cleaning Breed column

animals["Breed"] = animals["Breed"].str.replace("Mix", "", case = False)
animals["Breed"] = animals["Breed"].str.replace("mix", "", case = False)
animals["Breed"] = animals["Breed"].str.replace("/", " ")

In [306]:
# Setting up column to Count Vectorize

corpus = animals["Breed"].tolist()
corpus[0:10]

['Domestic Shorthair ',
 'Border Terrier ',
 'Raccoon ',
 'Labrador Retriever Jack Russell Terrier',
 'Labrador Retriever ',
 'Domestic Shorthair ',
 'Domestic Longhair ',
 'Domestic Shorthair ',
 'Domestic Shorthair ',
 'Beagle ']

In [323]:
# Count Vectorizer

cvec = CountVectorizer(analyzer = "word", ngram_range = (1, 3), lowercase = True)
cvec.fit(corpus)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 3), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [324]:
df_breeds_vec = pd.DataFrame(cvec.transform(corpus).todense(), columns = cvec.get_feature_names())

In [325]:
df_breeds_vec.shape

(84233, 4059)

In [326]:
df_breeds_vec.head()

Unnamed: 0,abyssinian,affenpinscher,afghan,afghan hound,afghan hound labrador,african,airedale,airedale terrier,airedale terrier irish,airedale terrier labrador,...,yorkshire terrier parson,yorkshire terrier pomeranian,yorkshire terrier rat,yorkshire terrier shih,yorkshire terrier soft,yorkshire terrier standard,yorkshire terrier toy,yorkshire terrier yorkshire,zealand,zealand wht
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
# animals[animals["Breed"] == "Bat "]

In [None]:
# breeds = ["Domestic Shorthair", "Pit Bull", "Labrador Retriever", "Chihuahua Shorthair", "Chihuahua Longhair",
#          "Domestic Medium Hair", "German Shepherd", "Bat", "Domestic Longhair", "Australian Cattle Dog", "Siamese",
#          "Dachshund", "Boxer", "Border Collie", "Miniature Poodle", "Catahoula", "Australian Shepherd", "Rat Terrier",
#          "Raccoon", "Yorkshire Terrier", "Siberian Husky", "Jack Russell Terrier", "Miniature Schnauzer", "Staffordshire",
#          "Beagle", "Great Pyrenees", "Cairn Terrier", "Pointer", "Miniature Pinscher", "Corgi"]

# len(breeds)

In [None]:
# for breed in breeds:
#     animals[breed] = animals["Breed"].str.contains(breed, case=False).astype(int)
    
# animals.drop(columns=["Color"], axis = 1, inplace=True)  

In [None]:
# total = animals["Domestic Longhair"].sum()
# print(total)

***Color***

***In Progress***

In [319]:
# animals["Color"] = animals["Color"].str.replace("/", " ")

In [320]:
# animals["Color"].value_counts();

In [321]:
# animals["Color"].unique();

In [322]:
# # 20 colors

# colors = ["black", "white", "brown", "tabby", "tan", "orange", "blue", "tricolor", "calico", "brindle", "tortie",
#           "torbie", "red", "chocolate", "gray", "yellow", "green", "silver", "gold", "cream"]

# for color in colors:
#     animals[color] = animals["Color"].str.contains(color, case=False).astype(int)
    
# animals.drop(columns=["Color"], axis = 1, inplace=True)    