## Final Project Submission

Please fill out:
* Student name: Thomas Brown
* Student pace: Full Time
* Scheduled project review date/time: TBD
* Instructor name: 
* Blog post URL:


## Description and Use-Case:

Data Source: https://www.kaggle.com/aaronschlegel/austin-animal-center-shelter-intakes-and-outcomes?select=aac_intakes_outcomes.csv

The purpose of this project is to examine data from one of Austin's largest animal shelters and build a machine learning model to predict outcomes of cats as they come in and are processed in the system.  An efficient and intelligent model can help "manage the inventory" of the shelter and get cats adopted more quickly and thus less likely to be euthanized.  <br><br>
The outcomes are listed below:
- Transfer           
- Adoption           
- Euthanasia        
- Return to Owner     
- Died                 
- Rto-Adopt
- Missing
- Disposal  <br>

We'll explore these outcomes (as well as the sub categories) in the EDA phase, but with an effective model, we can return missing cats to owners more efficiently, get cats likely to be adopted adopted more quickly and cats likely to go missing under more strict supervision!

# Importing Libraries:

In [26]:
# Standard Data-sci packages:
import pandas as pd
import numpy as np

# Scikit-Learn:
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline

# Plots and Graphs:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Other:
import warnings
warnings.filterwarnings('ignore')

# Importing Data:

## Main DF:

In [27]:
# Main Dataframe
dfi = pd.read_csv('aac_intakes_outcomes.csv.zip')
display(dfi.head())
dfi.info()

Unnamed: 0,age_upon_outcome,animal_id_outcome,date_of_birth,outcome_subtype,outcome_type,sex_upon_outcome,age_upon_outcome_(days),age_upon_outcome_(years),age_upon_outcome_age_group,outcome_datetime,...,age_upon_intake_age_group,intake_datetime,intake_month,intake_year,intake_monthyear,intake_weekday,intake_hour,intake_number,time_in_shelter,time_in_shelter_days
0,10 years,A006100,2007-07-09 00:00:00,,Return to Owner,Neutered Male,3650,10.0,"(7.5, 10.0]",2017-12-07 14:07:00,...,"(7.5, 10.0]",2017-12-07 00:00:00,12,2017,2017-12,Thursday,14,1.0,0 days 14:07:00.000000000,0.588194
1,7 years,A006100,2007-07-09 00:00:00,,Return to Owner,Neutered Male,2555,7.0,"(5.0, 7.5]",2014-12-20 16:35:00,...,"(5.0, 7.5]",2014-12-19 10:21:00,12,2014,2014-12,Friday,10,2.0,1 days 06:14:00.000000000,1.259722
2,6 years,A006100,2007-07-09 00:00:00,,Return to Owner,Neutered Male,2190,6.0,"(5.0, 7.5]",2014-03-08 17:10:00,...,"(5.0, 7.5]",2014-03-07 14:26:00,3,2014,2014-03,Friday,14,3.0,1 days 02:44:00.000000000,1.113889
3,10 years,A047759,2004-04-02 00:00:00,Partner,Transfer,Neutered Male,3650,10.0,"(7.5, 10.0]",2014-04-07 15:12:00,...,"(7.5, 10.0]",2014-04-02 15:55:00,4,2014,2014-04,Wednesday,15,1.0,4 days 23:17:00.000000000,4.970139
4,16 years,A134067,1997-10-16 00:00:00,,Return to Owner,Neutered Male,5840,16.0,"(15.0, 17.5]",2013-11-16 11:54:00,...,"(15.0, 17.5]",2013-11-16 09:02:00,11,2013,2013-11,Saturday,9,1.0,0 days 02:52:00.000000000,0.119444


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 79672 entries, 0 to 79671
Data columns (total 41 columns):
age_upon_outcome              79672 non-null object
animal_id_outcome             79672 non-null object
date_of_birth                 79672 non-null object
outcome_subtype               36348 non-null object
outcome_type                  79662 non-null object
sex_upon_outcome              79671 non-null object
age_upon_outcome_(days)       79672 non-null int64
age_upon_outcome_(years)      79672 non-null float64
age_upon_outcome_age_group    79672 non-null object
outcome_datetime              79672 non-null object
outcome_month                 79672 non-null int64
outcome_year                  79672 non-null int64
outcome_monthyear             79672 non-null object
outcome_weekday               79672 non-null object
outcome_hour                  79672 non-null int64
outcome_number                79672 non-null float64
dob_year                      79672 non-null int64
dob_month 

In [28]:
dfi.outcome_type.value_counts()

Adoption           33594
Transfer           23799
Return to Owner    14791
Euthanasia          6244
Died                 690
Disposal             304
Rto-Adopt            179
Missing               46
Relocate              15
Name: outcome_type, dtype: int64

In [29]:
dfi.outcome_subtype.value_counts()

Partner                19840
Foster                  5490
SCRP                    3205
Suffering               2549
Rabies Risk             2539
Snr                      752
Aggressive               497
In Kennel                351
Offsite                  350
Medical                  265
In Foster                177
Behavior                 133
At Vet                    71
Enroute                   49
Underage                  28
Court/Investigation       23
In Surgery                17
Possible Theft             9
Barn                       3
Name: outcome_subtype, dtype: int64

## Intake Data:

In [45]:
dfii = pd.read_csv('aac_intakes.csv.zip')
dfii.head(20)

Unnamed: 0,age_upon_intake,animal_id,animal_type,breed,color,datetime,datetime2,found_location,intake_condition,intake_type,name,sex_upon_intake
0,8 years,A706918,Dog,English Springer Spaniel,White/Liver,2015-07-05T12:59:00.000,2015-07-05T12:59:00.000,9409 Bluegrass Dr in Austin (TX),Normal,Stray,Belle,Spayed Female
1,11 months,A724273,Dog,Basenji Mix,Sable/White,2016-04-14T18:43:00.000,2016-04-14T18:43:00.000,2818 Palomino Trail in Austin (TX),Normal,Stray,Runster,Intact Male
2,4 weeks,A665644,Cat,Domestic Shorthair Mix,Calico,2013-10-21T07:59:00.000,2013-10-21T07:59:00.000,Austin (TX),Sick,Stray,,Intact Female
3,4 years,A682524,Dog,Doberman Pinsch/Australian Cattle Dog,Tan/Gray,2014-06-29T10:38:00.000,2014-06-29T10:38:00.000,800 Grove Blvd in Austin (TX),Normal,Stray,Rio,Neutered Male
4,2 years,A743852,Dog,Labrador Retriever Mix,Chocolate,2017-02-18T12:46:00.000,2017-02-18T12:46:00.000,Austin (TX),Normal,Owner Surrender,Odin,Neutered Male
5,2 years,A708452,Dog,Labrador Retriever Mix,Black/White,2015-07-30T14:37:00.000,2015-07-30T14:37:00.000,Austin (TX),Normal,Public Assist,Mumble,Intact Male
6,5 months,A731435,Cat,Domestic Shorthair Mix,Cream Tabby,2016-08-08T17:52:00.000,2016-08-08T17:52:00.000,Austin (TX),Normal,Owner Surrender,*Casey,Neutered Male
7,2 years,A760053,Dog,Chihuahua Shorthair,White/Tan,2017-10-11T15:46:00.000,2017-10-11T15:46:00.000,8800 South First Street in Austin (TX),Normal,Stray,,Intact Male
8,5 months,A707375,Dog,Pit Bull,Brown/White,2015-07-11T18:19:00.000,2015-07-11T18:19:00.000,Galilee Court And Damita Jo Dr in Manor (TX),Normal,Stray,*Candy Cane,Intact Female
9,2 years,A696408,Dog,Chihuahua Shorthair,Tricolor,2015-02-04T12:58:00.000,2015-02-04T12:58:00.000,9705 Thaxton in Austin (TX),Normal,Stray,*Pearl,Intact Female


We will need to add name to the main DF.

## Combining the DFs:

In [47]:
dfii['animal_id_intake'] = dfii.animal_id
dfii.head()

Unnamed: 0,age_upon_intake,animal_id,animal_type,breed,color,datetime,datetime2,found_location,intake_condition,intake_type,name,sex_upon_intake,animal_id_intake
0,8 years,A706918,Dog,English Springer Spaniel,White/Liver,2015-07-05T12:59:00.000,2015-07-05T12:59:00.000,9409 Bluegrass Dr in Austin (TX),Normal,Stray,Belle,Spayed Female,A706918
1,11 months,A724273,Dog,Basenji Mix,Sable/White,2016-04-14T18:43:00.000,2016-04-14T18:43:00.000,2818 Palomino Trail in Austin (TX),Normal,Stray,Runster,Intact Male,A724273
2,4 weeks,A665644,Cat,Domestic Shorthair Mix,Calico,2013-10-21T07:59:00.000,2013-10-21T07:59:00.000,Austin (TX),Sick,Stray,,Intact Female,A665644
3,4 years,A682524,Dog,Doberman Pinsch/Australian Cattle Dog,Tan/Gray,2014-06-29T10:38:00.000,2014-06-29T10:38:00.000,800 Grove Blvd in Austin (TX),Normal,Stray,Rio,Neutered Male,A682524
4,2 years,A743852,Dog,Labrador Retriever Mix,Chocolate,2017-02-18T12:46:00.000,2017-02-18T12:46:00.000,Austin (TX),Normal,Owner Surrender,Odin,Neutered Male,A743852


# Cleaning Data:

In [59]:
dfiii = pd.merge(dfi, dfii[['animal_id_intake', 'name']], on ='animal_id_intake', how ='left')
dfiii.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100130 entries, 0 to 100129
Data columns (total 42 columns):
age_upon_outcome              100130 non-null object
animal_id_outcome             100130 non-null object
date_of_birth                 100130 non-null object
outcome_subtype               39688 non-null object
outcome_type                  100115 non-null object
sex_upon_outcome              100129 non-null object
age_upon_outcome_(days)       100130 non-null int64
age_upon_outcome_(years)      100130 non-null float64
age_upon_outcome_age_group    100130 non-null object
outcome_datetime              100130 non-null object
outcome_month                 100130 non-null int64
outcome_year                  100130 non-null int64
outcome_monthyear             100130 non-null object
outcome_weekday               100130 non-null object
outcome_hour                  100130 non-null int64
outcome_number                100130 non-null float64
dob_year                      100130 non-nul

In [32]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 79672 entries, 0 to 79671
Data columns (total 42 columns):
age_upon_outcome              79672 non-null object
animal_id_outcome             79672 non-null object
date_of_birth                 79672 non-null object
outcome_subtype               36348 non-null object
outcome_type                  79662 non-null object
sex_upon_outcome              79671 non-null object
age_upon_outcome_(days)       79672 non-null int64
age_upon_outcome_(years)      79672 non-null float64
age_upon_outcome_age_group    79672 non-null object
outcome_datetime              79672 non-null object
outcome_month                 79672 non-null int64
outcome_year                  79672 non-null int64
outcome_monthyear             79672 non-null object
outcome_weekday               79672 non-null object
outcome_hour                  79672 non-null int64
outcome_number                79672 non-null float64
dob_year                      79672 non-null int64
dob_month 

## Removing Certain Outcomes:

In [19]:
df.outcome_type.value_counts()

Adoption           33594
Transfer           23799
Return to Owner    14791
Euthanasia          6244
Died                 690
Disposal             304
Rto-Adopt            179
Missing               46
Relocate              15
Name: outcome_type, dtype: int64

As we can see above, 

In [20]:
df.isna().sum()

age_upon_outcome                  0
animal_id_outcome                 0
date_of_birth                     0
outcome_subtype               43324
outcome_type                     10
sex_upon_outcome                  1
age_upon_outcome_(days)           0
age_upon_outcome_(years)          0
age_upon_outcome_age_group        0
outcome_datetime                  0
outcome_month                     0
outcome_year                      0
outcome_monthyear                 0
outcome_weekday                   0
outcome_hour                      0
outcome_number                    0
dob_year                          0
dob_month                         0
dob_monthyear                     0
age_upon_intake                   0
animal_id_intake                  0
animal_type                       0
breed                             0
color                             0
found_location                    0
intake_condition                  0
intake_type                       0
sex_upon_intake             

## Color:

3.5 thousand missing values.  For this one, I'll likely map onto probabilities.  Maybe at some point I'll drop the nulls and see if I get better results with the ML model.

In [21]:
df.color.nunique()

529

We'll need to do some grouping as well. . . Probably in a separate excel sheet, and then match the values after.  154 values are too many. . . 

## Name:

In [22]:
df.name.isna().sum()

AttributeError: 'DataFrame' object has no attribute 'name'

I think the simplest option here is has name or does not have name.  I would also be interested about less common or longer names.  Perhaps that indicates more involved owners.  Who knows!

## Outcome Subtype:

10 thousand missing values here.  This is an additional classification problem after the first.  We can worry about this later.

## Outcome Type:

I'll just drop these.

## Breed 2:

Probably not useful for this application.

## Coat Pattern:

I think a probability map will be useful here as well.

In [8]:
df.coat_pattern.value_counts()

tabby       13613
tortie       1547
calico       1494
point        1297
torbie       1035
smoke         156
agouti          6
brindle         4
tricolor        3
Name: coat_pattern, dtype: int64

## Color 2:

I'll likely just use color1 for this model:

# EDA:

## Different Outcome Types:

## Variables Compared Against Outcomes:

# Feature Engineering:

## Multicollinearity:

## Scaling:

## Log Transformations:

## One-Hot-Encoding:

# Train Test Split:

In [9]:
# Putting the final DF back together:

# Importing train test split:

# Splitting the data.  Test size 20%.  Random_state 12

# Baseline Model:

Probably use Log Classifier here

# Improved Models:

Make some pipelines

## K-Nearest-Neighbors:

First, we'll start by building a pipeline for this model.

## Decision Tree:

In [10]:
# Print tree with max depth set to none for funsies

## Random Forest:

In [11]:
# Get feature importance bar chart

## XG Boost:

## SVM:

# Model Comparisons:

## ROC Curves:

## Confusion Matrix:

## F1 vs Accuracy vs Recall etc. . . 

# Conclusion: