In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
% matplotlib inline

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

In [None]:
df_train = pd.read_csv('../input/train/train.csv')

In [None]:
df_train.head()

The PetFinder competition is a pretty straight-forward prediction challenge. The outcome variable that we're interested in is the *AdoptionSpeed* variable which is a measure of how quickly a pet is adopted from a shelter. This is a categorical value that takes a range from 0 to 4, where:

0 - Pet was adopted on the same day as it was listed. 
1 - Pet was adopted between 1 and 7 days (1st week) after being listed. 
2 - Pet was adopted between 8 and 30 days (1st month) after being listed. 
3 - Pet was adopted between 31 and 90 days (2nd & 3rd month) after being listed. 
4 - No adoption after 100 days of being listed. (There are no pets in this dataset that waited between 90 and 100 days).

Essentially, this is a multi-classification problem.

## Exploratory Data Analysis
To get started, we'll want to do preliminary EDA on the training data. 

In [None]:
df_train.describe()

In [None]:
#Gender - Gender of pet (1 = Male, 2 = Female, 3 = Mixed, if profile represents group of pets)
sns.set(style="darkgrid")
ax = sns.countplot(x="Gender", data=df_train)

In [None]:
#Type - Type of animal (1 = Dog, 2 = Cat)
#We'll add type of animal to the plot above
ax = sns.countplot(x="Type", hue="Gender", data=df_train)

## Age
From the summary statistics above, we can observe that the mean age of animal is ~10.5 months, with 75% of total being less than or equal to 12 months of age. Most of these adopted animals represent "puppies" and "kittens". Furthermore, 25% of adopted animals are 2 months or younger and 50% of adopted animals are 3 months or younger. Anecdotally, as a pet owner myself, I believe that people are typically looking to adopt animals as young as possible and in the US the earliest cats and dogs are available for adoption is 8 weeks old, or 2 months. Without even diving further than the summary, it appears that the dataset agrees with this notion... young puppies and kittens are the highest demand animals!

The oldest animal in our data set is 255 months old. We'll probably want to modify the feature of **Age** a bit to *cap* the value at some maximum. Before we do this, we'll continue our EDA process.

Below, we've plotted a histogram binned by **Age** for each Gender and Type combination. Each combo has a similar profile, althought it appears that cat adoptions are even more heavily skewed towards animals less than or equal to 12 months in age compared to dogs.

In [None]:
g = sns.FacetGrid(df_train, row="Type", col="Gender", margin_titles=True)
bins = np.linspace(0, 36, 12)
g.map(plt.hist, "Age", color="steelblue", bins=bins)

In [None]:
df_train[df_train['Type'] == 1]['Age'].describe()

In [None]:
df_train[df_train['Type'] == 2]['Age'].describe()

Sure enough, it's clear that on average, cats are about 5.5 months younger than dogs at adoption age.

## Breed
Besides **Age**, **Type**, and **Gender**, the **Breed** of the animal will also factor in significantly. It's unusual to find a *pure-breed* animal at a shelter, thus, we'll probably find that most of the animals in our dataset are of *mixed-breed* type. Also, we anticipate that breed factors in more heavily for dogs than for cats. Let's take a look and see if these hypotheses hold up.

In [None]:
sum(df_train['Breed2'] == 0) / len(df_train)

In [None]:
#ax = sns.countplot(x="Breed1", data=df_train)
df_train_by_breed = df_train.groupby('Breed1').agg(['count', 'mean'])

In [None]:
df_train_by_breed.sort_values(by=[('Type', 'count')], ascending=False)

In [None]:
ax = sns.countplot(x="AdoptionSpeed", hue="Gender", data=df_train)

# Modeling
There's so much more EDA that we could, and will do, but let's do a little model development to give us a starting point. To get started, we're just going to use the data in the form that it is given. We will go back later and do some feature engineering to improve the model. That said, we've already loaded the training data. Now we want to divide the training set into two smaller groups that we'll call the *training set* and *development set*.
* training set - used to train our model  
* development set - used to evaluate the results from our trained model  

In [None]:
Y = df_train['AdoptionSpeed'].values
X = df_train.drop(['AdoptionSpeed', 'Name', 'Description', 'PetID', 'RescuerID'], axis=1).values

np.random.seed(0)
shuffle = np.random.permutation(np.arange(X.shape[0]))
X, Y = X[shuffle], Y[shuffle]
cutoff = int(0.75*len(X))

train_data, train_labels = X[:cutoff], Y[:cutoff]
test_data, test_labels = X[cutoff:], Y[cutoff:]

Good, our data is now in a format that we can work with. Note that we converted the data from a *DataFrame* to a numpy *Matrix*. We're going to utilize Decision Trees for our baseline model.
## Decision Tree Model
The scikit-learn **Tree** package will be used to develop a basic Decision Tree model.   
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html  
We'll follow these steps to make our predictions:
1. Define Decision Tree model and hyperparameters
2. Train model
3. Make predictions on Development Set
4. Evaluate results

In [None]:
# some decision tree and random forest imports
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier 
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import cohen_kappa_score, confusion_matrix, make_scorer
from sklearn.model_selection import GridSearchCV

In [None]:
dt = DecisionTreeClassifier(criterion="gini", splitter="best", random_state=0, max_depth=5)
dt.fit(train_data, train_labels)
y_preds = dt.predict(test_data)

In [None]:
## Let's define the kappa scoring metric for use in our evaluations
def metric(y1,y2):
    return cohen_kappa_score(y1,y2, weights='quadratic')

# Make scorer for scikit-learn
scorer = make_scorer(metric)

In [None]:
metric(y_preds, test_labels)

In [None]:
# Create a loop to iterate through max_depth options
for i in np.arange(5, 100,5):
    dt = DecisionTreeClassifier(criterion="gini", splitter="best", random_state=0, max_depth=7, min_samples_split=i)
    dt.fit(train_data, train_labels)
    y_preds = dt.predict(test_data)
    print ('Cohen kappa score:', metric(y_preds, test_labels), 'Max Depth: ', i)

So, our very basic decision tree doesn't do a great job of predicting the class. The best results using a limited set of hyperparameters is a \kappa score of ~0.317 which would currently put our model at 568/713 participants. We can do much, much better than this. Before considering alternative approaches like SVM's and neural net's, we can restrict our first baseline model to decision trees. Here are some things we'll want to do to improve this model:
* more hyperparameters tuning
* feature engineering
* ensemble methods (bagging, boosting, random forests, etc.)

### Hyperparameter Tuning - Cross-Validation
One of the more useful tools we have at our disposal with scikitlearn is the GridSearchCV function. This will allow us to perform k-folds cross-validation on our dataset running through all the different combinations of hyperparameters that we set for it.   
https://scikit-learn.org/stable/modules/cross_validation.html   

In [None]:
dt = RandomForestClassifier()
#param_grid = {'criterion': ['gini', 'entropy']}
             #'max_depth': np.arange(0,15),
             #'min_samples_split': np.arange(10,100,10)}

rand_forest_grid = {
    'bootstrap': [True, False],
    'max_depth': [10, 25, 50, 85],
    'max_features': ['auto'],
    'min_samples_leaf': [10, 15, 25],
    'min_samples_split': [10, 15, 25],
    'n_estimators': [150, 200, 215]
}
dt_gridsearch = GridSearchCV(estimator=dt, param_grid = rand_forest_grid, cv = 3, n_jobs = -1,verbose = 1,scoring=scorer)
dt_gridsearch.fit(train_data, train_labels)

In [None]:
print(dt_gridsearch.best_params_)

In [None]:
metric(dt_gridsearch.predict(test_data), test_labels)

### Principle Component Analysis
[reserved]

## Create Submission File

In [None]:
df_sub = pd.read_csv('../input/test/test.csv')

In [None]:
rand_forest_preds = dt_gridsearch.predict(df_sub.drop(['Name', 'Description', 'PetID', 'RescuerID'], axis=1).values)

In [None]:
# Store predictions for Kaggle Submission
submission_df = pd.DataFrame(data={'PetID' : df_sub['PetID'], 
                                   'AdoptionSpeed' : rand_forest_preds})
submission_df.to_csv('submission.csv', index=False)