This notebook features an overview of the competition, EDA (Exploratory Data Analysis) and Ensembling of different models to get you started with the competition.

![](https://blog.groomit.me/wp-content/uploads/2018/02/petfinder2.jpg)

## PetFinder.my Adoption Prediction

## Table of contents

- [Data Columns](#1)
- [Dependencies](#2)
- [Preparation](#3)
- [Data Description](#4)
- [Visualization](#5)
- [Metric](#6)
- [Data Cleaning](#10)
- [Tree Ensembling](#7)
- [Predictions](#8)
- [Kaggle Submission](#9)

## Data columns <a id="1"></a>

[Source](https://www.kaggle.com/c/petfinder-adoption-prediction/data)

* PetID - Unique hash ID of pet profile
* AdoptionSpeed - Categorical speed of adoption. Lower is faster. This is the value to predict. See below section for more info.
* Type - Type of animal (1 = Dog, 2 = Cat)
* Name - Name of pet (Empty if not named)
* Age - Age of pet when listed, in months
* Breed1 - Primary breed of pet (Refer to BreedLabels dictionary)
* Breed2 - Secondary breed of pet, if pet is of mixed breed (Refer to BreedLabels dictionary)
* Gender - Gender of pet (1 = Male, 2 = Female, 3 = Mixed, if profile represents group of pets)
* Color1 - Color 1 of pet (Refer to ColorLabels dictionary)
* Color2 - Color 2 of pet (Refer to ColorLabels dictionary)
* Color3 - Color 3 of pet (Refer to ColorLabels dictionary)
* MaturitySize - Size at maturity (1 = Small, 2 = Medium, 3 = Large, 4 = Extra Large, 0 = Not Specified)
* FurLength - Fur length (1 = Short, 2 = Medium, 3 = Long, 0 = Not Specified)
* Vaccinated - Pet has been vaccinated (1 = Yes, 2 = No, 3 = Not Sure)
* Dewormed - Pet has been dewormed (1 = Yes, 2 = No, 3 = Not Sure)
* Sterilized - Pet has been spayed / neutered (1 = Yes, 2 = No, 3 = Not Sure)
* Health - Health Condition (1 = Healthy, 2 = Minor Injury, 3 = Serious Injury, 0 = Not Specified)
* Quantity - Number of pets represented in profile
* Fee - Adoption fee (0 = Free)
* State - State location in Malaysia (Refer to StateLabels dictionary)
* RescuerID - Unique hash ID of rescuer
* VideoAmt - Total uploaded videos for this pet
* PhotoAmt - Total uploaded photos for this pet
* Description - Profile write-up for this pet. The primary language used is English, with some in Malay or Chinese.


## Dependencies <a id="2"></a>

In [None]:
# For notebook plotting
%matplotlib inline

# Standard libraries
import os
import json
import numpy as np
import pandas as pd
from pprint import pprint

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.metrics import cohen_kappa_score
from sklearn.metrics import make_scorer
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
import lightgbm as lgb

# Seed for reproducability
seed = 12345
np.random.seed(seed)

# Info about dataset
print('Files and directories: \n{}\n'.format(os.listdir("../input")))
print('Within the train directory: \n{}\n'.format(os.listdir("../input/train")))
print('Within the test directory: \n{}\n'.format(os.listdir("../input/test")))

## Preparation <a id="3"></a>

In [None]:
# Read in data
KAGGLE_DIR = '../input/'

train_df = pd.read_csv(KAGGLE_DIR + "train/train.csv")
test_df = pd.read_csv(KAGGLE_DIR + "test/test.csv")

## Data Description <a id="4"></a>

In [None]:
# Stats
print('Data Statistics:')
train_df.describe()

In [None]:
# Types
print('Types: ')
train_df.dtypes

In [None]:
# Overview
print('This dataset has {} rows and {} columns'.format(train_df.shape[0], train_df.shape[1]))
print('Example rows:')
train_df.head(3)

## Visualization <a id="5"></a>

In [None]:
# Type distribution
train_df['Type'].value_counts().rename({1:'Dog',
                                        2:'Cat'}).plot(kind='barh',
                                                       figsize=(15,6))
plt.yticks(fontsize='xx-large')
plt.title('Type Distribution', fontsize='xx-large')

In [None]:
# Gender distribution
train_df['Gender'].value_counts().rename({1:'Male',
                                          2:'Female',
                                          3:'Mixed (Group of pets)'}).plot(kind='barh', 
                                                                           figsize=(15,6))
plt.yticks(fontsize='xx-large')
plt.title('Gender distribution', fontsize='xx-large')

In [None]:
# Age distribution 
train_df['Age'][train_df['Age'] < 50].plot(kind='hist', 
                                           bins = 100, 
                                           figsize=(15,6), 
                                           title='Age distribution')
plt.title('Age distribution', fontsize='xx-large')
plt.xlabel('Age in months')

In [None]:
# Photo amount distribution
train_df['PhotoAmt'].plot(kind='hist', 
                          bins=30, 
                          xticks=list(range(31)), 
                          figsize=(15,6))
plt.title('PhotoAmt distribution', fontsize='xx-large')
plt.xlabel('Photos')

In [None]:
# Target variable (Adoption Speed)
print('The values are determined in the following way:\n\
0 - Pet was adopted on the same day as it was listed.\n\
1 - Pet was adopted between 1 and 7 days (1st week) after being listed.\n\
2 - Pet was adopted between 8 and 30 days (1st month) after being listed.\n\
3 - Pet was adopted between 31 and 90 days (2nd & 3rd month) after being listed.\n\
4 - No adoption after 100 days of being listed.\n\
(There are no pets in this dataset that waited between 90 and 100 days).')

# Plot
train_df['AdoptionSpeed'].value_counts().sort_index(ascending=False).plot(kind='barh', 
                                                                          figsize=(15,6))
plt.title('Adoption Speed (Target Variable)', fontsize='xx-large')

In [None]:
# Example Description (of Nibble) ^^ 
print('Example Description (of Nibble) ^^ : ')
train_df['Description'][0]

## Metric <a id="6"></a>

The metric used for this competition is called ''Quadratic Weighted Kappa''.

We can use [scikit-learn's 'cohen_kappa_score' function](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.cohen_kappa_score.html) almost straight out-of-the-box for measuring our predictions.

In [None]:
# Metric used for this competition (Quadratic Weigthed Kappa aka Quadratic Cohen Kappa Score)
def metric(y1,y2):
    return cohen_kappa_score(y1,y2, weights='quadratic')

## Data Cleaning <a id="10"></a>

In [None]:
# Clean up DataFrames
# Will try to implement these into the model later
target = train_df['AdoptionSpeed']
clean_df = train_df.drop(columns=['Name', 'RescuerID', 'Description', 'PetID', 'AdoptionSpeed'])
clean_test = test_df.drop(columns=['Name', 'RescuerID', 'Description', 'PetID'])

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(clean_df, target, test_size=0.15, random_state=1)

## Tree Ensembling <a id="7"></a>

We will use predictions from both a [Random Forest Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html), an [Extra Trees Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html), an [AdaBoost Classifier.](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html) and a [Gaussian Naive Bayes Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html). Later we will take the average of all models to get the final predictions. [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) is used to get near-optimal parameters for almost all models.

In [None]:
# Create base models
clf = RandomForestClassifier()
clf2 = ExtraTreesClassifier()
clf3 = AdaBoostClassifier()
clf4 = GaussianNB()

# Create parameters to use for Grid Search
rand_forest_grid = {
    'bootstrap': [True],
    'max_depth': [72,75,77],
    'max_features': ['auto', 'sqrt'],
    'min_samples_leaf': [2, 3, 5],
    'min_samples_split': [2, 3, 5],
    'n_estimators': [175, 200, 500]
}

extra_trees_grid = {
    'bootstrap' : [False, True], 
    'criterion' : ['gini', 'entropy'], 
    'max_depth' : [77, 80, 83, 85], 
    'max_features': ['auto'], 
    'min_samples_leaf': [5, 10], 
    'min_samples_split': [5, 10],
    'n_estimators': [175, 200, 225]
}

adaboost_grid = {
    'n_estimators' : [200, 225, 250],
    'learning_rate' : [.1, .2 , .3, .4, .5],
    'algorithm' : ['SAMME.R']
}

# Search parameter space
rand_forest_gridsearch = GridSearchCV(estimator = clf, 
                           param_grid = rand_forest_grid, 
                           cv = 3, 
                           n_jobs = -1, 
                           verbose = 1)

extra_trees_gridsearch = GridSearchCV(estimator = clf2, 
                           param_grid = extra_trees_grid, 
                           cv = 3, 
                           n_jobs = -1, 
                           verbose = 1)

adaboost_gridsearch = GridSearchCV(estimator = clf3, 
                           param_grid = adaboost_grid, 
                           cv = 3, 
                           n_jobs = -1, 
                           verbose = 1)

In [None]:
# Fit the models
rand_forest_gridsearch.fit(x_train, y_train)
#extra_trees_gridsearch.fit(clean_df, target)
#adaboost_gridsearch.fit(clean_df, target)
#clf4.fit(clean_df, target)

In [None]:
# What are the best parameters for each model
print('Random Forest model:\n{}\n'.format(rand_forest_gridsearch.best_params_))
#print('Extra Trees model:\n{}\n'.format(extra_trees_gridsearch.best_params_))
#print('Adaboost model:\n{}\n'.format(adaboost_gridsearch.best_params_))

In [None]:
# Measure of performance 
# Useful for checking overfitting, performance, etc.
print('Random Forest score: ', metric(rand_forest_gridsearch.predict(x_test), y_test))
#print('Extra Trees score: ', metric(extra_trees_gridsearch.predict(clean_df), target))
#print('Adaboost score: ', metric(adaboost_gridsearch.predict(clean_df), target))
#print('GaussianNB score: ', metric(clf4.predict(clean_df), target))

## Predictions <a id="8"></a>

In [None]:
# Get predictions
predictions1 = rand_forest_gridsearch.predict(clean_test)
#predictions2 = extra_trees_gridsearch.predict(clean_test)
#predictions3 = adaboost_gridsearch.predict(clean_test)
#predictions4 = clf4.predict(clean_test)

# Combine predictions
#final_predictions = []
# Get average of predictions
#for pred in zip(predictions1, predictions2, predictions3, predictions4):
#    final_predictions.append(int(round((sum(pred)) / 4, 0)))

In [None]:
# Compare predictions
#prediction_df = pd.DataFrame({'PetID' : test_df['PetID'],
#                              'Random Forest' : predictions1,
#                              'Extra Trees' : predictions2,
#                              'Adaboost' : predictions3,
#                              'GaussianNB' : predictions4,
#})

#print('Predictions for each model: ')
#prediction_df.head()

## Kaggle Submission <a id="9"></a>

In [None]:
# Store predictions for Kaggle Submission
submission_df = pd.DataFrame(data={'PetID' : test_df['PetID'], 
                                   'AdoptionSpeed' : predictions1})
submission_df.to_csv('submission.csv', index=False)

In [None]:
# Check submission
submission_df.head()

In [None]:
# Compare distributions of training set and test set (Adoption Speed)

# Plot 1
plt.figure(figsize=(15,4))
plt.subplot(211)
train_df['AdoptionSpeed'].value_counts().sort_index(ascending=False).plot(kind='barh')
plt.title('Target Variable distribution in training set', fontsize='large')

# Plot 2
plt.subplot(212)
submission_df['AdoptionSpeed'].value_counts().sort_index(ascending=False).plot(kind='barh')
plt.title('Target Variable distribution in predictions')

plt.subplots_adjust(top=2)