Competition link: https://www.kaggle.com/c/petfinder-adoption-prediction <br>
Kernel link: https://www.kaggle.com/olivbau/petfinder-olivbau
# **PetFinder.my Adoption Prediction**

In this competition we will predict the speed at which a pet is adopted, based on the pet’s listing on PetFinder.

### Data:
* PetID - Unique hash ID of pet profile
* AdoptionSpeed - Categorical speed of adoption. Lower is faster. This is the value to predict. See below section for more info.
* Type - Type of animal (1 = Dog, 2 = Cat)
* Name - Name of pet (Empty if not named)
* Age - Age of pet when listed, in months
* Breed1 - Primary breed of pet (Refer to BreedLabels dictionary)
* Breed2 - Secondary breed of pet, if pet is of mixed breed (Refer to BreedLabels dictionary)
* Gender - Gender of pet (1 = Male, 2 = Female, 3 = Mixed, if profile represents group of pets)
* Color1 - Color 1 of pet (Refer to ColorLabels dictionary)
* Color2 - Color 2 of pet (Refer to ColorLabels dictionary)
* Color3 - Color 3 of pet (Refer to ColorLabels dictionary)
* MaturitySize - Size at maturity (1 = Small, 2 = Medium, 3 = Large, 4 = Extra Large, 0 = Not Specified)
* FurLength - Fur length (1 = Short, 2 = Medium, 3 = Long, 0 = Not Specified)
* Vaccinated - Pet has been vaccinated (1 = Yes, 2 = No, 3 = Not Sure)
* Dewormed - Pet has been dewormed (1 = Yes, 2 = No, 3 = Not Sure)
* Sterilized - Pet has been spayed / neutered (1 = Yes, 2 = No, 3 = Not Sure)
* Health - Health Condition (1 = Healthy, 2 = Minor Injury, 3 = Serious Injury, 0 = Not Specified)
* Quantity - Number of pets represented in profile
* Fee - Adoption fee (0 = Free)
* State - State location in Malaysia (Refer to StateLabels dictionary)
* RescuerID - Unique hash ID of rescuer
* VideoAmt - Total uploaded videos for this pet
* PhotoAmt - Total uploaded photos for this pet
* Description - Profile write-up for this pet. The primary language used is English, with some in Malay or Chinese.

### Target: Adoption speed

* 0 - Pet was adopted on the same day as it was listed.
* 1 - Pet was adopted between 1 and 7 days (1st week) after being listed.
* 2 - Pet was adopted between 8 and 30 days (1st month) after being listed.
* 3 - Pet was adopted between 31 and 90 days (2nd & 3rd month) after being listed.
* 4 - No adoption after 100 days of being listed. (There are no pets in this dataset that waited between 90 and 100 days). 

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd



# Data Overview

## Data Sets

In [None]:
from sklearn.model_selection import train_test_split

train = pd.read_csv('../input/train/train.csv')
test = pd.read_csv('../input/test/test.csv')

temp_train = train.copy()
temp_test = test.copy()
temp_train['dataset_type'] = 'train'
temp_test['dataset_type'] = 'test'
all_data = pd.concat([temp_train, temp_test], sort=False)

In [None]:
train.head()

In [None]:
labels_breed = pd.read_csv('../input/breed_labels.csv')
labels_state = pd.read_csv('../input/color_labels.csv')
labels_color = pd.read_csv('../input/state_labels.csv')

## AdoptionSpeed

Regardons la répartition de l'AdoptionSpeed sur le training set. <br>
La répartition semble plutôt homogène

In [None]:
sns.countplot(x="AdoptionSpeed", data=train)

## Types

In [None]:
plt.figure(figsize=(10, 6));
sns.countplot(x='dataset_type', data=all_data, hue='Type');
plt.title('Number of cats and dogs in train and test data');
plt.legend(['Dog','Cat'])
plt.show()

## Age

In [None]:
train['Age'].plot('hist', label='train');
test['Age'].plot('hist', label='test');
plt.xlabel("Age in month")
plt.legend(["train", "test"])

## Correlation

On peut voir que le traitement vermifuge, la stérilisation, et la vaccination sont corrélés.<br>
Cependant l'AdoptionSpeed ne semble pas etre corrélée a une feature en particulier.


In [None]:
sns.heatmap(train.corr())

## Feature Importance

On peut se pencher sur la "feature importance" <br>
La paramètre le plus important semble être le nombre de photos de l'animal, puis son âge, puis la couleur de son pelage.

In [None]:
from sklearn.ensemble import RandomForestClassifier

X = train.drop(['AdoptionSpeed', 'Description', 'Name', 'RescuerID', 'PetID'], axis=1)
y = train.AdoptionSpeed
test_classifier = test.drop(['Description', 'Name', 'RescuerID', 'PetID'], axis=1)

forest = RandomForestClassifier(max_depth=None, n_estimators = 100)
forest.fit(X,y)
importances = forest.feature_importances_

std = np.std([tree.feature_importances_ for tree in forest.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

for f in range(X.shape[1]):
    print("%d. [%d]: %s (%f)" % (f + 1, indices[f], X.columns.values.tolist()[indices[f]], importances[indices[f]]))

# Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(X.shape[1]), importances[indices],
       color="r", yerr=std[indices], align="center")
plt.xticks(range(X.shape[1]), indices)
plt.xlim([-1, X.shape[1]])
plt.show()

# Classifiers Comparaison

In [None]:
# Preparing data
X = train.drop(['AdoptionSpeed', 'Description', 'Name', 'RescuerID', 'PetID'], axis=1)
test_classifier = test.drop(['Description', 'Name', 'RescuerID', 'PetID'], axis=1)
y = train.AdoptionSpeed

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [None]:
from matplotlib.colors import ListedColormap
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_moons, make_circles, make_classification
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.metrics import cohen_kappa_score
from xgboost.sklearn import XGBClassifier
from sklearn.metrics import accuracy_score, f1_score

classifiers = [
    {"name": "KNN", "clfObject": KNeighborsClassifier(3)},
    #{"name": "Gaussian Process", "clfObject": GaussianProcessClassifier(1.0 * RBF(1.0))},
    #{"name": "Neural Net", "clfObject": MLPClassifier(alpha=1)},
    {"name": "Decision Tree", "clfObject": DecisionTreeClassifier(max_depth=5)},
    {"name": "Random Forest", "clfObject": RandomForestClassifier(max_depth=None, n_estimators = 100)},
    {"name": "AdaBoost", "clfObject": AdaBoostClassifier()},
    {"name": "XGBoost", "clfObject": XGBClassifier()},
    
]

for clf in classifiers:
    clf["clfObject"].fit(X_train, y_train)
    y_pred = clf["clfObject"].predict(X_test)
    clf["score"] = clf["clfObject"].score(X_test,y_test)
    clf["f1Score"] = f1_score(y_test,y_pred, average = "weighted")
    clf["accuracyScore"] = accuracy_score(y_test,y_pred)

    print("%s: Score: %.2f / F1 Score: %.2f / Accuracy Score: %.2f"% (clf["name"], clf["score"], clf["f1Score"], clf["accuracyScore"]))
    
names = [clf['name'] for clf in classifiers]
scores = [clf['score'] for clf in classifiers]
sns.barplot(x=names, y=scores)
plt.show()

In [None]:
# Finding the best classifier
best_clf = max(classifiers, key=lambda x:x['score'])

# Prediction
best_clf["clfObject"].fit(X,y)
y_pred = best_clf["clfObject"].predict(test_classifier)

submission = pd.DataFrame({'PetID': test.PetID, 'AdoptionSpeed': y_pred})
submission.to_csv('submission.csv', index=False)