# Adoption Classification

<font color='steelblue'>

<span style="font-family:verdana; font-size:1.6em;">
    <b>PetFinder Prediction</b><br>
    Petfinder dataset is about predicting how fast a pet is adopted based on pet's listing on PetFinder. <br>
    Classify if the pet will be adopted or not<br><br>
</span>
<span style="font-family:verdana; font-size:1.4em;">
    <b>Following examples are included in the processing:</b><em>
    <ol>
        <li>Load training and test data including labels</li>
        <li>Handle the categorical values using a pipeline</li>
        <li>Create a Neural Network and build a model</li>
        <li>Train the model on the training dataset</li>
        <li>Evaluate the accuracy of the model using test dataset</li>
        <li>Plot the accuracy and loss for the model</li>
    </ol></em>    
</span>

</font>

<font color='steelblue'>

<span style="font-family:verdana; font-size:1.6em;">
    To install pydot (in anaconda terminal):
    <ul>
        <li>pip install pydot</li><br>
        OR  <br><br>
        <li>conda install -c conda-forge pydot</li>
    </ul>
</span>
</font>

## Data Fields
- PetID - Unique hash ID of pet profile
- AdoptionSpeed - Categorical speed of adoption. Lower is faster. This is the value to predict. See below section for more info.
- Type - Type of animal (1 = Dog, 2 = Cat)
- Name - Name of pet (Empty if not named)
- Age - Age of pet when listed, in months
- Breed1 - Primary breed of pet (Refer to BreedLabels dictionary)
- Breed2 - Secondary breed of pet, if pet is of mixed breed (Refer to BreedLabels dictionary)
- Gender - Gender of pet (1 = Male, 2 = Female, 3 = Mixed, if profile represents group of pets)
- Color1 - Color 1 of pet (Refer to ColorLabels dictionary)
- Color2 - Color 2 of pet (Refer to ColorLabels dictionary)
- Color3 - Color 3 of pet (Refer to ColorLabels dictionary)
- MaturitySize - Size at maturity (1 = Small, 2 = Medium, 3 = Large, 4 = Extra Large, 0 = Not Specified)
- FurLength - Fur length (1 = Short, 2 = Medium, 3 = Long, 0 = Not Specified)
- Vaccinated - Pet has been vaccinated (1 = Yes, 2 = No, 3 = Not Sure)
- Dewormed - Pet has been dewormed (1 = Yes, 2 = No, 3 = Not Sure)
- Sterilized - Pet has been spayed / neutered (1 = Yes, 2 = No, 3 = Not Sure)
- Health - Health Condition (1 = Healthy, 2 = Minor Injury, 3 = Serious Injury, 0 = Not Specified)
- Quantity - Number of pets represented in profile
- Fee - Adoption fee (0 = Free)
- State - State location in Malaysia (Refer to StateLabels dictionary)
- RescuerID - Unique hash ID of rescuer
- VideoAmt - Total uploaded videos for this pet
- PhotoAmt - Total uploaded photos for this pet
- Description - Profile write-up for this pet. The primary language used is English, with some in Malay or Chinese.


## AdoptionSpeed

Contestants are required to predict this value. The value is determined by how quickly, if at all, a pet is adopted. The values are determined in the following way:<br>
0 - Pet was adopted on the same day as it was listed.<br>
1 - Pet was adopted between 1 and 7 days (1st week) after being listed.<br>
2 - Pet was adopted between 8 and 30 days (1st month) after being listed.<br>
3 - Pet was adopted between 31 and 90 days (2nd & 3rd month) after being listed.<br>
4 - No adoption after 100 days of being listed. (There are no pets in this dataset that waited between 90 and 100 days).<br> 

In [None]:
import numpy as np 
import pandas as pd 
import os
import json
import seaborn as sns 
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')

from functools import partial

import warnings
warnings.filterwarnings('ignore')

## Import the datasets (train and test)

In [None]:
traindf = pd.read_csv('../datasets/pet-train.csv', encoding = 'utf-8')

In [None]:
traindf.shape

In [None]:
traindf['Type'] = traindf['Type'].map({1: 'Dog', 2: 'Cat'})

In [None]:
traindf.head()

### Use the test dataset to evaluate the model

In [None]:
evaldf = pd.read_csv('../datasets/pet-test.csv', encoding = 'utf-8')

In [None]:
evaldf.shape

In [None]:
evaldf['Type'] = evaldf['Type'].map({1: 'Dog', 2: 'Cat'})

In [None]:
evaldf.head()

## Create target variable<br>
<font color='gray'>

<span style="font-family:verdana; font-size:1.2em;">
    <ul> 
<li>Let's simplify this for our tutorial. Here, you will transform this into a binary classification problem, and simply predict whether the pet was adopted, or not</li>
<li>After modifying the label column, 0 will indicate the pet was not adopted, and 1 will indicate it was</li>
    </ul>
</span>
</font>

In [None]:
traindf['AdoptionSpeed'].value_counts().sort_index().plot(kind = 'barh', 
                                                          color='steelblue')
plt.xlabel('count')
plt.ylabel('days')
plt.title('Adoption speed classes counts')
plt.show()

In [None]:
plt.figure(figsize=(6, 4));
sns.countplot(x='Type', data = traindf);
plt.title('Number of cats and dogs in train data');

In [None]:
todrop = ['Name', 'RescuerID', 'PetID', 'AdoptionSpeed', 'Description' ]

In [None]:
# In the original dataset "4" indicates the pet was not adopted.
traindf['target'] = np.where(traindf['AdoptionSpeed'] == 4, 0, 1)

In [None]:
traindf.drop(columns = todrop, axis = 1, inplace = True)

In [None]:
traindf.head().transpose()

In [None]:
# AdoptionSpeed features does not exist in the test data 
todrop.remove('AdoptionSpeed')

In [None]:
targetCol = traindf.pop('target')

In [None]:
todrop

In [None]:
evaldf.drop(columns = todrop, axis = 1, inplace = True)

In [None]:
evaldf.head().transpose()

## Handle Cateogorical Data

In [None]:
cat_cols = ['Type', 'Breed1', 'Breed2', 'Gender', 'Color1', 'Color2',
       'Color3', 'MaturitySize', 'FurLength', 'Vaccinated', 'Dewormed',
       'Sterilized', 'Health', 'State', 'Health']

In [None]:
# for these categorical columns, convert the numbers to strings
# These are ordinal values i.e. the numbers have mathematical meaning
for col in cat_cols:
    traindf[col] = traindf[col].astype(str)
    evaldf[col] = evaldf[col].astype(str)

In [None]:
traindf.dtypes

## Build pipeline

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

In [None]:
# Define the pipeline stages for numeric and categorical columns
numericPipe = Pipeline(steps=[('imputer', SimpleImputer(strategy='mean')),
                              ('scaler', StandardScaler())])
stringPipe = Pipeline(steps=[('imputer', SimpleImputer(strategy='constant', 
                                                       fill_value='missing')),
                             ('onehot', OneHotEncoder(handle_unknown='ignore'))])

In [None]:
# Create list of numeric and categorical columns
numericCols = traindf.select_dtypes(include=['int64', 'float64']).columns
stringCols = traindf.select_dtypes(include=['object']).columns

In [None]:
numericCols

In [None]:
stringCols

In [None]:
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer(transformers=[('num', numericPipe, numericCols),
                                               ('cat', stringPipe, stringCols)])

In [None]:
df1 = preprocessor.fit_transform(traindf)

In [None]:
traindf.shape

In [None]:
df1.shape

In [None]:
evaldf1 = preprocessor.transform(evaldf)

In [None]:
evaldf1.shape

In [None]:
y = targetCol.values

## Create training and test dataset

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df1, y, test_size = 0.25, 
                                                    random_state = 2345)

In [None]:
X_train.shape

In [None]:
X_test.shape

## Build neural network

In [None]:
import tensorflow as tf
from tensorflow import keras
from sklearn.model_selection import train_test_split
from tensorflow.keras import layers
from tensorflow.keras.layers.experimental import preprocessing

In [None]:
# Instantiate the model
model = keras.Sequential([
    layers.Dense(256, activation='relu', input_shape=[X_train.shape[1]]),
    layers.Dense(128, activation='relu'),
    layers.Dense(32, activation='relu'),
    layers.Dense(1, activation = 'sigmoid')
])

In [None]:
model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=["accuracy"])

In [None]:
model.summary()

In [None]:
# rankdir='LR' is used to make the graph horizontal.
tf.keras.utils.plot_model(model, show_shapes=True, rankdir="LR")

In [None]:
# Convert the sparse matrix to numpy array so that validation split can 
# be applied when training is applied
X_train = X_train.toarray()

In [None]:
EPOCHS = 50
BATCHES = 128
history = model.fit(X_train, y_train, batch_size = BATCHES, validation_split = 0.20,
                    epochs = EPOCHS, verbose = 2)

In [None]:
metrics_names = model.metrics_names
metrics_names

In [None]:
def plot_graphs(history, string):
    plt.plot(history.history[string])
    plt.plot(history.history['val_'+string])
    plt.title('Training and validation')
    plt.xlabel('Epochs')
    plt.ylabel(string)
    plt.legend([string, 'val_'+string])
    plt.show()

In [None]:
for name in metrics_names:
    plot_graphs(history, name)

In [None]:
loss, accuracy = model.evaluate(X_test, y_test)
print("Accuracy", accuracy)

## Now let us evaluate with the evaluation dataset

In [None]:
preds = model.predict(evaldf1)

In [None]:
preds = preds.astype(int)
preds[:5]

In [None]:
evaldf['target'] = preds

In [None]:
evaldf.head()

In [None]:
evaldf.loc[evaldf['target'] == 1, 'target'].count()

In [None]:
evaldf.loc[evaldf['target'] == 0, 'target'].count()