### Notebook Name: DSOL Ensemble
#### Description: This notebook features an overview of the competition, EDA (Exploratory Data Analysis) and Ensembling of different models.
#### Version: 1
#### Date Committed: 6 Jan 2018
#### Date Submitted: 6 Jan 2018
#### Score: 
#### Place: 157/345

#### Contributors:
#### Christopher Himmel
#### Rick Torzynski
#### Shreesha Pillangere Ramachandra
#### Muthu
#### Neha Varshney
#### Sandeep
#### Viswanathan Kodumudi Sivakumar
#### Mahesh


![](https://blog.groomit.me/wp-content/uploads/2018/02/petfinder2.jpg)

## PetFinder.my Adoption Prediction

## Table of contents

- [Data Columns](#1)
- [Dependencies](#2)
- [Preparation](#3)
- [Data Description](#4)
- [Visualization](#5)
- [Metric](#6)
- [Data Cleaning](#10)
- [Tree Ensembling](#7)
- [Predictions](#8)
- [Kaggle Submission](#9)

## Data columns <a id="1"></a>

[Source](https://www.kaggle.com/c/petfinder-adoption-prediction/data)

* PetID - Unique hash ID of pet profile
* AdoptionSpeed - Categorical speed of adoption. Lower is faster. This is the value to predict. See below section for more info.
* Type - Type of animal (1 = Dog, 2 = Cat)
* Name - Name of pet (Empty if not named)
* Age - Age of pet when listed, in months
* Breed1 - Primary breed of pet (Refer to BreedLabels dictionary)
* Breed2 - Secondary breed of pet, if pet is of mixed breed (Refer to BreedLabels dictionary)
* Gender - Gender of pet (1 = Male, 2 = Female, 3 = Mixed, if profile represents group of pets)
* Color1 - Color 1 of pet (Refer to ColorLabels dictionary)
* Color2 - Color 2 of pet (Refer to ColorLabels dictionary)
* Color3 - Color 3 of pet (Refer to ColorLabels dictionary)
* MaturitySize - Size at maturity (1 = Small, 2 = Medium, 3 = Large, 4 = Extra Large, 0 = Not Specified)
* FurLength - Fur length (1 = Short, 2 = Medium, 3 = Long, 0 = Not Specified)
* Vaccinated - Pet has been vaccinated (1 = Yes, 2 = No, 3 = Not Sure)
* Dewormed - Pet has been dewormed (1 = Yes, 2 = No, 3 = Not Sure)
* Sterilized - Pet has been spayed / neutered (1 = Yes, 2 = No, 3 = Not Sure)
* Health - Health Condition (1 = Healthy, 2 = Minor Injury, 3 = Serious Injury, 0 = Not Specified)
* Quantity - Number of pets represented in profile
* Fee - Adoption fee (0 = Free)
* State - State location in Malaysia (Refer to StateLabels dictionary)
* RescuerID - Unique hash ID of rescuer
* VideoAmt - Total uploaded videos for this pet
* PhotoAmt - Total uploaded photos for this pet
* Description - Profile write-up for this pet. The primary language used is English, with some in Malay or Chinese.


## Dependencies <a id="2"></a>

In [None]:
# For notebook plotting
%matplotlib inline

# Standard libraries
import os
import json
import numpy as np
import pandas as pd
from pprint import pprint

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.metrics import cohen_kappa_score
from sklearn.metrics import make_scorer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
import lightgbm as lgb
import xgboost as xgb



# Seed for reproducability
seed = 12345
np.random.seed(seed)

# Info about dataset
print('Files and directories: \n{}\n'.format(os.listdir("../input")))
print('Within the train directory: \n{}\n'.format(os.listdir("../input/train")))
print('Within the test directory: \n{}\n'.format(os.listdir("../input/test")))

## Preparation <a id="3"></a>

In [None]:
# Read in data
KAGGLE_DIR = '../input/'

train_df = pd.read_csv(KAGGLE_DIR + "train/train.csv")
test_df = pd.read_csv(KAGGLE_DIR + "test/test.csv")

## Data Description <a id="4"></a>

In [None]:
# Stats
print('Data Statistics:')
train_df.describe()

In [None]:
# Types
print('Types: ')
train_df.dtypes

In [None]:
# Overview
print('This dataset has {} rows and {} columns'.format(train_df.shape[0], train_df.shape[1]))
print('Example rows:')
train_df.head()

## Visualization <a id="5"></a>

In [None]:
# Type distribution
train_df['Type'].value_counts().rename({1:'Dog',
                                        2:'Cat'}).plot(kind='barh',
                                                       figsize=(15,6))
plt.yticks(fontsize='xx-large')
plt.title('Type Distribution', fontsize='xx-large')

In [None]:
# Gender distribution
train_df['Gender'].value_counts().rename({1:'Male',
                                          2:'Female',
                                          3:'Mixed (Group of pets)'}).plot(kind='barh', 
                                                                           figsize=(15,6))
plt.yticks(fontsize='xx-large')
plt.title('Gender distribution', fontsize='xx-large')

In [None]:
# Age distribution 
train_df['Age'][train_df['Age'] < 50].plot(kind='hist', 
                                           bins = 100, 
                                           figsize=(15,6), 
                                           title='Age distribution')
plt.title('Age distribution', fontsize='xx-large')
plt.xlabel('Age in months')

In [None]:
# Photo amount distribution
train_df['PhotoAmt'].plot(kind='hist', 
                          bins=30, 
                          xticks=list(range(31)), 
                          figsize=(15,6))
plt.title('PhotoAmt distribution', fontsize='xx-large')
plt.xlabel('Photos')

In [None]:
# Target variable (Adoption Speed)
print('The values are determined in the following way:\n\
0 - Pet was adopted on the same day as it was listed.\n\
1 - Pet was adopted between 1 and 7 days (1st week) after being listed.\n\
2 - Pet was adopted between 8 and 30 days (1st month) after being listed.\n\
3 - Pet was adopted between 31 and 90 days (2nd & 3rd month) after being listed.\n\
4 - No adoption after 100 days of being listed.\n\
(There are no pets in this dataset that waited between 90 and 100 days).')

# Plot
train_df['AdoptionSpeed'].value_counts().sort_index(ascending=False).plot(kind='barh', 
                                                                          figsize=(15,6))
plt.title('Adoption Speed (Target Variable)', fontsize='xx-large')

In [None]:
# Example Description (of Nibble) ^^ 
print('Example Description (of Nibble) ^^ : ')
train_df['Description'][0]

## Metric <a id="6"></a>

The metric used for this competition is called ''Quadratic Weighted Kappa''.

We can use [scikit-learn's 'cohen_kappa_score' function](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.cohen_kappa_score.html) almost straight out-of-the-box for measuring our predictions.

In [None]:
# Metric used for this competition (Quadratic Weigthed Kappa aka Quadratic Cohen Kappa Score)
def metric(y1,y2):
    return cohen_kappa_score(y1,y2, weights='quadratic')

## Loading Sentiment Data From json files <a id="10"></a>

In [None]:
import json
from pprint import pprint

def loadSentimentData(df):
    sentiments_mag = []
    sentiments_score = []
    path = '../input/train_sentiment/'
    for petId in df['PetID']:
        try:
            with open(path + str(petId) + '.json') as f:
                data = json.load(f)
                sentiments_mag.append(data['documentSentiment']['magnitude'])
                sentiments_score.append(data['documentSentiment']['score'])
        except:
            sentiments_mag.append(0)
            sentiments_score.append(0)

    df['Sentiment_mag'] = np.array(sentiments_mag) 
    df['Sentiment_score'] = np.array(sentiments_score) 
    return df
    
train_df = loadSentimentData(train_df)
test_df = loadSentimentData(test_df)
train_df.head()

## Data Cleaning <a id="10"></a>

In [None]:
# Clean up DataFrames
target = train_df['AdoptionSpeed']
clean_x_train = train_df.drop(columns=['Name', 'RescuerID', 'Description', 'PetID', 'AdoptionSpeed'])
clean_x_test = test_df.drop(columns=['Name', 'RescuerID', 'Description', 'PetID'])
target.describe(include = 'all')

## Correlation Matrix <a id="3"></a>

In [None]:
sns.set(style="white")

# Compute the correlation matrix
corr_df = train_df.drop(columns=['Name', 'RescuerID', 'Description', 'PetID', 'AdoptionSpeed'])
corr_df['AdoptionSpeed'] = target
corr = corr_df.corr()

# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

## Feature Importance <a id="3"></a>

In [None]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
%matplotlib inline

model = RandomForestClassifier()
model.fit(clean_x_train, target)

(pd.Series(model.feature_importances_, index=clean_x_train.columns)
   .nlargest(30)
   .plot(kind='barh'))

In [None]:
#check types of features
clean_x_train.dtypes

## Normalization <a id="3"></a>

In [None]:
#Data Normalization if needed
from sklearn import preprocessing
def normalizeData(df):
    x = df.values #returns a numpy array
    min_max_scaler = preprocessing.MinMaxScaler()
    x_scaled = min_max_scaler.fit_transform(x)
    df = pd.DataFrame(x_scaled)
    return df
#Normalize if needed
# clean_x_train = normalizeData(clean_x_train)
# clean_x_test = normalizeData(clean_x_test)

In [None]:
print(clean_x_train.shape)
print(target.shape)

## Data Split <a id="3"></a>

In [None]:
# Splitting Data into Training and Validation set(75:25)
from sklearn.model_selection import train_test_split
x_train, x_validation, y_train, y_validation = train_test_split(clean_x_train, target, test_size=0.25, random_state=10)

## Tree Ensembling <a id="7"></a>

We will use predictions from both a [Random Forest Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html), an [Extra Trees Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html), an [AdaBoost Classifier.](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html) and a [Gaussian Naive Bayes Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html). Later we will take the average of all models to get the final predictions. [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) is used to get near-optimal parameters for almost all models.

In [None]:
params = {'task': 'train',
    'boosting_type': 'gbdt',
    'objective': 'multiclass',
    'num_class':5,
    'metric': 'multi_logloss',
    'max_bin': 100,
    'max_depth': 20,
    'num_leaves': 80,
    'feature_fraction': 0.4,
    'bagging_fraction': 0.6,
    'bagging_freq': 17,
    'num_leaves': 80,
    'max_depth': 9,
    'learning_rate': 0.01,
    'bagging_fraction': 0.85,
    'feature_fraction': 0.8,
    'min_split_gain': 0.01,
    'min_child_samples': 150,
    'min_child_weight': 0.1,     
}

d_train = lgb.Dataset(x_train, label=y_train)
clean_d_train = lgb.Dataset(clean_x_train, label=target)

### Create base models

In [None]:
random_forest = RandomForestClassifier()
extra_trees = ExtraTreesClassifier()
ada_boost = AdaBoostClassifier()
xg_boost = xgb.XGBClassifier()

lgb_cv = lgb.cv(params, d_train, num_boost_round=10000,
                 nfold=3, shuffle=True, stratified=True, verbose_eval=20, early_stopping_rounds=100)
nround = lgb_cv['multi_logloss-mean'].index(np.min(lgb_cv['multi_logloss-mean']))

In [None]:
grid_search = False
if grid_search == True:

    # Create parameters to use for Grid Search
    rand_forest_grid = {
        'bootstrap': [True],
        'max_depth': [77, 80, 83, 85],
        'max_features': ['auto'],
        'min_samples_leaf': [5, 10],
        'min_samples_split': [5, 10],
        'n_estimators': [175, 200, 225]
    }

    extra_trees_grid = {
        'bootstrap' : [False, True], 
        'criterion' : ['gini', 'entropy'], 
        'max_depth' : [77, 80, 83, 85], 
        'max_features': ['auto'], 
        'min_samples_leaf': [5, 10], 
        'min_samples_split': [5, 10],
        'n_estimators': [175, 200, 225]
    }

    adaboost_grid = {
        'n_estimators' : [200, 225, 250],
        'learning_rate' : [.1, .2, .3, .4, .5],
        'algorithm' : ['SAMME.R']
    }

    xgboost_grid = {"max_depth": [1,2,3],
                  "max_features" : [1.0, 1.5],
                  "min_samples_leaf" : [3,5,9],
                  "n_estimators": [300, 500],
                  "learning_rate": [0.02,0.05,0.1]}

    # Search parameter space
    rand_forest_gridsearch = GridSearchCV(estimator = random_forest, 
                               param_grid = rand_forest_grid, 
                               cv = 3, 
                               n_jobs = -1, 
                               verbose = 1)

    extra_trees_gridsearch = GridSearchCV(estimator = extra_trees, 
                               param_grid = extra_trees_grid, 
                               cv = 3, 
                               n_jobs = -1, 
                               verbose = 1)

    adaboost_gridsearch = GridSearchCV(estimator = ada_boost, 
                               param_grid = adaboost_grid, 
                               cv = 3, 
                               n_jobs = -1, 
                               verbose = 1)
    xgboost_gridsearch = GridSearchCV(estimator = xg_boost, 
                              param_grid=xgboost_grid, 
                              cv = 3, 
                              n_jobs = -1, 
                              verbose = 1)

In [None]:
# Fit the grid_search models
submission=True
if submission==False and grid_search==True:
    rand_forest_gridsearch.fit(x_train, y_train)
    extra_trees_gridsearch.fit(x_train, y_train)
    adaboost_gridsearch.fit(x_train, y_train)
    xgboost_gridsearch.fit(x_train, y_train)
    lightGBM = lgb.train(params, d_train, nround)
    
    # What are the best parameters for each model
    print('Random Forest model:\n{}\n'.format(rand_forest_gridsearch.best_params_))
    print('Extra Trees model:\n{}\n'.format(extra_trees_gridsearch.best_params_))
    print('Adaboost model:\n{}\n'.format(adaboost_gridsearch.best_params_))
    print('XGboost model:\n{}\n'.format(xgboost_gridsearch.best_params_))

    # Get Validation predictions
    predictions_rf = rand_forest_gridsearch.predict(x_validation)
    predictions_et = extra_trees_gridsearch.predict(x_validation)
    predictions_ab = adaboost_gridsearch.predict(x_validation)
    predictions_xgb = xgboost_gridsearch.predict(x_validation)

    y_pred_lgbm = lightGBM.predict(x_validation)
    prediction_lgbm = []
    for pred in y_pred_lgbm:
        prediction_lgbm.append(pred.argmax())

    # Measure of performance 
    # Useful for checking overfitting, performance, etc.
    print('Random Forest score: ', metric(predictions_rf, y_validation))
    print('Extra Trees score: ', metric(predictions_et, y_validation))
    print('Adaboost score: ', metric(predictions_ab, y_validation))
    print('XGBoost score: ', metric(predictions_xgb, y_validation))
    print('LightGBM score: ', metric(predictions_lgbm, y_validation))

    # Combine predictions
    validation_predictions = []
    # Get average of predictions
    for pred in zip(predictions_rf, predictions_et, predictions_ab, predictions_xgb, predictions_lgbm):
       validation_predictions.append(int(round((sum(pred)) / 5, 0)))

    print('Combined Model Validation Kappa Score: ', metric(validation_predictions, y_validation))
    print('Combined Model Validation accuracy Score: ', accuracy_score(validation_predictions, y_validation))

#Random Forest score:  0.3548276726741826
#Extra Trees score:  0.3103631633681623
#Adaboost score:  0.3258035539320042
#XGBoost score:  0.35960892487423246
#LightGBM score:  0.3675133483960088

#Combined Model Validation Kappa Score:  0.3674064960957891
#Combined Model Validation accuracy Score:  0.4022405974926647
 
else:
    lightGBM = lgb.train(params, d_train, nround)
    y_pred_lgbm = lightGBM.predict(x_validation)
    predictions_lgbm = []
    for pred in y_pred_lgbm:
        predictions_lgbm.append(pred.argmax())
    print('LightGBM score: ', metric(predictions_lgbm, y_validation))
    
print('done')
    


## Predictions <a id="8"></a>
### Create parameters to use for Grid Search

In [None]:
rand_forest_grid = {
    'bootstrap': [True],
    'max_depth': [77],
    'max_features': ['auto'],
    'min_samples_leaf': [5],
    'min_samples_split': [10],
    'n_estimators': [200]
}

extra_trees_grid = {
    'bootstrap' : [False], 
    'criterion' : ['entropy'], 
    'max_depth' : [83], 
    'max_features': ['auto'], 
    'min_samples_leaf': [5], 
    'min_samples_split': [5],
    'n_estimators': [225]
}

adaboost_grid = {
    'n_estimators' : [200],
    'learning_rate' : [.3],
    'algorithm' : ['SAMME.R']
}

xgboost_grid = {"max_depth": [3],
              "max_features" : [1.0],
              "min_samples_leaf" : [3],
              "n_estimators": [300],
              "learning_rate": [0.1]}

rand_forest_gridsearch = GridSearchCV(estimator = random_forest, 
                           param_grid = rand_forest_grid, 
                           cv = 3, 
                           n_jobs = -1, 
                           verbose = 1)

extra_trees_gridsearch = GridSearchCV(estimator = extra_trees, 
                           param_grid = extra_trees_grid, 
                           cv = 3, 
                           n_jobs = -1, 
                           verbose = 1)

adaboost_gridsearch = GridSearchCV(estimator = ada_boost, 
                           param_grid = adaboost_grid, 
                           cv = 3, 
                           n_jobs = -1, 
                           verbose = 1)
xgboost_gridsearch = GridSearchCV(estimator = xg_boost, 
                          param_grid=xgboost_grid, 
                          cv = 3, 
                          n_jobs = -1, 
                          verbose = 1)

In [None]:
# Final Model
# Fit the models
rand_forest_gridsearch.fit(clean_x_train, target)
extra_trees_gridsearch.fit(clean_x_train, target)
adaboost_gridsearch.fit(clean_x_train, target)
xgboost_gridsearch.fit(clean_x_train, target)
lightgbm = lgb.train(params, clean_d_train, nround)

# Get Final predictions
predictions_rf = rand_forest_gridsearch.predict(clean_x_test)
predictions_et = extra_trees_gridsearch.predict(clean_x_test)
predictions_ab = adaboost_gridsearch.predict(clean_x_test)
predictions_xgb = xgboost_gridsearch.predict(clean_x_test)

y_pred_lgbm = lightGBM.predict(clean_x_test)
predictions_lgbm = []
for pred in y_pred_lgbm:
    predictions_lgbm.append(pred.argmax())

# Combine predictions
final_predictions = []
# Get average of predictions
for pred in zip(predictions_rf, predictions_et, predictions_ab, predictions_xgb, predictions_lgbm):
   final_predictions.append(int(round((sum(pred)) / 5, 0)))

print('done')

In [None]:
# Compare predictions
prediction_df = pd.DataFrame({'PetID' : test_df['PetID'],
                             'Random Forest' : predictions_rf,
                             'Extra Trees' : predictions_et,
                             'Adaboost' : predictions_ab,
                             'XGBoost' : predictions_xgb,
                             'lightGBM' : predictions_lgbm
})

prediction_df.head()

## Kaggle Submission <a id="9"></a>

In [None]:
# Store predictions for Kaggle Submission
submission_df = pd.DataFrame(data={'PetID' : test_df['PetID'], 
                                   'AdoptionSpeed' : final_predictions})
submission_df.to_csv('submission.csv', index=False)
submission_df.shape

In [None]:
# Check submission
submission_df.head()

In [None]:
# Compare distributions of training set and test set (Adoption Speed)

# Plot 1
plt.figure(figsize=(15,4))
plt.subplot(211)
train_df['AdoptionSpeed'].value_counts().sort_index(ascending=False).plot(kind='barh')
plt.title('Target Variable distribution in training set', fontsize='large')

# Plot 2
plt.subplot(212)
submission_df['AdoptionSpeed'].value_counts().sort_index(ascending=False).plot(kind='barh')
plt.title('Target Variable distribution in predictions')

plt.subplots_adjust(top=2)