# **Introduction**

The aim of this competition is predict Pawpularity scores for various photos of cats and dogs. In this notebook, I use the CatBoost algorithm with the hyperparamter tuning Hyperopt to create predictions exclusively on the meta-data given.

# Import Packages

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from catboost import CatBoostRegressor
from sklearn.model_selection import train_test_split
from hyperopt import fmin, hp, tpe, Trials, space_eval, STATUS_OK

# Load Data

In [None]:
# import the training and test sets
train_set = pd.read_csv('../input/petfinder-pawpularity-score/train.csv')
test_set = pd.read_csv('../input/petfinder-pawpularity-score/test.csv')

# EDA

A quick EDA on our data to see if we have any nulls or odd features.

In [None]:
print(train_set.shape)
print(test_set.shape)

In [None]:
train_set.info()

In [None]:
# set the columns we need for training
cols = [col for col in train_set.columns if col not in ['Id', 'Pawpularity']]

In [None]:
# check for any null entries
print(train_set.isnull().sum().sum())

This data is very clean!

In [None]:
# plot a correlation heat map
plt.figure(figsize=(16, 6))
sns.set(font_scale=1.1)
heatmap = sns.heatmap(train_set.corr(), vmin=-1, vmax=1, annot=True, cmap="YlGnBu")
heatmap.set_title('Correlation Matrix Heatmap', fontdict={'fontsize':22}, pad=14);

In [None]:
# correlation just for pawpularity
plt.figure(figsize=(8, 8))
heatmap = sns.heatmap(train_set.corr()[['Pawpularity']].sort_values(by='Pawpularity', ascending=False), vmin=-1, vmax=1, annot=True, cmap="YlGnBu")
heatmap.set_title('Features Correlating with Pawpularity', fontdict={'fontsize':18}, pad=16);

We can clearly see that the meta-data does not have much contribution to the Pawpularity. Eitherway, we will use the meta data in this analysis. An alternative method would be to use a CNN or some transfer learning model which are the current best implementations according to the leaderboard.

In [None]:
# distribution of pawpularity scores
plt.figure(figsize=(10,6))
ax = train_set['Pawpularity'].hist(bins=100)
ax.set_xlabel('Pawpularity Score')
ax.set_ylabel('Count')
plt.show()

In [None]:
train_set = train_set.loc[train_set['Pawpularity'] > 3]

There is extreme values at 100 and around the 1,2 and 3 mark. Lets remove these as they can have a significant impact during training.

# Hyperparameter Tuning With Hyperopt

We will use the Hyperopt package to tune our hyperparameters. This is a novel package that uses a Bayesian approach which is shown to be better than random or grid search.

In [None]:
# split the current training set into a validation set for cross-validation during training
x_train, x_valid, y_train, y_valid = train_test_split(train_set[cols],
                                                      train_set['Pawpularity'],
                                                      test_size=0.3)

In [None]:
# define the objective function which for this model is the RMSE
def objective(search_space):
    model = CatBoostRegressor(**search_space,
                              loss_function='RMSE',
                              eval_metric='RMSE',
                              early_stopping_rounds=100,
                              random_seed=42)
    
    model.fit(X = x_train, y = y_train, eval_set=(x_valid,y_valid), verbose=False)
    return {'loss': model.get_best_score()['validation']['RMSE'], 'status': STATUS_OK}

In [None]:
# define the search space for the hyperparameters
search_space = {'learning_rate': hp.uniform('learning_rate', 0.1, 0.5),
                'iterations': hp.randint('iterations',100,1000),
                'l2_leaf_reg': hp.randint('l2_leaf_reg',1,10),
                'depth': hp.randint('depth',4,10),
                'bootstrap_type' : hp.choice('bootstrap_type', ['Bayesian', 'Bernoulli'])}

In [None]:
# bayesian algorithm
algorithm=tpe.suggest

In [None]:
# search for best parameters
best_params = fmin(
  fn=objective,
  space=search_space,
  algo=algorithm,
  max_evals=1000)

In [None]:
# dict of the best params
hyperparams = space_eval(search_space, best_params)

# Model Training

Fit our final model with the best found hyperparameters.

In [None]:
params = {'learning_rate' : hyperparams['learning_rate'],
          'iterations' : hyperparams['iterations'],
          'depth' : hyperparams['depth'],
          'loss_function' : 'RMSE',
          'l2_leaf_reg' : hyperparams['l2_leaf_reg'],
          'eval_metric' : 'RMSE',
          'early_stopping_rounds': 100,
          'bootstrap_type' : hyperparams['bootstrap_type']}

In [None]:
model = CatBoostRegressor(**params, random_seed=42)
model.fit(X = x_train, y = y_train, eval_set=(x_valid,y_valid),verbose=250)

In [None]:
print('best RMSE', model.get_best_score()['validation']['RMSE'])

# Predictions

Create predictions and write them to the submission file.

In [None]:
preds = model.predict(test_set[cols])

In [None]:
submission=pd.DataFrame()
submission['Id'] = test_set['Id']
submission['Pawpularity'] = preds
submission

In [None]:
submission.to_csv('submission.csv', index=False)