# Pets: Definitive CatBoost parameter tuning
We'll do parameter tuning for [CatBoost](https://catboost.ai/) algorithm using [Pandas](https://pandas.pydata.org/) and [scikit-learn pipelines](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html). I'm a huge fan of pipelines as they ensure that (1) one **never** needs to modify the training data in-place and (2) all preprocessing steps can be parametrized and therefore tuned with, e.g., [RandomizedSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html) (see the [example](https://scikit-learn.org/stable/auto_examples/compose/plot_compare_reduction.html)). 

We'll be using randomized search instead of [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) as randomly chosen trials are more efficient for hyper-parameter optimization than trials on a grid ([Random Search for Hyper-Parameter Optimization](http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf) by J. Bergstra and Y. Bengio, 2012).

Comments on this notebook are most welcome!

### TODO
- [ ] More tuning

### Define imports and load data

In [3]:
import os

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import catboost
import tensorflow as tf  # Just for checking if GPU is available :)

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.metrics import make_scorer, cohen_kappa_score, accuracy_score
from sklearn.model_selection import RandomizedSearchCV, cross_val_score, StratifiedKFold, RepeatedStratifiedKFold
from sklearn.ensemble import VotingClassifier
import scipy

%matplotlib inline
plt.rc('figure', figsize=(20.0, 10.0))

GPU_AVAILABLE = tf.test.is_gpu_available()
QUADRATIC_WEIGHT_SCORER = make_scorer(cohen_kappa_score, weights='quadratic')
print("GPU available:", GPU_AVAILABLE)

In [4]:
INPUT_DIR = "../input"
print(os.listdir(INPUT_DIR))

## Data description (copied from [competition description](https://www.kaggle.com/c/petfinder-adoption-prediction/data))

<i>
In this competition you will predict the speed at which a pet is adopted, based on the pet’s listing on PetFinder. Sometimes a profile represents a group of pets. In this case, the speed of adoption is determined by the speed at which all of the pets are adopted. The data included text, tabular, and image data. See below for details. 
This is a Kernels-only competition. At the end of the competition, test data will be replaced in their entirety with new data of approximately the same size, and your kernels will be rerun on the new data.

### File descriptions
- train.csv - Tabular/text data for the training set
- test.csv - Tabular/text data for the test set
- sample_submission.csv - A sample submission file in the correct format
- breed_labels.csv - Contains Type, and BreedName for each BreedID. Type 1 is dog, 2 is cat.
- color_labels.csv - Contains ColorName for each ColorID
- state_labels.csv - Contains StateName for each StateID

### Data Fields
- PetID - Unique hash ID of pet profile
- AdoptionSpeed - Categorical speed of adoption. Lower is faster. This is the value to predict. See below section for more info.
- Type - Type of animal (1 = Dog, 2 = Cat)
- Name - Name of pet (Empty if not named)
- Age - Age of pet when listed, in months
- Breed1 - Primary breed of pet (Refer to BreedLabels dictionary)
- Breed2 - Secondary breed of pet, if pet is of mixed breed (Refer to BreedLabels dictionary)
- Gender - Gender of pet (1 = Male, 2 = Female, 3 = Mixed, if profile represents group of pets)
- Color1 - Color 1 of pet (Refer to ColorLabels dictionary)
- Color2 - Color 2 of pet (Refer to ColorLabels dictionary)
- Color3 - Color 3 of pet (Refer to ColorLabels dictionary)
- MaturitySize - Size at maturity (1 = Small, 2 = Medium, 3 = Large, 4 = Extra Large, 0 = Not Specified)
- FurLength - Fur length (1 = Short, 2 = Medium, 3 = Long, 0 = Not Specified)
- Vaccinated - Pet has been vaccinated (1 = Yes, 2 = No, 3 = Not Sure)
- Dewormed - Pet has been dewormed (1 = Yes, 2 = No, 3 = Not Sure)
- Sterilized - Pet has been spayed / neutered (1 = Yes, 2 = No, 3 = Not Sure)
- Health - Health Condition (1 = Healthy, 2 = Minor Injury, 3 = Serious Injury, 0 = Not Specified)
- Quantity - Number of pets represented in profile
- Fee - Adoption fee (0 = Free)
- State - State location in Malaysia (Refer to StateLabels dictionary)
- RescuerID - Unique hash ID of rescuer
- VideoAmt - Total uploaded videos for this pet
- PhotoAmt - Total uploaded photos for this pet
- Description - Profile write-up for this pet. The primary language used is English, with some in Malay or Chinese.
- AdoptionSpeed Contestants are required to predict this value. The value is determined by how quickly, if at all, a pet is adopted. The values are determined in the following way: 
    0 - Pet was adopted on the same day as it was listed. 
    1 - Pet was adopted between 1 and 7 days (1st week) after being listed. 
    2 - Pet was adopted between 8 and 30 days (1st month) after being listed. 
    3 - Pet was adopted between 31 and 90 days (2nd & 3rd month) after being listed. 
    4 - No adoption after 100 days of being listed. (There are no pets in this dataset that waited between 90 and 100 days).

### Images

For pets that have photos, they will be named in the format of PetID-ImageNumber.jpg. Image 1 is the profile (default) photo set for the pet. For privacy purposes, faces, phone numbers and emails have been masked.

### Image Metadata
We have run the images through Google's Vision API, providing analysis on Face Annotation, Label Annotation, Text Annotation and Image Properties. You may optionally utilize this supplementary information for your image analysis.

File name format is PetID-ImageNumber.json.

Some properties will not exist in JSON file if not present, i.e. Face Annotation. Text Annotation has been simplified to just 1 entry of the entire text description (instead of the detailed JSON result broken down by individual characters and words). Phone numbers and emails are already anonymized in Text Annotation.

Google Vision API reference: https://cloud.google.com/vision/docs/reference/rest/v1/images/annotate

### Sentiment Data
We have run each pet profile's description through Google's Natural Language API, providing analysis on sentiment and key entities. You may optionally utilize this supplementary information for your pet description analysis. There are some descriptions that the API could not analyze. As such, there are fewer sentiment files than there are rows in the dataset.

File name format is PetID.json.

Google Natural Language API reference: https://cloud.google.com/natural-language/docs/basics

What will change in the 2nd stage of the competition?
In the second stage of the competition, we will re-run your selected Kernels. The following files will be swapped with new data:

test.zip including test.csv and sample_submission.csv
test_images.zip
test_metadata.zip
test_sentiment.zip

In stage 2, all data will be replaced with approximately the same amount of different data. The stage 1 test data will not be available when kernels are rerun in stage 2.
</i>

Read training data CSV to pandas dataframe, marking categorical columns as category type.

In [19]:
import json

def read_sentiment_if_exists(directory: str, pet_id: str):
    path = os.path.join(directory, "{}.json".format(pet_id))
    if not os.path.exists(path):
        return None
    with open(path, 'r') as f:
        return json.load(f)
        
def parse_sentiment(sentiment):
    documentSentiment = sentiment['documentSentiment']
    return { 'SentimentMagnitude': documentSentiment['magnitude'],
             'SentimentScore': documentSentiment['score'],
             'SentimentLanguage': sentiment['language'] }

def read_csv_to_pandas(is_train=True):
    path = os.path.join(INPUT_DIR, 'train', 'train.csv') if is_train else os.path.join(INPUT_DIR, 'test', 'test.csv')
    df = pd.read_csv(path)
    
    def add_sentiment_if_exists(row):
        pet_id = row.get('PetID')
        directory = os.path.join(INPUT_DIR, 'train_sentiment') if is_train else os.path.join(INPUT_DIR, 'test_sentiment')
        sentiment_or_none = read_sentiment_if_exists(directory, pet_id)
        if sentiment_or_none is None:
            return row.append(pd.Series({ 'SentimentMagnitude': np.nan, 'SentimentScore': np.nan, 'SentimentLanguage': "unknown" }))
        sentiment_json = sentiment_or_none
        sentiment = parse_sentiment(sentiment_json)
        sentiment_series = pd.Series(sentiment)
        return row.append(sentiment_series)

    df = df.apply(add_sentiment_if_exists, axis=1, result_type='expand')
    df.Name = df.Name.fillna(value="Unknown")
    return df.astype({
        'Type': 'category',
        'Name': 'category',
        'Breed1': 'category',
        'Breed2': 'category',
        'Gender': 'category',
        'Color1': 'category',
        'Color2': 'category',
        'Color3': 'category',
        'MaturitySize': 'category',
        'FurLength': 'category',
        'Vaccinated': 'category',
        'Dewormed': 'category',
        'Sterilized': 'category',
        'Health': 'category',
        'State': 'category',
        'RescuerID': 'category',
        'SentimentLanguage': 'category'
    })

train_df = read_csv_to_pandas(is_train=True)

X_test = read_csv_to_pandas(is_train=False)
train_df.info()

## Preprocessing

### Split training data into training and validation set
We split `train_df` into two sets. `X_train` is used for the cross-validation, `X_val` is used at the end of the notebook to estimate the generalization error.

In [24]:
from sklearn.model_selection import train_test_split

def to_features_and_labels(df):
    y = df['AdoptionSpeed'].values
    X = df.drop('AdoptionSpeed', axis=1)
    return X, y

X_train_val, y_train_val = to_features_and_labels(train_df)

X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.20, random_state=42,
                                                  stratify=y_train_val)

print("Shape of X_train:", X_train.shape)
print("Shape of X_val:", X_val.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_val:", y_val.shape)

### Define transformers
We'll use [scikit-learn pipelines](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) to define our data preprocessing transforms. Because we're using CatBoost that can handle categorical features quite well, we don't do any other preprocessing than drop some of the columns and impute missing values for sentiments. For that, we'll define custom transformers `DataFrameColumnDropper` and `ColumnImputer`:

In [25]:
class DataFrameColumnDropper(BaseEstimator, TransformerMixin):
    """
    Drop given columns.
    
    Attributes:
        column_names (List[Str]): List of columns to drop
    """
    def __init__(self, column_names):
        self.column_names = column_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X.copy().drop(self.column_names, axis=1)
    
class ColumnImputer(BaseEstimator, TransformerMixin):
    def __init__(self, column_name, imputer):
        self.column_name = column_name
        self.imputer = imputer
    def fit(self, X, y=None):
        values = X[[self.column_name]].values
        self.imputer.fit(values)
        return self
    def transform(self, X):
        Y = X.copy()
        Y.loc[:, self.column_name] = self.imputer.transform(Y[[self.column_name]].values)
        return Y

Build the pipeline:

In [26]:
from sklearn.impute import SimpleImputer

def build_preprocessing_pipeline() -> Pipeline:
     return Pipeline([
        ('impute_sentiment_magnitude', ColumnImputer(column_name='SentimentMagnitude',
                                                     imputer=SimpleImputer(strategy='constant', fill_value=0.0))),
        ('impute_sentiment_score', ColumnImputer(column_name='SentimentScore',
                                                     imputer=SimpleImputer(strategy='constant', fill_value=0.0))),
        ('drop_unused_columns', DataFrameColumnDropper(
            column_names=['PetID', 'Description', 'RescuerID'])
        )
     ])

preprocessing_pipeline = build_preprocessing_pipeline()
X_train_preprocessed = preprocessing_pipeline.fit_transform(X_train, y_train)

X_train_preprocessed.head(20)

### Print the columns:

In [27]:
print("Number of features:", len(list(X_train_preprocessed)))
print("")

print("Numerical columns:", list(X_train_preprocessed.select_dtypes(include="number")))
print("")

print("Categorical columns:", list(X_train_preprocessed.select_dtypes(include="category")))

## CatBoost classifier
We define a small helper class called `CatBoostPandasClassifier` for training `CatBoostClassifier` with Pandas dataframes. The helper class correctly passes the categorical columns as `cat_features` to CatBoostClassifier's `fit` method and ensures that the order of the columns in fitting and prediction phases matches.

In [30]:
from catboost import CatBoostClassifier

class CatBoostPandasClassifier:
    """
    Helper class for training `CatBoostClassifier` with Pandas dataframes. The class
    passes the columns of type "category" as `cat_features` to `CatBoostClassifier`'s fit method and ensures
    that the order of the columns in the fitting and prediction phases matches.
    
    Author: Kimmo Sääskilahti
    """
    def __init__(self, *args, **kwargs):
        self.catboost_classifier = CatBoostClassifier(*args, **kwargs)
        self.columns = None
        
    def fit(self, X, y, *args, **kwargs):
        self.columns = list(X)
        cat_columns = list(X.select_dtypes(include='category'))
        cat_features = [X.columns.get_loc(name) for name in cat_columns]  # Indices of categorical features
        return self.catboost_classifier.fit(X.values, y, *args, cat_features=cat_features, **kwargs)
        
    def copy(self, *args, **kwargs):
        returned_classifier = CatBoostPandasClassifier()
        returned_classifier.catboost_classifier = self.catboost_classifier.copy()
        returned_classifier.columns = self.columns
        return returned_classifier
        
    def predict(self, X, *args, **kwargs):
        X_copy = X.loc[:, self.columns].copy()
        return self.catboost_classifier.predict(X_copy.values, *args, **kwargs)
    
    def predict_proba(self, X, *args, **kwargs):
        X_copy = X.loc[:, self.columns].copy()
        return self.catboost_classifier.predict_proba(X_copy.values, *args, **kwargs)
        
    def __getattr__(self, attr):
        """
        Pass all other method calls to self.catboost_classifier.
        """
        return getattr(self.catboost_classifier, attr)

# Example usage
catboost_pandas_clf = CatBoostPandasClassifier(iterations=10,
                                               learning_rate=0.1,
                                               loss_function='MultiClass',
                                               allow_writing_files=False)
# catboost_pandas_clf.fit(X_train_preprocessed, y_train)
cross_val_score(catboost_pandas_clf, X_train_preprocessed, y_train, cv=5, scoring=QUADRATIC_WEIGHT_SCORER)

## Parameter tuning

Let us first define a few helper functions to help in parameter tuning:

In [29]:
def build_search(pipeline, param_distributions, cv=5, n_iter=10):
    """
    Builder function for RandomizedSearch.
    """
    return RandomizedSearchCV(pipeline,
                              param_distributions=param_distributions, 
                              cv=cv,
                              return_train_score=True,
                              refit='cohen_kappa_quadratic',
                              n_iter=n_iter,
                              n_jobs=None,
                              scoring={
                                    'accuracy': make_scorer(accuracy_score),
                                    'cohen_kappa_quadratic': QUADRATIC_WEIGHT_SCORER
                              },
                              verbose=1,
                              random_state=42)

def pretty_cv_results(cv_results, 
                      sort_by='rank_test_cohen_kappa_quadratic',
                      sort_ascending=True,
                      n_rows=30):
    """
    Return pretty Pandas dataframe from the `cv_results_` attribute of finished parameter search,
    ranking by test performance and only keeping the columns of interest.
    """
    df = pd.DataFrame(cv_results)
    cols_of_interest = [key for key in df.keys() 
                            if key.startswith('param_') 
                                or key.startswith("mean_train")
                                or key.startswith("std_train")
                                or key.startswith("mean_test")
                                or key.startswith("std_test")
                                or key.startswith('mean_fit_time')
                                or key.startswith('rank')]
    return df.loc[:, cols_of_interest].sort_values(by=sort_by, ascending=sort_ascending).head(n_rows)

def run_search(search):
    search.fit(X_train, y_train)
    print('Best score is:', search.best_score_)
    return pretty_cv_results(search.cv_results_)

### Tune CatBoostPandasClassifier

In [None]:
task_type = "GPU" if GPU_AVAILABLE else "CPU"

def build_catboost_pipeline():
    return Pipeline([
        ('preprocessing', preprocessing_pipeline),
        ('classifier', CatBoostPandasClassifier(verbose=0,
                                                loss_function='MultiClass',
                                                allow_writing_files=False,
                                                task_type=task_type))
    ])

param_distributions = {
    'classifier__iterations': [500, 1000],  # Sets the value of `iterations` to the pipeline step `classifier` in parameter search
    'classifier__learning_rate': scipy.stats.uniform(0.01, 0.3),
    'classifier__max_depth': scipy.stats.randint(3, 10),
    'classifier__one_hot_max_size': [30],
    'classifier__l2_leaf_reg': scipy.stats.reciprocal(a=1e-2, b=1e1)  # Samples *exponents* uniformly between a and b
}

catboost_search = build_search(build_catboost_pipeline(),
                               param_distributions=param_distributions,
                               n_iter=50,
                               cv=RepeatedStratifiedKFold(n_splits=10, n_repeats=1, random_state=42))
catboost_cv_results = run_search(search=catboost_search)
catboost_cv_results

Check the parameters of the best estimator:

In [None]:
best_estimator = catboost_search.best_estimator_
print(best_estimator.named_steps['classifier'].get_params())

Create a list of the estimators along their scores:

In [None]:
all_estimators = [build_catboost_pipeline().set_params(**params) for params in catboost_search.cv_results_['params']]

scores_and_estimators = list(zip(catboost_search.cv_results_['mean_test_cohen_kappa_quadratic'], all_estimators))

scores_and_estimators.sort(key=lambda x: x[0], reverse=True)

scores_and_estimators[:3]

Create a voting classifier out of the best estimators:

In [None]:
N_TOP_ESTIMATORS = 5

best_estimators = scores_and_estimators[:N_TOP_ESTIMATORS]
voting_classifier = VotingClassifier([(str(score), estimator) for score, estimator in best_estimators], voting='soft')

In [None]:
best_estimator = catboost_search.best_estimator_
y_val_pred_best_estimator = best_estimator.predict(X_val)

voting_classifier.fit(X_train, y_train)
y_val_pred_voting_classifier = voting_classifier.predict(X_val)

print("Performance of best estimator on the hold-out set:",
      cohen_kappa_score(y_val, y_val_pred_best_estimator, weights='quadratic'))
print("Performance of voting classifier on the hold-out set:",
      cohen_kappa_score(y_val, y_val_pred_voting_classifier, weights='quadratic'))

Clearly there's a big discrepancy in `mean_test_cohen_kappa_quadratic` between cross-validation and the hold-out set. Any advice on how to bridge the gap (reduce the overfitting to training set) is welcome :)

### Print feature importances for the best estimator

In [None]:
column_names = best_estimator.named_steps['classifier'].columns
feature_importances = best_estimator.named_steps['classifier'].feature_importances_
print("{} columns, {} feature importances.".format(len(column_names), len(feature_importances)))

features_and_importances = list(zip(column_names, feature_importances))

features_and_importances.sort(key=lambda x: x[1], reverse=True)

for name, feature_importance in features_and_importances:
    print("{} -> {}".format(name, feature_importance))

### Train the final estimator with all data available
Finally we'll the train the voting classifier with all the data available and write our `submission.csv`. Of course, much better performance could be achieved by ensembling and/or blending more versatile model families.

In [None]:
voting_classifier.fit(X_train_val, y_train_val)

Evaluate predictions on the test set:

In [None]:
def get_predictions(estimator, X):
    predictions = estimator.predict(X).astype(np.int32)
    predictions = np.squeeze(predictions)  # Estimators may return arrays for each prediction
    indices = X.loc[:, 'PetID']
    as_dict = [{'PetID': index, 'AdoptionSpeed': prediction} for index, prediction in zip(indices, predictions)]
    df = pd.DataFrame.from_dict(as_dict)
    df = df.reindex(['PetID', 'AdoptionSpeed'], axis=1)
    return df

predictions = get_predictions(best_estimator, X=X_test)

Write `submission.csv`:

In [None]:
def write_submission(predictions):
    submission_folder = '.'
    dest_file = os.path.join(submission_folder, 'submission.csv')
    predictions.to_csv(dest_file, index=False)
    print("Wrote to {}".format(dest_file))
    
write_submission(predictions)