# Pawpularity Prediction Comparison of Five Regression Model: Random Forest, AdaBoost, Gradient Boosting, XGBoost, Voting

## TASK Description
1. Accurately determine a pet photo’s appeal and even suggest improvements to give rescue animals a higher chance of loving homes. Analyze raw images and metadata to predict the “Pawpularity” of pet photos.
2. Submissions are scored on the root mean squared (RMSE) error.

## Purpose of this Notebook
Building simple regression models for PetFinder Pawpularity dataset (image metadata only) and comparing their results. This notebook does not contain any EDA, Feature Extraction, Cross Validation, or any advanced image models. It is meant for beginners.
This is work in progress.


In [None]:
# import libraries
import numpy as np
import pandas as pd

In [None]:
# constants
DATA_DIR = "../input/petfinder-pawpularity-score/"

## Import Dataset

In [None]:
# read image metadata
train_meta_df = pd.read_csv(DATA_DIR + 'train.csv')
test_meta_df = pd.read_csv(DATA_DIR + 'test.csv')

print("Train Metadata Shape: ", train_meta_df.shape)
print("Test Metadata Shape: ", test_meta_df.shape)

In [None]:
train_meta_df.dtypes

In [None]:
train_meta_df.head()

In [None]:
test_meta_df.head()

In [None]:
# Read sample submission file
submission = pd.read_csv(DATA_DIR + 'sample_submission.csv')
test_meta_sample_y = submission["Pawpularity"]
submission.head()

In [None]:
# Prepare X and y variables for training
X = train_meta_df.drop(columns=["Id","Pawpularity"])
y = train_meta_df["Pawpularity"]
# Copy training data to save training data predictions for comparing different models
train_meta_df_pred = train_meta_df.copy()

In [None]:
# Import library for modeling
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

## Building Regression Models
### 1. Random Forest Model

In [None]:
# Import Random Forest regressor from sklearn
from sklearn.ensemble import RandomForestRegressor

In [None]:
# Building Random Forest regressor model
rf_model = RandomForestRegressor(
    n_estimators=300, 
    max_depth=15,
    random_state=100,
    verbose=2)
# Training
rf_model.fit(X, y)

In [None]:
# Random Forest prediction
rf_train_y_pred = rf_model.predict(X)
rf_test_y_pred = rf_model.predict(test_meta_df.drop(columns=["Id"]))

In [None]:
train_meta_df_pred["RFTrainPred"] = rf_train_y_pred
train_meta_df_pred[["Id", "Pawpularity", "RFTrainPred"]]

In [None]:
# Evaluation on training and testing data
rf_train_RMSE = mean_squared_error(train_meta_df["Pawpularity"], rf_train_y_pred, squared=False)
# Test RMSE is based on sample submission data which is not accurate
rf_test_RMSE = mean_squared_error(test_meta_sample_y, rf_test_y_pred, squared=False)
print("RF Train RMSE: ", rf_train_RMSE)
print("RF Test RMSE: ", rf_test_RMSE)

### 2. AdaBoost Model

In [None]:
# Import AdaBoost regressor from sklearn
from sklearn.ensemble import AdaBoostRegressor

In [None]:
# Building AdaBoost regressor model
adab_model = AdaBoostRegressor(
    n_estimators=10,
    learning_rate=0.0001,
    loss='square',
    random_state=100)
# Training
adab_model.fit(X, y)

In [None]:
# AdaBoost Prediction
adab_train_y_pred = adab_model.predict(X)
adab_test_y_pred = adab_model.predict(test_meta_df.drop(columns=["Id"]))

In [None]:
train_meta_df_pred["ADABTrainPred"] = adab_train_y_pred
train_meta_df_pred[["Id", "Pawpularity", "RFTrainPred", "ADABTrainPred"]]

In [None]:
# Evaluation on training and testing data
adab_train_RMSE = mean_squared_error(train_meta_df["Pawpularity"], adab_train_y_pred, squared=False)
# Test RMSE is based on sample submission data which is not accurate
adab_test_RMSE = mean_squared_error(test_meta_sample_y, adab_test_y_pred, squared=False)
print("ADAB Train RMSE: ", adab_train_RMSE)
print("ADAB Test RMSE: ", adab_test_RMSE)

### 3. Gradient Boosting Model

In [None]:
# Import GradientBoosting regressor from sklearn
from sklearn.ensemble import GradientBoostingRegressor

In [None]:
# Building GradientBoosting regressor model
gb_model = GradientBoostingRegressor(
    n_estimators=30,
    learning_rate=0.01,
    verbose=2,
    random_state=100)
# Training
gb_model.fit(X, y)

In [None]:
# GradientBoosting Prediction
gb_train_y_pred = gb_model.predict(X)
gb_test_y_pred = gb_model.predict(test_meta_df.drop(columns=["Id"]))

In [None]:
train_meta_df_pred["GBTrainPred"] = gb_train_y_pred
train_meta_df_pred[["Id", "Pawpularity", "RFTrainPred", "ADABTrainPred", "GBTrainPred"]]

In [None]:
# Evaluation on training and testing data
gb_train_RMSE = mean_squared_error(train_meta_df["Pawpularity"], gb_train_y_pred, squared=False)
# Test RMSE is based on sample submission data which is not accurate
gb_test_RMSE = mean_squared_error(test_meta_sample_y, gb_test_y_pred, squared=False)
print("GB Train RMSE: ", gb_train_RMSE)
print("GB Test RMSE: ", gb_test_RMSE)

### 4. XGBoost Model

In [None]:
# Import XGBoost regressor from sklearn
import xgboost as xgb

In [None]:
# Split training data into training and validation data
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=100)

print("Train Metadata Shape: ", X_train.shape)
print("Val Metadata Shape: ", X_val.shape)

In [None]:
# Set XGBoost parameters
xgb_params_dist = {
    "n_estimators": 350,
    "learning_rate": 0.01,
    "verbosity" : 2,
    "subsample": 0.3,
    "seed": 100,
}
# Building XGBoost Regressor Model
xgb_model = xgb.XGBRegressor(**xgb_params_dist)
# Training
xgb_model.fit(
    X_train,
    y_train,
    eval_set=[(X_val, y_val)],
)

In [None]:
# XGBoost Prediction
xgb_train_y_pred = xgb_model.predict(X)
xgb_test_y_pred = xgb_model.predict(test_meta_df.drop(columns=["Id"]))

In [None]:
train_meta_df_pred["XGBTrainPred"] = xgb_train_y_pred
train_meta_df_pred[["Id", "Pawpularity", "RFTrainPred", "ADABTrainPred", "GBTrainPred", "XGBTrainPred"]]

In [None]:
# Evaluation on training and testing data
xgb_train_RMSE = mean_squared_error(train_meta_df["Pawpularity"], xgb_train_y_pred, squared=False)
# Test RMSE is based on sample submission data which is not accurate
xgb_test_RMSE = mean_squared_error(test_meta_sample_y, xgb_test_y_pred, squared=False)
print("XGB Train RMSE: ", xgb_train_RMSE)
print("XGB Test RMSE: ", xgb_test_RMSE)

### 5. Voting Regressor Model

In [None]:
# Import Voting regressor from sklearn
from sklearn.ensemble import VotingRegressor

In [None]:
# Building Voting regressor model
vr_model = VotingRegressor([('rf', rf_model),
                            ('adab', adab_model),
                            ('gb', gb_model),
                            ('xgb', xgb_model)],
                           verbose=True)
# Training
vr_model.fit(X, y)

In [None]:
# Voting prediction
vr_train_y_pred = vr_model.predict(X)
vr_test_y_pred = vr_model.predict(test_meta_df.drop(columns=["Id"]))

In [None]:
train_meta_df_pred["VRTrainPred"] = vr_train_y_pred
train_meta_df_pred[["Id", "Pawpularity", "RFTrainPred", "ADABTrainPred", "GBTrainPred", "XGBTrainPred","VRTrainPred"]]

In [None]:
# Evaluation on training and testing data
vr_train_RMSE = mean_squared_error(train_meta_df["Pawpularity"], vr_train_y_pred, squared=False)
# Test RMSE is based on sample submission data which is not accurate
vr_test_RMSE = mean_squared_error(test_meta_sample_y, vr_test_y_pred, squared=False)
print("VR Train RMSE: ", vr_train_RMSE)
print("VR Test RMSE: ", vr_test_RMSE)

## RMSE Comparison

In [None]:
# Test RMSE is based on sample submission data which is not accurate
rmse_comparison = pd.DataFrame(np.array([[rf_train_RMSE,rf_test_RMSE,
                                          adab_train_RMSE,adab_test_RMSE,
                                          gb_train_RMSE,gb_test_RMSE,
                                          xgb_train_RMSE,xgb_test_RMSE,
                                          vr_train_RMSE,vr_test_RMSE]]), 
                               columns=["RFTrain","RFTest",
                                        "ADABTrain","ADABTest",
                                        "GBTrain","GBTest",
                                        "XGBTrain","XGBTest",
                                        "VRTrain","VRTest"])
rmse_comparison.head()

## Submission

In [None]:
# Submission dataframe
submission['Id'] = test_meta_df["Id"]
"""
Training RMSE is similar for all regressors.
Testing RMSE shows variations but it cannot be trusted as it is
based on sample submission data. After evaluating all models 
on real Kaggle testing data, all models have similar
testing RMSE values with minor differences.
Use any model for test prediction. Here Adaboost model is used.
"""
submission['Pawpularity'] = adab_test_y_pred
print("Submission Shape:", submission.shape)
submission.head(8)

In [None]:
# Submission File
submission.to_csv('submission.csv', index=False)

Thank you for going through this notebook. I will try to improve results in upcoming notebooks using cross validation, feature extraction, and advanced models. Stay tuned!