# Descrpition

PetFinder.my is Malaysia’s leading animal welfare platform, featuring over 180,000 animals with 54,000 happily adopted. PetFinder collaborates closely with animal lovers, media, corporations, and global organizations to improve animal welfare.

Currently, PetFinder.my uses a basic Cuteness Meter to rank pet photos. It analyzes picture composition and other factors compared to the performance of thousands of pet profiles. While this basic tool is helpful, it's still in an experimental stage and the algorithm could be improved.

### Features

* Focus - Pet stands out against uncluttered background, not too close / far.
* Eyes - Both eyes are facing front or near-front, with at least 1 eye / pupil decently clear.
* Face - Decently clear face, facing front or near-front.
* Near - Single pet taking up significant portion of photo (roughly over 50% of photo width or height).
* Action - Pet in the middle of an action (e.g., jumping).
* Accessory - Accompanying physical or digital accessory / prop (i.e. toy, digital sticker), excluding collar and leash.
* Group - More than 1 pet in the photo.
* Collage - Digitally-retouched photo (i.e. with digital photo frame, combination of multiple photos).
* Human - Human in the photo.
* Occlusion - Specific undesirable objects blocking part of the pet (i.e. human, cage or fence). Note that not all * blocking objects are considered occlusion.
* Info - Custom-added text or labels (i.e. pet name, description).
* Blur - Noticeably out of focus or noisy, especially for the pet’s eyes and face. For Blur entries, “Eyes” column is always set to 0.

### Goal

Our goal is to predict the value of Pawpularity using the provided metadata, first finding out if this is possible by EDA.

# Step 1. Importing libraries

In [None]:
import numpy as np
import pandas as pd
import os
from glob import glob
import matplotlib.pyplot as plt
import seaborn as sns
from xgboost import XGBRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.metrics import mean_squared_error

# Step 2. EDA

In [None]:
df = pd.read_csv('../input/petfinder-pawpularity-score/train.csv')

In [None]:
print('DataFrame shape:', df.shape)
df.head()

In [None]:
df.describe()

As we can see, our dataset looks pretty normal and it doesn't contain some awkward values.

In [None]:
sns.set(rc={'figure.figsize':(15,15), "lines.linewidth": 2.5})
sns.set_style("white")
f, axes = plt.subplots(3, 3)
sns.boxplot(data=df, x='Eyes', y='Pawpularity', ax=axes[0, 0])
sns.boxplot(data=df, x='Face', y='Pawpularity', ax=axes[0, 1])
sns.boxplot(data=df, x='Near', y='Pawpularity', ax=axes[0, 2])
sns.boxplot(data=df, x='Action', y='Pawpularity', ax=axes[1, 0])
sns.boxplot(data=df, x='Face', y='Pawpularity', ax=axes[1, 1])
sns.boxplot(data=df, x='Accessory', y='Pawpularity', ax=axes[1, 2])
sns.boxplot(data=df, x='Collage', y='Pawpularity', ax=axes[2, 0])
sns.boxplot(data=df, x='Human', y='Pawpularity', ax=axes[2, 1])
sns.boxplot(data=df, x='Occlusion', y='Pawpularity', ax=axes[2, 2])
plt.subplots_adjust(wspace = 0.3, hspace = 0.3)
f.show()

After plotting the box-and-whiskers diagram for each feature, we can conclude that it will be difficult to make a valid forecast, because in every case distribution of animals with some specific feature almost don't have any influence on Pawpularity value.

In [None]:
sns.set(rc={'figure.figsize':(10,5), "lines.linewidth": 2.5})
sns.distplot(df["Pawpularity"], label="Pawpularity")

As you can see, there are jumps at the borders of the histogram. I am inclined to believe that such data should be discarded (or normalized) due to the fact that very often such data may not reflect an objective picture of the situation.

In [None]:
df = df.loc[(df["Pawpularity"]<100) & (df["Pawpularity"]>3)]
X = df.iloc[:,1:-1]
y = df.iloc[:,-1]

# Step 3. Model

Let's make train test split in the ratio 80/20.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

We'll use XGBoost Regressor with parameters preparely found by GridSearchCV.

In [None]:
model = XGBRegressor(learning_rate =0.1,
 n_estimators=1000,
 max_depth=5,
 min_child_weight=1,
 gamma=0,
 subsample=0.8,
 colsample_bytree=0.8,
 nthread=4,
 scale_pos_weight=1,
 seed=42)
model.fit(X_train, y_train)

In [None]:
kfold = KFold(n_splits=10, random_state=42)
results = cross_val_score(model, X_train, y_train, cv=kfold)

In [None]:
y_test_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_test_pred, squared=False)
mse

For comparison we'll use more simple algorithm (DecisionTreeRegressor).

In [None]:
model_dtrgr = DecisionTreeRegressor()
model_dtrgr.fit(X_train, y_train)

In [None]:
kfold_dtrgr = KFold(n_splits=10, random_state=42)
results_dtrgr = cross_val_score(model, X_train, y_train, cv=kfold)

In [None]:
y_test_pred_dtrgr = model_dtrgr.predict(X_test)
mse_dtrgr = mean_squared_error(y_test, y_test_pred_dtrgr, squared=False)
mse_dtrgr

As we can see, RMSE was not improved significantly by using more complex method. It means that we must use dataset with other features or apply computer vision for the images considered in that problem.

Let's create output file:

In [None]:
df_test = pd.read_csv('../input/petfinder-pawpularity-score/test.csv')

In [None]:
output = pd.DataFrame(np.asarray([list(df_test['Id']), list(model.predict(df_test.iloc[:,1:]))]).T, columns=['Id', 'Pawpularity'])
output.to_csv('submission.csv', encoding='utf-8', index=False)

In [None]:
output