# Pet Popularity Prediction Need

Thousands of animals are euthanized in shelters each day. Many are still in completely good health but are sadly still put down. 
One key way of increasing the likelihood of an animal being adopted is by increasing its picture quality. 

This project will take this a step further by not only looking at the quality of a photo, but also a variety of other variables. Then from these variables creating a machine learning model to predict the popularity scores of future photos.


# Data Description

The given dataset contains 9923 different pet photos from PetFinder.my. These photos contain cats and
dogs in a variety of different poses and backgrounds. The dataset also contains photo metadata manually
labeling each photo with key variables such as if the pet is in proper focus, currently in an action or taking up
a significant portion of the photo. These variables are labeled with a value of 0 for no, and yes for 1. 

A "pawpularity” score is also given with each photo in the training set. 
This score signifies how much user engagement each photo received, and is the
score this project will be predicting. 

## Photo Metadata
Each pet photo is labeled with the value of 1 (Yes) or 0 (No) for each of the following features:

Focus - Pet stands out against uncluttered background, not too close / far.

Eyes - Both eyes are facing front or near-front, with at least 1 eye / pupil decently clear.

Face - Decently clear face, facing front or near-front.

Near - Single pet taking up significant portion of photo (roughly over 50% of photo width or height).

Action - Pet in the middle of an action (e.g., jumping).

Accessory - Accompanying physical or digital accessory / prop (i.e. toy, digital sticker), excluding collar and leash.

Group - More than 1 pet in the photo.

Collage - Digitally-retouched photo (i.e. with digital photo frame, combination of multiple photos).

Human - Human in the photo.

Occlusion - Specific undesirable objects blocking part of the pet (i.e. human, cage or fence). Note that not all blocking objects are considered occlusion.

Info - Custom-added text or labels (i.e. pet name, description).

Blur - Noticeably out of focus or noisy, especially for the pet’s eyes and face. For Blur entries, “Eyes” column is always set to 0.


In [None]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import math
from tqdm.notebook import tqdm
import imageio
import torch
import matplotlib.patches as patches
import os 
import matplotlib.image as img
import warnings

warnings.filterwarnings('ignore')

#load sklearn models and metrics
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.svm import SVR
from sklearn import svm
from sklearn import tree
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import GridSearchCV

In [None]:
train = pd.read_csv('../input/train-data/save_data.csv')

In [None]:
#!Github clone https://github.com/ultralytics/yolov5
#loads yolov5 with internet

# Print out top/low scoring pictures

In [None]:
train_image='../input/petfinder-pawpularity-score/train'
top_images=train.sort_values(by='Pawpularity',ascending=False)
top_images=top_images['Id'][:6]

fig=plt.figure(figsize=(30,30))
fig.suptitle('Top 6 Pawpularity Score',fontsize=80)
for i in range(0,6):
    image=img.imread(os.path.join(train_image,list(top_images)[i]+'.jpg'))
    fi=fig.add_subplot(2,3,i+1)
    plt.imshow(image)  
    
plt.show()    



In [None]:
worst_images=train.sort_values(by='Pawpularity',ascending=True)
worst_images=worst_images['Id'][:6]
fig=plt.figure(figsize=(30,30))
fig.suptitle('Bottom 6 Pawpularity Score',fontsize=80)
for i in range(0,6):
    image=img.imread(os.path.join(train_image,list(worst_images)[i]+'.jpg'))
    fi=fig.add_subplot(2,3,i+1)
    plt.imshow(image)  
    
plt.show()    

## Quick glance
Taking a look at the top and bottom scoring pictures it was noticed that
some of the lower scoring pictures tended to be blurrier, the animals were further in ther background, and there was more "clutter" in the picture. However for many of the other pictures it was hard to discern any meaningful differences between top and bottom scoring pictures.

# Load YOLOv5 model
Apply YOLOv5 to extract more data from given pictures.

### Below code was first run and result was saved into this notebook. YOLOv5 model code was then commented out to shorten running time for performing future changes or adjustments. 

In [None]:

#!cp -R '../input/torch-hub/torch/root/.cache/torch' '/root/.cache/torch'

#!cp -R '../input/torch-hub/ultralytics/root/.config/Ultralytics' '/root/.config/Ultralytics'
# yolov5x6_model = torch.hub.load('ultralytics/yolov5', 'yolov5x6')


In [None]:
# #Find our image file and append that to our training dataset

# def get_image_file_path(image_id):
#     return f'../input/petfinder-pawpularity-score/train/{image_id}.jpg'


# train['file_path'] = train['Id'].apply(get_image_file_path)

In [None]:
# widths = []
# heights = []
# ratios = []
# for file_path in (train['file_path']):
#     image = imageio.imread(file_path)
#     h, w, _ = image.shape
#     heights.append(h)
#     widths.append(w)
#     ratios.append(w / h)

In [None]:
# # Get Image Info
# def get_image_info(file_path, plot=False):
#     # Read Image
#     image = imageio.imread(file_path)
#     h, w, c = image.shape
    
#     if plot: # Debug Plots
#         fig, ax = plt.subplots(1, 2, figsize=(8,8))
#         ax[0].set_title('Pets detected in Image', size=16)
#         ax[0].imshow(image)
        
#     # Get YOLOV5 results using Test Time Augmentation for better result
#     results = yolov5x6_model(image, augment=True)
    
#     # Mask for pixels containing pets, initially all set to zero
#     pet_pixels = np.zeros(shape=[h, w], dtype=np.uint8)
    
#     # Dictionary to Save Image Info
#     h, w, _ = image.shape
#     image_info = { 
#         'n_pets': 0, # Number of pets in the image
#         'labels': [], # Label assigned to found objects
#         'thresholds': [], # confidence score
#     }
    
#     # Save found pets to draw bounding boxes
#     pets_found = []
    
#     # Save info for each pet
#     for x1, y1, x2, y2, treshold, label in results.xyxy[0].cpu().detach().numpy():
#         label = results.names[int(label)]
#         if label in ['dog', 'cat']:
#             image_info['n_pets'] += 1
#             image_info['labels'].append(label)
#             image_info['thresholds'].append(treshold)

            
#             # Set pixels containing pets to 1
#             pet_pixels[int(y1):int(y2), int(x1):int(x2)] = 1
            
#             # Add found pet
#             pets_found.append([x1, x2, y1, y2, label])

#     if plot:
#         for x1, x2, y1, y2, label in pets_found:
#             c = 'red' if label == 'dog' else 'blue'
#             rect = patches.Rectangle((x1, y1), x2-x1, y2-y1, linewidth=2, edgecolor=c, facecolor='none')
#             # Add the patch to the Axes
#             ax[0].add_patch(rect)
#             ax[0].text(max(25, (x2+x1)/2), max(25, y1-h*0.02), label, c=c, ha='center', size=14)
                
#     # Add Pet Ratio in Image
#     image_info['pet_ratio'] = pet_pixels.sum() / (h*w)

#     if plot:
#         # Show pet pixels
#         ax[1].set_title('Pixels Containing Pets', size=16)
#         ax[1].imshow(pet_pixels)
#         plt.show()
        
#     return image_info

In [None]:
# # Saves our newly calculated image Info
# IMAGES_INFO = {
#     'n_pets': [],
#     'label': [],
#     'pet_ratio': [],
# }

In [None]:
# #Prints out results of YOLOv5 model
# for file_path in train['file_path'].head(10):
#     get_image_info(file_path, plot=True)

In [None]:
# for idx, file_path in enumerate(tqdm(train['file_path'])):
#     image_info = get_image_info(file_path, plot=False)
#     IMAGES_INFO['n_pets'].append(image_info['n_pets'])
#     IMAGES_INFO['pet_ratio'].append(image_info['pet_ratio'])
    
#     # Not Every Image is correctly classified
#     labels = image_info['labels']
#     if len(set(labels)) == 1: # unanimous label
#         IMAGES_INFO['label'].append(labels[0])
#     elif len(set(labels)) > 1: # Get label with highest confidence
#         IMAGES_INFO['label'].append(labels[0])
#     else: # unknown label, yolo could not find pet
#         IMAGES_INFO['label'].append('unknown')

In [None]:
# # Add Image Info to Train dataset
# for k, v in IMAGES_INFO.items():
#     train[k] = v


# Data analysis

In [None]:
plt.figure(figsize= (15, 15))
sns.heatmap(train.corr(), annot=True, fmt='.1g' )
plt.title('Correlation Matrix', fontweight='bold', fontsize=20)
plt.show()


In [None]:
train.hist(column='Pawpularity', bins=20)
plt.title("Total Pawpularity Distribution")
plt.xlabel('Pawpularity')
plt.ylabel('Picture Count')

## Note:
Most of the pawpularity points lie in the 20-40 range. Care will need to be taken that any models created do not just purely focus on this area, without taking into account higher and lower range values.

In [None]:

train.hist(column='Pawpularity',by='Blur', bins=10)
plt.xlabel('Blur')
plt.ylabel('Pawpularity')


In [None]:
train.hist(column='Pawpularity',by='Near', bins=10)
plt.xlabel('Near')
plt.ylabel('Pawpularity')

In [None]:
train.hist(column='Pawpularity',by='Group', bins=10)
plt.xlabel('Group')
plt.ylabel('Pawpularity')

## Pawpularity meta data values

Looking at the "pawpularity" score against some of the variables, it looks as if there is no clear difference between a 0 or 1 value. 

This is odd, because one would first guess certain variables would have a large coorelation with a pictures "pawpularity" score. For example, the "blur" variable which tells whether or not an image is blurry or in focus shows no discernable difference in their score.

In [None]:
# catc = sum(x == 'cat' for x in IMAGES_INFO['label'])
# dogc = sum(x == 'dog' for x in IMAGES_INFO['label'])

In [None]:
# DOG_MEAN = train.loc[train['label'] == 'dog', 'Pawpularity'].mean()
# CAT_MEAN = train.loc[train['label'] == 'cat', 'Pawpularity'].mean()

In [None]:

# fig = plt.figure()
# ax = fig.add_axes([0,0,1,2])
# langs = ['Dog', 'Cat']
# species = [DOG_MEAN,CAT_MEAN]

# ax.bar(langs,species)
# ax.set_ylabel('Mean',fontsize=20)
# ax.set_xlabel('Species',fontsize=20)
# ax.set_title('Pawpularity Mean by Species',fontsize=20)
# plt.show()

In [None]:
#change label of of dog and cat to 0 and 1 respectively
train.label=train.label.replace(0,'dog')
train.label=train.label.replace(1,'cat')
train.label=train.label.replace(2,'unknown')


plt.figure(figsize=(15, 8))
plt.title('Pawpularity Distribution by species', size=24)
train.loc[train['label'] != 'unknown'].groupby('label')['Pawpularity'].plot(kind='hist', 
                                                                            bins=20, alpha=0.50)
plt.legend(prop={'size': 20})


In [None]:
train.label=train.label.replace('dog',0)
train.label=train.label.replace('cat',1)
train.label=train.label.replace('unknown',2)

## Filter features by variance

In [None]:
train.var()

In [None]:
#Removing all features below .1 variance
pd.set_option('display.max_columns', None)
train2 = train.drop(columns=['file_path','Id','Blur','Action','Subject Focus','Action','pet_ratio','Info','Collage','Accessory','Face']) 
train2.head()

## Filter features by coorelation

In [None]:
abs(train.corr()['Pawpularity'])

In [None]:
#Removing all features below .01 coorelation
train2 = train.drop(columns=['file_path','Id','Near','Action','Collage','Occlusion','Info','Human','Subject Focus','Eyes','Face','pet_ratio'])
train2.head()

In [None]:
# note add Univariate feature selection vs recursive feature elimination

# Create testing and training sets

In [None]:
X = train2.drop(columns=['Pawpularity'])
y = train2['Pawpularity']

X_train, X_val, y_train, y_val =train_test_split(
    X, y, test_size=0.25, random_state=7)


In [None]:
#Prints out visual of predicted values compared to actual

import matplotlib.patches as mpatches
def ActualvPredictionsGraph(y_test,y_pred,title):
    if max(y_test) >= max(y_pred):
        my_range = int(max(y_test))
    else:
        my_range = int(max(y_pred))
    plt.figure(figsize=(12,3))
    plt.scatter(range(len(y_test)), y_test, color='blue')
    plt.scatter(range(len(y_pred)), y_pred, color='red')
    plt.xlabel('Index ')
    plt.ylabel('Pawpularity ')
    plt.title(title,fontdict = {'fontsize' : 15})
    plt.legend(handles = [mpatches.Patch(color='red', label='prediction'),mpatches.Patch(color='blue', label='actual')])
    plt.show()
    return

### Grid search algorithm 

This algorithm is used to find the best hyperparameters for our models from a provided list of parameters.
Cross validation is also used to prevent overfitting and hopefully achieve a better model.

In [None]:

model_params = {
    'svc': {
        'model': svm.SVR(),
        'params' : {
            'C': [1,5,50,100],
            'kernel': ['rbf','linear','poly']
        }  
    },
    'Decision_tree': {
        'model': tree.DecisionTreeRegressor(),
        'params' : {
            'max_depth': [2,3,5,10,100],
            'min_samples_split': [2,3,5,10],
            'min_samples_leaf' :[2,3,5],
        }
    },
    'naive_bayes_gaussian': {
        'model': GaussianNB(),
        'params': {
            'var_smoothing': np.logspace(0,-9, num=100)
        }
    

    },

}
scores = []

for model_name, mp in model_params.items():
    clf =  GridSearchCV(mp['model'], mp['params'], cv=3, return_train_score=False)
    clf.fit(X_train, y_train)
    scores.append({
        'model': model_name,
        'best_score': clf.best_score_,
        'best_params': clf.best_params_
    })   
results = pd.DataFrame(scores,columns=['model','best_score','best_params'])


In [None]:
pd.options.display.max_columns = None
pd.options.display.max_rows = None
results



# Model Results

### Decision Tree Model

In [None]:
model=tree.DecisionTreeRegressor(splitter='best',max_depth= 4, min_samples_leaf=5)
tree_model=model.fit(X_train,y_train)
tree_y_pred = tree_model.predict(X_val)


MSE=mean_squared_error(y_val, tree_y_pred)
RMSE = math.sqrt(MSE)
x=RMSE
print("Root Mean Square Error:",RMSE)

ActualvPredictionsGraph(y_val[0:9912], tree_y_pred[0:9912], "Actual vs. Predicted over all values")

The decision tree model interestingly tended to predict most points at about the 40 mark, and also at about the 35 mark. There were a few predictions out of this range but these were very negligible.

In [None]:
#Visualization of tree model

fig = plt.figure(figsize=(20,5))
_ = tree.plot_tree(model,filled=True,feature_names=['Accesory','Group','Blur','N_Pets','Label'])

### Support Vector Machine Model

In [None]:
model=svm.SVR(C=1,kernel='rbf')
svm_model=model.fit(X_train,y_train)
svm_y_pred = svm_model.predict(X_val)


MSE=mean_squared_error(y_val, svm_y_pred)
RMSE = math.sqrt(MSE)
z=RMSE
print("Root Mean Square Error:",RMSE)

ActualvPredictionsGraph(y_val[0:9912], svm_y_pred[0:9912], "Actual vs. Predicted over all values")

The SVM model mainly predicted in the 20-40 range as feared. While being slightly more scattered in its predictions than our decision tree model it wasn't by much. 

### GaussianNB Model

In [None]:
model=GaussianNB(var_smoothing= 0.43287612810830584)
gaussian_model=model.fit(X_train,y_train)
gaussian_y_pred = gaussian_model.predict(X_val)

MSE=mean_squared_error(y_val, gaussian_y_pred)
RMSE = math.sqrt(MSE)
y=RMSE
print("Root Mean Square Error:",RMSE)

ActualvPredictionsGraph(y_val[0:9912], gaussian_y_pred[0:9912], "Actual vs. Predicted over all values")

The Naïve Bayes model gave the most interesting results. Without tuning this model had an RMSE score of
58.922, which was drastically improved after the grid search algorithm to a score of 24.60. Also, unlike the 
previous models which only guessed in the middle where most points lie, this model also predicted in the high 
extremes of 100. However, even with more variability in its predictions or more likely because of, this model 
scored the worst RMSE of all the created models.

### Ensemble Model-Averaging

In [None]:
a=tree_y_pred
b=gaussian_y_pred
c=svm_y_pred
pred_final = (a+b+c)/3.0


MSE=mean_squared_error(y_val, pred_final)
RMSE = math.sqrt(MSE)
w=RMSE
print("Root Mean Square Error:",RMSE)

ActualvPredictionsGraph(y_val[0:9912], pred_final[0:9912], "Actual vs. Predicted over all values")

In [None]:
names=['SVM.SVR','Decision Tree','GaussianNB','Ensemble']
values=[z,x,y,w]
plt.title('Model RSME comparison')
plt.ylabel('RMSE Value')
plt.bar(names,values)


# Takeaway

These models show that the given variables along with our extracted image values are not a good predictor for an image's popularity. It is likely that these models would perform poorly given new data that didnt center around the 20-40 range.

To fix this issue different data would be needed. It is highly possible that some other unusued variable would have a larger coorelation to a pictures popularity. 
This could be something having nothing to do with whats inside the picture itself. For example, one possibly important variable could be the time/day the picture was posted. One study found that during certain times of different days, a instagram picture was more likely to have increased instagram engagement. Meaning that the popularity scores of these pet photos could be following a similar trend.


# Reference
###  [PetFinder EDA + YOLOV5 Obj Detection + TFRecords](https://www.kaggle.com/markwijkhuizen/petfinder-eda-yolov5-obj-detection-tfrecords)