***
# PetFinder.my Pawpularity Predictors
***


### **Introduction**

As described in the competition overview hosted by Kaggle, website https://petfinder.my is a pet adoption website that focuses on improving animal welfare. To aid the adoption process, PetFinder.my uses an algorithm which employs a "cuteness meter" to score images of animals to be added on the listing for adoption.

The algorithm outputs a "Pawpularity score": A discrete number between [0, 100] that indicates the overall likelihood the image used will increase chances for adoption, which is based on certain aspects of the image. The main goal is to analyze the data given to predict the Pawpularity score of newly incoming pet photos.

This project aims to test three methods of prediction to achieve this goal:
* Logistic Regression
* Ensemble through Random Forest Bagging
* Convolutional Neural Network

***
## Section 0: Setup
***

### **Installation**

### **Headers**

In [None]:
import os
import random
import gc
import math

# Seed set for reproducibility
random.seed(32)

import pandas as pd
import numpy as np
import glob

import seaborn as sns
from  matplotlib import pyplot as plt
import matplotlib.image as mpimg
%matplotlib inline

from PIL import Image, ImageFilter
import cv2

import scipy
from sklearn import decomposition, linear_model, model_selection, ensemble, tree, preprocessing, svm, metrics, model_selection
import tensorflow as tf
    



***
## Section 1: Overview
***

### **Reviewing Given Data**

* images/ - Folder containing photos of the form {id}.jpg, where {id} is a unique Pet Profile ID
* train.csv - Metadata for each photo in the training set as well as the target, the photo's Pawpularity score. The metadata consists of attributes based on visual quality and composition, which are manually labelled for each photo in the dataset.

In [None]:
img_dir = r'../input/petfinder-pawpularity-score/train'
sample_img = plt.imread(img_dir + '/' + random.choice(os.listdir(img_dir)))

# Display random image from dataset
plt.axis('off')
plt.imshow(sample_img)

*Figure 1. A sample image taken from the images dataset*


In [None]:
metadata = pd.read_csv('../input/petfinder-pawpularity-score/train.csv')
metadata

*Table 1. Metadata describing attributes of cases in the image dataset. The Id column gives the photo's unique Pet Profile ID corresponding the photo's file name.*

### **Initial Data Analysis**


In [None]:
plt.figure(figsize=(15,10))
sns.histplot(data=metadata, x='Pawpularity', kde=True)
plt.title('Pawpularity Score Distribution')
plt.axvline(metadata['Pawpularity'].mean(), c='red')
plt.show()

*Figure 2. Pawpularity score distribution over image set*

The Pawpularity score distribution on the dataset seems to be positively skewed, with mean ≈ 38

--

The images displayed below are examples of images with extreme Pawpularity scores, as well as a sample with the average Pawpularity score:

In [None]:
# Method to display multiple images given a condition in the dataset
def display_img(df, attribute, n, cond):
    f, a = plt.subplots(1, n)

    for i, a in zip(df.loc[df[attribute] == cond]['Id'].sample(n), a.ravel()):
        a.set(xticks=[], yticks=[])
        sample_img = plt.imread(img_dir + '/' + i + '.jpg')
        a.imshow(sample_img)

In [None]:
display_img(metadata, 'Pawpularity', 3, 1)

*Figure 3. Sampled images with 1 Pawpularity*

--

In [None]:
display_img(metadata, 'Pawpularity', 3, 100)

*Figure 4. Sampled images with 100 Pawpularity*

--

In [None]:
display_img(metadata, 'Pawpularity', 3, 38)

*Figure 5. Sampled images with average Pawpularity*

--

Metadata by Pawpularity score:

In [None]:
low = metadata.loc[metadata['Pawpularity'] == 1]
low.head()

*Table 2. Pawpularity = 1*

In [None]:
high = metadata.loc[metadata['Pawpularity'] == 100]
high.head()

*Table 3. Pawpularity = 100*

Looking at the samples given in Table 2 and Table 3, images that have similar attributes may not score the same Pawpularity. For instance, all samples shown in Table 2 share similar attributes, and all have Pawpularity of 1. Sample 0254f54b148543442373d5aad45b2d1a shown in Table 3 also has similar attributes with respect to samples in Table 2, yet its Pawpularity is 100. 

--

If we represent the metadata as a correlation matrix between the different attributes:

In [None]:
plt.figure(figsize=(10, 12))
sns.heatmap(metadata.corr(), square=True)
plt.title('Metadata Correlation Matrix')
plt.show()

*Figure 6. Correlation Matrix between attributes in metadata*

There are some correlations between a few attributes. The pairs (Eyes, Face) and (Info, Collage) seem to show the highest correlation within the matrix. (Eyes, Face) is an intuitive correlation, since the eyes are a part of the face.

### **Validation Scheme & Scoring Metrics**

A 5-fold Cross-Validation scheme is used to test and validate all three models.

The dataset is randomly shuffled and split 80/20 on train/test. The training set is then partitioned into five equal subsections. Out of these five partitions, one is chosen as the validation set, the other four are used as the actual training set to fit the model. Testing is done on the validation set and score evaluations are recorded. The next partition is chosen as the validation set, and the other four partitions are learned and evaluated. The method is repeated until all partitions have been the validation set exactly once. The recorded scores of all iterations are averaged to get the overall score.

The model learned from cross-validation is then tested using the test set originally partitioned from the dataset. The purpose is to analyze the performance of the learned model given completely unseen data.

Scoring metrics used:

Root Mean Squared Error (RMSE), defined as: $$RMSE = \sqrt{\dfrac{1}{N}\sum^{N}_{i=1}{(y_i - ŷ_i)^2}}$$


***
## Section 2: Preprocessing
***

Three steps are involved in the preprocessing flow:
1. Preprocess image
2. Create superset from image data and metadata
3. Dimensionality reduction through PCA

### **Preprocessing Images**

A function is defined to preprocess the image. The function performs the following list of transformations to augment the image:
* Grayscale
* 128x128 pixel resize
* Antialiasing

In [None]:
def crop_center(pil_img, crop_width, crop_height):
    img_width, img_height = pil_img.size
    return pil_img.crop(((img_width - crop_width) // 2,
                         (img_height - crop_height) // 2,
                         (img_width + crop_width) // 2,
                         (img_height + crop_height) // 2))

def augment(img):
    out = img.copy().convert('LA')
    out.thumbnail((64, 64), Image.LANCZOS)
    
    return crop_center(out, 64, 64)

The purpose of augmenting the images is to normalize the dataset before it is used to train the model. This decreases complexity of the system while improving performance

An example from the image dataset is used to illustrate image preprocessing

In [None]:
plt.axis('off')
sample_img = Image.open(img_dir + '/' + metadata.sample().iloc[0]['Id'] + '.jpg')
plt.imshow(sample_img)


Figure 7. Raw image example

In [None]:

plt.axis('off')
plt.imshow(augment(sample_img))

Figure 8. Transformed image

Apply transformation to all images in dataset and store pixel values in a dataframe:

In [None]:
# Grab test data
test_image_loc = glob.glob("../input/petfinder-pawpularity-score/test/*.jpg")

test = []
for Id in test_image_loc:
    img = Image.open(Id)
    test.append(list(augment(img).getdata(0)))
    img.close()
test_data = np.array(test)

In [None]:
#Iterate through image ids to augment and store image data in array
images = []
for Id in metadata['Id'].tolist():
    img = Image.open(img_dir + '/' + Id + '.jpg')
    images.append(list(augment(img).getdata(0)))
    img.close()

# Convert array to numpy arraay
image_data = np.array(images)

In [None]:
image_df = pd.DataFrame(image_data)
image_df
image_df.to_csv('img_intensity.csv', header=False, index=False)

Each row in the dataframe represents an image. Enumerated columns represents a unique pixel position, and values represents the RGB values. Since the set is preprocessed grayscale, a single value denotes light intensity, ranging from [0, 255].

Intensity values are normalized around the standard deviation before processing:

In [None]:
scaler = preprocessing.StandardScaler()
scaler.fit(image_df)
norm_df = pd.DataFrame(scaler.transform(image_df))
y_norm_df = pd.DataFrame(scaler.transform(test_data))
norm_df

### **Dimensionality Reduction**

Principle Component Analysis(PCA) is used to reduce the dimensions to a feature space in the component space. 85% of the variance is captured given k components.

In [None]:
norm_pca = decomposition.PCA(.85)
input_df = pd.DataFrame(norm_pca.fit_transform(norm_df))

The eigenvectors are taken from the co-variate matrix of the dataset. The eigenvectors which have the most variability captured are taken as the principal components.

In [None]:
norm_pca.components_.shape[0]

The 70 largest components capture 85% of the variability of the original dataset. A change of base is performed on the normalized dataset using the principle components as the basis.

In [None]:
input_df

In [None]:
y_pca = decomposition.PCA(.85)
Y = pd.DataFrame(y_pca.fit_transform(y_norm_df))
Y.to_csv('test_preprocessed.csv', header=False, index=False)

The new dataset represents the first 70 eigenvectors as the column vectors. 4096 dimensions are reduced to 70 dimensions with only a 15% loss in capturing variability.

### **Combining Dataframes**

New dimensions are added to this dataset to represent the image data, along with the attributes given in the metadata.

In [None]:
temp = metadata.copy()
labels = pd.DataFrame([x.replace('.jpg','') for x in next(os.walk('../input/petfinder-pawpularity-score/test'), (None, None, []))[2]])
labels.to_csv('labels.csv', header=False, index=False)
# Save preprocessed without metadata
input_df.to_csv('preproc_wo_meta.csv', header=False, index=False)

# Save target
target = temp.pop('Pawpularity')
target.to_csv('target.csv', header=False, index=False)

# Save with metadata
input_df.join(temp)
input_df.to_csv('preproc_with_meta.csv', header=False, index=False)

In [None]:
input_df

*Table 4. Dataframe representing images in the image set*

The input space now reflects both image and metadata and is simultaneously processed by the model.

In [None]:
# %reset -f

***
## Section 3: Models & Methods
***

In [None]:
X = np.loadtxt('preproc_wo_meta.csv', delimiter=',')
y = np.loadtxt('target.csv')
labels = pd.read_csv('labels.csv', names=['Id'])
test_X = np.loadtxt('test_preprocessed.csv', delimiter=',')

train_X, test_X, train_y, test_y = model_selection.train_test_split(X, y, test_size=0.20, random_state=32)

### **Logistic Regression**

In [None]:
##TODO
# lr = linear_model.LogisticRegression(solver='lbfgs', max_iter=4000, n_jobs=-1)
# lr

In [None]:
# scores

### **Random Forests**

In [None]:
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2, 4]
bootstrap = [True, False]

random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap }

rf = ensemble.RandomForestRegressor()
rf_search = model_selection.RandomizedSearchCV(estimator = rf, param_distributions=random_grid)
rf.fit(X, y)

In [None]:
rf_pred = rf.predict(test_X)

In [None]:
submit = labels.join(pd.DataFrame(rf_pred, columns=['Pawpularity']))
submit.to_csv('submission.csv',index=False)

In [None]:
submit

In [None]:
# math.sqrt(metrics.mean_squared_error(test_y, rf_pred))

In [None]:
# metrics.r2_score(test_y, rf_pred)