# Acknowledgements

Massive thank you to [Alex Teboul](https://www.kaggle.com/alexteboul) for making a fantastic tutorial for begginers to get involved in this competition. Here is [Part 1](https://www.kaggle.com/alexteboul/tutorial-part-1-eda-for-beginners). 

I also found this discussion [thread](https://www.kaggle.com/c/petfinder-pawpularity-score/discussion/285140) very interesting. Especially the comment by [Chris Deotte](https://www.kaggle.com/cdeotte) about binning.

I used a notebook by [https://www.kaggle.com/saaries](https://www.kaggle.com/saaries) as inspiration to try transfer learning using EfficientNet.

This [notebook](https://www.kaggle.com/arjunrao2000/beginners-guide-efficientnet-with-keras/comments) by [Arjun Rao](https://www.kaggle.com/arjunrao2000) was very helpful for using EfficientNet practically within my neural network. 

# Libraries

In [None]:
# Core
import os
import pandas as pd
from glob import glob
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
%matplotlib inline
import seaborn as sns
sns.set_style('darkgrid')
from pathlib import Path
import time
import math

# Sklearn
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Data

In [None]:
# Source path
path = '../input/petfinder-pawpularity-score/'

# Read data and save as data frames
train_df = pd.read_csv(path + 'train.csv')
test_df = pd.read_csv(path + 'test.csv')

# Get the image data (the .jpg data) and put it into lists of filenames
train_jpg = glob(path + "train/*.jpg")
test_jpg = glob(path + "test/*.jpg")

# Preview first 5 elemtns of train_jpg
train_jpg[:5]

In [None]:
# Print dimensions of training data
print('train_df dimensions: ', train_df.shape)

# Print dimensions of test data
print('test_df dimensions: ',test_df.shape)

# Preview training data
train_df.head()

# Exploratory Data Analysis (EDA)

**Plot distribution of Pawpularity Scores**

In [None]:
# Figure size
plt.figure(figsize=(12,4))

# Histogram
sns.histplot(data=train_df, x='Pawpularity', bins=100)

# Aesthetics (axvline adds a vertical line across the axes)
plt.axvline(train_df['Pawpularity'].mean(), c='red', ls='-', lw=3, label='Mean Pawpularity')
plt.axvline(train_df['Pawpularity'].median(),c='blue',ls='-',lw=3, label='Median Pawpularity')
plt.title('Distribution of Pawpularity Scores', fontsize=20)
plt.legend()
plt.xlabel('Pawpularity', fontsize=15)
plt.ylabel('Count', fontsize=15)

**Observations:**
* The distribution is skewed due to the almost 300 entries with a pawpularity score of 100. 
* There is also a small bump of scores close to zero. 
* Apart from this the data seems to roughly follow a gamma distribution. 

**Plot distribution of Pawpularity scores according to feature classes**

In [None]:
# Obtain features
feature_variables = train_df.columns.values.tolist()


# Plot boxplot and distribution plot against pawpularity for each feature (excluding Id)
for i in feature_variables[1:-1]:
    fig, ax = plt.subplots(1,2, figsize=(12,4))
    sns.violinplot(ax=ax[0], data=train_df, x=i, y='Pawpularity')
    sns.histplot(ax=ax[1], data=train_df, x="Pawpularity", hue=i, kde=True)
    plt.suptitle(i, fontsize=20)
    fig.show()

**Observations:**
* The violin plots are almost identical within each feature. That is, the pawpularity is not massively dependent on the values of these features. 
* This means it will be difficult to train any algorithm to predict pawpularity on these features alone; we will likely need to use the images as well.

**Explore images**

In [None]:
# Show first 3 images in training set with pawpularity score
for i in range(3):
    
    # Image path
    image_path = train_jpg[i]
    
    # Image Id
    id_stem = Path(image_path).stem
    
    # Use Id to get pawpularity score
    id_stem_series = train_df.loc[train_df['Id'] == id_stem,'Pawpularity']
    pawpularity_by_id = id_stem_series.iloc[0]
    
    # Use plt.imread() to read in image file as an np.array of numbers between 0-225 (3 channels)
    image_array = plt.imread(image_path) 
    
    # Display image using plt.imshow()
    plt.figure(figsize=(8,8))
    plt.imshow(image_array)
    
    # Add title
    title = id_stem +', Pawpularity score:'+ str(pawpularity_by_id)
    plt.title(title)
    
    # Turn off gridlines
    plt.axis('off')
    
    # Show the image
    plt.show()

**Return images with certain a pawpularity score**

In [None]:
def pawpularity_pics(df=pd.DataFrame, num_images=int, desired_pawpularity=int, random_state=int):
    '''The pawpularity_pics() function accepts 4 parameters: df is a dataframe, 
    num_images is the number of images you want displayed, desired_pawpularity 
    is the pawpularity score of pics you want to see, and random state ensures reproducibility.'''
    
    # Sample df for desired pawpularity score (+/- 1)
    random_sample = df.loc[(df["Pawpularity"]<=(desired_pawpularity+1)) & (df["Pawpularity"]>=(desired_pawpularity-1))].sample(
        num_images, random_state=random_state).reset_index(drop=True)
    
    # Subplot space with 1 row and num_images columns
    plt.subplots(1, num_images, figsize=(14,14))
    
    # Loop over num_images
    for i in range(num_images):
        
        # Image Id
        image_path_stem = random_sample.iloc[i]['Id']
        root = '../input/petfinder-pawpularity-score/train/'
        extension = '.jpg'
        image_path = root + str(image_path_stem) + extension
         
        # Get pawpularity for title
        pawpularity_by_id = random_sample.iloc[i]['Pawpularity']
    
        # Read image using plt.imread()
        image_array = plt.imread(image_path)
        
        # Subplot
        plt.subplot(1, num_images, i+1)
        
        # Title is the pawpularity score
        plt.title(pawpularity_by_id) 
        
        # Turn off gridlines
        plt.axis('off')
        
        # Display image with plt.imshow()
        plt.imshow(image_array)
        
    plt.show()
    plt.close()

**Pawpularity 10 (+/-1)**

In [None]:
pawpularity_pics(train_df, 4, 10, 0)

**Pawpularity 50 (+/-1)**

In [None]:
pawpularity_pics(train_df, 4, 50, 0)

**Pawpularity 100 (+/-1)**

In [None]:
pawpularity_pics(train_df, 4, 100, 0)

**Observations:**
* Visually I have a hard time predicting the correct pawpularity scores from the pictures.
* There could be other factors influencing these scores, e.g. via the website that these were collected on. 
* The subjectivity of this task could make it very difficult for a neural network to perform well. 

# Model using metadata

We don't think a model based on the metadata alone will perform very well but we will give it a try using Random Forrests.

**Labels and features**

In [None]:
# Labels
y = train_df['Pawpularity']

# Features
X = train_df.drop(['Id','Pawpularity'], axis=1)

**Train-test split**

In [None]:
# Train-test split
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size = 0.2, random_state=0)
print('Dimensions: \n X_train:{} \n X_valid{} \n y_train{} \n y_valid{}'.format(X_train.shape, X_valid.shape, y_train.shape, y_valid.shape))

**Random Forest Regressor**

In [None]:
# Random Forest Regressor
RF = RandomForestRegressor(n_estimators=200, max_depth=4)

# Train the model
start = time.time()
RF.fit(X_train, y_train)
stop = time.time()

# Make predictions
RF_pred = RF.predict(X_valid)

# Print time and RMSE
print(f'Training time: {round((stop - start),3)} seconds')
RF_RMSE = math.sqrt(mean_squared_error(y_valid, RF_pred))
print(f'RF_RMSE: {round(RF_RMSE,3)}')

**Plot predictions**

In [None]:
# Make function to plot predictions
def ActualvPredictionsGraph(y_test,y_pred,title):
    if max(y_test) >= max(y_pred):
        my_range = int(max(y_test))
    else:
        my_range = int(max(y_pred))
    plt.figure(figsize=(12,3))
    plt.scatter(range(len(y_test)), y_test, color='blue')
    plt.scatter(range(len(y_pred)), y_pred, color='red')
    plt.xlabel('Index ')
    plt.ylabel('Pawpularity ')
    plt.title(title,fontdict = {'fontsize' : 15})
    plt.legend(handles = [mpatches.Patch(color='red', label='prediction'),mpatches.Patch(color='blue', label='actual')])
    plt.show()
    return

# Plot RF predictions
ActualvPredictionsGraph(y_valid[0:50], RF_pred[0:50], "First 50 Actual v. Predicted")
ActualvPredictionsGraph(y_valid, RF_pred, "All Actual v. Predicted")

# Plot actual v predicted in histogram form
plt.figure(figsize=(12,4))
sns.histplot(RF_pred,color='r',alpha=0.3,stat='probability', kde=True)
sns.histplot(y_valid,color='b',alpha=0.3,stat='probability', kde=True)
plt.legend(labels=['prediction','actual'])
plt.title('Actual v Predict Distribution')
plt.ylim([0.0, 0.2])
plt.show()


**Observations:**
* All RF predictions are very similar regardless of the metadata. (They seem to lie close to the mean of the training set distribution.)
* This is not a surprise because we already saw that the metadata isn't a good predictor of pawpularity. 

**Submit to competition**

In [None]:
# Test set
X_test = test_df.drop(['Id'], axis=1)

# Make predictions
test_df['Pawpularity'] = RF.predict(X_test) 

# Save to csv
submission_df = test_df[['Id','Pawpularity']]
submission_df.to_csv("submission.csv", index=False)
submission_df.head()

**Remarks:**

Other models that could be considered include: Decision Tree Regressor, Decision Tree Classification, Ordinary Least Squares Regression (Linear Regression), Ridge Regression, Bernoullie Naive Bayes Classification, Gradient Boosting Regression. 

These models don't end up performing much/any better than Random Forrests. This is simply because the metadata isn't a good predictor for pawpularity.

# Next steps

Next I will attempt to build models based on the images themselves. See here for my next notebook.