## PET FINDER / EDA + Regression on CatVsDog augmented dataset

The aim of this notebook is to explore data and exploit metadata. We will in particular try to add to the meta-data an information concerning the type of animal (Cat or dog) in order to see if this proves to be valuable in the context of a regression.



In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline 

import cv2

import os

TRAIN_DIR = '../input/petfinder-pawpularity-score/train'
TEST_DIR = '../input/petfinder-pawpularity-score/test'

## **1. First approach to data**

First, let's load the Ids as well as the associated metadata.

In [None]:
train = pd.read_csv("../input/petfinder-pawpularity-score/train.csv")
test = pd.read_csv("../input/petfinder-pawpularity-score//test.csv")
train.head()

Then let's edit some photos :

In [None]:
def displaySample(attribute):
    
    f, ax = plt.subplots(1, 2, figsize = (15, 5))

    dfSample0 = train[train[attribute]==0].sample()
    dfSample1 = train[train[attribute]==1].sample()

    imgBGR = cv2.imread(TRAIN_DIR + '/' + dfSample0.iloc[0]['Id'] + '.jpg')
    imgRGB = cv2.cvtColor(imgBGR, cv2.COLOR_BGR2RGB)
    ax[0].imshow(imgRGB)
    ax[0].axis('off')
    ax[0].set_title(attribute+" 0")   

    imgBGR = cv2.imread(TRAIN_DIR + '/' + dfSample1.iloc[0]['Id'] + '.jpg')
    imgRGB = cv2.cvtColor(imgBGR, cv2.COLOR_BGR2RGB)
    ax[1].imshow(imgRGB)
    ax[1].axis('off')
    ax[1].set_title(attribute+" 1")  
        
    plt.show()

In [None]:
#comparison of differents "Subject Focus" values
#-------------------------------------------------
displaySample('Subject Focus')

In [None]:
#comparison of differents "Eyes" values
#-------------------------------------------------
displaySample('Eyes')

In [None]:
#comparison of differents "Face" values
#-------------------------------------------------
displaySample('Face')

In [None]:
#comparison of differents "Near" values
#-------------------------------------------------
displaySample('Near')

In [None]:
#comparison of differents "action" values
#-------------------------------------------------
displaySample('Action')

In [None]:
#comparison of differents "Accessory" values
#-------------------------------------------------
displaySample('Accessory')

In [None]:
#comparison of differents "Group" values
#-------------------------------------------------
displaySample('Group')

In [None]:
#comparison of differents "collage" values
#-------------------------------------------------
displaySample('Collage')

In [None]:
#comparison of differents "Human" values
#-------------------------------------------------
displaySample('Human')

In [None]:
#comparison of differents "Occlusion" values
#-------------------------------------------------
displaySample('Occlusion')

In [None]:
#comparison of differents "Info" values
#-------------------------------------------------
displaySample('Info')

In [None]:
#comparison of differents "Blur" values
#-------------------------------------------------
displaySample('Blur')

We have cats and dogs here. In addition the meta data are quite complete, however we notice that it lacks the presence of the most discriminating factor which is the type of animal (cat or dog). A neural network will undoubtedly be able to accommodate, however it may be interesting to add this information as part of our analysis.

## **2. Quick labeling: cat or dog**

We will quickly add the cat or dog information to the dataset. This is a classic, so we will not go into the details of CNN learning process, we will go directly in this notebook to the prediction phase. Nevertheless we can specify that we will use a Resnet50 architecture whose precision has been evaluated at 0.98

#### Definition of some constants

In [None]:
# Fixed for our Cats & Dogs classes
NUM_CLASSES = 2

# Fixed for Cats & Dogs color images
CHANNELS = 3

IMAGE_RESIZE = 224
RESNET50_POOLING_AVERAGE = 'avg'
DENSE_LAYER_ACTIVATION = 'softmax'
OBJECTIVE_FUNCTION = 'categorical_crossentropy'

# Common accuracy metric for all outputs, but can use different metrics for different output
LOSS_METRICS = ['accuracy']

# Using 1 to easily manage mapping between test_generator & prediction for submission preparation
BATCH_SIZE_TESTING = 1

#### Import of libraries and definition of paths

In [None]:
from tensorflow.python.keras.applications import ResNet50
from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras.layers import Dense
from tensorflow.python.keras import optimizers

resnet_weights_path = '../input/resnet50/resnet50_weights_tf_dim_ordering_tf_kernels_notop.h5'
resnet_bestwgt_path = '../input/catvsdogweights/best.hdf5'

#### Definition de l'architecture du modele, compilation et chargement des poids

In [None]:
model = Sequential()

# 1st layer as the lumpsum weights from resnet50_weights_tf_dim_ordering_tf_kernels_notop.h5
# NOTE that this layer will be set below as NOT TRAINABLE, i.e., use it as is
model.add(ResNet50(include_top = False, pooling = RESNET50_POOLING_AVERAGE, weights = resnet_weights_path))

# 2nd layer as Dense for 2-class classification, i.e., dog or cat using SoftMax activation
model.add(Dense(NUM_CLASSES, activation = DENSE_LAYER_ACTIVATION))

# Say not to train first layer (ResNet) model as it is already trained
model.layers[0].trainable = False

# Compilation
sgd = optimizers.SGD(lr = 0.01, decay = 1e-6, momentum = 0.9, nesterov = True)
model.compile(optimizer = sgd, loss = OBJECTIVE_FUNCTION, metrics = LOSS_METRICS)

# Weights loading
model.load_weights(resnet_bestwgt_path)
model.summary()

#### Generator preparation

In [None]:
from keras.applications.resnet50 import preprocess_input
from keras.preprocessing.image import ImageDataGenerator

image_size = IMAGE_RESIZE

# preprocessing_function is applied on each image but only after re-sizing & augmentation (resize => augment => pre-process)
# Each of the keras.application.resnet* preprocess_input MOSTLY mean BATCH NORMALIZATION (applied on each batch) stabilize the inputs to nonlinear activation functions
# Batch Normalization helps in faster convergence
data_generator = ImageDataGenerator(preprocessing_function=preprocess_input)

test_generator = data_generator.flow_from_directory(
    directory = '../input/petfinder-pawpularity-score',
    target_size = (image_size, image_size),
    batch_size = BATCH_SIZE_TESTING,
    class_mode = None,
    shuffle = False,
    seed = 123
)


#### Predictions

In [None]:
# Reset before each call to predict
test_generator.reset()
pred = model.predict_generator(test_generator, steps = len(test_generator), verbose = 1)
predicted_class_indices = np.argmax(pred, axis = 1)

#### Formatting labels

In [None]:
results_df = pd.DataFrame(
    {
        'id': pd.Series(test_generator.filenames), 
        'Dog': pd.Series(predicted_class_indices)
    })

results_df[['set', 'Id']] = results_df['id'].str.split('/', expand=True)
results_df = results_df.drop(['id'], axis=1)
results_df['Id'] = results_df['Id'].map(lambda x: x.rstrip('.jpg'))
results_df.head()

#### Let's visualize some predicted labels

In [None]:
f, ax = plt.subplots(3, 3, figsize = (10, 10))

i = 0
for index, row in results_df[results_df['set']=='train'].sample(n=9).iterrows():
    imgBGR = cv2.imread(TRAIN_DIR + '/' + row['Id'] + '.jpg')
    imgRGB = cv2.cvtColor(imgBGR, cv2.COLOR_BGR2RGB)
    
    # a if condition else b
    predicted_class = "Dog" if row['Dog'] else "Cat"

    ax[i//3, i%3].imshow(imgRGB)
    ax[i//3, i%3].axis('off')
    ax[i//3, i%3].set_title("Predicted:{}".format(predicted_class))    
    i += 1
    
plt.show()

We now have to merge obtained labels with original metadata.

In [None]:
merged_train = pd.merge(train, results_df, on="Id")
merged_test = pd.merge(test, results_df, on="Id")

merged_train = merged_train.drop(['set'], axis=1)
merged_test = merged_test.drop(['set'], axis=1)

merged_train.head()

## **3. Exploratory Data Analysis**

### 3.1 Distributions

In [None]:
statDf = pd.concat([merged_train[['Pawpularity']].describe(), 
           merged_train[merged_train['Dog']==1]['Pawpularity'].describe(),
           merged_train[merged_train['Dog']==0]['Pawpularity'].describe()], 
           axis=1)
statDf.columns = ['All', 'Dog', 'Cat']
statDf

We notice that cats score lower on average than dogs. The standard deviation is also lower for cats, which confirms our diagnosis.

In [None]:
import seaborn as sns
ax = sns.boxplot(x="Dog", y="Pawpularity", data=merged_train)

The analysis of dispersion compared to the median confirms different profiles between cats (left) and dogs (right). Although cats also achieve high scores, these constitute a longer outliers' tail than dogs. The cats boxplot is also tighter around the median, showing some homogeneity in the notations.

In [None]:
f, ax = plt.subplots(1, 2, figsize = (15, 5))
ax[0].hist(merged_train[merged_train['Dog']==0]['Pawpularity'], color='blue')
ax[0].set_title('Cat scores distribution')
ax[1].hist(merged_train[merged_train['Dog']==1]['Pawpularity'], color='orange')
ax[1].set_title('Dog scores distribution')
plt.show()

The distributions confirm a slightly more spread out notation for the dogs. Nevertheless the two profiles are similar with a concentration of the scores in the mean values.

### 3.2 Correlation between features

In [None]:
corrDF = merged_train.drop(['Id'], axis=1)
sns.heatmap(corrDF.corr());

There seems to be little lesson to be learned from this correlation matrix.
Nevertheless we can note several points:
- an animal whose face is visible also has its eyes visible (which seems logical)
- a collage is often accompanied by text
- the occlusion and the presence of a human in the image are correlated (which is also logical)
- finally, no variable seems to have an obvious linear correlation link with the target popularity score

## **4. Regression**

We are now going to try a regression on the metadata with elastic Net.

In [None]:
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [None]:
#Let's standardize the target
merged_train['Pawpularity'] = (merged_train['Pawpularity']-min(merged_train['Pawpularity']))/(max(merged_train['Pawpularity'])-min(merged_train['Pawpularity']))
merged_train.head()

In [None]:
x = merged_train.drop(['Id', 'Pawpularity'], axis=1)
y = merged_train['Pawpularity']

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.001)

In [None]:
reg1 = ElasticNet(alpha=0.7, l1_ratio=0.2)
reg1.fit(x_train, y_train)

RMSE evaluation

In [None]:
y_pre = reg1.predict(x_test)
test_pre = reg1.predict(merged_test.iloc[:,1:])
mse = mean_squared_error(y_test,y_pre)
rmse = np.sqrt(mse)
rmse

We finally return the score to its original scale and then create the submission file.

In [None]:
merged_test['Pawpularity'] = (test_pre*(100-1))+1
submission = merged_test.drop(columns=['Subject Focus',
                                       'Eyes','Face','Action','Accessory','Group','Collage','Human',
                                       'Occlusion','Info','Blur','Near', 'Dog'],axis=0)
submission.to_csv('submission.csv', index=False)


## **5. Conclusion**

It is of course always possible to do better, but the metadata does not seem to be sufficiently discriminating to be used on its own in the prediction of the popularity score.