**Exploratory Data Analysis of Metadata**

This dataset includes metadata in addition to the image data. We can ask if the metadata has predictive value for the Pawpularity score. One simple way to investigate is to look at the conditional histograms of the Pawpularity, conditioned on the different values of the metadata. In this notebook, I show this simple data analysis.


In [None]:
# imports and load data set
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
datadir ='../input/petfinder-pawpularity-score/'
df_train=pd.read_csv(datadir+'train.csv')

In [None]:
df_train

Each metadata feature is binary. We can first look at the value counts for each metadata feature. If nearly all of the values were 0 or nearly all were 1 for a particular feature, then it would be unlikely that that feature would be helpful.

In [None]:
metadata_names=df_train.columns[1:-1]
for m in metadata_names:
    vc=df_train[m].value_counts()
    N=df_train.shape[0]
    print("%15s : %5s 0's (%4.1f%%), %5s 1's (%4.1f%%) "%(m,vc[0],100*vc[0]/N,vc[1],100*vc[1]/N))

It looks like "Subject Focus" is nearly always 0, and "Action" is nearly always 0.

Let us now look at the conditional histograms. I also compute the conditional means, and the difference of the means, and put them in the figure title for convenience.

In [None]:
mean0={}
mean1={}
sigma0={}
sigma1={}
diff={}
pawpularity=df_train['Pawpularity']
for m in metadata_names:
    plt.figure()
    sns.kdeplot(data=df_train,x='Pawpularity',hue=m,common_norm=False)
    mean0[m]=(pawpularity[df_train[m]==0]).mean()
    mean1[m]=(pawpularity[df_train[m]==1]).mean()
    sigma0[m]=(pawpularity[df_train[m]==0]).std()
    sigma1[m]=(pawpularity[df_train[m]==1]).std()
    diff[m]=mean1[m]-mean0[m]
    plt.title('mean0 : %5.3f, mean1 : %5.3f, difference %5.3f'%(mean0[m],
                                                                mean1[m],
                                                                diff[m]))

Sadly, it appears that the changes in the histograms based on conditioning on the metadata features are not large.

By sorting the metadata keys in reverse order, based on the magnitude of the difference, we can see which features give the largest changes in the mean of the distribution.

In [None]:
sortkey=lambda m: abs(diff[m])
sorted_metadata_names=sorted(list(metadata_names),key=sortkey,reverse=True)

print('%5s %15s %7s'%('rank','feature','diff'))
for n,m in enumerate(sorted_metadata_names):
    print('%5d %15s %7.3f'%(n+1,m,diff[m]))

This implies that "Blur" causes the largest change in the distribution. The effect is negative, implying that Blur=1 images have lower (on average) Pawpularity than Blur=0 images. This makes sense. However the effect here does seem pretty small.

Based on this, the "Blur", "Accessory" and "Group" metadata features would appear be the most useful for predicting the pawpularity score. However as the differences between the conditional means are small, its seems like including this metadata into the prediction model would have a modest benefit at best. I removed "Subject Focus" from this list as nearly all (97.2%) of the values of "Subject Focus" were 0.