### Brif problem description: 

PetFinder.my uses a basic Cuteness Meter to rank pet photos. It analyzes picture composition and other factors compared to the performance of thousands of pet profiles. While this basic tool is helpful, it's still in an experimental stage and the algorithm could be improved. The participants needs to build an AI model using provided data to help make the tool better.  

**Task** 

The task is to predict engagement with a pet's profile( **Pawpularity** ) based on the photograph for that profile. 

**Data** 

The dataset for this competition comprises both images and tabular data(hand-labelled metadata for each photo). 

The train set contains 9912 pet photos 

The test set contains 8 pet photos
> NOTE: The actual test data comprises about **6800** pet photos similar to the training set photos. 

#### **The goal of this notebook is to:** 

1. Understand the structure of the data. ( image and tabular ) 
2. Understand the relation between image and tabular data. 
3. Understand the impact of image and tabular data deciding Pawpularity score. 



### Setting up the Notebook 

In [None]:
import numpy as np
import pandas as pd 

import matplotlib.pyplot as plt 
import seaborn as sns 

import os 
import cv2
import random 

import warnings
warnings.filterwarnings("ignore")

In [None]:
from IPython.core.display import display, HTML
display(HTML("<style>div.output_scroll { height: 70em; }</style>"))

In [None]:
sns.set(style= 'darkgrid', 
       color_codes=True,
       font = 'Arial',
       font_scale= 1.5,
       rc={'figure.figsize':(12,8)})

### A. Dataset structure and Features provided train.csv file 

In [None]:
os.listdir("../input/petfinder-pawpularity-score/")

In [None]:
len(os.listdir("../input/petfinder-pawpularity-score/train"))

In [None]:
len(os.listdir("../input/petfinder-pawpularity-score/test"))

In [None]:
train = pd.read_csv('../input/petfinder-pawpularity-score/train.csv')
test = pd.read_csv('../input/petfinder-pawpularity-score/test.csv')
ss = pd.read_csv('../input/petfinder-pawpularity-score/sample_submission.csv')

In [None]:
test.shape

In [None]:
train.shape

In [None]:
ss.shape

In [None]:
train.head()

#### A.1 Features in train and test data .csv files 

Each feature can take vale 1(YES) or 0 (NO):   

1. Focus - Pet stands out against uncluttered background, not too close / far.
2. Eyes - Both eyes are facing front or near-front, with at least 1 eye / pupil decently clear.
3. Face - Decently clear face, facing front or near-front.
4. Near - Single pet taking up significant portion of photo (roughly over 50% of photo width or height).
5. Action - Pet in the middle of an action (e.g., jumping).
6. Accessory - Accompanying physical or digital accessory / prop (i.e. toy, digital sticker), excluding collar and leash.
7. Group - More than 1 pet in the photo.
8. Collage - Digitally-retouched photo (i.e. with digital photo frame, combination of multiple photos).
9. Human - Human in the photo.
10. Occlusion - Specific undesirable objects blocking part of the pet (i.e. human, cage or fence). Note that not all blocking objects are considered occlusion.
11. Info - Custom-added text or labels (i.e. pet name, description).
12. Blur - Noticeably out of focus or noisy, especially for the pet’s eyes and face. For Blur entries, “Eyes” column is always set to 0.

In [None]:
ss.head()

 
### B. Relation between information given in train.csv and train images 

 #### Here we will take random train images and their corresponding features( whose value is = 1(YES) ). 

In [None]:
 
_, axs = plt.subplots( 2, 2, figsize=(15, 12))

axs = axs.flatten()
col = train.columns.tolist() 

for a, ax in zip(train.sample(4).iterrows(), axs):
    img = cv2.imread(f'../input/petfinder-pawpularity-score/train/{a[1][0]}.jpg')
    img = cv2. resize(img, (600, 600))
    other_info = [ col[i] for i in range(13) if a[1][i] == 1 ]
    ax.grid(False)
    ax.set_xticks([])
    ax.set_yticks([])
    ax.imshow(img)
    ax.set_title(f'Id: {a[0]}, Pawpularity : {a[1][13]}, ' + ", ".join(other_info), fontsize= 12, fontweight='bold' )
    
plt.show()


### C. Output/dependent variable i.e Pawpularity

In [None]:
sns.distplot(train["Pawpularity"])
plt.title("Distribution of Pawpularity")

 
the distribution of Pawpularity looks like a normal distribution, which means the data spectrum is good. 


#### What is the difference between low Pawpularity and High Pawpularity images ? 

In [None]:
_, axs = plt.subplots( 3,4 , figsize=(15, 15))

axs = axs.flatten()
col = train.columns.tolist() 

for a, ax in zip(train[train["Pawpularity"] >= 95].sample(6).append(train[train["Pawpularity"] <= 5].sample(6)).iterrows(), axs):
    img = cv2.imread(f'../input/petfinder-pawpularity-score/train/{a[1][0]}.jpg')
    img = cv2. resize(img, (600, 600))
    ax.grid(False)
    ax.set_xticks([])
    ax.set_yticks([])
    ax.imshow(img)
    ax.set_title(f'Id: {a[0]}, Pawpularity : {a[1][13]}', fontsize= 12, fontweight='bold' )
    
plt.show()

 
 from the above images, I think it is difficult to say what is the main reason for high "Pawpularity".. 
 
 lets have a look at the data(train.csv) to find the difference between high Pawpularity and low Pawpularity images  


In [None]:
high = train[train["Pawpularity"] > 80 ].sample(500)
low = train[train["Pawpularity"] < 20 ].sample(500)

In [None]:
high.shape == low.shape

In [None]:
def plot_counts(df):
    data = dict()
    for c in df.columns.tolist()[1:-1]:
        data[c] = df[c].sum()
    return data 

In [None]:
plot_counts(high)

In [None]:
k = ["High Pawpularity" , "Low Pawpularity"]
for D,i, ax in zip([plot_counts(high), plot_counts(low)], range(2),  axs):
    plt.figure(figsize= (17, 6))
    plt.bar(range(len(D)), list(D.values()), align='center')
    plt.xticks(range(len(D)), list(D.keys()) )
    plt.title("Count of individual features for " +  k[i], fontsize= 20, fontweight='bold' )
    plt.show()


 
#### The distribution of values is also nearly same, maybe correlation with respect to data can give us some insights 


### D. Correlation

In [None]:
plt.figure(figsize= (15, 15))
sns.heatmap(train.corr(), annot=True, fmt='.1g' )
plt.title('Correlation Matrix', fontweight='bold', fontsize=20)
plt.show()


The feature "group" shows the highest positive correlation with the dependent variable "Pawpularity". But overall, the features  in the dataset do not seem to give us much information about the Pawpularity.  