## Pawpularity - What is up with all the 100s? 

Here's a notebook for anyone curious about what pictures are in those 100-Pawpularity scores.

Also they're just eye candy to look at :)

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from fastai.vision.all import *

In [None]:
another_df = pd.read_csv('../input/petfinder-pawpularity-score/train.csv')
another_df

In [None]:
another_df['path'] = another_df['Id'].apply(lambda x: '../input/petfinder2-cropped-dataset/crop/' + x + '.jpg')
another_df

In [None]:
my_df = pd.read_csv('../input/oof-and-train/train_with_oof.csv')
my_df.head(3)

## OOF vs Pawpularity Viz

Notice the huge difference between OOF predictions and Pawpularity. 

In [None]:
my_df['oof'] = my_df['oof'] * 100

In [None]:
# Credit for oof vs pawpularity viz: https://www.kaggle.com/joatom/petfinder2021-fastai-train
# Ideally, we want something closer to a diagonal line here (predictions align with labels)
my_df[['oof','Pawpularity']].plot.scatter('oof','Pawpularity')

In [None]:
# Sudden large spike in labeled 100 values
# Predictions group around the mean and cannot make more confident predictions
my_df[['oof','Pawpularity']].hist()

## Miscellaneous stats on the 100 scores

In [None]:
# There are 288 pictures with Pawpularity score of 100
best_score_indexes = my_df.index[my_df['Pawpularity'] == 100].to_list()
len(best_score_indexes)

In [None]:
best_df = my_df.iloc[best_score_indexes]
best_df.head(3)

In [None]:
# A custom breed column created with yolov5. Credit: https://www.kaggle.com/eduardofv/cat-or-dog-petfinder-pawpularity-competition
# It seems dogs are more likely than cats to get the 100-score
best_df['Breed'].map({0:'Dog', 1:'Cat', 2:'Neither'}).value_counts()

## Show us the pictures

In [None]:
unknown_breed_indexes = my_df.index[my_df['Breed'] == 2].to_list()
test_df = my_df.iloc[unknown_breed_indexes]
test_df

In [None]:
dls = ImageDataLoaders.from_df(another_df,
                              valid_pct=0.2,
                              seed=999,
                              fn_col='path',
                              label_col='Pawpularity',
                              bs=10,
                              item_tfms=Resize(224)
                              )
dls.show_batch()

In [None]:
dls = ImageDataLoaders.from_df(test_df,
                              valid_pct=0.2,
                              seed=999,
                              fn_col='path',
                              label_col='Pawpularity',
                              bs=5,
                              item_tfms=Resize(224),
                              aug_tfms=setup_aug_tfms([Dihedral(p=1)])
                              )
dls.show_batch()

In [None]:
# Based on the fastai starter implementation: https://www.kaggle.com/tanlikesmath/petfinder-pawpularity-eda-fastai-starter
dls = ImageDataLoaders.from_df(best_df,
                              valid_pct=0.2,
                              seed=999,
                              fn_col='path',
                              label_col='Pawpularity',
                              item_tfms=Resize(224)
                              )

In [None]:
# Rerun this cell to show a new batch of cute pets :3
# My takeaway: pets with closeups and eyes clearly visible get better scores,
            # better methods of resizing could also help model to make better predictions
dls.show_batch()

Do leave any comments on personal takeaways or feedback. I'm fairly new to making EDA notebooks :)