# PetFinder.my - Pawpularity Contest
## Predict the popularity of shelter pet photos

This is an English version. Japanese version is here.https://www.kaggle.com/chumajin/petfinder-eda-lgbm-for-starter-version


--------------------------------------------
As the subtitle suggests, this competition predicts the popularity (attractiveness) **Pawpularity** of pet photos.

This will automatically improve the quality of pet's profile and photos,

Stray dogs and cats will be able to find "enthusiastic" homes much sooner!

--------------------------------------------

If you find it useful, I would be grateful if you could **upvote**.

Thank you for people who always upvote!


In [None]:
import numpy as np 
import pandas as pd 
import os
import matplotlib.pyplot as plt
import cv2

# 1 What to predict? (See from sample_submission.csv)

In [None]:
sample = pd.read_csv("../input/petfinder-pawpularity-score/sample_submission.csv")
sample

Predict Paw pularity from Id photos.

Pawpularity is an index of the popularity (attractiveness) of pets.

**How Pawpularity Score Is Derived**

* The Pawpularity Score is derived from each pet profile's page view statistics at the listing pages, using an algorithm that normalizes the traffic data across different pages, platforms (web & mobile) and various metrics.
* Duplicate clicks, crawler bot accesses and sponsored profiles are excluded from the analysis.

# 2. What data do you predict from? (See test.csv)

In [None]:
test = pd.read_csv("../input/petfinder-pawpularity-score/test.csv")
test

* Id - filename
* Focus - Pet stands out against uncluttered background, not too close / far.
* Eyes - Both eyes are facing front or near-front, with at least 1 eye / pupil decently clear.
* Face - Decently clear face, facing front or near-front.
* Near - Single pet taking up significant portion of photo (roughly over 50% of photo width or height).
* Action - Pet in the middle of an action (e.g., jumping).
* Accessory - Accompanying physical or digital accessory / prop (i.e. toy, digital sticker), excluding collar and leash.
* Group - More than 1 pet in the photo.
* Collage - Digitally-retouched photo (i.e. with digital photo frame, combination of multiple photos).
* Human - Human in the photo.
* Occlusion - Specific undesirable objects blocking part of the pet (i.e. human, cage or fence). Note that not all blocking objects are considered occlusion.
* Info - Custom-added text or labels (i.e. pet name, description).
* Blur - Noticeably out of focus or noisy, especially for the pet’s eyes and face. For Blur entries, “Eyes” column is always set to 0.
* Pawpularity - an index of the popularity (attractiveness) of pets.

**【Attention】These labels are not used to derive the Pawpularity score. Details below.**

**Purpose of Photo Metadata**
* We have included optional Photo Metadata, manually labeling each photo for key visual quality and composition parameters.
* These labels are not used for deriving our Pawpularity score, but it may be beneficial for better understanding the content and co-relating them to a photo's attractiveness. Our end goal is to deploy AI solutions that can generate intelligent recommendations (i.e. show a closer frontal pet face, add accessories, increase subject focus, etc) and automatic enhancements (i.e. brightness, contrast) on the photos, so we are hoping to have predictions that are more easily interpretable.
* You may use these labels as you see fit, and optionally build an intermediate / supplementary model to predict the labels from the photos. If your supplementary model is good, we may integrate it into our AI tools as well.
* In our production system, new photos that are dynamically scored will not contain any photo labels. If the Pawpularity prediction model requires photo label scores, we will use an intermediary model to derive such parameters, before feeding them to the final model.

**Therefore, We need to pay attention to that the Pawpularity Score is derived from each pet profile's page(maybe including the picture)**

# 3. Take a look at the contents of train.csv

In [None]:
train = pd.read_csv("../input/petfinder-pawpularity-score/train.csv")
train

In [None]:
train_path = "../input/petfinder-pawpularity-score/train"

## Take a look at the top ID

In [None]:
path = os.path.join(train_path,train["Id"].iloc[0]+".jpg")

In [None]:
img = cv2.imread(path)
img = cv2.cvtColor(img,cv2.COLOR_BGR2RGB)
plt.imshow(img)

So pretty。。。Pawpularity　= 63



# 3.1 Try to display the English explanation

In [None]:
train.iloc[0]

In [None]:
explain_dict ={  
"Id":"filename",
"Subject Focus":"Pet stands out against uncluttered background, not too close / far",
"Eyes":"Both eyes are facing front or near-front, with at least 1 eye / pupil decently clear",
"Face":"Decently clear face, facing front or near-front",
"Near":"Single pet taking up significant portion of photo (roughly over 50% of photo width or height)",
"Action":"Pet in the middle of an action (e.g., jumping)",
"Accessory":"Accompanying physical or digital accessory / prop (i.e. toy, digital sticker), excluding collar and leash",
"Group":"More than 1 pet in the photo",
"Collage":"Digitally-retouched photo (i.e. with digital photo frame, combination of multiple photos)",
"Human":"Human in the photo",
"Occlusion":"Specific undesirable objects blocking part of the pet (i.e. human, cage or fence). Note that not all blocking objects are considered occlusion",
"Info":"Custom-added text or labels (i.e. pet name, description)",
"Blur":"Noticeably out of focus or noisy, especially for the pet’s eyes and face. For Blur entries, “Eyes” column is always set to 0",
"Pawpularity":"an index of the popularity (attractiveness) of pets",
}

In [None]:
train_eng = train.copy()

In [None]:
train_eng.columns = train.columns.map(explain_dict)

In [None]:
train_eng.head(3)

first id

In [None]:
tmpdf = train_eng[train_eng.index==0].T

In [None]:
tmpdf

In [None]:
img = cv2.imread(path)
img = cv2.cvtColor(img,cv2.COLOR_BGR2RGB)
plt.imshow(img)

## train.csv item number 1

In [None]:
for a in tmpdf[tmpdf[0]==True].index:
    print("・" + a)

## train.csv item number 0

In [None]:
for a in tmpdf[tmpdf[0]==False].index:
    print("・" + a)

### Put the above into a function ###

Express up to this point in one cell and change id 0 as an argument

In [None]:
path = os.path.join(train_path,train["Id"].iloc[0]+".jpg")

plt.figure()

img = cv2.imread(path)
img = cv2.cvtColor(img,cv2.COLOR_BGR2RGB)
plt.imshow(img)

plt.show()

tmpdf = train_eng[train_eng.index==0].T

print("--------Pawpularity------------")


print(tmpdf[0].iloc[-1])

print("--------item Number 1------------")

for a in tmpdf[tmpdf[0]==True].index:
    print("・ " + a)


print("")



print("--------item Number 0------------")

for a in tmpdf[tmpdf[0]==False].index:
    print("・ " + a)



In [None]:
def showimg(id):
    
    plt.figure()
    path = os.path.join(train_path,train["Id"].iloc[id]+".jpg")

    img = cv2.imread(path)
    img = cv2.cvtColor(img,cv2.COLOR_BGR2RGB)
    plt.imshow(img)
    
    plt.show()

    tmpdf = train_eng[train_eng.index==id].T
    
    print("--------Pawpularity------------")

    print(tmpdf[id].iloc[-1])

    print("--------item Number 1------------")

    for a in tmpdf[tmpdf[id]==True].index:
        
        if a == "Pawpularity":
            continue
        
        print("・ " + a)


    print("")



    print("--------item Number 0------------")

    for a in tmpdf[tmpdf[id]==False].index:
        
        if a == "Pawpularity":
            continue
        print("・ " + a)



In [None]:
showimg(1)

In [None]:
showimg(2)

## Take a look at Pawpularity100

In [None]:
train[train["Pawpularity"]==100]

In [None]:
showimg(19)

In [None]:
showimg(50)


100 even if 2 animals are shown
Just because a certain item is satisfied does not mean that you will get 100 points.

## There is no Pawpularity 0, so take a look at 1.

In [None]:
train[train["Pawpularity"]==1]

In [None]:
showimg(2442)

In [None]:
showimg(3232)

In [None]:
showimg(4235)

## Let's look at one for each Pawpularity 10.

In [None]:
tmpdf3 = train.groupby("Pawpularity").head(1).sort_values("Pawpularity").reset_index()
tmpdf3

In [None]:
tmpdf4 = tmpdf3.iloc[::10,:]
tmpdf4

In [None]:
for a in tmpdf4["index"]:
    showimg(a)
    print("")
    print("#################################################")

# 4. Try submit with LGBM

In [None]:
train

In [None]:
from sklearn import preprocessing
from sklearn.metrics import accuracy_score
from sklearn.model_selection import StratifiedKFold

# 4.1 Kfold

In [None]:
folds = train.copy()
Fold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for n, (train_index, val_index) in enumerate(Fold.split(folds, folds["Pawpularity"])):
    folds.loc[val_index, 'fold'] = int(n)
folds['fold'] = folds['fold'].astype(int)
print(folds.groupby(['fold', "Pawpularity"]).size())

# 4.2 Main

In [None]:
import lightgbm as lgb

In [None]:
import random

def fix_seed(seed):
    # random
    random.seed(seed)
    # Numpy
    np.random.seed(seed)
    os.environ["PYTHONHASHSEED"] = str(seed)

SEED = 42
fix_seed(SEED)

In [None]:
features = train.columns.to_list()[1:-1]
features

In [None]:
target = "Pawpularity"

In [None]:
# example of parameters
lgbm_params = {
    'objective': 'rmse', # Binary classification : 2値分類ではこれを使う
    'seed': 42, # random seed : これを固定すると、再現性が出る
    'metric': 'rmse', 
    'learning_rate': 0.01,
    'max_bin': 800, # depth
    'num_leaves': 80, # leaves,
    "verbose":-1
}

In [None]:
from sklearn.metrics import mean_squared_error

def rmsescore(y_true,y_pred):
    return np.sqrt(mean_squared_error(y_true, y_pred))

In [None]:
test

In [None]:
scores = []
allpreds = []

allvaliddf = pd.DataFrame()


for fold in range(5):
    
    fix_seed(SEED) # for repetability

    p_train = folds[folds["fold"] != fold]
    p_val = folds[folds["fold"] == fold]

    p_train = p_train.reset_index(drop=True)
    p_val = p_val.reset_index(drop=True)

    lgb_train = lgb.Dataset(p_train[features], p_train[target])
    lgb_eval = lgb.Dataset(p_val[features], p_val[target])



    model = lgb.train(lgbm_params, lgb_train, valid_sets=lgb_eval,
                      verbose_eval=50,  # Learning result output every 50 iterations : 50イテレーション毎に学習結果出力
                      num_boost_round=1000,  # Specify the maximum number of iterations : 最大イテレーション回数指定
                      early_stopping_rounds=100, # Early stopping number : early stoppingを採用するiteration回数
                     
                      
                     )

    import pickle

    model_name = f"LGBMmodel{fold}.bin"

    # saving model
    pickle.dump(model, open(model_name, 'wb'))

    # loading model
    model = pickle.load(open(model_name, 'rb'))

    # predicting validation value
    oof_pred = model.predict(p_val[features])


    scores.append(rmsescore(p_val[target],oof_pred))

    # predicting for test_X
    preds = model.predict(test[features])

    #preds2 = np.where(preds>=0.5,1,0)
    
    allpreds.append(preds)
    
    # out of fold : oof
    p_val["preds"] = oof_pred
    
    allvaliddf = pd.concat([allvaliddf,p_val])

In [None]:
scores

#### mean score

In [None]:
np.mean(scores)

#### out of fold score

In [None]:
rmsescore(allvaliddf[target],allvaliddf["preds"])

## 4.3 for submit

In [None]:
allpreds = np.mean(allpreds,axis=0)

In [None]:
sample["Pawpularity"] = allpreds

In [None]:
sample.to_csv("submission.csv",index=False)

In [None]:
sample


# Thank you for watching so far.
**upvote** I would be grateful if you could!