### Chatting 
This is my first project on Kaggle, working with some of my friends who are also interested in Deep Learning.  I want to record problems, thoughts, and what I've learned from this project in this notebook.  Now I'm writing the notebook I realize it takes affort to have a good notebook presentation.  

Anyway, hope you have fun on this project as well!   

# Introduction 

### The goal of the project is predicting the image score "Pawpularity" from the data including the feature catalog (metadata) and the image itself.  The followings are what we'll do: 

### 1. Explore the metadata : 
(1) read the csv table   
(2) take a look at the images  
(3) try several ML methods using the table along

### 2. A two-stage fine tuning regressor using pretrained model (ResNet-18) 
(1) Training the image feature in the MetaData using ResNet18 

(2) Transfer the above network to train as a regressor for Pawpularity 

### 3. Discussion 
(1) Thoughts about our model  
(2) What else can we do 

# 1. Explore the metadata 


In [None]:
# Packages I'm always using 
import numpy as np 
import pandas as pd 
import glob 
import matplotlib.pyplot as plt 
import matplotlib.image as img 
import time

# sklearn tools 
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.experimental import enable_hist_gradient_boosting

# function
def rms(y1, y2):
    return np.sqrt(np.sum( (y1-y2)**2)/len(y1))


### Plot style   (you can use seaborn  or some other plotting tool)
def plotstyle():
    plt.xlabel('xlabel',  fontsize = 20)
    plt.ylabel('ylabel',  fontsize = 20)

    plt.xticks(size=18)
    plt.yticks(size=18)
    
    
def axstyle(ax):
    ax.set_xlabel('xlabel',  fontsize = 20)
    ax.set_ylabel('ylabel',  fontsize = 20)

    ax.tick_params(axis='x', labelsize=20)
    ax.tick_params(axis='y', labelsize=20)

### There are ~9900 rows in the training data catalog.  In each row, we have 12 boolean  features ( 1 or 0), and a the score from 0 to 100. 

### In the testing data we have the features. 

In [None]:
# Take a look 
path = '/kaggle/input/petfinder-pawpularity-score/'
train_dat = pd.read_csv(path+'train.csv')
test_dat = pd.read_csv(path+'test.csv')

print(f" training data size =  {len(train_dat)}" )

train_dat.head(3) 

In [None]:
test_dat.head(3) 

### Let's take a look at the images 

As you can see,  the training data is pictures of dogs (or cats), and testing data is some kind of noise.  That's not a problem as long as our model works for the training data that has the real meaning.   

You can briefly go through a few more images to see how accurate the metadata catalog describe the corresponding images.  I'd say they look fine.   

In [None]:
## Getting the image using matplotlib.image.imread 
t1 = path + 'train/' + train_dat.Id[0] +'*'
f1 = glob.glob(t1)[0]
im1 = img.imread(f1) 

t2 = path + 'test/' + test_dat.Id[0] +'*'
f2 = glob.glob(t2)[0]
im2 = img.imread(f2) 

## plot the figure 
fig = plt.figure( figsize = (15, 6))
ax1 = fig.add_subplot(121)
ax2 = fig.add_subplot(122)

ax1.imshow(im1)
ax2.imshow(im2)
title1 = f'score = {train_dat.Pawpularity[0]}'
title2 = f'test image'
ax1.set_title(title1, fontsize=15)
ax2.set_title(title2, fontsize=15)

plt.show()

### Back to the catalog

Let's make a few plots, counts, statistics.  

The first plot is the histogram/distribution of the score  (Looks like a Poisson distribution, no?).  Most of the scores are distributed at the median ~ 33.  The standard deviation is around 21.  You can see a peak at around 100 point, presumming that the the intrinsic distribution is ($-\infty$, $\infty$), and set all negative to 0 and all above 100 to 100, or something like that. 

The next plot is separating the training data by a single feature ( 0 as blue and 1 as red ), to see if there is a single dominating feature that can give you a brief pridection to the score.  Unfortunately I would say there is no dominating features. The features are binary numbers, at the end of the day. 

In [None]:
# histogram of the score:
score = train_dat.Pawpularity

print(f" std of the score = {np.std(score)} \n median of the score = {np.median(score)}")

plt.figure(figsize = (6, 4))
plt.title(r'Histogram', fontsize=20)
plt.hist(score, bins = np.linspace(0, 100, 20), rwidth=0.7 )
plotstyle()
plt.xlabel('score')
plt.ylabel('Num')
plt.show()


plt.figure(figsize = (6, 4))
plt.title(r'1 vs 0', fontsize=20)

for i, feature in enumerate(train_dat.columns[1:-1]): 
    group0 = train_dat[train_dat[feature] == 0]
    group1 = train_dat[train_dat[feature] == 1]
    
    m0 = np.median(group0.Pawpularity)
    s0 = np.std(group0.Pawpularity)
    m1 = np.median(group1.Pawpularity)
    s1 = np.std(group1.Pawpularity)
    
    plt.plot((i, i), (m0-s0, m0+s0), 'b-', alpha=0.5)
    plt.plot((i+0.3, i+0.3), (m1-s1, m1+s1), 'r-', alpha=0.5)
    
    plt.scatter(i, m0, s=30, color='b', marker='x')
    plt.scatter(i+0.3, m1, s=30, color='r', marker='x')
plotstyle()    
plt.xlabel('features')
plt.ylabel('score')
plt.ylim(0, 100)
plt.show()

### Some ML method: 

Let's try if the combination of the catalog features can make good prediction to the score.  We will use several ML tools in sklearn to build our model. 

In [None]:
# Set up the data : 
y = train_dat['Pawpularity']
X = train_dat.drop(['Id','Pawpularity'],axis=1)

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=1234)


## Random Forest 
#create the Random Forest ensemble
RF_reg = RandomForestRegressor(n_estimators=200, max_depth=8)
#train the model
start = time.time()
RF_reg.fit(x_train, y_train)
stop = time.time()
#predict the response for the test data
y_RF = RF_reg.predict(x_test)
#print the rms
print(f'Training time: {round((stop - start),3)} seconds')
rms_RF = rms(y_test, y_RF)
print(f'RF_reg_RMSE: {round(rms_RF,3)}')


## Boost Decision Tree
BD_reg = GradientBoostingRegressor( max_depth=10, n_estimators=20, learning_rate=0.01)
BD_reg.fit(x_train, y_train)
y_BDT = BD_reg.predict(x_test)
rms_BDT = rms(y_test, y_BDT)
print(f'BD_reg_RMSE: {rms_BDT:.3f}')

plt.hist(y_train, bins = np.linspace(0, 100, 20), rwidth=0.8, label='Data' , alpha=0.5 )
plt.hist(y_RF, bins = np.linspace(0, 100, 20), rwidth=0.8, label='Random Forest' , alpha=0.7 )
plt.hist(y_BDT, bins = np.linspace(0, 100, 20), rwidth=0.5, label='BDT' , alpha=0.7 )
plt.legend(fontsize=15)
plt.show()

### hmm 

So I accually tried more than two methods, but none of them works well.  As you can see above, both method predict the mean value as the safest strategy.  Note that these method are not bad, the reason way we are not prediction the result can be: (1) There's a much hidden and complicate rule between the score and the features  (2) MetaData is irrelevant to the score (3) The score is irrelevant to the data.  

For (1) and (2),  our plan is to get into the image,  extract features with some CNN, and hope for the best.   (But if it's that simple,  why couldn't BDT make a good prediction? just saying)

(3) is the worst case which basically means we give up. But in the research field where I am at (Astronomy),  often time data are messy and confusing.  Finding out something is irrelevant can be a contribution.  

I recommend you to read this discussion about other's thoughts and ideas about this dataset: 

https://www.kaggle.com/c/petfinder-pawpularity-score/discussion/285140



# Two Step Model 

Next we try to use a pretrain ResNet18 network as a regressor.  
We have tried the follwoing: 

### 1. A two-step fine tuning model:  First train the network to predict the MetaData  (The classifier), then transfer the best performing model to train the Pawpularity (The regressor).   We also try adding a few hidden layer as the decoder. 

The result of the classifier works pretty good in less than 10 epoch if we preserve the pretrain weight. However, even though the regressor can successfully predict the Pawpularity of the training data, it return only the value of the mean (~38) for the validation data.  First we assume it was a result of overtrain. 

### 2. Build a network smaller than ResNet18. 

I assume the reason of overtraining is there are too many parameters in ResNet18 (11M parameters).  So I build a residual-net will less layer and lower dimension.  The result is similar to the previous method: works for classifier, and return the mean for the validation data.  Only this time,  with the same number of epoch,  the scattering of the training data is larger than using the pretrained ResNet18, and still returning somewhat a value close to mean at the validation set.  


### We will put our thoughts at the end of the notebook: 

In [None]:
import os
import numpy as np
import pandas as pd
import torch
import torchvision
import pytorch_lightning as pl
import matplotlib.pyplot as plt

from typing import Optional
from torch.utils.data import Dataset, DataLoader, random_split
from torch import nn, optim
from torchvision.io import read_image
from torchvision.transforms import Compose, ConvertImageDtype, Resize, Normalize
from pytorch_lightning.callbacks import ModelCheckpoint

pd.set_option("display.max_rows", None)
plt.style.use('ggplot')

from functools import partial
from dataclasses import dataclass
from collections import OrderedDict

!pip install torchsummary
from torchsummary import summary

# Dataset & Datamodule

In [None]:
class PetfinderDataset(Dataset):
    """Training/Testing dataset of Petfinder profiles."""
    
    def __init__(self, train=True,img_transform=None, meta_transform=None, score_transform=None):
        """
        Arguments
        ---------
            train (bool): Whether the training dataset or the testing dataset
            img_transform: Transformation of images
            meta_transform: Transformation of metadata
            socre_transform: Transformation of pawpularity scores
        
        Note
        ----
        `score_transform` is not supported if `train` is `False`.
        """
        self.dirpath = '../input/petfinder-pawpularity-score'
        self.img_dir = 'train' if train else 'test'
        self.meta = pd.read_csv( os.path.join(self.dirpath, 'train.csv' if train else 'test.csv') )
        self.metacols = self.meta.columns.drop( ['Id', 'Pawpularity'] if train else 'Id')
        self.train = train
        self.img_transform = img_transform
        self.meta_transform = meta_transform
        self.score_transform = score_transform
    
    def __len__(self):
        return len(self.meta.index)
    
    def __getitem__(self, idx):
        """
        Return the image, metadata and score given the index of a sample.
        
        Note
        ----
        If `self.train` is `False`, the returned score will be -1.
        """
        # Obtain image, metadata and score
        ind = self.meta.index[idx]  # index in metadata
        img_path = os.path.join( self.dirpath, self.img_dir, f"{self.meta.loc[ind, 'Id']}.jpg")
        img = read_image(img_path)
        meta = self.meta.loc[ind, self.metacols]
        meta = meta.values.astype(np.float32)  # convert data type
        score = self.meta.loc[ind, 'Pawpularity'] if self.train else -1.0
        score = np.float32(score)  # convert data type
        # Apply transformations
        if self.img_transform is not None:
            img = self.img_transform(img)
        if self.meta_transform is not None:
            meta = self.meta_transform(meta)
        if self.train and self.score_transform is not None:
            score = self.score_transform(score)
        return img, meta, score

In [None]:
class PetfinderDataModule(pl.LightningDataModule):
    """Data module of Petfinder profiles."""
    
    def __init__(self,image_size: int = 224, batch_size: int = 64, num_validation: int = 128):
        """
        Arguments
        ---------
            image_size: Size of square images after transformations
            batch_size: Batch size loading training/validation dataset
            num_validataion: Number of observations in validataion dataset
        """
        super().__init__()
        self.image_size = image_size
        self.batch_size = batch_size
        self.num_validation = num_validation
    
    def setup(self, stage: Optional[str] = None):
        # Transformations
        transforms = {'img_transform': Compose([
                ConvertImageDtype(torch.float32),
                Resize((self.image_size, self.image_size)),
                Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
                ])}
        # Split training set and validation set
        if stage in (None, 'fit'):
            self.dataset = PetfinderDataset(train=True, **transforms)
            self.trainset, self.valset = random_split(
                self.dataset, [len(self.dataset)-self.num_validation, self.num_validation])
        # Load dataset for prediction
        if stage == 'predict':
            self.predictset = PetfinderDataset(train=False, **transforms)
    
    def train_dataloader(self):
        return DataLoader(self.trainset, batch_size=BATCH_SIZE, shuffle=True)
    
    def val_dataloader(self):
        return DataLoader(self.valset, batch_size=BATCH_SIZE)
    
    def predict_dataloader(self):
        return DataLoader(self.predictset, batch_size=len(self.predictset))
    
    def num_meta(self):
        """
        Return number of features in the metadata.
        
        Note
        ----
        Must be called after running self.setup().
        """
        return len(self.dataset.metacols)
    
    def meta_odds(self):
        """
        Return the odds against features in the metadata.
        
        Note
        ----
        Must be called after running self.setup().
        """
        pos_rate = self.dataset.meta.loc[:, self.dataset.metacols].mean()
        pos_rate = torch.from_numpy(pos_rate.values).float()
        return (1 - pos_rate) / pos_rate

## Data Augmentation (TODO) 
Yes,  but actually, no

# Model & Training/Validation Step

### Building a network smaller than ResNet18
### Following this nicely written explanation about ResNet 
### https://github.com/FrancescoSaverioZuppichini/ResNet

In [None]:
### Following this nicely written explanation about ResNet 
### https://github.com/FrancescoSaverioZuppichini/ResNet

## Basic Block
class Conv2dAuto(nn.Conv2d):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.padding =  (self.kernel_size[0] // 2, self.kernel_size[1] // 2) # dynamic add padding based on the kernel_size
        
conv3x3 = partial(Conv2dAuto, kernel_size=3, bias=False)    

## Residul Block
class ResidualBlock(nn.Module):
    def __init__(self, in_channels, out_channels):
        super().__init__()
        self.in_channels, self.out_channels =  in_channels, out_channels
        self.blocks = nn.Identity()
        self.shortcut = nn.Identity()   
    
    def forward(self, x):
        residual = x
        if self.should_apply_shortcut: residual = self.shortcut(x)
        x = self.blocks(x)
        x += residual
        return x
    
    @property
    def should_apply_shortcut(self):
        return self.in_channels != self.out_channels

## Extend the ResidualBlock 
class ResNetResidualBlock(ResidualBlock):
    def __init__(self, in_channels, out_channels, expansion=1, downsampling=1, conv=conv3x3, *args, **kwargs):
        super().__init__(in_channels, out_channels)
        self.expansion, self.downsampling, self.conv = expansion, downsampling, conv
        self.shortcut = nn.Sequential(OrderedDict(
        {
            'conv' : nn.Conv2d(self.in_channels, self.expanded_channels, kernel_size=1,
                      stride=self.downsampling, bias=False),
            'bn' : nn.BatchNorm2d(self.expanded_channels)
            
        })) if self.should_apply_shortcut else None
        
        
    @property
    def expanded_channels(self):
        return self.out_channels * self.expansion
    
    @property
    def should_apply_shortcut(self):
        return self.in_channels != self.expanded_channels


In [None]:
def conv_bn(in_channels, out_channels, conv, *args, **kwargs):
    return nn.Sequential(OrderedDict({'conv': conv(in_channels, out_channels, *args, **kwargs), 
                          'bn': nn.BatchNorm2d(out_channels) }))
## Basic Block
class ResNetBasicBlock(ResNetResidualBlock):
    expansion = 1
    def __init__(self, in_channels, out_channels, activation=nn.ReLU, *args, **kwargs):
        super().__init__(in_channels, out_channels, *args, **kwargs)
        self.blocks = nn.Sequential(
            conv_bn(self.in_channels, self.out_channels, conv=self.conv, bias=False, stride=self.downsampling),
            activation(),
            conv_bn(self.out_channels, self.expanded_channels, conv=self.conv, bias=False),
        )
        
## Bottle Neck        
class ResNetBottleNeckBlock(ResNetResidualBlock):
    expansion = 4
    def __init__(self, in_channels, out_channels, activation=nn.ReLU, *args, **kwargs):
        super().__init__(in_channels, out_channels, expansion=4, *args, **kwargs)
        self.blocks = nn.Sequential(
           conv_bn(self.in_channels, self.out_channels, self.conv, kernel_size=1),
             activation(),
             conv_bn(self.out_channels, self.out_channels, self.conv, kernel_size=3, stride=self.downsampling),
             activation(),
             conv_bn(self.out_channels, self.expanded_channels, self.conv, kernel_size=1),
        )
        
class ResNetLayer(nn.Module):
    def __init__(self, in_channels, out_channels, block=ResNetBasicBlock, n=1, *args, **kwargs):
        super().__init__()
        # 'We perform downsampling directly by convolutional layers that have a stride of 2.'
        downsampling = 2 if in_channels != out_channels else 1
        
        self.blocks = nn.Sequential(
            block(in_channels , out_channels, *args, **kwargs, downsampling=downsampling),
            *[block(out_channels * block.expansion, 
                    out_channels, downsampling=1, *args, **kwargs) for _ in range(n - 1)]
        )

    def forward(self, x):
        x = self.blocks(x)
        return x
    


In [None]:
## Encoder
class ResNetEncoder(nn.Module):
    """
    ResNet encoder composed by increasing different layers with increasing features.
    """
    def __init__(self, in_channels=3, blocks_sizes=[64, 128, 256, 512], deepths=[2,2,2,2], 
                 activation=nn.ReLU, block=ResNetBasicBlock, *args,**kwargs):
        super().__init__()
        
        self.blocks_sizes = blocks_sizes
        
        self.gate = nn.Sequential(
            nn.Conv2d(in_channels, self.blocks_sizes[0], kernel_size=7, stride=2, padding=3, bias=False),
            nn.BatchNorm2d(self.blocks_sizes[0]),
            activation(),
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
        )
        
        self.in_out_block_sizes = list(zip(blocks_sizes, blocks_sizes[1:]))
        self.blocks = nn.ModuleList([ 
            ResNetLayer(blocks_sizes[0], blocks_sizes[0], n=deepths[0], activation=activation, 
                        block=block,  *args, **kwargs),
            *[ResNetLayer(in_channels * block.expansion, 
                          out_channels, n=n, activation=activation, 
                          block=block, *args, **kwargs) 
              for (in_channels, out_channels), n in zip(self.in_out_block_sizes, deepths[1:])]       
        ])
        
        
    def forward(self, x):
        x = self.gate(x)
        for block in self.blocks:
            x = block(x)
        return x

## Decoder
class ResnetDecoder(nn.Module):
    """
    This class represents the tail of ResNet. It performs a global pooling and maps the output to the
    correct class by using a fully connected layer.
    """
    def __init__(self, in_features, n_classes):
        super().__init__()
        self.avg = nn.AdaptiveAvgPool2d((1, 1))
        self.decoder = nn.Linear(in_features, n_classes)

    def forward(self, x):
        x = self.avg(x)
        x = torch.flatten(x, 1)  #x = x.view(x.size(0), -1)
        x = self.decoder(x)
        return x
    
## Combine everything 
    
class ResNet(nn.Module):
    
    def __init__(self, in_channels, n_classes, *args, **kwargs):
        super().__init__()
        self.out_features = n_classes
        self.encoder = ResNetEncoder(in_channels, *args, **kwargs)
        self.decoder = ResnetDecoder(self.encoder.blocks[-1].blocks[-1].expanded_channels, n_classes)
        
    def forward(self, x):
        x = self.encoder(x)
        x = self.decoder(x)
        return x

In [None]:
def resnet18(in_channels, n_classes):
    return ResNet(in_channels, n_classes, block=ResNetBasicBlock, deepths=[2, 2, 2, 2])

def resnet_small(in_channels, n_classes):
    return ResNet(in_channels, n_classes, blocks_sizes=[16, 32, 64, 128], block=ResNetBasicBlock, deepths=[2, 2, 2, 2])

My_model = resnet_small(3, 87)
summary(My_model.cuda(), (3, 224, 224))


## PawpularityPredictor

In [None]:
class PawpularityPredictor(pl.LightningModule):
    """Transfer learning model with two-stage finetuning."""
    
    def __init__(
        self,
        backbone: str ='resnet_18',
        training_phase: str = 'regression',
        num_meta: int = 12,
        pos_weight: torch.Tensor = torch.ones(12),
        classification_threshold: float = 0.5
    ):
        """
        Arguments
        ---------
            backbone: Backbone model to fine tune
            training_phase: Indicator of classification or regression
            num_meta: Number of features in metadata
            pos_weight: Weight of positive samples passed to classification loss
            classification_threshold: Threshold for binary classification
        """
        super().__init__()
        if training_phase not in ('classification', 'regression'):
            raise ValueError('phase must be either classification or regression')
        if backbone == 'resnet_18':
            self.backbone = torchvision.models.resnet18(pretrained=True)
            num_feats = self.backbone.fc.in_features
            self.backbone.fc = nn.Identity()  
        if backbone == 'try':
            self.backbone = My_model
            num_feats = self.backbone.out_features
            
        else:
            raise ValueError('backbone model not supported')
        # classifier
        self.classifier = nn.Linear(num_feats, num_meta)
        
        # regressor 
        self.regressor =nn.Linear(num_feats, 1) 
        
        
        self.lossfn_classification = nn.BCEWithLogitsLoss(pos_weight=pos_weight)
        self.lossfn_regression = nn.MSELoss()
        self.classification_threshold = classification_threshold
        self.training_phase = training_phase
        self.freeze_part_by_training_phase()
    
    def freeze_part_by_training_phase(self):
        """Freeze part of model according to the internal training phase."""
        if self.training_phase == 'classification':
            self.classifier.requires_grad_(True)
            self.regressor.requires_grad_(False)  # freeze regressor
        else:
            self.regressor.requires_grad_(True)
            self.classifier.requires_grad_(False)  # freeze classifier
    
    def forward(self, imgs):
        x = self.backbone(imgs)
        x = torch.sigmoid(self.regressor(x))
        x = 100*x
        #return self.regressor(self.backbone(imgs))
        return x
    
    def class_forward(self, imgs):
        x = self.backbone(imgs)
        x = self.classifier(x)
        return x
        
    
    def training_step(self, batch, batch_idx):
        imgs, meta, scores = batch
        if self.training_phase == 'classification':
            logits = self.class_forward(imgs)#self.classifier(self.backbone(imgs))
            loss = self.lossfn_classification(logits, meta)
            self.log('Loss:classification/train', loss)
            preds = torch.sigmoid(logits) > self.classification_threshold
            acc = (preds == meta).float().mean().item()  # batch accuracy
            self.log('Accuracy/train', acc)
        else:
            preds = self.forward(imgs) #self.regressor(self.backbone(imgs))
            loss = self.lossfn_regression(preds, scores.unsqueeze(-1))
            self.log('Loss:regression/train', loss)
            rmse = torch.sqrt(loss)
            self.log('RMSE/train', rmse)
        return loss
    
    def validation_step(self, batch, batch_idx):
        imgs, meta, scores = batch
        if self.training_phase == 'classification':
            logits = self.class_forward(imgs) #self.classifier(self.backbone(imgs))
            loss = self.lossfn_classification(logits, meta)
            self.log('Loss:classification/validation', loss)
            preds = torch.sigmoid(logits) > self.classification_threshold
            acc = (preds == meta).float().mean().item()  # batch accuracy
            self.log('Accuracy/validation', acc)
        else:
            preds = self.forward(imgs) #self.regressor(self.backbone(imgs))
            loss = self.lossfn_regression(preds, scores.unsqueeze(-1))
            self.log('Loss:regression/validation', loss)
            rmse = torch.sqrt(loss)
            self.log('RMSE/validation', rmse)
    
    def configure_optimizers(self):
        if self.training_phase == 'classification':
            optimizer = optim.AdamW(self.parameters(), lr=1e-3)
            return optimizer
        else:
            optimizer = optim.AdamW(self.parameters(), lr=1e-3)
            return optimizer
    
    def predict_step(self, batch, batch_idx, dataloader_idx=0):
        imgs, _, _ = batch
        return self(imgs)

# Training with Two-stage Finetuning

In [None]:
IMAGE_SIZE = 224
BATCH_SIZE = 64
NUM_VALIDATION = 128
CLASSIFICATION_THRESHOLD = 0.5
NUM_EPOCHS_CLASSIFICATION = 10
NUM_EPOCHS_REGRESSION = 20

In [None]:
datamodule = PetfinderDataModule( image_size=IMAGE_SIZE, batch_size=BATCH_SIZE, 
                                 num_validation=NUM_VALIDATION  )
datamodule.setup()

## Finetune Backbone & Classifier

In [None]:
# skip the classifier (if the regressor field again,  we come back to train this, 
# so that I know at least the network works on the classifier )
model_class = PawpularityPredictor( backbone = 'try',  training_phase='classification' )

checkpoint_callback = ModelCheckpoint(
    monitor='Accuracy/validation',
    mode='max',
    filename='classifier-{epoch}-{step}'
)

logger = pl.loggers.CSVLogger('./logs_classification')

trainer = pl.Trainer( gpus=1,
                      max_epochs=NUM_EPOCHS_CLASSIFICATION,
                      callbacks=[checkpoint_callback],
                      logger=logger )



In [None]:
trainer.fit(model_class, datamodule=datamodule)

In [None]:
classifier_path = checkpoint_callback.best_model_path
classifier_path

In [None]:
log = pd.read_csv(os.path.join(logger.log_dir, 'metrics.csv'))
log

## Finetune Backbone & Regressor (**also some Diagnostics**)

In [None]:

model = PawpularityPredictor.load_from_checkpoint(
    checkpoint_callback.best_model_path,
    training_phase='regression'
)
'''
model = PawpularityPredictor(
    training_phase='regression'
)
'''
checkpoint_callback = ModelCheckpoint(
    monitor='RMSE/validation',
    mode='min',
    filename='regressor-{epoch}-{step}'
)
logger = pl.loggers.CSVLogger('./logs_regression')
trainer = pl.Trainer(
    gpus=1,
    max_epochs=NUM_EPOCHS_REGRESSION,
    callbacks=[checkpoint_callback],
    logger=logger
)

#trainer.fit(model, datamodule=datamodule)

In [None]:
#checkpoint_callback.best_model_path

In [None]:
#log = pd.read_csv(os.path.join(logger.log_dir, 'metrics.csv'))
#log

In [None]:
def diagnose_predictions(model, datamodule, num_epochs):
    """Output a scatter plot of the actual/predicted Pawpularity scores."""
    # Setup figure
    fig, axes = plt.subplots(ncols=2, figsize=(12,6))
    lims = (-2, 102)
    for ax in axes:
        ax.set_xlabel('Actual Pawpularity Score')
        ax.set_ylabel('Predicted Pawpularity Score')
        ax.set_xlim(*lims)
        ax.set_ylim(*lims)
    axes[0].set_title('Training Samples')
    axes[1].set_title('Validation Set')
    fig.suptitle(f'Regressor Trained after {num_epochs} Epochs', fontsize=16)
    
    # Plot diagonal line
    for ax in axes:
        ax.plot(lims, lims, color='C3')
    
    # Visualize training/validation set
    dataloaders = (
        DataLoader(datamodule.trainset, batch_size=NUM_VALIDATION, shuffle=True),
        DataLoader(datamodule.valset, batch_size=NUM_VALIDATION)
    )
    for ax, dataloader in zip(axes, dataloaders):
        # Plot actual/predicted scores
        imgs, _, scores = next(iter(dataloader))
        with torch.no_grad():
            preds = model(imgs)
            
        x = scores.cpu().numpy()
        y = preds.squeeze().cpu().numpy()
        #print(np.shape(x), np.shape(y))
        ax.scatter(scores.cpu().numpy(), preds.squeeze().cpu().numpy(), c='C1')
        # Add text of RMSE
        rmse = torch.sqrt(model.lossfn_regression(preds, scores.unsqueeze(-1))).item()
        textstr = f'RMSE = {rmse:.2f}'
        props = dict(boxstyle='round', facecolor='C4', alpha=0.5)
        ax.text(0.05, 0.95, textstr, transform=ax.transAxes, verticalalignment='top', bbox=props)
    
    # Output
    os.makedirs('./diagnostics', exist_ok=True)
    fig.savefig(f'./diagnostics/regressor_{num_epochs}_epochs.png')
    plt.show(fig)

In [None]:
# Before training regressor
diagnose_predictions(model, datamodule, 0)

In [None]:
# After NUM epochs
trainer.fit(model, datamodule=datamodule)
diagnose_predictions(model, datamodule, 1 * NUM_EPOCHS_REGRESSION)

In [None]:
# After 2*NUM epochs
#trainer.fit(model, datamodule=datamodule)
#diagnose_predictions(model, datamodule, 2 * NUM_EPOCHS_REGRESSION)

# Inference

In [None]:
preds, = trainer.predict(datamodule=datamodule)

In [None]:
# Output predictions
predictions = pd.DataFrame({
    'Id': datamodule.predictset.meta['Id'],
    'Pawpularity': preds.squeeze().cpu().numpy()
})
predictions.to_csv('./submission.csv', index=False)

In [None]:
predictions

# Discussion 

### About the two-step procedure:  
So we already see other ML models failed to predict the Pawpularity using the Metadata along.  You might also see (or do it as a practice) the model of DNN network with the 12 MetaData features as input failed to predict the score.  Basically I believe if the input have no statistic significant to the output, any model would have failed to predict with the input.  
The two step model first try to train a network that recognize the MetaData, which has no correlation to the Pawpularity.  
However there are 512 output features in ResNet18. One of the problem we have is that we only have < 10k training data.  By training to match the 12 MetaData features, we are more certain that these 512 features capture the information about the images, and these 512 features should be helpful on prediction.  Except that they don't. 




### Some other discussion: 
As in this discussion, (https://www.kaggle.com/c/petfinder-pawpularity-score/discussion/285140) 
some point out that there are similar images getting distinct Pawpularity, or duplicate images. 
There are notebook working on identifing similar images (e.g. https://www.kaggle.com/schulta/petfinder-identify-duplicates-and-share-findings, https://www.kaggle.com/burakbekci/petfinder-finding-duplicates-with-cnn), basically alinging the features from Metadata or from the network.   
The similar images/ duplicate images are not too many (<300 images? That's 3% ) and can be remove or treat as noise (see the discussion link for more detail).  We have tried remove the duplicate images (keep the higher score) and train it again, but not getting imporving (as expect).  The similar featured images inspire us to view the Paupularity as a score of **aesthetics**, rather than base on identity features.  How good the photo looks in the thumbnail, are people related to the photos, who posted the photos, etc.  

Now I do have concern about whether we are given all the necessary data to predict this Pawpularity score.  Some common questions are like whether the score is more relevant to where/who/when the photos were posted, even though the competition explained that these have been normalized.  Personally, I think we do what we can do.  Analyze the data.  

So that's my thought about this project.  I have a lot of fun and gain experience.  To dig deeper on aesthetics rating is beyong my goal (but check this out: http://infolab.stanford.edu/~wangz/project/imsearch/Aesthetics/TMM15/lu.pdf).  

Chills, 
