# Trial of Pseudo-labeling by Cassava Model [Train]

## Purpose: Imrove the generality of the model of by using the pseudo-labeled dataset

In this notebook, I tried to get pseudo labels on [New Plants Disease Dataset](https://www.kaggle.com/vipoooool/new-plant-diseases-dataset) by an efficientnet trained with Cassava leaf disiease dataset.  

By using the pseudo labeled dataset, I can increase the dataset and maybe can make the model's generalization.  
The motivation is based on the [Noisy Student](https://arxiv.org/abs/1911.04252)

The teacher model's accuracy is about 88%.

This note book referenced https://www.kaggle.com/yasufuminakama/cassava-resnext50-32x4d-starter-training.

In [None]:
!curl https://raw.githubusercontent.com/pytorch/xla/master/contrib/scripts/env-setup.py -o pytorch-xla-env-setup.py
!python pytorch-xla-env-setup.py

In [None]:
pip install timm

In [None]:
pip install torch_optimizer

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
import glob
import os

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

I used dataset of plants disiease.  This is images of leaves and related to the cassava disease.  
In this code, I loaded the csv file as dataframe

In [None]:
data_dirs = glob.glob("/kaggle/input/new-plant-diseases-dataset/New Plant Diseases Dataset(Augmented)/New Plant Diseases Dataset(Augmented)/*/*/*.JPG")
plants_df = pd.DataFrame(data_dirs, columns=["image_id"])

For the training, I used pre-trained model and Radam optimizer, I installed these libraries

In [None]:
import torch_xla
import torch_xla.core.xla_model as xm

# Definition of hyper parameters

In [None]:
CFG = {
    'fold_num': 5,
    'seed': 719,
    'model_arch': 'tf_efficientnet_b0_ns',
    'img_size': 512,
    'epochs': 10,
    'train_bs': 64,
    'valid_bs': 32,
    'T_0': 10,
    'lr': 5e-3,
    'min_lr': 1e-6,
    'weight_decay':1e-6,
    'num_workers': 4,
    'accum_iter': 2, # suppoprt to do batch accumulation for backprop with effectively larger batch size
    'verbose_step': 1,
    'device': 'cuda',
    'target_size': 5,
    "gradient_accumulation_steps": 1, 
    "max_grad_norm": 5,
    "print_freq": 100,
    "label_smoothing": 0.1,
    "t1": 1.0,
    "t2": 1.0,
    "loss": "logloss", # bi_tempered_loss, logloss
    "optimizer": "AdamW", # Radam AdamW
    "scheduler": "OneCycleLR",
    "model_average": 3,
    "pre-train": True,
    "use_2019": True,
}

# gpu run
device = "cuda"
device = xm.xla_device()

# Load dataframes

In [None]:
train = pd.read_csv('../input/cassava-leaf-disease-classification/train.csv')
test = pd.read_csv('../input/cassava-leaf-disease-classification/sample_submission.csv')
label_map = pd.read_json('../input/cassava-leaf-disease-classification/label_num_to_disease_map.json', 
                         orient='index')

In [None]:
DATA_PATH_2019 = '../input/cassava-leaf-disease-merged/'
TRAIN_DIR_2019 = DATA_PATH_2019 + 'train/'

In [None]:
train_merged = pd.read_csv("../input/cassava-leaf-disease-merged/merged.csv")

In [None]:
OUTPUT_DIR = './'
if not os.path.exists(OUTPUT_DIR):
    os.makedirs(OUTPUT_DIR)

TRAIN_PATH = '../input/cassava-leaf-disease-classification/train_images'
TEST_PATH = '../input/cassava-leaf-disease-classification/test_images'

In [None]:
import os
import math
import time
import random
import shutil
from pathlib import Path
from contextlib import contextmanager
from collections import defaultdict, Counter
import matplotlib.pyplot as plt

import scipy as sp
import numpy as np
import pandas as pd

from sklearn import preprocessing
from sklearn.metrics import accuracy_score
from sklearn.model_selection import StratifiedKFold

from tqdm.auto import tqdm
from functools import partial

import cv2
from PIL import Image

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.optim import Adam, SGD
import torchvision.models as models
from torch.nn.parameter import Parameter
from torch.utils.data import DataLoader, Dataset
from torch.optim.lr_scheduler import CosineAnnealingWarmRestarts, CosineAnnealingLR, ReduceLROnPlateau, OneCycleLR
from skimage import io, transform

# import torch_optimizer as optim
import timm

from albumentations import (
    Compose, OneOf, Normalize, Resize, RandomResizedCrop, RandomCrop, HorizontalFlip, VerticalFlip, 
    RandomBrightness, RandomContrast, RandomBrightnessContrast, Rotate, ShiftScaleRotate, Cutout, 
    IAAAdditiveGaussianNoise, Transpose
    )
from albumentations.pytorch import ToTensorV2
from albumentations import ImageOnlyTransform

import warnings 
warnings.filterwarnings('ignore')

# Dataloader

In [None]:
class TrainDataset(Dataset):
    def __init__(self, df, transform=None):
        self.df = df
        self.file_names = df["image_id"].values
        self.labels = df['label'].values
        self.transform = transform
    
    def __len__(self):
        return len(self.df)
    
    def __getitem__(self, idx):
        file_name = self.file_names[idx]
        file_path = f'{TRAIN_DIR_2019}/{file_name}'
        image = io.imread(file_path)
        if self.transform:
            augmented = self.transform(image=image)
            image = augmented["image"]
        label = torch.tensor(self.labels[idx]).long()
        return image, label

class TestDataset(Dataset):
    def __init__(self, df, transform=None):
        self.df = df
        self.file_names = df["image_id"].values
        self.transform = transform
    
    def __len__(self):
        return len(self.df)
    
    def __getitem__(self, idx):
        file_name = self.file_names[idx]
        file_path = f'{TEST_PATH}/{file_name}'
        image = io.imread(file_path)
        if self.transform:
            augmented = self.transform(image=image)
            image = augmented["image"]
        return image
    
class PlantDataset(Dataset):
    def __init__(self, df, transform=None):
        self.df = df
        self.file_names = df["image_id"].values
        self.transform = transform
    
    def __len__(self):
        return len(self.df)
    
    def __getitem__(self, idx):
        file_name = self.file_names[idx]
        file_path = f'/{file_name}'
        image = io.imread(file_path)
        if self.transform:
            augmented = self.transform(image=image)
            image = augmented["image"]
        return image

In [None]:
from albumentations import (
    HorizontalFlip, VerticalFlip, IAAPerspective, ShiftScaleRotate, CLAHE, RandomRotate90,
    Transpose, ShiftScaleRotate, Blur, OpticalDistortion, GridDistortion, HueSaturationValue,
    IAAAdditiveGaussianNoise, GaussNoise, MotionBlur, MedianBlur, IAAPiecewiseAffine, RandomResizedCrop,
    IAASharpen, IAAEmboss, RandomBrightnessContrast, Flip, OneOf, Compose, Normalize, Cutout, CoarseDropout, ShiftScaleRotate, CenterCrop, Resize, RGBShift
)

In [None]:
def get_transform():
    return Compose([
            RandomResizedCrop(CFG['img_size'], CFG['img_size']),
            Transpose(p=0.5),
            HorizontalFlip(p=0.5),
            VerticalFlip(p=0.5),
            ShiftScaleRotate(p=0.5),
            RGBShift(r_shift_limit=15, g_shift_limit=15, b_shift_limit=15, p=0.5),
            HueSaturationValue(hue_shift_limit=0.2, sat_shift_limit=0.2, val_shift_limit=0.2, p=0.5),
            RandomBrightnessContrast(brightness_limit=(-0.1,0.1), contrast_limit=(-0.1, 0.1), p=0.5),
            Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225], max_pixel_value=255.0, p=1.0),
            CoarseDropout(p=0.5),
            Cutout(p=0.5),
            ToTensorV2(),
        ], p=1.)

def val_transform():
    return Compose([
            Resize(CFG['img_size'], CFG['img_size']),
            Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225], max_pixel_value=255.0, p=1.0),
            ToTensorV2(),
        ], p=1.)

In [None]:
class EfficientNet(nn.Module):
    def __init__(self, model_name="tf_efficientnet_b0_ns"):
        super(EfficientNet, self).__init__()
        self.model = timm.create_model(model_name, pretrained=False)
        n_features = self.model.classifier.in_features
        self.model.classifier = nn.Linear(n_features, CFG["target_size"])
    
    def forward(self, x):
        return self.model(x)

Definition of utilities

In [None]:
class AverageMeter(object):
    """Computes and stores the average and current value"""
    def __init__(self):
        self.reset()

    def reset(self):
        self.val = 0
        self.avg = 0
        self.sum = 0
        self.count = 0

    def update(self, val, n=1):
        self.val = val
        self.sum += val * n
        self.count += n
        self.avg = self.sum / self.count


def asMinutes(s):
    m = math.floor(s / 60)
    s -= m * 60
    return '%dm %ds' % (m, s)


def timeSince(since, percent):
    now = time.time()
    s = now - since
    es = s / (percent)
    rs = es - s
    return '%s (remain %s)' % (asMinutes(s), asMinutes(rs))


def init_logger(log_file=OUTPUT_DIR+'train.log'):
    from logging import getLogger, INFO, FileHandler,  Formatter,  StreamHandler
    logger = getLogger(__name__)
    logger.setLevel(INFO)
    handler1 = StreamHandler()
    handler1.setFormatter(Formatter("%(message)s"))
    handler2 = FileHandler(filename=log_file)
    handler2.setFormatter(Formatter("%(message)s"))
    logger.addHandler(handler1)
    logger.addHandler(handler2)
    return logger

LOGGER = init_logger()


def seed_torch(seed=42):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True

seed_torch(seed=42)

In [None]:
class Train:
    def __init__(self, model, step_per_epoch):
        self.model = model
        if CFG["optimizer"] == "RAdam":
            self.optimizer = optim.RAdam(
                model.parameters(),
                lr= CFG["lr"],
                betas=(0.9, 0.999),
                eps=1e-8,
                weight_decay=0,
            )
        elif CFG["optimizer"] == "AdamW":
            self.optimizer = torch.optim.AdamW(
                model.parameters(), 
                lr=CFG["lr"], 
                betas=(0.9, 0.999), 
                eps=1e-08, 
                weight_decay=0.0, 
                amsgrad=False)
        if CFG["scheduler"] == "OneCycleLR":
            self.scheduler = OneCycleLR(
                self.optimizer, 
                CFG["lr"],
                epochs=CFG["epochs"], 
                steps_per_epoch=step_per_epoch, 
                pct_start=0.3, 
                anneal_strategy='cos', 
                )
        else:
            self.scheduler = None
        self.scaler = torch.cuda.amp.GradScaler()
    
    def train(self, train_loader, epoch):
        batch_time = AverageMeter()
        data_time = AverageMeter()
        losses = AverageMeter()
        scores = AverageMeter()
        
        self.model.train()
        start = end = time.time()
        global_step = 0
        for step, (images, labels) in enumerate(train_loader):
            # measure data loading time
            images = images.to(device)
            if CFG["loss"] == "bi_tempered_loss":
                labels = torch.nn.functional.one_hot(torch.tensor(labels), num_classes=5)
            labels = labels.to(device)
            batch_size = labels.size(0)
            
            with torch.cuda.amp.autocast():
                y_preds = self.model(images)
                if CFG["loss"] == "bi_tempered_loss":
                    loss = bi_tempered_logistic_loss(activations=y_preds, labels=labels, t1=CFG["t1"], t2=CFG["t2"], label_smoothing=CFG["label_smoothing"])
                elif CFG["loss"] == "logloss":
                    loss = F.cross_entropy(y_preds, labels)
                    
                loss = loss.mean()
                
            self.scaler.scale(loss).backward()
            
            if (step + 1) % CFG["accum_iter"] == 0:
                self.scaler.unscale_(self.optimizer)
                grad_norm = torch.nn.utils.clip_grad_norm_(self.model.parameters(), CFG["max_grad_norm"])
                self.scaler.step(self.optimizer)
                self.scaler.update()
                self.optimizer.zero_grad()
            
            losses.update(loss.item(), batch_size)
            if self.scheduler:
                self.scheduler.step()
            
            xm.mark_step()
            
            global_step += 1
            
            # measure elapsed time
            batch_time.update(time.time() - end)
            end = time.time()
            
            if (step + 1) % CFG["print_freq"] == 0 or step == (len(train_loader)-1):
                print('Epoch: [{0}][{1}/{2}] '
                      'Data {data_time.val:.3f} ({data_time.avg:.3f}) '
                      'Elapsed {remain:s} '
                      'Loss: {loss.val:.4f}({loss.avg:.4f}) '
                      'Grad: {grad_norm:.4f}  '
                      'LR: {lr:.6f}  '
                      .format(
                       epoch+1, step, len(train_loader), batch_time=batch_time,
                       data_time=data_time, loss=losses,
                       remain=timeSince(start, float(step+1)/len(train_loader)),
                       grad_norm=grad_norm,
                       lr=self.scheduler.get_lr()[0] if self.scheduler is not None else CFG["lr"],
                       ))
        return losses.avg
    
    def validate(self, valid_loader):
        batch_time = AverageMeter()
        data_time = AverageMeter()
        losses = AverageMeter()
        scores = AverageMeter()
        # switch to evaluation mode
        self.model.eval()
        preds = []
        start = end = time.time()
        for step, (images, labels) in enumerate(valid_loader):
            # measure data loading time
            data_time.update(time.time() - end)
            if CFG["loss"] == "bi_tempered_loss":
                labels = torch.nn.functional.one_hot(torch.tensor(labels), num_classes=5)
            images = images.to(device)
            labels = labels.to(device)
            batch_size = labels.size(0)
            # compute loss
            with torch.no_grad():
                y_preds = self.model(images)
                if CFG["loss"] == "bi_tempered_loss":
                    loss = bi_tempered_logistic_loss(activations=y_preds, labels=labels, t1=CFG["t1"], t2=CFG["t2"], label_smoothing=CFG["label_smoothing"])
                elif CFG["loss"] == "logloss":
                    loss = F.cross_entropy(y_preds, labels)
                    
            losses.update(loss.mean().item(), batch_size)
            # record accuracy
            preds.append(y_preds.softmax(1).to('cpu').numpy())
            # measure elapsed time
            batch_time.update(time.time() - end)
            end = time.time()
            if (step + 1) % CFG["print_freq"] == 0 or step == (len(valid_loader)-1):
                print('EVAL: [{0}/{1}] '
                      'Data {data_time.val:.3f} ({data_time.avg:.3f}) '
                      'Elapsed {remain:s} '
                      'Loss: {loss.val:.4f}({loss.avg:.4f}) '
                      .format(
                       step, len(valid_loader), batch_time=batch_time,
                       data_time=data_time, loss=losses,
                       remain=timeSince(start, float(step+1)/len(valid_loader)),
                       ))
                
        predictions = np.concatenate(preds)
        return losses.avg, predictions
    
    def inference(self, states, test_loader):
        self.model.to(device)
        tk0 = tqdm(enumerate(test_loader), total=len(test_loader))
        probs = []
        for i, (images) in tk0:
            if i % 100 == 0:
                print(float(i) / len(tk0) * 100.0)
            images = images.to(device)
            avg_preds = []
            for state in states:
                self.model.load_state_dict(state['model'])
                self.model.eval()
                with torch.no_grad():
                    y_preds = self.model(images)
                avg_preds.append(y_preds.softmax(1).to('cpu').numpy())
            avg_preds = np.mean(avg_preds, axis=0)
            probs.append(avg_preds)
        probs = np.concatenate(probs)
        return probs

In [None]:

test_dataset = PlantDataset(plants_df, 
                             transform=val_transform())
test_loader = DataLoader(test_dataset, 
                          batch_size=CFG["train_bs"], 
                          shuffle=False, 
                          num_workers=CFG["num_workers"], pin_memory=True, drop_last=False)

states = [torch.load("/kaggle/input/efficientnet-2019/"+f'{CFG["model_arch"]}_fold{fold}_best.pth') for fold in range(1)]

model = EfficientNet(CFG["model_arch"])
model.to(device)
    
trainer = Train(model, len(test_dataset) // CFG["train_bs"])

best_score = 0.
best_loss = np.inf
    
preds = trainer.inference(states, test_loader)
plants_df = pd.concat([plants_df, pd.DataFrame(preds)], axis=1)

# Results

Finally, we get pseudo-labeled dataset of plants disease.  
I can use this probabilities as soft-label of the dataset.  

Please check result trained with this pseudo-labels in [this notebook](https://www.kaggle.com/gpiyama2119/cassava-train-with-pseudo-labeled-plants/)

In [None]:
plants_df.to_csv(OUTPUT_DIR+'plants_df.csv', index=False)