## cellular automata evaluation
Original: https://www.kaggle.com/nroman/melanoma-pytorch-starter-efficientnet

#### Versions oc:
* v1: fork from https://www.kaggle.com/octaviomm/melanoma-pytorch-starter-efficientnet
* v2: THISONE
* v3: multiple changes to work only with the files from the TRAIN folder (for training and validation)
* v4: adding WeightedRandomSampler

#### Versions:
* v9: ColorJitter transformation added **[0.896]**
* v10: Changed the dataset to [this one](https://www.kaggle.com/shonenkov/melanoma-merged-external-data-512x512-jpeg) with external data. **[0.894]**
* v11: Switched to [another dataset](https://www.kaggle.com/nroman/melanoma-external-malignant-256/) which I've created by myself. Also switched from StratifiedKFold to GroupKFold **[0.916]**
* v12: Switched to efficientnet-b1 **[0.919]**
* v13: Using meta featues: sex and age **[0.918]**
* v14: anatom_site_general_challenge meta feature added as one-hot encoded matrix **[0.923]**
* v16: Fixed OOF - now it contains only data from original training dataset, without extarnal data. Also switched back to StratifiedKFold. Added DrawHair augmentation. **[0.909]**
* v18: Too many things were changed at the same time. All experiments should have only one small change each, so it would be easy to understand how changes affect the result. Said that I rolled back everything, keeping only OOF fix, to make sure it work.
* v19: Added 'Hair' augmentation. OOF rework posponed untill the best time, since there is some bug in my code for it. **[0.925]**
* v20: Advanced Hair Augmentation technique used. Read more about it here: https://www.kaggle.com/c/siim-isic-melanoma-classification/discussion/159176 **[0.923]**
* v21: Microscope augmentation added instead of Cutout. Read more here: https://www.kaggle.com/c/siim-isic-melanoma-classification/discussion/159476 **[0.914]**
* v22: Changed the dataset to [this one](https://www.kaggle.com/cdeotte/jpeg-melanoma-256x256) by Chris Deotte. More info [here](https://www.kaggle.com/c/siim-isic-melanoma-classification/discussion/165526) **[0.900]**
* v23: All the same as v22 but effnet-b0 instead of b1 and more epochs per fold. **[0.895]**
* v24: effnet-b01 and more epochs. **[0.9092]**
* v25: Fixed a mistake in a way of filling preds. See [this comment](https://www.kaggle.com/nroman/melanoma-pytorch-starter-efficientnet/comments?scriptVersionId=39125585#913846). **[0.9016]**
* v26: Fix for another mistake. This time with a way of averaging TTA. See [this comment](https://www.kaggle.com/nroman/melanoma-pytorch-starter-efficientnet/comments#955916) **[0.915]**
* v27: Back to [my dataset](https://www.kaggle.com/nroman/melanoma-external-malignant-256/)

# * CHANGE TEST TRANSFORMS to test_transforms
# * change roc for val loss to save model
# * check the threshold change for imbalanced classification   

In [None]:
!pip install -q efficientnet_pytorch torchtoolbox

In [None]:
import torch
import torchvision
import torch.nn.functional as F
import torch.nn as nn
import torchtoolbox.transform as transforms
from torch.utils.data import Dataset, DataLoader, Subset
from torch.optim.lr_scheduler import ReduceLROnPlateau
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.model_selection import StratifiedKFold, GroupKFold, KFold
import pandas as pd
import numpy as np
import gc
import os
import cv2
import time
import datetime
import warnings
import random
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.notebook import tqdm
from efficientnet_pytorch import EfficientNet
from torch.utils.data import WeightedRandomSampler
from sklearn.metrics import precision_score, recall_score
from IPython.display import FileLink
%matplotlib inline

In [None]:
warnings.simplefilter('ignore')
def seed_everything(seed):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = True

seed_everything(47)

In [None]:
def make_train_test_df_for_cea_aug2(train_df_orig, df_cea_done, test_set_size = 2000, test_set_1s = 100, add_0s_into_train_df=0, add_1s_into_train_df=0):
    '''make a train_df that contains all the files that cea==completed & target==1
    and a test_df that has all the files that cea!=completed & target==1
    Optionally you can include fewer files'''
    # make a train_df that contains all the files that cea==completed & target==1
    image_name_cea_done = df_cea_done['image_name'].values
    train_df = train_df_orig[train_df_orig['image_name'].isin(image_name_cea_done)]
    # make sure test_df has all the files that cea!=completed & target==1
    test_df_large = train_df_orig[~train_df_orig['image_name'].isin(image_name_cea_done)]
    # divide in target==1 & target == 0
    test_df_only1s = test_df_large[test_df_large['target']==1]
    test_df_only0s = test_df_large[test_df_large['target']==0]
    test_df_only0s = test_df_only0s.reset_index(drop=True)
    test_df_only1s = test_df_only1s.reset_index(drop=True)
    # if add_0s_into_train_df more samples are wanted in train_df (of zeros)
    if add_0s_into_train_df >0:
        random.seed(0)
        rand_ints = random.sample(range(len(test_df_only0s)-1), add_0s_into_train_df)
        extra_0s_for_train_df = test_df_only0s.iloc[rand_ints]
        train_df = train_df.append(extra_0s_for_train_df)
        test_df_only0s =test_df_only0s.drop(rand_ints, axis=0)
    # if add_1s_into_train_df more samples are wanted in train_df (of ones)
    if add_1s_into_train_df >0:
        random.seed(0)
        rand_ints = random.sample(range(len(test_df_only1s)-1), add_1s_into_train_df)
        extra_1s_for_train_df = test_df_only1s.iloc[rand_ints]
        train_df = train_df.append(extra_1s_for_train_df)
        test_df_only1s = test_df_only1s.drop(rand_ints, axis=0)
    # if only a subset of the target == 1 is wanted
    if test_set_1s > 0:
        random.seed(0)
        rand_ints = random.sample(range(len(test_df_only1s)), test_set_1s)
        test_df_only1s = test_df_only1s.iloc[rand_ints]
        
    # if add_1s_into_train_df > 0 or add_0s_into_train_df > 0
    if add_1s_into_train_df > 0 or add_0s_into_train_df > 0:
        test_df_large = pd.DataFrame()
        test_df_large = test_df_large.append(test_df_only1s)
        test_df_large = test_df_large.append(test_df_only0s)
        random.seed(0)
        rand_ints = random.sample(range(len(test_df_large)), len(test_df_large))
        test_df_large.index = rand_ints
        test_df_large = test_df_large.reindex()
        test_df = test_df_large
    else:
        # get a large subset of test_df_large that contain all target 1 from test_df_large
        test_df = test_df_large
        if test_set_size > 0:
            number1s_already_in_test = len(test_df_only1s)
            random.seed(0)
            rand_ints = random.sample(range(len(test_df_only0s)), test_set_size - number1s_already_in_test)
            test_df_only0s_subset = test_df_only0s.iloc[rand_ints]
            test_df = test_df_only1s.append(test_df_only0s_subset)
    return train_df.reset_index(drop=True), test_df.reset_index(drop=True)

In [None]:
def transform_features(train_df, test_df):
    # One-hot encoding of anatom_site_general_challenge feature
    concat = pd.concat([train_df['anatom_site_general_challenge'], test_df['anatom_site_general_challenge']], ignore_index=True)
    dummies = pd.get_dummies(concat, dummy_na=True, dtype=np.uint8, prefix='site')
    train_df = pd.concat([train_df, dummies.iloc[:train_df.shape[0]]], axis=1)
    test_df = pd.concat([test_df, dummies.iloc[train_df.shape[0]:].reset_index(drop=True)], axis=1)

    # Sex features
    train_df['sex'] = train_df['sex'].map({'male': 1, 'female': 0})
    test_df['sex'] = test_df['sex'].map({'male': 1, 'female': 0})
    train_df['sex'] = train_df['sex'].fillna(-1)
    test_df['sex'] = test_df['sex'].fillna(-1)

    # Age features
    train_df['age_approx'] /= train_df['age_approx'].max()
    test_df['age_approx'] /= test_df['age_approx'].max()
    train_df['age_approx'] = train_df['age_approx'].fillna(0)
    test_df['age_approx'] = test_df['age_approx'].fillna(0)

    train_df['patient_id'] = train_df['patient_id'].fillna(0)
    return train_df, test_df

In [None]:
def get_synthesize_images_for_only_images_in_fold(train_df, train_df_synthesized, train_idx):
    #For only the images in this train fold get the synthesized ones
    df_train_in_fold = train_df.iloc[train_idx].reset_index(drop=True)
    names_in_train = df_train_in_fold['image_name'].values
    names_unique_synt_all = np.unique(train_df_synthesized['image_name_root'].values)
    names_in_train_that_have_synt = list(set(names_in_train).intersection(names_unique_synt_all))
    # select those images from all the synthesized ones
    train_synt_to_add_fold = train_df_synthesized[train_df_synthesized['image_name_root'].isin(names_in_train_that_have_synt)]
    # append the synthesized images to the current train fold
    df_train_in_fold_synt_added = df_train_in_fold.append(train_synt_to_add_fold)
    return df_train_in_fold_synt_added

In [None]:
def get_sampler_for_imbalanced_classification(train_df):
    '''https://discuss.pytorch.org/t/how-to-handle-imbalanced-classes/11264/2'''
    target = train_df['target'].values
    # print(f'target train 0/1: {len(np.where(target == 1)[0])}/{len(np.where(target == 0)[0])}')
    class_sample_count = np.array([len(np.where(target == t)[0]) for t in np.unique(target)])
    weight = 1. / class_sample_count
    samples_weight_py = np.array([weight[t] for t in target])
    samples_weight = torch.from_numpy(samples_weight_py)
    samples_weigth = samples_weight.double()
    sampler = WeightedRandomSampler(samples_weight, len(samples_weight))
    return sampler, samples_weight_py

In [None]:
!ls /kaggle/input

In [None]:
# OC just checking shapes and files that completed cea reconstruction
train_df_orig = pd.read_csv('/kaggle/input/jpeg-melanoma-256x256/train.csv')
df_cea_done = pd.read_csv('/kaggle/input/files-cea-done-264/files_cea_done_264.csv')
print(f"train_df_orig = {train_df_orig.shape}, train_df total 1's = {np.sum(train_df_orig['target'].values)}")
print(f'df_cea_done = {df_cea_done.shape}')

In [None]:
# OPTION 2
train_df, test_df = make_train_test_df_for_cea_aug2(train_df_orig, df_cea_done, 
                                                   test_set_size = 2000, test_set_1s = 450,
                                                  add_0s_into_train_df = 24890-len(df_cea_done),#1500-len(df_cea_done)
                                                  add_1s_into_train_df = 110) # 0
np.shape(train_df), np.shape(test_df)
print(f"train_df = {train_df.shape}, train_df total 1's = {np.sum(train_df['target'].values)}, train_df total 1=0's = {np.sum(train_df['target'].values==0)}")
print(f"test_df = {test_df.shape}, test_df total 1's = {np.sum(test_df['target'].values)}, test_df total 0's = {np.sum(test_df['target'].values==0)}")

In [None]:
# # OPTION 3 (small, only for testing)
# train_df, test_df = make_train_test_df_for_cea_aug2(train_df_orig, df_cea_done, 
#                                                    test_set_size = 2000, test_set_1s = 450,
#                                                   add_0s_into_train_df = 0,#1500-len(df_cea_done)
#                                                   add_1s_into_train_df = 0) # 0
# np.shape(train_df), np.shape(test_df)
# print(f"train_df = {train_df.shape}, train_df total 1's = {np.sum(train_df['target'].values)}, train_df total 1=0's = {np.sum(train_df['target'].values==0)}")
# print(f"test_df = {test_df.shape}, test_df total 1's = {np.sum(test_df['target'].values)}, test_df total 0's = {np.sum(test_df['target'].values==0)}")

In [None]:
# Transform features (one hot encoding)
train_df, test_df = transform_features(train_df, test_df)
print(f"train_df = {train_df.shape}, train_df total 1's = {np.sum(train_df['target'].values)}, train_df total 1=0's = {np.sum(train_df['target'].values==0)}")
print(f"test_df = {test_df.shape}, test_df total 1's = {np.sum(test_df['target'].values)}, test_df total 0's = {np.sum(test_df['target'].values==0)}")
print('============')
# make Dataframe with the synthesized images
# path_synthesis_selected = '/kaggle/input/cea-synthesis/cea_synthesis_selected_no_dark/'
path_synthesis_selected = '/kaggle/input/cea-synthesis-threshold0/cea_synthesis_selected_no_dark_threshold0/'
names_synthesis_selected = os.listdir(path_synthesis_selected)
names_synt_selec = np.sort(names_synthesis_selected)
names_synt_selec = [i[:-4] for i in names_synt_selec]
names_synt_selec_root = [i[:-4] for i in names_synt_selec]
df_names_synt_selec = pd.DataFrame((names_synt_selec,names_synt_selec_root)).T
df_names_synt_selec.columns = ['synt_name', 'image_name']
# Add the data of each lesion to the DF of the synthesized images
train_df_synthesized = df_names_synt_selec.merge(train_df, on='image_name')
# change image_name to load the correct name
train_df_synthesized = train_df_synthesized.rename(columns={'image_name':'image_name_root'})
train_df_synthesized = train_df_synthesized.rename(columns={'synt_name':'image_name'})
# add a column to each dataframe to indicate the path where the images are located
train_df['imfolder'] = '/kaggle/input/jpeg-melanoma-256x256/train/'
test_df['imfolder'] = '/kaggle/input/jpeg-melanoma-256x256/train/'
# train_df_synthesized['imfolder'] = '/kaggle/input/cea-synthesis/cea_synthesis_selected_no_dark/'
train_df_synthesized['imfolder'] = '/kaggle/input/cea-synthesis-threshold0/cea_synthesis_selected_no_dark_threshold0/'
print(f"train_df = {train_df.shape}, train_df total 1's = {np.sum(train_df['target'].values)}, train_df total 1=0's = {np.sum(train_df['target'].values==0)}")
print(f"test_df = {test_df.shape}, test_df total 1's = {np.sum(test_df['target'].values)}, test_df total 0's = {np.sum(test_df['target'].values==0)}")
print(f"train_df_synthesized = {train_df_synthesized.shape}, test_df total 1's = {np.sum(train_df_synthesized['target'].values)}, test_df total 0's = {np.sum(train_df_synthesized['target'].values==0)}")
# By default we start training from 0
RESUME_TRAINING = False

In [None]:
# ## Test if you can Get the correct samples in train and val per fold

# skf = KFold(n_splits=5, shuffle=True, random_state=47)
# train_idx_all, val_idx_all = [], []
# count_1s_per_train_fold, count_1s_per_val_fold = [], []
# for fold, (train_idx, val_idx) in enumerate(skf.split(X=np.zeros(len(train_df)), y=train_df['target'], groups=train_df['patient_id'].tolist()), 1):
#     train_idx_all.append(train_idx)
#     val_idx_all.append(val_idx)
#     count_1s_per_val_fold.append(np.sum(train_df.iloc[val_idx]['target'].values))
#     count_1s_per_train_fold.append(np.sum(train_df.iloc[train_idx]['target'].values))
#     print(f'fold={fold}, train_idx={np.shape(train_idx)}, train_idx=[{train_idx[0]}-{train_idx[-1]}], val_idx={np.shape(val_idx)}, val_idx[{val_idx[0]}-{val_idx[-1]}]')
# #figure
# fold_samples = np.zeros(len(train_df))
# fold_samples[train_idx_all[2]]=1
# fold_samples_x = np.linspace(1,len(fold_samples),len(fold_samples))
# fig, ax = plt.subplots(2,1,figsize=(34,4))
# ax[0].scatter(fold_samples_x, fold_samples);
# ax[1].scatter(fold_samples_x[10000:10500], fold_samples[10000:10500]);
# print(f'count_1s_per_val_fold = {count_1s_per_val_fold}')
# print(f'count_1s_per_train_fold = {count_1s_per_train_fold}')


# df_train_in_fold_synt_added = get_synthesize_images_for_only_images_in_fold(train_df, train_df_synthesized, train_idx_all[3])
# print(f'df_train_in_fold_synt_added: {df_train_in_fold_synt_added.shape}')

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

In [None]:
# Dataset class adapted to get the path from the DF
class MelanomaDataset(Dataset):
    def __init__(self, df: pd.DataFrame, train: bool = True, transforms = None, meta_features = None):
        """
        Class initialization
        Args:
            df (pd.DataFrame): DataFrame with data description
            imfolder (str): folder with images
            train (bool): flag of whether a training dataset is being initialized or testing one
            transforms: image transformation method to be applied
            meta_features (list): list of features with meta information, such as sex and age
            
        """
        self.df = df
        self.transforms = transforms
        self.train = train
        self.meta_features = meta_features
        
    def __getitem__(self, index):
        # print(index)
        im_path = os.path.join(self.df.iloc[index]['imfolder'], self.df.iloc[index]['image_name'] + '.jpg')
        # print(im_path)
        x = cv2.imread(im_path)
        meta = np.array(self.df.iloc[index][self.meta_features].values, dtype=np.float32)

        if self.transforms:
            x = self.transforms(x)
            
        if self.train:
            y = self.df.iloc[index]['target']
            return (x, meta), y
        else:
            return (x, meta)
    
    def __len__(self):
        return len(self.df)
    
    
class Net(nn.Module):
    def __init__(self, arch, n_meta_features: int):
        super(Net, self).__init__()
        self.arch = arch
        if 'ResNet' in str(arch.__class__):
            self.arch.fc = nn.Linear(in_features=512, out_features=500, bias=True)
        if 'EfficientNet' in str(arch.__class__):
            self.arch._fc = nn.Linear(in_features=1280, out_features=500, bias=True)
        self.meta = nn.Sequential(nn.Linear(n_meta_features, 500),
                                  nn.BatchNorm1d(500),
                                  nn.ReLU(),
                                  nn.Dropout(p=0.2),
                                  nn.Linear(500, 250),  # FC layer output will have 250 features
                                  nn.BatchNorm1d(250),
                                  nn.ReLU(),
                                  nn.Dropout(p=0.2))
        self.ouput = nn.Linear(500 + 250, 1)
        
    def forward(self, inputs):
        """
        No sigmoid in forward because we are going to use BCEWithLogitsLoss
        Which applies sigmoid for us when calculating a loss
        """
        x, meta = inputs
        cnn_features = self.arch(x)
        meta_features = self.meta(meta)
        features = torch.cat((cnn_features, meta_features), dim=1)
        output = self.ouput(features)
        return output

In [None]:
class Microscope:
    """
    Cutting out the edges around the center circle of the image
    Imitating a picture, taken through the microscope

    Args:
        p (float): probability of applying an augmentation
    """

    def __init__(self, p: float = 0.5):
        self.p = p

    def __call__(self, img):
        """
        Args:
            img (PIL Image): Image to apply transformation to.

        Returns:
            PIL Image: Image with transformation.
        """
        if random.random() < self.p:
            circle = cv2.circle((np.ones(img.shape) * 255).astype(np.uint8), # image placeholder
                        (img.shape[0]//2, img.shape[1]//2), # center point of circle
                        random.randint(img.shape[0]//2 - 3, img.shape[0]//2 + 15), # radius
                        (0, 0, 0), # color
                        -1)

            mask = circle - 255
            img = np.multiply(img, mask)
        
        return img

    def __repr__(self):
        return f'{self.__class__.__name__}(p={self.p})'

In [None]:
train_transform = transforms.Compose([
    #AdvancedHairAugmentation(hairs_folder='/kaggle/input/melanoma-hairs'),
    transforms.RandomResizedCrop(size=256, scale=(0.8, 1.0)),
    transforms.RandomHorizontalFlip(),
    transforms.RandomVerticalFlip(),
    Microscope(p=0.5),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],std=[0.229, 0.224, 0.225])
])
test_transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],std=[0.229, 0.224, 0.225])
])

In [None]:
arch = EfficientNet.from_pretrained('efficientnet-b1');

In [None]:
# check intersection of train and test
bool(set(train_df['image_name'].values) & set(test_df['image_name'].values))

In [None]:
# # Make sure there are no train samples in test subset and vicecersa
# for i in train_df['image_name'].values:
#     assert(i not in test_df['image_name'].values)
# for i in test_df['image_name'].values:
#     assert(i not in train_df['image_name'].values)

In [None]:
# Check target distribution 
print(f'train_df_orig: {np.sum(train_df_orig["target"].values==1)/len(train_df_orig):.4f}, {np.sum(train_df_orig["target"].values==0)}, {np.sum(train_df_orig["target"].values==1)}')
print(f'train_df: {np.sum(train_df["target"].values==1)/len(train_df):.4f}, {np.sum(train_df["target"].values==0)}, {np.sum(train_df["target"].values==1)}')
print(f'test_df: {np.sum(test_df["target"].values==1)/len(test_df):.4f}, {np.sum(test_df["target"].values==0)}, {np.sum(test_df["target"].values==1)}')
fig, ax = plt.subplots(1,3, figsize=(9,2))
ax[0].hist(train_df_orig['target'].values);
ax[1].hist(train_df['target'].values);
ax[2].hist(test_df['target'].values);

In [None]:
meta_features = ['sex', 'age_approx'] + [col for col in train_df.columns if 'site_' in col]
meta_features.remove('anatom_site_general_challenge')
np.asarray(meta_features)

In [None]:
train_df2 = train_df

In [None]:
# OC check datasets and dataloaders
rand_ints = np.random.randint(0,len(train_df2),100)
rand_ints = np.random.randint(0,len(train_df2),100)
df=train_df2.iloc[rand_ints].reset_index(drop=True)
sampler, samples_weight_py = get_sampler_for_imbalanced_classification(df)
train_or_test = MelanomaDataset(df, 
#                             imfolder='/kaggle/input/melanoma-external-malignant-256/train/train/', 
#                             imfolder='/kaggle/input/jpeg-melanoma-256x256/train/', 
                            train=True, 
                            transforms=train_transform,
                            meta_features=meta_features)
# XX WARNING shuffle should be false 
train_or_test_loader = DataLoader(dataset=train_or_test, batch_size=64, shuffle=False, num_workers=2,
                                  sampler=sampler)
train_or_test_loader_item = next(iter(train_or_test_loader))
print(f'X and Y = {len(train_or_test_loader_item)}')
print(f'X = {len(train_or_test_loader_item[0])} (image) and (meta features)')
print(f'image ={np.shape(train_or_test_loader_item[0][0])}, meta features = { np.shape(train_or_test_loader_item[0][1])}')
print(f'Y len = ({len(train_or_test_loader_item[1])}) ,number 1s: {torch.sum(train_or_test_loader_item[1])} first 5: {train_or_test_loader_item[1][:5]}')

In [None]:
# new test dataset. we get the images from the TRAIN folder
test = MelanomaDataset(df=test_df.reset_index(drop=True),
                       # imfolder='/kaggle/input/melanoma-external-malignant-256/test/test/', 
#                        imfolder='/kaggle/input/jpeg-melanoma-256x256/train/', 
                       train=False,
                       transforms=test_transform,  # For TTA
                       meta_features=meta_features)
test_loader = DataLoader(dataset=test, batch_size=16, shuffle=False, num_workers=2)
test_loader_item = next(iter(test_loader))

### check folds splits

In [None]:
# skf = KFold(n_splits=5, shuffle=True, random_state=47)
# train_idx_all = []
# count_1s_per_val_fold = []
# for fold, (train_idx, val_idx) in enumerate(skf.split(X=np.zeros(len(train_df2)), y=train_df2['target'], groups=train_df2['patient_id'].tolist()), 1):
#     train_idx_all.append(train_idx)
#     count_1s_per_val_fold.append(np.sum(train_df2.iloc[val_idx]['target'].values))
#     print(f'fold={fold}, train_idx={np.shape(train_idx)}, train_idx=[{train_idx[0]}-{train_idx[-1]}], val_idx={np.shape(val_idx)}, val_idx[{val_idx[0]}-{val_idx[-1]}]')
# #figure
# fold_samples = np.zeros(len(train_df2))
# fold_samples[train_idx_all[2]]=1
# fold_samples_x = np.linspace(1,len(fold_samples),len(fold_samples))
# plt.figure(figsize=(34,2))
# plt.scatter(fold_samples_x, fold_samples);
# print(f'count_1s_per_val_fold = {count_1s_per_val_fold}')

* run for one epoch to save the LAST models, 
* then make a new main loop block that loads the previous LAST models and epochs 

In [None]:
RESUME_TRAINING = True

In [None]:
USE_AUGMENTATIONS = True
WEIGHTED_SAMPLER = True
epochs = 10 # orig 12 
BATCH_SIZE_TRAIN = 64 # orig 64
BATCH_SIZE_VAL_TEST = 16
es_patience = 10 # orig 3  # Early Stopping patience - for how many epochs with no improvements to wait
# TTA = 3 # Test Time Augmentation rounds
start = time.time()
skf = KFold(n_splits=5, shuffle=True, random_state=47)

val_acc_all = [ [] for _ in range(skf.n_splits) ]
val_roc_all = [ [] for _ in range(skf.n_splits) ]
val_precision_all = [ [] for _ in range(skf.n_splits) ]
val_recall_all = [ [] for _ in range(skf.n_splits) ]
epoch_loss_all = [ [] for _ in range(skf.n_splits) ]
epoch_loss_val_all = [ [] for _ in range(skf.n_splits) ]

oof = np.zeros((len(train_df2), 1))  # Out Of Fold predictions
oof_all = np.zeros((len(train_df2), skf.n_splits))  # Out Of Fold predictions OC
preds = torch.zeros((len(test), 1), dtype=torch.float32, device=device)  # Predictions for test test
preds_separate = []

val_acc_all = [ [] for _ in range(skf.n_splits) ]
val_roc_all = [ [] for _ in range(skf.n_splits) ]
for fold, (train_idx, val_idx) in enumerate(skf.split(X=np.zeros(len(train_df2)), y=train_df2['target'], groups=train_df2['patient_id'].tolist()), 1):
    print('=' * 20, 'Fold', fold, '=' * 20)  
    
    model_path = f'model_{fold}.pth'  # Path and filename to save model to
    best_val = 0  # Best validation score within this fold (for val_roc)
    # best_val = 1000  # Best validation score within this fold (for val_loss)
    patience = es_patience  # Current patience counter
    arch = EfficientNet.from_pretrained('efficientnet-b1')
    model = Net(arch=arch, n_meta_features=len(meta_features))  # New model for each fold
    model = model.to(device)
    
    optim = torch.optim.Adam(model.parameters(), lr=0.001)
    scheduler = ReduceLROnPlateau(optimizer=optim, mode='max', patience=1, verbose=True, factor=0.2) # for val_roc
    # scheduler = ReduceLROnPlateau(optimizer=optim, mode='min', patience=1, verbose=True, factor=0.2) # for val_loss
    criterion = nn.BCEWithLogitsLoss()
    
    if USE_AUGMENTATIONS:
        train_df_aug = get_synthesize_images_for_only_images_in_fold(train_df, train_df_synthesized, train_idx)
    else:
        train_df_aug = train_df2.iloc[train_idx]
    
    if WEIGHTED_SAMPLER:
        if USE_AUGMENTATIONS:
            sampler, samples_weight_py = get_sampler_for_imbalanced_classification(train_df_aug.reset_index(drop=True))
        else:
            sampler, samples_weight_py = get_sampler_for_imbalanced_classification(train_df_aug.reset_index(drop=True))

    train = MelanomaDataset(df=train_df_aug.reset_index(drop=True), 
                            # imfolder='/kaggle/input/melanoma-external-malignant-256/train/train/', 
                            # imfolder='/kaggle/input/jpeg-melanoma-256x256/train/', 
                            train=True, 
                            transforms=train_transform,
                            meta_features=meta_features)
    val = MelanomaDataset(df=train_df2.iloc[val_idx].reset_index(drop=True), 
                            # imfolder='/kaggle/input/melanoma-external-malignant-256/train/train/', 
                            # imfolder='/kaggle/input/jpeg-melanoma-256x256/train/', 
                            train=True, 
                            transforms=test_transform,
                            meta_features=meta_features)
    
    if WEIGHTED_SAMPLER:
        train_loader = DataLoader(dataset=train, batch_size=BATCH_SIZE_TRAIN, num_workers=2, sampler=sampler)
    else:
        train_loader = DataLoader(dataset=train, batch_size=BATCH_SIZE_TRAIN, shuffle=True, num_workers=2)
    val_loader = DataLoader(dataset=val, batch_size=BATCH_SIZE_VAL_TEST, shuffle=False, num_workers=2)
    test_loader = DataLoader(dataset=test, batch_size=BATCH_SIZE_VAL_TEST, shuffle=False, num_workers=2)
    
    for epoch in tqdm(range(epochs)):
        start_time = time.time()
        correct = 0
        epoch_loss = 0
        epoch_loss_val = 0
        model.train()
        
        for x, y in tqdm(train_loader, desc='train'):
            x[0] = torch.tensor(x[0], device=device, dtype=torch.float32)
            x[1] = torch.tensor(x[1], device=device, dtype=torch.float32)
            y = torch.tensor(y, device=device, dtype=torch.float32)
            optim.zero_grad()
            z = model(x)
            loss = criterion(z, y.unsqueeze(1))
            loss.backward()
            optim.step()
            # pred_proba_train = torch.sigmoid(z) # added by OMM
            pred = torch.round(torch.sigmoid(z))  # round off sigmoid to obtain predictions
            correct += (pred.cpu() == y.cpu().unsqueeze(1)).sum().item()  # tracking number of correctly predicted samples
            epoch_loss += loss.item()
        train_acc = correct / len(train_idx)
        
        model.eval()  # switch model to the evaluation mode
        val_preds = torch.zeros((len(val_idx), 1), dtype=torch.float32, device=device)
        with torch.no_grad():  # Do not calculate gradient since we are only predicting
            # Predicting on validation set
            for j, (x_val, y_val) in enumerate(val_loader):
                x_val[0] = torch.tensor(x_val[0], device=device, dtype=torch.float32)
                x_val[1] = torch.tensor(x_val[1], device=device, dtype=torch.float32)
                y_val = torch.tensor(y_val, device=device, dtype=torch.float32)
                z_val = model(x_val)
                loss_val = criterion(z_val, y_val.unsqueeze(1))
                val_pred = torch.sigmoid(z_val)
                
                val_preds[j*val_loader.batch_size:j*val_loader.batch_size + x_val[0].shape[0]] = val_pred
                epoch_loss_val += loss_val.item()
            val_acc = accuracy_score(train_df2.iloc[val_idx]['target'].values, torch.round(val_preds.cpu()))
            val_roc = roc_auc_score(train_df2.iloc[val_idx]['target'].values, val_preds.cpu())
            val_precision = precision_score(train_df2.iloc[val_idx]['target'].values, torch.round(val_preds.cpu()))
            val_recall = recall_score(train_df2.iloc[val_idx]['target'].values, torch.round(val_preds.cpu()))
            
            val_acc_all[fold-1].append(val_acc)
            val_roc_all[fold-1].append(val_roc)
            val_precision_all[fold-1].append(val_precision)
            val_recall_all[fold-1].append(val_recall)
            
            epoch_loss_all[fold-1].append(epoch_loss)
            epoch_loss_val_all[fold-1].append(epoch_loss_val)
            
            print('Epoch {:03}: | Loss: {:.3f} | Train acc: {:.3f} | Val acc: {:.3f} | Val roc_auc: {:.3f} | Training time: {}'.format(
            epoch + 1, epoch_loss, train_acc, val_acc, val_roc, str(datetime.timedelta(seconds=time.time() - start_time))[:7]))
            
            # scheduler.step(val_roc)
            scheduler.step(epoch_loss_val)
                
            if val_roc >= best_val: #val_roc >= best_val epoch_loss_val <= best_val
                best_val = val_roc # best_val = val_roc best_val = epoch_loss_val
                patience = es_patience  # Resetting patience since we have new best validation accuracy
                torch.save(model, model_path)  # Saving current best model
            else:
                patience -= 1
                if patience == 0:
                    print('Early stopping. Best Val roc_auc: {:.3f}'.format(best_val))
                    break
    
#     assert(1==2) # continue in this block
    # save model after epoch iterations
#     model_last_epoch_path = f'model_last_epoch_{fold}.pth'
#     torch.save(model, model_last_epoch_path)
    
    # TRANINIG FINISHED (FOR THIS FOLD)
    model = torch.load(model_path)  # Loading best model of this fold
    model.eval()  # switch model to the evaluation mode
    val_preds = torch.zeros((len(val_idx), 1), dtype=torch.float32, device=device)
    test_preds = torch.zeros((len(test), 1), dtype=torch.float32, device=device)
    with torch.no_grad():
        # Predicting on validation set once again to obtain data for OOF
        for j, (x_val, y_val) in enumerate(val_loader):
            x_val[0] = torch.tensor(x_val[0], device=device, dtype=torch.float32)
            x_val[1] = torch.tensor(x_val[1], device=device, dtype=torch.float32)
            y_val = torch.tensor(y_val, device=device, dtype=torch.float32)
            z_val = model(x_val)
            val_pred = torch.sigmoid(z_val)
            val_preds[j*val_loader.batch_size:j*val_loader.batch_size + x_val[0].shape[0]] = val_pred
        oof[val_idx] = val_preds.cpu().numpy()
        
        # Predicting on test set
        
        # Not using TTA (new block) 
        for i, x_test in tqdm(enumerate(test_loader), desc='test set', total = len(test_df)//BATCH_SIZE_VAL_TEST + 1):
            x_test[0] = torch.tensor(x_test[0], device=device, dtype=torch.float32)
            x_test[1] = torch.tensor(x_test[1], device=device, dtype=torch.float32)
            z_test = model(x_test)
            z_test = torch.sigmoid(z_test)
            test_preds[i*test_loader.batch_size:i*test_loader.batch_size + x_test[0].shape[0]] += z_test
        
        # preds = test_preds # XX WARNING use this for working with one fold
        preds += test_preds # use this for working with one 5fold
        preds_separate.append(test_preds)
        # TTA predictions
        # tta_preds = torch.zeros((len(test), 1), dtype=torch.float32, device=device)
        # for _ in range(TTA):
            # for i, x_test in tqdm(enumerate(test_loader), desc='test'):
                # x_test[0] = torch.tensor(x_test[0], device=device, dtype=torch.float32)
                # x_test[1] = torch.tensor(x_test[1], device=device, dtype=torch.float32)
                # z_test = model(x_test)
                # z_test = torch.sigmoid(z_test)
                # tta_preds[i*test_loader.batch_size:i*test_loader.batch_size + x_test[0].shape[0]] += z_test
        # preds += tta_preds / TTA

        
preds /= skf.n_splits
stop = time.time()
print(f'epochs {epochs}: {(stop - start)/60:.3f} mins (in {device})')

In [None]:
!mkdir /kaggle/working/classification_results

In [None]:
!mv model_1.pth classification_results/model_1.pth
!mv model_2.pth classification_results/model_2.pth
!mv model_3.pth classification_results/model_3.pth
!mv model_4.pth classification_results/model_4.pth
!mv model_5.pth classification_results/model_5.pth

In [None]:
# val_acc_all, val_roc_all, val_precision_all, val_recall_all
df_val_acc_all = pd.DataFrame(val_acc_all).T
df_val_roc_all = pd.DataFrame(val_roc_all).T
df_val_precision_all = pd.DataFrame(val_precision_all).T
df_val_recall_all = pd.DataFrame(val_recall_all).T
# convert preds_separate from torch to DF
preds_separate_py = [np.squeeze(i.detach().cpu().numpy()) for i in preds_separate]
df_preds_separate_all = pd.DataFrame(preds_separate_py).T

In [None]:
# save results to folder
path_class_results = '/kaggle/working/classification_results/'
np.save(f'{path_class_results}epoch_loss_all.npy',epoch_loss_all)
np.save(f'{path_class_results}epoch_loss_val_all.npy',epoch_loss_val_all)
np.save(f'{path_class_results}oof.npy',oof)
train_df.to_csv(f'{path_class_results}train_df.csv')
test_df.to_csv(f'{path_class_results}test_df.csv')
np.save(f'{path_class_results}preds.npy',preds.detach().cpu().numpy())
df_val_acc_all.to_csv(f'{path_class_results}df_val_acc_all.csv', index=False)
df_val_roc_all.to_csv(f'{path_class_results}df_val_roc_all.csv', index=False)
df_val_precision_all.to_csv(f'{path_class_results}df_val_precision_all.csv', index=False)
df_val_recall_all.to_csv(f'{path_class_results}df_val_recall_all.csv', index=False)
df_preds_separate_all.to_csv(f'{path_class_results}df_preds_separate_all.csv', index=False)
np.save(f'{path_class_results}WEIGHTED_SAMPLER.npy',WEIGHTED_SAMPLER)
np.save(f'{path_class_results}epoch.npy',[epoch])
np.save(f'{path_class_results}es_patience.npy',[es_patience])

In [None]:
!zip -qr classification_results.zip /kaggle/working/classification_results/

In [None]:
FileLink('classification_results.zip')

# END

In [None]:
path_source = '/kaggle/working/classification_results/'
# read files
df_val_acc_all = pd.read_csv(f'{path_source}df_val_acc_all.csv')
df_val_precision_all = pd.read_csv(f'{path_source}df_val_precision_all.csv')
df_val_recall_all = pd.read_csv(f'{path_source}df_val_recall_all.csv')
df_val_roc_all = pd.read_csv(f'{path_source}df_val_roc_all.csv')
epoch_loss_all = np.load(f'{path_source}epoch_loss_all.npy', allow_pickle=True)
epoch_loss_val_all = np.load(f'{path_source}epoch_loss_val_all.npy', allow_pickle=True)
oof = np.load(f'{path_source}oof.npy')
preds = np.load(f'{path_source}preds.npy')
WEIGHTED_SAMPLER = np.load(f'{path_source}WEIGHTED_SAMPLER.npy')
epoch = np.load(f'{path_source}epoch.npy', allow_pickle=True)
# transform dataframes into list of lists
val_acc_all = df_val_acc_all.T.values.tolist()
val_precision_all = df_val_precision_all.T.values.tolist()
val_recall_all = df_val_recall_all.T.values.tolist()
val_roc_all = df_val_roc_all.T.values.tolist()

In [None]:
epoch[0]

In [None]:
predictions = preds.detach().cpu().numpy()
predictions /=5
fig, ax = plt.subplots(1,2, figsize=(12,3))
ax[0].plot(oof[:,0])
ax[0].set_title(f'oof val weighted={WEIGHTED_SAMPLER}')
ax[0].plot(train_df['target'].values * np.max(oof[:,0]), c='y')
ax[1].plot(predictions)
ax[1].plot(test_df['target'].values * np.max(predictions), c='y')
ax[1].set_title(f'test set weighted={WEIGHTED_SAMPLER}')

In [None]:
predictions = preds.detach().cpu().numpy()
predictions /=5
fig, ax = plt.subplots(1,2, figsize=(12,3))
ax[0].plot(oof[:,0])
ax[0].set_title(f'oof val weighted={WEIGHTED_SAMPLER}')
ax[0].plot(train_df['target'].values * np.max(oof[:,0]), c='y')
ax[1].plot(predictions)
ax[1].plot(test_df['target'].values * np.max(predictions), c='y')
ax[1].set_title(f'test set weighted={WEIGHTED_SAMPLER}')

In [None]:
oof_val_true = train_df2['target'].values # we collect the whole train_df from the 5 folds of val
print('OOF_val ROC: {:.3f}'.format(roc_auc_score(oof_val_true, oof)))
print('OOF_val accuracy: {:.3f}'.format(accuracy_score(oof_val_true, oof.round())))
print('OOF_val precision: {:.5f}'.format(precision_score(oof_val_true, oof.round())))
print('OOF_val recall: {:.5f}'.format(recall_score(oof_val_true, oof.round())))

In [None]:
oof_val_true = train_df2['target'].values # we collect the whole train_df from the 5 folds of val
print('OOF_val ROC: {:.3f}'.format(roc_auc_score(oof_val_true, oof)))
print('OOF_val accuracy: {:.3f}'.format(accuracy_score(oof_val_true, oof.round())))
print('OOF_val precision: {:.5f}'.format(precision_score(oof_val_true, oof.round())))
print('OOF_val recall: {:.5f}'.format(recall_score(oof_val_true, oof.round())))

In [None]:
test_true = test_df['target'].values
print('test ROC: {:.3f}'.format(roc_auc_score(test_true, predictions)))
print('test accuracy: {:.3f}'.format(accuracy_score(test_true, predictions.round())))
print('test precision: {:.5f}'.format(precision_score(test_true, predictions.round())))
print('test recall: {:.5f}'.format(recall_score(test_true, predictions.round())))

In [None]:
test_true = test_df['target'].values
print('test ROC: {:.3f}'.format(roc_auc_score(test_true, predictions)))
print('test accuracy: {:.3f}'.format(accuracy_score(test_true, predictions.round())))
print('test precision: {:.5f}'.format(precision_score(test_true, predictions.round())))
print('test recall: {:.5f}'.format(recall_score(test_true, predictions.round())))

In [None]:
fig, ax = plt.subplots(1,4,figsize=(16,4))
plt.style.use('seaborn-white') # seaborn-white
titles_metrics = ['acc', 'roc', 'precision', 'recall']
for idx, metrics in enumerate([val_acc_all, val_roc_all, val_precision_all, val_recall_all]):
    for i in metrics:
        ax[idx].plot(i)
        ax[idx].set_title(titles_metrics[idx])
        if idx<3:
            ax[idx].set_ylim([.4, 1])
plt.suptitle(f'val ROC epochs:{epochs} weighted={WEIGHTED_SAMPLER}', fontsize=18)

In [None]:
fig, ax = plt.subplots(1,4,figsize=(16,4))
plt.style.use('seaborn-white') # seaborn-white
titles_metrics = ['acc', 'roc', 'precision', 'recall']
for idx, metrics in enumerate([val_acc_all, val_roc_all, val_precision_all, val_recall_all]):
    for i in metrics:
        ax[idx].plot(i)
        ax[idx].set_title(titles_metrics[idx])
        if idx<3:
            ax[idx].set_ylim([.4, 1])
plt.suptitle(f'val ROC epochs:{epochs} weighted={WEIGHTED_SAMPLER}', fontsize=18)

In [None]:
fig, ax = plt.subplots(1,4,figsize=(16,4))
plt.style.use('seaborn-white') # seaborn-white
titles_metrics = ['acc', 'roc', 'precision', 'recall']
for idx, metrics in enumerate([val_acc_all, val_roc_all, val_precision_all, val_recall_all]):
    for i in metrics:
        ax[idx].plot(i)
        ax[idx].set_title(titles_metrics[idx])
        if idx<3:
            ax[idx].set_ylim([.4, 1])
plt.suptitle(f'val ROC epochs:{epochs} weighted={WEIGHTED_SAMPLER}', fontsize=18)

In [None]:
sns.kdeplot(pd.Series(preds.cpu().numpy().reshape(-1,)));

In [None]:
plt.plot(tta_preds.detach().cpu().numpy())

In [None]:
# Saving OOF predictions so stacking would be easier
pd.Series(oof.reshape(-1,)).to_csv('oof.csv', index=False)

In [None]:
sub = pd.read_csv('/kaggle/input/siim-isic-melanoma-classification/sample_submission.csv')
sub['target'] = preds.cpu().numpy().reshape(-1,)
sub.to_csv('submission.csv', index=False)

## Not used:

# Extra

In [None]:
# SEPARATE BLOCKS
# path_synthesis_selected = '/kaggle/input/cea-synthesis/cea_synthesis_selected_no_dark/'
# names_synthesis_selected = os.listdir(path_synthesis_selected)
# len(names_synthesis_selected)
#============
# names_synt_selec = np.sort(names_synthesis_selected)
# names_synt_selec = [i[:-4] for i in names_synt_selec]
# names_synt_selec_root = [i[:-4] for i in names_synt_selec]
# df_names_synt_selec = pd.DataFrame((names_synt_selec,names_synt_selec_root)).T
# df_names_synt_selec.columns = ['synt_name', 'image_name']
# print(df_names_synt_selec.shape)
# df_names_synt_selec.head()

# print(train_df.shape, df_names_synt_selec.shape)
# train_df_synthesized = df_names_synt_selec.merge(train_df, on='image_name')
# # change image_name to load the correct name
# train_df_synthesized = train_df_synthesized.rename(columns={'image_name':'image_name_root'})
# train_df_synthesized = train_df_synthesized.rename(columns={'synt_name':'image_name'})
# print(train_df_synthesized.shape)
# train_df_synthesized

# train_df['imfolder'] = '/kaggle/input/jpeg-melanoma-256x256/train/'
# test_df['imfolder'] = '/kaggle/input/jpeg-melanoma-256x256/train/'
# train_df_synthesized['imfolder'] = '/kaggle/input/cea-synthesis/cea_synthesis_selected_no_dark/'

# train_df_synt_added_BIG = train_df.append(train_df_synthesized, ignore_index=True, sort=False)
# train_df = train_df_synt_added_BIG
# print(train_df.shape)
# train_df.tail()

In [None]:
# # One-hot encoding of anatom_site_general_challenge feature
# concat = pd.concat([train_df['anatom_site_general_challenge'], test_df['anatom_site_general_challenge']], ignore_index=True)
# dummies = pd.get_dummies(concat, dummy_na=True, dtype=np.uint8, prefix='site')
# train_df = pd.concat([train_df, dummies.iloc[:train_df.shape[0]]], axis=1)
# test_df = pd.concat([test_df, dummies.iloc[train_df.shape[0]:].reset_index(drop=True)], axis=1)

# # Sex features
# train_df['sex'] = train_df['sex'].map({'male': 1, 'female': 0})
# test_df['sex'] = test_df['sex'].map({'male': 1, 'female': 0})
# train_df['sex'] = train_df['sex'].fillna(-1)
# test_df['sex'] = test_df['sex'].fillna(-1)

# # Age features
# train_df['age_approx'] /= train_df['age_approx'].max()
# test_df['age_approx'] /= test_df['age_approx'].max()
# train_df['age_approx'] = train_df['age_approx'].fillna(0)
# test_df['age_approx'] = test_df['age_approx'].fillna(0)

# train_df['patient_id'] = train_df['patient_id'].fillna(0)

In [None]:
# ## get_synthesize_images_for_only_images_in_fold():
# # These are the correct samples in the val subset
# INDEX_FOLD = 3
# print(f'train_idx: {train_idx_all[INDEX_FOLD][:5]}, val_idx: {val_idx_all[INDEX_FOLD][:5]}')
# df_val_in_fold = train_df.iloc[val_idx_all[INDEX_FOLD]].reset_index(drop=True)
# print(f'df_val_in_fold: {df_val_in_fold.shape}')
# # Here you should add only files selected by train_idx that also have synt
# df_train_in_fold = train_df.iloc[train_idx_all[INDEX_FOLD]].reset_index(drop=True)
# print(f'df_train_in_fold: {df_train_in_fold.shape}')
# print(f'train_df_synthesized: {train_df_synthesized.shape}')
# #From the images in train this fold get the ones in synt
# names_in_train = df_train_in_fold['image_name'].values
# names_unique_synt_all = np.unique(train_df_synthesized['image_name_root'].values)
# names_in_train_that_have_synt = list(set(names_in_train).intersection(names_unique_synt_all))
# print(f'names_in_train: {len(names_in_train)}')
# print(f'names_unique_synt_all: {len(names_unique_synt_all)}')
# print(f'names_in_train_that_have_synt: {len(names_in_train_that_have_synt)}')
# print('=============')
# # select those images from all the synthesized ones
# print(f'train_df_synthesized: {train_df_synthesized.shape}')
# train_synt_to_add_fold = train_df_synthesized[train_df_synthesized['image_name_root'].isin(names_in_train_that_have_synt)]
# print(f'train_synt_to_add_fold: {train_synt_to_add_fold.shape}')
# print('=============')
# # append the synthesized images to the current train fold
# df_train_in_fold_synt_added = df_train_in_fold.append(train_synt_to_add_fold)
# print(f'df_train_in_fold_synt_added: {df_train_in_fold_synt_added.shape}')
# df_train_in_fold_synt_added.tail(2)

In [None]:
# # OPTION 1
# train_df, test_df = make_train_test_df_for_cea_aug2(train_df_orig, df_cea_done, 
#                                                    test_set_size = 2000, test_set_1s = 450,
#                                                   add_0s_into_train_df = 1524-len(df_cea_done),#1500-len(df_cea_done)
#                                                   add_1s_into_train_df = 0) # 0
# np.shape(train_df), np.shape(test_df)
# print(f"train_df = {train_df.shape}, train_df total 1's = {np.sum(train_df['target'].values)}, train_df total 1=0's = {np.sum(train_df['target'].values==0)}")
# print(f"test_df = {test_df.shape}, test_df total 1's = {np.sum(test_df['target'].values)}, test_df total 0's = {np.sum(test_df['target'].values==0)}")

In [None]:
# class MelanomaDataset(Dataset):
#     def __init__(self, df: pd.DataFrame, imfolder: str, train: bool = True, transforms = None, meta_features = None):
#         """
#         Class initialization
#         Args:
#             df (pd.DataFrame): DataFrame with data description
#             imfolder (str): folder with images
#             train (bool): flag of whether a training dataset is being initialized or testing one
#             transforms: image transformation method to be applied
#             meta_features (list): list of features with meta information, such as sex and age
            
#         """
#         self.df = df
#         self.imfolder = imfolder
#         self.transforms = transforms
#         self.train = train
#         self.meta_features = meta_features
        
#     def __getitem__(self, index):
#         # print(index)
#         im_path = os.path.join(self.imfolder, self.df.iloc[index]['image_name'] + '.jpg')
#         # print(im_path)
#         x = cv2.imread(im_path)
#         meta = np.array(self.df.iloc[index][self.meta_features].values, dtype=np.float32)

#         if self.transforms:
#             x = self.transforms(x)
            
#         if self.train:
#             y = self.df.iloc[index]['target']
#             return (x, meta), y
#         else:
#             return (x, meta)
    
#     def __len__(self):
#         return len(self.df)
    
    
# class Net(nn.Module):
#     def __init__(self, arch, n_meta_features: int):
#         super(Net, self).__init__()
#         self.arch = arch
#         if 'ResNet' in str(arch.__class__):
#             self.arch.fc = nn.Linear(in_features=512, out_features=500, bias=True)
#         if 'EfficientNet' in str(arch.__class__):
#             self.arch._fc = nn.Linear(in_features=1280, out_features=500, bias=True)
#         self.meta = nn.Sequential(nn.Linear(n_meta_features, 500),
#                                   nn.BatchNorm1d(500),
#                                   nn.ReLU(),
#                                   nn.Dropout(p=0.2),
#                                   nn.Linear(500, 250),  # FC layer output will have 250 features
#                                   nn.BatchNorm1d(250),
#                                   nn.ReLU(),
#                                   nn.Dropout(p=0.2))
#         self.ouput = nn.Linear(500 + 250, 1)
        
#     def forward(self, inputs):
#         """
#         No sigmoid in forward because we are going to use BCEWithLogitsLoss
#         Which applies sigmoid for us when calculating a loss
#         """
#         x, meta = inputs
#         cnn_features = self.arch(x)
#         meta_features = self.meta(meta)
#         features = torch.cat((cnn_features, meta_features), dim=1)
#         output = self.ouput(features)
#         return output

In [None]:
# # One-hot encoding of anatom_site_general_challenge feature
# concat = pd.concat([train_df['anatom_site_general_challenge'], test_df['anatom_site_general_challenge']], ignore_index=True)
# dummies = pd.get_dummies(concat, dummy_na=True, dtype=np.uint8, prefix='site')
# train_df = pd.concat([train_df, dummies.iloc[:train_df.shape[0]]], axis=1)
# test_df = pd.concat([test_df, dummies.iloc[train_df.shape[0]:].reset_index(drop=True)], axis=1)

# # Sex features
# train_df['sex'] = train_df['sex'].map({'male': 1, 'female': 0})
# test_df['sex'] = test_df['sex'].map({'male': 1, 'female': 0})
# train_df['sex'] = train_df['sex'].fillna(-1)
# test_df['sex'] = test_df['sex'].fillna(-1)

# # Age features
# train_df['age_approx'] /= train_df['age_approx'].max()
# test_df['age_approx'] /= test_df['age_approx'].max()
# train_df['age_approx'] = train_df['age_approx'].fillna(0)
# test_df['age_approx'] = test_df['age_approx'].fillna(0)

# train_df['patient_id'] = train_df['patient_id'].fillna(0)

In [None]:
# ORIGINAL DATAFRAME
# for fold, (train_idx, val_idx) in enumerate(skf.split(X=np.zeros(len(train_df)), y=train_df['target'], groups=train_df['patient_id'].tolist()), 1):
#     print(f'fold={fold}, train_idx={np.shape(train_idx)}, train_idx[0]={train_idx[0]}, val_idx={np.shape(val_idx)}, val_idx[0]={val_idx[0]}')
# train_idx

In [None]:
# # SMALLER test dataloader
# n_ints =50
# rand_ints = np.random.randint(0,len(test_df),n_ints)
# test = MelanomaDataset(df=test_df.iloc[rand_ints].reset_index(drop=True),
#                        imfolder='/kaggle/input/jpeg-melanoma-256x256/test/', 
#                        train=False,
#                        transforms=train_transform,  # For TTA
#                        meta_features=meta_features)

In [None]:
def make_train_test_df_for_cea_aug2(train_df_orig, df_cea_done, test_set_size = 2000, test_set_1s = 100, add_0s_into_train_df=0, add_1s_into_train_df=0):
    '''make a train_df that contains all the files that cea==completed & target==1
    and a test_df that has all the files that cea!=completed & target==1
    Optionally you can include fewer files'''
    # make a train_df that contains all the files that cea==completed & target==1
    image_name_cea_done = df_cea_done['image_name'].values
    train_df = train_df_orig[train_df_orig['image_name'].isin(image_name_cea_done)]
    # make sure test_df has all the files that cea!=completed & target==1
    test_df_large = train_df_orig[~train_df_orig['image_name'].isin(image_name_cea_done)]
    # divide in target==1 & target == 0
    test_df_only1s = test_df_large[test_df_large['target']==1]
    test_df_only0s = test_df_large[test_df_large['target']==0]
    test_df_only0s = test_df_only0s.reset_index(drop=True)
    test_df_only1s = test_df_only1s.reset_index(drop=True)
    # if add_0s_into_train_df more samples are wanted in train_df (of zeros)
    if add_0s_into_train_df >0:
        random.seed(0)
        rand_ints = random.sample(range(len(test_df_only0s)-1), add_0s_into_train_df)
        extra_0s_for_train_df = test_df_only0s.iloc[rand_ints]
        train_df = train_df.append(extra_0s_for_train_df)
        test_df_only0s =test_df_only0s.drop(rand_ints, axis=0)
    # if add_1s_into_train_df more samples are wanted in train_df (of ones)
    if add_1s_into_train_df >0:
        random.seed(0)
        rand_ints = random.sample(range(len(test_df_only1s)-1), add_1s_into_train_df)
        extra_1s_for_train_df = test_df_only1s.iloc[rand_ints]
        train_df = train_df.append(extra_1s_for_train_df)
        test_df_only1s = test_df_only1s.drop(rand_ints, axis=0)
    # if only a subset of the target == 1 is wanted
    if test_set_1s > 0:
        random.seed(0)
        rand_ints = random.sample(range(len(test_df_only1s)), test_set_1s)
        test_df_only1s = test_df_only1s.iloc[rand_ints]
        
    # if add_1s_into_train_df > 0 or add_0s_into_train_df > 0
    if add_1s_into_train_df > 0 or add_0s_into_train_df > 0:
        test_df_large = pd.DataFrame()
        test_df_large = test_df_large.append(test_df_only1s)
        test_df_large = test_df_large.append(test_df_only0s)
        random.seed(0)
        rand_ints = random.sample(range(len(test_df_large)), len(test_df_large))
        test_df_large.index = rand_ints
        test_df_large = test_df_large.reindex()
        test_df = test_df_large
    else:
        # get a large subset of test_df_large that contain all target 1 from test_df_large
        test_df = test_df_large
        if test_set_size > 0:
            number1s_already_in_test = len(test_df_only1s)
            random.seed(0)
            rand_ints = random.sample(range(len(test_df_only0s)), test_set_size - number1s_already_in_test)
            test_df_only0s_subset = test_df_only0s.iloc[rand_ints]
            test_df = test_df_only1s.append(test_df_only0s_subset)
    return train_df.reset_index(drop=True), test_df.reset_index(drop=True)

In [None]:
predictions = preds.detach().cpu().numpy()
predictions /=5
fig, ax = plt.subplots(1,2, figsize=(12,3))
ax[0].plot(oof[:,0])
ax[0].set_title(f'oof val weighted={WEIGHTED_SAMPLER}, ep={epochs}')
ax[1].plot(predictions)
ax[1].set_title(f'test set weighted={WEIGHTED_SAMPLER}, ep={epochs}')

In [None]:
predictions = preds.detach().cpu().numpy()
predictions /=5
fig, ax = plt.subplots(1,2, figsize=(12,3))
ax[0].plot(oof[:,0])
ax[0].set_title(f'oof val weighted={WEIGHTED_SAMPLER}')
ax[1].plot(predictions)
ax[1].set_title(f'test set weighted={WEIGHTED_SAMPLER}')

In [None]:
def make_train_test_df_for_cea_aug(train_df_orig, df_cea_done, samples_subset = 2000, samples_subset_tgt1 = 100):
    '''make a train_df that contains all the files that cea==completed & target==1
    and a test_df that has all the files that cea!=completed & target==1
    Optionally you can include fewer files'''
    # make a train_df that contains all the files that cea==completed & target==1
    image_name_cea_done = df_cea_done['image_name'].values
    train_df = train_df_orig[train_df_orig['image_name'].isin(image_name_cea_done)]
    # make sure test_df has all the files that cea!=completed & target==1
    test_df_large = train_df_orig[~train_df_orig['image_name'].isin(image_name_cea_done)]
    # divide in target==1 & target == 0
    test_df_only1s = test_df_large[test_df_large['target']==1]
    test_df_only0s = test_df_large[test_df_large['target']==0]
    test_df_only0s = test_df_only0s.reset_index(drop=True)
    test_df_only1s = test_df_only1s.reset_index(drop=True)
    # if only a subset of the target == 1 is wanted
    if samples_subset_tgt1 > 0:
        random.seed(0)
        rand_ints = random.sample(range(len(test_df_only1s)), samples_subset_tgt1)
        test_df_only1s = test_df_only1s.iloc[rand_ints]
    # get a large subset of test_df_large that contain all target 1 from test_df_large
    if samples_subset > 0:
        number1s_already_in_test = len(test_df_only1s)
        random.seed(0)
        rand_ints = random.sample(range(len(test_df_only0s)), samples_subset - number1s_already_in_test)
        test_df_only0s_subset = test_df_only0s.iloc[rand_ints]
    test_df = test_df_only1s.append(test_df_only0s_subset)
    return train_df.reset_index(drop=True), test_df.reset_index(drop=True)

In [None]:
# # THIS IS NOT IS DONE IN make_train_test_df_for_cea_aug()
# # make a train_df that contains all the files that cea==completed & target==1
# image_name_cea_done = df_cea_done['image_name'].values
# train_df = train_df_orig[train_df_orig['image_name'].isin(image_name_cea_done)]
# # make sure test_df has all the files that cea!=completed & target==1
# test_df_large = train_df_orig[~train_df_orig['image_name'].isin(image_name_cea_done)]
# print(f'train_df_orig = {train_df_orig.shape}, train_df 1s = {np.sum(train_df_orig["target"].values)}')
# print(f'train_df = {train_df.shape}, train_df 1s = {np.sum(train_df["target"].values)}')
# print(f'test_df_large = {test_df_large.shape}, test_df_large 1s = {np.sum(test_df_large["target"].values)}')
# # divide in target==1 & target == 0
# test_df_only1s = test_df_large[test_df_large['target']==1]
# test_df_only0s = test_df_large[test_df_large['target']==0]
# test_df_only0s = test_df_only0s.reset_index(drop=True)
# test_df_only1s = test_df_only1s.reset_index(drop=True)
# print(np.shape(test_df_only1s), np.shape(test_df_only0s))
# # if only a subset of the target == 1 is wanted
# n_ints = 100
# if n_ints > 0:
#     random.seed(0)
#     rand_ints = random.sample(range(len(test_df_only1s)), n_ints)
#     test_df_only1s = test_df_only1s.iloc[rand_ints]
# print(np.shape(test_df_only1s))
# # get a large subset of test_df_large that contain all target 1 from test_df_large
# number1s_already_in_test = len(test_df_only1s)
# n_ints = 2000
# random.seed(0)
# rand_ints = random.sample(range(len(test_df_only0s)), n_ints - number1s_already_in_test)
# test_df_only0s_subset = test_df_only0s.iloc[rand_ints]
# print(np.shape(test_df_only0s_subset))
# df_test = test_df_only1s.append(test_df_only0s_subset)
# print(df_test.shape)
# df_test.tail(5)

In [None]:
# # OC check a df with a subset
# train_df2 = train_df[train_df['image_name'].isin(df_cea['image_name'].values)]
# print(train_df2.shape)
# train_df2 = train_df2[train_df2['age_approx'].notna()]
# print(train_df2.shape)
# train_df2.tail()
# # OC add fake target 1s just to see if the code works
# n_ints =20
# rand_ints = np.random.randint(0,len(train_df2),n_ints)
# for i in rand_ints:
#     train_df2['target'].iloc[i] = 1
# print(np.sum(train_df2['target'].values==1)/len(train_df2), np.sum(train_df2['target'].values==1))
# plt.figure(figsize=(3,2))
# plt.hist(train_df2['target'].values);

In [None]:
# # OC just checking shapes
# extra_image_path = os.path.join('/kaggle/input/melanoma-external-malignant-256/train/train', 
#                                 train_df2.iloc[10]['image_name'] + '.jpg')
# new_image_path = os.path.join('/kaggle/input/jpeg-melanoma-256x256/train', 
#                                 train_df2.iloc[10]['image_name'] + '.jpg')
# x1 = cv2.imread(extra_image_path)
# x2 = cv2.imread(new_image_path)
# fig, ax = plt.subplots(1,2)
# ax[0].imshow(x)
# ax[1].imshow(x)