If you are korean, please check out this notebook.
https://www.kaggle.com/vkehfdl1/for-korean-cassava

# **Cassava Competition Baseline**

This competition is classification problem that classify cassava leaves. There are 5 classes, 4 each diseases and 1 healthy class. Given images are 600x800 resolution, and test set does not reveal. 

Please look up this EDA for characteristic of each disease. Thank you for @ihelon (Yaroslav Isaienkov).
https://www.kaggle.com/ihelon/cassava-leaf-disease-exploratory-data-analysis

This baseline code refers to this notebook. Thank you for @frlemarchand (Francois Lemarchand).
https://www.kaggle.com/frlemarchand/efficientnet-aug-tf-keras-for-cassava-diseases

There are detailed explanation at annotation.

Feel free to leave comments below for questions. You can find accurate information looking up official documnets of libraries like tensorflow, numpy, pandas, scikit-learn.
* Tensorflow - https://www.tensorflow.org/api_docs/python/tf
* numpy - https://numpy.org/doc/1.19/
* pandas - https://pandas.pydata.org/pandas-docs/stable/reference/index.html
* scikit-learn - https://scikit-learn.org/stable/modules/classes.html

# Import Libraries

In [None]:
import numpy as np 
import pandas as pd 
from PIL import Image
import os
import matplotlib.pyplot as plt
from sklearn.utils import shuffle
from sklearn.utils import class_weight
from sklearn.preprocessing import minmax_scale
import random
import cv2
import warnings
warnings.filterwarnings('ignore')

import tensorflow as tf
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.layers import Dense, Dropout, Activation, Input, BatchNormalization, GlobalAveragePooling2D
from tensorflow.keras import layers
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping
from tensorflow.keras.experimental import CosineDecay
from tensorflow.keras.applications import EfficientNetB0
from tensorflow.keras.layers.experimental.preprocessing import RandomCrop,CenterCrop, RandomRotation
from tensorflow.keras.optimizers import Adam

# Data Preprocessing (Image Load)

In [None]:
training_folder = '../input/cassava-leaf-disease-classification/train_images/' # Folder that train images located
samples_df = pd.read_csv('../input/cassava-leaf-disease-classification/train.csv') # Load train image file names and each label data
samples_df["filepath"] = training_folder+samples_df["image_id"] # Create path by adding folder name and image name for load images easily
samples_df = samples_df.drop(['image_id'],axis=1) # Drop image names which is useless.

In [None]:
samples_df = shuffle(samples_df, random_state=42) # Shuffle all data randomly
train_size = int(len(samples_df)*0.8) # Define data set size for training
training_df = samples_df[:train_size] # Make training dataset
validation_df = samples_df[train_size:] # Make validation dataset

In [None]:
batch_size = 8 # Set batch size
image_size = 512 # Set image size
input_shape = (image_size, image_size, 3) # Set image shape (Require 3 numbers per pixel becuase it is color images)
dropout_rate = 0.4 # Set dropout rate
classes_to_predict = sorted(training_df.label.unique()) # Set number of classes which is 5 in this competition

In [None]:
"""
When we use train dataset and validation dataset, we applicate tensorflow dataset.
Tensorflow dataset can allocate data to device dynamically, so you can improve performance because it prevents overloading.
For details, please refer to below link.
https://www.tensorflow.org/guide/data_performance?hl=en
"""
training_data = tf.data.Dataset.from_tensor_slices((training_df.filepath.values, training_df.label.values))
validation_data = tf.data.Dataset.from_tensor_slices((validation_df.filepath.values, validation_df.label.values))

In [None]:
def load_image_and_label_from_path(image_path, label): # Function : Load image data and transform to tensor (simillar with array)
    img = tf.io.read_file(image_path) # Read file from image path
    img = tf.image.decode_jpeg(img, channels=3) # Transform image to array and save it
    img = tf.image.random_crop(img, size=[image_size,image_size,3]) # Crop images to desired size. You can use "central_crop" if you want to crop middle part of the images, not randomly. 
    return img, label

AUTOTUNE = tf.data.experimental.AUTOTUNE # AUTOTUNE for dynamic memory allocation
training_data = training_data.map(load_image_and_label_from_path, num_parallel_calls=AUTOTUNE) # Load train data
validation_data = validation_data.map(load_image_and_label_from_path,num_parallel_calls=AUTOTUNE) # Load validation data

In [None]:
# Cut train and validation data for training easily.
training_data_batches = training_data.shuffle(buffer_size=1000).batch(batch_size).prefetch(buffer_size=AUTOTUNE)
validation_data_batches = validation_data.shuffle(buffer_size=1000).batch(batch_size).prefetch(buffer_size=AUTOTUNE)

In [None]:
"""
Create augmentation layer for augmentation. When you put augmentation layer to your model, it will transform images automatically. 
If you want to adapt more augmentations, please refer to this link :  https://www.tensorflow.org/api_docs/python/tf/keras/layers/experimental/preprocessing
Also, refer to imgaug or albumentation, one of the most powerful augmentation libraries.
"""
data_augmentation_layers = tf.keras.Sequential(
    [
        layers.experimental.preprocessing.RandomFlip("horizontal_and_vertical"), #Flip images randomly
        layers.experimental.preprocessing.RandomRotation(0.25), #Rotate images randomly
        layers.experimental.preprocessing.RandomZoom((-0.2, 0)), #Zoom out images randomly
        
    ]
)


# Build model & train model

In [None]:
"""
In this baseline, we use transfer learning. Transfer learninng is method that load pre-trained large model, add custom layer at the end of it, and train it. 
We use transfer learning because design numerous layers on our own is very hard.
We use EfficientNetB0. For details, please refer to this article : https://arxiv.org/pdf/1905.11946.pdf

Caution!! You must turn on the Internet with pressing |< at the right top for downloading imagenet weights. 
"""
efficientnet = EfficientNetB0(weights="imagenet", #Download imagenet weights
                              include_top=False, 
                              input_shape=input_shape, 
                              drop_connect_rate=dropout_rate) #Load EfficientNetB0 model
efficientnet.trainable=True # Enable training EfficientNetB0

In [None]:
"""
It's okay to build your own CNN model
"""
model = Sequential() #Build new Sequential model 
model.add(Input(shape=input_shape)) #Set input to image size
model.add(data_augmentation_layers) #Add image augmentation layer
model.add(efficientnet) # Add EfficientNetB0
model.add(layers.GlobalAveragePooling2D()) # Add pooling layer
model.add(layers.Dropout(dropout_rate)) # Add dropout layer for avoiding overfitting
model.add(Dense(len(classes_to_predict), activation="softmax")) #Add last Dense layer. Classes to predict becomes output size
model.summary() #Check model

In [None]:
epochs = 30 #Set epochs
decay_steps = int(round(len(training_df)/batch_size))*epochs
cosine_decay = CosineDecay(initial_learning_rate=1e-4, decay_steps=decay_steps, alpha=0.3) #Use cosien decay : Decaying learning rate per epochs.
callbacks = [ModelCheckpoint(filepath='mymodel.h5', monitor='val_loss', save_best_only=True), # Save best model (the lowest validation loss) to .h5 format.
            EarlyStopping(monitor='val_loss', patience = 5, verbose=1)] #Stop training when validation loss doesn't improve while 5 epochs.

model.compile(loss="sparse_categorical_crossentropy", optimizer=Adam(cosine_decay), metrics=["accuracy"]) #Loss is sparse_categorical_crossentropy and optimizer is Adam. Use accuracy for monitoring model's performance. 

In [None]:
history = model.fit(training_data_batches, #Train model
                  epochs = epochs, 
                  validation_data=validation_data_batches,
                  callbacks=callbacks)

# Predict test data (Submission)

Below, there are codes for submission. The notebook must run whole things again when you submit it, so if you train your models again, it is really inefficient. So you can save lots of time by splitting train and submission notebooks. Below is explanation about it. 

1. Find saved .h5 file at 'output -> /kaggle/working'. And press three dots at right side and download it. 
2. Go to the kaggle 'Data' tab and press '+New Dataset'.
3. Enter dataset name, and upload your .h5 file. Next, press 'Create' button. 
4. Make new notebook for submission, and press 'Add data' at the upper right. Go to the 'Your Dataset', and add dataset you just made by pressing blue 'Add' button.
5. Load your model using 'model = tf.keras.models.load_model('filepath')'
FYI : You can easily copy file path by pressing little button called 'Copy File path'.

In [None]:
"""
model = tf.keras.models.load_model('filepath')
"""

In [None]:
#Load test data. Same way with loading trian data.
test_folder = '../input/cassava-leaf-disease-classification/test_images/' 
submission_df = pd.DataFrame(columns={"image_id","label","filepath"})
submission_df["image_id"] =  os.listdir(test_folder) #Put image names in test foler to "image_id"
submission_df["label"] = 0
submission_df["filepath"] = test_folder+submission_df['image_id']
submission_df = submission_df.drop(['image_id'],axis=1)

In [None]:
def load_image(image_path): #Function : Load image data and transform to tensor (simillar with array)
    img = tf.io.read_file(image_path) #Read file from image path
    img = tf.image.decode_jpeg(img, channels=3) #Transform image to array and save it
    img = tf.image.random_crop(img, size=[image_size,image_size,3]) # Crop images to desired size. You can use "central_crop" if you want to crop middle part of the images, not randomly.
    img = np.reshape(img, [-1,512,512,3]) #Transform images 3-d tensor to 4-d tensor
    return img

In [None]:
def test_predict(filepath):
    local_image = load_image(filepath) #load image using 'load_image' function
    predictions = model.predict(local_image) #Predict each class probabilities
    final_prediction = np.argmax(predictions) #Return final result (the highest probability in classes)
    return final_prediction #Return prediction result

In [None]:
def predictions_over_image(filepath):
    predictions = [] #List that save predictions
    for path in filepath:   
        predictions.append(test_predict(path)) # Save predictions for each images
    return predictions #Return prediction values

In [None]:
submission_df["label"] = predictions_over_image(submission_df["filepath"]) # Put predictions to submission DataFrame

In [None]:
submission_df.to_csv("submission.csv", index=False) # Make submission csv file

# Things you can try for improving score

1. Use large models ex) EfflicientNetB4
2. Use heavy augmentation ex) CoarseDropout, Cutmix, Mixup
3. Use TTA (Test Time Augmentation)  
4. Use Stratified Kfold : Because data is unbalanced.
5. Use another callbacks strategy
6. Use another loss function like Symmetric Cross Entropy    FYI : https://arxiv.org/abs/1908.06112
7. Use another optimizers like Lookahead or RAdam
8. Use more datasets    https://www.kaggle.com/tahsin/cassava-leaf-disease-merged
9. Use various ensemble methods ex) Bagging, voting, stacking, etc..
10. Design new models

**If this notebook is helpful, please upvote :) Thank you!!**