# Intro

I want to see how diffrent model performs at the data, so I set up this simple baseline. 

Check history versions for each model's training log and metrics


### Model List / 添加各模型的训练基线,包含:
* InceptionV3 [Version2-TPU](https://www.kaggle.com/tianyu5/tpus-cassava-leaf-disease?scriptVersionId=49382127)
* resnet50v2
* resnet101v2 [Version4-TPU](https://www.kaggle.com/tianyu5/tpus-cassava-leaf-disease?scriptVersionId=49386174)
* resnet152v2
* InceptionResnetV2 [Version6-TPU](https://www.kaggle.com/tianyu5/tpus-cassava-leaf-disease?scriptVersionId=49390571)
* DenseNet121 [Version8-TPU](https://www.kaggle.com/tianyu5/tpus-cassava-leaf-disease?scriptVersionId=49397189)
* Xception [Version9-TPU](https://www.kaggle.com/tianyu5/tpus-cassava-leaf-disease?scriptVersionId=49403249)
* VGG16 [Version14-TPU]
* NASNetLarge (调试完成,等待训练,比较慢)
* ResNet50  [Version12](https://www.kaggle.com/tianyu5/tpus-cassava-leaf-disease?scriptVersionId=49442732)
* ResNet101  [Version13](https://www.kaggle.com/tianyu5/tpus-cassava-leaf-disease?scriptVersionId=49445531)

GPU for EfficientNet ( Kaggle的TPU版本目前为 tf2.2, 不支持EfficienctNet, 使用GPU训练)
* EfficentNetB0 [Version11-GPU](https://www.kaggle.com/tianyu5/tpus-cassava-leaf-disease?scriptVersionId=49419318)
* EfficentNetB3 TBD (low memory, low memory, low memory !!!)


**Change these 2 lines for model switch / 选择模型时修改模型名和模型backbone方法这两行即可:**

```
MODEL_NAME = 'ResNet101'
......
base_model ,preprocess_layer = resnet101_base()

```

### Here is the Submit Kernel / 提交Kernel在此:
* [TPUs + Cassava Leaf Disease[Infer]](https://www.kaggle.com/tianyu5/tpus-cassava-leaf-disease-infer)

### References
Thanks for these kernels to help me get start:

* [Getting Started: TPUs + Cassava Leaf Disease](https://www.kaggle.com/jessemostipak/getting-started-tpus-cassava-leaf-disease)
* [Tensorflow Resnet50 (train with new tfrecords)](https://www.kaggle.com/wuliaokaola/tensorflow-resnet50-train-with-new-tfrecords)

### 各模型Baseline指标记录



| model |image size |  epoch | train acc | val acc| lb | 备注 | 
|:----|----|----|----|----|----|----|
|InceptionV3|512|25/25|0.9023|0.8504|0.843 | |
|ResNet50|512|23/25|0.9301|0.8658| 0.851 | |
|ResNet101|512|16/25|0.9119 |0.8667 | |之后过拟合 |
|ResNet152|512|17/25|0.9268|0.8640| - |-|
|ResNet101V2|512|19/25|0.9322|0.8400| 0.827 |之后过拟合|
|ResNet152V2|512|23/25|0.9707|0.8339|  |-|
|InceptionResnetV2|512|22/25|0.8749|0.8369|  |-|
|DenseNet121 |512| 24/25| 0.8916| 0.8755 | 0.8620 |曲线看着可以, 25轮未完全收敛|
|EfficentNetB0|300| 30/32|0.9030|0.8488|  |图片太大GPU爆内存|
|Xception|512|25/25|0.8082|0.8105|  |25轮未完全收敛|



# Set up environment

In [None]:
import math, re, os
import tensorflow as tf
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from kaggle_datasets import KaggleDatasets
from tensorflow import keras
import tensorflow.keras.backend as K
from functools import partial
from sklearn.model_selection import train_test_split

print("Tensorflow version " + tf.__version__)

# Detect TPU

In [None]:
try:
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
    print('Device:', tpu.master())
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.experimental.TPUStrategy(tpu)
except:
    strategy = tf.distribute.get_strategy()
print('Number of replicas:', strategy.num_replicas_in_sync)

# Set up variables

In [None]:
AUTOTUNE = tf.data.experimental.AUTOTUNE
GCS_PATH = KaggleDatasets().get_gcs_path("cassava-leaf-disease-classification") # 比赛官方链接 
GCS_PATH_MY = KaggleDatasets().get_gcs_path("cassava-leaf-tfrecord-512") # 自己做的tfrecord
BATCH_SIZE = 16 * strategy.num_replicas_in_sync  # TPU=16
IMAGE_SIZE = [512, 512]  # tfrecord的图片大小
RESIZE_IMAGE_SIZE = [512, 512]  #  图像增强压缩后的大小 TPU 512,  GPU 300(太大爆内存)
CLASSES = ['0', '1', '2', '3', '4']
EPOCHS = 25

# Load the data


## Decode the data
In the code chunk below we'll set up a series of functions that allow us to convert our images into tensors so that we can utilize them in our model. We'll also normalize our data. Our images are using a "Red, Blue, Green (RBG)" scale that has a range of [0, 255], and by normalizing it we'll set each pixel's value to a number in the range of [0, 1]. 

In [None]:
def decode_image(image):
    image = tf.image.decode_jpeg(image, channels=3)
    image = tf.cast(image, tf.float32)
    image = tf.reshape(image, [*IMAGE_SIZE, 3])
    return image

In [None]:
def read_tfrecord(example, labeled):
    tfrecord_format = {
        "image": tf.io.FixedLenFeature([], tf.string),
        "target": tf.io.FixedLenFeature([], tf.int64)
    } if labeled else {
        "image": tf.io.FixedLenFeature([], tf.string),
        "image_name": tf.io.FixedLenFeature([], tf.string)
    }
    example = tf.io.parse_single_example(example, tfrecord_format)
    image = decode_image(example['image'])
    if labeled:
        label = tf.cast(example['target'], tf.int32)
        return image, label
    idnum = example['image_name']
    return image, idnum

We'll use the following function to load our dataset. One of the advantages of a TPU is that we can run multiple files across the TPU at once, and this accounts for the speed advantages of using a TPU. To capitalize on that, we want to make sure that we're using data as soon as it streams in, rather than creating a data streaming bottleneck.

In [None]:
def load_dataset(filenames, labeled=True, ordered=False):
    ignore_order = tf.data.Options()
    if not ordered:
        ignore_order.experimental_deterministic = False # disable order, increase speed
    dataset = tf.data.TFRecordDataset(filenames, num_parallel_reads=AUTOTUNE) # automatically interleaves reads from multiple files
    dataset = dataset.with_options(ignore_order) # uses data as soon as it streams in, rather than in its original order
    dataset = dataset.map(partial(read_tfrecord, labeled=labeled), num_parallel_calls=AUTOTUNE)
    return dataset

## A note on using train_test_split()
While I used `train_test_split()` to create both a `training` and `validation` dataset, consider exploring **[cross validation instead](https://www.kaggle.com/dansbecker/cross-validation)**.

In [None]:
TRAINING_FILENAMES, VALID_FILENAMES = train_test_split(
#     tf.io.gfile.glob(GCS_PATH_NEW + '/*.tfrec'),
    tf.io.gfile.glob(GCS_PATH_MY + '/train_tfrecords/*.tfrec'),
    test_size=0.35, random_state=5
)

TEST_FILENAMES = tf.io.gfile.glob(GCS_PATH + '/test_tfrecords/ld_test*.tfrec')

## Adding in augmentations 

In [None]:
def data_augment(image, label):
    # Thanks to the dataset.prefetch(AUTO) statement in the following function this happens essentially for free on TPU. 
    # Data pipeline code is executed on the "CPU" part of the TPU while the TPU itself is computing gradients.
    image = tf.image.random_flip_left_right(image)
    # 再加些增强
    image = tf.reshape(image, [*IMAGE_SIZE, 3])  # 这里去掉了模型里的前处理层, 直接在这里reshape
    if not IMAGE_SIZE == RESIZE_IMAGE_SIZE:
        image = tf.image.resize(image, RESIZE_IMAGE_SIZE)
    
    return image, label

In [None]:
def data_val_augment(image, label):
    # val验证集图片预处理
    image = tf.reshape(image, [*IMAGE_SIZE, 3])  # 这里去掉了模型里的前处理层, 直接在这里reshape
    if not IMAGE_SIZE == RESIZE_IMAGE_SIZE:
        image = tf.image.resize(image, RESIZE_IMAGE_SIZE)  
    return image, label

## Define data loading methods
The following functions will be used to load our `training`, `validation`, and `test` datasets, as well as print out the number of images in each dataset.

In [None]:
def get_training_dataset():
    dataset = load_dataset(TRAINING_FILENAMES, labeled=True)  
    dataset = dataset.map(data_augment, num_parallel_calls=AUTOTUNE)  
    dataset = dataset.repeat()
    dataset = dataset.shuffle(2048)
    dataset = dataset.batch(BATCH_SIZE)
    dataset = dataset.prefetch(AUTOTUNE)
    return dataset

In [None]:
def get_validation_dataset(ordered=False):
    dataset = load_dataset(VALID_FILENAMES, labeled=True, ordered=ordered) 
    dataset = dataset.map(data_val_augment, num_parallel_calls=AUTOTUNE)  
    dataset = dataset.batch(BATCH_SIZE)
    if strategy.num_replicas_in_sync > 1: # TPU has more memory # GPU 关,TPU开
        dataset = dataset.cache()  
    dataset = dataset.prefetch(AUTOTUNE)
    return dataset

In [None]:
def get_test_dataset(ordered=False):
    dataset = load_dataset(TEST_FILENAMES, labeled=False, ordered=ordered)
    dataset = dataset.map(data_val_augment, num_parallel_calls=AUTOTUNE)  
    dataset = dataset.batch(BATCH_SIZE)
    dataset = dataset.prefetch(AUTOTUNE)
    return dataset

In [None]:
def count_data_items(filenames):
    n = [int(re.compile(r"-([0-9]*)\.").search(filename).group(1)) for filename in filenames]
    return np.sum(n)

In [None]:
NUM_TRAINING_IMAGES = count_data_items(TRAINING_FILENAMES)
NUM_VALIDATION_IMAGES = count_data_items(VALID_FILENAMES)
NUM_TEST_IMAGES = count_data_items(TEST_FILENAMES)

print('Dataset: {} training images, {} validation images, {} (unlabeled) test images'.format(
    NUM_TRAINING_IMAGES, NUM_VALIDATION_IMAGES, NUM_TEST_IMAGES))

# Brief exploratory data analysis (EDA)
First we'll print out the shapes and labels for a sample of each of our three datasets:

In [None]:
print("Training data shapes:")
for image, label in get_training_dataset().take(3):
    print(image.numpy().shape, label.numpy().shape)
print("Training data label examples:", label.numpy())
print("Validation data shapes:")
for image, label in get_validation_dataset().take(3):
    print(image.numpy().shape, label.numpy().shape)
print("Validation data label examples:", label.numpy())
print("Test data shapes:")
for image, idnum in get_test_dataset().take(3):
    print(image.numpy().shape, idnum.numpy().shape)
print("Test data IDs:", idnum.numpy().astype('U')) # U=unicode string

The following code chunk sets up a series of functions that will print out a grid of images. The grid of images will contain images and their corresponding labels.

In [None]:
# numpy and matplotlib defaults
np.set_printoptions(threshold=15, linewidth=80)

def batch_to_numpy_images_and_labels(data):
    images, labels = data
    numpy_images = images.numpy().astype(np.int)   # 这里注意下范围 小数[0. , 1.], 整数[0,255]
    numpy_labels = labels.numpy()
    if numpy_labels.dtype == object: # binary string in this case, these are image ID strings
        numpy_labels = [None for _ in enumerate(numpy_images)]
    # If no labels, only image IDs, return None for labels (this is the case for test data)
    return numpy_images, numpy_labels

def title_from_label_and_target(label, correct_label):
    if correct_label is None:
        return CLASSES[label], True
    correct = (label == correct_label)
    return "{} [{}{}{}]".format(CLASSES[label], 'OK' if correct else 'NO', u"\u2192" if not correct else '',
                                CLASSES[correct_label] if not correct else ''), correct

def display_one_plant(image, title, subplot, red=False, titlesize=16):
    plt.subplot(*subplot)
    plt.axis('off')
    plt.imshow(image)
    if len(title) > 0:
        plt.title(title, fontsize=int(titlesize) if not red else int(titlesize/1.2), color='red' if red else 'black', fontdict={'verticalalignment':'center'}, pad=int(titlesize/1.5))
    return (subplot[0], subplot[1], subplot[2]+1)

def display_batch_of_images(databatch, predictions=None):
    """This will work with:
    display_batch_of_images(images)
    display_batch_of_images(images, predictions)
    display_batch_of_images((images, labels))
    display_batch_of_images((images, labels), predictions)
    """
    # data
    images, labels = batch_to_numpy_images_and_labels(databatch)
    if labels is None:
        labels = [None for _ in enumerate(images)]
        
    # auto-squaring: this will drop data that does not fit into square or square-ish rectangle
    rows = int(math.sqrt(len(images)))
    cols = len(images)//rows
        
    # size and spacing
    FIGSIZE = 13.0
    SPACING = 0.1
    subplot=(rows,cols,1)
    if rows < cols:
        plt.figure(figsize=(FIGSIZE,FIGSIZE/cols*rows))
    else:
        plt.figure(figsize=(FIGSIZE/rows*cols,FIGSIZE))
    
    # display
    for i, (image, label) in enumerate(zip(images[:rows*cols], labels[:rows*cols])):
        title = '' if label is None else CLASSES[label]
        correct = True
        if predictions is not None:
            title, correct = title_from_label_and_target(predictions[i], label)
        dynamic_titlesize = FIGSIZE*SPACING/max(rows,cols)*40+3 # magic formula tested to work from 1x1 to 10x10 images
        subplot = display_one_plant(image, title, subplot, not correct, titlesize=dynamic_titlesize)
    
    #layout
    plt.tight_layout()
    if label is None and predictions is None:
        plt.subplots_adjust(wspace=0, hspace=0)
    else:
        plt.subplots_adjust(wspace=SPACING, hspace=SPACING)
    plt.show()

In [None]:
# load our training dataset for EDA
training_dataset = get_training_dataset()
training_dataset = training_dataset.unbatch().batch(20)
train_batch = iter(training_dataset)

In [None]:
# run this cell again for another randomized set of training images
display_batch_of_images(next(train_batch))

You can also modify the above code to look at your `validation` and `test` data, like this:

In [None]:
# load our validation dataset for EDA
validation_dataset = get_validation_dataset()
validation_dataset = validation_dataset.unbatch().batch(20)
valid_batch = iter(validation_dataset)

In [None]:
# run this cell again for another randomized set of training images
display_batch_of_images(next(valid_batch))

In [None]:
# load our test dataset for EDA
testing_dataset = get_test_dataset()
testing_dataset = testing_dataset.unbatch().batch(20)
test_batch = iter(testing_dataset)

In [None]:
# we only have one test image
display_batch_of_images(next(test_batch))

# Building the model
## Learning rate schedule

In [None]:
lr_scheduler = keras.optimizers.schedules.ExponentialDecay(
    initial_learning_rate=1e-5, 
    decay_steps=10000, 
    decay_rate=0.9)

## Building our model
In order to ensure that our model is trained on the TPU, we build it using `with strategy.scope()`.    

In [None]:
# model bases
def inceptionv3_base():
    base_model = tf.keras.applications.InceptionV3(weights='imagenet', include_top=False)
    preprocess_layer = tf.keras.layers.Lambda(tf.keras.applications.inception_v3.preprocess_input, input_shape=[*IMAGE_SIZE, 3])
    return base_model, preprocess_layer

def resnet152v2_base():
    base_model = tf.keras.applications.ResNet152V2(weights='imagenet', include_top=False)
    preprocess_layer = tf.keras.layers.Lambda(tf.keras.applications.resnet_v2.preprocess_input, input_shape=[*IMAGE_SIZE, 3])
    return base_model, preprocess_layer

def resnet101v2_base():
    base_model = tf.keras.applications.ResNet101V2(weights='imagenet', include_top=False)
    preprocess_layer = tf.keras.layers.Lambda(tf.keras.applications.resnet_v2.preprocess_input, input_shape=[*IMAGE_SIZE, 3])
    return base_model, preprocess_layer

def resnet50v2_base():
    base_model = tf.keras.applications.ResNet50V2(weights='imagenet', include_top=False)
    preprocess_layer = tf.keras.layers.Lambda(tf.keras.applications.resnet_v2.preprocess_input, input_shape=[*IMAGE_SIZE, 3])
    return base_model, preprocess_layer

def resnet50_base():
    base_model = tf.keras.applications.ResNet50(weights='imagenet', include_top=False)
    preprocess_layer = tf.keras.layers.Lambda(tf.keras.applications.resnet.preprocess_input, input_shape=[*IMAGE_SIZE, 3])
    return base_model, preprocess_layer

def resnet101_base():
    base_model = tf.keras.applications.ResNet101(weights='imagenet', include_top=False)
    preprocess_layer = tf.keras.layers.Lambda(tf.keras.applications.resnet.preprocess_input, input_shape=[*IMAGE_SIZE, 3])
    return base_model, preprocess_layer

def inception_resnet_v2_base():
    base_model = tf.keras.applications.InceptionResNetV2(weights='imagenet', include_top=False)
    preprocess_layer = tf.keras.layers.Lambda(tf.keras.applications.inception_resnet_v2.preprocess_input, input_shape=[*IMAGE_SIZE, 3])
    return base_model, preprocess_layer

# densenet121没有必须的预处理层, 保持代码一致,加上了
def densenet121_base():
    base_model = tf.keras.applications.DenseNet121(weights='imagenet', include_top=False)
    preprocess_layer = tf.keras.layers.Lambda(tf.keras.applications.densenet.preprocess_input, input_shape=[*IMAGE_SIZE, 3])
    return base_model, preprocess_layer

def xception_base():
    base_model = tf.keras.applications.Xception(weights='imagenet', include_top=False)
    preprocess_layer = tf.keras.layers.Lambda(tf.keras.applications.xception.preprocess_input, input_shape=[*IMAGE_SIZE, 3])
    return base_model, preprocess_layer

def vgg16_base():
    base_model = tf.keras.applications.VGG16(weights='imagenet', include_top=False)
    preprocess_layer = tf.keras.layers.Lambda(tf.keras.applications.vgg16.preprocess_input, input_shape=[*IMAGE_SIZE, 3])
    return base_model, preprocess_layer

def nasnet_large_base():
    base_model = tf.keras.applications.NASNetLarge(weights='imagenet', include_top=False)
    preprocess_layer = tf.keras.layers.Lambda(tf.keras.applications.nasnet.preprocess_input, input_shape=[*IMAGE_SIZE, 3])
    return base_model, preprocess_layer

def efficientnet_b3_base():
    base_model = tf.keras.applications.EfficientNetB3(weights='imagenet', include_top=False)
    preprocess_layer = tf.keras.layers.Lambda(tf.keras.applications.efficientnet.preprocess_input, input_shape=[*RESIZE_IMAGE_SIZE, 3])
    return base_model, preprocess_layer

def efficientnet_b0_base():
    base_model = tf.keras.applications.EfficientNetB0(weights='imagenet', include_top=False)
    preprocess_layer = tf.keras.layers.Lambda(tf.keras.applications.efficientnet.preprocess_input, input_shape=[*RESIZE_IMAGE_SIZE, 3])
    return base_model, preprocess_layer

In [None]:
MODEL_NAME = 'VGG16' # InceptionV3 ResNet101V2 ResNet152V2 InceptionResNetV2 DenseNet121 Xception VGG16 ResNet50 ResNet101 NASNetLarge (TPU)
#  EfficientNetB0 EfficientNetB3 (GPUs)
def build_model():
    with strategy.scope():   
        base_model ,preprocess_layer = vgg16_base()
        
        model = tf.keras.Sequential([
            preprocess_layer,
            base_model,
            tf.keras.layers.GlobalAveragePooling2D(),
            tf.keras.layers.Dense(len(CLASSES), activation='softmax')  
        ])

        model.compile(
            optimizer=tf.keras.optimizers.Adam(learning_rate=lr_scheduler, epsilon=0.001),
            loss='sparse_categorical_crossentropy',  
            metrics=['sparse_categorical_accuracy'])
    return model


# Train the model
As our model is training you'll see a printout for each epoch, and can also monitor TPU usage by clicking on the TPU metrics in the toolbar at the top right of your notebook.

In [None]:
# load data
train_dataset = get_training_dataset()
valid_dataset = get_validation_dataset()

In [None]:
STEPS_PER_EPOCH = NUM_TRAINING_IMAGES // BATCH_SIZE
VALID_STEPS = NUM_VALIDATION_IMAGES // BATCH_SIZE

model = build_model()

save_model_callback = tf.keras.callbacks.ModelCheckpoint(
        "%s-best-{epoch:02d}-{val_sparse_categorical_accuracy:.4f}.h5"%(MODEL_NAME), 
        monitor='val_sparse_categorical_accuracy', 
        verbose=0, save_best_only=True,
        save_weights_only=False, mode='max', save_freq='epoch')

history = model.fit(train_dataset, 
                    steps_per_epoch=STEPS_PER_EPOCH, 
                    epochs=EPOCHS,
                    callbacks = [save_model_callback],
                    validation_data=valid_dataset,
                    validation_steps=VALID_STEPS)

In [None]:
# 只保留最后一个best权重, 删掉best权重历史文件
best_weights = tf.io.gfile.glob( '*best*.h5')
best_weights.sort()
old_best_weights = best_weights[:-1]
for old_best_weight in old_best_weights:
    tf.io.gfile.remove(old_best_weight)
    
best_weight_path = best_weights[-1]
print(best_weight_path)

## Model Summary

In [None]:
model.summary()

# 保存模型,用于预测提交kernel

* TPU版本直接存全模型要用GCS,用本地磁盘有问题,注意点: [Saving to file a model within TPUStrategy
#36447](https://github.com/tensorflow/tensorflow/issues/36447)

* 参考别人,还是直接先存权重,预测时使用代码定义结构

* TPU上保存 https://www.kaggle.com/c/jigsaw-multilingual-toxic-comment-classification/discussion/148930#835254

* 又参考别人,发现直接save即可. 比较优雅,Done. 之前不行可能跟一些模型的写法有关.



In [None]:
model.save('%s-512-last.h5'%(MODEL_NAME))

# Evaluating our model
The first chunk of code is provided to show you where the variables in the second chunk of code came from. As you can see, there's a lot of room for improvement in this model, but because we're using TPUs and have a relatively short training time, we're able to iterate on our model fairly rapidly.

In [None]:
# print out variables available to us
print(history.history.keys())

In [None]:
# create learning curves to evaluate model performance
history_frame = pd.DataFrame(history.history)
history_frame.loc[:, ['loss', 'val_loss']].plot()
history_frame.loc[:, ['sparse_categorical_accuracy', 'val_sparse_categorical_accuracy']].plot();

# Making predictions
Now that we've trained our model we can use it to make predictions! 

In [None]:
# this code will convert our test image data to a float32 
def to_float32(image, label):
    return tf.cast(image, tf.float32), label

### 加载训练完的模型

In [None]:
best_model =  tf.keras.models.load_model(best_weight_path)

In [None]:
test_ds = get_test_dataset(ordered=True) 
test_ds = test_ds.map(to_float32)

print('Computing predictions...')
# test_images_ds = testing_dataset
test_images_ds = test_ds.map(lambda image, idnum: image)
probabilities = best_model.predict(test_images_ds)
predictions = np.argmax(probabilities, axis=-1)
print(predictions)

# Creating a submission file
Now that we've trained a model and made predictions we're ready to submit to the competition! You can run the following code below to get your submission file.

In [None]:
print('Generating submission.csv file...')
test_ids_ds = test_ds.map(lambda image, idnum: idnum).unbatch()
test_ids = next(iter(test_ids_ds.batch(NUM_TEST_IMAGES))).numpy().astype('U') # all in one batch
np.savetxt('submission.csv', np.rec.fromarrays([test_ids, predictions]), fmt=['%s', '%d'], delimiter=',', 
           header='image_id,label', comments='')
!head submission.csv

Be aware that because this is a code competition with a hidden test set, internet and TPUs cannot be enabled on your submission notebook. Therefore TPUs will only be available for training models. For a walk-through on how to train on TPUs and run inference/submit on GPUs, see our [TPU Docs](https://www.kaggle.com/docs/tpu#tpu6).