# CS542 - Class Challenge - fine-grained classification of plants:

Our class challenge will consists of two tasks addressing an image recognition task where our dataset contains about 1K categories of plants with only about 250,000 images.  There will be two parts to this task:

1. Image classification. Imagine we have cateloged all the plants we care to identify, now we just need to create a classifier for them! Use your skills from the supervised learning sections of this course to try to address this problem.

2. Semi-Supervised/Few-Shot Learning.  Unfortunately, we missed some important plants we want to classify!  We do have some images we think contain the plant, but we have only have a few labels.  Our new goal is to develop an AI model that can learn from just these labeled examples.

Each student must submit a model on both tasks.  Students in the top 3 on each task will get 5% extra credit on this assignment.

This notebook is associated with the second task (semi-supervised).


# Dataset
The dataset is downloaded on scc in the address: "/projectnb2/cs542-bap/classChallenge/data". You can find the python version of this notebook there as well or you could just type "jupyter nbconvert --to script baselineModel_task2.ipynb" and it will output "baselineModel_task2.py". You should be able to run "baselineModel_task2.py" on scc by simply typing "python baselineModel_task2.py"

Please don't try to change or delete the dataset.

# Evaluation:
You will compete with each other over your performance on the dedicated test set. The performance measure is classification accuracy, i.e: if the true class is your top predictions. 

# Baseline:
The following code is a baseline which you can use and improve to come up with your model for this task

# Suggestion
One simple suggestion would be to use a pretrained model on imagenet and finetune it on this data similar to this [link](https://keras.io/api/applications/)
Also you should likely train more than 2 epochs.

## Import TensorFlow and other libraries

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import os
import PIL

import tensorflow as tf

from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.models import Sequential

# Create a dataset

In [None]:

data_dir = '/projectnb2/cs542-bap/class_challenge/'

train_samps = np.loadtxt(os.path.join(data_dir, 'train_held_out_labeled.txt'), dtype='str', delimiter=" ")
val_samps = np.loadtxt(os.path.join(data_dir, 'val_held_out.txt'), dtype='str', delimiter=" ")

train_len = len(train_samps)

val_len = len(val_samps)


samples = np.concatenate((train_samps, val_samps))

unlabeled_samps = np.loadtxt(os.path.join(data_dir, 'train_held_out.txt'), dtype='str')
unlabeled_len = len(unlabeled_samps)

test_ds = tf.data.TextLineDataset(os.path.join(data_dir, 'test_held_out.txt'))

with open(os.path.join(data_dir, 'classes_held_out.txt'), 'r') as f:
    class_names = [c.strip() for c in f.readlines()]

num_classes = len(class_names)

## Write a short function that converts a file path to an (img, label) pair:

In [None]:
def decode_img(img, test=False, crop_size=224):
    img = tf.io.read_file(img)
    # convert the compressed string to a 3D uint8 tensor
    img = tf.image.decode_jpeg(img, channels=3)

    return tf.image.resize(img, [crop_size, crop_size])
  
def get_label(label):
    # find teh matching label
    one_hot = tf.where(tf.equal(label, class_names))
    # Integer encode the label
    return tf.reduce_min(one_hot)

def process_path(path, label):
    # should have two parts
    # file_path = tf.strings.split(file_path)
    # second part has the class index
    label = get_label(label)
   # load the raw data from the file
    img = decode_img(tf.strings.join([data_dir, 'images/', path, '.jpg']))
    return img, label

def process_path_test(file_path):
    # load the raw data from the file
    img = decode_img(tf.strings.join([data_dir, 'images/', file_path, '.jpg']))
    return img, file_path

# Finish setting up data

In [None]:
batch_size = 25

AUTOTUNE = tf.data.experimental.AUTOTUNE
test_ds = test_ds.map(process_path_test, num_parallel_calls=AUTOTUNE)

def configure_for_performance(ds):
    ds = ds.cache()
    ds = ds.shuffle(buffer_size=1000)
    ds = ds.batch(batch_size)
    ds = ds.prefetch(buffer_size=AUTOTUNE)
    return ds


def shuffle_train_val(train_perc = 0.2):
    # define the train length
    train_len = int(train_perc*len(samples))
    
    # idexing train set and val set by random choice
    train_idx = np.random.choice(range(len(samples)), train_len, replace=True)
    val_idx = [idx for idx in range(len(samples)) if idx not in train_idx]
    
    # get train_ds and val_ds based on indexes
    train_ds = tf.data.Dataset.from_tensor_slices((samples[train_idx, 0], samples[train_idx, 1]))
    train_ds = train_ds.map(process_path, num_parallel_calls=AUTOTUNE)
    train_ds = configure_for_performance(train_ds)
    val_ds = tf.data.Dataset.from_tensor_slices((samples[val_idx, 0], samples[val_idx, 1]))
    val_ds = val_ds.map(process_path, num_parallel_calls=AUTOTUNE)
    val_ds = configure_for_performance(val_ds)

    return train_ds, val_ds

## Models

## ResNet50

In [None]:
class ResNet50(tf.keras.Model):

    def __init__(self):
        super(ResNet50, self).__init__()
        self.ResNet50 = keras.applications.ResNet50(
            include_top=False,
            weights='imagenet',
            input_shape=(224, 224, 3)
        )
        
        # unfreeze the last two layers
        for layer in self.ResNet50.layers[:-2]:
            layer.trainable = False
        
        # define layers
        self.pool = layers.GlobalAveragePooling2D()
        self.flatten = layers.Flatten()
        self.fc_1 = layers.Dense(1024)
        self.fc_2 = layers.Dense(units=num_classes)

    def call(self, inputs):
        x = keras.applications.resnet.preprocess_input(inputs)
        x = self.ResNet50(x)
        x = self.flatten(x)
        x = self.fc_1(x)
        output = self.fc_2(x)

        return output

# data augmentation
model = Sequential([
    layers.experimental.preprocessing.RandomFlip(
        mode='horizontal'),
    layers.experimental.preprocessing.RandomZoom(0.2),
    layers.experimental.preprocessing.RandomTranslation(0.2, 0.2),
    ResNet50()
])

# compile the model
model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=0.00001),
    loss=tf.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=['accuracy'])

## EfficientB0

In [None]:
class EfficientB0(tf.keras.Model):

    def __init__(self):
        super(EfficientB0, self).__init__()
        self.EfficientB0 = keras.applications.EfficientNetB0(
             include_top=False,
             weights='imagenet',
             input_shape=(224, 224, 3), 
             # add stronger reguarliztions
             drop_connect_rate=0.4
        )
        
        # unfreeze top 20 layers
        for layer in self.EfficientB0.layers[:-20]:
            layer.trainable = False
            
        # define layers
        self.pool = layers.GlobalAveragePooling2D()
        self.flatten = layers.Flatten()
        self.fc_1 = layers.Dense(1024)
        self.dropout = layers.Dropout(0.3)
        self.fc_2 = layers.Dense(units=num_classes)

    def call(self, inputs):
        x = self.EfficientB0(inputs)
        x = self.pool(x)
        x = self.fc_1(x)
        x = self.dropout(x)
        output = self.fc_2(x)

        return output

# image augmentation
model = Sequential([
    layers.experimental.preprocessing.RandomFlip(
       mode='horizontal'),
    layers.experimental.preprocessing.RandomZoom(0.2),
    layers.experimental.preprocessing.RandomTranslation(0.2, 0.2),
    EfficientB0()
])

# compile the model
model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=0.00001),
    loss=tf.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=['accuracy'])

In [None]:
def Add_labels(unlabeled_samps, model, unlabeled_batch):
    unlabeled_ds = tf.data.Dataset.from_tensor_slices(unlabeled_samps)
    unlabeled_ds = unlabeled_ds.map(process_path_test, num_parallel_calls=AUTOTUNE)
    unlabeled_ds = unlabeled_ds.batch(1)
    
    # initialize prediction tracker
    predictions = None
    # initialize indexes tracker
    inds = []
    for image, image_name in unlabeled_ds:
        preds = model.predict(image)
        ind = np.argmax(preds)
        cls = class_names[ind]
        pred = (str(int(image_name)), cls)
        
        # keep tracking predictions
        if predictions is None:
            predictions = np.array(pred)
        else:
            predictions = np.vstack((predictions, pred))
            
        # keep tracking the indexes
        inds.append(preds[0, ind])
        
    # output top n predictions, n = max_unlabeled
    inds = np.argpartition(inds, -unlabeled_batch)[-unlabeled_batch:]
    predictions = predictions[inds]
    return predictions

In [None]:
model_list = [None] * 1
    
# the main training loop
for i in range(1):
  
    model = model
    train_ds, val_ds = shuffle_train_val()
    samps = samples
    unlabeled = unlabeled_samps

    print(f"Iteration {i+1}")
    unlabeled_batch = int(0.1 * unlabeled_len)
    
    # finish training this iteration until all unlabeled data are used
    while len(unlabeled) > 0:
        hist = model.fit(train_ds, validation_data=val_ds, epochs=2, shuffle=True)
        improvement = hist.history['val_accuracy'][-1] - hist.history['val_accuracy'][-2]
            
        # as long as the model stop moving forward, start training unlabeled samples
        if improvement <= 0.01:
            preds = Add_labels(unlabeled, model, min(len(unlabeled), unlabeled_batch))
            pred_ds = tf.data.Dataset.from_tensor_slices((preds[:,0], preds[:,1]))
            pred_ds = pred_ds.map(process_path, num_parallel_calls=AUTOTUNE)
            pred_ds = configure_for_performance(pred_ds)
      
            # keep updating the training set and the unlabeled set
            train_ds.concatenate(pred_ds)
            unlabeled = [j for j in unlabeled if j not in preds[:,0]]
            print(f"number of unlabeled samples remained: {len(unlabeled)}")
           
    # train all labeled and unlabeled data
    print(f"fine tuning the model (iteration {i+1})")
    model.fit(train_ds,validation_data=val_ds,epochs=20,shuffle=True)
        
    # keep track of the trained models
    model_list[i] = model
        

Iteration 1
Epoch 1/2
Epoch 2/2
Epoch 1/2
Epoch 2/2
Epoch 1/2
Epoch 2/2
Epoch 1/2
Epoch 2/2
Epoch 1/2
Epoch 2/2
Epoch 1/2
Epoch 2/2
Epoch 1/2
Epoch 2/2
Epoch 1/2
Epoch 2/2
Epoch 1/2
Epoch 2/2
number of unlabeled samples remained: 3788
Epoch 1/2
Epoch 2/2
Epoch 1/2
Epoch 2/2
Epoch 1/2
Epoch 2/2
Epoch 1/2
Epoch 2/2
number of unlabeled samples remained: 3368
Epoch 1/2
Epoch 2/2
number of unlabeled samples remained: 2948
Epoch 1/2
Epoch 2/2
Epoch 1/2
Epoch 2/2
number of unlabeled samples remained: 2528
Epoch 1/2
Epoch 2/2
number of unlabeled samples remained: 2108
Epoch 1/2
Epoch 2/2
number of unlabeled samples remained: 1688
Epoch 1/2
Epoch 2/2
number of unlabeled samples remained: 1268
Epoch 1/2
Epoch 2/2
number of unlabeled samples remained: 848
Epoch 1/2
Epoch 2/2
number of unlabeled samples remained: 428
Epoch 1/2
Epoch 2/2
number of unlabeled samples remained: 8
Epoch 1/2
Epoch 2/2
number of unlabeled samples remained: 0
fine tuning the model (iteration 1)
Epoch 1/20
Epoch 2/20
Epoch

In [None]:
hist7 = model_list[-1].fit(train_ds,validation_data=val_ds,epochs=10,shuffle=True)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [None]:
np.save('hist7.npy',hist7.history)

In [None]:
hist = np.load('hist1.npy',allow_pickle='TRUE').item()

In [None]:
hist2 = np.load('hist2.npy',allow_pickle='TRUE').item()

# Output submission csv for Kaggle


In [None]:
test_ds = test_ds.batch(1)

with open('submission_task2_semisupervised.csv', 'w') as f:
  f.write('id,predicted\n')
  for image_batch, image_names in test_ds:
    predictions = model_list[-1].predict(image_batch)
    for image_name, predictions in zip(image_names.numpy(), model.predict(image_batch)):
      inds = np.argmax(predictions)
      line = str(int(image_name)) + ',' + class_names[inds]
      f.write(line + '\n')

**Note**

Absolute path is recommended here. For example, use "/projectnb2/cs542-bap/[your directory name]/submission_task2_supervised.csv" to replace "submission_task2_supervised.csv".

Besides, you can request good resources by specify the type of gpus, such as "qsub -l gpus=1 -l gpu_type=P100 [your file name].qsub". This is helpful to avoid potential issues of GPUs, such as out of memory, etc.