As mentioned by [discussion](https://www.kaggle.com/c/cassava-disease/discussion/94114), pseudo-labels can be used as an option to help improve model performance. Here I generated a pseudo-label dataframe using the model trained in another [notebook](https://www.kaggle.com/electro/keras-baseline-0-88-quickstart-i-training-saving). The quality of pseudo-labels depends on the model used. My model's validation accuracy is 0.88. So feel free to replace it with your own model. And thanks to @anonamename for sharing [unzipped extra images](https://www.kaggle.com/anonamename/cassava-2019-compe-data) from [2019 Cassava Competition](https://www.kaggle.com/c/cassava-disease).

In [None]:
import numpy as np
import pandas as pd
import os
import tensorflow as tf
import math
from tqdm import tqdm
import matplotlib.pyplot as plt
import cv2

dir_extra = "../input/cassava-2019-compe-data/kaggle_upload/extraimages"

In [None]:
from tensorflow import keras
model = keras.models.load_model('../input/cassava-baseline-weights/best_weights.h5')

In [None]:
test_images = os.listdir(dir_extra)

N = len(test_images)
BATCH_SIZE = 16

pseudo_df = pd.DataFrame(columns=['image_id','label','prob'])

Batch prediction and appending dataframe.

In [None]:
def predict_on_batch(test_list, dir_extra=dir_extra, target_size=(380,380)):
    input_batch=[]
    for IMAGE_ID in test_list:
        image = tf.keras.preprocessing.image.load_img(os.path.join(dir_extra,IMAGE_ID), 
                                                      grayscale=False, 
                                                      color_mode="rgb", 
                                                      target_size=target_size, 
                                                      interpolation="nearest")
        input_arr = keras.preprocessing.image.img_to_array(image)
        input_batch.append(input_arr)
    return np.array(input_batch)

In [None]:
for batch_index in tqdm(range(math.ceil(N/BATCH_SIZE))):
    if batch_index*BATCH_SIZE+BATCH_SIZE < N:
        test_X = predict_on_batch(test_images[batch_index*BATCH_SIZE:batch_index*BATCH_SIZE+BATCH_SIZE])
        prob = model.predict(test_X)
        predictions = prob.argmax(axis = 1)
        pseudo_batch = pd.DataFrame({'image_id':test_images[batch_index*BATCH_SIZE:batch_index*BATCH_SIZE+BATCH_SIZE],
                    'label':list(predictions), 'prob':list(prob)})
    else:
        test_X = predict_on_batch(test_images[batch_index*BATCH_SIZE:])
        prob = model.predict(test_X)
        predictions = prob.argmax(axis = 1)
        pseudo_batch = pd.DataFrame({'image_id':test_images[batch_index*BATCH_SIZE:],
                    'label':list(predictions), 'prob':list(prob)})
        
    pseudo_df = pseudo_df.append(pseudo_batch, ignore_index=True)

In [None]:
pseudo_df['label'] = pseudo_df['label'].astype('int64')
pseudo_df

Choosing a threshold to select images you want to use.

In [None]:
def selected(row, threshold=0.95):
    a = row['prob']
    if np.amax(a) >= threshold:
        row['selected'] = 1
    else:
        row['selected'] = 0
    return row

pseudo_df = pseudo_df.apply(selected, axis='columns')
selected_df = pseudo_df[(pseudo_df['selected']==1)]
selected_df

Check out the quality of pseudo-labels.

In [None]:
def show_image(image_ids, labels):
    plt.figure(figsize=(16, 12))
    
    for ind, (image_id, label) in enumerate(zip(image_ids, labels)):
        plt.subplot(3, 3, ind + 1)
        image = cv2.imread(os.path.join(dir_extra,image_id))
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

        plt.imshow(image)
        plt.title(f"Class: {label}", fontsize=12)
        plt.axis("off")
    
    plt.show()
    
image_ids = selected_df[:9]["image_id"].values
labels = selected_df[:9]["label"].values

show_image(image_ids, labels)

In [None]:
selected_df.to_csv('pseudo_label_95.csv', index = False)

Thanks for your time.