### Introduction

The goal of this kernel is to build a Keras cnn model to classify 12 kinds of plant seedlings.

We will use the V2 Plant Seedlings Dataset. This dataset contains 5,539 images distributed between 12 classes. The images show plant seedlings at different growth stages. Of the 12 classes, 3 classes are crop seedlings and 9 are weed seedlings. 

The images are in different sizes. We will resize all images to 96x96 and use only 250 images from each class. We won't do any image augmentation. 

This kernel will focus on:

- Creating the folder structure that Keras generators need.
- Creating generators to feed the images from the folders into the model.
- Model building and training.
- Assessing the quality of the model by generating a confusion matrix and a classification report.

### Results

This simple model will produce a validation accuracy that is greater than 90% and an F1 score of approximately 0.75. 

***



In [None]:
from numpy.random import seed
seed(101)
from tensorflow import set_random_seed
set_random_seed(101)

import pandas as pd
import numpy as np

import tensorflow

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Conv2D, MaxPooling2D, Flatten
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.metrics import categorical_crossentropy
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau, ModelCheckpoint

import os
import cv2

import imageio
import skimage
import skimage.io
import skimage.transform

from sklearn.utils import shuffle
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
import itertools
import shutil
import matplotlib.pyplot as plt
%matplotlib inline


In [None]:
# Number of samples we will have in each class.
SAMPLE_SIZE = 250

# The images will all be resized to this size.
IMAGE_SIZE = 96

### What folders are available?

The images are grouped into 12 folders by plant specie. 

In [None]:
os.listdir('../input/nonsegmentedv2')

### How many images are there in each folder?

In [None]:
# get a list of image folders
folder_list = os.listdir('../input/nonsegmentedv2')

total_images = 0

# loop through each folder
for folder in folder_list:
    # set the path to a folder
    path = '../input/nonsegmentedv2/' + str(folder)
    # get a list of images in that folder
    images_list = os.listdir(path)
    # get the length of the list
    num_images = len(images_list)
    
    total_images = total_images + num_images
    # print the result
    print(str(folder) + ':' + ' ' + str(num_images))
    
print('\n')
# print the total number of images available
print('Total Images: ', total_images)
    

### Copy all images into one directory
This will make it easier to work with this data.

In [None]:
# Create a new directory to store all available images
all_images_dir = 'all_images_dir'
os.mkdir(all_images_dir)


In [None]:
# check that the new directory has been created
!ls

In [None]:
# This code copies all images from their seperate folders into the same 
# folder called all_images_dir.


folder_list = os.listdir('../input/nonsegmentedv2')

for folder in folder_list:
    
    # create a path to the folder
    path = '../input/nonsegmentedv2/' + str(folder)

    # create a list of all files in the folder
    file_list = os.listdir(path)

    # move the 0 images to all_images_dir
    for fname in file_list:

        # source path to image
        src = os.path.join(path, fname)
        
        # Change the file name because many images have the same file name.
        # Add the folder name to the existing file name.
        new_fname = str(folder) + '_' + fname
        
        # destination path to image
        dst = os.path.join(all_images_dir, new_fname)
        # copy the image from the source to the destination
        shutil.copyfile(src, dst)



In [None]:
# Check how many images are in all_images_dir.
# Should be 5539.

len(os.listdir('all_images_dir'))

### Create a dataframe containing all the information

In [None]:
# Get a list of all images in the all_images_dir folder.
image_list = os.listdir('all_images_dir')

# Create the dataframe.
df_data = pd.DataFrame(image_list, columns=['image_id'])

df_data.head()

In [None]:

# Each file name has this format:
# Loose Silky-bent_377.png

# This function will extract the class name from the file name of each image.
def extract_target(x):
    # split into a list
    a = x.split('_')
    # the target is the first index in the list
    target = a[0]
    
    return target


# create a new column called 'target'
df_data['target'] = df_data['image_id'].apply(extract_target)

df_data.head()

In [None]:
df_data.shape

### Display a random sample of 4 train images for each class

Here we will see what the images in each class look like. Take note of the similar appearance between Black-grass and Loose Silky-bent. We will see later in the confusion matrix and classification report that the model will struggle to seperate these two classes.

In [None]:
# source: https://www.kaggle.com/gpreda/honey-bee-subspecies-classification


def draw_category_images(col_name,figure_cols, df, IMAGE_PATH):
    
    """
    Give a column in a dataframe,
    this function takes a sample of each class and displays that
    sample on one row. The sample size is the same as figure_cols which
    is the number of columns in the figure.
    Because this function takes a random sample, each time the function is run it
    displays different images.
    """
    

    categories = (df.groupby([col_name])[col_name].nunique()).index
    f, ax = plt.subplots(nrows=len(categories),ncols=figure_cols, 
                         figsize=(4*figure_cols,4*len(categories))) # adjust size here
    # draw a number of images for each location
    for i, cat in enumerate(categories):
        sample = df[df[col_name]==cat].sample(figure_cols) # figure_cols is also the sample size
        for j in range(0,figure_cols):
            file=IMAGE_PATH + sample.iloc[j]['image_id']
            im=cv2.imread(file)
            ax[i, j].imshow(im, resample=True, cmap='gray')
            ax[i, j].set_title(cat, fontsize=16)  
    plt.tight_layout()
    plt.show()

In [None]:
IMAGE_PATH = 'all_images_dir/'

draw_category_images('target',4, df_data, IMAGE_PATH)

### Balance the class distribution
We will use 250 images from each class.

In [None]:
# What is the class distribution?

df_data['target'].value_counts()

In [None]:

# Get a list of classes
target_list = os.listdir('../input/nonsegmentedv2')

for target in target_list:

    # Filter out a target and take a random sample
    df = df_data[df_data['target'] == target].sample(SAMPLE_SIZE, random_state=101)
    
    # if it's the first item in the list
    if target == target_list[0]:
        df_sample = df
    else:
        # Concat the dataframes
        df_sample = pd.concat([df_sample, df], axis=0).reset_index(drop=True)


In [None]:
# Display the balanced classes.

df_sample['target'].value_counts()

### Create the train and  val sets


In [None]:
# train_test_split

# stratify=y creates a balanced validation set.
y = df_sample['target']

df_train, df_val = train_test_split(df_sample, test_size=0.10, random_state=101, stratify=y)

print(df_train.shape)
print(df_val.shape)

In [None]:
# Train set class distribution

df_train['target'].value_counts()

In [None]:
# Val set class distribution

df_val['target'].value_counts()

### Create a Directory Structure

In [None]:
folder_list = os.listdir('../input/nonsegmentedv2')

folder_list

In [None]:
# Create a new directory
base_dir = 'base_dir'
os.mkdir(base_dir)


#[CREATE FOLDERS INSIDE THE BASE DIRECTORY]

# now we create 2 folders inside 'base_dir':

# train_dir
    # Maize
    # Fat Hen
    # Shepherd’s Purse
    # Common Chickweed
    # Cleavers
    # Charlock
    # Loose Silky-bent
    # Small-flowered Cranesbill
    # Black-grass
    # Scentless Mayweed
    # Sugar beet
    # Common wheat

# val_dir
    # Maize
    # Fat Hen
    # Shepherd’s Purse
    # Common Chickweed
    # Cleavers
    # Charlock
    # Loose Silky-bent
    # Small-flowered Cranesbill
    # Black-grass
    # Scentless Mayweed
    # Sugar beet
    # Common wheat


# create a path to 'base_dir' to which we will join the names of the new folders
# train_dir
train_dir = os.path.join(base_dir, 'train_dir')
os.mkdir(train_dir)

# val_dir
val_dir = os.path.join(base_dir, 'val_dir')
os.mkdir(val_dir)


# [CREATE FOLDERS INSIDE THE TRAIN AND VALIDATION FOLDERS]

# create new folders inside train_dir

for folder in folder_list:
    
    folder = os.path.join(train_dir, str(folder))
    os.mkdir(folder)


# create new folders inside val_dir

for folder in folder_list:
    
    folder = os.path.join(val_dir, str(folder))
    os.mkdir(folder)

In [None]:
# check that the folders have been created

os.listdir('base_dir/train_dir')

### Transfer the images into the folders¶

In [None]:
# Set the id as the index in df_data
df_data.set_index('image_id', inplace=True)

In [None]:
df_data.head()

In [None]:
# Get a list of train and val images
train_list = list(df_train['image_id'])
val_list = list(df_val['image_id'])

# Transfer the train images

for image in train_list:
    
    # the id in the csv file does not have the .tif extension therefore we add it here
    fname = image
    # get the label for a certain image
    folder = df_data.loc[image,'target']
    
    
    # source path to image
    src = os.path.join(all_images_dir, fname)
    # destination path to image
    dst = os.path.join(train_dir, folder, fname)
    
    # resize the image and save it at the new location
    image = cv2.imread(src)
    image = cv2.resize(image, (IMAGE_SIZE, IMAGE_SIZE))
    # save the image at the destination
    cv2.imwrite(dst, image)
        
    

# Transfer the val images

for image in val_list:
    
    # the id in the csv file does not have the .tif extension therefore we add it here
    fname = image
    # get the label for a certain image
    folder = df_data.loc[image,'target']
    

    # source path to image
    src = os.path.join(all_images_dir, fname)
    # destination path to image
    dst = os.path.join(val_dir, folder, fname)
    
    # resize the image and save it at the new location
    image = cv2.imread(src)
    image = cv2.resize(image, (IMAGE_SIZE, IMAGE_SIZE))
    # save the image at the destination
    cv2.imwrite(dst, image)

    

### Check how many train images are in each folder

In [None]:
# get a list of image folders
folder_list = os.listdir('base_dir/train_dir')

total_images = 0

# loop through each folder
for folder in folder_list:
    # set the path to a folder
    path = 'base_dir/train_dir/' + str(folder)
    # get a list of images in that folder
    images_list = os.listdir(path)
    # get the length of the list
    num_images = len(images_list)
    
    total_images = total_images + num_images
    # print the result
    print(str(folder) + ':' + ' ' + str(num_images))
    
print('\n')
# print the total number of images available
print('Total Images: ', total_images)

### Check how many val images are in each folder

In [None]:
# get a list of image folders
folder_list = os.listdir('base_dir/val_dir')

total_images = 0

# loop through each folder
for folder in folder_list:
    # set the path to a folder
    path = 'base_dir/val_dir/' + str(folder)
    # get a list of images in that folder
    images_list = os.listdir(path)
    # get the length of the list
    num_images = len(images_list)
    
    total_images = total_images + num_images
    # print the result
    print(str(folder) + ':' + ' ' + str(num_images))
    
print('\n')
# print the total number of images available
print('Total Images: ', total_images)

In [None]:
# End of Data Preparation
### ================================================================================== ###
# Start of Model Building

### Set Up the Generators

In [None]:
train_path = 'base_dir/train_dir'
valid_path = 'base_dir/val_dir'


num_train_samples = len(df_train)
num_val_samples = len(df_val)
train_batch_size = 10
val_batch_size = 10


train_steps = np.ceil(num_train_samples / train_batch_size)
val_steps = np.ceil(num_val_samples / val_batch_size)

In [None]:
datagen = ImageDataGenerator(rescale=1.0/255)

train_gen = datagen.flow_from_directory(train_path,
                                        target_size=(IMAGE_SIZE,IMAGE_SIZE),
                                        batch_size=train_batch_size,
                                        class_mode='categorical')

val_gen = datagen.flow_from_directory(valid_path,
                                        target_size=(IMAGE_SIZE,IMAGE_SIZE),
                                        batch_size=val_batch_size,
                                        class_mode='categorical')

# Note: shuffle=False causes the test dataset to not be shuffled
test_gen = datagen.flow_from_directory(valid_path,
                                        target_size=(IMAGE_SIZE,IMAGE_SIZE),
                                        batch_size=1,
                                        class_mode='categorical',
                                        shuffle=False)

### Create the Model Architecture

In [None]:
# Source: https://www.kaggle.com/fmarazzi/baseline-keras-cnn-roc-fast-5min-0-8253-lb

kernel_size = (3,3)
pool_size= (2,2)
first_filters = 32
second_filters = 64
third_filters = 128

dropout_conv = 0.3
dropout_dense = 0.3


model = Sequential()
model.add(Conv2D(first_filters, kernel_size, activation = 'relu', 
                 input_shape = (IMAGE_SIZE, IMAGE_SIZE, 3)))
model.add(Conv2D(first_filters, kernel_size, activation = 'relu'))
model.add(Conv2D(first_filters, kernel_size, activation = 'relu'))
model.add(MaxPooling2D(pool_size = pool_size)) 
model.add(Dropout(dropout_conv))

model.add(Conv2D(second_filters, kernel_size, activation ='relu'))
model.add(Conv2D(second_filters, kernel_size, activation ='relu'))
model.add(Conv2D(second_filters, kernel_size, activation ='relu'))
model.add(MaxPooling2D(pool_size = pool_size))
model.add(Dropout(dropout_conv))

model.add(Conv2D(third_filters, kernel_size, activation ='relu'))
model.add(Conv2D(third_filters, kernel_size, activation ='relu'))
model.add(Conv2D(third_filters, kernel_size, activation ='relu'))
model.add(MaxPooling2D(pool_size = pool_size))
model.add(Dropout(dropout_conv))

model.add(Flatten())
model.add(Dense(256, activation = "relu"))
model.add(Dropout(dropout_dense))
model.add(Dense(12, activation = "softmax"))

model.summary()

### Train the Model

In [None]:
model.compile(Adam(lr=0.0001), loss='binary_crossentropy', 
              metrics=['accuracy'])


In [None]:
filepath = "model.h5"
checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, 
                             save_best_only=True, mode='max')

reduce_lr = ReduceLROnPlateau(monitor='val_acc', factor=0.5, patience=3, 
                                   verbose=1, mode='max', min_lr=0.00001)
                              
                              
callbacks_list = [checkpoint, reduce_lr]

history = model.fit_generator(train_gen, steps_per_epoch=train_steps, 
                    validation_data=val_gen,
                    validation_steps=val_steps,
                    epochs=20, verbose=1,
                   callbacks=callbacks_list)

### Evaluate the model using the val set

In [None]:
# get the metric names so we can use evaulate_generator
model.metrics_names

In [None]:
# Print the validation loss and accuracy.

# Here the best epoch will be used.
model.load_weights('model.h5')

val_loss, val_acc = \
model.evaluate_generator(test_gen, 
                        steps=len(df_val))

print('val_loss:', val_loss)
print('val_acc:', val_acc)

### Plot the Training Curves

In [None]:
# display the loss and accuracy curves

import matplotlib.pyplot as plt

acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(acc) + 1)

plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()
plt.figure()

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()
plt.figure()

### Make a prediction on the val set
We need these predictions to print the Confusion Matrix and calculate the F1 score.

In [None]:
# make a prediction
predictions = model.predict_generator(test_gen, steps=len(df_val), verbose=1)

In [None]:
predictions.shape

In [None]:
# This is how to check what index keras has internally assigned to each class. 
test_gen.class_indices


In [None]:
# Put the predictions into a dataframe.
# The columns need to be ordered to match the output of the previous cell

class_dict = train_gen.class_indices

# Get a list of the dict keys.
cols = class_dict.keys()

df_preds = pd.DataFrame(predictions, columns=cols)

df_preds.head()

### Create a Confusion Matrix

In [None]:
# Get the labels of the test images.

test_labels = test_gen.classes

In [None]:
# Source: Scikit Learn website
# http://scikit-learn.org/stable/auto_examples/
# model_selection/plot_confusion_matrix.html#sphx-glr-auto-examples-model-
# selection-plot-confusion-matrix-py


def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)
    
    # set the size of the figure here
    plt.figure(figsize=(15,10))

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=80) # set x-axis text angle here
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.tight_layout()

In [None]:
# argmax returns the index of the max value in a row
cm = confusion_matrix(test_labels, predictions.argmax(axis=1))

In [None]:
# Define the labels of the class indices. These need to match the 
# order shown above.
cm_plot_labels = cols

plot_confusion_matrix(cm, cm_plot_labels, title='Confusion Matrix')


### Create a Classification Report

In [None]:
from sklearn.metrics import classification_report

# Generate a classification report

# Get the true labels
y_true = test_gen.classes

# For this to work we need y_pred as binary labels not as probabilities
y_pred_binary = predictions.argmax(axis=1)

report = classification_report(y_true, y_pred_binary, target_names=cm_plot_labels)

print(report)

**Recall **= Given a class, will the classifier be able to detect it?<br>
**Precision** = Given a class prediction from a classifier, how likely is it to be correct?<br>
**F1 Score** = The harmonic mean of the recall and precision. Essentially, it punishes extreme values.


From the confusion matrix and classification report we see that the model is mis-classifying many Black-grass images as Loose Silky-bent. This may be because these plant seedlings look similar. Sheperd's Purse is another class that the model is struggling to classify correctly. One possible soluton may be to add more Black-grass and Sheperd's Purse images to the training set.

By generating the confusion matrix and F1 score we can see how the model is performing on a class by class basis. The accuracy score alone cannot give us these insights into the models strengths and weaknesses. 

### A Weed Detection Web App

I've built a prototype web app using this model. This may be something that farmers could use to detect weed seedlings. The user is able to submit a photo of a seedling and get an instant prediction indicating what kind of seedling it is. The app will probably not generalize very well but this shows how easy it is for an ordinary person to build an Ai product using the technology that's available today. 

All the code is available on Github. The technology that enables this app to work is new. Therefore, I recommend using the latest version of the Chrome browser. When using Safari for example, you may see a message indicating that the model is loading but the app may actually be frozen.

Web app:<br>
http://plant.test.woza.work/<br>
Github:<br>
https://github.com/vbookshelf/Weed-Detector





### Convert the model from Keras to Tensorflowjs
This conversion needs to be done so that the model can be loaded into the web app.

In [None]:
!pip install tensorflowjs

In [None]:
# Use the command line conversion tool to convert the model

!tensorflowjs_converter --input_format keras model.h5 tfjs_model/model

In [None]:
# Delete all_images_dir and base_dir directory to prevent a Kaggle error.
# Kaggle allows a max of 500 files to be saved.

shutil.rmtree('all_images_dir')
shutil.rmtree('base_dir')

### Conclusion

It may be possible to improve the performance of this model by doing the following:<br>
- using more of the available data
- using a larger image size
- doing image augmentation
- using a pre-trained model
- parameter tuning

Thank you for reading. Merry Christmas.