**Preface**
1. This Notebook contains some basic Data Exploration for Cassava Leaf Disease Prediction Problem 
2. It also has VGG16 Trained Model and an Example of using Transfer Learning and using CNN as Feature Extractor.
3. Please do through Dataset and Code to understand how to make a Cnn for this Problem.
4. To make submission to this competition just download h5 file of the model and copy paste entire code from title "Prediction and Making the Submission File".
5. Please do provide feedbacks for improvements as it will help me and others learn a lot.

In [None]:
import numpy as np
import pandas as pd 
from keras.preprocessing.image import ImageDataGenerator, load_img
from keras.utils import to_categorical
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import random
import os
print(os.listdir("../input/cassava-leaf-disease-classification"))

### 1. Problem Definition :

1. Cassava is an important plant and source of nutrition in many African Countries . But the problem is Leaf Disease Associated with Cassava Leaves ! If not prevented and Monitored properly the leaf disease can affect yield of Farmers . Current Method involves manuall inspection of Cassava Leaves and Labelling Them In this competition We are tasked to build a Model which can also detect the Disease and Classify Them ! 

2. The Leaf Disease Associated with Cassava Leaves can Fall into 5 Categories , where 4 Categories are Associated with a Disease and Fifth Category is Not Associated with a Disease .

### 2 . Data Peak

In [None]:
## Let us Peek Over Some Data
train = pd.read_csv("../input/cassava-leaf-disease-classification/train.csv")
ss = pd.read_csv("../input/cassava-leaf-disease-classification/sample_submission.csv")

In [None]:
train.head()

In [None]:
train.shape

##### We have Nearly 21k Images in Training Data , the train.csv has two columns :
1. image_id : Corresponding to name of the image file with train_images
2. label_id : This is the target We are going to predict !

In [None]:
ss.head()

In [None]:
ss.shape

We can See Clearly here only 1 row given for Sample Submission ! As specified in the Competition Data [Here](https://www.kaggle.com/c/cassava-leaf-disease-classification/data) , We are not given the Full set of Test Images it Will be available when Kernel is actually submitted as this a Code / Kernels Only Competition.

In [None]:
# Lets Map Disease to Their Actual Values 
# The Mapping can be obtained by using label_num_to_disease_map.json
train['label'] = train['label'].map({0:"Cassava Bacterial Blight (CBB)",1:"Cassava Brown Streak Disease (CBSD)" , 
                   2:"Cassava Green Mottle (CGM)" , 3:"Cassava Mosaic Disease (CMD)",4:"Healthy"})

In [None]:
train['label'].value_counts().plot.bar()

Disease CMD is Dominating in the Train Set . CGM , Healthy and CBSD have almost same Number of Appearances . CBB has Least Number of Occurences 

#### Lets Plot Some Images in Train !

In [None]:
df_cbb = train.loc[train['label'] =="Cassava Bacterial Blight (CBB)"].head(50).reset_index(drop = True)

In [None]:
# Ref : https://www.kaggle.com/parulpandey/melanoma-classification-eda-starter
images = df_cbb['image_id'].values

# Extract 9 random images from it
random_images = [np.random.choice(images) for i in range(9)]
IMAGE_PATH =  "../input/cassava-leaf-disease-classification"
# Location of the image dir
img_dir = IMAGE_PATH+'/train_images'

print('Display Random Images OF CBB ')

# Adjust the size of your images
plt.figure(figsize=(10,8))

# Iterate and plot random images
for i in range(9):
    plt.subplot(3, 3, i + 1)
    img = plt.imread(os.path.join(img_dir, random_images[i]))
    plt.imshow(img, cmap='gray')
    plt.axis('off')
    
# Adjust subplot parameters to give specified padding
plt.tight_layout()   

In [None]:
df_cmd = train.loc[train['label'] =="Cassava Mosaic Disease (CMD)"].head(50).reset_index(drop = True)
# Ref : https://www.kaggle.com/parulpandey/melanoma-classification-eda-starter
images = df_cmd['image_id'].values

# Extract 9 random images from it
random_images = [np.random.choice(images) for i in range(9)]
IMAGE_PATH =  "../input/cassava-leaf-disease-classification"
# Location of the image dir
img_dir = IMAGE_PATH+'/train_images'

print('Display Random Images OF CMB ')

# Adjust the size of your images
plt.figure(figsize=(10,8))

# Iterate and plot random images
for i in range(9):
    plt.subplot(3, 3, i + 1)
    img = plt.imread(os.path.join(img_dir, random_images[i]))
    plt.imshow(img, cmap='gray')
    plt.axis('off')
    
# Adjust subplot parameters to give specified padding
plt.tight_layout()   

In [None]:
df_healthy = train.loc[train['label'] =="Healthy"].head(50).reset_index(drop = True)
# Ref : https://www.kaggle.com/parulpandey/melanoma-classification-eda-starter
images = df_cmd['image_id'].values

# Extract 9 random images from it
random_images = [np.random.choice(images) for i in range(9)]
IMAGE_PATH =  "../input/cassava-leaf-disease-classification"
# Location of the image dir
img_dir = IMAGE_PATH+'/train_images'

print('Display Random Images OF Healthy ')

# Adjust the size of your images
plt.figure(figsize=(10,8))

# Iterate and plot random images
for i in range(9):
    plt.subplot(3, 3, i + 1)
    img = plt.imread(os.path.join(img_dir, random_images[i]))
    plt.imshow(img, cmap='gray')
    plt.axis('off')
    
# Adjust subplot parameters to give specified padding
plt.tight_layout()   

#### Observation :
1. We can see that there are wide variety of Images Present for each class , Some Images are focused on single leaf whereas some images have group of leaves focused on . 
2. The resoultion of Images and Camera Angle at which Image is Taken Varies Alot !
3. It will be critical to do heavy augmentations and useful augmentations so that our model is invariant to transformations .

### 3. Modelling : What we have been Waiting for !

1. For Modelling I will use VGG16 as a Baseline Pretrained on weights of Imagenet ! I will use VGG16 as a Feature Extractor , Add some Dense Layers and Fine Tune Final Layers !

2. Since it is Multiclass Classification Problem We will Use Number of Units = Number of Classes , in final Dense Layer with Softmax Activation and Will use Categorical Crossentropy !

In [None]:
# Code Credits : Deep Learning with Python By Francois Chollet
# Code Credits : https://www.kaggle.com/uysimty/keras-cnn-dog-or-cat-classification
from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D, Dropout, Flatten, Dense, Activation, BatchNormalization
from keras import layers
from keras.applications import VGG16

IMAGE_WIDTH = 225
IMAGE_HEIGHT = 225
NUM_CHANNELS = 3
conv_base = VGG16(weights = 'imagenet' , include_top = False , input_shape = (IMAGE_WIDTH , IMAGE_HEIGHT , NUM_CHANNELS))
conv_base.trainable = False # Freeze VGG16 base
model = Sequential()
model.add(conv_base)
model.add(layers.Flatten())
model.add(layers.Dense(512 , activation = "relu"))
model.add(layers.Dense(units = 5 , activation = "softmax"))
model.summary()

In [None]:
# Lets Import Some Callbacks 
# Callbacks Help to Avoid Overfitting and Makes Training Easy and Efficient
from keras.callbacks import EarlyStopping, ReduceLROnPlateau

earlystop = EarlyStopping(patience=10)
learning_rate_reduction = ReduceLROnPlateau(monitor='val_acc', 
                                            patience=3, 
                                            verbose=1, 
                                            factor=0.5, 
                                            min_lr=0.00001)
callbacks = [earlystop, learning_rate_reduction]


### 4. Data preparation 

1. Now comes Data Preparation most critical and Time Consuming Part of any Deep Learning Project ! Since Image File Names are given in DataFrame we will use flow_from_dataframe() utility from keras but before that let us prepare a Validation Set ! 
2. Preparing Validation Set is important to check model generalizibility ! 
3. Also there are many methods of Validation , You can refer them online
4. Here I am using 20 % Holdout from Train Data Stratified based on Targets , Please Note that this is not the best validation strategy there may be better validation strategies suitable for this use case , this is just to get started

In [None]:
train_df, validate_df = train_test_split(train, test_size=0.20, random_state=42 , stratify = np.array(train['label']))
train_df = train_df.reset_index(drop=True)
validate_df = validate_df.reset_index(drop=True)

In [None]:
# Distributions of Label in Train
train_df['label'].value_counts().plot.bar()


In [None]:
validate_df['label'].value_counts().plot.bar()


We can see almost similar distributions of targets in both train and validation set ! This is also similar to distribution of entire train set ! Another Experiment Could be to use different seeds while splitting Data ! Build models on Each of this split and average them ! 

In [None]:
total_train = train_df.shape[0]
total_validate = validate_df.shape[0]
batch_size= 128

#### Image Data Generator 

Let us Define Some Augmentations Which will increase robustness of the model and increase size of Training Data ! 
1. Rotate by 30 degrees
2. Rescale to 1./255
3. zoom by factor of 0.2
4. Flip Horizontally and Vertically

These are not the only augmentations ! We can get more creative with augmentations and use much more from inspecting images ! Comment Below if You find some useful augmentations.

In [None]:
train_datagen = ImageDataGenerator(
    rotation_range=30,
    rescale=1./255,
    zoom_range=0.2,
    horizontal_flip=True,
    vertical_flip = True
)

train_generator = train_datagen.flow_from_dataframe(
    train_df, 
    directory = "../input/cassava-leaf-disease-classification/train_images/", 
    x_col='image_id',
    y_col='label',
    target_size=(IMAGE_WIDTH, IMAGE_HEIGHT),
    class_mode='categorical',
    batch_size=batch_size
)


In [None]:
# Validation Data 
validation_datagen = ImageDataGenerator(rescale=1./255)
validation_generator = validation_datagen.flow_from_dataframe(
    validate_df, 
    directory = "../input/cassava-leaf-disease-classification/train_images/", 
    x_col='image_id',
    y_col='label',
    target_size=(IMAGE_WIDTH, IMAGE_HEIGHT),
    class_mode='categorical',
    batch_size=batch_size
)

### 5. Train Model 

We will use accuracy as a metric and use RmsProp Optimizer with Learning Rate of 1e-5

Training the model for 5 Epochs Nearly Takes 20 mins on Kaggle's GPU environment ! Training For More Epochs Might Improve Performance but need to take care of overfitting by Adding Dropouts Etc

In [None]:
from keras import optimizers
model.compile(loss='categorical_crossentropy', optimizer=optimizers.RMSprop(lr=1e-5), metrics=['accuracy'])
history = model.fit_generator(
    train_generator, 
    epochs=5,
    validation_data=validation_generator,
    validation_steps=total_validate//batch_size,
    steps_per_epoch=total_train//batch_size,
    callbacks=callbacks
)

In [None]:
# Save Model For Reproducibility and Inference
model.save("Cassava_VGG16Baseline.h5")


Lets Visualize Training and Validation Loss 

In [None]:
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 12))
ax1.plot(history.history['loss'], color='b', label="Training loss")
ax1.plot(history.history['val_loss'], color='r', label="validation loss")
ax1.set_xticks(np.arange(1, 5, 1)) # 5 corresponds to number of epochs
ax1.set_yticks(np.arange(0, 1, 0.1))

ax2.plot(history.history['accuracy'], color='b', label="Training accuracy")
ax2.plot(history.history['val_accuracy'], color='r',label="Validation accuracy")
ax2.set_xticks(np.arange(1, 5, 1)) # 5 corresponds to number of epochs 

legend = plt.legend(loc='best', shadow=True)
plt.tight_layout()
plt.show()

#### Conclusions :
1. We are able to Reach Validation Accuracy of 70 % with just 5 Epochs and Finetuning Dense Layer of VGG16.
2. There are more models like Resnet , Inception Net which can offer significant increase.
3. We can apply more pre processing and augmentation to improve model further . 

### 6 Prediction and Making Submission Files

In [None]:
test_filenames = os.listdir("../input/cassava-leaf-disease-classification/test_images")
test_df = pd.DataFrame({
    'image_id': test_filenames
})
nb_samples = test_df.shape[0]

In [None]:
# Only 1 Test Image Available remaining will be available when we submit our kernel
test_gen = ImageDataGenerator(rescale=1./255)
test_generator = test_gen.flow_from_dataframe(
    test_df, 
    "../input/cassava-leaf-disease-classification/test_images", 
    x_col='image_id',
    y_col=None,
    class_mode=None,
    target_size=(IMAGE_WIDTH, IMAGE_HEIGHT),
    batch_size=batch_size,
    shuffle=False
)

In [None]:
# Make Predictions 
predict = model.predict_generator(test_generator, steps=np.ceil(nb_samples/batch_size))

In [None]:
predict

We can see that Predictions are Probability of Each 5 Classes ! We will simply Pick Maximum of these probability index and Finally replace index with actual label of the disease ! 

In [None]:
test_df['label'] = np.argmax(predict, axis=-1)


In [None]:
test_df.head()

In [None]:
# class_indices map the index to actual category of the disease 
label_map = dict((v,k) for k,v in train_generator.class_indices.items())
test_df['label'] = test_df['label'].replace(label_map)

In [None]:
test_df.head()

In [None]:
# now lets convert back it into the format required for submission
test_df['label'] = test_df['label'].replace({ "Cassava Bacterial Blight (CBB)": 0, "Cassava Brown Streak Disease (CBSD)": 1 ,"Cassava Green Mottle (CGM)": 2 , "Cassava Mosaic Disease (CMD)":3 ,"Healthy":4})

In [None]:
# Making Final Submission
submission_df = test_df.copy()
submission_df.to_csv("submission.csv" ,index = False)

### Improvements Ideas:
1. The VGG16 Model used here is very basic , but it can be improved .
2. It can be great to try models like Resent and Inception Net.
3. Using Better preprocessing Techniques Before Feeding Image to Neural Network
4. Trying to Add Dropouts and regularization to Neural Network
5. Trying out Image Segmentation and other stuffs (will need to research on these )
6. Training Model for Longer !