# Leaf Data Analysis on CPU

## Objective

Understand ways to find a data set and to prepare a data set for machine learning and training.

## Activities 
**In this section of the training you will**
- Fetch and visually inspect a dataset 
- Create a dataset to address a real life problem
- Image Preprocessing
- Data Augmentation Techniques
- Address Imbalanced Dataset Problem
- Organize a dataset into training, validation and testing groups
- Finalize an augmented dataset for training, and testing

As you follow this notebook, complete **Activity** sections to finish this workload. 


### 1.1 Find a Data set


### 1.2 Subset of the Data

Although we used the initial data set during our initial exploration, we are only providing a subset of the relevant data used for training.  This means that you don't have download an 11GB dataset but instead just use the existing data set provided with the notebook.  We maintained the original folder structure of the dataset but removed classes we're not utilizing.

### 1.3 Infected Plant Disease: The Most Common Plant Disease found in the world.





### 1.4 Select and Merge Interested Classes




Click the cell below and then click **Run**.

In [None]:
import os
os.sys.path
import cv2

prcessed_leaf_path = "/hdd/data/leaf_data_set/plantdisease/processed"


### 2.1 Remove Invalid, Corrupt and Non-JPG files


In this section, we remove images that are not in ".jpg" format or that can not be read by cv2 module. We're utilizing the multiprocessing function so that we can take advantage of all of the cores on our machine to make the process go quickly

Click the cell below and then click **Run**.

In [None]:
import os
import shutil
import glob
import vmmr_utils
import matplotlib.pyplot as plt
%matplotlib inline


from multiprocessing import Pool

prcessed_leaf_path = "/hdd/data/leaf_data_set/plantdisease/processed"

prcessed_leaf_path 
#Check Images
if __name__ == '__main__':
    pool = Pool()
    image_list = glob.glob(prcessed_leaf_path + "/*/*")
    pool.map(vmmr_utils.check_image, image_list)
    pool.close()

print('Done.')

### 2.2 Distribution of Selected Classes


Now, we can take a look at the class distribution of our problem statement. We're importing PyGal and creating a wrapper for rendering the chart inline, then passing in our data to the charting function.

Click the cell below and then click **Run**.

In [None]:
import pygal 
from IPython.display import display, HTML
#Create function to display interactive plotting
base_html = """
<!DOCTYPE html>
<html>
  <head>
  <script type="text/javascript" src="http://kozea.github.com/pygal.js/javascripts/svg.jquery.js"></script>
  <script type="text/javascript" src="https://kozea.github.io/pygal.js/2.0.x/pygal-tooltips.min.js""></script>
  </head>
  <body>
    <figure>
      {rendered_chart}
    </figure>
  </body>
</html>
"""

def galplot(chart):
    rendered_chart = chart.render(is_unicode=True)
    plot_html = base_html.format(rendered_chart=rendered_chart)
    display(HTML(plot_html))
    
#Compare class distribution
line_chart = pygal.Bar(height=300)
line_chart.title = 'Leaf Class Distribution'
for o in os.listdir(prcessed_leaf_path):
    line_chart.add(o, len(os.listdir(os.path.join(prcessed_leaf_path, o))))
galplot(line_chart)

### 2.3 Confirm Folder Structure is Correct

To summarize and confirm our progress, we can take a look at the folder tree structure in **Most_Infected_Leafs** to take a look at our images we used to create a smaller subset. 

Click the cell below and then click **Run**.

In [None]:
#Confirm Folder Structure
#Confirm Folder Structure
prcessed_leaf_path = "/hdd/data/leaf_data_set/plantdisease"
for root, dirs, files in os.walk(prcessed_leaf_path):
    level = root.replace(os.getcwd(), '').count(os.sep)
    print('{0}{1}/'.format('    ' * level, os.path.basename(root)))
    for f in files[:2]:
        print('{0}{1}'.format('    ' * (level + 1), f))
    if level is not 0:
        print('{0}{1}'.format('    ' * (level + 1), "..."))        



### Create Train, Validation and Test Folders


We need to create training, validation and test folders for data ingestion and we'll use 0.7, 0.1, 0.2 ratio for this purpose.

Click the cell below and then click **Run**.

In [None]:
import os
import shutil
import glob
import vmmr_utils
import math
import re
import sys


#Train and Test Set Variables
train_val_test_ratio = (.7,.1,.2) # 70/10/20 Data Split
test_folder = '/hdd/data/leaf_data_set/plantdisease/test/'
train_folder = '/hdd/data/leaf_data_set/plantdisease/train/'
val_folder = '/hdd/data/leaf_data_set/plantdisease/val/'

file_names = os.listdir('/hdd/data/leaf_data_set/plantdisease/processed')

prcessed_leaf_path = "/hdd/data/leaf_data_set/plantdisease/processed"

#Remove Existing Folders if they exist
for folder in [test_folder, train_folder, val_folder]:
    if os.path.exists(folder) and os.path.isdir(folder):
        shutil.rmtree(folder)

#Remake Category Folders in both Train and Test Folders
for category in file_names:
    os.makedirs(test_folder + category)
    os.makedirs(train_folder + category)
    os.makedirs(val_folder + category)

#Split Data by Train Ratio and copy files to correct directory
for idx, category in enumerate(file_names):
    file_list = os.listdir(prcessed_leaf_path + '/' + category)
    
    train_ratio = math.floor(len(file_list) * train_val_test_ratio[0])
    val_ratio = math.floor(len(file_list) * train_val_test_ratio[1])
    train_list = file_list[:train_ratio]
    val_list = file_list[train_ratio:train_ratio + val_ratio]
    test_list = file_list[train_ratio + val_ratio:]
    
    for i, file in enumerate(train_list):
        shutil.copy(prcessed_leaf_path + '/' + category + '/' + file, train_folder + '/' + category + '/' + file)
    sys.stdout.write('Moving %s train images to category folder %s' % (len(train_list), category))  
    sys.stdout.write('\n')
    for i, file in enumerate(val_list):
        shutil.copy(prcessed_leaf_path + '/' + category + '/' + file, val_folder + '/' + category + '/' + file)
    sys.stdout.write('Moving %s validation images to category folder %s' % (len(val_list), category))                   
    sys.stdout.write('\n')
    for i, file in enumerate(test_list):
        shutil.copy(prcessed_leaf_path + '/' + category + '/' + file, test_folder + '/' + category + '/' + file)
    sys.stdout.write('Moving %s test images to category folder %s' % (len(test_list), category))
    sys.stdout.write('\n')
    
print("Done.")  

### Sample Augmentation

While looking at our distribution above we saw that certain classes were significantly lower than others.  To help mitigate that issue we're going to augment some of our data set so that we have a dataset that is more closely distributed.  Below we're taking a look at an example image and showing the effets of augmentation given a certain threshold of modification.  Then we're going to apply these random augmentations to our data.

Click the cell below and then click **Run**.

In [None]:
import random
import numpy as np
from keras.preprocessing.image import ImageDataGenerator, array_to_img, img_to_array, load_img
import matplotlib.pyplot as plt
%matplotlib inline

#Select a random image and follow the next step
datagen = ImageDataGenerator(rotation_range=45, 
                             width_shift_range=0.2, 
                             height_shift_range=0.2, 
                             zoom_range=0.3, 
                             vertical_flip=True,
                             horizontal_flip=True, 
                             fill_mode="nearest")
#Load example image
file_list = glob.glob("/hdd/data/leaf_data_set/plantdisease/test/*/*")
img_path = random.choice(file_list)
img = load_img(img_path)
car_class = img_path.split("/")[1]
plt.imshow(img)
plt.axis("off")
plt.title("Original " + car_class, fontsize=16)

img = img_to_array(img)
img = img.reshape((1,) + img.shape)
#Apply different augmentation techniques
n_augmentations = 4
plt.figure(figsize=(15, 6))    
i = 0
for batch in datagen.flow(img, 
                          batch_size=1, 
                          seed=21):
    
    plt.subplot(2, int(np.ceil(n_augmentations * 1. / 2)), i + 1)
    plt.imshow(array_to_img(batch[0]))
    plt.axis("off")
    plt.suptitle("Augmented " + car_class, fontsize=16)    
    
    i += 1
    if i >= n_augmentations:
        break

### Finalize Augmented Dataset for Training 


By using the augmentation techniques we have learned, we can oversample minority classes in training set. We are not going to do these steps in validation or test in order not to create any bias on the data. 

**Activity**

Click the cell below and then click **Run**.

In [None]:
#Oversampling Minority Classes in Training Set
def data_augment(data_dir):
    list_of_images = os.listdir(data_dir)
    datagen = ImageDataGenerator(rotation_range=45, 
        horizontal_flip=True, 
        fill_mode="nearest")
    for img_name in list_of_images:
        tmp_img_name = os.path.join(data_dir, img_name)
        img = load_img(tmp_img_name)
        img = img_to_array(img)
        img = img.reshape((1,) + img.shape)

        batch = datagen.flow(img, 
            batch_size=1, 
            seed=21,
            save_to_dir=data_dir, 
            save_prefix=img_name.split(".jpg")[0] + "augmented", 
            save_format="jpg")

        batch.next()

classes_to_augment = [
        "1.Apple_scab",
        "10.Corn_Gray_leaf_spot",
        "11.Corn_Northern_Leaf_Blight",
        "12.Grape_Black_Measles",
        "13.Grape_Black_rot",
        "14.Grape_healthy",
        "15.Grape_Leaf_blight",
        "16.Orange_Citrus_greening",
        "17.Peach_Bacterial_spot",
        "18.Peach_healthy",
        "19.Pepperbell_Bacterial_spot",
        "2.Apple_Black_rot",
        "20.Pepperbell_healthy",
        "21.Potato_Early_blight",
        "22.Potato_healthy",
        "23.Potato_Late_blight",
        "24.Raspberry_healthy",
        "25.Soybean_healthy",
        "26.Squash_Powdery_mildew",
        "27.Strawberry_healthy",
        "28.Strawberry_Leaf_scorch",
        "29.Tomato_Bacterial_spot",
        "3.Apple_Cedar_rust",
        "30.Tomato_Early_blight",
        "31.Tomato_healthy",
        "32.Tomato_Late_blight",
        "33.Tomato_Leaf_Mold",
        "34.Tomato_mosaic_virus",
        "35.Tomato_Septoria_leaf_spot",
        "36.Tomato_Spider_mites",
        "37.Tomato_Target_Spot",
        "38.Tomato_Yellow_Leaf_Curl_Virus",
        "4.Apple_healthy",
        "5.Blueberry_healthy",
        "6.Cherry_healthy",
        "7.Cherry_Powdery_mildew",
        "8.Corn_Common_rust",
        "9.Corn_healthy"]


for class_names in classes_to_augment:
    print("Currently Augmenting:", class_names)
    data_dir = os.path.join(train_folder, class_names)
    data_augment(data_dir)

### Resize Images

![Resize Images](assets/EDA_1-6.png)

Depending on the toplogy, we need to resize the images with the expected image format. Since we're going to be using InceptionV3 in the next section we're going to match the size, 299x299, for that topology. 

**Activity**

Click the cell below and then click **Run**.

In [None]:
from functools import partial

#Resize Images
if __name__ == '__main__':
    pool = Pool()
    image_list = glob.glob(train_folder + "/*/*")
    func = partial(vmmr_utils.resize_image, size=299)
    pool.map(func, image_list)
    pool.close()

vmmr_utils.display_images(train_folder)

### Look at Distribution of Selected Classes again


Now that we've done some augmentation to the dataset we want to see how the distribution has changed compared to before the augmentation.  In this case we're only going to be looking at the train folder, since we only augmented the train dataset, so the numbers will be slightly lower than the full dataset distribution graph from earlier.  

Click the cell below and then click **Run**.

In [None]:
#Compare class distribution
line_chart = pygal.Bar(height=300)
line_chart.title = 'Infected Leaf Training Class Distribution'
for o in os.listdir(train_folder):
    line_chart.add(o, len(os.listdir(os.path.join(train_folder, o))))
galplot(line_chart)   

## Summary 
**In this section of the training you learned**
- Fetch and visually inspect a dataset 
- Create a dataset to address a real life problem
- Image Preprocessing and Data Augmentation Techniques
- Address Imbalanced Dataset Problem
- Organize a dataset into training, validation and testing groups
- Finalize an augmented dataset for training, and testing

You now should understand ways to find a data set and to prepare a data set for machine learning and training.