# <span STYLE="text-decoration:underline">Global Wheat Detection</span>
<div align="center"><img src="http://www.global-wheat.com/wp-content/uploads/2020/04/ILLU_01_EN.jpg" width="800"/></div>
<span STYLE="text-decoration:underline">
The Problem</span> 
<br>
For several years, agricultural research has been using sensors to observe plants at key moments in their development. However, some important plant traits are still measured manually. One example of this is the manual counting of wheat ears from digital images – a long and tedious job. Factors that make it difficult to manually count wheat ears from digital images include the possibility of overlapping ears, variations in appearance according to maturity and genotype, the presence or absence of barbs, head orientation and even wind.  
 
<br>
<span style="text-decoration:underline">The Need</span> 
<br>
There is the need for a robust and accurate computer model that is capable of counting wheat ears from digital images. This model will benefit phenotyping research and help producers around the world assess ear density, health and maturity more effectively. Some work has already been done in deep learning, though it has resulted in too little data to have a generic model.  
<br>
Refer [this](http://www.global-wheat.com/) page for more details.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from pandas_profiling import ProfileReport
from pandas_summary import DataFrameSummary
import cv2
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import random
import math
from tqdm import tqdm
import seaborn as sns
from os import listdir
from os.path import isfile, join
%matplotlib inline

path = '../input/global-wheat-detection/'
TRAIN_IMAGES_PATH = path+'train/'

In [None]:
train = pd.read_csv(path+'train.csv')
sub = pd.read_csv(path+'sample_submission.csv')

# Data Pre-processing

## Utility Functions

In [None]:
def plot_images(dataframe, show_bb = True, image_count = 16,idxs = [], image_label='image_id', bbox_label='bbox', n = 3):
    """This function can display any no of image in the grid size of 4 by no_of_images/4.
        Usage:-
            dataframe = The dataframe containing images and bounding boxes grouped into single list.
            show_bb = Show bounding boxes or not.
            imge_count = No of images to display (multiple of n).
            idxs =  If u want to pass your own indexes for ploting use this else pass [] an empty array. It will automatically select random images.
            image_label = Name of the column containing images in dataframe.
            bbox_label = Name of the column containing list of all bounding boxes per image in dataframe.
            n = Number of images per row.

        [NOTE]: If you want to convert stock train dataframe to desired format use the clean_data method. 
    """
    size = len(dataframe)
    image_count = image_count + (image_count % n)
    if len(idxs)==0:
        create_idx = True
    else:
        create_idx = False
        
    row_count = (int) (image_count / n)
    fig, ax = plt.subplots(row_count, n, figsize=(20,10))
    for i in range(image_count):
        x = (int)(i/n)
        y = i%n
        if create_idx:
            idx = random.randint(0, size-1)
            idxs.append(idx)
        else:
            idx = idxs[i]
        input_row = dataframe.iloc[idx]
        tuple_index = (x,y) if row_count > 1 else y
        ax[tuple_index].imshow(cv2.imread(TRAIN_IMAGES_PATH + input_row[image_label]))
        ax[tuple_index].set_title(input_row['image_id'])
        if show_bb:
            try:
                bbs = input_row[bbox_label]
                for bbox in bbs:    
                    rect = patches.Rectangle((bbox[0],bbox[1]),bbox[2],bbox[3],linewidth=2,edgecolor='r',facecolor='none')
                    ax[tuple_index].add_patch(rect)
            except:
                pass
    fig.show()
    return idxs


def enlarge_image(dataframe, idx = -1, show_bb = True,image_label='image_id', bbox_label='bbox'):
    """This function is used to enlarge single image with or without bounding boxes.
        Usage:-
            dataframe = The dataframe containing images and bounding boxes grouped into single list.
            idx = The index of the image to be displayed. -1(default) means random selection.
            show_bb = Show bounding boxes or not.
            image_label = Name of the column containing images in dataframe.
            bbox_label = Name of the column containing list of all bounding boxes per image in dataframe.
        
        [NOTE]: If you want to convert stock train dataframe to desired format use the clean_data method. 
    """
    fig, ax = plt.subplots(figsize=(15,15))
    size = len(dataframe[dataframe['source']!='not_specified'])
    if idx==-1:
        idx = random.randint(0, size-1)
    input_row = dataframe.iloc[idx]
    tuple_index = (0,0)
    ax.imshow(cv2.imread(TRAIN_IMAGES_PATH + input_row[image_label]))
    ax.set_title(input_row[image_label])
    if show_bb:
        try:
            bbs = input_row[bbox_label]
            for bbox in bbs:    
                rect = patches.Rectangle((bbox[0],bbox[1]),bbox[2],bbox[3],linewidth=2,edgecolor='r',facecolor='none')
                ax.add_patch(rect)
        except:
            pass
    fig.show()
    

def clean_data(train):
    dic = {}
    imgs = []
    bbs = []
    srcs = []
    tmp = []
    tn='---'
    for i in tqdm(train.iterrows()):
        img,_,_,bb,s = i[1]
        if tn=='---':
            tn = img
        elif tn == img:
            tmp.append(list(map(math.floor, list(map(float,bb.replace('[',"").replace(']',"").split(','))))))
        else:
            imgs.append(tn+'.jpg')
            bbs.append(tmp)
            srcs.append(s)
            tn = img
            tmp=[]
            tmp.append(list(map(math.floor, list(map(float,bb.replace('[',"").replace(']',"").split(','))))))
    imgs.append(tn+'.jpg')
    bbs.append(tmp)
    srcs.append(s)
    dic['image_id']=imgs
    dic['bbox'] = bbs
    dic['source'] = srcs
    train_clean = pd.DataFrame(dic)
    return train_clean

In [None]:
train.head()

In [None]:
dfs = DataFrameSummary(train)

dfs.columns_stats

In [None]:
train_clean = clean_data(train)


"""Adding unlabeled images from train directory."""
dic = {}
onlyfiles = [f for f in listdir(path+'train') if isfile(join(path+'train', f))]
unlabeled = list(set(onlyfiles) - set(train_clean['image_id']))
dic['image_id'] = unlabeled
dic['bbox'] = [[] for i in range(len(unlabeled))]
dic['source'] = ['not_specified' for i in range(len(unlabeled))]
temp_clean = pd.DataFrame(dic)
train_clean = pd.concat([train_clean,temp_clean])

In [None]:
train_clean.head()

# EDA
## About the Data Sources
[ARVALIS - Plant Institute](https://www.english.arvalisinstitutduvegetal.fr/index.html). Institut du Végétal: ARVALIS - Institut du vegetal is an applied agricultural research organization dedicated to arable crops : cereals, maize, sorghum, potatoes, fodder crops, flax and tobacco. … It considers technological innovation as a major tool to enable producers and agri-companies to respond to societal challenges.
<br>
<br>
[ETHZ- ETH Zurich](https://ethz.ch/en.html) trains true experts and prepares its students to carry out their tasks as critical members of their communities, making an important contribution to the sustainable development of science, the economy and society.
<br>
<br>
[INRAE](https://www.inrae.fr/en) is France's new National Research Institute for Agriculture, Food and Environment, created on January 1, 2020, It was formed by the merger of INRA, the National Institute for Agricultural Research, and IRSTEA, the National Research Institute of Science and Technology for the Environment and Agriculture.
<br>
<br>
[RRES 90003](https://tinyurl.com/y8q7gpht) is a global engineering specification for Identification Marking Methods and Controls. It is a Rolls-Royce global document that was compiled for new component designs and will be called out in place of the previous system of JES and EDI specifications.
<br>
<br>
[University of Saskatchewan (USask)](https://www.usask.ca/) researchers played a key role in an international consortium that has sequenced the entire genome of durum wheat—the source of semolina for pasta, a food staple for the world's population, according to an article published today in Nature Genetics.

In [None]:
counts = dict(train_clean['source'].value_counts())

fig, ax = plt.subplots(figsize=(8,8));
wedges, texts, autotexts = ax.pie(list(counts.values()), autopct='%1.1f%%',
        shadow=True, startangle=90);
ax.legend(wedges, list(counts.keys()),
          title="Sources",
          loc="center",
          bbox_to_anchor=(1, 0, 0.5, 1));

plt.setp(autotexts, size=15);

ax.set_title("Data Distribution based on Sources");
plt.show();
sns.countplot(x='source', data=train_clean);

In [None]:
profile = ProfileReport(train, title='Report',progress_bar = False);
profile.to_widgets()

## Close Analysis of a Single Image

In [None]:
enlarge_image(train_clean)

## Plotting with and without bounding boxes for same images

In [None]:
idxs = plot_images(train_clean, show_bb = False, image_count = 3);
plot_images(train_clean, show_bb = True,idxs = idxs, image_count = 3);

## Examples of images without any wheat heads

In [None]:
plot_images(train_clean[train_clean['source']=='not_specified'], idxs = [],show_bb = True, image_count = 6);

## Images from [ARVALIS - Plant Institute](https://www.english.arvalisinstitutduvegetal.fr/index.html)-1

In [None]:
idxs = plot_images(train_clean[train_clean['source']=='arvalis_1'], show_bb = False,idxs = [], image_count = 3);
plot_images(train_clean[train_clean['source']=='arvalis_1'], idxs = idxs, show_bb = True, image_count = 3);

## Images from [ARVALIS - Plant Institute](https://www.english.arvalisinstitutduvegetal.fr/index.html)-2

In [None]:
idxs = plot_images(train_clean[train_clean['source']=='arvalis_2'], show_bb = False,idxs = [], image_count = 3);
plot_images(train_clean[train_clean['source']=='arvalis_2'], idxs = idxs, show_bb = True, image_count = 3);

## Images from [ARVALIS - Plant Institute](https://www.english.arvalisinstitutduvegetal.fr/index.html)-3

In [None]:
idxs = plot_images(train_clean[train_clean['source']=='arvalis_3'], show_bb = False,idxs = [], image_count = 3);
plot_images(train_clean[train_clean['source']=='arvalis_3'], idxs = idxs, show_bb = True, image_count = 3);

## Images from [ETH Zurich](https://ethz.ch/en.html)

In [None]:
idxs = plot_images(train_clean[train_clean['source']=='ethz_1'], show_bb = False,idxs = [], image_count = 3);
plot_images(train_clean[train_clean['source']=='ethz_1'], idxs = idxs, show_bb = True, image_count = 3);

## Images from [ROLLS-ROYCE ENGINEERING SPECIFICATION INDEX](https://tinyurl.com/y8q7gpht)

In [None]:
idxs = plot_images(train_clean[train_clean['source']=='rres_1'], show_bb = False,idxs = [], image_count = 3);
plot_images(train_clean[train_clean['source']=='rres_1'], idxs = idxs, show_bb = True, image_count = 3);

## Images from [University of Saskatchewan](https://www.usask.ca/)

In [None]:
idxs = plot_images(train_clean[train_clean['source']=='usask_1'], show_bb = False,idxs = [], image_count = 3);
plot_images(train_clean[train_clean['source']=='usask_1'], idxs = idxs, show_bb = True, image_count = 3);

## Images from [INRAE](https://www.inrae.fr/en)

In [None]:
idxs = plot_images(train_clean[train_clean['source']=='inrae_1'], show_bb = False,idxs = [], image_count = 3);
plot_images(train_clean[train_clean['source']=='inrae_1'], idxs = idxs, show_bb = True, image_count = 3);

It is evident from the above image analysis that each type of source has specific types of wheat images. This is probably dependent on the region of data collection and time of the year when the particular data was collected. 

# Possible Approaches for Prediction
1. We can directly create an object detection neural network for this problem. But it would try to find wheat in every picture even if it doesn't have any. This might result in false positives.
2. Another approach is to create a classification + object detection ensemble model which would first classify whether the image has any wheat or not. The images that are classified as having wheats will be passed to the object detector for bounding box detection. This would reduce the problem of false positives but might lead to some false negatives.

## Some insights about model selection and other tricks:
* Till now, EfficentDet seems to outperform other model architectures.
* Augmentation always helps improve accuracy.
* Cutmix and mixup are specially useful types of augmentations.
* 5 fold training with ensemble based on **WBF** seems to work great.
* Training is a very very slow process and using kaggle for trainig is not a very good idea. Use colab with some tricks insted.

### <span style="text-decoration:underline">TODO:</span>
* <del>Image Augmentation</del><br>Cutmix and mixup are good. Other augmentations include cutout, random flip and rotate. Brightness changes and blurring doesn't seem to help that much. 
* <del>Training and Inference Code</del><br>Refer to [this notebook](https://www.kaggle.com/yashchoudhary/gwd-fasterrcnn-with-augmentation-train-inference) for <span style="text-decoration:underline; color:red">FasterRCNN Training and Inference Code with Augmentation.</span>
* <del>Final Model Selection</del><br> EfficientDet out-performs others.


I will be updating this notebook with more analysis methodologies.
If you like my work please <span style="text-decoration:underline;color:red">UPVOTE</span>. It really motivates me to create better notebooks.