# 1. Introduction


In this competition, you’ll detect wheat heads from outdoor images of wheat plants, including wheat datasets from around the globe. Using worldwide data, you will focus on a generalized solution to estimate the number and size of wheat heads. To better gauge the performance for unseen genotypes, environments, and observational conditions, the training dataset covers multiple regions. You will use more than 3,000 images from Europe (France, UK, Switzerland) and North America (Canada). The test data includes about 1,000 images from Australia, Japan, and China. 



In [None]:
# webinar on how to improve wheat heads counting thanks to the Global Wheat Challenge ?

from IPython.display import IFrame, YouTubeVideo
YouTubeVideo('Wr44me5eyWY',width=600, height=400)


**This kernel will be a work in Progress,and I will keep on updating it as the competition progresses**

**<span style="color:Red">If you find this kernel useful, Please consider Upvoting it, it motivates me to write more Quality content**


## 1.2 Let's know more by answering few Questions

### Q1. Why are we solving this problem?

For several years, agricultural research has been using sensors to observe plants at key moments in their development. However, some important plant traits are still measured manually. One example of this is the manual counting of wheat ears from digital images – a long and tedious job. Factors that make it difficult to manually count wheat ears from digital images include the possibility of overlapping ears, variations in appearance according to maturity and genotype, the presence or absence of barbs, head orientation and even wind.  

However, accurate wheat head detection in outdoor field images can be visually challenging. There is often overlap of dense wheat plants, and the wind can blur the photographs. Both make it difficult to identify single heads. Additionally, appearances vary due to maturity, color, genotype, and head orientation. Finally, because wheat is grown worldwide, different varieties, planting densities, patterns, and field conditions must be considered. Models developed for wheat phenotyping need to generalize between different growing environments. Current detection methods involve one- and two-stage detectors (Yolo-V3 and Faster-RCNN), but even when trained with a large dataset, a bias to the training region remains.



### Q2. How the dataset looks like?

The dataset contains following 4 important files/folders

* `train.csv` - the training data
* `sample_submission.csv` - a sample submission file in the correct format
* `train.zip` - training images
* `test.zip` - test images

> **Note**: Most of the test set images are hidden. A small subset of test images has been included for your use in writing code.



### Q3. What are the columns in the data

* `image_id` - the unique image ID
* `width` - the width of the images
* `height` - the height of the images
* `bbox` - a bounding box, formatted as a Python-style list of [xmin, ymin, width, height]
* `source` - the source of the data

### Q4. What am I predicting?

You are attempting to predict bounding boxes around each wheat head in images that have them. If there are no wheat heads, you must predict no bounding boxes.

### Q5. How dataset is prepared

The [Global Wheat Head Dataset](http://www.global-wheat.com/2020-challenge/) is led by nine research institutes from seven countries: the University of Tokyo, Institut national de recherche pour l’agriculture, l’alimentation et l’environnement, Arvalis, ETHZ, University of Saskatchewan, University of Queensland, Nanjing Agricultural University, and Rothamsted Research. These institutions are joined by many in their pursuit of accurate wheat head detection, including the Global Institute for Food Security, DigitAg, Kubota, and Hiphen.



### Q6 What is mAP(the metric used for evaluation)?

This competition is evaluated on the **mean average precision** at different intersection over union (IoU) thresholds.

`MAP(mean average precision)`: **mAP (mean average precision)** is the average of AP. In some context, we compute the AP for each class and average them. But in some context, they mean the same thing. For example, under the COCO context, there is no difference between AP and mAP.


![](https://i.stack.imgur.com/JlHnn.jpg)

> Important note: if there are no ground truth objects at all for a given image, ANY number of predictions (false positives) will result in the image receiving a score of zero, and being included in the mean average precision.




Please visit following links to know more about MAP
* https://www.kaggle.com/c/global-wheat-detection/overview/evaluation
* https://kharshit.github.io/blog/2019/09/20/evaluation-metrics-for-object-detection-and-segmentation
* https://towardsdatascience.com/breaking-down-mean-average-precision-map-ae462f623a52
* https://datascience.stackexchange.com/questions/25119/how-to-calculate-map-for-detection-task-for-the-pascal-voc-challenge

# 2. Getting Data

## 2.1 Loading Libraries

In [None]:
import pandas as pd # package for high-performance, easy-to-use data structures and data analysis
import numpy as np # fundamental package for scientific computing with Python
import matplotlib
import os
from PIL import Image, ImageDraw
from ast import literal_eval
import matplotlib.pyplot as plt # for plotting
import seaborn as sns # for making plots with seaborn
color = sns.color_palette()
import plotly.offline as py
py.init_notebook_mode(connected=True)
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.offline as offline
offline.init_notebook_mode()

## 2.2 Reading data

In [None]:
BASE_PATH = '../input/global-wheat-detection'
TRAIN_DIR = f'{BASE_PATH}/train'
TEST_DIR = f'{BASE_PATH}/test'

train = pd.read_csv(f'{BASE_PATH}/train.csv')
submission = pd.read_csv(f'{BASE_PATH}/sample_submission.csv')

In [None]:
print('Size of train data', train.shape)
print('Size of submission file', submission.shape)


## 2.3 Peek at Dataset

* There are 147793 images in the train data
* We need to predict bounding boxes around each wheat head in images that have them.

## 2.4 Table overview

### train data

In [None]:
# display head of train data
display(train.head())

### number of unique images in train dataset

In [None]:
print(f'Number of unique images in train data is {len(list(np.unique(train.image_id)))}')

In [None]:
# let's have a look at the describe function
display(train.describe())

### submission file

In [None]:
display(submission.head())

### 2.5 Check for missing values




In [None]:
# checking missing data
total = train.isnull().sum().sort_values(ascending = False)
percent = (train.isnull().sum()/train.isnull().count()*100).sort_values(ascending = False)
missing_train_data  = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_train_data.head()

## 3. Exploratory Data Analysis (EDA)

### 3.1 Checking for data `source` distribution


In [None]:
def plot_count(df, feature, title='', size=2):
    f, ax = plt.subplots(1,1, figsize=(4*size,3*size))
    total = float(len(df))
    sns.countplot(df[feature],order = df[feature].value_counts().index, palette='Set2')
    plt.title(title)
    for p in ax.patches:
        height = p.get_height()
        ax.text(p.get_x()+p.get_width()/2.,
                height + 3,
                '{:1.2f}%'.format(100*height/total),
                ha="center") 
    plt.show()

In [None]:
plot_count(df=train, feature='source', title = 'data source count and %age plot', size=3)

**inference**

* `ethz_1` and `arvalis_1` are the 2 major data sources (contributing around 65% 0f total data).
* Dataset is not balanced in terms of source provided.


## 3.2 Visualizing images with bounding boxes

### 3.2.1 train images

In [None]:
def display_images(images): 
    f, ax = plt.subplots(5,3, figsize=(18,22))
    for i, image_id in enumerate(images):
        image_path = os.path.join(TRAIN_DIR, f'{image_id}.jpg')
        image = Image.open(image_path)
        
        # get all bboxes for given image in [xmin, ymin, width, height]
        bboxes = [literal_eval(box) for box in train[train['image_id'] == image_id]['bbox']]
        # draw rectangles on image
        draw = ImageDraw.Draw(image)
        for bbox in bboxes:    
            draw.rectangle([bbox[0], bbox[1], bbox[0] + bbox[2], bbox[1] + bbox[3]], width=3)
            
        ax[i//3, i%3].imshow(image) 
        image.close()       
        ax[i//3, i%3].axis('off')

        source = train[train['image_id'] == image_id]['source'].values[0]
        ax[i//3, i%3].set_title(f"image_id: {image_id}\nSource: {source}")

    plt.show() 

In [None]:
images = train.sample(n=15, random_state=42)['image_id'].values
display_images(images)

###  Let's take a more closer look 

In [None]:
def display_images_large(images): 
    f, ax = plt.subplots(5,2, figsize=(20, 50))
    for i, image_id in enumerate(images):
        image_path = os.path.join(TRAIN_DIR, f'{image_id}.jpg')
        image = Image.open(image_path)        
        bboxes = [literal_eval(box) for box in train[train['image_id'] == image_id]['bbox']]
        # draw rectangles on image
        draw = ImageDraw.Draw(image)
        for bbox in bboxes:    
            draw.rectangle([bbox[0], bbox[1], bbox[0] + bbox[2], bbox[1] + bbox[3]], width=3)
            
        ax[i//2, i%2].imshow(image) 
        ax[i//2, i%2].axis('off')
        source = train[train['image_id'] == image_id]['source'].values[0]
        ax[i//2, i%2].set_title(f"image_id: {image_id}\nSource: {source}")

    plt.show() 

In [None]:
images = train.sample(n=10, random_state=42)['image_id'].values
display_images_large(images)

### 3.2.2 Visualizing test images



In [None]:
def display_test_images(images): 
    f, ax = plt.subplots(5,2, figsize=(20, 50))
    for i, image_id in enumerate(images):
        image_path = os.path.join(TEST_DIR, f'{image_id}.jpg')
        image = Image.open(image_path)        
            
        ax[i//2, i%2].imshow(image) 
        ax[i//2, i%2].axis('off')
        ax[i//2, i%2].set_title(f"image_id: {image_id}")

    plt.show()

In [None]:
# since we need to predict bounding boxes for test images, hence below images do not have any bounding boxes
test_images = submission.image_id.values
display_test_images(test_images)

## References:

* image visualization help taken from https://www.kaggle.com/devvindan/wheat-detection-eda
* http://www.global-wheat.com/2020-challenge/

#### <p><span style="color:green">This Kernel is work in progress, will update soon </br></span></p>

### <p><span style="color:red">Ending note: <br>Please upvote this kernel if you like it . It motivates me to produce more quality content :)</br></span></p>