<img src="https://cdn11.img.sputnik.by/images/102461/23/1024612300.jpg" width="400" height="400">

In [None]:
import pandas as pd
import numpy as np
import os
from PIL import Image, ImageDraw
from ast import literal_eval
import matplotlib.pyplot as plt

In [None]:
!ls ../input/global-wheat-detection

## Loading the data

In [None]:
root_path = "../input/global-wheat-detection/"
train_folder = os.path.join(root_path, "train")
test_folder = os.path.join(root_path, "test")
train_csv_path = os.path.join(root_path, "train.csv")
sample_submission = os.path.join(root_path, "sample_submission.csv")

In [None]:
df = pd.read_csv(train_csv_path)

In [None]:
df.head()

In [None]:
df.shape[0]

## Some basic statistics

All of the annotated images have resolution 1024 x 1024

In [None]:
df['width'].unique() == df['height'].unique() == [1024]

In [None]:
def get_bbox_area(bbox):
    bbox = literal_eval(bbox)
    return bbox[2] * bbox[3]

In [None]:
df['bbox_area'] = df['bbox'].apply(get_bbox_area)

In [None]:
df['bbox_area'].value_counts().hist(bins=50)

As organizers say, there are many bounding boxes for each image, and not all images include wheat heads / bounding boxes.

In [None]:
unique_images = df['image_id'].unique()

In [None]:
num_total = len(os.listdir(train_folder))
num_annotated = len(unique_images)

print(f"There are {num_annotated} annotated images and {num_total - num_annotated} images without annotations.")

Let's see all the unique sources of data:

In [None]:
sources = df['source'].unique()
print(f"There are {len(sources)} sources of data: {sources}")

In [None]:
df['source'].value_counts()

Let's look at how many bounding boxes do we have for each image:

In [None]:
plt.hist(df['image_id'].value_counts(), bins=30)
plt.show()

Max number of bounding boxes is 116, whereas min (annotated) number is 1 

## Visualizing images

In [None]:
def show_images(images, num = 5):
    
    images_to_show = np.random.choice(images, num)

    for image_id in images_to_show:

        image_path = os.path.join(train_folder, image_id + ".jpg")
        image = Image.open(image_path)

        # get all bboxes for given image in [xmin, ymin, width, height]
        bboxes = [literal_eval(box) for box in df[df['image_id'] == image_id]['bbox']]

        # visualize them
        draw = ImageDraw.Draw(image)
        for bbox in bboxes:    
            draw.rectangle([bbox[0], bbox[1], bbox[0] + bbox[2], bbox[1] + bbox[3]], width=3)

        plt.figure(figsize = (15,15))
        plt.imshow(image)
        plt.show()

In [None]:
show_images(unique_images)

Loook at photos by their source:

In [None]:
for source in sources:
    print(f"Showing images for {source}:")
    show_images(df[df['source'] == source]['image_id'].unique(), num = 3)

What can we tell from visualizations:

* there are plenty of overlappind bounding boxes
* all photos seem to be taken vertically 
* all plants are can be rotated differently, there is no single orientation. this means that different flip and roration agumentations should probably help
* colors of wheet heads are quite different and seem to depend a little bit on the source
* wheet heads themselves are seen from very different angles of view relevant to the observer