# 3. Basic exploratory analysis
### Airbus Ship Detection Challenge - A quick overview for computer vision noobs

&nbsp;


Hi, and welcome! This is the third kernel of the series `Airbus Ship Detection Challenge - A quick overview for computer vision noobs.` In this short kernel we will review the data very briefly. We will, first, count ship/no-ship images and plot the ships-per-image distribution and, second, present the strong imbalance in the total amount of ship/no-ship pixels.



The full series consist of the following notebooks:
1. [Loading and visualizing the images](https://www.kaggle.com/julian3833/1-loading-and-visualizing-the-images)
2. [Understanding and plotting rle bounding boxes](https://www.kaggle.com/julian3833/2-understanding-and-plotting-rle-bounding-boxes) 
3. *[Basic exploratory analysis](https://www.kaggle.com/julian3833/3-basic-exploratory-analysis)*
4. [Exploring public models](https://www.kaggle.com/julian3833/4-exploring-models-shared-by-the-community)
5. [1.0 submission: submitting the test file](https://www.kaggle.com/julian3833/5-1-0-submission-submitting-the-test-file)

This is an ongoing project, so expect more notebooks to be added to the series soon. Actually, we are currently working on the following ones:
* Understanding and exploiting the data leak
* A quick overview of image segmentation domain
* Jumping into Pytorch
* Understanding U-net
* Proposing a simple improvement to U-net model

## 1. 10,000 foot view 

The dataset consist of 3 csvs and 2 image sets: 
* The `train/` and the `test/` set of images, with 104,070 and 88,500 images each
   - Refer to the [first kernel](https://www.kaggle.com/julian3833/1-loading-and-visualizing-images) of the series to display these images
* The `sample_submission.csv` is a submit example, with the format of a solution. It has exactly 88,486 images to process (14 images from `test/` should be excluded for the submission, as stated on the Challenge's [Data](https://www.kaggle.com/c/airbus-ship-detection/data) tab)
* The `train_ship_segmentations.csv` contains the run-length encoded bounding boxes for the ships in the `train/` directory, while the `test_ship_segmentations.csv` contains the bouding boxes for the `test/` directory. 
   - These `dfs` have two columns: `ImageId` and `EncodedPixels`
   - An image with more than one ship will have n rows in these csvs, one for each ship
  
  
 <span style='color:blue'>Why is there a `test_ship_segmentations.csv` at all? Isn't that csv like... the solution? Well, it actually is. There was a data leakage in the dataset and the organizers decided to make the segmentations for the `test/` images public. We are currently working on a notebook explaining the situation and we [shared another one](https://www.kaggle.com/julian3833/5-1-0-submission-submitting-the-test-file) creating a 1.00 submission from this test set</span>

In [None]:
ls ../input/

In [None]:
import os
import pandas as pd
import matplotlib.pyplot as plt

print(f"Images in train/: {len(os.listdir('../input/train/'))}")
print(f"Images in test/ :  {len(os.listdir('../input/test/'))}")

print()

n_submit_images = pd.read_csv("../input/sample_submission.csv").shape[0]
print(f"Images for submission: {n_submit_images}")

pd.read_csv(f"../input/train_ship_segmentations.csv").head()

## 2. A first glance at the csvs: ship vs. no-ship and total ships distribution

We will define two simple functions `load_df()` and `show_df()`. The first one loads a csv to pandas and creates the fields `HasShip` and `TotalShips` from `EncodedPixels`.  The second one displays the amount of images with and without ships and the distribution of total ships per image.

If you don't understand the `EncodedPixels`, you can refer to the [previous kernel](https://www.kaggle.com/julian3833/2-understanding-and-plotting-rle-bounding-boxes) of this series, where we explain the `run-length encoding` in detail. 

As you can see, only the 22% of the images have at least one ship present, and more than 60% of those have only one ship.

In [None]:
def load_df(file="train"):
    """
    Loads a csv, creates the fields `HasShip` and `TotalShips` dropping `EncodedPixels` and setting `ImageId` as index.
    """
    df = pd.read_csv(f"../input/{file}_ship_segmentations.csv")
    df['HasShip'] = df['EncodedPixels'].notnull()
    df = df.groupby("ImageId").agg({'HasShip': ['first', 'sum']}) # counts amount of ships per image, sets ImageId to index
    df.columns = ['HasShip', 'TotalShips']
    return df

def show_df(df):
    """
    Prints and displays the ship/no-ship ratio and the ship count distribution of df
    """
    total = len(df)
    ship = df['HasShip'].sum()
    no_ship = total - ship
    total_ships = int(df['TotalShips'].sum())
        
    print(f"Images: {total} \nShips:  {total_ships}")
    print(f"Images with ships:    {round(ship/total,2)} ({ship})")
    print(f"Images with no ships: {round(no_ship/total,2)} ({no_ship})")
    
    _, axes = plt.subplots(nrows=1, ncols=2, figsize=(30, 8), gridspec_kw = {'width_ratios':[1, 3]})
    
    # Plot ship/no-ship with a bar plot
    ship_ratio = df['HasShip'].value_counts() / total
    ship_ratio = ship_ratio.rename(index={True: 'Ship', False: 'No Ship'})
    ship_ratio.plot.bar(ax=axes[0], color=['red', 'lime'], rot=0, title="Ship/No-ship distribution");
    
    # Plot TotalShips distribution with a bar plot
    total_ships_distribution = df.loc[df['HasShip'], 'TotalShips'].value_counts().sort_index() / ship
    total_ships_distribution.plot(kind='bar', ax=axes[1], rot=0, title="Total ships distribution");

In [None]:
df_train = load_df("train")
df_test = load_df("test")
show_df(df_train.append(df_test))

The class imbalance of images get worse for the `test set` and, as we will see in the next title, it gets even worse when we don't consider the `images` but the `pixels`.

In [None]:
show_df(df_test)

## 3. Counting pixels: verifying the class imbalance between ship and no-ship pixels

The Challenge of detecting the ships in the images can be thought as a `classification problem` for pixels, where, for each image, we need to classify 768 $\times$ 768 pixels in one of two classes: `ship` and `no-ship`. This is actually the common approach to the `image segmentation` problem as we will discuss in further notebooks.

In this notebook we will just present the imbalance of the classes considering a `pixel-level` granularity, this is, we will check *how many pixels in the dataset corresponds to ships and how many to other stuff (no-ships)*

Few notes before diving into the code:
* The `total_pixels` is $ 768 \times 768 \times \text{n_imgs} $
* The total amount of `ship_pixels` is encoded in the `EncodedPixels`: it's actually the sum of the all the pair positions of those strings. 
   - Since we have defined a `rle_to_pixels` function [before](https://www.kaggle.com/julian3833/2-understanding-and-plotting-rle-bounding-boxes), we will just use it and count the amount of pixels after that transformation
* The total amount of `no_ship_pixels` is `total_pixels - ship_pixels`


In [None]:
# This function transforms EncodedPixels into a list of pixels
# Check our previous notebook for a detailed explanation:
# https://www.kaggle.com/julian3833/2-understanding-and-plotting-rle-bounding-boxes
def rle_to_pixels(rle_code):
    rle_code = [int(i) for i in rle_code.split()]
    pixels = [(pixel_position % 768, pixel_position // 768) 
                 for start, length in list(zip(rle_code[0:-1:2], rle_code[1:-2:2])) 
                 for pixel_position in range(start, start + length)]
    return pixels

def show_pixels_distribution(df):
    """
    Prints the amount of ship and no-ship pixels in the df
    """
    # Total images in the df
    n_images = df['ImageId'].nunique() 
    
    # Total pixels in the df
    total_pixels = n_images * 768 * 768 

    # Keep only rows with RLE boxes, transform them into list of pixels, sum the lengths of those lists
    ship_pixels = df['EncodedPixels'].dropna().apply(rle_to_pixels).str.len().sum() 

    ratio = ship_pixels / total_pixels
    print(f"Ship: {round(ratio, 3)} ({ship_pixels})")
    print(f"No ship: {round(1 - ratio, 3)} ({total_pixels - ship_pixels})")

In [None]:
df = pd.read_csv("../input/train_ship_segmentations.csv").append(pd.read_csv("../input/test_ship_segmentations.csv"))
show_pixels_distribution(df)

As you can see above,  only 1‰ of the pixels are `ships`, while 99.9% of the pixels are `no-ships`. 

And, as you can see below, dropping all the images with no ships in them the class imbalance is reduced, but it's still very high: 5‰, this is, 0.5% of the pixels are `ships` while 99.5% are `no-ships`.

As we will analyse in detail on the [following notebook](https://www.kaggle.com/julian3833/4-exploring-public-models) of the series, this extreme class imbalance condition of the dataset will trigger actions in the construction of the public models (in particular, the stack of a `ship/no-ship image classifier` for the general problem with a `ship/no-ship image segmentation` for only the 22% of the images with ships).

In [None]:
show_pixels_distribution(df.dropna())

### References
* [Airbus EDA](https://www.kaggle.com/ezietsman/airbus-eda) - a more advanced and very nice exploratory data analysis kernel
* [Fine tuning resnet34 on ship detection](https://www.kaggle.com/iafoss/fine-tuning-resnet34-on-ship-detection) - the kernel from which we red about the pixel class imbalance as a strong problem for the first time. We will refer to this *awesome kernel* various times on this notebook series.
* [Class imbalance problem](http://www.chioka.in/class-imbalance-problem/) - a blog post to recap about class imbalance


### What's next?
You can check the [next kernel](https://www.kaggle.com/julian3833/4-exploring-public-models) of the series, where we explore the hottests available public models and present the main ideas and approaches behind them.

