**Introduction**

This notebook is forked from [Siddhartha](https://www.kaggle.com/meaninglesslives/airbus-ship-detection-data-visualization)  Data Visualization notebook and mixed with [Hugo Lapointe](https://www.kaggle.com/hugolapointe/airbus-ship-detection-challenge) notebook with nice object-oriented classes. It also includes some code from other great kernels.

**Reading CSV information from the train dataset**

Some images have no ships on them. Some images have one ship and some other images have more than one ship on them. There is one record in RLE per ship in the image hence multiple records for the same image. 

In [None]:
import os
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")

INPUT_PATH = '../input'
DATA_PATH = INPUT_PATH
TRAIN_PATH = os.path.join(DATA_PATH, "train_v2")
TEST_PATH = os.path.join(DATA_PATH, "test_v2")
TRAIN_MASKS_PATH = os.path.join(DATA_PATH, "train/masks")

print("Reading CSV file")
df = pd.read_csv(DATA_PATH+'/train_ship_segmentations_v2.csv')
df.head()

**Analysing images with no ships**

There are 150,000 images without any ship on them. Since the sea is mostly empty, this is representative of the normal situation.

In [None]:
#df = df.reset_index()
withoutships = df[(df.EncodedPixels.isna() == True)]
print(withoutships.describe())

**Displaying some examples of images with no ships** 

Let's plot some random images from training set. The images are 768 x 768 pixels each.

In [None]:
# Helper class to read images
import sys
import random
import cv2
import pdb

from tqdm import tqdm
from PIL import Image
from glob import glob
from tqdm import tqdm_notebook, tnrange

import matplotlib.pyplot as plt
%matplotlib inline

class SatImage(object):       
    def __init__(self):
        self.title = ""
        self.image_id = ""
        self.image = None
        self.width = 0
        self.height = 0
        self.shape = (0, 0)
        self.mask = None
    
    def load_image(self, image_id, image_type):
        '''
        image_id: name of the image to read (including extension)
        image_type: type of image (from Train, Mask, Test)
        Returns SatImage instance
        '''
        matrix = self._get_image_data(image_id, image_type)
        instance = SatImage()
        instance.image_id = image_id
        instance.image = matrix
        instance.height = matrix.shape[0]
        instance.width = matrix.shape[1]
        instance.shape = (instance.height, instance.width)
        return instance

    def _get_filename(self, image_id, image_type):
        check_dir = False
        if "Train" == image_type:
            data_path = TRAIN_PATH
        elif "Mask" in image_type:
            data_path = TRAIN_MASKS_PATH
        elif "Test" in image_type:
            data_path = TEST_PATH
        else:
            raise Exception("Image type '%s' is not recognized" % image_type)

        if check_dir and not os.path.exists(data_path):
            os.makedirs(data_path)

        return os.path.join(data_path, "{}".format(image_id))

    def _get_image_data_opencv(self, image_id, image_type):
        fname = self._get_filename(image_id, image_type)
        img = cv2.imread(fname)
        assert img is not None, "Failed to read image : %s, %s" % (image_id, image_type)
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        return img

    def _get_image_data(self, image_id, image_type):
        img = self._get_image_data_opencv(image_id, image_type)
        img = img.astype('uint8')
        return img
    
    def set_title(self, title):
        self.title = title
    
    def add_objects(self, rle_masks):   
        r_mask = np.zeros(self.shape, dtype = np.uint8)
        g_mask = np.zeros(self.shape, dtype = np.uint8)
        b_mask = np.zeros(self.shape, dtype = np.uint8)
        
        for rle_mask in rle_masks:
            if isinstance(rle_mask, str):
                r_mask += 255 * RLE.decode(rle_mask, self.shape)
        
        self.mask = np.dstack((r_mask, g_mask, b_mask)) 
        #np.expand_dims(_mask, -1)

    # https://github.com/ternaus/TernausNet/blob/master/Example.ipynb
    def _mask_overlay(self, image, mask):
        """
        Helper function to visualize mask
        """
        #mask = mask.astype(np.uint8)
        weighted_sum = cv2.addWeighted(mask, 0.75, image, 0.5, 0.)
        #img = image.copy()
        #ind = mask[:, :, 1] > 0    
        #img[ind] = weighted_sum[ind]    
        #return img
        return weighted_sum

    def show(self, with_mask = False):
        plt.axis("off")
        plt.title(self.title)
        
        if with_mask:
            rle_masks = df.loc[df['ImageId'] == self.image_id,'EncodedPixels'].tolist()
            self.add_objects(rle_masks)
            copy = self._mask_overlay(self.image, self.mask)
            plt.imshow(copy)
            
        else:
            plt.imshow(self.image)


In [None]:
def display_SatImages(names, cols = 4, figsize = (5, 5)):
    w, h = figsize
    rows = len(names) * 1 // cols + 1
    plt.figure(figsize = (cols * w, rows * h))

    i = 1    
    for name in names:
        plt.subplot(rows, cols, i)
        satimage = SatImage().load_image(name, "Train")
        satimage.set_title(name)
        satimage.show(True)
        i += 1

    plt.tight_layout()
    plt.show()

sample = df[(df.EncodedPixels.isna() == True)].sample(n=16)
imgs = np.squeeze(sample[['ImageId']].values.tolist())
display_SatImages(imgs, 4, (5, 5))

**Analysing images with ships**

There are 42 556 images with ships on them. There is a total of 81 723 ships on all images. Some images have more than one ship on them.

In [None]:
withships = df[(df.EncodedPixels.isna() == False)]
print(withships.describe())

**Analysing ship frequency distribution**

There are roughly 4 ships per images but this is moslty unbalanced as more than 25% of images have only 1 ship on them. Then the number of ship per image grows up to 15 ships per image.

In [None]:
withships['ship_count'] = withships.groupby('ImageId')['ImageId'].transform('count')
print(withships['ship_count'].describe())

In [None]:
import seaborn as sns

sns.set_style("white")
sns.set_context("notebook", font_scale=1.5, rc={"lines.linewidth": 2.5})
#sns.distplot(withships['ship_count'],kde=False)

plt.figure(figsize = (23, 8))
plt.title('Ship Count Distribution in Train Set')
ax = sns.countplot(data = withships, x = "ship_count")
    
for p in ax.patches:
    x = p.get_bbox().get_points()[:,0]
    y = p.get_bbox().get_points()[1,1]
    ax.annotate("{:.2f}%".format(y / len(withships) * 100), (x.mean(), y), ha = "center", va = "bottom")
    
plt.title("Ships Frequency Distribution")
plt.ylabel("Frequency")
plt.xlabel("Ships count")
plt.show()
    

**Displaying some examples of images with ships** 

Let's plot some random images from training set with the mask overlayed in red on top of them. 

In [None]:
# Helper class to decode and encode RLE
# ref: https://www.kaggle.com/paulorzp/run-length-encode-and-decode
class RLE(object):
    def encode(image):
        pixels = img.T.flatten()
        pixels = np.concatenate([[0], pixels, [0]])
        runs = np.where(pixels[1:] != pixels[:-1])[0] + 1
        runs[1::2] -= runs[::2]
        return ' '.join(str(x) for x in runs)

    def decode(encoding, shape):
        '''
        encoding: run-length as string formated (start length)
        shape: (height,width) of array to return 
        Returns numpy array, 1 - mask, 0 - background
        '''
        s = encoding.split()
        starts, lengths = [np.asarray(x, dtype = int) for x in (s[0:][::2], s[1:][::2])]
        starts -= 1
        ends = starts + lengths
        image = np.zeros(shape[0] * shape[1], dtype = np.uint8)
        for lo, hi in zip(starts, ends):
            image[lo:hi] = 1
        return image.reshape(shape).T  # Needed to align to RLE direction

In [None]:
sample = df[(df.EncodedPixels.isna() == False)].sample(n=16)
imgs = np.squeeze(sample[['ImageId']].values.tolist())
display_SatImages(imgs, 4, (5, 5))

**Trying to optimize the imagery**

Let's try to do some pre-processing on the images.

In [None]:
from skimage.filters import gaussian,laplace
from skimage.feature import canny
from skimage.filters import scharr
from skimage import exposure
from skimage.color.adapt_rgb import adapt_rgb, each_channel, hsv_value

In [None]:
@adapt_rgb(hsv_value)
def canny_hsv(image):
    return canny(image)

@adapt_rgb(hsv_value)
def scharr_hsv(image):
    return scharr(image)

In [None]:
# simple features that can be easily extracted and used for training deep networks
# these features may be used along with original image

# initializing random for reproductible results
np.random.seed(13)

train_ids = df.ImageId.values
_train_ids = list(train_ids)

plt.figure(figsize=(30,15))
plt.subplots_adjust(bottom=0.2, top=1.2)  #adjust this to change vertical and horiz. spacings..

def displayImg(row, col, index, title, img):
    plt.subplot(row, col, index)
    plt.imshow(img, cmap='binary')
    plt.title(title)
    plt.axis('off')
    
nProc = 7 # nb of processings    
nImg = 5  #nb of images to process
for i in range(nImg):
    image_id = _train_ids[np.random.randint(0, len(_train_ids))]
    #ax = fig.add_subplot(3,3,i+1)
    img = SatImage().load_image(image_id, 'Train').image
    
    # Original image
    displayImg(nImg, nProc, i*nProc+1, 'Original', img)

    # Smoothing
    displayImg(nImg, nProc, i*nProc+2, 'Smoothed', gaussian(img))
    
    # Contrast stretching
    p2, p98 = np.percentile(img, (2, 98))
    img_rescale = exposure.rescale_intensity(img, in_range=(p2, p98))
    displayImg(nImg, nProc, i*nProc+3, 'Stretched', img_rescale)
    
    # Equalization
    displayImg(nImg, nProc, i*nProc+4, 'Equalization', exposure.equalize_hist(img))

    # Adaptive Equalization
    displayImg(nImg, nProc, i*nProc+5, 'Adaptative', exposure.equalize_adapthist(img))
    
    # Scharr Edge Magnitude
    displayImg(nImg, nProc, i*nProc+6, 'Scharr Edge Magnitude', scharr_hsv(img))

    # Canny features
    displayImg(nImg, nProc, i*nProc+7, 'Canny features', canny_hsv(img))

plt.show()