**Shallow clouds play a huge role in determining the Earth's climate. They’re also difficult to understand and to represent in climate models. By classifying different types of cloud organization, researchers at Max Planck hope to improve our physical understanding of these clouds, which in turn will help us build better climate models.**

## A brief history about the purpose of this competition

It all started around two years ago at a workshop where 12 cloud experts came together to discuss shallow clouds over the ocean. These clouds look benign compared to big thunderstorms but, in fact, for the Earth’s climate they play a huge role. The reason is that they reflect a lot of sunlight back into space, thereby cooling our planet, while only contributing marginally to the greenhouse effect. This means that it’s really important to figure out how these clouds will change as our planet warms. Current climate models, however, struggle with that. They do not even agree whether there will be more or less of these shallow clouds.


Part of the reason is that shallow clouds aren’t just the result of the global circulation of the atmosphere. Rather, they have a life of their own and arrange themselves in a variety of patterns. For many of these patterns, the basic mechanisms behind them are poorly understood. This brings us back to our group of scientists. As they were looking through hundreds of satellite images like the ones shown on this page, they noticed that some structures occur more often than others. After some discussion, they agreed on four common patterns and called them Sugar, Flower, Fish and Gravel.

![](https://miro.medium.com/max/1050/1*Wz8Rosw9W0VDorCwcLIxkg.png)

Source: Awesome article by Stephan Rasp https://medium.com/@raspstephan

# Let's dive in the clouds and explore the data

In [None]:
import numpy as np # linear algebra
import pandas as pd
pd.set_option("display.max_rows", 100)
import os
print(os.listdir("../input"))
# print(os.listdir("../"))
import cv2
import json
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams["font.size"] = 14
import seaborn as sns
from collections import Counter
from PIL import Image
import math
import seaborn as sns
from collections import defaultdict
from pathlib import Path
import cv2
from tqdm import tqdm

In [None]:
train_df = pd.read_csv("../input/understanding_cloud_organization/train.csv")
sample_df = pd.read_csv("../input/understanding_cloud_organization/sample_submission.csv")

In [None]:
train_df.head()

In [None]:
print(f'There are {train_df.shape[0]} records in train.csv')

In [None]:
train_df['Image_Label'].apply(lambda x : x.split('_')[1]).value_counts().plot(kind='bar')

We have approx 5.5k images in train dataset and they can have up to 4 masks: Fish, Flower, Gravel and Sugar.

In [None]:
len_train = len(os.listdir("../input/understanding_cloud_organization/train_images"))
len_test = len(os.listdir("../input/understanding_cloud_organization/test_images"))
print(f'There are {len_train} images in train dataset')
print(f'There are {len_test} images in test dataset')

## Figuring out the total number of images having empty masks.

In [None]:
len(train_df[train_df['EncodedPixels'].isnull()])

## Label wise breakdown of empty masks.

In [None]:
train_df.loc[train_df['EncodedPixels'].isnull(), 'Image_Label'].apply(lambda x: x.split('_')[1]).value_counts().plot(kind="bar")

## Count of labels having mask data

In [None]:
train_df.loc[train_df['EncodedPixels'].isnull() == False, 'Image_Label'].apply(lambda x: x.split('_')[1]).value_counts()

## Images having multiple masks

In [None]:
train_df.loc[train_df['EncodedPixels'].isnull() == False, 'Image_Label'].apply(lambda x: x.split('_')[0]).value_counts().value_counts().plot(kind="bar")

We can see that there are significant number of images having 2 masks and also a few of them having 4 masks.

## Checking image size for train and test

Check the size of each image in the dataset by iterating through all the images (in train and test).

In [None]:
from collections import defaultdict
train_size_dict = defaultdict(int)
train_path = Path("../input/understanding_cloud_organization/train_images/")

for img_name in train_path.iterdir():
    img = Image.open(img_name)
    train_size_dict[img.size] += 1

In [None]:
train_size_dict

Iterating through all images in Test

In [None]:
test_size_dict = defaultdict(int)
test_path = Path("../input/understanding_cloud_organization/test_images/")

for img_name in test_path.iterdir():
    img = Image.open(img_name)
    test_size_dict[img.size] += 1

In [None]:
test_size_dict

## All the images in train and test set have a size 2100*1400

# Vizualizing the masks

In [None]:
palet = [(249, 192, 12), (0, 185, 241), (114, 0, 218), (249,50,12)]

In [None]:
def name_and_mask(start_idx):
    col = start_idx
    img_names = [str(i).split("_")[0] for i in train_df.iloc[col:col+4, 0].values]
    if not (img_names[0] == img_names[1] == img_names[2] == img_names[3]):
        raise ValueError

    labels = train_df.iloc[col:col+4, 1]
    mask = np.zeros((1400, 2100, 4), dtype=np.uint8)

    for idx, label in enumerate(labels.values):
        if label is not np.nan:
            mask_label = np.zeros(2100*1400, dtype=np.uint8)
            label = label.split(" ")
            positions = map(int, label[0::2])
            length = map(int, label[1::2])
            for pos, le in zip(positions, length):
                mask_label[pos:(pos+le)] = 1
            mask[:, :, idx] = mask_label.reshape(1400, 2100, order='F')
    return img_names[0], mask

In [None]:
def show_mask_image(col):
    name, mask = name_and_mask(col)
    img = cv2.imread(str(train_path / name))
    fig, ax = plt.subplots(figsize=(15, 15))

    for ch in range(4):
        contours, _ = cv2.findContours(mask[:, :, ch], cv2.RETR_LIST, cv2.CHAIN_APPROX_NONE)
        for i in range(0, len(contours)):
            cv2.polylines(img, contours[i], True, palet[ch], 2)
    ax.set_title(name)
    ax.imshow(img)
    plt.show()

In [None]:
idx_no_class = []
idx_class_1 = []
idx_class_2 = []
idx_class_3 = []
idx_class_4 = []
idx_class_multi = []
idx_class_triple = []

for col in range(0, len(train_df), 4):
    img_names = [str(i).split("_")[0] for i in train_df.iloc[col:col+4, 0].values]
    if not (img_names[0] == img_names[1] == img_names[2] == img_names[3]):
        raise ValueError
        
    labels = train_df.iloc[col:col+4, 1]
    if labels.isna().all():
        idx_no_defect.append(col)
    elif (labels.isna() == [False, True, True, True]).all():
        idx_class_1.append(col)
    elif (labels.isna() == [True, False, True, True]).all():
        idx_class_2.append(col)
    elif (labels.isna() == [True, True, False, True]).all():
        idx_class_3.append(col)
    elif (labels.isna() == [True, True, True, False]).all():
        idx_class_4.append(col)
    elif labels.isna().sum() == 1:
        idx_class_triple.append(col)
    else:
        idx_class_multi.append(col)

# Images with class 1

In [None]:
for idx in idx_class_1[:5]:
    show_mask_image(idx)

# Images with class 2

In [None]:
for idx in idx_class_2[:5]:
    show_mask_image(idx)

# Images with class 3

In [None]:
for idx in idx_class_3[:5]:
    show_mask_image(idx)

# Images with class 4

In [None]:
for idx in idx_class_4[:5]:
    show_mask_image(idx)

# Images with multiple classes

In [None]:
for idx in idx_class_multi[:5]:
    show_mask_image(idx)

# Explanation of the evaluation metric

**Dice Coefficient (F1 Score)**

Dice coefficient is a statistic used to gauge the similarity of two samples.

Simply put, the Dice Coefficient is 2 * the Area of Overlap divided by the total number of pixels in both images. (See explanation of area of union in section 2).

![](https://miro.medium.com/max/644/1*yUd5ckecHjWZf6hGrdlwzA.png)

I devote the credits for the code of this kernel to @GoldFish kernel at [https://www.kaggle.com/go1dfish/clear-mask-visualization-and-simple-eda](https://www.kaggle.com/go1dfish/clear-mask-visualization-and-simple-eda)

**Please upvote if this is helpful**