### Purpose
I have created this notebook as a culmination of the great amount of work done by other kaggler's in this competition. As I was developing this I realized so much is already done but there is so much we can explore. If you find this remotely useful, please upvote the original authors before even thinking about upvoting this notebook

#### Before Starting I want to point out the resources that helped me get fammilliar with the data and this problem
<br>Ekhtiar Syed's kernel : https://www.kaggle.com/ekhtiar/eda-find-me-in-the-clouds
<br>Aleksandra Deis's kernel: https://www.kaggle.com/aleksandradeis/understanding-clouds-eda
<br>Andrew Lukeyanenko's kernel: https://www.kaggle.com/artgor/segmentation-in-pytorch-using-convenient-tools
<br>Research Paper: https://arxiv.org/pdf/1906.01906.pdf

Before diving into the data i believe its always important ( at times underrated ) to understand how the data was recorded or generated. Lets look at some important aspects from the research paper shared by the competition organizers <br>
>Based on visual inspection four subjective patterns or organization were defined: Sugar, Flower, Fish and Gravel. On cloud labeling days at two institutes, 67 participants classified more than 30,000 satellite images on
a crowd-sourcing platform


### Descriptions of Cloud Types
Although the classes are subjective but they do exhibit certain characteristics according to the research paper:
<br>Sugar: Fine, Random | Flower: Clustered, Well Seperated | Fish: Netwrok like, Skeletal | Gravel: Arcs, intermediate granularity
![](https://imgur.com/QXRU5kf.png)

### Labeling
Its interesting to note the labelling is not definititve but crowdsourced <br>
- > Researchers downloaded roughly 10,000 21◦ longitude by 14◦ latitude Terra and Aqua MODIS visible images from NASA Worldview 
- > On the web interface, participants are served an image randomly drawn from our library of 10,000 images. 
- > Users were then asked to draw rectangles around regions where one of the four cloud patterns dominates (Fig. 2a). 
- > Participants had the possibility to draw any number of boxes, including none, with the caveat that the box would _cover at least 10% of the image_. 
- > When an image was classified by _four_ different users, it was retired, i.e. removed from the image library. No user was shown the same image twice.
![](https://imgur.com/OdZSn5E.png)


### Agreement and Accuracy
So agreeing most on Flower, it seems Fish and Gravel are most controversial. This to me is surprising as to my noobie eyes, Sugar and Gravel seem like close cousins. Its important to keep in mind that these were researchers so they must have not gone by an instinctive approach and were used to looking at clouds on a daily basis. Cognitive biases will be inherent though.
Also, authors provide transperancy on these metrics: <br>
- > the agreement score, used to compare the inter-human agreement, defined as follows: “In which percentage of cases, if one user drew a box of a certain class, did another user also draw a box of the same class, under the condition that the boxes overlap.” The overlap is measured using the Intersection-over-Union (IoU) metric. For above metric an IoU of larger than 0.1 is required
- > The second metric is the pixel accuracy used to compare the machine learning models to the human predictions. Here, for each pixel, the accuracy of one user (or a machine learning prediction) compared to another user is computed for each pattern. Pixels where both users predict no pattern are omitted for this score.

![](https://imgur.com/sfsqeM2.png)

#### Deep Learning models used by authors
1. RetinaNet Object detection with images downscaled to 1050 by 700 pixels
![](https://imgur.com/BOKK0zt.png)
<br>Source: https://arxiv.org/pdf/1708.02002.pdf

2. Semantic segmentation using U-net structure with a Resnet50 backbone, images downscaled to 700 by 466 pixels
![](https://imgur.com/F2DtcJe.png)
<br>Source: https://arxiv.org/pdf/1505.04597.pdf
![](https://imgur.com/D9EQGN5.png)


#### Now that we have read the major portions of the paper and have aclear understanding of the background methodology and what all has been tried, lets delve into the data

PS: I blatantly copy code sets from [ekhtiar's](https://www.kaggle.com/ekhtiar/eda-find-me-in-the-clouds) and [aleksandradeis's](https://www.kaggle.com/aleksandradeis/understanding-clouds-eda) kernels so please upvote their work if you find this useful

In [None]:
import os
import pandas as pd
import random
import numpy as np
from matplotlib import pyplot as plt
from glob import glob
from PIL import Image
import imageio
import cv2

In [None]:
data_path = '/kaggle/input/understanding_cloud_organization'
train_csv_path = os.path.join('/kaggle/input/understanding_cloud_organization','train.csv')
train_image_path = os.path.join('/kaggle/input/understanding_cloud_organization','train_images')

# set paths to train and test image datasets
TRAIN_PATH = '../input/understanding_cloud_organization/train_images/'
TEST_PATH = '../input/understanding_cloud_organization/test_images/'


In [None]:
def load_processdata(loc,**kwargs):
    nomaskvalue = kwargs.get('nomaskvalue',-1)
    
    train_df = pd.read_csv(loc).fillna(nomaskvalue)
    
    # split column
    split_df = train_df["Image_Label"].str.split("_", n = 1, expand = True)
    # add new columns to train_df
    train_df['img'] = split_df[0]
    train_df['lbl'] = split_df[1]
    
    del split_df
    
    # Create labeled cloud type dummies ( but why? idk)
    train_df['fish'] = np.where((train_df['lbl'].str.lower()=='fish') & (train_df['EncodedPixels']!=-1),1,0)
    train_df['sugar'] = np.where((train_df['lbl'].str.lower()=='sugar') & (train_df['EncodedPixels']!=-1),1,0)
    train_df['gravel'] = np.where((train_df['lbl'].str.lower()=='gravel') & (train_df['EncodedPixels']!=-1),1,0)
    train_df['flower'] = np.where((train_df['lbl'].str.lower()=='flower') & (train_df['EncodedPixels']!=-1),1,0)
    
    train_df['Label_EncodedPixels'] = train_df.apply(lambda row: (row['lbl'], row['EncodedPixels']), axis = 1)

    return train_df

def get_image_sizes(train = True):
    '''
    Function to get sizes of images from test and train sets.
    INPUT:
        train - indicates whether we are getting sizes of images from train or test set
    '''
    if train:
        path = TRAIN_PATH
    else:
        path = TEST_PATH
        
    widths = []
    heights = []
    
    images = sorted(glob(path + '*.jpg'))
    
    max_im = Image.open(images[0])
    min_im = Image.open(images[0])
        
    for im in range(0, len(images)):
        image = Image.open(images[im])
        width, height = image.size
        
        if len(widths) > 0:
            if width > max(widths):
                max_im = image

            if width < min(widths):
                min_im = image

        widths.append(width)
        heights.append(height)
        
    return widths, heights, max_im, min_im



In [None]:
trdf = load_processdata(train_csv_path)
trdf.head()

In [None]:
# Lets look at some data on cloud type occurances
typecols = ['fish','sugar','gravel','flower']

co_occ = trdf.groupby('img')[typecols].sum().T.dot(trdf.groupby('img')[typecols].sum())
import seaborn as sns
sns.heatmap(co_occ, cmap = 'YlGnBu', annot=True, fmt="d")

This is interesting, gravel-sugar is most popular combo ( sounds like ice-cream tbh ) but its not that far away from sugar-fish which is just 14% lower at 1782. Flower is the least couccoring. Do we have patterns that look alike occuring together? It does make cognitive sense but let's leave that for now ( already hungry ). This is just a two way pairing, Let's use a mutivariate approac to see if absence or presence of certain c;loud types tells us about other cloud types

In [None]:
trdf.head()

In [None]:
from sklearn import svm
from sklearn.model_selection import cross_val_score
for c in typecols:
    print(f'{c}: ')
    yvar=c
    xvars = [i for i in typecols if i!=yvar][0:2]
    X = trdf[xvars]
    y = trdf[yvar]
    clf = svm.SVC(kernel='linear', C=1.0)

    scores = cross_val_score(clf, X, y, cv=5)
    print("    Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
    clf.fit(X, y)
    print(f'    {xvars},{list(clf.coef_)}')

In [None]:
result_list = []
import statsmodels.api as sm
trdf['const'] = 0
for c in typecols:
    print(f'-------{c}-------')
    for i in [1]:
        print(f'        Iteration: {i}')
        trdf_s = trdf.sample(frac=0.8, replace=False, random_state=i)
        yvar = c
        xvars = [i for i in typecols if i!=yvar][0:2]
        X = trdf_s[xvars]
        y = trdf_s[yvar]

        logit = sm.Logit(y, X)
        result = logit.fit()
        result_list.extend([result])
        print(result.summary())

del trdf['const']

Given the quasi complete and low coefficients we dont get much from this exercise, the data does not seem to sufficiently indicate any relationship between the clouds presence

In [None]:
# Label count freq
trdf.groupby('img')[typecols].sum().sum(axis=1).plot.hist(title='Freq of # labels per image')
plt.show()

trdf.groupby('img')[typecols].sum().sum(axis=0).plot(kind='bar',color='green',title='Occuarnce of the Cloud Types')


Let's look at them clouds now

In [None]:
# Function to decode the run length mask
def rle_to_mask(rle_string, height, width):
    '''
    convert RLE(run length encoding) string to numpy array

    Parameters: 
    rle_string (str): string of rle encoded mask
    height (int): height of the mask
    width (int): width of the mask 

    Returns: 
    numpy.array: numpy array of the mask
    '''
    
    rows, cols = height, width
    
    if rle_string == -1:
        return np.zeros((height, width))
    else:
        rle_numbers = [int(num_string) for num_string in rle_string.split(' ')]
        #print(rle_numbers)
        rle_pairs = np.array(rle_numbers).reshape(-1,2)
        #print(rle_pairs)
        img = np.zeros(rows*cols, dtype=np.uint8)
        #print(img)
        for index, length in rle_pairs:
            index -= 1
            img[index:index+length] = 255
        img = img.reshape(cols,rows)
        img = img.T
        return img


Surface area of masks: Given that the researchers were asked to mark a box with atleast 10% area of the image, its going to be inetersting to see what the distribution looks like ( blatant copy from [ekhtiar's kernel](https://www.kaggle.com/ekhtiar/eda-find-me-in-the-clouds) )

In [None]:
# we will use the following function to decode our mask to binary and count the sum of the pixels for our mask.
def get_binary_mask_sum(encoded_mask):
    mask_decoded = rle_to_mask(encoded_mask, width=2100, height=1400)
    binary_mask = (mask_decoded > 0.0).astype(int)
    return binary_mask.sum()

# calculate sum of the pixels for the mask per cloud formation
trdf['mask_pixel_sum'] = trdf.apply(lambda x: get_binary_mask_sum(x['EncodedPixels']), axis=1)



In [None]:
# Hope I'm doing this right
trdf['mask_pixel_perc'] = trdf['mask_pixel_sum']/(2100*1400)
trdf.head()

In [None]:
trdf.groupby('lbl')['mask_pixel_perc'].describe()

In [None]:
trdf.loc[trdf.mask_pixel_perc>0,:].groupby('lbl')['mask_pixel_perc'].describe()

In [None]:
g = sns.FacetGrid(trdf, col="lbl")
g.map(plt.hist, "mask_pixel_perc")

In [None]:
g = sns.FacetGrid(trdf.loc[trdf.mask_pixel_perc>0,:], col="lbl")
g.map(plt.hist, "mask_pixel_perc")

This makes sense, peaks are around 20% area and standard deviations are comparable accross classes

In [None]:
for i in range(0,5):
    img = cv2.imread(os.path.join(train_image_path, trdf['img'][i]))
    mask_decoded = rle_to_mask(trdf['Label_EncodedPixels'][i][1], img.shape[0], img.shape[1])
    fig, ax = plt.subplots(nrows=1, ncols=2, sharey=True, figsize=(20,10))
    ax[0].imshow(img)
    ax[1].imshow(mask_decoded)

Looks about right

So far we have just tried to see if <br>
1. There is any realationship between presence and/or absence of cloud types in an image
2. How is the bounding box for these cloud types distributed
3. Commonly occuring pairs of cloud types

## *WIP*
I was always looking at clouds and finding shapes, am curious what a pre trained ResNet50 finds in these clouds

In [None]:
from keras.applications.resnet50 import ResNet50
from keras.preprocessing import image
from keras.applications.resnet50 import preprocess_input, decode_predictions
import numpy as np

model = ResNet50(weights='imagenet')
top_preds=[]

In [None]:
def get_mask_cloud(img_path, img_id, label, mask):
    img = cv2.imread(os.path.join(img_path, img_id), 0)
    mask_decoded = rle_to_mask(mask, img.shape[0], img.shape[1])
    mask_decoded = (mask_decoded > 0.0).astype(int)
    img = np.multiply(img, mask_decoded)
    return img

In [None]:
# top_preds = []
# for i in range(0,1):#trdf.shape[0]):
#     img_path = os.path.join(train_image_path, trdf['img'][i])
#     img = image.load_img(img_path, target_size=(224, 224))
    
#     img_print = cv2.imread(os.path.join(train_image_path, trdf['img'][i]))
#     mask_decoded = get_mask_cloud(img_print, trdf['lbl'][i], trdf['EncodedPixels'][i])
#     #img = get_mask_cloud(train_image_path, sample['ImageId'], sample['Label'],sample['EncodedPixels'])
#     print(type(mask_decoded),type(image.img_to_array(img)))
#     print(mask_decoded)
#     print("+"*50)
#     print(image.img_to_array(img))
    
#     x = image.img_to_array(img)
#     x = np.expand_dims(x, axis=0)
#     x = preprocess_input(x)

#     preds = model.predict(x)
#     # decode the results into a list of tuples (class, description, probability)
#     # (one such list for each sample in the batch)
#     #print('Predicted:', decode_predictions(preds, top=2)[0])
#     top_preds.extend([decode_predictions(preds, top=2)[0]])

# #trdf['ResNet_toppreds'] = top_preds

# #trdf.head(50)