# Clouds - A Topological Approach

Recently I've had a strong interest in topological approaches to data analysis (typically abbreviated TDA). To those unfamiliar, the key idea in topology is to throw away coordinate systems and analyze systems and structures based soley on _connectedness_. For phenomena with complicated, intricate structure (i.e. clouds), TDA is good at extracting global qualities. For things like facial recognition, one is better off using conventional CNN techniques, which rely heavily on layers of feature detection.

Since this competition is about recognizing a complicated pattern, it seems appropriate to use TDA. An example process might use TDA to extract local features, then use a CNN to wrap these up into a segmentation map. In this kernel, I mostly investigate topological features qualitatively to see if it is a viable option. A rough segmentation is attempted at the end to validate it, but it is so slow as to be practically unusable.

For more information, see I. Obayashi and Y. Hiraoka, “Persistence Diagrams with Linear Machine Learning Models.”

## Setup

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
matplotlib.rcParams['figure.figsize'] = [15,15]

The first step is to load the training data. I do a couple of transformations to clean it up and make it easier to work with.

In [None]:
def read_train():
    df = pd.read_csv('../input/understanding_cloud_organization/train.csv')
    df = df.dropna()
    df[['fname','label']] = df.Image_Label.str.split('_',expand=True)
    df.EncodedPixels = df.EncodedPixels.str.split().apply(np.array,dtype=int)
    df = df.set_index(['fname','label']).EncodedPixels
    df = df.loc[~df.index.duplicated(keep='first')]
    df = df.unstack('label')
    return df

In [None]:
df = read_train()
df.head()

The next step is to define an easy function to load images. I use PIL (like everybody else) for this.

In [None]:
def load_image(fname,test=False):
    from PIL import Image
    path = '../input/understanding_cloud_organization/{}_images/{}'.format('train' if not test else 'test', fname)
    return np.asarray(Image.open(path))

In [None]:
fname = df.index[0]
img = load_image(fname)
plt.imshow(img);

Wow, this has extremely intricate structure! To transform it into something we can apply persistent homology to, we need to apply the following steps:

1. Shrink the image (optional, but speeds it up a lot).
2. Threshold the image. This effective separates the clouds from the ocean.
3. Apply the exact euclidean distance transform to the thresholded image and it's inverse. Subtract the two.

In [None]:
def transform_image(x,thresh,rescale,blur):
    import numpy as np
    from scipy.ndimage import zoom,gaussian_filter,distance_transform_edt
    # normalize between 0 and 1
    img = np.asarray(x,dtype=float)/255

    # take average of RGB channels
    img = img.mean(axis=2)

    # Shrink image
    img = zoom(img,rescale)

    # Remove artifacts from shrinking
    img = gaussian_filter(img,blur)

    # threshold image
    img = (img < thresh).astype(int)

    # Compute distance from black pixels to nearest white pixels (and normalize distances)
    img = distance_transform_edt(img) - (distance_transform_edt(1-img)-1).clip(0,None)
    img = img/rescale

    return img

In [None]:
timg = transform_image(img,thresh=0.5,rescale=0.2,blur=0.1)
plt.imshow(timg,cmap='coolwarm')
plt.clim([-50,50])

The goal of the transformation is to highlight the intricate structure of the cloud formations by creating a filtration. Specifically, I do a lower-star filtration (i.e. the birth points of connected components occur within clouds). Furthermore, since I'm using the Ripser library, I only compute the 0-dimensional homology groups (connected components). Looking at the picture above, I can visibly pick out a cycle, so there is the possibility that the 1-dimensional groups may also be of value.<sup>1</sup> To visulize what the persistence code is doing, let's look at what happens when we threshold the transformed image at different values (i.e. distances).

<sup>1</sup>As a side note, it looks like it would be pretty minor to make that change to the Ripser python code.

In [None]:
plt.imshow(np.block([[timg < t for t in ts] for ts in np.linspace(-20,20,16).reshape(4,4)]), cmap='gray');

Ok, so this makes sense. As the threshold increases, the sublevel sets appear, merge, and then fill up the image. The final step is to create a persistence image.

In [None]:
from ripser import lower_star_img as lower_star

def pers_image(dgm,res,rng,spread):
    from scipy.ndimage import gaussian_filter,zoom
    
    # birth and death
    idx = np.where(~np.isinf(dgm).any(axis=1))
    b,d = dgm[idx].T
    
    # compute histogram at 3x resolution
    img = np.histogram2d(b,d,bins=res*3,range=[[-rng,rng],[-rng,rng]])[0]
    
    # apply blurring to histogram (relative to total shape)
    img = gaussian_filter(img, np.array(img.shape)*spread)
    
    # decimate histogram (hopefully the blurring prevents aliasing)
    img = img[::3,::3]
    
    return img

In [None]:
dgm = lower_star(timg)
pim = pers_image(dgm,res=500,rng=20,spread=0.02)
plt.imshow(pim.T,origin='lower',cmap='jet');

Huh, looks pretty concentrated in the middle. Roughly, these correspond to small connected components (maybe a pixel or two wide). They seem to be fairly uniformly separated and die at about the same time.

Finally, we need some utility functions to mask the images and convert the persistence images to vectors.

In [None]:
def apply_mask(img,mask):
    px1 = mask[0::2]
    px2 = px1 + mask[1::2]
    mask = np.zeros(img.shape[:2][::-1],dtype=int)
    mask.flat[px1] = 1
    mask.flat[px2[px2<len(mask.flat)]] = -1
    mask.flat = np.cumsum(mask.flat)
    mask = mask.T
    return img*(mask>0)[:,:,None]

def training_vectors(df,thresh,rescale,blur,res,spread,rng):
    from tqdm import tqdm_notebook
    todo = [(label,col,fname,mask) for label,col in df.items() for fname,mask in col.dropna().items()]
    vecs,labels = [],[]
    for label,col,fname,mask in tqdm_notebook(todo):
        img = load_image(fname)
        img = apply_mask(img,mask)
        img = transform_image(img,thresh,rescale,blur)
        dgm = lower_star(img)
        img = pers_image(dgm,res,rng,spread)
        vecs += [img]
        labels += [label]
    return np.array(vecs),np.array(labels)

In [None]:
fname2 = df.index[3]
img2 = apply_mask(load_image(fname2),df.Flower.iloc[3])
plt.imshow(img2); plt.show();

## Training

Now we can do some regular old-fashioned machine learning. The goal right now is just to classify the masks. It is especially useful if the algorithm has a coefficient or weight vector since we can visualize it directly as a persistence image (logistic regression, SVM, etc). Since this is a kernel I chose to scale things down as much as possible, probably at the cost of accuracy.

In [None]:
X,y = training_vectors(df.sample(1000),thresh=0.5,rescale=0.2,blur=0.1,res=20,rng=20,spread=0.02)

In [None]:
from sklearn.linear_model import LogisticRegression
from scipy.ndimage import gaussian_filter
lg = LogisticRegression(C=1e-1,solver='lbfgs',multi_class='multinomial',max_iter=300)
Xi = gaussian_filter(X,(0,1,1)).reshape(-1,20**2)
idx = np.arange(len(Xi))
np.random.shuffle(idx)
i1,i2 = idx[:-300],idx[-300:]
lg.fit(Xi[i1],y[i1])
lg.score(Xi[i2],y[i2])

Let's look at some of the coefficient vectors.

In [None]:
from scipy.ndimage import zoom
fig,axes = plt.subplots(2,2,figsize=[15,15])
for ci,ax in zip(lg.coef_,axes.flat):
    ax.imshow(zoom(ci.reshape(20,20).T,30),origin='lower',cmap='RdBu');

In [None]:
from sklearn.svm import LinearSVC

svm = LinearSVC(max_iter=500,C=1e-1)
svm.fit(Xi[i1],y[i1])
svm.score(Xi[i2],y[i2])

In [None]:
fig,axes = plt.subplots(2,2,figsize=[15,15])
for ci,ax in zip(svm.coef_,axes.flat):
    ax.imshow(zoom(ci.reshape(20,20).T,30),origin='lower',cmap='RdBu');

## Segmentation

For a mostly topological method, this performs pretty well! My theory is that the approach can be used to generate topological features that can be used by a CNN. For now, let's see if we can do segmentation with this method.[](http://)

In [None]:
def segment(img,size,stride,thresh,rescale,blur,res,rng,spread):
    from tqdm import tqdm_notebook
    from itertools import product
    trans = transform_image(img,thresh,rescale,blur)
    preds = np.empty(np.array(img.shape[:2])//stride+1,dtype='U10')
    for i,j in tqdm_notebook(list(product(range(0,img.shape[0],stride),range(0,img.shape[1],stride)))):
        i1,i2,j1,j2 = ((np.array([[i],[j]])+[-size/2,size/2])*rescale).astype(int).flat
        small_img = trans[i1:i2,j1:j2]
        dgm = lower_star(small_img)
        pim = pers_image(dgm,res,rng,spread)
        preds[i//stride,j//stride] = svm.predict(pim.reshape(1,-1))[0]
    return preds

In [None]:
si = 121
simg = load_image(df.index[si])
seg = segment(simg,size=500,stride=50,thresh=0.5,rescale=0.2,blur=0.1,res=20,rng=20,spread=0.02)

fig,axes = plt.subplots(4,2,figsize=[15,20])
dice = 0
for ax,label in zip(axes,'Flower Fish Gravel Sugar'.split()):
    segl = seg == label
    try:
        ax[0].imshow(apply_mask(img,df[label].iloc[si]));
        segY = apply_mask(np.ones_like(img,dtype=int),df[label].iloc[si])[:,:,0] == 1
    except TypeError:
        segY = np.zeros(img.shape[:2],dtype=bool)
        ax[0].imshow(np.zeros_like(img))
    segX = (zoom(segl.astype(float),50)[:img.shape[0],:img.shape[1]] > 0.5) == 1
    if (segX|segY).sum() == 0:
        dice += 0.25
    else:
        dice += (2*(segX&segY).sum()/(segX|segY).sum())/4
    ax[1].imshow(segX,cmap='Blues');
    ax[0].set_title(label)
print(dice)

## Conclusion

The segmentation is so-so, but this is likely due to all the rough approximations I made to keep the computation time manageable for playing around. My goal is to refine the approach in another notebook, so stay tuned. At the very least, this has been a learning experience for me in TDA approaches. Feel free to leave comments, especially if you notice any problems or errors in my code.