# Satellite image classification



This notebook serves to prepare the .lst files corresponding to our data. This is the first step required to create the .rec files, which will be used as input for our CNN.

In [1]:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
%matplotlib inline

## Training data preparation

In [13]:
DATA_DIR = '../data/cloud-classif/'
IMG_DIR = os.path.join(DATA_DIR, 'jpg/')

In [14]:
df = pd.read_csv(os.path.join(DATA_DIR, 'labels_index.csv'))
df.head()

Unnamed: 0,image_name,tags
0,train_0,haze primary
1,train_1,agriculture clear primary water
2,train_2,clear primary
3,train_3,clear primary
4,train_4,agriculture clear habitation primary road


Load the labels and keep only those which are interesting for us, the ones related to the atmospheric conditions.

In [5]:
# Multi class version
# cloud_labels = {'clear':0, 'haze':1, 'partly_cloudy':2, 'cloudy':3}

# Version for binary classification
cloud_labels = {'clear':0, 'haze':1, 'partly_cloudy':1, 'cloudy':1}

In [16]:
df['tags'] = df['tags'].apply(lambda x: x.split(' '))
df['tags'] = [[label for label in sublist if label in cloud_labels.keys()] for sublist in df['tags']]

Not all images are labeled with one cloud related label (in this case there is one without labels), so filter out those that aren't.

In [17]:
df = df[df['tags'].str.len() == 1]

Convert labels to float/int (needed for im2rec)

In [18]:
df['tags'] = [cloud_labels[sublist[0]] for sublist in df['tags']]

In [19]:
df['image_name'] = df['image_name'] + '.jpg'
df.head()

Unnamed: 0,image_name,tags
0,train_0.jpg,1
1,train_1.jpg,0
2,train_2.jpg,0
3,train_3.jpg,0
4,train_4.jpg,0


Split in train, validation (75/25) and sample (this last one only for conveniency when debugging). The test set will come from Sentinel-2 images.

In [20]:
from sklearn.model_selection import train_test_split
train, valid = train_test_split(df,  test_size=0.25)
sample = df.sample(n=10)

Function to generate lst file which we will then use to prepare the rec file with img2rec. Using reset_index so that records are indexed 0, 1, ... n instead of keeping their index in the original df.

In [10]:
def write_lst(df, filename):
    df.reset_index().to_csv(filename, sep='\t', columns=['tags','image_name'], header=False)

In [22]:
write_lst(train, os.path.join(DATA_DIR, 'train/clouds-binary.lst'))
write_lst(valid, os.path.join(DATA_DIR, 'valid/clouds-binary.lst'))
write_lst(sample,os.path.join(DATA_DIR, 'sample/clouds-binary.lst'))

Finally, create the .rec files using the im2rec.py utility. From the directory containing the data (cloud-classif in our case):

```$ python ../../util/im2rec.py train/clouds-binary.lst jpg/ --exts ['.jpg']```


## Test data preparation

In [2]:
import glob
import os

In [3]:
DATA_DIR = '../data/cloud-classif/'
IMG_DIR = os.path.join(DATA_DIR, 's2test/')
EXT = '*.bmp'

Find all images in the directory

In [8]:
def list_images(basedir, imgdir, ext):
    rel_path = os.path.join(os.path.dirname(basedir), imgdir)
    abs_path = os.path.join(basedir, imgdir) + ext 
    return [os.path.join(imgdir, os.path.basename(x)) for x in glob.glob(abs_path)]
    
clear = pd.DataFrame({'image_name':list_images(IMG_DIR, 'clear/', EXT),
                      'tags':cloud_labels['clear']})
haze  = pd.DataFrame({'image_name':list_images(IMG_DIR, 'haze/',  EXT),
                      'tags':cloud_labels['haze']})
partly = pd.DataFrame({'image_name':list_images(IMG_DIR, 'partly_cloudy/', EXT),
                       'tags':cloud_labels['partly_cloudy']})
cloudy = pd.DataFrame({'image_name':list_images(IMG_DIR, 'cloudy/', EXT),
                       'tags':cloud_labels['cloudy']})

In [9]:
partly.head()

Unnamed: 0,image_name,tags
0,partly_cloudy/539070_6816538_541630_6819098_sa...,1
1,partly_cloudy/536510_6829338_539070_6831898_sa...,1
2,partly_cloudy/539070_6826778_541630_6829338_sa...,1
3,partly_cloudy/536510_6824218_539070_6826778_sa...,1
4,partly_cloudy/559550_6829338_562110_6831898_sa...,1


In [8]:
df = pd.concat([clear, haze, partly, cloudy])

In [11]:
write_lst(df, os.path.join(DATA_DIR, 'test/s2test-binary.lst'))

Again, create the .rec files using the im2rec.py utility. From the directory containing the data (cloud-classif in our case):

```$ python ../../util/im2rec.py test/s2test.lst s2test/  --exts ['.bmp']```