# Prepare Sentinel-2 RGB chips for Macro-localization Deep Learning Model Build

This notebook prepares image chips to train a classifier on Sentinel-2 RGB image chips, and store them on AWS/S3. This version processes the data to fit a multiclass model on cement, steel, and landcover.

## Import Libraries

In [None]:
import os
import shutil
from pathlib import Path

import boto3
import numpy as np
import pandas as pd

import rasterio
import sklearn.model_selection

## Download .tar Files From S3 Bucket and Extract Contents

In [None]:
CURRENT_DIRECTORY = os.getcwd()
AWS_SOURCE_PATH = 'S2-RGB-macro-localization-model-build4'

TARGET_PATH = '/scratch/ALD_S2_RGB_chips_v4p1_train4'

IMG_DIRS = (
    ('ALD_S2_RGB_landcover_chips_v4p1_2020_train4', 'landcover'),
    ('ALD_S2_RGB_cement_chips_v4p1_2020_train4', 'cement'),
    ('ALD_S2_RGB_steel_chips_v4p1_2020_train4', 'steel')
)

!mkdir -p {TARGET_PATH}

In [None]:
s3 = boto3.resource('s3')
bucket = s3.Bucket('sfi-shared-assets')

for source_file, _ in IMG_DIRS:
    bucket.download_file(str(Path(AWS_SOURCE_PATH, source_file+'.tar')), 
                         str(Path(TARGET_PATH, source_file + '.tar')))

In [None]:
for source_file, _ in IMG_DIRS:
    !cd {TARGET_PATH} && tar xf {str(Path(source_file + '.tar'))}

## Partition the Data Using Stratified Random Sampling

To help address the issue of limited sample sizes (in particular for steel plant imagery), we partitition the data using stratified random sampling.

* Will define PNGs to put in train/ and validate/ folders

In [None]:
image_list = ! find {TARGET_PATH} | grep tif$
class_assignments = [f.split('_')[-5] for f in image_list]

train_idx, val_idx = next(sklearn.model_selection.StratifiedShuffleSplit(n_splits=2, random_state=42, test_size=0.2).split(class_assignments, class_assignments))
subset_assignments = ['train' if i in train_idx else 'validate' for i in range(len(image_list))]

output_pngs = [f.split('/')[-1].replace('tif', 'png') for f in image_list]

In [None]:
for image_class in np.unique(class_assignments):
    for subset in np.unique(subset_assignments):
        !mkdir -p {TARGET_PATH}/{subset}/{image_class}

## Convert GeoTiff to PNG

Fastai appears to require converting TIFF files to an alternative image format. Thus, convert from GeoTIFF to PNG.

In [None]:
def convert_image(tif_filename, png_filename):
    with rasterio.open(tif_filename) as infile:
        
        profile = infile.profile
        profile['driver'] = 'PNG'
        
        raster = infile.read()
        
        with rasterio.open(png_filename, 'w', **profile) as dst:
            dst.write(raster)

Convert each image only if its corresponding target file does not already exist.

In [None]:
for image_file, class_assignment, subset_assignment, png_file in zip(image_list, 
                                                                     class_assignments, 
                                                                     subset_assignments,
                                                                     output_pngs):
    if not Path(TARGET_PATH, subset_assignment, class_assignment, png_file).exists():
        convert_image(image_file, 
                      Path(TARGET_PATH, subset_assignment, class_assignment, png_file))

## Write out record of training/testing chips

In [None]:
train_record_pdf = pd.DataFrame({"file": output_pngs,
                                 "class": class_assignments,
                                 "subset": subset_assignments})

In [None]:
train_record_pdf.to_csv("../../resources/macro-loc-model-build4/"+TARGET_PATH.split('/')[-1]+"_record.csv",
                       index=False)

## Tar Files and Upload to S3 Bucket

In [None]:
for source_file, _ in IMG_DIRS:
    shutil.rmtree(TARGET_PATH+'/'+source_file)
    os.remove(TARGET_PATH+'/'+source_file+'.tar')

In [None]:
unix_code = 'tar -C /scratch -cvf '+TARGET_PATH.split('/')[-1]+'.tar '+TARGET_PATH.split('/')[-1]
os.system(unix_code)

In [None]:
bucket.upload_file(TARGET_PATH.split('/')[-1]+'.tar', 
                   AWS_SOURCE_PATH+'/'+TARGET_PATH.split('/')[-1]+'.tar')

## Clean up Temporary Files

In [None]:
shutil.rmtree(TARGET_PATH)
os.remove(TARGET_PATH.split('/')[-1]+'.tar')