# Sentinel-2 Image Processing 

The notebook presents the training data generation script.

### About Informal Settlement Dataset
The Informal Settlement Dataset was received from iMMAP on March 5, 2020. The dataset contains ground-validated locations of informal migrant settlements in Northern Colombia. Through visual interpretation, we generated ground-truth polygons of the informal settlements. This script contains code for converting the vector shapefiles to raster masks.

### About Sentinel-2 Imagery

SENTINEL-2 is a wide-swath, high-resolution, multi-spectral imaging mission, supporting Copernicus Land Monitoring studies, including the monitoring of vegetation, soil and water cover, as well as observation of inland waterways and coastal areas ([Source](https://sentinel.esa.int/web/sentinel/user-guides/sentinel-2-msi/overview)). 

**Note**: 
- For 2016 and 2017 satellite imagery, we obtained L-1C Sentinel2 Imagery. 
- For 2018 - 2020 satellite imagery we obtained L-2A Sentinel2 Imagery. 

## Imports and Setup

In [1]:
import os
import operator
from tqdm import tqdm
import pandas as pd
import numpy as np
pd.set_option('use_inf_as_na', True)

import geopandas as gpd
import rasterio as rio

import sys
sys.path.insert(0, '../utils')
import geoutils

import logging
import warnings
logging.getLogger().setLevel(logging.ERROR)
warnings.filterwarnings("ignore")

import matplotlib.pyplot as plt
%matplotlib inline

%load_ext autoreload
%autoreload 2

## File Locations

In [2]:
data_dir = "../data/"
pos_mask_dir = data_dir + 'pos_masks/'
neg_mask_dir = data_dir + 'neg_masks/'
sentinel_dir = data_dir + 'sentinel2/'

if not os.path.exists(data_dir):
    os.makedirs(data_dir)
if not os.path.exists(sentinel_dir):
    os.makedirs(sentinel_dir)
if not os.path.exists(pos_mask_dir):
    os.makedirs(pos_mask_dir)
if not os.path.exists(neg_mask_dir):
    os.makedirs(neg_mask_dir)

areas = ['maicao', 'riohacha', 'uribia']

## Download Files from GCS

In [3]:
!gsutil -q -m cp gs://immap-gee/DEFLATED_gee_*.tif {sentinel_dir}
!gsutil -q -m cp gs://immap-gee/CROPPED_gee_*.tif {sentinel_dir}
!gsutil -q -m cp gs://immap-masks/informal_settlement_masks/*.gpkg {pos_mask_dir}
!gsutil -q -m cp gs://immap-masks/negative_sample_masks/*.gpkg {neg_mask_dir}
!gsutil -q -m cp gs://immap-masks/admin_boundaries/admin_bounds.gpkg {data_dir} 
print('Operations completed.')

Operations completed.


## Generate TIFF Files for Indices

In [3]:
for area in areas:
    geoutils.write_indices(area_dict, area)

## Area Filepath Dictionary
The following cell returns a dictionary containing the image filepaths for each area.

In [5]:
area_dict = geoutils.get_filepaths(areas, sentinel_dir, pos_mask_dir, neg_mask_dir)
print("Image filepaths for Maicao:")
area_dict['maicao']

Image filepaths for Maicao:


{'pos_mask_gpkg': '../data/pos_masks/maicao_mask.gpkg',
 'neg_mask_gpkg': '../data/neg_masks/maicao-samples.gpkg',
 'images': ['../data/sentinel2/DEFLATED_gee_maicao_2016.tif',
  '../data/sentinel2/DEFLATED_gee_maicao_2017.tif',
  '../data/sentinel2/DEFLATED_gee_maicao_2018.tif',
  '../data/sentinel2/DEFLATED_gee_maicao_2019.tif',
  '../data/sentinel2/DEFLATED_gee_maicao_2020.tif'],
 'images_cropped': ['../data/sentinel2/CROPPED_gee_maicao_2016.tif',
  '../data/sentinel2/CROPPED_gee_maicao_2017.tif',
  '../data/sentinel2/CROPPED_gee_maicao_2018.tif',
  '../data/sentinel2/CROPPED_gee_maicao_2019.tif',
  '../data/sentinel2/CROPPED_gee_maicao_2020.tif'],
 'indices_cropped': ['../data/sentinel2/CROPPED_INDICES_gee_maicao_2016.tif',
  '../data/sentinel2/CROPPED_INDICES_gee_maicao_2017.tif',
  '../data/sentinel2/CROPPED_INDICES_gee_maicao_2018.tif',
  '../data/sentinel2/CROPPED_INDICES_gee_maicao_2019.tif',
  '../data/sentinel2/CROPPED_INDICES_gee_maicao_2020.tif']}

## Generate Target Raster Masks
The following scripts generate TIFF masks for the vector GPKG files of both positive (new informal settlements) and negative (non-new informal settlement) samples.

### Positive Labels: Informal Settlements

In [4]:
area_dict = geoutils.get_pos_raster_mask(area_dict)
for area in areas:
    print("Raster filepath for {}: {}".format(area, area_dict[area]['pos_mask_tiff']))

Raster filepath for maicao: ../data/pos_masks/maicao_mask.tiff
Raster filepath for riohacha: ../data/pos_masks/riohacha_mask.tiff
Raster filepath for uribia: ../data/pos_masks/uribia_mask.tiff


### Negative Labels: Formal Settlements and Unoccupied Land

In [5]:
area_dict, target_dict = geoutils.get_neg_raster_mask(area_dict)
print("Target value codes: {}".format(target_dict))
for area in areas:
    print("Raster filepath for {}: {}".format(area, area_dict[area]['neg_mask_tiff']))

Target value codes: {'formal settlement': 2, 'unoccupied land': 3, 'informal settlement': 1}
Raster filepath for maicao: ../data/neg_masks/maicao-samples.tiff
Raster filepath for riohacha: ../data/neg_masks/riohacha-samples.tiff
Raster filepath for uribia: ../data/neg_masks/uribia-samples.tiff


## Generate Training Set

In [8]:
data, area_code = geoutils.generate_training_data(area_dict)
print('Area code: {}'.format(area_code))
print('Data dimensions: {}'.format(data.shape))
data.head(3)

Reading maicao...



  0%|          | 0/5 [00:00<?, ?it/s][A
 20%|██        | 1/5 [00:17<01:09, 17.48s/it][A
 40%|████      | 2/5 [00:35<00:52, 17.60s/it][A
 60%|██████    | 3/5 [00:52<00:34, 17.36s/it][A
 80%|████████  | 4/5 [01:09<00:17, 17.36s/it][A
100%|██████████| 5/5 [01:26<00:00, 17.34s/it][A


Reading riohacha...



  0%|          | 0/5 [00:00<?, ?it/s][A
 20%|██        | 1/5 [00:29<01:59, 29.88s/it][A
 40%|████      | 2/5 [00:59<01:29, 29.69s/it][A
 60%|██████    | 3/5 [01:28<00:59, 29.63s/it][A
 80%|████████  | 4/5 [01:58<00:29, 29.63s/it][A
100%|██████████| 5/5 [02:27<00:00, 29.43s/it][A

  0%|          | 0/5 [00:00<?, ?it/s][A

Reading uribia...



 20%|██        | 1/5 [00:03<00:15,  3.80s/it][A
 40%|████      | 2/5 [00:07<00:11,  3.76s/it][A
 60%|██████    | 3/5 [00:11<00:07,  3.74s/it][A
 80%|████████  | 4/5 [00:14<00:03,  3.67s/it][A
100%|██████████| 5/5 [00:18<00:00,  3.60s/it][A


Area code: {'maicao': 0, 'riohacha': 1, 'uribia': 2}
Data dimensions: (74436996, 112)


Unnamed: 0,B1_2016,B2_2016,B3_2016,B4_2016,B5_2016,B6_2016,B7_2016,B8_2016,B9_2016,B10_2016,...,savi_2020,mndwi_2020,ui_2020,nbi_2020,brba_2020,nbai_2020,mbi_2020,baei_2020,target,area
11131,0.1492,0.1331,0.1373,0.1587,0.1735,0.2247,0.2654,0.2486,0.3075,0.0377,...,0.189059,-0.535041,0.132327,0.202685,0.397793,-0.735885,-0.043378,0.940504,0,0
16695,0.1492,0.1359,0.1543,0.1761,0.1799,0.2381,0.2838,0.2657,0.3186,0.0377,...,0.173231,-0.486839,0.113795,0.215935,0.454976,-0.71221,-0.04349,0.96,0,0
16696,0.1492,0.1395,0.1514,0.1644,0.1799,0.2381,0.2838,0.2652,0.3186,0.0377,...,0.197578,-0.508728,0.113795,0.198205,0.417619,-0.726255,-0.049508,0.945952,0,0


## Save and Upload Final Dataset

In [9]:
output_file = data_dir + '20200326_dataset.csv'
data = data[data['target'] != 0]
data.to_csv(output_file, index=False)
print('Data dimensions: {}'.format(data.shape))
data.head(3)

Data dimensions: (334524, 112)


Unnamed: 0,B1_2016,B2_2016,B3_2016,B4_2016,B5_2016,B6_2016,B7_2016,B8_2016,B9_2016,B10_2016,...,savi_2020,mndwi_2020,ui_2020,nbi_2020,brba_2020,nbai_2020,mbi_2020,baei_2020,target,area
1133983,0.1597,0.13735,0.1531,0.187,0.209,0.2632,0.30515,0.26965,0.3327,0.0411,...,0.187614,-0.509745,0.105128,0.239614,0.449106,-0.718433,-0.042537,0.901237,3,0
1133984,0.1597,0.13905,0.1454,0.17845,0.209,0.2632,0.30515,0.26395,0.3327,0.0411,...,0.177058,-0.507485,0.105128,0.247826,0.464498,-0.716955,-0.03976,0.91149,3,0
1133985,0.16675,0.14875,0.1589,0.18605,0.2258,0.27945,0.3207,0.28085,0.3452,0.0416,...,0.179191,-0.524371,0.073259,0.262348,0.446475,-0.722188,-0.033995,0.875915,3,0


In [10]:
!gsutil -m cp {output_file} gs://immap-training/

Copying file://../data/20200326_dataset.csv [Content-Type=text/csv]...
==> NOTE: You are uploading one or more large file(s), which would run
significantly faster if you enable parallel composite uploads. This
feature can be enabled by editing the
"parallel_composite_upload_threshold" value in your .boto
configuration file. However, note that if you do this large files will
be uploaded as `composite objects
<https://cloud.google.com/storage/docs/composite-objects>`_,which
means that any user who downloads such objects will need to have a
compiled crcmod installed (see "gsutil help crcmod"). This is because
without a compiled crcmod, computing checksums on composite objects is
so slow that gsutil disables downloads of composite objects.

- [1/1 files][319.1 MiB/319.1 MiB] 100% Done                                    
Operation completed over 1 objects/319.1 MiB.                                    
