# Summary of this notebook

This notebook is adapted from Yeh et al. (2020)'s [repository](https://github.com/sustainlab-group/africa_poverty).  In this notebook, we download the satellite images from Google Earth Engine that correspond to the latitude and longitude coordinates in the DHS (and LSMS) surveys.  To avoid obfuscation due to clouds, each downloaded image is a composite of 3 years' worth of satellite images.  The composites are constructed by taking, for each pixel in the image, the median value that that pixel attains over the 3-year period.

In [None]:
#If using Google Colab and Google Drive, run the following commands

#from google.colab import drive
#drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
#REPLACE THIS COMMAND WITH THE APPROPRIATE PATH TO THE "code" FOLDER ON YOUR GOOGLE DRIVE
# %cd ./drive/MyDrive/poverty_project/group_project/code

/content/drive/MyDrive/poverty_project/africa_poverty-master/download


# Steps (Borrowed from Yeh et al.)

## Pre-requisites
Register a Google account at [https://code.earthengine.google.com](https://code.earthengine.google.com). This process may take a couple of days. Without registration, the `ee.Initialize()` command below will throw an error message.

## Instructions

This notebook exports Landsat satellite image composites of DHS and LSMS clusters from Google Earth Engine.

The images are saved in gzipped TFRecord format. By default, this notebook exports images to Google Drive. If you instead prefer to export images to Google Cloud Storage (GCS), change the `EXPORT` constant below to `'gcs'` and set `BUCKET` to the desired GCS bucket name.


|      | Storage  | Expected Export Time
|------|----------|---------------------
| DHS  | ~16.0 GB | ~24h
| LSMS |  ~2.5 GB | ~10h

The exported images take up a significant amount of storage space. Before exporting, make sure you have enough storage space. The images are exported to the following locations, based on the constants `EXPORT` and `BUCKET` defined below:

|      | Google Drive (default) | GCS
|------|:-----------------------|:---
| DHS  | `dhs_tfrecords_raw/`   | `{BUCKET}/dhs_tfrecords_raw/`
| LSMS | `lsms_tfrecords_raw/`  | `{BUCKET}/lsms_tfrecords_raw/`

Once the images have finished exporting, download the exported TFRecord files to the following folders:

- DHS: `data/dhs_tfrecords_raw/`
- LSMS: `data/lsms_tfrecords_raw/`

The folder structure should look as follows:

```
data/
    dhs_tfrecords_raw/
        angola_2011_00.tfrecord.gz
        ...
        zimbabwe_2015_00.tfrecord.gz
    lsms_tfrecords_raw/
        ethiopia_2011_00.tfrecord.gz
        ...
        uganda_2013_00.tfrecord.gz
```

## Imports and Constants

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import math
from typing import Any, Dict, Optional, Tuple

import ee
import pandas as pd

import ee_utils

Before using the Earth Engine API, you must perform a one-time authentication that authorizes access to Earth Engine on behalf of your Google account you registered at [https://code.earthengine.google.com](https://code.earthengine.google.com). The authentication process saves a credentials file to `$HOME/.config/earthengine/credentials` for future use.

The following command `ee.Authenticate()` runs the authentication process. Once you successfully authenticate, you may comment out this command because you should not need to authenticate again in the future, unless you delete the credentials file. If you do not authenticate, the subsequent `ee.Initialize()` command below will fail.

For more information, see [https://developers.google.com/earth-engine/python_install-conda.html](https://developers.google.com/earth-engine/python_install-conda.html).

In [None]:
ee.Authenticate()

To authorize access needed by Earth Engine, open the following URL in a web browser and follow the instructions. If the web browser does not start automatically, please manually browse the URL below.

    https://code.earthengine.google.com/client-auth?scopes=https%3A//www.googleapis.com/auth/earthengine%20https%3A//www.googleapis.com/auth/devstorage.full_control&request_id=zcdfjuBJ7BTll76V41R8xQY6kbUoHmc1555y5Rc8x5c&tc=Mg4Bim_D2Zg2y1zKEDvwJoI12_lz-pa8PpEWC-IGSDI&cc=KJPEJD63UQHMzD3KbWtDZnVFSPwHna998hi7GUYcYOY

The authorization workflow will generate a code, which you should paste in the box below.
Enter verification code: 4/1AWtgzh4fT2idGhJaIrnKO-Bbgz3Om_F_JUCHx7X8hx7wwVLVb9PbxpU13p8

Successfully saved authorization token.


In [None]:
ee.Initialize()  # initialize the Earth Engine API

## Constants

In [None]:
# ========== ADAPT THESE PARAMETERS ==========

# To export to Google Drive, uncomment the next 2 lines
EXPORT = 'drive'
BUCKET = None

# To export to Google Cloud Storage (GCS), uncomment the next 2 lines
# and set the bucket to the desired bucket name
# EXPORT = 'gcs'
# BUCKET = 'mybucket'


# export location parameters

#The locations you should export to to test this notebook for yourself
DHS_EXPORT_FOLDER = '../data/dhs_tfrecords_raw/folder_for_you_to_replicate_our_downloads'
LSMS_EXPORT_FOLDER = '../data/lsms_tfrecords_raw/folder_for_you_to_replicate_our_downloads'

# Our original export folders
# DO NOT USE THESE UNLESS THEY ARE EMPTY.  Trying to download the same data
# twice can cause multiple .csv files containing the same data points.
#DHS_EXPORT_FOLDER = '../data/dhs_tfrecords_raw'
#LSMS_EXPORT_FOLDER = '../data/lsms_tfrecords_raw'


# Data download chunk sizes

#CHUNK_SIZE = None        # use this if there are no memory issues (there will be for some surveys)
CHUNK_SIZE = 50           # set to a small number (<= 50) if Google Earth Engine reports memory errors

In [None]:
# ========== DO NOT MODIFY THESE ==========

# input data paths
DHS_CSV_PATH = '../data/yeh_et_al/dhs_clusters.csv'
LSMS_CSV_PATH = '../data/yeh_et_al/lsms_clusters.csv'

# band names
MS_BANDS = ['BLUE', 'GREEN', 'RED', 'NIR', 'SWIR1', 'SWIR2', 'TEMP1']

# image export parameters
PROJECTION = 'EPSG:3857'  # see https://epsg.io/3857
SCALE = 30                # export resolution: 30m/px
EXPORT_TILE_RADIUS = 127  # image dimension = (2*EXPORT_TILE_RADIUS) + 1 = 255px

## Export Images

In [None]:
def export_images(
        df: pd.DataFrame,
        country: str,
        year: int,
        export_folder: str,
        chunk_size: Optional[int] = None,
        ) -> Dict[Tuple[Any], ee.batch.Task]:
    '''
    Args
    - df: pd.DataFrame, contains columns ['lat', 'lon', 'country', 'year']
    - country: str, together with `year` determines the survey to export
    - year: int, together with `country` determines the survey to export
    - export_folder: str, name of folder for export
    - chunk_size: int, optionally set a limit to the # of images exported per TFRecord file
        - set to a small number (<= 50) if Google Earth Engine reports memory errors

    Returns: dict, maps task name tuple (export_folder, country, year, chunk) to ee.batch.Task
    '''
    subset_df = df[(df['country'] == country) & (df['year'] == year)].reset_index(drop=True)
    if chunk_size is None:
        num_chunks = 1
        #SIMON ADDED THE NEXT LINE TO MAKE THE CODE WORK
        chunk_size = len(subset_df)
    else:
        num_chunks = int(math.ceil(len(subset_df) / chunk_size))
    tasks = {}



    for i in range(num_chunks):
        chunk_slice = slice(i * chunk_size, (i+1) * chunk_size - 1)  # df.loc[] is inclusive
        fc = ee_utils.df_to_fc(subset_df.loc[chunk_slice, :])
        start_date, end_date = ee_utils.surveyyear_to_range(year)

        # create 3-year Landsat composite image
        roi = fc.geometry()
        imgcol = ee_utils.LandsatSR(roi, start_date=start_date, end_date=end_date).merged
        imgcol = imgcol.map(ee_utils.mask_qaclear).select(MS_BANDS)
        img = imgcol.median()

        # add nightlights, latitude, and longitude bands
        img = ee_utils.add_latlon(img)
        img = img.addBands(ee_utils.composite_nl(year))

        fname = f'{country}_{year}_{i:02d}'
        tasks[(export_folder, country, year, i)] = ee_utils.get_array_patches(
            img=img, scale=SCALE, ksize=EXPORT_TILE_RADIUS,
            points=fc, export='drive',
            prefix=export_folder, fname=fname,
            bucket=None)
    return tasks

In [None]:
dhs_df = pd.read_csv(DHS_CSV_PATH, float_precision='high', index_col=False)
dhs_surveys = list(dhs_df.groupby(['country', 'year']).groups.keys())

#If you only want specific DHS data (and no LSMS data), then uncomment the
#line below (specifying which countries/years you want) and then run
#this cell and the next cell, then skip the following two (LSMS) cells,
#and finally run the last cell of the notebook ("ee.utils.wait_on_tasks...")

#dhs_surveys = [('angolga', 2011), ('ethiopia', 2010)]

In [None]:
#dhs_df = pd.read_csv(DHS_CSV_PATH, float_precision='high', index_col=False)
#dhs_surveys = list(dhs_df.groupby(['country', 'year']).groups.keys())


tasks = {}

for country, year in dhs_surveys:
    new_tasks = export_images(
        df=dhs_df, country=country, year=year,
        export_folder=DHS_EXPORT_FOLDER, chunk_size=CHUNK_SIZE)
    tasks.update(new_tasks)

In [None]:
lsms_df = pd.read_csv(LSMS_CSV_PATH, float_precision='high', index_col=False)
lsms_surveys = list(lsms_df.groupby(['country', 'year']).groups.keys())

#If you only want specific LSMS data (and no DHS data), then uncomment the
#two lines below (specifying which countries/years you want)
#and then run this cell and then the next two cells

#tasks = {}
#lsms_surveys = [('tanzania', 2012), ('uganda', 2013)]

tasks = {}
lsms_surveys = [('ethiopia', 2015)]

In [None]:
#lsms_df = pd.read_csv(LSMS_CSV_PATH, float_precision='high', index_col=False)
#lsms_surveys = list(lsms_df.groupby(['country', 'year']).groups.keys())

for country, year in lsms_surveys:
    new_tasks = export_images(
        df=lsms_df, country=country, year=year,
        export_folder=LSMS_EXPORT_FOLDER, chunk_size=CHUNK_SIZE)
    tasks.update(new_tasks)

Check on the status of each export task at [https://code.earthengine.google.com/](https://code.earthengine.google.com/), or run the following cell which checks every minute. Once all tasks have completed, download the DHS TFRecord files to `data/dhs_tfrecords_raw/` and LSMS TFRecord files to `data/lsms_tfrecords_raw/`.

In [None]:
ee_utils.wait_on_tasks(tasks, poll_interval=60)

  0%|          | 0/7 [00:00<?, ?it/s]

Task ('../data/lsms_tfrecords_raw', 'ethiopia', 2015, 0) finished in 0 min with state: COMPLETED
Task ('../data/lsms_tfrecords_raw', 'ethiopia', 2015, 1) finished in 1 min with state: COMPLETED
Task ('../data/lsms_tfrecords_raw', 'ethiopia', 2015, 2) finished in 0 min with state: COMPLETED
Task ('../data/lsms_tfrecords_raw', 'ethiopia', 2015, 3) finished in 1 min with state: COMPLETED
Task ('../data/lsms_tfrecords_raw', 'ethiopia', 2015, 4) finished in 0 min with state: COMPLETED
Task ('../data/lsms_tfrecords_raw', 'ethiopia', 2015, 6) finished in 0 min with state: COMPLETED
Task ('../data/lsms_tfrecords_raw', 'ethiopia', 2015, 5) finished in 27 min with state: ('FAILED', 'Execution failed; out of memory.')


It appears that, of all the satellite image data we attempted to download, the only images that failed to download were a few of those corresponding to the LSMS survey locations (specifically, about one-seventh of the images corresponding to Ethiopia's 2015 batch of LSMS surveys).  We will not be using the LSMS data for the primary training and evaluation of our models, so this will not be a problem.

## What's next?

In the [next notebook](02_extract_images_and_data.ipynb), we extract and export the image data (and non-image data) contained in the `.tfrecord.gz` files we downloaded in this notebook.