# WASH Data Preprocessing

## About

The WASH **indicators** dataset was received in Aug 2020 from iMMAP Colombia, coming in 2 forms:
1. Urban blocks - aggregated on a polygon level, collected from a National Census.
2. Rural points - aggregated on a point level. The locations come from the locations of water points in the SIASAR dataset (data.siasar.org), and the aggregated statistics come from statistics of the communities within which these water points were located.

The **features** considered for modelling come in 4 types:
1. Satellite features - derived from images saved in Google Earth Engine, downloaded via 00_Data_Download.ipynb
2. POI features - raster surfaces calculated from distance to nearest Point of Interest, as extracted from OpenStreetMap
3. Urban area features - calculated from iMMAP provided urban area polygons (MGN_Urbano) which are the urbanized portions of the map. These features include distance from the capital, outskirts, and urban area size.
4. Spatial lag features - derived from the other features, getting the average of neighboring values in reference to a grid in question

This notebook collects key steps taken to produce the training dataset that:
* is aggregated to 1x1sqkm (from by block data)
* has POI features as surfaces
* has features for urban area characteristics

## Imports and Setup

In [5]:
import pandas as pd
import pathlib
import sys
sys.path.insert(0, '../utils')

import geoutils
import bqutils
from settings import *

## File Locations

In [6]:
dirs = [feats_dir, inds_dir]
for dir_ in dirs:
    with pathlib.Path(dir_) as path:
        if not path.exists():
            path.mkdir(parents=True, exist_ok=True)

## Download Data From GCS

In [None]:
!gsutil cp gs://immap-wash-training/indicators/indicator_labelled_grid*.csv {inds_dir}
!gsutil cp gs://immap-wash-training/features/2018_{area}_*.tif {feats_dir}
!gsutil cp gs://immap-wash-training/features/urban_area_features.csv {feats_dir}
!gsutil cp gs://immap-wash-training/grid/grid_1x1km_wfeatures_lagged.csv {feats_dir}

## Generate Training Data

In [3]:
dfs = []
for urbanity in ['u', 'r']:
    gdf = geoutils.generate_indicator_labelled_grid(for_ = urbanity)
    
    # poi features - processing happens in BQ
    depts = get_depts()
    for poi in pois:
        print(f'Processing {poi}')
        geoutils.generate_poi_features_by_dept(poi)

    gdf = geoutils.generate_satellite_features(gdf)

    df = geoutils.generate_training_data(gdf)
    df['urbanity'] = urbanity
    dfs.append(df)

train_df = pd.concat(dfs, axis = 0, ignore_index = True).reset_index(drop = True)
print('Resulting shape: ' + str(train_df.shape))
print('Urban: ' + str(train_df.query("urbanity == 'u'").shape[0]))
print('Rural: ' + str(train_df.query("urbanity == 'r'").shape[0]))
train_df.head(2)

100%|██████████| 13/13 [00:27<00:00,  2.14s/it]
100%|██████████| 13/13 [00:14<00:00,  1.09s/it]


Resulting shape: (11644, 66)
Urban: 7574
Rural: 4070


Unnamed: 0,pixelated_urban_area_id,id,geometry,perc_hh_no_toilet,perc_hh_no_water_supply,perc_hh_no_sewage,d_mc_basur,d_mc_aguac,d_mc_freq_,d_mc_pare,...,lag_temperature,lag_nighttime_lights,lag_population,lag_elevation,lag_urban_index,lag_nearest_highway,nighttime_lights_area_mean,x,y,urbanity
0,862.0,417475,"POLYGON((-75.5123828117681 5.05751500688412, -...",0.018677,0.020431,0.030647,0.029925,0.150449,0.793726,0.221855,...,14980.00001,47.137354,75.599635,2032.50001,31.37501,381.252504,23.124894,-75.507891,5.062007,u
1,83.0,187318,"POLYGON((-76.4376475501431 7.23143798441016, -...",0.190164,0.213115,0.209836,0.062295,0.409836,0.760656,0.501639,...,15023.68751,0.48376,1.064785,175.37501,6.87501,727.6364,0.808125,-76.433156,7.23593,u


## Upload to GCS

In [4]:
train_df.to_csv(data_dir + '20200916_dataset.csv', index = False)
!gsutil cp {data_dir}20200916_dataset.csv gs://immap-wash-training/training/

Copying file://../data/20200826_dataset.csv [Content-Type=text/csv]...
- [1 files][ 11.0 MiB/ 11.0 MiB]                                                
Operation completed over 1 objects/11.0 MiB.                                     
