### Ingest data on Google Earth Engine (WRI bucket)

* Purpose of script: This notebook will upload the geotiff files from the Google Cloud Storage to the WRI/aqueduct earthengine bucket. 
* Author: Rutger Hofste
* Kernel used: python36
* Date created: 20170802

## Preparation

* Authorize earthengine by running in your terminal: `earthengine authenticate`  
* you need to have access to the WRI-Aquaduct (yep a Google employee made a typo) bucket to ingest the data. Rutger can grant access to write to this folder. 
* Have access to the Google Cloud Storage Bucker

Run in your terminal `gcloud config set project aqueduct30`

In [1]:
import subprocess
import datetime
import os
import time
from datetime import timedelta
import re
import pandas as pd

## Settings

In [2]:
GCS_BASE = "gs://aqueduct30_v01/Y2017M08D02_RH_Upload_to_GoogleCS_V01/"

In [3]:
EE_BASE = "projects/WRI-Aquaduct/PCRGlobWB20V05"

## Functions

In [4]:
def splitKey(key):
    # will yield the root file code and extension of a set of keys
    prefix, extension = key.split(".")
    fileName = prefix.split("/")[-1]
    parameter = fileName[:-12]
    month = fileName[-2:] #can also do this with regular expressions if you like
    year = fileName[-7:-3]
    identifier = fileName[-11:-8]
    outDict = {"fileName":fileName,"extension":extension,"parameter":parameter,"month":month,"year":year,"identifier":identifier}
    return outDict

def splitParameter(parameter):
    #values = parameter.split("_")
    values = re.split("_|-", parameter) #soilmoisture uses a hyphen instead of underscore between the years
    keys = ["geographic_range","temporal_range","indicator","temporal_resolution","units","spatial_resolution","temporal_range_min","temporal_range_max"]
    # ['global', 'historical', 'PDomWN', 'month', 'millionm3', '5min', '1960', '2014']
    outDict = dict(zip(keys, values))
    outDict["parameter"] = parameter
    return outDict


## Script

In [5]:
command = ("earthengine create folder %s") %EE_BASE

In [6]:
print(command)

earthengine create folder projects/WRI-Aquaduct/PCRGlobWB20V05


In [7]:
subprocess.check_output(command,shell=True)

'Asset projects/WRI-Aquaduct/PCRGlobWB20V05 already exists\n'

In [8]:
command = ("/opt/google-cloud-sdk/bin/gsutil ls %s") %(GCS_BASE)

In [9]:
keys = subprocess.check_output(command,shell=True)

In [10]:
keys = keys.decode('UTF-8').splitlines()

Removing first item from the list. The first item contains a folder without file name

In [11]:
keys2 = keys[1:]

In [12]:
df = pd.DataFrame()
i = 0
for key in keys2:
    i = i+1
    outDict = splitKey(key)
    df2 = pd.DataFrame(outDict,index=[i])
    df = df.append(df2)    

In [13]:
df.head()

Unnamed: 0,extension,fileName,identifier,month,parameter,year
1,tif,global_historical_PDomWN_month_millionm3_5min_...,0,1,global_historical_PDomWN_month_millionm3_5min_...,1960
2,tif,global_historical_PDomWN_month_millionm3_5min_...,1,2,global_historical_PDomWN_month_millionm3_5min_...,1960
3,tif,global_historical_PDomWN_month_millionm3_5min_...,2,3,global_historical_PDomWN_month_millionm3_5min_...,1960
4,tif,global_historical_PDomWN_month_millionm3_5min_...,3,4,global_historical_PDomWN_month_millionm3_5min_...,1960
5,tif,global_historical_PDomWN_month_millionm3_5min_...,4,5,global_historical_PDomWN_month_millionm3_5min_...,1960


In [14]:
df.tail()

Unnamed: 0,extension,fileName,identifier,month,parameter,year
9286,tif,global_historical_soilmoisture_month_meter_5mi...,679,8,global_historical_soilmoisture_month_meter_5mi...,2014
9287,tif,global_historical_soilmoisture_month_meter_5mi...,680,9,global_historical_soilmoisture_month_meter_5mi...,2014
9288,tif,global_historical_soilmoisture_month_meter_5mi...,681,10,global_historical_soilmoisture_month_meter_5mi...,2014
9289,tif,global_historical_soilmoisture_month_meter_5mi...,682,11,global_historical_soilmoisture_month_meter_5mi...,2014
9290,tif,global_historical_soilmoisture_month_meter_5mi...,683,12,global_historical_soilmoisture_month_meter_5mi...,2014


In [15]:
df.shape

(9290, 6)

In [16]:
parameters = df.parameter.unique()

In [44]:
print(parameters)

[u'global_historical_PDomWN_month_millionm3_5min_1960_2014'
 u'global_historical_PDomWN_year_millionm3_5min_1960_2014'
 u'global_historical_PDomWW_month_millionm3_5min_1960_2014'
 u'global_historical_PDomWW_year_millionm3_5min_1960_2014'
 u'global_historical_PIndWN_month_millionm3_5min_1960_2014'
 u'global_historical_PIndWN_year_millionm3_5min_1960_2014'
 u'global_historical_PIndWW_month_millionm3_5min_1960_2014'
 u'global_historical_PIndWW_year_millionm3_5min_1960_2014'
 u'global_historical_PIrrWN_month_millionm3_5min_1960_2014'
 u'global_historical_PIrrWN_year_millionm3_5min_1960_2014'
 u'global_historical_PIrrWW_month_millionm3_5min_1960_2014'
 u'global_historical_PIrrWW_year_millionm3_5min_1960_2014'
 u'global_historical_PLivWN_month_millionm3_5min_1960_2014'
 u'global_historical_PLivWN_year_millionm3_5min_1960_2014'
 u'global_historical_PLivWW_month_millionm3_5min_1960_2014'
 u'global_historical_PLivWW_year_millionm3_5min_1960_2014'
 u'global_historical_aqbasinwaterstress_month_di

We will store the geotiff images of each NetCDF4 file in imageCollections. The imageCollections will have the same name and content as the original NetCDF4files. 


In [45]:
for parameter in parameters:
    eeLocation = EE_BASE + "/" + parameter
    command = ("earthengine create collection %s") %eeLocation
    # Uncomment the following command if you run this script for the first time
    subprocess.check_output(command,shell=True)
    print(command)
    

earthengine create collection projects/WRI-Aquaduct/PCRGlobWB20V05/global_historical_PDomWN_month_millionm3_5min_1960_2014
earthengine create collection projects/WRI-Aquaduct/PCRGlobWB20V05/global_historical_PDomWN_year_millionm3_5min_1960_2014
earthengine create collection projects/WRI-Aquaduct/PCRGlobWB20V05/global_historical_PDomWW_month_millionm3_5min_1960_2014
earthengine create collection projects/WRI-Aquaduct/PCRGlobWB20V05/global_historical_PDomWW_year_millionm3_5min_1960_2014
earthengine create collection projects/WRI-Aquaduct/PCRGlobWB20V05/global_historical_PIndWN_month_millionm3_5min_1960_2014
earthengine create collection projects/WRI-Aquaduct/PCRGlobWB20V05/global_historical_PIndWN_year_millionm3_5min_1960_2014
earthengine create collection projects/WRI-Aquaduct/PCRGlobWB20V05/global_historical_PIndWW_month_millionm3_5min_1960_2014
earthengine create collection projects/WRI-Aquaduct/PCRGlobWB20V05/global_historical_PIndWW_year_millionm3_5min_1960_2014
earthengine create c

Now that the folder and collections have been created we can start ingesting the data. It is crucial to store the relevant metadata with the images. 

In [19]:
df_parameter = pd.DataFrame()
i = 0
for parameter in parameters:
    i = i+1
    outDict_parameter = splitParameter(parameter)
    df_parameter2 = pd.DataFrame(outDict_parameter,index=[i])
    df_parameter = df_parameter.append(df_parameter2)   
    

In [20]:
df_parameter.head()

Unnamed: 0,geographic_range,indicator,parameter,spatial_resolution,temporal_range,temporal_range_max,temporal_range_min,temporal_resolution,units
1,global,PDomWN,global_historical_PDomWN_month_millionm3_5min_...,5min,historical,2014,1960,month,millionm3
2,global,PDomWN,global_historical_PDomWN_year_millionm3_5min_1...,5min,historical,2014,1960,year,millionm3
3,global,PDomWW,global_historical_PDomWW_month_millionm3_5min_...,5min,historical,2014,1960,month,millionm3
4,global,PDomWW,global_historical_PDomWW_year_millionm3_5min_1...,5min,historical,2014,1960,year,millionm3
5,global,PIndWN,global_historical_PIndWN_month_millionm3_5min_...,5min,historical,2014,1960,month,millionm3


In [21]:
df_parameter.tail()

Unnamed: 0,geographic_range,indicator,parameter,spatial_resolution,temporal_range,temporal_range_max,temporal_range_min,temporal_resolution,units
21,global,riverdischarge,global_historical_riverdischarge_month_m3secon...,5min,historical,2014,1960,month,m3second
22,global,riverdischarge,global_historical_riverdischarge_year_m3second...,5min,historical,2014,1960,year,m3second
23,global,runoff,global_historical_runoff_month_mmonth_5min_195...,5min,historical,2014,1958,month,mmonth
24,global,runoff,global_historical_runoff_year_myear_5min_1958_...,5min,historical,2014,1958,year,myear
25,global,soilmoisture,global_historical_soilmoisture_month_meter_5mi...,5min,historical,2014,1958,month,meter


In [22]:
df_parameter.shape

(25, 9)

In [23]:
df_complete = df.merge(df_parameter,how='left',left_on='parameter',right_on='parameter')

Adding NoData value, ingested_by and exportdescription

In [24]:
df_complete["nodata"] = -9999
df_complete["ingested_by"] ="RutgerHofste"
df_complete["exportdescription"] = df_complete["indicator"] + "_" + df_complete["temporal_resolution"]+"Y"+df_complete["year"]+"M"+df_complete["month"]

In [25]:
df_complete.head()

Unnamed: 0,extension,fileName,identifier,month,parameter,year,geographic_range,indicator,spatial_resolution,temporal_range,temporal_range_max,temporal_range_min,temporal_resolution,units,nodata,ingested_by,exportdescription
0,tif,global_historical_PDomWN_month_millionm3_5min_...,0,1,global_historical_PDomWN_month_millionm3_5min_...,1960,global,PDomWN,5min,historical,2014,1960,month,millionm3,-9999,RutgerHofste,PDomWN_monthY1960M01
1,tif,global_historical_PDomWN_month_millionm3_5min_...,1,2,global_historical_PDomWN_month_millionm3_5min_...,1960,global,PDomWN,5min,historical,2014,1960,month,millionm3,-9999,RutgerHofste,PDomWN_monthY1960M02
2,tif,global_historical_PDomWN_month_millionm3_5min_...,2,3,global_historical_PDomWN_month_millionm3_5min_...,1960,global,PDomWN,5min,historical,2014,1960,month,millionm3,-9999,RutgerHofste,PDomWN_monthY1960M03
3,tif,global_historical_PDomWN_month_millionm3_5min_...,3,4,global_historical_PDomWN_month_millionm3_5min_...,1960,global,PDomWN,5min,historical,2014,1960,month,millionm3,-9999,RutgerHofste,PDomWN_monthY1960M04
4,tif,global_historical_PDomWN_month_millionm3_5min_...,4,5,global_historical_PDomWN_month_millionm3_5min_...,1960,global,PDomWN,5min,historical,2014,1960,month,millionm3,-9999,RutgerHofste,PDomWN_monthY1960M05


In [26]:
df_complete.tail()

Unnamed: 0,extension,fileName,identifier,month,parameter,year,geographic_range,indicator,spatial_resolution,temporal_range,temporal_range_max,temporal_range_min,temporal_resolution,units,nodata,ingested_by,exportdescription
9285,tif,global_historical_soilmoisture_month_meter_5mi...,679,8,global_historical_soilmoisture_month_meter_5mi...,2014,global,soilmoisture,5min,historical,2014,1958,month,meter,-9999,RutgerHofste,soilmoisture_monthY2014M08
9286,tif,global_historical_soilmoisture_month_meter_5mi...,680,9,global_historical_soilmoisture_month_meter_5mi...,2014,global,soilmoisture,5min,historical,2014,1958,month,meter,-9999,RutgerHofste,soilmoisture_monthY2014M09
9287,tif,global_historical_soilmoisture_month_meter_5mi...,681,10,global_historical_soilmoisture_month_meter_5mi...,2014,global,soilmoisture,5min,historical,2014,1958,month,meter,-9999,RutgerHofste,soilmoisture_monthY2014M10
9288,tif,global_historical_soilmoisture_month_meter_5mi...,682,11,global_historical_soilmoisture_month_meter_5mi...,2014,global,soilmoisture,5min,historical,2014,1958,month,meter,-9999,RutgerHofste,soilmoisture_monthY2014M11
9289,tif,global_historical_soilmoisture_month_meter_5mi...,683,12,global_historical_soilmoisture_month_meter_5mi...,2014,global,soilmoisture,5min,historical,2014,1958,month,meter,-9999,RutgerHofste,soilmoisture_monthY2014M12


In [27]:
list(df_complete.columns.values)

['extension',
 'fileName',
 'identifier',
 'month',
 'parameter',
 'year',
 'geographic_range',
 'indicator',
 'spatial_resolution',
 'temporal_range',
 'temporal_range_max',
 'temporal_range_min',
 'temporal_resolution',
 'units',
 'nodata',
 'ingested_by',
 'exportdescription']

In [49]:
def uploadEE(index,row):
    target = EE_BASE +"/"+ row.parameter + "/" + row.fileName
    source = GCS_BASE + row.fileName + "." + row.extension
    metadata = "--nodata_value=%s --time_start %s-%s-01 -p extension=%s -p filename=%s -p identifier=%s -p year=%s -p geographic_range=%s -p indicator=%s -p spatial_resolution=%s -p temporal_range=%s -p temporal_range_max=%s -p temporal_range_min=%s -p temporal_resolution=%s -p units=%s -p ingested_by=%s -p exportdescription=%s" %(row.nodata,row.year,row.month,row.extension,row.fileName,row.identifier,row.year,row.geographic_range,row.indicator,row.spatial_resolution,row.temporal_range,row.temporal_range_max,row.temporal_range_min, row.temporal_resolution, row.units, row.ingested_by, row.exportdescription)
    command = "/opt/anaconda3/bin/earthengine upload image --asset_id %s %s %s" % (target, source,metadata)
    try:
        response = subprocess.check_output(command, shell=True)
        outDict = {"command":command,"response":response,"error":0}
        df_errors2 = pd.DataFrame(outDict,index=[index])
        pass
    except:
        try:
            outDict = {"command":command,"response":response,"error":1}
        except:
            outDict = {"command":command,"response":-9999,"error":2}
        df_errors2 = pd.DataFrame(outDict,index=[index])
        print("error")
    return df_errors2



In [50]:
df_errors = pd.DataFrame()
start_time = time.time()
for index, row in df_complete.iterrows():
    elapsed_time = time.time() - start_time 
    print(index,"%.2f" %((index/9289.0)*100), "elapsed: ", str(timedelta(seconds=elapsed_time)))
    df_errors2 = uploadEE(index,row)
    df_errors = df_errors.append(df_errors2)
    
    

(7865, '84.67', 'elapsed: ', '0:00:00.000748')
(7866, '84.68', 'elapsed: ', '0:00:01.226725')
(7867, '84.69', 'elapsed: ', '0:00:02.408142')
(7868, '84.70', 'elapsed: ', '0:00:03.553224')
(7869, '84.71', 'elapsed: ', '0:00:04.712462')
(7870, '84.72', 'elapsed: ', '0:00:06.039094')
(7871, '84.73', 'elapsed: ', '0:00:07.271191')
(7872, '84.75', 'elapsed: ', '0:00:08.488724')
(7873, '84.76', 'elapsed: ', '0:00:09.850453')
(7874, '84.77', 'elapsed: ', '0:00:11.030724')
(7875, '84.78', 'elapsed: ', '0:00:12.282369')
(7876, '84.79', 'elapsed: ', '0:00:13.491524')
(7877, '84.80', 'elapsed: ', '0:00:14.710352')
(7878, '84.81', 'elapsed: ', '0:00:15.876576')
(7879, '84.82', 'elapsed: ', '0:00:16.971030')
(7880, '84.83', 'elapsed: ', '0:00:18.123972')
(7881, '84.84', 'elapsed: ', '0:00:19.278200')
(7882, '84.85', 'elapsed: ', '0:00:20.373206')
(7883, '84.86', 'elapsed: ', '0:00:21.496790')
(7884, '84.87', 'elapsed: ', '0:00:22.725454')
(7885, '84.89', 'elapsed: ', '0:00:23.958020')
(7886, '84.90

(8040, '86.55', 'elapsed: ', '0:03:42.910707')
(8041, '86.56', 'elapsed: ', '0:03:44.578165')
(8042, '86.58', 'elapsed: ', '0:03:45.742500')
(8043, '86.59', 'elapsed: ', '0:03:47.473977')
(8044, '86.60', 'elapsed: ', '0:03:48.800691')
(8045, '86.61', 'elapsed: ', '0:03:50.024122')
(8046, '86.62', 'elapsed: ', '0:03:51.279055')
(8047, '86.63', 'elapsed: ', '0:03:52.467749')
(8048, '86.64', 'elapsed: ', '0:03:53.651665')
(8049, '86.65', 'elapsed: ', '0:03:54.803924')
(8050, '86.66', 'elapsed: ', '0:03:55.967812')
(8051, '86.67', 'elapsed: ', '0:03:57.053636')
(8052, '86.68', 'elapsed: ', '0:03:58.200147')
(8053, '86.69', 'elapsed: ', '0:03:59.352497')
(8054, '86.70', 'elapsed: ', '0:04:00.578032')
(8055, '86.72', 'elapsed: ', '0:04:01.627469')
(8056, '86.73', 'elapsed: ', '0:04:02.815356')
(8057, '86.74', 'elapsed: ', '0:04:03.986653')
(8058, '86.75', 'elapsed: ', '0:04:05.068195')
(8059, '86.76', 'elapsed: ', '0:04:06.173674')
(8060, '86.77', 'elapsed: ', '0:04:07.402407')
(8061, '86.78

(8215, '88.44', 'elapsed: ', '0:07:14.199116')
(8216, '88.45', 'elapsed: ', '0:07:15.426898')
(8217, '88.46', 'elapsed: ', '0:07:16.513280')
(8218, '88.47', 'elapsed: ', '0:07:17.731178')
(8219, '88.48', 'elapsed: ', '0:07:18.833041')
(8220, '88.49', 'elapsed: ', '0:07:20.051843')
(8221, '88.50', 'elapsed: ', '0:07:21.220278')
(8222, '88.51', 'elapsed: ', '0:07:22.456850')
(8223, '88.52', 'elapsed: ', '0:07:23.916259')
(8224, '88.53', 'elapsed: ', '0:07:25.032545')
(8225, '88.55', 'elapsed: ', '0:07:26.199614')
(8226, '88.56', 'elapsed: ', '0:07:27.572869')
(8227, '88.57', 'elapsed: ', '0:07:28.618074')
(8228, '88.58', 'elapsed: ', '0:07:29.941960')
(8229, '88.59', 'elapsed: ', '0:07:31.109509')
(8230, '88.60', 'elapsed: ', '0:07:32.157829')
(8231, '88.61', 'elapsed: ', '0:07:33.358122')
(8232, '88.62', 'elapsed: ', '0:07:34.508319')
(8233, '88.63', 'elapsed: ', '0:07:35.803548')
(8234, '88.64', 'elapsed: ', '0:07:36.968571')
(8235, '88.65', 'elapsed: ', '0:07:38.055106')
(8236, '88.66

(8390, '90.32', 'elapsed: ', '0:10:50.279300')
(8391, '90.33', 'elapsed: ', '0:10:51.444587')
(8392, '90.34', 'elapsed: ', '0:10:52.539823')
(8393, '90.35', 'elapsed: ', '0:10:53.700254')
(8394, '90.36', 'elapsed: ', '0:10:54.818859')
(8395, '90.38', 'elapsed: ', '0:10:55.938493')
(8396, '90.39', 'elapsed: ', '0:10:57.085381')
(8397, '90.40', 'elapsed: ', '0:10:58.220092')
(8398, '90.41', 'elapsed: ', '0:10:59.357316')
(8399, '90.42', 'elapsed: ', '0:11:00.532734')
(8400, '90.43', 'elapsed: ', '0:11:01.716657')
(8401, '90.44', 'elapsed: ', '0:11:02.820814')
(8402, '90.45', 'elapsed: ', '0:11:03.915910')
(8403, '90.46', 'elapsed: ', '0:11:04.995364')
(8404, '90.47', 'elapsed: ', '0:11:06.132141')
(8405, '90.48', 'elapsed: ', '0:11:07.217473')
(8406, '90.49', 'elapsed: ', '0:11:08.391865')
(8407, '90.50', 'elapsed: ', '0:11:09.594324')
(8408, '90.52', 'elapsed: ', '0:11:10.671404')
(8409, '90.53', 'elapsed: ', '0:11:11.845820')
(8410, '90.54', 'elapsed: ', '0:11:12.926020')
(8411, '90.55

(8565, '92.21', 'elapsed: ', '0:14:13.911243')
(8566, '92.22', 'elapsed: ', '0:14:15.039883')
(8567, '92.23', 'elapsed: ', '0:14:16.202593')
(8568, '92.24', 'elapsed: ', '0:14:17.326885')
(8569, '92.25', 'elapsed: ', '0:14:18.420250')
(8570, '92.26', 'elapsed: ', '0:14:19.599722')
(8571, '92.27', 'elapsed: ', '0:14:20.734391')
(8572, '92.28', 'elapsed: ', '0:14:21.811129')
(8573, '92.29', 'elapsed: ', '0:14:22.939696')
(8574, '92.30', 'elapsed: ', '0:14:24.106666')
(8575, '92.31', 'elapsed: ', '0:14:25.193790')
(8576, '92.32', 'elapsed: ', '0:14:26.410023')
(8577, '92.34', 'elapsed: ', '0:14:27.518206')
(8578, '92.35', 'elapsed: ', '0:14:28.632309')
(8579, '92.36', 'elapsed: ', '0:14:29.849474')
(8580, '92.37', 'elapsed: ', '0:14:30.987731')
(8581, '92.38', 'elapsed: ', '0:14:32.032112')
(8582, '92.39', 'elapsed: ', '0:14:33.196655')
(8583, '92.40', 'elapsed: ', '0:14:34.337722')
(8584, '92.41', 'elapsed: ', '0:14:35.714556')
(8585, '92.42', 'elapsed: ', '0:14:36.834702')
(8586, '92.43

In [33]:
!mkdir /volumes/data/temp

mkdir: cannot create directory '/volumes/data/temp': File exists


In [51]:
df_errors.to_csv("/volumes/data/temp/df_errors.csv")

In [52]:
!aws s3 cp  /volumes/data/temp/df_errors.csv s3://wri-projects/Aqueduct30/temp/df_errors.csv

upload: ../../../../data/temp/df_errors.csv to s3://wri-projects/Aqueduct30/temp/df_errors.csv


Retry the ones with errors

In [41]:
df_retry = df_errors.loc[df_errors['error'] != 0]

In [43]:
for index, row in df_retry.iterrows():
    response = subprocess.check_output(row.command, shell=True)
    