In [1]:
""" This notebook will download the data from S3 to the EC2 instance 
-------------------------------------------------------------------------------
In this notebook we will copy the data for the first couple of steps from WRI's
Amazon S3 Bucket. The data is large i.e. **40GB** so a good excuse to drink a 
coffee. The output in Jupyter per file is suppressed so you will only see a 
result after the file has been donwloaded. You can also run this command in your
terminal and see the process per file.

The script will rename and copy certain files to create a coherent dataset.

requires AWS cli to be configured.


Author: Rutger Hofste
Date: 20170731
Kernel: python36
Docker: rutgerhofste/gisdocker:ubuntu16.04

Args:

    SCRIPT_NAME (string) : Script name
    INPUT_VERSION (integer) : input version, see readme and output number
                              of previous script. 
    


Returns:

Result:
    Unzipped, renamed and restructured files in the EC2 output folder.


"""

# Input Parameters

SCRIPT_NAME = "Y2017M07D31_RH_download_PCRGlobWB_data_V02"
INPUT_VERSION = 2

# Output Parameters


In [2]:
import time, datetime, sys
dateString = time.strftime("Y%YM%mD%d")
timeString = time.strftime("UTC %H:%M")
start = datetime.datetime.now()
print(dateString,timeString)
sys.version

Y2018M03D29 UTC 12:04


'3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19) \n[GCC 7.2.0]'

In [3]:
# Imports
import os

In [4]:
# ETL
s3_input_path = "s3://wri-projects/Aqueduct30/processData/Y2017M07D31_RH_copy_S3raw_s3process_V{:02.0f}/output/".format(INPUT_VERSION)
ec2_output_path = "/volumes/data/{}/output/".format(SCRIPT_NAME)


In [5]:
!rm -r {ec2_output_path}

rm: cannot remove '/volumes/data/Y2017M07D31_RH_download_PCRGlobWB_data_V02/output/': No such file or directory


In [6]:
!mkdir -p {ec2_output_path}

In [7]:
!aws s3 cp {s3_input_path} {ec2_output_path} --recursive

download: s3://wri-projects/Aqueduct30/processData/Y2017M07D31_RH_copy_S3raw_s3process_V02/output/global_droughtseveritystandardisedsoilmoisture_5min_1960-2014.asc to ../../../../data/Y2017M07D31_RH_download_PCRGlobWB_data_V02/output/global_droughtseveritystandardisedsoilmoisture_5min_1960-2014.asc
download: s3://wri-projects/Aqueduct30/processData/Y2017M07D31_RH_copy_S3raw_s3process_V02/output/global_droughtseveritystandardisedstreamflow_5min_1960-2014.asc to ../../../../data/Y2017M07D31_RH_download_PCRGlobWB_data_V02/output/global_droughtseveritystandardisedstreamflow_5min_1960-2014.asc
download: s3://wri-projects/Aqueduct30/processData/Y2017M07D31_RH_copy_S3raw_s3process_V02/output/global_environmentalflows_5min_1960-2014.asc to ../../../../data/Y2017M07D31_RH_download_PCRGlobWB_data_V02/output/global_environmentalflows_5min_1960-2014.asc
download: s3://wri-projects/Aqueduct30/processData/Y2017M07D31_RH_copy_S3raw_s3process_V02/output/global_historical_PDomUse_year_millionm3_5min_19

download: s3://wri-projects/Aqueduct30/processData/Y2017M07D31_RH_copy_S3raw_s3process_V02/output/global_historical_riverdischarge_month_m3second_5min_1960_2014.nc4 to ../../../../data/Y2017M07D31_RH_download_PCRGlobWB_data_V02/output/global_historical_riverdischarge_month_m3second_5min_1960_2014.nc4
download: s3://wri-projects/Aqueduct30/processData/Y2017M07D31_RH_copy_S3raw_s3process_V02/output/global_historical_soilmoisture_month_meter_5min_1958-2014.nc4 to ../../../../data/Y2017M07D31_RH_download_PCRGlobWB_data_V02/output/global_historical_soilmoisture_month_meter_5min_1958-2014.nc4
download: s3://wri-projects/Aqueduct30/processData/Y2017M07D31_RH_copy_S3raw_s3process_V02/output/totalRunoff_monthTot_output.zip to ../../../../data/Y2017M07D31_RH_download_PCRGlobWB_data_V02/output/totalRunoff_monthTot_output.zip


In [8]:
#List files downloaded (32 in my case)
!find {ec2_output_path} -type f | wc -l

32


In [9]:
# As you can see there are some zipped files. Unzipping.  
# Unzipping the file results in a 24GB file which is signifact. Therefore this step will take quite some time

!unzip {ec2_output_path}totalRunoff_monthTot_output.zip -d {ec2_output_path}

Archive:  /volumes/data/Y2017M07D31_RH_download_PCRGlobWB_data_V02/output/totalRunoff_monthTot_output.zip
  inflating: /volumes/data/Y2017M07D31_RH_download_PCRGlobWB_data_V02/output/totalRunoff_monthTot_output.nc  


The total number of files should be around 25 but can change if the raw data changed. 
In the data that Yoshi provided there is only Livestock data for consumption (WN). However in an email he specified that the withdrawal (WW) equals the consumption (100% consumption) for livestock. Therefore we copy the WN Livestock files to WW to make looping over WN and WW respectively easier. 

In [10]:
!cp {ec2_output_path}/global_historical_PLivWN_month_millionm3_5min_1960_2014.nc4 {ec2_output_path}/global_historical_PLivWW_month_millionm3_5min_1960_2014.nc4

In [11]:
!cp {ec2_output_path}/global_historical_PLivWN_year_millionm3_5min_1960_2014.nc4 {ec2_output_path}/global_historical_PLivWW_year_millionm3_5min_1960_2014.nc4

In [12]:
!ls -lah {ec2_output_path}

total 68G
drwxr-xr-x 2 root root 4.0K Mar 29 12:32 .
drwxr-xr-x 3 root root 4.0K Mar 29 12:04 ..
-rw-r--r-- 1 root root  57M Mar 29 11:53 global_droughtseveritystandardisedsoilmoisture_5min_1960-2014.asc
-rw-r--r-- 1 root root  55M Mar 29 11:53 global_droughtseveritystandardisedstreamflow_5min_1960-2014.asc
-rw-r--r-- 1 root root  56M Mar 29 11:53 global_environmentalflows_5min_1960-2014.asc
-rw-r--r-- 1 root root 3.2G Mar 29 11:49 global_historical_PDomUse_month_millionm3_5min_1960_2014.nc4
-rw-r--r-- 1 root root 270M Mar 29 11:49 global_historical_PDomUse_year_millionm3_5min_1960_2014.nc4
-rw-r--r-- 1 root root 3.2G Mar 29 11:49 global_historical_PDomWW_month_millionm3_5min_1960_2014.nc4
-rw-r--r-- 1 root root 271M Mar 29 11:49 global_historical_PDomWW_year_millionm3_5min_1960_2014.nc4
-rw-r--r-- 1 root root 1.7G Mar 29 11:49 global_historical_PIndUse_month_millionm3_5min_1960_2014.nc4
-rw-r--r-- 1 root root 156M Mar 29 11:49 global_historical_PIndUse_year_millionm3_5min_1

In [13]:
files = os.listdir(ec2_output_path)
print("Number of files: " + str(len(files)))

Number of files: 35


Copy PLivWN to PLivWW because Livestock Withdrawal = Livestock Consumption (see Yoshi's email'). This will solve some lookping issues in the future. Copies 4GB of data so takes a while

Some files that WRI received from Utrecht refer to water "Use" instead of WN (net). Renaming the relevant file. Renaming them

In [14]:
!mv {ec2_output_path}/global_historical_PDomUse_month_millionm3_5min_1960_2014.nc4 {ec2_output_path}/global_historical_PDomWN_month_millionm3_5min_1960_2014.nc4
!mv {ec2_output_path}/global_historical_PDomUse_year_millionm3_5min_1960_2014.nc4 {ec2_output_path}/global_historical_PDomWN_year_millionm3_5min_1960_2014.nc4

!mv {ec2_output_path}/global_historical_PIndUse_month_millionm3_5min_1960_2014.nc4 {ec2_output_path}/global_historical_PIndWN_month_millionm3_5min_1960_2014.nc4
!mv {ec2_output_path}/global_historical_PIndUse_year_millionm3_5min_1960_2014.nc4 {ec2_output_path}/global_historical_PIndWN_year_millionm3_5min_1960_2014.nc4


As you can see, the filename structure of the runoff files is different. Using Panoply to inspect the units, we rename the files accordingly. 

new names for annual:  

global_historical_runoff_year_myear_5min_1958_2014.nc

new name for monthly:  

global_historical_runoff_month_mmonth_5min_1958_2014.nc


In [15]:
!mv {ec2_output_path}/totalRunoff_annuaTot_output.nc {ec2_output_path}/global_historical_runoff_year_myear_5min_1958_2014.nc

In [16]:
!mv {ec2_output_path}/totalRunoff_monthTot_output.nc {ec2_output_path}/global_historical_runoff_month_mmonth_5min_1958_2014.nc

Final Folder strcuture

In [17]:
number_of_files =  len(os.listdir(ec2_output_path))

In [18]:
assert number_of_files == 35, ("Number of files is different than previous run. {} instead of 35".format(number_of_files))

In [19]:
end = datetime.datetime.now()
elapsed = end - start
print(elapsed)

0:27:46.549347
