### Spatially Join FAO Names to hydrobasins level 6

* Purpose of script: Spatially join FAO Names hydrobasins to the official HydroBasins level 6 polygons
* Author: Rutger Hofste
* Kernel used: python35
* Date created: 20170825

In [1]:
S3_INPUT_PATH_FAO = "s3://wri-projects/Aqueduct30/processData/Y2017M08D23_RH_Buffer_FAONames_V01/output/"
S3_INPUT_PATH_HYBAS = "s3://wri-projects/Aqueduct30/processData/Y2017M08D02_RH_Merge_HydroBasins_V01/output/"
S3_OUTPUT_PATH = "s3://wri-projects/Aqueduct30/processData/Y2017M08D25_RH_spatial_join_FAONames_V01/output/"
EC2_INPUT_PATH = "/volumes/data/Y2017M08D25_RH_spatial_join_FAONames_V01/input/"
EC2_OUTPUT_PATH = "/volumes/data/Y2017M08D25_RH_spatial_join_FAONames_V01/output/"
INPUT_FILE_NAME_FAO = "hydrobasins_fao_fiona_merged_buffered_v01.shp"
INPUT_FILE_NAME_HYBAS = "hybas_lev06_v1c_merged_fiona_V01.shp"
OUTPUT_FILE_NAME = "hybas_lev06_v1c_merged_fiona_withFAO_V02.csv"

In [2]:
!rm -r {EC2_INPUT_PATH}
!rm -r {EC2_OUTPUT_PATH}

!mkdir -p {EC2_INPUT_PATH}
!mkdir -p {EC2_OUTPUT_PATH}

rm: cannot remove '/volumes/data/Y2017M08D25_RH_spatial_join_FAONames_V01/input/': No such file or directory
rm: cannot remove '/volumes/data/Y2017M08D25_RH_spatial_join_FAONames_V01/output/': No such file or directory


In [3]:
!aws s3 cp {S3_INPUT_PATH_FAO} {EC2_INPUT_PATH} --recursive 

download: s3://wri-projects/Aqueduct30/processData/Y2017M08D23_RH_Buffer_FAONames_V01/output/hydrobasins_fao_fiona_merged_buffered_v01.cpg to ../../../../data/Y2017M08D25_RH_spatial_join_FAONames_V01/input/hydrobasins_fao_fiona_merged_buffered_v01.cpg
download: s3://wri-projects/Aqueduct30/processData/Y2017M08D23_RH_Buffer_FAONames_V01/output/hydrobasins_fao_fiona_merged_buffered_v01.prj to ../../../../data/Y2017M08D25_RH_spatial_join_FAONames_V01/input/hydrobasins_fao_fiona_merged_buffered_v01.prj
download: s3://wri-projects/Aqueduct30/processData/Y2017M08D23_RH_Buffer_FAONames_V01/output/hydrobasins_fao_fiona_merged_buffered_v01.shx to ../../../../data/Y2017M08D25_RH_spatial_join_FAONames_V01/input/hydrobasins_fao_fiona_merged_buffered_v01.shx
download: s3://wri-projects/Aqueduct30/processData/Y2017M08D23_RH_Buffer_FAONames_V01/output/hydrobasins_fao_fiona_merged_buffered_v01.dbf to ../../../../data/Y2017M08D25_RH_spatial_join_FAONames_V01/input/hydrobasins_fao_fiona_merged_buffered_

In [4]:
!aws s3 cp {S3_INPUT_PATH_HYBAS} {EC2_INPUT_PATH} --recursive --exclude *.tif

download: s3://wri-projects/Aqueduct30/processData/Y2017M08D02_RH_Merge_HydroBasins_V01/output/hybas_lev06_v1c_merged_fiona_V01.cpg to ../../../../data/Y2017M08D25_RH_spatial_join_FAONames_V01/input/hybas_lev06_v1c_merged_fiona_V01.cpg
download: s3://wri-projects/Aqueduct30/processData/Y2017M08D02_RH_Merge_HydroBasins_V01/output/hybas_lev00_v1c_merged_fiona_V01.cpg to ../../../../data/Y2017M08D25_RH_spatial_join_FAONames_V01/input/hybas_lev00_v1c_merged_fiona_V01.cpg
download: s3://wri-projects/Aqueduct30/processData/Y2017M08D02_RH_Merge_HydroBasins_V01/output/hybas_lev00_v1c_merged_fiona_V01.prj to ../../../../data/Y2017M08D25_RH_spatial_join_FAONames_V01/input/hybas_lev00_v1c_merged_fiona_V01.prj
download: s3://wri-projects/Aqueduct30/processData/Y2017M08D02_RH_Merge_HydroBasins_V01/output/hybas_lev06_v1c_merged_fiona_V01.prj to ../../../../data/Y2017M08D25_RH_spatial_join_FAONames_V01/input/hybas_lev06_v1c_merged_fiona_V01.prj
download: s3://wri-projects/Aqueduct30/processData/Y2017

In [5]:
import os
if 'GDAL_DATA' not in os.environ:
    os.environ['GDAL_DATA'] = r'/usr/share/gdal/2.1'
from osgeo import gdal,ogr,osr
'GDAL_DATA' in os.environ
# If false, the GDAL_DATA variable is set incorrectly. You need this variable to obtain the spatial reference
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import time
%matplotlib notebook


In [6]:
gdfFAO = gpd.read_file(os.path.join(EC2_INPUT_PATH,INPUT_FILE_NAME_FAO))

In [7]:
list(gdfFAO)

['area',
 'LEGEND',
 'MAJ_AREA',
 'MAJ_BAS',
 'MAJ_NAME',
 'SUB_AREA',
 'SUB_BAS',
 'SUB_NAME',
 'TO_BAS',
 'index1',
 'geometry']

In [8]:
gdfHybas = gpd.read_file(os.path.join(EC2_INPUT_PATH,INPUT_FILE_NAME_HYBAS))

In [9]:
list(gdfHybas)

['HYBAS_ID',
 'NEXT_DOWN',
 'NEXT_SINK',
 'MAIN_BAS',
 'DIST_SINK',
 'DIST_MAIN',
 'SUB_AREA',
 'UP_AREA',
 'PFAF_ID',
 'ENDO',
 'COAST',
 'ORDER',
 'SORT',
 'geometry']

In [10]:
gdfHybas.dtypes

HYBAS_ID       int64
NEXT_DOWN      int64
NEXT_SINK      int64
MAIN_BAS       int64
DIST_SINK    float64
DIST_MAIN    float64
SUB_AREA     float64
UP_AREA      float64
PFAF_ID        int64
ENDO           int64
COAST          int64
ORDER          int64
SORT           int64
geometry      object
dtype: object

In [11]:
gdfFAO.dtypes

area        float64
LEGEND        int64
MAJ_AREA      int64
MAJ_BAS       int64
MAJ_NAME     object
SUB_AREA      int64
SUB_BAS       int64
SUB_NAME     object
TO_BAS        int64
index1        int64
geometry     object
dtype: object

In [12]:
gdfFAO['index1_copy'] = gdfFAO['index1']

In [13]:
gdfFAO = gdfFAO.set_index('index1')

In [14]:
gdfFAO.index.name

'index1'

A spatial join was performed on the data. However the FAO polygons were stored as polygons and not as multi-polygons. The data also lacked a unique Identifier. The identifier consists of a combination of MAJ_BAS and SUB_BASE. The maximum length of MAJ_BAS is 4 and 6 for SUB_BAS (279252). We will store the identifier as a string with the format: MAJ_BASxxxxSUB_BASExxxxxx

In [15]:
gdfFAO['FAOid'] = gdfFAO.apply(lambda x:'MAJ_BAS_%0.4d_SUB_BAS_%0.7d' % (x['MAJ_BAS'],x['SUB_BAS']),axis=1)

In [16]:
gdfFAO.index.name

'index1'

In [17]:
dfFAO = gdfFAO.drop('geometry',1)

In [18]:
dfFAO.head()

Unnamed: 0_level_0,area,LEGEND,MAJ_AREA,MAJ_BAS,MAJ_NAME,SUB_AREA,SUB_BAS,SUB_NAME,TO_BAS,index1_copy,FAOid
index1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,2.550863,1,318639,6001,"Black Sea, South Coast",24573,1001,Bursa / Balikesir,-999,0,MAJ_BAS_6001_SUB_BAS_0001001
1,0.811355,1,318639,6001,"Black Sea, South Coast",7803,1002,Kocaeli,-999,1,MAJ_BAS_6001_SUB_BAS_0001002
2,6.58024,1,318639,6001,"Black Sea, South Coast",63081,1003,Sakarya River,-999,2,MAJ_BAS_6001_SUB_BAS_0001003
3,3.172792,1,318639,6001,"Black Sea, South Coast",29866,1004,Duzce / Bolu / Zonguldak / Karabuk,-999,3,MAJ_BAS_6001_SUB_BAS_0001004
4,8.138718,1,318639,6001,"Black Sea, South Coast",77771,1005,Kizilirmak River,-999,4,MAJ_BAS_6001_SUB_BAS_0001005


In [19]:
gdfFAO['FAOid_copy'] = gdfFAO['FAOid']

In [20]:
gdfFAO.index.name

'index1'

In [21]:
list(gdfFAO)

['area',
 'LEGEND',
 'MAJ_AREA',
 'MAJ_BAS',
 'MAJ_NAME',
 'SUB_AREA',
 'SUB_BAS',
 'SUB_NAME',
 'TO_BAS',
 'geometry',
 'index1_copy',
 'FAOid',
 'FAOid_copy']

In [22]:
gdfFAO = gdfFAO.dissolve(by='FAOid')

In [23]:
list(gdfFAO)

['geometry',
 'area',
 'LEGEND',
 'MAJ_AREA',
 'MAJ_BAS',
 'MAJ_NAME',
 'SUB_AREA',
 'SUB_BAS',
 'SUB_NAME',
 'TO_BAS',
 'index1_copy',
 'FAOid_copy']

In [24]:
dfFAO = gdfFAO.drop('geometry',1)

In [25]:
dfFAO.head()

Unnamed: 0_level_0,area,LEGEND,MAJ_AREA,MAJ_BAS,MAJ_NAME,SUB_AREA,SUB_BAS,SUB_NAME,TO_BAS,index1_copy,FAOid_copy
FAOid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
MAJ_BAS_1001_SUB_BAS_0001001,0.865953,1,701385,1001,"Gulf of Mexico, North Atlantic Coast",8689,1001,Upper Roanoke,1005,20147,MAJ_BAS_1001_SUB_BAS_0001001
MAJ_BAS_1001_SUB_BAS_0001002,0.150201,1,701385,1001,"Gulf of Mexico, North Atlantic Coast",1540,1002,Banister,1004,20148,MAJ_BAS_1001_SUB_BAS_0001002
MAJ_BAS_1001_SUB_BAS_0001003,0.435553,1,701385,1001,"Gulf of Mexico, North Atlantic Coast",4403,1003,Upper Dan,1004,20149,MAJ_BAS_1001_SUB_BAS_0001003
MAJ_BAS_1001_SUB_BAS_0001004,0.414318,1,701385,1001,"Gulf of Mexico, North Atlantic Coast",4204,1004,Lower Dan,1005,20150,MAJ_BAS_1001_SUB_BAS_0001004
MAJ_BAS_1001_SUB_BAS_0001005,0.634339,1,701385,1001,"Gulf of Mexico, North Atlantic Coast",6496,1005,Lower Roanoke,-999,20151,MAJ_BAS_1001_SUB_BAS_0001005


In [26]:
gdfFAOTest = gdfFAO.loc[100:200]

TypeError: cannot do slice indexing on <class 'pandas.core.indexes.base.Index'> with these indexers [100] of <class 'int'>

In [None]:
validGeom = gdfFAO.geometry.is_valid

In [None]:
gdfFAO.crs = {'init': u'epsg:4326'}

In [None]:
gdfFAO = gdfFAO.set_index('index1_copy')

In [None]:
gdfJoined = gpd.sjoin(gdfHybas, gdfFAO ,how="left", op='intersects')

In [None]:
list(gdfJoined)

In [None]:
gdfJoined.shape

In [None]:
series = gdfJoined.groupby('PFAF_ID')['SUB_NAME'].apply(list)
series2 = gdfJoined.groupby('PFAF_ID')['MAJ_NAME'].apply(list)
series3 = gdfJoined.groupby('PFAF_ID')['FAOid_copy'].apply(list)

In [None]:
df_new1 = series.to_frame()
df_new2 = series2.to_frame()
df_new3 = series3.to_frame()

In [None]:
df_new1.head()

In [None]:
df_out = df_new1.merge(right = df_new2, how = "outer", left_index = True, right_index = True )

In [None]:
df_out = df_out.merge(right = df_new3, how = "outer", left_index = True, right_index = True )

In [None]:
df_out.dtypes

In [None]:
df_out.to_csv(os.path.join(EC2_OUTPUT_PATH,OUTPUT_FILE_NAME),encoding="UTF-8")

In [None]:
!aws s3 cp {EC2_OUTPUT_PATH} {S3_OUTPUT_PATH} --recursive