### Spatially Join FAO Names to hydrobasins level 6

* Purpose of script: Spatially join FAO Names hydrobasins to the official HydroBasins level 6 polygons
* Author: Rutger Hofste
* Kernel used: python35
* Date created: 20170825

In [1]:
S3_INPUT_PATH_FAO = "s3://wri-projects/Aqueduct30/processData/Y2017M08D23_RH_Buffer_FAONames_V01/output/"
S3_INPUT_PATH_HYBAS = "s3://wri-projects/Aqueduct30/processData/Y2017M08D02_RH_Merge_HydroBasins_V01/output/"
S3_OUTPUT_PATH = "s3://wri-projects/Aqueduct30/processData/Y2017M08D25_RH_spatial_join_FAONames_V01/output/"
EC2_INPUT_PATH = "/volumes/data/Y2017M08D25_RH_spatial_join_FAONames_V01/input/"
EC2_OUTPUT_PATH = "/volumes/data/Y2017M08D25_RH_spatial_join_FAONames_V01/output/"
INPUT_FILE_NAME_FAO = "hydrobasins_fao_fiona_merged_buffered_v01.shp"
INPUT_FILE_NAME_HYBAS = "hybas_lev06_v1c_merged_fiona_V01.shp"
OUTPUT_FILE_NAME = "hybas_lev06_v1c_merged_fiona_withFAO_V01.csv"

In [2]:
!rm -r {EC2_INPUT_PATH}
!rm -r {EC2_OUTPUT_PATH}

!mkdir -p {EC2_INPUT_PATH}
!mkdir -p {EC2_OUTPUT_PATH}

In [7]:
!aws s3 cp {S3_INPUT_PATH_FAO} {EC2_INPUT_PATH} --recursive 

download: s3://wri-projects/Aqueduct30/processData/Y2017M08D23_RH_Buffer_FAONames_V01/output/hydrobasins_fao_fiona_merged_buffered_v01.cpg to ../../../../data/Y2017M08D25_RH_spatial_join_FAONames_V01/input/hydrobasins_fao_fiona_merged_buffered_v01.cpg
download: s3://wri-projects/Aqueduct30/processData/Y2017M08D23_RH_Buffer_FAONames_V01/output/hydrobasins_fao_fiona_merged_buffered_v01.prj to ../../../../data/Y2017M08D25_RH_spatial_join_FAONames_V01/input/hydrobasins_fao_fiona_merged_buffered_v01.prj
download: s3://wri-projects/Aqueduct30/processData/Y2017M08D23_RH_Buffer_FAONames_V01/output/hydrobasins_fao_fiona_merged_buffered_v01.shx to ../../../../data/Y2017M08D25_RH_spatial_join_FAONames_V01/input/hydrobasins_fao_fiona_merged_buffered_v01.shx
download: s3://wri-projects/Aqueduct30/processData/Y2017M08D23_RH_Buffer_FAONames_V01/output/hydrobasins_fao_fiona_merged_buffered_v01.dbf to ../../../../data/Y2017M08D25_RH_spatial_join_FAONames_V01/input/hydrobasins_fao_fiona_merged_buffered_

In [4]:
!aws s3 cp {S3_INPUT_PATH_HYBAS} {EC2_INPUT_PATH} --recursive --exclude *.tif

download: s3://wri-projects/Aqueduct30/processData/Y2017M08D02_RH_Merge_HydroBasins_V01/output/hybas_lev00_v1c_merged_fiona_V01.cpg to ../../../../data/Y2017M08D25_RH_spatial_join_FAONames_V01/input/hybas_lev00_v1c_merged_fiona_V01.cpg
download: s3://wri-projects/Aqueduct30/processData/Y2017M08D02_RH_Merge_HydroBasins_V01/output/hybas_lev06_v1c_merged_fiona_V01.cpg to ../../../../data/Y2017M08D25_RH_spatial_join_FAONames_V01/input/hybas_lev06_v1c_merged_fiona_V01.cpg
download: s3://wri-projects/Aqueduct30/processData/Y2017M08D02_RH_Merge_HydroBasins_V01/output/hybas_lev00_v1c_merged_fiona_V01.prj to ../../../../data/Y2017M08D25_RH_spatial_join_FAONames_V01/input/hybas_lev00_v1c_merged_fiona_V01.prj
download: s3://wri-projects/Aqueduct30/processData/Y2017M08D02_RH_Merge_HydroBasins_V01/output/hybas_lev06_v1c_merged_fiona_V01.prj to ../../../../data/Y2017M08D25_RH_spatial_join_FAONames_V01/input/hybas_lev06_v1c_merged_fiona_V01.prj
download: s3://wri-projects/Aqueduct30/processData/Y2017

In [5]:
import os
if 'GDAL_DATA' not in os.environ:
    os.environ['GDAL_DATA'] = r'/usr/share/gdal/2.1'
from osgeo import gdal,ogr,osr
'GDAL_DATA' in os.environ
# If false, the GDAL_DATA variable is set incorrectly. You need this variable to obtain the spatial reference
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import time
%matplotlib notebook


In [8]:
gdfFAO = gpd.read_file(os.path.join(EC2_INPUT_PATH,INPUT_FILE_NAME_FAO))

In [9]:
list(gdfFAO)

['LEGEND',
 'MAJ_AREA',
 'MAJ_BAS',
 'MAJ_NAME',
 'SUB_AREA',
 'SUB_BAS',
 'SUB_NAME',
 'TO_BAS',
 'area',
 'geometry',
 'index1']

In [10]:
gdfHybas = gpd.read_file(os.path.join(EC2_INPUT_PATH,INPUT_FILE_NAME_HYBAS))

In [11]:
list(gdfHybas)

['COAST',
 'DIST_MAIN',
 'DIST_SINK',
 'ENDO',
 'HYBAS_ID',
 'MAIN_BAS',
 'NEXT_DOWN',
 'NEXT_SINK',
 'ORDER',
 'PFAF_ID',
 'SORT',
 'SUB_AREA',
 'UP_AREA',
 'geometry']

In [12]:
gdfHybas.dtypes

COAST          int64
DIST_MAIN    float64
DIST_SINK    float64
ENDO           int64
HYBAS_ID       int64
MAIN_BAS       int64
NEXT_DOWN      int64
NEXT_SINK      int64
ORDER          int64
PFAF_ID        int64
SORT           int64
SUB_AREA     float64
UP_AREA      float64
geometry      object
dtype: object

In [13]:
gdfFAO.dtypes

LEGEND        int64
MAJ_AREA      int64
MAJ_BAS       int64
MAJ_NAME     object
SUB_AREA      int64
SUB_BAS       int64
SUB_NAME     object
TO_BAS        int64
area        float64
geometry     object
index1        int64
dtype: object

In [14]:
gdfFAO['index1_copy'] = gdfFAO['index1']

In [15]:
gdfFAO = gdfFAO.set_index('index1')

In [16]:
gdfFAO.index.name

'index1'

A spatial join was performed on the data. However the FAO polygons were stored as polygons and not as multi-polygons. The data also lacked a unique Identifier. The identifier consists of a combination of MAJ_BAS and SUB_BASE. The maximum length of MAJ_BAS is 4 and 6 for SUB_BAS (279252). We will store the identifier as a string with the format: MAJ_BASxxxxSUB_BASExxxxxx

In [17]:
gdfFAO['FAOid'] = gdfFAO.apply(lambda x:'MAJ_BAS_%0.4d_SUB_BASE_%0.7d' % (x['MAJ_BAS'],x['SUB_BAS']),axis=1)

In [18]:
gdfFAO.index.name

'index1'

In [20]:
dfFAO = gdfFAO.drop('geometry',1)

In [21]:
dfFAO.head()

Unnamed: 0_level_0,LEGEND,MAJ_AREA,MAJ_BAS,MAJ_NAME,SUB_AREA,SUB_BAS,SUB_NAME,TO_BAS,area,index1_copy,FAOid
index1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,1,318639,6001,"Black Sea, South Coast",24573,1001,Bursa / Balikesir,-999,2.550863,0,MAJ_BAS_6001_SUB_BASE_0001001
1,1,318639,6001,"Black Sea, South Coast",7803,1002,Kocaeli,-999,0.811355,1,MAJ_BAS_6001_SUB_BASE_0001002
2,1,318639,6001,"Black Sea, South Coast",63081,1003,Sakarya River,-999,6.58024,2,MAJ_BAS_6001_SUB_BASE_0001003
3,1,318639,6001,"Black Sea, South Coast",29866,1004,Duzce / Bolu / Zonguldak / Karabuk,-999,3.172792,3,MAJ_BAS_6001_SUB_BASE_0001004
4,1,318639,6001,"Black Sea, South Coast",77771,1005,Kizilirmak River,-999,8.138718,4,MAJ_BAS_6001_SUB_BASE_0001005


In [22]:
gdfFAO['FAOid_copy'] = gdfFAO['FAOid']

In [25]:
gdfFAO.index.name

'FAOid'

In [23]:
list(gdfFAO)

['LEGEND',
 'MAJ_AREA',
 'MAJ_BAS',
 'MAJ_NAME',
 'SUB_AREA',
 'SUB_BAS',
 'SUB_NAME',
 'TO_BAS',
 'area',
 'geometry',
 'index1_copy',
 'FAOid',
 'FAOid_copy']

In [24]:
gdfFAO = gdfFAO.dissolve(by='FAOid')

In [None]:
list(gdfFAO)

In [None]:
dfFAO = gdfFAO.drop('geometry',1)

In [None]:
dfFAO.head()

In [None]:
gdfFAOTest = gdfFAO.loc[100:200]

In [None]:
validGeom = gdfFAO.geometry.is_valid

In [27]:
gdfFAO.crs = {'init': u'epsg:4326'}

In [29]:
gdfFAO = gdfFAO.set_index('index1_copy')

In [30]:
gdfJoined = gpd.sjoin(gdfHybas, gdfFAO ,how="left", op='intersects')

In [31]:
list(gdfJoined)

['COAST',
 'DIST_MAIN',
 'DIST_SINK',
 'ENDO',
 'HYBAS_ID',
 'MAIN_BAS',
 'NEXT_DOWN',
 'NEXT_SINK',
 'ORDER',
 'PFAF_ID',
 'SORT',
 'SUB_AREA_left',
 'UP_AREA',
 'geometry',
 'index_right',
 'LEGEND',
 'MAJ_AREA',
 'MAJ_BAS',
 'MAJ_NAME',
 'SUB_AREA_right',
 'SUB_BAS',
 'SUB_NAME',
 'TO_BAS',
 'area',
 'FAOid_copy']

In [33]:
gdfJoined.shape

(24078, 25)

In [37]:
series = gdfJoined.groupby('PFAF_ID')['SUB_NAME'].apply(list)
series2 = gdfJoined.groupby('PFAF_ID')['MAJ_NAME'].apply(list)
series3 = gdfJoined.groupby('PFAF_ID')['FAOid_copy'].apply(list)

In [38]:
df_new1 = series.to_frame()
df_new2 = series2.to_frame()
df_new3 = series3.to_frame()

In [36]:
df_new1.head()

Unnamed: 0_level_0,SUB_NAME
PFAF_ID,Unnamed: 1_level_1
111011,[Wadi El Naqa]
111012,[Egyptian east coast]
111013,[Egyptian east coast]
111014,[Egyptian east coast]
111015,[Egyptian east coast]


In [45]:
df_out = df_new1.merge(right = df_new2, how = "outer", left_index = True, right_index = True )

In [46]:
df_out = df_out.merge(right = df_new3, how = "outer", left_index = True, right_index = True )

In [47]:
df_out.dtypes

SUB_NAME      object
MAJ_NAME      object
FAOid_copy    object
dtype: object

In [48]:
df_out.to_csv(os.path.join(EC2_OUTPUT_PATH,OUTPUT_FILE_NAME),encoding="UTF-8")

In [49]:
!aws s3 cp {EC2_OUTPUT_PATH} {S3_OUTPUT_PATH} --recursive

upload: ../../../../data/Y2017M08D25_RH_spatial_join_FAONames_V01/output/hybas_lev06_v1c_merged_fiona_withFAO_V01.csv to s3://wri-projects/Aqueduct30/processData/Y2017M08D25_RH_spatial_join_FAONames_V01/output/hybas_lev06_v1c_merged_fiona_withFAO_V01.csv
