# GNIS Names into the Spatial Feature Registry

#### This code is in progress.  The code registers the GNIS names into the Spatial Feature Registry (SFR within Data Distilleries GC2 instance) using the following workflow.  All data are retained from the source (unaltered), three registration fields are added (_id, reg_date, reg_source) and data are exported to a GeoJSON file.   The GeoJSON file is then uploaded to ScienceBase to document the final data as it is represented in the SFR.  Currently we are uploading data to the SFR using a manual process, with plans to automate this step in the future. 

#### General workflow involves:
     1: Retrieve Data From Source (ScienceBase Item: 5ace95e3e4b0e2c2dd1a688f)
     2: Create GeoDataFrame from CSV
     3: Define Variables needed throughout process
     4: Create new ScienceBase item to describe registration process
     5: Build and export GeoJSON representation of the data.  This process includes the addition of two registration fields that document information about registration (reg_source-> points to new SB item), and a registered uuid (_id).  
     6: Upload GeoJSON file to new ScienceBase item to document what was registered into SFR, along with additional information about when and how registration occured.  This process will likely change as we introduce a more systematic way of tracking prov.   During this step the user will upload data to GC2 as well (SFR schema).  Currently this process is done manually through the UI.
     

Code by: Daniel Wieferich (USGS)

Date: 20180510

In [1]:
#Import Needed Packages
import pandas as pd
import fiona
from shapely.geometry import Point
import geopandas as gpd
import urllib.request as ur
import subprocess
import geojson
from sfr_load_utils import *
import numpy as np

#### Step 1: Retrieve data from source


In [2]:
#Retrieve Dataset from ScienceBase
#Dam Removal Database V2 stored at https://www.sciencebase.gov/catalog/item/5ace95e3e4b0e2c2dd1a688f

file_name = 'NationalFile'

#Define url of zipped shapefile download
downloadUrl ='https://geonames.usgs.gov/docs/stategaz/NationalFile.zip'
#Download government unit file to local directory
ur.urlretrieve(downloadUrl, file_name+'.zip')
#In working directory unzips file
subprocess.call(r'"C:\Program Files\7-Zip\7z.exe" x ' + 'NationalFile.zip' )

0

#### Step 2: Import CSV into GeoDataFrame

In [3]:
# Read downloaded file into a pandas dataframe.  Note there are weird characters requiring the ISO encoding
df_gnis = pd.read_csv(file_name+'_20180601.txt', sep='|', encoding = 'utf-8')

In [4]:
df_gnis.head()

Unnamed: 0,FEATURE_ID,FEATURE_NAME,FEATURE_CLASS,STATE_ALPHA,STATE_NUMERIC,COUNTY_NAME,COUNTY_NUMERIC,PRIMARY_LAT_DMS,PRIM_LONG_DMS,PRIM_LAT_DEC,PRIM_LONG_DEC,SOURCE_LAT_DMS,SOURCE_LONG_DMS,SOURCE_LAT_DEC,SOURCE_LONG_DEC,ELEV_IN_M,ELEV_IN_FT,MAP_NAME,DATE_CREATED,DATE_EDITED
0,0,Bruceville Cemetery,Cemetery,CO,8,Boulder,13.0,400058N,1051226W,40.016099,-105.207171,,,,,1598.0,5243.0,Niwot,05/04/2015,10/22/2015
1,399,Agua Sal Creek,Stream,AZ,4,Apache,1.0,362740N,1092842W,36.461112,-109.478439,362053N,1090915W,36.348058,-109.154266,1645.0,5397.0,Fire Dance Mesa,02/08/1980,
2,400,Agua Sal Wash,Valley,AZ,4,Apache,1.0,363246N,1093103W,36.546112,-109.517607,362740N,1092842W,36.461112,-109.478439,1597.0,5239.0,Little Round Rock,02/08/1980,
3,401,Aguaje Draw,Valley,AZ,4,Apache,1.0,343417N,1091313W,34.571428,-109.22037,344308N,1085826W,34.7188,-108.9739,1750.0,5741.0,Kearn Lake,02/08/1980,01/14/2008
4,402,Arlington State Wildlife Area,Park,AZ,4,Maricopa,13.0,331455N,1124625W,33.248655,-112.773504,,,,,231.0,758.0,Spring Mountain,02/08/1980,


In [5]:
df_gnis.shape

(2277132, 20)

In [6]:
#Make sure no missing geometry
df_gnis_test = df_gnis.dropna(axis=0, subset=['PRIM_LAT_DEC','PRIM_LONG_DEC'],thresh=1)
df_gnis_test.shape

(2277132, 20)

In [7]:
df_gnis = df_gnis.replace(np.nan, '', regex=True)
geometry = [Point(xy) for xy in zip(df_gnis.PRIM_LAT_DEC, df_gnis.PRIM_LONG_DEC)]
gdf_gnis = gpd.GeoDataFrame(df_gnis, crs={'init':'epsg:4269'}, geometry=geometry)

In [8]:
gdf_gnis.shape

(2277132, 21)

In [14]:
epsg = {'code':'4269'}
expected_geom_type = 'Point'
source_uri = 'https://www.sciencebase.gov/catalog/item/5af6219be4b0da30c1b5faad'
outfile_name = 'gnis_20180601'

In [15]:
collection = df_to_geojson(gdf_gnis, epsg, source_uri, expected_geom_type)

In [16]:
file = export_geojson(outfile_name, collection)
outfile_zip = zip_geojson(outfile_name)

#### Step 3: Define Variables

In [17]:
#User Defined Variables
epsg = {'code':'4269'}
expected_geom_type = 'Point'
source_sbitem = 'https://geonames.usgs.gov'
list_tags = ['gnis names','BIS Spatial Feature Registry']
date = '2018-06-29'
data_name = 'Spatial Features from GNIS Names'

#### Step 4: Create SB Item to describe SFR Registration 

In [18]:
#Build SB Item to house SFR GeoJSON File, including description of item.  
#This step outputs source_uri (uri to the new sb item that describes the data) to be included as registration information.

#Turns list of tags into json format accepted by SB
sb_tags = build_sb_tags(list_tags)
#Create SB session and log in
sb = sb_login()   
#Creates JSON needed to build and describe new SB item
item_info = {'title': 'Spatial Feature Registration Files for GNIS Names (National File 2018-06-29) Database', 'parentId': '55fafaf5e4b05d6c4e501b81', 'summary': 'GNIS names data registered into the spatial feature registry. Source data are documented at https://geonames.usgs.gov', 'tags': [{'type': 'Subject', 'name': 'gnis'}, {'type': 'Subject', 'name': 'BIS Spatial Feature Registry'}], 'dates': [{'type': 'creation', 'dateString': '2018-06-29', 'label': 'Creation'},{'type': 'Release Date', 'dateString': '2018-06-01', 'label': 'Release Date'}], 'purpose': 'These spatial data were ingested into the Spatial Feature Registry (SFR) data system within the Biological Information System.', 'webLinks': [{'type': 'webLink', 'typeLabel': 'Web Link', 'uri': 'https://geonames.usgs.gov/docs/stategaz/NationalFile.zip', 'rel': 'related', 'title': 'source data download'}]}
print (item_info)
#Builds new SB item
new_item = build_new_sfr_sbitem(sb,item_info)
#URI of new SB item.  This is inserted into GEOJSON so we have a direct connection in SFR to documentation... this step may not
#be needed as we build prov capabilities.


username: dwieferich@usgs.gov
········
{'title': 'Spatial Feature Registration Files for GNIS Names (National File 2018-06-29) Database', 'parentId': '55fafaf5e4b05d6c4e501b81', 'summary': 'GNIS names data registered into the spatial feature registry. Source data are documented at https://geonames.usgs.gov', 'tags': [{'type': 'Subject', 'name': 'gnis'}, {'type': 'Subject', 'name': 'BIS Spatial Feature Registry'}], 'dates': [{'type': 'creation', 'dateString': '2018-06-29', 'label': 'Creation'}, {'type': 'Release Date', 'dateString': '2018-06-01', 'label': 'Release Date'}], 'purpose': 'These spatial data were ingested into the Spatial Feature Registry (SFR) data system within the Biological Information System.', 'webLinks': [{'type': 'webLink', 'typeLabel': 'Web Link', 'uri': 'https://geonames.usgs.gov/docs/stategaz/NationalFile.zip', 'rel': 'related', 'title': 'source data download'}]}


In [19]:
source_uri = str(new_item['link']['url'])
print (source_uri)

https://www.sciencebase.gov/catalog/item/5b367691e4b040769c1754e7


#### Step 5: Build and export GeoJSON representation of data.  Add registration id and source_uri (newly created SB item). Verify that the correct number of features were included in the GeoJSON dataset.

In [20]:
collection = df_to_geojson(gdf_gnis, epsg, source_uri, expected_geom_type)
print (verify_correct_count(collection, gdf_gnis))

#export_geojson(outfile_name, collection)
#Add file to SB Item

Correct number of features


In [21]:
file = export_geojson(outfile_name, collection)
outfile_zip = zip_geojson(outfile_name)

#### Step 6: Upload GeoJSON file to ScienceBase Item and also upload to GC2 using UI (make sure to specify UTF-8 encoding and MultiPolygon).

In [22]:
sb.upload_file_to_item(new_item, outfile_zip)

{'body': 'GNIS Names data registered into the spatial feature registry. Source data are documented at&nbsp;<a href="https://geonames.usgs.gov%27%2C/" style="box-sizing: border-box; color: rgb(35, 82, 124); text-decoration-line: underline; outline: -webkit-focus-ring-color auto 5px; outline-offset: -2px; font-family: &quot;Helvetica Neue&quot;, Helvetica, Arial, sans-serif; font-size: 14px;">https://geonames.usgs.gov\'</a>.',
 'dates': [{'dateString': '2018-06-29',
   'label': 'Creation',
   'type': 'creation'},
  {'dateString': '2018-06-01',
   'label': 'Release Date',
   'type': 'Release Date'}],
 'distributionLinks': [{'files': [{'contentType': 'application/zip',
     'name': 'gnis_20180601.zip',
     'size': 195750841,
     'title': None}],
   'name': 'SpatialFeatureR.zip',
   'rel': 'alternate',
   'title': 'Download Attached Files',
   'type': 'downloadLink',
   'typeLabel': 'Download Link',
   'uri': 'https://www.sciencebase.gov/catalog/file/get/5b367691e4b040769c1754e7'}],
 'fil

In [None]:
#Currently the new SB item needs to have some additional information uploaded.  The UI can be used for this for now but in the future we will want to build as much as we can into this process.

In [1]:
#Import Needed Packages
import pandas as pd
import fiona
from shapely.geometry import Point
import geopandas as gpd
import urllib.request as ur
import subprocess
import geojson
from sfr_load_utils import *
import numpy as np

In [None]:
df.groupby(by='STATE_ALPHA')['FEATURE_ID'].count()

In [2]:
gdf_gnis = gpd.read_file('gnis_20180601.geojson')

In [74]:
gdf_gnis.shape

(2277132, 23)

In [72]:
#df2 = gdf_gnis[(gdf_gnis['FEATURE_ID'] < 800000)]
#df2 = gdf_gnis[(gdf_gnis['FEATURE_ID'] >= 800000) & (gdf_gnis['FEATURE_ID'] < 1100000)]
#df2 = gdf_gnis[(gdf_gnis['FEATURE_ID'] >= 1100000) & (gdf_gnis['FEATURE_ID'] < 1400000)]
#df2 = gdf_gnis[(gdf_gnis['FEATURE_ID'] >= 1400000) & (gdf_gnis['FEATURE_ID'] < 1700000)]
#df2 = gdf_gnis[(gdf_gnis['FEATURE_ID'] >= 1700000) & (gdf_gnis['FEATURE_ID'] < 2000000)]
#df2 = gdf_gnis[(gdf_gnis['FEATURE_ID'] >= 2000000) & (gdf_gnis['FEATURE_ID'] < 2300000)]
df2 = gdf_gnis[(gdf_gnis['FEATURE_ID'] >= 2300000)]
df2.rename(columns={'_id': 'identifier'}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  return super(DataFrame, self).rename(**kwargs)


In [73]:
df2.shape

(317777, 23)

In [75]:
epsg = {'code':'4269'}
expected_geom_type = 'Point'
source_uri = 'https://www.sciencebase.gov/catalog/item/5af6219be4b0da30c1b5faad'
outfile_name = 'gnis/gnis_20180601'

In [76]:
collection = df_2_geojson(df2, epsg, expected_geom_type)


In [77]:
file = export_geojson(outfile_name, collection)


In [7]:
def df_2_geojson(df, epsg, expected_geom_type):
    #Create basic JSON structure
    collection = {'type':'FeatureCollection', 'crs': {'type': 'epsg', 'properties': epsg }, 'features':[]}

    #Identify list of field names in df, this is used to populate the properties section of the structure
    field_names = (list(df.columns.values))
    field_names.remove('geometry')
    #print (field_names)

    #For each row in dataframe, populate geometry and properties
    for row in df.itertuples():
        geom = fix_geom_type(row, expected_geom_type)

        feature = {'type':'Feature',
                   'properties':{},
                  'geometry': geom}
        for field in field_names:
            feature['properties'][field] = getattr(row, field)

        collection['features'].append(feature)

    return collection