# Data Ingestion of USGS Groundwater Data

Add raw data to project and manually create metadata for the upstream source.

Each dataset gets its own research object to track them individually without creating a very large, interconnected research object for the whole project.

## Dataset Descriptions

Use attributes and entities as needed

- Attributes
    - name
    - publisher {"@id": "#local_id"} or {"@id": "url"}
    - creater {"@id": "#person"}
    - license
    - datePublished
    - keywords (ex. "streets, aggregated features, map")
    - mainEntity Output Dataset 
- Data Entities
    - File
        - source (File(s))
        - description
        - contentSize (MB, KB, or B)
        - encodingFormat (from PRONOM: https://www.nationalarchives.gov.uk/PRONOM)
        - sha256
    - Dataset
        - source (File(s))
        - description
        - contentSize (MB, KB, or B)
        - encodingFormat (from PRONOM: https://www.nationalarchives.gov.uk/PRONOM)
    - ComputationalWorkflow
        - @type (pre-defined with built-in models) ["File", "SoftwareSourceCode", "ComputationalWorkflow"]
        - source (File)
        - author
        - programmingLanguage
        - input [{"@id": "#id1"}, {"@id": "#id2"}]
        - output [{"@id": "#id3"}]
- Contextual Entities
    - Publisher
        - @type Organization
        - name
        - url
    - Person
        - name
    - programmingLanguage
        - @type ["ComputerLanguage", "SoftwareApplication"]
        - name
        - version
        - url
    - encodingFormat
        - @id PRONOM url
        - @type Website
        - name
    - input
        - @type FormalParameter
        - name
        - defaultValue {"@id": "#ReferenceToUpstreamDataset"}
        - encodingFormat
        - valueRequired True
    - output
        - @type FormalParameter
        - name
        - defaultValue {"@id": "#ReferenceToDownstreamDataset"}
        - encodingFormat
        - valueRequired True

In [1]:
import os
import requests
import pandas as pd
from io import BytesIO

# Get Data

In [2]:
# Groundwater Sites in Philadelphia County to Pull Data for
# 23 sites total
list_of_site_numbers = [
    "400132075031001",
    "400001075040301",
    "400217075142101",
    "395611075091301",
    "395353075151501",
    "395416075150301",
    "395459075140501",
    "395705075135901",
    "400211075093701",
    "400327075152201",
    "400424075104901",
    "400512075033401",
    "400311075101301",
    "400055075122501",
    "400038075094601",
    "400016075102801",
    "395849075134201",
    "395656075104401",
    "395408075104001",
    "395341075102101",
    "400644074590801",
    "400308074592201",
    "400516075033201",
]

In [3]:
print(
    len(list_of_site_numbers),
    len(list_of_site_numbers) == len(set(list_of_site_numbers))
)

23 True


In [4]:
resp = requests.get("https://waterservices.usgs.gov/nwis/site/?format=rdb&sites=401029075161601&siteOutput=expanded&siteStatus=all")

In [5]:
resp.status_code

200

In [6]:
resp.text

'#\n#\n# US Geological Survey\n# retrieved: 2024-04-22 19:00:47 -04:00\t(caas01)\n#\n# The Site File stores location and general information about groundwater,\n# surface water, and meteorological sites\n# for sites in USA.\n#\n# File-format description:  http://help.waterdata.usgs.gov/faq/about-tab-delimited-output\n# Automated-retrieval info: http://waterservices.usgs.gov/rest/Site-Service.html\n#\n# Contact:   gs-w_support_nwisweb@usgs.gov\n#\n# The following selected fields are included in this output:\n#\n#  agency_cd       -- Agency\n#  site_no         -- Site identification number\n#  station_nm      -- Site name\n#  site_tp_cd      -- Site type\n#  lat_va          -- DMS latitude\n#  long_va         -- DMS longitude\n#  dec_lat_va      -- Decimal latitude\n#  dec_long_va     -- Decimal longitude\n#  coord_meth_cd   -- Latitude-longitude method\n#  coord_acy_cd    -- Latitude-longitude accuracy\n#  coord_datum_cd  -- Latitude-longitude datum\n#  dec_coord_datum_cd -- Decimal Lat

In [7]:
len_header = 0
for line in resp.text.split("\n"):
    if line[0] == "#":
        len_header += 1
    else:
        break
len_header

59

In [8]:
buffer = BytesIO()
buffer.write(bytes(resp.text, "utf-8"))

3155

In [9]:
buffer.seek(0)

0

In [10]:
df = pd.read_table(buffer, header=len_header)

In [11]:
df

Unnamed: 0,agency_cd,site_no,station_nm,site_tp_cd,lat_va,long_va,dec_lat_va,dec_long_va,coord_meth_cd,coord_acy_cd,...,local_time_fg,reliability_cd,gw_file_cd,nat_aqfr_cd,aqfr_cd,aqfr_type_cd,well_depth_va,hole_depth_va,depth_src_cd,project_no
0,5s,15s,50s,7s,16s,16s,16s,16s,1s,1s,...,1s,1s,30s,10s,8s,1s,8s,8s,1s,12s
1,USGS,401029075161601,MG 2240,GW,401029,0751616,40.1747222,-75.2711111,G,S,...,Y,C,YY Y,N300ERLMZC,231LCKG,,,,,


## Get all site metadata information

In [12]:
try:
    os.mkdir("../../data/extract/USGSGroundwaterSiteMetadata")
except FileExistsError as e:
    pass

In [13]:
base_url = "https://waterservices.usgs.gov/nwis/site/?format=rdb&sites={site_num}&siteOutput=expanded&siteStatus=all"

In [14]:
failures = []
for site_num in list_of_site_numbers:
    url = base_url.format(site_num=site_num)
    resp = requests.get(url)
    if resp.status_code != 200:
        failues.append(
            {
                "site_num": site_num,
                "url": url,
                "response_status": resp.status_code,
                "response": resp,
            }
        )
    else:
        with open(f"../../data/extract/USGSGroundwaterSiteMetadata/{site_num}.rdb", "w") as f:
            f.write(resp.text)

len(failures)

0

## Get all site records

In [15]:
try:
    os.mkdir("../../data/extract/USGSGroundwaterObservations")
except FileExistsError as e:
    pass

In [16]:
base_url = "https://nwis.waterdata.usgs.gov/nwis/gwlevels?site_no={site_num}&agency_cd=USGS&format=rdb"

In [17]:
failures = []
for site_num in list_of_site_numbers:
    url = base_url.format(site_num=site_num)
    resp = requests.get(url)
    if resp.status_code != 200:
        failues.append(
            {
                "site_num": site_num,
                "url": url,
                "response_status": resp.status_code,
                "response": resp,
            }
        )
    else:
        with open(f"../../data/extract/USGSGroundwaterObservations/{site_num}.rdb", "w") as f:
            f.write(resp.text)

len(failures)

0

# Make Two Crates

In [18]:
import os
from datetime import date
from rocrate.rocrate import ROCrate
from rocrate.model import (
    Person,
    File,
    Dataset,
    ComputationalWorkflow,
    ContextEntity,
)

In [19]:
def get_file_size(path):
    file_stats = os.stat(path)
    size = file_stats.st_size
    if size >= 1024*1024:
        return f"{round(file_stats.st_size / (1024*1024), 2)}MB"
    elif size >= 1024:
        return f"{round(file_stats.st_size / 1024, 2)}KB"
    else:         
        return f"{file_stats.st_size}B"

## First Crate - Site Metadata

In [20]:
crate = ROCrate()

### Main Attributes

- name
- publisher {"@id": "#local_id"} or {"@id": "url"}
- creater {"@id": "#person"}
- license
- datePublished
- keywords (ex. "streets, aggregated features, map")
- mainEntity Output Dataset

Contextual

- Publisher
    - @type Organization
    - name
    - url

In [21]:
name = "USGS Well Groundwater Data Site Metadata"
short_name = "USGSGroundwaterSiteMetadata"

In [22]:
usgs = crate.add(
    ContextEntity(
        crate,
        "U.S. Geological Survey",
        properties={
            "@type": "Organization",
            "name": "U.S. Geological Survey",
            "url": "https://www.usgs.gov/"
        },
    )
)

In [23]:
us_public_domain = crate.add(
    ContextEntity(
        crate,
        "U.S. Public Domain",
        properties={
            "@type": "License",
            "name": "U.S. Public Domain",
            "url": "http://www.usa.gov/publicdomain/label/1.0/",
        },
    )
)

In [24]:
date_published = date(2024, 4, 13).isoformat()

In [25]:
crate.name = name
crate.publisher = usgs
crate.creator = usgs
crate.license = us_public_domain
crate.datePublished = date_published
crate.keywords = ["USGS", "National Water Information System", "Groundwater", "Ingested Data", "Raw"]

### Dataset

- source (File(s))
- description
- contentSize (MB, KB, or B)
- encodingFormat (from PRONOM: https://www.nationalarchives.gov.uk/PRONOM)

Contextual 

- encodingFormat
    - @id PRONOM url
    - @type Website
    - name

In [26]:
encoding_format = crate.add(
    ContextEntity(
        crate,
        "https://www.nationalarchives.gov.uk/PRONOM/x-fmt/271",
        properties={
            "@type": "Website",
            "name": "dBASE Database",
        },
    )
)

In [27]:
# Dataset Files - Metadata
raw_files = []
for site_num in list_of_site_numbers:
    raw_file = crate.add (
        File(
            crate,
            source=f"file:///data/extract/{short_name}/{site_num}.rdb",
            properties={
                "contentSize": get_file_size(f"../../data/extract/{short_name}/{site_num}.rdb"),
                "encodingFormat": [encoding_format.properties()["name"], {"@id": encoding_format.id}],
            }
        )
    )
    raw_files.append(raw_file)

In [28]:
raw_dataset = crate.add(
    Dataset(
        crate,
        source=f"file:///data/extract/{short_name}",
        properties={
            "description": "Groundwater Well Site Metadata for Philadelphia County, Pennsylvania",
            "hasParts": [
                {"@id": raw_file.id}
                for raw_file in raw_files
            ]
        }
    )
)

In [29]:
# Main Entry
crate.mainEntity = raw_dataset

In [30]:
crate.mainEntity.source

'file:///data/extract/USGSGroundwaterSiteMetadata'

### Store Research Object

In [31]:
crate.write(f"../../metastore/{short_name}/")

### Confirm Usage of Crate

In [32]:
read_crate = ROCrate(f"../../metastore/{short_name}/")

In [33]:
for e in read_crate.get_entities():
    print(e.id, e.type)

./ Dataset
ro-crate-metadata.json CreativeWork
file:///data/extract/USGSGroundwaterSiteMetadata/400132075031001.rdb File
file:///data/extract/USGSGroundwaterSiteMetadata/400001075040301.rdb File
file:///data/extract/USGSGroundwaterSiteMetadata/400217075142101.rdb File
file:///data/extract/USGSGroundwaterSiteMetadata/395611075091301.rdb File
file:///data/extract/USGSGroundwaterSiteMetadata/395353075151501.rdb File
file:///data/extract/USGSGroundwaterSiteMetadata/395416075150301.rdb File
file:///data/extract/USGSGroundwaterSiteMetadata/395459075140501.rdb File
file:///data/extract/USGSGroundwaterSiteMetadata/395705075135901.rdb File
file:///data/extract/USGSGroundwaterSiteMetadata/400211075093701.rdb File
file:///data/extract/USGSGroundwaterSiteMetadata/400327075152201.rdb File
file:///data/extract/USGSGroundwaterSiteMetadata/400424075104901.rdb File
file:///data/extract/USGSGroundwaterSiteMetadata/400512075033401.rdb File
file:///data/extract/USGSGroundwaterSiteMetadata/400311075101301.

In [34]:
main_id = read_crate.mainEntity.id
main_id

'file:///data/extract/USGSGroundwaterSiteMetadata/'

In [35]:
file_id = read_crate.get(main_id).properties().get("hasParts")[0]["@id"]
file_id

'file:///data/extract/USGSGroundwaterSiteMetadata/400132075031001.rdb'

In [36]:
read_crate.get(file_id).source

'file:///data/extract/USGSGroundwaterSiteMetadata/400132075031001.rdb'

## Second Crate - Site Observations

In [37]:
crate = ROCrate()

### Main Attributes

- name
- publisher {"@id": "#local_id"} or {"@id": "url"}
- creater {"@id": "#person"}
- license
- datePublished
- keywords (ex. "streets, aggregated features, map")
- mainEntity Output Dataset

Contextual

- Publisher
    - @type Organization
    - name
    - url

In [38]:
name = "USGS Well Groundwater Data Site Observations"
short_name = "USGSGroundwaterObservations"

In [39]:
usgs = crate.add(
    ContextEntity(
        crate,
        "U.S. Geological Survey",
        properties={
            "@type": "Organization",
            "name": "U.S. Geological Survey",
            "url": "https://www.usgs.gov/"
        },
    )
)

In [40]:
us_public_domain = crate.add(
    ContextEntity(
        crate,
        "U.S. Public Domain",
        properties={
            "@type": "License",
            "name": "U.S. Public Domain",
            "url": "http://www.usa.gov/publicdomain/label/1.0/",
        },
    )
)

In [41]:
date_published = date(2024, 4, 13).isoformat()

In [42]:
crate.name = name
crate.publisher = usgs
crate.creator = usgs
crate.license = us_public_domain
crate.datePublished = date_published
crate.keywords = ["USGS", "National Water Information System", "Groundwater", "Ingested Data", "Raw"]

### Dataset

- source (File(s))
- description
- contentSize (MB, KB, or B)
- encodingFormat (from PRONOM: https://www.nationalarchives.gov.uk/PRONOM)

Contextual 

- encodingFormat
    - @id PRONOM url
    - @type Website
    - name

In [43]:
encoding_format = crate.add(
    ContextEntity(
        crate,
        "https://www.nationalarchives.gov.uk/PRONOM/x-fmt/271",
        properties={
            "@type": "Website",
            "name": "dBASE Database",
        },
    )
)

In [44]:
# Dataset Files - Metadata
raw_files = []
for site_num in list_of_site_numbers:
    raw_file = crate.add (
        File(
            crate,
            source=f"file:///data/extract/{short_name}/{site_num}.rdb",
            properties={
                "contentSize": get_file_size(f"../../data/extract/{short_name}/{site_num}.rdb"),
                "encodingFormat": [encoding_format.properties()["name"], {"@id": encoding_format.id}],
            }
        )
    )
    raw_files.append(raw_file)

In [45]:
raw_dataset = crate.add(
    Dataset(
        crate,
        source=f"file:///data/extract/{short_name}",
        properties={
            "description": "Groundwater Well Measurements for Philadelphia County, Pennsylvania",
            "hasParts": [
                {"@id": raw_file.id}
                for raw_file in raw_files
            ]
        }
    )
)

In [46]:
# Main Entry
crate.mainEntity = raw_dataset

In [47]:
crate.mainEntity.source

'file:///data/extract/USGSGroundwaterObservations'

### Store Research Object

In [48]:
crate.write(f"../../metastore/{short_name}/")

### Confirm Usage of Crate

In [49]:
read_crate = ROCrate(f"../../metastore/{short_name}/")

In [50]:
for e in read_crate.get_entities():
    print(e.id, e.type)

./ Dataset
ro-crate-metadata.json CreativeWork
file:///data/extract/USGSGroundwaterObservations/400132075031001.rdb File
file:///data/extract/USGSGroundwaterObservations/400001075040301.rdb File
file:///data/extract/USGSGroundwaterObservations/400217075142101.rdb File
file:///data/extract/USGSGroundwaterObservations/395611075091301.rdb File
file:///data/extract/USGSGroundwaterObservations/395353075151501.rdb File
file:///data/extract/USGSGroundwaterObservations/395416075150301.rdb File
file:///data/extract/USGSGroundwaterObservations/395459075140501.rdb File
file:///data/extract/USGSGroundwaterObservations/395705075135901.rdb File
file:///data/extract/USGSGroundwaterObservations/400211075093701.rdb File
file:///data/extract/USGSGroundwaterObservations/400327075152201.rdb File
file:///data/extract/USGSGroundwaterObservations/400424075104901.rdb File
file:///data/extract/USGSGroundwaterObservations/400512075033401.rdb File
file:///data/extract/USGSGroundwaterObservations/400311075101301.

In [51]:
main_id = read_crate.mainEntity.id
main_id

'file:///data/extract/USGSGroundwaterObservations/'

In [52]:
file_id = read_crate.get(main_id).properties().get("hasParts")[0]["@id"]
file_id

'file:///data/extract/USGSGroundwaterObservations/400132075031001.rdb'

In [53]:
read_crate.get(file_id).source

'file:///data/extract/USGSGroundwaterObservations/400132075031001.rdb'