This notebook explores a new technique for working through the USGS "/science/" pages. These are the web pages created and maintained by USGS Science Centers describing their work and activities. They represent the best (and almost only) online, public documentation of USGS projects. They sometimes trace back to internal project management processes.

The notebook has two parts:
* New(ish) function for scraping project web pages (there is still no API available for any other type of access to the material)
* Exploratory process using an OpenAI large language model to read the textual descripive information from the page and return linkable concepts

The basic prompt explored here ends up doing a more thorough job with a much faster response time than any other methods I've experimented with in the past. I run through two examples, one in a more ecosystems-related field and the other in a geoscience field. The basic concept of operations with this is to use AI to identify important things we want to know about from our texts and then link those to defined concepts in the GeoKB. We won't link everything a given pass with an LLM identifies, but the confirmed linkages we make will theoretically prove more meaningful and usable.

One interesting dynamic I'm exploring here is how to ask LLMs designed for chat completion and conversation to return information in a structures suitable for further processing and digestion with software codes. The interplay between human-language questioning and discourse and encoding of key aspects of the conversation into "data" is kind of fascinating.

In [128]:
import os
import requests
from bs4 import BeautifulSoup
import openai
import dateutil.parser
import json
from wbmaker import WikibaseConnection

In [151]:
def scrape_project(url):
    usgs_base_url = "https://www.usgs.gov"
    if "science" not in [i.lower() for i in url.split("/")]:
        return

    r = requests.get(url)
    if r.status_code != 200:
        return

    soup = BeautifulSoup(r.content, 'html.parser')
    project = {}

    title_section = soup.find('h1')
    title_section_parts = [i.strip() for i in title_section.text.split('\n') if len(i.strip()) > 0]
    project["label"] = title_section_parts[0]
    project["status"] = title_section_parts[1]

    by_line_link = soup.find('span', {'class': 'by-line'}).find('a')
    project["org_owner"] = by_line_link.text.strip()
    project["org_owner_link"] = f"{usgs_base_url}{by_line_link['href']}"

    project["article_date"] = dateutil.parser.parse(soup.find('span', {'class': 'date'}).text.strip()).isoformat()

    project["contacts"] = []

    contact_section = soup.find('div', {'class': 'paragraph--type--contacts'})
    for c in contact_section.find_all('div', {'class': 'field-contacts'}):
        contact = {}
        name_section = c.find('h4')
        contact["name"] = name_section.find('span').text.strip()
        contact["link"] = f"{usgs_base_url}{name_section.find('a')['href']}"
        contact["email"] = c.find('div', {'class': 'field-email'}).text.strip()
        project["contacts"].append(contact)

    project["keywords"] = []
    explore_section = soup.find('div', {'class': 'c-usgs-explore-science'})
    if explore_section:
        for item in explore_section.find('ul').find_all('li'):
            project["keywords"].append({
                "term": item.text.strip(),
                "link": f"{usgs_base_url}{item.find('a')['href']}"
            })

    project["partners"] = []
    partner_section = soup.find('div', {'id': 'partners'})
    if partner_section:
        for item in partner_section.find_all('a'):
            project["partners"].append({
                "partner_name": item.text.strip(),
                "partner_link": item['href']
            })

    overview_section = soup.find('div', {'id': 'overview'})
    project["overview_texts"] = []
    for p in overview_section.find_all('p'):
        p_text = p.text.strip()
        if not p_text.startswith('Return to') and len(p_text) > 50:
            project["overview_texts"].append(p.text.strip())

    return project

prompts = {
    "concepts": "Read the description of a scientific project provided after this sentence and return the following in objects and lists of a JSON string: concise summary of less than 250 characters, list of study objectives, named geographic places that can be found on a map, other types of places described in words, organization names and descriptions of their role, scientific disciplines, scientific methods and techniques being used, types of scientific data used or produced, types of scientific models used or produced, geologic formations discussed, rock types identified, geologic time periods described, mineral and/or energy commodities identified, mineral species mentioned, biological species mentioned.",
}

def llm_text(long_text, prompt="summary", model="gpt-3.5-turbo"):
    prompt_str = prompts[prompt]
    response = openai.ChatCompletion.create(
        model=model,
        messages=[
            {
                "role": "system",
                "content": "You are a very helpful assistant."
            },
            {
                "role": "user",
                "content": f"{prompt_str}\n\n {long_text}"
            },
        ]
    )

    if response and "choices" in response:
        return json.loads(response["choices"][0]["message"]["content"])
    
    return

def llm_definitions(terms, context, model="gpt-3.5-turbo"):
    response = openai.ChatCompletion.create(
        model=model,
        messages=[
            {
                "role": "system",
                "content": "You are a very helpful assistant."
            },
            {
                "role": "user",
                "content": f"For the following list of terms you provided in a previous response about {context}, please provide concise definitions for each term and a link to an online source of definition (if available) in a list of JSON objects: {', '.join(terms)}"
            },
        ]
    )

    if response and "choices" in response:
        return response["choices"][0]["message"]["content"]
    
    return

# Bio/Eco Example

In [126]:
sample_project = scrape_project("https://www.usgs.gov/programs/climate-research-and-development-program/science/impacts-coastal-and-watershed-changes")
sample_project

{'label': 'Impacts of coastal and watershed changes on upper estuaries: causes and implications of wetland ecosystem transitions along the US Atlantic and Gulf Coasts',
 'status': 'Active',
 'org_owner': 'Climate Research and Development Program',
 'org_owner_link': 'https://www.usgs.gov/programs/climate-research-and-development-program',
 'article_date': '2020-02-26T00:00:00',
 'contacts': [{'name': 'Ken Krauss, Ph.D.',
   'link': 'https://www.usgs.gov/staff-profiles/ken-krauss',
   'email': 'kraussk@usgs.gov'},
  {'name': 'Gregory Noe',
   'link': 'https://www.usgs.gov/staff-profiles/gregory-noe',
   'email': 'gnoe@usgs.gov'},
  {'name': 'Camille LaFosse Stagg, Ph.D.',
   'link': 'https://www.usgs.gov/staff-profiles/camille-lafosse-stagg',
   'email': 'staggc@usgs.gov'},
  {'name': 'Hongqing Wang, Ph.D.',
   'link': 'https://www.usgs.gov/staff-profiles/hongqing-wang',
   'email': 'wangh@usgs.gov'},
  {'name': 'Eric J Ward, Ph.D.',
   'link': 'https://www.usgs.gov/staff-profiles/eric-

In [127]:
long_text = " ".join(sample_project['overview_texts'])
print(len(long_text))

7026


In [131]:
extracted_concepts = llm_text(
    long_text=long_text,
    prompt="concepts"
)

In [132]:
extracted_concepts

{'concise_summary': "This project studies the conversion of tidal freshwater forested wetlands to 'Ghost Forests' caused by tidal extension and human influences, and its impact on the upper estuarine wetlands' resiliency and function, using an integrated approach to research and monitoring and scientific methods in different coastal areas. The results guide future decisions on ecosystem restoration, carbon, water quality, coastal resilience, wildlife and fisheries.",
 'geographic_places': ['Atlantic Coast',
  'Gulf Coast',
  'southeastern United States',
  'international coastal areas'],
 'other_places': ['watersheds',
  'estuaries',
  'oligohaline marsh',
  'nontidal floodplains'],
 'organizations': [{'name': 'National Science Foundation (NSF)',
   'description': 'funding agency that supports the project through the Coastal SEES (Science, Engineering, and Education for Sustainability) Program'},
  {'name': 'University of Maryland (UMD)',
   'description': 'lead institution of research

# Geo Example

In [134]:
geo_project_url = "https://www.usgs.gov/centers/gmeg/science/concealed-rare-earth-element-ree-terranes-southern-basin-and-range-geologic"
geo_project_scrape = scrape_project("https://www.usgs.gov/programs/climate-research-and-development-program/science/impacts-coastal-and-watershed-changes")
geo_project_scrape

{'label': 'Impacts of coastal and watershed changes on upper estuaries: causes and implications of wetland ecosystem transitions along the US Atlantic and Gulf Coasts',
 'status': 'Active',
 'org_owner': 'Climate Research and Development Program',
 'org_owner_link': 'https://www.usgs.gov/programs/climate-research-and-development-program',
 'article_date': '2020-02-26T00:00:00',
 'contacts': [{'name': 'Ken Krauss, Ph.D.',
   'link': 'https://www.usgs.gov/staff-profiles/ken-krauss',
   'email': 'kraussk@usgs.gov'},
  {'name': 'Gregory Noe',
   'link': 'https://www.usgs.gov/staff-profiles/gregory-noe',
   'email': 'gnoe@usgs.gov'},
  {'name': 'Camille LaFosse Stagg, Ph.D.',
   'link': 'https://www.usgs.gov/staff-profiles/camille-lafosse-stagg',
   'email': 'staggc@usgs.gov'},
  {'name': 'Hongqing Wang, Ph.D.',
   'link': 'https://www.usgs.gov/staff-profiles/hongqing-wang',
   'email': 'wangh@usgs.gov'},
  {'name': 'Eric J Ward, Ph.D.',
   'link': 'https://www.usgs.gov/staff-profiles/eric-

In [135]:
geo_project_long_text = " ".join(geo_project_scrape['overview_texts'])
print(len(geo_project_long_text))

7026


In [137]:
geo_project_extracted_concepts = llm_text(
    long_text=geo_project_long_text,
    prompt="concepts"
)

In [138]:
geo_project_extracted_concepts

{'concise_summary': 'This project investigates the impacts of tidal extension on estuaries and wetlands due to rising sea levels and human activities, and aims to model the processes to guide ecosystem restoration and future decisions in coastal areas along the Atlantic Coast, Gulf Coast, and internationally.',
 'study_objectives': ['To understand and describe the effects of tidal extension in freshwater wetlands and nontidal floodplains, and changes to sedimentation, nutrient uptake and release, and habitat balance',
  'To quantify ecosystem functions, sediment and nutrient processes, and carbon sequestration and their variability in different coastal environments undergoing TFFW conversion and marsh transition',
  'To develop advanced ecological theory of coastal wetland change by analyzing extensive field, laboratory, modeling, and remote sensing data over 15 years and using dynamic process models to predict ecosystem change in new coastal areas'],
 'geographic_places': {'named_geog

# Following up on Definitions

One key in this is to determine whether the AI model we are consulting and what it has been trained on agree with our own definition of concepts discovered. We can follow up by asking for definitions and reference sources on the concepts identified, work through those, add definition and linkages to our own knowledgebase, and iteratively improve both our ability to use what the models identify and feedback improvements to the models through training and fine-tuning.

In [152]:
method_definitions = llm_definitions(
    terms=geo_project_extracted_concepts['scientific_methods_and_techniques'],
    context="scientific methods and techniques"
)

In [161]:
json.loads(method_definitions.split("\n\n")[1].split('```')[1].replace('\n', ''))

[{'term': 'Field data collection',
  'definition': 'The process of collecting information and data about a physical environment or phenomenon in its natural setting using various methods such as observation, measurement, and surveying.',
  'source': 'https://en.wikipedia.org/wiki/Field_research'},
 {'term': 'Vegetation survey',
  'definition': 'A methodology used to determine the type, abundance, and distribution of the plant species and communities in a particular area, which can provide information about biodiversity, ecosystem health, and natural resource management.',
  'source': 'https://www.fs.fed.us/biology/resources/pdfs/veg_protocol.pdf'},
 {'term': 'Growth record measurement',
  'definition': 'A method of quantifying and analyzing the growth patterns of living organisms, such as trees, using various techniques including dendrochronology (tree-ring analysis) and stem increment measurements, which can provide information about climate, environmental conditions, and past events.