This notebook runs through an experimental process to slurp up the Greater Sage Grouse Bibliography into a different data structure from its [web app](https://apps.usgs.gov/gsgbib/index.php) that we can then run some further processing on. There's probably a better way to do this by grabbing up the backend database, but this is a decent enough way to get started. Once we have the data in a usable backend, we can run other processes on the citations to assemble more info.

In [1]:
import requests
from bis2 import dd
from datetime import datetime
from bis import rrl
import json
import hashlib
from bs4 import BeautifulSoup

In [2]:
bis = dd.getDB("bis")
collection_rrl = bis["RRL"]
collection_rrlAnnotations = bis["RRL Annotations"]

I initially tried to use BeautifulSoup to process the full set of citation details at https://apps.usgs.gov/gsgbib/5654e232e4b071e7ea53d6e1.php?page=5. However, that HTML output from the PHP app is so messy that this proved to be too weird to deal with quickly. I instead went to the simple list at https://apps.usgs.gov/gsgbib/5654e232e4b071e7ea53d6e1.php?page=0 and worked from there. It ended up being quicker to use a desktop tool I've messed with for web scraping (Outwit Hub) to just grab out the table and links as JSON and dumped that here locally for the basic start.

In [3]:
sgBib = json.loads(open("gsgBibScrape.json").read())

The individual page outputs for each citation's set of annotations are pretty clear and easy to process, so this next process runs through and assembles those into usable data structures. I didn't do a lot of processing here on these yet as we're still working on what the annotation model should look like. Because we're using the MD5 hash of the citation string, we can simply use the citation string to tie the annotation back to the ID we put in the RRL collection.

In [4]:
gsgBibURL = "https://apps.usgs.gov/gsgbib/5654e232e4b071e7ea53d6e1.php?page=0"

for citation in sgBib:
    citationID = hashlib.md5(citation["Citation"].encode()).hexdigest()
    annotationStub = {}
    annotationStub["target"] = "rrl:"+citationID
    annotationStub["datetime"] = datetime.utcnow().isoformat()
    annotationStub["source"] = citation["Url"]

    thisAnnotations = []
    
    annotationSoup = BeautifulSoup(requests.get(citation["Url"]).text, "lxml")
    annotationSection = annotationSoup.find("div", {"class": "ss-form-container"})

    for section in annotationSection.findAll("div", {"class":"ss-form"})[1].findAll("p"):
        sectionHeader = section.find("strong")
        if sectionHeader is not None:
            sectionLabel = sectionHeader.get_text()
            sectionContent = section.get_text().split(sectionHeader.get_text()+": ")[1]
            if sectionLabel == "Topics":
                sectionContent = sectionContent.split(", ")
            thisAnnotation = {"type":sectionLabel,"body":sectionContent}
            thisAnnotations.append({**annotationStub, **thisAnnotation})
        
    try:
        for link in annotationSection.findAll("div", {"class":"ss-form"})[1].findAll("a", {"target":"_new"}):
            citationLink = link["href"]
    except:
        citationLink = None
    
    print (rrl.ResearchReferenceLibrary.register_citation(collection_rrl, citation["Citation"], gsgBibURL, citationLink))
    collection_rrlAnnotations.insert_many(thisAnnotations)


{'status': 'ok', '_id': '58219f9b42767011a13d7fd742cd5800', 'message': 'New citation registered.'}
{'status': 'ok', '_id': 'da534ed6dbf0b10c8fab673c9d576aee', 'message': 'New citation registered.'}
{'status': 'ok', '_id': '4648b1d47f0b7da63b6839479e152504', 'message': 'New citation registered.'}
{'status': 'ok', '_id': 'c29e9bf477f5b55dc9bc59739ca8aa7c', 'message': 'New citation registered.'}
{'status': 'ok', '_id': '9afcef0619428007646d518786cd2d52', 'message': 'New citation registered.'}
{'status': 'ok', '_id': 'be80ee3df5a138c3ecaf9a70a63bc5b0', 'message': 'New citation registered.'}
{'status': 'ok', '_id': '66bde082d6ac0f62a59a4f54082121d9', 'message': 'New citation registered.'}
{'status': 'ok', '_id': 'ed31b6b46f3fd65acb61d3dbee761d0b', 'message': 'New citation registered.'}
{'status': 'ok', '_id': '7fd3f6dc2fac17a5c1077efee4df5b3d', 'message': 'New citation registered.'}
{'status': 'ok', '_id': '2695c0bc613bbcbfa894e681886bde43', 'message': 'New citation registered.'}
{'status':

{'status': 'ok', '_id': '8a591b8ea59116cd2f8e6a5a769ee154', 'message': 'New citation registered.'}
{'status': 'ok', '_id': 'd14df7012762f659392000d851aacf70', 'message': 'New citation registered.'}
{'status': 'ok', '_id': 'b5ea81afe672fdabba6bef71b4ff52d1', 'message': 'New citation registered.'}
{'status': 'ok', '_id': '120918f43def2f617ada12af465fe607', 'message': 'New citation registered.'}
{'status': 'ok', '_id': '0f5341981a932b57217cba5f976af3d4', 'message': 'New citation registered.'}
{'status': 'ok', '_id': '7a1e3b6faf0ee58f6c742b36b5cb3755', 'message': 'New citation registered.'}
{'status': 'ok', '_id': 'e816e6ed12fb93fe388046de600ed1e0', 'message': 'New citation registered.'}
{'status': 'ok', '_id': 'f6c4acb7b7ce8033740833ce92a52bc3', 'message': 'New citation registered.'}
{'status': 'ok', '_id': 'd2322e27d44bd774ee412bae0fe2dea6', 'message': 'New citation registered.'}
{'status': 'ok', '_id': '7221368555a9a809691b442c12be20b6', 'message': 'New citation registered.'}
{'status':

{'status': 'ok', '_id': '922b6e538673111b0a259a1b20acf95b', 'message': 'New citation registered.'}
