In [None]:
from pymongo import MongoClient

from scraping import Reg_API, CommentParser

# Setup basic MongoDB collection
client = MongoClient()
db = client.regulationsgov
comments_collection = db.comments

# Reg_API

Singleton utility class for making API calls to [regulations.gov](regulations.gov). API calls are constructed by chaining method calls, eg; `api.endpoint("/comments").search("climate").get()`.

### Constructor

The constructor takes 2 optional parameters

- `page_size: int = 20` The number of results per page for paged results
- `apikey: str = EMMAS_APIKEY` The api key used to validate results

### API Calls

Each call should start with one of the following methods...

- `endpoint(endpoint: str)` Start an api call at the specified `endpoint` as specified in the [API documentation](https://open.gsa.gov/api/regulationsgov/). This endpoint should start with a "/", eg; "/comments"
- `url(url: string)`Start an api call at the specified `url`. Should include the api-base

The following 2 methods can then be chained any number of times to limit search and/or get different pages

- `search(search_term: str)` Only include results containing the `search_term`.
- `page(page: int)` Get page number `page` of the results

Finally, each API call should be finished with the `get` method as explained below.

- `get(get_json: boolean=True)` Get the result of the constructed API call. If `get_json` is set to false, returns the full response object as returned by `requests.get`, otherwise just returns json content

# CommentParser

A singleton utility class used to get further data using pre-set API calls, extract data using pre-set field paths, and return a standardized dictionary containing the collected data.

### Constructor

The constructor takes a single required parameter

- `api: Reg_API` A `Reg_API` instance used to make additional API calls when needed 

### Comment Parsing

There are a number of private methods used to define API calls and paths, but only the method `get_comment_data` should generally be called externally.

This method takes a single required parameter `comment` which should be the data for a single comment. This data can be obtained by making a call to the "/comments" endpoint then accessing the "data" field of the returned dictionary and selecting one element of the array, eg `api.endpoint("/comments").get()["data"][0]`.

Using this passed data the method several tasks to get and structure data related to the passed comment and returns the acquired info in a dictionary with the following keys...

- `_id`: The comment ID, used to uniquely identify this comment within a MongoDB collection
- `comment_response`: The dictionary passed to the method, contains basic info about the comment
- `comment_info`: More detailed info about the comment obtained through an API call
- `attachments`: Info about attachments obtained through an API call
- `comment_text`: A custom dictionary containing the actual text of the comment. Has 2 keys for different sources of text
    - `plaintext` The text in the "comments" field of the `comment_info` dictionary
    - `attachments`: A list of the text in the attachments. Each entry is a dictionary where keys are filetypes and values are the text of the file, for example `{"pdf": "This is the text of a PDF Comment"}`.
      > Note: Currently this *only* handles pdf files, which appear to be the only filetype despite the returned json having a field formatted like `"fileFormats": [urls]` which suggessts the posssibility of multiple file formats.


In [None]:
page_size = 20
api = Reg_API(page_size)
parser = CommentParser(api)

pageNum = 1
while True: 
    comments = api.endpoint("/comments").search("climate").page(pageNum).get()
    
    for i, comment in enumerate(comments["data"]):
        comment_data = parser.get_comment_data(comment)
        print(f"(pg {pageNum}) {i+1}/{page_size}: ", comment_data['_id'], end="")
        print(" "*50, end="\r") # Clear line

        ## Uncomment to store in MongoDB
        #comments_collection.insert_one(comment_data) 

    if pageNum > 1: ### Comment out/delete to run for ALL pages
        break       # - Currently stops after 2nd page
    
    if comments["meta"]["hasNextPage"] == False:
        break
    else:
        pageNum += 1