# Parse Paragraphs from JSON Files

This notebook reads the CORD-19 dataset for 2020-08-29 (downloaded from [AllenAI's CORD-19 Historical Releases Page](https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases.html), expanded and manually copied to S3, as follows:

    .
    ├── pdf_json
    ├── pmc_json
    └── metadata.csv

The `metadata.csv` is a master list of files, some of which are available in the `pdf_json` and `pmc_json` sub-folders. We parse the JSON files and extract paragraphs (and title sentences) and write them out to a Parquet file for downstream processing.

# Initialize Dask cluster

In [1]:
from dask_saturn.core import describe_sizes

describe_sizes()

{'medium': 'Medium - 2 cores - 4 GB RAM',
 'large': 'Large - 2 cores - 16 GB RAM',
 'xlarge': 'XLarge - 4 cores - 32 GB RAM',
 '2xlarge': '2XLarge - 8 cores - 64 GB RAM',
 '4xlarge': '4XLarge - 16 cores - 128 GB RAM',
 '8xlarge': '8XLarge - 32 cores - 256 GB RAM',
 '12xlarge': '12XLarge - 48 cores - 384 GB RAM',
 '16xlarge': '16XLarge - 64 cores - 512 GB RAM',
 'g4dnxlarge': 'T4-XLarge - 4 cores - 16 GB RAM - 1 GPU',
 'g4dn4xlarge': 'T4-4XLarge - 16 cores - 64 GB RAM - 1 GPU',
 'g4dn8xlarge': 'T4-8XLarge - 32 cores - 128 GB RAM - 1 GPU',
 'p32xlarge': 'V100-2XLarge - 8 cores - 61 GB RAM - 1 GPU',
 'p38xlarge': 'V100-8XLarge - 32 cores - 244 GB RAM - 4 GPU',
 'p316xlarge': 'V100-16XLarge - 64 cores - 488 GB RAM - 8 GPU'}

In [2]:
from dask.distributed import Client, wait
from dask_saturn import SaturnCluster
import time

n_workers = 10
cluster = SaturnCluster(n_workers=n_workers, scheduler_size='2xlarge', worker_size='4xlarge', nthreads=16)
client = Client(cluster)
cluster

[2020-09-08 20:52:45] INFO - dask-saturn | Starting cluster. Status: stopped
[2020-09-08 20:52:50] INFO - dask-saturn | Starting cluster. Status: starting
[2020-09-08 20:53:02] INFO - dask-saturn | Cluster is ready


VBox(children=(HTML(value='<h2>SaturnCluster</h2>'), HBox(children=(HTML(value='\n<div>\n  <style scoped>\n   …

If you initialized your cluster here in this notebook, it might take a few minutes for all your nodes to become available. You can run the chunk below to block until all nodes are ready.

> **Pro tip**: Create and/or start your cluster from the "Dask" page in Saturn if you want to get a head start!

In [3]:
while len(client.scheduler_info()['workers']) < n_workers:
    print('Waiting for workers, got', len(client.scheduler_info()['workers']))
    time.sleep(30)
print('Done!')

Waiting for workers, got 0
Waiting for workers, got 1
Done!


In [4]:
import boto3
import dask.dataframe as dd
import json
import pandas as pd

from dask.distributed import Client, progress, get_worker

## Constants

In [5]:
BUCKET_NAME = "saturn-elsevierinc"
FOLDER_NAME = "cord19"

DATA_FOLDER = "/".join(["s3:/", BUCKET_NAME, FOLDER_NAME])

METADATA_FILE = "/".join(["s3:/", BUCKET_NAME, FOLDER_NAME, "metadata.csv"])

PARAGRAPH_FOLDER = "/".join(["s3:/", BUCKET_NAME, "cord19-paras-pq"])

## Read metadata file

The `metadata.csv` file contains the full list of files. Some of the files don't have full-text associated because of paywall issues, but the `metadata.csv` file contains the title and abstract for them. Files for which full text is provided have the path to the full text referenced in the `pdf_json_files` and `pmc_json_files` columns. 

Of the files referenced in this dataframe, 140,317 do not have either filepath populated (meaning they don't exist in the dataset), 25,250 have only the `pdf_json_files` column populated, and 4,598 have only the `pmc_json_files` column populated. Our strategy (see cell 7) is to use `pdf_json_files` when available, else use `pmc_json_files`, and discard the record if none are available.

In [6]:
metadata_df = dd.read_csv(METADATA_FILE, dtype=str)
metadata_df = metadata_df[["cord_uid", "title", "abstract", 
                           "pdf_json_files", "pmc_json_files"]]

# :TODO: comment for real run
# metadata_df = metadata_df.sample(frac=0.0005)

metadata_df.head()

Unnamed: 0,cord_uid,title,abstract,pdf_json_files,pmc_json_files
0,ug7v899j,Clinical features of culture-proven Mycoplasma...,OBJECTIVE: This retrospective chart review des...,document_parses/pdf_json/d1aafb70c066a2068b027...,document_parses/pmc_json/PMC35282.xml.json
1,02tnwd4m,Nitric oxide: a pro-inflammatory mediator in l...,Inflammatory diseases of the respiratory tract...,document_parses/pdf_json/6b0567729c2143a66d737...,document_parses/pmc_json/PMC59543.xml.json
2,ejv2xln0,Surfactant protein-D and pulmonary host defense,Surfactant protein-D (SP-D) participates in th...,document_parses/pdf_json/06ced00a5fc04215949aa...,document_parses/pmc_json/PMC59549.xml.json
3,2b73a28n,Role of endothelin-1 in lung disease,Endothelin-1 (ET-1) is a 21 amino acid peptide...,document_parses/pdf_json/348055649b6b8cf2b9a37...,document_parses/pmc_json/PMC59574.xml.json
4,9785vg6d,Gene expression in epithelial cells in respons...,Respiratory syncytial virus (RSV) and pneumoni...,document_parses/pdf_json/5f48792a5fa08bed9f560...,document_parses/pmc_json/PMC59580.xml.json


## Processing

For each record, we read the referenced file from the S3 filesystem, parse it into a JSON dictionary, and extract the text blocks we are interested in, namely the title, abstract (multiple paragraphs) and body (multiple paragraphs). We also compute the sequence number for each paragraph. This array of tuples (`pid`, `ptext`) is returned by the `parse_paragraphs` function below.

We then explode the `paragraphs` column, and separate out the `pid` and `ptext` columns, then write them out into a set of Parquet files for further processing.

In [7]:
def read_fully(filepath, s3, bucket_name):
    obj = s3.Object(bucket_name, filepath)
    s = obj.get()['Body'].read().decode('utf-8')
    return s


def parse_paragraphs(rows, bucket_name):
    worker = get_worker()
    try:
        s3 = worker.s3
    except:
        s3 = boto3.resource('s3')
        worker.s3 = s3
    paragraphs = []
    filepath = None
    try:
        if pd.notnull(rows.pdf_json_files):
            filepath = rows.pdf_json_files
        elif pd.notnull(rows.pmc_json_files):
            filepath = rows.pmc_json_files
        else:
            pass
        if filepath is not None:
            abs_filepath = filepath.replace("document_parses", "cord19")
            fdict = json.loads(read_fully(abs_filepath, s3, bucket_name))
            paragraphs.append(("T", fdict["metadata"]["title"]))
            paragraphs.extend([("A{:d}".format(i), x["text"]) 
                for i, x in enumerate(fdict["abstract"])])
            paragraphs.extend([("B{:d}".format(i), x["text"]) 
                for i, x in enumerate(fdict["body_text"])])
        else:
            paragraphs.append(("T", rows["title"]))
            for i, abs_para_text in enumerate(rows["abstract"].split('\n')):
                paragraphs.append(("A{:d}".format(i), abs_para_text))
    except:
        pass
    return paragraphs


In [8]:
paragraph_df = metadata_df.copy()

In [9]:
paragraph_df = paragraph_df.repartition(npartitions=100)

In [10]:
paragraph_df["paragraphs"] = paragraph_df.apply(
    lambda rows: parse_paragraphs(rows, BUCKET_NAME), meta=("object"), axis=1)
paragraph_df = paragraph_df.drop(columns=["title", "abstract",
                                          "pdf_json_files", "pmc_json_files"])
paragraph_df = paragraph_df.explode("paragraphs")
paragraph_df = paragraph_df.dropna()
paragraph_df["pid"] = paragraph_df.apply(lambda rows: rows.paragraphs[0], 
                                         meta=("str"), axis=1)
paragraph_df["ptext"] = paragraph_df.apply(lambda rows: rows.paragraphs[1], 
                                           meta=("str"), axis=1)
paragraph_df = paragraph_df.drop(columns=["paragraphs"])


In [11]:
import s3fs

fs = s3fs.S3FileSystem()
if fs.exists(PARAGRAPH_FOLDER):
    fs.rm(PARAGRAPH_FOLDER, recursive=True)

In [12]:
PARAGRAPH_FOLDER

's3://saturn-elsevierinc/cord19-paras-pq'

In [18]:
%%time
paragraph_df.to_parquet(PARAGRAPH_FOLDER, engine="pyarrow", compression="snappy")

CPU times: user 769 ms, sys: 31.1 ms, total: 800 ms
Wall time: 3min 56s


## Verify Result

In [19]:
PARAGRAPH_FOLDER

's3://saturn-elsevierinc/cord19-paras-pq'

In [20]:
fs.ls(PARAGRAPH_FOLDER)

[]

In [16]:
fs.du(PARAGRAPH_FOLDER) / 1e6

0.0

In [21]:
paragraph_df = dd.read_parquet(PARAGRAPH_FOLDER, engine="pyarrow")
paragraph_df.head()

OSError: Passed non-file path: saturn-elsevierinc/cord19-paras-pq

In [None]:
len(paragraph_df)

In [22]:
# do this if youre done using the cluster
cluster.close()

distributed.client - ERROR - Failed to reconnect to scheduler after 10.00 seconds, closing client
_GatheringFuture exception was never retrieved
future: <_GatheringFuture finished exception=CancelledError()>
concurrent.futures._base.CancelledError


In [None]:
# client.close()