# Parse Paragraphs from JSON Files

This notebook reads the CORD-19 dataset for 2020-08-29 (downloaded from [AllenAI's CORD-19 Historical Releases Page](https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases.html), expanded and manually copied to S3, as follows:

    .
    ├── pdf_json
    ├── pmc_json
    └── metadata.csv

The `metadata.csv` is a master list of files, some of which are available in the `pdf_json` and `pmc_json` sub-folders. We parse the JSON files and extract paragraphs (and title sentences) and write them out to a Parquet file for downstream processing.

In [1]:
import boto3
import dask.dataframe as dd
import json
import pandas as pd

from dask.distributed import Client, progress, get_worker

## Constants

In [2]:
BUCKET_NAME = "saturn-elsevierinc"
FOLDER_NAME = "cord19"

DATA_FOLDER = "/".join(["s3:/", BUCKET_NAME, FOLDER_NAME])

METADATA_FILE = "/".join(["s3:/", BUCKET_NAME, FOLDER_NAME, "metadata.csv"])

PARAGRAPH_FOLDER = "/".join(["s3:/", BUCKET_NAME, "cord19-paras-pq-sm"])

## Read metadata file

The `metadata.csv` file contains the full list of files. Some of the files don't have full-text associated because of paywall issues, but the `metadata.csv` file contains the title and abstract for them. Files for which full text is provided have the path to the full text referenced in the `pdf_json_files` and `pmc_json_files` columns. 

Of the files referenced in this dataframe, 140,317 do not have either filepath populated (meaning they don't exist in the dataset), 25,250 have only the `pdf_json_files` column populated, and 4,598 have only the `pmc_json_files` column populated. Our strategy (see cell 7) is to use `pdf_json_files` when available, else use `pmc_json_files`, and discard the record if none are available.

In [3]:
metadata_df = dd.read_csv(METADATA_FILE, dtype=str)
metadata_df = metadata_df[["cord_uid", "title", "abstract", 
                           "pdf_json_files", "pmc_json_files"]]

# :TODO: comment for real run
metadata_df = metadata_df.sample(frac=0.0005)

metadata_df.head()

Unnamed: 0,cord_uid,title,abstract,pdf_json_files,pmc_json_files
6498,sz7qmi8q,Interstitielle Lungenerkrankungen,"Interstitial pneumonia is a rare disease, posi...",document_parses/pdf_json/cd0e34984d3ba62e544e3...,document_parses/pmc_json/PMC7101537.xml.json
33851,4amnl029,Covid-19: Lack of test and trace data are frus...,,,
32210,5moean7z,COVID-19 in Patient with Sarcoidosis Receiving...,"Because of in vitro studies, hydroxychloroquin...",,
22958,do0dumkk,We're more negative after five nights of less ...,,,
20225,yctuuh7w,Paediatrics in the Tropics,,,document_parses/pmc_json/PMC7150102.xml.json


## Processing

In [4]:
client = Client(processes=False, n_workers=2, threads_per_worker=1)

Perhaps you already have a cluster running?
Hosting the HTTP server on port 34997 instead
  http_address["port"], self.http_server.port


In [5]:
# :TODO: revisit for full dataset
metadata_df = metadata_df.repartition(npartitions=20)

For each record, we read the referenced file from the S3 filesystem, parse it into a JSON dictionary, and extract the text blocks we are interested in, namely the title, abstract (multiple paragraphs) and body (multiple paragraphs). We also compute the sequence number for each paragraph. This array of tuples (`pid`, `ptext`) is returned by the `parse_paragraphs` function below.

We then explode the `paragraphs` column, and separate out the `pid` and `ptext` columns, then write them out into a set of Parquet files for further processing.

In [6]:
def read_fully(filepath, s3, bucket_name):
    obj = s3.Object(bucket_name, filepath)
    s = obj.get()['Body'].read().decode('utf-8')
    return s


def parse_paragraphs(rows, bucket_name):
    worker = get_worker()
    try:
        s3 = worker.s3
    except:
        s3 = boto3.resource('s3')
        worker.s3 = s3
    paragraphs = []
    filepath = None
    try:
        if pd.notnull(rows.pdf_json_files):
            filepath = rows.pdf_json_files
        elif pd.notnull(rows.pmc_json_files):
            filepath = rows.pmc_json_files
        else:
            pass
        if filepath is not None:
            abs_filepath = filepath.replace("document_parses", "cord19")
            fdict = json.loads(read_fully(abs_filepath, s3, bucket_name))
            paragraphs.append(("T", fdict["metadata"]["title"]))
            paragraphs.extend([("A{:d}".format(i), x["text"]) 
                for i, x in enumerate(fdict["abstract"])])
            paragraphs.extend([("B{:d}".format(i), x["text"]) 
                for i, x in enumerate(fdict["body_text"])])
        else:
            paragraphs.append(("T", rows["title"]))
            for i, abs_para_text in enumerate(rows["abstract"].split('\n')):
                paragraphs.append(("A{:d}".format(i), abs_para_text))
    except:
        pass
    return paragraphs


In [7]:
paragraph_df = metadata_df.copy()
paragraph_df["paragraphs"] = paragraph_df.apply(
    lambda rows: parse_paragraphs(rows, BUCKET_NAME), meta=("object"), axis=1)
paragraph_df = paragraph_df.drop(columns=["title", "abstract",
                                          "pdf_json_files", "pmc_json_files"])
paragraph_df = paragraph_df.explode("paragraphs")
paragraph_df = paragraph_df.dropna()
paragraph_df["pid"] = paragraph_df.apply(lambda rows: rows.paragraphs[0], 
                                         meta=("str"), axis=1)
paragraph_df["ptext"] = paragraph_df.apply(lambda rows: rows.paragraphs[1], 
                                           meta=("str"), axis=1)
paragraph_df = paragraph_df.drop(columns=["paragraphs"])

In [8]:
import s3fs

fs = s3fs.S3FileSystem()
if fs.exists(PARAGRAPH_FOLDER):
    fs.rm(PARAGRAPH_FOLDER, recursive=True)

In [9]:
PARAGRAPH_FOLDER

's3://saturn-elsevierinc/cord19-paras-pq-sm'

In [10]:
%%time
paragraph_df.to_parquet(PARAGRAPH_FOLDER, engine="pyarrow", compression="snappy")

# paragraph_df.persist()
# progress(paragraph_df)
# results = paragraph_df.compute()

CPU times: user 5.84 s, sys: 2.38 s, total: 8.22 s
Wall time: 9.4 s


## Verify Result

In [11]:
paragraph_df = dd.read_parquet(PARAGRAPH_FOLDER, engine="pyarrow")
paragraph_df.head(npartitions=10)

Unnamed: 0,cord_uid,pid,ptext
6498,sz7qmi8q,T,
6498,sz7qmi8q,A0,Schwer punkt: Lun gen-und Pleura pa tho lo gie...
6498,sz7qmi8q,A1,In traal veolä re Ak ku mu la ti on von SP-A I...
6498,sz7qmi8q,A2,In traal veolä re Ak ku mu la ti on von pro SP...
6498,sz7qmi8q,B0,Die hi sto pa tho lo gi sche Un ter su chung v...


In [12]:
len(paragraph_df)

1215