# PySpark DataFrame Preprocessing for CORD-19

SQL is a useful tool for querying data. [Apache Spark](https://spark.apache.org/) is a framework that allows for map-reduce workloads with a SQL-interface through the `pyspark.sql` module. The data provided by CORD-19 is semi-structured and contains many nested fields that can be tricky to work with. 

This notebook contains starter code for pre-processing the raw JSON documents into a structured and strongly-typed [Spark DataFrame](https://spark.apache.org/docs/latest/sql-programming-guide.html) that can be queried using Spark SQL. I'll provide a cell that can be used as the starting point for exploration into the dataset. I'll also provide a few example queries for interacting with nested data.


### Handy References

* [`spark.sql` module documentation](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html)
* [Databricks `select` documentation on lateral view](https://docs.databricks.com/spark/latest/spark-sql/language-manual/select.html#lateral-view)
* [Spark Data Types reference](https://spark.apache.org/docs/latest/sql-reference.html)


### Notes on the environment

To being, make sure the notebook has access to the internet. If you are running any Spark code locally, I suggest setting the `SPARK_HOME` variable so it is pointing to the local python site packages.

```bash
# in bash or zsh on MacOS or Linux
SPARK_HOME=$(python -c "import pyspark; print(pyspark.__path__[0])")

# in powershell on Windows
$env:SPARK_HOME = $(python -c "import pyspark; print(pyspark.__path__[0])")
```

# Starter Code

Spark can be installed via the Python package manager, `pip`.

In [None]:
! pip install pyspark

In [None]:
from pyspark.sql.functions import lit
from pyspark.sql.types import (
    ArrayType,
    IntegerType,
    MapType,
    StringType,
    StructField,
    StructType,
)


def generate_cord19_schema():
    """Generate a Spark schema based on the semi-textual description of CORD-19 Dataset.

    This captures most of the structure from the crawled documents, and has been
    tested with the 2020-03-13 dump provided by the CORD-19 Kaggle competition.
    The schema is available at [1], and is also provided in a copy of the
    challenge dataset.

    One improvement that could be made to the original schema is to write it as
    JSON schema, which could be used to validate the structure of the dumps. I
    also noticed that the schema incorrectly nests fields that appear after the
    `metadata` section e.g. `abstract`.
    
    [1] https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/2020-03-13/json_schema.txt
    """

    # shared by `metadata.authors` and `bib_entries.[].authors`
    author_fields = [
        StructField("first", StringType()),
        StructField("middle", ArrayType(StringType())),
        StructField("last", StringType()),
        StructField("suffix", StringType()),
    ]

    authors_schema = ArrayType(
        StructType(
            author_fields
            + [
                # Uncomment to cast field into a JSON string. This field is not
                # well-specified in the source.
                StructField(
                    "affiliation",
                    StructType(
                        [
                            StructField("laboratory", StringType()),
                            StructField("institution", StringType()),
                            StructField(
                                "location",
                                StructType(
                                    [
                                        StructField("settlement", StringType()),
                                        StructField("country", StringType()),
                                    ]
                                ),
                            ),
                        ]
                    ),
                ),
                StructField("email", StringType()),
            ]
        )
    )

    # used in `section_schema` for citations, references, and equations
    spans_schema = ArrayType(
        StructType(
            [
                # character indices of inline citations
                StructField("start", IntegerType()),
                StructField("end", IntegerType()),
                StructField("text", StringType()),
                StructField("ref_id", StringType()),
            ]
        )
    )

    # A section of the paper, which includes the abstract, body, and back matter.
    section_schema = ArrayType(
        StructType(
            [
                StructField("text", StringType()),
                StructField("cite_spans", spans_schema),
                StructField("ref_spans", spans_schema),
                # While equations don't appear in the abstract, but appear here
                # for consistency
                StructField("eq_spans", spans_schema),
                StructField("section", StringType()),
            ]
        )
    )

    bib_schema = MapType(
        StringType(),
        StructType(
            [
                StructField("ref_id", StringType()),
                StructField("title", StringType()),
                StructField("authors", ArrayType(StructType(author_fields))),
                StructField("year", IntegerType()),
                StructField("venue", StringType()),
                StructField("volume", StringType()),
                StructField("issn", StringType()),
                StructField("pages", StringType()),
                StructField(
                    "other_ids",
                    StructType([StructField("DOI", ArrayType(StringType()))]),
                ),
            ]
        ),
        True,
    )

    # Can be one of table or figure captions
    ref_schema = MapType(
        StringType(),
        StructType(
            [
                StructField("text", StringType()),
                # Likely equation spans, not included in source schema, but
                # appears in JSON
                StructField("latex", StringType()),
                StructField("type", StringType()),
            ]
        ),
    )

    return StructType(
        [
            StructField("paper_id", StringType()),
            StructField(
                "metadata",
                StructType(
                    [
                        StructField("title", StringType()),
                        StructField("authors", authors_schema),
                    ]
                ),
                True,
            ),
            StructField("abstract", section_schema),
            StructField("body_text", section_schema),
            StructField("bib_entries", bib_schema),
            StructField("ref_entries", ref_schema),
            StructField("back_matter", section_schema),
        ]
    )


def extract_dataframe_kaggle(spark):
    """Extract a structured DataFrame from the semi-structured document dump.

    It should be fairly straightforward to modify this once there are new
    documents available. The date of availability (`crawl_date`) and `source`
    are available as metadata.
    """
    base = "/kaggle/input/CORD-19-research-challenge"
    crawled_date = "2020-03-13"
    sources = [
        "noncomm_use_subset",
        "comm_use_subset",
        "biorxiv_medrxiv",
        "pmc_custom_license",
    ]

    dataframe = None
    for source in sources:
        path = f"{base}/{crawled_date}/{source}/{source}"
        df = (
            spark.read.json(path, schema=generate_cord19_schema(), multiLine=True)
            .withColumn("crawled_date", lit(crawled_date))
            .withColumn("source", lit(source))
        )
        if not dataframe:
            dataframe = df
        else:
            dataframe = dataframe.union(df)
    return dataframe


# Example Usage and Exploration

Now that we've defined the helper functions, lets start to take a look at the data. First we define a new `SparkSession`, which will create or reuse an existing session. [By default](https://spark.apache.org/docs/latest/spark-standalone.html#cluster-launch-scripts), this will utilize all cores and `total_memory - 1GB` of memory.

## Extracting the Data

Take note of the schema, which is heavy nested and repeated.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = SparkSession.builder.getOrCreate()
df = extract_dataframe_kaggle(spark)
df.printSchema()

df.createOrReplaceTempView("cord19")

Then we register the DataFrame as a temporary table so we can run SQL. Caching can also help significantly, if there is enough memory available.

## DataFrame API vs Spark SQL

These APIs are interchangable, since there is a query planner that figures out the best way to accomplish the query. Having a declarative API is helpful before you dump the data in the flattened form that suits your application.

I will be showing off both the Spark DataFrame interface which can be used programmatically and the SQL interface which can be adapted for use on BigQuery.

#### Group By: How many papers are there in each source?

One example of a source is `biorxiv`.

In [None]:
print("Using the Spark DataFrame interface...")
df.groupBy("source").agg(F.countDistinct("paper_id")).show()

print("Using the Spark SQL interface...")
query = """
SELECT
    source,
    COUNT(DISTINCT paper_id)
FROM
    cord19
GROUP BY
    source
"""
spark.sql(query).show()

### Flatten: Who has written the most papers?

Here, lets take a look at our first nested field. Each paper can have many authors. 

The `COLUMN.*` notation will extract all the columns from a struct into the scope of the `SELECT` clause.

In [None]:
authors = df.select("paper_id", F.explode("metadata.authors").alias("author")).select("paper_id", "author.*")
authors.select("first", "middle", "last", "email").where("email <> ''").show(n=5)
authors.printSchema()

Now count the number of distinct papers for each author.

In [None]:
(
    authors.groupBy("first", "middle", "last")
    .agg(F.countDistinct("paper_id").alias("n_papers"))
    .orderBy(F.desc("n_papers"))
).show(n=5)

It looks like the [German virologist Christian Drosten](https://en.wikipedia.org/wiki/Christian_Drosten) has quite a bit to say on the matter. There also seems to be a few data quality issues, since there are quite a few papers authored by "â€ ".

We can also express the same query, but in Spark-flavored SQL. The `LATERAL VIEW` will be used throughout with this DataFrame for unnesting.

In [None]:
query = """
WITH authors AS (
    SELECT
        paper_id,
        author.*
    FROM
        cord19
    LATERAL VIEW
        explode(metadata.authors) AS author
)
SELECT
    first,
    last,
    COUNT(DISTINCT paper_id) as n_papers
FROM
    authors
GROUP BY
    first,
    last
ORDER BY
    n_papers DESC
"""

spark.sql(query).show(n=5)

### Array Aggregate: Generating full abstracts

One last useful trick for handling nested fields are [`array` aggregate functions](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.array). We'll take a look at `pyspark.sql.functions.array_join` for generating full abstracts.

The first way involves exploding the DataFrame with the array position, and then concatenating all of the rows belonging to particular paper. This can be translated directly into SQL. The second way involves the use of User Defined Functions, which can work on data row at a time.

In [None]:
# based on https://stackoverflow.com/a/50668635
from pyspark.sql import Window

abstract = (
    df.select("paper_id", F.posexplode("abstract").alias("pos", "value"))
    .select("paper_id", "pos", "value.text")
    .withColumn("ordered_text", F.collect_list("text").over(Window.partitionBy("paper_id").orderBy("pos")))
    .groupBy("paper_id")
    .agg(F.max("ordered_text").alias("sentences"))
    .select("paper_id", F.array_join("sentences", " ").alias("abstract"))
    .withColumn("words", F.size(F.split("abstract", "\s+")))
)

abstract.show(n=5)

If you're curious how this is represented under the hood, you can take a look at the query planner. The performance is not too bad in return for ergonomics. 

In [None]:
abstract.explain()

Now for the SQL analogue. This may be a bit convoluted, but if you're following along, you should be ready for any sort of data processing.

In [None]:
query = """
WITH abstract AS (
    SELECT
        paper_id,
        pos,
        value.text as text
    FROM
        cord19
    LATERAL VIEW
        posexplode(abstract) AS pos, value
),
collected AS (
    SELECT
        paper_id,
        collect_list(text) OVER (PARTITION BY paper_id ORDER BY pos) as sentences
    FROM
        abstract
),
sentences AS (
    SELECT
        paper_id,
        max(sentences) as sentences
    FROM
        collected
    GROUP BY
        paper_id
)
SELECT
    paper_id,
    array_join(sentences, " ") as abstract,
    -- make sure the regex is being escaped properly
    size(split(array_join(sentences, " "), "\\\s+")) as words
FROM
    sentences
"""

spark.sql(query).show(n=5)

Finally, we can use a User Defined Function written in Python. This is versatile, and is similar to a `pandas.Dataframe.apply`

In [None]:
@F.udf("string")
def join_abstract(rows) -> str:
    return " ".join([row.text for row in rows])

(
    df.select("paper_id", join_abstract("abstract").alias("abstract"))
    .where("abstract <> ''")
    # mix and match SQL using `pyspark.sql.functions.expr` or `DataFrame.selectExpr`
    .withColumn("words", F.expr("size(split(abstract, '\\\s+'))"))
).show(n=5)

It can also be registered to use in SQL.

In [None]:
spark.udf.register("join_abstract", join_abstract)

query = """
SELECT
    paper_id,
    join_abstract(abstract) as abstract,
    size(split(join_abstract(abstract), '\\\s+')) as words
FROM
    cord19
WHERE
    size(abstract) > 1
"""

spark.sql(query).show(n=5)

# What next?

Take this notebook, cut all of the extra cells out, and begin processing your data for text mining. Spark has an excellent [feature-extraction](https://spark.apache.org/docs/latest/ml-features) library that can be used to transform data in all sorts of ways. For example, the extracted abstracts from above can be tokenized and turned into weighted term-frequency vectors for similarity searches.

Hopefully I'll be able to follow up with some interesting analysis involving recommendation systems. I'm particularly interested in the literature around the non-pharmaceudical  interventions, and I hope to curate a sensible list of the approaches that people around the world have taken to combat COVID-19. 

If you need help with anything related to Spark or SQL related to this notebook, feel free to reach out. 