Querying text annotations at scale with SPARK
Experiment using Elsevier Labs' Annotation Query library to query annotations of PubMed articles. The code is in Scala and leverage SPARK processing.
Use build.sbt to install dependencies.
You need to compile (
mvn package) and add the jar for the two following libraries:
Running the code
ParsePubtatorXML: Running this app will parse the list of PubMed article IDs we are interested in (stored in
./data/keys). and for each article query Pubtator and store the XML response in the
./data/xmlforlder. Then for each XML file, we extract the string, original document markup and pubtator annotations:
./data/strcontains the string content of the document stripped from any annotation (all annotation offsets referencing this text)
./data/pubtatorcontains the pubtator annotations including Gene, Disease, Chemical, Mutation, Species and CellLine
./data/omcontains the original markup of the document including Document, Title and Abstract.
AnnotateSCNLP: This app is using Stanford Core NLP to annotate the sentences contained in each article text. The annotations are then stored in
BuildParquet: This app store each annotation set (om, pubtator and scnlp) in a parquet file in a format specified by Annotation Query.
Query: This app runs several scenarios querying the annotations with logical relations such as: "give me all annotations of genes and cell lines that co-occur in the same sentence"