CORD-19: COVID-19 Open Research Dataset
"Today, researchers and leaders from the Allen Institute for AI, Chan Zuckerberg Initiative (CZI), Georgetown University’s Center for Security and Emerging Technology (CSET), Microsoft, and the National Library of Medicine (NLM) at the National Institutes of Health released the COVID-19 Open Research Dataset (CORD-19) of scholarly literature about COVID-19, SARS-CoV-2, and the Coronavirus group.
Requested by The White House Office of Science and Technology Policy, the dataset represents the most extensive machine-readable Coronavirus literature collection available for data and text mining to date, with over 29,000 articles, more than 13,000 of which have full text.
Now, The White House joins these institutions in issuing a call to action to the Nation’s artificial intelligence experts to develop new text and data mining techniques that can help the science community answer high-priority scientific questions related to COVID-19."
Downloading the Data
The CORD dataset is a large chunk of machine readable publications that date from 1870 up until the present (within a few weeks). The dataset consists of a vast number of publications that span a variety of topics including studies on agravated respiratory conditions, international epidemics like Ebola and SARS, and the latest publications on COVID-19. The overall goal is to take this dense web of information and turn it into a comprehensive picture.
This code attempts to do this by creating a set of topics for a given time slice (for more dense spans of time we also divide them into sub-topics), finding the most similar documents in each topic, and creating extractive summaries for the topic. These are included in the wiki for the repository.
The full extractive summaries for all topics found is included in the wiki and contains an abstractive summary authored by me.
NLP and Visualization Python Packages
The setup shell file consists of all necessary python packages for running the code. I will highlight a few key packages that form the backbone of the code as well as useful packages for visualization.
- Spacy and also SciSpacy is used to perform tokenization, recognize parts of speech, pattern and phrase match, and also clean up stop words.
- RAKE or Rapid Automatic Keyword Extraction algorithm is used as a fairly general and also rapid keyword extraction tool for the first stage of the algorithm (Layer 1 ) to quickly extract keywords from the title and abstract
- Fuzzy string matching uses the Levenshtein Distance between sequences of tokens to decide if they are synonms. Some examples include (reading frame and open reading frame, gastroentritis and transmissble gastroentritis virus). My convention is to always keep the longer N-gram (longer sequence of words)
- NMF or Non-Negative Matrix factorization is an unsupervised learning algorithm similar to Latent Dirichlet Allocation but converges more rapidly using a more advanced minimization technique. In this process, a document-term matrix is constructed with the weights of various terms from a set of documents. This matrix is factored into a term-feature and a feature-document matrix. The features are derived from the contents of the documents, and the feature-document matrix describes data clusters of related documents. A detailed description can be found here: Fast Local Algorithms for Large Scale Nonnegative Matrix and Tensor Factorizations
- The document features are represented using Term-Frequency Inverse Document Frequency matrices. The matrix is fit and transformed to create document clusters using NMF. Rare words (with a small document frequency) that are not frequent across all documents are emphasized more with a larger TFIDF score. Cosine similarity between TFIDF vectors is used to create summaries for each topic. Count Vectorizer is similar but consists only of a matrix of token counts, these vectors are used to create a histogram of word counts
- PyTextRank is used for a more advanced but slower recognition of key phrases that contain the topic words. Cutting off the rank score removes more general phrases that would blow up the size of the text summaries.
- T-distributed Stochastic Neighbor Embedding is a tool for visualizing the topics by projecting the higher dimensional document space onto a 2D plane. It converts similarities between data points to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data. This in effect clusters points that have similarity and pushes away dissimilar clusters. This algorithm is used to visualize topic clusters. A key tunable parameter in the algo is the perplexity and in general it should scale with the number of features.
- celluloid provides gif animations from matlibplot. I use it to step though tSNE plots that scan over different number of topics. This shows visualizes how many topics might be necessary for a given time slice.
CORD Crusher Main Executable
CORDCrusher.py is the main executable file that calls the set of functions in the NLP pipeline that create keywords, categorizes documents, and summaries the text. Each step is set by the -m option to run a given method.
The first step is to choose a time range of publications to consider,the metadata file is then truncated to contain only publications in the given time frame (in the example below it is 2019 2021). PATHtoMETAData is the full path to the CORD19 data folder that contains the metadata.csv file. The argument -Y accepts a range of years so -Y 2002 2012 2019 2021 would produce three time slices for [2002, 2012] [2012, 2019],[2019, 2021] (the last range is limited to today's date) and creates output csv files as TimeSlicePapersXtoY.csv for range [X,Y]
python CORDCrusher.py -m TimeSlice --path PATHtoMETAData -Y 2019 2021
The next step is to rake keywords in the title and abstract from the timeslice csv files. The options are similar to the above but include a number N which would correspond to the maximum number of keywords to keep by default (N=20). This step also fills in a columns of tags that flag papers based on Spacy match patterns.
python CORDCrusher.py -m RAKE -Y 2019 2021 --NRaked N
The step for creating topics runs the NMF topic building algorithm, stores the top 10 topic keywords, stores text lines that match topic words, also stores titles that match the raked keywords used to build the topic. To reduce the size of the files for the matched lines (thus prevent memory use to blow up down stream) we keep lines where the pyTextRank score is larger 0.02 (by default ) or a larger value of 0.04 so that the matched lines can be ranked by TF-IDF scores later. The number of topics to create is defined by setting Ntopics.
The Era is a predefined set of time slices each of which corresponds to a summary in the twiki, the available options are:
Topics can have a specific focus by requiring a tag (based on Spacy Match patterns), an example focus below is public health and looks for papers that have both the COVID19 tag, the PublicHealth tag as well as others that correspond to infection control measures.
python CORDCrusher.py -m CreateTopics --Era COVID19andSPublicHealth --topics Ntopics -o COVID19andSPublicHealth --pyTextRank 0.04 -Y 2019 2021 --path PATHtoMETAData
The above step creates a set of text files that are input for this step which ranks the documents from NMF embeddings, the matched text lines per document, and matched sentences according to the TF-IDF scores. This is the main output for creating the summaries based on the most typical papers, paragraphs and sentences.
python CORDCrusher.py -m CreateSummaries --Era COVID19andSPublicHealth --topics Ntopics -o COVID19andSPublicHealth
The last step is to write out the summaries to a markdown file. Matched lines and title names are reverse matched to find the URL link to the full paper. NRanked specifies how many of the highest TF-IDF items to keep (default is 5) The output markdown file contains the following set of information for the Ntopics: ,a table of the top ranked documents along with their TF-IDF score, a list of paragraphs for the highest scoring matched text lines per paper, and finally the highest scoring matched sentences. The last argument --TopicLabels gives the labels for each topic and if these are unknown you can specify numbers (e.g. for five topics --TopicLabels 0 1 2 3 4
python CORDCrusher.py -m WriteSummaries --Era COVID19andSPublicHealth --topics Ntopics --topRanked NRanked -o COVID19andSPublicHealth --TopicLabels Strings Seperated by Spaces
Full Example for output:
python CORDCrusher.py -m WriteSummaries --Era COVID19andZoo -o COVID19andZoo --topRanked 5 --topics 5 --TopicLabels "seafood wholesale market" "Phylogenetic analyses" "Porcine epidemic diarreah virus" "ACE2 receptor similarity to bat/pangolin" "antiviral drugs"
This produces the summary (aside from the part that is written by me after reading it) here
Supporting Code for Investigations
Using Rake for keywords abstraction can be tricky when the keywords are turned into NGrams (sequential words) as there they may be redundant or incomplete NGrams like "world health", "health organization" and "world health organization". The code in BuildNGrams.py builds a dictionary so that NGrams within a fuzzy matching distance are combined, in the above case the counts for "world health" and "health organization" would be added to "world health organization"
The before/after of the NGram output is plotted as a histogram of counts by calling:
python CORDCrusher.py -m RakedKeywords -Y 2019 2021
This produces a histogram as seen below that can be used to eyeball key topics (like public health) and make sure the topic word is included in the NMF topics, to find stop words that could be removed due to high frequency like (sars cov), and check how well the NGrams are cleaned up (like in the above example).
The key step for creating summaries is choosing the number of topics and the minimum pyTextRank score to cut out.
Choosing the number of topics is based on looking at a 2D projection of the input keywords. I use t-distributed stochastic neighbor embedding (t-SNE) to project from word space to a 2D plane of topics. This method is well suited to NMF topic analysis. It first learns the probability distribution from pairs of higher dimensional word phrases, then in the lower dimensional plane it creates a probability distribution to minimizes the Kullback–Leibler divergence for clusters of points. The result is a 2D map of points that shows clumps that correspond to a given topic.
It is useful to see how the t-SNE scatter plot changes for different number of topics and at which point including more topics does not result in more tight clusters. The plot below is generated with the following command:
python CORDCrusher.py -m TopicScan --TopicScan 2 4 8 10 15 20 25 30 --Era SARS2002to2005 -o TestScan
The method TopicScan runs NMF several times for the a set of topics (e.g. 2 4 8 10 15 20 25 30) and a given Era (the example is for the SARS outbreak). The gif shows an animation of the t-SNE projection for each number of topics. Up to 10 topics are needed in order to see some coherence, and with 30 topics the clusters are tight and more isolated from one another.
The minimum pyTextRank is investigated by looking at the abstracts for all documents with matched topic words. Comparing all the phrases from pyTextRank and the phrases with matched topic words shows that low pyTextRank scores have a range where the topic words are not very dominant and high scores (rarer phrases that are highly relevant) are whre the topic words are important in the abstract.
An example command is as follows:
python CORDCrusher.py -m PhraseRanking --pyTextRank 0.0 --Era SARS2005to2012PublicHealth --topics 20 -Y 2005 2012
The above method creates a scatter plot of points for the two cases: all the phrases with one mention of topic words in the abstract and only phrases with a matched topic word. The lines show the probability distribution function for the two cases based on Kernal Density estimation. The lines cross at 0.17 where the probability of unmatched phrases is equal to the phrases containing topic words. Requiring phrases with matched topic words to have a value greated 0.17 for the rank score removes less important phrases that often are not so relevant in the abstract and can have a large matching frequency in documents