Skip to content

tchanda90/covid19-textmining

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

COVID-19 Text Mining

Purpose

The CORD-19 dataset is a vast collection of literature on the novel coronavirus. We can apply text and data mining approaches to find answers to questions in the literature in support of the ongoing COVID-19 response efforts worldwide.

What do we know about COVID-19 risk factors?
  • Smoking, pre-existing pulmonary disease
  • Co-infections (determine whether co-existing respiratory/viral infections make the virus more transmissible or virulent) and other co-morbidities
  • Neonates and pregnant women
  • Socio-economic and behavioral factors to understand the economic impact of the virus and whether there were differences.
  • Transmission dynamics of the virus, including the basic reproductive number, incubation period, serial interval, modes of transmission and environmental factors
  • Severity of disease, including risk of fatality among symptomatic hospitalized patients, and high-risk patient groups
  • Susceptibility of populations
  • Public health mitigation measures that could be effective for control

Method

First, the documents on COVID-19 are retrieved using a BM-25 search engine. Then, to find answers to the questions above, two methods are used to find sentences in the papers that talk about those topics.

Method 1:

  1. Create TF-IDF vectors for all sentences from all papers
  2. For a particular Search Query, get the TF-IDF vector.
  3. Find the highest Cosine Similarity between the Search Query and all the sentences from the papers.
  • Pros: Fast and accurate.
  • Cons: Not able to capture semantic relationships between words.

Method 2:

  1. Train Word Embeddings (Word2Vec) on the papers' texts.
  2. For a particular Search Query, get the embedded Word Vectors.
  3. Find the lowest Word Mover's Distance between the Search Query and all the sentences from the papers.
  • Pros: Able to capture semantic relationships between words.
  • Cons: Distance calculations are slow.

Results

Question: Incubation Period - TF-IDF

Question: Incubation Period - WMD


Question: Co-morbidities - TF-IDF

Question: Co-morbidities - WMD


Question: High Risk Group - TF-IDF Question: High Risk Group - WMD


Question: Reproductive Number - TF-IDF Question: Reproductive Number - WMD


Question: Pregant Women - TF-IDF Question: Pregnant Women - WMD


Question: Neonates of Mothers with Covid-19 - TF-IDF Question: Neonates of Mothers with Covid-19 - WMD