Skip to content

ryanpmeyer/RapidReviews-code-review

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 

Repository files navigation

COVIDScholar: Rapid Reviews

About this Repo

Our contributions to the COVIDScholar Rapid Reviews project are included in the full (private) COVIDScholar repositories. In this repo, we will include solely the code we have written so that it is clear what our contributions are to the graders. However, this means this repository is not able to be simply downloaded and run like other projects from this class. The front-end interface can be cloned and run locally from https://github.com/COVID-19-Text-Mining/RapidReview-frontend, however the backend and model code itself cannot. We hope that due to the nature of us working on an existing project and with our mentor who has integrated our code within COVIDScholar's existing systems that the graders will understand the limitations we have in this regard.

Summary

In response to the unprecedented pandemic, the scientific community has been releasing research literature at a remarkable speed and scale. The substantial volume of the new research has created difficulty for researchers to read and process these papers in a timely fashion.

To facilitate the releases of COVID-related research papers, we utilized NLP and built an ML model that rapidly identifies peer reviewers based on their relevance to the paper’s subject. This would aid researchers in synthesizing the information spread across thousands of emergent research articles, patents, and clinical trials into actionable insights and new knowledge.

In this project, we first performed exploratory data analysis and data cleaning on the existing COVIDScholar corpus and then trained the model on the word embeddings.

After the model was trained we went through various stages of model testing, soliciting feedback from the rapid reviews team, and retraining the model with more data scraped from databases.

Similarly, multiple iterations of the front-end user interface were designed, deployed, user-tested, and refined. Our final product is a web interface currently being hosted and used by COVIDScholar!

Rapid Reviews Model

The architecture of our model is built into the existing COVIDScholar web application. In the databases, we store the entire corpus of papers, as well as all of the document embeddings our model yielded after training. The backend server uses these embeddings to calculate scores and send the client interface all of the necessary information, including related papers, suggested peer reviewers, and topic classifications.

Word2vec is a better alternative to one-hot encoding for representing words as vectors while preserving the relationship between word meanings in context. Doc2Vec is a method of generating representation vectors out of words using word2vec. Our model uses doc2vec to embed the abstracts of papers, which can be used to compute relevance scores for potential reviewers or similarity scores to other papers.

To implement our reviewer suggestion model, we first cleaned the abstract strings and then embedded them by training an unsupervised Doc2Vec model on our entire corpus of research papers. Furthermore, we trained an NER model to identify copyright notices using 800 abstracts that were extracted and manually labeled from Prodigy. These embedded abstracts are stored as sparse vectors, allowing us to save memory and time when computing any potential heuristics. We can compute the cosine similarity of two embedded abstracts to get a similarity score of the papers themselves.

In addition to the corpus of COVIDScholar papers, we used papers from PubMed to train our model further. PubMed is another database of academic research papers with existing categorical field tags, so we were able to use a subset of its data to teach our model to classify the subject matter of an abstract.

The model allows us to query the N most similar papers in our corpus to any given abstract string. For reviewer suggestions, we calculate the cosine similarity of all other papers in the corpus to the paper being reviewed, and weight authors’ contributions to the most similar papers to compute a reviewer relevance score. In the backend of the COVIDScholar web application, our embeddings are searched through Vespa.ai for nearest neighbor calculations to limit the search to similar papers before calculating author relevance scores.

User Interface

The front-end of this tool was built with React.js and was bootstrapped with Create React App.

The user interface displays the uploaded abstract and a list of suggested peer reviewers generated by the model. It also lists the information about each reviewer such as their past relevant papers and relevance strength to the current abstract so that users can easily compare the suggested reviewers.

In order to run the front-end, direct to https://github.com/COVID-19-Text-Mining/RapidReview-frontend. After downloading the code locally, if applicable, install the Node from the official website https://nodejs.org/en in order to have NPM (Node Package Manager) in the system.

After the installations, you can run:

npm start

Runs the app in the development mode.
Open http://localhost:3000 to view it in the browser.

The page will reload if you make edits.
You will also see any lint errors in the console.

npm test

Launches the test runner in the interactive watch mode.
See the section about running tests for more information.

About

code review repo for IEOR 135 team 21 Rapid Reviews

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published