MetaMining: Mining of Literature Data for Meta-analysis

CS410 Course Project at UIUC

Chaochao Zhou (single member / project leader)
Email: cz76@illinois.edu

Presentation Recording

Paper

This work has been documented in a paper, which can be accessed online:

Text2Struct: A Machine Learning Pipeline for Mining Structured Data from Text
Chaochao Zhou, Bo Yang
Arxiv preprint 2022
[arxiv]

Implementation

All codes and files are provided in the "/Demo" folder of this repo:

Codes

A Colab notebook "test.ipynb" is provided for demo. Please sequentially run the code cells in this notebook.

Files

Trained RNN model: current_model.h5
A testing dataset (current_data) including:
- Index-word dictionary: index_word.pkl
- Text (token sequence): x_test.pkl
- Entity label: y_test.pkl
- Numeral label: nl_test.pkl

Project Overview

Objective

A single study may only include a small sample, which may lack statistical power in statistical analyses. Meta-analysis addresses this problem by combining the results of multiple scientific studies. However, meta-analysis requires researchers to extract/collect a large amount data by reviewing numerous publications, which is tedious and time-consuming. Therefore, it is desirable to develop a pipeline for automatic extraction of quantitative data from text

Given a text, the objective of this work is to extract the related entities of a numeral (N), including the metric (M) and unit (U), for example

Challenges

Multiple numerals may co-exist in a single sentence, so the relations corresponding to each numeral needs to be found.
The lengths of words to describe entities (metrics and units) vary in different sentences.
A numeral may be associated with multiple metrics hierarchically.

Contribution

Main works that were proposed and completed to solve the mining problem include:

Collected texts and performed text pre-processing
Created a dataset with text annotation of entities and relations
Developed a recurrent neural network (RNN) and tested its performance

Test

Some randomly sampled predictions from the test set (that was not used for training) are demonstrated below:

Summary

It is shown that the pipeline is viable to automatically extract structural data from general texts without special templates/patterns. Within the mined structured data, important measures can be further filtered and collected in combination with text retrieval implementations. It is anticipated to achieve better performance by expanding the dataset and investigating other machine learning models. We are working on the improvement of this pipeline, and stay tuned to a newer version that will be released in the near future!

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
Demo		Demo
LICENSE.md		LICENSE.md
Progress Report.pdf		Progress Report.pdf
Project Proposal.pdf		Project Proposal.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Demo

Demo

LICENSE.md

LICENSE.md

Progress Report.pdf

Progress Report.pdf

Project Proposal.pdf

Project Proposal.pdf

README.md

README.md

Repository files navigation

MetaMining: Mining of Literature Data for Meta-analysis

CS410 Course Project at UIUC

Presentation Recording

Paper

Implementation

Codes

Files

Project Overview

Objective

Challenges

Contribution

Test

Summary

About

Releases

Packages

License

zcc861007/CourseProject

Folders and files

Latest commit

History

Repository files navigation

MetaMining: Mining of Literature Data for Meta-analysis

CS410 Course Project at UIUC

Presentation Recording

Paper

Implementation

Codes

Files

Project Overview

Objective

Challenges

Contribution

Test

Summary

About

Resources

License

Stars

Watchers

Forks