Skip to content

zcc861007/CourseProject

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

76 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MetaMining: Mining of Literature Data for Meta-analysis

CS410 Course Project at UIUC

Chaochao Zhou (single member / project leader)
Email: cz76@illinois.edu

Presentation Recording

Watch the video

Paper

This work has been documented in a paper, which can be accessed online:

Text2Struct: A Machine Learning Pipeline for Mining Structured Data from Text
Chaochao Zhou, Bo Yang
Arxiv preprint 2022
[arxiv]

Implementation

All codes and files are provided in the "/Demo" folder of this repo:

Codes

A Colab notebook "test.ipynb" is provided for demo. Please sequentially run the code cells in this notebook.

Files

  • Trained RNN model: current_model.h5
  • A testing dataset (current_data) including:
    • Index-word dictionary: index_word.pkl
    • Text (token sequence): x_test.pkl
    • Entity label: y_test.pkl
    • Numeral label: nl_test.pkl

Project Overview

Objective

A single study may only include a small sample, which may lack statistical power in statistical analyses. Meta-analysis addresses this problem by combining the results of multiple scientific studies. However, meta-analysis requires researchers to extract/collect a large amount data by reviewing numerous publications, which is tedious and time-consuming. Therefore, it is desirable to develop a pipeline for automatic extraction of quantitative data from text

Given a text, the objective of this work is to extract the related entities of a numeral (N), including the metric (M) and unit (U), for example

image

Challenges

  • Multiple numerals may co-exist in a single sentence, so the relations corresponding to each numeral needs to be found.
  • The lengths of words to describe entities (metrics and units) vary in different sentences.
  • A numeral may be associated with multiple metrics hierarchically.

Contribution

Main works that were proposed and completed to solve the mining problem include:

  • Collected texts and performed text pre-processing
  • Created a dataset with text annotation of entities and relations
  • Developed a recurrent neural network (RNN) and tested its performance

Test

Some randomly sampled predictions from the test set (that was not used for training) are demonstrated below:

image

image

image

Summary

It is shown that the pipeline is viable to automatically extract structural data from general texts without special templates/patterns. Within the mined structured data, important measures can be further filtered and collected in combination with text retrieval implementations. It is anticipated to achieve better performance by expanding the dataset and investigating other machine learning models. We are working on the improvement of this pipeline, and stay tuned to a newer version that will be released in the near future!

About

MetaMining: Mining of Literature Data for Meta-analysis

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published