pubmed dataset prep (PMC)

from xmls to folds ready for training

This repo extract punned xml's data process and prepares them for folds to run your cross validation experiment

The follwoing steps take place:

Data Extraction (found under extraction folder)

extract the required fields of the xml (title, abstract, body)
retrieve the text of the given fields

Data Processing (found under processing folder)

Text segmentation
POS addition (optional)
Text shuffling
Text split
Folds creation

Reuired pre-requisites:

ruby
python 3
SciSpacy
Spacy
SciSpacy's Medical tagger w/ pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.0/en_core_sci_md-0.2.0.tar.gz
pip install ftfy fixes text for you
slurm (is reccomended as it was built for parallel processing suitable for slurm)

This repo is modular and can be modified to address different needs: other field types can be extracted for the xml, POS can be added, is new xml files are added a single variable needs to be changed and so on.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
extraction		extraction
processing		processing
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extraction

extraction

processing

processing

README.md

README.md

Repository files navigation

pubmed dataset prep (PMC)

from xmls to folds ready for training

Data Extraction (found under extraction folder)

Data Processing (found under processing folder)

Reuired pre-requisites:

About

Releases

Packages

Languages

shiranD/pubmed_dataset_prep

Folders and files

Latest commit

History

Repository files navigation

pubmed dataset prep (PMC)

from xmls to folds ready for training

Data Extraction (found under extraction folder)

Data Processing (found under processing folder)

Reuired pre-requisites:

About

Resources

Stars

Watchers

Forks

Languages