This repo extract punned xml's data process and prepares them for folds to run your cross validation experiment
The follwoing steps take place:
- extract the required fields of the xml (title, abstract, body)
- retrieve the text of the given fields
- Text segmentation
- POS addition (optional)
- Text shuffling
- Text split
- Folds creation
- ruby
- python 3
- SciSpacy
- Spacy
- SciSpacy's Medical tagger w/
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.0/en_core_sci_md-0.2.0.tar.gz
pip install ftfy
fixes text for you- slurm (is reccomended as it was built for parallel processing suitable for slurm)
This repo is modular and can be modified to address different needs: other field types can be extracted for the xml, POS can be added, is new xml files are added a single variable needs to be changed and so on.