Baseline model for the MedSecId paper

This contains a basic baseline model for identifying section in clinical text from the paper A New Public Corpus for Clinical Section Identification: MedSecId. The purpose of this repository is to provide a means to reproduce the results in the paper. If you want to include this work in your own projects, use the mimicsid package described in the Inclusion in Your Projects section, which was designed to be an off-the-shelf package pip install.

Reproducing the Results

Python 3.9.9 was used with the requirements in src/requirements.txt and src/requirements-mednlp.txt.

To train and test the models, use the run.sh script by:

Copy the MIMIC-III NOTEEVENTS.csv file to the corpus directory.
Download the annotation set and uncompress it:
1. pushd corpus
2. wget https://zenodo.org/record/7150451/files/section-id-annotations.zip
3. unzip section-id-annotations.zip
4. popd
Remove the repo results: rm -r results
Create the Python environment (in pyvirenv): ./run.sh pyenv
Install all Python libraries and models: ./run.sh pydep
Create the features as mini-batches (takes a while): ./run.sh batch
Test and train the models (takes a while): ./run.sh traintest
Create the metrics used in the ./run paperresults

At the end of this, there should be a results directory with:

results/stats: the corpus statistics
results/perf: the summary of the results and labels of the best model
results/model: the models and model specific results

Inclusion in Your Projects

The purpose of this repository is to reproduce the results in the paper. If you want to use the annotations and/or use the pretrained model, please refer to the mimicsid repository.

Data Analysis

The medical concept (CUI) plot given in the paper, and others are available as interactive 3D plots here.

Citation

If you use this project in your research please use the following BibTeX entry:

@inproceedings{landes-etal-2022-new,
    title = "A New Public Corpus for Clinical Section Identification: {M}ed{S}ec{I}d",
    author = "Landes, Paul  and
      Patel, Kunal  and
      Huang, Sean S.  and
      Webb, Adam  and
      Di Eugenio, Barbara  and
      Caragea, Cornelia",
    booktitle = "Proceedings of the 29th International Conference on Computational Linguistics",
    month = oct,
    year = "2022",
    address = "Gyeongju, Republic of Korea",
    publisher = "International Committee on Computational Linguistics",
    url = "https://aclanthology.org/2022.coling-1.326",
    pages = "3709--3721"
}

Also please cite the Zensols Framework:

@article{Landes_DiEugenio_Caragea_2021,
  title={DeepZensols: Deep Natural Language Processing Framework},
  url={http://arxiv.org/abs/2109.03383},
  note={arXiv: 2109.03383},
  journal={arXiv:2109.03383 [cs]},
  author={Landes, Paul and Di Eugenio, Barbara and Caragea, Cornelia},
  year={2021},
  month={Sep}
}

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
analysis		analysis
corpus		corpus
hyperparams		hyperparams
models		models
notebook		notebook
resources		resources
results		results
src		src
.gitignore		.gitignore
.gitmodules		.gitmodules
CITATION.cff		CITATION.cff
README.md		README.md
harness.py		harness.py
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Baseline model for the MedSecId paper

Reproducing the Results

Inclusion in Your Projects

Data Analysis

Citation

License

About

Releases

Packages

Languages

uic-nlp-lab/medsecid

Folders and files

Latest commit

History

Repository files navigation

Baseline model for the MedSecId paper

Reproducing the Results

Inclusion in Your Projects

Data Analysis

Citation

License

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages