medical-nlp
Dataset compiled for Natural Language Processing using a corpus of medical transcriptions and custom-generated clinical stop words and vocabulary.
Usage
Clone or download files for use in medical text Natural Language Processing (NLP) experiments.
data
mtsamples.csv. Compiled from Kaggle's medical transcriptions dataset by Tara Boyle, scraped from Transcribed Medical Transcription Sample Reports and Examples. See Kaggle repository.clinical-stopwords.txt. Compiled from Dr. Kavita Ganesan clinical-concepts repository. See the Discovering Related Clinical Concepts Using Large Amounts of Clinical Notes paper.vocab.txt. Generated vocabulary text files for Natural Language Processing (NLP) using the Systematized Nomenclature of Medicine International (SNMI) data. See how to Generate your own vocab file.X.csv. Fully processed dataset obtained from running the Data Modelling notebook. Simplified dataset to 4 classes.classes.txt. Text file describing the dataset's classes:Surgery,Medical Records,Internal MedicineandOthertrain.csv. Training data subset. Contains 90% of theX.csvprocessed file.test.csv. Test data subset. Contains 10% of theX.csvprocessed file.