Utilizing Untagged Medical Literature for Diagnoses

This project aims to establish and evaluate a methodology to computationally consume medical literature and draw certain results from it. We intend to construct the project around a symptom-disease paradigm, employing NLP techniques to traverse through large quantities of textual data and extract a disease diagnosis from a given set of symptoms.

ABOUT

We intend to utilize untagged, unstructured literature and extract information and evaluate our findings on them. We incorporated the concepts of Word Embeddings and the working of Word2Vec to vectorize the texts.

Related Work

We referred to the following literature when considering the possible methodologies, we could adopt to approach the problem. Nye B, Jessy Li J, Patel R, et al. A Corpus with Multi-Level Annotations of Patients, Interventions and Outcomes to Support Language Processing for Medical Literature. Proc Conf Assoc Comput Linguist Meet. 2018;2018:197‐207. (Read here)

METHODOLOGY

We used the word2vec algorithm provided in the Gensim library for training. Using a subsection of the literature we trained several Word2Vec models to create word embeddings and tested the resulting vectors by hand, manually checking similarities between different words and assessing which model provided the most accurate embeddings. Training multiple models allowed us to tune the hyper-parameters which could then be used over the entire dataset.

During our work we noticed the need to perform some formatting such as centering the different references to COVID-19 (such as Coronavirus) to one simple string ‘covid19’. We noticed improvements in our findings once we re-trained our model using the cleaned data.

However, this incurred a dependency in the dataset, whenever a symptom would be associated with the disease, the occurrence of the word ‘positive’ or something alike was essential. Additionally, using only a single word is not very useful in describing symptoms of a disease. When describing symptoms, saying a patient has high temperature instead of temperature gives much more value to the input. Including descriptors such as these could prove useful in our tool.

Our next goal was to then create not word embeddings, but phrase embeddings. Our first approach to this task was to create bi-grams with our unsupervised corpus and take it as an input to Word2Vec.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.gitignore		.gitignore
LICENSE		LICENSE
Notebook.ipynb		Notebook.ipynb
README.md		README.md
WE_Glove.py		WE_Glove.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Utilizing Untagged Medical Literature for Diagnoses

ABOUT

Related Work

METHODOLOGY

DATASET

REFERENCES

BUILT WITH

About

Releases

Packages

Contributors 2

Languages

License

sukumarh/nlp-on-medical-literature

Folders and files

Latest commit

History

Repository files navigation

Utilizing Untagged Medical Literature for Diagnoses

ABOUT

Related Work

METHODOLOGY

DATASET

REFERENCES

BUILT WITH

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages