Skip to content
One month project for an MLHC submission of domain-injected word embeddings
C Python Shell Other
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.ipynb_checkpoints
code
data
models
resources
README.md
requirements.txt

README.md

AWE-CM

Augmented Word Embeddings with a Clinical Metathesaurus

Writeup

https://www.overleaf.com/11188799yhjcmxphgpjm#/42196701/

Sample Data

Public discharge summaries were used to create a small amount of text used to demonstrate how to build and evaluate the vectors.

Usage

Pretrained word embeddings from Google-News (baseline)

NOTE: This code requires Python 2.

Build word2vec and word2vecf tools

cd resources/word2vec ; make ; cd ../../

cd resources/word2vecf ; make ; cd ../../

Download the UMLS tables and put them in the right folder. The code automatically checks this directory.

wboag@gray:/scratch/wboag/awecm$ ls code/word2vecf/umls/umls_tables/
.gitignore   LRABR        MRCONSO.RRF  MRREL.RRF    MRSTY.RRF

Build corpus (starting from the sample data above)

python code/corpus/build_fake_corpus.py > data/txt/fake_discharge.txt

Build official word2vec vectors

resources/word2vec/word2vec -train data/txt/fake_discharge.txt -output data/vectors/w2v_fake_discharge.vec -size 300 -window 8 -sample 1e-4 -hs 0 -negative 8 -threads 12 -iter 5 -min-count 5 -alpha 0.025 -binary 0 -cbow 0

Building word vectors from word2vecf (in theory, equal to official word2vec)

python code/word2vecf/generate_context_windows.py data/txt/fake_discharge.txt 8 data/contexts/fake_discharge_word.contexts data/vocabs/fake_discharge_word.w data/vocabs/fake_discharge_word.c --word

resources/word2vecf/word2vecf -train data/contexts/fake_discharge_word.contexts -output data/vectors/w2vf_fake_discharge_word.vec -size 300 -sample 0 -hs 0 -negative 8 -threads 12 -iters 5 -alpha 0.025 -binary 0 -wvocab data/vocabs/fake_discharge_word.w -cvocab data/vocabs/fake_discharge_word.c

Building CUI-enhanced AWE-CM vectors from word2vecf (requires UMLS tables) (only CUI)

python code/word2vecf/generate_context_windows.py data/txt/fake_discharge.txt 8 data/contexts/fake_discharge_cui.contexts data/vocabs/fake_discharge_cui.w data/vocabs/fake_discharge_cui.c --cui

resources/word2vecf/word2vecf -train data/contexts/fake_discharge_cui.contexts -output data/vectors/w2vf_fake_discharge_cui.vec -size 300 -sample 0 -hs 0 -negative 8 -threads 12 -iters 5 -alpha 0.025 -binary 0 -wvocab data/vocabs/fake_discharge_cui.w -cvocab data/vocabs/fake_discharge_cui.c

Building CUI_REL-enhanced AWE-CM vectors from word2vecf (requires UMLS tables) (CUI + Related CUI)

python code/word2vecf/generate_context_windows.py data/txt/fake_discharge.txt 8 data/contexts/fake_discharge_cui_rel.contexts data/vocabs/fake_discharge_cui_rel.w data/vocabs/fake_discharge_cui_rel.c --cui_rel

resources/word2vecf/word2vecf -train data/contexts/fake_discharge_cui_rel.contexts -output data/vectors/w2vf_fake_discharge_cui_rel.vec -size 300 -sample 0 -hs 0 -negative 8 -threads 12 -iters 5 -alpha 0.025 -binary 0 -wvocab data/vocabs/fake_discharge_cui_rel.w -cvocab data/vocabs/fake_discharge_cui_rel.c

Evaluating word vectors with SRS (correlation with experts)

 python code/eval/srs/srs_eval.py data/vectors/w2vf_fake_discharge_word.vec
You can’t perform that action at this time.