Skip to content
One month project for an MLHC submission of domain-injected word embeddings
C Python Shell Other
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.


Augmented Word Embeddings with a Clinical Metathesaurus


Sample Data

Public discharge summaries were used to create a small amount of text used to demonstrate how to build and evaluate the vectors.


Pretrained word embeddings from Google-News (baseline)

NOTE: This code requires Python 2.

Build word2vec and word2vecf tools

cd resources/word2vec ; make ; cd ../../

cd resources/word2vecf ; make ; cd ../../

Download the UMLS tables and put them in the right folder. The code automatically checks this directory.

wboag@gray:/scratch/wboag/awecm$ ls code/word2vecf/umls/umls_tables/

Build corpus (starting from the sample data above)

python code/corpus/ > data/txt/fake_discharge.txt

Build official word2vec vectors

resources/word2vec/word2vec -train data/txt/fake_discharge.txt -output data/vectors/w2v_fake_discharge.vec -size 300 -window 8 -sample 1e-4 -hs 0 -negative 8 -threads 12 -iter 5 -min-count 5 -alpha 0.025 -binary 0 -cbow 0

Building word vectors from word2vecf (in theory, equal to official word2vec)

python code/word2vecf/ data/txt/fake_discharge.txt 8 data/contexts/fake_discharge_word.contexts data/vocabs/fake_discharge_word.w data/vocabs/fake_discharge_word.c --word

resources/word2vecf/word2vecf -train data/contexts/fake_discharge_word.contexts -output data/vectors/w2vf_fake_discharge_word.vec -size 300 -sample 0 -hs 0 -negative 8 -threads 12 -iters 5 -alpha 0.025 -binary 0 -wvocab data/vocabs/fake_discharge_word.w -cvocab data/vocabs/fake_discharge_word.c

Building CUI-enhanced AWE-CM vectors from word2vecf (requires UMLS tables) (only CUI)

python code/word2vecf/ data/txt/fake_discharge.txt 8 data/contexts/fake_discharge_cui.contexts data/vocabs/fake_discharge_cui.w data/vocabs/fake_discharge_cui.c --cui

resources/word2vecf/word2vecf -train data/contexts/fake_discharge_cui.contexts -output data/vectors/w2vf_fake_discharge_cui.vec -size 300 -sample 0 -hs 0 -negative 8 -threads 12 -iters 5 -alpha 0.025 -binary 0 -wvocab data/vocabs/fake_discharge_cui.w -cvocab data/vocabs/fake_discharge_cui.c

Building CUI_REL-enhanced AWE-CM vectors from word2vecf (requires UMLS tables) (CUI + Related CUI)

python code/word2vecf/ data/txt/fake_discharge.txt 8 data/contexts/fake_discharge_cui_rel.contexts data/vocabs/fake_discharge_cui_rel.w data/vocabs/fake_discharge_cui_rel.c --cui_rel

resources/word2vecf/word2vecf -train data/contexts/fake_discharge_cui_rel.contexts -output data/vectors/w2vf_fake_discharge_cui_rel.vec -size 300 -sample 0 -hs 0 -negative 8 -threads 12 -iters 5 -alpha 0.025 -binary 0 -wvocab data/vocabs/fake_discharge_cui_rel.w -cvocab data/vocabs/fake_discharge_cui_rel.c

Evaluating word vectors with SRS (correlation with experts)

 python code/eval/srs/ data/vectors/w2vf_fake_discharge_word.vec
You can’t perform that action at this time.