## Explore clinical embeddings

This example explores how you would convert text into a numerical vector of clinical embeddings and look for matches.

This is from the [Weissman lab](https://github.com/weissman-lab/clinical_embeddings)

I downloaded [https://github.com/weissman-lab/clinical_embeddings?tab=readme-ov-file] (https://github.com/weissman-lab/clinical_embeddings?tab=readme-ov-file) - the 100 dimension model trained on open access reports only. Downloaded to file manager UCLH remote desktop, unzipped tar.gz to tar, then tar to folders. Then uploaded into Jupyter environment

In [1]:
# Reload functions every time
%load_ext autoreload 
%autoreload 2

In [2]:
# Load libraries
import sys
import os
from pathlib import Path


# Import the variables that have been set in the init.py folder in the root directory
# These include a constant called PROJECT_ROOT which stores the absolute path to this folder
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), os.pardir)))
import init
PROJECT_ROOT = os.getenv("PROJECT_ROOT")

# Add the src folder to sys path, so that the application knows to look there for libraries
sys.path.append(str(Path(PROJECT_ROOT) / 'src'))

In [4]:
!pip install gensim

Collecting gensim
  Downloading gensim-4.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (26.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m26.5/26.5 MB[0m [31m16.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting smart-open>=1.8.1
  Downloading smart_open-7.0.1-py3-none-any.whl (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.8/60.8 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting wrapt
  Downloading wrapt-1.16.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (80 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m80.3/80.3 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: wrapt, smart-open, gensim
Successfully installed gensim-4.3.2 smart-open-7.0.1 wrapt-1.16.0


In [5]:
from gensim.models import FastText, Word2Vec, KeyedVectors # KeyedVectors are used to load the GloVe models


In [7]:
embeddings_path = Path(PROJECT_ROOT) / 'data_store/clinical_embeddings/W2V_100'
os.listdir(str(embeddings_path))


['w2v_OA_CR_100d.bin.wv.vectors.npy',
 'w2v_OA_CR_100d.bin',
 'w2v_OA_CR_100d.bin.trainables.syn1neg.npy']

In [10]:
# Load the model
# model = Word2Vec.load(str(embeddings_path) + '/w2v_oa_all_100d.bin') # for Open Access All Manuscripts
model = Word2Vec.load(str(embeddings_path) + '/w2v_OA_CR_100d.bin') # for Case Reports only



In [11]:
# Return 100-dimensional vector representations of each word
model.wv.get_vector('diabetes')
model.wv.get_vector('cardiac_arrest')
model.wv.get_vector('lymphangioleiomyomatosis')

# Try out cosine similarity
model.wv.similarity('copd', 'chronic_obstructive_pulmonary_disease')
model.wv.similarity('myocardial_infarction', 'heart_attack')
model.wv.similarity('lymphangioleiomyomatosis', 'lam')

0.75374293

In [12]:
print(model.wv.similarity('vodka', 'drinking'))
print(model.wv.similarity('vodka', 'myocardial_infarction'))


0.8092604
0.2948834


In [32]:
model.wv.similarity('myocardial_infarction', 'heart_attack')


0.7921261

In [33]:
model.wv.similarity('copd', 'chronic_obstructive_pulmonary_disease')


0.86180943

In [55]:
model.wv.similarity('discharge', 'tta' )


0.12271345

## Load notes from Jon project

In [47]:
import pandas as pd
import numpy as np
notes = pd.read_csv('/home/jovyan/work/zella/zbeds/explore/discharges/data-raw/discharges_jon_2023-02-17.csv')

In [43]:
def get_valid_vectors(df, model):
    valid_vectors = set()
    for text in df['note']:
        unique_words = set(text.split())
        for word in unique_words:
            try:
                model.wv.get_vector(word)
                valid_vectors.add(word)
            except KeyError:
                continue
    return list(valid_vectors)

In [45]:
found_words = get_valid_vectors(notes, model)


In [56]:
similarities = np.array([(word, model.wv.similarity('home', word)) for word in found_words])
similarities[np.argsort(similarities[:, 1].astype(float))[::-1][:10]]

array([['home', '1.0'],
       ['apartment', '0.7666387'],
       ['supermarket', '0.7521133'],
       ['taxi', '0.744518'],
       ['hired', '0.7438201'],
       ['attendance', '0.7428458'],
       ['hotel', '0.74082637'],
       ['rehab', '0.7370461'],
       ['midwife', '0.73629725'],
       ['daycare', '0.73576784']], dtype='<U32')

In [58]:
similarities = np.array([(word, model.wv.similarity('discharge', word)) for word in found_words])
similarities[np.argsort(similarities[:, 1].astype(float))[::-1][:10]]


array([['discharge', '1.0'],
       ['admission', '0.6673033'],
       ['discharging', '0.6502145'],
       ['admittance', '0.6492708'],
       ['flatus', '0.64832884'],
       ['foul', '0.6476266'],
       ['defecating', '0.6340597'],
       ['oozing', '0.6292507'],
       ['complaints', '0.6285596'],
       ['urinated', '0.6284532']], dtype='<U32')