# CUI2Vec Embeddings

CUI2Vec is an embedding strategy for multi-modal data that creates embeddings of 108,477 medical concepts. The model was trained on 60 million members, 20 million clinical notes, and 1.7 million full text biomedical journal articles. For the purposes of our project, we will be using just the ICD-9 code embeddings. While the CUI2Vec provides embeddings for various billing codes, only ICD-9 codes are relevant to our project, thus we filter for embeddings that are labeled with 'IDX', as seen in the code below.

### Source

CUI2Vec Paper: https://psb.stanford.edu/psb-online/proceedings/psb20/Beam.pdf

Downloadable embeddings file: https://github.com/clinicalml/embeddings/blob/master/claims_codes_hs_300.txt.gz

### Data

The data is stored in a `txt` file. Each line contains three key elements. 

1. The first element is the name of the code being embedded. For example, an ICD-9 code may look like `IDX401.1`. Other types of codes will look different, such as starting with 'N' followed by numerals. For the sake of this project, we only care about retrieving ICD-9 codes so we will filter for rows where the first token begins with `IDX`. The rest of the code specifcies that actual code, which in this case would be `401.1` The first three characters are important because they are usually enough to describe the general diagnosis. Anything after the period only provide more specificity. In practice, we see both the shortened version (`401`) and extended version (`4011`) used. Notice, the period is not used in practice. To account for the variabile use in our training dataset, we save embeddings using both versions of the ICD code, the shortened and extended version.

2. After the ICD-9 code, the second element is a series of 300 floating-point numbers, corresponding the actual embedding. We read these numbers into a numpy array.

3. The final element in the line is a line-terminating character `\n`.

Our approach for downloaded the CUI2Vec embeddings follows three simple steps:
1. Open the file containing embeddings and filter for rows that begin with `ICX`. 
2. Parse ICD-9 codes into a shortened and extended version, if applicable. Store the embedding in a numpy array
3. Save the embedding in a dictionary. If the shortened code hasn't been used before then save the embedding to the shortened code. If the extended version exists, also save the embedding to the extended code.

In [2]:
import numpy as np
import pickle
import os

### Unzip data:
Before moving on, besure to run in terminal: 

`unzip Data/BIODS220_ICD_Dx_10_9_v7 - icd_dx_10_9_v7.zip`

`unzip Data/claims_codes_hs_300.txt.gz` 
        

In [5]:
print(os.listdir('.'))

['Confusion_matrix_model_1.png', 'model_w2v.ipynb', 'embeddings.ipynb', 'README.md', '.DS_Store', '__pycache__', 'CUI2Vec_embedding.pickle', 'embeddings_word2vec-google-news-300.pickle', 'add_labels.py', '.ipynb_checkpoints', 'data_X_data_Y.pkl', 'CUI2Vec_embeddings.ipynb', '.gitignore', 'B220_SAA_v1.csv', 'plot_confusion_matrix.py', 'LICENSE', 'project_model.py', 'Biomed-Data-Science-NLP-Project', 'CUI2Vec_embeddings', 'patient_dataframe.pkl', '.git', 'BIODS220_ICD_Dx_10_9_v7 - icd_dx_10_9_v7.csv', 'ICD_Label_Cleaned_Oct_25.csv', 'model_CUI2vec.ipynb']


In [21]:
# !gzip -d ./Data/claims_codes_hs_300.txt.gz

In [6]:
ICD_code_dict_file_path = 'CUI2Vec_embeddings/Data/BIODS220_ICD_Dx_10_9_v7 - icd_dx_10_9_v7.csv'
embeddings_file_path = 'CUI2Vec_embeddings/Data/claims_codes_hs_300.txt' # Download file from link above and change file_path as necessary

In [7]:
ICD_codes = {} # Maps an ICD_9 to ICD_10 code
with open(ICD_code_dict_file_path) as f:
    for row in f.readlines():
        tokens = row.split(',')
        ICD_codes[tokens[1]] = tokens[0]

In [8]:
def convertICD9toICD10(icd_9_code):
    icd_9_code = icd_9_code.replace('.','')
    try:
        icd_10_code = ICD_codes[icd_9_code]
    except:
            try:
                icd_10_code = ICD_codes[icd_9_code + '0']
            except:
                try:
                    icd_10_code = ICD_codes[icd_9_code[:-1] + '0']
                except:
                    try:
                        icd_10_code = ICD_codes[icd_9_code + '9']
                    except:
                        try:
                            icd_10_code = ICD_codes[icd_9_code[:-1] + '9']
                        except:
                            icd_10_code = None
    return icd_10_code

In [9]:
# Pase embeddings
embeddings = {}

with open(embeddings_file_path) as f:
    for row in f.readlines():
        tokens = row.split(' ')
        
        # Step 1: Filter for ICD-9 embeddings
        if 'IDX' == tokens[0][:3]: 
            
            # Step 2: Parse input for ICD-9 code and embedding
            embedding = np.array(tokens[1:-1])  
            icd_9_code = tokens[0][4:] # removes IDX_ prefix i.e. IDX_401.1 becomes 401.1
            icd_10_code = convertICD9toICD10(icd_9_code)  
            
            if icd_10_code is None:
                continue
            
            # Step 3: Save embedding to dictionary
            embeddings[icd_10_code] = embedding.astype('float')

            # Save shortened version if it's not already saved
            shortened_code = icd_10_code[:3]
            try:
                embeddings[shortened_code]
            except:
                embeddings[shortened_code] = embedding.astype('float')

In [10]:
# Save embeddings using pickle
filename = 'CUI2Vec_embedding.pickle'
with open(filename, 'wb') as file:
    pickle.dump(embeddings, file, protocol=pickle.HIGHEST_PROTOCOL)

In [11]:
len(embeddings)

10683

In [3]:
filename = 'CUI2Vec_embedding.pickle'
with open(filename, 'rb') as file:
    embeddings = pickle.load(file)

In [16]:
embeddings['E785'].shape


(300,)