# CUI2Vec Embeddings

CUI2Vec is an embedding strategy for multi-modal data that creates embeddings of 108,477 medical concepts. The model was trained on 60 million members, 20 million clinical notes, and 1.7 million full text biomedical journal articles. For the purposes of our project, we will be using just the ICD-9 code embeddings. While the CUI2Vec provides embeddings for various billing codes, only ICD-9 codes are relevant to our project, thus we filter for embeddings that are labeled with 'IDX', as seen in the code below.

### Source

CUI2Vec Paper: https://psb.stanford.edu/psb-online/proceedings/psb20/Beam.pdf

Downloadable embeddings file: https://github.com/clinicalml/embeddings/blob/master/claims_codes_hs_300.txt.gz

### Data

The data is stored in a `txt` file. Each line contains three key elements. 

1. The first element is the name of the code being embedded. For example, an ICD-9 code may look like `IDX401.1`. Other types of codes will look different, such as starting with 'N' followed by numerals. For the sake of this project, we only care about retrieving ICD-9 codes so we will filter for rows where the first token begins with `IDX`. The rest of the code specifcies that actual code, which in this case would be `401.1` The first three characters are important because they are usually enough to describe the general diagnosis. Anything after the period only provide more specificity. In practice, we see both the shortened version (`401`) and extended version (`4011`) used. Notice, the period is not used in practice. To account for the variabile use in our training dataset, we save embeddings using both versions of the ICD code, the shortened and extended version.

2. After the ICD-9 code, the second element is a series of 300 floating-point numbers, corresponding the actual embedding. We read these numbers into a numpy array.

3. The final element in the line is a line-terminating character `\n`.

### Approach

Our approach for downloaded the CUI2Vec embeddings follows three simple steps:
1. Open the file containing embeddings and filter for rows that begin with `ICX`. 
2. Parse ICD-9 codes into a shortened and extended version, if applicable. Store the embedding in a numpy array
3. Save the embedding in a dictionary. If the shortened code hasn't been used before then save the embedding to the shortened code. If the extended version exists, also save the embedding to the extended code.

In [1]:
import numpy as np
import pickle
import os

In [2]:
file_path = 'Data/claims_codes_hs_300.txt' # Download file from link above and change file_path as necessary
embeddings = {}

with open(file_path) as f:
    for row in f.readlines():
        tokens = row.split(' ')
        
        # Step 1: Filter for ICD-9 embeddings
        if 'IDX' == tokens[0][:3]: 
            
            # Step 2: Parse input for ICD-10 code and save embedding to numpy array
            icd_code = tokens[0][4:] # i.e. IDX_401.1 becomes 401.1
            embedding = np.array(tokens[1:-1])            
            
            # Step 3: Save embedding to dictionary
            if len(icd_code) == 3:
                embeddings[icd_code] = embedding
            else:
                extended_code = icd_code.replace('.','') # i.e. 401.1 becomes 4011
                embeddings[extended_code] = embedding
                
                # Save shortened version if it's not already saved
                shortened_code = icd_code[:3]
                try:
                    embeddings[shortened_code]
                except:
                    embeddings[shortened_code] = embedding

In [3]:
# Save embeddings using pickle
filename = 'CUI2Vec_embedding.pickle'
if not os.path.exists(filename):
    with open(filename, 'wb') as file:
        pickle.dump(embeddings, file, protocol=pickle.HIGHEST_PROTOCOL)