# Data Preprocessing

### Imports: 
1. VO_RXN - All VO concepts with a RxNorm annotation
2. RX_DF_ALL - All *vaccine related* concepts within RxNorm
3. VO_DF_FULL - All VO concepts under Vaccine and Vaccine Component subgroups (includes both with and without RxNorm annotations)
4. VO_DF_APPLY - All VO concepts that do not contain RxNorm annotation

In [1]:
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
import torch
import pickle

In [2]:
VO_RXN = pd.read_csv('VO_RXN.csv')
RX_DF = pd.read_csv('2024_02_05_RXN_Concepts.csv', names=['RXN', 'RX_STR','TTY'])
RX_DF_ALL = pd.read_csv('RXN_Concepts.csv', names=['RXN', 'RX_STR','TTY'])
RX_DF_ALL = RX_DF_ALL.iloc[1:]
RX_DF = RX_DF[~RX_DF['TTY'].isin(['PSN','SY'])]

In [3]:
VO_RXN.columns = ['ID', 'VO_STR', 'RXN']

In [4]:
VO_DF_FULL = pd.read_csv('VO_DF_FULL.csv')

In [5]:
VO_DF_FULL = VO_DF_FULL[['ID','Label','RXN']]
VO_DF_FULL.columns = ['ID', 'VO_STR', 'RXN']
# VO_DF_FULL

In [84]:
VO_RXN.dtypes

ID           object
VO_STR       object
RXN           int64
PAIRED        int64
IDENTICAL     int64
dtype: object

In [86]:
RX_DF_ALL

Unnamed: 0,RXN,RX_STR,TTY,RX_EMB
1,7288,Neisseria meningitidis,IN,"[0.057902053, 0.026175115, -0.06539729, 0.0367..."
2,8080,pertussis vaccine,IN,"[-0.0029333595, 0.09751388, -0.10179457, -0.04..."
3,29501,meningococcal group A polysaccharide,IN,"[0.05271473, -0.068272494, -0.0526025, 0.02013..."
4,29503,meningococcal group C polysaccharide,IN,"[0.02626946, -0.05765597, -0.053973895, 0.0266..."
5,50937,Haemophilus influenzae type b,IN,"[-0.013994071, 0.011716439, -0.09422101, 0.035..."
...,...,...,...,...
2302,2630660,0.5 ML influenza A virus A/Darwin/9/2021 (H3N2...,SBD,"[0.04740048, 0.0060114963, -0.00946633, -0.030..."
2303,2630661,influenza A virus A/Darwin/9/2021 (H3N2) antig...,SBD,"[0.033620417, 0.012553314, -0.011871517, -0.03..."
2304,2630662,influenza A virus A/Darwin/9/2021 (H3N2) antig...,SCD,"[-0.02931, 0.03337675, 0.008451684, 0.00467629..."
2305,2630663,influenza A virus (H1N1) antigen / influenza A...,SBDF,"[-0.025375172, 0.0071961344, -0.021845385, -0...."


### Preprocessing: 
The following data preprocessing steps will be conducted:

1. Generate embeddings for VO concepts in VO_DF_FULL
2. Generate embeddings for RxNorm concepts in RX_DF_ALL
3. Convert embeddings dataframe to a lookup dictionary


**NOTE:** There are some RxNorm concepts that have been remapped and such cannot be retrieved. These will be excluded.

**ToDo:** Update remapped/obsolete concepts to their most recent concept

In [6]:
# Check if GPU is available
if torch.cuda.is_available():
    # Set the device to GPU
    device = torch.device('cuda:3')
    print('Using GPU:', torch.cuda.get_device_name(device))
else:
    device = torch.device('cpu')
    print('GPU not available, using CPU.')

Using GPU: NVIDIA A100-SXM4-80GB


In [7]:
st_model = SentenceTransformer('tavakolih/all-MiniLM-L6-v2-pubmed-full', device = device)

In [8]:
#Sentence Embedding and preprocessing
VO_DF_FULL['VO_EMB'] = VO_DF_FULL['VO_STR'].map(lambda x: st_model.encode(x,device=device))
RX_DF_ALL['RX_EMB'] = RX_DF_ALL['RX_STR'].map(lambda x: st_model.encode(x,device=device))

In [9]:
#Create lookup dictionaries for VO_EMB and RX_EMB
VO_LOOKUP = dict(zip(VO_DF_FULL['ID'], VO_DF_FULL['VO_EMB']))
RX_LOOKUP = dict(zip(RX_DF_ALL['RXN'], RX_DF_ALL['RX_EMB']))

In [10]:
#Create lookup dictionaries for VO and RX Labels
VO_LOOKUP_STR = dict(zip(VO_DF_FULL['ID'], VO_DF_FULL['VO_STR']))
RX_LOOKUP_STR = dict(zip(RX_DF_ALL['RXN'], RX_DF_ALL['RX_STR']))

In [11]:
VO_DF_APPLY = VO_DF_FULL[VO_DF_FULL['RXN'].isna()]

In [12]:
# VO_DF_APPLY['VO_EMB'] = VO_DF_APPLY['VO_STR'].map(lambda x: st_model.encode(x,device=device))
# VO_DF_APPLY
# VO_DF_FULL
# RX_DF_ALL

### Generation of unmapped pairs: 
    To generate unmapped, for each mapped concept pair (A,B), let unmapped pairs be (A,X) where X is all of Rxnorm which is not B.
    
1. VO_RXN_LIST - List of existing VO-Rx pairs
2. VO_LIST - Paired VO concepts
3. VO_APPLY_LIST - Unpaired VO concepts
4. RXN_LIST - All RxNorom concepts related to vaccines

Generation of unmapped pairs will be done as using *product* function of itertools package: 

1. NO_VO_RXN_LIST - Artificial unmapped pairs using paired VO concepts and RxNorm concepts that are NOT in known pairs (VO_RXN_LIST) - **TEST**
2. VO_RXN_APPLY_LIST - Possible apply set of VO concepts that do not have RxNorm annotations. Could contain possible mappings/pairings - **APPLY**
3. Converting lists in 1. and 2. above into dataframes (NO_VO_RXN_DF and VO_RXN_APPLY_DF)

In [13]:
VO_RXN_LIST = list(zip(VO_RXN['ID'], VO_RXN['RXN']))
VO_LIST = list(VO_RXN['ID'])
VO_APPLY_LIST = list(VO_DF_APPLY['ID'])
# RXN_LIST = list(VO_RXN['RXN'])
RXN_LIST = list(RX_DF_ALL['RXN'].astype(int))

In [14]:
from itertools import product
NO_VO_RXN_LIST = list(product(VO_LIST, RXN_LIST))
VO_RXN_APPLY_LIST = list(product(VO_APPLY_LIST, RXN_LIST))
print(len(NO_VO_RXN_LIST))
NO_VO_RXN_LIST = [x for x in NO_VO_RXN_LIST if x not in VO_RXN_LIST]

2596556


In [15]:
len(NO_VO_RXN_LIST)

2595458

In [16]:
output_dir = 'outputs/'

In [17]:
import datetime
def cdt():
    return datetime.datetime.now().strftime("%Y_%m_%d")

In [18]:
# # Pickle NO_VO_RXN_LIST and save with date
# import pickle
# with open(output_dir+'NO_VO_RXN_LIST_'+cdt()+'.pkl', 'wb') as f:
#     pickle.dump(NO_VO_RXN_LIST, f)

In [19]:
#Converting NO_VO_RXN_LIST (list of lists) to a dataframe
NO_VO_RXN_DF = pd.DataFrame(NO_VO_RXN_LIST, columns=['ID', 'RXN'])

In [20]:
#Converting VO_RXN_APPLY_LIST (list of lists) to a dataframe
VO_RXN_APPLY_DF = pd.DataFrame(VO_RXN_APPLY_LIST, columns=['ID', 'RXN'])

In [21]:
# Convert VO_RXN to a dictionary
# VO_RXN_DICT = dict(zip(VO_RXN['ID'], VO_RXN['VO_STR']))

In [22]:
# Convert RX_DF to a dictionary
# RX_DICT = dict(zip(RX_DF['RXN'], RX_DF['RX_STR']))

In [23]:
# # Map values for ID from VO_RXN and RXN from RX_DF
# NO_VO_RXN_DF['VO_STR'] = NO_VO_RXN_DF['ID'].map(VO_RXN_DICT)
# NO_VO_RXN_DF['RX_STR'] = NO_VO_RXN_DF['RXN'].map(RX_DICT)

# # Drop null
# NO_VO_RXN_DF = NO_VO_RXN_DF.dropna()
# NO_VO_RXN_DF

In [24]:
# VO_RXN_DF = pd.merge(VO_RXN, RX_DF, on='RXN')

In [25]:
# VO_RXN_DF

In [26]:
# # Mapping encodings from VO_RXN_DF to NO_VO_RXN_DF
# NO_VO_RXN_DF['VO_EMB'] = NO_VO_RXN_DF['ID'].map(VO_LOOKUP)
# NO_VO_RXN_DF['RX_EMB'] = NO_VO_RXN_DF['RXN'].map(RX_LOOKUP)

In [27]:
# # Mapping encodings for apply DF
# VO_RXN_APPLY_DF['VO_EMB'] = VO_RXN_APPLY_DF['ID'].map(VO_APPLY_LOOKUP)
# VO_RXN_APPLY_DF['RX_EMB'] = VO_RXN_APPLY_DF['RXN'].map(RX_LOOKUP)
# VO_RXN_APPLY_DF

In [28]:
# r = RX_LOOKUP[798232]
# v = VO_LOOKUP['VO_0003346']
# np.array_equal(r,v)

In [81]:
# Function to check if embeddings are identical
def check_identical(row):
    vo_emb = VO_LOOKUP.get(row['ID'])
    rx_emb = RX_LOOKUP.get(row['RXN'].astype(str))

    if vo_emb is not None and rx_emb is not None and np.array_equal(vo_emb, rx_emb):
        return 1
    else:
        return 0

In [78]:
VO_RXN[VO_RXN['ID']=='VO_0000063']['RXN']

477    1659730
Name: RXN, dtype: int64

In [80]:
NO_VO_RXN_DF[NO_VO_RXN_DF['ID']=='VO_0000063']['RXN']

1099498       7288
1099499       8080
1099500      29501
1099501      29503
1099502      50937
            ...   
1101798    2630660
1101799    2630661
1101800    2630662
1101801    2630663
1101802    2630664
Name: RXN, Length: 2305, dtype: int64

In [82]:
VO_RXN = VO_RXN.loc[VO_RXN['RXN'].astype(str).isin(RX_LOOKUP.keys())]

In [83]:
# VO_RXN_DF['EMB'] = VO_RXN_DF.apply(lambda row: np.concatenate((row['VO_EMB'], row['RX_EMB'])), axis=1)
VO_RXN.loc[:,'PAIRED'] = 1
VO_RXN.loc[:,'IDENTICAL'] = VO_RXN.apply(check_identical, axis=1)
# VO_RXN_DF['C_PAIR'] = VO_RXN_DF[['ID', 'RXN']].values.astype(str).tolist()

AttributeError: 'int' object has no attribute 'astype'

In [32]:
# VO_RXN_DF['EMB'] = VO_RXN_DF.apply(lambda row: np.concatenate((row['VO_EMB'], row['RX_EMB'])), axis=1)
# VO_RXN_DF['PAIRED'] = 1
# VO_RXN_DF['IDENTICAL'] = VO_RXN_DF.apply(lambda row: np.where(np.all(row['VO_EMB']==row['RX_EMB']),1,0),axis=1)
# # VO_RXN_DF['C_PAIR'] = VO_RXN_DF[['ID', 'RXN']].values.astype(str).tolist()

In [33]:
VO_RXN[['PAIRED','IDENTICAL']].astype(str).apply(pd.Series.value_counts)

Unnamed: 0,PAIRED,IDENTICAL
0,,1098.0
1,1098.0,


In [34]:
# Removing identical concept pairs from VO_RXN_DF [TRAIN]
VO_RXN_DF= VO_RXN[VO_RXN['IDENTICAL']==0]

In [35]:
# [x for x in VO_RXN_DF['RXN'].astype(str) if x not in RX_LOOKUP.keys()] #Obsolete concepts : Todo
VO_RXN_DF = VO_RXN_DF[VO_RXN_DF['RXN'].astype(str).isin(RX_LOOKUP.keys())]

In [36]:
# NO_VO_RXN_DF = NO_VO_RXN_DF.dropna()
# NO_VO_RXN_DF['EMB'] = NO_VO_RXN_DF.apply(lambda row: np.concatenate((row['VO_EMB'], row['RX_EMB'])), axis=1)
NO_VO_RXN_DF['PAIRED'] = 0
NO_VO_RXN_DF['IDENTICAL'] = NO_VO_RXN_DF.apply(check_identical, axis=1)
NO_VO_RXN_DF = NO_VO_RXN_DF[NO_VO_RXN_DF['IDENTICAL']==0]

In [37]:
# NO_VO_RXN_DF = NO_VO_RXN_DF.dropna()
# NO_VO_RXN_DF['EMB'] = NO_VO_RXN_DF.apply(lambda row: np.concatenate((row['VO_EMB'], row['RX_EMB'])), axis=1)
# NO_VO_RXN_DF['PAIRED'] = 0
# NO_VO_RXN_DF['IDENTICAL'] = NO_VO_RXN_DF.apply(lambda row: np.where(np.all(row['VO_EMB']==row['RX_EMB']),1,0),axis=1)

In [38]:
# pickle VO_RXN_DF and NO_VO_RXN_DF
with open(output_dir+'VO_RXN_DF_'+cdt()+'.pkl', 'wb') as f:
    pickle.dump(VO_RXN_DF, f)

In [39]:
with open(output_dir+'NO_VO_RXN_DF_'+cdt()+'.pkl', 'wb') as f:
    pickle.dump(NO_VO_RXN_DF, f)

In [40]:
with open(output_dir+'VO_RXN_DF_'+cdt()+'.pkl', 'rb') as f:
  VO_RXN_DF = pickle.load(f)

In [41]:
with open(output_dir+'NO_VO_RXN_DF_'+cdt()+'.pkl', 'rb') as f:
  NO_VO_RXN_DF = pickle.load(f)

In [42]:
VO_RXN_DF = VO_RXN_DF[['ID','RXN','PAIRED']].reset_index(drop=True)
NO_VO_RXN_DF = NO_VO_RXN_DF[['ID','RXN','PAIRED']].reset_index(drop=True)

In [43]:
# VO_RXN_DF = VO_RXN_DF[['ID','RXN','EMB','PAIRED','IDENTICAL']].reset_index()
# NO_VO_RXN_DF = NO_VO_RXN_DF[['ID','RXN','EMB','PAIRED','IDENTICAL']].reset_index()

In [44]:
# FULL_VO_RXN_DF = pd.concat([VO_RXN_DF, NO_VO_RXN_DF], ignore_index=True)

In [45]:
# FULL_VO_RXN_DF.index

In [46]:
# FULL_VO_RXN_DF

In [47]:
import torch
from torch import nn
from torch.utils.data import DataLoader, Dataset
from torch.nn import functional as F
import torch.optim as optim
from sklearn.model_selection import train_test_split

In [48]:
train_data, test_data_1 = train_test_split(VO_RXN_DF, test_size=0.3, random_state=42)

In [49]:
# train_data, test_data_1 = train_test_split(VO_RXN_DF, test_size=0.3, random_state=42, stratify=VO_RXN_DF['IDENTICAL'])

In [50]:
# train_data['IDENTICAL'].value_counts()

In [51]:
# test_data_1['IDENTICAL'].value_counts()

In [52]:
# test_data_2 = NO_VO_RXN_DF.groupby('ID').apply(lambda x: x.sample(n=200, random_state=42)).reset_index(drop=True)

In [53]:
test_data = pd.concat([test_data_1, NO_VO_RXN_DF], ignore_index=True)

In [54]:
class CustomDataset(Dataset):
    def __init__(self, dataframe, vo_lookup, rx_lookup, device, transform=None):
        self.dataframe = dataframe
        self.vo_lookup = vo_lookup
        self.rx_lookup = rx_lookup
        self.device = device
        self.transform = transform

    def __len__(self):
        return len(self.dataframe)

    def __getitem__(self, idx):
        sample = self.dataframe.iloc[idx]
        vo_emb = self.vo_lookup.get(sample['ID'])
        rx_emb = self.rx_lookup.get(sample['RXN'].astype(str))     
        concept_1 , concept_2 = sample['ID'], sample['RXN'].astype(str)
        # print(concept_1, concept_2)
        # print(len(vo_emb) , len(rx_emb))
        # features = sample['EMB']
        features = torch.tensor(np.concatenate([vo_emb, rx_emb])).reshape(1, 768).to(self.device)
        label = sample['PAIRED']
        index = idx
        features = torch.tensor(features).reshape(1,768).to(device)
        label = torch.tensor(label).to(device)

        if self.transform:
            features = self.transform(features)

        return features, label, index, concept_1, concept_2

In [55]:
class CustomDataset_OLD(Dataset):
    def __init__(self, dataframe,device, transform=None):
        self.dataframe = dataframe
        self.transform = transform

    def __len__(self):
        return len(self.dataframe)

    def __getitem__(self, idx):
        sample = self.dataframe.iloc[idx]
        features = sample['EMB']
        label = sample['PAIRED']
        index = idx
        concept_1 , concept_2 = sample['ID'], sample['RXN'].astype(str)
        features = torch.tensor(features).reshape(1,768).to(device)
        label = torch.tensor(label).to(device)
        # concept_1 = torch.tensor(concept_1)
        # concept_2 = torch.tensor(concept_2)

        if self.transform:
            features = self.transform(features)

        return features, label, index, concept_1, concept_2

In [56]:
test_data.dtypes

ID        object
RXN        int64
PAIRED     int64
dtype: object

In [57]:
# Assuming you have a DataFrame 'train_df' and 'test_df' containing your data
train_dataset = CustomDataset(train_data, VO_LOOKUP, RX_LOOKUP, device)
test_dataset = CustomDataset(test_data, VO_LOOKUP, RX_LOOKUP, device)

# Create DataLoader instances for batching and shuffling
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

In [58]:
class Autoencoder(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(Autoencoder, self).__init__()
        self.encoder = nn.Sequential(
            nn.Linear(input_size, hidden_size),
            nn.ReLU()
        )
        self.decoder = nn.Sequential(
            nn.Linear(hidden_size, input_size),
            nn.ReLU()
        )

    def forward(self, x):
        encoded = self.encoder(x)
        decoded = self.decoder(encoded)
        return encoded, decoded

In [59]:
ae_model = Autoencoder(input_size=768, hidden_size=256).to(device)

In [60]:
def train_autoencoder(model, train_loader, num_epochs, learning_rate):
    criterion = nn.MSELoss()
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)
    loss_list = []

    for epoch in range(num_epochs):
        running_loss = 0.0
        for data in train_loader:
            inputs, _, _, _, _ = data

            optimizer.zero_grad()
            encoded, outputs = model(inputs)
            loss = criterion(outputs, inputs)
            loss.backward()
            optimizer.step()

            running_loss += loss.item()
            loss_list.append(loss.item())

        # Print the average loss for each epoch
        if epoch % 10 == 9:
          print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {running_loss / len(train_loader)}')

    threshold = np.percentile(loss_list, 95)
    print(f'Threshold: {threshold}')
    return threshold, loss_list

In [61]:
threshold, loss_list = train_autoencoder(ae_model, train_loader, num_epochs=100, learning_rate=0.001)

  features = torch.tensor(features).reshape(1,768).to(device)


Epoch [10/100], Loss: 0.0017504478552533935
Epoch [20/100], Loss: 0.0016711698845028877
Epoch [30/100], Loss: 0.0016489127883687615
Epoch [40/100], Loss: 0.0016377352779575933
Epoch [50/100], Loss: 0.001632049330510199
Epoch [60/100], Loss: 0.0016280689160339534
Epoch [70/100], Loss: 0.0016256730596069247
Epoch [80/100], Loss: 0.00162417169000643
Epoch [90/100], Loss: 0.0016225478029809892
Epoch [100/100], Loss: 0.001620050936859722
Threshold: 0.001840218965662643


In [62]:
np.mean(loss_list)

0.0016693101827210436

In [63]:
def evaluate_autoencoder_gold(model, test_loader,threshold):
    model.eval()
    correct = 0
    total_samples = 0
    misidentified = []
    mis_loss = []
    pred = []
    lab = []

    with torch.no_grad():
        criterion = nn.MSELoss(reduction='none')
        for data in test_loader:
            inputs, label, idx, c1, c2 = data
            # print(inputs.shape)
            # print(c_pair)
            label = label.cpu()
            encoded, outputs = model(inputs)
            loss = criterion(outputs, inputs)
            loss_ind = torch.mean(loss, dim=(1, 2)).cpu()
            predicted = torch.from_numpy(np.where(loss_ind > threshold, 0, 1))
            misidentified.extend(idx[predicted!=label].tolist())
            mis_loss.extend(loss_ind[predicted!=label].tolist())
            pred.extend(predicted[predicted!=label].tolist())
            lab.extend(label[predicted!=label].tolist())
            # print(predicted)
            # print(label)
            correct += (predicted == label).sum().item()
            total_samples += label.size(0)
        accuracy = correct / total_samples
        print(f'Accuracy: {accuracy:.4f}')

    return correct, total_samples, misidentified, mis_loss , pred, lab

In [64]:
correct, total_samples, misidentified, miss_loss, pred, lab = evaluate_autoencoder_gold(ae_model, test_loader,threshold=threshold)

  features = torch.tensor(features).reshape(1,768).to(device)


Accuracy: 0.2623


In [65]:
correct, total_samples

(680754, 2595788)

In [66]:
test_data.reset_index().iloc[misidentified]['PAIRED'].value_counts()

PAIRED
0    1915015
1         19
Name: count, dtype: int64

In [67]:
test_data.reset_index().iloc[misidentified]

Unnamed: 0,index,ID,RXN,PAIRED
23,23,VO_0000063,1659730,1
35,35,VO_0004847,1593139,1
45,45,VO_0003081,1593136,1
46,46,VO_0004814,832682,1
61,61,VO_0015051,1184336,1
...,...,...,...,...
2594610,2594610,VO_0004213,1601408,0
2594611,2594611,VO_0004213,1601409,0
2594710,2594710,VO_0004213,1657698,0
2595215,2595215,VO_0004213,2050057,0


Check if in the test set there are correctly predicted "Identical" pairs. Also check the training set for this proportion.
Todo: Add negative samples to both training and testing. Stratify on the type of mapping (identical or not)
Neg >>> Pos

In [68]:
# save model
torch.save(ae_model.state_dict(), output_dir+'model_'+cdt()+'.pt')

In [69]:
missed_dataset = test_data.reset_index().iloc[misidentified]

In [70]:
missed_dataset['LOSS'] = miss_loss
missed_dataset

Unnamed: 0,index,ID,RXN,PAIRED,LOSS
23,23,VO_0000063,1659730,1,0.001983
35,35,VO_0004847,1593139,1,0.001869
45,45,VO_0003081,1593136,1,0.002085
46,46,VO_0004814,832682,1,0.001889
61,61,VO_0015051,1184336,1,0.001911
...,...,...,...,...,...
2594610,2594610,VO_0004213,1601408,0,0.001797
2594611,2594611,VO_0004213,1601409,0,0.001805
2594710,2594710,VO_0004213,1657698,0,0.001835
2595215,2595215,VO_0004213,2050057,0,0.001830


In [71]:
missed_dataset['PAIRED'].value_counts()

PAIRED
0    1915015
1         19
Name: count, dtype: int64

In [72]:
test_data['PAIRED'].value_counts()

PAIRED
0    2595458
1        330
Name: count, dtype: int64

In [73]:
train_data['PAIRED'].value_counts()

PAIRED
1    768
Name: count, dtype: int64

In [74]:
missed_dataset['VO_STR'] = missed_dataset['ID'].map(VO_LOOKUP_STR)
missed_dataset['RX_STR'] = missed_dataset['RXN'].astype(str).map(RX_LOOKUP_STR)

In [75]:
missed_dataset

Unnamed: 0,index,ID,RXN,PAIRED,LOSS,VO_STR,RX_STR
23,23,VO_0000063,1659730,1,0.001983,Imovax Rabies (USA),Imovax
35,35,VO_0004847,1593139,1,0.001869,Trumenba Injectable Product,Trumenba Injectable Product
45,45,VO_0003081,1593136,1,0.002085,Trumenba,Trumenba
46,46,VO_0004814,832682,1,0.001889,Biothrax 0.5 ML Injectable Suspension,Bacillus anthracis strain V770-NP1-R antigens ...
61,61,VO_0015051,1184336,1,0.001911,M-M-R II Injectable Product,M-M-R II Injectable Product
...,...,...,...,...,...,...,...
2594610,2594610,VO_0004213,1601408,0,0.001797,BACILLUS ANTHRACIS STRAIN V770-NP1-R ANTIGENS ...,meningococcal group B vaccine 0.1 MG/ML / Neis...
2594611,2594611,VO_0004213,1601409,0,0.001805,BACILLUS ANTHRACIS STRAIN V770-NP1-R ANTIGENS ...,meningococcal group B vaccine 0.1 MG/ML / Neis...
2594710,2594710,VO_0004213,1657698,0,0.001835,BACILLUS ANTHRACIS STRAIN V770-NP1-R ANTIGENS ...,Haemophilus influenzae b (Ross strain) capsula...
2595215,2595215,VO_0004213,2050057,0,0.001830,BACILLUS ANTHRACIS STRAIN V770-NP1-R ANTIGENS ...,0.5 ML hepatitis B surface antigen vaccine 0.0...


In [76]:
missed_dataset[['ID', 'VO_STR','RXN','RX_STR','PAIRED','IDENTICAL','LOSS']].to_csv(output_dir+'MISSED_PREDS_'+cdt()+'.csv')

KeyError: "['IDENTICAL'] not in index"

### Notes:  
1. Remove identical values from the training set [X]
2. Incorporating unmapped pairs for testing [X]
3. Incorporating unmapped pairs for training + testing 
4. Train on similar but not related unmapped pairs (adacel vs infanrix) 
5. Testing needs to incorporate all vaccine RXNORM concepts (not just the ones present in VO) [X]
6. Testing needs to incorporate all vaccine VO concepts
7. 

In [106]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")

tokenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/609 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

In [108]:
save_path = 'models/llama2-7b-hf/'

In [109]:
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)

('models/llama2-7b-hf/tokenizer_config.json',
 'models/llama2-7b-hf/special_tokens_map.json',
 'models/llama2-7b-hf/tokenizer.model',
 'models/llama2-7b-hf/added_tokens.json',
 'models/llama2-7b-hf/tokenizer.json')

In [None]:
jupyter notebook \
    --NotebookApp.allow_origin='https://colab.research.google.com' \
    --port=1234 \
    --NotebookApp.port_retries=0

In [None]:
http://127.0.0.1:1234/tree?token=e2cd224c7db6603aa429853c6fd3f79ba512c56759e51651