Data:
- https://s-baker.net/resource/hoc/

WSD:
1. Database ideas
  - https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-12-223
  - https://lhncbc.nlm.nih.gov/ii/areas/WSD/collaboration.html#WSD_choices
2. Methods
  - https://academic.oup.com/bioinformatics/article/26/22/2889/228423
  - https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3079-8
  - https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5792196/
  - https://pubmed.ncbi.nlm.nih.gov/30811548/
  - https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7787358/
  - https://towardsdatascience.com/lstm-by-example-using-tensorflow-feb0c1968537
  - 

To do: 
1. Prepare the Hallmarks of Cancer db
2. Find all unique words in abstracts
  - Filter data?
  - Tokenization?
3. Apply Jimeno-Yepes et al methods for db creation
  - Should I build my own db specific to our documents or just use a pre-existing WSD corpus.
  1. Search metathesaurus to find words with more than 1 meaning
  2. Remove interpretations (CUI) that do not have MeSH indexing
  3. Search for citations that use the CUI of the word in the abstract
4. Filter data 
  - Elim under used CUI
  - Elim ambiguous uses (citations that use 2 CUI's of one word)
5. Use data set to train WSD
6. How do we implement this with our data?
  - Replace ambiguous words with non-ambiguous words
  - Factor into encoding similar to positional encoding methods in transforms


In [None]:
# !git clone https://ChristopheUAB@github.com/ChristopheUAB/WSD-Proj

# git clone https://username@github.com/username/repo_name

In [44]:
'''
Read and export MSD data to tsv
'''

from google.colab import files
import os
import re
import pandas as pd
from io import StringIO

class MSD_Load:
  def UploadMSDFiles():
    uploaded = files.upload()

  def FilePaths(ParentPath):
    Paths = []
    with os.scandir('/content/') as entries:
        for entry in entries:
          if entry.name.endswith('.arff'):
            Paths.append(ParentPath + entry.name)
    return Paths
  
  def ExtractDat(Paths):
    FinalData = ''
    CUIDic = {}
    for entry in Paths:
      match = re.search(r'/content/(.*)_pmids_tagged.arff',entry).group(1)
      with open(entry, 'r', encoding = "ISO-8859-1") as f:
        data = f.read()
        
        # Extract CUI, attributes, and data
        CUIData = re.search(r'@RELATION (.*)\n', data).group(1).split('_')
        Attributes = re.findall(r'@ATTRIBUTE (.*)\n', data)
        FullData = re.search(r'@DATA\n(.*)', data, re.DOTALL).group(1)

        # Match CUI and M#s
        Classes = [i for i in Attributes if 'class' in i]
        Index = Attributes.index(Classes[0])
        MVals = re.search(r'class {(.*)}', Classes[0])
        MValsList = MVals.group(1).split(', ')

        # Replace M#s with correct CUIs in FullData
        c = 0
        for CUI in CUIData:
          FullData = FullData.replace(MValsList[c] + '\n',CUI + '\n')
          c += 1

        # Insert ambiguous at the start of all lines
        FullData = FullData.split('\n')
        del FullData[-1]
        c = 0
        for line in FullData:
          FullData[c] = match + ',' + line
          c += 1
        FullData = '\n'.join(FullData)

        # Append to data list
        FinalData += FullData + '\n'

        CUIDic[match] = CUIData

        f.close()

    # Join data
    Attributes[Index] = 'CUI'
    AttributesStr = 'ambiguous word,' + ",".join(Attributes) + '\n'
    FinalData = AttributesStr + FinalData
    return FinalData, CUIDic

  def String2DF(String):
    # Just use splits
    String = String.split('\n')

    # Change relevant commas to tabs
    FinalString = ''
    c = 0
    for element in String:
      ElementList = element.split('"')
      if len(ElementList)-3 != 0:
        FinalString += ElementList[0].replace(',', '\t') + '\n'
      else:
        ElementList[0] = ElementList[0].replace(',', '\t')
        ElementList[2] = ElementList[2].replace(',', '\t')
        FinalString += '"'.join(ElementList) + '\n'
      c += 1

    # Convert tab seperated values to a df
    df = pd.read_csv(StringIO(FinalString), sep = '\t')
    return df
  
  # df.to_csv('MSDFile.tsv',sep='\t',encoding="ISO-8859-1")
  def WriteTSV(df, FileName):
    df.to_csv(FileName,sep='\t',encoding="ISO-8859-1")
    return

  def ReadTSV(Path):
    df = pd.read_csv(Path, sep = '\t',encoding="ISO-8859-1")
    return df

MSD_Load.UploadMSDFiles()
paths = MSD_Load.FilePaths('/content/')
[Data, CUIDic] = MSD_Load.ExtractDat(paths)
df = MSD_Load.String2DF(Data)
# MSD_Load.WriteTSV(df, 'MSDFile.tsv')
# df = MSD_Load.ReadTSV('/content/MSDFile.tsv')

Saving AA_pmids_tagged.arff to AA_pmids_tagged (1).arff
Saving ADA_pmids_tagged.arff to ADA_pmids_tagged (1).arff
Saving ADH_pmids_tagged.arff to ADH_pmids_tagged (1).arff
Saving ADP_pmids_tagged.arff to ADP_pmids_tagged (1).arff
Saving Adrenal_pmids_tagged.arff to Adrenal_pmids_tagged (1).arff
Saving Ala_pmids_tagged.arff to Ala_pmids_tagged (1).arff
Saving ALS_pmids_tagged.arff to ALS_pmids_tagged (1).arff
Saving ANA_pmids_tagged.arff to ANA_pmids_tagged (1).arff
Saving Arteriovenous Anastomoses_pmids_tagged.arff to Arteriovenous Anastomoses_pmids_tagged (1).arff
Saving Astragalus_pmids_tagged.arff to Astragalus_pmids_tagged (1).arff
Saving BAT_pmids_tagged.arff to BAT_pmids_tagged (1).arff
Saving B-Cell Leukemia_pmids_tagged.arff to B-Cell Leukemia_pmids_tagged (1).arff
Saving benchmark_mesh.txt to benchmark_mesh (1).txt
Saving BLM_pmids_tagged.arff to BLM_pmids_tagged (1).arff
Saving Borrelia_pmids_tagged.arff to Borrelia_pmids_tagged (1).arff
Saving BPD_pmids_tagged.arff to BPD_pm

AttributeError: ignored

In [13]:
# Many to one examples:
# https://towardsdatascience.com/sentiment-analysis-using-lstm-step-by-step-50d074f09948
# https://towardsdatascience.com/reading-between-the-layers-lstm-network-7956ad192e58

# LSTM Coding tut
# https://towardsdatascience.com/lstm-by-example-using-tensorflow-feb0c1968537
# https://www.deeplearningbook.org/contents/rnn.html
# http://colah.github.io/posts/2015-08-Understanding-LSTMs/

# WSD Papers
# https://aclanthology.org/C18-1030/
# df = MSD_Load.String2DF(Data)
# # Clean up data
# #   Tokenize? Punct?
# #   Standardize lengths
# # convert to numbers using frequency
# # Generate embeddings
# df.head()
# print(type(df['citation string'][0]))
# print(df['citation string'][0])

# for row in df['citation string']:
#   for sub_str in ['-1']:
#     if sub_str in row:
#       print(row)

# !pip install glove_python

# from glove import Corpus, Glove

# #Creating a corpus object
# corpus = Corpus() 

# #Training the corpus to generate the co-occurrence matrix which is used in GloVe
# corpus.fit(Corpus, window=10)

# glove = Glove(no_components=5, learning_rate=0.05) 
# glove.fit(corpus.matrix, epochs=30, no_threads=4, verbose=True)
# glove.add_dictionary(corpus.dictionary)
# glove.save('glove.model')

from gensim.models import Word2Vec

sg_model = Word2Vec(sentences=Corpus, size=256, window=5, workers=4, sg=1)

labels = []
tokens = []
for word in model.wv.vocab:
    tokens.append(model[word])
    labels.append(word)

In [31]:
# for word in sg_model.wv.vocab:
#   print(word)
#   print(sg_model[word])
#   break
# print(sg_model.wv.vocab[0])
print(sg_model['1161'])

  


KeyError: ignored

In [None]:
# import nltk

# word_data = "It's originated from the idea that there are readers who prefer learning new skills from the comforts of their drawing rooms"
# nltk_tokens = nltk.word_tokenize(word_data)
# print (nltk_tokens)
import nltk
import string
# nltk.download('punkt')
nltk.download("stopwords")
from nltk.corpus import stopwords


from nltk.tokenize import word_tokenize
s = df['citation string'][0]
# print(dir(s))
df.head()
# s = word_tokenize(s.lower())
print(s)
print(string.punctuation)
print(stopwords.words('english'))
test = df
# test['citation string'][0] = word_tokenize(test['citation string'][0].lower())
test.head()
# print(test['citation string'][0])

In [2]:
'''
Data processing
'''
# Convert everything to lower
# remove punctuation
#   Remove all but <>
#   Replace tag with $ and then remove <>
# Tokenize the string
# Remove stop words
# Return data?
#   truncating or removing data?

import nltk
import string
nltk.download('punkt')
nltk.download("stopwords")
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

class Data_Processing:

  # Basic tokenization
  # Note: consider adding start and end sent tokens? doesn't seeme relecant to learning word context
  def BasicProc(df):
    corpus = []
    labels = []
    for index, row in df.iterrows():
      # Remove punctuation and replace target word with $
      punctuation = string.punctuation.replace('<', '')
      punctuation = punctuation.replace('>', '')
      row['citation string'] = row['citation string'].translate(str.maketrans('','',punctuation))
      subString = re.search(r'<e>(.*)<e>',row['citation string']).group(1)
      row['citation string'] = row['citation string'].replace('<e>' + subString + '<e>', '$')
      row['citation string'] = row['citation string'].translate(str.maketrans('','','<>'))

      # Lower, tokenize, and remove stop words
      row['citation string'] = word_tokenize(row['citation string'].lower())
      filteredSentence = []
      stopWords = stopwords.words('english')
      for word in row['citation string']:
        if word not in stopWords:
          filteredSentence.append(word)
      row['citation string'] = filteredSentence

      # Update the df
      corpus.append(row['citation string'])
      labels.append(row['CUI'])
    return corpus, labels
  
  def DataSnip(Corpus, Window):
    WindowedCorpus = []
    for entry in Corpus:
      
      # Add Window # of blanks to start and end to guarantee correct lengths
      Blanks = [' '] * Window   ##################3
      entry = Blanks + entry + Blanks

      # Find relevant segment
      index = 0
      for token in entry:
        if token == '$':
          break
        else:
          index += 1

      WindowedCorpus.append(entry[index-Window:Window+index+1])
    return WindowedCorpus

WindowSize = 10
[Corpus, Labels] = Data_Processing.BasicProc(df)
WindowedCorpus = Data_Processing.DataSnip(Corpus, WindowSize)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [45]:
'''
Embeddings
'''

import numpy as np
from gensim.models import Word2Vec

class Embedding():

  def RevFreq(Corpus):
    tokenKey = {}
    for element in Corpus:
      for token in element:
        if token in tokenKey:
          tokenKey[token] += 1
        else:
          tokenKey[token] = 1

    sortedResCounts = sorted(tokenKey.items(), key=lambda x:x[1], reverse=True)  # Tuples
    Count = 1
    Zeros = ['', '$']   # The $ tag and empty spaces are special cases not relevant to the surrounding context
    sortedRes = {}
    for key in sortedResCounts:
      if key[0] in Zeros:
        sortedRes[key[0]] = [0]
      else:
        sortedRes[key[0]] = [Count]
        Count +=1

    return sortedRes

  # def Word2Vec(Corpus):
  #   sg_model = Word2Vec(sentences=Corpus, size=256, window=5, workers=4, sg=1)

  #   # For some reason this is skipping words like 1161
  #   sg_model_dict = {}
  #   for word in sg_model.wv.vocab:
  #     sg_model_dict[word] = sg_model[word]

  #   return sg_model_dict

  # def Glove():
    # Glove
    # word2vec
    # ?

  def OneHotEncoding(Labels):
    CompressedLabels = [*set(Labels)]
    Dim = len(CompressedLabels)
    OneHotMat = np.identity(Dim)
    LabelDict = {}
    for i in range(0,Dim):
      Blank = []
      for element in OneHotMat[i]:
        Blank.append([element])

      LabelDict[CompressedLabels[i]] = Blank

    return LabelDict
  
  def DictConvertion(List, Dict, Forward = True):
    if Forward is True:
      ConvertedList = []
      for element in List:
        ConvertedList.append(Dict[element])
    else:
      ConvertedList = []
      for element in List:
        ConvertedList.append( list(Dict.keys())[list(Dict.values()).index(element)])

    return ConvertedList

# Keys
TokenKey = Embedding.RevFreq(WindowedCorpus)
# TokenKey = Embedding.Word2Vec(Corpus)
OneHotLabels = Embedding.OneHotEncoding(Labels)

# Convert tokens and labels using keys
ConvertedLabels = Embedding.DictConvertion(Labels, OneHotLabels)
ConvertedCorpus = []
for element in WindowedCorpus:
  ConvertedCorpus.append(Embedding.DictConvertion(element, TokenKey))

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
standard
diet
modulate
ion
currents
emerging
evidences
point
similar
effects
alphalinolenic
acid
$
vegetable
n3
pufa
much
less
known
effects
specific
cardiac
ion
tail
first
mode
fatty
acid
binding
sitedirected
mutagenesis
active
site
$
ala215
typically
conserved
gly
rlox
revealed
substitution
gly
retained
9r
acidenriched
oil
ace
inhibitor
concerning
decrease
blood
pressure
shralphalinolenic
acid
$
reported
exhibit
antihypertensive
effect
angiotensinconverting
enzyme
inhibitor
acei
also
antihypertensive
examined
influence
17betaestradiol
e2
capacity
shsy5y
cells
supplemented
alphalinolenic
acid
$
produce
eicosapentaenoic
acid
epa
docosapentaenoic
acid
dpa
docosahexaenoic
acid
dha
sunflower
differ
greatly
content
n3
fatty
acids
specifically
alphalinolenic
acid
$
low
ala
intake
associated
risk
fatal
chd
sudden
cardiac
death
administrationthe
present
study
attempted
clarify
antihypertensive
effect
mechanism
alphalinolenic
aci

In [None]:
'''
Basic LSTM
'''
import numpy as np
import keras
from nltk.tokenize import RegexpTokenizer
from keras.models import Sequential, load_model
from keras.layers import LSTM
from keras.layers import Bidirectional
from keras.layers import Dropout
from keras.layers.core import Dense, Activation
from keras.optimizers import RMSprop
import matplotlib.pyplot as plt
import pickle
import heapq

WindowSizeTot = WindowSize*2 +1
ConvertedCorpus = np.array(ConvertedCorpus)# .reshape((1,WindowSize,len(ConvertedCorpus)))
ConvertedLabels = np.array(ConvertedLabels)

import tensorflow as tf

ConvertedCorpus = tf.cast(ConvertedCorpus, tf.float32)
ConvertedLabels = tf.cast(ConvertedLabels, tf.float32)

# model = Sequential()
# model.add(LSTM(128, input_shape=(WindowSize, 1)))
# # model.add(Dense(len(TokenKey)))
# model.add(Activation(activation='softmax'))

# model.summary()

# optimizer = RMSprop(lr=0.01)
# model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
# history = model.fit(x=ConvertedCorpus, y=ConvertedLabels, validation_split=0.05, batch_size=128, epochs=2, shuffle=True).history

model = keras.Sequential(
    [
        keras.Input(shape=(WindowSizeTot, 1)),
        # keras.layers.LSTM(128),
        Bidirectional(LSTM(128)),
        keras.layers.Dense(410,activation="softmax"),
        # Dropout(.2, input_shape=(2,)),
    ]
)
optimizer = keras.optimizers.RMSprop(learning_rate=0.01)
model.compile(loss="categorical_crossentropy", optimizer=optimizer)

history = model.fit(x=ConvertedCorpus, y=ConvertedLabels, validation_split=0.1, batch_size=128, epochs=100, shuffle=True).history

model.save('keras_next_word_model.h5')
pickle.dump(history, open("history.p", "wb"))
model = load_model('keras_next_word_model.h5')
history = pickle.load(open("history.p", "rb"))

plt.plot(history['loss'])
plt.plot(history['val_loss'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')


Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100

KeyboardInterrupt: ignored

In [None]:
'''
Data visualization
'''
import numpy as np

# test = {}
# test['test1'] = 0
# print(test)

# print(len(TokenKey))
# input_shape=(WindowSize, len(TokenKey))
# print(input_shape)
# y = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]).reshape((5,2))
# print(y)

# X = np.zeros((len(WindowedCorpus), WindowSize, len(TokenKey)), dtype=bool)

# print(np.zeros((5,4,11)))

Test = [1, 2, [3, 4]]
print(Test)
print(Test[2][1])
print(history.keys())

[1, 2, [3, 4]]
4
dict_keys(['loss', 'val_loss'])


In [None]:
'''
Models
'''

# LSTM
# BiLSTM



In [None]:
'''
Train WSD
'''
# Use seed to make reproducable
# What approach do I use?
# Test various algos

In [None]:
'''
Test WSD
'''
# Use k-fold validation

In [None]:
'''
Implement WSD
'''
# Search through the data in the cancer corpus for ambiguous words
# Use the WSD to get the CUI code for that ambiguous word
# Replace that ambiguous instance with the CUI code to pass the document on to the document classifier

In [None]:
# https://gweissman.github.io/post/using-metamap-with-python-to-access-the-umls-metathesaurus-a-quick-start-guide/
# https://documentation.uts.nlm.nih.gov/rest/search/

