<b><h1>Metamap Evaluation tool V1 - part 1</h1></b>
This tool will help to evaluate and compare different metamap behavioural options.
Part 1- This will have the code of reading the text and building phrase and word2vec model

<h2>Requirements</h2>
<ol>
<li>Create the conda environment with all the libraries. Local installation of Metamap.</li>
<li>Installing pubmed edirect (https://www.ncbi.nlm.nih.gov/books/NBK179288/)</li>
<li>Installing Metamap on linux locally (https://metamap.nlm.nih.gov/Docs/README.html)</li>
<li>Build pubmed parser from source code, as we did some changes to the standard version. Its provided in repository</li>
<li>Papers text from pubmed</li>
</ol>

<h3> Step 1: Getting the text and building word2vec model</h3>
Getting the paper abstracts, preprocessing, forming phrase and training word2vec model

In [2]:
#1a - reading paper abstracts


import pubmed_parser as pp

dicts_out = pp.parse_medline_xml('/home/aindani/CDSS/vectorization/processed_main_script_out_1_5_2021.xml',
                                 year_info_only=False,
                                 nlm_category=False,
                                 author_list=False,
                                 reference_list=False)

In [3]:
#1b preprocessing

import spacy
import unidecode
from word2number import w2n
from pycontractions import Contractions
import gensim.downloader as api
import json

import logging  # Setting up the loggings to monitor gensim
logging.basicConfig(format="%(levelname)s - %(asctime)s: %(message)s", datefmt= '%H:%M:%S', level=logging.INFO)

nlp = spacy.load('en_core_web_sm')

# Choose model accordingly for contractions function
model = api.load("glove-twitter-25")
# model = api.load("glove-twitter-100")
# model = api.load("word2vec-google-news-300")

cont = Contractions(kv_model=model)
cont.load_models()

# exclude words from spacy stopwords list
deselect_stop_words = ['no', 'not']
for w in deselect_stop_words:
    nlp.vocab[w].is_stop = False
    


def remove_whitespace(text):
    """remove extra whitespaces from text"""
    text = text.strip()
    return " ".join(text.split())


def remove_accented_chars(text):
    """remove accented characters from text, e.g. café --> cafe"""
    text = unidecode.unidecode(text)
    return text


def expand_contractions(text):
    """expand shortened words, e.g. don't to do not"""
    text = list(cont.expand_texts([text], precise=True))[0]
    return text


def text_preprocessing(text, accented_chars=True, contractions=True, 
                       convert_num=True, extra_whitespace=True, 
                       lemmatization=True, lowercase=True, punctuations=True, 
                       remove_num=True, special_chars=True, 
                       stop_words=True):
    """preprocess text with default option set to true for all steps"""
   
    if extra_whitespace == True: #remove extra whitespaces
        text = remove_whitespace(text)
    if accented_chars == True: #remove accented characters
        text = remove_accented_chars(text)
    if contractions == True: #expand contractions
        text = expand_contractions(text)
    if lowercase == True: #convert all characters to lowercase
        text = text.lower()

    doc = nlp(text) #tokenise text
    #print(doc)
    clean_text = []
    
    #remove_unwanted_dict={'objective'}
    
    for token in doc:
        #print(token)
        flag = True
        edit = token.text
        '''
        #remove unwanted like objective, methods
        if token.text in remove_unwanted_dict:
            flag = False
        '''
        
        # remove stop words
        if stop_words == True and token.is_stop and token.pos_ != 'NUM': 
            flag = False
        # remove punctuations
        if punctuations == True and token.pos_ == 'PUNCT' and flag == True: 
            flag = False
        # remove special characters
        if special_chars == True and token.pos_ == 'SYM' and flag == True: 
            flag = False
        # remove numbers
        if remove_num == True and (token.pos_ == 'NUM' or token.text.isnumeric()) \
        and flag == True:
            flag = False
        # convert number words to numeric numbers
        if convert_num == True and token.pos_ == 'NUM' and flag == True:
            edit = w2n.word_to_num(token.text)
        # convert tokens to base form
        elif lemmatization == True and token.lemma_ != "-PRON-" and flag == True:
            edit = token.lemma_
        # append tokens edited and not removed to list 
        if edit != "" and flag == True:
            clean_text.append(edit)        
    return clean_text


INFO - 11:58:48: loading projection weights from /home/aindani/gensim-data/glove-twitter-25/glove-twitter-25.gz
INFO - 11:59:08: KeyedVectors lifecycle event {'msg': 'loaded (1193514, 25) matrix of type float32 from /home/aindani/gensim-data/glove-twitter-25/glove-twitter-25.gz', 'binary': False, 'encoding': 'utf8', 'datetime': '2021-07-09T11:59:08.702764', 'gensim': '4.0.1', 'python': '3.9.0 (default, Nov 15 2020, 14:28:56) \n[GCC 7.3.0]', 'platform': 'Linux-4.18.0-193.6.3.el8_2.x86_64-x86_64-with-glibc2.28', 'event': 'load_word2vec_format'}


In [4]:
#1c function to form phrases
from gensim.models.phrases import Phrases, Phraser
def build_phrases(sentences):
    phrases = Phrases(sentences,
                      min_count=3,
                      threshold=0.5,
                      scoring='npmi',
                      progress_per=1000)
    return Phraser(phrases)
#title handling giving more weight

In [5]:
#looping over all the abstracts to do text preprocessing
processed_abstracts=[]
for doc in dicts_out:
    clean_text=text_preprocessing(doc['abstract'])
    processed_abstracts.append(clean_text)
    

INFO - 11:59:30: Removed 38 and 38 OOV words from document 1 and 2 (respectively).
INFO - 11:59:30: adding document #0 to Dictionary(0 unique tokens: [])
INFO - 11:59:30: built Dictionary(83 unique tokens: ['a', 'advice', 'along', 'an', 'and']...) from 2 documents (total 303 corpus positions)
INFO - 11:59:30: Dictionary lifecycle event {'msg': "built Dictionary(83 unique tokens: ['a', 'advice', 'along', 'an', 'and']...) from 2 documents (total 303 corpus positions)", 'datetime': '2021-07-09T11:59:30.546742', 'gensim': '4.0.1', 'python': '3.9.0 (default, Nov 15 2020, 14:28:56) \n[GCC 7.3.0]', 'platform': 'Linux-4.18.0-193.6.3.el8_2.x86_64-x86_64-with-glibc2.28', 'event': 'created'}
INFO - 11:59:31: Removed 38 and 38 OOV words from document 1 and 2 (respectively).
INFO - 11:59:31: adding document #0 to Dictionary(0 unique tokens: [])
INFO - 11:59:31: built Dictionary(83 unique tokens: ['a', 'advice', 'along', 'an', 'and']...) from 2 documents (total 303 corpus positions)
INFO - 11:59:31:

In [6]:
ans=build_phrases(processed_abstracts[:3200])

INFO - 12:00:38: collecting all words and their counts
INFO - 12:00:38: PROGRESS: at sentence #0, processed 0 words and 0 word types
INFO - 12:00:39: PROGRESS: at sentence #1000, processed 148403 words and 104220 word types
INFO - 12:00:39: PROGRESS: at sentence #2000, processed 279048 words and 175557 word types
INFO - 12:00:39: PROGRESS: at sentence #3000, processed 390948 words and 232127 word types
INFO - 12:00:39: collected 241080 token types (unigram + bigrams) from a corpus of 409831 words and 3200 sentences
INFO - 12:00:39: merged Phrases<241080 vocab, min_count=3, threshold=0.5, max_vocab_size=40000000>
INFO - 12:00:39: Phrases lifecycle event {'msg': 'built Phrases<241080 vocab, min_count=3, threshold=0.5, max_vocab_size=40000000> in 0.42s', 'datetime': '2021-07-09T12:00:39.328074', 'gensim': '4.0.1', 'python': '3.9.0 (default, Nov 15 2020, 14:28:56) \n[GCC 7.3.0]', 'platform': 'Linux-4.18.0-193.6.3.el8_2.x86_64-x86_64-with-glibc2.28', 'event': 'created'}
INFO - 12:00:39: exp

<h3>Step 2: Training word2vec model </h3>

In [7]:
import pickle

all_abs_tokens_phrases=[]
for abstract_tokens in processed_abstracts[:3200]:
    all_abs_tokens_phrases.append(ans[abstract_tokens])


#saving for future reference
with open("all_abs_tokens_phrases.txt", "wb") as fp:   #Pickling
    pickle.dump(all_abs_tokens_phrases, fp)

In [8]:
import gensim 

model = gensim.models.Word2Vec(all_abs_tokens_phrases,
        vector_size=150,
        window=5,
        min_count=0,
        workers=10,
        epochs=10,
        sg=1)

INFO - 12:03:42: collecting all words and their counts
INFO - 12:03:42: PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
INFO - 12:03:42: collected 16502 word types from a corpus of 380879 raw words and 3200 sentences
INFO - 12:03:42: Creating a fresh vocabulary
INFO - 12:03:42: Word2Vec lifecycle event {'msg': 'effective_min_count=0 retains 16502 unique words (100.0%% of original 16502, drops 0)', 'datetime': '2021-07-09T12:03:42.800970', 'gensim': '4.0.1', 'python': '3.9.0 (default, Nov 15 2020, 14:28:56) \n[GCC 7.3.0]', 'platform': 'Linux-4.18.0-193.6.3.el8_2.x86_64-x86_64-with-glibc2.28', 'event': 'prepare_vocab'}
INFO - 12:03:42: Word2Vec lifecycle event {'msg': 'effective_min_count=0 leaves 380879 word corpus (100.0%% of original 380879, drops 0)', 'datetime': '2021-07-09T12:03:42.801400', 'gensim': '4.0.1', 'python': '3.9.0 (default, Nov 15 2020, 14:28:56) \n[GCC 7.3.0]', 'platform': 'Linux-4.18.0-193.6.3.el8_2.x86_64-x86_64-with-glibc2.28', 'event': 'prepare_vo

<h3> Step 3: Saving the models for further use </h3>

In [9]:
model.save('corpus_word2vec.model')
#frozen_phrases_model = ans.freeze()
ans.save("my_phrase_model.pkl")

INFO - 12:04:43: Word2Vec lifecycle event {'fname_or_handle': '/home/aindani/CDSS/Metamap_eval_final/corpus_word2vec.model', 'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2021-07-09T12:04:43.019245', 'gensim': '4.0.1', 'python': '3.9.0 (default, Nov 15 2020, 14:28:56) \n[GCC 7.3.0]', 'platform': 'Linux-4.18.0-193.6.3.el8_2.x86_64-x86_64-with-glibc2.28', 'event': 'saving'}
INFO - 12:04:43: not storing attribute cum_table
INFO - 12:04:43: saved /home/aindani/CDSS/Metamap_eval_final/corpus_word2vec.model
INFO - 12:04:43: FrozenPhrases lifecycle event {'fname_or_handle': '/home/aindani/CDSS/Metamap_eval_final/my_phrase_model.pkl', 'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2021-07-09T12:04:43.055553', 'gensim': '4.0.1', 'python': '3.9.0 (default, Nov 15 2020, 14:28:56) \n[GCC 7.3.0]', 'platform': 'Linux-4.18.0-193.6.3.el8_2.x86_64-x86_64-with-glibc2.28', 'event': 'saving'}
INFO - 12:04:43: saved /home/aindani/CDSS/Met

<h3>Step 1: Get the gold standard</h3>
Here gold standard can be in file as well. In our case we have stored everything in PostgreSQL so we are retrieving data from there