## D1 - Gather as much medical text as possible and learn embeddings from those texts.

## Get/Scrape the documents

There are a lot of ways of scrapping a web to recollect *.pdf files, one which most of webs can't block is a Web Browser Automation like Selenium, you can create a script to automate some tasks, but since Pubmed have an API is easier using it.

For scrape and recolect documents we use Entrez, which helps us to get the IDs from our search, which will be free papers in english, and when we have the PMIDs, we will use a library to automatize the search and download of that papers.
If we don't want to use this library, we need to automatize the download of the pdf files for the different domains.

In [21]:
from Bio import Entrez

In [31]:
def search(query):
    Entrez.email = 'set.tobur@gmail.com' #Entrez needs an email for avoid bots
    handle = Entrez.esearch(db='pubmed', 
                            sort='relevance', 
                            retmax='20000',
                            retmode='xml', 
                            term=query)
    results = Entrez.read(handle)
    return results

In [32]:
def fetch_details(id_list):
    ids = ','.join(id_list)
    Entrez.email = 'set.tobur@gmail.com'
    handle = Entrez.efetch(db='pubmed',
                           retmode='xml',
                           id=ids)
    results = Entrez.read(handle)
    return results

In [33]:
results = search('"english"[Language] AND free full text[sb]')
id_list = results['IdList']
papers = fetch_details(id_list)
for i, paper in enumerate(papers['PubmedArticle']):
    print("%d) %s" % (i+1, paper['MedlineCitation']['Article']['ArticleTitle']))
# Pretty print the first paper in full to observe its structure
#import json
#print(json.dumps(papers[0], indent=2, separators=(',', ':')))

1) Update: Plant Cortical Microtubule Arrays.
2) Arabidopsis VPS38 is required for vacuolar trafficking but dispensable for autophagy.
3) Precision genome editing using synthesis-dependent repair of Cas9-induced DNA breaks.
4) Robust zero resistance in a superconducting high-entropy alloy at pressures up to 190 GPa.
5) Vasopressin excites interneurons to suppress hippocampal network activity across a broad span of brain maturity at birth.
6) Osmosensing by the bacterial PhoQ/PhoP two-component system.
7) On the role of the corpus callosum in interhemispheric functional connectivity in humans.
8) Inequality in nature and society.
9) Using deep learning and Google Street View to estimate the demographic makeup of neighborhoods across the United States.
10) Oral anticoagulants for prevention of stroke in atrial fibrillation: systematic review, network meta-analysis, and cost effectiveness analysis.
11) Marriage and risk of dementia: systematic review and meta-analysis of observational stu

In [34]:
# Now we can use the PMID to use some library and downdload papers
lista_PMIDS = ""
for i, paper in enumerate(papers['PubmedArticle']):
    lista_PMIDS += str(paper['MedlineCitation']['PMID']+',')

In [35]:
len(lista_PMIDS)

900

#### When we get all the PMIDS we can call the script and try to get all the papers

You can get the install instructions from https://github.com/billgreenwald/Pubmed-Batch-Download, but for me worked better a manual installation.

 Install ruby & ruby-dev from apt-get or pacman and then write in a terminal:
 - gem install mechanize camping socksify


In [57]:
import subprocess
import os

cmd = 'ruby Pubmed-Batch-Download/pubmedid2pdf.rb %s' % (lista_PMIDS)
run = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
out, err = run.communicate()

## Parse

Once we have all the documents, we need to parse all the pdf files to text file for training the Word2Vec model

In [38]:
import pdfparser.poppler as pdf
import sys
import os
from tqdm import tqdm

In [39]:
import re
import unicodedata
def remove_accents(text, method='unicode'):
    """
    http://textacy.readthedocs.io/en/latest/_modules/textacy/preprocess.html
        if 'unicode', remove accented char for any unicode symbol with a 
        direct ASCII equivalent; 
        if 'ascii', remove accented char for any unicode symbol

        NB: the 'ascii' method is notably faster than 'unicode', but less good

    """

    if method == 'unicode':
        return ''.join(c for c in unicodedata.normalize('NFKD', text)
                       if not unicodedata.combining(c))
    elif method == 'ascii':
        return unicodedata.normalize('NFKD', text).encode('ascii', errors='ignore').decode('ascii')
    else:
        msg = '`method` must be either "unicode" and "ascii", not {}'.format(method)
        raise ValueError(msg)
    

In [40]:
def remove_superscript(text):
    m = re.search('[a-z][a-z][0-9] ', text)
    if m:        
        text = text[:m.start()+2] + text[m.start() + 3:]
    return text

In [41]:
def load_file(file):
    doc = pdf.Document(str.encode(file))
    
    phrases = []
    #print('No of pages', doc.no_of_pages)
    for page in doc:
        for flow in page:
            for box in flow:
                box.bbox.as_tuple()                
                for line in box:
                    text_cleaned = remove_accents(line.text)
                    text_cleaned = remove_superscript(text_cleaned)
                    phrases.append(text_cleaned)
                    #phrases.append(line.text.encode('UTF-8'))
    return phrases
#load_file('../data/external/medicalPapersBatch1_011017/es_1678-4464-csp-33-s3-e00104917.pdf')

In [44]:
import queue
#save pdf files
def save_documents(pathInput, pathOutput='', where=0):
    if where == 0: # save to a data structure
        all_docs = []
        for file in os.listdir(pathInput):
            all_docs.append(load_file(os.path.join(pathInput, file)))
        return all_docs
    else: #save to file in disk
        q = queue.Queue()
        s = ""
        for file in tqdm(os.listdir(pathInput)):
            fileToSave = open(os.path.join(pathOutput, file), "w")
            values = load_file(os.path.join(pathInput, file))
            with fileToSave as myfile:
                for line in values:
                    if re.match('(.)*-$',line):                        
                        #we create a FIFO queue for join lines with dashes
                        q.put(line[:-1])
                    else:
                        #while there are not dashes we join the lines
                        while not q.empty():
                            s += q.get()                                                        
                        #print('{}{}\n'.format(s,line))
                        myfile.write('{}{}\n'.format(s,line))
                        s = "" # If script gets here we empty the 

In [46]:
all_docs = save_documents('data/pdf/','data/processed/', 1)

100%|██████████| 53/53 [00:19<00:00,  2.69it/s]


Of course, all of this is only a Proof of Concept, because in real life you will need to recollect a lot more of documents and parse much better, to create a clean corpus, because for example right now it has a lot of numbers combined with letters

## Training the model

In [2]:
import gensim
from nltk.corpus import stopwords
import nltk
import string
from tqdm import tqdm
from gensim.models.keyedvectors import KeyedVectors

In [51]:
class MySentences(object):
    
    def __init__(self, dirname):
        self.dirname = dirname
 
    def __iter__(self):
        punctuations = list(string.punctuation)
        stop = stopwords.words('spanish') + stopwords.words('english') + punctuations
        stop += ['et', '1', '2', 'j', '3']        
        for fname in tqdm(os.listdir(self.dirname)):
            for line in open(os.path.join(self.dirname, fname), "rb"):
                sentence = line.decode('utf-8').lower()
                parts = nltk.word_tokenize(sentence)
                important_words = [item for item in parts if item not in stop]
                yield important_words            

In [52]:
sentences = MySentences('data/processed/')

In [53]:
model = gensim.models.Word2Vec(sentences, sg = 1, min_count=4, size = 300)

100%|██████████| 53/53 [00:21<00:00,  2.46it/s]
100%|██████████| 53/53 [00:22<00:00,  2.38it/s]
100%|██████████| 53/53 [00:22<00:00,  2.39it/s]
100%|██████████| 53/53 [00:23<00:00,  2.24it/s]
100%|██████████| 53/53 [00:26<00:00,  2.03it/s]
100%|██████████| 53/53 [00:23<00:00,  2.26it/s]


With model.save('models/w2v_v2.bin') we can save the model for retraining or use it in production.

Since is very hard to get a clean corpus and train models, I decide to use a pretrained model, trained from PubMed and PMC from http://bio.nlplab.org/.

In [54]:
# When we use the method load_word2vec_format() we can't retrain the model
model = KeyedVectors.load_word2vec_format('models/PubMed-and-PMC-w2v.bin', binary=True)

## Evaluate

For this we will use a paper called "How to evaluate word embeddings? On importance of data efficiency and simple supervised tasks" and their code, which can find in https://github.com/kudkudak/word-embeddings-benchmarks

UMNSRS
:  The  University  of  Minnesota  Semantic  Relatedness  Set  (UMNSRS)  was  developed by Pakhomov et al. (2010), and consists of 725 clinical term pairs whose semantic similarity and relatedness was determined independently by four medical residents from the University of Minnesota Medical School.  The similarity and relatedness of each term pair was annotated based on a continuous scale by having the resident touch a baron a touch sensitive computer screen to indicate the degree of similarity or relatedness.  The Intraclass Correlation Coefficient (ICC) for the reference standard tagged for similarity was 0.47, and 0.50 for relatedness.   Therefore,  as suggested by Pakhomov and colleagues,we use a subset of the ratings  consisting  of  401  pairs  for  the  similarity set and 430 pairs for the relatedness set which each have an ICC of 0.73.

In [None]:
import subprocess
import os

cmd = 'scripts/word-embeddings-benchmarks/scripts/evaluate_on_all.py --file ~/jupyterNotebooks/mlnn2017/P3\ -\ Exploring\ specific\ domains\ with\ embeddings/models/PubMed-and-PMC-w2v.bin --format word2vec_bin'
run = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
out, err = run.communicate()

In [None]:
https://kokes.github.io/nbviewer.js/viewer.html

|                                | AP    | Battig | ESSLI_2b | ESSLI_2c | MEN   | MTurk | SimLex999 | WS353 | WS353S | UMNSRS_similarity | UMNSRS_relatedness |
|--------------------------------|-------|--------|----------|----------|-------|-------|-----------|-------|--------|-------------------|--------------------|
| PubMed-and-PMC-w2v             | 0.554 | 0.308  | 0.725    | 0.488    | 0.545 | 0.446 | 0.261     | 0.416 | 0.475  | 0.541             | 0.4864             |
| SBW-vector-300-min5+medics     | 0.281 | 0.246  | 0.55     | 0.4      | 0.314 | 0.232 | 0.145     | 0.232 | 0.303  | 0.059             | 0.033              |
| SBW-vectors-300-min5           | 0.325 | 0.239  | 0.425    | 0.377    | 0.32  | 0.262 | 0.128     | 0.280 | 0.332  | 0.053             | 0.027              |
| LexVec which="commoncrawl-W+C" | 0.639 | 0.431  | 0.725    | 0.644    | 0.809 | 0.712 | 0.419     | 0.647 | 0.756  |                   |                    |
| GloVe dim=300 corpus=wiki-6B   | 0.622 | 0.451  | 0.750    | 0.578    | 0.737 | 0.633 | 0.371     | 0.522 | 0.653  | 0.261             | 0.243              |

## D2 - Explore whether a combination with general domain corpora improves the results.

For this we could scrape wikipedia with BeautifulSoup4 or Selenium. This will be easier than with the pdf files, since HTML is a text format, and it gives us all the information we need about the text, but since the BioNLP group had already the previous model with a dump of the Wikipedia, we are going to use that.

If we had a model, and we wanted to retrain it with a bigger corpus, or like in this case in other domain, we can do it like this:

In [None]:
# We load the simple model
simple_model = KeyedVectors.load_word2vec_format('models/PubMed-and-PMC-w2v.bin', binary=True)

In [None]:
# We set the directory of the new files for the retraining
dirpath='data/processed/'
corpusReader = nltk.corpus.PlaintextCorpusReader(dirpath, '.*\.txt')
frases = len(corpusReader.sents())
print("The number of sentences =", frases)

In [None]:
sentences = MySentences(dirpath)
model.train(sentences, total_examples=frases, epochs=model.iter)

In [None]:
# If we want we can save our new model retrained with the new corpora
#model.wv.save_word2vec_format('../models/model_w2v_sbw_withMedical.bin', binary=True)

In [None]:
#And now we can load the combined model and checks his accuracy
combined_model = KeyedVectors.load_word2vec_format('models/wikipedia-pubmed-and-PMC-w2v.bin', binary=True)

|                                | AP    | Battig | ESSLI_2b | ESSLI_2c | MEN   | MTurk | SimLex999 | WS353 | WS353S | UMNSRS_similarity | UMNSRS_relatedness |
|--------------------------------|-------|--------|----------|----------|-------|-------|-----------|-------|--------|-------------------|--------------------|
| PubMed-and-PMC-w2v             | 0.554 | 0.308  | 0.725    | 0.488    | 0.545 | 0.446 | 0.261     | 0.416 | 0.475  | 0.541             | 0.4864             |
| Wikipedia+PubMed-and-PMC-w2v   | 0.567 | 0.396  | 0.725    | 0.511    | 0.517 | 0.572 | 0.278     | 0.484 | 0.526  | 0.543             | 0.4860             |
| SBW-vector-300-min5+medics     | 0.281 | 0.246  | 0.55     | 0.4      | 0.314 | 0.232 | 0.145     | 0.232 | 0.303  | 0.059             | 0.033              |
| SBW-vectors-300-min5           | 0.325 | 0.239  | 0.425    | 0.377    | 0.32  | 0.262 | 0.128     | 0.280 | 0.332  | 0.053             | 0.027              |
| LexVec which="commoncrawl-W+C" | 0.639 | 0.431  | 0.725    | 0.644    | 0.809 | 0.712 | 0.419     | 0.647 | 0.756  |                   |                    |
| GloVe dim=300 corpus=wiki-6B   | 0.622 | 0.451  | 0.750    | 0.578    | 0.737 | 0.633 | 0.371     | 0.522 | 0.653  | 0.261             | 0.243              |

As we can see the result in all tests, medical and non-medical are better than with the previous model, so is safe to say is better to train models with a mix of the corpora we are going to have in our application

## D3: Explore whether the combination with random-walks over knowledge-bases improves the results.

## Extra - Visualization of Embeddings using a dimensionality reduction algorithm.

In [1]:
from gensim.models.keyedvectors import KeyedVectors

In [1]:
combined_model = KeyedVectors.load_word2vec_format('models/wikipedia-pubmed-and-PMC-w2v.bin', binary=True)

In [2]:
#We cut the embeddings for tSNE and we load the small model
combined_model.wv.save_word2vec_format('models/smaller_combined_model.w2v', total_vec=2500)

In [2]:
combined_model = KeyedVectors.load_word2vec_format('models/smaller_combined_model.w2v')

In [3]:
import bokeh.plotting as bp
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
from bokeh.models import HoverTool, BoxSelectTool
from bokeh.plotting import figure, show, output_notebook

output_notebook()
plot_tfidf = bp.figure(plot_width=700, plot_height=600, title="A map of word vectors",
    tools="pan,wheel_zoom,box_zoom,reset,hover,previewsave",
    x_axis_type=None, y_axis_type=None, min_border=1)

In [4]:
from sklearn.manifold import TSNE
word_vectors = [combined_model[w] for w in combined_model.wv.vocab.keys()]
tsne_model = TSNE(n_components=2, init='pca', verbose=1, random_state=0)
tsne_w2v = tsne_model.fit_transform(word_vectors)

[t-SNE] Computing pairwise distances...
[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Computed conditional probabilities for sample 1000 / 2500
[t-SNE] Computed conditional probabilities for sample 2000 / 2500
[t-SNE] Computed conditional probabilities for sample 2500 / 2500
[t-SNE] Mean sigma: 0.888116
[t-SNE] KL divergence after 100 iterations with early exaggeration: 1.315982
[t-SNE] Error after 300 iterations: 1.315982


In [5]:
tsne_df = pd.DataFrame(tsne_w2v, columns=['x', 'y'])
tsne_df['words'] = combined_model.wv.vocab.keys()

In [6]:
plot_tfidf.scatter(x='x', y='y', source=tsne_df)
hover = plot_tfidf.select(dict(type=HoverTool))
hover.tooltips={"word": "@words"}
show(plot_tfidf)