# Text Similarity (LDA)

Running abstract_mining.py file from the repo should produce a strucured csv with the pubmed id and the abstract for the specified number of abstracts.

With this csv we will perform some standard cleaning and tokenizing operations to train a LDA model. 

Then, we will calculate Jensen-Shanon distance to get most similar articles to an inputed text. 

## Data Loading

We load our data and we remove duplicates

In [1]:
# Load basic libraries
import pandas as pd
import numpy as np

# Load raw dataset
raw = pd.read_csv('data_mining/merged.csv')

# Keep only unique elements
mask_unique = raw.duplicated(subset=["abstract"], keep='first')
df = raw[~mask_unique].reset_index(drop=True)

# Eventually we equilibrate the number of abstracts per topic
def equilibrate_topics(df, max_per_topic):
    # Get a df with the first elements for the first topic
    equilibrated = df.head(max_per_topic)
    # Append to that df the other elements for the other topics
    for topic in df["topic"].unique()[1:]:
        equilibrated = pd.concat([equilibrated, 
                                 df[df["topic"]==topic].head(max_per_topic)], 
                                 axis=0, 
                                 ignore_index=True)
    return equilibrated.reset_index(drop=True)
equilibrated = equilibrate_topics(df, 1000)

In [2]:
# Let's check the result
df

Unnamed: 0,pubmed_id,abstract,topic
0,7544188,"""Rebound"" phenomenon of hepatitis C viremia af...",Gene therapy
1,7543681,Probing the transmembrane topology of cyclic n...,Gene therapy
2,7543632,Autolymphocyte therapy. III. Effective adjuvan...,Gene therapy
3,7543577,"Syntheses, calcium channel agonist-antagonist ...",Gene therapy
4,7543499,Unexpected dystonia while changing from clozap...,Gene therapy
...,...,...,...
29276,7107327,The cellular specificity of lectin binding in ...,Post-translational modification
29277,7107225,Effect of tauroursodeoxycholic acid on patient...,Post-translational modification
29278,7104049,The effect of various modifiers on rat microso...,Post-translational modification
29279,7104039,Inhibition of methemoglobin and metmyoglobin r...,Post-translational modification


Check number of abstracts per topic for each dataframe

In [3]:
def get_abstracts_per_topic(df):
    for topic in df["topic"].unique():
        print(df[df["topic"]==topic].shape[0], " unique abstracts for: ", topic)

print("df size: ", df.shape)
get_abstracts_per_topic(df)
    
print("\nequilibrated size: ", equilibrated.shape)
get_abstracts_per_topic(equilibrated)

df size:  (29281, 3)
4994  unique abstracts for:  Gene therapy
4303  unique abstracts for:  Immunology
4341  unique abstracts for:  Genome engineering
3537  unique abstracts for:  Regulatory element
2770  unique abstracts for:  Sequence
2669  unique abstracts for:  Transfection
3866  unique abstracts for:  Epigenetics
2801  unique abstracts for:  Post-translational modification

equilibrated size:  (8000, 3)
1000  unique abstracts for:  Gene therapy
1000  unique abstracts for:  Immunology
1000  unique abstracts for:  Genome engineering
1000  unique abstracts for:  Regulatory element
1000  unique abstracts for:  Sequence
1000  unique abstracts for:  Transfection
1000  unique abstracts for:  Epigenetics
1000  unique abstracts for:  Post-translational modification


## Text Cleaning

We will keep the 29k+ abstracts dataset. This size being moderate we can add to the dataframe a "clean" and a "tokens" column with the clean string and the list of tokens respectively.

We define our cleaning function following the steps:
1. Lower case transformation
2. Special characters removal
3. Punctuation removal
4. Tekenizing
5. Stop words removal
6. Lemmatization

In [4]:
import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
import stop_words
import unidecode
import string
import re

def preprocessing(texte, return_str=False):
    tex = []
   
    # lower case
    texte = unidecode.unidecode(texte.lower())
   
    # remove special characters
    texte = re.sub(r'\n', ' ', texte)
    texte = re.sub(r'\d+', '', texte)
    
    # remove punctuation
    texte = texte.translate(str.maketrans('', '', string.punctuation))
    
    # remove whitespaces
    texte = texte.strip()
        
    # tokenization
    tokens = word_tokenize(texte)
        
    # define stop words
    sw_1 = stop_words.get_stop_words('en')
    sw_nltk = set(stopwords.words('english'))
    sw = list(set(sw_1+list(sw_nltk)))
    
    # remove stop words and filster monoletters
    tokens = [i for i in tokens if not i in sw and len(i) > 1]
    
    # lemmatization
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(w) for w in tokens]
    
    if return_str:
        return (" ").join(tokens)
    
    return tokens

We apply our cleaning function This step may take a few minutes.

In [5]:
df["clean"] = df.loc[:, 'abstract'].apply(lambda x: preprocessing(x, return_str=True))
df["tokens"] = df.loc[:, 'abstract'].apply(lambda x: preprocessing(x, return_str=False))

In [6]:
# We can save to csv eventually
# df.to_csv("clean.csv", index=False)

In [7]:
# Let's check the result
df

Unnamed: 0,pubmed_id,abstract,topic,clean,tokens
0,7544188,"""Rebound"" phenomenon of hepatitis C viremia af...",Gene therapy,rebound phenomenon hepatitis viremia interfero...,"[rebound, phenomenon, hepatitis, viremia, inte..."
1,7543681,Probing the transmembrane topology of cyclic n...,Gene therapy,probing transmembrane topology cyclic nucleoti...,"[probing, transmembrane, topology, cyclic, nuc..."
2,7543632,Autolymphocyte therapy. III. Effective adjuvan...,Gene therapy,autolymphocyte therapy iii effective adjuvant ...,"[autolymphocyte, therapy, iii, effective, adju..."
3,7543577,"Syntheses, calcium channel agonist-antagonist ...",Gene therapy,synthesis calcium channel agonistantagonist mo...,"[synthesis, calcium, channel, agonistantagonis..."
4,7543499,Unexpected dystonia while changing from clozap...,Gene therapy,unexpected dystonia changing clozapine risperi...,"[unexpected, dystonia, changing, clozapine, ri..."
...,...,...,...,...,...
29276,7107327,The cellular specificity of lectin binding in ...,Post-translational modification,cellular specificity lectin binding kidney lig...,"[cellular, specificity, lectin, binding, kidne..."
29277,7107225,Effect of tauroursodeoxycholic acid on patient...,Post-translational modification,effect tauroursodeoxycholic acid patient ileal...,"[effect, tauroursodeoxycholic, acid, patient, ..."
29278,7104049,The effect of various modifiers on rat microso...,Post-translational modification,effect various modifier rat microsomal peroxid...,"[effect, various, modifier, rat, microsomal, p..."
29279,7104039,Inhibition of methemoglobin and metmyoglobin r...,Post-translational modification,inhibition methemoglobin metmyoglobin reductio...,"[inhibition, methemoglobin, metmyoglobin, redu..."


Keep a small fraction of the dataset for test

In [8]:
# create a mask of binary values
mask = np.random.rand(len(df)) < 0.99
# Apply mask
train_df = df[mask].reset_index(drop=True)
test_df = df[~mask].reset_index(drop=True)

print("Dataframe sizes:")
print("df:", len(df), 
      "\ntrain_df:", len(train_df),
      "\ntest_df:", len(test_df))

Dataframe sizes:
df: 29281 
train_df: 28980 
test_df: 301


## LDA

In [9]:
from time import time
from gensim import models, corpora, similarities
from gensim.corpora.mmcorpus import MmCorpus

In [10]:
def train_lda(data, num_topics = 8, chunksize = 300):
    """
    This function trains the lda model
    We setup parameters like number of topics, the chunksize to use in Hoffman method
    We also do 2 passes of the data since this is a small dataset, so we want the distributions to stabilize
    """
    tokens = data["tokens"]
    dictionary = corpora.Dictionary(tokens)
    corpus = [dictionary.doc2bow(doc) for doc in tokens]
    t1 = time()
    # low alpha: each document is only represented by a small number of topics, and vice versa
    # low eta: each topic is only represented by a small number of words, and vice versa
    lda = models.LdaModel(corpus=corpus, num_topics=num_topics, id2word=dictionary,
                   alpha=1e-3, eta=0.5, chunksize=chunksize, minimum_probability=0.0, passes=2)
    t2 = time()
    print("Time to train LDA model on", len(data), "articles:", (t2-t1)/60, "min")
    return dictionary, corpus, lda

In [11]:
dictionary,corpus,lda = train_lda(train_df, num_topics = 8, chunksize = 500)

Time to train LDA model on 28980 articles: 2.3416532397270204 min


Let's check and save our model, corpus and dictionary

In [12]:
print("corpus size:", len(corpus))
print("dictionnary size:", len(dictionary))

corpus size: 28980
dictionnary size: 78133


In [13]:
# Uncomment to save:
# lda.save('model/lda.model')
# MmCorpus.save_corpus("model/corpus.mm", corpus)
# dictionary.save_as_text("model/dictionary.txt")

## Test model with Text Similarity 

Fisrt we define our helper functions to compute Jensen-Shannon distance

In [14]:
# Needed libraries
from scipy.stats import entropy
from indra.literature.pubmed_client import get_metadata_for_ids

# Functions
def jensen_shannon(query, matrix):
    """
    This function implements a Jensen-Shannon similarity
    between the input query (an LDA topic distribution for a document)
    and the entire corpus of topic distributions.
    It returns an array of length M where M is the number of documents in the corpus
    """
    # lets keep with the p,q notation above
    p = query[None,:].T # take transpose
    print(p.shape)
    q = matrix.T # transpose matrix
    m = 0.5*(p + q)
    print(m.shape)
    pp = np.repeat(query[None,:].T, repeats=matrix.shape[0], axis=1)
    return np.sqrt(0.5*(entropy(pp,m) + entropy(q,m)))

def get_most_similar_documents(query,matrix,k=10):
    """
    This function implements the Jensen-Shannon distance above
    and retruns the top k indices of the smallest jensen shannon distances
    """
    sims = jensen_shannon(query,matrix) # list of jensen shannon distances
    return sims.argsort()[:k] # the top k positional index of the smallest Jensen Shannon distances

We take some text from the test dataframe

In [15]:
# print some examples from test_df
number = 3
test_strings = []
rand_idx = np.random.choice(len(test_df), size=number)
text = test_df.iloc[rand_idx,1]
for n in rand_idx:
    print("\nTopic:", test_df.iloc[n,2])
    test_strings.append(text[n])
    print(text[n],"...")


Topic: Gene therapy
[Morphological and experimental studies of the autonomic ganglia of the head].  ...

Topic: Post-translational modification
[Pathoanatomical and pathohistological studies of cyclophosphamide-induced organ changes in calves with special reference to lymphatic organs].  ...

Topic: Regulatory element
Stamps commemorating medicine. "The seeing eye dog".  ...


For JS distance calculation we need to calculate some of the variables for the test text:
1. A new bag of words
2. The new document distribution (towards each topic)
3. The topic distance

In [16]:
# Let's propose text similarity using the 1st example from above
string_to_test = test_strings[0]

# New bag of words
new_bow = dictionary.doc2bow(preprocessing(string_to_test, return_str=False))
print("Lenght of new bow:", len(new_bow))

# New document distribution
new_doc_distribution = np.array([tup[1] for tup in lda.get_document_topics(bow=new_bow)])
print("Size of new document distribution:", new_doc_distribution.shape)

# New topic distance
doc_topic_dist = np.array([[tup[1] for tup in lst] for lst in lda[corpus]])
print("Size JS distances array:", doc_topic_dist.shape)

Lenght of new bow: 6
Size of new document distribution: (8,)
Size JS distances array: (28980, 8)


We can now compute JS distance and get the most similar Articles from our test dataset

In [17]:
# Compute JS distance and get most similar IDs
most_sim_ids = get_most_similar_documents(new_doc_distribution,doc_topic_dist)

# Get the list of corresponding most similar pubmed IDs
most_sim_pm_ids = list(train_df.iloc[list(most_sim_ids),0].values)
print("Pubmed's most similar IDs", most_sim_pm_ids, "\n")

# Retrieve metadata by using pubmed API (indra module)
most_sim_metadata = get_metadata_for_ids(most_sim_pm_ids)

# Print the retrieved data for comparison
for top, val in zip(most_sim_pm_ids, most_sim_metadata.values()):
    print(top, "\n", val["title"], "\n")

(8, 1)
(8, 28980)
Pubmed's most similar IDs [7477075, 7429050, 7307327, 7443036, 7241856, 7385144, 7460314, 7408280, 7284106, 7524781] 

7477075 
 Dehydroepiandrosterone sulphate (DHEAS) concentrations and amyotrophic lateral sclerosis. 

7429050 
 C.H.A. brief to Hall Review. 

7307327 
 Comparative effectiveness of topically applied non-steroid anti-inflammatory agents on guinea-pig skin. 

7443036 
 Role of catecholaminergic mechanisms of the brain in the fixation of temporary links. 

7241856 
 [Tubulo-glomerular feedback in the denervated rat kidney (author's transl)]. 

7385144 
 [Haloperidol-induced akathisia in the state of violence]. 

7460314 
 [Sexual dimorphism and synaptic plasticity in the neuroendocrine hypothalamus (author's transl)]. 

7408280 
 Altered taste thresholds in gastro-intestinal cancer. 

7284106 
 Vagal effects on heart rate in rabbit, A preliminary report. 

7524781 
 Degeneration activity: a transient effect following sympathectomy for hyperhidrosis. 

