## LDA visualization of gensim citations

### A visualization of citations of "Software Framework for Topic Modelling with Large Corpora" represented as LDA probability distribution towards the selected number of topics.

### Note the configurable parameters to play with:

#### Preprocessing:

* Text filter: Whether to filter out some part of texts, or not
* Filter parameters (for filtering of HTML content) based on: 
    * minimal length of a valid text sentence, 
    * minimal length of a valid text line
* Text preprocess method - currently gensim's preprocess_text() - works the best only for English

#### LDA:

* Number of topics
* Number of passes over the input corpus
* Many others, so far left on defaults: https://radimrehurek.com/gensim/models/ldamodel.html

#### Tf-Idf:

* Corpus representation: 1. BoW counts of words, or 2. Tf-Idf weights
* Representative term set filtering: to consider only a given percentile of top-important terms according to tf-idf weights for each doc

#### Visualization:

* Docs representation: currently over a distance matrix
    * Distance method for distance matrix: correlation, cosine, euclidean, ... choose from https://docs.scipy.org/doc/scipy/reference/spatial.distance.html
* Dimensionality reduction: currently MDS from distance matrix

#### Topic representation in visualization:

* Number of most important words for topic
* For each word from a set of each topic, maximum number of occurrences of this word in other clusters representations
    * Can clarify the meaning of the topic in contrast to other inferred topics


## Tool functions

In [1]:
import os
from gensim.corpora.dictionary import Dictionary
from gensim.models import LdaModel
from gensim.parsing.preprocessing import preprocess_string
from gensim.models import TfidfModel

import numpy as np

### Data load:

1. using filter for short sentences (to get rid of HTML tags content) + gensim preprocess_text()
2. plaintext + gensim preprocess_text()


In [2]:
html_content_dir = "data/fulltexts_html"
pdf_content_dir = "data/fulltexts_pdf"

In [3]:
def filter_sentences_from_text(text_lines, min_line_len=20, min_sen_len=15):
    fulltext = " ".join([line for line in list(text_lines) if len([l for l in line.split(" ") 
                                                                   if len(l) > 0]) > min_line_len])
    sens = filter(lambda sen: len(sen) >= min_sen_len, fulltext.split("."))
    return ". ".join(sens)

In [4]:
import os


def get_texts_from_dir(texts_dir, filter_sen=False):
    txt_files = os.listdir(texts_dir)
    txt_files = [os.path.join(texts_dir, txt) for txt in txt_files]
    texts = dict()
    for txt_f in list(filter(lambda path: path.endswith(".txt"), txt_files)):
        try:
            if filter_sen:
                # custom filtering based on sentences length:
                text = filter_sentences_from_text(open(txt_f, "r").readlines())
            else:
                # no filtering:
                text = open(txt_f, "r").read()
            text = open(txt_f, "r").read()
            texts[os.path.basename(txt_f)] = preprocess_string(text)
        except UnicodeDecodeError:
            print("Utf-8 decode error on %s" % txt_f)
            continue
    return texts


In [5]:
def read_texts_no_preproc(texts_dir):
    txt_files = os.listdir(texts_dir)
    txt_files = [os.path.join(texts_dir, txt) for txt in txt_files]
    for txt_f in list(filter(lambda path: path.endswith(".txt"), txt_files)):
        yield open(txt_f, "r").readlines()

### 1: using filter for short sentences

In [6]:
#: comparison of a text before and after filtering:
texts = read_texts_no_preproc(html_content_dir)
text = list(texts)[42]
text[:20]

 '    • NCBI\n',
 '    • Skip to main content\n',
 '    • Skip to navigation\n',
 '    • Resources\n',
 '    • How To\n',
 '    • About NCBI Accesskeys\n',
 '\n',
 'PMC\n',
 'US National Library of Medicine \n',
 'National Institutes of Health \n',
 'Search database\n',
 'Search term\n',
 '\n',
 'Search\n',
 '    • Advanced \n',
 '    • Journal list \n',
 '    • Help \n',
 '    • Journal List\n',
 '    • HHS Author Manuscripts\n']

In [40]:
# same one using filtering:
text_f = filter_sentences_from_text(text)
text_f.split("\n")[:10]

['*Address correspondence to: Harriet de Wit, Department of Psychiatry and Behavioral Neuroscience, MC 3077, University of Chicago, 5841 S. , Chicago, IL, 60637 USA, ude. ogacihcu@wedh',
 ' ±3,4-methylenedioxymethamphetamine (MDMA) is widely believed to increase sociability.  The drug alters speech production and fluency, and may influence speech content.  Here, we investigated the effect of MDMA on speech content, which may reveal how this drug affects social interactions. ',
 ' 35 healthy volunteers with prior MDMA experience completed this two-session, within-subjects, double-blind study during which they received 1. 5 mg/kg oral MDMA and placebo.  Participants completed a 5-min standardized talking task during which they discussed a close personal relationship (e. , a friend or family member) with a research assistant.  The conversations were analyzed for selected content categories (e. , words pertaining to affect, social interaction, and cognition), using both a standard dictiona

In [8]:
htmls_preproc = get_texts_from_dir(html_content_dir, filter_sen=True)
list(htmls_preproc.keys())[-10:]

['978-3-319-69835-9_29.txt',
 'IzpisGradiva.txt',
 '1709.txt',
 'S0957417418302938.txt',
 '1504.txt',
 '7840632.txt',
 '8417270.txt',
 '69963.txt',
 '978-3-319-91947-8_4.txt',
 '8025903.txt']

### 2: without text filter

In [9]:
pdfs_preproc = get_texts_from_dir(pdf_content_dir, filter_sen=False)
list(pdfs_preproc.keys())[-10:]

['1-s2.0-S0363811116300212-main.txt',
 '1-s2.0-S002002551830094X-main.txt',
 'Bhuiyan and Al Hasan - 2016 - Waiting to be sold Prediction of time-dependent h.txt',
 'Do not blame it on the algorithm an empirical assessment of multiple recommender systems and their impact on content diversity.txt',
 'Jebbara and Cimiano - 2017 - Aspect-Based Relational Sentiment Analysis Using a.txt',
 'ÁLVARO - 2016 - Analysis of the Formality of Text and its Impact o.txt',
 'Khandpur et al. - 2017 - Crowdsourcing cybersecurity Cyber attack detectio.txt',
 'Chardin et al. - 2013 - Query rewriting for rule mining in databases.txt',
 "Rahman and Finin - 2017 - Deep Understanding of a Document's Structure.txt",
 'Yang and Hsu - 2016 - Hdpauthor A new hybrid author-topic model using l.txt']

In [10]:
# merge with htmls
merged_texts_preproc = {**htmls_preproc, **pdfs_preproc}

In [12]:
texts = [t for t in merged_texts_preproc.values() if len(t) > 0]
texts_links = merged_texts_preproc.keys()
len(texts)

1259

### tf-idf integration

Two approaches:

1. use idf weights instead of BoW frequencies
2. filter out given percentile of least important words from Docs' representation
3. combined

In [13]:
dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
tfidf = TfidfModel(corpus)

### 1. use idf weights for top-given percentile of terms

In [14]:
def terms_for_doc(doc_id):
    return tfidf[corpus[doc_id]]

In [15]:
def top_terms_idf_for_doc(doc_id, percentile):
    doc_terms_ordered = sorted(terms_for_doc(doc_id), key=lambda term: term[1],  reverse=True)
    return [term[0] for term in doc_terms_ordered[:int(len(doc_terms_ordered)*percentile)]]

In [16]:
def name_terms_in_tuples(doc_corpus):
    return [(dictionary.get(tup[0]), tup[1]) for tup in doc_corpus]

In [17]:
def top_terms_idf_for_doc(doc_id, percentile):
    doc_terms_ordered = sorted(terms_for_doc(doc_id), key=lambda term: term[1], reverse=True)
    return [(doc_terms_ordered[i][0], doc_terms_ordered[i][1]) 
            for i in range(int(len(doc_terms_ordered)*percentile))]

In [18]:
def term_tfidf_for_doc(doc_idx):
    return [(tfidf_term_tuple[0], tfidf_term_tuple[1]) 
            for tfidf_term_tuple in tfidf_corpus[doc_idx]]

In [19]:
tfidf_corpus1 = [top_terms_idf_for_doc(doc_i, 0.18) for doc_i in range(len(texts))]

In [20]:
tfidf_corpus1[0][:10]

[(96, 0.5256214468617546),
 (61, 0.47458063764105685),
 (62, 0.4447236782404861),
 (40, 0.42353982842035914),
 (101, 0.11551086229646161),
 (133, 0.09556753579304629),
 (80, 0.09465985755388896),
 (68, 0.08628738866201034),
 (39, 0.08038232292672068),
 (83, 0.06928599370798487)]

### 2. use term frequencies for top-given percentile of terms

In [21]:
def top_terms_freqs_for_doc(doc_id, percentile):
    doc_terms_idf = dict(terms_for_doc(doc_id))
    doc_terms_count = corpus[doc_id]
    doc_terms_count_ordered = sorted(doc_terms_count, 
                                     key=lambda term_count: doc_terms_idf[term_count[0]], reverse=True)
    return [(doc_terms_count_ordered[i][0], doc_terms_count_ordered[i][1])
            for i in range(int(len(doc_terms_count_ordered)*percentile))]

In [22]:
tfidf_corpus2 = [top_terms_freqs_for_doc(doc_i, 0.1) for doc_i in range(len(texts))]
tfidf_corpus2[0][:10]

[(96, 11),
 (61, 11),
 (62, 11),
 (40, 11),
 (101, 3),
 (133, 2),
 (80, 7),
 (68, 2),
 (39, 3),
 (83, 7)]

## LDA computation

Train LDA on a given type of corpus

In [23]:
def lda_from_texts(given_corpus, num_topics, passes):
    lda = LdaModel(given_corpus, num_topics=num_topics, alpha='auto', eval_every=5, passes=passes)
    return lda

In [24]:
def topic_distro_for_text(text):
    return lda.get_document_topics(dictionary.doc2bow(text), minimum_probability=0)

In [25]:
def terms_for_topic(topic_id, top_terms=10):
    topic_top_terms = lda.get_topic_terms(topic_id, topn=top_terms)
    return [dictionary.get(term[0]) for term in topic_top_terms]

In [26]:
lda_num_topics = 30
num_passes = 10

# TODO: choose between corpus, tfidf_corpus1, tfidf_corpus2:
lda = lda_from_texts(tfidf_corpus2, num_topics=lda_num_topics, passes=num_passes)

## Documents representation

Each doc is represented as a probability distribution towards LDA topics 

In [27]:
topic_distros = np.array([topic_distro_for_text(text) for text in texts[:-1]])
topic_distros = topic_distros[:, :, 1]

In [28]:
topic_distros.shape

(1258, 30)

# Visualization projections: approaches

## 1. Get a projection of docs so that the docs are close to their major topic

Doc distance to a topic must be proportional to it's probability of belonging to it

Does not consider relative distance of the documents, which seems to me as having no interpretation value - if it does, it is again equal to the second approach

... thus is not implemented

Yet, other projection methods surely deserve a consideration



## 2. Get a projection of topic according to their relative similarity

Relative similarity is a correlation of documents' belonging to it

Topics centers are documents with one-hot distribution of probabilities.

In [29]:
base_topic_docs_distros = np.identity(lda_num_topics)

In [30]:
topic_distros = np.append(topic_distros, base_topic_docs_distros) \
                  .reshape((len(topic_distros)+lda_num_topics, lda_num_topics))

In [31]:
topic_distros.shape

(1288, 30)

In [32]:
# distance matrix by selected metric
from scipy.spatial.distance import pdist, squareform

dists = squareform(pdist(topic_distros, metric="correlation"))
dists

array([[0.        , 0.887894  , 0.2849529 , ..., 1.08339084, 0.92996449,
        1.08346365],
       [0.887894  , 0.        , 0.76575193, ..., 1.0739843 , 0.54060026,
        1.02831219],
       [0.2849529 , 0.76575193, 0.        , ..., 1.07090122, 0.87243759,
        1.0711996 ],
       ...,
       [1.08339084, 1.0739843 , 1.07090122, ..., 0.        , 1.03448276,
        1.03448276],
       [0.92996449, 0.54060026, 0.87243759, ..., 1.03448276, 0.        ,
        1.03448276],
       [1.08346365, 1.02831219, 1.0711996 , ..., 1.03448276, 1.03448276,
        0.        ]])

In [33]:
dists.shape

(1288, 1288)

In [34]:
# projection to 2D using MDS

from sklearn import manifold

adist = dists

amax = np.amax(adist)
adist /= amax

mds = manifold.MDS(n_components=2, dissimilarity="precomputed")
results = mds.fit(adist)

coords = results.embedding_
results

MDS(dissimilarity='precomputed', eps=0.001, max_iter=300, metric=True,
  n_components=2, n_init=4, n_jobs=1, random_state=None, verbose=0)

In [35]:
# matplotlib visualization:

# import matplotlib.pyplot as plt
# from matplotlib.pyplot import figure


# plt.subplots_adjust(bottom = 0.1)
# plt.scatter(
#     coords[:, 0], coords[:, 1], marker = 'o'
#     )
# for label, x, y in zip([""]*len(coords), coords[:, 0], coords[:, 1]):
#     plt.annotate(
#         label,
#         xy = (x, y), xytext = (-20, 20),
#         textcoords = 'offset points', ha = 'right', va = 'bottom',
#         bbox = dict(boxstyle = 'round,pad=0.5', fc = 'yellow', alpha = 0.5),
#         arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))
    
# # figure(num=None, figsize=(10, 10), dpi=80, facecolor='w', edgecolor='k')
# plt.figsize = (100, 100)
# plt.grid()
# plt.show()

In [36]:
# Selective labeling of topics by only the unique words among topics

word_occurrence_bound = int(lda.num_topics / 10)
top_terms_per_topic = 20

from functools import reduce

all_topics_w = [terms_for_topic(i, top_terms=top_terms_per_topic) for i in range(lda_num_topics)]
all_words = reduce(lambda x, y: set(x) | set(y), all_topics_w)
intersect_words = list(filter(lambda w: sum([w in t_words for t_words in all_topics_w]) > word_occurrence_bound, 
                              all_words))
unique_topics_w = [[w for w in t_words if w not in intersect_words] for t_words in all_topics_w]

In [37]:
import plotly
plotly.tools.set_credentials_file(username='stmichal', api_key='OXox9Rf8jzEHqUsNPqwn')

In [38]:
import plotly.plotly as py
import plotly.graph_objs as go

docs_len = len(texts)

# Documents trace
trace_docs = go.Scatter(
    x = coords[:docs_len, 0],
    y = coords[:docs_len, 1],
    mode = 'markers',
    marker = dict(color = 'rgba(0, 0, 255, .5)', size = 5),
    text = list(texts_links)
)
# Bases (topics documents) trace
trace_bases = go.Scatter(
    x = coords[docs_len:, 0],
    y = coords[docs_len:, 1],
    mode = 'markers',
    marker = dict(color = 'rgba(255, 0, 122, .2)', size = 60),
    text = ["T %s: %s" % (i, unique_topics_w[i]) for i in range(lda_num_topics)]
)

data = [trace_docs, trace_bases]

# label = 'MDS over LDA %s topics. tfidf for top 0.5 terms as frequencies - TODO: check' % lda.num_topics

label = 'TODO: fill appropriately: MDS over LDA X topics. tfidf:counts for top X terms as frequencies.'


# Plot and embed in ipython notebook!
layout = dict(title=label,  
              font=dict(size=12),
              showlegend=True,
              width=1000,
              height=1000,
              margin=dict(l=40, r=40, b=85, t=100),
              hovermode='closest',
              plot_bgcolor='rgb(256,256,256)'          
              )
py.iplot(dict(data=data, layout=layout), filename=label)