## LDA visualization of gensim citations

### A visualization of citations of "Software Framework for Topic Modelling with Large Corpora" represented as LDA probability distribution towards the selected number of topics.

### Note the configurable parameters to play with:

#### Preprocessing:

* Text filter: Whether to filter out some part of texts, or not
* Filter parameters (for filtering of HTML content) based on: 
    * minimal length of a valid text sentence, 
    * minimal length of a valid text line
* Text preprocess method - currently gensim's preprocess_text() - works the best only for English

#### LDA:

* Number of topics
* Number of passes over the input corpus
* Many others, so far left on defaults: https://radimrehurek.com/gensim/models/ldamodel.html

#### Tf-Idf:

* Corpus representation: 1. BoW counts of words, or 2. Tf-Idf weights
* Representative term set filtering: to consider only a given percentile of top-important terms according to tf-idf weights for each doc

#### Visualization:

* Docs representation: currently over a distance matrix
    * Distance method for distance matrix: correlation, cosine, euclidean, ... choose from https://docs.scipy.org/doc/scipy/reference/spatial.distance.html
* Dimensionality reduction: currently MDS from distance matrix

#### Topic representation in visualization:

* Number of most important words for topic
* For each word from a set of each topic, maximum number of occurrences of this word in other clusters representations
    * Can clarify the meaning of the topic in contrast to other inferred topics


## Tool functions

In [1]:
import os
from gensim.corpora.dictionary import Dictionary
from gensim.models import LdaModel
from gensim.parsing.preprocessing import preprocess_string
from gensim.models import TfidfModel

import numpy as np

### Data load:

1. using filter for short sentences (to get rid of HTML tags content) + gensim preprocess_text()
2. plaintext + gensim preprocess_text()


In [2]:
html_content_dir = "data/fulltexts_html"
pdf_content_dir = "data/fulltexts_pdf"

In [3]:
def filter_sentences_from_text(text_lines, min_line_len=20, min_sen_len=15):
    fulltext = " ".join([line for line in list(text_lines) if len([l for l in line.split(" ") 
                                                                   if len(l) > 0]) > min_line_len])
    sens = filter(lambda sen: len(sen) >= min_sen_len, fulltext.split("."))
    return ". ".join(sens)

In [4]:
import os

from gensim.parsing.preprocessing import (
    preprocess_string,
    remove_stopwords,
    strip_multiple_whitespaces,
    strip_numeric,
    strip_punctuation,
    strip_short,
    strip_tags,
)    
import nltk


custom_filters = [
    lambda x: x.lower(),
    strip_tags,
    strip_punctuation,
    strip_multiple_whitespaces,
    strip_numeric,
    remove_stopwords,
    strip_short,
]

nltk.download('words')
english_words = set(word.lower() for word in nltk.corpus.words.words())

def get_texts_from_dir(texts_dir, filter_sen=False):
    txt_files = os.listdir(texts_dir)
    txt_files = [os.path.join(texts_dir, txt) for txt in txt_files]
    texts = dict()
    for txt_f in list(filter(lambda path: path.endswith(".txt"), txt_files)):
        try:
            if filter_sen:
                # custom filtering based on sentences length:
                text = filter_sentences_from_text(open(txt_f, "r").readlines())
            else:
                # no filtering:
                text = open(txt_f, "r").read()
            text = open(txt_f, "r").read()
            texts[os.path.basename(txt_f)] = [
                word for word in preprocess_string(text, custom_filters)
                if word in english_words
            ]
        except UnicodeDecodeError:
            print("Utf-8 decode error on %s" % txt_f)
            continue
    return texts

[nltk_data] Downloading package words to /home/michal/nltk_data...
[nltk_data]   Package words is already up-to-date!


In [5]:
def read_texts_no_preproc(texts_dir):
    txt_files = os.listdir(texts_dir)
    txt_files = [os.path.join(texts_dir, txt) for txt in txt_files]
    for txt_f in list(filter(lambda path: path.endswith(".txt"), txt_files)):
        yield open(txt_f, "r").readlines()

### 1: using filter for short sentences

In [6]:
#: comparison of a text before and after filtering:
texts = read_texts_no_preproc(html_content_dir)
text = list(texts)[42]
text[:20]

['\ufeffToggle navigation\xa0\xa0\xa0IDEALS \n',
 '    • Login \n',
 '    • \n',
 '    • Search IDEALSThis Collection \n',
 '    • query \n',
 '      Advanced Search \n',
 '    • \n',
 ' \n',
 '\n',
 '\n',
 ' \n',
 'Entity-relation search: context pattern driven relation ranking\n',
 'Welcome to the IDEALS Repository\n',
 '\n',
 'JavaScript is disabled for your browser. Some features of this site may not work without it.\n',
 '\n',
 '\n',
 '\n',
 'Browse\n',
 'IDEALS\n']

In [7]:
# same one using filtering:
text_f = filter_sentences_from_text(text)
text_f.split("\n")[:10]

['A traditional page link-based search system is not adequate for users intending to query data efficiently.  For instance, emergent phenomena reveal that some entity-based search engines, such as EntityRank, directly return answers (target entities) to users instead of web pages.  Most of the time, however, compared to searching for interested entities, users more often focus on relationships among entities.  To our knowledge, there is only one web search system that automatically extracts relations from massive unstructured corpora.  This system is referred to as OpenIE, which indeed brings us one step closer to an entity relation-based system.  Nevertheless, its system extracts only direct relations between a pair of entities and ranks simply by occurrence frequency.  The monotone pattern extraction, adopted in their relation phrase extraction model, provides high quality entity relations but also fail to return many potential true relations in the corpus, which has been explained i

In [8]:
htmls_preproc = get_texts_from_dir(html_content_dir, filter_sen=True)
list(htmls_preproc.keys())[-10:]

['S187704281631655X.txt',
 '3800a765-abs.txt',
 '978-3-319-93034-3_10.txt',
 '5679915.txt',
 'hal-01480773.txt',
 'ALFNLP.txt',
 '1711.txt',
 '1803.txt',
 '978-3-319-22183-0_27.txt',
 '1801.txt']

### 2: without text filter

In [9]:
pdfs_preproc = get_texts_from_dir(pdf_content_dir, filter_sen=False)
list(pdfs_preproc.keys())[-10:]

['Zhang et al. - 2017 - Targeted Advertising Based on Browsing History.txt',
 'Xu et al. - 2017 - AnswerBot automated generation of answer summary .txt',
 'Ribón - A Framework for Semantic Similarity Measures to en.txt',
 'Graus et al. - 2013 - yourHistory–Semantic linking for a personalized ti.txt',
 'Ali and LaPaugh - 2013 - Enabling Author-Centric Ranking of Web Content..txt',
 'Gordeev - 2016 - Automatic detection of verbal aggression for Russi.txt',
 'Ver Steeg and Galstyan - 2013 - Information-theoretic measures of influence based .txt',
 'Epasto et al. - 2017 - Bicriteria distributed submodular maximization in .txt',
 'Cao et al. - 2018 - Searching for Truth in a Database of Statistics.txt',
 'Catizone et al. - 2012 - LIE Leadership, Influence and Expertise..txt']

In [10]:
# merge with htmls
merged_texts_preproc = {**pdfs_preproc}

In [11]:
texts = [t for t in merged_texts_preproc.values() if len(t) > 0]
texts_links = merged_texts_preproc.keys()
len(texts)

754

### tf-idf integration

Two approaches:

1. use idf weights instead of BoW frequencies
2. filter out given percentile of least important words from Docs' representation
3. combined

In [12]:
dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
tfidf = TfidfModel(corpus)

### 1. use idf weights for top-given percentile of terms

In [13]:
def terms_for_doc(doc_id):
    return tfidf[corpus[doc_id]]

In [14]:
def top_terms_idf_for_doc(doc_id, percentile):
    doc_terms_ordered = sorted(terms_for_doc(doc_id), key=lambda term: term[1],  reverse=True)
    return [term[0] for term in doc_terms_ordered[:int(len(doc_terms_ordered)*percentile)]]

In [15]:
def name_terms_in_tuples(doc_corpus):
    return [(dictionary.get(tup[0]), tup[1]) for tup in doc_corpus]

In [16]:
def top_terms_idf_for_doc(doc_id, percentile):
    doc_terms_ordered = sorted(terms_for_doc(doc_id), key=lambda term: term[1], reverse=True)
    return [(doc_terms_ordered[i][0], doc_terms_ordered[i][1]) 
            for i in range(int(len(doc_terms_ordered)*percentile))]

In [17]:
def term_tfidf_for_doc(doc_idx):
    return [(tfidf_term_tuple[0], tfidf_term_tuple[1]) 
            for tfidf_term_tuple in tfidf_corpus[doc_idx]]

In [18]:
tfidf_corpus1 = [top_terms_idf_for_doc(doc_i, 0.18) for doc_i in range(len(texts))]

In [19]:
tfidf_corpus1[0][:10]

[(395, 0.8100748966009056),
 (167, 0.3988720504597434),
 (514, 0.2500924209574005),
 (134, 0.13777225950655736),
 (529, 0.10336837201530495),
 (124, 0.09178279631047316),
 (604, 0.07934668648674739),
 (125, 0.07249263373208426),
 (49, 0.06575902668688782),
 (538, 0.06336493100830388)]

In [22]:
dictionary = Dictionary(texts)
tfidf_model = TfidfModel(dictionary=dictionary)
tf_corpus = [dictionary.doc2bow(text) for text in texts]
tfidf_corpus = tfidf_model[corpus]
tfidf_corpus

<gensim.interfaces.TransformedCorpus at 0x7f5892983780>

In [24]:
tf_corpus[0]

[(0, 1),
 (1, 1),
 (2, 4),
 (3, 1),
 (4, 1),
 (5, 1),
 (6, 1),
 (7, 1),
 (8, 1),
 (9, 1),
 (10, 1),
 (11, 1),
 (12, 1),
 (13, 8),
 (14, 4),
 (15, 1),
 (16, 1),
 (17, 1),
 (18, 1),
 (19, 2),
 (20, 1),
 (21, 1),
 (22, 1),
 (23, 2),
 (24, 1),
 (25, 1),
 (26, 1),
 (27, 1),
 (28, 6),
 (29, 1),
 (30, 1),
 (31, 3),
 (32, 1),
 (33, 3),
 (34, 1),
 (35, 1),
 (36, 2),
 (37, 1),
 (38, 3),
 (39, 1),
 (40, 4),
 (41, 4),
 (42, 1),
 (43, 3),
 (44, 1),
 (45, 2),
 (46, 3),
 (47, 1),
 (48, 1),
 (49, 12),
 (50, 1),
 (51, 3),
 (52, 18),
 (53, 4),
 (54, 6),
 (55, 1),
 (56, 1),
 (57, 1),
 (58, 1),
 (59, 1),
 (60, 1),
 (61, 1),
 (62, 1),
 (63, 1),
 (64, 3),
 (65, 2),
 (66, 1),
 (67, 1),
 (68, 2),
 (69, 1),
 (70, 2),
 (71, 1),
 (72, 1),
 (73, 1),
 (74, 4),
 (75, 1),
 (76, 1),
 (77, 1),
 (78, 2),
 (79, 1),
 (80, 2),
 (81, 2),
 (82, 1),
 (83, 2),
 (84, 1),
 (85, 1),
 (86, 1),
 (87, 1),
 (88, 1),
 (89, 2),
 (90, 1),
 (91, 1),
 (92, 3),
 (93, 1),
 (94, 3),
 (95, 1),
 (96, 1),
 (97, 1),
 (98, 4),
 (99, 1),
 (100, 4

### 2. use term frequencies for top-given percentile of terms

In [25]:
def top_terms_freqs_for_doc(doc_id, percentile):
    doc_terms_idf = dict(terms_for_doc(doc_id))
    doc_terms_count = corpus[doc_id]
    doc_terms_count_ordered = sorted(doc_terms_count, 
                                     key=lambda term_count: doc_terms_idf[term_count[0]], reverse=True)
    return [(doc_terms_count_ordered[i][0], doc_terms_count_ordered[i][1])
            for i in range(int(len(doc_terms_count_ordered)*percentile))]

In [26]:
tfidf_corpus2 = [top_terms_freqs_for_doc(doc_i, 0.1) for doc_i in range(len(texts))]
tfidf_corpus2[0][:10]

[(395, 284),
 (167, 114),
 (514, 92),
 (134, 48),
 (529, 136),
 (124, 26),
 (604, 54),
 (125, 19),
 (49, 12),
 (538, 32)]

## LDA computation

Train LDA on a given type of corpus

In [27]:
def lda_from_texts(given_corpus, num_topics, passes):
    lda = LdaModel(given_corpus, num_topics=num_topics, alpha='auto', eval_every=5, passes=passes)
    return lda

In [28]:
def topic_distro_for_text(text):
    return lda.get_document_topics(dictionary.doc2bow(text), minimum_probability=0)

In [29]:
def terms_for_topic(topic_id, top_terms=10):
    topic_top_terms = lda.get_topic_terms(topic_id, topn=top_terms)
    return [dictionary.get(term[0]) for term in topic_top_terms]

In [30]:
lda_num_topics = 11
num_passes = 10

# TODO: choose between corpus, tfidf_corpus1, tfidf_corpus2:
lda = lda_from_texts(tfidf_corpus2, num_topics=lda_num_topics, passes=num_passes)

## Documents representation

Each doc is represented as a probability distribution towards LDA topics 

In [44]:
# add HTML texts to a set of all texts
merged_texts_preproc = {**pdfs_preproc, **htmls_preproc}
texts = [t for t in merged_texts_preproc.values() if len(t) > 0]
texts_links = merged_texts_preproc.keys()
len(texts)

1257

In [45]:
topic_distros = np.array([topic_distro_for_text(text) for text in texts[:-1]])
topic_distros = topic_distros[:, :, 1]

In [46]:
topic_distros.shape

(1256, 11)

# Visualization projections: approaches

## 1. Get a projection of docs so that the docs are close to their major topic

Doc distance to a topic must be proportional to it's probability of belonging to it

Does not consider relative distance of the documents, which seems to me as having no interpretation value - if it does, it is again equal to the second approach

... thus is not implemented

Yet, other projection methods surely deserve a consideration



## 2. Get a projection of topic according to their relative similarity

Relative similarity is a correlation of documents' belonging to it

Topics centers are documents with one-hot distribution of probabilities.

In [47]:
base_topic_docs_distros = np.identity(lda_num_topics)

In [48]:
topic_distros = np.append(topic_distros, base_topic_docs_distros) \
                  .reshape((len(topic_distros)+lda_num_topics, lda_num_topics))

In [49]:
topic_distros.shape

(1267, 11)

In [50]:
# distance matrix by selected metric
from scipy.spatial.distance import pdist, squareform

dists = squareform(pdist(topic_distros, metric="correlation"))
dists

array([[0.        , 1.18377963, 1.08430097, ..., 0.65466782, 1.12858401,
        1.16681353],
       [1.18377963, 0.        , 0.276056  , ..., 1.00711261, 0.03171644,
        1.04157268],
       [1.08430097, 0.276056  , 0.        , ..., 1.17678818, 0.26014944,
        0.59088021],
       ...,
       [0.65466782, 1.00711261, 1.17678818, ..., 0.        , 1.1       ,
        1.1       ],
       [1.12858401, 0.03171644, 0.26014944, ..., 1.1       , 0.        ,
        1.1       ],
       [1.16681353, 1.04157268, 0.59088021, ..., 1.1       , 1.1       ,
        0.        ]])

In [51]:
dists.shape

(1267, 1267)

In [52]:
# projection to 2D using MDS

from sklearn import manifold

adist = dists

amax = np.amax(adist)
adist /= amax

mds = manifold.MDS(n_components=2, dissimilarity="precomputed")
results = mds.fit(adist)

coords = results.embedding_
results

MDS(dissimilarity='precomputed', eps=0.001, max_iter=300, metric=True,
  n_components=2, n_init=4, n_jobs=None, random_state=None, verbose=0)

In [53]:
# matplotlib visualization:

# import matplotlib.pyplot as plt
# from matplotlib.pyplot import figure


# plt.subplots_adjust(bottom = 0.1)
# plt.scatter(
#     coords[:, 0], coords[:, 1], marker = 'o'
#     )
# for label, x, y in zip([""]*len(coords), coords[:, 0], coords[:, 1]):
#     plt.annotate(
#         label,
#         xy = (x, y), xytext = (-20, 20),
#         textcoords = 'offset points', ha = 'right', va = 'bottom',
#         bbox = dict(boxstyle = 'round,pad=0.5', fc = 'yellow', alpha = 0.5),
#         arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))
    
# # figure(num=None, figsize=(10, 10), dpi=80, facecolor='w', edgecolor='k')
# plt.figsize = (100, 100)
# plt.grid()
# plt.show()

In [54]:
# Selective labeling of topics by only the unique words among topics

word_occurrence_bound = int(lda.num_topics / 2)
top_terms_per_topic = 30
max_output_words = 10

from functools import reduce

all_topics_w = [terms_for_topic(i, top_terms=top_terms_per_topic) for i in range(lda_num_topics)]
all_words = reduce(lambda x, y: set(x) | set(y), all_topics_w)
intersect_words = list(filter(lambda w: sum([w in t_words for t_words in all_topics_w]) > word_occurrence_bound, 
                              all_words))
unique_topics_w = [[w for w in t_words if w not in intersect_words] for t_words in all_topics_w]
unique_topics_w = [twords[:max_output_words] for twords in all_topics_w if len(twords) > max_output_words]

In [55]:
import plotly
plotly.tools.set_credentials_file(username='stefanik.m', api_key='ChJP5J2dPZgTtv3p6DgH')

In [56]:
import plotly.plotly as py
import plotly.graph_objs as go

docs_len = len(texts)

# Documents trace
trace_docs = go.Scatter(
    x = coords[:docs_len, 0],
    y = coords[:docs_len, 1],
    mode = 'markers',
    marker = dict(color = 'rgba(0, 0, 255, .5)', size = 5),
    text = list(texts_links)
)
# Bases (topics documents) trace
trace_bases = go.Scatter(
    x = coords[docs_len:, 0],
    y = coords[docs_len:, 1],
    mode = 'markers',
    marker = dict(color = 'rgba(255, 0, 122, .2)', size = 60),
    text = ["T %s: %s" % (i, unique_topics_w[i]) for i in range(lda_num_topics)]
)

data = [trace_docs, trace_bases]

# label = 'MDS over LDA %s topics. tfidf for top 0.5 terms as frequencies - TODO: check' % lda.num_topics

label = 'TODO: fill appropriately: MDS over LDA X topics. tfidf:counts for top X terms as frequencies.'


# Plot and embed in ipython notebook!
layout = dict(title=label,  
              font=dict(size=12),
              showlegend=True,
              width=1000,
              height=1000,
              margin=dict(l=40, r=40, b=85, t=100),
              hovermode='closest',
              plot_bgcolor='rgb(256,256,256)'          
              )
py.iplot(dict(data=data, layout=layout), filename=label) 

High five! You successfully sent some data to your account on plotly. View your plot in your browser at https://plot.ly/~stefanik.m/0 or inside your plot.ly account where it is named 'TODO: fill appropriately: MDS over LDA X topics. tfidf:counts for top X terms as frequencies.'
