# Interactive Topic Analysis

The HTML format of Sherlock reports allows you to easily interact with the results of your analysis. To demonstrate this feature, we will model the set of topics appearing in a set of over 1000 Associated Press news articles. For this notebook to run properly you need to have the package `pyLDAvis` installed on your server. You can install it by running 

>```pip install pyldavis ```

on the terminal, or alternatively you can apply the pre-defined environment to your server. You can do this from the servers tab.

The mathematical model we use to extract a set of unknown topics from a set of documents is called "Latent Dirichlet Allocation". Roughly speaking, it assumes that
* Each topic is characterized by a distribution over some vocabulary.
* Each document is generated as follows:
    * randomly select the number of words in the document;
    * randomly select a distribution of topics for this document;
    * generate each word in the document by first choosing a topic according to the chosen topic distribution, and then choose the word according to the vocabulary distribution that the topic defines.
    
When we *fit* the model to a set of documents, we aim to recover the topic-word probabilities, as well as the document-topic probabilities. You can read more about LDA here: https://www.seas.harvard.edu/courses/cs281/papers/blei-ng-jordan-2003.pdf.

In [1]:
#Necessary imports
import nltk
import numpy as np
import os
import pandas as pd
import pyLDAvis
import re
import string
import warnings
import zipfile

from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from sklearn.decomposition import LatentDirichletAllocation as LDA

## Text preprocessing and model fitting


In what follows, we preprocess the text files. This involves tokenizing the documents (i.e. splitting the raw text into a list of words), removing common stopwords (such as "the", "that", etc which do not contain a lot of information), stemming words (i.e. reducing them to their "root": for example "navigates" might be stemmed to "navigate") and only keeping nouns. There are many ways to go about text preprocessing -  this is only a "quick and dirty" way of preprocessing the files! For this task, we'll use the `nltk` library (Natural Language Toolkit) which contains a set of very useful tools for text processing. More guidelines on how to preprocess text data can be found here: https://de.dariah.eu/tatom/preprocessing.html.

In [2]:
nltk.download('stopwords', quiet=True);
nltk.download('wordnet', quiet=True);
nltk.download('averaged_perceptron_tagger', quiet=True);
nltk.download('punkt', quiet=True);

In [3]:
#Read in the data
docs = []
with zipfile.ZipFile('data/ap.zip', 'r') as zip_file:
    for file_name in zip_file.namelist():
        with zip_file.open(file_name) as f:
            for line in f:
                docs.append(str(line))

In [4]:
#Define a function that converts a Part Of Speech tag
#(aka POS tag - see https://en.wikipedia.org/wiki/Part-of-speech_tagging) tag  
#to a format understandable by the Wordnet Lemmatizer. 

def get_wordnet_pos(treebank_tag):
    """Convert a POS tag into a format understandable by the Wordnet lemmatizer."""
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return ''

In [5]:
#Preprocessing function to apply to each document
lemmatiser = WordNetLemmatizer()
punct = string.punctuation
s=set(stopwords.words('english'))

def process(doc):
    words = word_tokenize(doc)
    words = [x.lower() for x in words]
    words_pos = pos_tag(words)
    words = [lemmatiser.lemmatize(x[0], pos = get_wordnet_pos(x[1])) 
             for x in words_pos if (len(x[0]) >=3) and (not x[0] in s) and x[1] == 'NN' 
             and not any(char in punct for char in x[0])]
    return words


In [6]:
#Process the documents
docs = [process(x) for x in docs]

In [7]:
#Create the vocabulary
vocab = set([])
for words in docs:
    vocab |= set(words)
vocab=list(vocab)

In [8]:
#Create a dictionary relating a word to its index in vocab
word_index_tuples = zip(vocab, [i for i in range(len(vocab))])
word_to_index =  dict(word_index_tuples)

In [9]:
#Create the doc_word_freq matrix
doc_word_freq = []
for words in docs:
    word_freq = np.zeros(len(vocab))
    for w in words:
        word_freq[word_to_index[w]] += 1
    doc_word_freq.append(word_freq)
doc_word_freq = np.array(doc_word_freq)

Now that we are done preparing the data, let us fit the model. This might take a while.

In [10]:
lda = LDA(n_topics=10,
          max_iter=100, 
          learning_method='online', 
          learning_offset=10.,
          random_state=0,
          n_jobs = 1,
          evaluate_every=1,
          verbose=0).fit(doc_word_freq)

## Visualisation

You will find below an interactive visualisation of the topics discovered by the LDA algorithm. On the left-hand side, each topic is displayed in such a way that similar topic are closer to each other. The relative size of the topics indicate their relative prevalence amongst all the documents. When clicking on a topic, a list of terms for this topic is displayed on the right-hand side, in increasing order of *relevance*. 

The *relevance* of a term for a given topic is a measure of how useful this term is in identifying the topic. It depends on two quantities: the marginal probability of the word given the topic, and the ratio of the marginal probability to the overall probability for this word to appear in the whole corpus. The value of the $\lambda$ parameter (top right) defines how important each quantity is when computing the relevance. When $\lambda = 1$, only the marginal probability is taken into account, while when $\lambda =0$ only the ratio of marginal probability to the overall probability is taken into account. Typically, a value of $\lambda = 0.6$ yields good results. More details on this can be found in this paper: https://nlp.stanford.edu/events/illvi2014/papers/sievert-illvi2014.pdf. 

When you click on a specific term, the conditional distribution of the term for each topic will be displayed on the right-hand side.

In [11]:
topic_word_dists =  lda.components_/np.sum(lda.components_, axis=1, keepdims=True)
doc_word_dists = doc_word_freq/np.sum(doc_word_freq, axis=1, keepdims=True)
doc_topic_dists = lda.transform(doc_word_freq)
doc_lengths = np.sum(doc_word_freq, axis=1)
word_freqs = np.sum(doc_word_freq, axis=0)

In [12]:
#The pyldavis package uses a deprecated pandas functinality, let's make it quiet.
warnings.filterwarnings("ignore")

In [13]:
vis = pyLDAvis.prepare(topic_word_dists, lda.transform(doc_word_dists), doc_lengths, list(vocab), word_freqs, R=8)

In [14]:
pyLDAvis.display(vis)

## Publish an interactive SherlockML report

The HTML format of SherlocklML reports allows you to create interactive presentations of your data. Why don't you give it a try? Click on the "Publish" button on the top right of this page after running this notebook. Then, click on the "Report" tab on the left-hand side panel. You should be able to consult your report after a few seconds of processing time. 