# Topic detection with LDA model on ubuntu dialog corpus
Topic detection is such a task where the model will determine several topics that are relevant to the given document. 
Before we talk about LDA, I would like to give some basic idea about topic model.

## Topic Model
In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body. Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently: "dog" and "bone" will appear more often in documents about dogs, "cat" and "meow" will appear in documents about cats, and "the" and "is" will appear equally in both. A document typically concerns multiple topics in different proportions; thus, in a document that is 10% about cats and 90% about dogs, there would probably be about 9 times more dog words than cat words. The "topics" produced by topic modeling techniques are clusters of similar words. A topic model captures this intuition in a mathematical framework, which allows examining a set of documents and discovering, based on the statistics of the words in each, what the topics might be and what each document's balance of topics is.

## LDA (Latent Dirichlet Allocation)
LDA (Latent Dirichlet Allocation) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar.

### LDA is a generative model
LDA is a generative model. It assumes a article is generated from a mix of hidden topics, and each model contains a potential words that can be drawn from.

Of course, LDA can also be used for discriminative task. But let's first look at the generative process.

So literally we have two distribution:
1. the topic model distribution p(topic|θ)
2. a word distrubition given a topic model p(word|topic).

Notice that θ is a hyper-pamameter

Then to generate a article, the process is here:
```
Chooseparameter θ ～ p(θ);
For each of the N words:
Choose a topic ～ p(topic|θ);
Choose a word ～ p(word|topic);
```

And to classify the topic given an article, we choose the topic that has the highest probability to generate this article. It's similar to maximum posterior probability.
And notice that **the words order doesn't matter.** So actually the LDA model fail to capture some semantic features regarding words order. But LDA model is still powerful, especially in topic detection task, where words order is not a necessary feature to take into account.

### Discriminative Task
When we talk about topic detection, it's a discriminative task. And the way LDA modal to do this is just to find maximum likelihood of potential topics, using Gibbs sampling. It's similar to maximum posterior probability.

A more detailed explanation can be found [here](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)

### Unsupervised model

LDA is an unsupervised model, in that it doesn't require any label. In simple word, LDA do clustering among bunch of documents.

As a result, the topics in model doesn't have a specific "name", instead, each topic is characterized as a word vector representing their probabilities.

And in our task, we want to determine the relevant topics given the document. A naive way is to choose some keyword from topic word vectors with high probabilities.

### another problem to notice

And another problem is that, it's likely that given a dialog, we might suggest a keyword not in the article! This might look wried.

But on the other hand, this is a good thing, as this model consider the hidden semantic structure. For example, a simple dialog Hey, I failed to use 'sudo apt install python3'. A keyword python3 can be easily detected, but a human being with some linux knowledge should be able to tell it's a ubuntu-like distribution. And the LDA model might be able to do this.

And here is the possible situation:

User input: Hey, I failed to use `sudo apt install python3`

Response: your problem might be relative to this topics: `python3, ubuntu, apt, install`

And the word `ubuntu` looks reasonable even if it doesn't appear in the dialog.

## Corpus: Ubuntu dialog (dataset)

The corpus we use here is [Ubuntu dialog](http://dataset.cs.mcgill.ca/ubuntu-corpus-1.0/), which is a multi-turn dialog, introduced by [this paper](https://arxiv.org/abs/1506.08909)

Notice that the corpus already has 2.0 version, but we still use 1.0 here.

We should download the [raw data](http://dataset.cs.mcgill.ca/ubuntu-corpus-1.0/ubuntu_dialogs.tgz), as the first step in data science is to deal with dirty data.

## The lib you need in this tutorial
```
flashtext
gensim
stop_words
```
All these can be easily installed using `pip`

## Data preprocess

combine all dialogs into one text file, one dialog one line, retaining conversation only, removing punctuation, removing stopword


## notes in preprocessing
### stop word
get stop words from a python lib called stop_words

### flashtext
flashtext is a powerful python lib used for replacing keyword. We might want to remove stop word, and any punctuation. 

However, filtering out stop word might be time consuming using Regex,the runtime is linear to the size of the keyword set you want to filer out. A good tool for this is [flashtext](https://github.com/vi3k6i5/flashtext), a word matching model based on prefix-tree. Its performance remain O(n), where n is the length of sentence, while Regex might go up to O(mn) where m is the stopword size, which means the flashtext is performance-insentitive regarding the size of keyword set. It reduce my processing time significantly. 

### stemming
Stemming means combining similar word into one. For example, **install** and **installs**

This is extremely important, because when building the model, it's intuitively to transform all words to their Semitic root, which is better to build the vector space.

Here I use the Stemmer from [NLTK](http://www.nltk.org/)
I found that the stemmer might trim the word at some wrong position. But it's fine to construct word-vector space.

### tokenize
basically you can write a simple Regex to do this, or even string.split() will do. But I to gain better performance, I use tokenizer from NLTK too. The regex I use for tkenizer is **r'\w\S+\w'**, which retain a word start with character and end with character, this can filter out punctuation, but keep sth. like **don't**  or **pre-installed**, where punctuation appear inside a word.

In [14]:
from glob import glob
import codecs
import nltk
from nltk import RegexpTokenizer
from nltk.stem.porter import PorterStemmer
from stop_words import get_stop_words
from flashtext import KeywordProcessor
import pickle
import re
import os
from multiprocessing import Pool
import functools
import string

files = glob('./dialogs/**/*.tsv')


# create English stop words list
en_stop = set(get_stop_words('en'))
# create punctuationi to filter
punc = set(string.punctuation)
    
    

def processing(i, file_list, stopword, punc):
    tk = RegexpTokenizer(r'\w\S+\w')
    p_stemmer = PorterStemmer()

    with codecs.open('whole_dialogs_stem_%d' % i, 'w', 'utf-8') as out:
        for fi in file_list:
            with codecs.open(fi, 'r', 'utf-8') as f:
                words = functools.reduce(lambda x, y: x + y,
                                         map(tk.tokenize, [line.strip().split('\t')[-1].lower() for
                                                           line in f]))
                words = [p_stemmer.stem(
                    w) for w in words if w not in stopword and w not in punc]
                out.write(' '.join(words) + '\n')


### You might want to use multi process to do it, otherwise it will take a long time.

### Also, I would like to recommend a funny tools called `tqdm`, which can show the process bar. It's helpful when you're running long task.

In [None]:
def div_list(l, n):
    length = len(l)
    t = length // n
    quaters = [t * i for i in range(0, n)]
    ran = range(0, n - 1)
    result = [l[quaters[i]:quaters[i + 1]] for i in ran]
    result.append(l[quaters[n - 1]:len(l)])
    return result

process_num = 8
p = Pool()
div_files = div_list(files, process_num)
for i in range(process_num):
    p.apply_async(processing, args=(
        i, div_files[i], stopword_processor))
p.close()
p.join()

output_list = glob('./whole_dialogs_stem_*')
for output in glob('./whole_dialogs_stem_*'):
    os.system('cat %s >> whole_dialogs_stem' % output)
    os.system('rm %s' % output)

It takes about 20 minutes in my laptop to run. For convenience, I output all processed corpus to a single file.

# Introduce to gensim
(Gensim)[https://radimrehurek.com/gensim/index.html] is powerful topic modelling lib in python. It provide rich models which can be easily used off-the-shelf. What's more, it provides some useful tool to construct corpus with word vector representation.

### First we load the data and construct the corpus as a dictionary. The data fed into dictionary should be a list of document, where each document is a list of tokens.

In [39]:
from gensim import corpora, models
with open('whole_dialogs_stem', 'r', encoding='utf-8') as f:
    raw_corpus = [line.split() for line in f]
dictionary = corpora.Dictionary(raw_corpus)

### Then you might want to filter out some rare words, as well as some words that apprear too frequently. These words might significantly slow down the training process, and also make some negative effect to the model.

In [40]:
dictionary.filter_extremes(no_below=3, no_above=0.2, keep_n=100000)
dictionary.save('ubuntu.dict')

### Then we construct the corpus as word vectors

In [41]:
corpus = [dictionary.doc2bow(doc) for doc in raw_corpus]

### Here is how a document looks like. Notice that the word order doesn't matter anymore, only frequency distribution.

In [42]:
print(corpus[0])

[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 2), (15, 1), (16, 1), (17, 1)]


### And let's train a LDA model
we can use `LdaMulticore` to take advantage of multi processor in your machine. The model training need a corpus, a dictionary lookup, and more importantly, you have to specifiy `num_topics`, it's a hyper parameter that to be tune. A good idea is to train the model iteratively, until the similarity among topics is minimized. But this might take a long time. 

Also you can search the parameter based on `Perplexity vs num_topics`

For tutorial purpose, I just use `num_topics = 50` here.

In [43]:
model = models.LdaMulticore(corpus, id2word=dictionary, workers=3,
                            num_topics=50)

Be sure to save the model. Also remember the dictionary we saved before, it will be needed in the inference as a lookup table.

In [44]:
model.save('ubuntu.lda')

Let's see the words distribution in topics

In [45]:
model.print_topics()

[(44,
  '0.096*"file" + 0.043*"directori" + 0.041*"folder" + 0.028*"copi" + 0.025*"permiss" + 0.025*"home" + 0.020*"chang" + 0.018*"sudo" + 0.014*"save" + 0.014*"creat"'),
 (36,
  '0.074*"gnome" + 0.041*"desktop" + 0.040*"kde" + 0.020*"want" + 0.018*"distro" + 0.017*"xubuntu" + 0.014*"differ" + 0.013*"uniti" + 0.013*"app" + 0.012*"xfce"'),
 (40,
  '0.017*"well" + 0.015*"thing" + 0.014*"think" + 0.013*"realli" + 0.013*"time" + 0.012*"yeah" + 0.012*"one" + 0.012*"good" + 0.011*"now" + 0.010*"lol"'),
 (33,
  '0.113*"question" + 0.090*"window" + 0.086*"linux" + 0.077*"ask" + 0.036*"answer" + 0.016*"want" + 0.012*"hello" + 0.011*"one" + 0.011*"quick" + 0.010*"guy"'),
 (32,
  '0.082*"script" + 0.032*"python" + 0.028*"firewal" + 0.025*"bash" + 0.023*"rule" + 0.022*"iptabl" + 0.019*"block" + 0.019*"write" + 0.017*"echo" + 0.016*"firestart"'),
 (34,
  '0.049*"bug" + 0.043*"free" + 0.024*"report" + 0.017*"ipod" + 0.014*"linux" + 0.013*"good" + 0.012*"thank" + 0.012*"buy" + 0.011*"softwar" + 0.01

### Topic detection with LDA model
Finally, let's see how the model works with the inference with new document.

As it's unsupervised learning, there is no label for evaluating. Generally people use perplexity and some other metrics to measure how well the model perform.

But here I would just try to use the model to inference some key word in the given document.

#### To make sure the model works in predicting new data, we should make the same preprocess to the  document. 

In [15]:
from gensim import corpora, models

tk = RegexpTokenizer(r'\w\S+\w')
p_stemmer = PorterStemmer()
dictionary = corpora.Dictionary.load('ubuntu.dict')
model = models.LdaMulticore.load('ubuntu.lda')

def preprocess(fi, stopword, punc, tk, stemmer, dictionary):
    with codecs.open(fi, 'r', 'utf-8') as f:
        words = functools.reduce(lambda x, y: x + y,
                                 map(tk.tokenize, [line.strip().split('\t')[-1].lower() for
                                                   line in f]))
        words = [stemmer.stem(
            w) for w in words if w not in stopword and w not in punc]
        return dictionary.doc2bow(words)
    
def inference(doc, model, k=3):
    # get the top k topic
    topics = sorted(model[doc], key=lambda x: x[1], reverse=True)[:k]
    return topics
    

Let's try a test file

In [46]:
test_file = './test/100002.tsv'
print('original dialog:')
with open(test_file,'r') as f:
    for line in f:
        print(line)
doc = preprocess(test_file, en_stop, punc, tk, p_stemmer, dictionary)
topics = inference(doc, model)

original dialog:
2008-09-03T06:57:00.000Z	brettley		if i installed another linux opperating system next to ubuntu, would it take the GRUB i have now and modify it?

2008-09-03T06:59:00.000Z	wols_	brettley	it would install anotehr bootloader

2008-09-03T07:00:00.000Z	brettley	wols_	like what?



Here are the top 3 topics, with their probability

In [47]:
topics

[(12, 0.7095178708078098),
 (11, 0.12502156021403896),
 (35, 0.08712723564481913)]

And let's see what keyword we can find in the topics.

In [48]:
for t in topics:
    LDA_keywords = sorted(model.get_topic_terms(t[0]), key=lambda x:x[1], reverse=True)
    LDA_keywords = dict((dictionary[i[0]],i[1]) for i in LDA_keywords)
    print('-'*50)
    print('topic %d, prob: %f'%(t[0],t[1]))
    print(LDA_keywords)
                        

--------------------------------------------------
topic 12, prob: 0.709518
{'boot': 0.12062512636333775, 'grub': 0.052740154034447956, 'window': 0.03675319373613759, 'live': 0.03537946195971994, 'livecd': 0.01556302143835399, 'will': 0.01335677179709427, 'bio': 0.011869683427816443, 'now': 0.01146358867856986, 'option': 0.011268242272930003, 'want': 0.010371099908555202}
--------------------------------------------------
topic 11, prob: 0.125022
{'share': 0.09041290024881912, 'window': 0.05304941127631381, 'samba': 0.0511428194971934, 'printer': 0.039550153515780714, 'mac': 0.03575682170535402, 'print': 0.03329300179646669, 'xmm': 0.022617994471666004, 'vim': 0.017858619265137876, 'network': 0.017451469127535375, 'linux': 0.017051113616151126}
--------------------------------------------------
topic 35, prob: 0.087127
{'send': 0.05996920976719629, 'raid': 0.043209108945681196, 'mail': 0.04018007224764992, 'email': 0.02659783011425958, 'map': 0.013692063313678489, 'softwar': 0.01252943

### Experiment result
I've tried out some testcase, the result seems reasonable but not very good as I expect.
You're free to play some testcase with the model. 


I've tried training the model with different parameters, but couldn't find a very good one. Actually people are always having some controversy over LDA, mainly due to two weakness:
* hard to tune the paramater, somehow empirical (or you can use complicated method for parameters searching, but very time-consuming and not realistic in real production)
* the topics themselve are generated by clustering, it's sometimes hard for human being to understand.

The reason why I want to try LDA is when I first look at their [official example](https://radimrehurek.com/gensim/wiki.html#latent-dirichlet-allocation), which give reasonable results. But it seems that the model doesn't play very well with the ubuntu corpus.

I find a [good paper](https://arxiv.org/abs/1608.08176) on discussing what's the problem of LDA model.


### some discussion here:

1. As mentioned above, the experiment doesn't produce very good result. But when choosing complicated machine learning model, we always undertake the risk that the experiment might not give a perfect outcome. And even we can tune the model with specific dataset, and spend long time on tunning the params, it doesn't make much sense in real production.

2. As a unsupervised learning model (clustering), it's actually not intuitive to use it for classification task. Because we don't even have a label for each topic the model generate. And how the topic looks like is totallly determined by the algorithm itself. Even it make much sense in the mathematical perspective, but it might not be a good model in real world.

3. Another reason why LDA doesn't perform well in ubuntu dialog corpus it that (i guess) the dialog is quite short, with only few word that can represent the feature of the topic. I have a project using LDA for topic classification of newspaper articles, which make much more sense than ubuntu corpus. I suspect whether LDA is a good model for such a corpus. But the outcome looks somehow reasonable so it's a meaningful experiment.

4. After all, I hope I can give you some ideas on topic modeling, as well as some preprocess approach and the way we use it for topic detection