
### Part I
    
    Topic modelling and LDA in a nutshell 

### Part II:

    Load input data.

    Pre-process that data.

    Transform documents into bag-of-words vectors.

    Train LDA model.
    
    Visualize LDA model using pyLDAvis 


# What is topic modelling?

___
A type of statistical model for discovering the abstract "topics" that occur in a collection of documents. A document typically concerns multiple topics in different proportions; thus, in a document that is 10% about cats and 90% about dogs, there would probably be about 9 times more dog words than cat words. The "topics" produced by topic modeling techniques are clusters of similar words. A topic model captures this intuition in a mathematical framework, which allows examining a set of documents and discovering, based on the statistics of the words in each, what the topics might be and what each document's balance of topics is.
> [Topic Modeling - Intro & Implementation](https://www.kaggle.com/akashram/topic-modeling-intro-implementation)
___

- Topic modeling is a form of unsupervised learning that identifies hidden relationships in data.

- Being unsupervised, topic modeling doesn’t need labeled data. It can be applied directly to a set of text documents to extract information.

- Topic modeling works in an exploratory manner, looking for the themes (or topics) that lie within a set of text data.

- There is no prior knowledge about the themes required in order for topic modeling to work.

- It discovers topics using a probabilistic framework to infer the themes within the data based on the words observed in the documents.

- Topic modeling is a versatile way of making sense of an unstructured collection of text documents.

- It can be used to automate the process of sifting through large volumes of text data and help to organize and understand it.

- Once key topics are discovered, text documents can be grouped for further analysis, to identify trends, for instance, or as a form of classification.

See: https://highdemandskills.com/topic-modeling-intuitive/


### LDA: Latent Dirichlet Allocation

Source: http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/



Suppose you have the following set of sentences:

    I like to eat broccoli and bananas.
    I ate a banana and spinach smoothie for breakfast.
    Chinchillas and kittens are cute.
    My sister adopted a kitten yesterday.
    Look at this cute hamster munching on a piece of broccoli.

What is latent Dirichlet allocation? It’s a way of automatically discovering topics that these sentences contain. For example, given these sentences and asked for 2 topics, LDA might produce something like

    Sentences 1 and 2: 100% Topic A
    Sentences 3 and 4: 100% Topic B
    Sentence 5: 60% Topic A, 40% Topic B
    Topic A: 30% broccoli, 15% bananas, 10% breakfast, 10% munching, … (at which point, you could interpret topic A to be about food)
    Topic B: 20% chinchillas, 20% kittens, 20% cute, 15% hamster, … (at which point, you could interpret topic B to be about cute animals)

### LDA model

- LDA topic modeling discovers topics that are hidden (latent) in a set of text documents.

- It does this by inferring possible topics based on the words in the documents. It uses a generative probabilistic model and Dirichlet distributions to achieve this.

- The inference in LDA is based on a Bayesian framework. This allows the model to infer topics based on observed data (words) through the use of conditional probabilities.

- A generative probabilistic model works by observing data, then generating data that’s similar to it in order to understand the observed data. This is a powerful way to analyze data and goes beyond mere description—by learning how to generate observed data, a generative model learns the essential features that characterize the data.


# Data

In [None]:
import pandas as pd

In [None]:
df=pd.read_csv('../input/hackernews-umbrella-topics/hn.csv')

In [None]:
df.head()

In [None]:
df.keyw.unique()

In [None]:
df.groupby('keyw').count()['title'].sort_values(ascending=False)

In [None]:
df_priv=df[(df['keyw']=='privacy') & ~df['text'].isna()]

In [None]:
docs=df_priv['text'].tolist()

In [None]:
print(len(docs))

In [None]:
print(docs[0])

# Pre-process and vectorize the documents

In [None]:
# Tokenize the documents.
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

# Split the documents into tokens.
tokenizer = RegexpTokenizer(r'\w+')
for idx in range(len(docs)):
    docs[idx] = docs[idx].lower()  # Convert to lowercase.
    docs[idx] = tokenizer.tokenize(docs[idx])  # Split into words.

# Remove numbers, but not words that contain numbers.
docs = [[token for token in doc if not token.isnumeric()] for doc in docs]

# Remove words that are only one character.
docs = [[token for token in doc if len(token) > 1] for doc in docs]

docs = [[token for token in doc if token not in stop_words] for doc in docs]

In [None]:
docs[0][:10]

In [None]:
# Lemmatize the documents.
from nltk.stem.wordnet import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
docs = [[lemmatizer.lemmatize(token) for token in doc] for doc in docs]

In [None]:
docs[0][:10]

In [None]:
# Compute bigrams.
from gensim.models import Phrases

# Add bigrams and trigrams to docs (only ones that appear 20 times or more).
bigram = Phrases(docs, min_count=20)
for idx in range(len(docs)):
    for token in bigram[docs[idx]]:
        if '_' in token:
            # Token is a bigram, add to document.
            docs[idx].append(token)

In [None]:
docs[1][-30:]

In [None]:
# Remove rare and common tokens.
from gensim.corpora import Dictionary

# Create a dictionary representation of the documents.
dictionary = Dictionary(docs)

# Filter out words that occur less than 20 documents, or more than 50% of the documents.
dictionary.filter_extremes(no_below=20, no_above=0.5)

In [None]:
dictionary

# Bag-of-words

`"John","likes","to","watch","movies","Mary","likes","movies","too"`

`BoW1 = {"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1};`

In [None]:
# Bag-of-words representation of the documents.
corpus = [dictionary.doc2bow(doc) for doc in docs]

In [None]:
corpus[0][:10]

In [None]:
print('Number of unique tokens: %d' % len(dictionary))
print('Number of documents: %d' % len(corpus))

# Training

- How many topics? 
    - we can experiment or check the coherence score
- `chunksize` controls how many documents are processed at a time in the training algorithm. Increasing chunksize will speed up training, at least as long as the chunk of documents easily fit into memory.  Chunksize can however influence the quality of the model
- `passes` controls how often we train the model on the entire corpus. Another word for passes might be `epochs`. `iterations` is somewhat technical, but essentially it controls how often we repeat a particular loop over each document. It is important to set the number of “passes” and “iterations” high enough.

In [None]:
# Train LDA model.
from gensim.models import LdaModel

# Set training parameters.
num_topics = 10
chunksize = 2000
passes = 20
iterations = 400
eval_every = None  # Don't evaluate model perplexity, takes too much time.

# Make a index to word dictionary.
temp = dictionary[0]  # This is only to "load" the dictionary.
id2word = dictionary.id2token

model = LdaModel(
    corpus=corpus,
    id2word=id2word,
    chunksize=chunksize,
    alpha='auto',
    eta='auto',
    iterations=iterations,
    num_topics=num_topics,
    passes=passes,
    eval_every=eval_every
)

In [None]:
top_topics = model.top_topics(corpus) #, num_words=20)

# Average topic coherence is the sum of topic coherences of all topics, divided by the number of topics.
avg_topic_coherence = sum([t[1] for t in top_topics]) / num_topics
print('Average topic coherence: %.4f.' % avg_topic_coherence)

from pprint import pprint
pprint(top_topics)

In [None]:
import pyLDAvis.gensim_models
pyLDAvis.enable_notebook()
pyLDAvis.gensim_models.prepare(model, corpus, dictionary)

# Hyper parameters

- there is no universally "best" choice
- `alpha` is a parameter that controls the prior distribution over topic weights in each document, while `eta` is a parameter for the prior distribution over word weights in each topic. In gensim, both default to a symmetric, 1 / num_topics prior.
- `alpha` and `eta` can be thought of as smoothing parameters when we compute how much each document "likes" a topic (in the case of alpha) or how much each topic "likes" a word (in the case of eta)

# Coherence 
https://rare-technologies.com/what-is-topic-coherence/
https://towardsdatascience.com/evaluate-topic-model-in-python-latent-dirichlet-allocation-lda-7d57484bb5d0

### Source:
https://radimrehurek.com/gensim/auto_examples/tutorials/run_lda.html#data