# Topic Modelling

Topic modeling is the process of automatically identifying topics present in a text corpus by uncovering patterns in word occurrences in those texts.

## Latent Dirichlet Allocation (LDA)

LDA is a Baesian(probabilistic) algorithm that considers that each document is a finite mixture of a set of topics and each topic is an infinite mixture of a set of topic probabilities (or words appearing in the texts).

For example:

Document1 = x% Topic1 + y% Topic2 + z% Topic 3

Topic1 = a% Word1 + b% Word2 + c% Word3 + ...

The number of topics, as said, is finite and is chosen by the programmer while the number of topic probabilities(or words per topic) is the property of the identified topics. As you go on increasing the word limit per topic AFTER training the algorithm, you will be able to see the allocated probabilities of words for a certain topic. As the probabilities of the words go lower and lower, the significance of the word with respect to that particular topic reduces. The topic with the highest probability is called the **dominant topic** for a particular document.

**How to determine the optimum number of topics**

Using topic cohenerence value, we can determine if the a set of words in the topic support the overall topic cluster or not.

Ususally, with increasing number of topics in a set of documents, result in increasing coherence values. The number topics for which the curve is at the highest and then flattens out (similar to elbow method of finding optimum K for K-Means Clustering) should ideally be the number of topics.

# Implementation

## Imports

In [None]:
import re
import numpy as np
import pandas as pd

import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

import warnings
warnings.filterwarnings("ignore")

from nltk.corpus import stopwords
STOPWORDS = stopwords.words('english')

import contractions
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()

import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df = pd.read_csv('../Datasets/internet_news_data/articles_data.csv')
df.head()

In [None]:
df.info()

In [None]:
df = df[~df['description'].isna()]

In [None]:
df.info()

In [None]:
df.description[0]

## Basic Text Cleaning

In [None]:
def preprocess_text(x):
    # Keep only texts and spaces
    cleaned_text = re.sub(r'[^a-zA-Z\d\s]+', '', x)
    word_list = []
    # for each word, expand contractions
    for each_word in cleaned_text.split(' '):
        word_list.append(contractions.fix(each_word).lower())
    # for each word in expanded list, get lemmas if not in STOPWORDS list
    word_list = [wnl.lemmatize(each_word.strip()) for each_word in word_list if each_word not in STOPWORDS and each_word.strip() != '' and len(each_word.strip()) > 1]
    return word_list

In [None]:
preprocess_text(df.description[0])

In [None]:
df['description_cleaned'] = df['description'].apply(preprocess_text)

## Creating Dictionary & Corpus for Topic Modeling

In [None]:
list_all_descriptions = list(df['description_cleaned']) # Format - List of list of tokens

In [None]:
list_all_descriptions[0]

In [None]:
id2word = corpora.Dictionary(list_all_descriptions)

In [None]:
corpus = [id2word.doc2bow(each_description) for each_description in list_all_descriptions]

In [None]:
corpus[0] # format - (word_id, word_frequency).

In [None]:
id2word[20] # tesla appears twice

## LDA Model

In [None]:
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                            id2word=id2word,
                                            num_topics=10,
                                            random_state=42,
                                            chunksize=100,
                                            passes=10,
                                            alpha='auto',
                                            per_word_topics=True)

In [None]:
lda_model.print_topics()

### Compute Coherence Score

In [None]:
coherence_model_lda = CoherenceModel(model=lda_model,
                                     texts=list_all_descriptions,
                                     dictionary=id2word,
                                     coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('Coherence Score: {}'.format(coherence_lda))

### Obtain the optimum number of topics

In [None]:
def get_optimum_num_topics(min_topics, max_topics, steps=2, graph=True):
    list_models = []
    list_coherence_score = []
    for i in range(min_topics, max_topics + 1, steps):
        print("Number of Topics: {}".format(i))
        lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                                    id2word=id2word,
                                                    num_topics=i,
                                                    random_state=42,
                                                    chunksize=100,
                                                    passes=10,
                                                    alpha='auto',
                                                    per_word_topics=True)
        list_models.append(lda_model)
        coherence_model_lda = CoherenceModel(model=lda_model,
                                             texts=list_all_descriptions,
                                             dictionary=id2word,
                                             coherence='c_v')
        coherence_lda = coherence_model_lda.get_coherence()
        list_coherence_score.append(coherence_lda)
    if graph:
        sns.lineplot(x=list(range(min_topics, max_topics + 1, steps)), y=list_coherence_score)
        plt.show()

In [None]:
get_optimum_num_topics(5, 10)

### Retrain model

In [None]:
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                            id2word=id2word,
                                            num_topics=7,
                                            random_state=42,
                                            chunksize=100,
                                            passes=10,
                                            alpha='auto',
                                            per_word_topics=True)

In [None]:
lda_model.print_topics(num_words=5)

In [None]:
def format_topics_sentences(ldamodel=lda_model, corpus=corpus, texts=list_all_descriptions):
    # Init output
    sent_topics_df = pd.DataFrame()

    # Get main topic in each document
    for i, row in enumerate(ldamodel[corpus]):
        row = sorted(row[0], key=lambda x: x[1], reverse=True)
        # Get the Dominant topic, Perc Contribution and Keywords for each document
        for j, (topic_num, prop_topic) in enumerate(row):
            if j == 0:  # => dominant topic
                wp = ldamodel.show_topic(topic_num)
                topic_keywords = ", ".join([word for word, prop in wp])
                sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)
            else:
                break
    sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords']

    # Add original text to the end of the output
    contents = pd.Series(texts)
    sent_topics_df = pd.concat([sent_topics_df, contents], axis=1)
    return(sent_topics_df)

In [None]:
df_topic_sents_keywords = format_topics_sentences(ldamodel=lda_model, corpus=corpus, texts=list_all_descriptions)

In [None]:
df_topic_sents_keywords

In [None]:
df_topic_sents_keywords['Dominant_Topic'].value_counts()