# Intro
In natural language processing, the **Latent Dirichlet Allocation** (LDA) is a generative statistical model that allows to divide a collection of texts into N-number of subgroups, where each subgroup is characterized by a set of X-number of keywords, and this set of keywords is associated with a topic. Both the topic (as a set of words) and the text (as a set of topics describing the text) are described by Dirichlet-distribution. 

Let's say we have three texts:<br>
"Dogs like playing."<br>
"Cats like milk."<br>
"Cats and dogs like eating and playing. I love dogs. They are adorable." <br>

The results from LDA model could be the following:<br>

Text1: 100% Topic1 + 0% Topic2<br>
Text2: 100% Topic2 + 0% Topic1<br>
Text3: 70% Topic2 + 30% Topic1<br>

Where each topic represented by a set of words (from most to least relevant), which forms the topic:<br>

Topic1: 30% dog, 30% playing, 20% like 10% adorable 10% love<br>
Topic2: 50% cat, 30% milk, 20% like<br>

# When to apply LDA topic modeling

When we have a collection of documents and wish to understand what the collection/archive contains without necessarily reading every document.
   - If we are working with a small number of documents (or even a single document), word frequency counts (or TF-IDF) might be sufficient in order to get an idea what the text is about. <br>
   - However, if we have a large number of documents, then topic modeling might be a good approach.
   
# Theory: How LDA model works and what's behind it
In latent diriclet allocaton (LDA) model, each document is considered to be characterized by a set of topics that is following the Dirichlet distribution. 


In LDA probabilistic topic modeling:
- a collection of documents (texts) $D$ is given
- each document $d$ from the collection is a sequence of words $W_{d} = (w_{1}, ..., w_{n_{d}})$ from dictionary $W$, where $n_{d}$ - length of document d. 
- each document may be related to one or several themes
- order of documents in collection is not important
- order of words in documents is neglected, each document is considered as "bag of words"
- document collection is considered as set of pairs "document-word" $(d,w), d \in D, w \in W_{d}$
- each topic $t\in T$, where $T$ - set of topics, is described by Dirichlet-distribution $p(w|t)$ on the range of $w\in W$, in other words there are topic vectors: $\phi_{t} = (p(w|t):w \in W)$
- each document $d\in D$ is described by Dirichlet-distribution $p(t|d)$ on the range of $t\in T$, in other words there are document vectors: $\theta_{d} = (p(t|d):t \in T)$
<br>

Probability of a pair "document-word" to occure can be written as: 
$$
p(d,w)=\sum\limits_{t\in T}p(t)p(w|t)p(d|t)
$$

To build a topic model means to find matrices $\Phi = ||p(w|t)||$ and $\Theta = ||p(t|d)||$ given collection $D$.

In order to find a solution, we need to solve the optimization problem, i.e. to maximaze the function: 
$$
\sum\limits_{d\in D}\sum\limits_{w\in d}n_{dw}logp(w|d)\to\max\limits_{\Phi,\Theta},
$$
where $n_{dw}$ is frequency of word $w$ in document $d$.

# The goals:

This notebook aimes to investigate the capabilities of LDA topic modelling techniques on Russian texts. The main goals will be:
- to test LDA and Mallet LDA on Russian texts
- to find whether filtering of most common and most rare words will increase model performance
- to find most frequent topics(e.g. first 20) and their sets of keywords for the given collection of documents

In [None]:
# Installing the package for lemmatization (Russian language)
!pip install pymorphy2


In [None]:
import os
import re
import numpy as np
import pandas as pd
from pprint import pprint
import json
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
import pyLDAvis
import pyLDAvis.gensim  
import matplotlib.pyplot as plt
from matplotlib import rc
from matplotlib import rcParams
%matplotlib inline
import nltk
import pymorphy2
from nltk import word_tokenize
from nltk.corpus import stopwords
stopwordsrus = set(stopwords.words('russian'))
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

In [None]:
# Checking up Kaggle directories, loading the dataset
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

rus_data = pd.read_csv("/kaggle/input/corpus-of-russian-news-articles-from-lenta/lenta-ru-news.csv")





In [None]:
# Let's have a look at the data:
rus_data

In [None]:

rus_data['topic'].unique()


In [None]:
# Let's shorten the dataset and exclude some of the topics: 

rusdata = rus_data['text'][(rus_data['topic']!='Библиотека')&(rus_data['topic']!='Бывший СССР')&(rus_data['topic']!='69-я параллель')].reset_index(drop=True)

In [None]:
# These Russian news articles we will use for LDA analysis:
rusdata

In [None]:
# Loading stopwords and combining them:
stops1 = []
with open('/kaggle/input/russianstopwords/RussianStopWords.txt', "r", encoding="utf-8", newline=None) as readfile:
     [stops1.append(line.rstrip()) for line in readfile]

stops2 = []
with open('/kaggle/input/stopwords/stopwords.txt', "r", encoding="utf-8", newline=None) as readfile:
     [stops2.append(line.rstrip()) for line in readfile]

stopw_all = stops1 + stops2 + list(stopwordsrus)
stopwordsru = list(dict.fromkeys(stopw_all))

In [None]:
# Tokenizing and removing stopwords:
def process(text):
    return list(t.lower() for t in word_tokenize(text) if t.isalpha() and t.lower() not in stopwordsru)

In [None]:
# Let's take only first 10000 articles for our analysis:
data = [process(t) for t in rusdata[:10000]]

In [None]:
# Lemmatizer for russian language:
morph = pymorphy2.MorphAnalyzer()
def lemmatizer(texts):
    return [[morph.parse(word)[0] for word in text] for text in texts]

In [None]:
morph_data = lemmatizer(data)

In [None]:
# We need only lemma of the word, without additional information, so let's extract it:
def extract_lemma(texts):
    norm = []
    for t in texts:
        res = []
        for word in t:
            n = word.normal_form
            res.append(n)
        norm.append(res)
    return norm

In [None]:
# This is our lemmatized data ready to be used further:
data_norm = extract_lemma(morph_data)

In [None]:
# Let's build the bigram and trigram models using gensim
bigram = gensim.models.Phrases(data_norm, min_count=5, threshold=100) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[data_norm], threshold=100)  

# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

In [None]:
def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]

data_words_trigrams = make_trigrams(data_norm)

In [None]:
# We can see that bigrams and trigrams are build successfully 
print(data_words_trigrams[0])

In [None]:
# Let's create dictionary of all our unique words from the dataset using corpora from gensim
dictionary = corpora.Dictionary(data_words_trigrams)
# Now let's create corpus where we count occurances for each word from dictionary in texts
corpus = [dictionary.doc2bow(doc) for doc in data_words_trigrams]
# We will also try to filter unimportant words by their tf-idf score, so let's create the tf-idf scores here too
tfidf = gensim.models.TfidfModel(corpus, id2word = dictionary)

In [None]:
# A word from our dictionary:
dictionary[8]

In [None]:
# Let's see what are the max and min values of tf-idf score:
tf_max = round(max([max([value for id, value in tfidf[corpus[x]]]) for x in range(len(corpus))]), 4)
tf_min = round(min([min([value for id, value in tfidf[corpus[x]]]) for x in range(len(corpus))]), 4)
print(tf_max, tf_min)
tfidf_range = [round(num, 3) for num in np.arange(tf_min, tf_max, 0.005).tolist()]
# We will be cutting the highest and the lowest tf-idf, e.g. <10 and >95% of all tf-idf values, so let's obtain those values:
print(np.percentile(tfidf_range, 95), np.percentile(tfidf_range, 10))


In [None]:
# Let's select high and low tfidf threshold values, and filter them out. 
# Thus, we will filter out very common (low tf-idf) and very rare (big tf-idf) words

low_value = np.percentile(tfidf_range, 5) 
high_value = np.percentile(tfidf_range, 95) 

filtered_corpus = []
for i in range(0, len(corpus)):
        
    filter_ids = [id for id, value in tfidf[corpus[i]] if value < low_value or value > high_value ]
   
    new_bow = [(index, value) for (index, value) in corpus[i] if index not in filter_ids] 

      
    filtered_corpus.append(new_bow)

In [None]:
# Now we have prepered the data. We have two types of corpus:
# original without tf-idf filtering called "corpus", and the one with tf-idf filtering called "filtered_corpus"

# Firtst, let's build LDA model from Gensim without tf-idf filtering.
# Since the number of texts is quite large (10000), let's chose number of topics equal to 80.
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=dictionary,
                                           num_topics=80, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)

In [None]:
# Show Topics
pprint(lda_model.show_topics(formatted=False))

In [None]:
# Compute Perplexity
print('\nPerplexity: ', lda_model.log_perplexity(corpus))  # a measure of how good the model is. The lower the better.


In [None]:
# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_words_trigrams, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

In [None]:
# Now let's build basic Gensim LDA model with tf-idf filtering:
lda_model = gensim.models.ldamodel.LdaModel(corpus=filtered_corpus,
                                           id2word=dictionary,
                                           num_topics=80, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)

In [None]:
# Show Topics
pprint(lda_model.show_topics(formatted=False))

In [None]:
# Compute Perplexity
print('\nPerplexity: ', lda_model.log_perplexity(filtered_corpus))  # a measure of how good the model is. lower the better.


In [None]:
# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_words_trigrams, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

# Mallet LDA Model

In [None]:
!wget http://mallet.cs.umass.edu/dist/mallet-2.0.8.zip

In [None]:
!unzip mallet-2.0.8.zip

In [None]:

mallet_path = '/kaggle/working/mallet-2.0.8/bin/mallet'

In [None]:
# Let's run Mallet LDA model with Nr. of topics 145
ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=145, id2word=dictionary, random_seed=0)

In [None]:
# Show Topics
pprint(ldamallet.show_topics(formatted=False))

In [None]:
# Compute Coherence Score
coherence_model_ldamallet = CoherenceModel(model=ldamallet, texts=data_words_trigrams, dictionary=dictionary, coherence='c_v')
coherence_ldamallet = coherence_model_ldamallet.get_coherence()
print('\nCoherence Score: ', coherence_ldamallet)

In [None]:
def format_topics_sentences(ldamodel, corpus, texts):
    # Init output
    sent_topics_df = pd.DataFrame (index=range(10000), columns = ['Dominant_Topic1', 'Dominant_Topic2', '%Topic_Contribution1', '%Topic_Contribution2', 'Topic_Keywords1', 'Topic_Keywords2'])
    #sent_topics_df = pd.DataFrame()
    

    # Get main topic in each document
    for i, text in enumerate(ldamodel[corpus]):
        text = sorted(text, key=lambda x: (x[1]), reverse=True) #sort % contributions of topic  
        # Get the Dominant topic, % of topic contribution and Keywords for each document
        for j, (topic_num, topic_contrib) in enumerate(text):
            if j == 0:  # => dominant topic
                wp = ldamodel.show_topic(topic_num)
                topic_keywords = ", ".join([word for word, prop in wp])
                sent_topics_df.Dominant_Topic1[i] = int(topic_num)
                sent_topics_df['%Topic_Contribution1'][i] = round(topic_contrib,4)
                sent_topics_df['Topic_Keywords1'][i] = topic_keywords
                
                #sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)
                
            elif j == 1:  # => second dominant topic
                wp = ldamodel.show_topic(topic_num)
                topic_keywords = ", ".join([word for word, prop in wp])
                sent_topics_df.Dominant_Topic2[i] = int(topic_num)
                sent_topics_df['%Topic_Contribution2'][i] = round(topic_contrib,4)
                sent_topics_df['Topic_Keywords2'][i] = topic_keywords
                
            else:
                break
    

    # Add original text to the end of the output
    contents = pd.Series(texts)
    sent_topics_df = pd.concat([sent_topics_df, contents], axis=1)
    return(sent_topics_df)


df_topic_sents_keywords = format_topics_sentences(ldamodel=ldamallet, corpus=corpus, texts=rusdata[:10000])

# Format
df_dominant_topic = df_topic_sents_keywords.reset_index(drop = True)


# Show
df_dominant_topic.head(10)

In [None]:
# Here we can select a text and see its first and second most dominant topics (keywords of those topics)
index = 0
df_dominant_topic['text'][index]

In [None]:
df_dominant_topic['Topic_Keywords1'][index]+'// '+df_dominant_topic['Topic_Keywords2'][index]

# LDA Mallet with high and low tf-idf filtered out

In [None]:
# Now let's build Mallet LDA with tf-idf -filtered corpus:

ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=filtered_corpus, num_topics=145, id2word=dictionary, random_seed=0)#150

In [None]:
# Compute Coherence Score
coherence_model_ldamallet = CoherenceModel(model=ldamallet, texts=data_words_trigrams, dictionary=dictionary, coherence='c_v')
coherence_ldamallet = coherence_model_ldamallet.get_coherence()
print('\nCoherence Score: ', coherence_ldamallet)

In [None]:
# Show Topics
pprint(ldamallet.show_topics(formatted=False))


In [None]:
df_topic_sents_keywords = format_topics_sentences(ldamodel=ldamallet, corpus=filtered_corpus, texts=rusdata[:10000])

df_dominant_topic_filtered_idfs = df_topic_sents_keywords.reset_index(drop = True)

# Show
df_dominant_topic_filtered_idfs.head(10)

In [None]:
# To see an example of how the model worked: we can pick a text by index, and see the keywords of most dominant topic
# (with higest contribution, as well as the second one)
index = 2
df_dominant_topic_filtered_idfs['text'][index]

In [None]:
# Two most dominint topics (their keywords)
df_dominant_topic_filtered_idfs['Topic_Keywords1'][index]+'// '+df_dominant_topic_filtered_idfs['Topic_Keywords2'][index]

In [None]:
# Now let's plot 20 most frequent topics from our 10000 texts, as well as the second most frequent topic
# Let's organize data for the plot

In [None]:
# Counting texts for each topic number
counts = []
n_topics = 145
for x in range(0, n_topics):
    z = df_dominant_topic['Dominant_Topic1'][df_dominant_topic['Dominant_Topic1'] == x].count()
    counts.append([x,z])

In [None]:
# Sorting by number of texts
sorted_counts = sorted(counts, key=lambda x: int(x[1]), reverse=True) 


In [None]:
# Selecting most popular n-themes
n_themes = 25
most_popular = [sorted_counts[x][0] for x in range(n_themes)] 


In [None]:
# Selecting keywords for most popular n-themes
theme_keywords = [df_dominant_topic['Topic_Keywords1'][df_dominant_topic['Dominant_Topic1']==x].unique().tolist() for x in most_popular]

In [None]:
# Count number of texts for second topics
topic_counts = []
n_topics = 145
for x in range(0, n_topics):
    for y in range(0, n_topics):
        z = df_dominant_topic['Dominant_Topic1'][(df_dominant_topic['Dominant_Topic1'] == x) & (df_dominant_topic['Dominant_Topic2'] == y)].count()
        topic_counts.append([x,y,z])

In [None]:
# Sorting 
two_themes_count = sorted(topic_counts, key=lambda x: int(x[2]), reverse=True) 
# Selecting second topics for first popular n-topics
most_frequent = [two_themes_count[ind] for ind in range(len(two_themes_count)) if two_themes_count[ind][0] in most_popular]
most_frequent_sorted = sorted(most_frequent, key=lambda x: int(x[0]), reverse=True) 

In [None]:
# Selecting most frequent second topics that follow those most frequent first topics
second_topic = []
for x in most_popular:
    for y in range(len(most_frequent_sorted)):
        if most_frequent_sorted[y][0] == x:
            second_topic.append(most_frequent_sorted[y:y+1][0])
            break

In [None]:
# Getting array of second topic numbers
second_topic_num = [second_topic[x][1] for x in range(n_themes)]

In [None]:
# Corresponding keywords for second topics
theme_keywords_second_topic = [df_dominant_topic['Topic_Keywords2'][df_dominant_topic['Dominant_Topic2']==x].unique().tolist() for x in second_topic_num]

In [None]:
rcParams['font.size'] = 9
rcParams['axes.titlesize'] = 14
plt.rc('xtick', labelsize=14)
plt.rc('ytick', labelsize=14)
plt.rc('axes', labelsize=14)
rcParams['figure.dpi']= 600

theme = [sorted_counts[x][0] for x in range(n_themes)]
text_count = [sorted_counts[x][1] for x in range(n_themes)]
y_pos = np.arange(len(theme))


fig, ax = plt.subplots(figsize=(16, 20)) # 16 14

plot = ax.barh(y_pos, text_count, align='center')

plt.xlim(0, 500)

ax.set_yticks(y_pos)
ax.set_yticklabels(theme)
ax.invert_yaxis()  # labels read top-to-bottom
ax.set_xlabel('N of texts')
ax.set_ylabel('Topic number')
ax.set_title('25 most frequent topics')

x_offset = -100
y_offset = 0.3

for ind, bar in enumerate(plot):
    
    yloc = bar.get_y() + bar.get_height() / 2
    width = int(bar.get_width())
    ax.annotate('T1: '+', '.join(theme_keywords[ind][0].split(', ')[:3])+'\n'+', '.join(theme_keywords[ind][0].split(', ')[4:7])+'\n'+', '.join(theme_keywords[ind][0].split(', ')[8:10]), xy=(width+x_offset, yloc+y_offset), fontsize=10)



#for ind, p in enumerate(ax.patches):
   # b = p.get_bbox()
    #val = "{:.0f}".format(b.x1)        
    #ax.annotate('T1: '+', '.join(theme_keywords[ind][0].split(', ')[:4])+'\n'+', '.join(theme_keywords[ind][0].split(', ')[5:10]), (5, b.y1 + y_offset), fontsize=4)
    

x_offset = 30
y_offset = 0.3
for ind, bar in enumerate(plot):
    #b = p.get_bbox()
    
    #val = "{:.0f}".format(b.x1) 
    yloc = bar.get_y() + bar.get_height() / 2
    width = int(bar.get_width())
    #ax.annotate('T2: '+', '.join(theme_keywords_second_topic[ind][0].split(', ')[:4])+'\n'+', '.join(theme_keywords_second_topic[ind][0].split(', ')[5:10]), (500, b.y1 + y_offset), fontsize=4)
    ax.annotate('T2: '+', '.join(theme_keywords_second_topic[ind][0].split(', ')[:3])+'\n'+', '.join(theme_keywords_second_topic[ind][0].split(', ')[4:7])+'\n'+', '.join(theme_keywords[ind][0].split(', ')[8:10]), xy=(width+x_offset, yloc+y_offset), fontsize=10)



plt.show()

- Each document can be described by a distribution of topics [T1 + T2 + T3 + ... + T150] and each topic can be described by a distribution of words, where T1 - topic with highest contribution, thus T1 considered as the prevalent topic of text.
<br>
- Hovewer it makes sence to look at the second or even third topic in order to get broader overview and better idea of what the text is about.
<br>
- In the graph, the most frequent topics (T1) are represented, as well as most frequent T2, following that T1. 
- In this test, words filtering by their tf-idf score (most rare and most common words) didn't lead to accuracy improvement. Hovewer more investigation is needed.
- More investigation on optimizing the number of topics is required
- Further word filtering might be required to improve accuracy of the model