# Topic Analysis
Based on [Gensim Topic Modeling](https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/)

## Finding the dominant topic in each sentence

One of the practical application of topic modeling is to determine what topic a given document is about

In [6]:
import os
import sys
import matplotlib.pyplot as plt
import gensim
import pyLDAvis
import pyLDAvis.gensim
import pandas as pd

from gensim import corpora
from gensim import models
from gensim.models.coherencemodel import CoherenceModel

print('Python Version: %s' % (sys.version))
%matplotlib inline

Python Version: 2.7.15 | packaged by conda-forge | (default, Feb 28 2019, 04:00:11) 
[GCC 7.3.0]


In [14]:
dictionary = corpora.Dictionary.load('documents.dict')
corpus = corpora.MmCorpus('documents.mm')
lda_model = models.LdaModel.load('lda_model')
ldamallet = models.wrappers.LdaMallet.load('ldamallet')
optimal_model = models.wrappers.LdaMallet.load('optimal_model')

print(dictionary)
print(corpus)
print(lda_model)
print(ldamallet)

Dictionary(7714 unique tokens: [u'francesco', u'csuci', u'univesidad', u'sation', u'efimenko']...)
MmCorpus(4 documents, 7714 features, 10760 non-zero entries)
LdaModel(num_terms=7714, num_topics=20, decay=0.5, chunksize=100)
<gensim.models.wrappers.ldamallet.LdaMallet object at 0x7f221b9f8290>


In [3]:
import pickle
#with open('documents', 'wb') as f: #save
#    pickle.dump(mylist, f)

with open('documents', 'rb') as f: #load
    documents = pickle.load(f)

To find that, we find the topic number that has the highest percentage contribution in that document.

The `format_topics_sentences()` function below nicely aggregates this information in a presentable table.

In [13]:
def format_topics_sentences(ldamodel=lda_model, corpus=corpus, texts=documents):
    # Init output
    sent_topics_df = pd.DataFrame()

    # Get main topic in each document
    for i, row in enumerate(ldamodel[corpus]):
        row = sorted(row, key=lambda x: (x[1]), reverse=True)
        # Get the Dominant topic, Perc Contribution and Keywords for each document
        for j, (topic_num, prop_topic) in enumerate(row):
            if j == 0:  # => dominant topic
                wp = ldamodel.show_topic(topic_num)
                topic_keywords = ", ".join([word for word, prop in wp])
                sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)
            else:
                break
    sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords']

    # Add original text to the end of the output
    contents = pd.Series(texts)
    sent_topics_df = pd.concat([sent_topics_df, contents], axis=1)
    return(sent_topics_df)

In [16]:
df_topic_sents_keywords = format_topics_sentences(ldamodel=optimal_model, corpus=corpus, texts=documents)

# Format
df_dominant_topic = df_topic_sents_keywords.reset_index()
df_dominant_topic.columns = ['Document_No', 'Dominant_Topic', 'Topic_Perc_Contrib', 'Keywords', 'Text']

# Show
df_dominant_topic.head(10)

Unnamed: 0,Document_No,Dominant_Topic,Topic_Perc_Contrib,Keywords,Text
0,0,2.0,0.5922,"institut, teach, learn, univers, european, stu...","[trend, learn, teach, european, higher, educ, ..."
1,1,3.0,0.4185,"student, http, higher, develop, base, research...","[horizon, report, higher, educ, edit, interest..."
2,2,6.0,0.5883,"innov, educ, oecd, figur, chang, level, grade,...","[educ, research, innov, measur, innov, educ, n..."
3,3,0.0,0.5455,"para, aprendizagem, ncia, universidad, tecnolo...","[panorama, tecnol, gico, nmc, universidad, bra..."


## Find the most representative document for each topic

Sometimes just the topic keywords may not be enough to make sense of what a topic is about. So, to help with understanding the topic, you can find the documents a given topic has contributed to the most and infer the topic by reading that document. Whew!!

In [17]:
# Group top 5 sentences under each topic
sent_topics_sorteddf_mallet = pd.DataFrame()

sent_topics_outdf_grpd = df_topic_sents_keywords.groupby('Dominant_Topic')

for i, grp in sent_topics_outdf_grpd:
    sent_topics_sorteddf_mallet = pd.concat([sent_topics_sorteddf_mallet, 
                                             grp.sort_values(['Perc_Contribution'], ascending=[0]).head(1)], 
                                            axis=0)

# Reset Index    
sent_topics_sorteddf_mallet.reset_index(drop=True, inplace=True)

# Format
sent_topics_sorteddf_mallet.columns = ['Topic_Num', "Topic_Perc_Contrib", "Keywords", "Text"]

# Show
sent_topics_sorteddf_mallet.head()

Unnamed: 0,Topic_Num,Topic_Perc_Contrib,Keywords,Text
0,0.0,0.5455,"para, aprendizagem, ncia, universidad, tecnolo...","[panorama, tecnol, gico, nmc, universidad, bra..."
1,2.0,0.5922,"institut, teach, learn, univers, european, stu...","[trend, learn, teach, european, higher, educ, ..."
2,3.0,0.4185,"student, http, higher, develop, base, research...","[horizon, report, higher, educ, edit, interest..."
3,6.0,0.5883,"innov, educ, oecd, figur, chang, level, grade,...","[educ, research, innov, measur, innov, educ, n..."


## Topic distribution across documents
Finally, we want to understand the volume and distribution of topics in order to judge how widely it was discussed. The below table exposes that information.

In [28]:
# Number of Documents for Each Topic
topic_counts = df_topic_sents_keywords['Dominant_Topic'].value_counts()

# Percentage of Documents for Each Topic
topic_contribution = topic_counts/topic_counts.sum()

# Topic Number and Keyword
topic_num_keywords = df_topic_sents_keywords[['Dominant_Topic', 'Topic_Keywords']]

# Concatenate Column wise
df_dominant_topics = pd.concat([topic_num_keywords, topic_counts, topic_contribution], axis=1)

# Change Column names
df_dominant_topics.columns = ['Dominant_Topic', 'Topic_Keywords', 'Num_Documents', 'Perc_Documents']

# Show
df_dominant_topics

Unnamed: 0,Dominant_Topic,Topic_Keywords,Num_Documents,Perc_Documents
0.0,2.0,"institut, teach, learn, univers, european, stu...",1.0,0.25
1.0,3.0,"student, http, higher, develop, base, research...",,
2.0,6.0,"innov, educ, oecd, figur, chang, level, grade,...",1.0,0.25
3.0,0.0,"para, aprendizagem, ncia, universidad, tecnolo...",1.0,0.25
6.0,,,1.0,0.25
