# Topic Modeling Using Latent Dirichlet Allocation

##### - Topic modeling is the process of identifying patterns in text data that correspond to a topic.
##### - Used for analysis.
##### - Like Unsupervised Learning No Need for Labeled Data!
##### - Summarize the data !

##### - Latent Dirichlet Allocation is a topic modeling technique.
##### - A Given piece of text is a combination of multiple topics.
##### - Example: Data Visualization, finance and so on.
##### - A Topic is basically a distribution over a fixed vocabulary of words.

In [2]:
# Import libraries
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from gensim import models, corpora

In [3]:
# Load input data
def load_data(input_file):
    data =[]
    with open(input_file, 'r') as f:
        for line in f.readlines():
            data.append(line[:-1])
            
    return data

In [4]:
def process(input_text):
    # create Regex Regulizer
    toknenizer =RegexpTokenizer(r'\w+')
    
    # Create a snowball stemmer
    stemmer =SnowballStemmer('english')
    
    # get the list of stopwords
    stop_words =stopwords.words('english')
    
    # Tokenize input string
    tokens =toknenizer.tokenize(input_text.lower())
    
    # remove stopwords
    tokens =[x for  x in tokens if not x in stop_words]
    
    # Preform Stemming on tokenised words
    tokens_stemmed =[stemmer.stem(x) for x in tokens ]
    
    return tokens_stemmed

In [5]:
data =load_data('data.txt')

In [6]:
tokens =[ process(x) for x in data ]

In [7]:
dict_tokens =corpora.Dictionary(tokens)

In [8]:
doc_term_mat =[ dict_tokens.doc2bow(token) for token in tokens ]

In [9]:
num_topics =2

In [10]:
ldmodels =models.ldamodel.LdaModel(doc_term_mat, num_topics=num_topics, id2word=dict_tokens, passes=25 )

In [11]:
num_words =5

In [13]:
print('\nTop '+str(num_words)+ 'Contributing Words to each topic :')
for item in ldmodels.print_topics(num_topics=num_topics, num_words=num_words):
    print('\nTopic', item[0])
    
    list_of_strings =item[1].split(' + ')
    for text in list_of_strings:
        weight =text.split('*')[0]
        word =text.split('*')[1]
        print(word, '==>', str(round(float(weight) * 100, 2)) + "*")


Top 5Contributing Words to each topic :

Topic 0
"empir" ==> 3.8*
"mathemat" ==> 2.7*
"expand" ==> 2.7*
"call" ==> 2.7*
"formul" ==> 1.6*

Topic 1
"peopl" ==> 2.0*
"histor" ==> 2.0*
"cultur" ==> 2.0*
"europ" ==> 2.0*
"time" ==> 2.0*
