### LOAD THE DATASET

The dataset contains the details about Article written published by worldnews. The article shows the time, date, title of the article. It also contains the category which is the publisher of the article and wether the article should be read by everyone or only the adults in over_18 column. For topic modeling we will be using the title.

In [1]:
import pandas as pd
data = pd.read_csv('Eluvio_DS_Challenge.csv',error_bad_lines = False, encoding ='utf-8' )
data_text = data['title']
temp = data_text
print(data_text)

0                         Scores killed in Pakistan clashes
1                          Japan resumes refuelling mission
2                           US presses Egypt on Gaza border
3              Jump-start economy: Give health care to all 
4           Council of Europe bashes EU&UN terror blacklist
                                ...                        
509231     Heil Trump : Donald Trump s  alt-right  white...
509232    There are people speculating that this could b...
509233            Professor receives Arab Researchers Award
509234    Nigel Farage attacks response to Trump ambassa...
509235    Palestinian wielding knife shot dead in West B...
Name: title, Length: 509236, dtype: object


In [2]:
len(data_text)

509236

In [3]:
data_text[:10]

0                    Scores killed in Pakistan clashes
1                     Japan resumes refuelling mission
2                      US presses Egypt on Gaza border
3         Jump-start economy: Give health care to all 
4      Council of Europe bashes EU&UN terror blacklist
5    Hay presto! Farmer unveils the  illegal  mock-...
6    Strikes, Protests and Gridlock at the Poland-U...
7                       The U.N. Mismanagement Program
8            Nicolas Sarkozy threatens to sue Ryanair 
9    US plans for missile shields in Polish town me...
Name: title, dtype: object

#### DATA PREPROCESSING

In any Machine Learning process, Data Preprocessing is that step in which the data gets transformed, or Encoded, to bring it to such a state that now the machine can easily parse it. In other words, the features of the data can now be easily interpreted by the algorithm. Here we are transforming the text data in title into a format where we can apply the LDA model. For that we need to stem and lemmatize the data and convert it into the list of unique words.


In [4]:
import gensim

from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(2021)



In [5]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\priya\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

### STEMMING AND LEMMATIZING THE DATA

In [6]:
def lemmatize_stemming(txt):
    return stemmer.stem(WordNetLemmatizer().lemmatize(txt,pos = 'v'))

def preprocess(txt):
    result = []
    for token in gensim.utils.simple_preprocess(txt):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
    return result
stemmer = SnowballStemmer('english')

In [7]:
doc_sample = data_text[5]
#print(doc_sample)

words = []

for word in doc_sample.split(' '):
    words.append(word)
print('Original words')
print(words)

print("tokenized words")
print(preprocess(doc_sample))

Original words
['Hay', 'presto!', 'Farmer', 'unveils', 'the', '', 'illegal', '', 'mock-Tudor', 'castle', 'he', 'tried', 'to', 'hide', 'behind', '40ft', 'hay', 'bales']
tokenized words
['presto', 'farmer', 'unveil', 'illeg', 'mock', 'tudor', 'castl', 'tri', 'hide', 'bale']


In [8]:
processed_txt = data_text.map(preprocess)
#data_text

In [9]:
print(processed_txt[:200])

0                         [score, kill, pakistan, clash]
1                        [japan, resum, refuel, mission]
2                           [press, egypt, gaza, border]
3                   [jump, start, economi, health, care]
4              [council, europ, bash, terror, blacklist]
                             ...                        
195                    [ahmadinejad, invit, iraq, allow]
196           [mass, murder, women, get, virginia, tech]
197    [navi, cruiser, missil, satellit, carri, toxic...
198    [hama, support, protest, prophet, cartoon, int...
199                   [bolivia, accept, explan, scandal]
Name: title, Length: 200, dtype: object


### BAG OF WORDS

Bag of words is a Natural Language Processing technique of text modelling. In technical terms, we can say that it is a method of feature extraction with text data. This approach is a simple and flexible way of extracting features from documents.

A bag of words is a representation of text that describes the occurrence of words within a document. We just keep track of word counts and disregard the grammatical details and the word order. It is called a “bag” of words because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.

In [10]:
dictionary = gensim.corpora.Dictionary(processed_txt)

In [11]:
count = 0
for k,v in dictionary.iteritems():
    print(k,v)
    count = count + 1
    if count > 20:
        break

0 clash
1 kill
2 pakistan
3 score
4 japan
5 mission
6 refuel
7 resum
8 border
9 egypt
10 gaza
11 press
12 care
13 economi
14 health
15 jump
16 start
17 bash
18 blacklist
19 council
20 europ


In [12]:
dictionary.filter_extremes(no_below = 15,no_above = 0.5, keep_n = 100000)
print(dictionary)

Dictionary(11098 unique tokens: ['clash', 'kill', 'pakistan', 'score', 'japan']...)


In [13]:
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_txt]

In [14]:
bow_corpus[:10]

[[(0, 1), (1, 1), (2, 1), (3, 1)],
 [(4, 1), (5, 1), (6, 1), (7, 1)],
 [(8, 1), (9, 1), (10, 1), (11, 1)],
 [(12, 1), (13, 1), (14, 1), (15, 1), (16, 1)],
 [(17, 1), (18, 1), (19, 1), (20, 1), (21, 1)],
 [(22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1)],
 [(8, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1)],
 [(35, 1), (36, 1)],
 [(37, 1), (38, 1), (39, 1), (40, 1)],
 [(41, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1)]]

In [15]:
bow_doc_1225 = bow_corpus[1225]
for i in range(len(bow_doc_1225)):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_1225[i][0], 
                                                     dictionary[bow_doc_1225[i][0]], 
                                                     bow_doc_1225[i][1]))

Word 47 ("video") appears 1 time.
Word 267 ("militari") appears 1 time.
Word 392 ("expos") appears 1 time.
Word 396 ("oper") appears 1 time.
Word 579 ("german") appears 1 time.
Word 1114 ("scientist") appears 1 time.


## TF-IDF

In information retrieval,  TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The TFIDF  value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. TFIDF is one of the most popular term-weighting schemes today. 

In [16]:
from gensim import corpora
from gensim.models import TfidfModel

tfidf = TfidfModel(bow_corpus)

In [17]:
corpus_tfidf = tfidf[bow_corpus]

In [18]:
from pprint import pprint

for doc in corpus_tfidf:
    pprint(doc)
    break

[(0, 0.51315015603875),
 (1, 0.29743861993839626),
 (2, 0.42429819056762896),
 (3, 0.6842355078535349)]


### LDA MODEL

In natural language processing, the Latent Dirichlet Allocation (LDA) is a generative statistical model that allows sets of observations 
to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected 
into documents, it posits that each document is a mixture of a small number of topics and that each word's presence is attributable to one 
of the document's topics. LDA is an example of a topic model and belongs to the machine learning toolbox and in wider sense to the 
artificial intelligence toolbox.

### Running LDA using Bag of Words


We will be applying the LDA model to Bag of word and TFIDF and clustering them into different topics. For each topic, We can get the words occuring in that topic and its relative weight. After training the model, i tried it on two different random documents to check wether it is working perfectly or not. 

In [19]:
lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics=15, id2word=dictionary, passes=2, workers=2)

In [20]:
for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

Topic: 0 
Words: 0.043*"islam" + 0.042*"state" + 0.041*"attack" + 0.025*"pari" + 0.023*"terror" + 0.021*"yemen" + 0.016*"group" + 0.016*"train" + 0.014*"miss" + 0.012*"kurdish"
Topic: 1 
Words: 0.069*"syria" + 0.059*"isi" + 0.039*"syrian" + 0.032*"refuge" + 0.024*"forc" + 0.022*"fight" + 0.020*"aleppo" + 0.019*"iraq" + 0.019*"rebel" + 0.016*"say"
Topic: 2 
Words: 0.021*"putin" + 0.019*"bank" + 0.019*"billion" + 0.019*"european" + 0.016*"crisi" + 0.015*"crash" + 0.013*"europ" + 0.013*"trade" + 0.012*"say" + 0.012*"world"
Topic: 3 
Words: 0.074*"russia" + 0.042*"russian" + 0.037*"iran" + 0.029*"deal" + 0.026*"nuclear" + 0.026*"say" + 0.020*"militari" + 0.020*"missil" + 0.016*"japan" + 0.016*"ukrain"
Topic: 4 
Words: 0.023*"franc" + 0.017*"million" + 0.014*"canada" + 0.013*"world" + 0.013*"record" + 0.013*"year" + 0.012*"flight" + 0.012*"food" + 0.011*"studi" + 0.011*"sanction"
Topic: 5 
Words: 0.026*"right" + 0.023*"court" + 0.022*"human" + 0.020*"say" + 0.016*"nation" + 0.016*"unit" + 0

### Running LDA using TF-IDF

In [21]:
lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf, num_topics=15, id2word=dictionary, passes=2, workers=4)

In [22]:
for idx, topic in lda_model_tfidf.print_topics(-1):
    print('Topic: {} Word: {}'.format(idx, topic))

Topic: 0 Word: 0.018*"migrant" + 0.011*"missil" + 0.009*"rocket" + 0.008*"evacu" + 0.007*"trump" + 0.006*"test" + 0.006*"launch" + 0.006*"nuclear" + 0.006*"korea" + 0.005*"mediterranean"
Topic: 1 Word: 0.039*"isi" + 0.018*"islam" + 0.017*"saudi" + 0.016*"kill" + 0.015*"state" + 0.012*"arabia" + 0.012*"iraq" + 0.012*"milit" + 0.010*"attack" + 0.010*"iraqi"
Topic: 2 Word: 0.017*"yemen" + 0.015*"assad" + 0.010*"rebel" + 0.009*"syria" + 0.008*"presid" + 0.007*"syrian" + 0.007*"fifa" + 0.007*"referendum" + 0.006*"colombia" + 0.006*"zealand"
Topic: 3 Word: 0.012*"elect" + 0.009*"vote" + 0.007*"parti" + 0.006*"parliament" + 0.006*"court" + 0.006*"rule" + 0.006*"minist" + 0.005*"pope" + 0.005*"presidenti" + 0.005*"presid"
Topic: 4 Word: 0.018*"kill" + 0.016*"attack" + 0.013*"suicid" + 0.012*"taliban" + 0.012*"afghan" + 0.011*"bomber" + 0.009*"afghanistan" + 0.009*"hostag" + 0.008*"pakistan" + 0.007*"soldier"
Topic: 5 Word: 0.023*"korea" + 0.020*"china" + 0.019*"south" + 0.019*"north" + 0.009*"

###  Classification of the topics
###  Performance evaluation by classifying sample document using LDA Bag of Words model


In [23]:
for index, score in sorted(lda_model[bow_corpus[4310]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model.print_topic(index, 15)))


Score: 0.4453185200691223	 
Topic: 0.043*"saudi" + 0.023*"minist" + 0.021*"say" + 0.019*"arabia" + 0.014*"parti" + 0.012*"presid" + 0.011*"prime" + 0.010*"leader" + 0.010*"call" + 0.009*"opposit" + 0.009*"coalit" + 0.009*"return" + 0.009*"govern" + 0.008*"parliament" + 0.008*"cameron"

Score: 0.2789072096347809	 
Topic: 0.043*"islam" + 0.042*"state" + 0.041*"attack" + 0.025*"pari" + 0.023*"terror" + 0.021*"yemen" + 0.016*"group" + 0.016*"train" + 0.014*"miss" + 0.012*"kurdish" + 0.012*"philippin" + 0.012*"student" + 0.011*"british" + 0.010*"muslim" + 0.009*"famili"

Score: 0.16134995222091675	 
Topic: 0.028*"presid" + 0.024*"elect" + 0.023*"climat" + 0.021*"chang" + 0.019*"vote" + 0.018*"venezuela" + 0.017*"govern" + 0.012*"declar" + 0.011*"ahead" + 0.010*"referendum" + 0.010*"protest" + 0.010*"critic" + 0.009*"villag" + 0.009*"street" + 0.009*"independ"


In [24]:
### Performance evaluation by classifying sample document using LDA TF-IDF model

In [25]:
for index, score in sorted(lda_model_tfidf[bow_corpus[4310]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model_tfidf.print_topic(index, 15)))


Score: 0.676242470741272	 
Topic: 0.012*"elect" + 0.009*"vote" + 0.007*"parti" + 0.006*"parliament" + 0.006*"court" + 0.006*"rule" + 0.006*"minist" + 0.005*"pope" + 0.005*"presidenti" + 0.005*"presid" + 0.005*"say" + 0.005*"right" + 0.005*"visa" + 0.004*"legal" + 0.004*"belgium"

Score: 0.19986557960510254	 
Topic: 0.023*"korea" + 0.020*"china" + 0.019*"south" + 0.019*"north" + 0.009*"drone" + 0.009*"korean" + 0.007*"japan" + 0.006*"say" + 0.006*"militari" + 0.005*"russia" + 0.005*"disput" + 0.005*"drill" + 0.005*"jong" + 0.005*"joint" + 0.004*"nuclear"


### Testing model on unseen document

In [26]:
unseen_document = 'How a Pentagon deal became an identity crisis for Google'
bow_vector = dictionary.doc2bow(preprocess(unseen_document))

for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 5)))

Score: 0.32711878418922424	 Topic: 0.074*"russia" + 0.042*"russian" + 0.037*"iran" + 0.029*"deal" + 0.026*"nuclear"
Score: 0.22679010033607483	 Topic: 0.021*"china" + 0.018*"egypt" + 0.017*"visit" + 0.017*"chines" + 0.014*"obama"
Score: 0.19521427154541016	 Topic: 0.021*"putin" + 0.019*"bank" + 0.019*"billion" + 0.019*"european" + 0.016*"crisi"
Score: 0.1286247819662094	 Topic: 0.026*"right" + 0.023*"court" + 0.022*"human" + 0.020*"say" + 0.016*"nation"
Score: 0.011113831773400307	 Topic: 0.069*"syria" + 0.059*"isi" + 0.039*"syrian" + 0.032*"refuge" + 0.024*"forc"
Score: 0.011113827116787434	 Topic: 0.043*"islam" + 0.042*"state" + 0.041*"attack" + 0.025*"pari" + 0.023*"terror"
Score: 0.011113827116787434	 Topic: 0.023*"franc" + 0.017*"million" + 0.014*"canada" + 0.013*"world" + 0.013*"record"
Score: 0.011113827116787434	 Topic: 0.082*"china" + 0.052*"south" + 0.044*"korea" + 0.041*"north" + 0.018*"mosul"
Score: 0.011113827116787434	 Topic: 0.030*"migrant" + 0.028*"minist" + 0.027*"germ

In [27]:
unseen_document = 'I had a peanut butter sandwich for breakfast. I like to eat almonds, peanuts and walnuts. My neighbor got a little dog yesterday. Cats and dogs are mortal enemies You mustn’t feed peanuts to your dog.'
bow_vector = dictionary.doc2bow(preprocess(unseen_document))

for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 5)))

Score: 0.2724400460720062	 Topic: 0.076*"turkey" + 0.050*"israel" + 0.034*"isra" + 0.030*"palestinian" + 0.024*"border"
Score: 0.18554823100566864	 Topic: 0.023*"franc" + 0.017*"million" + 0.014*"canada" + 0.013*"world" + 0.013*"record"
Score: 0.15652279555797577	 Topic: 0.026*"right" + 0.023*"court" + 0.022*"human" + 0.020*"say" + 0.016*"nation"
Score: 0.11214568465948105	 Topic: 0.030*"migrant" + 0.028*"minist" + 0.027*"german" + 0.027*"french" + 0.025*"germani"
Score: 0.09104159474372864	 Topic: 0.043*"saudi" + 0.023*"minist" + 0.021*"say" + 0.019*"arabia" + 0.014*"parti"
Score: 0.07522545754909515	 Topic: 0.021*"putin" + 0.019*"bank" + 0.019*"billion" + 0.019*"european" + 0.016*"crisi"
Score: 0.07140255719423294	 Topic: 0.069*"polic" + 0.040*"arrest" + 0.023*"suspect" + 0.023*"turkish" + 0.020*"offic"


## PLOT

Here we have plotted a intertopic distance map. There are different clusters formed from which we can see the accuracy of the model. we can see the actual term frequency of the words and its estimated frequency by LDA model.  

In [28]:
import pyLDAvis
import pyLDAvis.gensim_models
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()
panel = pyLDAvis.gensim_models.prepare(lda_model, bow_corpus, dictionary, mds='tsne')
panel



In [29]:
import pyLDAvis
import pyLDAvis.gensim_models
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()
panel = pyLDAvis.gensim_models.prepare(lda_model_tfidf, bow_corpus, dictionary, mds='tsne')
panel

  and should_run_async(code)
