# Topic Modeling on News Headlines

#### Dataset:

The dataset is from Kaggle which includes over 1 million news headline published over a period of 17 years.
https://www.kaggle.com/therohk/million-headlines

#### Problem Statement:

Our goal in this analysis is to find the topics of the news headlines and make predictions on unseen documents using Gensim LDA model. 

### Sample the Data

In [1]:
import pandas as pd

data = pd.read_csv('abcnews-date-text.csv')
data.head()

Unnamed: 0,publish_date,headline_text
0,20030219,aba decides against community broadcasting lic...
1,20030219,act fire witnesses must be aware of defamation
2,20030219,a g calls for infrastructure protection summit
3,20030219,air nz staff in aust strike for pay rise
4,20030219,air nz strike to affect australian travellers


In [2]:
len(data)

1186018

There are 1,186,018 records which is too many for this practice analysis, therefore, we will take a random sample of 10,000 records from the dataset.

In [3]:
documents = data.sample(10000, random_state=42)
len(documents)

10000

In [4]:
documents = documents.reset_index(drop=True)
documents.head()

Unnamed: 0,publish_date,headline_text
0,20090902,extended interview terence higgins speaks with...
1,20100329,council ceo temporarily replaced
2,20160410,adelaide united finally breaks a league premie...
3,20090921,rocca hangs up the boots
4,20071114,syphilis threatens rare marsupial with extinction


### Preprocess the Text

In this step, we are going to turn the text into a bag-of-words representation, then use lemmatize and stemming techniques to reduce the inflectional forms of each word into a common base or root.

In [5]:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
import numpy as np
np.random.seed(42)

In [6]:
def lemmatize_stemming(doc):
    stemmer = SnowballStemmer('english')
    return stemmer.stem(WordNetLemmatizer().lemmatize(doc, pos='v'))

def text_preprocess(doc):
    # min_len and max_len: minimum and maximum lengths of token (inclusive)
    processed_words = gensim.utils.simple_preprocess(doc,min_len=3, max_len=15)  
    stop_words = gensim.parsing.preprocessing.STOPWORDS
    tokens = [lemmatize_stemming(token) for token in processed_words if token not in stop_words]

    return tokens

In [7]:
doc_sample = documents.iloc[551,1]  
print(doc_sample)
print()
print(text_preprocess(doc_sample))

council awaits temple tourism decision

['council', 'await', 'templ', 'tourism', 'decis']


In [8]:
processed_docs = documents['headline_text'].apply(text_preprocess)

### Convert Corpus to Vectors 
There are 2 ways of transforming corpus to word vectors. One is Bag-of-words representation which uses word count, so the word order does not matter. Another way is tf-idf which transforms Bag-of-words word count into word weight. First, we will try Bag-of-words approach.

#### 1. Bag-of-Words

In [9]:
# create a dictionary - a mapping between words and their integer ids

dictionary = gensim.corpora.Dictionary(processed_docs)  # this dictionary is not the regular Python dictionary

Take a look at the first 10 words in the dictionary. Note the dictionary consists of the word ID and the word pairs. The word ID is created for this specific corpus.

In [10]:
# print out the first 10 words in the dictionary

for i, (k, v) in enumerate(dictionary.iteritems()):
    print(k, v)
    if i > 10:
        break

0 extend
1 higgin
2 interview
3 speak
4 terenc
5 ceo
6 council
7 replac
8 temporarili
9 adelaid
10 break
11 drought


Next, we are going to filter the dictionary with some parameters.The `filter_extremes` method filters out tokens in the dictionary by their frequency.

- no_below (int, optional) – Keep tokens which are contained in at least no_below documents.
- no_above (float, optional) – Keep tokens which are contained in no more than no_above documents (fraction of total corpus size, not an absolute number).
- keep_n (int, optional) – Keep only the first keep_n most frequent tokens.
- keep_tokens (iterable of str) – Iterable of tokens that must stay in dictionary after filtering.

In [11]:
dictionary.filter_extremes(no_below=15, no_above=0.6, keep_n=5000)

doc2bow converts document into the bag-of-words (BoW) format, which is a list of (token_id, token_count) tuples.

In [12]:
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]
bow_corpus[8]

[(7, 1), (21, 1)]

The above output means: in document 8, the token_id 7 appears 1 time, token_id 21 appears 1 time. If we want to know which word the token_id corresponds to, we will need to map it out from the pre-defined dictionary.

In [13]:
# let's print out an example

bow_doc_8 = bow_corpus[8]
for id, count in bow_doc_8:
    print(id, dictionary[id], count)

7 drought 1
21 hit 1


As we mentioned earlier, Bag-of-words simply uses word count frequency to create vectors. Next, we will try the second appraoch, tf-idf.

#### 2. TF-IDF

In [14]:
from gensim import corpora, models

tfidf = models.TfidfModel(bow_corpus)

In [15]:
# transform bow_corpus to word and weight

corpus_tfidf = tfidf[bow_corpus]  # different syntax than Scikit-Learn,it uses []
corpus_tfidf

<gensim.interfaces.TransformedCorpus at 0x1efec18d438>

In [16]:
# print out the same example - document 8

corpus_tfidf_8 = corpus_tfidf[8]
for id, count in corpus_tfidf_8:
    print(id, dictionary[id], count)

7 drought 0.7366323814482965
21 hit 0.6762933790906216


Notice how tf-idf transforms the word count to word weight. Previously, these words all had the same word which 1, but they have different weight.

### Topic Modeling with LDA

#### 1. Bag of Words

In [17]:
lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics=20, id2word=dictionary, passes=10, random_state=42)

In [18]:
for idx, topic in lda_model.print_topics():
    print(f'Topic{idx}:')
    print(topic)

Topic0:
0.063*"chang" + 0.052*"flood" + 0.050*"australia" + 0.049*"work" + 0.047*"trial" + 0.042*"elect" + 0.035*"murder" + 0.032*"futur" + 0.029*"campaign" + 0.024*"tour"
Topic1:
0.046*"time" + 0.044*"return" + 0.044*"storm" + 0.041*"rise" + 0.039*"train" + 0.039*"leav" + 0.029*"die" + 0.028*"push" + 0.028*"grant" + 0.028*"parliament"
Topic2:
0.055*"coast" + 0.043*"north" + 0.043*"fight" + 0.042*"gold" + 0.035*"break" + 0.032*"road" + 0.030*"west" + 0.029*"victim" + 0.027*"investig" + 0.025*"east"
Topic3:
0.103*"new" + 0.074*"nsw" + 0.068*"australian" + 0.059*"day" + 0.049*"test" + 0.048*"hospit" + 0.042*"miss" + 0.032*"region" + 0.029*"tax" + 0.022*"chief"
Topic4:
0.056*"end" + 0.039*"queensland" + 0.038*"review" + 0.037*"join" + 0.036*"green" + 0.036*"women" + 0.035*"polici" + 0.031*"discuss" + 0.030*"remain" + 0.027*"blue"
Topic5:
0.094*"council" + 0.043*"union" + 0.030*"reject" + 0.029*"look" + 0.026*"free" + 0.025*"consid" + 0.023*"season" + 0.021*"aussi" + 0.021*"trade" + 0.021*

For 10,000 news headlines, there should be more than 20 topics. But this exercise, we will limit to 20 topics.

#### 2.TF-IDF

In [19]:
lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf, num_topics=20, id2word=dictionary, passes=10,random_state=42)

`print_topics_` method by default only prints out the first 20 topics, if we specified more topics and we need to print out them all, pass `-1`as an argument will do.

In [20]:
for idx, topic in lda_model_tfidf.print_topics():
    print(f'Topic{idx}:')
    print(topic)

Topic0:
0.054*"take" + 0.051*"work" + 0.047*"flood" + 0.038*"cost" + 0.038*"elect" + 0.035*"futur" + 0.035*"tour" + 0.026*"award" + 0.024*"tas" + 0.023*"point"
Topic1:
0.043*"return" + 0.041*"storm" + 0.037*"drought" + 0.036*"time" + 0.029*"rat" + 0.026*"leav" + 0.026*"parliament" + 0.025*"beat" + 0.024*"green" + 0.023*"presid"
Topic2:
0.037*"coast" + 0.034*"break" + 0.032*"gold" + 0.027*"road" + 0.026*"north" + 0.025*"investig" + 0.025*"fight" + 0.022*"train" + 0.022*"victim" + 0.021*"war"
Topic3:
0.069*"australian" + 0.051*"miss" + 0.047*"test" + 0.045*"day" + 0.038*"tax" + 0.037*"hospit" + 0.035*"fall" + 0.035*"seek" + 0.028*"afl" + 0.027*"new"
Topic4:
0.051*"end" + 0.039*"review" + 0.036*"join" + 0.031*"women" + 0.031*"discuss" + 0.028*"remain" + 0.027*"olymp" + 0.025*"name" + 0.025*"cricket" + 0.025*"gas"
Topic5:
0.035*"aussi" + 0.034*"want" + 0.032*"free" + 0.029*"hear" + 0.028*"decis" + 0.028*"union" + 0.027*"releas" + 0.027*"trade" + 0.025*"council" + 0.025*"look"
Topic6:
0.039

### Model Evaluation
####  1.Bag of Words Model

In [21]:
# let's use our favoriate sample - document 8

documents.iloc[8,1]

'drought hits us wheat forecast'

In [22]:
# sort per the value (second element in the tuple)
for index, score in sorted(lda_model[bow_corpus[8]], key=lambda x: x[1], reverse=True): 
    print(f"{index}, {score}, {lda_model.print_topic(index, 5)}") # only print out the first 5 words in the topic

13, 0.6833238005638123, 0.141*"man" + 0.069*"charg" + 0.059*"attack" + 0.050*"crash" + 0.048*"car"
3, 0.01666717417538166, 0.103*"new" + 0.074*"nsw" + 0.068*"australian" + 0.059*"day" + 0.049*"test"
15, 0.016667170450091362, 0.201*"polic" + 0.049*"set" + 0.041*"feder" + 0.039*"alleg" + 0.038*"defend"
0, 0.016667168587446213, 0.063*"chang" + 0.052*"flood" + 0.050*"australia" + 0.049*"work" + 0.047*"trial"
1, 0.016667168587446213, 0.046*"time" + 0.044*"return" + 0.044*"storm" + 0.041*"rise" + 0.039*"train"
2, 0.016667168587446213, 0.055*"coast" + 0.043*"north" + 0.043*"fight" + 0.042*"gold" + 0.035*"break"
4, 0.016667168587446213, 0.056*"end" + 0.039*"queensland" + 0.038*"review" + 0.037*"join" + 0.036*"green"
5, 0.016667168587446213, 0.094*"council" + 0.043*"union" + 0.030*"reject" + 0.029*"look" + 0.026*"free"
6, 0.016667168587446213, 0.119*"say" + 0.094*"plan" + 0.033*"cut" + 0.033*"research" + 0.026*"state"
7, 0.016667168587446213, 0.074*"face" + 0.049*"lead" + 0.044*"court" + 0.044*

In Bag-of-Words model, topic #9 is close to the true headline. We see there are `farmer` and `weather` in the predicted topics.

#### 2. TF-IDF Model

In [23]:
# test the same doc on tf-idf model

for index, score in sorted(lda_model_tfidf[bow_corpus[8]], key=lambda x: x[1], reverse=True): 
    print(f"{index}, {score}, {lda_model.print_topic(index, 5)}")

1, 0.6833289265632629, 0.046*"time" + 0.044*"return" + 0.044*"storm" + 0.041*"rise" + 0.039*"train"
12, 0.016666902229189873, 0.103*"interview" + 0.065*"sydney" + 0.049*"take" + 0.038*"indigen" + 0.037*"cost"
0, 0.016666898503899574, 0.063*"chang" + 0.052*"flood" + 0.050*"australia" + 0.049*"work" + 0.047*"trial"
2, 0.016666898503899574, 0.055*"coast" + 0.043*"north" + 0.043*"fight" + 0.042*"gold" + 0.035*"break"
3, 0.016666898503899574, 0.103*"new" + 0.074*"nsw" + 0.068*"australian" + 0.059*"day" + 0.049*"test"
4, 0.016666898503899574, 0.056*"end" + 0.039*"queensland" + 0.038*"review" + 0.037*"join" + 0.036*"green"
5, 0.016666898503899574, 0.094*"council" + 0.043*"union" + 0.030*"reject" + 0.029*"look" + 0.026*"free"
6, 0.016666898503899574, 0.119*"say" + 0.094*"plan" + 0.033*"cut" + 0.033*"research" + 0.026*"state"
7, 0.016666898503899574, 0.074*"face" + 0.049*"lead" + 0.044*"court" + 0.044*"dead" + 0.040*"law"
8, 0.016666898503899574, 0.063*"report" + 0.062*"open" + 0.055*"school" +

The tf-idf performs similar to Bag-of-Words model. We see `farmer`, `weater` also appear in predicted topic #9.

### Model Testing 

Time to test our model on unseen documents.

In [24]:
unseen_doc = "US closes 5 military bases in Afghanistan as part of Taliban peace deal"
bow_vector = dictionary.doc2bow(text_preprocess(unseen_doc))
bow_vector

[(74, 1), (182, 1), (274, 1), (548, 1), (753, 1)]

In [25]:
for index, score in sorted(lda_model[bow_vector], key=lambda x: x[1], reverse=True): # sort by 2nd element in the tuple
    print(f"{index}, {score}, {lda_model.print_topic(index, 5)}")

19, 0.6749829649925232, 0.060*"year" + 0.057*"jail" + 0.052*"kill" + 0.050*"help" + 0.041*"appeal"
8, 0.17500627040863037, 0.063*"report" + 0.062*"open" + 0.055*"school" + 0.043*"nation" + 0.029*"close"


Recall topic #19 is about these words, which is about voilence.

0.050*"jail" + 0.049*"year" + 0.032*"hold" + 0.030*"injur" + 0.028*"teen" + 0.028*"abus" + 0.028*"drug" + 0.027*"strike" + 0.027*"fire" + 0.027*"show"

The prediction about this headline is topic #19, which is not too bad.