# Topic Modelling for News

![](https://images.unsplash.com/photo-1495020689067-958852a7765e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1050&q=80)

Photo by [Roman Kraft](https://unsplash.com/photos/_Zua2hyvTBk)

This exercise is about modelling the main topics of a database of News headlines.

Begin by importing the needed libraries:

In [2]:
# TODO: import needed libraries
import pandas as pd
import numpy as np
from nltk import word_tokenize, wordpunct_tokenize, pos_tag
from nltk.corpus import stopwords
stop_words = stopwords.words("english")
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import wordnet

Load the data in the file `random_headlines.csv`

In [5]:
df = pd.read_csv('random_headlines.csv')
print(df.shape)
df.head()

(20000, 2)


Unnamed: 0,publish_date,headline_text
0,20120305,ute driver hurt in intersection crash
1,20081128,6yo dies in cycling accident
2,20090325,bumper olive harvest expected
3,20100201,replica replaces northernmost sign
4,20080225,woods targets perfect season


This is always a good idea to perform some EDA (exploratory data analytics) on a dataset...

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   publish_date   20000 non-null  int64 
 1   headline_text  20000 non-null  object
dtypes: int64(1), object(1)
memory usage: 312.6+ KB


Now perform all the needed preprocessing on those headlines: case lowering, tokenization, punctuation removal, stopwords removal, stemming/lemmatization.

In [9]:
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

def clean_data(quote):
    quote = quote.lower()
    tokens = word_tokenize(quote)
    token_punc = [t for t in tokens if t.isalpha()]
    token_stop = [t for t in token_punc if t not in stop_words]
    stemmed_words = [stemmer.stem(w) for w in token_stop]
    return stemmed_words

df["stemmed"] = df["headline_text"].apply(lambda x: clean_data(x))
df["stemmed"].head()

0    [ute, driver, hurt, intersect, crash]
1                       [die, cycl, accid]
2          [bumper, oliv, harvest, expect]
3    [replica, replac, northernmost, sign]
4          [wood, target, perfect, season]
Name: stemmed, dtype: object

Now use Gensim to compute a BOW

In [22]:
from gensim.corpora import Dictionary
from gensim.models import TfidfModel
from gensim.matutils import corpus2dense

dictionary = Dictionary(df["stemmed"])

corpus_bow = [dictionary.doc2bow(doc) for doc in df["stemmed"]]
print(len(corpus_bow))
corpus_bow[0:2]

20000


[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)], [(5, 1), (6, 1), (7, 1)]]

Compute the TF-IDF using Gensim

In [27]:
tfidf_model = TfidfModel(corpus_bow)
corpus_tfidf = tfidf_model[corpus_bow]
print(len(corpus_tfidf))
corpus_tfidf

20000


<gensim.interfaces.TransformedCorpus at 0x276fdc69370>

Finally compute the **LSA** (also called LSI) using Gensim, for a given number of Topics that you choose yourself

In [48]:
# TODO: Compute LSA
from gensim.models import LsiModel

num_topics = 4
num_words = 3

lsi_model = LsiModel(corpus_bow, num_topics=num_topics, id2word=dictionary)

For each of the topic, show the most significant words.

In [49]:
# TODO: Print the 3 or 4 most significant words of each topic
lsi_model.print_topics(num_topics=num_topics, num_words=num_words)

[(0, '-0.752*"polic" + -0.405*"man" + -0.207*"charg"'),
 (1, '-0.670*"man" + 0.574*"polic" + -0.329*"charg"'),
 (2, '-0.654*"new" + -0.295*"plan" + -0.242*"say"'),
 (3, '0.703*"new" + -0.344*"say" + -0.336*"plan"')]

What do you think about those results?

Now let's try to use LDA instead of LSA using Gensim

In [18]:
# TODO: Compute LDA
from gensim.models import LdaModel

lda1 = LdaModel(corpus=corpus_tfidf, num_topics=4, id2word=dictionary, passes=2)

In [19]:
# TODO: print the most frequent words of each topic
lda1.print_topics(num_topics=num_topics, num_words=num_words)

[(0, '0.016*"report" + 0.009*"back" + 0.009*"may"'),
 (1, '0.012*"mine" + 0.011*"polic" + 0.009*"elect"'),
 (2, '0.013*"question" + 0.010*"council" + 0.010*"fund"'),
 (3, '0.012*"sydney" + 0.012*"charg" + 0.011*"australian"')]

Now, how does it work with LDA?

Let's make some visualization of the LDA results using pyLDAvis.

In [54]:
import pyLDAvis
import pyLDAvis.gensim

pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(topic_model=lda1, corpus=corpus_bow, dictionary=dictionary)
vis

  if isinstance(node, ast.Num):  # <number>
  if isinstance(node, ast.Num):  # <number>
  if isinstance(node, ast.Num):  # <number>
  return node.n
  if isinstance(node, ast.Num):  # <number>
  return node.n


Depending on your results, you can try to fine tune the algorithm: number of topics, hyperparameters...
And check with others their results.