# Topic Modelling for News

![](https://images.unsplash.com/photo-1495020689067-958852a7765e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1050&q=80)

Photo by [Roman Kraft](https://unsplash.com/photos/_Zua2hyvTBk)

This exercise is about modelling the main topics of a database of News headlines.

Begin by importing the needed libraries:

In [1]:
# TODO: import needed libraries
import nltk 
import numpy as np
import pandas as pd

Load the data in the file `random_headlines.csv`

In [2]:
# TODO: load the dataset
df = pd.read_csv('random_headlines.csv')
print(df.shape)
df.head()

(20000, 2)


Unnamed: 0,publish_date,headline_text
0,20120305,ute driver hurt in intersection crash
1,20081128,6yo dies in cycling accident
2,20090325,bumper olive harvest expected
3,20100201,replica replaces northernmost sign
4,20080225,woods targets perfect season


This is always a good idea to perform some EDA (exploratory data analytics) on a dataset...

In [3]:
# TODO: Perform a short EDA
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   publish_date   20000 non-null  int64 
 1   headline_text  20000 non-null  object
dtypes: int64(1), object(1)
memory usage: 312.6+ KB


Now perform all the needed preprocessing on those headlines: case lowering, tokenization, punctuation removal, stopwords removal, stemming/lemmatization.

In [4]:
# TODO: Preprocess the input data

# tokenize
df['tokens'] = df['headline_text'].apply(lambda row:nltk.word_tokenize(row))

# punctuation
df['alphanumeric'] = df['tokens'].apply(lambda row: [word for word in row if word.isalpha()])

# remove stopwords
stop = nltk.corpus.stopwords.words('english')
df['stop'] = df['alphanumeric'].apply(lambda row: [word for word in row if word not in stop])

# stemming
stemmer = nltk.PorterStemmer()
df['stemmed'] = df['stop'].apply(lambda row: [stemmer.stem(word) for word in row])

df['stemmed'].head()

0    [ute, driver, hurt, intersect, crash]
1                       [die, cycl, accid]
2          [bumper, oliv, harvest, expect]
3    [replica, replac, northernmost, sign]
4          [wood, target, perfect, season]
Name: stemmed, dtype: object

In [9]:
!pip install --upgrade --user gensim



Now use Gensim to compute a BOW

In [10]:
# TODO: Compute the BOW using Gensim
from gensim.corpora import Dictionary

dictionary = Dictionary(df['stemmed'])
corpus = [dictionary.doc2bow(line) for line in df['stemmed']]
print(len(corpus))
corpus[0:2]

20000


[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)], [(5, 1), (6, 1), (7, 1)]]

Compute the TF-IDF using Gensim

In [11]:
# TODO: Compute TF-IDF
from gensim.models import TfidfModel

tfidf_model = TfidfModel(corpus)
tf_idf = tfidf_model[corpus]
print(len(tf_idf))

20000


Finally compute the **LSA** (also called LSI) using Gensim, for a given number of Topics that you choose yourself

In [12]:
# TODO: Compute LSA
from gensim.models import LsiModel

lsi = LsiModel(corpus = corpus, num_topics = 4, id2word = dictionary)

For each of the topic, show the most significant words.

In [13]:
# TODO: Print the 3 or 4 most significant words of each topic
lsi.print_topics(num_words = 3)

[(0, '-0.752*"polic" + -0.404*"man" + -0.207*"charg"'),
 (1, '0.669*"man" + -0.575*"polic" + 0.328*"charg"'),
 (2, '-0.655*"new" + -0.296*"plan" + 0.242*"man"'),
 (3, '0.701*"new" + -0.350*"say" + -0.336*"plan"')]

What do you think about those results?

Words like "charged" and "police" suggest that the framework has identified themes associated with police activity, most likely charging or making arrests.
Words like "new" and "plan" seem to suggest a distinct set of problems centered around novel ideas or strategies.
There appear to be two distinct themes here: one is law enforcement, and the other is innovative initiatives.
The model seems to have separated news reports into two main categories: incidents involving law enforcement and fresh developments.
Taking everything into account, the model has partitioned the corpus well, making it easier to understand the main subjects covered in the textual material.

Now let's try to use LDA instead of LSA using Gensim

In [14]:
# TODO: Compute LDA
from gensim.models import LdaModel

lda = LdaModel(corpus = corpus, num_topics = 4, id2word = dictionary, random_state = 0, chunksize = 512, passes = 5)

In [15]:
# TODO: print the most frequent words of each topic
lda.print_topics(num_words = 3)

[(0, '0.016*"report" + 0.009*"back" + 0.009*"may"'),
 (1, '0.012*"mine" + 0.011*"polic" + 0.009*"elect"'),
 (2, '0.013*"question" + 0.010*"council" + 0.010*"fund"'),
 (3, '0.012*"sydney" + 0.012*"charg" + 0.011*"australian"')]

In [16]:
!pip install pyLDAvis

Collecting pyLDAvis
  Downloading pyLDAvis-3.4.1-py3-none-any.whl (2.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting pandas>=2.0.0 (from pyLDAvis)
  Downloading pandas-2.2.2-cp311-cp311-macosx_11_0_arm64.whl (11.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.3/11.3 MB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting funcy (from pyLDAvis)
  Downloading funcy-2.0-py2.py3-none-any.whl (30 kB)
Collecting tzdata>=2022.7 (from pandas>=2.0.0->pyLDAvis)
  Downloading tzdata-2024.1-py2.py3-none-any.whl (345 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m345.4/345.4 kB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: funcy, tzdata, pandas, pyLDAvis
  Attempting uninstall: pandas
    Found existing installation: pandas 1.5.3
    Uninstalling pandas-1.5.3:
      Successfully uninstalled pandas-1.5.3


Now, how does it work with LDA?

Let's make some visualization of the LDA results using pyLDAvis.

In [18]:
# TODO: show visualization results of the LDA
import pyLDAvis
import pyLDAvis.gensim

pyLDAvis.enable_notebook()

vis = pyLDAvis.gensim.prepare(lda, corpus, dictionary)
vis

  from pandas.core import (
  from pandas.core import (
  from pandas.core import (
  from pandas.core import (
  from pandas.core import (
  from pandas.core import (
  from pandas.core import (
  from pandas.core import (
  from pandas.core import (


BrokenProcessPool: A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable.

Depending on your results, you can try to fine tune the algorithm: number of topics, hyperparameters...
And check with others their results.