# Text Analysis in Python

## Introduction

In this workshop, we will be learning how to two techniques of text analysis in Python: sentiment analysis and topic modelling. 
These techniques can help us come up with new insights or answer questions we have about the text.

We will be using two types of data for the text analysis. 
The first is a collection of two thousand movie reviews categorized by Bo Pang and Lillian Lee, which we will be downloading and performing sentiment analysis on. The second is the last ten years of transcripts of PM Lee's National Day Rally speeches, which will be used for topic modelling.

The contents of this workshop have been adapted from: 

* the Natural Language Processing in Python Tutorial by Alice Zhao

https://github.com/adashofdata/nlp-in-python-tutorial

* Sentiment Analysis: First Steps With Python's NLTK Library by Marius Mogyorosi

https://realpython.com/python-nltk-sentiment-analysis/#using-nltks-pre-trained-sentiment-analyzer

Data retrieved from:

* Movie Review Data
(https://www.cs.cornell.edu/people/pabo/movie-review-data/)
* Prime Minister's Office Singapore Newsroom
(https://www.pmo.gov.sg/Newsroom)


## Sentiment Analysis

### Introduction

Sentiment analysis can help determine the sentiment of a text - how positive or negative it is. This is useful for sorting out the positive and negative comments of a specific topic and can be used on many examples (reviews, tweets, feedback).

### Getting the data

First, we install NLTK (Natural Language Toolkit), a Python package for NLP (Natural Language Processing).
It contains text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning.

In [None]:
!pip install nltk

From NLTK, we will need to download the resrouces that we are using today, including the movie review data that we are analysing. 

* names: A list of common English names compiled by Mark Kantrowitz
* stopwords: A list of really common words, like articles, pronouns, prepositions, and conjunctions
* movie_reviews: Two thousand movie reviews categorized by Bo Pang and Lillian Lee
* vader_lexicon: A scored list of words and jargon that NLTK references when performing sentiment analysis, created by C.J. Hutto and Eric Gilbert

In [53]:
import nltk

#nltk.download()
nltk.download(['vader_lexicon', "movie_reviews", 'stopwords', 'names'])

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\theresa.lee\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\theresa.lee\AppData\Roaming\nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\theresa.lee\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package names to
[nltk_data]     C:\Users\theresa.lee\AppData\Roaming\nltk_data...
[nltk_data]   Package names is already up-to-date!


True

### Cleaning the data

Now that we have the movie review data, we need to begin cleaning the data. This will standardise the text and remove text and characters that are not relevant.

First, we create a list of all the words in the text, excluding punctuation marks and numbers.

In [54]:
words = [w for w in nltk.corpus.movie_reviews.words() if w.isalpha()]
words

['plot',
 'two',
 'teen',
 'couples',
 'go',
 'to',
 'a',
 'church',
 'party',
 'drink',
 'and',
 'then',
 'drive',
 'they',
 'get',
 'into',
 'an',
 'accident',
 'one',
 'of',
 'the',
 'guys',
 'dies',
 'but',
 'his',
 'girlfriend',
 'continues',
 'to',
 'see',
 'him',
 'in',
 'her',
 'life',
 'and',
 'has',
 'nightmares',
 'what',
 's',
 'the',
 'deal',
 'watch',
 'the',
 'movie',
 'and',
 'sorta',
 'find',
 'out',
 'critique',
 'a',
 'mind',
 'fuck',
 'movie',
 'for',
 'the',
 'teen',
 'generation',
 'that',
 'touches',
 'on',
 'a',
 'very',
 'cool',
 'idea',
 'but',
 'presents',
 'it',
 'in',
 'a',
 'very',
 'bad',
 'package',
 'which',
 'is',
 'what',
 'makes',
 'this',
 'review',
 'an',
 'even',
 'harder',
 'one',
 'to',
 'write',
 'since',
 'i',
 'generally',
 'applaud',
 'films',
 'which',
 'attempt',
 'to',
 'break',
 'the',
 'mold',
 'mess',
 'with',
 'your',
 'head',
 'and',
 'such',
 'lost',
 'highway',
 'memento',
 'but',
 'there',
 'are',
 'good',
 'and',
 'bad',
 'ways',

We also want to exclude stop words, which we already downloaded earlier. We also make all the words in our list lower case.

In [55]:
stopwords = nltk.corpus.stopwords.words("english")

In [56]:
clean_words = [w for w in words if w.lower() not in stopwords]
clean_words

['plot',
 'two',
 'teen',
 'couples',
 'go',
 'church',
 'party',
 'drink',
 'drive',
 'get',
 'accident',
 'one',
 'guys',
 'dies',
 'girlfriend',
 'continues',
 'see',
 'life',
 'nightmares',
 'deal',
 'watch',
 'movie',
 'sorta',
 'find',
 'critique',
 'mind',
 'fuck',
 'movie',
 'teen',
 'generation',
 'touches',
 'cool',
 'idea',
 'presents',
 'bad',
 'package',
 'makes',
 'review',
 'even',
 'harder',
 'one',
 'write',
 'since',
 'generally',
 'applaud',
 'films',
 'attempt',
 'break',
 'mold',
 'mess',
 'head',
 'lost',
 'highway',
 'memento',
 'good',
 'bad',
 'ways',
 'making',
 'types',
 'films',
 'folks',
 'snag',
 'one',
 'correctly',
 'seem',
 'taken',
 'pretty',
 'neat',
 'concept',
 'executed',
 'terribly',
 'problems',
 'movie',
 'well',
 'main',
 'problem',
 'simply',
 'jumbled',
 'starts',
 'normal',
 'downshifts',
 'fantasy',
 'world',
 'audience',
 'member',
 'idea',
 'going',
 'dreams',
 'characters',
 'coming',
 'back',
 'dead',
 'others',
 'look',
 'like',
 'dead

In [57]:
combined_words = ' '.join(clean_words)
combined_words



### Word frequency

One of the uses of NLTK is finding the frequency distribution of each word in the text.

In [58]:
# Tokenization splits text into smaller pieces (ie a token). Tokens can be words or sentences.
def word_frequency(text):
    text = nltk.word_tokenize(text)
    fd = nltk.FreqDist(text)
    return fd

In [59]:
fd = word_frequency(combined_words)
fd.most_common(5)

[('film', 9517),
 ('one', 5852),
 ('movie', 5771),
 ('like', 3690),
 ('even', 2565)]

In [60]:
fd.tabulate(5)

 film   one movie  like  even 
 9517  5852  5771  3690  2565 


In [61]:
fd["romance"]

184

In [62]:
lower_fd = nltk.FreqDist([w.lower() for w in fd])

In [63]:
lower_fd["America"]

0

## Sentiment analysis

NLTK already has a built-in, pretrained sentiment analyzer called VADER (Valence Aware Dictionary and sEntiment Reasoner).
VADER will return a sentiment score of the input text, where -1 is very negative, +1 is very positive and neutral is 0.

In [64]:
from nltk.sentiment import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()
sia.polarity_scores("Wow, NLTK is really powerful!")

{'neg': 0.0, 'neu': 0.295, 'pos': 0.705, 'compound': 0.8012}

In [65]:
nltk.corpus.movie_reviews.words()

['plot', ':', 'two', 'teen', 'couples', 'go', 'to', ...]

As the movie reviews are from a pre-existing data set, they have already been sorted into 'negative' and 'positive' categories, with unique IDs associated with each review.

In [66]:
positive_review_ids = nltk.corpus.movie_reviews.fileids(categories=["pos"])
negative_review_ids = nltk.corpus.movie_reviews.fileids(categories=["neg"])
all_review_ids = positive_review_ids + negative_review_ids

Now, we can define a function that will tell us if a review is positive by splitting the review into individual sentences and getting the mean score of the review.

In [67]:
from statistics import mean

#If average score of all sentences in the review is positive, True.
def is_positive(review_id: str) -> bool:    
    text = nltk.corpus.movie_reviews.raw(review_id)
    scores = [sia.polarity_scores(sentence)["compound"]
        for sentence in nltk.sent_tokenize(text)]
    return mean(scores) > 0

Using this function, we can rate all the reviews and see how accurate VADER is at identifying positive reviews.

In [68]:
from random import shuffle

shuffle(all_review_ids)
correct = 0
for review_id in all_review_ids:
    if is_positive(review_id):
        if review_id in positive_review_ids:
            correct += 1
    else:
        if review_id in negative_review_ids:
            correct += 1

print(F"{correct / len(all_review_ids):.2%} correct")

64.00% correct


### Positive and negative words

We can start to create sets of positive and negative words based on the predefined categories in the data, and use the frequency distribution function to determine the frequency of the words in each set.

In [69]:
unwanted = nltk.corpus.stopwords.words("english")
unwanted.extend([w.lower() for w in nltk.corpus.names.words()])

def skip_unwanted(pos_tuple):
    word, tag = pos_tuple
    if not word.isalpha() or word in unwanted:
        return False
    if tag.startswith("NN"):
        return False
    return True

positive_words = [word for word, tag in filter(
    skip_unwanted,
    nltk.pos_tag(nltk.corpus.movie_reviews.words(categories=["pos"]))
)]
negative_words = [word for word, tag in filter(
    skip_unwanted,
    nltk.pos_tag(nltk.corpus.movie_reviews.words(categories=["neg"]))
)]

In [70]:
positive_fd = nltk.FreqDist(positive_words)
negative_fd = nltk.FreqDist(negative_words)

common_set = set(positive_fd).intersection(negative_fd)

for word in common_set:
    del positive_fd[word]
    del negative_fd[word]

top_100_positive = {word for word, count in positive_fd.most_common(100)}
top_100_negative = {word for word, count in negative_fd.most_common(100)}

### Training a classifier

Now we can define a function that will return the average scores for each review and the number of words in the review that are also in the top 100 positive words.

In [89]:
def extract_features(text):
    features = dict()
    wordcount = 0
    compound_scores = list()
    positive_scores = list()

    for sentence in nltk.sent_tokenize(text):
        for word in nltk.word_tokenize(sentence):
            if word.lower() in top_100_positive:
                wordcount += 1
        compound_scores.append(sia.polarity_scores(sentence)["compound"])
        positive_scores.append(sia.polarity_scores(sentence)["pos"])

    # Adding 1 to the final compound score to always have positive numbers
    # since some classifiers you'll use later don't work with negative numbers.
    features["mean_compound"] = mean(compound_scores) + 1
    features["mean_positive"] = mean(positive_scores)
    features["wordcount"] = wordcount

    return features

This will create a list of features in each text to be analysed.

In [90]:
features = [
    (extract_features(nltk.corpus.movie_reviews.raw(review)), "pos")
    for review in nltk.corpus.movie_reviews.fileids(categories=["pos"])
]
features.extend([
    (extract_features(nltk.corpus.movie_reviews.raw(review)), "neg")
    for review in nltk.corpus.movie_reviews.fileids(categories=["neg"])
])

We will now train the classifier based on the features we have just defined.

In [91]:
# Use 1/4 of the set for training
train_count = len(features) // 4
shuffle(features)
classifier = nltk.NaiveBayesClassifier.train(features[:train_count])
classifier.show_most_informative_features(10)

nltk.classify.accuracy(classifier, features[train_count:])

Most Informative Features
               wordcount = 3                 pos : neg    =      9.6 : 1.0
               wordcount = 2                 pos : neg    =      4.3 : 1.0
               wordcount = 4                 pos : neg    =      3.2 : 1.0
               wordcount = 1                 pos : neg    =      1.9 : 1.0
               wordcount = 0                 neg : pos    =      1.8 : 1.0


0.6566666666666666

## Topic Modelling

### Introduction

Another popular text analysis technique is called topic modeling. The ultimate goal of topic modeling is to find various topics that are present in your corpus. Each document in the corpus will be made up of at least one topic, if not multiple topics.

In this notebook, we will be covering the steps on how to do **Latent Dirichlet Allocation (LDA)**, which is one of many topic modeling techniques. It was specifically designed for text data.

To use a topic modeling technique, you need to provide (1) a document-term matrix and (2) the number of topics you would like the algorithm to pick up.

Once the topic modeling technique is applied, your job as a human is to interpret the results and see if the mix of words in each topic make sense. If they don't make sense, you can try changing up the number of topics, the terms in the document-term matrix, model parameters, or even try a different model.

### Getting the data

First, we retrieve the data of National Day speech transcripts from the folder "transcripts" on our computers and put them in a dictionary.

In [1]:
# pickle: serialise Python projects/save data for later (can load objects like lists in a different notebook later)
import pickle

# Load pickled files
data = {}

#Years of National Day Rally speeches
year = ['2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021', '2022', '2023']

#Opening transcript files from directory
for i, c in enumerate(year):
    with open("transcripts/" + c + ".txt", "rb") as file:
        data[c] = pickle.load(file)

In [2]:
# Double check to make sure data has been loaded properly
data.keys()

dict_keys(['2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021', '2022', '2023'])

In [3]:
# More checks
data['2019'][:5]

['\nMy fellow Singaporeans, good evening.',
 'This year, we commemorate our Bicentennial, two hundred years of Singapore’s modern history.',
 'We began in January at the Singapore River. Since then, many community groups, businesses, schools and even individuals have marked the Bicentennial in their own ways, and reflected on their own histories. I recently attended the Eurasian Festival. This marked 100 years of the Eurasian Association, and showcased our unique Eurasian history and culture. Earlier in April, five Indian dance groups held a combined performance, the Natya Yatra, at the Esplanade. They were celebrating 100 years of Indian classical dance in Singapore. There have been many events in the heartlands too. In Teck Ghee, we organised “Happily ever after – then and now” to celebrate the diverse and changing wedding customs in Singapore. Some 90 couples, old and young, took the occasion to renew their wedding vows. It was a meaningful and joyous event. For Jalan Kayu, former r

### Cleaning the data

Common data cleaning steps on all text:

* Make text all lower case
* Remove punctuation
* Remove numerical values
* Remove common non-sensical text (/n)
* Tokenize text
* Remove stop words

In [4]:
# Let's take a look at our data again
next(iter(data.keys()))

'2014'

In [5]:
# Notice that our dictionary is currently in key: year, value: list of text format
next(iter(data.values()))

['Explore by topics',
 'PM Lee Hsien Loong delivered his National Day Rally speech on 17 August 2014 at the Institute of Technical Education College Central. He spoke in Malay\xa0and Chinese, followed by English. Here is the transcript of the English speech in full. For the video with sign language interpretation, please scroll down to the bottom of the page.',
 '- - - - - - - - - - - - -',
 '“Believing in Singapore, Pioneering Our Future”',
 'INTRODUCTION',
 'Singapore has come a long way. It is the work of generations, each standing on the shoulders of the one which came before and it started with one special generation – the Pioneer Generation (PG). And one outstanding member of the Pioneer Generation was Encik Yusof Ishak, our first President. Encik Yusof showed that in Singapore, you can rise to the top if you work hard. He stood for enduring values that underpin Singapore’s success – meritocracy, multiracialism, modernisation. He was a President for all Singaporeans. So, to mark 

In [6]:
# We are going to change this to key: year, value: string format
def combine_text(list_of_text):
    '''Takes a list of text and combines them into one large chunk of text.'''
    combined_text = ' '.join(list_of_text)
    return combined_text

In [7]:
# Combine it!
data_combined = {key: [combine_text(value)] for (key, value) in data.items()}

Pandas is a Python library for data analysis. A dataframe is a pandas object and is basically a table.

In [8]:
# We can either keep it in dictionary format or put it into a pandas dataframe
import pandas as pd
pd.set_option('max_colwidth',150)

data_df = pd.DataFrame.from_dict(data_combined).transpose()
data_df.columns = ['transcript']
data_df = data_df.sort_index()
data_df

Unnamed: 0,transcript
2014,Explore by topics PM Lee Hsien Loong delivered his National Day Rally speech on 17 August 2014 at the Institute of Technical Education College Cen...
2015,Explore by topics PM Lee delivered his 2015 National Day Rally speech on 23 August 2015 at the Institute of Technical Education College Central. H...
2016,"Explore by topics PM Lee spoke in Malay and Chinese, followed by English. The English speech is in two parts. This is Part 1 of the speech. Please..."
2017,"Explore by topics Good evening again. My fellow Singaporeans, we have had an eventful year. We have been busy guarding against terrorism and stren..."
2018,"Good evening again. My fellow Singaporeans. We have had a busy year, both at home and internationally Two months ago, we hosted the first ever mee..."
2019,"\nMy fellow Singaporeans, good evening. This year, we commemorate our Bicentennial, two hundred years of Singapore’s modern history. We began in J..."
2020,"My fellow Singaporeans, Every year, rain or shine, Singaporeans come together on the 9th of August for the National Day Parade, to celebrate the m..."
2021,"My fellow Singaporeans Good evening again My last National Day Rally was two years ago. Since then, COVID-19 has changed our world. Globally, it h..."
2022,My fellow Singaporeans Good evening. We have come a long way in our fight against COVID-19. We are now learning to live with the virus. With each ...
2023,"My fellow Singaporeans, good evening. We are all relieved that COVID is behind us. Life as we knew it has resumed. COVID-19 was the most challengi..."


In [9]:
# Let's take a look at the transcript for 2023
data_df.transcript.loc['2023']



In [10]:
# Apply a first round of text cleaning techniques
# re is a Python library for regular expressions. It is a series of characters for matching text patterns.
import re
import string

def clean_text_round1(text):
    '''Make text lowercase, remove text in square brackets, remove punctuation and remove words containing numbers.'''
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\w*\d\w*', '', text)
    return text

round1 = lambda x: clean_text_round1(x)

In [11]:
# Let's take a look at the updated text
data_clean = pd.DataFrame(data_df.transcript.apply(round1))
data_clean

Unnamed: 0,transcript
2014,explore by topics pm lee hsien loong delivered his national day rally speech on august at the institute of technical education college central h...
2015,explore by topics pm lee delivered his national day rally speech on august at the institute of technical education college central he spoke in ...
2016,explore by topics pm lee spoke in malay and chinese followed by english the english speech is in two parts this is part of the speech please clic...
2017,explore by topics good evening again my fellow singaporeans we have had an eventful year we have been busy guarding against terrorism and strength...
2018,good evening again my fellow singaporeans we have had a busy year both at home and internationally two months ago we hosted the first ever meeting...
2019,\nmy fellow singaporeans good evening this year we commemorate our bicentennial two hundred years of singapore’s modern history we began in januar...
2020,my fellow singaporeans every year rain or shine singaporeans come together on the of august for the national day parade to celebrate the making o...
2021,my fellow singaporeans good evening again my last national day rally was two years ago since then has changed our world globally it has taken mil...
2022,my fellow singaporeans good evening we have come a long way in our fight against we are now learning to live with the virus with each infection w...
2023,my fellow singaporeans good evening we are all relieved that covid is behind us life as we knew it has resumed was the most challenging ordeal fo...


In [12]:
# Apply a second round of cleaning
def clean_text_round2(text):
    '''Get rid of some additional punctuation and non-sensical text that was missed the first time around.'''
    text = re.sub('[‘’“”…]', '', text)
    text = re.sub('\n', '', text)
    return text

round2 = lambda x: clean_text_round2(x)

In [13]:
# Let's take a look at the updated text
data_clean = pd.DataFrame(data_clean.transcript.apply(round2))
data_clean

Unnamed: 0,transcript
2014,explore by topics pm lee hsien loong delivered his national day rally speech on august at the institute of technical education college central h...
2015,explore by topics pm lee delivered his national day rally speech on august at the institute of technical education college central he spoke in ...
2016,explore by topics pm lee spoke in malay and chinese followed by english the english speech is in two parts this is part of the speech please clic...
2017,explore by topics good evening again my fellow singaporeans we have had an eventful year we have been busy guarding against terrorism and strength...
2018,good evening again my fellow singaporeans we have had a busy year both at home and internationally two months ago we hosted the first ever meeting...
2019,my fellow singaporeans good evening this year we commemorate our bicentennial two hundred years of singapores modern history we began in january a...
2020,my fellow singaporeans every year rain or shine singaporeans come together on the of august for the national day parade to celebrate the making o...
2021,my fellow singaporeans good evening again my last national day rally was two years ago since then has changed our world globally it has taken mil...
2022,my fellow singaporeans good evening we have come a long way in our fight against we are now learning to live with the virus with each infection w...
2023,my fellow singaporeans good evening we are all relieved that covid is behind us life as we knew it has resumed was the most challenging ordeal fo...


### Organising the data

In [16]:
# Let's take a look at our dataframe
data_df

Unnamed: 0,transcript
2014,Explore by topics PM Lee Hsien Loong delivered his National Day Rally speech on 17 August 2014 at the Institute of Technical Education College Cen...
2015,Explore by topics PM Lee delivered his 2015 National Day Rally speech on 23 August 2015 at the Institute of Technical Education College Central. H...
2016,"Explore by topics PM Lee spoke in Malay and Chinese, followed by English. The English speech is in two parts. This is Part 1 of the speech. Please..."
2017,"Explore by topics Good evening again. My fellow Singaporeans, we have had an eventful year. We have been busy guarding against terrorism and stren..."
2018,"Good evening again. My fellow Singaporeans. We have had a busy year, both at home and internationally Two months ago, we hosted the first ever mee..."
2019,"\nMy fellow Singaporeans, good evening. This year, we commemorate our Bicentennial, two hundred years of Singapore’s modern history. We began in J..."
2020,"My fellow Singaporeans, Every year, rain or shine, Singaporeans come together on the 9th of August for the National Day Parade, to celebrate the m..."
2021,"My fellow Singaporeans Good evening again My last National Day Rally was two years ago. Since then, COVID-19 has changed our world. Globally, it h..."
2022,My fellow Singaporeans Good evening. We have come a long way in our fight against COVID-19. We are now learning to live with the virus. With each ...
2023,"My fellow Singaporeans, good evening. We are all relieved that COVID is behind us. Life as we knew it has resumed. COVID-19 was the most challengi..."


In [17]:
# Let's add National Day
names = ['National Day 2014','National Day 2015', 'National Day 2016', 
         'National Day 2017', 'National Day 2018', 'National Day 2019', 
        'National Day 2020','National Day 2021', 'National Day 2022',
        'National Day 2023']

data_df['names'] = names
data_df

Unnamed: 0,transcript,names
2014,Explore by topics PM Lee Hsien Loong delivered his National Day Rally speech on 17 August 2014 at the Institute of Technical Education College Cen...,National Day 2014
2015,Explore by topics PM Lee delivered his 2015 National Day Rally speech on 23 August 2015 at the Institute of Technical Education College Central. H...,National Day 2015
2016,"Explore by topics PM Lee spoke in Malay and Chinese, followed by English. The English speech is in two parts. This is Part 1 of the speech. Please...",National Day 2016
2017,"Explore by topics Good evening again. My fellow Singaporeans, we have had an eventful year. We have been busy guarding against terrorism and stren...",National Day 2017
2018,"Good evening again. My fellow Singaporeans. We have had a busy year, both at home and internationally Two months ago, we hosted the first ever mee...",National Day 2018
2019,"\nMy fellow Singaporeans, good evening. This year, we commemorate our Bicentennial, two hundred years of Singapore’s modern history. We began in J...",National Day 2019
2020,"My fellow Singaporeans, Every year, rain or shine, Singaporeans come together on the 9th of August for the National Day Parade, to celebrate the m...",National Day 2020
2021,"My fellow Singaporeans Good evening again My last National Day Rally was two years ago. Since then, COVID-19 has changed our world. Globally, it h...",National Day 2021
2022,My fellow Singaporeans Good evening. We have come a long way in our fight against COVID-19. We are now learning to live with the virus. With each ...,National Day 2022
2023,"My fellow Singaporeans, good evening. We are all relieved that COVID is behind us. Life as we knew it has resumed. COVID-19 was the most challengi...",National Day 2023


In [18]:
# Let's pickle it for later use
data_df.to_pickle("corpus.pkl")

For many of the techniques used in text analysis, the text must be tokenized, meaning broken down into smaller pieces. The most common tokenization technique is to break down text into words. We can do this using scikit-learn's CountVectorizer, where every row will represent a different document and every column will represent a different word.

In addition, with CountVectorizer, we can remove stop words. Stop words are common words that add no additional meaning to text such as 'a', 'the', etc.

In [19]:
# We are going to create a document-term matrix using CountVectorizer, and exclude common English stop words
# scikit-learn: Python library for machine learning
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words='english')
data_cv = cv.fit_transform(data_clean.transcript)
data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names_out())
data_dtm.index = data_clean.index
data_dtm

Unnamed: 0,aa,aac,aacs,aarman,abang,abbott,abc,abdul,abiding,abilities,...,新谣song,有山有水,李大傻讲古,梁文福,洪健华,百年树人,立国一代,行行出状元,请起立,顾名思义
2014,0,0,0,0,0,0,1,0,1,1,...,1,0,0,1,1,0,0,1,1,1
2015,1,0,0,0,0,1,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0
2016,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2017,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2018,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2019,0,0,0,0,0,0,0,0,0,1,...,0,0,1,0,0,0,0,0,0,0
2020,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2021,0,0,0,1,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2022,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2023,0,8,7,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0


In [20]:
# Let's pickle it for later use
data_dtm.to_pickle("dtm.pkl")

In [21]:
# Let's also pickle the cleaned data (before we put it in document-term matrix format) and the CountVectorizer object
data_clean.to_pickle('data_clean.pkl')
pickle.dump(cv, open("cv.pkl", "wb"))

### Removing stop words

In [22]:
# Read in the document-term matrix

data = pd.read_pickle('dtm.pkl')
data = data.transpose()
data.head()

Unnamed: 0,2014,2015,2016,2017,2018,2019,2020,2021,2022,2023
aa,0,1,0,0,0,0,0,0,0,0
aac,0,0,0,0,0,0,0,0,0,8
aacs,0,0,0,0,0,0,0,0,0,7
aarman,0,0,0,0,0,0,0,1,0,0
abang,0,0,1,0,0,0,0,0,0,0


In [23]:
# Find the top 30 words said in each speech
top_dict = {}
for c in data.columns:
    top = data[c].sort_values(ascending=False).head(30)
    top_dict[c]= list(zip(top.index, top.values))

top_dict

{'2014': [('mr', 58),
  ('singapore', 57),
  ('cpf', 51),
  ('year', 35),
  ('years', 35),
  ('think', 34),
  ('just', 33),
  ('jurong', 32),
  ('people', 31),
  ('need', 31),
  ('pm', 30),
  ('said', 28),
  ('want', 28),
  ('work', 24),
  ('good', 24),
  ('gardens', 24),
  ('like', 23),
  ('government', 23),
  ('singaporeans', 22),
  ('going', 22),
  ('home', 22),
  ('old', 22),
  ('new', 21),
  ('lake', 21),
  ('retirement', 20),
  ('help', 20),
  ('money', 20),
  ('month', 20),
  ('flat', 20),
  ('keppel', 20)],
 '2015': [('singapore', 84),
  ('years', 59),
  ('people', 48),
  ('mr', 39),
  ('help', 32),
  ('lee', 28),
  ('national', 26),
  ('day', 25),
  ('time', 25),
  ('year', 25),
  ('just', 24),
  ('singaporeans', 23),
  ('children', 23),
  ('think', 23),
  ('hdb', 23),
  ('did', 22),
  ('make', 22),
  ('work', 21),
  ('flat', 21),
  ('good', 21),
  ('government', 20),
  ('sit', 20),
  ('future', 18),
  ('team', 18),
  ('like', 17),
  ('flats', 17),
  ('old', 17),
  ('home', 16

In [24]:
# Print the top 15 words said in each speech
for year, top_words in top_dict.items():
    print(year)
    print(', '.join([word for word, count in top_words[0:14]]))
    print('---')

2014
mr, singapore, cpf, year, years, think, just, jurong, people, need, pm, said, want, work
---
2015
singapore, years, people, mr, help, lee, national, day, time, year, just, singaporeans, children, think
---
2016
singapore, president, people, china, new, time, need, make, like, good, work, world, just, years
---
2017
years, diabetes, good, like, people, make, preschool, children, just, rice, day, teachers, singapore, need
---
2018
years, flats, flat, singapore, hip, old, hdb, new, government, singaporeans, just, like, housing, good
---
2019
singapore, years, like, year, city, climate, new, sea, change, government, preschool, good, just, need
---
2020
national, day, singaporeans, singapore, parade, years, message, hold, year, look, crisis, bay, parades, like
---
2021
singapore, singaporeans, workers, work, time, people, like, racial, national, pass, different, government, tudung, years
---
2022
singapore, singaporeans, people, just, like, countries, society, make, world, national, ta

In [25]:
# Look at the most common top words --> add them to the stop word list
from collections import Counter

# Let's first pull out the top 30 words for each year
words = []
for year in data.columns:
    top = [word for (word, count) in top_dict[year]]
    for t in top:
        words.append(t)
        
words

['mr',
 'singapore',
 'cpf',
 'year',
 'years',
 'think',
 'just',
 'jurong',
 'people',
 'need',
 'pm',
 'said',
 'want',
 'work',
 'good',
 'gardens',
 'like',
 'government',
 'singaporeans',
 'going',
 'home',
 'old',
 'new',
 'lake',
 'retirement',
 'help',
 'money',
 'month',
 'flat',
 'keppel',
 'singapore',
 'years',
 'people',
 'mr',
 'help',
 'lee',
 'national',
 'day',
 'time',
 'year',
 'just',
 'singaporeans',
 'children',
 'think',
 'hdb',
 'did',
 'make',
 'work',
 'flat',
 'good',
 'government',
 'sit',
 'future',
 'team',
 'like',
 'flats',
 'old',
 'home',
 'special',
 'said',
 'singapore',
 'president',
 'people',
 'china',
 'new',
 'time',
 'need',
 'make',
 'like',
 'good',
 'work',
 'world',
 'just',
 'years',
 'doing',
 'companies',
 'want',
 'sea',
 'countries',
 'did',
 'different',
 'business',
 'singaporeans',
 'sure',
 'minority',
 'south',
 'race',
 'day',
 'jobs',
 'things',
 'years',
 'diabetes',
 'good',
 'like',
 'people',
 'make',
 'preschool',
 'childr

In [26]:
# Let's aggregate this list and identify the most common words along with how many speeches they occur in
Counter(words).most_common()

[('singapore', 10),
 ('years', 10),
 ('like', 10),
 ('just', 9),
 ('singaporeans', 9),
 ('people', 8),
 ('new', 8),
 ('work', 7),
 ('good', 7),
 ('government', 7),
 ('make', 7),
 ('year', 6),
 ('day', 6),
 ('time', 6),
 ('need', 5),
 ('want', 4),
 ('old', 4),
 ('help', 4),
 ('flat', 4),
 ('national', 4),
 ('world', 4),
 ('hdb', 3),
 ('did', 3),
 ('future', 3),
 ('flats', 3),
 ('different', 3),
 ('better', 3),
 ('let', 3),
 ('mr', 2),
 ('think', 2),
 ('said', 2),
 ('home', 2),
 ('lee', 2),
 ('children', 2),
 ('companies', 2),
 ('sea', 2),
 ('countries', 2),
 ('race', 2),
 ('jobs', 2),
 ('things', 2),
 ('preschool', 2),
 ('start', 2),
 ('housing', 2),
 ('young', 2),
 ('build', 2),
 ('workers', 2),
 ('land', 2),
 ('today', 2),
 ('look', 2),
 ('come', 2),
 ('society', 2),
 ('cpf', 1),
 ('jurong', 1),
 ('pm', 1),
 ('gardens', 1),
 ('going', 1),
 ('lake', 1),
 ('retirement', 1),
 ('money', 1),
 ('month', 1),
 ('keppel', 1),
 ('sit', 1),
 ('team', 1),
 ('special', 1),
 ('president', 1),
 ('ch

In [27]:
# If more than half of the speeches have it as a top word, exclude it from the list
add_stop_words = [word for word, count in Counter(words).most_common() if count > 5]
add_stop_words

['singapore',
 'years',
 'like',
 'just',
 'singaporeans',
 'people',
 'new',
 'work',
 'good',
 'government',
 'make',
 'year',
 'day',
 'time']

In [28]:
# Let's update our document-term matrix with the new list of stop words
from sklearn.feature_extraction import text 
from sklearn.feature_extraction.text import CountVectorizer

# Read in cleaned data
data_clean = pd.read_pickle('data_clean.pkl')

# Add new stop words
stop_words = text.ENGLISH_STOP_WORDS.union(add_stop_words)

# Recreate document-term matrix
cv = CountVectorizer(stop_words=list(stop_words))
data_cv = cv.fit_transform(data_clean.transcript)
data_stop = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names_out())
data_stop.index = data_clean.index

# Pickle it for later use
import pickle
pickle.dump(cv, open("cv_stop.pkl", "wb"))
data_stop.to_pickle("dtm_stop.pkl")

## Topic Modelling

### Attempt #1 (all text)

In [31]:
# Let's read in our document-term matrix
import pandas as pd
import pickle

data = pd.read_pickle('dtm_stop.pkl')
data

Unnamed: 0,aa,aac,aacs,aarman,abang,abbott,abc,abdul,abiding,abilities,...,新谣song,有山有水,李大傻讲古,梁文福,洪健华,百年树人,立国一代,行行出状元,请起立,顾名思义
2014,0,0,0,0,0,0,1,0,1,1,...,1,0,0,1,1,0,0,1,1,1
2015,1,0,0,0,0,1,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0
2016,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2017,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2018,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2019,0,0,0,0,0,0,0,0,0,1,...,0,0,1,0,0,0,0,0,0,0
2020,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2021,0,0,0,1,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2022,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2023,0,8,7,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0


In [32]:
# Import the necessary modules for LDA with gensim
# gensim is a Python toolkit specifically for topic modelling
# Terminal / Anaconda Navigator: conda install -c conda-forge gensim
from gensim import matutils, models
import scipy.sparse

# import logging
# logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [33]:
# One of the required inputs is a term-document matrix
tdm = data.transpose()
tdm.head()

Unnamed: 0,2014,2015,2016,2017,2018,2019,2020,2021,2022,2023
aa,0,1,0,0,0,0,0,0,0,0
aac,0,0,0,0,0,0,0,0,0,8
aacs,0,0,0,0,0,0,0,0,0,7
aarman,0,0,0,0,0,0,0,1,0,0
abang,0,0,1,0,0,0,0,0,0,0


In [34]:
# We're going to put the term-document matrix into a new gensim format, from df --> sparse matrix --> gensim corpus
sparse_counts = scipy.sparse.csr_matrix(tdm)
corpus = matutils.Sparse2Corpus(sparse_counts)

In [35]:
# Gensim also requires dictionary of the all terms and their respective location in the term-document matrix
cv = pickle.load(open("cv_stop.pkl", "rb"))
id2word = dict((v, k) for k, v in cv.vocabulary_.items())

Let's start the number of topics at 2, see if it makes sense, and increase the number of topics from here.

In [36]:
# Now that we have the corpus (term-document matrix) and id2word (dictionary of location: term),
# we need to specify two other parameters as well - the number of topics and the number of passes
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=2, passes=10)
lda.print_topics()

[(0,
  '0.005*"need" + 0.004*"want" + 0.004*"old" + 0.003*"mr" + 0.003*"think" + 0.003*"help" + 0.003*"cpf" + 0.003*"flat" + 0.003*"better" + 0.003*"president"'),
 (1,
  '0.005*"national" + 0.004*"help" + 0.004*"flats" + 0.003*"hdb" + 0.003*"mr" + 0.003*"seniors" + 0.003*"world" + 0.003*"workers" + 0.003*"lee" + 0.003*"society"')]

In [37]:
# LDA for num_topics = 3
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=3, passes=10)
lda.print_topics()

[(0,
  '0.006*"national" + 0.004*"help" + 0.004*"mr" + 0.004*"children" + 0.003*"lee" + 0.003*"workers" + 0.003*"different" + 0.003*"future" + 0.003*"need" + 0.003*"world"'),
 (1,
  '0.007*"flats" + 0.005*"hdb" + 0.004*"flat" + 0.004*"seniors" + 0.004*"help" + 0.003*"housing" + 0.003*"world" + 0.003*"want" + 0.003*"old" + 0.003*"let"'),
 (2,
  '0.006*"mr" + 0.005*"need" + 0.005*"president" + 0.004*"cpf" + 0.004*"want" + 0.004*"china" + 0.004*"think" + 0.003*"pm" + 0.003*"said" + 0.003*"did"')]

In [38]:
# LDA for num_topics = 4
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=4, passes=10)
lda.print_topics()

[(0,
  '0.004*"national" + 0.004*"president" + 0.004*"mr" + 0.004*"help" + 0.004*"did" + 0.003*"need" + 0.003*"world" + 0.003*"china" + 0.003*"different" + 0.003*"workers"'),
 (1,
  '0.009*"flats" + 0.008*"flat" + 0.006*"old" + 0.006*"hip" + 0.005*"hdb" + 0.004*"housing" + 0.003*"healthcare" + 0.003*"generation" + 0.003*"want" + 0.003*"better"'),
 (2,
  '0.005*"need" + 0.004*"national" + 0.004*"mr" + 0.004*"want" + 0.004*"cpf" + 0.003*"way" + 0.003*"build" + 0.003*"let" + 0.003*"think" + 0.003*"help"'),
 (3,
  '0.011*"seniors" + 0.011*"flats" + 0.007*"hdb" + 0.005*"projects" + 0.005*"help" + 0.004*"housing" + 0.004*"plus" + 0.004*"estates" + 0.004*"prices" + 0.003*"today"')]

### Attempt #2 (nouns only)

One popular trick is to look only at terms that are from one part of speech (only nouns, only adjectives, etc.). Check out the UPenn tag set: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html.

In [40]:
# Let's create a function to pull out nouns from a string of text
from nltk import word_tokenize, pos_tag

def nouns(text):
    '''Given a string of text, tokenize the text and pull out only the nouns.'''
    is_noun = lambda pos: pos[:2] == 'NN'
    tokenized = word_tokenize(text)
    all_nouns = [word for (word, pos) in pos_tag(tokenized) if is_noun(pos)] 
    return ' '.join(all_nouns)

In [11]:
# Read in the cleaned data, before the CountVectorizer step
data_clean = pd.read_pickle('data_clean.pkl')
data_clean

Unnamed: 0,transcript
2014,explore by topics pm lee hsien loong delivered...
2015,explore by topics pm lee delivered his nation...
2016,explore by topics pm lee spoke in malay and ch...
2017,explore by topics good evening again my fellow...
2018,good evening again my fellow singaporeans we h...
2019,my fellow singaporeans good evening this year ...
2020,my fellow singaporeans every year rain or shin...
2021,my fellow singaporeans good evening again my l...
2022,my fellow singaporeans good evening we have co...
2023,my fellow singaporeans good evening we are all...


In [12]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
# Apply the nouns function to the transcripts to filter only on nouns
data_nouns = pd.DataFrame(data_clean.transcript.apply(nouns))
data_nouns

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\theresa.lee\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\theresa.lee\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


Unnamed: 0,transcript
2014,topics hsien loong day august institute educat...
2015,topics lee day august institute education coll...
2016,topics spoke malay speech parts part speech pl...
2017,topics evening singaporeans year terrorism har...
2018,evening singaporeans year home months presiden...
2019,singaporeans evening year years singapores his...
2020,singaporeans year rain shine singaporeans augu...
2021,singaporeans evening day rally years world glo...
2022,singaporeans evening way fight virus infection...
2023,singaporeans evening covid life ordeal nation ...


In [13]:
# Create a new document-term matrix using only nouns
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer

# Re-add the additional stop words since we are recreating the document-term matrix
add_stop_words = ['singapore','years','like','just','singaporeans','people','new','work','good','government','make']
stop_words = text.ENGLISH_STOP_WORDS.union(add_stop_words)

# Recreate a document-term matrix with only nouns
cvn = CountVectorizer(stop_words=list(stop_words))
data_cvn = cvn.fit_transform(data_nouns.transcript)
data_dtmn = pd.DataFrame(data_cvn.toarray(), columns=cvn.get_feature_names_out())
data_dtmn.index = data_nouns.index
data_dtmn

Unnamed: 0,aac,aacs,abbott,abc,abdul,abilities,ability,abstract,abu,academics,...,zumba,严众莲,十年树木,新年快乐,有山有水,李大傻讲古,梁文福,洪健华,百年树人,立国一代
2014,0,0,0,1,0,1,2,0,4,0,...,0,1,0,0,0,0,1,1,0,0
2015,0,0,1,0,0,0,1,0,0,0,...,0,0,1,0,0,0,0,0,1,0
2016,0,0,0,0,0,0,2,0,0,0,...,0,0,0,1,0,0,0,0,0,0
2017,0,0,0,0,0,0,1,0,0,0,...,1,0,0,0,0,0,0,0,0,0
2018,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2019,0,0,0,0,0,1,0,1,0,1,...,0,0,0,0,0,1,0,0,0,0
2020,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2021,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2022,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2023,7,5,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0


In [14]:
# Create the gensim corpus
corpusn = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmn.transpose()))

# Create the vocabulary dictionary
id2wordn = dict((v, k) for k, v in cvn.vocabulary_.items())

In [15]:
# Let's start with 2 topics
ldan = models.LdaModel(corpus=corpusn, num_topics=2, id2word=id2wordn, passes=10)
ldan.print_topics()

[(0,
  '0.010*"flats" + 0.008*"time" + 0.008*"year" + 0.007*"day" + 0.005*"housing" + 0.005*"mr" + 0.005*"seniors" + 0.005*"hdb" + 0.005*"cpf" + 0.005*"home"'),
 (1,
  '0.007*"day" + 0.007*"time" + 0.006*"year" + 0.006*"world" + 0.006*"president" + 0.005*"way" + 0.004*"countries" + 0.004*"sea" + 0.004*"children" + 0.004*"preschool"')]

In [16]:
# Let's try topics = 3
ldan = models.LdaModel(corpus=corpusn, num_topics=3, id2word=id2wordn, passes=10)
ldan.print_topics()

[(0,
  '0.011*"flats" + 0.011*"seniors" + 0.008*"day" + 0.008*"time" + 0.007*"workers" + 0.005*"projects" + 0.005*"world" + 0.005*"hdb" + 0.005*"year" + 0.005*"housing"'),
 (1,
  '0.009*"year" + 0.006*"time" + 0.006*"flats" + 0.006*"cpf" + 0.006*"day" + 0.006*"way" + 0.005*"world" + 0.005*"generation" + 0.004*"life" + 0.004*"housing"'),
 (2,
  '0.009*"time" + 0.008*"day" + 0.007*"president" + 0.006*"children" + 0.006*"year" + 0.006*"world" + 0.004*"mr" + 0.004*"china" + 0.004*"jobs" + 0.004*"future"')]

In [17]:
# Let's try 4 topics
ldan = models.LdaModel(corpus=corpusn, num_topics=4, id2word=id2wordn, passes=10)
ldan.print_topics()

[(0,
  '0.010*"flats" + 0.009*"year" + 0.009*"seniors" + 0.008*"cpf" + 0.007*"mr" + 0.007*"time" + 0.007*"day" + 0.006*"home" + 0.006*"housing" + 0.006*"hdb"'),
 (1,
  '0.011*"time" + 0.011*"president" + 0.007*"companies" + 0.007*"workers" + 0.006*"world" + 0.006*"race" + 0.006*"day" + 0.006*"china" + 0.006*"countries" + 0.005*"jobs"'),
 (2,
  '0.011*"day" + 0.004*"year" + 0.004*"parade" + 0.004*"crisis" + 0.003*"world" + 0.003*"nation" + 0.003*"bay" + 0.003*"employers" + 0.003*"workers" + 0.002*"jobs"'),
 (3,
  '0.008*"year" + 0.007*"day" + 0.006*"flats" + 0.006*"world" + 0.006*"time" + 0.006*"children" + 0.005*"way" + 0.005*"life" + 0.005*"preschool" + 0.004*"school"')]

### Attempt #3 (Nouns and Adjectives)

In [41]:
# Let's create a function to pull out nouns from a string of text
def nouns_adj(text):
    '''Given a string of text, tokenize the text and pull out only the nouns and adjectives.'''
    is_noun_adj = lambda pos: pos[:2] == 'NN' or pos[:2] == 'JJ'
    tokenized = word_tokenize(text)
    nouns_adj = [word for (word, pos) in pos_tag(tokenized) if is_noun_adj(pos)] 
    return ' '.join(nouns_adj)

In [42]:
# Apply the nouns function to the transcripts to filter only on nouns
data_nouns_adj = pd.DataFrame(data_clean.transcript.apply(nouns_adj))
data_nouns_adj

Unnamed: 0,transcript
2014,topics pm lee hsien loong national day august institute technical education college central malay chinese english transcript english speech full v...
2015,topics pm lee national day august institute technical education college central malay chinese transcript english speech full video sign language i...
2016,topics pm lee spoke malay chinese english speech parts part speech please click part pm lee ill delivery english speech completeness record transc...
2017,topics good evening fellow singaporeans eventful year busy terrorism racial harmony friends other countries big small last year economy skills bui...
2018,good evening fellow singaporeans busy year home months first president dprk chairman mr donald trump mr kim jong un singapore host discussions big...
2019,fellow singaporeans good evening year bicentennial years singapores modern history january singapore river many community groups businesses school...
2020,fellow singaporeans year rain shine singaporeans august national day making nation commitment i ndp first parades part padang rain contingents ste...
2021,fellow singaporeans good evening last national day rally years world globally millions lives many more countless jobs businesses time virus differ...
2022,fellow singaporeans good evening long way fight virus infection impact latest omicron wave many other countries wave cases roller coaster cases ho...
2023,fellow singaporeans good evening covid life challenging ordeal nation independence proud many other countries pandemic stronger resilient united t...


In [43]:
# Create a new document-term matrix using only nouns and adjectives, also remove common words with max_df
cvna = CountVectorizer(stop_words=list(stop_words), max_df=.8)
data_cvna = cvna.fit_transform(data_nouns_adj.transcript)
data_dtmna = pd.DataFrame(data_cvna.toarray(), columns=cvna.get_feature_names_out())
data_dtmna.index = data_nouns_adj.index
data_dtmna

Unnamed: 0,aac,aacs,abang,abbott,abc,abdul,abilities,ability,abstract,abu,...,新年快乐,新谣song,有山有水,李大傻讲古,梁文福,洪健华,百年树人,立国一代,行行出状元,请起立
2014,0,0,0,0,1,0,1,2,0,12,...,0,1,0,0,1,1,0,0,1,1
2015,0,0,0,1,1,0,0,1,0,0,...,0,0,0,0,0,0,1,0,0,0
2016,0,0,1,0,0,0,0,2,0,0,...,1,0,0,0,0,0,0,0,0,0
2017,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2018,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2019,0,0,0,0,0,0,1,0,1,0,...,0,0,0,1,0,0,0,0,0,0
2020,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2021,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2022,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2023,7,6,0,0,0,0,0,0,1,0,...,0,0,1,0,0,0,0,0,0,0


In [44]:
# Create the gensim corpus
corpusna = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmna.transpose()))

# Create the vocabulary dictionary
id2wordna = dict((v, k) for k, v in cvna.vocabulary_.items())

In [45]:
# Let's start with 2 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=2, id2word=id2wordna, passes=10)
ldana.print_topics()

[(0,
  '0.005*"cpf" + 0.003*"jurong" + 0.003*"change" + 0.003*"retirement" + 0.003*"school" + 0.003*"companies" + 0.003*"sea" + 0.002*"age" + 0.002*"education" + 0.002*"money"'),
 (1,
  '0.009*"flats" + 0.006*"flat" + 0.005*"president" + 0.005*"seniors" + 0.004*"housing" + 0.003*"china" + 0.003*"projects" + 0.002*"companies" + 0.002*"parents" + 0.002*"diabetes"')]

In [23]:
# Let's try 3 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=3, id2word=id2wordna, passes=10)
ldana.print_topics()

[(0,
  '0.010*"flats" + 0.008*"flat" + 0.006*"cpf" + 0.005*"seniors" + 0.005*"housing" + 0.003*"projects" + 0.003*"age" + 0.003*"jurong" + 0.003*"retirement" + 0.003*"education"'),
 (1,
  '0.013*"president" + 0.008*"china" + 0.005*"companies" + 0.004*"sea" + 0.004*"south" + 0.004*"minority" + 0.004*"race" + 0.003*"drivers" + 0.003*"taxi" + 0.003*"malaysia"'),
 (2,
  '0.004*"change" + 0.003*"sea" + 0.003*"companies" + 0.003*"school" + 0.003*"racial" + 0.003*"law" + 0.003*"talent" + 0.003*"climate" + 0.003*"preschool" + 0.003*"port"')]

In [51]:
# Let's try 4 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=4, id2word=id2wordna, passes=10)
ldana.print_topics()

[(0,
  '0.005*"flat" + 0.004*"flats" + 0.004*"sit" + 0.003*"babies" + 0.003*"housing" + 0.003*"parents" + 0.003*"problems" + 0.003*"spirit" + 0.003*"households" + 0.003*"parade"'),
 (1,
  '0.008*"cpf" + 0.004*"jurong" + 0.004*"retirement" + 0.004*"change" + 0.003*"school" + 0.003*"companies" + 0.003*"age" + 0.003*"education" + 0.003*"sea" + 0.003*"money"'),
 (2,
  '0.015*"flats" + 0.009*"flat" + 0.008*"seniors" + 0.006*"housing" + 0.006*"projects" + 0.004*"diabetes" + 0.004*"hip" + 0.004*"preschool" + 0.004*"health" + 0.003*"healthcare"'),
 (3,
  '0.010*"president" + 0.006*"china" + 0.004*"companies" + 0.004*"sea" + 0.003*"south" + 0.003*"race" + 0.003*"talent" + 0.003*"international" + 0.003*"law" + 0.003*"minority"')]

### Identify Topics in Each Document

Out of the 9 topic models we looked at, the nouns and adjectives, 4 topic one made the most sense. So let's pull that down here and run it through some more iterations to get more fine-tuned topics.

In [46]:
# Our final LDA model (for now)
#ldana = models.LdaModel(corpus=corpusna, num_topics=4, id2word=id2wordna, passes=80)
ldana = models.LdaModel(corpus=corpusna, num_topics=4, id2word=id2wordna, passes=300)
ldana.print_topics()

[(0,
  '0.011*"seniors" + 0.010*"flats" + 0.006*"projects" + 0.006*"diabetes" + 0.005*"preschool" + 0.004*"health" + 0.004*"rice" + 0.004*"housing" + 0.004*"teachers" + 0.004*"social"'),
 (1,
  '0.009*"flat" + 0.007*"flats" + 0.007*"cpf" + 0.004*"housing" + 0.004*"jurong" + 0.004*"hip" + 0.003*"lease" + 0.003*"scheme" + 0.003*"gardens" + 0.003*"residents"'),
 (2,
  '0.009*"president" + 0.007*"sea" + 0.006*"china" + 0.004*"companies" + 0.004*"change" + 0.004*"preschool" + 0.003*"climate" + 0.003*"south" + 0.003*"minority" + 0.003*"school"'),
 (3,
  '0.005*"racial" + 0.005*"companies" + 0.004*"race" + 0.004*"flat" + 0.003*"pass" + 0.003*"tudung" + 0.003*"flats" + 0.003*"religion" + 0.003*"foreign" + 0.003*"religious"')]

* Topic 0: Ageing population/health
* Topic 1: Housing
* Topic 2: International
* Topic 3: Race/religion

In [47]:
# Let's take a look at which topics each transcript contains
corpus_transformed = ldana[corpusna]
list(zip([a for [(a,b)] in corpus_transformed], data_dtmna.index))

[(1, '2014'),
 (3, '2015'),
 (2, '2016'),
 (0, '2017'),
 (1, '2018'),
 (2, '2019'),
 (0, '2020'),
 (3, '2021'),
 (1, '2022'),
 (0, '2023')]

* Topic 0: Ageing population/health [2017, 2020, 2023]
* Topic 1: Housing [2014, 2018, 2022]
* Topic 2: International [2016, 2019]
* Topic 3: Race/religion [2015, 2021]