# Text Data Analysis

In [None]:
import json
import pandas as pd
import spacy
from tqdm import tqdm
import nltk
import flair
import gensim
import umap
import numpy as np
import plotly.express as px
import transformers

## 1. Hansard Data

In this section, we explore Hansard data, which consists of speeches and debates made in Singapore's Parliament Chamber and provides a record of parliamentary business and proceedings in a Sitting. Data from Hansard has already been scraped for you, focusing specifically on the Committee of Supply (or Budget) debates for the 15th Parliament (from 2021 to present). We will use this as an opportunity to explore sentiment analysis and topic modelling.

### 1.1 Importing the data and doing simple processing

In [None]:
# Read in the Hansard data
hansard_df = pd.read_csv("Hansard_15th_Parl_COS.csv")

In [None]:
print(hansard_df.shape)
hansard_df.head()

Let's try to enrich this dataset with some useful variables. 

<span style="background-color: #FFFF00">**Exercise:** Create two new columns for this dataset: 
* `Sitting Year` (int): Year in which the speech was given
* `Speech Length` (int): Number of words in the speech </span>

In [None]:
hansard_df.head()

In [None]:
hansard_df['Speech Length'].plot.hist()

### 1.2 Sentiment Analysis

Let's start with applying some sentiment analysis. While most Parliamentary speeches are likely to be quite mild in terms of sentiment, we might be able to identify some more impassioned speeches. Before you proceed, make sure you have both the `spacy` and `spacytextblob` libraries installed.

In [None]:
# Run the command here to download textblob's additional corpuses
!python -m textblob.download_corpora
!python -m spacy download en_core_web_sm

In [None]:
from spacytextblob.spacytextblob import SpacyTextBlob

# Initialise the NLP pipeline and add the spacetextblob step to the pipeline
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe('spacytextblob')

# Add our texts
texts = hansard_df['Speech']

# This will take about 20-30 seconds to run
sentiment_results = []
for doc in nlp.pipe(texts, disable=["tok2vec", "tagger", "ner", "parser", "attribute_ruler", "lemmatizer"]):
    sentiment_results.append({
        'Polarity': doc._.blob.polarity,
        'Subjectivity': doc._.blob.subjectivity, 
    })

In [None]:
# Now we append it to our original dataset
sentiment_results_df = pd.DataFrame(sentiment_results)
hansard_df = pd.concat([hansard_df, sentiment_results_df], axis = 1)

Let's do some simple data analysis to understand the distributions of the polarity and subjectivity scores. Before you run the cells below, think carefully about what you expect from the data.

In [None]:
# Polarity scores range from [-1 to 1], with -1 indicating very negative and 1 indicating very positive
hansard_df['Polarity'].plot.hist()

In [None]:
# Subjectivity scores range from [0 to 1], with 0 indicating very objective and 1 indicating very subjective
hansard_df['Subjectivity'].plot.hist()

Unsurprisingly most texts are neutral and objective. But this may be swayed by the number of times the Chairman speaks. Let's filter that out and look at this again.

In [None]:
hansard_df_cleaned = hansard_df[hansard_df['Speaker'] != "The Chairman"].reset_index(drop = True)

In [None]:
# Polarity scores range from [-1 to 1], with -1 indicating very negative and 1 indicating very positive
hansard_df_cleaned['Polarity'].plot.hist()

In [None]:
# Subjectivity scores range from [0 to 1], with 0 indicating very objective and 1 indicating very subjective
hansard_df_cleaned['Subjectivity'].plot.hist()

Let's find the most positive speech and the most negative speech! Share your thoughts about why you think these 

In [None]:
hansard_df_cleaned.loc[hansard_df_cleaned['Polarity'].argmin()]

In [None]:
hansard_df_cleaned.loc[hansard_df_cleaned['Polarity'].argmax()]

Both of these speeches seem a bit short, which might explain their extreme polarity scores. Let's plot a scatter plot to highlight the relationship between speech length and polarity.

In [None]:
hansard_df_cleaned.plot.scatter(x = 'Speech Length', y = 'Polarity')

<span style="background-color: #FFFF00">**Class Discussion:** Given that `textblob` is a dictionary-based approach to sentiment analysis, can you think of why longer speeches tend to have less extreme values for positive/negative sentiment?</span>

Now let's try a different approach: using an embedding-based classifier instead! We will use a small embedding-based classifier that has already been finetuned to save time.

In [None]:
from flair.nn import Classifier
from flair.data import Sentence
tagger = Classifier.load('./flair_sentiment.pt')

In [None]:
# Let's try it out with a random speech
sentence = Sentence(hansard_df_cleaned['Speech'][2])
tagger.predict(sentence)
print(sentence)

In [None]:
sentiment_scores = []

# This should take around 3-5 minutes
for text in tqdm(hansard_df_cleaned['Speech'].tolist()):
    sentence = Sentence(text)
    tagger.predict(sentence)

    # Remember to take the inverse of the negative score
    if sentence.labels[0].value == 'NEGATIVE':
        sentiment_scores.append(1 - sentence.labels[0].score)
    else:
        sentiment_scores.append(sentence.labels[0].score)

In [None]:
hansard_df_cleaned['Sentiment'] = sentiment_scores

In [None]:
hansard_df_cleaned['Sentiment'].plot.hist()

<span style="background-color: #FFFF00">**Class Discussion:** What do you notice about this chart that is different from the `textblob` model results? Why do you think there is such a big difference?</span>

In [None]:
hansard_df_cleaned.plot.scatter(x = 'Speech Length', y = 'Sentiment')

### 1.3 Topic modelling

Since Parliamentary debates tend to be quite topic-focused, topic modelling would be a good option for us to better understand the ongoing debates and to get a sense of the priority areas for discussion in Parliament.

In [None]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import string

# Download necessary NLTK data
nltk.download('punkt_tab')
nltk.download('wordnet')
nltk.download('stopwords')

# Initialize stopwords and lemmatizer
stop_words = set(stopwords.words('english'))
stop_words.update(['also', 'mr', 'chairman', 'beg', 'move'])
lemmatizer = WordNetLemmatizer()

def preprocess(text):
    # Tokenization
    words = word_tokenize(text.lower())
    
    # Remove punctuation and non-alphabetic tokens
    words = [word for word in words if word.isalpha()]
    
    # Stopword removal and lemmatization
    words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]
    
    return ' '.join(words)

In [None]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

# Step 1: Preprocessing (with stopword removal and lemmatization)
texts_preprocessed = [preprocess(text) for text in hansard_df_cleaned['Speech']]

# Step 2: Vectorizing the text data
vectorizer = CountVectorizer()
dtm = vectorizer.fit_transform(texts_preprocessed)

# Step 3: Applying LDA for Topic Modeling
lda = LatentDirichletAllocation(n_components = 10, random_state = 2024)
lda.fit(dtm)

# Step 4: Extracting and Displaying Topics
def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print(f"Topic {topic_idx}:")
        print(", ".join([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]))

no_top_words = 10
tf_feature_names = vectorizer.get_feature_names_out()
display_topics(lda, tf_feature_names, no_top_words)

The topics looks fairly sensible, but is there a way for us to get a more tangible and concrete way to assess the quality of this topic modelling? We can look at the **coherence score** for this task.

In [None]:
from gensim.models import CoherenceModel
from gensim.corpora import Dictionary

# Step 1: Create a Gensim Dictionary and Corpus
texts_tokenized = [text.split() for text in texts_preprocessed]
dictionary = Dictionary(texts_tokenized)
corpus = [dictionary.doc2bow(text) for text in texts_tokenized]

# Step 2: Get the topics from the LDA model
lda_topics = lda.components_
lda_topics_words = [[vectorizer.get_feature_names_out()[i] for i in topic.argsort()[:-10 - 1:-1]] for topic in lda_topics]

# Step 3: Calculate Coherence Score
coherence_model_lda = CoherenceModel(topics = lda_topics_words, 
                                     texts = texts_tokenized, 
                                     dictionary = dictionary, 
                                     coherence = 'c_v')

coherence_lda = coherence_model_lda.get_coherence()
print(f'Coherence Score for LDA Model: {coherence_lda}')

Now let's try varying some of the parameters to see which gets us the optimal coherence score. We'll start by adjusting the number of topics.

In [None]:
n_topics_list = [3, 5, 10, 15, 20, 25]
coherence_scores = []

# It should take around 15-30 seconds for each iteration
for n_topics in tqdm(n_topics_list):
        
    lda = LatentDirichletAllocation(n_components = n_topics, random_state = 2024)
    lda.fit(dtm)
    lda_topics = lda.components_
    lda_topics_words = [[vectorizer.get_feature_names_out()[i] for i in topic.argsort()[:-10 - 1:-1]] for topic in lda_topics]    
    coherence_model_lda = CoherenceModel(topics = lda_topics_words, 
                                         texts = texts_tokenized, 
                                         dictionary = dictionary, 
                                         coherence = 'c_v')    
    coherence_lda = coherence_model_lda.get_coherence()
    print(f"Number of topics: {n_topics} | Coherence Score: {coherence_lda}")
    coherence_scores.append(coherence_lda)

In [None]:
import matplotlib.pyplot as plt
plt.plot(n_topics_list, coherence_scores)
plt.xlabel("Num Topics")
plt.ylabel("Coherence score")
plt.legend(("coherence_values"), loc='best')
plt.show()

In [None]:
lda = LatentDirichletAllocation(n_components = 25, random_state = 2024)
lda.fit(dtm)
no_top_words = 10
tf_feature_names = vectorizer.get_feature_names_out()
display_topics(lda, tf_feature_names, no_top_words)

In [None]:
# Step 1: Get the topic distribution for each document
lda_topic_distributions = lda.transform(dtm)

# Step 2: Identify the dominant topic for each document
dominant_topics = np.argmax(lda_topic_distributions, axis = 1)

# Step 3: Apply UMAP to reduce to 2 dimensions
umap_model = umap.UMAP(n_components = 2, random_state = 2024)
lda_2d = umap_model.fit_transform(lda_topic_distributions)

# Step 4: Prepare data for Plotly
df = pd.DataFrame({
    'UMAP1': lda_2d[:, 0],
    'UMAP2': lda_2d[:, 1],
    'Dominant Topic': dominant_topics,
    'Text': hansard_df_cleaned['Speech'].str.slice(0,1000).tolist()  
})

# Step 5: Create custom hover template to control text width
hover_template = '<br>'.join(['%{customdata}'])

# Limiting the line width by adding line breaks after a specific number of characters (e.g., 50)
df['Text'] = df['Text'].apply(lambda x: '<br>'.join([x[i:i+50] for i in range(0, len(x), 50)]))

# Step 6: Create an interactive plot with Plotly
fig = px.scatter(
    df, x='UMAP1', y='UMAP2',
    color='Dominant Topic',
    custom_data=['Text'],  # Use custom data for hover template
    title='Interactive UMAP Projection of LDA Topic Distributions',
    color_continuous_scale=px.colors.qualitative.Set1
)

# Customize hover template to use our custom text formatting
fig.update_traces(
    hovertemplate=hover_template,
    marker=dict(size=8, opacity=0.7)
)

# Customize layout with specific dimensions
fig.update_layout(
    width = 1200,
    height = 800,
    legend_title_text='Dominant Topic',
    legend = dict(
        itemsizing='constant'
    )
)

# Show plot
fig.show()

Now we try with another topic modelling approach: using embeddings with BERTopic. Note that BERTopic relies on hierarchical clustering, so we don't have to set any number of topics as a hyperparameter.

In [None]:
from bertopic import BERTopic

# Step 1: Initialize BERTopic
topic_model = BERTopic()

# Step 2: Fit the model to your data
topics, probabilities = topic_model.fit_transform(hansard_df_cleaned['Speech'].tolist())

# Step 3: View the topics
topics_overview = topic_model.get_topic_info()

In [None]:
topics_overview

BERTopic says we have 74 topics, which sounds like a lot of topics compared to what we had previously! Unfortunately it also seems like 604 (or about 17% of the data) are considered as "outliers". Let's use some of the data visualisation tools to get a visual appreciation of the topics.

In [None]:
topic_model.visualize_topics()

It seems like some of these clusters are 

In [None]:
topic_model.visualize_documents(hansard_df_cleaned['Speech'].tolist())

<span style="background-color: #FFFF00">**Class Discussion:** What are your observations about the quality of the topics identified here, versus the topics identified by the LDA model? Are there significant differences, and if so, in what ways?</span>

## 2. NUS SMS Data

In this section we explore the NUS SMS corpus that was released [here](https://github.com/kite1988/nus-sms-corpus), mainly to demonstrate the challenges of analysing Singlish data and how conventional NLP techniques may fail.

### 2.1 Importing the data and doing simple processing

In [None]:
with open("smsCorpus_en_2015.03.09_all.json", 'r') as file:
    sms_corpus = json.load(file)

In [None]:
# Check how many messages there are in this corpus
len(sms_corpus['smsCorpus']['message'])

In [None]:
# Examine the first message
sms_corpus['smsCorpus']['message'][0]

Now we write a function to extract all the SMSes

In [None]:
sms_corpus_list = []
for message in sms_corpus['smsCorpus']['message']:
    sms_corpus_list.append({
        'ID': message['@id'],
        'Text': message['text']['$']
    })
sms_corpus_df = pd.DataFrame(sms_corpus_list)
sms_corpus_df['Text'] = sms_corpus_df['Text'].astype('str')

In [None]:
sms_corpus_df

<span style="background-color: #FFFF00">**Exercise:** Create two new columns for this dataset:  </span>
* Word Count (int): How many words are in the text
* Polarity (float): How positive or negative the text is (using `textblob` and `spacy`)

In [None]:
sms_corpus_df['Word Count'].plot.hist()

In [None]:
sms_corpus_df['Polarity'].plot.hist()

<span style="background-color: #FFFF00">**Class Discussion:** Before you ran these plots, what were you expecting? Now after having seen these plots, what are your thoughts? Is this what you had expected, and why?</span>

### 2.2: Topic modelling

We try with some topic modelling to highlight the challenges of topic modelling with short texts, on top of the difficulties with Singlish texts.

In [None]:
# Step 1: Preprocessing (with stopword removal and lemmatization)
texts_preprocessed = [preprocess(text) for text in sms_corpus_df['Text'].astype('str')]

# Step 2: Vectorizing the text data
vectorizer = CountVectorizer()
dtm = vectorizer.fit_transform(texts_preprocessed)

# Step 3: Applying LDA for Topic Modeling
lda = LatentDirichletAllocation(n_components = 10, random_state = 2024)
lda.fit(dtm)

# Step 4: Extracting and Displaying Topics
def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print(f"Topic {topic_idx}:")
        print(", ".join([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]))

no_top_words = 10
tf_feature_names = vectorizer.get_feature_names_out()
display_topics(lda, tf_feature_names, no_top_words)

The topics here look quite bad, but this is unsurprising given how short SMSes are. Topic modelling tends to underperform in these cases. We check this by computing the coherence score as well.

In [None]:
# Step 1: Create a Gensim Dictionary and Corpus
texts_tokenized = [text.split() for text in texts_preprocessed]
dictionary = Dictionary(texts_tokenized)
corpus = [dictionary.doc2bow(text) for text in texts_tokenized]

# Step 2: Get the topics from the LDA model
lda_topics = lda.components_
lda_topics_words = [[vectorizer.get_feature_names_out()[i] for i in topic.argsort()[:-10 - 1:-1]] for topic in lda_topics]

# Step 3: Calculate Coherence Score
coherence_model_lda = CoherenceModel(topics = lda_topics_words, 
                                     texts = texts_tokenized, 
                                     dictionary = dictionary, 
                                     coherence = 'c_v')

coherence_lda = coherence_model_lda.get_coherence()
print(f'Coherence Score for LDA Model: {coherence_lda}')

One problem with Singlish is the difficulty in tokenising it correctly. Let's take a look by applying BERT's tokeniser to some of the Singlish texts here.

In [None]:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased")

In [None]:
print(sms_corpus_df['Text'][0])
tokenizer.tokenize(sms_corpus_df['Text'][0])

In [None]:
print(sms_corpus_df['Text'][36])
tokenizer.tokenize(sms_corpus_df['Text'][36])