# Text Data Analysis

In [None]:
import json
import pandas as pd
import spacy
from tqdm import tqdm
import nltk
import flair
import gensim
import umap
import numpy as np
import plotly.express as px
import transformers

## 1. Hansard Data

In this section, we explore Hansard data, which consists of speeches and debates made in Singapore's Parliament Chamber and provides a record of parliamentary business and proceedings in a Sitting. Data from Hansard has already been scraped for you, focusing specifically on the Committee of Supply (or Budget) debates for the 14th Parliament (from 2021 to present). We will use this as an opportunity to explore sentiment analysis and topic modelling.

### 1.1 Importing the data and doing simple processing

In [None]:
# Read in the Hansard data
hansard_df = pd.read_csv("Hansard_15th_Parl_COS.csv")

In [None]:
print(hansard_df.shape)
hansard_df.head()

Let's try to enrich this dataset with some useful variables. 

<span style="background-color: #FFFF00; color: #000000">**Exercise:** Create two new columns for this dataset: 
* `Sitting Year` (int): Year in which the speech was given
* `Speech Length` (int): Number of words in the speech </span>

In [None]:
# Your code here


In [None]:
hansard_df.head()

In [None]:
hansard_df['Speech Length'].plot.hist()

### 1.2 Sentiment Analysis

Let's start with applying some sentiment analysis. While most Parliamentary speeches are likely to be quite mild in terms of sentiment, we might be able to identify some more impassioned speeches. Before you proceed, make sure you have both the `spacy` and `spacytextblob` libraries installed.

In [None]:
# Run the command here to download textblob's additional corpuses
!python -m textblob.download_corpora
!python -m spacy download en_core_web_sm

In [None]:
from spacytextblob.spacytextblob import SpacyTextBlob

# Initialise the NLP pipeline and add the spacetextblob step to the pipeline
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe('spacytextblob')

# Add our texts
texts = hansard_df['Speech']

# This will take about 20-30 seconds to run
sentiment_results = []
for doc in nlp.pipe(texts, disable=["tok2vec", "tagger", "ner", "parser", "attribute_ruler", "lemmatizer"]):
    sentiment_results.append({
        'Polarity': doc._.blob.polarity,
        'Subjectivity': doc._.blob.subjectivity, 
    })

In [None]:
# Now we append it to our original dataset
sentiment_results_df = pd.DataFrame(sentiment_results)
hansard_df = pd.concat([hansard_df, sentiment_results_df], axis = 1)

Let's do some simple data analysis to understand the distributions of the polarity and subjectivity scores. Before you run the cells below, think carefully about what you expect from the data.

In [None]:
# Polarity scores range from [-1 to 1], with -1 indicating very negative and 1 indicating very positive
hansard_df['Polarity'].plot.hist()

In [None]:
# Subjectivity scores range from [0 to 1], with 0 indicating very objective and 1 indicating very subjective
hansard_df['Subjectivity'].plot.hist()

Unsurprisingly most texts are neutral and objective. But this may be swayed by the number of times the Chairman speaks. Let's filter that out and look at this again.

In [None]:
hansard_df_cleaned = hansard_df[hansard_df['Speaker'] != "The Chairman"].reset_index(drop = True).copy()

In [None]:
# Polarity scores range from [-1 to 1], with -1 indicating very negative and 1 indicating very positive
hansard_df_cleaned['Polarity'].plot.hist()

In [None]:
# Subjectivity scores range from [0 to 1], with 0 indicating very objective and 1 indicating very subjective
hansard_df_cleaned['Subjectivity'].plot.hist()

Let's find the most positive speech and the most negative speech! Share your thoughts about the results

In [None]:
hansard_df_cleaned.loc[hansard_df_cleaned['Polarity'].idxmin()]

In [None]:
hansard_df_cleaned.loc[hansard_df_cleaned['Polarity'].idxmax()]

Both of these speeches seem a bit short, which might explain their extreme polarity scores. Let's plot a scatter plot to highlight the relationship between speech length and polarity.

In [None]:
hansard_df_cleaned.plot.scatter(x = 'Speech Length', y = 'Polarity')

<span style="background-color: #FFFF00; color: #000000">**Class Discussion:** Given that `textblob` is a dictionary-based approach to sentiment analysis, can you think of why longer speeches tend to have less extreme values for positive/negative sentiment?</span>

Now let's try a different approach: using an embedding-based classifier instead! We will use a small embedding-based classifier that has already been finetuned to save time.

In [None]:
from flair.nn import Classifier
from flair.data import Sentence
tagger = Classifier.load('./flair_sentiment.pt')

In [None]:
# Let's try it out with a random speech
sentence = Sentence(hansard_df_cleaned['Speech'][2])
tagger.predict(sentence)
print(sentence)

In [None]:
sentiment_scores = []

# This should take around 3-5 minutes
for text in tqdm(hansard_df_cleaned['Speech'].tolist()):
    sentence = Sentence(text)
    tagger.predict(sentence)

    # Remember to take the inverse of the negative score
    if sentence.labels[0].value == 'NEGATIVE':
        sentiment_scores.append(1 - sentence.labels[0].score)
    else:
        sentiment_scores.append(sentence.labels[0].score)

In [None]:
hansard_df_cleaned['Sentiment'] = sentiment_scores

In [None]:
hansard_df_cleaned['Sentiment'].plot.hist()

<span style="background-color: #FFFF00; color: #000000">**Class Discussion:** What do you notice about this chart that is different from the `textblob` model results? Why do you think there is such a big difference?</span>

In [None]:
hansard_df_cleaned.plot.scatter(x = 'Speech Length', y = 'Sentiment')

In [None]:
hansard_df_cleaned.loc[hansard_df_cleaned['Sentiment'].idxmin()]['Speech']

### 1.3 Topic modelling

Since Parliamentary debates tend to be quite topic-focused, topic modelling would be a good option for us to better understand the ongoing debates and to get a sense of the priority areas for discussion in Parliament.

In [None]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import string

# Download necessary NLTK data
nltk.download('punkt_tab')
nltk.download('wordnet')
nltk.download('stopwords')

# Initialize stopwords and lemmatizer
stop_words = set(stopwords.words('english'))
stop_words.update(['also', 'mr', 'chairman', 'beg', 'move'])
lemmatizer = WordNetLemmatizer()

def preprocess(text):
    # Tokenization
    words = word_tokenize(text.lower())
    
    # Remove punctuation and non-alphabetic tokens
    words = [word for word in words if word.isalpha()]
    
    # Stopword removal and lemmatization
    words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]
    
    return ' '.join(words)

In [None]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

# Step 1: Preprocessing (with stopword removal and lemmatization)
texts_preprocessed = [preprocess(text) for text in hansard_df_cleaned['Speech']]

# Step 2: Vectorizing the text data
vectorizer = CountVectorizer()
dtm = vectorizer.fit_transform(texts_preprocessed)

# Step 3: Applying LDA for Topic Modeling
lda = LatentDirichletAllocation(n_components = 10, random_state = 2024)
lda.fit(dtm)

# Step 4: Extracting and Displaying Topics
def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print(f"Topic {topic_idx}:")
        print(", ".join([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]))

no_top_words = 10
tf_feature_names = vectorizer.get_feature_names_out()
display_topics(lda, tf_feature_names, no_top_words)

The topics looks fairly sensible, but is there a way for us to get a more tangible and concrete way to assess the quality of this topic modelling? We can look at the **coherence score** for this task.

In [None]:
from gensim.models import CoherenceModel
from gensim.corpora import Dictionary

# Step 1: Create a Gensim Dictionary and Corpus
texts_tokenized = [text.split() for text in texts_preprocessed]
dictionary = Dictionary(texts_tokenized)
corpus = [dictionary.doc2bow(text) for text in texts_tokenized]

# Step 2: Get the topics from the LDA model
lda_topics = lda.components_
lda_topics_words = [[vectorizer.get_feature_names_out()[i] for i in topic.argsort()[:-10 - 1:-1]] for topic in lda_topics]

# Step 3: Calculate Coherence Score
coherence_model_lda = CoherenceModel(topics = lda_topics_words, 
                                     texts = texts_tokenized, 
                                     dictionary = dictionary, 
                                     coherence = 'c_v')

coherence_lda = coherence_model_lda.get_coherence()
print(f'Coherence Score for LDA Model: {coherence_lda}')

Now let's try varying some of the parameters to see which gets us the optimal coherence score. We'll start by adjusting the number of topics.

In [None]:
n_topics_list = [3, 5, 10, 15, 20, 25]
coherence_scores = []

# It should take around 15-30 seconds for each iteration
for n_topics in tqdm(n_topics_list):
        
    lda = LatentDirichletAllocation(n_components = n_topics, random_state = 2024)
    lda.fit(dtm)
    lda_topics = lda.components_
    lda_topics_words = [[vectorizer.get_feature_names_out()[i] for i in topic.argsort()[:-10 - 1:-1]] for topic in lda_topics]    
    coherence_model_lda = CoherenceModel(topics = lda_topics_words, 
                                         texts = texts_tokenized, 
                                         dictionary = dictionary, 
                                         coherence = 'c_v')    
    coherence_lda = coherence_model_lda.get_coherence()
    print(f"Number of topics: {n_topics} | Coherence Score: {coherence_lda}")
    coherence_scores.append(coherence_lda)

In [None]:
import matplotlib.pyplot as plt
plt.plot(n_topics_list, coherence_scores)
plt.xlabel("Num Topics")
plt.ylabel("Coherence score")
plt.legend(("coherence_values"), loc='best')
plt.show()

In [None]:
lda = LatentDirichletAllocation(n_components = 20, random_state = 2024)
lda.fit(dtm)
no_top_words = 10
tf_feature_names = vectorizer.get_feature_names_out()
display_topics(lda, tf_feature_names, no_top_words)

In [None]:
# Step 1: Get the topic distribution for each document
lda_topic_distributions = lda.transform(dtm)

# Step 2: Identify the dominant topic for each document
dominant_topics = np.argmax(lda_topic_distributions, axis = 1)

# Step 3: Apply UMAP to reduce to 2 dimensions
umap_model = umap.UMAP(n_components = 2, random_state = 2024)
lda_2d = umap_model.fit_transform(lda_topic_distributions)

# Step 4: Prepare data for Plotly
df = pd.DataFrame({
    'UMAP1': lda_2d[:, 0],
    'UMAP2': lda_2d[:, 1],
    'Dominant Topic': dominant_topics,
    'Text': hansard_df_cleaned['Speech'].str.slice(0,1000).tolist()  
})

# Step 5: Create custom hover template to control text width
hover_template = '<br>'.join(['%{customdata}'])

# Limiting the line width by adding line breaks after a specific number of characters (e.g., 50)
df['Text'] = df['Text'].apply(lambda x: '<br>'.join([x[i:i+50] for i in range(0, len(x), 50)]))

# Step 6: Create an interactive plot with Plotly
fig = px.scatter(
    df, x='UMAP1', y='UMAP2',
    color='Dominant Topic',
    custom_data=['Text'],  # Use custom data for hover template
    title='Interactive UMAP Projection of LDA Topic Distributions',
    color_continuous_scale=px.colors.qualitative.Set1
)

# Customize hover template to use our custom text formatting
fig.update_traces(
    hovertemplate=hover_template,
    marker=dict(size=8, opacity=0.7)
)

# Customize layout with specific dimensions
fig.update_layout(
    width = 1200,
    height = 800,
    legend_title_text='Dominant Topic',
    legend = dict(
        itemsizing='constant'
    )
)

# Show plot
fig.show()

Now we try with another topic modelling approach: using embeddings with BERTopic. Note that BERTopic relies on hierarchical clustering, so we don't have to set any number of topics as a hyperparameter.

In [None]:
from bertopic import BERTopic

# Step 1: Initialize BERTopic
topic_model = BERTopic()

# Step 2: Fit the model to your data
topics, probabilities = topic_model.fit_transform(hansard_df_cleaned['Speech'].tolist())

# Step 3: View the topics
topics_overview = topic_model.get_topic_info()

In [None]:
topics_overview

BERTopic says we have 74 topics, which sounds like a lot of topics compared to what we had previously! Unfortunately it also seems like 698 (or about 20% of the data) are considered as "outliers". Let's use some of the data visualisation tools to get a visual appreciation of the topics.

In [None]:
topic_model.visualize_topics()

It seems like some of these clusters overlap a lot. What if we looked at the documents and topics?

In [None]:
topic_model.visualize_documents(hansard_df_cleaned['Speech'].tolist())

<span style="background-color: #FFFF00; color: #000000">**Class Discussion:** What are your observations about the quality of the topics identified here, versus the topics identified by the LDA model? Are there significant differences, and if so, in what ways?</span>

## 2. NUS SMS Data

In this section we explore the NUS SMS corpus that was released [here](https://github.com/kite1988/nus-sms-corpus), mainly to demonstrate the challenges of analysing Singlish data and how conventional NLP techniques may fail.

### 2.1 Importing the data and doing simple processing

In [None]:
with open("smsCorpus_en_2015.03.09_all.json", 'r') as file:
    sms_corpus = json.load(file)

In [None]:
# Check how many messages there are in this corpus
len(sms_corpus['smsCorpus']['message'])

In [None]:
# Examine the first message
sms_corpus['smsCorpus']['message'][0]

Now we write a function to extract all the SMSes

In [None]:
sms_corpus_list = []
for message in sms_corpus['smsCorpus']['message']:
    sms_corpus_list.append({
        'ID': message['@id'],
        'Text': message['text']['$']
    })
sms_corpus_df = pd.DataFrame(sms_corpus_list)
sms_corpus_df['Text'] = sms_corpus_df['Text'].astype('str')

In [None]:
sms_corpus_df

<span style="background-color: #FFFF00; color: #000000">**Exercise:** Create two new columns for this dataset:  </span>
* Word Count (int): How many words are in the text
* Polarity (float): How positive or negative the text is (using `textblob` and `spacy`)

In [None]:
# Your code here


In [None]:
sms_corpus_df['Word Count'].plot.hist()

In [None]:
sms_corpus_df['Polarity'] = sentiment_results
sms_corpus_df['Polarity'].plot.hist()

<span style="background-color: #FFFF00; color: #000000">**Class Discussion:** Before you ran these plots, what were you expecting? Now after having seen these plots, what are your thoughts? Is this what you had expected, and why?</span>

### 2.2: Topic modelling

We try with some topic modelling to highlight the challenges of topic modelling with short texts, on top of the difficulties with Singlish texts.

In [None]:
# Step 1: Preprocessing (with stopword removal and lemmatization)
texts_preprocessed = [preprocess(text) for text in sms_corpus_df['Text'].astype('str')]

# Step 2: Vectorizing the text data
vectorizer = CountVectorizer()
dtm = vectorizer.fit_transform(texts_preprocessed)

# Step 3: Applying LDA for Topic Modeling
lda = LatentDirichletAllocation(n_components = 10, random_state = 2024)
lda.fit(dtm)

no_top_words = 10
tf_feature_names = vectorizer.get_feature_names_out()
display_topics(lda, tf_feature_names, no_top_words)

The topics here look quite bad, but this is unsurprising given how short SMSes are. Topic modelling tends to underperform in these cases. We check this by computing the coherence score as well.

In [None]:
# Step 1: Create a Gensim Dictionary and Corpus
texts_tokenized = [text.split() for text in texts_preprocessed]
dictionary = Dictionary(texts_tokenized)
corpus = [dictionary.doc2bow(text) for text in texts_tokenized]

# Step 2: Get the topics from the LDA model
lda_topics = lda.components_
lda_topics_words = [[vectorizer.get_feature_names_out()[i] for i in topic.argsort()[:-10 - 1:-1]] for topic in lda_topics]

# Step 3: Calculate Coherence Score
coherence_model_lda = CoherenceModel(topics = lda_topics_words, 
                                     texts = texts_tokenized, 
                                     dictionary = dictionary, 
                                     coherence = 'c_v')

coherence_lda = coherence_model_lda.get_coherence()
print(f'Coherence Score for LDA Model: {coherence_lda}')

One problem with Singlish is the difficulty in tokenising it correctly. Let's take a look by applying BERT's tokeniser to some of the Singlish texts here.

In [None]:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased")

In [None]:
print(sms_corpus_df['Text'][0])
tokenizer.tokenize(sms_corpus_df['Text'][0])

In [None]:
print(sms_corpus_df['Text'][36])
tokenizer.tokenize(sms_corpus_df['Text'][36])

## 3. Introduction to Large Language Models

In this section, we will use a LLM, specifically Google's **Gemini** model (free tier), to perform the same tasks we did earlier — summarisation, topic classification, and sentiment analysis — and compare the results.

### 3.1 Setup

To use Google's Gemini API, you will need to:
1. Go to [Google AI Studio](https://aistudio.google.com) and sign in with your Google account
2. Generate an API key
3. Create a `.env` file in this directory with the line: `GEMINI_API_KEY=your_key_here`

In [None]:
from dotenv import load_dotenv
import os

# Load environment variables from .env file
load_dotenv()

In [None]:
from google import genai

# The client automatically picks up GEMINI_API_KEY from the environment
client = genai.Client(api_key=os.getenv("GEMINI_API_KEY"))

# We'll use Gemini 2.0 Flash — fast, capable, and free
MODEL_ID = "gemini-3-flash-preview"

In [None]:
# Quick test — make sure the API is working
response = client.models.generate_content(model=MODEL_ID, contents="Say hello in one sentence.")
print(response.text)

### 3.2 Summarising Parliamentary Speeches

One of the most immediately useful capabilities of LLMs is summarisation. Let's take a long parliamentary speech and ask Gemini to summarise it. Compare how easy this is versus building a custom extractive or abstractive summariser.

In [None]:
# Pick one of the longest speeches in the dataset
long_speech_idx = hansard_df_cleaned['Speech Length'].idxmax()
long_speech = hansard_df_cleaned.loc[long_speech_idx]

print(f"Speaker: {long_speech['Speaker']}")
print(f"Title: {long_speech['Title']}")
print(f"Word count: {long_speech['Speech Length']}")
print(f"\nFirst 500 characters:\n{long_speech['Speech'][:500]}...")

In [None]:
prompt = f"""Summarise the following parliamentary speech in 3-5 bullet points. 
Focus on the key policy issues raised and any questions asked.

Speech:
{long_speech['Speech']}"""

response = client.models.generate_content(model=MODEL_ID, contents=prompt)
print(response.text)

### 3.3 Zero-Shot Topic Classification

Earlier, we used LDA to assign topics to speeches. This required preprocessing, vectorisation, and hyperparameter tuning. With an LLM, we can simply *describe* the topics and ask the model to classify — no training required. This is called **zero-shot classification**.

In [None]:
# Define topic labels based on the 20-topic LDA results from Section 1.3
topic_labels = [
    "Parliamentary procedure / general debate",
    "Community and youth programmes",
    "Arts, sports, and heritage",
    "Neighbourhood and resident issues",
    "Foreign affairs and defence (ASEAN)",
    "Public governance and women's issues",
    "Transport and public roads",
    "Education and schools",
    "HDB housing and rental",
    "Digital services and smart nation",
    "Government procurement and tax",
    "Business, enterprise, and economy",
    "Food, hawkers, and climate/environment",
    "Healthcare and family support",
    "Workers, jobs, and wages",
    "Drugs, prisons, and rehabilitation",
    "Vehicles and driving",
    "SAF, scams, and security",
    "Legal and judicial matters",
    "Electric vehicles and banking",
]

topic_list_str = "\n".join([f"{i}: {label}" for i, label in enumerate(topic_labels)])
print(topic_list_str)

In [None]:
import time

# Sample 3 speeches with reasonable length for classification
sample_df = hansard_df_cleaned[hansard_df_cleaned['Speech Length'] > 100].sample(3, random_state=2024).reset_index(drop=True)

llm_topics = []
for i, row in sample_df.iterrows():
    prompt = f"""Classify the following parliamentary speech into exactly ONE of these topics. 
Respond with ONLY the topic number (0-19), nothing else.

Topics:
{topic_list_str}

Speech:
{row['Speech'][:2000]}"""
    
    response = client.models.generate_content(model=MODEL_ID, contents=prompt)
    llm_topic = response.text.strip()
    llm_topics.append(llm_topic)
    print(f"Speech {i}: LLM says topic {llm_topic} ({topic_labels[int(llm_topic)] if llm_topic.isdigit() else 'INVALID'})")
    time.sleep(5)  # Be polite to the free API

In [None]:
# Compare with LDA's topic assignments for the same speeches
# We need to re-transform these speeches through the LDA pipeline
sample_texts_preprocessed = [preprocess(text) for text in sample_df['Speech']]
sample_dtm = vectorizer.transform(sample_texts_preprocessed)
sample_lda_topics = np.argmax(lda.transform(sample_dtm), axis=1)

comparison = pd.DataFrame({
    'Speech (first 80 chars)': sample_df['Speech'].str[:80],
    'LDA Topic': [f"{t} ({topic_labels[t]})" for t in sample_lda_topics],
    'LLM Topic': [f"{t} ({topic_labels[int(t)] if t.isdigit() else 'INVALID'})" for t in llm_topics],
})
comparison

<span style="background-color: #FFFF00; color: #000000">**Exercise:** Try modifying the prompt above — for example, ask the model to also provide a one-sentence justification for its choice. How does adding instructions to the prompt change the output? Does the classification quality improve or worsen?</span>

In [None]:
# Your code here


### 3.4 Sentiment Analysis with an LLM

A key advantage of LLMs over traditional sentiment tools is that they can **explain their reasoning**. TextBlob gives you a number; Flair gives you a label and a score. An LLM can tell you *why* it thinks a speech is positive or negative — which is far more useful for policy analysis.

In [None]:
# Let's analyse sentiment for the same speeches we compared earlier
# Recall: "Do you mind repeating? I am sorry." was rated most negative by TextBlob
test_speeches = [
    ("TextBlob most negative", hansard_df_cleaned.loc[hansard_df_cleaned['Polarity'].idxmin(), 'Speech']),
    ("TextBlob most positive", hansard_df_cleaned.loc[hansard_df_cleaned['Polarity'].idxmax(), 'Speech']),
    ("Flair most negative", hansard_df_cleaned.loc[hansard_df_cleaned['Sentiment'].idxmin(), 'Speech'][:2000]),
]

for label, speech in test_speeches:
    prompt = f"""Analyse the sentiment of this parliamentary speech. 
Provide:
1. A sentiment score from -1.0 (very negative) to 1.0 (very positive)
2. A one-sentence explanation of your reasoning

Format your response exactly as:
Score: [number]
Reason: [explanation]

Speech:
{speech}"""
    
    response = client.models.generate_content(model=MODEL_ID, contents=prompt)
    print(f"=== {label} ===")
    print(f"Speech: {speech[:100]}...")
    print(response.text)
    print()
    time.sleep(1)

<span style="background-color: #FFFF00; color: #000000">**Class Discussion:** Compare the LLM's sentiment analysis with TextBlob and Flair's results. Notice how the LLM correctly identifies that "Do you mind repeating? I am sorry." is a neutral/polite request, not a negative statement. What does this tell us about the limitations of dictionary-based and embedding-based approaches for domain-specific text like parliamentary speeches?</span>

### 3.5 Handling Singlish with LLMs

In Section 2, we saw that BERT's tokeniser struggles with Singlish — splitting words like "Bugis" and "oso" into meaningless subword tokens. The LDA topic model also produced poor results on SMS data. LLMs trained on diverse internet text (including forums, social media, and chat) tend to handle colloquial language much better. Let's test this.

In [None]:
# Pick some Singlish-heavy SMS messages
singlish_samples = [
    sms_corpus_df['Text'][0],   # "Bugis oso near wat..."
    sms_corpus_df['Text'][3],   # "Den only weekdays got special price... Haiz..."
    sms_corpus_df['Text'][36],  # "ll go yan jiu too... We can skip ard oso..."
]

for sms in singlish_samples:
    prompt = f"""This is a Singlish SMS message from Singapore. Please:
1. Translate it to standard English
2. Rate the sentiment from -1.0 (very negative) to 1.0 (very positive)

Format your response as:
Translation: [standard English version]
Sentiment: [score]

SMS: {sms}"""
    
    response = client.models.generate_content(model=MODEL_ID, contents=prompt)
    print(f"Original: {sms}")
    print(response.text)
    print()
    time.sleep(1)

In [None]:
# Compare with TextBlob's polarity on the same messages
for sms in singlish_samples:
    doc = nlp(sms)
    print(f"SMS: {sms}")
    print(f"TextBlob Polarity: {doc._.blob.polarity:.3f}")
    print()

<span style="background-color: #FFFF00; color: #000000">**Class Discussion:** How well does the LLM handle Singlish compared to TextBlob and BERT's tokeniser? What are the implications for NLP work in Singapore's multilingual context? Think about scenarios in the public sector where you might encounter Singlish text (e.g. social media feedback, community forums, helpline transcripts).</span>