# Is the SONA worth listening to?

The State of the Nation Address (SONA) is an annual event in the Philippines where the President of the republic reveals the current status of the country, and his plans for the future. It is a highly anticipated event, and is usually broadcasted live on television and radio. The SONA, however, is also a very long event, and can last for several hours. This has led to some people questioning whether it is worth listening to. In this notebook, we will use data from the previous SONAs to determine if it is worth listening to the SONA.



In [1]:
import pandas as pd

df = pd.read_csv("../data/sona-dataset.csv")
df.head()

Unnamed: 0,president,date,title,link,venue,session,speech,total_words
0,Manuel L. Quezon,"November 25, 1935",Message to the First Assembly on National Defense,http://www.officialgazette.gov.ph/1935/11/25/m...,"Legislative Building, Manila","First National Assembly, First Session","Mr. Speaker, gentlemen of the National Assemb...",4341
1,Manuel L. Quezon,"June 16, 1936",On the Country’s Conditions and Problems,http://www.officialgazette.gov.ph/1936/06/16/m...,"Legislative Building, Manila","First National Assembly, First Session","Mr. Speaker, Gentlemen of the National Assemb...",7250
2,Manuel L. Quezon,"October 18, 1937","Improvement of Philippine Conditions, Philippi...",http://www.officialgazette.gov.ph/1937/10/18/m...,"Legislative Building, Manila","First National Assembly, Second Session","Mr. Speaker, Gentlemen of the National Assemb...",5774
3,Manuel L. Quezon,"January 24, 1938",Revision of the System of Taxation,http://www.officialgazette.gov.ph/1938/01/24/m...,"Legislative Building, Manila","First National Assembly, Third Session",Gentlemen of the National Assembly: The state...,3212
4,Manuel L. Quezon,"January 24, 1939",The State of the Nation and Important Economic...,http://www.officialgazette.gov.ph/1939/01/24/m...,"Legislative Building, Manila","Second National Assembly, First Session",Gentlemen of the National Assembly: I take pl...,4826


### Split each speech into sentences

In [2]:
from functools import partial

import spacy
from pandarallel import pandarallel

pandarallel.initialize(progress_bar=True)
nlp = spacy.load("en_core_web_md")

def split_sentences(nlp, text):
    return [sent.text for sent in nlp(text).sents]

func = partial(split_sentences, nlp)
df["sentences"] = df["speech"].parallel_apply(func)

INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.

https://nalepae.github.io/pandarallel/troubleshooting/


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=11), Label(value='0 / 11'))), HBox…

### Save the sentences

In [None]:
import os


def filter_and_sort_sentences(sentences):
    filtered_sentences = [
        " ".join(s.split()) for s in sentences if len(s.split()) > 10
    ]
    return sorted(
        filtered_sentences,
        key=lambda s: len(s),
        reverse=True,
    )


def write_sentences_to_file(sentences, directory="../sentences"):
    if not os.path.exists(directory):
        os.makedirs(directory)
    file_path = os.path.join(directory, f"{i:03}.txt")
    with open(file_path, "w", encoding="utf-8") as f:
        f.write("\n".join(sentences))


for i, speech_sentences in enumerate(df["sentences"]):
    filtered_sentences = filter_and_sort_sentences(speech_sentences)
    write_sentences_to_file(filtered_sentences)

## Exploratory Data Analysis

### Speech length distribution

In [None]:
import plotly.express as px

fig = px.histogram(
    df,
    x="total_words",
    color="president",
    nbins=50,
    title="Distribution of SONA Speech Lengths",
)
fig.show()

### Dominant topics over time

In [None]:
import base64

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
from wordcloud import WordCloud

df["date"] = pd.to_datetime(df["date"])

min_year = df["date"].dt.year.min()
max_year = df["date"].dt.year.max()
period_length = 10
time_periods = [
    (year, year + period_length)
    for year in range(min_year, max_year, period_length)
]

num_periods = len(time_periods)
cols = 3  # Adjust columns if you want a different layout
rows = (num_periods // cols) + (num_periods % cols != 0)

fig, axes = plt.subplots(
    rows, cols, figsize=(15, 10)
)  # Adjust figsize as needed

for i, (start, end) in enumerate(time_periods):
    # Filter speeches from this time period
    mask = (df["date"].dt.year >= start) & (df["date"].dt.year < end)
    speeches_period = df.loc[mask]

    # Combine all speeches from this period into one text
    text = " ".join(speeches_period["speech"])

    # Generate word frequencies
    vectorizer = CountVectorizer(stop_words="english")
    freqs = vectorizer.fit_transform([text])
    word_freq = dict(
        zip(
            vectorizer.get_feature_names_out(),
            freqs.sum(axis=0).A1,
        )
    )

    # Generate word cloud
    wordcloud = WordCloud(
        width=800,
        height=400,
        max_words=100,
        background_color="white",
    ).generate_from_frequencies(word_freq)

    # Convert word cloud to image
    img = wordcloud.to_image()
    img_bytes = img.tobytes()
    encoded = base64.b64encode(img_bytes).decode()

    row, col = i // cols, i % cols
    axes[row, col].imshow(img, interpolation="bilinear")
    axes[row, col].set_title(f"{start}-{end}")
    axes[row, col].axis("off")  # Remove axes ticks and labels

fig.tight_layout()  # Adjust spacing between subplots
plt.show()

## Preprocessing

### Subset the data to only include relevant columns

In [None]:
subset_df = df[["president", "date", "speech"]]
subset_df.head()

### Normalize the speeches

In [None]:
pandarallel.initialize(progress_bar=True)
nlp = spacy.load("en_core_web_sm")


def normalize_speech(nlp, text):
    return " ".join(
        [
            token.lemma_.lower()
            for token in nlp(text)
            if not token.is_stop and token.is_alpha
        ]
    )


cleaned_df = subset_df.copy()
cleaned_df["normalized_speech"] = (
    cleaned_df["speech"].parallel_apply(partial(normalize_speech, nlp))
)

In [None]:
cleaned_df.head()