Political Discourse Keyword Analysis

This notebook will explore word usage patterns across 20 speeches made by current Secretary General Ant√≥nio Guterres using basic NLP and text analysis techniques

In [1]:
#Standard library
import os
import re
from collections import Counter

#Importing spaCy for NLP
import spacy

Loading in the speeches

In [2]:
data_path = "../data/raw/"

speeches = []
for file in sorted(os.listdir(data_path)):
    if file.endswith(".txt"):
        with open(os.path.join(data_path, file), "r", encoding="utf-8") as f:
            speeches.append(f.read())

print(f"Loaded {len(speeches)} speeches")

Loaded 20 speeches


Cleaning the text

In [3]:
cleaned_speeches = []

for speech in speeches:
    text = speech.lower() #Making each word lowercase
    text = re.sub(r"[^a-z\s]", "", text) #Removing all characters except lowercase letters and whitespace
    cleaned_speeches.append(text)

#Previewing first cleaned speeches
print(cleaned_speeches[0][:300])

as we enter the new year the world stands at a crossroads


chaos and uncertainty surround us 


division violence climate breakdown and systemic violations of international law


a retreat from the very principles that bind us together as a human family 


people everywhere are asking are leaders e


Tokenization + stopwords

In [4]:
nlp = spacy.load("en_core_web_sm") #Loading the small English language module pipeline
stopwords = nlp.Defaults.stop_words #Default list of all stop words ("the," "is," "at")

all_tokens = []

for speech in cleaned_speeches:
    doc = nlp(speech) #spaCy tokenizes, tags, and parses the speech
    tokens = [
        token.text #The string of the word
        for token in doc 
        if token.is_alpha and token.text.lower() not in stopwords #Removes numbers/symbols, excludes stopwords
    ]
    all_tokens.extend(tokens) #Add filtered words to master list

print(f"Total number of content words: {len(all_tokens)}")

Total number of content words: 3438


Analyzing word frequency

In [6]:
word_counts = Counter(all_tokens) #Counts the frequency of each word
word_counts.most_common(20) #Retrieves the top 20 most frequent words

[('people', 31),
 ('united', 30),
 ('nations', 30),
 ('world', 27),
 ('international', 23),
 ('peace', 20),
 ('years', 20),
 ('year', 18),
 ('humanitarian', 18),
 ('human', 17),
 ('global', 16),
 ('today', 16),
 ('thank', 16),
 ('rights', 16),
 ('communities', 16),
 ('excellencies', 15),
 ('women', 15),
 ('support', 14),
 ('help', 14),
 ('progress', 13)]