# Project 2: Text Analysis of UN Speeches
by Matt Ring

# Setup

1. Load Packages

In [125]:
## Packages
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import pandas as pd
import numpy as np

2. Import Data

In [126]:
## Data

df = pd.read_csv("data/un_gen_debates_text.csv")

In [127]:
# Display the data
df.sample(10)

Unnamed: 0,session,year,country,country_name,speaker,position,text
1525,36,1981,TZA,"Tanzania, United Republic of",Mr. SALIM,,72.\tLet me at the outset sincerely congratula...
6203,64,2009,DNK,Denmark,Carsten Staur,UN Representative,"At this moment in time, \nmajor economic and e..."
4931,57,2002,MCO,Monaco,Albert,Head of State,﻿First of all I wish to thank the President of...
6541,66,2011,AND,Andorra,Gilbert Saboya Sunye,Minister for Foreign Affairs,"First of all, I \nwould like to avail myself o..."
6923,67,2012,YUG,Yugoslavia,Tomislav Nikolić,President,﻿Your presidency of the\nGeneral Assembly at i...
714,31,1976,CPV,Cabo Verde,Mr. Fortes,,"204. Allow me through you Sir, to express to M..."
1826,38,1983,VEN,"Venezuela, Bolivarian Republic of",ZAMBRANO VELASCO,,"﻿\n\n120.\t It is no mere formality, Sir, for ..."
5874,62,2007,LSO,Lesotho,Archibald Lesao Lehohla,Prime Minister,My delegation associates \nitself with the com...
2754,45,1990,CAF,Central African Republic,SOHAHONG-KOMBET,,﻿\nThe current international backdrop to this ...
1546,37,1982,AUT,Austria,Pahr,,81.\tIt is with great pleasure and \nsatisfact...


3. Clean Data

In [128]:
# Remove newline and tab characters
df["text"] = df["text"].replace({"\n":" ",
                                 "\t":" "}, regex = True)

4. Subset Data

Political transitions in the Former Soviet Union (FSU) and other socialist states will be assessed. This was selected as I noticed Yugoslavia still present even in 2015. I had originally intended to look at other political transitions, but considered that too broad. As such, I've selected all modern states which were once or are Marxist-Leninist. States which only reference socialism in their constitution are not included, such as India, Portugal, Algeria, etc...

Some adjustments need to be made and noted here:
1. Germany is not included, as it is difficult to handle reunification here when each state had separate speaches.
2. Former Yugoslavic states will be noted as such. Yugoslavia is recorded until 1991, then representing Serbia from 2001 onwards. The other states include Slovenia, North Macedonia, Bosnia and Herzegovina, Croatia, and Montenegro.
3. Former USSR states will be noted as such. These include Russia, Armenia, Azerbeijan, Belarus, Estonia, Georgia, Kazakhstan, Kyrgyzstan, Latvia, Lithuania, Moldova, Tajikistan, Turkmenistan, Ukraine, and Uzbekistan.
4. Czechloslovakia becomes Chezia in the data around the year of 1992. These countries will be considered the same and continuous.

In [129]:
# Label former Yugolavic states
yugo = ["SVN", "MKD", "BIH", "HRV", "MNE", "YUG"]
df["former_yugoslavia"] = np.where(df["country"].isin(yugo), 1, 0)

In [130]:
# Label former USSR states
ussr = ["RUS", "ARM", "AZE", "BLR", "EST", "GEO", "KAZ", "KGZ", "LVA", "LTU", "MDA", "TJK", "TKM", "UKR", "UZB"]
df["former_ussr"] = np.where(df["country"].isin(ussr), 1, 0)

In [131]:
# Reassign all CSK to CZE (Czechia)
df["country"] = np.where(df["country"] == "CSK", "CZE", df["country"])

In [132]:
# Create a list of former (and present) socialist states
fss = ["RUS", "CHN", "YUG", "POL", "CUB", 
       "AFG", "ALB", "AGO", "ARM", "AZE",
       "BLR", "BEN", "BIH", "BGR", "KHM", 
       "COG", "CZE", "EST", "ETH", "GRD", 
       "GEO", "HUN", "HRV", "KAZ", "KGZ", 
       "LVA", "LTU", "MDA", "MKD", "MNG", 
       "MOZ", "ROU", "SOM", "MNE", "SVN", 
       "TJK", "TKM", "TUV", "UKR", "UZB", 
       "VNM", "YEM", "LAO", "PRK"]

Now, we'll classify countries over time as socialist or not based on [this](https://en.wikipedia.org/wiki/List_of_socialist_states#Marxist%E2%80%93Leninist_states_2) list. Any not on this list will be research independently. Republics within the USSR will be classified based on information found [here](https://en.wikipedia.org/wiki/Republics_of_the_Soviet_Union).

In [133]:
df_fss = df.loc[df["country"].isin(fss)]

Finally, reset the index! This'll be useful later.

In [135]:
df_fss = df_fss.set_index(np.arange(0, len(df_fss)))

In [136]:
df_fss.sample(10)

Unnamed: 0,session,year,country,country_name,speaker,position,text,former_yugoslavia,former_ussr
954,58,2003,BGR,Bulgaria,Simeon de Saxe-Coburg-Gotha,Prime Minister,"﻿Allow me, at the outset, to congratulate you ...",0,0
435,44,1989,AGO,Angola,Mr . VAN DUNEM,,"﻿ Mr. President, today I have the honour of ad...",0,0
867,56,2001,BEN,Benin,Kolawolé A. Idji,Minister for Foreign Affairs,﻿I would first of all like to express my outra...,0,0
760,53,1998,GRD,Grenada,Robert Millette,UN Representative,My delegation congratulates the Secretary-Gene...,0,0
187,33,1978,YUG,Yugoslavia,Vrhovec,,"﻿90. Mr. President, may I first cordially cong...",1,0
91,29,1974,YEM,Yemen,Mr. Tarcici,,My delegation was very much moved by the sad n...,0,0
18,26,1971,HUN,Hungary,Mr. PETER,,"108. The current, twenty -sixth session of th...",0,0
1479,70,2015,COG,Congo,Mr. Jean-Claude Gakosso,Minister for Foreign Affairs,"His Excellency Mr. Denis Sassou Nguesso, Presi...",0,0
1088,61,2006,EST,Estonia,Mr. Sven JÜRGENSON,Minister for Foreign Affairs,I begin by congratulating Ms. Haya Rashed Al-...,0,1
228,35,1980,MOZ,Mozambique,Chissano,,"﻿First of all, I should like to express, on be...",0,0


In [137]:
len(df_fss)

1512

# LDA Model

In [138]:
# Add stopwords related to institutions, such as UN councils, country names, etc.
# Also including words found across topics: peace, new, development

additional_stop_words = ["international", "united", "nations", "nation", "national", "countries", "country", "world", "states",
                         "council", "government", "people", "peoples", "republic", "general", "security",
                         "economic", "social", "assembly", "peace", "new"]

In [139]:
# Create a vectorizer
vec = CountVectorizer(stop_words=text.ENGLISH_STOP_WORDS.union(additional_stop_words))

In [140]:
# Create dtm
X = vec.fit_transform(df_fss["text"])

In [141]:
# Create lda
lda = LatentDirichletAllocation(n_components=7)

In [142]:
# Fit lda
doc_topics = lda.fit_transform(X)

In [143]:
print(f"There are {lda.components_.shape[0]} topics and {lda.components_.shape[1]} words")

There are 7 topics and 29617 words


# Interpretting Topics

In [144]:
## Get feature names (vocabulary)
voc = np.array(vec.get_feature_names())

In [145]:
# Set number of top words you want
n_words=10

# Create lambda function to extra top words from voc
imp_words = lambda x: [voc[each] for each in np.argsort(x)[:-n_words-1:-1]]

In [146]:
# Use imp_words to extract words with the highest weights from our lda model
words_in_topic = ([imp_words(x) for x in lda.components_])

In [147]:
# Examine words
words_in_topic

[['powers',
  'war',
  'somalia',
  'albania',
  'ethiopia',
  'super',
  'albanian',
  'aggression',
  'europe',
  'state'],
 ['nuclear',
  'relations',
  'soviet',
  'disarmament',
  'weapons',
  'operation',
  'military',
  'war',
  'union',
  'political'],
 ['development',
  'human',
  'rights',
  'global',
  'cooperation',
  'efforts',
  'community',
  'european',
  'support',
  'organization'],
 ['africa',
  'development',
  'community',
  'political',
  'efforts',
  'african',
  'afghanistan',
  'south',
  'democratic',
  'session'],
 ['development',
  'cooperation',
  'community',
  'efforts',
  'global',
  'nuclear',
  'organization',
  'important',
  'regional',
  'political'],
 ['struggle',
  'south',
  'independence',
  'africa',
  'support',
  'aggression',
  'african',
  'regime',
  'imperialism',
  'namibia'],
 ['viet',
  'nam',
  'cambodia',
  'kampuchea',
  'vietnamese',
  'cuba',
  'war',
  'foreign',
  'years',
  'khmer']]

New dataframe of topics

In [148]:
# Name topics
topics = ["inter_coop", "liberation_wars", "balkans", "mid_east", "dev_nation_support", "nuclear", "africa"]

In [149]:
cols = ["Topic_" + str(each) for each in topics]
docs = [str(each) for each in range(X.shape[0])]

In [150]:
# Create dataframe with term weights and document # and topic # as rows, columns
df_topics = pd.DataFrame(np.round(doc_topics, 2),
                        columns=cols,
                        index=docs)

In [151]:
# Extract most important topics from those values
imp_topic = np.argmax(df_topics.values, axis=1)

In [152]:
df_topics

Unnamed: 0,Topic_inter_coop,Topic_liberation_wars,Topic_balkans,Topic_mid_east,Topic_dev_nation_support,Topic_nuclear,Topic_africa
0,0.62,0.24,0.00,0.00,0.00,0.14,0.00
1,0.00,0.93,0.01,0.00,0.00,0.06,0.00
2,0.07,0.29,0.00,0.08,0.00,0.34,0.22
3,0.00,0.11,0.00,0.00,0.00,0.23,0.66
4,0.00,0.13,0.00,0.00,0.00,0.00,0.87
...,...,...,...,...,...,...,...
1507,0.00,0.11,0.87,0.00,0.00,0.00,0.03
1508,0.00,0.04,0.78,0.00,0.00,0.00,0.18
1509,0.00,0.00,0.63,0.00,0.21,0.00,0.15
1510,0.00,0.00,0.19,0.70,0.00,0.00,0.11


In [153]:
df_topics["top_topic"] = imp_topic

In [154]:
df_topics

Unnamed: 0,Topic_inter_coop,Topic_liberation_wars,Topic_balkans,Topic_mid_east,Topic_dev_nation_support,Topic_nuclear,Topic_africa,top_topic
0,0.62,0.24,0.00,0.00,0.00,0.14,0.00,0
1,0.00,0.93,0.01,0.00,0.00,0.06,0.00,1
2,0.07,0.29,0.00,0.08,0.00,0.34,0.22,5
3,0.00,0.11,0.00,0.00,0.00,0.23,0.66,6
4,0.00,0.13,0.00,0.00,0.00,0.00,0.87,6
...,...,...,...,...,...,...,...,...
1507,0.00,0.11,0.87,0.00,0.00,0.00,0.03,2
1508,0.00,0.04,0.78,0.00,0.00,0.00,0.18,2
1509,0.00,0.00,0.63,0.00,0.21,0.00,0.15,2
1510,0.00,0.00,0.19,0.70,0.00,0.00,0.11,3


In [165]:
# Change the type of the topics index
df_topics = df_topics.reset_index()
df_topics['index'] = df_topics['index'].astype(str)

Bind topics by column to documents.

In [169]:
df_final = pd.concat([df_fss, df_topics], axis = 1).drop(["index", "level_0"], axis = 1)

In [170]:
df_final.sample(10)

Unnamed: 0,session,year,country,country_name,speaker,position,text,former_yugoslavia,former_ussr,Topic_inter_coop,Topic_liberation_wars,Topic_balkans,Topic_mid_east,Topic_dev_nation_support,Topic_nuclear,Topic_africa,top_topic
791,54,1999,BLR,Belarus,Ural Latypov,Deputy Prime Minister,"Please, Sir, accept my most sincere congratula...",0,1,0.0,0.06,0.14,0.0,0.8,0.0,0.0,4
851,55,2000,POL,Poland,Wladyslaw Bartoszewski,Minister for Foreign Affairs,Allow me first of all to congratulate Mr. Harr...,0,0,0.05,0.05,0.3,0.0,0.59,0.0,0.0,4
139,32,1977,AFG,Afghanistan,ABDULLAH,,﻿212. On behalf of the delegation of the Repu...,0,0,0.0,0.28,0.0,0.59,0.0,0.13,0.0,3
1132,62,2007,EST,Estonia,Toomas Hendrik Ilves,President,I shall speak today on four fundamental topic...,0,1,0.0,0.0,0.96,0.0,0.04,0.0,0.0,2
683,51,1996,HUN,Hungary,László Kovács,Minister for Foreign Affairs,"﻿May I, at the outset, extend the congratulati...",0,0,0.0,0.0,0.78,0.0,0.22,0.0,0.0,2
913,57,2002,BLR,Belarus,Mikhail Khvostov,Minister for Foreign Affairs,"﻿I congratulate you, Sir, on your election to ...",0,1,0.0,0.0,0.0,0.0,1.0,0.0,0.0,4
892,56,2001,PRK,"Korea, Democratic People's Republic of",Li Hyong Chol,UN Representative,"﻿I congratulate you, Sir, once again on your e...",0,0,0.0,0.24,0.03,0.57,0.15,0.0,0.0,3
176,33,1978,KHM,Cambodia,Ieng Sary,,﻿90. During the past year the struggle of the ...,0,0,0.01,0.22,0.0,0.0,0.0,0.4,0.37,5
1341,67,2012,AZE,Azerbaijan,Elmar Maharram oglu Mammadyarov,Minister for Foreign Affairs,"﻿At the outset, I would like to congratulate m...",0,1,0.0,0.11,0.73,0.13,0.0,0.02,0.0,2
1344,67,2012,BIH,Bosnia and Herzegovina,Bakir Izetbegović,President,﻿I would like to congratulate President Jeremi...,1,0,0.2,0.0,0.8,0.0,0.0,0.0,0.0,2


# Visualizations

## 1. Defining Each Topic

In [171]:
import pyLDAvis.sklearn
lda_viz = pyLDAvis.sklearn.prepare(lda_model=lda,
                                   dtm=X,
                                   vectorizer=vec,
                                      sort_topics=False)

  default_term_info = default_term_info.sort_values(


In [172]:
pyLDAvis.display(lda_viz)


## 2. Bloc Trends/All Countries Shown (by one or all topics)

## 3. Important/Interesting Countries