# Citation

Much of the code and examples are copied/modified from 

> Blueprints for Text Analytics Using Python by Jens Albrecht, Sidharth Ramachandran, and Christian Winkler (O'Reilly, 2021), 978-1-492-07408-3.
>

- https://github.com/blueprints-for-text-analytics-python/blueprints-text
- https://github.com/blueprints-for-text-analytics-python/blueprints-text/blob/master/ch01/First_Insights.ipynb

---

# Setup

In [1]:
%run "../config/notebook_settings.py"

from helpers.utilities import Timer, get_logger
from helpers.text_processing import count_tokens, tf_idf, get_context_from_keyword

def get_project_directory():
    return os.getcwd().replace('/source/executables', '')

print(get_project_directory())

/Users/shanekercheval/repos/nlp-template


In [2]:
with Timer("Loading Data"):
    path = os.path.join(get_project_directory(), 'artifacts/data/processed/un-general-debates-blueprint.pkl')
    df = pd.read_pickle(path)

Started: Loading Data
Finished (1.21 seconds)


---

# Exploratory Data Analysis

This section provides a basic exploration of the text and dataset.

## Dataset Summary

In [3]:
hlp.pandas.numeric_summary(df)

Unnamed: 0,# of Non-Nulls,# of Nulls,% Nulls,# of Zeros,% Zeros,Mean,St Dev.,Coef of Var,Skewness,Kurtosis,Min,10%,25%,50%,75%,90%,Max
session,7507,0,0.0%,0,0.0%,49.6,12.9,0.3,-0.2,-1.1,25,31.0,39.0,51.0,61.0,67.0,70
year,7507,0,0.0%,0,0.0%,1994.6,12.9,0.0,-0.2,-1.1,1970,1976.0,1984.0,1996.0,2006.0,2012.0,2015
text_length,7507,0,0.0%,0,0.0%,17967.3,7860.0,0.4,1.1,1.8,2362,9553.8,12077.0,16424.0,22479.5,28658.2,72041
num_tokens,7507,0,0.0%,0,0.0%,1480.3,635.2,0.4,1.1,1.7,187,793.6,1005.5,1358.0,1848.0,2336.4,5688
num_bi_grams,7507,0,0.0%,0,0.0%,588.5,243.6,0.4,1.0,1.6,58,321.0,408.0,544.0,726.0,912.0,2185


In [4]:
hlp.pandas.non_numeric_summary(df)

Unnamed: 0,# of Non-Nulls,# of Nulls,% Nulls,Most Freq. Value,# of Unique,% Unique
country,7507,0,0.0%,ALB,199,2.7%
country_name,7507,0,0.0%,Albania,199,2.7%
speaker,7507,0,0.0%,<unknown>,5429,72.3%
position,7507,0,0.0%,<unknown>,114,1.5%
text,7507,0,0.0%,33: May I first convey to our [...],7507,100.0%
tokens,7507,0,0.0%,"['may', 'first', 'convey', 'pr[...]",7507,100.0%
bi_grams,7507,0,0.0%,"['first convey', 'albanian del[...]",7507,100.0%


---

In [8]:
df['text'].iloc[0][0:1000]

'33: May I first convey to our President the congratulations of the Albanian delegation on his election to the Presidency of the twenty-fifth session of the General Assembly?\n34.\tIn taking up the work on the agenda of the twenty- fifth session of the General Assembly, which is being held on the eve of the twenty-fifth anniversary of the coming into force of the Charter of the United Nations, the peace-loving Member States would have wished to be in a position to present on this occasion some picture of positive and satisfactory activity on the part of the United Nations. The Albanian delegation, for its part, would have taken great pleasure in drawing up such a balance sheet of activities covering a quarter of a century, which is certainly no short period in the life of an international organization. Unfortunately, this is not the situation. Created on the day after victory had been achieved over the Powers of the Rome BerlinTokyo Axis and conceived in the spirit of the principles wh

In [14]:
'|'.join(df['tokens'].iloc[0])[0:1000]

'may|first|convey|president|congratulations|albanian|delegation|election|presidency|twenty-fifth|session|general|assembly|taking|work|agenda|twenty-|fifth|session|general|assembly|held|eve|twenty-fifth|anniversary|coming|force|charter|united|nations|peace-loving|member|states|would|wished|position|present|occasion|picture|positive|satisfactory|activity|part|united|nations|albanian|delegation|part|would|taken|great|pleasure|drawing|balance|sheet|activities|covering|quarter|century|certainly|short|period|life|international|organization|unfortunately|situation|created|day|victory|achieved|powers|rome|berlintokyo|axis|conceived|spirit|principles|predominated|war|antifascist|coalition|organization|awakened|whole|progressive|humanity|hope|would|serve|important|factor|creating|better|international|conditions|order|favor|cause|freedom|peace|world|security|activities|number|events|occurred|world|arena|period|disappointed|hopes|peoples|united|nations|far|contributed|required|fundamental|provisio

In [12]:
'|'.join(df['bi_grams'].iloc[0])[0:1000]

'first convey|albanian delegation|twenty-fifth session|general assembly|twenty- fifth|fifth session|general assembly|twenty-fifth anniversary|united nations|peace-loving member|member states|states would|satisfactory activity|united nations|albanian delegation|part would|taken great|great pleasure|balance sheet|activities covering|short period|international organization|organization unfortunately|situation created|rome berlintokyo|berlintokyo axis|antifascist coalition|organization awakened|progressive humanity|would serve|important factor|creating better|better international|international conditions|freedom peace|world security|world arena|period disappointed|united nations|nations far|fundamental provisions|international peace|liberation struggle|imperialist powers|united states|america foremost|foremost among|path diametrically|diametrically opposed|instrument favoring|pillage oppression|peace-loving peoples|united nations|committing aggression|many parts|frequently helped|direction

## Explore Non-Text Columns

Explore idiosyncrasies of various columns, e.g. same speaker represented multiple ways.

In [None]:
df[df['speaker'].str.contains('Bush')]['speaker'].value_counts()

---

## Explore Text Column|

### Top Words Used

In [None]:
count_tokens(df['tokens']).head(20)

---

### Distribution of Text Length

In [None]:
ax = df['text_length'].plot(kind='box', vert=False, figsize=(10, 1))
ax.set_title("Distribution of Text Length")
ax.set_xlabel("# of Characters")
ax.set_yticklabels([])
ax;

In [None]:
ax = df['text_length'].plot(kind='hist', bins=60, figsize=(10, 2));
ax.set_title("Distribution of Text Length")
ax.set_xlabel("# of Characters")
ax;

In [None]:
import seaborn as sns
sns.displot(df['text_length'], bins=60, kde=True, height=3, aspect=3);

In [None]:
where = df['country'].isin(['USA', 'FRA', 'GBR', 'CHN', 'RUS'])
g = sns.catplot(data=df[where], x="country", y="text_length", kind='box')
g.fig.set_size_inches(6, 3)
g.fig.set_dpi(100)
g = sns.catplot(data=df[where], x="country", y="text_length", kind='violin')
g.fig.set_size_inches(6, 3)
g.fig.set_dpi(100)

In [None]:
assert not df[['year', 'country']].duplicated().any()
df.groupby('year').size().plot(title="Number of Countries");

In [None]:
df.groupby('year').agg({'text_length': 'mean'}) \
  .plot(title="Avg. Speech Length", ylim=(0,30000));

### Word Frequency

In [None]:
counts_df = count_tokens(df['tokens'])

In [None]:
counts_df.head()

In [None]:
def plot_wordcloud(frequency_dict):
    wc = wordcloud.WordCloud(background_color='white',
        #colormap='RdYlGn',
        colormap='tab20b',
        width=round(hlp.plot.STANDARD_WIDTH*100),
        height=round(hlp.plot.STANDARD_HEIGHT*100),
        max_words = 200, max_font_size=150,
        random_state=42
    )
    wc.generate_from_frequencies(frequency_dict)

    fig, ax = plt.subplots(figsize=(hlp.plot.STANDARD_WIDTH, hlp.plot.STANDARD_HEIGHT))
    ax.imshow(wc, interpolation='bilinear')
    #plt.title("XXX")
    plt.axis('off')

In [None]:
plot_wordcloud(counts_df.to_dict()['frequency']);

### TF-IDF

In [None]:
tf_idf_df = tf_idf(
    df=df,
    tokens_column='tokens',
    segment_columns = None,
    min_frequency_corpus=20,
    min_frequency_document=20,
)

In [None]:
ax = tf_idf_df.head(30)[['tf-idf']].plot(kind='barh', width=0.99)
ax.invert_yaxis();

In [None]:
plot_wordcloud(tf_idf_df.to_dict()['tf-idf']);

#### Per Year - 1970

In [None]:
tf_idf_per_year = tf_idf(
    df=df,
    tokens_column='tokens',
    segment_columns = 'year',
    min_frequency_corpus=10,
    min_frequency_document=10,
)

In [None]:
stop_words = ['twenty-fifth', 'twenty-five', 'twenty', 'fifth']
tokens_to_show = tf_idf_per_year.query('year == 1970').reset_index()
tokens_to_show = tokens_to_show[~tokens_to_show.token.isin(stop_words)]

In [None]:
ax = tokens_to_show.head(30).set_index('token')[['tf-idf']].plot(kind='barh', width=0.99)
ax.invert_yaxis();

In [None]:
tokens_to_show = tokens_to_show[['token', 'tf-idf']].set_index('token')
tokens_to_show = tokens_to_show.to_dict()['tf-idf']

In [None]:
plot_wordcloud(tokens_to_show);

#### Per Year - 2015

In [None]:
stop_words = ['seventieth']
tokens_to_show = tf_idf_per_year.query('year == 2015').reset_index()
tokens_to_show = tokens_to_show[~tokens_to_show.token.isin(stop_words)]

In [None]:
ax = tokens_to_show.head(30).set_index('token')[['tf-idf']].plot(kind='barh', width=0.99)
ax.invert_yaxis();

In [None]:
tokens_to_show = tokens_to_show[['token', 'tf-idf']].set_index('token')
tokens_to_show = tokens_to_show.to_dict()['tf-idf']

In [None]:
plot_wordcloud(tokens_to_show);

### Keywords in Context

In [None]:
contexts = get_context_from_keyword(
    documents=df[df['year'] == 2015]['text'],
    window_width=50,
    keyword='sdgs', random_seed=42
)
for x in contexts:
    print(x)

In [None]:
contexts = get_context_from_keyword(
    documents=df[df['year'] == 2015]['text'],
    window_width=50,
    keyword='sids', random_seed=42
)
for x in contexts:
    print(x)

In [None]:
contexts = get_context_from_keyword(
    documents=df[df['year'] == 2015]['text'],
    window_width=50,
    keyword='pv', random_seed=42
)
for x in contexts:
    print(x)

---