# Word2Vec for Rhetorical Analysis

### By generating word embeddings from a domain-specific corpus, one might be able to glean insights about internal rhetorical features. In order to demonstrate this concept, I have utilized a corpus of U.S. embassy tweets to identify which countries are associated with particular words within American diplomatic rhetoric.

In [1]:
import ast
import pandas as pd
import re
import numpy as np
from tqdm import tqdm

In [2]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
df_us = pd.read_csv('/content/drive/MyDrive/NLP_stuff/all_usa_tweets.csv')

  exec(code_obj, self.user_global_ns, self.user_ns)


# Generate Custom Word Embeddings with Gensim's Word2Vec tools

In [None]:
import gensim 
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

window = 7

languages = ['en','fr','es','ru']
stopword_lang = ['english','french','spanish','russian']
def w2v(df, language, stopword_lang): 
    df = df[df['lang']==language]
    sentences = list()
    lines = df['full_text'].values.tolist()
    ## remove urls
    lines = [re.sub(r"http\S+", "", x) for x in lines]
    for line in tqdm(lines):
        tokens = word_tokenize(line)
        tokens = [word.lower() for word in tokens]
        table = str.maketrans('','',string.punctuation)
        stripped = [w.translate(table) for w in tokens]
        words = [word for word in stripped if word.isalpha()]
        stop_words = set(stopwords.words(stopword_lang))
        words = [w for w in words if not w in stop_words]
        sentences.append(words)
    EMBEDDING_DIM = 100
    model = gensim.models.Word2Vec(
            sentences = sentences,
            size = EMBEDDING_DIM,
            window = window,
            min_count = 1)
    words = list(model.wv.vocab)
    print("Test Vocabulary size: %d" % len(words))
    return model

In [None]:
american_embassies_w2v_english = w2v(df_us, 'en', 'english')

100%|██████████| 311537/311537 [03:11<00:00, 1625.81it/s]


Test Vocabulary size: 152103


In [3]:
#american_embassies_w2v_english.save("/content/drive/MyDrive/NLP_stuff/us_en_w2v.model")

In [4]:
from gensim.models import Word2Vec
american_embassies_w2v_english = Word2Vec.load("/content/drive/MyDrive/NLP_stuff/us_en_w2v.model")

In [5]:
def word_association_with_countries(word,model,n=10):
    countries = ['america','usa','afghanistan', 'albania', 'algeria', 'angola', 'argentina', 'armenia', 'australia', 'austria', 'azerbaijan', 'bahamas', 'bahrain', 'bangladesh', 'barbados', 'belarus', 'belgium', 'belize', 'benin', 'bolivia', 'bosnia', 'botswana', 'brazil', 'brunei', 'bulgaria', 'burkina', 'burundi', 'cabo', 'cambodia', 'cameroon', 'canada', 'chad', 'chile', 'china', 'colombia', 'comoros', 'congo', 'costa', 'côte', 'croatia', 'cuba', 'cyprus', 'czechia', 'denmark', 'djibouti', 'dominican', 'ecuador', 'egypt', 'elsalvador', 'equatorialguinea', 'eritrea', 'estonia', 'eswatini', 'ethiopia', 'fiji', 'finland', 'france', 'gabon', 'georgia', 'germany', 'ghana', 'greece', 'guatemala', 'guineabissau', 'guinea', 'guyana', 'haiti', 'holysee','vatican','honduras', 'hungary', 'iceland', 'india', 'indonesia', 'iran', 'iraq', 'ireland', 'israel', 'italy', 'jamaica', 'japan', 'jordan', 'kazakhstan', 'kenya', 'northkorea','southkorea', 'kuwait', 'kyrgyzstan', 'lao', 'latvia', 'lebanon', 'lesotho', 'liberia', 'libya', 'lithuania', 'luxembourg', 'madagascar', 'malawi', 'malaysia', 'mali', 'malta', 'mauritania', 'mauritius', 'mexico', 'micronesia', 'moldova', 'mongolia', 'montenegro', 'morocco', 'mozambique', 'myanmar', 'nepal', 'netherlands', 'newzealand', 'nicaragua', 'niger', 'nigeria', 'norway', 'oman', 'pakistan', 'palau', 'palestine', 'panama', 'papua', 'paraguay', 'peru', 'philippines', 'poland', 'portugal', 'qatar', 'romania', 'russia', 'rwanda', 'samoa', 'saudi', 'saudiarabia','senegal', 'serbia', 'seychelles', 'sierra', 'singapore', 'slovakia', 'slovenia', 'somalia', 'spain', 'srilanka', 'sudan', 'suriname', 'sweden', 'switzerland', 'syrian', 'tajikistan', 'tanzania', 'thailand', 'timor', 'togo', 'trinidad', 'tunisia', 'turkey', 'turkmenistan', 'uganda', 'ukraine', 'uae', 'unitedarabemirates', 'unitedstates', 'uruguay', 'uzbekistan', 'venezuela', 'viet', 'yemen', 'zambia', 'zimbabwe']
    country_dict = {}
    for country in countries:
        country_dict[country] = model.wv.similarity(word,country)
    top_dict = {k: v for k, v in sorted(country_dict.items(), key=lambda item: -item[1])}
    return list(top_dict.keys())[:10]

In [6]:
word_association_with_countries('conflict',american_embassies_w2v_english)

['yemen',
 'libya',
 'syrian',
 'afghanistan',
 'sudan',
 'iraq',
 'somalia',
 'ukraine',
 'northkorea',
 'russia']

In [7]:
word_association_with_countries('bad',american_embassies_w2v_english)

['iran',
 'china',
 'russia',
 'northkorea',
 'venezuela',
 'cuba',
 'nicaragua',
 'ukraine',
 'syrian',
 'belarus']

In [8]:
word_association_with_countries('friendship',american_embassies_w2v_english)

['lithuania',
 'latvia',
 'greece',
 'slovakia',
 'luxembourg',
 'iceland',
 'estonia',
 'slovenia',
 'denmark',
 'australia']

# Analysis

### The countries most associated with the word 'conflict' do indeed reflect the countries that American rhetoric most associates with conflict. Even 'bad' produces similar results with the top five (Iran, China, Russia, North Korea, and Venezuala) easily deemed the most salient challengers to the American-led geopolitical order. These results are interesting, but perhaps a better approach would be to search for contextual outliers.

## Mahalanobis Distance for Contextual Outliers

### This approach allows us to identify which terms are 'rhetorical outliers'. In other words, which country is discussed in the 'most different' context relative to other countries.

In [9]:

def mahalanobis(x=None, data=None, cov=None):
    """Compute the Mahalanobis Distance between each row of x and the data  
    x    : vector or matrix of data with, say, p columns.
    data : ndarray of the distribution from which Mahalanobis distance of each observation of x is to be computed.
    cov  : covariance matrix (p x p) of the distribution. If None, will be computed from data.
    """
    x_minus_mu = x - np.mean(data)
    if not cov:
        cov = np.cov(data.values.T)
    inv_covmat = sp.linalg.inv(cov)
    left_term = np.dot(x_minus_mu, inv_covmat)
    mahal = np.dot(left_term, x_minus_mu.T)
    return mahal.diagonal()

import scipy as sp
countries = ['afghanistan', 'albania', 'algeria', 'angola', 'argentina', 'armenia', 'australia', 'austria', 'azerbaijan', 'bahamas', 'bahrain', 'bangladesh', 'barbados', 'belarus', 'belgium', 'belize', 'benin', 'bolivia', 'bosnia', 'botswana', 'brazil', 'brunei', 'bulgaria', 'burkina', 'burundi', 'cabo', 'cambodia', 'cameroon', 'canada', 'chad', 'chile', 'china', 'colombia', 'comoros', 'congo', 'costa', 'côte', 'croatia', 'cuba', 'cyprus', 'czechia', 'denmark', 'djibouti', 'dominican', 'ecuador', 'egypt', 'elsalvador', 'equatorialguinea', 'eritrea', 'estonia', 'eswatini', 'ethiopia', 'fiji', 'finland', 'france', 'gabon', 'georgia', 'germany', 'ghana', 'greece', 'guatemala', 'guineabissau', 'guinea', 'guyana', 'haiti','vatican','honduras', 'hungary', 'iceland', 'india', 'indonesia', 'iran', 'iraq', 'ireland', 'israel', 'italy', 'jamaica', 'japan', 'jordan', 'kazakhstan', 'kenya', 'northkorea','southkorea', 'kuwait', 'kyrgyzstan', 'lao', 'latvia', 'lebanon', 'lesotho', 'liberia', 'libya', 'lithuania', 'luxembourg', 'madagascar', 'malawi', 'malaysia', 'mali', 'malta', 'mauritania', 'mauritius', 'mexico', 'moldova', 'mongolia', 'montenegro', 'morocco', 'mozambique', 'myanmar', 'nepal', 'netherlands', 'newzealand', 'nicaragua', 'niger', 'nigeria', 'norway', 'oman', 'pakistan', 'palau', 'palestine', 'panama', 'paraguay', 'peru', 'philippines', 'poland', 'portugal', 'qatar', 'romania', 'russia', 'rwanda', 'saudi', 'saudiarabia','senegal', 'serbia', 'seychelles', 'sierra', 'singapore', 'slovakia', 'slovenia', 'somalia', 'spain', 'srilanka', 'sudan', 'suriname', 'sweden', 'switzerland', 'syria', 'tajikistan', 'tanzania', 'thailand', 'togo', 'trinidad', 'tunisia', 'turkey', 'turkmenistan', 'uganda', 'ukraine', 'uae', 'emirates', 'unitedstates', 'uruguay', 'uzbekistan', 'venezuela', 'vietnam', 'yemen', 'zambia', 'zimbabwe']
    
def get_mahalanobis_outlier_rank(model, category_list):
    matrix = []
    for item in category_list:
        vector = model.wv.get_vector(item)
        matrix.append(vector)
    matrix = np.vstack(matrix)
    vector_df = pd.DataFrame(matrix)
    vector_df.index = category_list
    columns = list(range(100))
    df_x = vector_df[columns]
    vector_df['mahalanobis'] = mahalanobis(x=df_x, data=vector_df[columns])
    return pd.DataFrame(vector_df['mahalanobis'].sort_values(ascending=False))

In [10]:
rhetorical_outlier_df = get_mahalanobis_outlier_rank(american_embassies_w2v_english,countries)
rhetorical_outlier_df.head(15)

Unnamed: 0,mahalanobis
russia,158.042167
iran,157.218849
china,156.899061
venezuela,155.322582
israel,154.203543
syria,149.254358
ukraine,148.713391
afghanistan,148.081061
belarus,147.238824
sweden,146.593714


## Analysis

### Indeed, we see that America's geopolitical adversaries are discussed in a strikingly different context than most countries. Allies such as Israel and Sweden do appear in the list, but given this approach is mathematical as opposed to normative, we can assume that Sweden and Israel are discussed in atypical contexts.

In [11]:
def word_association_df(maha_df, exclude_words):
  country_list = []
  word_associates = []

  for word in list(maha_df.index):
    b = american_embassies_w2v_english.wv.most_similar([word],topn=30)
    country_list.append(word)
    word_associates.append([x[0] for x in b if x not in exclude_words])
  df = pd.DataFrame()
  df['country'] = country_list
  df['word_associates'] = word_associates
  return df


In [12]:
word_association_df = word_association_df(rhetorical_outlier_df, countries)

In [13]:
word_association_df.head(10)

Unnamed: 0,country,word_associates
0,russia,"[inf, russian, inftreaty, crimea, fielding, br..."
1,iran,"[iranian, behavior, destructive, nuclear, prox..."
2,china,"[beijing, ccp, prc, chinese, streetball, commu..."
3,venezuela,"[venezuelan, estamosunidosve, venezuelans, mad..."
4,israel,"[uae, golan, accords, gaza, normalizing, gulf,..."
5,syria,"[yemen, iraq, tigray, idlib, deescalation, hou..."
6,ukraine,"[territorial, unitedforukraine, crimea, georgi..."
7,afghanistan,"[iraq, afghan, lebanon, taliban, resolutesuppo..."
8,belarus,"[belarusian, referendum, ukraine, ukrainian, o..."
9,sweden,"[usunvie, archelsinki, usembtallinn, usainuk, ..."


In [14]:
american_embassies_w2v_english.wv.most_similar(['russia'],topn=10)

[('inf', 0.789565920829773),
 ('russian', 0.7857728004455566),
 ('inftreaty', 0.7853613495826721),
 ('crimea', 0.7645756006240845),
 ('fielding', 0.7507433891296387),
 ('breach', 0.7346606254577637),
 ('russians', 0.7270146608352661),
 ('provocative', 0.7266160249710083),
 ('kremlin', 0.7258703112602234),
 ('violating', 0.7258059978485107)]

## Visualizations For Word Associations in Corpora

In [21]:
import networkx as nx
from sklearn.metrics.pairwise import cosine_similarity
import plotly.graph_objects as go
from plotly.offline import plot
import networkx as nx
import numpy as np

def word2vec_network(model, word_list, threshhold=0.5):
    words, vectors = [], []
    for item in word_list:
        try:
            vectors.append(model.wv.get_vector(item))
            words.append(item)
        except:
            print(f'Word {item} not found in vocab.')
    sims = cosine_similarity(vectors, vectors)       
    for i in range(len(vectors)):
        for j in range(len(vectors)):
            if i<=j:
                sims[i, j] = False
    indices = np.argwhere(sims > threshhold)

    G = nx.Graph()

    for index in indices:
        G.add_edge(words[index[0]], words[index[1]], weight=sims[index[0],
                                                                 index[1]])

    weight_values = nx.get_edge_attributes(G,'weight')
    positions = nx.spring_layout(G)
    nx.set_node_attributes(G,name='position',values=positions)
    searches = []
    edge_x = []
    edge_y = []
    weights = []
    ave_x, ave_y = [], []
    for edge in G.edges():
        x0, y0 = G.nodes[edge[0]]['position']
        x1, y1 = G.nodes[edge[1]]['position']
        edge_x.append(x0)
        edge_x.append(x1)
        edge_x.append(None)
        edge_y.append(y0)
        edge_y.append(y1)
        edge_y.append(None)
        ave_x.append(np.mean([x0, x1]))
        ave_y.append(np.mean([y0, y1]))
        weights.append(f'{edge[0]}, {edge[1]}: {weight_values[(edge[0], edge[1])]}')
    edge_trace = go.Scatter(
        x=edge_x, y=edge_y,
        opacity=0.3,
        line=dict(width=2, color='White'),
        hoverinfo=None,
        mode='lines')
    edge_trace.text = weights
    node_x = []
    node_y = []
    sizes = []
    for node in G.nodes():
        x, y = G.nodes[node]['position']
        node_x.append(x)
        node_y.append(y)
        if node in searches:
            sizes.append(50)
        else:
            sizes.append(15)
    node_trace = go.Scatter(
        x=node_x, y=node_y,
        mode='markers+text',
        hoverinfo='text',
        textposition="top center",
        marker=dict(
            showscale=False,
            line=dict(color='White'),
            colorscale='RdBu',
            reversescale=False,
            color=[],
            opacity=0.9,
            size=sizes,
            colorbar=dict(
                thickness=15,
                title='Node Connections',
                xanchor='left',
                titleside='right'
            ),
            line_width=2
        )
    )
    invisible_similarity_trace = go.Scatter(
        x=ave_x, y=ave_y,
        mode='markers',
        hoverinfo='text',
        marker=dict(
            color=[],
            opacity=0,
        )
    )
    invisible_similarity_trace.text=weights
    
    node_adjacencies = []
    node_text = []
    for node, adjacencies in enumerate(G.adjacency()):
        node_adjacencies.append(len(adjacencies[1]))
        node_text.append(adjacencies[0])
    node_trace.marker.color = node_adjacencies
    node_trace.text = node_text
    fig = go.Figure(
        data=[edge_trace, node_trace, invisible_similarity_trace],
        layout=go.Layout(
            title=None,
            template='plotly_dark',
            titlefont_size=20,
            showlegend=False,
            coloraxis=None,
            hovermode='closest',
            margin=dict(b=20,l=20,r=20,t=40),
            annotations=[
                dict(
                    text='Word Associations',
                    showarrow=False,
                    xref="paper", yref="paper",
                    x=0.005, y=-0.002 ) 
            ],
            xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
            yaxis=dict(showgrid=False, zeroline=False, showticklabels=False)
        )
    )
    #fig.update_coloraxes(showscale=False)
    #fig.update_layout(showscale=False, showlegend=False)

    return fig.show()

In [16]:
#word_list = list(american_embassies_w2v_english.wv.index2entity[:10])
countries = ['afghanistan', 'albania', 'algeria', 'angola', 'argentina', 'armenia', 'australia', 'austria', 'azerbaijan', 'bahamas', 'bahrain', 'bangladesh', 'barbados', 'belarus', 'belgium', 'belize', 'benin', 'bolivia', 'bosnia', 'botswana', 'brazil', 'brunei', 'bulgaria', 'burkina', 'burundi', 'cabo', 'cambodia', 'cameroon', 'canada', 'chad', 'chile', 'china', 'colombia', 'comoros', 'congo', 'costa', 'côte', 'croatia', 'cuba', 'cyprus', 'czechia', 'denmark', 'djibouti', 'dominican', 'ecuador', 'egypt', 'elsalvador', 'equatorialguinea', 'eritrea', 'estonia', 'eswatini', 'ethiopia', 'fiji', 'finland', 'france', 'gabon', 'georgia', 'germany', 'ghana', 'greece', 'guatemala', 'guineabissau', 'guinea', 'guyana', 'haiti', 'holysee','vatican','honduras', 'hungary', 'iceland', 'india', 'indonesia', 'iran', 'iraq', 'ireland', 'israel', 'italy', 'jamaica', 'japan', 'jordan', 'kazakhstan', 'kenya', 'northkorea','southkorea', 'kuwait', 'kyrgyzstan', 'lao', 'latvia', 'lebanon', 'lesotho', 'liberia', 'libya', 'lithuania', 'luxembourg', 'madagascar', 'malawi', 'malaysia', 'mali', 'malta', 'mauritania', 'mauritius', 'mexico', 'micronesia', 'moldova', 'mongolia', 'montenegro', 'morocco', 'mozambique', 'myanmar', 'nepal', 'netherlands', 'newzealand', 'nicaragua', 'niger', 'nigeria', 'norway', 'oman', 'pakistan', 'palau', 'palestine', 'panama', 'paraguay', 'peru', 'philippines', 'poland', 'portugal', 'qatar', 'romania', 'russia', 'rwanda', 'samoa', 'saudi', 'saudiarabia','senegal', 'serbia', 'seychelles', 'sierra', 'singapore', 'slovakia', 'slovenia', 'somalia', 'spain', 'srilanka', 'sudan', 'suriname', 'sweden', 'switzerland', 'syrian', 'tajikistan', 'tanzania', 'thailand', 'timor', 'togo', 'trinidad', 'tunisia', 'turkey', 'turkmenistan', 'uganda', 'ukraine', 'uae', 'unitedstates', 'uruguay', 'uzbekistan', 'venezuela', 'vietnam', 'yemen', 'zambia', 'zimbabwe']
word_associations = [word[0] for word in american_embassies_w2v_english.wv.most_similar(['russia'],topn=20)]

word2vec_network(american_embassies_w2v_english, countries, threshhold=0.4)

### The image above shows the rhetorical outliers in the U.S. diplomatic tweet corpus. It's evident from this visualization that some countries are discussed in atypical contexts. 

In [17]:
def word2vec_word_association_network(word, model):
    word = word.lower()
    word_associations = [word[0] for word in model.wv.most_similar([word],topn=40)]
    return word2vec_network(model, word_associations, threshhold=0.7)

In [19]:
word2vec_word_association_network('china', american_embassies_w2v_english)

In [20]:
word2vec_word_association_network('russia', american_embassies_w2v_english)

# Conclusion

### Indeed, while observing the vocabulary that is associated with country names like 'Russia' or 'Iran', one sees a variety of words indicating negative sentiment: breach, provocative, violating, destructive, etc. As such, it appears that this word-embedding approach to rhetorical analysis is useful for high-volume data such as entire corpora. Other approaches include creating embeddings for entire tweet streams via models like BERT.