# Convergence as Entropy Minimization Across Lexico-Semantic Choice

### 1 A Theory of Diachronic Language Shift within Groups

Linguistic alignment/Convergence both describe the tendency for groups to converge on similar means of discussing a topic.

Similar means can be expressed as a minimization in the entropy between utterances made by group members. As group members A and B sound more similar to one another, you can recover more of a group member A's semantic content from the lexical items in member B's message/utterance/sentence, because you can better predict member A's message just by listening to member B. In other words because they're using similar language to say the same thing the predictability of one utterance when presented with another utterance increases and thus entropy (i.e. how unpredictable two things are based on obersrvations of one or the other) decreases.

Similarity in the precise lexico-semantic meaning of two words can be measured using contextual word embeddings. Models like BERT provide contextually informed word embeddings.

The following is validation on toy data that, yes, there is indeed a difference in the entropy between messages written by members of the same group when compared to members of a different group. We validate this on a subset of 40 post titles discussing COVID-19, evenly sampled from subreddits "r/covidiots" and "r/conspiracy". No pre-processing was performed on the data.

#### 1.1 Generating intial data

In [None]:
from data.corpus import *

#### 1.2 Word vector representations

$$ E_{xi} = wv(i \in x) $$

In [None]:
'python3 ./data/extremismVecsv5.py'

#### 1.3 Probability based on word vectors

Imagine that an interlocutor is playing a kind of language reconstruction game. The interlocutor is given a single utterance from an individual, broken up into tokens. The interlocutor is then given a set of utterances also broken up into tokens from several utterances all taken from a number of members of some group. The interlocutor is then asked to take the groups' tokens and reconstruct an utterance that means something similar to the sentence they observed from the individual. This process can be repeated for the same original utterance using tokens from several different groups. In this scenario, reconstructed utterances that are more similar in meaning to the original utterance will have lower entropy. Reconstructed utterances that are either less similar or less intelligible will have higher entropy.

We operationalize this language game by calculating entropy for utterances using BERT word vectors (Devlin et al. 2019)) to represent each token. This allows us to capture similarity between tokens that are semantically similar but are not a 1:1 mapping of the same word. Let $E_{xi}$ be the set of BERT word vectors for each token $w_i$ in a sentence $x$.

$$E_{xi} = BERT(w_i \in x)$$

the probability that two words are semantically similar to one another based on their word vectors is a function of their location in vector space (Devlin et al., 2019; Mikolov et al., 2013; Pennington et al., 2014). If a word vector were a point in space, words that are more semantically related to one another will be closer to one another. We use cosine similarity (CoS) to calculate the proximity between word vectors. Now, the probability of two word vectors meaning the same thing can be thought of in the following way: if word vectors put words that are more semantically similar to one another closer in space, the probability that a word/token $i$ from a sentence $x$ is semantically similar to a word/token $j$ from a sentence $y$ can be thought of colloquially as how likely you are to hit $xi$ if you were to throw a dart at $yj$. We quantify this intuition about probability and vector space in equation 1 using a Gaussian distribution with a location parameter $\mu=1.$ such that as the CoS value for the comparison of two word vectors approaches 1 we have maximum confidence that the two words mean the same thing, and a scale parameter $\sigma$.

$$P(E_{xi} | E_{yj}) = P_{\mathcal{N}}\left( CoS(E_{xi},E_{yj}) \bigg|  \mu=1, \sigma \right)$$

Think of $\sigma$ like the accuracy of the dart thrower, where lower $\sigma$ values equate to the dart thrower only hitting a word/token $xi$ if it is very close to $yj$ in word vector space.

However, we almost never have a reason to compare any one vector from a sentence $xi$ to any single vector from another sentence/distribution, $yj$. Instead, it’s better to ask how likely is a vector $xi$ conditioned on what we know about the total distribution $y$, in which there are $j$ tokens ($j \in y$). A priori, one way of posing this question is by asking “when we compare $xi$ to the entirety of the distribution $y$, which token $j \in y$ returns the maximum likelihood for $xi$ and what is the probability of $xi$ conditioned on that token?” We thus rewrite equation 1 as follows:


$$P(E_{xi} | E_{y}) = P_{\mathcal{N}} \left( \max_{j} \left(CoS(E_{xi},E_{y}) \right) \bigg|  \mu=1, \sigma \right)$$

In [4]:
import torch
import pandas as pd
import numpy as np

s_col = 'subreddit_name'

def vec_from_string(dfi):
    return torch.FloatTensor(
        [
            [
                float(i) for i in vec.replace('[', '').replace(']','').split(', ')
            ] for vec in dfi['vecs'].values
        ]
    )

df = pd.read_table("/Volumes/ROY/comp_ling/datasci/intergroupEntropy/data/senators458/vecs_538senators-tweets.tsv")
df['party'] = df['party'].replace({'I': 'D'})
print(list(df))
print(df['party'].value_counts())

Eu = vec_from_string(df)
Eu.shape

['__id', 'created_at', 'url', 'replies', 'retweets', 'favorites', 'user', 'bioguide_id', 'party', 'state', 'tokens', 'vecs']
D    12339
R    11124
Name: party, dtype: int64


torch.Size([23463, 1536])

Because of how many comparisons there are to make, I had to break the algorithm up. Below we calculate the Cosine Similarity (CoS) for each vector compared to every other vector in the entirety of the dataset.

In [5]:
from shared.mod.mutual_information.RQA import *

H = hRQA()
H.streamCOS(Eu,Eu)

From there we save that to make later calculations faster.

In [7]:
torch.save(
    {
        'M':H.M,
        'df': df[[i for i in list(df) if i not in ['vecs','vec']]],
    },
    "/Volumes/ROY/comp_ling/datasci/intergroupEntropy/data/senators458/gun-control-ckpt.pt")

#### 1.4 Entropy across sentences using probability based on their component word vectors

Meanwhile, the probabililty that an individual's message $x$ exhibits convergence with a historical, known individual/memger of group $h \in H$ based on their message history is based on (1) how much of $x$'s history, per token, is recoverable from $h$'s history, which is expressed by (2) the amount of information relayed by the closest related lexical item $hj$ to each token $xi$.

$$H( x ; y ) = -\sum_i P(E_{xi}|E_{y}) \log P(E_{xi}|E_{y})$$

In the diachronic case, a message $x$ can only be similar to messages in $h$ that were written prior to $x$. Thus, let $\tau_{x}$ be the time of message $x$, and let $E_{hj}$ contain only tokens pulled from messages occurring prior to $\tau_x$--i.e.

$$E_{hj} = \bigg\lbrace wv(j \in h)\delta_{\tau_j < \tau_x}\,\ .\ .\ .\ wv(n \in h)\delta_{\tau_n < \tau_x}  \bigg\rbrace$$

First, I loaded the data from the checkpoint described above.

In [1]:
import torch
import pandas as pd
import numpy as np

s_col = 'party'
ckpt = torch.load("/Volumes/ROY/comp_ling/datasci/intergroupEntropy/data/senators458/gun-control-ckpt.pt")

M, df = ckpt['M'], ckpt['df']
ids = df['__id'].values
df['party'] = df['party'].replace({'I': 'D'})
print(df[s_col].value_counts(),'\n\n')
del ckpt

D    12339
R    11124
Name: party, dtype: int64 




In [2]:
print(list(df))
for subreddit in df[s_col].unique():
    print('{} \t {}'.format(subreddit, len(df['__id'].loc[df[s_col].isin([subreddit])].unique())))

['__id', 'created_at', 'url', 'replies', 'retweets', 'favorites', 'user', 'bioguide_id', 'party', 'state', 'tokens']
D 	 334
R 	 289


In [3]:
comm = np.array([df[s_col].loc[df['__id'].isin([idx])].unique()[0] for idx in np.unique(ids)])

#Establish parameters for Gaussian
d = torch.distributions.Normal(1,.3,validate_args=False)

#### 1.5 Assessing entropic differences between groups

Monte Carlo sampling for two conditions: within grouping and outside of group.

We break down our hypotheses as follows:

- (H1) There is statistically significant, lower entropy between any sampled set of comments posted within a subreddit A comprised of like-minded individuals compared to any sampled set of comments posted to a subreddit B in opposition to A.
- (H0) There is no statistically significant difference in entropy between two diametrically opposed subreddits.

To test this, picture the following ``game''. Pretend that for each comment in oyur dataset you randomly sample $N$ comments at random from the same subreddit A, and $N$ comments from the opposing subreddit B. We then calculate the entropy between our comment and each comment in the random sample from A, as well as the entropy between our comment and each comment in the random sample from B. We then take the mean entropy between our original comment and the comments from A, and the comments from B separately for comparison. We repeat this process a number of times. Hypothesis H1 is validated if in greater than 5\% ($\alpha$) of cases there is lower mean entropy for our initial comment and sample A than there is for our initial comment and sample B. This test is significant if the distribution of means with A and the distribution for means with B is significantly different. Both of these conditions must be true in order to validate H1.

### A Permutation test from sampled message histories

In [4]:
# (0) negative surprisal
# Hxy = d.log_prob(M)

# (1) monte carlo procedure
N_permutations, xsize, ysize = 1000, 20, 20

sample_sets = np.array([[j for j in np.unique(ids) if j != i] for i in np.unique(ids)])
comm_status = comm[sample_sets]
uids = np.array([df['user'].loc[df['__id'].isin([idx])].unique()[0] for idx in np.unique(ids)])
uidsset = uids[sample_sets]


ML, MR = [],[]
for permutation in range(N_permutations):

    xml = np.random.choice(np.unique(ids)[comm=='D'], size=(xsize,), replace=False)
    xmr = np.random.choice(np.unique(ids)[comm=='R'], size=(xsize,), replace=False)

    idxml = df['__id'].isin(xml).values
    idxmr = df['__id'].isin(xmr).values

    mxml = torch.FloatTensor([df['__id'].values[idxml]==idx for idx in xml])
    mxmr = torch.FloatTensor([df['__id'].values[idxmr]==idx for idx in xmr])

    ysamples = {idx:
        (
        np.random.choice(sample_sets[idx][(comm_status[idx] == 'D') & (uidsset[idx] != uids[idx])], size=(ysize,), replace=False),
        np.random.choice(sample_sets[idx][(comm_status[idx] == 'R') & (uidsset[idx] != uids[idx])], size=(ysize,), replace=False)
        ) for idx in xml.tolist()+xmr.tolist()}

    ysamples ={k: (df['__id'].isin(v[0]).values, df['__id'].isin(v[1]).values) for k,v in ysamples.items()}

    ml, mr = [],[]
    for x in xml:
        xi = M[ids==x]
        ml += [torch.cat([xi[:,ysamples[x][0]].max(dim=-1).values.unsqueeze(-1),
                          xi[:,ysamples[x][1]].max(dim=-1).values.unsqueeze(-1)], dim=-1)]

    for x in xmr:
        xi = M[ids == x]
        mr += [torch.cat([xi[:, ysamples[x][0]].max(dim=-1).values.unsqueeze(-1),
                          xi[:, ysamples[x][1]].max(dim=-1).values.unsqueeze(-1)], dim=-1)]

    ml, mr = torch.cat(ml, dim=0), torch.cat(mr, dim=0)
    ml, mr = d.log_prob(ml), d.log_prob(mr)
    ml, mr = -(torch.exp(ml) * ml), -(torch.exp(mr) * mr)
    ml, mr = (mxml.unsqueeze(-1) * ml).sum(dim=1), (mxmr.unsqueeze(-1) * mr).sum(dim=1)

    ML += [ml]
    MR += [mr]

    if ((permutation+1) % 100) == 0:
        print('permutation {}/{}'.format(permutation+1, N_permutations))

ML, MR = torch.cat(ML,dim=0), torch.cat(MR,dim=0)

# Printing means and medians as test of normalcy
# (ML[:,0].mean(), ML[:,0].median()), (ML[:,1].mean(), ML[:,1].median()), (MR[:,0].mean(), MR[:,0].median()), (MR[:,1].mean(), MR[:,1].median())


permutation 100/1000
permutation 200/1000
permutation 300/1000
permutation 400/1000
permutation 500/1000
permutation 600/1000
permutation 700/1000
permutation 800/1000
permutation 900/1000
permutation 1000/1000


In [5]:
def CI(x,z=1.96):
    mu = x.mean()
    side = z * (x.std()/np.sqrt(x.shape[0]))
    return mu-side, mu, mu+side

CI(ML[:,0]), CI(ML[:,1]), CI(MR[:,0]), CI(MR[:,1])

((tensor(3.9768), tensor(4.0181), tensor(4.0593)),
 (tensor(5.3929), tensor(5.4332), tensor(5.4735)),
 (tensor(5.9604), tensor(6.0090), tensor(6.0575)),
 (tensor(5.5860), tensor(5.6366), tensor(5.6872)))

And testing the statistical significance of these distributions . . .

In [6]:
from scipy.stats import ttest_ind as ttest

ttest(ML[:,0].numpy(), ML[:,1].numpy()),ttest(MR[:,1].numpy(), MR[:,0].numpy())

(Ttest_indResult(statistic=-48.08948160400127, pvalue=0.0),
 Ttest_indResult(statistic=-10.406199276034744, pvalue=2.502825813965209e-25))

### 2 Conclusions

What's provided here is a mock up example with real (albeit a fractional amount) data validating that within group communication exhibits lower entropy than intergroup communication. The methods proposed here can be used in a variety of applications, ranging from quantifying convergence within groups without prior definition of lexical dictionaries to, per person, measuring convergence with normative group communication styles over time (message-by-message)

#### References

Adams, A., Miles, J., Dunbar, N. E., & Giles, H. (2018). Communication accommodation in text messages: Exploring liking, power, and sex as predictors of textisms. The Journal of Social Psychology, 158(4), 474–490. https://doi.org/10.1080/00224545.2017.1421895

Dale, R., Duran, N. D., & Coco, M. (2018). Dynamic Natural Language Processing with Recurrence Quantification Analysis. ArXiv:1803.07136 [Cs]. http://arxiv.org/abs/1803.07136

de Vries, W., van Cranenburgh, A., & Nissim, M. (2020). What’s so special about BERT’s layers? A closer look at the NLP pipeline in monolingual and multilingual models. Findings of the Association for Computational Linguistics: EMNLP 2020, 4339–4350

Palomares, N., Giles, H., Soliz, J., & Gallois, C. (2016). Intergroup Accommodation, Social Categories, and Identities. In H. Giles (Ed.), Communication Accomodation Theory (p. 232).

Rosen, Z. (2022). A BERT’s eye view: A “big-data” framework for assessing language convergence and accommodation in large, many-to-many settings. Journal of Language and Social Psychology, 0261927X2210811.