# Convergence as Entropy Minimization Across Lexico-Semantic Choice

### 1 A Theory of Diachronic Language Shift within Groups

Linguistic alignment/Convergence both describe the tendency for groups to converge on similar means of discussing a topic.

Similar means can be expressed as a minimization in the entropy between utterances made by group members. As group members A and B sound more similar to one another, you can recover more of a group member A's semantic content from the lexical items in member B's message/utterance/sentence, because you can better predict member A's message just by listening to member B. In other words because they're using similar language to say the same thing the predictability of one utterance when presented with another utterance increases and thus entropy (i.e. how unpredictable two things are based on obersrvations of one or the other) decreases.

Similarity in the precise lexico-semantic meaning of two words can be measured using contextual word embeddings. Models like BERT provide contextually informed word embeddings.

The following is validation on toy data that, yes, there is indeed a difference in the entropy between messages written by members of the same group when compared to members of a different group. We validate this on a subset of 40 post titles discussing COVID-19, evenly sampled from subreddits "r/covidiots" and "r/conspiracy". No pre-processing was performed on the data.

#### 1.1 Generating intial data

In [None]:
from convergenceEntropy.data.covidreddit.datacall import *

#### 1.2 Word vector representations

$$ E_{xi} = wv(i \in x) $$

In [None]:
'python3 ./data/extremismVecsv5.py'

#### 1.3 Probability based on word vectors

Message histories for individuals is assumed to the be the same when scores for their message history indicate high similarity. Let $x$ be the content of a new message by any one individual, and $h$ be the history of messages by another individual. Assume that for $x$, we are unsure of the individual's purchasing history, while we know the history of purchases made by the individual with history $h$. We calculate the similarity between a thread of messages by quantifying the recurrence (see Dale et al. 2018) between individual message histories:

$$ P(E_{xi}|E_{hj}) = P_{\mathcal{N}}\left( CoS(E_{xi},E_{hj}) \bigg| \mu=1, \sigma[=.3] \right) $$

In [None]:
import torch
import pandas as pd
import numpy as np

s_col = 'subreddit_name'

def vec_from_string(dfi):
    return torch.FloatTensor(
        [
            [
                float(i) for i in vec.replace('[', '').replace(']','').split(', ')
            ] for vec in dfi['vecs'].values
        ]
    )

df = pd.read_table("/Volumes/V'GER/comp_ling/DataScideProjects/convergenceEntropy/data/senators458/vecs_538senators-tweets.tsv")
print(list(df))
print(df['party'].value_counts())

Eu = vec_from_string(df)
Eu.shape

Because of how many comparisons there are to make, I had to break the algorithm up. Below we calculate the Cosine Similarity (CoS) for each vector compared to every other vector in the entirety of the dataset.

In [None]:
from shared.mod.mutual_information.RQA import *

H = hRQA()
H.streamCOS(Eu,Eu)

From there we save that to make later calculations faster.

In [None]:
torch.save(
    {
        'M':H.M,
        'df': df[[i for i in list(df) if i not in ['vecs','vec']]],
    },
    "/Volumes/V'GER/comp_ling/DataScideProjects/convergenceEntropy/data/senators458/gun-control-ckpt.pt")

#### 1.4 Entropy across sentences using probability based on their component word vectors

Meanwhile, the probabililty that an individual's message $x$ exhibits convergence with a historical, known individual/memger of group $h \in H$ based on their message history is based on (1) how much of $x$'s history, per token, is recoverable from $h$'s history, which is expressed by (2) the amount of information relayed by the closest related lexical item $hj$ to each token $xi$.

$$ I( x ; h ) = \sum_i \max\limits_{j \in h} \left( log P(E_{xi}|E_{hj}) \right) $$

In the diachronic case, a message $x$ can only be similar to messages in $h$ that were written prior to $x$. Thus, let $\tau_{x}$ be the time of message $x$, and let $E_{hj}$ contain only tokens pulled from messages occurring prior to $\tau_x$--i.e.

$$E_{hj} = \bigg\lbrace wv(j \in h)\delta_{\tau_j < \tau_x}\,\ .\ .\ .\ wv(n \in h)\delta_{\tau_n < \tau_x}  \bigg\rbrace$$

First, I loaded the data from the checkpoint described above.

In [None]:
import torch
import pandas as pd
import numpy as np

s_col = 'party'
ckpt = torch.load("/Volumes/V'GER/comp_ling/DataScideProjects/convergenceEntropy/data/senators458/gun-control-ckpt.pt")

M, df = ckpt['M'], ckpt['df']
ids = df['__id'].values
df['party'] = df['party'].replace({'I': 'D'})
del ckpt

In [None]:
print(list(df))
for subreddit in df[s_col].unique():
    print('{} \t {}'.format(subreddit, len(df['__id'].loc[df[s_col].isin([subreddit])].unique())))

and then, step-by-step build the entropy matrices, starting with calculating the conditional probability of one vector based on the location of another vector via a Gaussian distribution . . .

In [None]:
comm = np.array([df[s_col].loc[df['__id'].isin([idx])].unique()[0] for idx in np.unique(ids)])

#Establish parameters for Gaussian
d = torch.distributions.Normal(1,.3,validate_args=False)

def entropy(x,y,M=M,d=d):
    hxy = M[x][:,y].max(dim=-1).values
    hxy = d.log_prob(hxy)
    return -(torch.exp(hxy) * hxy).sum()

In [None]:
# Hxy = [[d.log_prob(M[ids==i][:,ids==j].max(dim=-1).values) for j in np.unique(ids)] for i in np.unique(ids)]
# Hxy = [[-(torch.exp(cell) * cell).sum() for cell in row] for row in Hxy]

Hxy = [[entropy(ids==i,ids==j) for j in np.unique(ids)] for i in np.unique(ids)]
Hxy = torch.FloatTensor(Hxy)

# Saving the subreddit names for later
comm = np.array([df[s_col].loc[df['__id'].isin([idx])].unique()[0] for idx in np.unique(ids)])

#### Checking model ``accuracy''
An important question to consider is how ``accurate'' is our model? Clearly, there are group level phenomena that are being captured, but how well does the model do at simply matching individuals together based on their group affiliation? To measure this, and considering that this is an otherwise unsupervised model, one way to test that the model performs as expected is to ask if, per each comment, the nearest-neighbor-comment (i.e. the comment that has the lowest entropy with the original comment) is from the same group.

1. With $\sigma=.05$, in .917 of cases the nearest neighbor is from the same group.
2. With $\sigma=.1$, in .923 of cases the nearest neighbor is from the same group.

In [None]:
y_ = torch.LongTensor([[idx for idx in row if idx != i] for i,row in enumerate(Hxy.argsort(dim=-1))])[:,:5] #note, we do second column because first will always == the message index.
(comm == comm[y_[:,0].numpy()]).mean()

#### 1.5 Assessing entropic differences between groups

Monte Carlo sampling for two conditions: within grouping and outside of group.

We break down our hypotheses as follows:

- (H1) There is statistically significant, lower entropy between any sampled set of comments posted within a subreddit A comprised of like-minded individuals compared to any sampled set of comments posted to a subreddit B in opposition to A.
- (H0) There is no statistically significant difference in entropy between two diametrically opposed subreddits.

To test this, picture the following ``game''. Pretend that for each comment in oyur dataset you randomly sample $N$ comments at random from the same subreddit A, and $N$ comments from the opposing subreddit B. We then calculate the entropy between our comment and each comment in the random sample from A, as well as the entropy between our comment and each comment in the random sample from B. We then take the mean entropy between our original comment and the comments from A, and the comments from B separately for comparison. We repeat this process a number of times. Hypothesis H1 is validated if in greater than 5\% ($\alpha$) of cases there is lower mean entropy for our initial comment and sample A than there is for our initial comment and sample B. This test is significant if the distribution of means with A and the distribution for means with B is significantly different. Both of these conditions must be true in order to validate H1.

### A Permutation test from sampled message histories

In [None]:
# (0) negative surprisal
# Hxy = d.log_prob(M)

# (1) monte carlo procedure
N_permutations, xsize, ysize = 100, 10, 20

sample_sets = np.array([[j for j in np.unique(ids) if j != i] for i in np.unique(ids)])
comm_status = comm[sample_sets]
uids = np.array([df['user'].loc[df['__id'].isin([idx])].unique()[0] for idx in np.unique(ids)])
uidsset = uids[sample_sets]


ML, MR = [],[]
for permutation in range(N_permutations):

    xml = np.random.choice(np.unique(ids)[comm=='D'], size=(xsize,), replace=False)
    xmr = np.random.choice(np.unique(ids)[comm=='R'], size=(xsize,), replace=False)

    idxml = df['__id'].isin(xml).values
    idxmr = df['__id'].isin(xmr).values

    mxml = torch.FloatTensor([df['__id'].values[idxml]==idx for idx in xml])
    mxmr = torch.FloatTensor([df['__id'].values[idxmr]==idx for idx in xmr])

    ysamples = {idx:
        (
        np.random.choice(sample_sets[idx][(comm_status[idx] == 'D') & (uidsset[idx] != uids[idx])], size=(ysize,), replace=False),
        np.random.choice(sample_sets[idx][(comm_status[idx] == 'R') & (uidsset[idx] != uids[idx])], size=(ysize,), replace=False)
        ) for idx in xml.tolist()+xmr.tolist()}

    ysamples ={k: (df['__id'].isin(v[0]).values, df['__id'].isin(v[1]).values) for k,v in ysamples.items()}

    ml, mr = [],[]
    for x in xml:
        xi = M[ids==x]
        ml += [torch.cat([xi[:,ysamples[x][0]].max(dim=-1).values.unsqueeze(-1),
                          xi[:,ysamples[x][1]].max(dim=-1).values.unsqueeze(-1)], dim=-1)]

    for x in xmr:
        xi = M[ids == x]
        mr += [torch.cat([xi[:, ysamples[x][0]].max(dim=-1).values.unsqueeze(-1),
                          xi[:, ysamples[x][1]].max(dim=-1).values.unsqueeze(-1)], dim=-1)]

    ml, mr = torch.cat(ml, dim=0), torch.cat(mr, dim=0)
    ml, mr = d.log_prob(ml), d.log_prob(mr)
    ml, mr = -(torch.exp(ml) * ml), -(torch.exp(mr) * mr)
    ml, mr = (mxml.unsqueeze(-1) * ml).sum(dim=1), (mxmr.unsqueeze(-1) * mr).sum(dim=1)

    ML += [ml]
    MR += [mr]

ML, MR = torch.cat(ML,dim=0), torch.cat(MR,dim=0)

# Printing means and medians as test of normalcy
(ML[:,0].mean(), ML[:,0].median()), (ML[:,1].mean(), ML[:,1].median()), (MR[:,0].mean(), MR[:,0].median()), (MR[:,1].mean(), MR[:,1].median())

And testing the statistical significance of these distributions . . .

In [None]:
from scipy.stats import ttest_ind as ttest

ttest(ML[:,0].numpy(), ML[:,1].numpy()),ttest(MR[:,0].numpy(), MR[:,1].numpy())

### 2 Conclusions

What's provided here is a mock up example with real (albeit a fractional amount) data validating that within group communication exhibits lower entropy than intergroup communication. The methods proposed here can be used in a variety of applications, ranging from quantifying convergence within groups without prior definition of lexical dictionaries to, per person, measuring convergence with normative group communication styles over time (message-by-message)

#### References

Adams, A., Miles, J., Dunbar, N. E., & Giles, H. (2018). Communication accommodation in text messages: Exploring liking, power, and sex as predictors of textisms. The Journal of Social Psychology, 158(4), 474–490. https://doi.org/10.1080/00224545.2017.1421895

Dale, R., Duran, N. D., & Coco, M. (2018). Dynamic Natural Language Processing with Recurrence Quantification Analysis. ArXiv:1803.07136 [Cs]. http://arxiv.org/abs/1803.07136

de Vries, W., van Cranenburgh, A., & Nissim, M. (2020). What’s so special about BERT’s layers? A closer look at the NLP pipeline in monolingual and multilingual models. Findings of the Association for Computational Linguistics: EMNLP 2020, 4339–4350

Palomares, N., Giles, H., Soliz, J., & Gallois, C. (2016). Intergroup Accommodation, Social Categories, and Identities. In H. Giles (Ed.), Communication Accomodation Theory (p. 232).

Rosen, Z. (2022). A BERT’s eye view: A “big-data” framework for assessing language convergence and accommodation in large, many-to-many settings. Journal of Language and Social Psychology, 0261927X2210811.