# Convergence as Entropy Minimization Across Lexico-Semantic Choice

### 1 A Theory of Diachronic Language Shift within Groups

Linguistic alignment/Convergence both describe the tendency for groups to converge on similar means of discussing a topic.

Similar means can be expressed as a minimization in the entropy between utterances made by group members. As group members A and B sound more similar to one another, you can recover more of a group member A's semantic content from the lexical items in member B's message/utterance/sentence, because you can better predict member A's message just by listening to member B. In other words because they're using similar language to say the same thing the predictability of one utterance when presented with another utterance increases and thus entropy (i.e. how unpredictable two things are based on obersrvations of one or the other) decreases.

Similarity in the precise lexico-semantic meaning of two words can be measured using contextual word embeddings. Models like BERT provide contextually informed word embeddings.

The following is validation on toy data that, yes, there is indeed a difference in the entropy between messages written by members of the same group when compared to members of a different group. We validate this on a subset of 40 post titles discussing COVID-19, evenly sampled from subreddits "r/covidiots" and "r/conspiracy". No pre-processing was performed on the data.

#### 1.1 Generating intial data

In [None]:
from convergenceEntropy.data.covidreddit.datacall import *

#### 1.2 Word vector representations

$$ E_{xi} = wv(i \in x) $$

In [None]:
'python3 ./data/extremismVecsv4.py'

#### 1.3 Probability based on word vectors

Message histories for individuals is assumed to the be the same when scores for their message history indicate high similarity. Let $x$ be the content of a new message by any one individual, and $h$ be the history of messages by another individual. Assume that for $x$, we are unsure of the individual's purchasing history, while we know the history of purchases made by the individual with history $h$. We calculate the similarity between a thread of messages by quantifying the recurrence (see Dale et al. 2018) between individual message histories:

$$ P(E_{xi}|E_{hj}) = P_{\mathcal{N}}\left( CoS(E_{xi},E_{hj}) \bigg| \mu=1, \sigma[=.3] \right) $$

In [1]:
import torch
import torch.nn as nn
import pandas as pd
import numpy as np

cos = nn.CosineSimilarity(dim=-1)
s_col = 'subreddit_name'

def vec_from_string(dfi):
    return torch.FloatTensor(
        [
            [
                float(i) for i in vec.replace('[', '').replace(']','').split(', ')
            ] for vec in dfi['vecs'].values
        ]
    )

groups = ['menslib', 'feminism', 'mensrights']

df = pd.read_table("/Volumes/V'GER/comp_ling/DataScideProjects/convergenceEntropy/data/redditany/three_groups/woman_post_title/vecs_feminism-mensrights-menslib-women_post_title.tsv")
print(list(df))
print(df['subreddit_name'].value_counts())
df = df.loc[df[s_col].isin(groups) & df['pattern_found'].values]
df.index=range(len(df))
df['--id'] = df['__id'].values

update_id_dic = {i:idx for idx,i in enumerate(np.unique(df['__id'].values))}
df['__id'] = df['__id'].replace(update_id_dic)

ids = np.unique(df['__id'].values)
comm = np.array([df[s_col].loc[df['__id'].isin([idx])].unique()[0] for idx in ids])


['__id', 'subreddit_name', 'post_id', 'post_title', 'index', 'pattern_found', 'tokens', 'vecs']
mensrights    67294
feminism      16073
menslib        5222
Name: subreddit_name, dtype: int64


In [2]:
print(list(df))
for subreddit in df['subreddit_name'].unique():
    print('{} \t {}'.format(subreddit, len(df['__id'].loc[df['subreddit_name'].isin([subreddit])].unique())))

['__id', 'subreddit_name', 'post_id', 'post_title', 'index', 'pattern_found', 'tokens', 'vecs', '--id']
feminism 	 169
mensrights 	 897
menslib 	 33


In [3]:
Eu = vec_from_string(df)
del df['vecs']
Eu.shape

torch.Size([26555, 1536])

In [4]:
#Establish parameters for Gaussian
d = torch.distributions.Normal(1,.3,validate_args=False)

#### 1.4 Entropy across sentences using probability based on their component word vectors

Meanwhile, the probabililty that an individual's message $x$ exhibits convergence with a historical, known individual/memger of group $h \in H$ based on their message history is based on (1) how much of $x$'s history, per token, is recoverable from $h$'s history, which is expressed by (2) the amount of information relayed by the closest related lexical item $hj$ to each token $xi$.

$$ I( x ; h ) = \sum_i \max\limits_{j \in h} \left( log P(E_{xi}|E_{hj}) \right) $$

In the diachronic case, a message $x$ can only be similar to messages in $h$ that were written prior to $x$. Thus, let $\tau_{x}$ be the time of message $x$, and let $E_{hj}$ contain only tokens pulled from messages occurring prior to $\tau_x$--i.e.

$$E_{hj} = \bigg\lbrace wv(j \in h)\delta_{\tau_j < \tau_x}\,\ .\ .\ .\ wv(n \in h)\delta_{\tau_n < \tau_x}  \bigg\rbrace$$

and then, step-by-step build the entropy matrices, starting with calculating the conditional probability of one vector based on the location of another vector via a Gaussian distribution . . .

#### 1.5 Assessing entropic differences between groups

Monte Carlo sampling for two conditions: within grouping and outside of group.

We break down our hypotheses as follows:

- (H1) There is statistically significant, lower entropy between any sampled set of comments posted within a subreddit A comprised of like-minded individuals compared to any sampled set of comments posted to a subreddit B in opposition to A.
- (H0) There is no statistically significant difference in entropy between two diametrically opposed subreddits.

To test this, picture the following ``game''. Pretend that for each comment in oyur dataset you randomly sample $N$ comments at random from the same subreddit A, and $N$ comments from the opposing subreddit B. We then calculate the entropy between our comment and each comment in the random sample from A, as well as the entropy between our comment and each comment in the random sample from B. We then take the mean entropy between our original comment and the comments from A, and the comments from B separately for comparison. We repeat this process a number of times. Hypothesis H1 is validated if in greater than 5\% ($\alpha$) of cases there is lower mean entropy for our initial comment and sample A than there is for our initial comment and sample B. This test is significant if the distribution of means with A and the distribution for means with B is significantly different. Both of these conditions must be true in order to validate H1.

### A Permutation test from sampled message histories

In [5]:
# (1) monte carlo procedure
N_permutations, xsize, ysize = 1000, 20, 20

sample_sets = np.array([[j for j in np.unique(ids) if j != i] for i in np.unique(ids)])
comm_status = comm[sample_sets]

ids = df['__id'].values

M = []
for permutation in range(N_permutations):

    #dictionary of sampled comments of shape group x sample_size
    xm = {group: np.random.choice(np.unique(ids)[comm==group], size=(xsize,), replace=False) for group in groups}

    #dictionary of indeces for each sample per group of shape group x total_number_of_vecs
    idxm = {group: df['__id'].isin(xm[group]).values for group in groups}

    #dictionary of masks for each sample group for summation of shape group x sample_size x total_number_of_vecs
    mxm = {group: torch.FloatTensor([df['__id'].values[idxm[group]]==idx for idx in xm[group]]) for group in groups}

    #dictionary of y-axis samples for each sampled conversation of shape sample_conversation x group x y_axis_sample_size
    ysamples = {idx:
                    { group:
                          np.random.choice(sample_sets[idx][(comm_status[idx] == group)], size=(ysize,), replace=False)
                      for group in groups }
        for idx in sum([xm[group].tolist() for group in groups],[])}

    key = list(ysamples.keys())[0]

    #dictionary of y-axis sample indeces for each sampled conversation of shape sample_conversation x group x total_number_of_vecs
    ysamples = { k:
                    { group:
                          df['__id'].isin(v[group]).values
                      for group in groups }
                for k,v in ysamples.items() }

    m = []
    for group in groups:
        #calculating response per each group of shape idxm[group] x ysamples[group,sample_conversation,groupY] and taking the max for each.
        r = torch.cat([
            torch.cat([
                cos(
                    Eu[ids==x].unsqueeze(1),
                    Eu[ysamples[x][groupy]]).max(dim=-1).values.unsqueeze(-1)
                for groupy in groups], dim=-1)
            for x in xm[group]],dim=0)

        r = d.log_prob(r)
        r = -(torch.exp(r) * r)
        r = (mxm[group].unsqueeze(-1) * r).sum(dim=1)
        m += [r.unsqueeze(0)]

    m = torch.cat(m, dim=0)
    M += [m.unsqueeze(0)]

    if ((permutation+1) % 100) == 0:
        print('permutation {}/{}'.format(permutation+1, N_permutations))

# M is a matrix of shape trials x groups x sample_size x group_history_sampled_from
M  = torch.cat(M,dim=0)

permutation 100/1000
permutation 200/1000
permutation 300/1000
permutation 400/1000
permutation 500/1000
permutation 600/1000
permutation 700/1000
permutation 800/1000
permutation 900/1000
permutation 1000/1000


In [6]:
#Printing medians and means
for i, group in enumerate(groups):
    for j,comparison in enumerate(groups):
        res = M[:,i,:,j].reshape(-1)
        print('{}:{} \t :: {} {}'.format(group,comparison,res.mean(), res.median()))
    print('=======][=======')

menslib:menslib 	 :: 7.644375801086426 6.024878978729248
menslib:feminism 	 :: 8.336524963378906 6.501287460327148
menslib:mensrights 	 :: 8.308388710021973 6.459111213684082
feminism:menslib 	 :: 6.81982946395874 5.764913082122803
feminism:feminism 	 :: 6.699769496917725 5.675271987915039
feminism:mensrights 	 :: 6.957864284515381 5.885248184204102
mensrights:menslib 	 :: 5.994281768798828 5.034763336181641
mensrights:feminism 	 :: 6.121181488037109 5.172330856323242
mensrights:mensrights 	 :: 6.010438919067383 5.070476531982422


And testing the statistical significance of these results . . .

In [10]:
from scipy.stats import ttest_ind as ttest

for i, group in enumerate(groups):
    for j,comparison in enumerate(groups):
        if group != comparison:
            sample1 = M[:,i,:,i].reshape(-1).numpy()
            sample2 = M[:,i,:,j].reshape(-1).numpy()
            print('(ttest) {}:{} \t :: {}'.format(group,comparison,ttest(sample1,sample2)))
    print('=======][=======')

(ttest) menslib:feminism 	 :: Ttest_indResult(statistic=-13.99452990635845, pvalue=2.1429933278855224e-44)
(ttest) menslib:mensrights 	 :: Ttest_indResult(statistic=-13.43874805915357, pvalue=4.402248370908595e-41)
(ttest) feminism:menslib 	 :: Ttest_indResult(statistic=-2.79346007833938, pvalue=0.005217225055746762)
(ttest) feminism:mensrights 	 :: Ttest_indResult(statistic=-5.945564642468476, pvalue=2.777857150978993e-09)
(ttest) mensrights:menslib 	 :: Ttest_indResult(statistic=0.3466492028738662, pvalue=0.7288567002223381)
(ttest) mensrights:feminism 	 :: Ttest_indResult(statistic=-2.356036138364458, pvalue=0.018475917743386597)


And testing the significance within groups

In [11]:
for i, group in enumerate(groups):
    for j,comparison in enumerate(groups):
        if group != comparison:
            sample1 = M[:,i,:,i].reshape(-1).numpy()
            sample2 = M[:,j,:,j].reshape(-1).numpy()
            print('(ttest) {}:{} \t :: {}'.format(group,comparison,ttest(sample1,sample2)))
    print('=======][=======')

(ttest) menslib:feminism 	 :: Ttest_indResult(statistic=20.842168259051917, pvalue=5.821040102171811e-96)
(ttest) menslib:mensrights 	 :: Ttest_indResult(statistic=34.54592774181085, pvalue=1.028820763743441e-257)
(ttest) feminism:menslib 	 :: Ttest_indResult(statistic=-20.842168259051917, pvalue=5.821040102171811e-96)
(ttest) feminism:mensrights 	 :: Ttest_indResult(statistic=15.3979712992317, pvalue=2.4036252421619027e-53)
(ttest) mensrights:menslib 	 :: Ttest_indResult(statistic=-34.54592774181085, pvalue=1.028820763743441e-257)
(ttest) mensrights:feminism 	 :: Ttest_indResult(statistic=-15.3979712992317, pvalue=2.4036252421619027e-53)


In [9]:
torch.save({'M': M}, "/Volumes/V'GER/comp_ling/DataScideProjects/convergenceEntropy/data/redditany/three_groups/woman_post_title/sampled-ckpt.pt")

### 2 Conclusions

What's provided here is a mock up example with real (albeit a fractional amount) data validating that within group communication exhibits lower entropy than intergroup communication. The methods proposed here can be used in a variety of applications, ranging from quantifying convergence within groups without prior definition of lexical dictionaries to, per person, measuring convergence with normative group communication styles over time (message-by-message)

#### References

Adams, A., Miles, J., Dunbar, N. E., & Giles, H. (2018). Communication accommodation in text messages: Exploring liking, power, and sex as predictors of textisms. The Journal of Social Psychology, 158(4), 474–490. https://doi.org/10.1080/00224545.2017.1421895

Dale, R., Duran, N. D., & Coco, M. (2018). Dynamic Natural Language Processing with Recurrence Quantification Analysis. ArXiv:1803.07136 [Cs]. http://arxiv.org/abs/1803.07136

de Vries, W., van Cranenburgh, A., & Nissim, M. (2020). What’s so special about BERT’s layers? A closer look at the NLP pipeline in monolingual and multilingual models. Findings of the Association for Computational Linguistics: EMNLP 2020, 4339–4350

Palomares, N., Giles, H., Soliz, J., & Gallois, C. (2016). Intergroup Accommodation, Social Categories, and Identities. In H. Giles (Ed.), Communication Accomodation Theory (p. 232).

Rosen, Z. (2022). A BERT’s eye view: A “big-data” framework for assessing language convergence and accommodation in large, many-to-many settings. Journal of Language and Social Psychology, 0261927X2210811.

### Previous Sampling method

In [None]:
import torch
import torch.nn as nn
import pandas as pd
import numpy as np

cos = nn.CosineSimilarity(dim=-1)
s_col = 'subreddit_name'

def vec_from_string(dfi):
    return torch.FloatTensor(
        [
            [
                float(i) for i in vec.replace('[', '').replace(']','').split(', ')
            ] for vec in dfi['vecs'].values
        ]
    )

groups = ['feminism', 'mensrights']

df = pd.read_table("/Volumes/V'GER/comp_ling/DataScideProjects/convergenceEntropy/data/redditany/three_groups/woman_post_title/vecs_feminism-mensrights-menslib-women_post_title.tsv")
print(list(df))
print(df['subreddit_name'].value_counts())
df = df.loc[df[s_col].isin(groups) & df['pattern_found'].values]
df.index=range(len(df))
df['--id'] = df['__id'].values

update_id_dic = {i:idx for idx,i in enumerate(np.unique(df['__id'].values))}
df['__id'] = df['__id'].replace(update_id_dic)

ids = np.unique(df['__id'].values)
comm = np.array([df[s_col].loc[df['__id'].isin([idx])].unique()[0] for idx in ids])

print(list(df))
for subreddit in df['subreddit_name'].unique():
    print('{} \t {}'.format(subreddit, len(df['__id'].loc[df['subreddit_name'].isin([subreddit])].unique())))

Eu = vec_from_string(df)
del df['vecs']
Eu.shape

#Establish parameters for Gaussian
d = torch.distributions.Normal(1, .3, validate_args=False)

# (0) defining groups



# (1) monte carlo procedure
N_permutations, xsize, ysize = 100, 10, 20

sample_sets = np.array([[j for j in np.unique(ids) if j != i] for i in np.unique(ids)])
comm_status = comm[sample_sets]

ids = df['__id'].values

ML, MR = [],[]
for permutation in range(N_permutations):

    xml = np.random.choice(np.unique(ids)[comm==groups[0]], size=(xsize,), replace=False)
    xmr = np.random.choice(np.unique(ids)[comm==groups[1]], size=(xsize,), replace=False)

    idxml = df['__id'].isin(xml).values
    idxmr = df['__id'].isin(xmr).values

    mxml = torch.FloatTensor([df['__id'].values[idxml]==idx for idx in xml])
    mxmr = torch.FloatTensor([df['__id'].values[idxmr]==idx for idx in xmr])

    ysamples = {idx:
        (
        np.random.choice(sample_sets[idx][(comm_status[idx] == groups[0]) ], size=(ysize,), replace=False),
        np.random.choice(sample_sets[idx][(comm_status[idx] == groups[1])], size=(ysize,), replace=False)
        ) for idx in xml.tolist()+xmr.tolist()}

    ysamples ={k: (df['__id'].isin(v[0]).values, df['__id'].isin(v[1]).values) for k,v in ysamples.items()}

    ml, mr = [],[]
    for x in xml:
        xil = cos(
            Eu[ids==x].unsqueeze(1),
            Eu[ysamples[x][0]])
        xir = cos(
            Eu[ids==x].unsqueeze(1),
            Eu[ysamples[x][1]])

        ml += [torch.cat([xil.max(dim=-1).values.unsqueeze(-1),
                          xir.max(dim=-1).values.unsqueeze(-1)], dim=-1)]

    for x in xmr:
        xil = cos(Eu[ids==x].unsqueeze(1), Eu[ysamples[x][0]])
        xir = cos(Eu[ids==x].unsqueeze(1), Eu[ysamples[x][1]])

        mr += [torch.cat([xil.max(dim=-1).values.unsqueeze(-1),
                          xir.max(dim=-1).values.unsqueeze(-1)], dim=-1)]

    ml, mr = torch.cat(ml, dim=0), torch.cat(mr, dim=0)
    ml, mr = d.log_prob(ml), d.log_prob(mr)
    ml, mr = -(torch.exp(ml) * ml), -(torch.exp(mr) * mr)
    ml, mr = (mxml.unsqueeze(-1) * ml).sum(dim=1), (mxmr.unsqueeze(-1) * mr).sum(dim=1)

    ML += [ml]
    MR += [mr]

    print('permutation {}/{}'.format(permutation+1, N_permutations))

ML, MR = torch.cat(ML,dim=0), torch.cat(MR,dim=0)

# Printing means and medians as test of normalcy
(ML[:,0].mean(), ML[:,0].median()), (ML[:,1].mean(), ML[:,1].median()), (MR[:,0].mean(), MR[:,0].median()), (MR[:,1].mean(), MR[:,1].median())