# Convergence as Entropy Minimization Across Lexico-Semantic Choice

### 1 Communication Accommodation Theory

Lexical alignment/Convergence both describe the tendency for members of a group to converge on similar means of discussing a topic.

Similar means can be expressed as a minimization in the entropy between utterances made by group members. As group members A and B sound more similar to one another, you can recover more of a group member A's semantic content from the lexical items in member B's message/utterance/sentence, because you can better predict member A's message just by listening to member B. In other words because they're using similar language to say the same thing the predictability of one utterance when presented with another utterance increases and thus entropy (i.e. how unpredictable two things are based on obersrvations of one or the other) decreases.

Similarity in the precise lexico-semantic meaning of two words can be measured using contextual word embeddings. Models like BERT (and it's twin RoBERTa) provide contextually informed word embeddings.

The following is a quick analysis attempting to recover Social Identity attributes for speakers based on imputed convergence with another group.

Specifically: recovering political party affiliation from tweets discussing immigrants.

#### 1.1 Generating intial data

In [None]:
from intergroupEntropy.data.redditany.redo_collection_corpus import *

#### 1.2 Word vector representations

$$ E_{xi} = wv(i \in x) $$

In [None]:
'ssh -'
'tmux attach-session -s BERT'
'python3 ./reddit_vecs.py'

#### 1.3 Probability based on word vectors

Based on our understanding of how to test conceptual similarities between individuals' utterances and groups', and to get a grip on our BERT-based method, let's imagine that an interlocutor is playing a kind of language reconstruction game that is directly related to the question posed in our description of the Bernoulli process. The interlocutor is given a single utterance from an individual $x$, broken up into tokens $xi$. The interlocutor is then given a set of tokens $j$ from several utterances all taken from a number of members of some group--$y j$. The interlocutor is then asked to take the groups' tokens $yj$ and reconstruct an utterance that means something that is as similar as possible to the sentence $x$. If they can reconstruct a sentence from the groups' tokens $yj$ that means something similar to the utterance $x$, this would effectively answer the question of whether or not
"for each token ($xi$) in the sentence $x$, tell me if someone in the sample $y$ used the same word, or a synonym for it, in the same way that it was used in $x$."
Furthermore, in this scenario, reconstructed utterances that are more similar in meaning to the original utterance will have lower entropy. Reconstructed utterances that are either less similar or less intelligible will have higher entropy.

We start our language game by, first, converting all of the tokens in both $x$ and $y$ to BERT word vectors (Devlin et al. 2019. This will allow us to capture similarity between tokens that are semantically similar but are not a 1:1 mapping of the same word. Let $E_{xi}$ be the set of BERT word vectors for each token $i$ in a sentence $x$ and $E_{yj}$ be the set of BERT word vectors for each token $j$ in a sample $y$ of utterances from a group. The equation \ref{eq:bert} below shows the process of converting tokens $i \in x$ to word vectors. Tokens $j \in y$ are converted to word vectors via the same process.

$$E_{xi} = BERT(i \in x)$$

The utility of word vector models is that they represent the meaning of words spatially and, surprisingly, accurately. Even in early, contextually uninformed models, words that are semantically similar to one another based on their word vectors cluster closer together in word vector space (GLoVe: Pennington et al. 2014; Word2Vec: Mikolov et al. 2013; BERT: Devlin et al. 2019). In contextually aware models like BERT and all subsequent transformer models words that have similar word senses cluster separately from other word senses. This allows us to make fine grain distinctions between the different meanings of polysemous words like the many meanings of "bank, but it also allows us to capture subtle community/group-specific differences in word usage like the differences in the use of the word "slay" in example \ref{} (Devlin et al. 2019). In layman's terms, if a word vector represents the meaning of a word as a point in space, words that are more semantically related to one another will be closer to one another. And if those word vectors are generated by a contextually aware model the closer \textit{the word senses} of two words are to one another the closer two word vectors will be to one another in vector space. A popular way to measure the proximity of two word vectors to one another is to use Cosine Error (CoE), where a CoE value of 0 indicates that the word vectors for two words in high dimensional space are in a superposition of one another, and 2 means that they are maximally divergent.

Think of finding similar words in word vector space like a game of darts, where CoE values that are closer to 0 when comparing a word vector $E_{xi}$ to another word vector $E_{yj}$ indicate that if you threw a dart at $E_{xi}$ you are more likely to accidentally hit the word vector $E_{yj}$ if you miss.

Please note however that \textit{proximity} in vector space is different from a probability, and CoE values are just a scaled measurement of proximity, not the probability that two vectors are the same or similar. An additional step is needed to render CoE values as probabilities that can be used as part of a statistical framework. To convert CoE to a probability, we leverage a half-Gaussian distribution, continuous on an interval of 0 to infinity, with two parameters: (1) a location parameter $\mu=0.0$ such that as the CoE value for the comparison of two word vectors approaches 0 we have maximum confidence that the two words mean the same thing, and (2) a scale parameter $\sigma$ that sets a penalty weight for CoE values farther away from 0.
$$P(E_{xi} | E_{yj}) = P_{\mathcal{N}_{[0,\infty]}}\left( CoE(E_{xi},E_{yj}) \bigg|  \mu=0., \sigma \right)$$

Think of $\sigma$ like the accuracy of the dart thrower in our previous example, where lower $\sigma$ values equate to the dart thrower only hitting a word/token $xi$ if it is very close to $yj$ in word vector space.

However, we almost never have a reason to compare any one vector from a sentence $xi$ to every single vector from another sentence/distribution, $yj$. After all, the question we're trying to answer as described in the previous section is "for each token ($xi$) in the sentence $x$, I want you to tell me if someone in the sample $y$ used the same word, or a synonym for it, in the same way that it was used in $x$." Based on this, it’s better to ask, instead how likely is a vector $xi$ might show up in any sample $y$ from the cummulative utterances for the group $Y$, conditioned on what we know about the composition of the sample $y$. To do this, we take the probability of a token $xi$ from the sentence $x$ and the token $yj$ from $y$ that has the lowest CoE with $xi$. This effectively replicates the hypothetical study participant in the example given at the top of this section selecting a word that most closely means the same thing as one of the words ($xi$) from the sentence $x$ and trying to use it to create a new utterance that closely matches $x$ in meaning. Furthermore, if nothing in the distribution $y$ is semantically similar, nor embedded in a similar context as $xi$ is in $x$, then the minimum CoE value will be high (and thus indicates that the token $xi$ doesn't have anything approximating a similar term or usage in $y$). We thus rewrite equation the last equation as follows:

$$P(E_{xi} | E_{y}) = P_{\mathcal{N}_{[0,\infty]}} \left( \min_{j} \left(CoE(E_{xi},E_{y}) \right) \bigg|  \mu=0., \sigma \right)$$

#### 1.4 Entropy across sentences using probability based on their component word vectors

Meanwhile, the probability that an individual's message $x$ exhibits convergence with the messaging habits of a groups can be calculated by finding the entropy for $x$ and an imputed sample from the group $y \in \lbrace Y | Y_g \rbrace$.

$$H( x ; y ) = -\sum_i P(E_{xi}|E_{y}) \log P(E_{xi}|E_{y})$$

### 2. Implementation

In [None]:
'ssh -'
'tmux attach-session -s BERT'
'python3 ./indH.py'

### 3. Assessment

First, I loaded the data from the checkpoint described above.

In [1]:
import torch
import pandas as pd
import numpy as np
from SIS.methods.reddit_feminism.stitch_data import get_stitched_data

data_path = "/Users/zacharyrosen/airlock/d/convergence/feminism-menslib-mensrights/women/summaries/menslib/2/posteriors-MensLib.pt"
ckpt = torch.load(data_path)

total_H = ckpt['M']
_ids = ckpt['labels']

total_H = total_H.transpose(0,1).transpose(1,2)

groups = ['Feminism', 'MensRights', 'MensLib']

In [2]:
total_H.shape

torch.Size([463, 3, 200])

#### 3.1 Assessing entropic differences per each sentence in the corpus

In [3]:
from scipy.stats import ttest_ind as ttest

pvalue, statistic = [], []
for i in range(total_H.shape[0]):
    r = [
        ttest(
            total_H[i, 2][~total_H[i, 2].isnan()],
            total_H[i, j][~total_H[i, j].isnan()]
        ) for j in range(total_H.shape[1])]
    pvalue += [np.array([ri.pvalue for ri in r])]
    statistic += [np.array([ri.statistic for ri in r])]

pvalue, statistic = np.array(pvalue), np.array(statistic)

In [4]:
minima = [
    torch.cat(
        [
            total_H[i,j][~total_H[i,j].isnan()].mean(axis=-1).view(1,-1) for j in range(len(groups))
        ],
        dim=-1
    ) for i in range(total_H.shape[0])
]
minima = torch.cat(minima, dim=0).argmin(dim=-1)

pct_data, confusion_data, means_data = [], [], []

for i, g in enumerate(groups):

    p_res = (pvalue[:,i] < .025)
    mu_res = (statistic[:,i] < 0)
    res =  p_res & mu_res

    pct_data += [res.mean(axis=0)]
    confusion_data += [(p_res & (minima==i).numpy()).sum(axis=0)]
    means_data += [mu_res.mean(axis=0)]

results = pd.DataFrame()
results['cond'] = groups
results['results'] = np.array(pct_data)

mean_results = pd.DataFrame()
mean_results['cond'] = groups
mean_results['results'] = np.array(means_data)

confusion = pd.DataFrame()
confusion['cond'] = groups
confusion['results'] = np.array(confusion_data)

To interpret the results, higher scores indicate that more examples from the condition in the row passed the test when comparing the reconstruction of terms from the same condition as the row to examples from the condition in the column.

In [5]:
results

Unnamed: 0,cond,results
0,Feminism,0.965443
1,MensRights,0.969762
2,MensLib,0.0


In [6]:
mean_results

Unnamed: 0,cond,results
0,Feminism,0.984881
1,MensRights,0.978402
2,MensLib,0.0


In [7]:
confusion

Unnamed: 0,cond,results
0,Feminism,6
1,MensRights,3
2,MensLib,0


#### 3.2 By entire comment

To calculate the significance for an entire comment we summed the entropy for all the sentences that comprised the comment for each trial number in the data. We then repeated the same testing procedure as performed for the sentence level analysis.

In [8]:
comment_H = [total_H[_ids['commentId'].isin([c]).values].sum(axis=0).unsqueeze(0) for c in _ids['commentId'].unique()]
comment_H = torch.cat(comment_H,dim=0)

In [9]:
comment_H.shape

torch.Size([103, 3, 200])

In [10]:
from scipy.stats import ttest_ind as ttest

pvalue, statistic = [], []
for i in range(comment_H.shape[0]):
    r = [
        ttest(
            comment_H[i, 2][~comment_H[i, 2].isnan()],
            comment_H[i, j][~comment_H[i, j].isnan()]
        ) for j in range(comment_H.shape[1])]
    pvalue += [np.array([ri.pvalue for ri in r])]
    statistic += [np.array([ri.statistic for ri in r])]

pvalue, statistic = np.array(pvalue), np.array(statistic)

In [11]:
minima = [
    torch.cat(
        [
            comment_H[i,j][~comment_H[i,j].isnan()].mean(axis=-1).view(1,-1) for j in range(len(groups))
        ],
        dim=-1
    ) for i in range(comment_H.shape[0])
]
minima = torch.cat(minima, dim=0).argmin(dim=-1)
pct_data, confusion_data, means_data = [], [], []

for i, g in enumerate(groups):

    p_res = (pvalue[:,i] < .025)
    mu_res = (statistic[:,i] < 0)
    res =  p_res & mu_res

    pct_data += [res.mean(axis=0)]
    confusion_data += [(p_res & (minima==i).numpy()).sum(axis=0)]
    means_data += [mu_res.mean(axis=0)]

results = pd.DataFrame()
results['cond'] = groups
results['results'] = np.array(pct_data)

mean_results = pd.DataFrame()
mean_results['cond'] = groups
mean_results['results'] = np.array(means_data)

confusion = pd.DataFrame()
confusion['cond'] = groups
confusion['results'] = np.array(confusion_data)

In [12]:
results

Unnamed: 0,cond,results
0,Feminism,0.970874
1,MensRights,0.980583
2,MensLib,0.0


In [13]:
mean_results

Unnamed: 0,cond,results
0,Feminism,0.980583
1,MensRights,0.990291
2,MensLib,0.0


In [14]:
confusion

Unnamed: 0,cond,results
0,Feminism,2
1,MensRights,0
2,MensLib,0


### 3.3 Analysis of Texts in Confusion Matrix

In [15]:
__ids = _ids.drop_duplicates(subset=['commentId']).copy()
__ids.index = range(len(__ids))

texts = pd.read_table("/Volumes/ROY/comp_ling/datasci/intergroupEntropy/data/redditany/corpus_with_author_data.tsv", lineterminator='\n')
texts = texts.loc[~texts['body'].isna()]

First, we look at the confused sentences for MensLib -> Feminism

In [16]:
confused_indexes = __ids['commentId'].loc[(pvalue[:,0] < .025) & (minima == 0).numpy()]

texts[['author','subId','commentId','body']].loc[texts['commentId'].isin(confused_indexes)]

Unnamed: 0,author,subId,commentId,body
25159,pcapdata,une854,i894prn,i agree
25160,pcapdata,une854,i894prn,there’s a meme phrase criticizing that approa...
25238,ladybadcrumble,une854,i88mz31,oh seriously
25239,ladybadcrumble,une854,i88mz31,that rules
25240,ladybadcrumble,une854,i88mz31,i reported someone for that once and never hea...


In [17]:
confused_indexes = __ids['commentId'].loc[(pvalue[:,1] < .025) & (minima == 1).numpy()]

texts[['author','subId','commentId','body']].loc[texts['commentId'].isin(confused_indexes)]

Unnamed: 0,author,subId,commentId,body


### 3.4 Comparison of MensLib from Feminism vs. MensLib from MensRights

In [18]:
pvalue, statistic = [], []
for i in range(comment_H.shape[0]):
    r = [
        ttest(
            comment_H[i, 0][~comment_H[i, 0].isnan()],
            comment_H[i, 1][~comment_H[i, 1].isnan()]
        )
    ]
    pvalue += [np.array([ri.pvalue for ri in r])]
    statistic += [np.array([ri.statistic for ri in r])]

pvalue, statistic = np.array(pvalue), np.array(statistic)

In [19]:
results = pd.DataFrame(np.array(['Feminism', 'MensRights']).reshape(-1,1), columns=['group'])

results['% similarity'] = [((pvalue < .025) & (statistic < 0)).mean(), ((pvalue < .025) & (statistic > 0)).mean()]

results['% inconclusive'] = [((pvalue > .025) & (statistic < 0)).mean(), ((pvalue > .025) & (statistic > 0)).mean()]

In [20]:
results

Unnamed: 0,group,% similarity,% inconclusive
0,Feminism,0.679612,0.116505
1,MensRights,0.126214,0.07767


### 4. Conclusions

The results are interesting. I'll break them up by comparison to each other subreddit here.

[_**r/Feminism**_] How much can you learn about comments from r/MensLib when reading comments made to r/Feminism when compared to reading comments from the same subreddit? It turns out, not nearly as much as one might initially think. At a sentence level 96.5\% of sentences from r/MensLib have lower entropy when compared to other comments from r/MensLib compared to comments from r/Feminism. In 98.5\% of cases there is, at a minimum, a lower mean for entropy comparing sentences from r/MensLib to other comments from r/MensLib as opposed to comments from r/Feminism. This discrepency is a little wider when looking at comments as opposed to just sentences. 97.1\% of sentences from r/MensLib have lower entropy when compared to other comments from r/MensLib compared to comments from r/Feminism. In 98.1\% of cases there is, at a minimum, a lower mean for entropy comparing sentences from r/MensLib to other comments from r/MensLib as opposed to comments from r/Feminism.

[_**r/MensRights**_] How much can you learn about comments from r/MensLib when reading comments made to r/MensRights when compared to reading comments from the same subreddit? Much like when compared to r/MensRights, very little. At a sentence level 97.0\% of sentences from r/MensLib have lower entropy when compared to other comments from r/MensLib compared to comments from r/MensRights. In 97.8\% of cases there is, at a minimum, a lower mean for entropy comparing sentences from r/MensLib to other comments from r/MensLib as opposed to comments from r/MensRights. This discrepency is a little wider when looking at comments as opposed to just sentences. 98.1\% of sentences from r/MensLib have lower entropy when compared to other comments from r/MensLib compared to comments from r/MensRights. In 99.0\% of cases there is, at a minimum, a lower mean for entropy comparing sentences from r/MensLib to other comments from r/MensLib as opposed to comments from r/MensRights.

[_**Confusion matrix**_] We now ask how often is it that a message from r/MensLib might be confused for one from either r/Feminism or r/MensRights (i.e. $p<.025$ and the mean for entropy from the outgroup is lower than the mean for the entropy in the ingroup).

At the sentence level, in 6 cases utterances made by members of r/MensLib have statistically significant, greater similarity with content generated in r/Feminism and in 3 cases utterances made by members of r/MensLib have statistically significant, greater similarity with content generated in r/MensRights.

At the comment levels, in 2 cases utterances made by members of r/MensLib have statistically significant, greater similarity with content generated in r/Feminism. No instances in which an entire comment had statistically significant, greater similarity with content generated in r/MensRights were found.

[_**Similarity to r/Feminism vs. r/MensRights**_] We compared the regeneration of comments from r/MensLib using samples from the outgroup r/Feminism (outgroup 1) to using samples from the outgroup r/MensRights (outgroup 2). In 67.9\% of cases, entropy for reconstructing messages from r/MensLib from samples pulled from outgroup 1 are statistically significant and lower than when reconstructing the same messages from samples pulled from outgroup 2. 11.6\% of cases are inconclusive (statistical significance was not reached). In 12.6\% of cases, entropy for reconstructing messages from r/MensLib from samples pulled from outgroup 2 are statistically significant and lower than when reconstructing the same messages from samples pulled from outgroup 1. 7.8\% of cases are inconclusive (statistical significance was not reached). We conclude that whilst messages from r/MensLib show signs of high intragroup convergence, there is still greater similarity overall with other Feminist groups than there are with other groups from the manosphere.


Bear in mind that there is, at the comment level, no cases in which one could confuse a comment from r/MensLib with a comment from r/MensRights. There are two instances however in which one could confuse a comment a from r/MensRights with a comment for r/Feminism outright.




#### References

Adams, A., Miles, J., Dunbar, N. E., & Giles, H. (2018). Communication accommodation in text messages: Exploring liking, power, and sex as predictors of textisms. The Journal of Social Psychology, 158(4), 474–490. https://doi.org/10.1080/00224545.2017.1421895

Dale, R., Duran, N. D., & Coco, M. (2018). Dynamic Natural Language Processing with Recurrence Quantification Analysis. ArXiv:1803.07136 [Cs]. http://arxiv.org/abs/1803.07136

de Vries, W., van Cranenburgh, A., & Nissim, M. (2020). What’s so special about BERT’s layers? A closer look at the NLP pipeline in monolingual and multilingual models. Findings of the Association for Computational Linguistics: EMNLP 2020, 4339–4350

Palomares, N., Giles, H., Soliz, J., & Gallois, C. (2016). Intergroup Accommodation, Social Categories, and Identities. In H. Giles (Ed.), Communication Accomodation Theory (p. 232).

Rosen, Z. (2022). A BERT’s eye view: A “big-data” framework for assessing language convergence and accommodation in large, many-to-many settings. Journal of Language and Social Psychology, 0261927X2210811.