# Convergence as Entropy Minimization Across Lexico-Semantic Choice

### 1 Communication Accommodation Theory

Lexical alignment/Convergence both describe the tendency for members of a group to converge on similar means of discussing a topic.

Similar means can be expressed as a minimization in the entropy between utterances made by group members. As group members A and B sound more similar to one another, you can recover more of a group member A's semantic content from the lexical items in member B's message/utterance/sentence, because you can better predict member A's message just by listening to member B. In other words because they're using similar language to say the same thing the predictability of one utterance when presented with another utterance increases and thus entropy (i.e. how unpredictable two things are based on obersrvations of one or the other) decreases.

Similarity in the precise lexico-semantic meaning of two words can be measured using contextual word embeddings. Models like BERT (and it's twin RoBERTa) provide contextually informed word embeddings.

The following is a quick analysis attempting to recover Social Identity attributes for speakers based on imputed convergence with another group.

Specifically: recovering political party affiliation from tweets discussing immigrants.

#### 1.1 Generating intial data

In [None]:
from intergroupEntropy.data.redditany.redo_collection_corpus import *

#### 1.2 Word vector representations

$$ E_{xi} = wv(i \in x) $$

In [None]:
'ssh -'
'tmux attach-session -s BERT'
'python3 ./reddit_vecs.py'

#### 1.3 Probability based on word vectors

Based on our understanding of how to test conceptual similarities between individuals' utterances and groups', and to get a grip on our BERT-based method, let's imagine that an interlocutor is playing a kind of language reconstruction game that is directly related to the question posed in our description of the Bernoulli process. The interlocutor is given a single utterance from an individual $x$, broken up into tokens $xi$. The interlocutor is then given a set of tokens $j$ from several utterances all taken from a number of members of some group--$y j$. The interlocutor is then asked to take the groups' tokens $yj$ and reconstruct an utterance that means something that is as similar as possible to the sentence $x$. If they can reconstruct a sentence from the groups' tokens $yj$ that means something similar to the utterance $x$, this would effectively answer the question of whether or not
"for each token ($xi$) in the sentence $x$, tell me if someone in the sample $y$ used the same word, or a synonym for it, in the same way that it was used in $x$."
Furthermore, in this scenario, reconstructed utterances that are more similar in meaning to the original utterance will have lower entropy. Reconstructed utterances that are either less similar or less intelligible will have higher entropy.

We start our language game by, first, converting all of the tokens in both $x$ and $y$ to BERT word vectors (Devlin et al. 2019. This will allow us to capture similarity between tokens that are semantically similar but are not a 1:1 mapping of the same word. Let $E_{xi}$ be the set of BERT word vectors for each token $i$ in a sentence $x$ and $E_{yj}$ be the set of BERT word vectors for each token $j$ in a sample $y$ of utterances from a group. The equation \ref{eq:bert} below shows the process of converting tokens $i \in x$ to word vectors. Tokens $j \in y$ are converted to word vectors via the same process.

$$E_{xi} = BERT(i \in x)$$

The utility of word vector models is that they represent the meaning of words spatially and, surprisingly, accurately. Even in early, contextually uninformed models, words that are semantically similar to one another based on their word vectors cluster closer together in word vector space (GLoVe: Pennington et al. 2014; Word2Vec: Mikolov et al. 2013; BERT: Devlin et al. 2019). In contextually aware models like BERT and all subsequent transformer models words that have similar word senses cluster separately from other word senses. This allows us to make fine grain distinctions between the different meanings of polysemous words like the many meanings of "bank, but it also allows us to capture subtle community/group-specific differences in word usage like the differences in the use of the word "slay" in example \ref{} (Devlin et al. 2019). In layman's terms, if a word vector represents the meaning of a word as a point in space, words that are more semantically related to one another will be closer to one another. And if those word vectors are generated by a contextually aware model the closer \textit{the word senses} of two words are to one another the closer two word vectors will be to one another in vector space. A popular way to measure the proximity of two word vectors to one another is to use Cosine Error (CoE), where a CoE value of 0 indicates that the word vectors for two words in high dimensional space are in a superposition of one another, and 2 means that they are maximally divergent.

Think of finding similar words in word vector space like a game of darts, where CoE values that are closer to 0 when comparing a word vector $E_{xi}$ to another word vector $E_{yj}$ indicate that if you threw a dart at $E_{xi}$ you are more likely to accidentally hit the word vector $E_{yj}$ if you miss.

Please note however that \textit{proximity} in vector space is different from a probability, and CoE values are just a scaled measurement of proximity, not the probability that two vectors are the same or similar. An additional step is needed to render CoE values as probabilities that can be used as part of a statistical framework. To convert CoE to a probability, we leverage a half-Gaussian distribution, continuous on an interval of 0 to infinity, with two parameters: (1) a location parameter $\mu=0.0$ such that as the CoE value for the comparison of two word vectors approaches 0 we have maximum confidence that the two words mean the same thing, and (2) a scale parameter $\sigma$ that sets a penalty weight for CoE values farther away from 0.
$$P(E_{xi} | E_{yj}) = P_{\mathcal{N}_{[0,\infty]}}\left( CoE(E_{xi},E_{yj}) \bigg|  \mu=0., \sigma \right)$$

Think of $\sigma$ like the accuracy of the dart thrower in our previous example, where lower $\sigma$ values equate to the dart thrower only hitting a word/token $xi$ if it is very close to $yj$ in word vector space.

However, we almost never have a reason to compare any one vector from a sentence $xi$ to every single vector from another sentence/distribution, $yj$. After all, the question we're trying to answer as described in the previous section is "for each token ($xi$) in the sentence $x$, I want you to tell me if someone in the sample $y$ used the same word, or a synonym for it, in the same way that it was used in $x$." Based on this, it’s better to ask, instead how likely is a vector $xi$ might show up in any sample $y$ from the cummulative utterances for the group $Y$, conditioned on what we know about the composition of the sample $y$. To do this, we take the probability of a token $xi$ from the sentence $x$ and the token $yj$ from $y$ that has the lowest CoE with $xi$. This effectively replicates the hypothetical study participant in the example given at the top of this section selecting a word that most closely means the same thing as one of the words ($xi$) from the sentence $x$ and trying to use it to create a new utterance that closely matches $x$ in meaning. Furthermore, if nothing in the distribution $y$ is semantically similar, nor embedded in a similar context as $xi$ is in $x$, then the minimum CoE value will be high (and thus indicates that the token $xi$ doesn't have anything approximating a similar term or usage in $y$). We thus rewrite equation the last equation as follows:

$$P(E_{xi} | E_{y}) = P_{\mathcal{N}_{[0,\infty]}} \left( \min_{j} \left(CoE(E_{xi},E_{y}) \right) \bigg|  \mu=0., \sigma \right)$$

#### 1.4 Entropy across sentences using probability based on their component word vectors

Meanwhile, the probability that an individual's message $x$ exhibits convergence with the messaging habits of a groups can be calculated by finding the entropy for $x$ and an imputed sample from the group $y \in \lbrace Y | Y_g \rbrace$.

$$H( x ; y ) = -\sum_i P(E_{xi}|E_{y}) \log P(E_{xi}|E_{y})$$

### 2. Implementation

In [None]:
'ssh -'
'tmux attach-session -s BERT'
'python3 ./indH.py'

### 3. Assessment

First, I loaded the data from the checkpoint described above.

In [18]:
import torch
import pandas as pd
import numpy as np
from SIS.methods.reddit_feminism.stitch_data import get_stitched_data

# data_path = "/Users/zacharyrosen/Desktop/airlock/d/convergence/feminism-menslib-mensrights/women/summaries/posteriors-Feminism.pt"
# ckpt = torch.load(data_path)

data_path = "/Users/zacharyrosen/airlock/d/convergence/feminism-menslib-mensrights/women/summaries/feminism/"
ckpt = get_stitched_data(data_path)

total_H = ckpt['M']
_ids = ckpt['labels']

total_H = total_H.transpose(0,1).transpose(1,2)

groups = ['Feminism', 'MensRights', 'MensLib']

In [19]:
total_H.shape

torch.Size([935, 3, 200])

#### 3.1 Assessing entropic differences per each sentence in the corpus

In [20]:
from scipy.stats import ttest_ind as ttest

pvalue, statistic = [], []
for i in range(total_H.shape[0]):
    r = [
        ttest(
            total_H[i, 0][~total_H[i, 0].isnan()],
            total_H[i, j][~total_H[i, j].isnan()]
        ) for j in range(total_H.shape[1])]
    pvalue += [np.array([ri.pvalue for ri in r])]
    statistic += [np.array([ri.statistic for ri in r])]

pvalue, statistic = np.array(pvalue), np.array(statistic)

In [21]:
minima = [
    torch.cat(
        [
            total_H[i,j][~total_H[i,j].isnan()].mean(axis=-1).view(1,-1) for j in range(len(groups))
        ],
        dim=-1
    ) for i in range(total_H.shape[0])
]
minima = torch.cat(minima, dim=0).argmin(dim=-1)

pct_data, confusion_data, means_data = [], [], []

for i, g in enumerate(groups):

    p_res = (pvalue[:,i] < .025)
    mu_res = (statistic[:,i] < 0)
    res =  p_res & mu_res

    pct_data += [res.mean(axis=0)]
    confusion_data += [(p_res & (minima==i).numpy()).sum(axis=0)]
    means_data += [mu_res.mean(axis=0)]

results = pd.DataFrame()
results['cond'] = groups
results['results'] = np.array(pct_data)

mean_results = pd.DataFrame()
mean_results['cond'] = groups
mean_results['results'] = np.array(means_data)

confusion = pd.DataFrame()
confusion['cond'] = groups
confusion['results'] = np.array(confusion_data)

To interpret the results, higher scores indicate that more examples from the condition in the row passed the test when comparing the reconstruction of terms from the same condition as the row to examples from the condition in the column.

In [22]:
results

Unnamed: 0,cond,results
0,Feminism,0.0
1,MensRights,0.704813
2,MensLib,0.293048


In [23]:
mean_results

Unnamed: 0,cond,results
0,Feminism,0.0
1,MensRights,0.850267
2,MensLib,0.354011


In [24]:
confusion

Unnamed: 0,cond,results
0,Feminism,0
1,MensRights,28
2,MensLib,539


#### 3.2 By entire comment

To calculate the significance for an entire comment we summed the entropy for all the sentences that comprised the comment for each trial number in the data. We then repeated the same testing procedure as performed for the sentence level analysis.

In [25]:
comment_H = [total_H[_ids['commentId'].isin([c]).values].sum(axis=0).unsqueeze(0) for c in _ids['commentId'].unique()]
comment_H = torch.cat(comment_H,dim=0)

In [26]:
comment_H.shape

torch.Size([299, 3, 200])

In [27]:
from scipy.stats import ttest_ind as ttest

pvalue, statistic = [], []
for i in range(comment_H.shape[0]):
    r = [
        ttest(
            comment_H[i, 0][~comment_H[i, 0].isnan()],
            comment_H[i, j][~comment_H[i, j].isnan()]
        ) for j in range(comment_H.shape[1])]
    pvalue += [np.array([ri.pvalue for ri in r])]
    statistic += [np.array([ri.statistic for ri in r])]

pvalue, statistic = np.array(pvalue), np.array(statistic)

In [28]:
minima = [
    torch.cat(
        [
            comment_H[i,j][~comment_H[i,j].isnan()].mean(axis=-1).view(1,-1) for j in range(len(groups))
        ],
        dim=-1
    ) for i in range(comment_H.shape[0])
]
minima = torch.cat(minima, dim=0).argmin(dim=-1)

pct_data, confusion_data, means_data = [], [], []

for i, g in enumerate(groups):

    p_res = (pvalue[:,i] < .025)
    mu_res = (statistic[:,i] < 0)
    res =  p_res & mu_res

    pct_data += [res.mean(axis=0)]
    confusion_data += [(p_res & (minima==i).numpy()).sum(axis=0)]
    means_data += [mu_res.mean(axis=0)]

results = pd.DataFrame()
results['cond'] = groups
results['results'] = np.array(pct_data)

mean_results = pd.DataFrame()
mean_results['cond'] = groups
mean_results['results'] = np.array(means_data)

confusion = pd.DataFrame()
confusion['cond'] = groups
confusion['results'] = np.array(confusion_data)

In [29]:
results

Unnamed: 0,cond,results
0,Feminism,0.0
1,MensRights,0.80602
2,MensLib,0.434783


In [30]:
mean_results

Unnamed: 0,cond,results
0,Feminism,0.0
1,MensRights,0.869565
2,MensLib,0.468227


In [31]:
confusion

Unnamed: 0,cond,results
0,Feminism,0
1,MensRights,17
2,MensLib,140


### 3.4 Analysis of Texts in Confusion Matrix

In [32]:
__ids = _ids.drop_duplicates(subset=['commentId']).copy()
__ids.index = range(len(__ids))

texts = pd.read_table("/Volumes/ROY/comp_ling/datasci/intergroupEntropy/data/redditany/corpus_with_author_data.tsv", lineterminator='\n')
texts = texts.loc[~texts['body'].isna()]

First, we look at the confused sentences for Feminism -> MensRights

In [33]:
confused_indexes = __ids['commentId'].loc[(pvalue[:,1] < .025) & (minima == 1).numpy()]

texts[['author','subId','commentId','body']].loc[texts['commentId'].isin(confused_indexes)]

Unnamed: 0,author,subId,commentId,body
212,_db_,uschn9,i94hpx3,patriarchy dies hard
566,RoswalienMath,ulj8z3,i7ya9zb,*might
690,dusktrail,ulj8z3,i7won7h,what do you think
1152,shimmerangels,ur0v5s,i8wgknu,no words
1197,nona01,ur0v5s,i8x5nnt,a bit extreme isn't it
1198,nona01,ur0v5s,i8x5nnt,edit: /s
1205,stayutofwomnbusiness,ur0v5s,i8w3vqm,its all men
1285,bookluvr83,ukgnfg,i7ph8st,ya'll quada
1334,Vanilla3K,upgcx5,i8kwai0,the fucking justice system i swear
1352,rougewitch,upgcx5,i8ohncz,men protecting men


Then we look at the confused sentences for Feminism -> MensLib

In [34]:
confused_indexes = __ids['commentId'].loc[(pvalue[:,2] < .025) & (minima == 2).numpy()]

texts[['author','subId','commentId','body']].loc[texts['commentId'].isin(confused_indexes)]

Unnamed: 0,author,subId,commentId,body
205,notime4urnonsense,uschn9,i92jukq,and men have the audacity to claim that marria...
207,mcguinty42,uschn9,i92xudu,that is such an visual way to protest
208,mcguinty42,uschn9,i92xudu,i reckon a lot of movements should take notes ...
454,groundphoenixhogday,ulj8z3,i7wutrt,that is a great analogy
478,cyborgaudi,ulj8z3,i7znfsz,in my experience most of them would say yes to...
...,...,...,...,...
2751,,ukbdez,i7rx10x,"surely ""people who menstruate"" is a far more ..."
2753,Canvas718,ukbdez,i7vwkpj,trans men are men who may — or may not — have ...
2754,Canvas718,ukbdez,i7vwkpj,in any case if you support trans people then...
2755,Canvas718,ukbdez,i7vwkpj,i just couldn’t tell from the original post


### 4. Conclusions

The results are interesting. I'll break them up by comparison to each other subreddit here.

[_**r/MensRights**_] There is significant evidence that you cannot learn much about the content of a post made to r/Feminism from reading a post in r/MensRights. At a sentence/sub-utterance level, 69.7\% of examples from r/Feminism have statistically significant higher entropy when trying to recover their meaning from comments posted to r/MensRights when compared to trying to recover the same meaning from comments posted to r/Feminism. 83.3\% of examples, minimally, have lower entropy in this condition. This discrepency is wider at the comment level, where 83.6\% of examples from r/Feminism have statistically significant higher entropy when trying to recover their meaning from comments posted to r/MensRights and 91.3\% of examples, minimally, have lower entropy in this condition.

[_**r/MensLib**_] While there are clear differences, it seems possible that you can learn a decent amount about the content of a post made to r/Feminism from reading a post in r/MensLib. At a sentence/sub-utterance level, 27.0\% of examples from r/Feminism have statistically significant higher entropy when trying to recover their meaning from comments posted to r/MensLib when compared to trying to recover the same meaning from comments posted to r/Feminism. 36.5\% of examples, minimally, have higher entropy in this condition. This discrepency is not much wider at the comment level, where 34.8\% of examples from r/Feminism have statistically significant higher entropy when trying to recover their meaning from comments posted to r/MensLib and 41.5\% of examples, minimally, have lower entropy in this condition.

There are two interesting observations that the data lays bare. First, it is clear that there is a rhetorical difference between the firmly Feminist subreddit r/Feminism and the anti-Feminist r/MensRights with respect to how the two groups talk about women. While this is not surprising, it validates that our framework is capable to of identifying real world discursive differences between groups.

Interesting, too, is the difference in the quantitative outcomes between our two levels of analysis. That there is a greater difference in entropy at the comment level (comments tend to be comprised of multiple sentences/sub-utterances) would indicate that much of the difference in how online groups conceptualize aspects of a given topic is more observable in larger discursive units.

Put another way, the probability that any one claim or statement in an utterance is sampled from a group level communicative norm is not zero, but there's still a good bit of variation. But the probability that the conceptual makeup of a larger discursive unit offered up by an individual is sampled from some set of group norms is significantly higher. In other words, each sentence is more likely to inject of new energy into the system, but people still converge overall to the concepts used by their in-group.

The difference in outcomes at varying levels of analysis isn't just an intriguing observation--it can have impacts for the detection of hate speech and other harmful content on the internet. Many if not most hate-speech and/or minsinformation classifiers focus on detecting harmful content at the level of a single claim/sentence. Our results indicate that differences between potentially hateful comment and more acceptable forms of discourse might show up more strongly when taking a step back and looking at a larger proportion of content produced by an individual--how do the myriad things a person has said come together to convey their view of the world? As the old adage goes, "context is key" and there is a real possibility that some important context is lost when the focus is on too small a discursive unit.


#### References

Adams, A., Miles, J., Dunbar, N. E., & Giles, H. (2018). Communication accommodation in text messages: Exploring liking, power, and sex as predictors of textisms. The Journal of Social Psychology, 158(4), 474–490. https://doi.org/10.1080/00224545.2017.1421895

Dale, R., Duran, N. D., & Coco, M. (2018). Dynamic Natural Language Processing with Recurrence Quantification Analysis. ArXiv:1803.07136 [Cs]. http://arxiv.org/abs/1803.07136

de Vries, W., van Cranenburgh, A., & Nissim, M. (2020). What’s so special about BERT’s layers? A closer look at the NLP pipeline in monolingual and multilingual models. Findings of the Association for Computational Linguistics: EMNLP 2020, 4339–4350

Palomares, N., Giles, H., Soliz, J., & Gallois, C. (2016). Intergroup Accommodation, Social Categories, and Identities. In H. Giles (Ed.), Communication Accomodation Theory (p. 232).

Rosen, Z. (2022). A BERT’s eye view: A “big-data” framework for assessing language convergence and accommodation in large, many-to-many settings. Journal of Language and Social Psychology, 0261927X2210811.