# Import code

In [None]:
import numpy as np
import pandas as pd

In [None]:
import statsmodels.api as sm
import statsmodels.formula.api as smf

# Objectives and previous work

We study social semantic variation over time on Reddit between 2006 and 2020.
- We identify words that show short-term semantic change.
- We use social network information to study the social dynamics between communities over time.
- We investigate the interaction between social and temporal dynamics in semantic change.

Previous work

- has studied the emergence of __formal neologisms__ (NeoCrawler, Logoscope, Neoveille) 
- and the diffusion of formal lexical innovations in web corpora (NeoCrawler) and in social networks (Goel et al. 2016)
- Distributional semantics allow to study the emergence and diffusion of __semantic innovation and change__ (Firth 1957; Harris 1954)
  - Most previous work has focused on long-term semantic change (Hamilton et al. 2016).
  - More recent work has also investigated short-term meaning change (Del Tredici et al. 2019).
- Previous work has shown that words can show __semantic variation between communities__. For example, there is variation in the semantic representations of words between groups of younger vs. older speakers and female- vs. male-dominated communities (Gonen et al. 2020).
- There has been little research investigating the interaction between semantic variation and change and social dynamics between communities over time (Hofmann et al. 2021).


We go beyond previous work and study the interaction between social and temporal dynamics of semantic variation.

- To which degree can the overall semantic change observed be traced back to social variation between communities? How important is it to incorporate this information when analysing semantic change? This might have implications for the importance of balanced corpora for investigating semantic change.
- To which extent does social distance between communities correlate with differences in the semantic representations between communities? Do (politically close) communities have more similar semantic representations?
- How does does social semantic variation unfold over time? Does the semantic variation between communities increase over time if communities are becoming more socially distant (semantic & social polarization)? (Hofmann et al. 2021)

# Data

We study a large Reddit dataset (Baumgartner et al. 2020). This dataset covers the period from 2006 to 2020.

We limit ourselves to the political sphere of Reddit. We do this to focus on a set of communities which can be expected to show substantial differences with regard to social distances between groups of speakers.

We extract comments for political subreddits from different parts of the political landscape (left, liberal, right) based on 'userleansbot' data. We additionally analyse comments from apolitical subreddits (e.g. _r/AskReddit_) as a control group.

We retrieve all comments for this set of communities and we create subcorpora for all combinations of subreddits and yearly time slices.

# Method

## Detecting candidates for semantic change

We determine __candidates for semantic change__ across all communities in our dataset.

We generate a word2vec model for the full dataset in 2010 and 2020, and we determine the words that show the highest degree of semantic change as measured by cosine distance.

|       | word       |     dist |
|-------|------------|----------|
|  3368 | theonion   | 1.073174 |
| 14398 | trump      | 1.064995 |
| 24331 | pronouns   | 1.063726 |
|  5941 | quarantine | 1.026087 |
| 20712 | autonomy   | 1.000578 |
|  2834 | chat       | 0.993065 |
| 19200 | ccp        | 0.990134 |
|  3474 | guardian   | 0.988905 |
| 23239 | lockdown   | 0.987755 |
| 16618 | 230        | 0.987645 |
| 22663 | pronoun    | 0.979626 |
|  6978 | cancel     | 0.975683 |
| 20920 | absolved   | 0.967182 |
|  1758 | sub        | 0.964382 |
|  6053 | messaging  | 0.961209 |
|  2473 | bot        | 0.961136 |
|  1231 | https      | 0.961055 |
|  1689 | pardon     | 0.959310 |
| 21526 | kink       | 0.957009 |
| 17816 | frontline  | 0.950385 |

## Generating models of social semantic variation

We then generate embedding models for each combination of subreddit/year.

These models offer semantic representations that are sensitive to both temporal and social variation.

We align these models using Procrustes Alignment (Hamilton et al. 2016).

| commmunity       | year |
|------------------|------|
| askaconservative | 2010 |
| askaconservative | 2011 |
| askaconservative | 2012 |
| askaconservative | 2013 |
| askaconservative | 2014 |
| askaconservative | 2015 |
| askaconservative | 2016 |
| askaconservative | 2017 |
| askaconservative | 2018 |
| askaconservative | 2019 |
| askaconservative | 2020 |
| askademocrat     | 2010 |
| askademocrat     | 2011 |
| askademocrat     | 2012 |
| askademocrat     | 2013 |
| askademocrat     | 2014 |
| askademocrat     | 2015 |
| askademocrat     | 2016 |
| askademocrat     | 2017 |
| askademocrat     | 2018 |
| askademocrat     | 2019 |
| askademocrat     | 2020 |
| ...              | ...  |

## Analysing social and temporal semantic variation

This allows us to study both dimensions of variation and to investigate interactions between these two dimensions.

We organise the data in the following way to feed it into a statistical model. Each row represents one observation of comparing the semantic representation of one community at a specific point in time (yearly bins) with a second community at a specific point in time.

For each word in our sample of candidates of semantic changes, we measure semantic distances between the semantic representations in each community/year combination based on cosine similarity.

In [None]:
dists = pd.read_csv('data/distances.csv')

In [None]:
dists

Unnamed: 0,lex,community_a,year_a,community_b,year_b,dist_sem,dist_soc,dist_temp
0,trump,askaconservative,2010,askaconservative,2010,0.0,0,0
1,trump,askaconservative,2010,askaconservative,2011,0.1,0,1
2,trump,askaconservative,2010,askaconservative,2012,0.2,0,2
3,trump,askaconservative,2010,askaconservative,2013,0.3,0,3
4,trump,askaconservative,2010,askaconservative,2014,0.4,0,4
...,...,...,...,...,...,...,...,...
61,pandemic,askaconservative,2010,askademocrat,2016,1.4,1,6
62,pandemic,askaconservative,2010,askademocrat,2017,1.6,1,7
63,pandemic,askaconservative,2010,askademocrat,2018,1.8,1,8
64,pandemic,askaconservative,2010,askademocrat,2019,2.0,1,9


## Statistical model

We use a mixed effects model to capture the role of temporal and social distance between semantic representations.

The statistical models is structured as follows:

-   dependent var: **semantic** distance: `dist_sem` > cosine similarity
-   **fixed** effects
    -   **temporal** distance: `dist_temp` > years
    -   **social** distance
        -   political camps: &rsquo;left&rsquo;, &rsquo;right&rsquo;, &rsquo;liberal&rsquo;, &rsquo;neutral&rsquo; ([&rsquo;userleansbot&rsquo; data](userleansbot.csv))
            -   same camp > `0`
            -   different camp > `1`
        -   user overlap: to be calculated (continuous)
    -   **interaction** between `dist_temp` and `dist_soc`
-   **random** effects:
    -   **lexeme**: `lex` > random intercept


In [None]:
md = smf.mixedlm(
    "dist_sem ~ dist_soc * dist_temp",
    dists,
    groups='lex' # random intercept
    # re_formula="~lex" # random slope
) 

In [None]:
mdf = md.fit()

In [None]:
mdf.summary()

0,1,2,3
Model:,MixedLM,Dependent Variable:,dist_sem
No. Observations:,66,Method:,REML
No. Groups:,2,Scale:,0.0307
Min. group size:,33,Log-Likelihood:,9.3463
Max. group size:,33,Converged:,Yes
Mean group size:,33.0,,

0,1,2,3,4,5,6
,Coef.,Std.Err.,z,P>|z|,[0.025,0.975]
Intercept,0.000,0.109,0.000,1.000,-0.214,0.214
dist_soc,0.250,0.086,2.923,0.003,0.082,0.418
dist_temp,0.100,0.008,11.980,0.000,0.084,0.116
dist_soc:dist_temp,0.150,0.014,10.375,0.000,0.122,0.178
lex Var,0.019,0.163,,,,


# References

Baumgartner, Jason, Savvas Zannettou, Brian Keegan, Megan Squire, and Jeremy Blackburn. 2020. ‘The Pushshift Reddit Dataset’. In Proceedings of the International AAAI Conference on Web and Social Media, 14:830–39.

Del Tredici, Marco, Raquel Fernández, and Gemma Boleda. 2019. ‘Short-Term Meaning Shift: A Distributional Exploration’. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2069–75. Minneapolis, Minnesota: Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1210.

Firth, John R. 1957. A Synopsis of Linguistic Theory, 1930-1955. Studies in Linguistic Analysis. Special Volume of the Philological Society. Oxford: Basil Blackwell.

Goel, Rahul, Sandeep Soni, Naman Goyal, John Paparrizos, Hanna Wallach, Fernando Diaz, and Jacob Eisenstein. 2016. ‘The Social Dynamics of Language Change in Online Networks’. In Social Informatics, edited by Emma Spiro and Yong-Yeol Ahn, 41–57. Cham: Springer International Publishing.

Gonen, Hila, Ganesh Jawahar, Djamé Seddah, and Yoav Goldberg. 2020. ‘Simple, Interpretable and Stable Method for Detecting Words with Usage Change across Corpora’. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 538–55. Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.51.

Hamilton, William L., Jure Leskovec, and Dan Jurafsky. 2016. ‘Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change’. In , 1489—1501. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL). Berlin, Germany: Association for Computational Linguistics. http://www.aclweb.org/anthology/P16-1141.

Harris, Zellig S. 1954. ‘Distributional Structure’. Word-Journal of The International Linguistic Association 10 (2–3): 146—162.

Hofmann, Valentin, Janet B. Pierrehumbert, and Hinrich Schütze. 2021. ‘Modeling Ideological Agenda Setting and Framing in Polarized Online Groups with Graph Neural Networks and Structured Sparsity’. ArXiv:2104.08829 [Cs], April. http://arxiv.org/abs/2104.08829.