# NeoCov

> Semantic change and socio-semantic variation. The case of Covid-related neologisms on Reddit.

## Description


This repository contains the code for the paper _Semantic change and socio-semantic variation. The case of Covid-related neologisms on Reddit_. This paper has been submitted and is currently under (anonymous) review for the journal _Linguistics Vanguard_.

You can clone the repository and install the code as a Python package named `neocov` by running `pip install .` within the cloned directory. This will automatically install all dependencies. As always, it is recommended to install this package in a virtual environment (e.g. using `conda`). 

The Reddit data used for this paper are too big to make them available here. Some parts of the code cannot be executed without having access to these datasets. The full datasets of Reddit comments and the models trained from these comments can be requested via email once the anonymous review process is finished. The datasets and models allow to reproduce our results.

This notebook provides the full pipeline used to process the Reddit comments, train the models, and produce the results presented in our paper. More detailed information is documented in the module notebooks and on the documentation website under https://wuqui.github.io/neocov/.

The code used for the tables and figures contained in the paper can be found directly via the following links:

| Reference | Link                                            |
|-----------|-------------------------------------------------|
| Table 2   | [semantic neologisms](#semantic-neologisms)     |
| Figure 1  | [Covid-related communities](#covid-communities) |
| Figure 2  | [Semantic axes](#sem-axis)                      |
| Figure 3  | [Semantic maps for _vaccines_](#sem-maps)       |



## Imports 

In [None]:
# all_data

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
from neocov.read_data import *
from neocov.preproc import *
from neocov.type_emb import *
from neocov.communities import *

In [None]:
from pathlib import Path
import pandas as pd
pd.set_option('display.max_rows', 100)
import altair as alt
from altair_saver import save
from gensim.models import Word2Vec
import pickle

## Variables

In [None]:
DATA_DIR = '../data/'
COMMENTS_DIAC_DIR = f'{DATA_DIR}comments/by_date/'
COMMENTS_DIR_SUBR = f'{DATA_DIR}comments/subr/'
OUT_DIR = '../out/'

## Detecting semantic change

#### Read comments

In [None]:
YEAR = '2020'

In [None]:
comment_paths_year = get_comments_paths_year(COMMENTS_DIAC_DIR, YEAR)

In [None]:
comments = read_comm_csvs(comment_paths_year)

In [None]:
comments

Unnamed: 0,author,body,created_utc,id,subreddit
0,Broncos57,Oh okay thank you so much for the reply! I rea...,2020-04-14 21:20:57,fnf0nqd,boston
1,tresclow,Es tan deprimente ver cuando esta clase de est...,2020-04-14 21:20:57,fnf0noq,chile
2,Hicklebear,This comment is Codex approved.,2020-04-14 21:20:57,fnf0nor,Grimdank
3,[deleted],[removed],2020-04-14 21:20:57,fnf0nos,acturnips
4,ilovedog5,Am I the only person who thinks this whole thi...,2020-04-14 21:20:57,fnf0not,UnresolvedMysteries
...,...,...,...,...,...
9599965,Driedrain,Invoice sent!,2020-08-19 21:59:59,g25e6cy,hardwareswap
9599966,PresOfTheLesbianClub,Yes. Fixed that. Thank you!,2020-08-19 21:59:59,g25e6cz,vanderpumprules
9599967,originalasteele,"This is incredible!! Oh, how I miss Midna",2020-08-19 21:59:59,g25e6d1,zelda
9599968,sunbeam2z,I boosted you. I don't need a boost.,2020-08-19 21:59:59,g25e6ci,Earnin


### Preprocessing

In [None]:
comments_clean = clean_comments(comments)

conv_to_lowerc       (9599970, 5) 0:00:08.393617      
rm_punct             (9599970, 5) 0:00:57.095767      
tokenize             (9599970, 5) 0:04:38.659268      
rem_short_comments   (5125011, 5) 0:01:27.863816      


Dataset of comments after pre-processing:

In [None]:
comments_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5125011 entries, 0 to 9599969
Data columns (total 5 columns):
 #   Column       Dtype         
---  ------       -----         
 0   author       string        
 1   body         object        
 2   created_utc  datetime64[ns]
 3   id           string        
 4   subreddit    string        
dtypes: datetime64[ns](1), object(1), string(3)
memory usage: 234.6+ MB


In [None]:
docs = comments_clean['body'].to_list()

Saving the cleaned comments to disk:

In [None]:
with open(f'{OUT_DIR}docs_clean/diac_{YEAR}.pickle', 'wb') as fp:
    pickle.dump(docs, fp)

Loading the cleaned comments from disk:

In [None]:
with open(f'{OUT_DIR}docs_clean/diac_{YEAR}.pickle', 'rb') as fp:
    docs = pickle.load(fp)

### Train models

#### Create corpus

In [None]:
corpus = Corpus(docs)

#### Train model

In [None]:
%%time
model = train_model(corpus, EPOCHS=20)

In [None]:
len(model.wv.key_to_index)

#### Save model

In [None]:
model.save(f'{OUT_DIR}models/{YEAR}_ep-20.model')

### Load models

In [None]:
model_2019 = Word2Vec.load(f'{OUT_DIR}models/2019.model')

In [None]:
model_2020 = Word2Vec.load(f'{OUT_DIR}models/2020.model')

### Align models

In [None]:
model_2019_vocab = len(model_2019.wv.key_to_index)
model_2020_vocab = len(model_2020.wv.key_to_index)

In [None]:
smart_procrustes_align_gensim(model_2019, model_2020)

190756 190756
190756 190756


<gensim.models.word2vec.Word2Vec at 0x1738f1030>

In [None]:
assert len(model_2019.wv.key_to_index) == len(model_2020.wv.vectors)

Overview of vocabulary sizes for both models before Procrustes alignment:

In [None]:
models_vocab = pd.DataFrame(
    columns=['Model', 'Words'],
    data=[
        ['2019', model_2019_vocab],
        ['2020', model_2020_vocab],
        ['intersection', len(model_2019.wv.key_to_index)]
    ],
)

models_vocab

Unnamed: 0,Model,Words
0,2019,252564
1,2020,277707
2,intersection,190756


In [None]:
models_vocab.to_csv(f'{OUT_DIR}models_vocab.csv', index=False)

### Measure distances

Measuring semantic distances (~ cosine distance) between the 2019 and the 2020 model for all words contained in the aligned vocabulary.

In [None]:
distances = measure_distances(model_2019, model_2020)

<a id='semantic-neologisms'></a>

20 words that show the highest semantic distance between 2019 and 2020. This output is presented in Table 2 in the paper.

In [None]:
blacklist_lex = load_blacklist_lex()

k = 20
freq_min = 100

sem_change_cands = (distances\
    .query('freq_1 > @freq_min and freq_2 > @freq_min')
    .query('lex.str.isalpha() == True')
    .query('lex.str.len() > 3')
    .query('lex not in @blacklist_lex')
    .nlargest(k, 'dist_sem')
    .reset_index(drop=True)
)

sem_change_cands

Unnamed: 0,lex,dist_sem,freq_1,freq_2
0,lockdowns,1.016951,940,990
1,maskless,0.996101,118,127
2,sunsetting,0.996084,111,119
3,childe,0.980564,209,221
4,megalodon,0.975273,751,792
5,newf,0.962381,107,115
6,corona,0.926739,3553,3684
7,filtrate,0.918609,102,110
8,chaz,0.899856,190,202
9,klee,0.888728,161,173


Output semantic neologisms for inclusion in the paper.

In [None]:
sem_change_cands_out = (sem_change_cands
    .nlargest(100, 'dist_sem')
    .assign(index_1 = lambda df: df.index + 1)
    .assign(dist_sem = lambda df: df['dist_sem'].round(2))
    .assign(dist_sem = lambda df: df['dist_sem'].apply('{:.2f}'.format))
    .rename({'index_1': '', 'lex': 'Lexeme', 'dist_sem': 'SemDist'}, axis=1)
)

In [None]:
sem_change_cands_out.to_csv(
        f'{OUT_DIR}sem_change_cands.csv',
        columns=['', 'Lexeme', 'SemDist'],
        index=False
    )

### Inspect neighbourhood

Closer inspection of semantic neighbours between 2019 and 2020 for the term _distancing_. Unfortunately, due to the space limitation, these results had to be excluded from the paper.

In [None]:
LEX_NBS = 'distancing'

In [None]:
nbs_model_1, nbs_model_2 = get_nearest_neighbours_models(
    lex=LEX_NBS, 
    freq_min=25,
    model_1=model_2019, 
    model_2=model_2020,
    k=10
)

display(
    nbs_model_1,
    nbs_model_2
)

Unnamed: 0,Model,Word,SemDist,Freq
0,1,distanced,0.22,309
1,1,extricate,0.27,32
2,1,detaching,0.34,61
3,1,disassociate,0.34,93
4,1,offing,0.36,48
5,1,recuse,0.38,50
6,1,recused,0.4,29
7,1,isolating,0.42,685
8,1,detach,0.44,245
9,1,distract,0.45,1553


Unnamed: 0,Model,Word,SemDist,Freq
50601,2,distanced,0.46,326
50602,2,isolation,0.46,2037
50603,2,gatherings,0.47,921
50604,2,distance,0.48,11355
50605,2,lockdowns,0.5,990
50606,2,quarantine,0.53,5225
50607,2,masks,0.53,8997
50608,2,quarantining,0.53,279
50609,2,quarantines,0.53,160
50610,2,lockdown,0.53,4642


In [None]:
(nbs_model_1.filter(['Word', 'SemDist'])
	.to_csv(f'{OUT_DIR}nbs_{LEX_NBS}_2019.csv', float_format='%.2f', index=False))

(nbs_model_2.filter(['Word', 'SemDist'])
	.to_csv(f'{OUT_DIR}nbs_{LEX_NBS}_2020.csv', float_format='%.2f', index=False))

## Social semantic variation

### Covid-related communities

In this section, we determine those communities which are most actively engaged in Covid-related discourse.

#### read comments

In [None]:
comments_dir_path = Path('../data/comments/lexeme/')
comments_paths = list(comments_dir_path.glob(f'Covid*.csv'))

In [None]:
%%time
comments = read_comm_csvs(comments_paths)
comments

CPU times: user 36.6 s, sys: 11.4 s, total: 48 s
Wall time: 59 s


Unnamed: 0,author,body,created_utc,id,subreddit
0,Gloob_Patrol,I assume you work too so he's feeling like he ...,2020-09-08 18:53:06,g4guhl5,LongDistance
1,amtrusc,"Strep swab and culture negative, I’m sure? Cou...",2020-09-08 18:53:08,g4guhsm,tonsilstones
2,Ephuntz,&gt;Good point. My apologies. It's just becomi...,2020-09-08 18:53:09,g4guhua,Winnipeg
3,cstransfer,Have you noticed an increase of people going e...,2020-09-08 18:53:09,g4guhu4,financialindependence
4,IlliniWhoDat,"I haven't. I have seen it online, but haven't...",2020-09-08 18:53:13,g4gui6o,KoreanBeauty
...,...,...,...,...,...
3800760,willw,Last group pre COVID!,2020-07-01 21:59:48,fwmqfbj,jawsurgery
3800761,Daikataro,"If everyone is infected with COVID, new cases ...",2020-07-01 21:59:49,fwmqff2,politics
3800762,StabYourBloodIntoMe,&gt; If the mortality rate is actually decreas...,2020-07-01 21:59:50,fwmqfib,dataisbeautiful
3800763,Shorse_rider,I was a freelancer until covid and earned more...,2020-07-01 21:59:55,fwmqfuw,AskWomen


#### get subreddit counts

In [None]:
subr_counts = get_subr_counts(comments)

<a id='covid-communities'></a>

Top 15 communities that are most actively engaged in Covid-related discourse.

In [None]:
subr_counts_plt = plot_subr_counts(subr_counts, k=15)
subr_counts_plt

In [None]:
subr_counts_plt.save(f'{OUT_DIR}subr_counts.png', scale_factor=2.0)

### Train models

In this section, we train community-specific embedding models.

In [None]:
SUBR = 'Coronavirus'

In [None]:
fpaths = get_comments_paths_subr(COMMENTS_DIR_SUBR, SUBR)
comments = read_comm_csvs(fpaths)

In [None]:
%%time
comments_clean = clean_comments(comments)

conv_to_lowerc       (4121144, 5) 0:00:08.279838      
rm_punct             (4121144, 5) 0:00:31.917256      
tokenize             (4121144, 5) 0:07:40.929735      
rem_short_comments   (2927221, 5) 0:01:04.440039      
CPU times: user 1min 21s, sys: 3min 17s, total: 4min 38s
Wall time: 10min 42s


In [None]:
docs = comments_clean['body']
docs = docs.to_list()

In [None]:
with open(f'{OUT_DIR}docs_clean/{SUBR}.pickle', 'wb') as fp:
    pickle.dump(docs, fp)

Load pre-processed comments from disk.

In [None]:
with open(f'{OUT_DIR}docs_clean/{SUBR}.pickle', 'rb') as fp:
    docs = pickle.load(fp)

In [None]:
f'{len(docs):,}'

'2,927,221'

In [None]:
corpus = Corpus(docs)

In [None]:
%%time
model = train_model(corpus, EPOCHS=20)

CPU times: user 21min 15s, sys: 10.7 s, total: 21min 26s
Wall time: 4min 44s


Print vocabulary size.

In [None]:
f'{len(model.wv.key_to_index):,}'

'38,558'

In [None]:
model.save(f'{OUT_DIR}models/{SUBR}.model')

### Load models

In [None]:
model_names = ['Coronavirus', 'conspiracy']

In [None]:
models = []
for name in model_names:
	model = make_model_dict(name)
	model['model'] = Word2Vec.load(model['path'])
	models.append(model)

### Align models

In [None]:
for model in models:
	model['vocab'] = len(model['model'].wv.key_to_index)

In [None]:
smart_procrustes_align_gensim(models[0]['model'], models[1]['model'])

67181 67181
67181 67181


<gensim.models.word2vec.Word2Vec at 0x172c82530>

In [None]:
assert len(models[0]['model'].wv.key_to_index) == len(models[1]['model'].wv.key_to_index)

In [None]:
models_vocab = (pd.DataFrame(models)
	.filter(['name', 'vocab'])
	.rename({'name': 'Model', 'vocab': 'Words'}, axis=1)
)

models_vocab

Unnamed: 0,Model,Words
0,Coronavirus,94816
1,conspiracy,112599


In [None]:
# models_vocab.to_csv(f'../out/vocabs/vocab_{models[0]["name"]}--{models[1]["name"]}.csv', index=False)

### Semantic neighbourhoods

In [None]:
distances = measure_distances(models[0]['model'], models[1]['model'])

#### words that differ the most between both communities

Due to space limitations, the following results had to be excluded from the paper.

In [None]:
blacklist_lex = load_blacklist_lex()

k = 20
freq_min = 100

sem_change_cands = (distances\
    .query('freq_1 > @freq_min and freq_2 > @freq_min')
    .query('lex.str.isalpha() == True')
    .query('lex.str.len() > 3')
    .query('lex not in @blacklist_lex')
    .nlargest(k, 'dist_sem')
    .reset_index(drop=True)
)

sem_change_cands

Unnamed: 0,lex,dist_sem,freq_1,freq_2
0,soliciting,1.003643,2233,2474
1,nimrod,0.974182,103,144
2,incivility,0.958038,16347,15690
3,globes,0.955581,140,193
4,submitter,0.952117,261,352
5,acorn,0.950665,105,148
6,subsequently,0.946088,12174,11956
7,resubmit,0.937007,1763,1927
8,mouthy,0.934621,129,178
9,narc,0.93071,224,305


In [None]:
sem_change_cands_out = (sem_change_cands
    .nlargest(100, 'dist_sem')
    .assign(index_1 = lambda df: df.index + 1)
    .assign(dist_sem = lambda df: df['dist_sem'].round(2))
    .assign(dist_sem = lambda df: df['dist_sem'].apply('{:.2f}'.format))
    .rename({'index_1': '', 'lex': 'Lexeme', 'dist_sem': 'SemDist'}, axis=1)
    .filter(['Lexeme', 'SemDist'])
)
sem_change_cands_out


Unnamed: 0,Lexeme,SemDist
0,soliciting,1.0
1,nimrod,0.97
2,incivility,0.96
3,globes,0.96
4,submitter,0.95
5,acorn,0.95
6,subsequently,0.95
7,resubmit,0.94
8,mouthy,0.93
9,narc,0.93


In [None]:
sem_change_cands_out.to_csv(
        f'{OUT_DIR}sem_var_soc_cands.csv',
        index=False
    )

#### nearest neighbours for target lexemes in both communities

In [None]:
LEX_NBS = 'vaccines'

In [None]:
nbs_model_1, nbs_model_2 = get_nearest_neighbours_models(
    lex=LEX_NBS, 
    freq_min=100,
    model_1=models[0]['model'], 
    model_2=models[1]['model'],
    k=10
)

display(
    nbs_model_1,
    nbs_model_2
)

Unnamed: 0,Model,Word,SemDist,Freq,vec
0,1,vaccine,0.25,109094,"[1.6486416, -0.77526903, -0.25844133, -2.18959..."
1,1,vaccinations,0.31,3305,"[1.6486416, -0.77526903, -0.25844133, -2.18959..."
2,1,antivirals,0.38,357,"[1.6486416, -0.77526903, -0.25844133, -2.18959..."
3,1,treatments,0.4,4737,"[1.6486416, -0.77526903, -0.25844133, -2.18959..."
4,1,drugs,0.41,13655,"[1.6486416, -0.77526903, -0.25844133, -2.18959..."
5,1,therapies,0.42,425,"[1.6486416, -0.77526903, -0.25844133, -2.18959..."
6,1,doses,0.43,6558,"[1.6486416, -0.77526903, -0.25844133, -2.18959..."
7,1,trials,0.45,10095,"[1.6486416, -0.77526903, -0.25844133, -2.18959..."
8,1,strains,0.46,3875,"[1.6486416, -0.77526903, -0.25844133, -2.18959..."
9,1,therapeutics,0.46,385,"[1.6486416, -0.77526903, -0.25844133, -2.18959..."


Unnamed: 0,Model,Word,SemDist,Freq,vec
21542,2,vaccinations,0.2,3624,"[0.34145182, -1.5721449, -0.045296144, -1.6733..."
21543,2,vaccine,0.23,112485,"[0.34145182, -1.5721449, -0.045296144, -1.6733..."
21544,2,vaccination,0.36,7780,"[0.34145182, -1.5721449, -0.045296144, -1.6733..."
21545,2,treatments,0.38,4874,"[0.34145182, -1.5721449, -0.045296144, -1.6733..."
21546,2,medications,0.4,1614,"[0.34145182, -1.5721449, -0.045296144, -1.6733..."
21547,2,vax,0.42,3208,"[0.34145182, -1.5721449, -0.045296144, -1.6733..."
21548,2,injections,0.43,795,"[0.34145182, -1.5721449, -0.045296144, -1.6733..."
21549,2,adjuvants,0.43,199,"[0.34145182, -1.5721449, -0.045296144, -1.6733..."
21550,2,medicines,0.43,1208,"[0.34145182, -1.5721449, -0.045296144, -1.6733..."
21551,2,viruses,0.45,17105,"[0.34145182, -1.5721449, -0.045296144, -1.6733..."


#### biggest discrepancies in nearest neighbours for target lexemes

In [None]:
lex = 'vaccines'
topn = 15

nbs_model_1, nbs_model_2 = get_nearest_neighbours_models(
    lex=lex, 
    freq_min=100,
    model_1=models[0]['model'], 
    model_2=models[1]['model'],
    k=100_000
)

nbs_diffs = pd.merge(
    nbs_model_1, nbs_model_2, 
    on='Word',
    suffixes = ('_1', '_2')
)

nbs_diffs = nbs_diffs\
    .assign(sim_diff = abs(nbs_diffs['SemDist_1'] - nbs_diffs['SemDist_2']))\
    .sort_values('sim_diff', ascending=True)\
    .reset_index(drop=True)\
    .query('Word.str.len() >= 4')

subr_1_nbs = nbs_diffs\
    .query('SemDist_1 < SemDist_2')\
    .nlargest(topn, 'sim_diff')

subr_2_nbs = nbs_diffs\
    .query('SemDist_2 < SemDist_1')\
    .nlargest(topn, 'sim_diff')

display(subr_1_nbs, subr_2_nbs)

Unnamed: 0,Model_1,Word,SemDist_1,Freq_1,vec_1,Model_2,SemDist_2,Freq_2,vec_2,sim_diff
21540,1,candidates,0.48,4842,"[1.6486416, -0.77526903, -0.25844133, -2.18959...",2,0.82,4925,"[0.34145182, -1.5721449, -0.045296144, -1.6733...",0.34
21539,1,dyson,0.77,114,"[1.6486416, -0.77526903, -0.25844133, -2.18959...",2,1.1,158,"[0.34145182, -1.5721449, -0.045296144, -1.6733...",0.33
21536,1,parallel,0.8,943,"[1.6486416, -0.77526903, -0.25844133, -2.18959...",2,1.09,1095,"[0.34145182, -1.5721449, -0.045296144, -1.6733...",0.29
21530,1,lamp,0.86,224,"[1.6486416, -0.77526903, -0.25844133, -2.18959...",2,1.14,305,"[0.34145182, -1.5721449, -0.045296144, -1.6733...",0.28
21531,1,underworld,0.86,115,"[1.6486416, -0.77526903, -0.25844133, -2.18959...",2,1.14,159,"[0.34145182, -1.5721449, -0.045296144, -1.6733...",0.28
21526,1,oxford,0.64,4128,"[1.6486416, -0.77526903, -0.25844133, -2.18959...",2,0.91,4378,"[0.34145182, -1.5721449, -0.045296144, -1.6733...",0.27
21519,1,slices,0.81,113,"[1.6486416, -0.77526903, -0.25844133, -2.18959...",2,1.07,157,"[0.34145182, -1.5721449, -0.045296144, -1.6733...",0.26
21509,1,fade,0.87,490,"[1.6486416, -0.77526903, -0.25844133, -2.18959...",2,1.12,611,"[0.34145182, -1.5721449, -0.045296144, -1.6733...",0.25
21504,1,sputnik,0.68,279,"[1.6486416, -0.77526903, -0.25844133, -2.18959...",2,0.93,376,"[0.34145182, -1.5721449, -0.045296144, -1.6733...",0.25
21507,1,approved,0.64,7276,"[1.6486416, -0.77526903, -0.25844133, -2.18959...",2,0.89,7443,"[0.34145182, -1.5721449, -0.045296144, -1.6733...",0.25


Unnamed: 0,Model_1,Word,SemDist_1,Freq_1,vec_1,Model_2,SemDist_2,Freq_2,vec_2,sim_diff
21541,1,gmos,0.85,130,"[1.6486416, -0.77526903, -0.25844133, -2.18959...",2,0.49,179,"[0.34145182, -1.5721449, -0.045296144, -1.6733...",0.36
21537,1,mandated,1.09,2456,"[1.6486416, -0.77526903, -0.25844133, -2.18959...",2,0.79,2732,"[0.34145182, -1.5721449, -0.045296144, -1.6733...",0.3
21538,1,disrespecting,1.25,231,"[1.6486416, -0.77526903, -0.25844133, -2.18959...",2,0.95,314,"[0.34145182, -1.5721449, -0.045296144, -1.6733...",0.3
21534,1,neuralink,0.98,210,"[1.6486416, -0.77526903, -0.25844133, -2.18959...",2,0.69,285,"[0.34145182, -1.5721449, -0.045296144, -1.6733...",0.29
21535,1,vaxx,0.86,633,"[1.6486416, -0.77526903, -0.25844133, -2.18959...",2,0.57,771,"[0.34145182, -1.5721449, -0.045296144, -1.6733...",0.29
21532,1,poisons,0.84,171,"[1.6486416, -0.77526903, -0.25844133, -2.18959...",2,0.56,234,"[0.34145182, -1.5721449, -0.045296144, -1.6733...",0.28
21524,1,preventable,1.02,1550,"[1.6486416, -0.77526903, -0.25844133, -2.18959...",2,0.75,1726,"[0.34145182, -1.5721449, -0.045296144, -1.6733...",0.27
21525,1,mandating,1.08,840,"[1.6486416, -0.77526903, -0.25844133, -2.18959...",2,0.81,998,"[0.34145182, -1.5721449, -0.045296144, -1.6733...",0.27
21527,1,sugar,0.99,2478,"[1.6486416, -0.77526903, -0.25844133, -2.18959...",2,0.72,2747,"[0.34145182, -1.5721449, -0.045296144, -1.6733...",0.27
21528,1,leukemia,0.9,140,"[1.6486416, -0.77526903, -0.25844133, -2.18959...",2,0.63,193,"[0.34145182, -1.5721449, -0.045296144, -1.6733...",0.27


### Maps of social semantic variation

<a id='sem-maps'></a>

The following section contains the plots for Figure 3.

In [None]:
lex = 'vaccines'

In [None]:
nbs_vecs = pd.concat([get_nbs_vecs(lex, model, k=750) for model in models])

#### common neighbours

In [None]:
#data
nbs_vecs = dim_red_nbs_vecs(nbs_vecs, perplexity=0)



In [None]:
#data
nbs_sim = (nbs_vecs
	.groupby('subreddit')
	.apply(lambda df: df.nlargest(10, 'sim'))
	.reset_index(drop=True)
)

In [None]:
#data
chart_sims = (alt.Chart(nbs_sim).mark_text().encode(
		x='x_tsne:Q',
		y='y_tsne:Q',
		text='lex',
		color='subreddit:N'
	))

chart_sims

In [None]:
chart_sims.save(f'../out/map-sem-space_{lex}_sims.pdf')
chart_sims.save(f'../out/map-sem-space_{lex}_sims.html')

#### differences in neighbours

In [None]:
nbs_vecs = dim_red_nbs_vecs(nbs_vecs, perplexity=70)



In [None]:
nbs_diff = nbs_vecs.drop_duplicates(subset='lex', keep=False)
nbs_diff = (nbs_diff
	.groupby('subreddit')
	.apply(lambda df: df.nlargest(20, 'sim'))
	.reset_index(drop=True)
)

In [None]:
chart_diffs = (alt.Chart(nbs_diff).mark_text().encode(
		x='x_tsne:Q',
		y='y_tsne:Q',
		text='lex:N',
		color='subreddit:N',
		# column='subr_nb:N',
	)).interactive()


chart_diffs

In [None]:
chart_diffs.save(f'../out/map-sem-space_{lex}_diffs.pdf')
chart_diffs.save(f'../out/map-sem-space_{lex}_diffs.html')

### Dimensions of social semantic variation

<a id='sem-axis'></a>

The following section presents the plots for Figure 2.

In [None]:
lexs = [ 'corona', 'rona', 'moderna', 'sars', 'spreader', 'maskless', 'distancing', 'quarantines', 'pandemic', 'science', 'research', 'masks', 'lockdowns', 'vaccines' ]

#### _good_ vs _bad_

In [None]:
pole_words = ['good', 'bad']

In [None]:
proj_sims = get_axis_sims(lexs, models, pole_words, k=10)

In [None]:
proj_sims = aggregate_proj_sims(proj_sims)

In [None]:
proj_sims_chart = plot_sem_axis(proj_sims, models)
proj_sims_chart

In [None]:
# proj_sims_chart.save(f'../out/proj-emb_{models[0]["name"]}--{models[1]["name"]}___{pole_words[0]}--{pole_words[1]}.pdf')

#### _objective_ vs _subjective_

In [None]:
pole_words = ['objective', 'subjective']

In [None]:
proj_sims = get_axis_sims(lexs, models, pole_words, k=10)

In [None]:
proj_sims = aggregate_proj_sims(proj_sims)

In [None]:
proj_sims_chart = plot_sem_axis(proj_sims, models)
proj_sims_chart

In [None]:
# proj_sims_chart.save(f'../out/proj-emb_{models[0]["name"]}--{models[1]["name"]}___{pole_words[0]}--{pole_words[1]}.pdf')