## Supplementary code for paper submission: 'Tracing Semantic Variation in Slang'.

This notebook contains the supplementary data pre-processing code for 'Tracing Semantic Variation in Slang'. Since we cannot publically release all entries from Green's Dictionary of Slang (GDoS) due to copyright terms, this note book illustrates how we pre-process raw data obtained from https://greensdictofslang.com/ and turn the data into a format that can be used to reproduce our experimental results.

Here is a list of non-standard Python packages you'll need. All of which can be obtained using *pip install*.

- numpy
- bs4

In [52]:
import bs4
import pickle
import re
import glob
import numpy as np
from tqdm import trange
import os

In [41]:
from util import GSD_Definition, GSD_Word
from process import process_GSD

For illustration, we include the raw html dumps for 3 dictionary entries for the slang word *beast*. Each file is named after its hash tag organized by the original dictionary. The original entries can be found on the following webpages:

https://greensdictofslang.com/entry/23sqfua

https://greensdictofslang.com/entry/xzzdtua

https://greensdictofslang.com/entry/3e7vqxq

We not first crawl our directory for these hash tags:

In [42]:
word_hash = [s[:-5] for s in glob.glob('htmls/*.html')]

In [43]:
word_hash

['htmls/24ojdci',
 'htmls/6czjomy',
 'htmls/6a3cggi',
 'htmls/2zmnjxq',
 'htmls/wfcjxrq',
 'htmls/hdttlwa',
 'htmls/zhgmdzi',
 'htmls/t7uysrq',
 'htmls/y2o3bcy',
 'htmls/g6fxnxq',
 'htmls/fn5m24y',
 'htmls/myxdjhy',
 'htmls/v4lczky',
 'htmls/ira2pka',
 'htmls/yssly4i',
 'htmls/zy5223y',
 'htmls/7gojxvq',
 'htmls/iz7ydca',
 'htmls/hemsaei',
 'htmls/zwxta4a',
 'htmls/rhav3zq',
 'htmls/2yr2zty',
 'htmls/kksk6gq',
 'htmls/5l3z3fi',
 'htmls/hbwrnwa',
 'htmls/gcsboaa',
 'htmls/djzwhhq',
 'htmls/xsctena',
 'htmls/plqlbna',
 'htmls/nnylxzq',
 'htmls/hl2mndq',
 'htmls/tiv74la',
 'htmls/suw5vkq',
 'htmls/67l5twa',
 'htmls/neyom4i',
 'htmls/3rvitbi',
 'htmls/qrwpa4y',
 'htmls/r6jct7y',
 'htmls/ftaxary',
 'htmls/posylma',
 'htmls/bkt5xeq',
 'htmls/mmqse7a',
 'htmls/fqcoooa',
 'htmls/preo3ri',
 'htmls/crhkufi',
 'htmls/benztoa',
 'htmls/pmrdxvq',
 'htmls/vwwaywa',
 'htmls/it4ow6y',
 'htmls/iksvt7a',
 'htmls/zp2b3dq',
 'htmls/de6ffgy',
 'htmls/krtlkha',
 'htmls/payjqrq',
 'htmls/duerf7y',
 'htmls/ue

The following pre-processing function will then take in a list of hash tags and process the respective html files. A pickle file will be generated for each word entry. Note that we do not collapse homonyms (i.e. same word form with multiple word entries) until the actual experiment.

In [36]:
process_GSD(word_hash, input_dir = "", output_dir = "")

  6%|▌         | 88/1505 [00:04<01:15, 18.68it/s, d_count=0, w_count=0]


KeyboardInterrupt: 

This should generate 3 pickle files which we now load for further pre-processing.

In [62]:
data = [pickle.load(open('pickles/' + f, 'rb')) for f in os.listdir('pickles')]
print(len(data))

1407


The following code filters the reference entries according to the set of regions that we are interested in (in our case, US and UK). It also tries to automatically extract valid example usage sentences from the reference entries.

In [63]:
regions = ['[US]', '[UK]']
#regions = ['[US]', '[UK]', '[Aus]']

In [68]:
punctuations = '!\'"#$%&()\*\+,-\./:;<=>?@[\\]^_`{|}~'

re_punc = re.compile(r"["+punctuations+r"]+")
re_space = re.compile(r" +")

re_extract_quote = re.compile(r"[1-9/]+:")
re_extract_quote_all = re.compile(r"[1-9/]+:.*$")

def proc_quote_sent(sent):
    return re_extract_quote.sub(' ', re_extract_quote_all.findall(sent)[0]).strip()

def validate_quote_sent(word, sent):
    tokens = [s.lower() for s in re_space.sub(' ', re_punc.sub('', sent)).split(' ')]
    return word.lower() in tokens

data_proc = []

for i in trange(len(data)):
    w = data[i]
    if w.is_abbr():
        continue
    d_list = []
    for d in w.definitions:
        stamps = d.stamps
        region_set = set([s[1] for s in stamps])
        if np.any([r in region_set for r in regions]):
            new_stamps = [s for s in stamps if np.any([r==s[1] in region_set for r in regions])]
            new_def = GSD_Definition(d.def_sent)
            new_def.stamps = new_stamps
            new_def.contexts = {key:value for key, value in d.contexts.items() if key in new_stamps}
            d_list.append(new_def)
    if len(d_list) > 0:
        new_word = GSD_Word(w.word.replace("\\xe2\\x80\\x99", "'").replace("\\xe2\\x80\\x98", "'"), w.pos, w.homonym)
        new_word.definitions = d_list
        data_proc.append(new_word)

print(len(data_proc))

100%|██████████| 1407/1407 [00:00<00:00, 3072.84it/s]

1360





Here's what the data looks after after pre-processing:

In [69]:
_ = [print(d) for d in data_proc]

[WORD]
downer
[POS]
n.
[HOMONYM]
1
[DEFINITIONS]
a nickel, five cents
1859 - [US]
1881 - [US]
a sixpence
1839 - [UK] - Downer a sixpence.
1857 - [UK] - Sixpence, downer, also sprat.
1860 - [UK]
1873 - [UK]
1885 - [UK] - Two more names for a sixpence are a downer and a tanner [F&H].

[WORD]
stump
[POS]
v.
[HOMONYM]
1
[DEFINITIONS]
to challenge usu. to a fight, to dare
1766 - [UK]
1844 - [US] - I was a darned old chucklehead to stump you to strike me.
1905 - [US] - stump, v. To challenge.
1912 - [US] - stump, v. To dare. I stump you to jump off that stack. Sometimes to invite. I stump you to go fishing..
1948 - [US] - Ill stump you to jump down [DA].
to put up a fight
1856 - [US] - Dont let him stump you, give him one on his nigger head!!

[WORD]
one
[POS]
phr.
[DEFINITIONS]
goodbye, 'see you later
2001 - [US]
2002 - [US]

[WORD]
cracker
[POS]
n.
[HOMONYM]
9
[DEFINITIONS]
an attractive young woman; usu. as ; occas. a man
1970 - [UK] - I definitely had to have a piece of cracker.
1977 - [

We now save the pre-processed data to be used for experiments. See the notebook *Trace.ipynb* in the code package for how this can be used to reproduce results in our paper.

In [70]:
np.save('GSD_Data.npy', data_proc)