# Urban Dictionary (UD) Dataset for Slang NLP 

This notebook contains a simple Python interface for the Urban Dictionary (UD) dataset accompanying the paper 'A Computational Framework for Slang Generation'.

The dataset is a curated subset of the open source Urban Dictionary subset made available via Kaggle:

https://www.kaggle.com/therohk/urban-dictionary-words-dataset

Please see our paper for details regarding how the entries were selected and processed.

Each entry of the dataset contains the following:

1. Word
2. Sense definition sentence
3. Data partition (train, dev, or test)
4. Sample non-contrastive fastText embedding (300 dimensions)
5. Sample contrastive fastText embedding (300 dimensions)
6. Sample non-contrastive SBERT embedding (768 dimensions)
7. Sample contrastive SBERT embedding (768 dimensions)


In [1]:
import numpy as np
from collections import namedtuple

In [2]:
UD_data = np.load('UD_Dataset.npy', allow_pickle=True)

Each entry in the loaded array contains one example:

In [3]:
example = UD_data[20]

The first element contains word:

In [4]:
example[0]

'ambo'

The second element contains the associated definition sentence:

In [5]:
example[1]

'liverpool slang for ambulance'

The third element indicates whether a data entry was used for training, validation, or testing:

In [6]:
example[2]

'train'

The 4th and 6th element contains non-contrastive sense embeddings based on fastText and SBERT respectively:

In [7]:
example[3].shape

(300,)

In [8]:
example[5].shape

(768,)

The 5th and 7th element contains trained contrastive sense embeddings based on fastText and SBERT respectively:

In [9]:
example[4].shape

(300,)

In [10]:
example[6].shape

(768,)