# Creating a Toy Dataset for Translingual Semantic Induction

In this notebook, we will create a dataset of 100 words with 20-dimensional embeddings which are a subset of the 200k, 300d English-Italian dataset, that will be used as a tight-loop test-bed for the synset-alignment idea.

We'll try to take an interesting subset in a non-random manner, such that the various components may be sanity-checked (although not necessarily improve results). This means enough English words will have Italian correspondences, and enough nontrivial parts of the WordNet graph (including polysemy) will be included.

In [13]:
import numpy as np

def read(file, maxlines=10000, dim=20, threshold=0, vocabulary=None, dtype='float'):
    header = file.readline().decode("utf-8").split(' ')
    words = []
    matrix = np.empty((maxlines, dim), dtype=dtype) if vocabulary is None else []
    for i in range(maxlines):
        word, vec = file.readline().decode("utf-8").split(' ', 1)
        if vocabulary is None:
            words.append(word)
            matrix[i] = np.fromstring(vec, sep=' ', count=dim, dtype=dtype)
        elif word in vocabulary:
            words.append(word)
            matrix.append(np.fromstring(vec, sep=' ', count=dim, dtype=dtype))
    return (words, matrix) if vocabulary is None else (words, np.array(matrix, dtype=dtype))

In [16]:
with open('data/embeddings/en.emb.txt', 'rb') as enembs:
    testing = read(enembs, maxlines=100, dim=5, vocabulary=['in','the','house'])

In [17]:
testing

(['the', 'in'], array([[ 0.025749, -0.006358, -0.001263,  0.092149,  0.084755],
        [ 0.024942,  0.008304, -0.033839,  0.078987,  0.135708]]))

Good.

Here's what we'll do now. We'll take the first 1,000 vectors in English and see how many Italian word equivalents we've found. If it's not close enough to 100, we'll iterate.

In [18]:
with open('data/embeddings/en.emb.txt', 'rb') as enembs:
    k_words, k_vecs = read(enembs, maxlines=1000, dim=20)

In [19]:
k_words[20:25]

['be', ':', "'s", 'are', 'at']

In [20]:
en_it_dict = {}
for dictfile in ['data/dictionaries/en-it.train.txt', 'data/dictionaries/en-it.test.txt']:
    with open(dictfile) as dfile:
        for l in dfile:
            en, it = l.strip().split()
            en_it_dict[en] = it

In [22]:
en_it_dict['house']

'casa'

In [23]:
len([w for w in k_words if w in en_it_dict])

766

Terrific! Let's get a random subsample, hoping for a similar % of translateable words in it.

In [31]:
np.random.seed = 90210  # donna martin on my mind
rand_idcs = np.random.choice(range(1000), size=100)
d_words = [w for i,w in enumerate(k_words) if i in rand_idcs]

In [33]:
d_words[40:45]

['yet', 'return', 'among', 'million', 'story']

In [34]:
len([w for w in d_words if w in en_it_dict])

73

Awesome. Let's fish for multisense words.

In [40]:
import pickle
from scipy.sparse import csr_matrix
from collections import Counter

word_in_synset_table_idcs = {}
with open('data/synsets/v3a_wordlist.txt') as wordfile:
    for i in range(1000):
        w = wordfile.readline().strip()
        if w in d_words:
            word_in_synset_table_idcs[w] = i

In [41]:
len(word_in_synset_table_idcs)

94

In [56]:
backup_d_words = d_words
d_words = list(word_in_synset_table_idcs.keys())

In [57]:
full_synset_pairing = pickle.load(open('data/synsets/v3a_pairings.pkl', 'rb'))

In [58]:
# words correspond to columns
full_syn_columns = full_synset_pairing[:, np.array(list(word_in_synset_table_idcs.values()))]

In [60]:
populated_rows = [r for r in full_syn_columns[:] if r.nnz > 0]

In [61]:
len(populated_rows)

432

This isn't good - the big dataset has 60K synsets for 200K words, we want to keep the ratio sensible. Let's truncate the alignment arbitrarily.

In [121]:
sampled_populated_rows = [i*5 for i,r in enumerate(full_syn_columns[::5]) if r.nnz > 0]

In [122]:
len(sampled_populated_rows)

82

In [123]:
partial_pairing = full_syn_columns[np.array(sampled_populated_rows)]

In [124]:
partial_pairing

<82x94 sparse matrix of type '<class 'numpy.int32'>'
	with 84 stored elements in Compressed Sparse Row format>

In [128]:
# words per synset
x_degs = partial_pairing.sum(axis=1).flatten().tolist()[0]
xdeg_counts = Counter(x_degs)
xdeg_counts.most_common()

[(1, 80), (2, 2)]

In [129]:
# synsets per word
y_degs = partial_pairing.sum(axis=0).flatten().tolist()[0]
ydeg_counts = Counter(y_degs)
ydeg_counts.most_common()

[(0, 42), (1, 33), (2, 10), (3, 6), (4, 2), (5, 1)]

Ok. So not much polysemy to go around, but some degree of multisense ambiguity. We can live with that.

## Write to files
Let's start with the semantic side. I don't think we need the synsets.

In [130]:
with open('data/synsets/toy_wordlist.txt', 'w', encoding='utf8') as wordlist_file:
    for w in d_words:
        wordlist_file.write(f'{w}\n')

with open('data/synsets/toy_pairings.pkl', 'wb') as graph_file:
    pickle.dump(partial_pairing, graph_file)

Dictionaries can stay the same, since we're not doing seeded alignment anyway (for now), and if we ever will it'll only start with a seed of 25 words.

Now for English embeddings:

In [133]:
d_vecs = [v for i,v in enumerate(k_vecs) if i in rand_idcs and k_words[i] in d_words]
print(len(d_vecs))  # should be 94
with open('data/embeddings/en.toy.txt', 'w', encoding='utf8') as en_embs_file:
    en_embs_file.write(f'{len(d_words)} {len(d_vecs[0])}\n')
    for w,v in zip(d_words, d_vecs):
        en_embs_file.write(w + ' ' + ' '.join([f'{vd:.6f}' for vd in v]) + '\n')

94


Time for Italian. First we need to find the actual words for the ones we have, then randomly select a bunch of others.

In [157]:
it_d_words = [en_it_dict[w] for w in d_words if w in en_it_dict]
len(it_d_words)

73

In [158]:
with open('data/embeddings/it.emb.txt', 'rb') as itembs:
    it_words, it_vecs = read(itembs, maxlines=52000)
len([w for w in it_words if w in it_d_words])

70

In [159]:
np.random.seed = 5500
add_words = np.random.choice([w for w in it_words[:1000] if w not in it_d_words], size=24)

In [163]:
it_d_words = [w for w in it_d_words if w in it_words] + add_words.tolist()

In [168]:
it_d_ordered_words_vecs = [(w, it_vecs[i]) for i,w in enumerate(it_words) if w in it_d_words]
print(len(it_d_ordered_words_vecs))  # should be 95
with open('data/embeddings/it.toy.txt', 'w', encoding='utf8') as it_embs_file:
    it_embs_file.write(f'{len(it_d_ordered_words_vecs)} {len(it_d_ordered_words_vecs[0][1])}\n')
    for w,v in it_d_ordered_words_vecs:
        it_embs_file.write(w + ' ' + ' '.join([f'{vd:.6f}' for vd in v]) + '\n')

93


OK, I think we're done?