mistake when construct new_w_map #24

CN-AlbertWu96 · 2019-04-02T02:14:19Z

In encode_folder.py, when we want to narrow down the the word mapping from pre-trained embedding file, like glove.100.pk, this function is to add the embedding of words that appear in the documents (train & test). Since the word in documents could contain capital letters but the words in pre-trained embedding file, like glove.100.pk only contain small letters, so the words with capital letters will be ignored. For example, in the training set, we have word "Japan" but no "japan", we cannot get the embedding of "japan" from glove.100.pk.
We should change word = line[0] to word = line[0].lower()

def filter_words(w_map, emb_array, ck_filenames):
    vocab = set()
    for filename in ck_filenames:
        for line in open(filename, 'r'):
            if not (line.isspace() or (len(line) > 10 and line[0:10] == '-DOCSTART-')):
                line = line.rstrip('\n').split()
                assert len(line) >= 3, 'wrong ck file format'
                word = line[0]
                vocab.add(word)
    new_w_map = {}
    new_emb_array = []
    # obtain the embedding of words appear in both wmap and vocab
    for (word, idx) in w_map.items():
        if word in vocab or word in ['<unk>', '<s>', '< >', '<\n>']:
            assert word not in new_w_map, "%s appears twice in ebd file"%word
            new_w_map[word] = len(new_emb_array)
            new_emb_array.append(emb_array[idx])
    print('filtered %d --> %d' % (len(emb_array), len(new_emb_array)))
    return new_w_map, new_emb_array

The text was updated successfully, but these errors were encountered:

LiyuanLucasLiu · 2019-04-02T03:11:43Z

Nice catch! better to add both word and word.lower :-) #25

LiyuanLucasLiu mentioned this issue Apr 2, 2019

Update encode_folder.py #25

Merged

LiyuanLucasLiu closed this as completed Apr 2, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mistake when construct new_w_map #24

mistake when construct new_w_map #24

CN-AlbertWu96 commented Apr 2, 2019

LiyuanLucasLiu commented Apr 2, 2019

mistake when construct new_w_map #24

mistake when construct new_w_map #24

Comments

CN-AlbertWu96 commented Apr 2, 2019

LiyuanLucasLiu commented Apr 2, 2019