Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mistake when construct new_w_map #24

Closed
CN-AlbertWu96 opened this issue Apr 2, 2019 · 1 comment
Closed

mistake when construct new_w_map #24

CN-AlbertWu96 opened this issue Apr 2, 2019 · 1 comment

Comments

@CN-AlbertWu96
Copy link

In encode_folder.py, when we want to narrow down the the word mapping from pre-trained embedding file, like glove.100.pk, this function is to add the embedding of words that appear in the documents (train & test). Since the word in documents could contain capital letters but the words in pre-trained embedding file, like glove.100.pk only contain small letters, so the words with capital letters will be ignored. For example, in the training set, we have word "Japan" but no "japan", we cannot get the embedding of "japan" from glove.100.pk.
We should change word = line[0] to word = line[0].lower()

def filter_words(w_map, emb_array, ck_filenames):
    vocab = set()
    for filename in ck_filenames:
        for line in open(filename, 'r'):
            if not (line.isspace() or (len(line) > 10 and line[0:10] == '-DOCSTART-')):
                line = line.rstrip('\n').split()
                assert len(line) >= 3, 'wrong ck file format'
                word = line[0]
                vocab.add(word)
    new_w_map = {}
    new_emb_array = []
    # obtain the embedding of words appear in both wmap and vocab
    for (word, idx) in w_map.items():
        if word in vocab or word in ['<unk>', '<s>', '< >', '<\n>']:
            assert word not in new_w_map, "%s appears twice in ebd file"%word
            new_w_map[word] = len(new_emb_array)
            new_emb_array.append(emb_array[idx])
    print('filtered %d --> %d' % (len(emb_array), len(new_emb_array)))
    return new_w_map, new_emb_array
@LiyuanLucasLiu
Copy link
Collaborator

Nice catch! better to add both word and word.lower :-) #25

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants