You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In encode_folder.py, when we want to narrow down the the word mapping from pre-trained embedding file, like glove.100.pk, this function is to add the embedding of words that appear in the documents (train & test). Since the word in documents could contain capital letters but the words in pre-trained embedding file, like glove.100.pk only contain small letters, so the words with capital letters will be ignored. For example, in the training set, we have word "Japan" but no "japan", we cannot get the embedding of "japan" from glove.100.pk.
We should change word = line[0] to word = line[0].lower()
deffilter_words(w_map, emb_array, ck_filenames):
vocab=set()
forfilenameinck_filenames:
forlineinopen(filename, 'r'):
ifnot (line.isspace() or (len(line) >10andline[0:10] =='-DOCSTART-')):
line=line.rstrip('\n').split()
assertlen(line) >=3, 'wrong ck file format'word=line[0]
vocab.add(word)
new_w_map= {}
new_emb_array= []
# obtain the embedding of words appear in both wmap and vocabfor (word, idx) inw_map.items():
ifwordinvocaborwordin ['<unk>', '<s>', '< >', '<\n>']:
assertwordnotinnew_w_map, "%s appears twice in ebd file"%wordnew_w_map[word] =len(new_emb_array)
new_emb_array.append(emb_array[idx])
print('filtered %d --> %d'% (len(emb_array), len(new_emb_array)))
returnnew_w_map, new_emb_array
The text was updated successfully, but these errors were encountered:
In encode_folder.py, when we want to narrow down the the word mapping from pre-trained embedding file, like
glove.100.pk
, this function is to add the embedding of words that appear in the documents (train & test). Since the word in documents could contain capital letters but the words in pre-trained embedding file, likeglove.100.pk
only contain small letters, so the words with capital letters will be ignored. For example, in the training set, we have word "Japan" but no "japan", we cannot get the embedding of "japan" fromglove.100.pk
.We should change word = line[0] to word = line[0].lower()
The text was updated successfully, but these errors were encountered: