Skip to content

Commit

Permalink
Fix bug in load_from_files_multilabel()
Browse files Browse the repository at this point in the history
When loading documents from a folder, the order in which they initially
appear in the ``labels_path`` file wasn't kept because a dictionary was
used to construct the (document name, labels) pairs. This issue
sometimes caused the pytest failing in line 252 of the "test_util.py"
file.
  • Loading branch information
sergioburdisso committed May 13, 2020
1 parent 313a9a4 commit 7ce1844
Showing 1 changed file with 3 additions and 1 deletion.
4 changes: 3 additions & 1 deletion pyss3/util.py
Original file line number Diff line number Diff line change
Expand Up @@ -1669,15 +1669,17 @@ def load_from_files_multilabel(docs_path, labels_path, sep_label=None, sep_doc='
doc_labels_raw = [re.split(sep_label, l.rstrip())
for l in flabels.read().split('\n')]
doc_labels = {}
doc_names = []

for doc_name, label in doc_labels_raw:
if doc_name not in doc_labels:
doc_labels[doc_name] = [label]
doc_names.append(doc_name)
else:
doc_labels[doc_name].append(label)
cat_info[label] += 1

for doc_name in tqdm(doc_labels, desc="Loading documents"):
for doc_name in tqdm(doc_names, desc="Loading documents"):
file_name = doc_name + ".txt" if '.' not in doc_name else doc_name
with open(path.join(docs_path, file_name), "r", encoding=ENCODING) as fdoc:
x_data.append(fdoc.read())
Expand Down

0 comments on commit 7ce1844

Please sign in to comment.