Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hard to understand error when MARC-ja dataset is not downloaded correctly #7

Open
shunk031 opened this issue Jul 28, 2023 · 0 comments · May be fixed by #8
Open

Hard to understand error when MARC-ja dataset is not downloaded correctly #7

shunk031 opened this issue Jul 28, 2023 · 0 comments · May be fixed by #8

Comments

@shunk031
Copy link
Owner

shunk031 commented Jul 28, 2023

The following is an error when I ran lm-evaluation-harness (jp-stable/JGLUE) and the MARC-ja dataset did not download correctly. This turned out to be the root cause of poor network conditions and failed downloads.

Selected Tasks: ['jsquad-1.1-0.3', 'jcommonsenseqa-1.1-0.3', 'jnli-1.1-0.3', 'marc_ja-1.1-0.3']
Using device 'cuda'
You are using config.init_device='cpu', but you can also use config.init_device="meta" with Composer + FSDP for fast initialization.
/home/shunk031/lm-evaluation-harness/lm_eval/tasks/ja/jsquad.py:75: FutureWarning: load_metric is deprecated and will be removed in the next major version of datasets. Use 'evaluate.load' instead, from the new library 🤗 Evaluate: https:/
/huggingface.co/docs/evaluate
  self.jasquad_metric = datasets.load_metric(jasquad.__file__)
Downloading data files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 8501.97it/s]
Extracting data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 2054.69it/s]
Traceback (most recent call last):
  File "/home/shunk031/lm-evaluation-harness/main.py", line 122, in <module>
    main()
  File "/home/shunk031/lm-evaluation-harness/main.py", line 91, in main
    results = evaluator.simple_evaluate(
  File "/home/shunk031/lm-evaluation-harness/lm_eval/utils.py", line 185, in _wrapper
    return fn(*args, **kwargs)
  File "/home/shunk031/lm-evaluation-harness/lm_eval/evaluator.py", line 82, in simple_evaluate
    task_dict = lm_eval.tasks.get_task_dict(tasks)
  File "/home/shunk031/lm-evaluation-harness/lm_eval/tasks/__init__.py", line 373, in get_task_dict
    task_name_dict = {
  File "/home/shunk031/lm-evaluation-harness/lm_eval/tasks/__init__.py", line 374, in <dictcomp>
    task_name: get_task(task_name)()
  File "/home/shunk031/lm-evaluation-harness/lm_eval/base.py", line 430, in __init__
    self.download(data_dir, cache_dir, download_mode)
  File "/home/shunk031/lm-evaluation-harness/lm_eval/base.py", line 459, in download
    self.dataset = datasets.load_dataset(
  File "/home/shunk031/lm-evaluation-harness/.venv/lib/python3.10/site-packages/datasets/load.py", line 2133, in load_dataset
    builder_instance.download_and_prepare(
  File "/home/shunk031/lm-evaluation-harness/.venv/lib/python3.10/site-packages/datasets/builder.py", line 954, in download_and_prepare
    self._download_and_prepare(
  File "/home/shunk031/lm-evaluation-harness/.venv/lib/python3.10/site-packages/datasets/builder.py", line 1717, in _download_and_prepare
    super()._download_and_prepare(
  File "/home/shunk031/lm-evaluation-harness/.venv/lib/python3.10/site-packages/datasets/builder.py", line 1027, in _download_and_prepare
    split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
  File "/root/.cache/huggingface/modules/datasets_modules/datasets/shunk031--JGLUE/eed55a4f1c560114b29786d11eed4fc793f35c3b2aa9efdf5352c0bd85016b36/JGLUE.py", line 535, in _split_generators
    return self.__split_generators_marc_ja(dl_manager)
  File "/root/.cache/huggingface/modules/datasets_modules/datasets/shunk031--JGLUE/eed55a4f1c560114b29786d11eed4fc793f35c3b2aa9efdf5352c0bd85016b36/JGLUE.py", line 503, in __split_generators_marc_ja
    split_dfs = preprocess_for_marc_ja(
  File "/root/.cache/huggingface/modules/datasets_modules/datasets/shunk031--JGLUE/eed55a4f1c560114b29786d11eed4fc793f35c3b2aa9efdf5352c0bd85016b36/JGLUE.py", line 405, in preprocess_for_marc_ja
    df = df[["review_body", "star_rating", "review_id"]]
  File "/home/shunk031/lm-evaluation-harness/.venv/lib/python3.10/site-packages/pandas/core/frame.py", line 3767, in __getitem__
    indexer = self.columns._get_indexer_strict(key, "columns")[1]
  File "/home/shunk031/lm-evaluation-harness/.venv/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 5877, in _get_indexer_strict
    self._raise_if_missing(keyarr, indexer, axis_name)
  File "/home/shunk031/lm-evaluation-harness/.venv/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 5938, in _raise_if_missing
    raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Index(['review_body', 'star_rating', 'review_id'], dtype='object')] are in the [columns]"

Since this error alone is not enough to determine if the data has not been loaded correctly, a more detailed condition is needed by displaying the contents of the data frame.

Related #9 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant