Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: BuilderConfig 'rte' not found. Available: ['default'] #24

Closed
Synnai opened this issue Mar 31, 2024 · 2 comments
Closed

ValueError: BuilderConfig 'rte' not found. Available: ['default'] #24

Synnai opened this issue Mar 31, 2024 · 2 comments

Comments

@Synnai
Copy link

Synnai commented Mar 31, 2024

I had to manually download the GLUE dataset from the git repo GLUE-baselines, and then I put it in the same directory as the value of cache_dir in utils/load_config.py. However, when I executed python train_plms_glue.py --language_model_name roberta-base --dataset_name cola --multitask_training --auxiliary_dataset_name rte --learning_rate 1e-5 --num_runs 5, an ERROR occurred.

The ERROR log is as follow:

INFO:root:********** Run starts. **********
INFO:root:configuration is Namespace(dataset_name='rte_cola', auxiliary_dataset_name='rte', language_model_name='roberta-base', multitask_training=True, batch_size=16, num_epochs=10, learning_rate=1e-05, gpu=0, num_runs=5, device='cuda:0', target_dataset_name='cola', save_model_dir='./save_models/rte_cola/roberta-base_lr1e-05')
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /roberta-base/resolve/main/tokenizer_config.json HTTP/1.1" 200 0
Resolving data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 18/18 [00:00<00:00, 157614.76it/s]
Traceback (most recent call last):
  File "/home/dell7960/PycharmProjects/DARE/MergeLM/train_plms_glue.py", line 93, in <module>
    glue_data_loader.load_multitask_datasets(dataset_names=dataset_names, train_split_ratio_for_val=0.1, max_seq_length=128)
  File "/home/dell7960/PycharmProjects/DARE/MergeLM/utils/glue_data_loader.py", line 110, in load_multitask_datasets
    multiple_datasets = [self.load_dataset(dataset_name=dataset_name, train_split_ratio_for_val=train_split_ratio_for_val,
  File "/home/dell7960/PycharmProjects/DARE/MergeLM/utils/glue_data_loader.py", line 110, in <listcomp>
    multiple_datasets = [self.load_dataset(dataset_name=dataset_name, train_split_ratio_for_val=train_split_ratio_for_val,
  File "/home/dell7960/PycharmProjects/DARE/MergeLM/utils/glue_data_loader.py", line 76, in load_dataset
    dataset = load_dataset(path=os.path.join(cache_dir, "glue"), name=dataset_name)
  File "/home/dell7960/PycharmProjects/VisionLaSeR/.venv/lib/python3.10/site-packages/datasets/load.py", line 2556, in load_dataset
    builder_instance = load_dataset_builder(
  File "/home/dell7960/PycharmProjects/VisionLaSeR/.venv/lib/python3.10/site-packages/datasets/load.py", line 2265, in load_dataset_builder
    builder_instance: DatasetBuilder = builder_cls(
  File "/home/dell7960/PycharmProjects/VisionLaSeR/.venv/lib/python3.10/site-packages/datasets/builder.py", line 371, in __init__
    self.config, self.config_id = self._create_builder_config(
  File "/home/dell7960/PycharmProjects/VisionLaSeR/.venv/lib/python3.10/site-packages/datasets/builder.py", line 592, in _create_builder_config
    raise ValueError(

At the same time, when I tried to execute python train_plms_glue.py --language_model_name roberta-base --dataset_name cola --learning_rate 1e-5 --num_runs 5, another ERROR occurred:

INFO:root:********** Run starts. **********
INFO:root:configuration is Namespace(dataset_name='cola', auxiliary_dataset_name='cola', language_model_name='roberta-base', multitask_training=False, batch_size=16, num_epochs=10, learning_rate=1e-05, gpu=0, num_runs=5, device='cuda:0', save_model_dir='./save_models/cola/roberta-base_lr1e-05')
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /roberta-base/resolve/main/tokenizer_config.json HTTP/1.1" 200 0
Resolving data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 18/18 [00:00<00:00, 157614.76it/s]
Traceback (most recent call last):
  File "/home/dell7960/PycharmProjects/DARE/MergeLM/train_plms_glue.py", line 154, in <module>
    train_dataset, val_dataset, test_dataset, num_labels = glue_data_loader.load_dataset(dataset_name=args.dataset_name,
  File "/home/dell7960/PycharmProjects/DARE/MergeLM/utils/glue_data_loader.py", line 76, in load_dataset
    dataset = load_dataset(path=os.path.join(cache_dir, "glue"), name=dataset_name)
  File "/home/dell7960/PycharmProjects/VisionLaSeR/.venv/lib/python3.10/site-packages/datasets/load.py", line 2556, in load_dataset
    builder_instance = load_dataset_builder(
  File "/home/dell7960/PycharmProjects/VisionLaSeR/.venv/lib/python3.10/site-packages/datasets/load.py", line 2265, in load_dataset_builder
    builder_instance: DatasetBuilder = builder_cls(
  File "/home/dell7960/PycharmProjects/VisionLaSeR/.venv/lib/python3.10/site-packages/datasets/builder.py", line 371, in __init__
    self.config, self.config_id = self._create_builder_config(
  File "/home/dell7960/PycharmProjects/VisionLaSeR/.venv/lib/python3.10/site-packages/datasets/builder.py", line 592, in _create_builder_config
    raise ValueError(
ValueError: BuilderConfig 'cola' not found. Available: ['default']

Could you please give any advice to fix it?

@yule-BUAA
Copy link
Owner

Hi,

For the GLUE dataset, we first download the dataset to cache and then load the cached dataset with this line. I think your issue occurs because the dataset you downloaded has some problems.

To fix this, you just have to uncomment this line and comment the previous line, which will automatically download the GLUE dataset to the cache_dir.

Hope this helps! Feel free to ask if there are any further questions.

@Synnai
Copy link
Author

Synnai commented Apr 1, 2024

Hi,

For the GLUE dataset, we first download the dataset to cache and then load the cached dataset with this line. I think your issue occurs because the dataset you downloaded has some problems.

To fix this, you just have to uncomment this line and comment the previous line, which will automatically download the GLUE dataset to the cache_dir.

Hope this helps! Feel free to ask if there are any further questions.

It works. Thanks a lot!

@Synnai Synnai closed this as completed Apr 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants