Multi30K dataset link is broken #1756

xuzhao9 · 2022-06-01T21:26:36Z

The link to Multi30K dataset at http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/training.tar.gz is broken:

text/torchtext/datasets/multi30k.py

Line 16 in 73bf4fa

"train": r"http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/training.tar.gz",

The text was updated successfully, but these errors were encountered:

parmeet · 2022-06-02T02:29:35Z

Yupp, looks like the server is down :(.

muskbing · 2022-06-08T00:38:56Z

Yupp, looks like the server is down :(.

Any solutions?

2401ch · 2022-06-09T06:38:18Z

I meet the same problem, is there any solution?

2401ch · 2022-06-09T06:46:41Z

Nayef211 · 2022-06-09T06:56:36Z

@chenghan1995 @muskbing unfortunately, we're not responsible for hosting the datasets. I'd recommend waiting for their server to come back up or reaching out directly to the organization that hosts the dataset. In this case this would be the University of Sheffield.

cc @parmeet I wonder if you know of a way to get in contact with the team that hosts this dataset?

muskbing · 2022-06-09T07:05:25Z

@chenghan1995 @muskbing unfortunately, we're not responsible for hosting the datasets. I'd recommend waiting for their server to come back up or reaching out directly to the organization that hosts the dataset. In this case this would be the University of Sheffield.

cc @parmeet I wonder if you know of a way to get in contact with the team that hosts this dataset?

I've send email to the owner of the dataset
https://www.statmt.org/wmt16/multimodal-task.html#task1

email address is:lspecia@gmail.com

But there is no response.

I wander is there anyone who has the data file

neychev · 2022-06-15T16:03:43Z

Found a local copy of the dataset and uploaded it to github (it's rather small). For now it is available via this link: https://github.com/neychev/small_DL_repo/tree/master/datasets/Multi30k

Just in case, all rights belong to the original authors of the dataset, this is only a temporal copy for convenience.

muskbing · 2022-06-16T08:17:39Z

Found a local copy of the dataset and uploaded it to github (it's rather small). For now it is available via this link: https://github.com/neychev/small_DL_repo/tree/master/datasets/Multi30k

Just in case, all rights belong to the original authors of the dataset, this is only a temporal copy for convenience.

Thanks bro, you're really awesome

neychev · 2022-06-16T10:06:06Z

Please, refer to the next answer with updated example

~~Example code to make it work (tested on Colab):~~

!pip install torchdata
!mkdir -p ~/.torchtext/cache/Multi30k
!wget -P ~/.torchtext/cache/Multi30k https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/training.tar.gz
!wget -P ~/.torchtext/cache/Multi30k https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/validation.tar.gz
!wget -P ~/.torchtext/cache/Multi30k https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/mmt16_task1_test.tar.gz

from torchtext.datasets import Multi30k
train_iter = Multi30k(split="train")

rrmina · 2022-06-20T02:46:44Z

Found a local copy of the dataset and uploaded it to github (it's rather small). For now it is available via this link: https://github.com/neychev/small_DL_repo/tree/master/datasets/Multi30k

Just in case, all rights belong to the original authors of the dataset, this is only a temporal copy for convenience.

Thank you! This worked for train and valid but not for test :( .

The test file being downloaded by torchtext (and torchdata) are from http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/mmt16_task1_test.tar.gz

Does anyone have the mmt16_task1_test.tar.gz file for the meantime? Thanks in advance!!!

P.S. I was able to work around the 'test' issue by making another tar.gz from the contents of mmt_task1_test2016.tar.gz, and changing the 'test' hash in torchtext sources but I assume that other users may not be able to this

Nayef211 · 2022-06-22T22:16:30Z

Example code to make it work (tested on Colab):

!pip install torchdata

!mkdir -p ~/.torchtext/cache/Multi30k

!wget -P ~/.torchtext/cache/Multi30k https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/training.tar.gz
!wget -P ~/.torchtext/cache/Multi30k https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/validation.tar.gz
!wget -P ~/.torchtext/cache/Multi30k https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/mmt_task1_test2016.tar.gz


# now everything works as intended
from torchtext.datasets import Multi30k
train_iter = Multi30k(split="train")

Just wanted to mention another approach to get Multi30k working with the data you are hosting @neychev. Rather than downloading the data directly using wget we can programmatically modify the URLs that each split of the dataset is being dowloaded from as follows:

from torchtext.datasets import multi30k, Multi30k

# Update URLs to point to data stored by user
multi30k.URL["train"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/training.tar.gz"
multi30k.URL["valid"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/validation.tar.gz"
multi30k.URL["test"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/mmt_task1_test2016.tar.gz"

# Update hash since there is a discrepancy between user hosted test split and that of the test split in the original dataset 
multi30k.MD5["test"] = "876a95a689a2a20b243666951149fd42d9bfd57cbbf8cd2c79d3465451564dd2"

dp = Multi30k(split='train')

As @rrmina mentioned earlier, this approach still doesn't work with the test split. If I try to print the contents of the test split, I don't get any outputs. @neychev do you happen to know what the discrepancy is for mmt16_task1_test.tar.gz between the original test split and the one you host?

As a next step, I also plan to update our Multi30k dataset implementation so we can rely on the data stored in https://github.com/neychev/small_DL_repo/tree/master/datasets/Multi30k until the dataset in the original server is restored. This way we don't need to rely on any of the above hacks to get this dataset working. 😄

neychev · 2022-06-23T14:40:05Z

Thanks, @Nayef211, @rrmina !

No idea what's exactly wrong with the data, the files above were located in ~/.torchtext/cache/Multi30k of one of my students.

I've tried to simply rename the archive (according to the name in torchtext docs) and files in it and change MD5 to the correct one and it seems to work.

Including the approach suggested by @Nayef211, which is way more elegant, the final algorithm should be the following:

from torchtext.datasets import multi30k, Multi30k

# Update URLs to point to data stored by user
multi30k.URL["train"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/training.tar.gz"
multi30k.URL["valid"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/validation.tar.gz"
multi30k.URL["test"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/mmt16_task1_test.tar.gz"

# Update hash since there is a discrepancy between user hosted test split and that of the test split in the original dataset 
multi30k.MD5["test"] = "6d1ca1dba99e2c5dd54cae1226ff11c2551e6ce63527ebb072a1f70f72a5cd36"

data_train = Multi30k(split='train')
data_val = Multi30k(split='valid')
data_test = Multi30k(split='test')

Test data has 1000 sentences, which seems correct.

Nayef211 · 2023-07-27T14:10:17Z

Reopening because the servers hosting the dataset seems to be down again. #2194 changes the links to https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/... with a TODO comment to follow up once the servers are back up.

Temporarily disable the T5 tutorial to fix the issue with the dataset that can't be downloaded because the website is down. More info: pytorch/text#1756

013292 · 2023-08-02T03:38:06Z

Plus, besides commenting the previous URL, you also need to change the MD5 in torchtext/datasets/multi30k.py.

# URL = {
#     'train': r'http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/training.tar.gz',
#     'valid': r'http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/validation.tar.gz',
#     'test': r'http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/mmt16_task1_test.tar.gz',
# }
# 
# MD5 = {
#     'train': '20140d013d05dd9a72dfde46478663ba05737ce983f478f960c1123c6671be5e',
#     'valid': 'a7aa20e9ebd5ba5adce7909498b94410996040857154dab029851af3a866da8c',
#     'test': '0681be16a532912288a91ddd573594fbdd57c0fbb81486eff7c55247e35326c2',
# }

URL = {
    "train": r"https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/training.tar.gz",
    "valid": r"https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/validation.tar.gz",
    "test": r"https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/mmt16_task1_test.tar.gz",
}

MD5 = {
    "train": "20140d013d05dd9a72dfde46478663ba05737ce983f478f960c1123c6671be5e",
    "valid": "a7aa20e9ebd5ba5adce7909498b94410996040857154dab029851af3a866da8c",
    "test": "6d1ca1dba99e2c5dd54cae1226ff11c2551e6ce63527ebb072a1f70f72a5cd36",
}

Pein2017 · 2024-01-11T08:54:43Z

Thank for the instructions. I've had to manually extract the mmt16_task1_test.tar.gz file, as it wasn't automatically handled by datasets.Multi30k for some reason. The mmt16 file contains multiple files, not just the expected test.en and test.de.
Might be worth a note to save others some time!

erno123 · 2024-01-12T20:09:03Z

An simple general solution was suggested by @Nayef211, a Contributor, on 23. June 2022 here: #1756 (comment)

Rather than downloading the data directly using wget we can programmatically modify the URLs that each split of the dataset is being dowloaded from as follows:

from torchtext.datasets import multi30k, Multi30k

# Update URLs to point to data stored by user
multi30k.URL["train"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/training.tar.gz"
multi30k.URL["valid"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/validation.tar.gz"
multi30k.URL["test"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/mmt_task1_test2016.tar.gz"

# Update hash since there is a discrepancy between user hosted test split and that of the test split in the original dataset 
multi30k.MD5["test"] = "876a95a689a2a20b243666951149fd42d9bfd57cbbf8cd2c79d3465451564dd2"

dp = Multi30k(split='train')

The important point here is that the URL of the wrong mmt16_task1_test.tar.gz and its hash would be replaced by the correct mmt_task1_test2016.tar.gz file and its hash.

But that was somehow forgotten. I figured out the problem and the solution on my own yesterday and then I'found this suggested bug fix today :-(.

@Nayef211 or other contributors, could you implement it?

d1math · 2024-01-21T11:48:12Z

Thank for the instructions. I've had to manually extract the mmt16_task1_test.tar.gz file, as it wasn't automatically handled by datasets.Multi30k for some reason. The mmt16 file contains multiple files, not just the expected test.en and test.de. Might be worth a note to save others some time!

It wasn't automatically extracted because the mmt16_task1_test.tar.gz archive containes Apple metadata files ._test.de, ._test.en, and ._test.fr that matche the filter and are getting extracted instead. Would be good to fix the archive file, but meanwhile this patch for _filter_fn can help it to pick the correct file from the archive:

def _filter_fn(split, language_pair, i, x):
    return f"/{torchtext.datasets.multi30k._PREFIX[split]}.{language_pair[i]}" in x[0]
torchtext.datasets.multi30k._filter_fn = _filter_fn

fool2fish · 2024-02-07T07:48:03Z

these URLs work again:

multi30k.URL["train"] = "http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/training.tar.gz"
multi30k.URL["valid"] = "http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/validation.tar.gz"
multi30k.URL["test"] = "http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/mmt16_task1_test.tar.gz"

if the script still doesn't work, we can copy paste the URLs to browser to download files manually, and save them to dir {ROOT}/datasets/Multi30k/ where {ROOT} is one of params of torchtext.datasets.Multi30k(root={ROOT}, ...)

k2393937499 · 2024-05-16T02:00:18Z

Files in the mmt_task1_test2016.tar.gz are test2016.de and test2016.en, so if use the url
"test": r"https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/mmt_task1_test2016.tar.gz",
I have to change the _PREFIX from

_PREFIX = {
    "train": "train",
    "valid": "val",
    "test": "test",
}

to

_PREFIX = {
    "train": "train",
    "valid": "val",
    "test": "test2016",
}

I'm new to torch2.x, it's a strange bug LOL.

samdow mentioned this issue Jun 7, 2022

Multi30k can't be downloaded the destination domain can't be reached pytorch/pytorch#79013

Open

Nayef211 mentioned this issue Jul 6, 2022

Fix Multi30k dataset urls #1816

Merged

Nayef211 closed this as completed in #1816 Jul 6, 2022

Nayef211 mentioned this issue Dec 7, 2022

update multi30k test dataset hash #2003

Merged

Nayef211 mentioned this issue Mar 6, 2023

build_vocab_from_iterator does not work in notebook pytorch/tutorials#1938

Closed

Nayef211 mentioned this issue Jul 27, 2023

Update links to multi30k dataset since original servers are down #2194

Merged

Nayef211 reopened this Jul 27, 2023

svekars pushed a commit to pytorch/tutorials that referenced this issue Jul 27, 2023

Temporarily disable the T5 tutorial

a0bea18

Temporarily disable the T5 tutorial to fix the issue with the dataset that can't be downloaded because the website is down. More info: pytorch/text#1756

svekars mentioned this issue Jul 27, 2023

Temporarily disable the T5 tutorial pytorch/tutorials#2511

Merged

svekars pushed a commit to pytorch/tutorials that referenced this issue Jul 27, 2023

Temporarily disable the T5 tutorial (#2511)

c32ad93

Temporarily disable the T5 tutorial to fix the issue with the dataset that can't be downloaded because the website is down. More info: pytorch/text#1756

foxy6624 mentioned this issue Feb 22, 2024

UTF-8 error with testing set of torchtext.datasets.Multi30k(language_pair=("de", "en")). #2221

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi30K dataset link is broken #1756

Multi30K dataset link is broken #1756

xuzhao9 commented Jun 1, 2022

parmeet commented Jun 2, 2022

muskbing commented Jun 8, 2022

2401ch commented Jun 9, 2022

2401ch commented Jun 9, 2022

Nayef211 commented Jun 9, 2022

muskbing commented Jun 9, 2022

neychev commented Jun 15, 2022

muskbing commented Jun 16, 2022

neychev commented Jun 16, 2022 •

edited

Loading

rrmina commented Jun 20, 2022

Nayef211 commented Jun 22, 2022

neychev commented Jun 23, 2022 •

edited

Loading

Nayef211 commented Jul 27, 2023

013292 commented Aug 2, 2023

Pein2017 commented Jan 11, 2024

erno123 commented Jan 12, 2024

d1math commented Jan 21, 2024 •

edited

Loading

fool2fish commented Feb 7, 2024 •

edited

Loading

k2393937499 commented May 16, 2024

Multi30K dataset link is broken #1756

Multi30K dataset link is broken #1756

Comments

xuzhao9 commented Jun 1, 2022

parmeet commented Jun 2, 2022

muskbing commented Jun 8, 2022

2401ch commented Jun 9, 2022

2401ch commented Jun 9, 2022

Nayef211 commented Jun 9, 2022

muskbing commented Jun 9, 2022

neychev commented Jun 15, 2022

muskbing commented Jun 16, 2022

neychev commented Jun 16, 2022 • edited Loading

rrmina commented Jun 20, 2022

Nayef211 commented Jun 22, 2022

neychev commented Jun 23, 2022 • edited Loading

Nayef211 commented Jul 27, 2023

013292 commented Aug 2, 2023

Pein2017 commented Jan 11, 2024

erno123 commented Jan 12, 2024

d1math commented Jan 21, 2024 • edited Loading

fool2fish commented Feb 7, 2024 • edited Loading

k2393937499 commented May 16, 2024

neychev commented Jun 16, 2022 •

edited

Loading

neychev commented Jun 23, 2022 •

edited

Loading

d1math commented Jan 21, 2024 •

edited

Loading

fool2fish commented Feb 7, 2024 •

edited

Loading