This repository contains data readers and examples for the different datasets provided by the Shifts Project.
The Shifts Dataset contains curated and labeled examples of real, 'in-the-wild' distributional shifts across three large-scale tasks. Specifically, it contains white matter multiple sclerosis lesions segmentation and vessel power estimation tasks' data currently used in Shifts Challenge 2022, as well as tabular weather prediction, machine translation, and vehicle motion prediction tasks' data used in Shifts Challenge 2021. Dataset shift is ubiquitous in all of these tasks and modalities.
The dataset, assessment metrics and benchmark results are detailed in our associated papers:
- Shifts 2.0: Extending The Dataset of Real Distributional Shifts (2022)
- Shifts: A Dataset of Real Distributional Shift Across Multiple Large-Scale Tasks (2021)
If you use Shifts datasets in your work, please cite our papers using the following Bibtex:
@misc{https://doi.org/10.48550/arxiv.2206.15407,
author = {Malinin, Andrey and Athanasopoulos, Andreas and Barakovic, Muhamed and Cuadra, Meritxell Bach and Gales, Mark J. F. and Granziera, Cristina and Graziani, Mara and Kartashev, Nikolay and Kyriakopoulos, Konstantinos and Lu, Po-Jui and Molchanova, Nataliia and Nikitakis, Antonis and Raina, Vatsal and La Rosa, Francesco and Sivena, Eli and Tsarsitalidis, Vasileios and Tsompopoulou, Efi and Volf, Elena},
title = {Shifts 2.0: Extending The Dataset of Real Distributional Shifts},
publisher = {arXiv},
year = {2022},
doi = {10.48550/ARXIV.2206.15407}
url = {https://arxiv.org/abs/2206.15407}
}
@article{shifts2021,
author = {Malinin, Andrey and Band, Neil and Ganshin, Alexander, and Chesnokov, German and Gal, Yarin, and Gales, Mark J. F. and Noskov, Alexey and Ploskonosov, Andrey and Prokhorenkova, Liudmila and Provilkov, Ivan and Raina, Vatsal and Raina, Vyas and Roginskiy, Denis and Shmatova, Mariya and Tigar, Panos and Yangel, Boris},
title = {Shifts: A Dataset of Real Distributional Shift Across Multiple Large-Scale Tasks},
journal = {arXiv preprint arXiv:2107.07455},
year = {2021},
}
If you have any questions about the Shifts Dataset, the paper or the benchmarks, please contact am969@yandex-team.ru
.
The Shifts datasets are released under different license.
Data is distributed under CC BY NC SA 4.0 license. Data can be downloaded after signing OFSEP data usage agreement.
The data are released under CC BY NC SA 4.0 license. By downloading the data, you are accepting and agreeing to the terms of the CC BY NC SA 4.0 license.
The vessel power estimation dataset consists of measurements sampled every minute from sensors on-board a merchant ship over a span of 4 years, cleaned and augmented with weather data from a third-party provider. We also provide a synthetic benchmark dataset that contains the same splits and input features as in the real data, but the target power labels are replaced with predictions of an analytical physics-based vessel model.
The Shifts Weather Prediction Dataset is released under CC BY NC SA 4.0 license. This dataset was constructed by combining features from publicly available weather prediction services and models. Specifically, we combined data from NOAA/NWS servers, data generated by WRF model from NCAR/UCAR, and data from Meteorological Service of Canada. Ground station readings were taken from [NOAA] (https://www.weather.gov/disclaimer). The data was cleaned and features standardized.
The Shifts Machine Translation Dataset is released under a mixed license.
GlobalVoices evaluation data is released under CC BY NC SA 4.0.
The english source data was taken from GlobalVoices (originally licenced under CC BY 3.0) and target Russian translations provided by Yandex in-house professional translators.
The source-side text for the Reddit development and evaluation datasets exist under terms of the Reddit API. The target side Russian sentences were obtained by Yandex via in-house professional translators and are released under CC BY NC SA 4.0. We highlight that the development set source sentences are the same ones as used in the MTNT dataset.
Shifts SDC Motion Prediction Dataset is released under CC BY NC SA 4.0 license.
By downloading the Shifts Dataset, you automatically agree to the licenses described above.
Already preprocessed data can be downloaded from zenodo. The baseline models can be downloaded from this link. All the code of the baseline models and uncertainty measures is provided in the "shifts/mswml/" folder.
For the synthetic data, the canonical partitions of the training and development data can be downloaded from Zenodo.
The synthetic evaluation and generalization sets and the canonical partitions of the real data will be gradually made available based on the timeline of the Shifts Challenge 2022.
Canonical parition of the training, development and evaluation data can be downloaded here. The full dataset can be downloaded here. Baseline models can be downloaded here.
The weather and motion prediction data should be as simple to load as the development data.
The training data for this task is the WMT'20 En-Ru dataset can be downloaded here, the development data can be downloaded here and the evaluation data can be downloaded here All data is automatically downloaded via the scripts provided here. Baseline models can be downloaded here.
A description of how to process the evaluation data for the translation track is provided here.
Canonical parition of the training and development data can be downloaded here. The canonical parition of the evaluation data can be downloaded here. The full, unpartitioned dataset is available here. Baseline models can be downloaded here.