Memory-Mapped IndexedDataset implementation #589

davidecaroselli · 2019-03-20T12:55:17Z

Following discussion in #574:

Implemented MMapIndexedDataset and MMapIndexedDatasetBuilder compatible with IndexedDataset/IndexedDatasetBuilder
Update scripts/read_binarized.py to support new MMapIndexedDataset
Option '--raw-text' and '--lazy-load' replaced with '--dataset-impl' and moved the option definition custom task args to more high-level options.add_dataset_args() (more appropriate)
Implemented also utils functions in indexed_dataset: make_dataset(), dataset_exists()

…le with IndexedDataset/IndexedDatasetBuilder

…and moved the option definition custom task args to more high-level options.add_dataset_args() (more appropriate). Implemented also utils funzion in indexed_dataset: make_dataset(), dataset_exists()

… larger file on disk

facebook-github-bot

@myleott has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

myleott

Looks great! Can you please also update the docs for --lazy-load: https://github.com/pytorch/fairseq/blob/master/docs/getting_started.rst

myleott · 2019-03-29T04:56:50Z

fairseq/tasks/language_modeling.py

-        parser.add_argument('--lazy-load', action='store_true',
-                            help='load the dataset lazily')
-        parser.add_argument('--raw-text', default=False, action='store_true',
-                            help='load raw text dataset')


Let's provide backward compatibility for these. Add back these flags, change the description to indicate [deprecated] and add logic here: https://github.com/pytorch/fairseq/blob/master/fairseq/options.py#L111-L115.

Something like:

if getattr(args, 'raw_text', False): utils.deprecation_warning('--raw-text is deprecated, please use --dataset-impl=raw') args.dataset_impl = 'raw' elif getattr(args, 'lazy_load', False): utils.deprecation_warning('--lazy-load is deprecated, please use --dataset-impl=lazy') args.dataset_impl = 'lazy'

I still have to include these changes, I will work on this asap!

facebook-github-bot

@myleott has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@myleott has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@myleott has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Add backward compatibility and fix tests

facebook-github-bot

@myleott has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2019-05-07T16:01:39Z

@myleott merged this pull request in a1c997b.

ebetica · 2020-01-08T23:15:30Z

fairseq/data/indexed_dataset.py

@@ -262,3 +293,169 @@ def finalize(self, index_file):
        write_longs(index, self.data_offsets)
        write_longs(index, self.sizes)
        index.close()
+
+
+def _warmup_mmap_file(path):


Hey @davidecaroselli, I noticed in your implementation, you require a full read of the dataset at the beginning of training to warm up the mmap file. Is this actually useful? Can I get away with not warming up the mmap file?

Hi! Yes, that is super useful because that warmup ensure best performance during actual training. However it is not mandatory, meaning that everything works even without it. So if, for some strange reason, you want to avoid it, just comment the call!

Summary: Following discussion in facebookresearch/fairseq#574: - Implemented MMapIndexedDataset and MMapIndexedDatasetBuilder compatible with IndexedDataset/IndexedDatasetBuilder - Update scripts/read_binarized.py to support new MMapIndexedDataset - Option '--raw-text' and '--lazy-load' replaced with '--dataset-impl' and moved the option definition custom task args to more high-level options.add_dataset_args() (more appropriate) - Implemented also utils functions in indexed_dataset: make_dataset(), dataset_exists() Pull Request resolved: facebookresearch/fairseq#589 Differential Revision: D14597128 Pulled By: myleott fbshipit-source-id: 4e92d99920cbaa52cfe5a0f1f5d9ae5c92d4268e

Davide Caroselli added 9 commits March 19, 2019 18:39

Implemented MMapIndexedDataset and MMapIndexedDatasetBuilder compatib…

8f53569

…le with IndexedDataset/IndexedDatasetBuilder

MMapIndexedDataset extends torch.utils.data.Dataset

0eb509e

Further reduction of MMapIndexedDataset .idx file

ec5693f

Update scripts/read_binarized.py to support new MMapIndexedDataset

4b0e81e

fairseq-preprocess now supports '--dataset-impl' option

8a260d4

Option '--raw-text' and '--lazy-load' replaced with '--dataset-impl' …

5bff335

…and moved the option definition custom task args to more high-level options.add_dataset_args() (more appropriate). Implemented also utils funzion in indexed_dataset: make_dataset(), dataset_exists()

MMapIndexedDataset now exposes mandatory 'sizes' property

ef2edcc

MMapIndexedDataset default dtype is int64: faster data extraction but…

4907112

… larger file on disk

MMapIndexedDataset implements a trivial warmup procedure

e7538d4

facebook-github-bot added the CLA Signed label Mar 20, 2019

davidecaroselli changed the title ~~Features/mmap dataset~~ Memory-Mapped IndexedDataset implementation Mar 20, 2019

facebook-github-bot reviewed Mar 25, 2019

View reviewed changes

myleott suggested changes Mar 29, 2019

View reviewed changes

New version of MMapDataset with measured RAM saving from 5 to 10x

34cc5c3

facebook-github-bot reviewed May 4, 2019

View reviewed changes

Davide Caroselli added 2 commits May 6, 2019 17:40

Made MMapIndexedDataset pickable

2c947c2

Much faster memorymap file warmup

15f0d10

facebook-github-bot reviewed May 6, 2019

View reviewed changes

Merge branch 'master' into features/mmap_dataset

998b6d1

facebook-github-bot reviewed May 6, 2019

View reviewed changes

myleott and others added 3 commits May 6, 2019 20:25

Merge branch 'master' into features/mmap_dataset

85ec62f

Add backward compatibility and fix tests

91101c8

Merge pull request #2 from myleott/features/mmap_dataset

da87e36

Add backward compatibility and fix tests

facebook-github-bot reviewed May 7, 2019

View reviewed changes

facebook-github-bot closed this in a1c997b May 7, 2019

facebook-github-bot added the Merged label May 7, 2019

ebetica reviewed Jan 8, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory-Mapped IndexedDataset implementation #589

Memory-Mapped IndexedDataset implementation #589

davidecaroselli commented Mar 20, 2019

facebook-github-bot left a comment

myleott left a comment

myleott Mar 29, 2019

davidecaroselli May 5, 2019

facebook-github-bot left a comment

facebook-github-bot left a comment

facebook-github-bot left a comment

facebook-github-bot left a comment

facebook-github-bot commented May 7, 2019

ebetica Jan 8, 2020

davidecaroselli Jan 9, 2020

Memory-Mapped IndexedDataset implementation #589

Memory-Mapped IndexedDataset implementation #589

Conversation

davidecaroselli commented Mar 20, 2019

facebook-github-bot left a comment

Choose a reason for hiding this comment

myleott left a comment

Choose a reason for hiding this comment

myleott Mar 29, 2019

Choose a reason for hiding this comment

davidecaroselli May 5, 2019

Choose a reason for hiding this comment

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot commented May 7, 2019

ebetica Jan 8, 2020

Choose a reason for hiding this comment

davidecaroselli Jan 9, 2020

Choose a reason for hiding this comment