Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++ Frontend] Implement DataLoader #11918

Closed
wants to merge 4 commits into from

Conversation

goldsborough
Copy link
Contributor

@goldsborough goldsborough commented Sep 21, 2018

This PR implements a DataLoader API for the C++ frontend.

The components present in this API largely match the Python API. It consists of:

  • Datasets: Conceptually a function from a set of indices to a batch of examples;
  • Transforms: A functional transformation of a dataset. A Map<D, T> for Dataset D and transform T is itself a dataset;
  • Samplers: Specify a strategy for generating indices for a new batch;
  • A DataLoader, with the ability to automatically parallelize fetching of samples across multiple worker threads;

Note that collation functions fall naturally out of the Map<Dataset, Transform> abstraction.

Things that are missing right now that maybe should be added:

  • Memory pinning for CUDA tensors

The API was designed to be generalizable to almost any kind of dataset, transform or sampling strategy, while providing a convenient API out of the box. To achieve this, it is quite heavily templatized on various possible input types.

There are many parts to this PR! Right now, I would like feedback on:

  • Your impression of the general usability of the API;
  • Your impression of which parts seem too complex or overthought;
  • The implementation of the parallelization aspects of the DataLoader. I've followed the Python implementation in some matters, but also differ in others. I think my implementation is a little cleaner and decouples components slightly better than the Python dataloader.

I haven't added too many comments yet, as this is fresh out of the oven. Let me know if anything is unclear from the code itself.

There also aren't any tests yet. I will write a comprehensive test suite once we agree on the API and implementation.

@apaszke @ezyang @teng-li @pietern

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

goldsborough has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@goldsborough goldsborough force-pushed the cpp-data-loader branch 2 times, most recently from 7d5e865 to d87492e Compare September 23, 2018 17:39
@pietern
Copy link
Contributor

pietern commented Sep 24, 2018

cc @shen-pan @aartibasant @aazzolini

DataLoader(
DatasetType dataset,
DataLoaderOptions options,
SamplerType sampler)

This comment was marked as off-topic.

This comment was marked as off-topic.

template <
typename DatasetType,
typename SamplerType = samplers::RandomSampler<>>
class DataLoader {

This comment was marked as off-topic.


template <
typename DatasetType,
typename SamplerType = samplers::RandomSampler<>,

This comment was marked as off-topic.

This comment was marked as off-topic.

namespace datasets {
class TensorDataset : public Dataset<TensorDataset, TensorExample> {
public:
explicit TensorDataset(std::vector<Tensor> tensors)

This comment was marked as off-topic.

This comment was marked as off-topic.

class DataShuttle {
public:
using JobType = Job;
using ResultType = Result;

This comment was marked as off-topic.

This comment was marked as off-topic.

using typename BatchDataset<S, B>::BatchType;
using ExampleType = E;

virtual ExampleType index(size_t index) = 0;

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

Tensor apply_batch(std::vector<Tensor> tensors) override {
return torch::stack(tensors);
}
};

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

OutputBatchType output_batch;
torch::detail::reserve_capacity(output_batch, input_batch.size());
for (auto&& input : input_batch) {
output_batch.insert(output_batch.end(), apply(std::move(input)));

This comment was marked as off-topic.

return torch::from_blob(buffer.data(), buffer.size(), torch::kByte)
.reshape({count, 1, kImageRows, kImageColumns})
.to(torch::kFloat32)
.div(255);

This comment was marked as off-topic.

This comment was marked as off-topic.


std::vector<char> buffer(count);
labels.read(buffer.data(), count);
return torch::from_blob(buffer.data(), count, torch::kByte).to(torch::kInt64);

This comment was marked as off-topic.

@goldsborough goldsborough force-pushed the cpp-data-loader branch 2 times, most recently from 78b49ed to 4c702b9 Compare October 5, 2018 15:59
Copy link
Contributor

@apaszke apaszke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM. I have some comments and suggestions with varying levels of priority, so it would be great if you could take a look at them and see if you want to address those.

enforce_ordering(options.enforce_ordering_) {}

size_t batch_size;
bool drop_last;

This comment was marked as off-topic.

This comment was marked as off-topic.

/// (`max_jobs` => `2 * workers`). In the spirit of properly using the C++ type
/// system, `DataLoaderOptions` allows only setting values. To access values,
/// you must create a `FullDataLoaderOptions` from a `DataLoaderOptions`
/// instance, which will do any necessary coalescing.

This comment was marked as off-topic.

This comment was marked as off-topic.

/// standard algorithms like `std::copy(dataloader.begin(), dataloader.end(),
/// output_iterator)` are supported too.
Iterator<Batch> begin() {
reset();

This comment was marked as off-topic.

This comment was marked as off-topic.


/// Joins the `DataLoader`'s worker threads and drains internal queues.
void join() {
if (joined_) {

This comment was marked as off-topic.

This comment was marked as off-topic.


SourceDataset dataset;
AppliedTransform transform;
};

This comment was marked as off-topic.

/// Fetches the next batch.
void next() override {
// If we didn't get the very first batch yet, get it now.
lazy_initialize();

This comment was marked as off-topic.

This comment was marked as off-topic.


/// A `Sampler` is an object that yields indices with which to index into a
/// dataset.
class Sampler {

This comment was marked as off-topic.

This comment was marked as off-topic.


void reset() override {
index_ = 0;
}

This comment was marked as off-topic.

///
/// auto dataset = datasets::MNIST("path/to/mnist")
/// .map(transforms::Collate<Example<>>([](std::vector<Example<>> e) {
/// return std::move(e.front());

This comment was marked as off-topic.

/// A `Collation` for `Example<Tensor, Tensor>` types that stacks all data
/// tensors into one tensor, and all target (label) tensors into one tensor.
template <>
struct Stack<Example<>> : public Collation<Example<>> {

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.


/// Resets the `Sampler`'s internal state.
/// Typically called before a new epoch.
virtual void reset() = 0;

This comment was marked as off-topic.

@goldsborough goldsborough force-pushed the cpp-data-loader branch 4 times, most recently from 0c7fe9c to d6f81b9 Compare October 18, 2018 14:32
Copy link
Collaborator

@thiagocrepaldi thiagocrepaldi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I try to compile this PR, I get the following error:

FAILED: caffe2/torch/CMakeFiles/torch.dir/csrc/api/src/data/datasets/mnist.cpp.o 
/usr/bin/c++  -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMAGMA_V2 -DONNX_NAMESPACE=onnx_torch -DTH_BLAS_MKL -DUSE_CUDA -DUSE_GCC_ATOMICS=1 -D_FILE_OFFSET_BITS=64 -D_THP_CORE -Dtorch_EXPORTS -I../aten/src -I. -I../ -I../third_party/protobuf/src -I../cmake/../third_party/benchmark/include -Icaffe2/contrib/aten -I../third_party/onnx -Ithird_party/onnx -I../torch/csrc/api -I../torch/csrc/api/include -I../torch/../aten/src/TH -Icaffe2/torch/../aten/src/TH -I../torch/../aten/src -Icaffe2/torch/../aten/src -Iaten/src -I../torch/../aten/src/THC -Icaffe2/torch/../aten/src/THC -Icaffe2/torch/../aten/src/ATen -I../torch/csrc -I../c10/.. -isystem third_party/gloo -isystem ../cmake/../third_party/gloo -isystem /home/thiagofc/PyTorch/Miniconda3-4.5.4-Linux-x86_64/envs/PyTorch_dev_3.6/include -isystem ../cmake/../third_party/googletest/googletest/include -isystem ../cmake/../third_party/eigen -isystem /home/thiagofc/PyTorch/Miniconda3-4.5.4-Linux-x86_64/envs/PyTorch_dev_3.6/include/python3.6m -isystem /home/thiagofc/PyTorch/Miniconda3-4.5.4-Linux-x86_64/envs/PyTorch_dev_3.6/lib/python3.6/site-packages/numpy/core/include -isystem ../cmake/../third_party/pybind11/include -isystem ../cmake/../third_party/cub -isystem /usr/local/cuda-9.0/include --std=c++11  -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -O2 -fPIC -Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -Wno-unused-but-set-variable -Wno-maybe-uninitialized -DHAVE_AVX_CPU_DEFINITION -DHAVE_AVX2_CPU_DEFINITION -O3  -fPIC   -DCUDA_HAS_FP16=1 -DHAVE_GCC_GET_CPUID -DUSE_AVX -DUSE_AVX2 -DTH_HAVE_THREAD -std=c++11 -Wall -Wextra -Wno-unused-parameter -Wno-missing-field-initializers -Wno-write-strings -Wno-unknown-pragmas -Wno-missing-braces -Wno-maybe-uninitialized -std=gnu++11 -MD -MT caffe2/torch/CMakeFiles/torch.dir/csrc/api/src/data/datasets/mnist.cpp.o -MF caffe2/torch/CMakeFiles/torch.dir/csrc/api/src/data/datasets/mnist.cpp.o.d -o caffe2/torch/CMakeFiles/torch.dir/csrc/api/src/data/datasets/mnist.cpp.o -c ../torch/csrc/api/src/data/datasets/mnist.cpp
../torch/csrc/api/src/data/datasets/mnist.cpp:6:24: fatal error: ATen/Error.h: No such file or directory
compilation terminated.

Aten/Error.h seems to be removed from master and this PR too:

$ find . -name Error.h
./third_party/ComputeLibrary/arm_compute/core/Error.h
./third_party/ComputeLibrary/arm_compute/graph/Error.h

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

goldsborough is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@wumo
Copy link

wumo commented Oct 27, 2018

When using the updated DataLoader to run the cpp mnist example. I got the following error

terminate called after throwing an instance of 'c10::Error'
  what():  input.ndimension() == 4 ASSERT FAILED at /pytorch/torch/csrc/api/src/nn/modules/conv.cpp:96, please report a bug to PyTorch. (forward at /pytorch/torch/csrc/api/src/nn/modules/conv.cpp:96)
frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7f5401a94821 in /home/wumo/CLionProjects/learn-torch/lib/libtorch/lib/libc10.so)
frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7f5401a940ea in /home/wumo/CLionProjects/learn-torch/lib/libtorch/lib/libc10.so)
frame #2: torch::nn::Conv2dImpl::forward(at::Tensor) + 0x22c (0x7f5430c2684c in /home/wumo/CLionProjects/learn-torch/lib/libtorch/lib/libtorch.so.1)
frame #3: /home/wumo/CLionProjects/learn-torch/bin/learn_torch() [0x410d33]
frame #4: /home/wumo/CLionProjects/learn-torch/bin/learn_torch() [0x415d88]
frame #5: /home/wumo/CLionProjects/learn-torch/bin/learn_torch() [0x409a08]
frame #6: __libc_start_main + 0xf0 (0x7f540113a830 in /lib/x86_64-linux-gnu/libc.so.6)
frame #7: /home/wumo/CLionProjects/learn-torch/bin/learn_torch() [0x4082e9]

@goldsborough
Copy link
Contributor Author

@wumo This should have been fixed in #13060. Do you have the latest release?

@goldsborough goldsborough deleted the cpp-data-loader branch October 29, 2018 18:03
@ezyang ezyang added the merged label Jun 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants