Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pytorch example with DataLoader adapter, using MNIST data #50

Merged
merged 7 commits into from Aug 28, 2018

Conversation

forbearer
Copy link

This code includes an MNIST dataset generator, a pytorch training example that uses the resulting dataset, and a simple README.md.

As can be seen from the main.py, there are few limitations that come to light which could help us improve petastorm:

  • Batch shuffling
  • Support for custom transforms
  • Total data size (or some semblance of it?)

Running pytorch/examples/mnist/main.py (in a Docker container) with the default 10 epoch yielded the following outcome (I just show the test output for the middle 8 epochs):

...
Train Epoch: 1 [59520/60000 (99%)]	Loss: 0.505042

Test set: Average loss: 0.2056, Accuracy: 9395/10000 (94%)

...
Test set: Average loss: 0.1337, Accuracy: 9596/10000 (96%)
Test set: Average loss: 0.1033, Accuracy: 9684/10000 (97%)
Test set: Average loss: 0.0919, Accuracy: 9710/10000 (97%)
Test set: Average loss: 0.0760, Accuracy: 9770/10000 (98%)
Test set: Average loss: 0.0689, Accuracy: 9797/10000 (98%)
Test set: Average loss: 0.0623, Accuracy: 9803/10000 (98%)
Test set: Average loss: 0.0632, Accuracy: 9791/10000 (98%)
Test set: Average loss: 0.0541, Accuracy: 9818/10000 (98%)

...
Train Epoch: 10 [59520/60000 (99%)]	Loss: 0.040862

Test set: Average loss: 0.0505, Accuracy: 9845/10000 (98%)

real	3m3.021s
user	20m4.680s
sys	0m22.228s

With the petastormed variant, the training accuracy looks on-par, with somewhat better runtime. I'll show just the test output:

Test set: Average loss: 0.2035, Accuracy: 9385/10000 (94%)
Test set: Average loss: 0.1326, Accuracy: 9591/10000 (96%)
Test set: Average loss: 0.1040, Accuracy: 9675/10000 (97%)
Test set: Average loss: 0.0887, Accuracy: 9705/10000 (97%)
Test set: Average loss: 0.0761, Accuracy: 9752/10000 (98%)
Test set: Average loss: 0.0715, Accuracy: 9774/10000 (98%)
Test set: Average loss: 0.0627, Accuracy: 9797/10000 (98%)
Test set: Average loss: 0.0606, Accuracy: 9810/10000 (98%)
Test set: Average loss: 0.0582, Accuracy: 9824/10000 (98%)
Test set: Average loss: 0.0548, Accuracy: 9828/10000 (98%)

real	2m35.852s
user	2m33.508s
sys	0m6.576s

@forbearer
Copy link
Author

My build is failing due to my test program name being non-unique: pytest-dev/pytest#2887

I'll do a simple rename and update this pull request.

Copy link
Collaborator

@selitvin selitvin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also cover mnist training triggered from a test to make sure the example does not go stale?


###
# Adapted to petastorm dataset using original contents from
# pytorch/examples/mnist/main.py .
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe point to the correct git repo: https://github.com/pytorch/examples

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

transforms.Normalize((0.1307,), (0.3081,))
])

class BatchMaker(object):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This object is an equivalent of torch.utils.data.DataLoader. It should become part of the petastorm library (like tf_utils.tf_tensors() is an adaptor to the tensorflow world, this class should become an adaptor for the pytorch world. Suggest we name it also as some sort of Loader. Maybe just petastorm.pytorch.DataLoader ?
Can we provide similar features to torch.utils.data.DataLoader? Our Reader implements a bunch of these, so it's ok since we will intiialize our Loader with a reader. Other features we need to implement (if not right now, then later). For mnist example we would need batch_size, shuffle (we can wait for the new Reader flow which will deliver proper shuffling), collate_fn, total_size property (we should be able to sum all rowgroup counts to get the number).

As the ideal end result a user would simply need to switch instantiation torch.utils.data.DataLoader with petastorm.pytorch.DataLoader, tweak some parameters and get done with the migration.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so it seems that I shouldn't have split #53 into its own pull request, but fold those improvements into here in one go. My initial thought had to been to get an initial example going, the refine it.

I'm happy to bring #53 in first, and get us as close as possible tp a simple switch.

I like petastorm.pytorch.DataLoader...seems we should make equivalent packages for tf :-)

def __iter__(self):
batch = []
for mnist in self.reader:
batch.append((_image_transform(mnist.image), mnist.digit))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this transform automatic as well using information from Unischema?

@@ -0,0 +1,2 @@
torch
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are adding pytorch support to the petastrom library, the dependencies can go as extras into setup.py.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did try that yesterday, but adding torch brings in a half-GB package, which could make travis CI flakey. In my case, the torch download stalls for long enough for travis to kill it.

What I was hoping was not needing dependency on torch. Maybe I ought to aim to do that via this example.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the generate function relies on torchvision datasets, which is really convenient.

Then, main.py requires torch for running training, so it would seem to make sense to fold into our setup.py as extras. I'd like to give that another try, and see if the travis run fairs better this time around. (Or, there may be a travis option to wait longer for the package download to succeed??)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MnistSchema = Unischema('MnistSchema', [
UnischemaField('idx', np.int_, (), ScalarCodec(IntegerType()), False),
UnischemaField('digit', np.int_, (), ScalarCodec(IntegerType()), False),
UnischemaField('image', np.uint8, (28, 28, 1), NdarrayCodec(), False),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the third dimension? For monochromatic image it is usually omitted (i.e. (28, 28)).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, here I think I gave in to my incomplete understanding of the training example...the training setup requires 4 dimensions, an array of 3 dimensional image matrices...so it appeared easier to just add this 3rd dim. Oh, maybe I can do 2 dims, but just reshape to 3 dims in the example!

This creates both a `train` and `test` petastorm datasets in `/home/${USER}/dev/datasets/mnist`:

```bash
python generate_petastorm_mnist.py -d ~/dev/data/mnist -o file:///home/${USER}/dev/datasets/mnist
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not clear from the readme what is -d switch is. Maybe we default it somewhere to a temp directory and omit it from the README all together?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, yeah, or I can opt to not download so then the directory isn't needed. Removing a -d does seem nice.

PYTHONPATH=${PETASTORM_PATH}
```

== Generating a Petastorm Dataset from MNIST Data ==
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does == renders correctly as a section markup? I think I had to use # when I was writing main README.md

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, good catch!

@@ -0,0 +1,43 @@
== Setup ==
```bash
pip install -r requirements.txt
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I like the duplicate dependency speification: here and in the setup.py

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this can likely be changed to something like pip install -e .[torch]

help='Directory to where the MNIST data will be downloaded; default to repository base.')
parser.add_argument('-o', '--output-url', type=str, required=True,
help='hdfs://... or file:/// url where the parquet dataset will be written to.')
parser.add_argument('-m', '--master', type=str, required=False, default=None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably have local[*] as a default to make the script run out of the box with minimal parameters?

session_builder = SparkSession \
.builder \
.appName('MNIST Dataset Creation') \
.config('spark.executor.memory', '1g') \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we go with the default memory sizes and reduce the bloat?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

K!

@@ -0,0 +1,43 @@
== Setup ==
```bash
pip install -r requirements.txt
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this can likely be changed to something like pip install -e .[torch]


base_dir = os.path.abspath(os.path.join(os.path.dirname(sys.argv[0]), '..', '..'))
parser.add_argument('-d', '--download-dir', type=str, required=False, default=base_dir,
help='Directory to where the MNIST data will be downloaded; default to repository base.')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a saner default would be just working dir actually. That way its easy to find where it was downloaded to

for dset, data in mnist_data.items():
dset_output_url = '{}/{}'.format(output_url, dset)
print('output: {}'.format(dset_output_url))
with materialize_dataset(spark, dset_output_url, MnistSchema, ROWGROUP_SIZE_MB):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using the default row group size is probably fine for the example

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, ok!

spark.createDataFrame(sql_rows, MnistSchema.as_spark_schema()) \
.coalesce(parquet_files_count) \
.write \
.mode('overwrite') \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think we should use overwrite in the examples, as its probably not something we want people to copy

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@forbearer forbearer force-pushed the pytorch_example branch 2 times, most recently from cc9b8c8 to a7e6ee4 Compare August 16, 2018 20:18
.travis.yml Outdated
@@ -21,7 +21,7 @@ python:
install:
# This will use requirements from setup.py and install them in the tavis's virtual environment
# [tf] chooses to depend on cpu version of tensorflow (alternatively, could do [tf_gpu])
- pip install -e .[tf,pyarrow,opencv]
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two are actually no longer valid references with the recent change in setup.py

.travis.yml Outdated
@@ -21,7 +21,7 @@ python:
install:
# This will use requirements from setup.py and install them in the tavis's virtual environment
# [tf] chooses to depend on cpu version of tensorflow (alternatively, could do [tf_gpu])
- pip install -e .[tf,pyarrow,opencv]
- travis_wait pip install -e .[tf,tv]
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope this change works out with travis! Waiting my run....

@forbearer forbearer force-pushed the pytorch_example branch 7 times, most recently from 807db1f to 3567258 Compare August 20, 2018 17:51
@CLAassistant
Copy link

CLA assistant check
Thank you for your submission, we really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


Owen Cheng seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@forbearer forbearer force-pushed the pytorch_example branch 2 times, most recently from 4281e48 to fcc1bbb Compare August 20, 2018 19:00
@forbearer
Copy link
Author

@selitvin About DataLoader and num_epochs, you are right that it's not in the constructor. I had num_workers in my memory and that got mixed up when I was talking to you. My bad. This does simplify things. :-)

@forbearer forbearer changed the title Pytorch example, plus minor comment fix-ups Pytorch example with DataLoader adapter, using MNIST data Aug 20, 2018
@forbearer forbearer force-pushed the pytorch_example branch 4 times, most recently from 7371a9b to c59f0c8 Compare August 21, 2018 13:47
@forbearer forbearer force-pushed the pytorch_example branch 5 times, most recently from f9d2a17 to 7b817d1 Compare August 23, 2018 20:38
@forbearer
Copy link
Author

Well, the experiment to upgrade to xenial and bionic did not yield positive result.

So I started wondering if the dlopen static TLS problem was actually incorporated into those two Ubuntu releases.

From this https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1673956, the bug was analyzed and bugfixed in https://sourceware.org/bugzilla/show_bug.cgi?id=17620

According to https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=793641, the above fix is in glibc-2.22.

According to ubuntu, xenial http://changelogs.ubuntu.com/changelogs/pool/main/g/glibc/glibc_2.23-0ubuntu10/changelog has glibc-2.22

Well, as I still encounter the same dlopen static TLS error in both xenial and bionic, why might that be?

@forbearer forbearer force-pushed the pytorch_example branch 3 times, most recently from c36eb1d to bd192d5 Compare August 25, 2018 02:06
@forbearer
Copy link
Author

LOL: https://travis-ci.com/uber/petastorm/jobs/141963848

I was so stoked that the build passed, but discovered upon closer examination that the problem persists: seg fault during collect.

At this point, here is what I have tried and what I know:

  • So long as test_pytorch_utils.py is in the pytest path, collect fails, even if using -k to exclude
  • Different versions of Ubuntu did not resolve the issue, tried: trusty, xenial, and bionic, despite the fact that glibc-2.22 (which is in xenial onward) contains the static TLS fix (by increasing DTV surplus slots from 14 to 32, or some slightly larger integer)
  • Taking Clearwater's advice, I tried installing libgomp1 and preceding pytest with export LD_PRELOAD="/usr/lib/x86_64-linux-gnu/libgomp.so.1", but that did not resolve either!
  • Using these commands (adapted from https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=793689#24), I believe the culprint may be that the torch-0.4.1 we are pip installing statically linked against a specific libgomp, and therefore override does not work!
for lib in $(ldd $(find /usr -name _C.so |& grep -v denied) | grep "=>" | sed 's/.*=> \([/a-z0-9\._+-]*\) .*/\1/g' | sort); do echo "readelf -d -W $lib"; readelf -d -W "$lib"; done | egrep 'readelf|STATIC_TLS' | grep -B1 TLS

readelf -d -W /lib/x86_64-linux-gnu/libc.so.6
 0x000000000000001e (FLAGS)              STATIC_TLS
--
readelf -d -W /lib/x86_64-linux-gnu/libm.so.6
 0x000000000000001e (FLAGS)              STATIC_TLS
readelf -d -W /lib/x86_64-linux-gnu/libpthread.so.0
 0x000000000000001e (FLAGS)              STATIC_TLS
readelf -d -W /lib/x86_64-linux-gnu/librt.so.1
 0x000000000000001e (FLAGS)              STATIC_TLS
--
readelf -d -W /usr/local/lib/python2.7/dist-packages/torch/lib/libgomp-7bcb08ae.so.1
 0x000000000000001e (FLAGS)              STATIC_TLS

Below, I do LD_PRELOAD first to show that _C.so still pulls torch's libgomp:

$ export LD_PRELOAD="/usr/lib/x86_64-linux-gnu/libgomp.so.1"; for lib in $(ldd $(find /usr -name _C.so |& grep -v denied) | grep "=>" | sed 's/.*=> \([/a-z0-9\._+-]*\) .*/\1/g' | sort); do echo "readelf -d -W $lib"; readelf -d -W "$lib"; done | egrep 'readelf|STATIC_TLS' | grep -B1 TLS; unset LD_PRELOAD
readelf: Error: 'libnvToolsExt-3965bdd0.so.1': No such file
readelf: Error: '=>': No such file
readelf: Error: '(0x00007ff8e5dd3000)': No such file
readelf -d -W /lib/x86_64-linux-gnu/libc.so.6
 0x000000000000001e (FLAGS)              STATIC_TLS
--
readelf -d -W /lib/x86_64-linux-gnu/libm.so.6
 0x000000000000001e (FLAGS)              STATIC_TLS
readelf -d -W /lib/x86_64-linux-gnu/libpthread.so.0
 0x000000000000001e (FLAGS)              STATIC_TLS
readelf -d -W /lib/x86_64-linux-gnu/librt.so.1
 0x000000000000001e (FLAGS)              STATIC_TLS
--
readelf -d -W /usr/local/lib/python2.7/dist-packages/torch/lib/libgomp-7bcb08ae.so.1
 0x000000000000001e (FLAGS)              STATIC_TLS

Last ditch effort: I'm going to try pip installing the CPU torch, followed by building from source using
Dockerfile. If neither of these efforts work, then I'm going to split pytorch tests from the rest of the pytest in order to make progress.

This PR has sunk quite a bit of time. :-\

@@ -0,0 +1,53 @@
FROM ubuntu:14.04.5
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Dockerfile is not meant to be committed (unless we want to for some reason).

I'm just trying to track what I've done.

@forbearer forbearer force-pushed the pytorch_example branch 3 times, most recently from 5e17547 to e8cfdae Compare August 27, 2018 19:45
@@ -12,24 +12,29 @@
# See the License for the specific language governing permissions and
# limitations under the License.

dist: trusty
dist: xenial
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to land xenial. I guess it's ok- just want to make sure this is a deliberate change.

Copy link
Author

@forbearer forbearer Aug 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a deliberate change. I think we should proceed with xenial to be ahead of (or on par with) the game.

@@ -0,0 +1,42 @@
FROM pytorch:latest
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move out of mnist? seems a bit confusing. Maybe also add a comment here (and a legal header) on what this file is doing.

Copy link
Author

@forbearer forbearer Aug 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yeah, that's actually a good point.

If there will be a tf example using MNIST, perhaps I should simply make a pytorch subdirectory for the pytorch example portion. Then that would make things much clearer?

Copy link
Author

@forbearer forbearer Aug 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, actually, mnist generation leverages pytorch as well, so I don't think I'll gain much to move the main into a pytorch subdirectory.

But, I'll move this Dockerfile within, and then add a legal header and comment about functionality.


# Instantiate each petastorm Reader with a single thread, shuffle enabled, and appropriate epoch setting
for epoch in range(1, loop_epochs + 1):
with DataLoader(Reader('{}/train'.format(args.dataset_url), reader_pool=ThreadPool(1),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we specify reader_pool, maybe just the default, to make the code a little clearer. Same for shuffle options.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both arguments reader_pool and shuffle_options are specified in this code, or do you mean, give it a unique name and then supply that? E.g.,

single_thread = ThreadPool(1)
shuffle_enabled = ShuffleOptions()
...
with DataLoader(Reader(..., reader_pool=single_thread, shuffle_options=shuffle_enabled, ...)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think I see what you are getting at. Removing these two arguments would work just fine. :-)

Owen Cheng added 6 commits August 27, 2018 23:21
…xtra requirement, but avoid requiring it for testing.
* Added to Reader support for data length
* Defined new petastorm.pytorce.DataLoader class, which allows custom collate and transform functions
* Made pytorch example much more on-par with original pytorch MNIST example code
* Simplified and revamped unit test to pytest
* Simplified pip requirement, upgrade pip, and make install output quiet
* Simplified generate CLI args
* Fixed example README
* Added pytorch example to repo README
* Addressed pylint issues
* pass in Reader instance to DataLoader adapter
* pin pip version
* fix class reference in comment
* early import of torch in conftest to prevent dlopen error for Python < 3.0
* skip of test_read_mnist_dataset and test_full_pytorch_example when Python < 3.0
* separate out test_generate_mnist_dataset, which can run in both Python versions
@forbearer forbearer force-pushed the pytorch_example branch 4 times, most recently from 4c98b2f to 915d71c Compare August 28, 2018 13:36
…hooting step for pytorch issue; upgrade to xenial; simplified pytorch Reader args.
@forbearer forbearer merged commit 053addf into master Aug 28, 2018
@forbearer forbearer deleted the pytorch_example branch August 28, 2018 15:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants