Pytorch example with DataLoader adapter, using MNIST data #50

forbearer · 2018-08-15T16:43:40Z

This code includes an MNIST dataset generator, a pytorch training example that uses the resulting dataset, and a simple README.md.

As can be seen from the main.py, there are few limitations that come to light which could help us improve petastorm:

Batch shuffling
Support for custom transforms
Total data size (or some semblance of it?)

Running pytorch/examples/mnist/main.py (in a Docker container) with the default 10 epoch yielded the following outcome (I just show the test output for the middle 8 epochs):

...
Train Epoch: 1 [59520/60000 (99%)]	Loss: 0.505042

Test set: Average loss: 0.2056, Accuracy: 9395/10000 (94%)

...
Test set: Average loss: 0.1337, Accuracy: 9596/10000 (96%)
Test set: Average loss: 0.1033, Accuracy: 9684/10000 (97%)
Test set: Average loss: 0.0919, Accuracy: 9710/10000 (97%)
Test set: Average loss: 0.0760, Accuracy: 9770/10000 (98%)
Test set: Average loss: 0.0689, Accuracy: 9797/10000 (98%)
Test set: Average loss: 0.0623, Accuracy: 9803/10000 (98%)
Test set: Average loss: 0.0632, Accuracy: 9791/10000 (98%)
Test set: Average loss: 0.0541, Accuracy: 9818/10000 (98%)

...
Train Epoch: 10 [59520/60000 (99%)]	Loss: 0.040862

Test set: Average loss: 0.0505, Accuracy: 9845/10000 (98%)

real	3m3.021s
user	20m4.680s
sys	0m22.228s

With the petastormed variant, the training accuracy looks on-par, with somewhat better runtime. I'll show just the test output:

Test set: Average loss: 0.2035, Accuracy: 9385/10000 (94%)
Test set: Average loss: 0.1326, Accuracy: 9591/10000 (96%)
Test set: Average loss: 0.1040, Accuracy: 9675/10000 (97%)
Test set: Average loss: 0.0887, Accuracy: 9705/10000 (97%)
Test set: Average loss: 0.0761, Accuracy: 9752/10000 (98%)
Test set: Average loss: 0.0715, Accuracy: 9774/10000 (98%)
Test set: Average loss: 0.0627, Accuracy: 9797/10000 (98%)
Test set: Average loss: 0.0606, Accuracy: 9810/10000 (98%)
Test set: Average loss: 0.0582, Accuracy: 9824/10000 (98%)
Test set: Average loss: 0.0548, Accuracy: 9828/10000 (98%)

real	2m35.852s
user	2m33.508s
sys	0m6.576s

forbearer · 2018-08-15T16:53:35Z

My build is failing due to my test program name being non-unique: pytest-dev/pytest#2887

I'll do a simple rename and update this pull request.

selitvin

Can we also cover mnist training triggered from a test to make sure the example does not go stale?

selitvin · 2018-08-16T01:30:04Z

examples/mnist/main.py

+
+###
+# Adapted to petastorm dataset using original contents from
+# pytorch/examples/mnist/main.py .


Maybe point to the correct git repo: https://github.com/pytorch/examples

selitvin · 2018-08-16T04:41:45Z

examples/mnist/main.py

+   transforms.Normalize((0.1307,), (0.3081,))
+])
+
+class BatchMaker(object):


This object is an equivalent of torch.utils.data.DataLoader. It should become part of the petastorm library (like tf_utils.tf_tensors() is an adaptor to the tensorflow world, this class should become an adaptor for the pytorch world. Suggest we name it also as some sort of Loader. Maybe just petastorm.pytorch.DataLoader ?
Can we provide similar features to torch.utils.data.DataLoader? Our Reader implements a bunch of these, so it's ok since we will intiialize our Loader with a reader. Other features we need to implement (if not right now, then later). For mnist example we would need batch_size, shuffle (we can wait for the new Reader flow which will deliver proper shuffling), collate_fn, total_size property (we should be able to sum all rowgroup counts to get the number).

As the ideal end result a user would simply need to switch instantiation torch.utils.data.DataLoader with petastorm.pytorch.DataLoader, tweak some parameters and get done with the migration.

Ah, so it seems that I shouldn't have split #53 into its own pull request, but fold those improvements into here in one go. My initial thought had to been to get an initial example going, the refine it.

I'm happy to bring #53 in first, and get us as close as possible tp a simple switch.

I like petastorm.pytorch.DataLoader...seems we should make equivalent packages for tf :-)

selitvin · 2018-08-16T04:42:52Z

examples/mnist/main.py

+    def __iter__(self):
+        batch = []
+        for mnist in self.reader:
+            batch.append((_image_transform(mnist.image), mnist.digit))


Can we make this transform automatic as well using information from Unischema?

selitvin · 2018-08-16T04:44:27Z

examples/mnist/requirements.txt

@@ -0,0 +1,2 @@
+torch


If we are adding pytorch support to the petastrom library, the dependencies can go as extras into setup.py.

I did try that yesterday, but adding torch brings in a half-GB package, which could make travis CI flakey. In my case, the torch download stalls for long enough for travis to kill it.

What I was hoping was not needing dependency on torch. Maybe I ought to aim to do that via this example.

Actually, the generate function relies on torchvision datasets, which is really convenient.

Then, main.py requires torch for running training, so it would seem to make sense to fold into our setup.py as extras. I'd like to give that another try, and see if the travis run fairs better this time around. (Or, there may be a travis option to wait longer for the package download to succeed??)

I'll give this a try: https://docs.travis-ci.com/user/common-build-problems/#build-times-out-because-no-output-was-received

selitvin · 2018-08-16T04:45:13Z

examples/mnist/schema.py

+MnistSchema = Unischema('MnistSchema', [
+    UnischemaField('idx', np.int_, (), ScalarCodec(IntegerType()), False),
+    UnischemaField('digit', np.int_, (), ScalarCodec(IntegerType()), False),
+    UnischemaField('image', np.uint8, (28, 28, 1), NdarrayCodec(), False),


Do we need the third dimension? For monochromatic image it is usually omitted (i.e. (28, 28)).

So, here I think I gave in to my incomplete understanding of the training example...the training setup requires 4 dimensions, an array of 3 dimensional image matrices...so it appeared easier to just add this 3rd dim. Oh, maybe I can do 2 dims, but just reshape to 3 dims in the example!

selitvin · 2018-08-16T04:58:49Z

examples/mnist/README.md

+This creates both a `train` and `test` petastorm datasets in `/home/${USER}/dev/datasets/mnist`:
+
+```bash
+python generate_petastorm_mnist.py -d ~/dev/data/mnist -o file:///home/${USER}/dev/datasets/mnist


Not clear from the readme what is -d switch is. Maybe we default it somewhere to a temp directory and omit it from the README all together?

Good point, yeah, or I can opt to not download so then the directory isn't needed. Removing a -d does seem nice.

selitvin · 2018-08-16T04:59:51Z

examples/mnist/README.md

+PYTHONPATH=${PETASTORM_PATH}
+```
+
+== Generating a Petastorm Dataset from MNIST Data ==


Does == renders correctly as a section markup? I think I had to use # when I was writing main README.md

Oops, good catch!

selitvin · 2018-08-16T05:00:47Z

examples/mnist/README.md

@@ -0,0 +1,43 @@
+== Setup ==
+```bash
+pip install -r requirements.txt


Not sure I like the duplicate dependency speification: here and in the setup.py

Yeah this can likely be changed to something like pip install -e .[torch]

selitvin · 2018-08-16T05:02:58Z

examples/mnist/generate_petastorm_mnist.py

+                        help='Directory to where the MNIST data will be downloaded; default to repository base.')
+    parser.add_argument('-o', '--output-url', type=str, required=True,
+                        help='hdfs://... or file:/// url where the parquet dataset will be written to.')
+    parser.add_argument('-m', '--master', type=str, required=False, default=None,


We should probably have local[*] as a default to make the script run out of the box with minimal parameters?

selitvin · 2018-08-16T05:04:07Z

examples/mnist/generate_petastorm_mnist.py

+    session_builder = SparkSession \
+        .builder \
+        .appName('MNIST Dataset Creation') \
+        .config('spark.executor.memory', '1g') \


Can we go with the default memory sizes and reduce the bloat?

rgruener · 2018-08-16T19:24:22Z

examples/mnist/README.md

@@ -0,0 +1,43 @@
+== Setup ==
+```bash
+pip install -r requirements.txt


Yeah this can likely be changed to something like pip install -e .[torch]

rgruener · 2018-08-16T19:33:58Z

examples/mnist/generate_petastorm_mnist.py

+
+    base_dir = os.path.abspath(os.path.join(os.path.dirname(sys.argv[0]), '..', '..'))
+    parser.add_argument('-d', '--download-dir', type=str, required=False, default=base_dir,
+                        help='Directory to where the MNIST data will be downloaded; default to repository base.')


I think a saner default would be just working dir actually. That way its easy to find where it was downloaded to

rgruener · 2018-08-16T19:35:28Z

examples/mnist/generate_petastorm_mnist.py

+    for dset, data in mnist_data.items():
+        dset_output_url = '{}/{}'.format(output_url, dset)
+        print('output: {}'.format(dset_output_url))
+        with materialize_dataset(spark, dset_output_url, MnistSchema, ROWGROUP_SIZE_MB):


using the default row group size is probably fine for the example

rgruener · 2018-08-16T19:37:17Z

examples/mnist/generate_petastorm_mnist.py

+            spark.createDataFrame(sql_rows, MnistSchema.as_spark_schema()) \
+                .coalesce(parquet_files_count) \
+                .write \
+                .mode('overwrite') \


I dont think we should use overwrite in the examples, as its probably not something we want people to copy

forbearer · 2018-08-16T20:39:14Z

.travis.yml

@@ -21,7 +21,7 @@ python:
 install:
  # This will use requirements from setup.py and install them in the tavis's virtual environment
  # [tf] chooses to depend on cpu version of tensorflow (alternatively, could do [tf_gpu])
-  - pip install -e .[tf,pyarrow,opencv]


These two are actually no longer valid references with the recent change in setup.py

forbearer · 2018-08-16T20:39:25Z

.travis.yml

@@ -21,7 +21,7 @@ python:
 install:
  # This will use requirements from setup.py and install them in the tavis's virtual environment
  # [tf] chooses to depend on cpu version of tensorflow (alternatively, could do [tf_gpu])
-  - pip install -e .[tf,pyarrow,opencv]
+  - travis_wait pip install -e .[tf,tv]


I hope this change works out with travis! Waiting my run....

CLAassistant · 2018-08-20T17:52:02Z

Thank you for your submission, we really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.

Owen Cheng seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

forbearer · 2018-08-20T20:27:01Z

@selitvin About DataLoader and num_epochs, you are right that it's not in the constructor. I had num_workers in my memory and that got mixed up when I was talking to you. My bad. This does simplify things. :-)

forbearer · 2018-08-24T02:47:00Z

Well, the experiment to upgrade to xenial and bionic did not yield positive result.

So I started wondering if the dlopen static TLS problem was actually incorporated into those two Ubuntu releases.

From this https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1673956, the bug was analyzed and bugfixed in https://sourceware.org/bugzilla/show_bug.cgi?id=17620

According to https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=793641, the above fix is in glibc-2.22.

According to ubuntu, xenial http://changelogs.ubuntu.com/changelogs/pool/main/g/glibc/glibc_2.23-0ubuntu10/changelog has glibc-2.22

Well, as I still encounter the same dlopen static TLS error in both xenial and bionic, why might that be?

forbearer · 2018-08-27T13:43:16Z

LOL: https://travis-ci.com/uber/petastorm/jobs/141963848

I was so stoked that the build passed, but discovered upon closer examination that the problem persists: seg fault during collect.

At this point, here is what I have tried and what I know:

So long as test_pytorch_utils.py is in the pytest path, collect fails, even if using -k to exclude
Different versions of Ubuntu did not resolve the issue, tried: trusty, xenial, and bionic, despite the fact that glibc-2.22 (which is in xenial onward) contains the static TLS fix (by increasing DTV surplus slots from 14 to 32, or some slightly larger integer)
Taking Clearwater's advice, I tried installing libgomp1 and preceding pytest with export LD_PRELOAD="/usr/lib/x86_64-linux-gnu/libgomp.so.1", but that did not resolve either!
Using these commands (adapted from https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=793689#24), I believe the culprint may be that the torch-0.4.1 we are pip installing statically linked against a specific libgomp, and therefore override does not work!
- If I forcibly symlink torch's libgomp, I get the GOMP_4.0 not found error
- Build from source should work, according to GOMP_4.0 not found pytorch/pytorch#643 (comment)

for lib in $(ldd $(find /usr -name _C.so |& grep -v denied) | grep "=>" | sed 's/.*=> \([/a-z0-9\._+-]*\) .*/\1/g' | sort); do echo "readelf -d -W $lib"; readelf -d -W "$lib"; done | egrep 'readelf|STATIC_TLS' | grep -B1 TLS

readelf -d -W /lib/x86_64-linux-gnu/libc.so.6
 0x000000000000001e (FLAGS)              STATIC_TLS
--
readelf -d -W /lib/x86_64-linux-gnu/libm.so.6
 0x000000000000001e (FLAGS)              STATIC_TLS
readelf -d -W /lib/x86_64-linux-gnu/libpthread.so.0
 0x000000000000001e (FLAGS)              STATIC_TLS
readelf -d -W /lib/x86_64-linux-gnu/librt.so.1
 0x000000000000001e (FLAGS)              STATIC_TLS
--
readelf -d -W /usr/local/lib/python2.7/dist-packages/torch/lib/libgomp-7bcb08ae.so.1
 0x000000000000001e (FLAGS)              STATIC_TLS

Below, I do LD_PRELOAD first to show that _C.so still pulls torch's libgomp:

$ export LD_PRELOAD="/usr/lib/x86_64-linux-gnu/libgomp.so.1"; for lib in $(ldd $(find /usr -name _C.so |& grep -v denied) | grep "=>" | sed 's/.*=> \([/a-z0-9\._+-]*\) .*/\1/g' | sort); do echo "readelf -d -W $lib"; readelf -d -W "$lib"; done | egrep 'readelf|STATIC_TLS' | grep -B1 TLS; unset LD_PRELOAD
readelf: Error: 'libnvToolsExt-3965bdd0.so.1': No such file
readelf: Error: '=>': No such file
readelf: Error: '(0x00007ff8e5dd3000)': No such file
readelf -d -W /lib/x86_64-linux-gnu/libc.so.6
 0x000000000000001e (FLAGS)              STATIC_TLS
--
readelf -d -W /lib/x86_64-linux-gnu/libm.so.6
 0x000000000000001e (FLAGS)              STATIC_TLS
readelf -d -W /lib/x86_64-linux-gnu/libpthread.so.0
 0x000000000000001e (FLAGS)              STATIC_TLS
readelf -d -W /lib/x86_64-linux-gnu/librt.so.1
 0x000000000000001e (FLAGS)              STATIC_TLS
--
readelf -d -W /usr/local/lib/python2.7/dist-packages/torch/lib/libgomp-7bcb08ae.so.1
 0x000000000000001e (FLAGS)              STATIC_TLS

Last ditch effort: I'm going to try pip installing the CPU torch, followed by building from source using
Dockerfile. If neither of these efforts work, then I'm going to split pytorch tests from the rest of the pytest in order to make progress.

This PR has sunk quite a bit of time. :-\

forbearer · 2018-08-27T15:53:28Z

examples/mnist/Dockerfile

@@ -0,0 +1,53 @@
+FROM ubuntu:14.04.5


This Dockerfile is not meant to be committed (unless we want to for some reason).

I'm just trying to track what I've done.

selitvin · 2018-08-27T20:57:41Z

.travis.yml

@@ -12,24 +12,29 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

-dist: trusty
+dist: xenial


do we want to land xenial. I guess it's ok- just want to make sure this is a deliberate change.

Yes, this is a deliberate change. I think we should proceed with xenial to be ahead of (or on par with) the game.

selitvin · 2018-08-27T21:00:12Z

examples/mnist/Dockerfile

@@ -0,0 +1,42 @@
+FROM pytorch:latest


Should we move out of mnist? seems a bit confusing. Maybe also add a comment here (and a legal header) on what this file is doing.

Ah, yeah, that's actually a good point.

If there will be a tf example using MNIST, perhaps I should simply make a pytorch subdirectory for the pytorch example portion. Then that would make things much clearer?

On second thought, actually, mnist generation leverages pytorch as well, so I don't think I'll gain much to move the main into a pytorch subdirectory.

But, I'll move this Dockerfile within, and then add a legal header and comment about functionality.

selitvin · 2018-08-27T21:01:44Z

examples/mnist/main.py

+
+    # Instantiate each petastorm Reader with a single thread, shuffle enabled, and appropriate epoch setting
+    for epoch in range(1, loop_epochs + 1):
+        with DataLoader(Reader('{}/train'.format(args.dataset_url), reader_pool=ThreadPool(1),


Should we specify reader_pool, maybe just the default, to make the code a little clearer. Same for shuffle options.

Both arguments reader_pool and shuffle_options are specified in this code, or do you mean, give it a unique name and then supply that? E.g.,

single_thread = ThreadPool(1) shuffle_enabled = ShuffleOptions() ... with DataLoader(Reader(..., reader_pool=single_thread, shuffle_options=shuffle_enabled, ...)?

OK, I think I see what you are getting at. Removing these two arguments would work just fine. :-)

…ng torchvision.

…xtra requirement, but avoid requiring it for testing.

* Added to Reader support for data length * Defined new petastorm.pytorce.DataLoader class, which allows custom collate and transform functions * Made pytorch example much more on-par with original pytorch MNIST example code * Simplified and revamped unit test to pytest * Simplified pip requirement, upgrade pip, and make install output quiet * Simplified generate CLI args * Fixed example README * Added pytorch example to repo README * Addressed pylint issues

* pass in Reader instance to DataLoader adapter * pin pip version * fix class reference in comment * early import of torch in conftest to prevent dlopen error for Python < 3.0 * skip of test_read_mnist_dataset and test_full_pytorch_example when Python < 3.0 * separate out test_generate_mnist_dataset, which can run in both Python versions

…hooting step for pytorch issue; upgrade to xenial; simplified pytorch Reader args.

forbearer requested a review from selitvin August 15, 2018 16:43

forbearer requested a review from rgruener August 15, 2018 17:34

forbearer force-pushed the pytorch_example branch 4 times, most recently from b00e65b to 0605819 Compare August 15, 2018 18:58

forbearer mentioned this pull request Aug 15, 2018

Reader len() and BatchReader with custom transform #53

Closed

selitvin requested changes Aug 16, 2018

View reviewed changes

rgruener reviewed Aug 16, 2018

View reviewed changes

forbearer force-pushed the pytorch_example branch 2 times, most recently from cc9b8c8 to a7e6ee4 Compare August 16, 2018 20:18

forbearer commented Aug 16, 2018

View reviewed changes

forbearer force-pushed the pytorch_example branch 7 times, most recently from 807db1f to 3567258 Compare August 20, 2018 17:51

forbearer force-pushed the pytorch_example branch 2 times, most recently from 4281e48 to fcc1bbb Compare August 20, 2018 19:00

forbearer changed the title ~~Pytorch example, plus minor comment fix-ups~~ Pytorch example with DataLoader adapter, using MNIST data Aug 20, 2018

forbearer force-pushed the pytorch_example branch 4 times, most recently from 7371a9b to c59f0c8 Compare August 21, 2018 13:47

forbearer force-pushed the pytorch_example branch 5 times, most recently from f9d2a17 to 7b817d1 Compare August 23, 2018 20:38

forbearer force-pushed the pytorch_example branch 3 times, most recently from c36eb1d to bd192d5 Compare August 25, 2018 02:06

forbearer force-pushed the pytorch_example branch from bd192d5 to 40e9acf Compare August 27, 2018 15:31

forbearer commented Aug 27, 2018

View reviewed changes

forbearer force-pushed the pytorch_example branch 3 times, most recently from 5e17547 to e8cfdae Compare August 27, 2018 19:45

selitvin approved these changes Aug 27, 2018

View reviewed changes

Owen Cheng added 6 commits August 27, 2018 23:21

New example to generate a petastorm dataset from MNIST data, leveragi…

31e8fac

…ng torchvision.

Example using petastorm dataset with pytorch, plus minor fix-ups

0bc1762

Rename test_generate programs to be repo unique; add torchvision as e…

d7f8c28

…xtra requirement, but avoid requiring it for testing.

Fix lambda syntax error for Python 3

3a4c22f

forbearer force-pushed the pytorch_example branch 4 times, most recently from 4c98b2f to 915d71c Compare August 28, 2018 13:36

Run pytorch tests separately, with conftest to import torch; troubles…

2a45f47

…hooting step for pytorch issue; upgrade to xenial; simplified pytorch Reader args.

forbearer force-pushed the pytorch_example branch from 915d71c to 2a45f47 Compare August 28, 2018 13:38

forbearer merged commit 053addf into master Aug 28, 2018

forbearer deleted the pytorch_example branch August 28, 2018 15:49

Pytorch example with DataLoader adapter, using MNIST data #50

Pytorch example with DataLoader adapter, using MNIST data #50

Conversation

forbearer commented Aug 15, 2018

forbearer commented Aug 15, 2018

selitvin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CLAassistant commented Aug 20, 2018

forbearer commented Aug 20, 2018

forbearer commented Aug 24, 2018

forbearer commented Aug 27, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

forbearer Aug 28, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

forbearer Aug 28, 2018 • edited

Choose a reason for hiding this comment

forbearer Aug 28, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

forbearer Aug 28, 2018 •

edited

forbearer Aug 28, 2018 •

edited

forbearer Aug 28, 2018 •

edited