New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pytorch example with DataLoader adapter, using MNIST data #50
Conversation
My build is failing due to my test program name being non-unique: pytest-dev/pytest#2887 I'll do a simple rename and update this pull request. |
b00e65b
to
0605819
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also cover mnist training triggered from a test to make sure the example does not go stale?
examples/mnist/main.py
Outdated
|
||
### | ||
# Adapted to petastorm dataset using original contents from | ||
# pytorch/examples/mnist/main.py . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe point to the correct git repo: https://github.com/pytorch/examples
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
examples/mnist/main.py
Outdated
transforms.Normalize((0.1307,), (0.3081,)) | ||
]) | ||
|
||
class BatchMaker(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This object is an equivalent of torch.utils.data.DataLoader
. It should become part of the petastorm
library (like tf_utils.tf_tensors()
is an adaptor to the tensorflow world, this class should become an adaptor for the pytorch world. Suggest we name it also as some sort of Loader. Maybe just petastorm.pytorch.DataLoader
?
Can we provide similar features to torch.utils.data.DataLoader
? Our Reader
implements a bunch of these, so it's ok since we will intiialize our Loader with a reader. Other features we need to implement (if not right now, then later). For mnist example we would need batch_size
, shuffle
(we can wait for the new Reader flow which will deliver proper shuffling), collate_fn
, total_size
property (we should be able to sum all rowgroup counts to get the number).
As the ideal end result a user would simply need to switch instantiation torch.utils.data.DataLoader
with petastorm.pytorch.DataLoader
, tweak some parameters and get done with the migration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, so it seems that I shouldn't have split #53 into its own pull request, but fold those improvements into here in one go. My initial thought had to been to get an initial example going, the refine it.
I'm happy to bring #53 in first, and get us as close as possible tp a simple switch.
I like petastorm.pytorch.DataLoader
...seems we should make equivalent packages for tf
:-)
examples/mnist/main.py
Outdated
def __iter__(self): | ||
batch = [] | ||
for mnist in self.reader: | ||
batch.append((_image_transform(mnist.image), mnist.digit)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this transform automatic as well using information from Unischema
?
examples/mnist/requirements.txt
Outdated
@@ -0,0 +1,2 @@ | |||
torch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are adding pytorch support to the petastrom library, the dependencies can go as extras into setup.py
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did try that yesterday, but adding torch
brings in a half-GB package, which could make travis CI flakey. In my case, the torch download stalls for long enough for travis to kill it.
What I was hoping was not needing dependency on torch. Maybe I ought to aim to do that via this example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, the generate function relies on torchvision datasets, which is really convenient.
Then, main.py
requires torch
for running training, so it would seem to make sense to fold into our setup.py
as extras. I'd like to give that another try, and see if the travis run fairs better this time around. (Or, there may be a travis option to wait longer for the package download to succeed??)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
examples/mnist/schema.py
Outdated
MnistSchema = Unischema('MnistSchema', [ | ||
UnischemaField('idx', np.int_, (), ScalarCodec(IntegerType()), False), | ||
UnischemaField('digit', np.int_, (), ScalarCodec(IntegerType()), False), | ||
UnischemaField('image', np.uint8, (28, 28, 1), NdarrayCodec(), False), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the third dimension? For monochromatic image it is usually omitted (i.e. (28, 28)
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, here I think I gave in to my incomplete understanding of the training example...the training setup requires 4 dimensions, an array of 3 dimensional image matrices...so it appeared easier to just add this 3rd dim. Oh, maybe I can do 2 dims, but just reshape to 3 dims in the example!
examples/mnist/README.md
Outdated
This creates both a `train` and `test` petastorm datasets in `/home/${USER}/dev/datasets/mnist`: | ||
|
||
```bash | ||
python generate_petastorm_mnist.py -d ~/dev/data/mnist -o file:///home/${USER}/dev/datasets/mnist |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not clear from the readme what is -d
switch is. Maybe we default it somewhere to a temp directory and omit it from the README all together?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, yeah, or I can opt to not download so then the directory isn't needed. Removing a -d
does seem nice.
examples/mnist/README.md
Outdated
PYTHONPATH=${PETASTORM_PATH} | ||
``` | ||
|
||
== Generating a Petastorm Dataset from MNIST Data == |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does ==
renders correctly as a section markup? I think I had to use #
when I was writing main README.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, good catch!
examples/mnist/README.md
Outdated
@@ -0,0 +1,43 @@ | |||
== Setup == | |||
```bash | |||
pip install -r requirements.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I like the duplicate dependency speification: here and in the setup.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this can likely be changed to something like pip install -e .[torch]
help='Directory to where the MNIST data will be downloaded; default to repository base.') | ||
parser.add_argument('-o', '--output-url', type=str, required=True, | ||
help='hdfs://... or file:/// url where the parquet dataset will be written to.') | ||
parser.add_argument('-m', '--master', type=str, required=False, default=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably have local[*]
as a default to make the script run out of the box with minimal parameters?
session_builder = SparkSession \ | ||
.builder \ | ||
.appName('MNIST Dataset Creation') \ | ||
.config('spark.executor.memory', '1g') \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we go with the default memory sizes and reduce the bloat?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
K!
examples/mnist/README.md
Outdated
@@ -0,0 +1,43 @@ | |||
== Setup == | |||
```bash | |||
pip install -r requirements.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this can likely be changed to something like pip install -e .[torch]
|
||
base_dir = os.path.abspath(os.path.join(os.path.dirname(sys.argv[0]), '..', '..')) | ||
parser.add_argument('-d', '--download-dir', type=str, required=False, default=base_dir, | ||
help='Directory to where the MNIST data will be downloaded; default to repository base.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a saner default would be just working dir actually. That way its easy to find where it was downloaded to
for dset, data in mnist_data.items(): | ||
dset_output_url = '{}/{}'.format(output_url, dset) | ||
print('output: {}'.format(dset_output_url)) | ||
with materialize_dataset(spark, dset_output_url, MnistSchema, ROWGROUP_SIZE_MB): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using the default row group size is probably fine for the example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, ok!
spark.createDataFrame(sql_rows, MnistSchema.as_spark_schema()) \ | ||
.coalesce(parquet_files_count) \ | ||
.write \ | ||
.mode('overwrite') \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont think we should use overwrite in the examples, as its probably not something we want people to copy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
cc9b8c8
to
a7e6ee4
Compare
.travis.yml
Outdated
@@ -21,7 +21,7 @@ python: | |||
install: | |||
# This will use requirements from setup.py and install them in the tavis's virtual environment | |||
# [tf] chooses to depend on cpu version of tensorflow (alternatively, could do [tf_gpu]) | |||
- pip install -e .[tf,pyarrow,opencv] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two are actually no longer valid references with the recent change in setup.py
.travis.yml
Outdated
@@ -21,7 +21,7 @@ python: | |||
install: | |||
# This will use requirements from setup.py and install them in the tavis's virtual environment | |||
# [tf] chooses to depend on cpu version of tensorflow (alternatively, could do [tf_gpu]) | |||
- pip install -e .[tf,pyarrow,opencv] | |||
- travis_wait pip install -e .[tf,tv] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope this change works out with travis! Waiting my run....
807db1f
to
3567258
Compare
Owen Cheng seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
4281e48
to
fcc1bbb
Compare
@selitvin About DataLoader and num_epochs, you are right that it's not in the constructor. I had |
7371a9b
to
c59f0c8
Compare
f9d2a17
to
7b817d1
Compare
Well, the experiment to upgrade to So I started wondering if the dlopen static TLS problem was actually incorporated into those two Ubuntu releases. From this https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1673956, the bug was analyzed and bugfixed in https://sourceware.org/bugzilla/show_bug.cgi?id=17620 According to https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=793641, the above fix is in According to ubuntu, xenial http://changelogs.ubuntu.com/changelogs/pool/main/g/glibc/glibc_2.23-0ubuntu10/changelog has Well, as I still encounter the same dlopen static TLS error in both xenial and bionic, why might that be? |
c36eb1d
to
bd192d5
Compare
LOL: https://travis-ci.com/uber/petastorm/jobs/141963848 I was so stoked that the build passed, but discovered upon closer examination that the problem persists: seg fault during collect. At this point, here is what I have tried and what I know:
Below, I do LD_PRELOAD first to show that
Last ditch effort: I'm going to try pip installing the CPU torch, followed by building from source using This PR has sunk quite a bit of time. :-\ |
bd192d5
to
40e9acf
Compare
examples/mnist/Dockerfile
Outdated
@@ -0,0 +1,53 @@ | |||
FROM ubuntu:14.04.5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This Dockerfile is not meant to be committed (unless we want to for some reason).
I'm just trying to track what I've done.
5e17547
to
e8cfdae
Compare
@@ -12,24 +12,29 @@ | |||
# See the License for the specific language governing permissions and | |||
# limitations under the License. | |||
|
|||
dist: trusty | |||
dist: xenial |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to land xenial
. I guess it's ok- just want to make sure this is a deliberate change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is a deliberate change. I think we should proceed with xenial to be ahead of (or on par with) the game.
examples/mnist/Dockerfile
Outdated
@@ -0,0 +1,42 @@ | |||
FROM pytorch:latest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we move out of mnist? seems a bit confusing. Maybe also add a comment here (and a legal header) on what this file is doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yeah, that's actually a good point.
If there will be a tf example using MNIST, perhaps I should simply make a pytorch subdirectory for the pytorch example portion. Then that would make things much clearer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, actually, mnist generation leverages pytorch as well, so I don't think I'll gain much to move the main into a pytorch subdirectory.
But, I'll move this Dockerfile within, and then add a legal header and comment about functionality.
examples/mnist/main.py
Outdated
|
||
# Instantiate each petastorm Reader with a single thread, shuffle enabled, and appropriate epoch setting | ||
for epoch in range(1, loop_epochs + 1): | ||
with DataLoader(Reader('{}/train'.format(args.dataset_url), reader_pool=ThreadPool(1), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we specify reader_pool
, maybe just the default, to make the code a little clearer. Same for shuffle options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both arguments reader_pool
and shuffle_options
are specified in this code, or do you mean, give it a unique name and then supply that? E.g.,
single_thread = ThreadPool(1)
shuffle_enabled = ShuffleOptions()
...
with DataLoader(Reader(..., reader_pool=single_thread, shuffle_options=shuffle_enabled, ...)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I think I see what you are getting at. Removing these two arguments would work just fine. :-)
…xtra requirement, but avoid requiring it for testing.
* Added to Reader support for data length * Defined new petastorm.pytorce.DataLoader class, which allows custom collate and transform functions * Made pytorch example much more on-par with original pytorch MNIST example code * Simplified and revamped unit test to pytest * Simplified pip requirement, upgrade pip, and make install output quiet * Simplified generate CLI args * Fixed example README * Added pytorch example to repo README * Addressed pylint issues
* pass in Reader instance to DataLoader adapter * pin pip version * fix class reference in comment * early import of torch in conftest to prevent dlopen error for Python < 3.0 * skip of test_read_mnist_dataset and test_full_pytorch_example when Python < 3.0 * separate out test_generate_mnist_dataset, which can run in both Python versions
4c98b2f
to
915d71c
Compare
…hooting step for pytorch issue; upgrade to xenial; simplified pytorch Reader args.
915d71c
to
2a45f47
Compare
This code includes an MNIST dataset generator, a pytorch training example that uses the resulting dataset, and a simple
README.md
.As can be seen from the
main.py
, there are few limitations that come to light which could help us improve petastorm:Running
pytorch/examples/mnist/main.py
(in a Docker container) with the default 10 epoch yielded the following outcome (I just show the test output for the middle 8 epochs):With the petastormed variant, the training accuracy looks on-par, with somewhat better runtime. I'll show just the test output: