Skip to content
This repository has been archived by the owner on Apr 10, 2024. It is now read-only.

Utils rewrite #10

Merged
merged 8 commits into from
Feb 6, 2018
Merged

Utils rewrite #10

merged 8 commits into from
Feb 6, 2018

Conversation

ludwigschubert
Copy link
Contributor

@colah @znah this is a rewrite of unified io.

It doesn't introduce a lot of new functionality, but it does disentangle our prior abstractions a little:

If that looks obvious—good. :D

Does come with tests, but thorough testing is tough as we'll need to try the writing and reading functionality in various environments like /bigstore, /cns, on which I can't run automated tests.

I'd love to have a discussion about naming at this point. I went with what felt obvious, but I'm in no way stuck on these names. I also want to discuss what we should expose to users by default. I have put suggested functions into the __init__.py file.

Let me know your thoughts! :-)

@ludwigschubert ludwigschubert added the help wanted Extra attention is needed label Feb 2, 2018
Copy link
Contributor

@colah colah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a great start! There's a lot of things I like, such as the general separation of load/read and save/write.

That said, I also have some questions and concerns. I've eft these as comments in line.

from lucid.util.array_to_image import _infer_domain_from_array, _normalize_array_and_convert_to_image


_last_html_output = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of all these _last_*_output variables?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an idea @znah introduced (with an explanatory comment that I must have removed) to allow testing these methods without needing to mock IPython.display. I may alternatively just attempt to mock out that dependency during testing.

pass


def _image_url(image, fmt='png', mode="data"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels a bit weird to be passing PIL.Image objects around. From the perspective of lucid, they're this really weird object, that we're only invoking as an intermediary to convert NumPy arrays to image formats. I'd rather have them be entirely encapsulated, lest the abstraction leak.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair, though if we want resizing it seems like a better option than staying in numpy and adding all of scipy for just resizing. I'll take a stab at this with numpy arrays as "canonicalized" images. At the end, though, we will need either an easily serializable format or already serialized data (see your comment about not wanting to pass around serialized imaeg data either) to pass around if we want to avoid duplication. An image needs to be serialized in some way for both display and saving/writing.

_last_html_output = html_str
IPythonDisplay(IPythonHTML(html_str))

def _display_data(image_data, format):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't find this function name descriptive: We're displaying an image, given data. Relatedly, does it make sense for image data to be something we pass around like this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More broadly, it seems like there's a lot of abstraction in this code that's coming from the theory that it's better to use IPythonImage over data urls when possible. Do we know that's actually the case? If so, are the gains worth the abstraction overhead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like your last point. I don't know! I will look into it. It just seemed that the data URL was an additional roundtrip, but I will check if IPython.display.Image doesn't do the same thing…

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will quietly go into a corner and concede your point—IPython.display.Image just converts to data URLs as well.

So data-urls it is from now on; that simplifies things in show.

rank, shape and dtype. rank 4 tensors will be displayed as image grids, rank 2
and 3 tensors as images.
"""
if isinstance(thing, np.ndarray):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be valid for thing to be a list of NumPy arrays. This is important for two reasons:

(1) Often each image gets generated separately and added to a list, so one would have to deliberately convert them to a NumPy array.
(2) Sometimes the images are of different sizes (eg. activation magnitudes at different layers). This has traditionally been a big pain point, that the images abstraction nicely solves.

In the long run, we may also want to support grids of images and other such things...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, totally! I'll build that in. In general, this is supposed to define the interface and make it reasonably easy and predictable to extend it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in e3b6c8f.

IPythonDisplay(IPythonImage(data=image_data, format=format))

except ImportError:
logging.warn('IPython is not present, HTML output from lucid.util.show and '
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this accounting for the IPython libraries not being installed, or us not being in an IPython environment?

If the former, is there any reason for the IPython libraries to not be installed? (They'll be part of our package requirements, presumably?)

If the latter, I don't think the import test accomplishes this. I also think IPython's libraries may be smart enough to handle the issue for us.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hear you. At the heart of the issue is a blurring of our tensorflow/mathematical utilities and some helpers for using the above interactively. I will think more about this; one way may be to decouple optvis and things like show entirely. (But, of course, we still want them to be nice and easy to use in a notebook environment without much configuration!)


low, high = np.min(array), np.max(array)
if low < domain[0] or high > domain[1]:
message = "Clipping domain from (~{:.2f}, ~{:.2f}) to (~{:.2f}, ~{:.2f})"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These messages are really helpful for catching errors, but can also get annoying. Maybe we should consider having a way to turn off "verbose"? Either as a function flag or global. Maybe that's sufficiently handled by logging -- I don't have any experience with it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, logging has a switch for that! We may want to set the threshold higher by default ourselves? Otherwise the standard way to do this in python is:

import logging
logging.getLogger().setLevel(logging.INFO)

We could also use a custom logger? I'll look into this more and propose a way for these messages to be hidden by default but easily enabled. :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adressed in e3b6c8f.

Logging can now be configured at the module level & default logging level is set to WARN.

Default log level is now scoped to our module and can be set like this:

import logging
logging.getLogger('lucid').setLevel(logging.INFO)

I included some documentation in the lucid module's __init__ file, but we may want to move this explanation to a readme or FAQ in the future.
(Though it seems to follow Python best practices, I didn't know about the one-logger-per module either.)

from lucid.util.read import reading


def load_npy(handle):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems weird to me for these to take a handle. I kind of liked that, in the old version, they could be used to override inferring the type from the extension. But on reflection, I never used it.

If we intend for these to take handles, maybe make them private functions (eg. _load_npy). The present version seems potentially confusing to users.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also seems a bit inconsistent with save_npy(object, url) below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right; these should be private. Savers take handles because more often than not that's what the libraries expect. It's not perfect as this breaks the serialize/write abstraction a little bit.

I believe this can be solved by making those functions private, and potentially adding back load_npy(path) convenience methods should we need them. I believe our goal should be to never need to explicitly call these, though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making private mostly resolves this. Maybe also change their names so that they don't look like duals to save_*() functions? _save_to_handle_npy() is a bit clunky, but meh.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, just want to check that you're aware of cstringio. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 04cc1cc by making loaders private for now. If we ever need them to be public as a fallback I will make sure they follow the same interface as save and take a path/url directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! I am aware of cStringIO. Unfortunately it's a Python2-only thing
(towards the bottom of the list).

return result
else:
message = "Unknown extension '{}', supports {}."
raise RuntimeError(message.format(ext, loaders))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"supported" and loaders.keys() maybe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that was laziness on my part. .keys() is one of those expressions that's not compatible across python versions, but I will simply use list(…). (Source, and generally handy cheat sheet for compatibility)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 04cc1cc.



@contextmanager
def reading(path, mode=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears that reading() only supports gfile right now. Does that mean that load() only supports what gfile does at present?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, but merely WIP. reading/read will have feature parity before I suggest this be merged.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 4a91554.

def _supports_make_dirs(path):
"""Whether this path implies a storage system that supports and requires
intermediate directories to be created explicitly."""
return not path.startswith("/bigstore")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, GFS and bigstore are the same thing.

@colah
Copy link
Contributor

colah commented Feb 2, 2018

Where is the line between utils and misc? Should gradient_override be in utils?

@ludwigschubert
Copy link
Contributor Author

/misc vs /utils.

I wanted to start separating infrastructure/glue code from, you know, "actual math code". I'm not set on the naming, but I felt that utils for utility methods made sense, and that misc would be a place for things like your convenience wrapper for dimensionality reduction, etc.

Ludwig Schubert added 3 commits February 2, 2018 17:01
(Adds invariance to extension CAPITALIZATION and tries opening unknown extensions as images.)
"""
rank = len(shape)

if rank == 2:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PIL.Image.fromarray() implements this logic already (except the case ndim=3, depth=1)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, thanks for letting me know! The single image with non-squeezed depth = 1 case tripped me up in a notebook some timer ago, but I'll see if squeezing such dimensions allows us to get rid of this method altogether!

raise ValueError("Unsupported mode '%s'", mode)


def _data_from_image(image, fmt='png', quality=95):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_encode_image()?

@ludwigschubert
Copy link
Contributor Author

ludwigschubert commented Feb 5, 2018

From sync with @colah:

Tasks concerning logging and interactive usage:

  • Logging needs may change per function call; how can users specify that? (module-wide log levels may not be precise enough)
  • "log" images—same or different system? custom log handler? etc.
  • image output may be surprising, why not display stats about image on hover etc.

Additional considerations

  • PIL has max image size; can we think of a way around this?
  • "data-url" as a save target; you save and get back a URL somehow?
  • less wide interface for turning an arbitrary array into data
  • can load be made to deal with protobufs?
  • can save be made to deal with protobufs?

…g by avoiding it, more structure for tests
@ludwigschubert
Copy link
Contributor Author

Moving the remaining ToDos here to issues.

# Conflicts:
#	notebooks/local_development.ipynb
@ludwigschubert ludwigschubert merged commit 0429933 into master Feb 6, 2018
@ludwigschubert ludwigschubert deleted the utils-rewrite branch February 6, 2018 03:11
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants