Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abstractify methods for finding data files saved on disk #32

Closed
spencerahill opened this issue Nov 6, 2015 · 4 comments
Closed

Abstractify methods for finding data files saved on disk #32

spencerahill opened this issue Nov 6, 2015 · 4 comments
Milestone

Comments

@spencerahill
Copy link
Owner

(Below is copied from @spencerkclark 's comment on #31 )

While currently within Calc the two methods accomplish the same tasks, in an abstract sense the current ..._one_dir and ..._gfdl read-in methods are actually quite different:

..._one_dir requires the user to map every variable to a file name explicitly. If the variable does not appear in this map, aospy will not even attempt to look for it.
..._gfdl is an implicit system. The mapping is coded into the method which looks for the files within the post-processing file structure. The user is not required a priori to specify which variables are in which files, and thus aospy is allowed to look for variables that may not exist.

I would argue that the explicit read-in method is the most general way of doing things. With enough information, one could automatically generate an explicit file map from an implicit generator. In addition, there is nothing that says you couldn't relax the current single directory constraint and just map each variable (within a particular time frequency) to a full file path.

To continue to support implicit read-in methods (for very structured output data, like ..._gfdl) you could require that a user create some object that implements an interface (call the interface FileMapGenerator?) to include methods to generate a map to files for a particular variable when given the intvl_in, variable name, data_in_dur etc. as arguments. The source code for these objects could be stored in a user's aospy_user directory.

Within a Run object one could then have a single argument for the file read-in method. The user could pass either the explicit dictionary mapping or they could pass an object that implements the FileMapGenerator interface. Within Calc, when reading in the files, you could have some simple logic that would be along the lines of: "if an explicit map is provided use the map; if not, use information about which variable you are looking for, and the interval in etc. and pass those as arguments to the generator, which would return an explicit map for just that variable." Using an interface would ensure that the explicit file map generated would always have the same structure (so that it could be used seamlessly within Calc).

@spencerahill
Copy link
Owner Author

I really like this. I agree, the implicit mapping should be specified by the user, since it no doubt varies so much among people (even within the lab; cf. #28 ).

Going one step further, there could be a FileMap class, which for starters could just be a wrapper around the dict built-in (or maybe ordereddict). So, Run (or Model or Proj; should be able to specify at any level), just calls e.g. FileMap(read_in_method), and FileMap.__init__ supports read_in_method being either a dict or a FileMapGenerator. In the latter case, FileMapGenerator then just executes whatever method(s) it needs to build the FileMap. Or maybe it should be FileMapGenerator that accepts the dict, so that FileMap only needs to handle the FileMapGenerator interface and nothing else?

That way Calc doesn't need any logic at all -- it will always receive a FileMap object that explicitly lists where the files it needs are.

@spencerkclark
Copy link
Collaborator

That's even better -- something along these lines would remove a lot of distracting code (~100 lines) from calc.py.

@spencerahill
Copy link
Owner Author

From @spencerkclark in #90:

Tracing all the way back to a main-like script, what is the minimum set of parameters needed to identify a given file set for any DataLoader? How should we specify those parameters when submitting a computation? In some ways this traces back to your comment on _generate_file_set above.

This is a key outstanding design question regarding DataLoader, which was introduced in #90 to address this Issue.

@spencerahill
Copy link
Owner Author

Closed by #90

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants