Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add config filter parameter to simulate_factors() #14

Merged
merged 8 commits into from
May 5, 2017
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 1 addition & 1 deletion desmod/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -271,7 +271,7 @@ def factorial_config(base_config, factors, special_key=None):
for keys, values_list in factors:
unrolled_factors.append([(keys, values) for values in values_list])

for keys_values_lists in product(*unrolled_factors):
for idx, keys_values_lists in enumerate(product(*unrolled_factors)):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove enumerate() to completely unwind the changes in this function.

config = deepcopy(base_config)
special = []
if special_key:
Expand Down
32 changes: 29 additions & 3 deletions desmod/simulation.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,12 @@
from multiprocessing import cpu_count, Process, Queue
from pprint import pprint
from threading import Thread
try:
from future_builtins import filter
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use six for Python 2/3 compatibility. To get filter, we can unconditionally:

from six.moves import filter

except ImportError:
# Assume Python 3, which already has filter built in
pass

import json
import os
import random
Expand All @@ -21,6 +27,7 @@
consume_multi_progress)
from desmod.timescale import parse_time, scale_time
from desmod.tracer import TraceManager
from desmod.workspacesync.s3 import S3Sync


class SimEnvironment(simpy.Environment):
Expand Down Expand Up @@ -124,11 +131,12 @@ def schedule(self, delay=0):
class _Workspace(object):
"""Context manager for workspace directory management."""
def __init__(self, config):
self.config = config
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I maintain that it is important to fully extract all configuration from config here in _Workspace.__init__(). Doing so would obviate the need to capture config with the self.config instance attribute. It would also demand that we instead instantiate an S3Sync instance. I.e.:

def __init__(self, config):
    self.s3_sync = S3Sync(config)
    ...
def sync(self):
    if self.s3_sync.is_enabled:
        ...
        self.s3_sync.sync(artifacts)

We also want S3Sync.__init__() to follow suit and extract all needed config as instance attributes such that the config dict is not retained as an instance attribute and is thus not referenced in S3Sync.sync().

self.workspace = config.setdefault('meta.sim.workspace',
config.setdefault('sim.workspace',
os.curdir))
self.overwrite = config.setdefault('sim.workspace.overwrite', False)
self.prev_dir = os.getcwd()
self.orig_dir = os.getcwd()

def __enter__(self):
if os.path.relpath(self.workspace) != os.curdir:
Expand All @@ -140,7 +148,20 @@ def __enter__(self):
os.chdir(self.workspace)

def __exit__(self, *exc):
os.chdir(self.prev_dir)
os.chdir(self.orig_dir)
try:
os.chdir(self.config['sim.workspace'])
self.sync()
finally:
os.chdir(self.orig_dir)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When __exit__() is called, we have a strong expectation that the current directory will be meta.sim.workspace. In multi-simulation cases, that would be something like ws/42. The new code here changes to the sim.workspace directory prior to calling sync(); i.e. just ws, not ws/42 where this simulation's artifacts reside.

I question whether it is right for sync() to operate outside of the individual simulation's workspace?

It seems like this paradigm of changing to the multi-sim session level directory (sim.workspace) can only work correctly if the new config_filter is used to limit the multi-simulation session to exactly one simulation. Otherwise, each of many simulations would call sync() such that they would all try to sync ws/0, ws/1, ..., ws/N.

To put a fine point on it, what I'm suggesting is that the control-flow here should be:

def __exit__(...):
    self.sync()
    os.chdir(self.prev_dir)

Sorry I didn't pick up on this in my first review.


def sync(self):
if self.config.setdefault('sim.sync.s3.enable', False):
artifacts = []
for root, _, files in os.walk(os.curdir, topdown=False):
for filename in files:
artifacts.append(os.path.join(root, filename))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a little curious that _Workspace has responsibility for walking the artifacts versus the S3Sync taking that on itself. Is the idea that if we have multiple synchronization backends that we could reuse the same artifacts list?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally had this logic in workspacesync/s3.py and moved it into _Workspace for that exact reason you asked about. Having the sync backends using the same artifacts input is nice for consistency, code reuse, and computational cycles. That said, I'm solving a problem that doesn't exist yet, so I'm happy to move it back to s3.py for now if you think that is better.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<shrug> I think it's fine as-is.

S3Sync(self.config, artifacts).sync()


def simulate(config, top_type, env_type=SimEnvironment, reraise=True,
Expand Down Expand Up @@ -205,7 +226,7 @@ def simulate(config, top_type, env_type=SimEnvironment, reraise=True,


def simulate_factors(base_config, factors, top_type,
env_type=SimEnvironment, jobs=None):
env_type=SimEnvironment, jobs=None, config_filter=None):
"""Run multi-factor simulations in separate processes.

The `factors` are used to compose specialized config dictionaries for the
Expand All @@ -220,15 +241,20 @@ def simulate_factors(base_config, factors, top_type,
:param top_type: The model's top-level Component subclass.
:param env_type: :class:`SimEnvironment` subclass.
:param int jobs: User specified number of concurent processes.
:param function config_filter:
A function which will be passed a config and returns a bool to filter.
:returns: Sequence of result dictionaries for each simulation.

"""
configs = list(factorial_config(base_config, factors, 'meta.sim.special'))
ws = base_config.setdefault('sim.workspace', os.curdir)
overwrite = base_config.setdefault('sim.workspace.overwrite', False)

for index, config in enumerate(configs):
config['meta.sim.index'] = index
config['meta.sim.workspace'] = os.path.join(ws, str(index))
if config_filter is not None:
configs[:] = filter(config_filter, configs)
if overwrite and os.path.relpath(ws) != os.curdir and os.path.isdir(ws):
shutil.rmtree(ws)
return simulate_many(configs, top_type, env_type, jobs)
Expand Down
Empty file.
50 changes: 50 additions & 0 deletions desmod/workspacesync/s3.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
"""Synchronization of workspace artifacts to Amazon S3 cloud storage."""
import os


class S3Sync(object):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this could have been implemented as a regular function instead of as a class with a single method. I.e.

sync_to_s3(...)

versus:

S3Sync(...).sync()

I don't see it as imperative that we change this. It is not a user-facing API and we could thus revisit later if desired.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main reason I went with this approach to perhaps have common set of methods for multiple backend synchronizers, which could register themselves and perhaps one day be called in a loop like so:

for cls in sync_registry:
    cls(config, artifacts).sync()

That said, that's a lot of hand waving, and we only have one sync backend now, so I probably shouldn't be solving problems that don't exist yet.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even without using classes, each synchronization backend module could register its sync() function. Lots of ways to skin this. As long as these syncronization backends are all driven by configuration and not by user-facing API, then we retain lots of license to revisit this detail.

I do think there is at least one compelling case for having a class: extract and check the configuration before simulation.

"""S3 workspace synchronization.

The :class:`S3Sync` class implements a sync method that synchronizes the
workspace artifacts to an S3 bucket with a configured key prefix. The
s3 destination is of the format: /{prefix}/{workspace name}/{artifact path}

:param dict config: A fully-initialized configuration dictionary.
:param list artifacts: A list of artifact paths relative to the workspace
directory. E.g., ["./0/results.yaml"].

"""

MAX_THREADS = 10

def __init__(self, config, artifacts):
self.client = None
self.config = config
self.workspace = config['sim.workspace']
self.artifacts = artifacts

def _upload_artifact(self, artifact):
dest = os.path.join(
self.config['sim.sync.s3.prefix'],
os.path.split(self.workspace)[1],
(artifact[2:]))
self.client.upload_file(
artifact, self.config['sim.sync.s3.bucket'], dest)

def sync(self):
"""Concurrently upload the artifacts to s3."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A potential problem with sync() is that it does not sync--it only uploads new files.

I assume client.upload_file() will happily overwrite files without error.

What is missing is any concept of removing files that exist in S3, but are not in the list of artifacts to synchronize. We would ostensibly need to interrogate S3 with list_objects() to figure out which extra files (objects) need to be deleted.

However, doing that alone would leave extra per-simulation directories unaccounted for. E.g. using the same workspace, if I first run a multi-factor set of 10 simulations and then run again with only 5 simulations, even if we synchronized per-simulation directories 0..4 correclty, S3 would still contain bogus/stale directories 5..9.

from concurrent.futures import ThreadPoolExecutor
import boto3

self.config.setdefault('sim.sync.s3.prefix', '')
self.client = boto3.client('s3')

if len(self.artifacts) == 0:
return

futures = []
with ThreadPoolExecutor(max_workers=self.MAX_THREADS) as executor:
for artifact in self.artifacts:
futures.append(
executor.submit(self._upload_artifact, artifact))
[future.result() for future in futures]
2 changes: 2 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
boto3 >= 1.4.2
futures >= 3.1.1; python_version < '3.0'
pyvcd >= 0.1.1
PyYAML >= 3.11
simpy >= 3.0.8
Expand Down
17 changes: 17 additions & 0 deletions tests/test_simulation.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ def config():
'sim.result.file': 'result.yaml',
'sim.workspace': 'workspace',
'sim.workspace.overwrite': False,
'sim.sync.s3.enable': False,
'sim.timescale': '1 us',
'sim.seed': 1234,
'sim.duration': '1 us',
Expand Down Expand Up @@ -196,6 +197,22 @@ def test_simulate_factors(config):
result['config']['sim.result.file']))


def test_simulate_factors_only_factor(config):
FACTOR_NUM = 2
cfg_filter_fn = lambda cfg: cfg['meta.sim.index'] == FACTOR_NUM
factors = [(['sim.seed'], [[1], [2], [3]])]
results = simulate_factors(
config, factors, TopTest, config_filter=cfg_filter_fn)
assert len(results) == 1
for result in results:
assert result['sim.exception'] is None
assert result['config']['meta.sim.workspace'] == os.path.join(
config['sim.workspace'], str(FACTOR_NUM))
assert os.path.exists(
os.path.join(result['config']['meta.sim.workspace'],
result['config']['sim.result.file']))


def test_simulate_factors_progress(config, capfd):
config['sim.progress.enable'] = True
config['sim.duration'] = '10 us'
Expand Down
1 change: 1 addition & 0 deletions tests/test_tracer.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ def config():
'sim.vcd.start_time': '',
'sim.vcd.stop_time': '',
'sim.workspace': 'workspace',
'sim.sync.s3.enable': False,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like since sim.sync.s3.enable defaults to False, we should not have had to add this here. Is it really needed?

'test.raise': False,
}

Expand Down