New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quandl wiki loader #1173

Merged
merged 2 commits into from May 3, 2016

Conversation

Projects
None yet
4 participants
@llllllllll
Member

llllllllll commented Apr 28, 2016

No description provided.

Options:
-f, --algofile FILENAME The file that contains the algorithm to run.
-t, --algotext TEXT The algorithm script to run.
-D, --define TEXT Define define a name to be bound in the

This comment has been minimized.

@richafrank

richafrank Apr 28, 2016

Member

typo "define" twice

@@ -1,26 +0,0 @@
#!/usr/bin/env python

This comment has been minimized.

@richafrank

richafrank Apr 28, 2016

Member

Should we keep run_algo for a while, but deprecated?

This comment has been minimized.

@richafrank

richafrank Apr 28, 2016

Member

It could be a passthrough to the new entry point.

This comment has been minimized.

@llllllllll

llllllllll Apr 28, 2016

Member

I think that for the 1.0 release it would be okay to just rename this, I imagine most people type this out each time they use it so it won't break much

)
__all__ = ['loader', 'load_from_yahoo', 'load_bars_from_yahoo',
'load_prices_from_csv', 'load_prices_from_csv_folder']

This comment has been minimized.

@richafrank

richafrank Apr 28, 2016

Member

Should we be deprecating these before removing them?

This comment has been minimized.

@llllllllll

llllllllll Apr 28, 2016

Member

I could do that

@llllllllll llllllllll force-pushed the quandl-wiki-loader branch 2 times, most recently from d3b4e15 to c41abc1 Apr 28, 2016

See the docstring of ``isclose`` for more information about ``atol`` and
``rtol``.
"""
if equal_nan and a is nan and b is nan:

This comment has been minimized.

@richafrank

richafrank Apr 28, 2016

Member

Do we only care about numpy.nan?

In [1]: float('nan') is float('nan')
Out[1]: False

In [2]: from numpy import nan

In [3]: float('nan') is nan
Out[3]: False

This comment has been minimized.

@llllllllll

llllllllll Apr 28, 2016

Member

I will use np.isnan

def hidden(path):
"""Check if a path is hidden.

This comment has been minimized.

@richafrank

richafrank Apr 28, 2016

Member

Does this need to be a cross-platform notion of hidden, e.g. using windows file attributes, or just looking at the name?

This comment has been minimized.

@llllllllll

llllllllll Apr 28, 2016

Member

this is mainly to filter out the .cache dir we write. Until we really care about it being hidden I think this is fine

def maybe_show_progress(it, show_progress, **kwargs):
"""Optionally show a progress bar for the given iterator.
class Timedelta(_DatetimeParam):

This comment has been minimized.

@richafrank

richafrank Apr 28, 2016

Member

Where is this used?

This comment has been minimized.

@llllllllll

llllllllll Apr 28, 2016

Member

I don't think I ended up using it, We might want to add an --older-than flag to clean which would want this

@llllllllll llllllllll force-pushed the quandl-wiki-loader branch from c41abc1 to c37c5dc Apr 28, 2016

env = TradingEnvironment(asset_db_path=str(
bundle_data.asset_finder.engine.url,
)[len('sqlite:///'):])

This comment has been minimized.

@ssanderson

ssanderson Apr 28, 2016

Member

This feels a little brittle. Could we make this a regex split?

start=start,
end=end,
env=env,
**{

This comment has been minimized.

@ssanderson

ssanderson Apr 28, 2016

Member

TIL that this is valid syntax.

@llllllllll llllllllll force-pushed the quandl-wiki-loader branch 3 times, most recently from 0b933ff to 21cf1c4 Apr 28, 2016

-b, --bundle BUNDLE-NAME The data bundle to use for the simulation.
[default: quandl]
--bundle-timestamp TIMESTAMP The date to lookup data on or before.
[default: 2016-04-28 20:14:38.252921+00:00]

This comment has been minimized.

@richafrank

richafrank Apr 29, 2016

Member

Is this date actually hard-coded?

This comment has been minimized.

@llllllllll

llllllllll Apr 29, 2016

Member

no, that is pd.Timestamp.utcnow()(, I wonder if we can make this look nicer

This comment has been minimized.

@llllllllll

llllllllll Apr 29, 2016

Member

I could set show_default to false and just say that it is the current time

$ python -m zipline clean <bundle> --after <date>
# keep everything in the range of [before, after] and delete the rest
$ python -m zipline clean <bundle> --before <date> --after <after>

This comment has been minimized.

@richafrank

richafrank Apr 29, 2016

Member

Will it keep everything in the range (,after] ∪ [before,) if I specify a before that follows the after?

This comment has been minimized.

@llllllllll

llllllllll Apr 29, 2016

Member

right now that would delete everything:

    def should_clean(name):
        dt = pd.Timestamp(name)

        return (
            (
                (before is not None and dt < before) or
                (after is not None and dt > after)
            ) and
            not in_last_n(dt)
        )
ingestion that is less than or equal to the ``bundle-date``. This is how we can
run backtests with older data. The reason that this uses a less than or equal to
relationship is that we can specify the date that we ran an old backtest and get
the same data that would have been available to us on that date. the

This comment has been minimized.

@richafrank

richafrank Apr 29, 2016

Member

Should capitalize "The" at the end of this line.

pricing data, splits, cash dividends, and asset metadata. This is the bundle
that ``run`` will use by default if no other bundle is specified. To ingest this
data bundle we recommend creating an account on quandl.com to get an API key to
be able to make more API requests. Once we have an API key we may run:

This comment has been minimized.

@richafrank

richafrank Apr 29, 2016

Member

"more API requests" - more than what?

This comment has been minimized.

@llllllllll

llllllllll Apr 29, 2016

Member

does "more per day" work here?

This comment has been minimized.

@richafrank

richafrank Apr 29, 2016

Member

That makes more sense to me.

On Fri, Apr 29, 2016 at 4:12 PM, Joe Jevnik notifications@github.com
wrote:

In docs/source/bundles.rst
#1173 (comment):

+bundle-date defaults to the current day to use the most recent data.
+
+Default Data Bundles
+~~~~~~~~~~~~~~~~~~~~
+
+.. quandl-data-bundle:
+
+Quandl WIKI Bundle
+`````````````````
+
+By default zipline comes with thequandl``data bundle which uses quandl's +WIKI dataset https://www.quandl.com/data/WIKI`
. This bundle includes daily
+pricing data, splits, cash dividends, and asset metadata. This is the bundle
+that `run` will use by default if no other bundle is specified. To ingest this
+data bundle we recommend creating an account on quandl.com to get an API key to
+be able to make more API requests. Once we have an API key we may run:

does "more per day" work here?


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
https://github.com/quantopian/zipline/pull/1173/files/6a4ec86f631ac2d62f74c536baaf6da6596a12a0#r61637980

@llllllllll llllllllll force-pushed the quandl-wiki-loader branch from 6a4ec86 to 168e907 Apr 29, 2016

.. code-block:: python
from zipline.bundles import register, yahoo_equities

This comment has been minimized.

@ssanderson

ssanderson Apr 29, 2016

Member

unused import?

This comment has been minimized.

@llllllllll

llllllllll Apr 29, 2016

Member

this should be yahoo_equities(equities) in the call to register

ENH: make BcolzMinuteBarWriter.write take iterable
Updates the BcolzMinuteBarWriter.write api to allow users to pass their
data as a stream instead of requiring that they loop over their data
externally. This matches the API presented by BcolzDailyBarWriter.
.. note::
This is not the total number of allowed failures, just the number of allowed

This comment has been minimized.

@ssanderson

ssanderson Apr 29, 2016

Member

It's unclear what the word "this" refers to here. There are several other usages of the word "This" in this paragraph that should probably be replaced with explicit nouns. (e.g. "this bundle includes" vs "the quandl bundle includes.).

@llllllllll llllllllll force-pushed the quandl-wiki-loader branch from 168e907 to 7600d61 Apr 29, 2016

```````````
``environ`` is a mapping representing the environment variables to use. This is
where any custom arguments needed for the ingestion should be passed, for

This comment has been minimized.

@richafrank

richafrank Apr 29, 2016

Member

Should we say that "this is where any custom arguments need for the ingestion WILL be passed"? Or are we asking the bundle author to do something here?

This comment has been minimized.

@llllllllll

llllllllll May 3, 2016

Member

This is where the bundle author should find any arguments that it needs. For example, quandl needs to know what the auth key is so it uses the env. Do you have a suggestion on how to word this?

exchange and a few other columns. To write data, invoke
:meth:`~zipline.assets.AssetDBWriter.write`. This is passed dataframes for the
various pieces of metadata, more information about the format of the data exists
in the docs for write.

This comment has been minimized.

@richafrank

richafrank Apr 29, 2016

Member

"To write data, invoke write with dataframes for the various pieces of metadata. More information about the format..."

``cache`` is an instance of :class:`~zipline.utils.cache.dataframe_cache`. This
object is a mapping from strings to dataframes. This object is provided in case
an ingestion crashes part way through. The idea is that as the ingest function

This comment has been minimized.

@richafrank

richafrank Apr 29, 2016

Member

typo: extra "as"

ip.register_magic_function(utils.parse_cell_magic, "line_cell", "zipline")
del ip
def load_ipython_extension(ipython):

This comment has been minimized.

@richafrank

richafrank Apr 29, 2016

Member

What calls this?

This comment has been minimized.

@llllllllll

llllllllll Apr 29, 2016

Member

%load_ext

def asset_db_path(bundle_name, timestr, environ=None):
return pth.data_path(
[bundle_name, timestr, 'assets-%d.sqlite' % ASSET_DB_VERSION],

This comment has been minimized.

@richafrank

richafrank May 2, 2016

Member

How do these locations compare to the current locations?

This comment has been minimized.

@llllllllll

llllllllll May 3, 2016

Member

There is no current location for these files on disk that I know of.

This comment has been minimized.

@richafrank

richafrank May 3, 2016

Member

Is that because the default was in-memory?

This comment has been minimized.

@llllllllll
)
class UnknownBundle(click.ClickException, LookupError):

This comment has been minimized.

@richafrank

richafrank May 2, 2016

Member

Nice, I didn't know about LookupError.

def commit(self):
"""Sync the temporary directory to the final path.
"""
copytree(self.name, self._final_path)

This comment has been minimized.

@ssanderson

ssanderson May 2, 2016

Member

Might be worth noting in the docstring that this isn't truly transactional. If the commit fails, we don't roll back to the original state.

daily_bars_dir.name,
bundle.calendar,
)
daily_bar_writer.write(())

This comment has been minimized.

@richafrank

richafrank May 2, 2016

Member

Why do we write here?

This comment has been minimized.

@llllllllll

llllllllll May 3, 2016

Member

to ensure that the ctables exist when we try to create the adjustment reader which will open the tables.

This comment has been minimized.

@richafrank

richafrank May 3, 2016

Member

Where does that happen? Could we have a comment here?

after_last,
dt,
[asset],
)[0].T

This comment has been minimized.

@richafrank

richafrank May 2, 2016

Member

Just pattern matching here, but why do some of these changes include a transposition, but not others?

This comment has been minimized.

@llllllllll

llllllllll May 3, 2016

Member

some places we are doing a ravel and taking the first element anyways.

out[i, where] = values[where]
# first slice down to len(where) because we might not have
# written data for all the minutes requested
out[:len(where), i][where] = values[where]

This comment has been minimized.

@richafrank

richafrank May 2, 2016

Member

Were these changes for consistency with other writers/readers?

This comment has been minimized.

@llllllllll

llllllllll May 3, 2016

Member

There are two changes here. The first is that we are transposing the output to be more consistent with the other readers. The second is that we are adding some extra slicing to make sure the indicies are aligned when write. This was causing warnings.

@@ -0,0 +1,223 @@
"""

This comment has been minimized.

@richafrank

richafrank May 2, 2016

Member

Are there changes here besides the git move from zipline/data/paths.py?

This comment has been minimized.

@llllllllll

llllllllll May 3, 2016

Member

I made the defaults for os.environ more standard. They used to be None and then bind to os.environ.copy().

@@ -365,6 +364,7 @@ def iterator(iterator=iterator, assets=set(assets)):
full_table.attrs['last_row'] = last_row
full_table.attrs['calendar_offset'] = calendar_offset
full_table.attrs['calendar'] = calendar.asi8.tolist()
full_table.flush()

This comment has been minimized.

@richafrank

richafrank May 2, 2016

Member

Why do we need to flush here?

This comment has been minimized.

@llllllllll

llllllllll May 3, 2016

Member

I wanted to ensure that this was flushed to make it easier to reason about the state of the files on disk when debugging.

This comment has been minimized.

@richafrank

richafrank May 3, 2016

Member

Any meaningful performance hit?

This comment has been minimized.

@llllllllll

llllllllll May 3, 2016

Member

this should normally get flushed soon anyways so it should be fine.

if not f.endswith('.py') or f == '__init__.py':
continue
modname = f[:-len('.py')]
globals()[modname] = import_module('.' + modname, package=__name__)

This comment has been minimized.

@richafrank

richafrank May 2, 2016

Member

What are we doing here that differs from the built-in import machinery?

This comment has been minimized.

@llllllllll

llllllllll May 3, 2016

Member

Do you mean __import__? If so, __import__('a.b.c') returns a, not a.b.c. I wanted to just get the last module.

This comment has been minimized.

@richafrank

richafrank May 3, 2016

Member

Sorry, I meant, why are we walking the file system looking for submodules to import and add as attributes of this module, when import module.submodule should do that already?

This comment has been minimized.

@llllllllll

llllllllll May 3, 2016

Member

ah, I wanted to be able to use dir(zipline.examples) to find the examples in the tests and as a user

Should the directory be cleaned up if an exception is raised in the
context manager.
serialize : {'msgpack', 'pickle'}, optional
How should the data be serialized.

This comment has been minimized.

@richafrank

richafrank May 2, 2016

Member

Could we include the defaults in the docstring for these optional parameters?

This comment has been minimized.

@llllllllll

llllllllll May 3, 2016

Member

numpy does not suggest including the defaults because you can see that if you inspect the function signature. The only time that the default should be included is if it is lazy, like None which creates an empty dict or something.

This comment has been minimized.

@richafrank

richafrank May 3, 2016

Member

Oh, good to know.

lock=None,
clean_on_failure=True,
serialization='msgpack'):
self.path = path if path is not None else mkdtemp()

This comment has been minimized.

@richafrank

richafrank May 2, 2016

Member

I was surprised to see that we create a temp dir even if you don't enter the cache context.

This comment has been minimized.

@llllllllll

llllllllll May 3, 2016

Member

this is standard for context managers like this. open will open the file even if you don't enter the context.

This comment has been minimized.

@richafrank

richafrank May 3, 2016

Member

That's fair.

try:
return self.deserialize(self._keypath(key))
except UnboundLocalError:
# This is how pandas fails if the file doesn't exist! #pandas

This comment has been minimized.

@richafrank
def __setitem__(self, key, value):
with self.lock:
try:
del self[key]

This comment has been minimized.

@richafrank

richafrank May 2, 2016

Member

Why do we remove it first?

This comment has been minimized.

@llllllllll

llllllllll May 3, 2016

Member

I originally was using castra for this, I will just remove this.

raise
# without `strict` we should just log the failure
warnings.warn(
'Failed to load extension: %r\n%s' % (extensions, e),

This comment has been minimized.

@richafrank

richafrank May 2, 2016

Member

extensions -> ext ?

return perf
def _load_extensions(default, extensions, strict, environ):

This comment has been minimized.

@richafrank

richafrank May 2, 2016

Member

Should cli call this function instead of duplicating it?

This comment has been minimized.

@llllllllll

llllllllll May 3, 2016

Member

yeah, I pulled this out so they could share. Looks like I missed a critical step ;)

self._final_path = final_path
def commit(self):
"""Sync the temporary file to the final path.

This comment has been minimized.

@ssanderson

ssanderson May 2, 2016

Member

Same note as below.

@@ -33,18 +30,12 @@ def handle_data(context, data):
context.has_ordered = True
if __name__ == '__main__':
def _test_args():
"""Extra arguments to use when testing this example.

This comment has been minimized.

@llllllllll

llllllllll May 3, 2016

Member

does this docstring seem like enough?

This comment has been minimized.

@richafrank

richafrank May 3, 2016

Member

What do you think of "Extra arguments to use when zipline automated tests run this example."?

@llllllllll llllllllll force-pushed the quandl-wiki-loader branch 8 times, most recently from 52004f7 to 59c8e37 May 3, 2016

ENH: Updates the cli, data bundles and extensions.
Adds the data bundle concept which makes it easy for users to register
loading functions to build out minute and daily data along with an
assets db and adjustments db. By default we have provided a `quandl`
bundle which pulls from the public domain WIKI dataset. Users may
register new bundles by decorating an ingest function with
`zipline.data.bundles.register(<name>)`. This also provides a
`yahoo_equities` function for creating an ingestion function that will
load a static set of assets from yahoo.

The cli is now structured as a couple of subcommands and has been
changed to `python -m zipline`. The old behavior of `run_algo.py` has
been moved to the `run` subcommand. This is almost entirely the same
except that it now takes the name of the data bundle to use, defaulting
to `quandl`.

The next subcommand is `ingest` which takes the name of
a data bundle to ingest. This will run the loading machinery and write
the data to a specified location that `run` can find.

There is also a `clean` subcommand which deletes the data that was
written with `ingest`.

Extensions have also been added to zipline. This is an experimental
feature where users can provide an extra set of python files to run at
the start of the process. These can be used to configure aspects of
zipline. Right now the only thing that is supported in an extension file
is the registration of a new data bundle.

@llllllllll llllllllll merged commit f3e436a into master May 3, 2016

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@llllllllll llllllllll deleted the quandl-wiki-loader branch May 3, 2016

@coveralls

This comment has been minimized.

coveralls commented Jun 5, 2018

Coverage Status

Coverage remained the same at ?% when pulling 59c8e37 on quandl-wiki-loader into 21bc598 on master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment