In [120]:
from pathlib import Path
import sys; sys.path.insert(0, str(Path('src').absolute()))
import os
cwd =os.getcwd()

import ast
import inspect

from IPython.display import Markdown as md

def flink(title: str, name: str):
    split = name.split('.', maxsplit=1)
    if len(split) == 1:
        modname = split[0]
        fname = None
    else:
        [modname, fname] = split
    module = globals()[modname]
    
    file = Path(module.__file__).relative_to(cwd)
    
    if fname is not None:
        func = module
        for p in fname.split('.'):
            func = getattr(func, p)
        _, number = inspect.getsourcelines(func)
        numbers = f'#L{number}'
    else:
        numbers = ''
    return f'[{title}]({file}{numbers})'
    
dmd = lambda x: display(md(x))

import cachew
import cachew.tests.test_cachew as tests

In [109]:
dmd(f'<!--THIS FILE IS AUTOGENERATED BY README.ipynb. Use generate-readme to update it.-->')

<!--THIS FILE IS AUTOGENERATED BY README.ipynb. Use generate-readme to update it.-->

[![CircleCI](https://circleci.com/gh/karlicoss/cachew.svg?style=svg)](https://circleci.com/gh/karlicoss/cachew)

# Cachew: quick NamedTuple/dataclass cache
TLDR: cachew can persistently cache any sequence (an [Iterator](https://docs.python.org/3/library/typing.html#typing.Iterator)) over [NamedTuples](https://docs.python.org/3/library/typing.html#typing.NamedTuple) or [dataclasses](https://docs.python.org/3/library/dataclasses.html) into an sqlite database on your disk.
Database schema is automatically inferred from type annotations ([PEP 526](https://www.python.org/dev/peps/pep-0526)).

It works in a similar manner to [functools.lru_cache](https://docs.python.org/3/library/functools.html#functools.lru_cache): caching your data is just a matter of decorating it.

The difference from `functools.lru_cache` is that data is preserved between program runs.

## Motivation

I often find myself processing big chunks of data, computing some aggregates on it or extracting only bits I'm interested at. While I'm trying to utilize REPL as much as I can, some things are still fragile and often you just have to rerun the whole thing in the process of development. This can be frustrating if data parsing and processing takes seconds, let alone minutes in some cases. 

Conventional way of dealing with it is serializing results along with some sort of hash (e.g. md5) of input files,
comparing on the next run and returning cached data if nothing changed.

Simple as it sounds, it is pretty tedious to do every time you need to memorize some data, contaminates your code with routine and distracts you from your main task.


# Example
Imagine you're working on a data analysis pipeline for some huge dataset, say, extracting urls and their titles from Wikipedia archive.
Parsing it (`extract_links` function) takes hours, however, the archive is presumably updated not very frequently.


With this library your can achieve it through single `@cachew` decorator.

In [112]:
doc = inspect.getdoc(cachew.cachew)
doc = doc.split('Usage example:')[-1].lstrip()
dmd(f"""```python
{doc}
```""")

```python
>>> from typing import NamedTuple, Iterator
>>> class Link(NamedTuple):
...     url : str
...     text: str
...
>>> @cachew
... def extract_links(archive: str) -> Iterator[Link]:
...     for i in range(5):
...         import time; time.sleep(1) # simulate slow IO
...         yield Link(url=f'http://link{i}.org', text=f'text {i}')
...
>>> list(extract_links(archive='wikipedia_20190830.zip')) # that would take about 5 seconds on first run
[Link(url='http://link0.org', text='text 0'), Link(url='http://link1.org', text='text 1'), Link(url='http://link2.org', text='text 2'), Link(url='http://link3.org', text='text 3'), Link(url='http://link4.org', text='text 4')]

>>> from timeit import Timer
>>> res = Timer(lambda: list(extract_links(archive='wikipedia_20190830.zip'))).timeit(number=1) # second run is cached, so should take less time
>>> print(f"took {int(res)} seconds to query cached items")
took 0 seconds to query cached items
```

In [108]:
[composite] = [x
 for x in ast.walk(ast.parse(inspect.getsource(cachew))) 
 if isinstance(x, ast.FunctionDef) and x.name == 'composite_hash'
]

link = f'{Path(cachew.__file__).relative_to(cwd)}:#L{composite.lineno}'

dmd(f'''
# How it works
Basically, your data objects get {flink('flattened out', 'cachew.NTBinder.to_row')}
and python types are mapped {flink('onto sqlite types and back', 'cachew.NTBinder.iter_columns')}

When the function is called, cachew [computes the hash]({link})  of your function's arguments 
and compares it against the previously stored hash value.
    
If they match, it would deserialize and yield whatever is stored in the cache database, if the hash mismatches, the original data provider is called and new data is stored along with the new hash.
''')


# How it works
Basically, your data object gets [flattened out](src/cachew/__init__.py:272)
and python types are mapped [onto sqlite types and back](src/cachew/__init__.py:324)

When the function is called, `cachew` [computes the hash](src/cachew/__init__.py:544)  of your function's arguments 
and compares it against the previously stored hash value.
    
If they match, it would deserialize and yield whatever is stored in the cache database, if the hash mismatches, the original data provider is called and new data is stored along with the new hash.


In [124]:
dmd('# Features')
types = [f'`{c.__name__}`' for c in cachew.PRIMITIVES.keys()]
dmd(f"""
* supports primitive types: {', '.join(types)}
* supports {flink('Optional', 'tests.test_optional')}
* supports {flink('nested datatypes', 'tests.test_nested')}
* supports return type inference: {flink('1', 'tests.test_return_type_inference')}, {flink('2', 'tests.test_return_type_mismatch')}
* detects {flink('datatype schema changes', 'tests.test_schema_change')} and discards old data automatically            
""")
# * custom hash function TODO example with mtime?

# Features


* supports primitive types: `str`, `int`, `float`, `bool`, `datetime`, `date`
* supports [Optional](src/cachew/tests/test_cachew.py:325)
* supports [nested datatypes](src/cachew/tests/test_cachew.py:241)
* supports return type inference: [1](src/cachew/tests/test_cachew.py:185), [2](src/cachew/tests/test_cachew.py:199)
* detects [datatype schema changes](src/cachew/tests/test_cachew.py:271) and discards old data automatically            


In [134]:
dmd(f"""
# Using
See {flink('docstring', 'cachew.cachew')} for up-to-date documentation on parameters and return types. 
You can also use {flink('extensive unit tests', 'tests')} as a reference.
    
Some highlights:
    
* `cache_path` can be a filename, or you can specify a callable {flink('returning path', 'tests.test_callable_cache_path')} and depending on function's arguments.
  
  It's not required to specify the path (it will be created in `/tmp`) but recommended.
    
* `hashf` by default just hashes all the arguments, you can also specify a custom callable.
    
   For instance, it can be used to {flink('discard cache', 'tests.test_custom_hash')} the input file was modified.
    
* `cls` is deduced from return type annotations by default, but can be specified if you don't control the code you want to cache.    
""")


# Using
See [docstring](src/cachew/__init__.py:462) for up-to-date documentation on parameters and return types. 
You can also use [extensive unit tests](src/cachew/tests/test_cachew.py) as a reference.
    
Some highlights:
    
* `cache_path` can be a filename, or you can specify a callable [returning path](src/cachew/tests/test_cachew.py:221) and depending on function's arguments.
  
  It's not required to specify the path (it will be created in `/tmp`) but recommended.
    
* `hashf` by default just hashes all the arguments, you can also specify a custom callable.
    
   For instance, it can be used to [discard cache](src/cachew/tests/test_cachew.py:51) the input file was modified.
    
* `cls` is deduced from return type annotations by default, but can be specified if you don't control the code you want to cache.    


# Installing
Package is available on [pypi](https://pypi.org/project/cachew/).

    pip install cachew
    
## Developing
I'm using [tox](tox.ini) to run tests, and [circleci](.circleci/config.yml).

# Implementation

* why tuples and dataclasses?
  
  Tuples are natural in Python for quickly grouping together return results.
  `NamedTuple` and `dataclass` specifically provide a very straighforward and self documenting way way to represent a bit of data in Python.
  Very compact syntax makes it extremely convenitent even for one-off means of communicating between couple of functions.
   
  If you want to find out more why you should use more dataclasses in your code I suggest these links:
  [What are data classes?](https://stackoverflow.com/questions/47955263/what-are-data-classes-and-how-are-they-different-from-common-classes), [basic data classes](https://realpython.com/python-data-classes/#basic-data-classes).
   
    
* why not [pickle](https://docs.python.org/3/library/pickle.html)?

  Pickling is a bit heavyweight for plain data class. There are many reports of pickle being slower than even JSON and it's also security risk. Lastly, it can only be loaded via Python.

* why `sqlite` database for storage?

  It's pretty effecient and sequence of namedtuples maps onto database rows in a very straighforward manner.

* why not `pandas.DataFrame`?

  DataFrames are great and can be serialised to csv or pickled.
  They are good to have as one of the ways you can interface with your data, however hardly convenitent to think about it abstractly due to their dynamic nature.
  They also can't be nested.
  
* why not [ORM](https://en.wikipedia.org/wiki/Object-relational_mapping)?
  
  ORMs tend to be pretty invasive, which might complicate your scripts or even ruin performance. It's also somewhat an overkill for such a specific purpose.

  * E.g. [SQLAlchemy](https://docs.sqlalchemy.org/en/13/orm/tutorial.html#declare-a-mapping) requires you using custom sqlalchemy specific types and inheriting a base class.
    Also it doesn't support nested types.

* why not [marshmallow](https://marshmallow.readthedocs.io/en/3.0/nesting.html)?
  
  Marshmallow is a common way to map data into db-friendly format, but it requires explicit schema which is an overhead when you have it already in the form of type annotations. I've looked at existing projects to utilise type annotations, but didn't find them covering all I wanted:
  
  * https://marshmallow-annotations.readthedocs.io/en/latest/ext/namedtuple.html#namedtuple-type-api
  * https://pypi.org/project/marshmallow-dataclass