Automate data flows for machine learning
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
docs
examples
openflow
.gitignore
.gitmodules
LICENSE
README.md
setup.py

README.md

OpenFlow

OpenFlow is a Python library which lets you handle data flows into your application. It uses pandas.DataFrame as its primary tool.

You can find a basic introduction to OpenFlow in the first part of this blog post.

Usage

pip install openflow

Example

To use OpenFlow, you need to define a fetch() function. This function will fetch the data from the source of your choice (CSV, Database). In this example, the source will be this CSV file containing a list of movies.

The fetch() function has to return a pandas.Dataframe instance.

from datetime import date

import pandas as pd
from openflow import DataSource

url = 'https://raw.githubusercontent.com/fivethirtyeight/data/master/bechdel/movies.csv'
fetch = lambda _: pd.read_csv(url)

movies_datasource = DataSource(fetch)
print(movies_datasource.get_dataframe())
#       year       imdb     ...     period code decade code
# 0     2013  tt1711425     ...             1.0         1.0
# 1     2012  tt1343727     ...             1.0         1.0
# ...    ...        ...     ...             ...         ...
# 1792  1971  tt0067992     ...             NaN         NaN
# 1793  1970  tt0065466     ...             NaN         NaN

movies_datasource.add_output('percentage_of_max_budget', lambda df: df['budget'] / df['budget'].max())
movies_datasource.add_output('age', lambda df: date.today().year - df['year'])

# you can reuse previously defined outputs
movies_datasource.add_output('cat_age', lambda df: (df['age'] / 7).astype(int))

# we force the computation because `get_dataframe()` was already called once before
print(movies_datasource.get_dataframe(force_computation=True))

#       year       imdb     ...     percentage_of_max_budget  age  cat_age
# 0     2013  tt1711425     ...                     0.030588    5        0
# 1     2012  tt1343727     ...                     0.105882    6        0
# ...    ...        ...     ...                          ...  ...      ...
# 1792  1971  tt0067992     ...                     0.007059   47        6
# 1793  1970  tt0065466     ...                     0.002353   48        6

Three new outputs have been added to the original DataSource.

More complex examples of fetch() function can be found here. They shows how to use Postgres and Mongo as DataSource. Feel free to write your own.