Streaming API for pandas applied to big datasets
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.circleci
_doc
_unittests
src
.gitignore
.landscape.yml
.local.jenkins.lin.yml
.local.jenkins.win.yml
.travis.yml
HISTORY.rst
LICENSE.txt
README.rst
appveyor.yml
build_script.bat
requirements.txt
setup.py

README.rst

README

Build status Build Status Windows https://circleci.com/gh/sdpython/pandas_streaming/tree/master.svg?style=svg MIT License Requirements Status https://codecov.io/github/sdpython/pandas_streaming/coverage.svg?branch=master GitHub Issues Waffle Notebook Coverage https://api.codacy.com/project/badge/Grade/f53b7f4d6a0447aa9ce0c4ad5df659ef

pandas_streaming aims at processing big files with pandas, too big to hold in memory, too small to be parallelized with a significant gain. The module replicates a subset of pandas API and implements other functionalities for machine learning.

from pandas_streaming.df import StreamingDataFrame
sdf = StreamingDataFrame.read_csv("filename", sep="\t", encoding="utf-8")

for df in sdf:
    # process this chunk of data
    # df is a dataframe
    print(df)

The module can also stream an existing dataframe.

import pandas
df = pandas.DataFrame([dict(cf=0, cint=0, cstr="0"),
                       dict(cf=1, cint=1, cstr="1"),
                       dict(cf=3, cint=3, cstr="3")])

from pandas_streaming.df import StreamingDataFrame
sdf = StreamingDataFrame.read_df(df)

for df in sdf:
    # process this chunk of data
    # df is a dataframe
    print(df)

Links: