Skip to content
Out-of-Core DataFrames for Python, visualize and explore big tabular data at a billion rows per second.
Branch: master
Clone or download
Latest commit baba8fb Jul 20, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
benchmarks initial commit Apr 10, 2019
bin more frozen issues, set mpl backend before importing vaex Oct 13, 2016
data data: added meta data Feb 9, 2018
docs Fixing outdated clone url Jul 17, 2019
examples added arrow example to docs Nov 2, 2018
licenses major cleanup of old files, included a license, and changed credits f… Feb 27, 2015
misc website: old snippet code Jul 22, 2017
packages fix: s3 should cache by default Jul 20, 2019
tests Merge branch 'master' into refactor_columns Jul 18, 2019
.gitattributes make sure by setting linguist-vendored=true Nov 23, 2016
.gitignore Ignore symlinks generated in setup.py Jul 5, 2019
.gitmodules vendor flat_hash_map Apr 24, 2019
.releash.py chore(releash): add vaex-ml to packages Jul 6, 2019
.travis.yml vaex-ml: package centred around machine learning related tasks (#254) Jun 25, 2019
AUTHORS.txt major cleanup of old files, included a license, and changed credits f… Feb 27, 2015
LICENSE.txt Update LICENSE.txt Jun 6, 2016
MANIFEST.in fix: missed .txt suffix Mar 30, 2017
Makefile added Makefile for common commands Nov 23, 2016
README.rst chore(vaex-meta): create valid rst Jul 6, 2019
appveyor.yml chore(ci,appveyor): use activate, see conda/conda#8836 Jul 3, 2019
credits.md keeping track of credits Aug 15, 2016
py2app.py repo: moved files from main package to core package Sep 28, 2017
requirements-ml.txt vaex-ml: package centred around machine learning related tasks (#254) Jun 25, 2019
requirements.txt s3fs 0.3 might have issues Jul 18, 2019
requirements_rtd.txt rtd: remove kapteyn package Apr 10, 2019
setup.py chore(setup.py): vaex-arrow does not need a symlink Jul 6, 2019

README.rst

Travis Conda Join the chat at https://gitter.im/maartenbreddels/vaex

Vaex uses several sites:

Vaex is open source software, if you need support, contact us at https://vaex.io

What is Vaex?

Vaex is a python library for lazy Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular datasets. It can calculate statistics such as mean, sum, count, standard deviation etc, on an N-dimensional grid for more than a billion (10^9) objects/rows per second. Visualization is done using histograms, density plots and 3d volume rendering, allowing interactive exploration of big data. Vaex uses memory mapping, zero memory copy policy and lazy computations for best performance (no memory wasted).

Why vaex

  • Performance: Works with huge tabular data, process more than a billion rows/second
  • Lazy / Virtual columns: compute on the fly, without wasting ram
  • Memory efficient no memory copies when doing filtering/selections/subsets.
  • Visualization: directly supported, a one-liner is often enough.
  • User friendly API: You will only need to deal with a Dataset object, and tab completion + docstring will help you out: ds.mean<tab>, feels very similar to Pandas.
  • Lean: separated into multiple packages
    • vaex-core: Dataset and core algorithms, takes numpy arrays as input columns.
    • vaex-hdf5: Provides memory mapped numpy arrays to a Dataset.
    • vaex-arrow: Arrow support for cross language data sharing.
    • vaex-viz: Visualization based on matplotlib.
    • vaex-jupyter: Interactive visualization based on Jupyter widgets / ipywidgets, bqplot, ipyvolume and ipyleaflet.
    • vaex-astro: Astronomy related transformations and FITS file support.
    • vaex-server: Provides a server to access a dataset remotely.
    • vaex-distributed: (Proof of concept) combined multiple servers / cluster into a single dataset for distributed computations.
    • vaex-qt: Program written using Qt GUI.
    • vaex: meta package that installs all of the above.
    • vaex-ml: Machine learning with automatic pipelines.
  • Jupyter integration: vaex-jupyter will give you interactive visualization and selection in the Jupyter notebook and Jupyter lab.

Installation

Using conda:

  • conda install -c conda-forge vaex

Using pip:

  • pip install vaex

Or read the detailed instructions

Getting started

We assuming you have installed vaex, and are running a Jupyter notebook server. We start by importing vaex and ask it to give us sample example dataset.

import vaex
ds = vaex.example()  # open the example dataset provided with vaex

Instead, you can download some larger datasets, or read in your csv file.

ds  # will pretty print a table
# x y z vx vy vz E L Lz FeH
0 -0.7774707672.10626292 1.93743467 53.276722 288.386047 -95.2649078-121238.171875 831.0799560546875 -336.426513671875 -2.309227609164518
1 3.77427316 2.23387194 3.76209331 252.810791 -69.9498444-56.3121033-100819.91406251435.1839599609375-828.7567749023438 -1.788735491591229
2 1.3757627 -6.3283844 2.63250017 96.276474 226.440201 -34.7527161-100559.96093751039.2989501953125920.802490234375 -0.7618109022478798
3 -7.06737804 1.31737781 -6.10543537 204.968842 -205.679016-58.9777031-70174.8515625 2441.724853515625 1183.5899658203125 -1.5208778422936413
4 0.243441463 -0.822781682-0.206593871-311.742371-238.41217 186.824127 -144138.75 374.8164367675781 -314.5353088378906 -2.655341358427361
... ... ... ... ... ... ... ... ... ... ...
3299953.76883793 4.66251659 -4.42904139 107.432999 -2.1377129617.5130272 -119687.3203125746.8833618164062 -508.96484375 -1.6499842518381402
3299969.17409325 -8.87091351 -8.61707687 32.0 108.089264 179.060638 -68933.8046875 2395.633056640625 1275.490234375 -1.4336036247720836
329997-1.14041007 -8.4957695 2.25749826 8.46711349 -38.2765236-127.541473-112580.359375 1182.436279296875 115.58557891845703 -1.9306227597361942
329998-14.2985935 -5.51750422 -8.65472317 110.221558 -31.392559186.2726822 -74862.90625 1324.59265136718751057.017333984375 -1.225019818838568
32999910.5450506 -8.86106777 -4.65835428 -2.10541415-27.61088563.80799961 -95361.765625 351.0955505371094 -309.81439208984375-2.5689636894079477

Using square brackets[], we can easily filter or get different views on the dataset.

ds_negative = ds[ds.x < 0]  # easily filter your dataset, without making a copy
ds_negative[:5][['x', 'y']]  # take the first five rows, and only the 'x' and 'y' column (no memory copy!)
# x y
0 -0.777471 2.10626
1 -7.06738 1.31738
2 -5.17174 7.82915
3-15.9539 5.77126
4-12.3995 13.9182

When dealing with huge datasets, say a billion rows (10^9), computations with the data can waste memory, up to 8 GB for a new column. Instead, vaex uses lazy computation, only a representation of the computation is stored, and computations done on the fly when needed. Even though, you can just many of the numpy functions, as if it was a normal array.

import numpy as np
# creates an expression (nothing is computed)
r = np.sqrt(ds.x**2 + ds.y**2 + ds.z**2)
r  # for convenience, we print out some values
<vaex.expression.Expression(expressions='sqrt((((x ** 2) + (y ** 2)) + (z ** 2)))')> instance at 0x11bcc4780 values=[2.9655450396553587, 5.77829281049018, 6.99079603950256, 9.431842752707537, 0.8825613121347967 ... (total 330000 values) ... 7.453831761514681, 15.398412491068198, 8.864250273925633, 17.601047186042507, 14.540181524970293]

These expressions can be added to the dataset, creating what we call a virtual column. These virtual columns are simular to normal columns, except they do not waste memory.

ds['r'] = r  # add a (virtual) column that will be computed on the fly
ds.mean(ds.x), ds.mean(ds.r)  # calculate statistics on normal and virtual columns
(-0.06713149126400597, 9.407082338299773)

One of the core features of vaex is its ability to calculate statistics on a regular (N-dimensional) grid. The dimensions of the grid are specified by the binby argument (analogous to SQL's grouby), and the shape and limits.

ds.mean(ds.r, binby=ds.x, shape=32, limits=[-10, 10]) # create statistics on a regular grid (1d)
array([15.01058183, 14.43693006, 13.72923338, 12.90294499, 11.86615103,
       11.03563695, 10.12162553,  9.2969267 ,  8.58250973,  7.86602644,
        7.19568442,  6.55738773,  6.01942499,  5.51462457,  5.15798991,
        4.8274218 ,  4.7346551 ,  5.1343761 ,  5.46017944,  6.02199777,
        6.54132124,  7.27025256,  7.99780777,  8.55188217,  9.30286584,
        9.97067561, 10.81633293, 11.60615795, 12.33813552, 13.10488982,
       13.86868565, 14.60577266])
ds.mean(ds.r, binby=[ds.x, ds.y], shape=32, limits=[-10, 10]) # or 2d
ds.count(ds.r, binby=[ds.x, ds.y], shape=32, limits=[-10, 10]) # or 2d counts/histogram
array([[22., 33., 37., ..., 58., 38., 45.],
       [37., 36., 47., ..., 52., 36., 53.],
       [34., 42., 47., ..., 59., 44., 56.],
       ...,
       [73., 73., 84., ..., 41., 40., 37.],
       [53., 58., 63., ..., 34., 35., 28.],
       [51., 32., 46., ..., 47., 33., 36.]])

These one and two dimensional grids can be visualized using any plotting library, such as matplotlib, but the setup can be tedious. For convenience we can use plot1d, plot, or see the list of plotting commands

Continue

Continue the tutorial or check the examples

If you like vaex, please let us know by giving us a star on GitHub,

Regards,

The vaex.io team

You can’t perform that action at this time.