<style>
pre {
 white-space: pre-wrap !important;
}
.table-striped > tbody > tr:nth-of-type(odd) {
    background-color: #f9f9f9;
}
.table-striped > tbody > tr:nth-of-type(even) {
    background-color: white;
}
.table-striped td, .table-striped th, .table-striped tr {
    border: 1px solid black;
    border-collapse: collapse;
    margin: 1em 2em;
}
.rendered_html td, .rendered_html th {
    text-align: left;
    vertical-align: middle;
    padding: 4px;
}
</style>

# I/O Kung-Fu: get your data in and out of [Vaex](https://github.com/vaexio/vaex)

## Data input

Every project starts with reading in some data. Vaex supports several data sources:
 - Binary file formats:
     - [HDF5](https://en.wikipedia.org/wiki/Hierarchical_Data_Format#HDF5)
     - [Apache Arrow](https://arrow.apache.org/)
     - [Apache Parquet](https://parquet.apache.org/)
 - Text based file formats:
     - [CSV](https://en.wikipedia.org/wiki/Comma-separated_values)
     - [ASCII](https://en.wikipedia.org/wiki/Text_file)
     - [JSON](https://www.json.org/json-en.html)
 - In-memory data representations
     - [panads](https://pandas.pydata.org/) DataFrames and everything that pandas can read
     - [Apache Arrow](https://arrow.apache.org/) Tables
     - [numpy](https://numpy.org/) arrays
     - Python dictionaries
     
The following examples show the best practices of getting your data in Vaex.

### Binary file formats

If your data is already in one of the supported binary file formats (HDF5, Apache Arrow, Apache Parquet), opening it with Vaex rather simple:

```
import vaex 

df_1 = vaex.open('./my_data/my_file_1.hdf5')
df_2 = vaex.open('./my_data/my_file_1.arrow')
df_3 = vaex.open('./my_data/my_file_1.parquet')
```

Opening such data is instantenous regardless of the file size on disk: Vaex will just memory-map the data instead of reading it in memory. This is the optimal way of working with large datasets that are larger than available RAM.

If your data is contained within multiple files, one can open them all simultaneously like this:

```
df = vaex.open('./my_data/my_file*.hdf5')
# alternatively
df = vaex.open_many(['./my_data/my_file_1.hdf5', './my_data/my_file_2.hdf5', './my_data/my_file_2.hdf5'])
```
The result will be a single DataFrame object containing all of the data coming from all files.

The datasets one is trying to open do not necessarily have to be local. With Vaex you can open a HDF5 file straight from Amazon's S3:

```
df = vaex.open('s3://vaex/taxi/yellow_taxi_2009_2015_f32.hdf5?anon=true')
```

In this case the data will be lazily downloaded and cached to your local machine. "Lazily downloaded" means that Vaex will only download the portions of the data you really need. For example: imagine that we have a file hosted on S3 that has 100 columns and 1 billion rows. Getting a preview of the DataFrame via `print(df)` for instance will download only the first and last 5 rows of that data. If we than proceed making calculations or plots with only 5 columns, only the data from those columns will be downloaded and caches to our local machine. This approaches saves us bothbandwith and local storage.


### Text based file formats

Datasets are still commonly stored in text-based file formats such as CSV. Since text-based file formats are not memory-mappable, they have to be read in memory. If the contents of a CSV file fits into the available RAM memory, one can simply do:

```
df = vaex.from_csv('./my_data/my_file.csv')
# or alternatively 
df = vaex.read_csv('./my_data/my_file.csv')  # `vaex.read_csv` is an alias to `vaex.from_csv`
```

Vaex is using pandas for reading CSV files in the background, so on can pass any arguments to the `vaex.from_csv` or `vaex.read_csv` as one would pass to `pandas.read_csv` and specify for example separators, column names and column types for instance. In addition to this, if you specify the `convert=True` argument `vaex.from_csv` the data will be automatically converted to an HDF5 file format behind the scenes, and thus freeing RAM and allowing you to work with your data in a memory-efficient, out-of-core manner. 

If the CSV file is so large that it can not fit into RAM all at one time, one can convert the data to HDF5 simply by:

```
df = vaex.from_csv('./my_data/my_big_file.csv', convert=True, chunk_size=5_000_000)
```

When the above line is executed, Vaex will read the CSV in chunks, and convert each chunk to a temporary HDF5 file on disk. All temporary will files are then concatenated into a single HDF5, and the temporary files deleted. The size of the individual chunks to be read can be specified via the `chunk_size` argument. 

It is also common the data to be stored in JSON files. To read such data in Vaex one can do:

```
df = vaex.from_json('./my_data/my_file.json')
```

This is a convenience method and simply wraps `pandas.read_json`, so the same arguments and file reading strategy applies.

### In-memory data representations

One can construct a Vaex DataFrame from a variety of in-memory data representations. One of the most useful such operations is converting a pandas into Vaex DataFrame:

```
df = vaex.from_pandas(pandas_df, copy_index=True)
```
The `copy_index` argument specifies whether the index column of a pandas DataFrame should be imported into the Vaex DataFrame. Converting a pandas into a Vaex DataFrame is particularly useful since pandas can read data from a large variety of file formats. For instance, if we want to work with a [SAS](https://www.sas.com/en_us/home.html) file, we can read it into Vaex via pandas like so:

```
import vaex
import pandas as pd

pandas_df = pd.read_sas('./my_data/my_file.xport')
df = vaex.from_pandas(pandas_df)
```

One can read in an [arrow table](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html) as a DataFrame in a similar manner:

```
df = vaex.from_arrow_table(pa_table)
```

Constructing a Vaex DataFrame from numpy arrays can be done line this:

```
import vaex
import numpy as np

x = np.array([1, 2, 3])
y = np.array(['dog', 'cat', 'mouse'])

df = vaex.from_arrays(x=x, y=y)
```

Constructing a DataFrame from a Python dict is also straight-forward:

```
d = {'a': [1, 2, 3], 'b': ['dog', 'cat', 'mouse']}
df = vaex.from_dict(d)
```

## Data export



In [None]:
df.to

In [14]:
df = vaex.example()

In [16]:
df.to_arrays()

[array([ 0, 23, 32, ..., 14, 18,  4], dtype=uint8),
 array([ 1.2318684 , -0.16370061, -2.120256  , ...,  0.36885077,
        -0.11259264, 20.79622   ], dtype=float32),
 array([-0.39692867,  3.6542213 ,  3.3260527 , ..., 13.029609  ,
         1.4529126 , -3.3313878 ], dtype=float32),
 array([-0.59805775, -0.25490645,  1.7078403 , ..., -3.6339347 ,
         2.1689527 , 12.188416  ], dtype=float32),
 array([ 301.15527 , -195.00023 ,  -48.63423 , ...,  -53.677147,
         179.30865 ,   42.690002], dtype=float32),
 array([ 174.05948 ,  170.47217 ,  171.6473  , ..., -145.15771 ,
         205.7971  ,   69.204796], dtype=float32),
 array([ 27.427546 , 142.53023  ,  -2.0794373, ...,  76.7091   ,
        -68.75873  ,  29.542751 ], dtype=float32),
 array([-149431.4 , -124247.95, -138500.55, ...,  -84912.26, -133498.47,
         -65519.33], dtype=float32),
 array([ 407.38898,  890.24115,  372.2411 , ...,  817.1376 ,  724.00024,
        1843.0747 ], dtype=float32),
 array([ 333.95554,  684.6676 , 