# Apache Arrow

Apache Arrow was born with the idea to define a set of standards for data representation and interchange between languages and systems to avoid costs of data serialization/deserialization and in order to avoid reinventing the wheel on each of those systems and languages.

### The initial problem:

Each system / language requires their own format definitions, implementation of common algorithms, etcetera. In our heterogeneous environments we often have to move data from one system/language to accommodate our workflows that meant copy&convert the data between them, which is quite costly.

![image info](./diagrams/without_arrow.png)

With the Arrow Columnar format Specification:

![image info](./diagrams/with_arrow.png)

Apart from the initial vision, Arrow has grown to also develop a multi-language collection of libraries for solving systems problems related to in-memory analytical data processing. This includes such topics as:

- Zero-copy shared memory and RPC-based data movement
- Reading and writing file formats (like CSV, Apache ORC, and Apache Parquet)
- In-memory analytics and query processing
 

## Language implementations and PyArrow

The Apache Arrow Columnar format has been implemented in multitude of languages:

- C++
- Python
- C#
- Java
- Go
- R
- Ruby
- Rust
- Matlab
- JavaScript
- Julia

The implementation for Python is called PyArrow and can be found on PyPI [here](https://pypi.org/project/pyarrow/).

PyArrow provides a Python API for functionality provided by the Arrow C++ libraries, along with tools for Arrow integration and interoperability with pandas, NumPy, and other software in the Python ecosystem.

It is written in Python, Cython and C++.

## NanoArrow

The Arrow libraries are growing with a lot of functionality and nanoarrow was born to solve the problem where linking to the Arrow implementation is difficult or impossible

The [NanoArrow library](https://github.com/apache/arrow-nanoarrow) is a set of helper functions to interpret and generate [Arrow C Data Interface](https://arrow.apache.org/docs/format/CDataInterface.html) and [Arrow C Stream Interface](https://arrow.apache.org/docs/format/CStreamInterface.html) structures. The library is in active development.

The NanoArrow Python bindings are intended to support clients that wish to produce or interpret Arrow C Data and/or Arrow C Stream structures in Python, without a dependency on the larger PyArrow package.

# Arrow Columnar Format


## Why does Arrow use a Columnar in-memory format?

Data can be represented in memory or stored using a Row based format or a Column based format.

![image info](./diagrams/table.svg)

### Row format

Traditionally, in order to read the following data into memory you would have some kind of structure representing the following rows:

![image info](./diagrams/row_format.svg)


That means that you have all the information for every row together in memory. This is great for transactional Transactional Databases ([OLTP](https://en.wikipedia.org/wiki/Online_transaction_processing)) and if you want to access all the data information for a row every time.

### Columnar format

If we have a much bigger table and we just want, for example the average cost of transaction skipping all the data that is irrelevant to do that computation would be costly. That's why storage and memory representation for Columnar format is important.

![image info](./diagrams/column_format.svg)

Modern Analytical Processing Databases ([OLAP](https://en.wikipedia.org/wiki/Online_analytical_processing)) typically use a columnar format to easily perform computations and analysis over the data types.

A columnar format keeps the data organised by column instead of by row. Analytical operations like filtering, grouping, aggregations and others are much more efficient. CPU can maintain [memory locality](https://en.wikipedia.org/wiki/Locality_of_reference#Types%20of%20locality) and require less memory jumps to process the data. By keeping the data contiguous in memory it also enables vectorization of the computations. Most modern CPUs have single instructions, multiple data ([SIMD](https://en.wikipedia.org/wiki/Single_instruction,_multiple_data)) enabling parallel processing and execution of instructions on vector data in single CPU instructions.

Compression is another element where columnar format representation can take high advantage. Data similarity allows for better compression techniques and algorithms. Having the same data types locality close allows us to have better compression ratios.
