# PyArrow in Pandas

1. What is Arrow? What is PyArrow?
2. How can I use Arrow/PyArrow today?
3. Using PyArrow as a backend instead of NumPy

# What is Arrow?

Arrow is an open-source project from Apache. The idea is to have an in-memory set of data structures that are cross-platform and cross-language, for data analysis work. If you are implementing a language that wants to do data analysis, or if you're implementing a system/database/tool for data analysis, then don't re-invent the wheel -- just use Apache Arrow.

If you're using Apache Arrow, then you can probably share data with other systems using Apache Arrow.

# Well.. what's wrong with NumPy?

I've long described Pandas as an automatic transmission, whereas NumPy is the manual transmission.  Pandas has long used NumPy for storage. And that's mostly great:

- Storage is in C, so it's small and fast
- Much less memory usage than Python objects
- Vectorized operations
- We know it's rock solid

But...

- It was never designed for data storage/retrieval
- It was never designed for tabular data, like we use in Pandas
- It was never really meant for serious string operations

At some point, Arrow will replace NumPy as the backend for Pandas. It's currently experimental, but the documentation says that Arrow (PyArrow, the Python bindings for Arrow) will be mandatory in order to install Pandas 3.0 whenever it comes out.

In [1]:
# if we're using int64 in Pandas/NumPy, then an integer takes up 64 bits or 8 bytes
# how much space does a Python integer take?

x = 1000
import sys

sys.getsizeof(x)  # how many bytes will this be?

28

In [2]:
x = x ** 100

In [3]:
sys.getsizeof(x)

160