# PyArrow in Pandas

1. What is Arrow? What is PyArrow?
2. How can I use Arrow/PyArrow today?
3. Using PyArrow as a backend instead of NumPy

# What is Arrow?

Arrow is an open-source project from Apache. The idea is to have an in-memory set of data structures that are cross-platform and cross-language, for data analysis work. If you are implementing a language that wants to do data analysis, or if you're implementing a system/database/tool for data analysis, then don't re-invent the wheel -- just use Apache Arrow.

If you're using Apache Arrow, then you can probably share data with other systems using Apache Arrow.

# Well.. what's wrong with NumPy?

I've long described Pandas as an automatic transmission, whereas NumPy is the manual transmission.  Pandas has long used NumPy for storage. And that's mostly great:

- Storage is in C, so it's small and fast
- Much less memory usage than Python objects
- Vectorized operations
- We know it's rock solid

But...

- It was never designed for data storage/retrieval
- It was never designed for tabular data, like we use in Pandas
- It was never really meant for serious string operations

At some point, Arrow will replace NumPy as the backend for Pandas. It's currently experimental, but the documentation says that Arrow (PyArrow, the Python bindings for Arrow) will be mandatory in order to install Pandas 3.0 whenever it comes out.

In [1]:
# if we're using int64 in Pandas/NumPy, then an integer takes up 64 bits or 8 bytes
# how much space does a Python integer take?

x = 1000
import sys

sys.getsizeof(x)  # how many bytes will this be?

28

In [2]:
x = x ** 100

In [3]:
sys.getsizeof(x)

160

In [4]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

In [5]:
np.random.seed(0)
df = DataFrame(np.random.randint(0, 1000, [4,5]),
               index=list('abcd'),
               columns=list('vwxyz'))
df
               

Unnamed: 0,v,w,x,y,z
a,684,559,629,192,835
b,763,707,359,9,723
c,277,754,804,599,70
d,472,600,396,314,705


In [6]:
# how much memory does this data frame use?
# I use df.info(), which both shows me info about my columns + rows, and gives me a memory summary

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, a to d
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   v       4 non-null      int64
 1   w       4 non-null      int64
 2   x       4 non-null      int64
 3   y       4 non-null      int64
 4   z       4 non-null      int64
dtypes: int64(5)
memory usage: 192.0+ bytes


In [7]:
4 * 5 * 8

160

In [8]:
# because the rows and columns have strings as names, Pandas does *not* store them in NumPy!
# rather, it stores them as Python strings, and then refers to those strings from Pandas to Python memory space

df.index

Index(['a', 'b', 'c', 'd'], dtype='object')

In [9]:
df.columns

Index(['v', 'w', 'x', 'y', 'z'], dtype='object')

In [10]:
# by passing memory_usage='deep', we say: Go and find the memory usage for each string
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, a to d
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   v       4 non-null      int64
 1   w       4 non-null      int64
 2   x       4 non-null      int64
 3   y       4 non-null      int64
 4   z       4 non-null      int64
dtypes: int64(5)
memory usage: 360.0 bytes


In [11]:
# let's add a column

df['s'] = ['hello', 'out', 'there', 'everyone']

In [12]:
df

Unnamed: 0,v,w,x,y,z,s
a,684,559,629,192,835,hello
b,763,707,359,9,723,out
c,277,754,804,599,70,there
d,472,600,396,314,705,everyone


In [13]:
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, a to d
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   v       4 non-null      int64 
 1   w       4 non-null      int64 
 2   x       4 non-null      int64 
 3   y       4 non-null      int64 
 4   z       4 non-null      int64 
 5   s       4 non-null      object
dtypes: int64(5), object(1)
memory usage: 577.0 bytes


In [15]:
df.info()  # no memory_usage = deep, and it's still off


<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, a to d
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   v       4 non-null      int64 
 1   w       4 non-null      int64 
 2   x       4 non-null      int64 
 3   y       4 non-null      int64 
 4   z       4 non-null      int64 
 5   s       4 non-null      object
dtypes: int64(5), object(1)
memory usage: 224.0+ bytes


In [16]:
# let's assign a new string to one value

df.loc['a', 's'] = 'abcdefghij' * 10_000_000

In [17]:
df.info()  # no deep -- how much memory

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, a to d
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   v       4 non-null      int64 
 1   w       4 non-null      int64 
 2   x       4 non-null      int64 
 3   y       4 non-null      int64 
 4   z       4 non-null      int64 
 5   s       4 non-null      object
dtypes: int64(5), object(1)
memory usage: 396.0+ bytes


In [18]:
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, a to d
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   v       4 non-null      int64 
 1   w       4 non-null      int64 
 2   x       4 non-null      int64 
 3   y       4 non-null      int64 
 4   z       4 non-null      int64 
 5   s       4 non-null      object
dtypes: int64(5), object(1)
memory usage: 95.4 MB


# Using PyArrow today

We often read CSV files from disk into Pandas. A big CSV file can take a long time to load. That's because Pandas uses
a relatively slow engine for reading data.

We can replace it, with almost zero effort, with PyArrow's CSV-loading facility. 

In [19]:
filename = '/Users/reuven/Courses/Current/Data/nyc_taxi_2019-01.csv'

In [20]:
!ls -lh $filename

-rw-r--r-- 1 reuven staff 656M Jun  4  2021 /Users/reuven/Courses/Current/Data/nyc_taxi_2019-01.csv


In [22]:
# how long does it take to load this data into Pandas?

import time

start_time = time.time()
df = pd.read_csv(filename)
end_time = time.time()

print(f'Total is {end_time - start_time}')

Total is 11.839046955108643


In [23]:
# Let's use PyArrow instead

import time

start_time = time.time()
df = pd.read_csv(filename, engine='pyarrow')
end_time = time.time()

print(f'Total is {end_time - start_time}')

Total is 1.0558300018310547


In [24]:
df.dtypes

VendorID                         int64
tpep_pickup_datetime     datetime64[s]
tpep_dropoff_datetime    datetime64[s]
passenger_count                  int64
trip_distance                  float64
RatecodeID                       int64
store_and_fwd_flag              object
PULocationID                     int64
DOLocationID                     int64
payment_type                     int64
fare_amount                    float64
extra                          float64
mta_tax                        float64
tip_amount                     float64
tolls_amount                   float64
improvement_surcharge          float64
total_amount                   float64
congestion_surcharge           float64
dtype: object

In [26]:
type(df.values)

numpy.ndarray

# CSV is slow (because it's text)

Apache Arrow has two of its own (binary) formats that we can use for storing and retrieving data. This will always be much faster than CSV.

Arrow supports two formats:

- Feather -- faster reads/writes, but no compression (so you have larger files)
- Parquet -- slower reads/writes, but heavy compression (so you have smaller files)