# Agenda

1. Pandas, NumPy, and PyArrow
2. Arrow -- what it is
3. PyArrow for file loading/saving
4. PyArrow as a back-end data type
5. Transitioning to PyArrow

# Pandas, NumPy, and PyArrow

NumPy has been around for about 20 years. It gives us the speed and efficiency of C data types, but a Python API. That means we can use Python for numeric computing.

NumPy defines the "NumPy array," which contains a number of values all of the same type. These are all stored in C, which means that (in contrast with normal Python programming), we have to specify not only the type of values, but their lengths, as well.

In [1]:
import numpy as np

In [2]:
a = np.array([10, 20, 30, 40, 50])
a

array([10, 20, 30, 40, 50])

In [3]:
type(a)

numpy.ndarray

In [4]:
a.dtype

dtype('int64')

In [5]:
64 * 5

320

In [6]:
mylist = [10, 20, 30, 40, 50]
import sys
sys.getsizeof(mylist)

104

In [7]:
a = np.array([10, 20, 30, 40, 50], dtype='int8')
a

array([10, 20, 30, 40, 50], dtype=int8)

NumPy is great! But it's also a bit low level. About 15 years ago, Wes McKinney invented Pandas, which is a wrapper around NumPy, giving us a lot of convenience methods. NumPy handles n-dimensional arrays. Pandas has only two main data structures, one is the Series (1D) and also the DataFrame (2D).

Pandas gives us lots of methods for:

- Loading data from different formats
- Storing data in different formats
- Analyzing in many ways
- Cleaning our data
- Visualizing our data

When we store data in our Pandas series or data frame, it's being stored in a NumPy array behind the scenes:

- A series is actually a wrapper around a single NumPy array
- Each of a data frame's columns is a NumPy array, but there are very fast transformations that allow us to retrieve by row, as well as by column.

For many years, this has worked just fine... mostly. Where are the problems?

- Strings -- currently, Pandas stores text as Python strings, with pointers from NumPy arrays into Python memory
- There's no compression or general categorization/normalization of textual (or other) data
- Much of the analysis we want to do is columnar, and NumPy just isn't that fast at working in columns, even when we think of a data frame as containing multiple series (i.e., NumPy arrays)
- NumPy is great for Python, but it's not interoperable with other systems.

Wes and others said: Let's create a new data structure that'll work in Pandas and also in lots of other systems and langauges. That became Apache Arrow, an open-source project. It just released version 20 a month or so ago.

It supports many data structures (arrays, tables, dictionaries, etc.), with copy-on-write, compression, with reduction of duplications, and also nullable types. Python supports it through PyArrow, its bindings, but you also have Arrow compatibility with R, Apache Spark, and other languages/systems.



# PyArrow and Pandas

For a long time, you could convert a PyArrow Table to a Pandas data frame, and vice versa. They were interoperable, but needed translation.

A few years ago, several Pandas core developers decided it would be worth starting the journey of moving Pandas from NumPy to PyArrow as a back end.

The idea is that Pandas will continue to work as before, with all of its methods and functionality, but the back end will be PyArrow, rather than NumPy.

In a few years, we can expect that PyArrow will be the default back-end storage. I don't think that NumPy is going away, either in general or in Pandas. But Pandas 3.0 will require PyArrow to be installed.

# Loading/saving files

Already today, in a non-experimental way, you can use PyArrow in your Pandas code to read from CSV files (and write to them, too).

If you use `pd.read_csv`, then Pandas needs to do several things:

- Load the file into memory, often in multiple chunks, if it's too big
- Evaluate what dtype you're going to want for each column (if you don't specify), choosing between `int64`, `float64`, and `object` (i.e., string)
- Create the data frame based on that
- Pandas, of course, uses a single core, with a single process and a single thread

PyArrow uses the Arrow engine, which uses multiple cores, and multiple threads, to read from a file, which makes it far faster.

In [8]:
filename = '/Users/reuven/Courses/Current/Data/nyc_taxi_2019-01.csv'

!ls -lh $filename

-rw-r--r-- 1 reuven staff 656M Jul 29  2021 /Users/reuven/Courses/Current/Data/nyc_taxi_2019-01.csv


In [12]:
import time
import pandas as pd
from pandas import Series, DataFrame

In [13]:
start_time = time.perf_counter()
df = pd.read_csv(filename)
end_time = time.perf_counter()

total_time = end_time - start_time
print(f'Total time: {total_time:0.2f}')

Total time: 6.42


In [None]:
start_time = time.perf_counter()
df = pd.read_csv(filename)
end_time = time.perf_counter()

total_time = end_time - start_time
print(f'Total time: {total_time:0.2f}')