# 10+ Minutes to pyArrow and Arrow DataFusion 

This is WIP, just the basics are in place.

# pyArrow Basics

In [1]:
import pyarrow as pa

## Arrays and Tables

In [2]:
days = pa.array([1,12,17,23,28], type=pa.int8())

In [3]:
months = pa.array([1, 3, 5, 7, 1], type=pa.int8())

In [4]:
years = pa.array([1990, 2000, 1995, 2000, 1995], type=pa.int16())

In [5]:
birthdays_table = pa.table([days, months, years],
                           names=["days", "months", "years"])

In [6]:
birthdays_table

pyarrow.Table
days: int8
months: int8
years: int16
----
days: [[1,12,17,23,28]]
months: [[1,3,5,7,1]]
years: [[1990,2000,1995,2000,1995]]

## Saving and Loading Tables

In [7]:
import pyarrow.parquet as pq

In [8]:
pq.write_table(birthdays_table, './data/birthdays.parquet')

In [9]:
reloaded_birthdays = pq.read_table('./data/birthdays.parquet')

In [10]:
reloaded_birthdays

pyarrow.Table
days: int8
months: int8
years: int16
----
days: [[1,12,17,23,28]]
months: [[1,3,5,7,1]]
years: [[1990,2000,1995,2000,1995]]

## Performing Computations

Here's a [list of available compute functions](https://arrow.apache.org/docs/python/compute.html#compute) for our reference.

In [11]:
import pyarrow.compute as pc

In [12]:
pc.value_counts(birthdays_table["years"])

<pyarrow.lib.StructArray object at 0x0000023C64E7E7A0>
-- is_valid: all not null
-- child 0 type: int16
  [
    1990,
    2000,
    1995
  ]
-- child 1 type: int64
  [
    1,
    2,
    2
  ]

## Working with large data

In [13]:
import pyarrow.dataset as ds

Arrow also provides the ```pyarrow.dataset``` API to work with large data, which will handle for you partitioning of your data in smaller chunks.  

In [15]:
ds.write_dataset(birthdays_table, "./data/pyArrow-large-data/1", format="parquet",
                 partitioning=ds.partitioning(
                    pa.schema([birthdays_table.schema.field("years")])
                ))

Loading back the partitioned dataset will detect the chunks



In [16]:
birthdays_dataset = ds.dataset("./data/pyArrow-large-data", format="parquet", partitioning=["years"])

In [17]:
birthdays_dataset

<pyarrow._dataset.FileSystemDataset at 0x23c65122440>

In [18]:
birthdays_dataset.files

['./data/pyArrow-large-data/1/1990/part-0.parquet',
 './data/pyArrow-large-data/1/1995/part-0.parquet',
 './data/pyArrow-large-data/1/2000/part-0.parquet',
 './data/pyArrow-large-data/1990/part-0.parquet',
 './data/pyArrow-large-data/1995/part-0.parquet',
 './data/pyArrow-large-data/2000/part-0.parquet']

Arrow will lazily load chunks of data only when iterating over them

In [19]:
import datetime

current_year = datetime.datetime.utcnow().year

In [20]:
for table_chunk in birthdays_dataset.to_batches():
    print("AGES", pc.subtract(current_year, table_chunk["years"]))

AGES [
  2022
]
AGES [
  2022,
  2022
]
AGES [
  2022,
  2022
]
AGES [
  33
]
AGES [
  28,
  28
]
AGES [
  23,
  23
]


In [21]:
# more to come, including DataFusion basics