# TorchArrow in 10 minutes

TorchArrow is a Python DataFrame library built on the Apache Arrow columnar memory format and leveraging the Velox vectorized engine for loading, filtering, mapping, joining, aggregating, and otherwise manipulating tabular data on CPUs.

TorchArrow allows mostly zero copy interop with Numpy, Pandas, PyArrow, CuDf and of coarse PyTorch.
In fact it is the integration with PyTorch which has trigered the development of TorchArrow. 
So TorchArrow understands Tensors natively.  

(Remark. In case the following looks familar, it is with gratitude that portions of this tutorial were borrowed and adapted from the 10 Minutes to Pandas (and CuDF) tutorial.)



In [1]:

import pandas as pd
import numpy as np
import pyarrow as pa

The TorchArrow library consists of 2 parts: 

  * *DTypes* define *Schema*, *Fields*, primitive and composite *Types*. 

  * *Dataframes*  are sequences of named and typed *columns* of same length.  

Let's get started...

In [2]:
import torcharrow as ta

## Constructing data: Columns

### From Pandas to TorchArrow
To start let's create a Panda series and a TorchArrow column and compare them:

In [3]:
pd.Series([1,2,None,4])

0    1.0
1    2.0
2    NaN
3    4.0
dtype: float64

In Pandas each Series has an index, here depicted as the first column. Note also that the inferred type is float and not int, since in Pandas None implictly promotes an int list to a float series.

TorchArrow has a much more precise type system:

In [4]:
s = ta.Column([1,2,None,4])
s

  data    validity
------  ----------
     1           1
     2           1
     0           0
     4           1
dtype: Int64(nullable=True), count: 4, null_count: 1

TorchArrow infers that that the type is `Int64(nullable)` which required that the vectors is represented internally via two arrays, its data and validity bit mask (which we only show if null_count>0).


Of course we can always get lots of more informataion from a column (its type, length, etc):

In [5]:
(s.dtype, s.size)

(Int64(nullable=True), 4)

TorchArrow supports (almost all of Arrow types), including arbitrarily nested structs, maps, lists, and fixed size lists. Here is a column of a list of strings.

In [6]:
sf = ta.Column([ ["hello", "world"], ["how", "are", "you"] ], ta.List_(ta.string))
sf

data                     offsets
---------------------  ---------
['hello', 'world']             0
['how', 'are', 'you']          2
dtype: List_(string), count: 2, null_count: 0

### Builders

Columns are append only. Use the usual `append` and `extend` funcions to grow them.

In [7]:
sf.append(["I", "am", "fine"])
sf

data                     offsets
---------------------  ---------
['hello', 'world']             0
['how', 'are', 'you']          2
['I', 'am', 'fine']            5
dtype: List_(string), count: 3, null_count: 0

## Constructing data: Dataframes

Now let's focus on Dataframes. A Dataframe is just a set of named and strongly typed columns of equal length:

In [8]:
df = ta.DataFrame({'a': list(range(7)),
                     'b': list(reversed(range(7))),
                     'c': list(range(7))
                    })
df

  a    b    c
---  ---  ---
  0    6    0
  1    5    1
  2    4    2
  3    3    3
  4    2    4
  5    1    5
  6    0    6
dtype: Schema([Field(a, int64), Field(b, int64), Field(c, int64)]), count: 7, null_count: 0

Dataframes (and columns of struct types) are updatable. That is we can add (or update any column) as long as the update result obeys the dataframes invariant (columns are stringly typed, have equal length).


In [9]:
df['d'] = ta.Column(list(range(99, 99+7)))
df

  a    b    c    d
---  ---  ---  ---
  0    6    0   99
  1    5    1  100
  2    4    2  101
  3    3    3  102
  4    2    4  103
  5    1    5  104
  6    0    6  105
dtype: Schema([Field(a, int64), Field(b, int64), Field(c, int64), Field(d, int64)]), count: 7, null_count: 0

## Interop

Take a Pandas dataframe and move it zero copy (if possible) to TorchArrow.

In [10]:
# TODO
# pdf = pd.DataFrame({'a': [0, 1, 2, 3],'b': [0.1, 0.2, None, 0.3]})
# gdf = ta.DataFrame.from_pandas(pdf)
# gdf

And bring it back to Pandas

In [11]:
# gdf.to_pandas()

The same works for arrow (here not shown) too.

## Viewing (sorted) data

Take the top n rows

In [12]:
df.head(2)

  a    b    c    d
---  ---  ---  ---
  0    6    0   99
  1    5    1  100
dtype: Schema([Field(a, int64), Field(b, int64), Field(c, int64), Field(d, int64)]), count: 2, null_count: 0

Sort values

In [13]:
# TODO df.sort(by='b')

## Selection
Projection a single column

In [14]:
df['a']

  data
------
     0
     1
     2
     3
     4
     5
     6
dtype: int64, count: 7, null_count: 0

Selection by row position. Note that the operation currently returns a (row of) value(s). Should we return a columns instead?

In [15]:
df[1]

(1, 5, 1, 100)

Selecting a slice keeps the type alive.


In [16]:
df.slice(2,3)

  a    b    c    d
---  ---  ---  ---
  2    4    2  101
dtype: Schema([Field(a, int64), Field(b, int64), Field(c, int64), Field(d, int64)]), count: 1, null_count: 0

Selection by condition is written with a boolean condition.

In [17]:
df[df['a'] > 4]

  a    b    c    d
---  ---  ---  ---
  5    1    5  104
  6    0    6  105
dtype: Schema([Field(a, int64), Field(b, int64), Field(c, int64), Field(d, int64)]), count: 2, null_count: 0

Selection by methods like isin

In [39]:
df[df['a'].isin([5])]

  a    b    c    d
---  ---  ---  ---
  5    1    5  104
dtype: Schema([Field(a, int64), Field(b, int64), Field(c, int64), Field(d, int64)]), count: 1, null_count: 0

## Missing data
 Missing data can be filled in via fillna method 

In [19]:
t = s.fillna(999)
t

  data
------
     1
     2
   999
     4
dtype: Int64(nullable=True), count: 4, null_count: 0

## Numerical columns and descriptive statistics
Just use usual statistics ops for columns. 


In [20]:
(t.min(), t.max(), t.sum(), t.mean())

(1, 999, 1006, 251.5)

## String methods
Torcharrow provides all of Python's string processing methods, just lifted to owrk over columns.

In [21]:
s = ta.Column(['what a wonderful world!', 'really?'])
s.capitalize()

data                       offsets
-----------------------  ---------
What a wonderful world!          0
Really?                         23
dtype: string, count: 2, null_count: 0

## Functional tools: Filter, map, flatmap and reduce

Use `filter`, `map`, `flatmap` and `reduce` to call a unary user defined function (UDF) that operates on each element of a column or row of a dataframe where torcharrow represents a row as a tuple.

In [22]:
def pred(tup)-> bool:
    return tup[0] >tup[1]

df.filter(pred)

  a    b    c    d
---  ---  ---  ---
  4    2    4  103
  5    1    5  104
  6    0    6  105
dtype: Schema([Field(a, int64), Field(b, int64), Field(c, int64), Field(d, int64)]), count: 3, null_count: 0

If `map` returns the same type as given, then just call pass the function. 

In [23]:
def add_ten(num):
    return num + 10

df['a'].map(add_ten)

  data
------
    10
    11
    12
    13
    14
    15
    16
dtype: int64, count: 7, null_count: 0

Note that all operations working on columns and dataframes ignore null values. So applying add_ten on our original column s returns:


In [24]:
ta.Column([1,2,None,4]).map(add_ten)

  data    validity
------  ----------
    11           1
    12           1
     0           0
    14           1
dtype: Int64(nullable=True), count: 4, null_count: 1

If a function's argument type differs from its return type, then the reuturn type must be specified as the last argument.

In [25]:
def concat(words):
    return ' '.join(words)

sf.map(concat, ta.string)

data           offsets
-----------  ---------
hello world          0
how are you         11
I am fine           22
dtype: string, count: 3, null_count: 0

`fltamap` combines `filter` with `map`. For instance, lets double all rows that start with 'I" and rop all others.

In [26]:
def selfish(words):
    if len(words)>=1 and words[0] == "I": 
        return [words, words]
    else:
        return []

sf.flatmap(selfish)

data                   offsets
-------------------  ---------
['I', 'am', 'fine']          0
['I', 'am', 'fine']          3
dtype: List_(string), count: 2, null_count: 0

Finally `reduce` works exactly as in Python. To compute the product simply use the opertor mul.

In [27]:
import operator
t.reduce(operator.mul)

7992

## Relational tools: Join and Group-by
 
Performing SQL style joins. Note that the dataframe order is not maintained. (Is not yet implemened)

In [28]:
# df_a = ta.DataFrame()
# df_a['key'] = ['a', 'b', 'c', 'd', 'e']
# df_a['vals_a'] = [float(i + 10) for i in range(5)]

# df_b = ta.DataFrame()
# df_b['key'] = ['a', 'c', 'e']
# df_b['vals_b'] = [float(i+100) for i in range(3)]

# merged = df_a.merge(df_b, on=['key'], how='left')
# merged

### Grouping

Like pandas, torchArrow support the Split-Apply-Combine groupby paradigm. (Is not yet implemened)

### Transpose



In [29]:
sample = ta.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
sample

  a    b
---  ---
  1    4
  2    5
  3    6
dtype: Schema([Field(a, int64), Field(b, int64)]), count: 3, null_count: 0

In [30]:
# sample.transpose() -TODO

## More on User defined functions

Above we  we covered the most basic usage of a unary UDF. Let's look into more esoteric features here:

**Multiparameter UDFs.** Functions that take more than one argument but not the complete row declare which columns are passed in.   

In [31]:
# df.map(operator.add, incols= ['a','b']) -- TODO

**Multireturn UDFs.**  Functions that return more than one column can be specfied by returning a tuple; providing the  return type is mandatory.


In [32]:
# df.map(divmod,  incols= ['a','b'], dtypes = [int64, int64]]) -- TODO

**Functions with state**. UDFs need sometimes additional precomputed state. We capture the state in an object and use a method as a delegate:
 

In [33]:
def fib(n):
    if n == 0:
        return 0
    elif n == 1 or n == 2:
        return 1
    else:
        return fib(n-1) + fib(n-2)
    
class State:
    def __init__(self, x):
        self.state = fib(x) 
    def add_fib(self, x):
        return self.state+x

m = State(10)
ta.Column([1,2,3]).map(m.add_fib)

  data
------
    56
    57
    58
dtype: int64, count: 3, null_count: 0

## Vectorized user defined functions and transforms

Vectorized function leak TorchArrow representation boundaries! So read the following with the big caveat that it can change quickly!

Vectorized functions get *n* strongly typed vectors as input and return *m* vectors as output. Validity handling is optional. The following assumes that all data is valid!

In [34]:

def conditional_add(x, y, out):
    for i, (a, e) in enumerate(zip(x, y)):
        if a > 0:
            out[i] = a + e
        else:
            out[i] = a

This code is perfect for vectorization via Numba. Leveraging Numba will require us to only add some custom attributes. (TODO)

Vectorized functions can be applied using `transform`. We pass a list of data columns and return a typed list of data columns. 

In [35]:
# df = ta.transform(conditional_add, incols= ['a','b'], dtypes = [int64]) -- TODO
# df.head()

If you want to pass the underlying vaidity map in and/or out as well, you have to provide it as  incols and out dtypes respectively. The input and output names are called name.data and name.vaidity repectively. The dtype for a validity map is called nullable. So for the folowing transfor, we pass all data and validity masks and return a validity vector as well. 

In [36]:
# ta.transform(conditional_add_with_mask, incols = ['a.data','a.mask', 'b.data', 'b.mask'], dtype = [int64, nullable]]) -- TODO

Assuming that nulls are handled as bitarrays of 64 bytes each, and that the return must be null if row a's value is > 0, then we can define it like so.

In [37]:
"End of tutorial"

'End of tutorial'