# Torcharrow: State handling -- Configs, Sessions, Multi-targetting and Tracing


Torcharrow has no global mutable state. But it has constant global state, and session state which is threaded implicitly through a pipeline. This does not only enable eanble configuration management but also multi-device targetting and tracing. This short doc explain the concepts and their use.

## Configs
Configs are just dictionaries wrapped in a class. They can be given in code or by via a json file. The following config



In [1]:
import torcharrow as T
cfg = T.Config({'device': 'test', 'tracing': False, 'types_to_trace':[]})

defines the default target `device` to be `'test'`, `tracing` to be false, and provides an empty list for `types_to_trace`. Many more configs can be given, but these three will play a role in multi-device targetting and tracing, see below.

## Sessions
 
Configs are passed to session. A session maintains all state that is relevant for the execution of a single pipeline. 


In [2]:
session = T.Session(cfg)

## Column and Dataframe Factories

Columns and dataframes are created with respect to a session. Columns and dataframes inherit the session's default device (accessible under the property `to`), a session also guarantees unique object identifiers (here called `id`). We will later see that sessions also keep `trace` state.

In [3]:
c = session.Column([1,2,3])
d = session.DataFrame({'a': [1,2,3], 'b' : ['a','b','c']})
f"Column c: {list(c)}, its id: {c.id}, its device: {c.to} || DataFrame d: {list(d)}, its id: {d.id}, its device: {d.to})"

"Column c: [1, 2, 3], its id: c0, its device: test || DataFrame d: [(1, 'a'), (2, 'b'), (3, 'c')], its id: c3, its device: test)"

## Default config and session

Most programs don't have to worry about configs and sessions. They can either use the predefined `Session.default` or even ignore that and use public constructor `Column` and `Frame` which implicitly pick up the session default. So TorchArrow non-power users can be completely unaware of configs, sessions, multidevice targeting, tracing, etc.


In [4]:
d = T.Column(['abc',None])
d.to

'test'

## Multi-device targetting

Torcharrows supports multi-device targetting. i.e., columns and dataframes can reside in different memory (which we call also device). Currently we support 3 configurations:

- test, which means columns and dataframes are backed by by Numpy
- cpu, which means columns and dataframes are backed by Velox,
- gpu, which means columsn and datframes are backed by CuPy (i.e. GPU memory).

The user controls the assignment in 3 ways:

- the default assignment is done via the config's `device` parameter. The current device default is `test`. 
- the `to` parameter of the `Column` or `(Data)Frame` factory method. If `to` is None, the data is allocated at the default device; otherwise it is created at the specified device.
- the `move_to` instance method call defined on the base class `AbstractColumn`. The method moves the column/frame to the designated device. 

Torcharrow requires that  
- creation of a dataframe on a particular device assumes that all its columns are created on the same device. 
- applying on operation on a column or dataframe will result in a column or dataframe on the same device.
- if the operation requires several columns/frames as input, all of them have to be on the same device.

Let's see this in practice: First we create a dataframe and we inspect the dataframes and columns `to` device...


In [5]:
e =T.Frame({'a': [1.0, None], 'b':['a','c']})
f = e['a'] > 12
(e.to, e['a'].to, e['b'].to, f.to ) 

('test', 'test', 'test', 'test')

Alternatively we could have created a column/frame on a particular device:

In [6]:
g = T.Column([1.0, None], to = 'cpu')
g.to

'cpu'

To add `e['a']` to `f` we have to bring the columns to the same device. Let's say it is `cpu`. Then add wil return a new column on `cpu`.

In [7]:
h = e['a'].move_to('cpu') + g
h.to

'cpu'

The system raises a TypeError if two columns to add reside on different devices.

In [8]:
x = T.Column([1], to = 'cpu') 
y = T.Column([1], to = 'test')
try:
    z = x+y
except TypeError as e:
    print(f"error: {e}")


error: self and other must have same device.


## Tracing


Torcharrow programs are executed eagerly -- that is every expression is evaluated bottom up and statements  are executed one after another. While this is fast and allows developers to debug programs easily it doesn't allow to inspect the executed code for analysis, optimization or platform retargeting. 

To get the best of both worlds, fast execution, and ease of analyzability, torcharrow introduces tracing. To create a torcharrow trace, author a new config, in which you set `tracing` to True and provide the types of classes that you want to trace. For Torcharrow the tracing defaults should always include `AbstractColumn` and `GroupedDataFrame`.

In [9]:
types= [T.Session, T.AbstractColumn, T.GroupedDataFrame]
cfg2 = T.Config({'device': 'test', 'tracing': True, 'types_to_trace':types})


Next we run the program unchanged. For visibility on what happens we print out the resulting dataframe, each column having particular object ids. 

In [10]:
from torcharrow import me

In [11]:
ts = T.Session(cfg2)
d0 = ts.DataFrame(dtype=T.Struct([T.Field(i, T.int64) for i in ['a', 'b', 'c']]))
d1 = d0.select('*', e=me['a'] + me['b'])
str(d1)

"self._fromdata({'a':Column([], id = c0), 'b':Column([], id = c1), 'c':Column([], id = c2), 'e':Column([], id = c4), id = c5})"

A faithful trace should have captured this execution and be able to replay with the same results.  Let's see wether that's the case:

The generated `trace` is accessable via the `session` object. The trace has two components:
-  `statements` returns a list of assignments where each
   - right hand side is an operation of the types to trace  
   - left hand side is named after the object id that's is created by the righ hand side 
- `result` returns the name of the variable that was last assigned. 

In [12]:
d1_result = ts.trace.result()
d1_stms = ts.trace.statements()
(d1_result, d1_stms)

('c5',
 ["c3 = Session.DataFrame(s0, dtype=Struct([Field('a', int64), Field('b', int64), Field('c', int64)]))",
  "c5 = DataFrame.select(c3, '*', e=me.__getitem__('a').__add__(me.__getitem__('b')))"])

The right-hand side of each statement is a fully resolved and type checked expressions in normal form, e.g. see the assignmnet to c5. Arguments to all expressions are Python values or references to variables introduced earlier.   

What can we do with such trace? We can 
 * analyze it for type correctness or for privacy flows
 * optimize and rewrite it
 * capture it, ship it to another machine and re-execute with or without data. 
 
Here we just replay the trace using Pythons exec and eval (TODO: Use fully qualified names everywhere so that the below import can be dropped). 

In [13]:

from torcharrow import Session, Struct, Field, int64, DataFrame, NumericalColumn, me
# execute the statements
s0 = Session(cfg)
for stm in d1_stms:
    exec(stm)
#eval the result
str(eval(d1_result))

"self._fromdata({'a':Column([], id = c0), 'b':Column([], id = c1), 'c':Column([], id = c2), 'e':Column([], id = c4), id = c5})"

We see that `d1` and `eval(d1_result)` are structurally exactly the same, including their object ids. Thus the trace preserved 100% of the original semantics. 