# Data Object

In order to feed observational data to the causal discovery algorithms in our API, the raw data-- NumPy arrays and a list of variable names (optional), is used to instantiate a CausalAI data object. Note that any data transformation must be applied to the NumPy array prior to instantiating a data object. For time series and tabular data, $\texttt{TimeSeriesData}$ and $\texttt{TabularData}$ must be initialized with the aforementioned data respectively.

In [3]:
import numpy as np
import math
import matplotlib
from matplotlib import pyplot as plt   
import csv
import pandas as pd

## Time Series Data

Let's begin by importing the modules

In [1]:
from causalai.data.time_series import TimeSeriesData
from causalai.data.transforms.time_series import StandardizeTransform, DifferenceTransform

We will now instantiate a random numpy array and define a data object using our time series data class, and look at its important attributes and methods. Let's say our time series has length 100, and there are 2 variables.

In [4]:
data_array = np.random.random((100, 2))

data_obj = TimeSeriesData(data_array)
print(f'This time series object has length {data_obj.length}')
print(f'This time series object has dimensions {data_obj.dim}')
print(f'This time series object has variables with names {data_obj.var_names}')

This time series object has length [100]
This time series object has dimensions 2
This time series object has variables with names [0, 1]


There are a few things to notice:
1. We are assuming that both the variables are sampled at the same temporal rate (i.e., the same temporal resolution). We currently do not support time series in which different variables have different temporal resolution.
2. Since we did not define any variable names, by default it is enumerated by the variable index values.
3. The data object's length is returned as a list. We discuss this below under Multi-Data object.

We can alternatively define variable names by passing it to the data object constructor as follows:

In [5]:
data_array = np.random.random((100, 2))
var_names = ['A', 'B']

data_obj = TimeSeriesData(data_array, var_names=var_names)
print(f'This time series object has length {data_obj.length}')
print(f'This time series object has dimensions {data_obj.dim}')
print(f'This time series object has variables with names {data_obj.var_names}')

This time series object has length [100]
This time series object has dimensions 2
This time series object has variables with names ['A', 'B']


Finally, the data array can be retrieved as:

In [6]:
data_array_ret, = data_obj.data_arrays

print('\nRetrieving data array from the data object and making sure they are exactly the same:')
assert (data_array_ret==data_array).all()
print(data_array.shape)
print(data_array_ret.shape)


Retrieving data array from the data object and making sure they are exactly the same:
(100, 2)
(100, 2)


### Multi-Data Object

In time series case, there can be use cases where we have multiple disjoint time series for the same dataset. For instance, the first time series is from January-March, and the second time series is from July-September. In this case, concatenating the two time series would be incorrect.

To support such use cases in our library, one can pass multiple numpy arrays to the data object constructor as follows:

In [7]:
data_array1 = np.random.random((100, 2))
data_array2 = np.random.random((24, 2))
var_names = ['A', 'B']

data_obj = TimeSeriesData(data_array1, data_array2, var_names=var_names)
print(f'This time series object has length {data_obj.length}')
print(f'This time series object has dimensions {data_obj.dim}')
print(f'This time series object has variables with names {data_obj.var_names}')

print('\nRetrieving data array from the data object and making sure they are exactly the same:')
data_array1_ret,data_array2_ret = data_obj.data_arrays
assert (data_array1_ret==data_array1).all()
assert (data_array2_ret==data_array2).all()
print(data_array1.shape, data_array2.shape)
print(data_array1_ret.shape, data_array2_ret.shape)

This time series object has length [100, 24]
This time series object has dimensions 2
This time series object has variables with names ['A', 'B']

Retrieving data array from the data object and making sure they are exactly the same:
(100, 2) (24, 2)
(100, 2) (24, 2)


It should now be apparent that the data object length is returned as a list so that one can retrieve the individual time series length.

As side notes, note that all arrays must have the same number of dimensions, otherwise the object constructor will throw an error.

### Data object Methods

We list 2 data object methods that may be useful for users. They are:
1. var_name2index: This method takes as input variable name, and returns the index of that variable.
2. extract_array: Extract the arrays corresponding to the node names X,Y,Z, which are provided as inputs. X and Y are individual nodes, and Z is the set of nodes to be used as the conditional set. More explanation below.

First we show below the usage of var_name2index:

In [8]:
print(f"The index of variable B is {data_obj.var_name2index('B')}")

The index of variable B is 1


To understand the purpose of the extract_array method, note that in causal discovery, a typical operation is to perform conditioal independence (CI) tests, where conditioned on some set of variables Z, we want to perform independence test between two variables X and Y.

To perform these CI tests, a convenient approach is to list the variables X,Y and the set Z by name and their relative time index, and then define a function which returns all the instances of the corresponding variable values. For instance, in the example below, we are interested in performing a CI test between variables X=(B,t) and Y=(A,t-2) conditioned on the variable set Z=[(A, t-1), (B, t-2)], over all the values of t in the given time series dataset. Note that we follow the naming conventions below: 
1. X is the variable B at the current time t. Since it is always t, we drop the time index and simply pass the variable name string.
2. Y is the variable A from the time steps t-2 relative to X. We drop the character t, and specify this choice as (A,-2).
3. Each time indexed variable inside the list Z follows the same naming convention as specified above for Y.

In [9]:
data_array = np.random.random((5, 2))
var_names = ['A', 'B']
data_obj = TimeSeriesData(data_array, var_names=var_names)

X = 'B'
Y = ('A', -2)
Z = [('A', -1), ('B', -2)]

x,y,z = data_obj.extract_array(X,Y,Z, max_lag=3)


To understand the outputs x,y,z above, we print below the time series and these outputs with each element labeled with their respective variable name and time index.

In [10]:

data_array = data_obj.data_arrays[0]
T=data_array.shape[0]
print('data_array = [')
for i in range(data_array.shape[0]):
    print(f'[A(t-{T-i-1}): {data_array[i][0]:.2f}, B(t-{T-i-1}): {data_array[i][1]:.2f}],')
print(']')



T=x.shape[0]
print(f'\nX = {X}\nx = [')
for i in range(x.shape[0]):
    print(f'[{X}(t-{T-i-1}): {x[i]:.2f}],')
print(']')

print(f'\nY = {Y}\ny = [')
for i in range(x.shape[0]):
    print(f'[{Y[0]}(t-{T-i-1-Y[1]}): {y[i]:.2f}],')
print(']')

print(f'\nZ = {Z}\nz = [')
for i in range(x.shape[0]):
    print(f'[{Z[0][0]}(t-{T-i-1-Z[0][1]}): {z[i][0]:.2f}, {Z[1][0]}(t-{T-i-1-Z[1][1]}): {z[i][1]:.2f}],')
print(']')

data_array = [
[A(t-4): 0.68, B(t-4): 0.26],
[A(t-3): 0.40, B(t-3): 0.08],
[A(t-2): 0.48, B(t-2): 0.49],
[A(t-1): 0.23, B(t-1): 0.31],
[A(t-0): 0.78, B(t-0): 0.06],
]

X = B
x = [
[B(t-1): 0.31],
[B(t-0): 0.06],
]

Y = ('A', -2)
y = [
[A(t-3): 0.40],
[A(t-2): 0.48],
]

Z = [('A', -1), ('B', -2)]
z = [
[A(t-2): 0.48, B(t-3): 0.08],
[A(t-1): 0.23, B(t-2): 0.49],
]


Notice that the number of rows in x,y,z are the same and for any given row index, their values correspond to the variable names and relative time index specified. These arrays can now be use to perform CI tests. Our causal discovery models use this method internally, but they can be used directly if needed as well.

On a final note, if the specified list Z contains nodes whose relative lag is more than the value of max_lag, they will be ignored. For instance, if Z contains ('A', -4) and max_lag=3, then this node will be removed from Z prior to computing the z array.

## Tabular Data

The tabular data object behaves similarly to the time series object. The modules for the tabular case are as follows:

In [14]:
from causalai.data.tabular import TabularData
from causalai.data.transforms.tabular import StandardizeTransform

## Data Pre-processing

In [11]:
data_array = np.random.random((100, 2))

StandardizeTransform_ = StandardizeTransform()
StandardizeTransform_.fit(data_array)

data_train_trans = StandardizeTransform_.transform(data_array)


print(f'Dimension-wise mean of the original data array: {data_array.mean(0)}')
print(f'Dimension-wise mean of the transformed data array: {data_train_trans.mean(0)}.'\
      f'\nNotice that this is close to 0.')

print(f'\nDimension-wise standard deviation of the original data array: {data_array.std(0)}')
print(f'Dimension-wise standard deviation of the transformed data array: {data_train_trans.std(0)}.'\
      f' \nNotice that this is close to 1.')



Dimension-wise mean of the original data array: [0.53437735 0.54876852]
Dimension-wise mean of the transformed data array: [2.13717932e-16 1.03805853e-16].
Notice that this is close to 0.

Dimension-wise standard deviation of the original data array: [0.28210117 0.27132903]
Dimension-wise standard deviation of the transformed data array: [0.99999937 0.99999932]. 
Notice that this is close to 1.


The standard transform class automatically ignores NaNs in the array:

In [12]:
data_array = np.random.random((10, 2))
data_array[:2,0] = math.nan

StandardizeTransform_ = StandardizeTransform()
StandardizeTransform_.fit(data_array)

data_train_trans = StandardizeTransform_.transform(data_array)

print(f'Original Array: ')
print(data_array)

print(f'\nTransformed Array: ')
print(data_train_trans)

print('\nBelow we print the mean and standard deviation of the 0th column after ignoring the 1st 2 elements:')

print(f'\nDimension-wise mean of the original data array: {data_array[2:,0].mean(0)}')
print(f'Dimension-wise mean of the transformed data array: {data_train_trans[2:,0].mean(0)}.'\
      f'\nNotice that this is close to 0.')

print(f'\nDimension-wise standard deviation of the original data array: {data_array[2:,0].std(0)}')
print(f'Dimension-wise standard deviation of the transformed data array: {data_train_trans[2:,0].std(0)}.'\
      f' \nNotice that this is close to 1.')


Original Array: 
[[       nan 0.32367363]
 [       nan 0.45758606]
 [0.93602592 0.94666669]
 [0.10544216 0.56649038]
 [0.5082098  0.97646683]
 [0.02717121 0.09088905]
 [0.22236171 0.90320235]
 [0.3967986  0.88517427]
 [0.89512793 0.60708946]
 [0.64137796 0.54478578]]

Transformed Array: 
[[        nan -1.09328675]
 [        nan -0.61566548]
 [ 1.46249195  1.12872312]
 [-1.12498759 -0.22723985]
 [ 0.12973597  1.23501033]
 [-1.36882152 -1.92355227]
 [-0.7607535   0.9737002 ]
 [-0.21733824  0.90940004]
 [ 1.33508431 -0.08243639]
 [ 0.54458863 -0.30465295]]

Below we print the mean and standard deviation of the 0th column after ignoring the 1st 2 elements:

Dimension-wise mean of the original data array: 0.46656441039094176
Dimension-wise mean of the transformed data array: -2.7755575615628914e-17.
Notice that this is close to 0.

Dimension-wise standard deviation of the original data array: 0.32100093752872744
Dimension-wise standard deviation of the transformed data array: 0.999999514759

On a final note, the causal discovery algorithms automatically handles NaN instances internally as well.