# Pandas Series and Dataframe testing

Pandas introduces the Series object, which stores an array of data against several dimensions of columns. These can be merged into a DataFrame object, which acts like a table of this high-dimensional data.

Currently, the goal is to upgrade from a two-dimensional dataframe of (dates, samples) to an optional three-dimensional (dates, samples, groups) model. Only some of these Series may contain this extra dimension of data, and some may contain data for a subset of groups.

The goal is therefore to find a way to combine these series into a dataframe safely and preserving data.

In [1]:
import os
import sys

def to_base_cwd():
    os.chdir(os.path.join(os.getcwd(), '../..'))

os.getcwd()

'/home/james/eam-core-provenance/docs/ipynb'

In [2]:
#to_base_cwd()
os.getcwd()

'/home/james/eam-core-provenance/docs/ipynb'

In [3]:
import pandas as pd 
import pint
import numpy as np
import pickle

## Creating Series

First, the series data must be loaded or generated.

- When loading series, the data is obtained from pickling the calculation traces in `util.py store_dataframe()`.

- When generating a series, we can introduce data that models some of the acceptable edge cases expected; without countries or using only a subset of countries. 

In [4]:
def load_series_from_files():
    res = []
    res.append(pd.read_pickle('pickle_data/Datacentres.pickle'))
    res.append(pd.read_pickle('pickle_data/Cellular.pickle'))
    res.append(pd.read_pickle('pickle_data/Fixed Line.pickle'))
    res.append(pd.read_pickle('pickle_data/CDN.pickle'))
    res.append(pd.read_pickle('pickle_data/Modem Router.pickle'))
    res.append(pd.read_pickle('pickle_data/End User Device.pickle'))
    
    return res

def generate_series_list():
    times = pd.date_range('2020-01-01', '2020-3-01', freq='MS')
    sample_size = 3
    groups_full = ['A', 'B', 'C']
    
    series_length = len(times) * sample_size * len(groups_full)
    index_names_full = ['time', 'samples', 'group']

    
    iterables = [times, range(sample_size), groups_full]
    df_multi_index = pd.MultiIndex.from_product(iterables, names=index_names_full)
    s1 = pd.Series(data=range(series_length), index=df_multi_index, name='FULL_A')
    
    iterables = [times, range(sample_size), groups_full[:2]]
    df_multi_index = pd.MultiIndex.from_product(iterables, names=index_names_full)
    s2 = pd.Series(data=range(int(series_length * 2/3)), index=df_multi_index, name='TWO_COUNTRIES')
    
    iterables = [times, range(sample_size)]
    df_multi_index = pd.MultiIndex.from_product(iterables, names=index_names_full[:2])
    s3 = pd.Series(data=range(int(series_length / 3)), index=df_multi_index, name='NO_GROUPS_A')  
    
    iterables = [times, range(sample_size), groups_full[:1]]
    df_multi_index = pd.MultiIndex.from_product(iterables, names=index_names_full)
    s4 = pd.Series(data=range(int(series_length * 1/3)), index=df_multi_index, name='ONE_COUNTRY')
    
    iterables = [times, range(sample_size)]
    df_multi_index = pd.MultiIndex.from_product(iterables, names=index_names_full[:2])
    s5 = pd.Series(data=range(int(series_length / 3)), index=df_multi_index, name='NO_GROUPS_B')
    
    return [s1,s2,s3,s4,s5]

def get_generated_series_indexing():
    times = pd.date_range('2020-01-01', '2020-3-01', freq='MS')
    sample_size = 3
    groups = ['A', 'B', 'C']
    
    iterables = [times, range(sample_size), groups]
    
    return iterables, times

## Interacting with Series

Here, series are loaded as a list.

When combining series into the empty dataframe, they must only decrease in dimensionality, or an error is thrown.

So, its fine if the series are ordered in decreasing dimensionality; but this is time-consuming and there ought to be a better way.

The solution is to generate the dataframe with a defined multi-index of the maximum possible dimensions and groups. This way, every series is a non-strict subset of the DataFrame, and errors are avoided.

### How series are merged

Interesting to note is _how_ pandas performs the merge.

- One case is when merging a series with the same dimensionality, but a subset in an index. For example, only one or two countries of a possible three. Here, the missing data is filled in with `NaN` (and, presumably as a result, the other values are interpreted as floats; TODO need to check this with Pint to investigate the effect this has with Pint value arrays!

- The other case is when a dimension is missing entirely, for example if a series without group data is merged. In this case, the data is duplicated across the new dimension.

In [5]:
series = generate_series_list()
series = load_series_from_files()
len(series)

6

In [6]:
for s in series:
    #print(s)
    continue

In [7]:
#cdn = series[3]
#print(cdn.index.get_level_values('group').values)
#cdn

In [8]:
df_blank = pd.DataFrame()
for i in range(len(series)):
    print(series[i].shape)
    #df_blank[str(i)] = series[len(series)-i-1]
    #df_blank[str(i)] = series[i]
df_blank

(1,)
(2,)
(2,)
(2,)
(2,)
(2,)


In [13]:
times = pd.date_range('2020-01-01', '2020-01-01', freq='MS')
sample_size = 1
groups = ['A', 'B']
iterables = [times, range(sample_size), groups]
index_names = ['time', 'samples', 'group']

#don't call this if using pickle data
#iterables, times = get_generated_series_indexing()

df_multi_index = pd.MultiIndex.from_product(iterables, names=index_names)
df_index = pd.DataFrame(index=df_multi_index)

for i in range(len(series)):
    print(series[i].shape)
    if 'group' in series[i].index.names:
        print(set(series[i].index.get_level_values('group').values))
    #df_index[str(i)] = series[len(series)-i-1]
    if series[i].name is None: series[i].name = 
    df_index[series[i].name] = series[i]
    
# this demonstrates that different shaped series can be merged into a dataframe
df_index

(1,)
(2,)
{'A', 'B'}
(2,)
{'A', 'B'}
(2,)
{'A', 'B'}
(2,)
{'A', 'B'}
(2,)
{'A', 'B'}


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,None
time,samples,group,Unnamed: 3_level_1
2020-01-01,0,A,0.03
2020-01-01,0,B,0.06


In [10]:
from eam_core import util
data, metadata = util.h5load('pickle_data/result_data_carbon.hdf5')
data.head(5)

[[1mroot                [0m][[1;32mINFO[0m   ]  Configured logging from /home/james/eam-core-provenance/src/eam_core/logconf.yml ([1mlog_configuration.py[0m:30)
[[1mnumexpr.utils       [0m][[1;32mINFO[0m   ]  NumExpr defaulting to 8 threads. ([1mutils.py[0m:157)


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,CDN,Internet Network,Laptop
Unnamed: 0_level_1,Unnamed: 1_level_1,unit,megametric_ton,megametric_ton,megametric_ton
time,samples,group,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
2019-01-01,0,A,2.5e-11,2.5e-11,2e-07
2019-01-01,0,B,2.5e-11,2.5e-11,2e-07
2019-02-01,0,A,2.5e-11,2.5e-11,2e-07
2019-02-01,0,B,2.5e-11,2.5e-11,2e-07
2019-03-01,0,A,2.5e-11,2.5e-11,2e-07
