# How to access the data
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/remifan/LabPtPTm2/HEAD?filepath=examples%2Fbasics.ipynb)
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/remifan/LabPtPTm2/blob/master/examples/basics.ipynb)

(Binder is better for the interactive content of this notebook)

In [1]:
import labptptm2



## Convenience API

### Data selection

The typical data selection looks like this

In [2]:
dat_grp, sup_grp = labptptm2.select(1, 0, 4, 2)

the 4 input arguments of `select` identify each collected data file:
- arg#1: int, random source sequence identifier, which can be either 1 or 2
- arg#2: int, launched power in dBm unit, which must be a member of [-5, -4, -3, -2, -1, 0, 1, 2, 3]
- arg#3: int, channel index, which is member of [1, 2, 3, 4, 5, 6, 7]
- arg#4: int, index of scope captures under the same link configuration, a member of [1, 2, 3]

`select` returns 2 objects, a `list` of data groups and a `list` of supplymentary data groups.

Each data group contains recieved samples (resampled to 2 samples/symbol), synchronized sent symbols and attributes; each supplymentary data group contains auxiliary infomation such as estimated frequency offset and chromatic dispersion.

multi-selection is supported by a list of specifications:
``` python
dat_grp, sup_grp = labptptm2.select(1, [0, 1], [4, 7], 2)
```

since we only input a single specification, the returned list has only 1 group:

In [3]:
dat_grp[0].info

0,1
Name,/1125km_SSMF/src1/0dBm_ch4_2
Type,zarr.hierarchy.Group
Read-only,False
Store type,zarr.storage.ConsolidatedMetadataStore
Chunk store type,zarr.storage.FSStore
No. members,2
No. arrays,2
No. groups,0
Arrays,"recv, sent"


The data group contains 2 arrays named 'recv' and 'sent',
and we can inspect their shapes and No. bytes

In [4]:
dat_grp[0]['recv'].info # you don't have to understand all of those information.

0,1
Name,/1125km_SSMF/src1/0dBm_ch4_2/recv
Type,zarr.core.Array
Data type,complex64
Shape,"(4500000, 2)"
Chunk shape,"(140625, 1)"
Order,C
Read-only,False
Compressor,"Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)"
Store type,zarr.storage.ConsolidatedMetadataStore
Chunk store type,zarr.storage.FSStore


The attributes is a `dict` contains the information of the experiment

In [5]:
dict(dat_grp[0].attrs)

{'baudrate': 36000000000.0,
 'channelindex': 4,
 'channels': 7,
 'distance': 1125000.0,
 'lpdbm': 0.0,
 'lpw': 0.001,
 'modformat': '16QAM',
 'polmux': 1,
 'samplerate': 72000000000.0,
 'spans': 15,
 'srcid': 'src1'}

Similarily, you can `info` the supplimentary data group

In [6]:
sup_grp[0].info

0,1
Name,/supdata/src1/0dBm_ch4_2
Type,zarr.hierarchy.Group
Read-only,False
Store type,zarr.storage.ConsolidatedMetadataStore
Chunk store type,zarr.storage.FSStore
No. members,1
No. arrays,1
No. groups,0
Arrays,nfo


In [7]:
dict(sup_grp[0].attrs)

{'cd': 18.451}

### Data downloading

So far, there is no actual data downloading yet. We can trigger the data download by slicing the arrays

In [8]:
%time dat_grp[0]['recv'][:100000]  # download the first 100K samples

CPU times: user 19.2 ms, sys: 9.88 ms, total: 29.1 ms
Wall time: 559 ms


array([[-0.00305913-0.00367398j,  0.02140145-0.07081664j],
       [ 0.01202397-0.00803816j, -0.0013208 -0.05736233j],
       [ 0.02396413-0.01456006j, -0.01560492-0.04063822j],
       ...,
       [-0.01968213-0.00502888j, -0.02807888-0.0379972j ],
       [-0.01955158-0.01066033j, -0.02267638-0.02884825j],
       [ 0.00173824-0.01677127j,  0.00122746-0.00630978j]],
      dtype=complex64)

In [9]:
%time dat_grp[0]['recv'][:]  # download the all 4.5M samples

CPU times: user 583 ms, sys: 140 ms, total: 722 ms
Wall time: 6.09 s


array([[-0.00305913-0.00367398j,  0.02140145-0.07081664j],
       [ 0.01202397-0.00803816j, -0.0013208 -0.05736233j],
       [ 0.02396413-0.01456006j, -0.01560492-0.04063822j],
       ...,
       [-0.03961598+0.02778501j,  0.01610195+0.02478129j],
       [-0.02261471+0.03245351j,  0.02470688+0.01263798j],
       [ 0.00126897+0.00130212j,  0.02451321-0.00247592j]],
      dtype=complex64)

Downloading full waveform takes longer time, it is because the amount of data to download is determined by the slicing window. It is nice to have this feature when we run a small demo in a ad-hoc environment. 

### Local cache

We may often re-use the same data many times, for example, re-running the above codes after restarting Notebook's kernel

In [10]:
%time dat_grp[0]['recv'][:]  # download the all 4.5M samples

CPU times: user 29.6 ms, sys: 12.6 ms, total: 42.2 ms
Wall time: 41.3 ms


array([[-0.00305913-0.00367398j,  0.02140145-0.07081664j],
       [ 0.01202397-0.00803816j, -0.0013208 -0.05736233j],
       [ 0.02396413-0.01456006j, -0.01560492-0.04063822j],
       ...,
       [-0.03961598+0.02778501j,  0.01610195+0.02478129j],
       [-0.02261471+0.03245351j,  0.02470688+0.01263798j],
       [ 0.00126897+0.00130212j,  0.02451321-0.00247592j]],
      dtype=complex64)

the same code that took seconds at first call now just finished in no time! This is because the data is cached locally at first call, so that further calls only load data from local cache.

In [11]:
labptptm2.config.cache_storage  # this is the cache location for this Notebook environment

'/tmp/labptptm2'

In [12]:
import os
os.listdir(labptptm2.config.cache_storage)     # these cached files are not human-readable:)

['36fd9c9c29800e34c4cae3b54bd4da1b6ea5cc9bf9b57089b1a0f70f2fc6c2b7',
 '114c3d3bff37e2dd1d72d8d0117b95aca9bba8ce0f97a199dd312d6d2f5a6dea',
 'efb86f1d9315e4ba6ae03f0b77258a5e57bba55f6d29984bf01f37413dd23f04',
 'e4aab9d1a790a24f8f09c8cfd32808524935323f9b8d6803924cd967e84b69e1',
 'ec9b6779458b1f654c432b31eaf4d9a005ad757fc33156089c50c761b725ace1',
 '7d52445d9a3aaafe9a5b9e14aca8a607161c016cce44fd68e786c3cc8dedc154',
 '0e33ed2c0af9fdef3d55c6e74691200624cee954c75503aa8acb6447aabb5e4e',
 '1d8044e82911465621787855e18a9f3754da017a6a98cc0d611d18ae8eb1d832',
 'a75295fbef9ade05f886d72b5aea8af8acbef7acd19edb5222c192af2966f8d3',
 '05ea1dbe1e1b2c00f4782c4bd67aacce7a57757ea6bc00d942ca4fce8006203c',
 '03df6c3cc7a7fc50f688f8b50dec5f13e1e52865834b5ce947402f2be6c0a1f4',
 '3423b50457a5062cfb874f3437c239ef7a08c536952ea49fa8259bb0d6d970b3',
 '7819c3620e9be6cfdfbacbcae8285dd0caf48e1c7c4c960cbbb65e3c11605b37',
 '3f973d3c8dd6927a8bb9a75013969b209585e62ec885138cc0710e81221dfc60',
 '688e65e86bf75bcf19e8c272ed90d828

The default cache location is the OS {Temporary Folder}, which gets cleared each time OS restarts. You can set `labptptm2.config.cache_storage` to other path to enable permanent data cache:
``` python
labptptm2.config.cache_storage = os.getcwd()             # current working directory
labptptm2.config.cache_storage = 'D:/data'               # Windows
labptptm2.config.cache_storage = '/home/spongebob/data'  # *nix
labptptm2.config.cache_storage = '/Users/spongebob/data' # MacOS
```

## Configuration

there are a few options to customize:

In [13]:
labptptm2.config

store: None
remote: s3://optcommpubdataqrfan/labptptm2_zarr
cache_storage: /tmp/labptptm2

- `store`: target store from which data is quried if not locally cached, `None` means use remote store
- `remote`: address of remote store, which is not supposed to be changed
- `cache_storage`: the location of local cache, change this to other path to enable permanent cache.

In the above Local cache example, to persist cache storage without hardcoding the `labptptm2.config.cache_storage = xxx`, you may use configuration file in YAML format. You can dump the default setting to make one first.

In [14]:
labptptm2.config.dump()  # dump to current working directory by default

In [15]:
with open('labptptm2.yaml', 'r') as f:
    print(f.read())

cache_storage: /tmp/labptptm2
remote: s3://optcommpubdataqrfan/labptptm2_zarr
store: null



now you can update this config file, and `labptptm2.config` will automatically load its content if it is found on initial import.

## Remote storage and direct access
Now let's take a look at the remote storage

In [16]:
root = labptptm2.open_group()

root.info

0,1
Name,/
Type,zarr.hierarchy.Group
Read-only,False
Store type,zarr.storage.ConsolidatedMetadataStore
Chunk store type,zarr.storage.FSStore
No. members,3
No. arrays,0
No. groups,3
Groups,"1125km_SSMF, source, supdata"


there are 3 groups, let's see the info of '1125km_SSMF' group

In [17]:
root['1125km_SSMF'].info

0,1
Name,/1125km_SSMF
Type,zarr.hierarchy.Group
Read-only,False
Store type,zarr.storage.ConsolidatedMetadataStore
Chunk store type,zarr.storage.FSStore
No. members,2
No. arrays,0
No. groups,2
Groups,"src1, src2"


by repeatedly entering the deeper groups, you would end up with the data's metadata we have seen above

In [18]:
root['1125km_SSMF/src1/-1dBm_ch1_1/recv'].info

0,1
Name,/1125km_SSMF/src1/-1dBm_ch1_1/recv
Type,zarr.core.Array
Data type,complex64
Shape,"(4500000, 2)"
Chunk shape,"(140625, 1)"
Order,C
Read-only,False
Compressor,"Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)"
Store type,zarr.storage.ConsolidatedMetadataStore
Chunk store type,zarr.storage.FSStore


Such `root[{path string}]` access shows that data is organized by directory-like (so called 'Group') structure. (DON'T use Win-stype `\` seperator)

You can interactively navigate it! (needs Kernel, [not working in Colab](https://github.com/QuantStack/ipytree/issues/32#issuecomment-647098636))

In [19]:
root.tree() # click + to expand that directory and - to fold it, no heavy data download

Tree(nodes=(Node(disabled=True, name='/', nodes=(Node(disabled=True, name='1125km_SSMF', nodes=(Node(disabled=…

In [20]:
root['1125km_SSMF'].tree() # sub-groups also have tree function

Tree(nodes=(Node(disabled=True, name='1125km_SSMF', nodes=(Node(disabled=True, name='src1', nodes=(Node(disabl…

Still, data is only downloaded when it gets sliced

In [21]:
data = root['1125km_SSMF/src1/-1dBm_ch1_1/recv'][:10000] # slicer [:] would download the whole file

The convenience data API introduced at begining is an extra layer built on top of such direct access. 

## Clone the entire remote store

Though we suggest using remote store with local cache layer to achieve on-demand data query, we still provide single function to clone the entire data store conveniently.

The entire data store has size of around 27 GB, so clone it if you have good bandwidth.

``` python
import sys

labptptm2.clone_store('./labptptm2', log=sys.stdout)  # copy the entire store to ./labptptm2 !

# once the local store is ready, you need to update labptptm2.config.store

labptptm2.config.store = './labptptm2'

# it is suggested to use configuration file
```

Fiannly, no more data downloading and caching, data would be loaded from the local store directly from now on.