# Using Iguazio Frames Library for High-Performance Data Access 
iguazio `v3io_frames` is a streaming oriented multi-model (generic) data API which allow high-speed data loading and storing<br>
frames currently support iguazio key/value, time-series, and streaming data models (called backends), additional backends will be added.

For detailed description of the Frames API go to https://github.com/v3io/frames

to use frames you first create a `client` and provide it the session and credential details, the client is used to for 5 basic operations:
```
   create  - create a new key/value or time-series table or stream 
   delete  - delete the table or stream
   read    - read data from the backend (as pandas DataFrame or dataFrame iterator)
   write   - write one or more DataFrames into the backend
   execute - execute a command on the backend, each backend may support multiple commands 
```   

Content:
- [Working with key/value and SQL data](kv)
- [Working with Time-series data](#tsdb)
- [Working with Streams](#stream)

The following sections describe how to use frames, for more help and details use the internal documentation, e.g. run the following command
```  client.read?```


In [1]:
import pandas as pd
import v3io_frames as v3f
import os
client = v3f.Client('framesd:8081', container='users')

<a id='kv'></a>
## Working with key/value and SQL data

### Load data from Amazon S3

In [2]:
# read S3 file into a data frame and show its data & metadata
tablename = os.path.join(os.getenv('V3IO_USERNAME')+'/examples/bank')
df = pd.read_csv('https://s3.amazonaws.com/iguazio-sample-data/bank.csv', sep=';')
df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,30,unemployed,married,primary,no,1787,no,no,cellular,19,oct,79,1,-1,0,unknown,no
1,33,services,married,secondary,no,4789,yes,yes,cellular,11,may,220,1,339,4,failure,no
2,35,management,single,tertiary,no,1350,yes,no,cellular,16,apr,185,1,330,1,failure,no
3,30,management,married,tertiary,no,1476,yes,yes,unknown,3,jun,199,4,-1,0,unknown,no
4,59,blue-collar,married,secondary,no,0,yes,no,unknown,5,may,226,1,-1,0,unknown,no


### Write data frames into the database using a single command
data is streamed into the database via fast NoSQL APIs, note the backend is `kv`<br>
the input data can be a single dataframe or a dataframe iterator (for streaming)

In [3]:
out = client.write('kv', tablename, df)

### Read from the Database with DB side SQL
offload data filtering, grouping, joins, etc to a scale-out high speed DB engine<br>
Note that we're using a V3IO_USERNAME as environment variable as therefore we need to define the string for the "From" section<br>
The from convention is select * from v3io.<data container>."path"

In [4]:
table_path = os.path.join('v3io.users."'+os.getenv('V3IO_USERNAME')+'/examples/bank"')
%sql select * from $table_path where balance > 10000

Done.


loan,education,previous,housing,poutcome,duration,marital,default,balance,month,contact,campaign,y,job,day,age,pdays
no,secondary,0,yes,unknown,249,married,no,19317,aug,cellular,1,yes,retired,4,68,-1
no,secondary,0,no,unknown,219,married,no,26452,jul,telephone,2,no,retired,15,75,-1


### Read the data through frames API
the frames API returns a dataframe or a dataframe iterator (a stream)<br>

In [5]:
df = client.read(backend='kv', table=tablename, filter="balance>20000")
df.head(8)

Unnamed: 0_level_0,housing,contact,education,loan,campaign,pdays,poutcome,default,balance,duration,previous,job,marital,month,day,age,y
__name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
75,no,telephone,secondary,no,2.0,-1.0,unknown,no,26452.0,219.0,0.0,retired,married,jul,15.0,75.0,no


### Read the data as a stream iterator
to use iterator and allow cuncurent data movement and processing add `iterator=True`, you will need to iterate over the returned value or use `concat`
iterators work with all backends (not just stream), they allow streaming when placed as an input to write functions which support iterators as input

In [6]:
dfs = client.read(backend='kv', table=tablename, filter="balance>20000", iterator=True)
for df in dfs:
    print(df.head())

        balance  campaign  marital default loan    contact   y   age  \
__name                                                                 
75      26452.0       2.0  married      no   no  telephone  no  75.0   

        duration  previous   day housing  pdays  education      job poutcome  \
__name                                                                         
75         219.0       0.0  15.0      no   -1.0  secondary  retired  unknown   

       month  
__name        
75       jul  


### Batch updates with expression
in many cases we want to update specific column values or update a column using an expression (e.g. counter = counter + x)<br>
when using the key/value backend it can run an expression against each of the rows (specified in the index), and use the dataframe columns as parameters<br>
columns are specified using `{}`, e.g. specifing `expression="packets=packets+{pkt};bytes=bytes+{bytes};last_update={mytime}"` will add the data in `pkt` and `bytes` column from the input dataframe to the `packets` and `bytes` columns in the row and set the `last_update` field to `mytime`. the rows are selected based on the input dataframe index

In [7]:
# example: creating a new column which reflect the delta between the old `balance` column and the one provided in df (should result in 0 since df didnt change)
out = client.write('kv', tablename, df, expression='balance_delta=balance-{balance}')

### Making a single row update using execute command
The use of `condition` is optional and allow to implement safe/conditional transactions 

In [8]:
client.execute('kv',tablename,'update', args={'key':'44', 'expression': 'age=44', 'condition':'balance>0'})

### Delete the table
note: in kv (NoSQL) tabels there is no need to create a table before using it

In [9]:
client.delete('kv',table=tablename)

<a id='tsdb'></a>
## Working with time-series data

Note that the tsdb table example will be created under the root of the "users" container

In [10]:
# create a time series table, rate specifies the typical ingestion rate (e.g. one sample per minute)
client.create(backend='tsdb', table='tsdb_tab',attrs={'rate':'1/m'})

In [11]:
# create sample time-series data
import numpy as np
from datetime import datetime, timedelta
end = datetime.now().replace(minute=0, second=0, microsecond=0)
rng = pd.date_range(end=end, periods=60, freq='300s', tz='EST')
df = pd.DataFrame(np.random.randn(len(rng), 3), index=rng, columns=['cpu','mem','disk'])
df = df.cumsum()
print(df.info(), df.head())

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 60 entries, 2019-03-14 07:05:00-05:00 to 2019-03-14 12:00:00-05:00
Freq: 300S
Data columns (total 3 columns):
cpu     60 non-null float64
mem     60 non-null float64
disk    60 non-null float64
dtypes: float64(3)
memory usage: 1.9 KB
None                                 cpu       mem      disk
2019-03-14 07:05:00-05:00  0.300817 -0.136579  1.400577
2019-03-14 07:10:00-05:00  0.340741 -0.485285  2.723041
2019-03-14 07:15:00-05:00 -2.029409  0.285942  2.076960
2019-03-14 07:20:00-05:00 -1.977885  0.096732  4.749289
2019-03-14 07:25:00-05:00 -1.567182  0.174292  6.428432


### Write to the time-series DB
The time series DB has a time based index and additional sub indexes called labels<br>
labels can be specified in two ways:
* Using the `labels` parameters which will add the specified labels to each row/sample<br>
* Using multi-index, all non time index columns are automatically converted to labels

if your DataFrame doesnt contain multi-index and you wish to use specific columns as time-series labels you should convert the columns to indexes using:<br>
```python
    df.index.name='time'                              # in case the index column is un-named 
    df.reset_index(level=0, inplace=True)    
    df = df.set_index(['time','symbol','exchange'])   # e.g. convert the specified columns to indexes 
```

Note: you can use both (multi-index and labels) together, the labels will be the aggregation of both 

In [12]:
client.write(backend='tsdb', table = 'tsdb_tab',dfs=df, labels={'node':'11'})

### Read from the time-series DB


In [13]:
# Read Time-Series aggregates from the DB (returned as a data stream, use concat to assemble the frames)
tsdf = client.read(backend='tsdb', query='select avg(*),max(*),min(*) from tsdb_tab', step='60m', start="now-7d", end='now',multi_index=True)
tsdf.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,avg(mem),max(mem),min(mem),avg(disk),max(disk),min(disk),avg(cpu),max(cpu),min(cpu)
time,node,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2019-03-14 11:58:02,11,0.082436,1.478036,-1.102929,4.200892,6.428432,1.400577,-1.216931,0.340741,-2.139816


### Delete the table

In [14]:
client.delete('tsdb','tsdb_tab')

<a id='stream'></a>
## Working with streams
iguazio platform support streams which have AWS Kinesis like API, they can be accessed from the notebook as if they were a structured stream (currently assume the structure is serialized through `JSON`. streams can be accessed in a bulk, or preferably using an iterator.<br>
Streams must be first created, users can specify the number of shards and retention period when creating the stream.

In [15]:
strm = os.path.join(os.getenv('V3IO_USERNAME')+'/examples/somestream')
client.create(backend='stream', table=strm,attrs={'retention_hours':48,'shards':1})

In [16]:
# write data into a stream
def gendf():
    end = datetime.now().replace(minute=0, second=0, microsecond=0)
    rng = pd.date_range(end=end, periods=60, freq='300s', tz='Israel')
    df = pd.DataFrame(np.random.randn(len(rng), 3), index=rng, columns=['cpu', 'mem', 'disk'])
    return df

client.write('stream', strm, gendf())

### Reading from a stream
the stream read operation need to specify the seek method and parameters (each seek method may have different parameters) as listed below:
```
   earliest   - start from the earliest point in the stream (no params)
   latest     - start from the latest, i.e. show only new records
   time       - start from a point in time, specify the start param e.g. start='now-1d'
   sequence   - start from a specific sequence number, specify the sequence param e.g. sequence=45
```



In [17]:
dfs = client.read('stream', strm,seek='earliest', shard_id='0', iterator=True)
for df in dfs:
    print(df.head(4))

                 cpu      disk               index-0       mem  \
seq_number                                                       
1           0.011781 -0.172618  2019-03-14T05:05:00Z  0.535711   
2           1.580387 -0.470061  2019-03-14T05:10:00Z -0.003681   
3           0.400377 -1.969355  2019-03-14T05:15:00Z  0.363214   
4          -0.213012 -0.806401  2019-03-14T05:20:00Z  0.250864   

                             stream_time  
seq_number                                
1          2019-03-14 12:58:08.343475133  
2          2019-03-14 12:58:08.343475133  
3          2019-03-14 12:58:08.343475133  
4          2019-03-14 12:58:08.343475133  


### Push a single record update to a stream
In some cases it is more conviniant to just push a buffer into a stream, for that use the execute `put` command <br>
put accepts the `data` arg and two optional parameters (`clientinfo` for some extra info and `partition` if you want to specify the shard id)

In [18]:
client.execute('stream', strm, 'put', args={'data': 'abcd', 'clientinfo': '123'})

### Delete the stream

In [19]:
client.delete('stream',strm)