## Introduction
DataSource is a helper class for doing montonous, repetitive tasks related to financial time series data. Quite often, we need to apply a set of common functions such as shifting data, calculating rolling statistics, percent changes etc., for each security in our database by using pandas **groupby** function. DataSource simplifies this process by one-liners instead of multiple lines of pandas code; In fact, its just a fancy wrapper for pandas groupby with some extra goodies.

So instead of 
```python
shift = lambda x: x.shift(1)
dataframe['lag_one'] = dataframe.groupby('symbol')['close'].transform(shift)
```
it would be
```python
dataframe.add_lag(on='close', period=1, col_name='lag_one')
```

The only requirement is that the dataframe must have **symbol and timestamp columns**. 

**If you have them as indexes, reset them as columns**

In [29]:
import pandas as pd
import sys
sys.path.append('../../')
from fastbt.datasource import DataSource

## Initialize DataSource class with a dataframe

In [32]:
df = pd.read_csv('data/bank.csv', parse_dates=['timestamp'])
ds = DataSource(df)

## Use ds.data to get the dataframe back
ds.data.head()

Unnamed: 0,timestamp,symbol,series,open,high,low,close,last,prevclose,tottrdqty,tottrdval,totaltrades,isin
0,2018-09-03,AXISBANK,EQ,655.45,655.45,629.6,631.8,630.7,649.25,6494484,4148248000.0,88364,INE238A01034
1,2018-09-03,RBLBANK,EQ,630.1,643.5,622.0,623.9,624.4,627.25,1532438,970892900.0,28755,INE976G01028
2,2018-09-03,INDUSINDBK,EQ,1906.0,1918.85,1884.7,1897.0,1895.0,1906.6,937981,1782402000.0,85683,INE095A01012
3,2018-09-03,KOTAKBANK,EQ,1292.0,1295.0,1265.0,1269.1,1266.25,1287.25,2076747,2639495000.0,59594,INE237A01028
4,2018-09-03,BANKBARODA,EQ,153.95,156.5,150.9,151.7,151.3,152.95,16081701,2484263000.0,72339,INE028A01039


In [44]:
# If your dataframe has a different name for the symbol and timestamp
# column, pass them as parameters during initialization
ds = DataSource(df, symbol='symbol', timestamp='timestamp')

DataSource adds a column for each function you specify and returns a dataframe with the column added. The existing functions are

```
 add_lag
 add_pct_change
 add_rolling
 add_formula
 add_indicator 
```

All functions have a col_name argument. Except for ``add_formula``, column names are generated automatically for all functions. 

**All column names, even those specified as arguments, are converted into lower case to make them case-insensitive.**

The following arguments are common to the functions
 * ``col_name`` - column name to be added in dataframe. Mandatory for ``add_formula``, for others added automatically
 * ``period`` - time period for the function
 * ``lag`` - time lag; time by which the result is to be lagged. Not applicable for ``add_lag`` and ``add_formula``
 * ``on`` - column on which the grouping is to be made. Not applicable to ``add_formula`` and ``add_indicator``
 
 Let's see a few examples

## ``add_lag``

adds the given time lag to the specified period

In [45]:
## Add a one day lag to data
ds.add_lag();
print(ds.data.info()) # Column lag_close_1 added automatically
ds.data.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48 entries, 0 to 47
Data columns (total 14 columns):
timestamp      48 non-null datetime64[ns]
symbol         48 non-null object
series         48 non-null object
open           48 non-null float64
high           48 non-null float64
low            48 non-null float64
close          48 non-null float64
last           48 non-null float64
prevclose      48 non-null float64
tottrdqty      48 non-null int64
tottrdval      48 non-null float64
totaltrades    48 non-null int64
isin           48 non-null object
lag_close_1    36 non-null float64
dtypes: datetime64[ns](1), float64(8), int64(2), object(3)
memory usage: 5.3+ KB
None


Unnamed: 0,timestamp,symbol,series,open,high,low,close,last,prevclose,tottrdqty,tottrdval,totaltrades,isin,lag_close_1
0,2018-09-03,AXISBANK,EQ,655.45,655.45,629.6,631.8,630.7,649.25,6494484,4148248000.0,88364,INE238A01034,
1,2018-09-03,RBLBANK,EQ,630.1,643.5,622.0,623.9,624.4,627.25,1532438,970892900.0,28755,INE976G01028,
2,2018-09-03,INDUSINDBK,EQ,1906.0,1918.85,1884.7,1897.0,1895.0,1906.6,937981,1782402000.0,85683,INE095A01012,
3,2018-09-03,KOTAKBANK,EQ,1292.0,1295.0,1265.0,1269.1,1266.25,1287.25,2076747,2639495000.0,59594,INE237A01028,
4,2018-09-03,BANKBARODA,EQ,153.95,156.5,150.9,151.7,151.3,152.95,16081701,2484263000.0,72339,INE028A01039,


In [46]:
# Add a 2 day lag
ds.add_lag(period=2)

# Add a 3 day lag with a custom column name on the open price
ds.add_lag(period=3, col_name='three_day_lag', on='open').dropna()

Unnamed: 0,timestamp,symbol,series,open,high,low,close,last,prevclose,tottrdqty,tottrdval,totaltrades,isin,lag_close_1,lag_close_2,three_day_lag
36,2018-09-06,FEDERALBNK,EQ,78.5,78.5,76.7,77.35,77.3,77.75,6731974,521808800.0,30898,INE171A01029,77.75,77.0,81.5
37,2018-09-06,HDFCBANK,EQ,2049.0,2059.0,2032.6,2052.2,2050.15,2045.85,2600603,5316363000.0,125530,INE040A01026,2045.85,2051.8,2069.4
38,2018-09-06,ICICIBANK,EQ,330.0,331.25,325.5,328.65,327.95,329.65,11322771,3724325000.0,81857,INE090A01021,329.65,328.5,343.6
39,2018-09-06,IDFCBANK,EQ,45.15,45.4,44.45,44.9,44.7,44.85,12803268,573560700.0,23485,INE092T01019,44.85,44.95,47.95
40,2018-09-06,AXISBANK,EQ,638.9,643.3,624.6,638.2,637.5,637.65,10026621,6344121000.0,122891,INE238A01034,637.65,641.8,655.45
41,2018-09-06,RBLBANK,EQ,598.0,600.0,578.65,591.55,591.6,593.55,1766611,1039492000.0,45182,INE976G01028,593.55,608.0,630.1
42,2018-09-06,SBIN,EQ,298.0,299.85,294.5,296.45,294.9,296.55,18001336,5352603000.0,111206,INE062A01020,296.55,296.4,312.5
43,2018-09-06,KOTAKBANK,EQ,1242.0,1264.6,1238.15,1260.9,1258.55,1238.15,2947633,3706721000.0,66281,INE237A01028,1238.15,1257.6,1292.0
44,2018-09-06,INDUSINDBK,EQ,1870.0,1885.0,1851.0,1880.0,1879.0,1854.85,1003340,1879392000.0,54166,INE095A01012,1854.85,1855.6,1906.0
45,2018-09-06,BANKBARODA,EQ,145.9,147.1,144.15,146.1,145.7,144.55,10475666,1525306000.0,40096,INE028A01039,144.55,145.55,153.95


You could use 
```python 
ds.data.info()
```
to see the added columns at the end of each cell

## ``add_pct_change``
add a percentage change column

In [None]:
## Add a 2 day percentage change on close price
ds.add_pct_change(on='close', period=2)

Calculate the 2 day percentage change in close price and lag it by one day.
This is especially useful if you want to know the 2 day returns on the morning of the 3rd day

In [55]:
ds.add_pct_change(on='close', period=2, lag=1)
ds.data.head()


Unnamed: 0,timestamp,symbol,series,open,high,low,close,last,prevclose,tottrdqty,tottrdval,totaltrades,isin,lag_close_1,lag_close_2,three_day_lag,chg_close_2
0,2018-09-03,AXISBANK,EQ,655.45,655.45,629.6,631.8,630.7,649.25,6494484,4148248000.0,88364,INE238A01034,,,,
1,2018-09-03,RBLBANK,EQ,630.1,643.5,622.0,623.9,624.4,627.25,1532438,970892900.0,28755,INE976G01028,,,,
2,2018-09-03,INDUSINDBK,EQ,1906.0,1918.85,1884.7,1897.0,1895.0,1906.6,937981,1782402000.0,85683,INE095A01012,,,,
3,2018-09-03,KOTAKBANK,EQ,1292.0,1295.0,1265.0,1269.1,1266.25,1287.25,2076747,2639495000.0,59594,INE237A01028,,,,
4,2018-09-03,BANKBARODA,EQ,153.95,156.5,150.9,151.7,151.3,152.95,16081701,2484263000.0,72339,INE028A01039,,,,


You could also pass keyword arguments to the percent change function. These arguments are passed on to the pandas function before computing the results.
So let's compute the percent change by backfilling NA data

In [61]:
ds.add_pct_change(on='close', period=2, lag=1, fill_method='bfill');

## ``add_rolling``
add a rolling statistic column

In [63]:
ds.add_rolling(window=3, on='close', function='mean');

## ``add_formula``
add a formula column

The formula should be a string

In [None]:
ds.add_formula()

## ``add_indicator``

add a technical indicator

In [68]:
ds.add_indicator('RSI', 2);

## A few shortcuts

Adding columns in bulk.

In [71]:
# Add 2,3,5 day returns
[ds.add_pct_change(on='close', period=i, col_name='ret' + str(i))
 for i in [2,3,5]]
ds.data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48 entries, 0 to 47
Data columns (total 23 columns):
timestamp         48 non-null datetime64[ns]
symbol            48 non-null object
series            48 non-null object
open              48 non-null float64
high              48 non-null float64
low               48 non-null float64
close             48 non-null float64
last              48 non-null float64
prevclose         48 non-null float64
tottrdqty         48 non-null int64
tottrdval         48 non-null float64
totaltrades       48 non-null int64
isin              48 non-null object
lag_close_1       36 non-null float64
lag_close_2       24 non-null float64
three_day_lag     12 non-null float64
chg_close_2       12 non-null float64
rol_close_mean    24 non-null float64
rsi_3             12 non-null float64
rsi_2             24 non-null float64
ret2              24 non-null float64
ret3              12 non-null float64
ret5              0 non-null float64
dtypes: datetime64[ns](1

## ``batch_process``
Batch process