### Step 2: merge all .csv files into a one-year dataframe

loop through the lob_caps directory, forming one time-sorted dataframe, with all CAPS files. These files captured sample bid and ask capitalization, and respective bid and ask volumes.

In [1]:
from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

In [2]:
!pip3 install matplotlib
!pip3 install altair

Collecting altair
  Obtaining dependency information for altair from https://files.pythonhosted.org/packages/f2/b4/02a0221bd1da91f6e6acdf0525528db24b4b326a670a9048da474dfe0667/altair-5.1.1-py3-none-any.whl.metadata
  Downloading altair-5.1.1-py3-none-any.whl.metadata (8.6 kB)
Downloading altair-5.1.1-py3-none-any.whl (520 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m520.6/520.6 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
[?25hInstalling collected packages: altair
Successfully installed altair-5.1.1


In [3]:
import altair as alt
import pandas as pd
import os
import numpy as np

In [4]:
!mv $(find . -type d -name "lob_caps" -exec grep -q MATCH {} \; -print0 | xargs -0 echo) backup_match/

usage: mv [-f | -i | -n] [-hv] source target
       mv [-f | -i | -n] [-v] source ... directory


In [5]:
#https://stackoverflow.com/a/21232849 model 
def getCAPSByDateAndType(type):  #returns a dict, date + df caps for that date, then extended date and time
                                # print("for type, ", type)  ./lob_caps/
    ret = []
    for root, dirs, files in os.walk("./lob_caps"): #core/gh-code/grus-code/ver2-pctChangeDriven/lob_caps
        for filename in files:
            if type in filename:
                # print("CAPS file, ", filename) #mac, do find . -name ._\* -delete
                ret.append(filename)
    return ret

csvFileList = getCAPSByDateAndType("CAPS") #iterate this array to dip into each csv, later on
li = []                         #form the endFrame / global data frame around this array
for filename in csvFileList:
    csv = "lob_caps/" + filename
    # print(csv)
    df = pd.read_csv(csv, index_col=None, header=0)
    li.append(df)

capsFrame = pd.concat(li, axis=0, ignore_index=True) #end frame contains all data
capsFrame.sort_values(by=['time'], ascending=True)   #sorted by time into one time series
print("for new df: ", capsFrame.shape[0])
start = capsFrame["time"].min()
end = capsFrame["time"].max()
print("start: ", start, " end: ", end)
print(capsFrame.columns)

ValueError: No objects to concatenate

In [None]:
capsFrame

## schema for capitalization data

loads the csv files, as acquired from coinbase

In [None]:
capsFrame.head(2) #shows the basic data collection via coinbase, these are aggregated values, collected several x a minute

### imputation

In [None]:
# impute missing values with last non-null value
capsFrame['bc'] = capsFrame['bc'].fillna(method='ffill')
capsFrame['ac'] = capsFrame['ac'].fillna(method='ffill')
capsFrame['tbv'] = capsFrame['tbv'].fillna(method='ffill')
capsFrame['tav'] = capsFrame['tav'].fillna(method='ffill')
capsFrame['mp'] = capsFrame['mp'].fillna(method='ffill')
capsFrame['minBid'] = capsFrame['minBid'].fillna(method='ffill')


## Discover precursor and surge episodes

the goal of the data prep is to discover periods of continuous, positive momentum. These are **surges**. 

The periods preceding surges are, for the sake of the experiment, **precursors**. They are detected as periods of discontinuous positive momentum, or negative momentum. 

A ten-row window is used to calculate positive or negative momentum. A percent **change** is calculated for the ten row subsample.

## regularization of critical features
get percent change as basis for comprehending LOB

create new columns which depict the momentum of one row versus the next, in terms of price , capitalization and volume

In [None]:
# Load your time series data into a pandas dataframe
# consider cahnging this approach because it doesnt actually check in between values

caps_df = capsFrame   
lookback_period = 10 # in rows
caps_df['change'] = caps_df['mp'].pct_change(periods=lookback_period)
caps_df['bc_change'] = caps_df['bc'].pct_change(periods=lookback_period)
caps_df['ac_change'] = caps_df['ac'].pct_change(periods=lookback_period)
caps_df['tav_change'] = caps_df['tav'].pct_change(periods=lookback_period)
caps_df['tbv_change'] = caps_df['tbv'].pct_change(periods=lookback_period)
## key components: bc_change, ac_change, tav_change, tbv_change, change
# caps_df.sample
print(caps_df.shape[0], caps_df.columns)# Calculate the returns of your asset over a fixed lookback period

###  establish benchmarks for percent change

the mean of change represents the average rate of change between LOB samples. This is used to determine whether the change between rows is significant or not. 

In [None]:
#for period, average or mean change metric. this changes with window size
meanChange = round(caps_df['change'].mean(),8)
meanChange

## data mining: sequence discovery
define precursors from surges, prepare the data with this sequence: 

precursor -> surge

prepare to cluster every precursor, by the sequential, resultant surge. Do not assume causality, but rather preoccurance.

use the threshold, mean change as tool to separate precursor from surges, where surges represent periods of positive momentum above threshold.

This step defines the data schema for the remainder of the process, where key statistics are defined for precursors and surges.

In [None]:
# identify units of 10 rows where the percent change is greater or less than the threshold
### key components: bc_change, ac_change, tav_change, tbv_change, change
threshold = meanChange
surges = []
precursors = []
for i in range(0,len(caps_df),10):
    if caps_df.iloc[i:i+10]['change'].mean() >= threshold:
        surges.append({'time': caps_df.iloc[i]['time'],
                       's_MP': caps_df.iloc[i]['mp'],
                       'change': caps_df.iloc[i:i+10]['change'].mean(),
                       'type':'surge'})  #['bc', 'ac', 'tbv', 'tav', 'time', 'mp', 'minBid', 'change']
    else:
        precursors.append({'time': caps_df.iloc[i]['time'],
                           'p_MP': caps_df.iloc[i]['mp'],
                           'change': caps_df.iloc[i:i+10]['change'].mean(),
                            'type':'precursor',
                            'precursor_buy_cap_pct_change':caps_df.iloc[i]['bc_change'], 
                            'precursor_ask_cap_pct_change':caps_df.iloc[i]['ac_change'],
                            'precursor_bid_vol_pct_change':caps_df.iloc[i]['tbv_change'],
                            'precursor_ask_vol_pct_change':caps_df.iloc[i]['tav_change']
                            })  

In [None]:
#for item in surges[:2]:
    #print(item)

In [None]:
#for item in precursors:
    #print(item)

## prepprocess: merge precursors and surges into time series

a dataframe of sequences, **sequence_df** is created by concatenating both buckets, and sorting by time. This will create a time series of surge and precursor periods, as defined by: 

* 10 window percent change values
* contiguity: these precursor and surges are next to each other and thus have a length or duration of momentum.

In [None]:
surges_df = pd.DataFrame(surges)
precursors_df = pd.DataFrame(precursors)
sequence_df = pd.concat([surges_df, precursors_df]).sort_values(by=['time'], ascending=True)

In [None]:
sequence_df.index

### view the aligned, continuous time series of precursors and surges

view the final abstraction: sets of precursor periods, next to surges, in a linear time series. Precursors effectively precede surges on a linear time series.

In [None]:
# for index, row in sequence_df.iterrows():
#     print(row['surge'], row['precursor'])
sequence_df['type'].head(40)

In [None]:
# sequence_df.head(45)

## visualize proof of algorithmic accuracy

this chart will plot the price time series, with an area of precursor and surge, as proof of our algorithmic accuracy.

In [None]:
subset = sequence_df[:4999]
line = alt.Chart(subset).mark_line(color='green').encode(
    x='time',
    y='s_MP'
)

s_bar = alt.Chart(subset).mark_bar().encode(
    x='time',
    y='s_MP',
    color='type:N'
)

p_bar = alt.Chart(subset).mark_bar().encode(
    x='time',
    y='p_MP',
    color='type:N'
)

chart = (s_bar + p_bar + line).properties(width=600, height=500)
chart.title = 'Data Mining Accuracy, Surge vs Precursor Sequence'
subtitle = 'Precursors are contiguous periods where percentage rate of growth is less than threshold'
chart.properties(title=alt.TitleParams(text=[chart.title, subtitle], baseline='bottom', orient='top', anchor='start', fontSize=14))
chart.interactive()

In [None]:
sequence_df.columns

### data mining 2: information gain, create new features

Perform information gain on grouped precursors and surges

define the **sum change**, or total change per continuous episode (precursor or surge). 

define the **length** of each episode. 

define the height of the surge, how high did the continuous positive momentum reach?

define the size (area) of the surge, as a triangular area (height times length), as **surge_area**

Create one line to describe a precursor or search and it's related order book statistics

In [None]:
sequence_df['group'] = (sequence_df['type'] != sequence_df['type'].shift(1)).cumsum()
columns_to_transform = [
    'precursor_buy_cap_pct_change',
    'precursor_ask_cap_pct_change',
    'precursor_bid_vol_pct_change',
    'precursor_ask_vol_pct_change'
]

for col in columns_to_transform:
    sequence_df[col] = sequence_df.groupby('group')[col].transform(lambda x: x.sum() if not x.isna().all() else np.nan)

In [None]:
#### imputation

In [None]:
# # impute missing values with last non-null value DONE PRIOR, NOW AT START
sequence_df['s_MP'] = sequence_df['s_MP'].fillna(method='ffill')
sequence_df['p_MP'] = sequence_df['p_MP'].fillna(method='ffill')
sequence_df['precursor_buy_cap_pct_change'] = sequence_df['precursor_buy_cap_pct_change'].fillna(method='ffill')
sequence_df['precursor_ask_cap_pct_change'] = sequence_df['precursor_ask_cap_pct_change'].fillna(method='ffill')
sequence_df['precursor_bid_vol_pct_change'] = sequence_df['precursor_bid_vol_pct_change'].fillna(method='ffill')
sequence_df['precursor_ask_vol_pct_change'] = sequence_df['precursor_ask_vol_pct_change'].fillna(method='ffill')

In [None]:
#sequence_df['group'] = (sequence_df['type'] != sequence_df['type'].shift(1)).cumsum()

In [None]:
sequence_df['length'] = sequence_df.groupby(['type', 'group'])['group'].transform('count')

print(sequence_df.shape[0])
sequence_df['sum_change'] = sequence_df.groupby(['type', 'group'])['change'].transform('sum')

sequence_df['max_surge_mp'] = sequence_df.groupby(['type', 'group'])['s_MP'].transform('max')
sequence_df['min_surge_mp'] = sequence_df.groupby(['type', 'group'])['s_MP'].transform('min')

sequence_df['max_precursor_mp'] = sequence_df.groupby(['type', 'group'])['p_MP'].transform('max')
sequence_df['min_precursor_mp'] = sequence_df.groupby(['type', 'group'])['p_MP'].transform('min')

sequence_df['area']  = sequence_df.apply(lambda row: row['length'] * row['sum_change'], axis=1)

sequence_df.loc[sequence_df['type'] == 'surge', 'surge_area'] = sequence_df.loc[sequence_df['type'] == 'surge', 'area']

sequence_df['surge_targets_met_pct']  = sequence_df.apply(lambda group: ((group['max_precursor_mp']-group['max_surge_mp'])/group['max_surge_mp']  ) *100, axis=1)

In [None]:
# define a custom function to calculate the percentage by which max_surge_mp exceeds max_precursor_mp
'''for a pandas dataframe wth attributes ['group', 'time', 's_MP', 'change', 'type', 'length', 'sum_change',
       'max_surge_mp', 'min_surge_mp', 'area', 'surge_area', 'group', 'time',
       'change', 'type', 'p_MP', 'precursor_buy_cap_pct_change',
       'precursor_ask_cap_pct_change', 'precursor_bid_vol_pct_change',
       'precursor_ask_vol_pct_change', 'length', 'sum_change',
       'max_precursor_mp', 'min_precursor_mp', 'area'] 
       group by type, group then create  
       a new column 'surge_targets_met_pct' which equals the percentage 
       by which the max_surge_mp exceeds the max_precursor_mp'''


sequence_df.columns
print(sequence_df.shape[0])

In [None]:
sequence_df.head(30)

## data mining 3: form final sequences by statistical weight

Critical group by unique identifier

In [None]:
unique_df = sequence_df.groupby('group').first().reset_index()
# print(unique_df)

In [None]:
unique_df.head(20)

#### Merge even and odd Rows to form the final sequences

Even rows contain surge, and odd rows contain precursors. **When you merge them, you form a sequence of precursor, and surge.**

Each row will contain a continuous **precursor->surge** sequence.

In [None]:
# needs to start with a precursor removes the first surge
unique_df = unique_df.iloc[1:]
even_df = unique_df.iloc[::2].reset_index(drop=True)
odd_df = unique_df.iloc[1::2].reset_index(drop=True)

merged_df = pd.concat([even_df, odd_df], axis=1)

# print(merged_df)

In [None]:
merged_df[:10]

In [None]:
nan_cols = merged_df.dropna(axis=1, how='all')
nan_cols.head()

In [None]:
nan_cols.columns

### Write to CSV: step one, pipeline
Label to use is surge_targets_met_pct

In [None]:
nan_cols.to_csv('pipeline1.csv', index=False)