# OHLCV with Technical Indicators 
> Inspired by the informative [work](https://www.kaggle.com/takemi/ohlc-charts-candlestick-charts), I wrote this notebook to introduce the basic method to get the OHLCV data and derived **technical indicators**, which might be used as new features! Hope this notebook would help!

### Package - `finta`
#### 1. Download package files
> In this work, I use a package named `finta` to help me quickly compute the **technical indicators**. In order to use the package in the notebook, I download it first and add the data to the corresponding notebook. Following provides the steps to download the package files:

In [None]:
# !pip download finta -d ./finta/

# import os
# from zipfile import ZipFile

# dirName = "./finta/"
# zipName = "finta.zip"

# # Create a ZipFile Object
# with ZipFile(zipName, 'w') as zipObj:
#     # Iterate over all the files in directory
#     for folderName, subfolders, filenames in os.walk(dirName):
#         for filename in filenames:
#             if (filename != zipName):
#                 # create complete filepath of file in directory
#                 filePath = os.path.join(folderName, filename)
#                 # Add file to zip
#                 zipObj.write(filePath)

#### 2. Install the package
> After adding the data and refreshing, you can use the following command to install the package. Then, we can play around with it.

In [None]:
!pip install finta --no-index --find-links=file:///kaggle/input/fin-ta/finta

In [None]:
# Import packages 
import os 
import glob
import gc 
import yaml
import math
import warnings
from tqdm import tqdm
from functools import reduce

import pandas as pd
from finta import TA
from numba import jit  
import numpy as np 
from plotly.subplots import make_subplots
import plotly.graph_objects as go
from joblib import Parallel, delayed

# Configuration 
warnings.simplefilter('ignore')
pd.set_option('max_column', 300)

In [None]:
# Variable definitions
DATA_PATH = "../input/optiver-realized-volatility-prediction"
__DATA_DIRS__ = ['book_train.parquet', 
                 'trade_tarin.parquet', 
                 'book_test.parquet', 
                 'trade_test.parquet']
FILE_LIST_MAP = {
    'book_train': glob.glob(os.path.join(DATA_PATH, "book_train.parquet/*")),
    'trade_train': glob.glob(os.path.join(DATA_PATH, "trade_train.parquet/*")), 
    'book_test': glob.glob(os.path.join(DATA_PATH, "book_test.parquet/*")), 
    'trade_test': glob.glob(os.path.join(DATA_PATH, "trade_test.parquet/*"))
}
ORDER_PRICE = ['bid_price1', 'bid_price2', 'ask_price1', 'ask_price2']
ORDER_VOLUME = ['bid_size1', 'bid_size2', 'ask_size1', 'ask_size2']
TI_NAMES = [func for func in dir(TA) if callable(getattr(TA, func)) and not func.startswith("_")]

In [None]:
# Utility functions 
def get_wap(df_book):
    '''Compute estimated price series.
    
    Parameters:
        df_book: pd.DataFrame, raw information of book data
    
    Return: 
        wap: pd.DataFrame, estimated price series with time identifiers
    '''
    df_book_ = df_book.copy()
    df_book_['wap1'] = ((df_book_['bid_price1']*df_book_['ask_size1'] + 
                        df_book_['ask_price1']*df_book_['bid_size1']) /
                        (df_book_['bid_size1'] + df_book_['ask_size1']))
    wap = df_book_.loc[:, ['time_id', 'seconds_in_bucket', 'wap1']]
    return wap

def get_ohlcv(prices, volumes, scale=10):
    '''Return OHLCV of stock price based on the Kline scale (sec).
    
    Parameters:
        prices: pd.Series, the estimated price series
        volumes: pd.Series, the trading volume series
        scale: int, the scale of the Kline (sec), default=10
        
    Return:
        OHLCV: pd.DataFrame, four prices and trading volumes of the stock based on the Kline scale 
    '''
    if 600%scale != 0:
        raise ValueError("Choose scale divisible by 600 seconds...")
    
    OHLCV = pd.DataFrame(columns=['open', 'high', 'low', 'close', 'volume'])
    for i in range(0, 600, scale):
        p_window = prices[i:i+scale]
        v_window = volumes[i:i+scale]
        p_stick = {
            'open': p_window.iloc[0],
            'high': np.max(p_window),
            'low': np.min(p_window),
            'close': p_window.iloc[-1],
            'volume': np.sum(v_window),
        }
        OHLCV = OHLCV.append(p_stick, ignore_index=True)
    OHLCV.insert(0, column='sec', value=[scale*(i+1) for i in range(0, 600//scale)])
    
    return OHLCV

def ffill(df):
    '''Forward fill information in order book data, followed by bfill to avoid bug in filler data.
    '''
    df_ = df.copy()
    df_.set_index(['time_id', 'seconds_in_bucket'], inplace=True)
    df_ = df_.reindex(pd.MultiIndex.from_product([df_.index.levels[0], np.arange(0,600)], names = ['time_id', 'seconds_in_bucket']), method='ffill')
    df_ = df_.reindex(pd.MultiIndex.from_product([df_.index.levels[0], np.arange(0,600)], names = ['time_id', 'seconds_in_bucket']), method='bfill')
    df_.reset_index(inplace=True)
    
    return df_

### Simple Illustration of Candlestick Chart
> Following is a simple illustration of OHLCV using `Candlestick` provided by `plotly`.

In [None]:
df_order = pd.read_parquet(os.path.join(DATA_PATH, "book_train.parquet/stock_id=0/"))
df_trade = pd.read_parquet(os.path.join(DATA_PATH, "trade_train.parquet/stock_id=0/"))
df_order = df_order[df_order['time_id'] == 5]
df_order = ffill(df_order)
df_trade = df_trade[df_trade['time_id'] == 5]
df_order.head()

In [None]:
wap = get_wap(df_order)
df = wap.merge(df_trade.loc[:, ['seconds_in_bucket', 'size']], on=['seconds_in_bucket'], how='outer')
df.fillna(0, inplace=True)
ohlcv = get_ohlcv(df['wap1'], df['size'], scale=10)

fig = make_subplots(rows=2, cols=1, shared_xaxes=True)
fig.add_trace(go.Candlestick(
    x=ohlcv['sec'],
    open=ohlcv['open'], 
    high=ohlcv['high'],
    low=ohlcv['low'], 
    close=ohlcv['close']),
    row=1, col=1
)
fig.add_trace(go.Bar(
    x=ohlcv['sec'], 
    y=ohlcv['volume']),
    row=2, col=1
)

fig.update(layout_xaxis_rangeslider_visible=False)
fig.show()

### TI DataFrame Generation 
> Next, we're going to generate complete dataframe containing **technical indicators** for all (`stock_id`, `time_id`) pairs.

In [None]:
def get_dataset(datatype, n_jobs, scale):
    '''Return processed dataset after running parallel dataset generation.
    
    Parameters:
        datatype: str, dataset type, the choices are as follows:
            {'train', 'test'}
        n_jobs: int, num of processors to work
        scale: int, the scale of the Kline (sec)
    
    Return:
        df_proc: pd.DataFrame, dataset containing derived technical indicators
    '''
    df_proc = Parallel(n_jobs=n_jobs)(
        delayed(gen_dataset)(book_file, 
                             trade_file,  
                             scale) 
        for book_file, trade_file in tqdm(
            zip(sorted(FILE_LIST_MAP[f'book_{datatype}']), sorted(FILE_LIST_MAP[f'trade_{datatype}']))
        )
    )    

    df_proc = pd.concat(df_proc, ignore_index=True)

    return df_proc

def gen_dataset(book_file, trade_file, scale):
    '''Generate dataset for one stock.
    
    Parameters:
        book_file: str, file path of the book data 
        trade_file: str, file path of the trade data
        scale: int, the scale of the Kline (sec)
    '''
    assert book_file.split('=')[1] == trade_file.split('=')[1]
    stock_id = book_file.split('=')[1]
    df_book = pd.read_parquet(book_file)   # Order book dataframe for a single stock
    df_book = ffill(df_book)
    df_trade = pd.read_parquet(trade_file)   # Trade dataframe for a single stock 

    df = get_wap(df_book)
    df = df.merge(df_trade.loc[:, ['time_id', 'seconds_in_bucket', 'size']], on=['time_id', 'seconds_in_bucket'], how='outer')
    df.fillna(0, inplace=True)
    
    del df_book, df_trade
    _ = gc.collect()
    
    tis = get_tis(df, scale)
    stats = {col: ['mean', 'median', 'min', 'max', 'std'] for col in tis.columns if col != 'time_id'}
    df_stats = cal_stats(tis, stats)
    df_stats['row_id'] = df_stats['time_id'].apply(lambda t: f'{stock_id}-{t}')
    df_stats['stock_id'] = int(stock_id)
    print(f"stock{stock_id} completes~")
    
    return df_stats

def get_tis(df, scale):
    '''Compute all technical indicators provided by finta.
    
    Parameters:
        df: pd.DataFrame, book data merged with trade data
        scale: int, the scale of the Kline (sec) 
    '''
    tis = pd.DataFrame()
    for time_id, gp in df.groupby('time_id'):
        ohlcv_ = get_ohlcv(gp['wap1'], gp['size'], scale=scale)
        ohlcv_.set_index(pd.DatetimeIndex(ohlcv_['sec']), inplace=True)
        ohlcv_.drop('sec', axis=1, inplace=True)
        for ti_name, ti in TIS.items():
            result = ti(ohlcv_)
            if type(result) == pd.core.series.Series:
                result.name = ti_name
            else:
                result.columns = [f'{ti_name}_{col}' for col in result.columns]
            ohlcv_ = ohlcv_.merge(result, right_index=True, left_index=True)
        ohlcv_['time_id'] = int(time_id)
        tis = pd.concat([tis, ohlcv_], ignore_index=True)
    return tis 

def cal_stats(df, ft_stats):
    '''Calculate specified stats for given dataframe.
    
    Parameters:
        df: pd.DataFrame, dataframe containing raw features
        ft_stats: dict[str, list], mapping relationship between features and the stats (e.g., mean, median, min, max, std)
        
    Return:
        df_stats: pd.DataFrame, dataframe containing derived stats
    '''
    df_ = df.groupby(by=['time_id'])   # Group samples based on time_id
    df_stats = df_.agg(ft_stats)   # Use numba engine to accelerate stats computation
    df_stats.columns = ['_'.join(sub_str) for sub_str in df_stats.columns]
    df_stats.reset_index(inplace=True)
    
    return df_stats

### Technical Indicator Computation
> Finally, we can use the built-in functions to compute the **technical indicators**. 

In [None]:
# List all technical indicators, excluding the ones raising errors
# TIS = {ti: getattr(TA, ti) for ti in TI_NAMES if ti not in ['LWMA', 'VIDYA', 'ALMA', 'MAMA', 'SWI', 'TMF']}
TIS = {'ADL': getattr(TA, 'ADL')}
TIS

In [None]:
# Try to compute one of the technical indicators
df_ti = get_dataset('train', -1, 10)

### Issue and Summary
> The problem I encounter is that the computational time is too long. There are many techniques to speed up the computation, but I don't have much time to optimize the process. Finally, I still hope this notebook could give you some inspiration and help you use these **technical indicators** to get better performance!