In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# G-Research 01: Data Reading, Formatting and Grouping by Time

In this notebook, I would like to share some useful functionality for G-Research project.
- Data formatting and keeping the names of attributes as named tuple to be able to use them anywhere in the poject in straighforward way. This approach allows you to change the name of attribute on one place without affecting the code.
- Grouping by time. This is very important feature. The basic goal is to get the trading data set with all its attributes, but on different scale.

## General Python Libraries

In [None]:
from datetime import datetime

## My Code Exported from Scripts

I always keep huge chunks of code outside of the notebook (aka client-server pattern). In that case, I can reuse the code anywhere and anytime and create shorter and better readable notebooks.

In [None]:
from g_research_data import A
from g_research_transformations import DFAttributesTypeTransformer
from g_research_transformations import ManyToDatetimeIndexTransformer
from g_research_transformations import GroupDFBlocksByTimeTransformer

## Reading the Data

First of all, I read the data from the file.

> **NOTE:** In this notebook, I am using just a sample of the data, of course can be replaced by the whole data set. To take all the data, just delete the code in comment.

In [None]:
data = pd.read_csv('../input/g-research-crypto-forecasting/train.csv')
print(data.shape)
data.head()

> **TO USE THE WHOLE DATA SET JUST DELETE THE FOLLOWING CELL**

In [None]:
# TAKING JUST A SAMPLE ###########################################################
data = data.loc[0:10000,]
print(data.shape)
data.head()
# END(TAKING JUST A SAMPLE) ######################################################

## Formatting the Data

Next, I am doing following:
- I created a NamedTuple if attribute names and types. That can be found in *g_research_data* file.
- I am formatting the timestamp to match my future use.
- I created a transformer for properly setting the attributes. At this point, I am using *DFAttributesTypeTransformer* from the file *g_research_transformations*.

> NOTE: In the file *g_research_transformations*, I am using the same interface for all the transformation classes. That is why you can see some parent class and then children classes.

In [None]:
# Example usage of variable A
print(A.timestamp)
print(A.timestamp.name)
print(A.timestamp.type)

In [None]:
data[A.timestamp.name] = [datetime.fromtimestamp(ts) for ts in data[A.timestamp.name]]

type_transformer = DFAttributesTypeTransformer()
df = type_transformer.fit_predict(data, dict(A))

df.head()

## Taking Only One Asset

I am just going to take one asset for showing the functionality.

In [None]:
take_asset = 0
df_asset = df.loc[df[A.asset_id.name] == take_asset,]
print(df_asset.shape)
df_asset.head(16)

## Grouping Rows Based on Time

Prediction time frame/window is very important in time series. I created a transformer class that can do the grouping based on time window and create attributes relevant to cryptocurrency trading.  

The transformer *GroupDFBlocksByTimeTransformer* from the file *g_research_transformations* is used as follows:
- First parameter is a data frame with date time attribute in datetime index format and another column to be transformer. 
- Second parameter is the name of the date time attribute.
- Third parameter is the window lenght as in [here](https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.floor.html). Examples:
    - "15min", "75min", ... for minutes,
    - "1H", ... for hours,
    - "1D", ... for days.
- Fourth parameter is the name of grouping function. Currently available are: "mean", "max", "min", "first", "last".

Let us have a look at examples useful for our usecase.

In [None]:
grouping_transformer = GroupDFBlocksByTimeTransformer()

In [None]:
window_length = "10min"

In following cell I am going to create a grouped *open* column. It is the first value for that interval. 

You can check the values in the first cell of the chapter *Taking Only One Asset*.

In [None]:
grouping_function = "first"
attribute_from = A.open.name

df_grouped = grouping_transformer.fit_predict(
    df_asset[[A.timestamp.name, attribute_from]], 
    A.timestamp.name, 
    window_length, 
    grouping_function
)
df_grouped.head()

In following cell I am going to create a grouped *high* column. It is the max value for that interval:

You can check the values in the first cell of the chapter *Taking Only One Asset*.

In [None]:
grouping_function = "max"
attribute_from = A.high.name

df_grouped = grouping_transformer.fit_predict(
    df_asset[[A.timestamp.name, attribute_from]], 
    A.timestamp.name, 
    window_length, 
    grouping_function
)
df_grouped.head()

## Recreating the Data Frame

Let us use the former to recreate all the OHLC attributes (Open, High, Low, Close) for window length 10 minutes.

In [None]:
window_length = "10min"

# Open
grouping_function = "first"
attribute_from = A.open.name
df_final = grouping_transformer.fit_predict(
    df_asset[[A.timestamp.name, attribute_from]], 
    A.timestamp.name, 
    window_length, 
    grouping_function
)

df_final.rename(columns={"FIRST":attribute_from}, inplace=True)

# High
grouping_function = "max"
attribute_from = A.high.name
df_pom = grouping_transformer.fit_predict(
    df_asset[[A.timestamp.name, attribute_from]], 
    A.timestamp.name, 
    window_length, 
    grouping_function
)
df_final[attribute_from] = df_pom[grouping_function.upper()]

# Low
grouping_function = "min"
attribute_from = A.low.name
df_pom = grouping_transformer.fit_predict(
    df_asset[[A.timestamp.name, attribute_from]], 
    A.timestamp.name, 
    window_length, 
    grouping_function
)
df_final[attribute_from] = df_pom[grouping_function.upper()]

# Close
grouping_function = "last"
attribute_from = A.close.name
df_pom = grouping_transformer.fit_predict(
    df_asset[[A.timestamp.name, attribute_from]], 
    A.timestamp.name, 
    window_length, 
    grouping_function
)
df_final[attribute_from] = df_pom[grouping_function.upper()]

df_final.head()

I hope this short tutorial can show you how to effectivile change the window length of the data set!