In [None]:
import pandas as pd
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt

In [None]:
df_train = pd.read_csv('/kaggle/input/g-research-crypto-forecasting/train.csv')
df_train.head()

In [None]:
df_asset_details = pd.read_csv('/kaggle/input/g-research-crypto-forecasting/asset_details.csv')
df_asset_details

# <center>DATA FEATURES</center> 

We can see the different features included in the dataset. Specifically, the features included per asset are the following:
*   **timestamp**: All timestamps are returned as second Unix timestamps (the number of seconds elapsed since 1970-01-01 00:00:00.000 UTC). Timestamps in this dataset are multiple of 60, indicating minute-by-minute data.
*   **Asset_ID**: The asset ID corresponding to one of the crytocurrencies (e.g. `Asset_ID = 1` for Bitcoin). The mapping from `Asset_ID` to crypto asset is contained in `asset_details.csv`.
*   **Count**: Total number of trades in the time interval (last minute).
*   **Open**:	Opening price of the time interval (in USD).
*   **High**:	Highest price reached during time interval (in USD).
*   **Low**: Lowest price reached during time interval (in USD).
*   **Close**:	Closing price of the time interval (in USD).
*   **Volume**:	Quantity of asset bought or sold, displayed in base currency USD.
*   **VWAP**: The average price of the asset over the time interval, weighted by volume. VWAP is an aggregated form of trade data.
*   **Target**: Residual log-returns for the asset over a 15 minute horizon. 

The first two columns define the time and asset indexes for this data row. The 6 middle columns are feature columns with the trading data for this asset and minute in time. The last column is the prediction target, which we will get to later in more detail.

We also view the asset information, including the list of all assets, the `Asset_ID` to asset mapping, and the weight of each asset used to weigh their relative importance in the evaluation metric.

In [None]:
btc = df_train[df_train["Asset_ID"]==1].set_index("timestamp") # Asset_ID = 1 for Bitcoin
btc_mini = btc.iloc[-200:] # Select recent data rows

In [None]:
import plotly.graph_objects as go

fig = go.Figure(data=[go.Candlestick(x=btc_mini.index, open=btc_mini['Open'], high=btc_mini['High'], low=btc_mini['Low'], close=btc_mini['Close'])])
fig.show()

## preprocessing

In [None]:
btc.info(verbose=True)

In [None]:
btc.isna().sum()

In [None]:
#check the time range for Bitcoin and Ethereum data, 
#using the coversion from timestamp to `datetime`
beg_btc = btc.index[0].astype('datetime64[s]')
end_btc = btc.index[-1].astype('datetime64[s]')

print('BTC data goes from ', beg_btc, 'to ', end_btc)

**.reindex() description** 

DataFrame.reindex(labels=None, index=None, columns=None, axis=None, method=None, copy=True, level=None, fill_value=nan, limit=None, tolerance=None)

method{None, ‘backfill’/’bfill’, ‘pad’/’ffill’, ‘nearest’}
Method to use for filling holes in reindexed DataFrame. Please note: this is only applicable to DataFrames/Series with a monotonically increasing/decreasing index.

None (default): don’t fill gaps

*pad / ffill: Propagate last valid observation forward to next valid.*

backfill / bfill: Use next valid observation to fill gap.

nearest: Use nearest valid observations to fill gap.

In [None]:
from datetime import datetime
btc = df_train[df_train["Asset_ID"]==1]
btc = btc.reindex(range(btc.index[0],btc.index[-1]+60,60),method='pad')
btc_index = btc
btc_index['timestamp'] = pd.to_datetime(btc_index['timestamp'], unit='s')

# <center>BAR PLOT</center> 

Here I am making a bar plot of month data for 2018 to 2021. For the index, I will use [2018:]. Because our dataset contains data until 2021. So, 2018 to end should bring 2018 to 2021.

Each bar represents a month. A huge spike in April 2020. Otherwise, there is no monthly seasonality here.

In [None]:
import matplotlib.dates as mdates
btc_month = btc_index.resample("M", on='timestamp').mean()

In [None]:
fig, ax = plt.subplots(figsize=(10, 6))
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d %H'))
#ax.set_xticklabels(btc_hour['timestamp'], rotation=90)
ax.bar(btc_month['2018':].index, btc_month.loc['2018':, "Volume"], width=25, align='center')
fig.autofmt_xdate()

# <center>BOX PLOT</center> 

One way to find seasonality is by using a set of boxplots. Here I am going to make boxplots for each hour. I will use ‘Open’, ‘Close’, ‘High’ and ‘Low’ data to make this plot.

The dataset is lack of the market between 2021-09-22 to 2022-01-01, we cut out the records between 2018-01-01 to 2018-09-20, to make sure every month has the same amount of data. 

In [None]:
import time
mask = (btc['timestamp'] > '2018-09-21') & (btc['timestamp'] <= '2021-09-21')
btc_1821 = btc.loc[mask]

In [None]:
import seaborn as sns
btc_1821['month'] = btc_1821['timestamp'].dt.strftime('%b')
#start, end = '2018-01-01 00', '2021-07-01 00'
fig, axes = plt.subplots(4, 1, figsize=(10, 16), sharex=True)
for name, ax in zip(['Open', 'Close', 'High', 'Low'], axes):
    sns.boxplot(data=btc_1821,x='month', y=name, ax=ax)
    ax.set_ylabel("")
    ax.set_title(name)
    if ax != axes[-1]:
        ax.set_xlabel('')

**The box plots show that prices are usually at their low point in Sep, Oct, Nov, Dec, while at high point in Feb, Mar, Apr.**

‘Volume’ data was too busy in the original dataset. It can be fixed by resampling. Instead of plotting daily data, plotting monthly average will fix this issue to a large extent. I will use the btc_month dataset I prepared already for the bar plot and box plots above for this.

In [None]:
btc_month['Volume'].plot(figsize=(8, 6))

# <center>resample to smooth out the spikes</center> 
In the ‘Volume’ data we are working on right now, we can observe some big spikes here and there. These types of spikes are not helpful for data analysis or for modeling. normally to smooth out the spikes, resampling to a lower frequency and rolling is very helpful.

In [None]:
btc_week = btc_index.resample("W", on='timestamp').mean()
btc_day = btc_index.resample("D", on='timestamp').mean()

plot the daily and weekly data in the same plot.

In [None]:
start, end = '2021-01', '2021-08'
fig, ax = plt.subplots()
ax.plot(btc_day.loc[start:end, 'Volume'], marker='.', linestyle='-', linewidth = 0.5, label='Daily', color='black')
ax.plot(btc_week.loc[start:end, 'Volume'], marker='o', markersize=8, linestyle='-', label='Weekly', color='coral')
ax.set_ylabel("Open")
ax.legend()
fig.autofmt_xdate()

# <center>PLOT THE CHANGE</center> 

### method 1

In [None]:
btc_date = btc_index.set_index("timestamp")

The shift function shifts the data before or after the specified amount of time. If I do not specify the time it will shift the data by one day by default. That means you will get the previous day's data. In financial data like this one, it is helpful to see previous day data and today's data side by side. It only plot the previous day data:

In [None]:
btc_date['Change'] = btc_date.Close.div(btc_date.Close.shift())
btc_date['Change'].plot(figsize=(20, 8), fontsize = 16)

In the code above, .div() helps to fill up the missing data. Actually, div() means division. df. div(6) will divide each element in df by 6. But here I used ‘df.Close.shift()’. So, Each element of df will be divided by each element of ‘df.Close.shift()’. We do this to avoid the null values that are created by the ‘shift()’ operation

We can simply take a specific period and plot to have a clearer look. This is the plot of 2020 only.

In [None]:
btc_date['2020']['Change'].plot(figsize=(10, 6))

### method 2

Another way of transformation. It keeps adding the cumulative. For example, if you add an expanding function to the ‘High’ column first element remains the same. The second element becomes cumulative of the first and second element, the third element becomes cumulative of the first, second, and third element, and so on. You can use aggregate functions like mean, median, standard deviation, etc. on it too.

In [None]:
fig, ax = plt.subplots()
ax = btc_date.High.plot(label='High')
ax = btc_date.High.expanding().mean().plot(label='High expanding mean')
ax = btc_date.High.expanding().std().plot(label='High expanding std')
ax.legend()

# <center>DECOMPESITION</center> 

Decomposition will show the observations and these three elements in the same plot:
*   Trend: Consistent upward or downward slope of a time series.
*   Seasonality: Clear periodic pattern of a time series
*   Noise: Outliers or missing values

Original observations = Trend + Seasonality + Residuals

In [None]:
from pylab import rcParams
import statsmodels.api as sm
rcParams['figure.figsize'] = 11, 9
decomposition = sm.tsa.seasonal_decompose(btc_month['Volume'], model='Additive')
fig = decomposition.plot()
plt.show()