# EDA  - predict value from other assets- 

In this notebook, we describe the results of predicting the transition of one asset from the values of other assets.

Specifically, Predict the value of mean for asset ID = 0 from the value of mean for asset ID = 1 to 13 at the same time..

The purpose of this paper is to obtain insights for score improvement from the discrepancies in the predicted values.
For example, if the discrepancy is large, it can be inferred that a large change has occurred only in that asset ID.

## import

In [None]:
import pandas as pd
import numpy as np
import time
import datetime
import matplotlib.pyplot as plt
import seaborn as sns
sns.set("talk")

from sklearn.linear_model import LinearRegression


TRAIN_CSV = '../input/g-research-crypto-forecasting/train.csv'

ASSET_DETAILS_CSV = '../input/g-research-crypto-forecasting/asset_details.csv'
OUTPUT_DIR = '../model/'

## Read train data
- read csv data
- merge asset details

In [None]:
train = pd.read_csv(TRAIN_CSV)
train.dropna(subset=["Target"], inplace=True)
    
asset_details = pd.read_csv(ASSET_DETAILS_CSV).sort_values("Asset_ID")
print(train.shape)
print(asset_details.shape)
display(train.head())
display(asset_details)

In [None]:
#merge assetID
def add_asset_details(train, asset_details):
    return train.merge(
        asset_details,
        how = "left", on = "Asset_ID"
    )

train = add_asset_details(train, asset_details)

#calculate mean
train["Mean"] = train[['Open', 'High', 'Low', 'Close']].mean(axis=1)

## set time for train



In [None]:
#window for train
train_start = "21/08/2020"
train_end = "21/08/2021"

def set_time_train(train, train_start, train_end):
    totimestamp = lambda s: np.int32(time.mktime(datetime.datetime.strptime(s, "%d/%m/%Y").timetuple()))
    train_window = [totimestamp(train_start), totimestamp(train_end)]
    train = train.query("@train_window[0] < timestamp < @train_window[1]")
    return train

train = set_time_train(train, train_start, train_end)
print(train.shape)

## get pivot table
- index:timestamp
- columns:Asset_ID
- values:Mean

In [None]:
#get the table which index is time stamp
df_pivot = train.pivot_table(index="timestamp", columns="Asset_Name", values="Mean")
df_pivot = df_pivot.fillna(method="ffill")

In [None]:
df_pivot.head()

### check nan

In [None]:
df_pivot.isnull().sum()

## predict (in case of Bitcoin)

##E linear regression by other asset

In [None]:
target = "Bitcoin"
feats = df_pivot.drop(target, axis=1).columns

In [None]:
#linear regression
lr = LinearRegression()
lr.fit(df_pivot[feats], df_pivot[target])

In [None]:
def get_df_pred(lr, feats, target, df_pivot):
    #get prediction datafram
    df_pred = pd.DataFrame()
    df_pred["pred"] = lr.predict(df_pivot[feats])
    df_pred["true"] = df_pivot[target].values
    df_pred["pred/true"] = df_pred["pred"] / df_pred["true"]
    df_pred["time"] = df_pivot.index
    df_pred["time"] = df_pred["time"].apply(lambda x:datetime.datetime.fromtimestamp(x))
    return df_pred

df_pred = get_df_pred(lr, feats, target, df_pivot)
df_pred.head()

### check prediction error

In [None]:
fig, ax = plt.subplots(1,1,figsize=(20,5))

def plot_pred(df_pred, target, ax):
    ax.plot(df_pred["time"], df_pred["true"], label="true")
    ax.plot(df_pred["time"], df_pred["pred"], label="pred")
    ax.set_title(target)
    ax.legend()
    
plot_pred(df_pred, target, ax)

- The predictions are generally good, but there are times when the errors are large.

### predict (in cases of all assets)

In [None]:
fig, ax = plt.subplots(13,1,figsize=(20,40), sharex=True)

lr_list = []
for ax_i, asset_i in zip(ax, df_pivot.columns):
    print(asset_i)
    target = asset_i
    feats = df_pivot.drop(asset_i, axis=1).columns
    
    lr = LinearRegression()
    lr.fit(df_pivot[feats], df_pivot[target])
    lr_list.append(lr)
    
    df_pred = get_df_pred(lr, feats, target, df_pivot)
    plot_pred(df_pred, target, ax_i)

point
- The error tends to be larger where there is more change.

## check correlation of predicion error and target
Check if there is a correlation between the error and the target, the log return in case of bitcoin.

In [None]:
target_pivot = train.pivot_table(index="timestamp", columns="Asset_Name", values="Target")
target_pivot.head()

In [None]:
#check with figure
target = "Bitcoin"
feats = df_pivot.drop(target, axis=1).columns
df_pred = get_df_pred(lr_list[1], feats, target, df_pivot)

fig, ax = plt.subplots(1,1,figsize=(20,5))

def plot_pred(df_pred, target, ax):
    ax.plot(df_pred["time"], df_pred["true"], label="true")
    ax.plot(df_pred["time"], df_pred["pred"], label="pred")
    ax.set_title(target)
    ax.legend()
    
plot_pred(df_pred, target, ax)
ax2 = ax.twinx()
ax2.plot(df_pred["time"], target_pivot[target], color="black", alpha=0.3)
ax2.grid(False)

- Unfortunately, I don't see any significant correlation...
- Whether this prediction error feature can be used or not will be verified in the future.