# G-Research Crypto Forecasting - Cleaning, EDA and Prediction

## Reference
This kernel can't happen if I haven't studied from (Reference List): <br>
**https://www.kaggle.com/odins0n/g-research-plots-eda** <br>
**https://www.kaggle.com/cstein06/tutorial-to-the-g-research-crypto-competition** <br>
**https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-crypto-forecasting/notebook**


Thanks so much for my new virtualization with plotly, the knowledge in time-series analysis and in crypto market.

## Column Description

Data features
We can see the different features included in the dataset. Specifically, the features included per asset are the following:

- timestamp: All timestamps are returned as second Unix timestamps (the number of seconds elapsed since 1970-01-01 00:00:00.000 UTC). Timestamps in this dataset are multiple of 60, indicating minute-by-minute data.
- Asset_ID: The asset ID corresponding to one of the crytocurrencies (e.g. Asset_ID = 1 for Bitcoin). The mapping from Asset_ID to crypto asset is contained in asset_details.csv.
- Count: Total number of trades in the time interval (last minute).
- Open: Opening price of the time interval (in USD).
- High: Highest price reached during time interval (in USD).
- Low: Lowest price reached during time interval (in USD).
- Close: Closing price of the time interval (in USD).
- Volume: Quantity of asset bought or sold, displayed in base currency USD.
- VWAP: The average price of the asset over the time interval, weighted by volume. VWAP is an aggregated form of trade data.
- Target: Residual log-returns for the asset over a 15 minute horizon.

# Basic data handling and inspection

In [None]:
# import all of the important libraries in this kernel
import numpy as np
import pandas as pd
import plotly.graph_objects as go
import time
import datetime
from plotly.offline import init_notebook_mode, iplot
import plotly.express as px
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import seaborn as sns
import lightgbm as lgb
cmap = sns.color_palette()

In [None]:
# import the data to use in this kernel
df = pd.read_csv('../input/g-research-crypto-forecasting/train.csv')
asset_details = pd.read_csv('../input/g-research-crypto-forecasting/asset_details.csv')
df_test = pd.read_csv('../input/g-research-crypto-forecasting/example_test.csv')

In [None]:
# casually checking the data
df.head(13)

In [None]:
asset_details

## Checking the missing value

In [None]:
df.isnull().sum()

The missing value have in the VWAP and Target columns

In [None]:
#sort the asset_id data by using Weight as reference.
asset_details = asset_details.sort_values('Weight',ascending=False)
asset_details

In [None]:
asset_names_dict = {row["Asset_Name"]:row["Asset_ID"] for ind, row in asset_details.iterrows()}
asset_names_dict

Adding the "Asset_Name" columns from asset_details to df

In [None]:
def add_asset_name(stdata, join):
    return stdata.merge(
        join, how="left",on="Asset_ID"
    )

df = add_asset_name(df,asset_details)

Create the Real_Time column by changing from "Timestamp" column

In [None]:
df['Real_Time'] = pd.to_datetime(df['timestamp'],unit='s')

Just for checking that the dataset that we want is exactly we're looking for.

In [None]:
df.head(10)

# Exploratory Data Analysis

Checking all of the currency percentage in the dataset.

In [None]:
(df['Asset_Name'].value_counts()/df.shape[0])*100

## Virtualization

### Percentage of every type of coins in the dataframe

In [None]:
countpie = df['Asset_Name'].value_counts()

fig = {
  "data": [
    {
      "values": countpie.values,
      "labels": countpie.index,
      "domain": {"x": [0, .5]},
      "name": "Currency types",
      "hoverinfo":"label+percent+name",
      "hole": .7,
      "type": "pie"
    },],
  "layout": {
        "title":"Pie chart of all the Currency types ratio",
    }
}
iplot(fig)

In [None]:
# This is what I normally write my plotly code but because the number of data is very big so it take too much time.
# It might easier to write but it takes too much time to run the graph.

#px.histogram(df, x="Asset_Name", color="Asset_Name")

In [None]:
# This is the better way to run a histogram plot by "Sanskar Hasija"
asset_count= []
for i in range(14):
    count = (df["Asset_ID"]==i).sum()
    asset_count.append(count)

In [None]:
# The output is basically the same as the code above but it run much more faster
fig = px.histogram(x = asset_details.sort_values("Asset_ID")["Asset_Name"],
                   y = asset_count , 
             color = asset_details.sort_values("Asset_ID")["Asset_Name"])
fig.update_xaxes(title="Currency types")
fig.update_yaxes(title = "Number of Rows")
fig.show()

### Volume

In [None]:
volumesum = df.groupby(['Asset_ID'])['Volume'].sum()
volumesum

In [None]:
fig = px.histogram(x = asset_details.sort_values("Asset_ID")["Asset_Name"],
                   y = volumesum, 
                   color = asset_details.sort_values("Asset_ID")["Asset_Name"])
fig.update_xaxes(title="Currency types")
fig.update_yaxes(title = "Sum of the volumes")
fig.update_layout(showlegend = True,
    title = {
        'text': 'Quantity of asset bought or sold based on USD',
        'y':0.95,
        'x':0.45,
        'xanchor': 'center',
        'yanchor': 'top'})
fig.show()

It's a bit strange that most people buy and sale for Tron, Dogecoin, Stellar and Cardano. Not put the investment in Bitcoin or Ethereum. I might not understanding the **Volume** variables correctly

In [None]:
assetindex = asset_details.sort_values("Asset_ID")["Asset_Name"].values

In [None]:
assetindex

Checking the price at low and high from the start to an end of the dataset that has some manipulation or outliers or not

In [None]:
plt.figure(figsize=(40,80))
gs = gridspec.GridSpec(7, 2)
for i in range(14):
    ax = plt.subplot(gs[i])
    ax = sns.scatterplot(x='Close',y='Open',data=df[df['Asset_ID'] == i],color=cmap[i%10])
    ax.set_xlabel('')
    ax.set_title('Scatter plot of currency name: ' + assetindex[i] +' in USD')
plt.show()




Most of all the data have no outliers or less than 1% that might slightly out of place.

### Correlation between the currency

I want to know the correlation between each coin.

In [None]:
f = plt.figure(figsize=(15,30))

for ind, coin in enumerate(list(assetindex)):
    coin_df = df[df["Asset_ID"]==asset_names_dict[coin]].set_index("Real_Time")
    # fill missing values 
    ax = f.add_subplot(7,2,ind+1)
    plt.plot(coin_df['Close'], label=coin, color=cmap[ind%10])
    plt.legend()
    plt.xlabel('Time')
    plt.ylabel(coin)
    plt.title(coin)

plt.tight_layout()
plt.show()

### Correlation Map

In [None]:
all_assets_df = pd.DataFrame([])
for ind, coin in enumerate(list(assetindex)):
    coin_df = df[df["Asset_ID"]==asset_names_dict[coin]].set_index("Real_Time")
    # fill missing values
    close_values = coin_df["Close"].fillna(0)
    close_values.name = coin
    all_assets_df = all_assets_df.join(close_values, how="outer")


corrmat = all_assets_df.corr()
fig, ax = plt.subplots(figsize=(14, 14))
sns.heatmap(corrmat, vmax=1., square=True, cmap="rocket_r")
plt.title("Cryptocurrency correlation map on actual price values", fontsize=15)
plt.show()


### Candlestick Charts

**Key Takeaways**

- In the trading world, they tend to use Candlestick charts to determine possible price movement based on last patterns. 
- Candlesticks are useful when trading as they show four price poinits (open, close, high and low) throughout the period of thime the trader specifies. 
- Trading is often dictated by emotion, which can be read in candlestick charts.

Check out for more information and reference: https://www.investopedia.com/trading/candlestick-charting-what-is-it/

In [None]:
btctemp = df[df['Asset_Name']=='Bitcoin'].set_index("Real_Time")
btctemp = btctemp.iloc[-2000:,] # I want only the lastest 2000 rows from the bottme
btctemp

In [None]:
fig = go.Figure(data=[go.Candlestick(x=btctemp.index, open=btctemp['Open'], high=btctemp['High'], low=btctemp['Low'], close=btctemp['Close'])])
fig.update_xaxes(title_text = 'Time',
                             rangeslider_visible = True)

fig.update_layout(
     title = {
        'text': ' Candelstick Chart: Bitcoin',
        'y':0.90,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})

fig.update_yaxes(title_text = 'Price in USD', ticksuffix = '$')




fig.show()

I want to analyze more in other coin types so I should create function to make it easier to use not have to change the code again and again everytime.

In [None]:
def crypto_df(AssetName,fdata=df):
    currencydf = fdata[fdata['Asset_Name']== AssetName].set_index("Real_Time")
    currencydf = currencydf.iloc[-2000:,] # I want only the lastest 2000 rows from the bottme
    return(currencydf)

In [None]:
ethtemp = crypto_df('Ethereum')

Create the function to plot the latest 2000 rows of data

In [None]:
def latestcandle(coindata,coinname):  
        fig = go.Figure(data=[go.Candlestick(x=coindata.index, open=coindata['Open'], high=coindata['High'], low=coindata['Low'], close=coindata['Close'])])
        fig.update_xaxes(title_text = 'Time',
                                rangeslider_visible = True)

        fig.update_layout(
        title = {
                'text': ' Candelstick Chart: {:}'.format(coinname),
                'y':0.90,
                'x':0.5,
                'xanchor': 'center',
                'yanchor': 'top'})

        fig.update_yaxes(title_text = 'Price in USD', ticksuffix = '$')

        fig.show()

In [None]:
latestcandle(ethtemp,'Ethereum')

### Easiest function for use!

We have already created the function that use to build the latest 2000 rows dataset of a coin and plot the candle bar seperately so why I shouldn't build it together for further analyze.

Remember if you plot or write the same line of code a lot, try to create a new function.

In [None]:
def latestcandle(coinname,fdata=df):  
        
        currencydf = fdata[fdata['Asset_Name']== coinname].set_index("Real_Time")
        currencydf = currencydf.iloc[-2000:,] # I want only the lastest 2000 rows from the bottme
        
        fig = go.Figure(data=[go.Candlestick(x=currencydf.index, open=currencydf['Open'], high=currencydf['High'], low=currencydf['Low'], close=currencydf['Close'])])
        fig.update_xaxes(title_text = 'Time',
                                rangeslider_visible = True)

        fig.update_layout(
        title = {
                'text': ' Candelstick Chart: {:}'.format(coinname),
                'y':0.90,
                'x':0.5,
                'xanchor': 'center',
                'yanchor': 'top'})

        fig.update_yaxes(title_text = 'Price in USD', ticksuffix = '$')

        fig.show()

In [None]:
latestcandle('Ethereum Classic') #just fill in the name of a coin so you could get the plot of those

In [None]:
latestcandle('Litecoin')

In [None]:
latestcandle('Dogecoin')

# Summary

This is all of the exploratory data analysis part. I will show you the other part of LGBM Machine Learning later. Which is very interesting because I haven't used this library before.

If you enjoy this kernel please upvote for me and feel free to comment so I could know my mistakes and other to improve myself. Thanks!