# **Speed-up data loading!**

Train.csv is a big file that takes some time to load. I created this notebook to have a feather version of the file which will be much faster to load. You don't need to run this notebook. If you want, you can just use the output as your input and speed-up the data loading **from one minute to one second!**

To load the feather file, just add data from this notebook and open the file with <code>pd.read_feather('../input/speed-up-data-loading/train.feather')</code>

I changed the datatypes of the DataFrame (I took most of the code from [this](https://www.kaggle.com/toomuchsauce/g-crypto-interactive-dashboard-indicators) notebook) to reduce the size of the dataset. If you want to use the original data types, you can skip the call to reduce_memory function.

At the end of the notebook I fill the gaps to have all the assets for all the existing timestamps. This is convenient for me, anyone not interested in this format can ignore the <code>transformed_train.feather</code> file and just use <code>train.feather</code>

In [None]:
import numpy as np
import pandas as pd
from datetime import datetime

In [None]:
start_time = datetime.timestamp(datetime.now())
crypto_df = pd.read_csv("../input/g-research-crypto-forecasting/train.csv")
print("load in",datetime.timestamp(datetime.now())-start_time,"seconds")

In [None]:
crypto_df.head(10)

In [None]:
# https://www.kaggle.com/toomuchsauce/g-crypto-interactive-dashboard-indicators

def reduce_memory(df):
    
    before = df.memory_usage().sum()  
    
    for col in df.columns:        
        dtype = df[col].dtype
        if dtype == 'float64':
            c_min = df[col].min()
            c_max = df[col].max()        
            if c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                df[col] = df[col].astype(np.float32)
            else:
                df[col] = df[col].astype(np.float64)

    df['Asset_ID'] = df['Asset_ID'].astype('int8')
    df['Count'] = df['Count'].astype('int32')
    df['timestamp'] = df['timestamp'].astype('uint32')
                    
    after = df.memory_usage().sum()
    
    print('Memory taken before transformation : ', before)
    print('Memory taken after transformation : ', after)
    print('Memory taken reduced by : ',( before - after) * 100/ before, '%')
    
    return df

crypto_df = reduce_memory(crypto_df)

In [None]:
crypto_df.to_feather('train.feather')

In [None]:
#del crypto_df
start_time = datetime.timestamp(datetime.now())
crypto_df = pd.read_feather("train.feather")
print("load in",datetime.timestamp(datetime.now())-start_time,"seconds")

In [None]:
crypto_df.head(10)

## **Transform data to fill gaps**

In [None]:
df1 = pd.DataFrame({'timestamp':crypto_df.timestamp.unique()}).sort_values(['timestamp'])
df2 = pd.DataFrame({'Asset_ID':crypto_df.Asset_ID.unique()}).sort_values(['Asset_ID'])
df_x = df1.merge(df2, how='cross').set_index(['timestamp','Asset_ID'],drop=True)
crypto_df=df_x.join(crypto_df.set_index(['timestamp','Asset_ID'],drop=True).sort_index()[['Open','High','Low','Close','Volume','VWAP','Target']],how='left').fillna(0).reset_index()
del df1
del df2
del df_x
crypto_df.to_feather('transformed_train.feather')

In [None]:
crypto_df.head(10)