## UBIQUANT SCALING PARQUET

### <span style="color:blue">Scaling Method

- **time_id** $\rightarrow$ min-max scaling ( $X-min\over max-min$ )
- **f_1 ~ f_300** $\rightarrow$ standard scaling ( $X-mean\over std$ )
- **investment_id** $\rightarrow$ mean target encoding (Version 2)
- **target** $\rightarrow$ keep the original

### <span style="color:blue">Description

- **scaled_df.parquet** $\rightarrow$ full dataframe after scaling
- **scaled_train.parquet** $\rightarrow$ 80% datafram after scaling
- **scaled_test.parquet** $\rightarrow$ 20% dataframe after scaling
- **df_describe.parquet** $\rightarrow$ statistic of full dataframe
- **remain_cols.parquet** $\rightarrow$ full time_df(scaled), investment_id, target

### <span style="color:blue">How to Use

- **Add data** (on the top right corner of kaggle notebook) > **Notebook Output File** > Search with keyword below
- keyword: ubiquant-scaling-parquet
- location: /kaggle/input/ubiquant-scaling-parquet

## IMPORT LIBRARIES

In [None]:
import numpy as np
import pandas as pd
import os
import math
import gc

## LOADING DATA
- parquet url: https://www.kaggle.com/robikscube/ubiquant-parquet?select=train_low_mem.parquet
- location: /kaggle/input/ubiquant-parquet

In [None]:
%%time
# load data and shuffle with random state 2022
train = [pd.read_parquet('../input/ubiquant-parquet/train_low_mem.parquet').sample(frac=1, random_state=2022)]
print(train[0].shape)
print(f'number of row_id: {len(train[0].row_id.unique())}')
print(f'number of time_id: {len(train[0].time_id.unique())}')
print(f'number of investment_id: {len(train[0].investment_id.unique())}')

## DATAFRAME DESCRIBE
- includes statistic of raw data
- used in scaling process

In [None]:
dfs = []

dfs.append(train[0].describe())
dfs[0].to_parquet('df_describe.parquet', engine='pyarrow')

## MEAN TARGET ENCODING
- https://casa-de-feel.tistory.com/22

In [None]:
mean_target_encoded = train[0].groupby('investment_id')['target'].mean()
mean_target_encoded.to_csv('mean_target_encoded.csv')
train[0]['investment_id'] = train[0]['investment_id'].map(mean_target_encoded)

## STANDARD SCALING
- downcast dataframe: https://www.kaggle.com/ljjblackpig/3-steps-to-reduce-memory-size-for-the-dataset

In [None]:
def downcast_df(df):
    list_of_columns = list(df.select_dtypes(include=["float64"]).columns)
        
    if len(list_of_columns)>=1:
        max_string_length = max([len(col) for col in list_of_columns])
        for col in list_of_columns:
            df[col] = pd.to_numeric(df[col], downcast="float")
    else:
        print("no columns to downcast")
    
    return df

    
def standard_scale(df, cols, desc):
    df = df[cols]
    desc = desc[cols]
    df = (df-desc.loc['mean'])/desc.loc['std']
    return downcast_df(df)

def min_max_scale(df, cols, desc):
    df = df[cols]
    desc = desc[cols]
    df = (df-desc.loc['min'])/(desc.loc['max']-desc.loc['min'])
    return downcast_df(df)

In [None]:
remain_cols = ['time_id', 'investment_id', 'target']
dfs.append(pd.DataFrame(train[0][remain_cols].astype('float32')))
dfs[1]['time_id'] = min_max_scale(dfs[1], ['time_id'], dfs[0])
dfs[1].to_parquet('remain_cols.parquet', engine='pyarrow')

In [None]:
cols = [v for v in list(train[0].columns) if v[0]=='f']
n = len(cols)//7 + 1

scaled_dfs = []
for i in range(7):
    temp_col = cols[i*n:(i+1)*n]
    scaled_dfs.append(standard_scale(train[0], temp_col, dfs[0]))
    print('*', end='')
train.pop()
gc.collect()

## SAVE SCALED DATAFRAME AS PARQUET

In [None]:
temp_dfs = []

temp_dfs.append(pd.concat(dfs[1:2]+scaled_dfs[:2], axis=1))
temp_dfs.append(pd.concat(scaled_dfs[2:4], axis=1))
temp_dfs.append(pd.concat(scaled_dfs[4:], axis=1))

while scaled_dfs:
    scaled_dfs.pop()

In [None]:
scaled_df = pd.concat(temp_dfs, axis=1)

while temp_dfs:
    temp_dfs.pop()

scaled_df.to_parquet('scaled_df.parquet', engine='pyarrow')

## TRAIN TEST SPLIT
- train 80 : test 20
- shuffle was already done before

In [None]:
n = int(len(scaled_df)*0.2)
scaled_df.iloc[:n].to_parquet('scaled_test.parquet', engine='pyarrow')
scaled_df.iloc[n:].to_parquet('scaled_train.parquet', engine='pyarrow')