# Project Description 

In this competition, **I built a model that predicted short-term volatility for hundreds of stocks across different sectors**. Optiver provided hundreds of millions of rows of highly granular financial data that is used to forecast volatility over 10 minutes periods. My model will be evaluated against real market data collected in the three-month evaluation period after training period (evaluation period ends on January 10 2022).

# Data

#### Train Data - available for us
My model is designed utilizing Train data. Train data includes order book snapshots and executed trades with one second resolution. Order book snapshot provides top two levels of ask and bid. Executed trades contain data on trades that actually executed. 

#### Test Data - not available for us 
I had 5 submissions each day to test my model against the test data. My model is evaluated based on RMPSE  ( Root Mean Square Percentage Error ). The best submission I had was 20.78%

#### Evaluation Data - not available for us 
This data is collected after the training period is over. The goal for this competition was to have minimum RMPSE against evaluation data.


# RAW DATA 
#### Order Book, Execute Trade, and Target


In [None]:
import pandas as pd 

path = '/kaggle/input/optiver-realized-volatility-prediction/'
file_path_bk = path + ('book_train.parquet/stock_id=0')
file_path_tr = path + ('trade_train.parquet/stock_id=0')

book_example = pd.read_parquet(file_path_bk)
trade_example = pd.read_parquet(file_path_tr)
target_example = pd.read_csv (path + 'train.csv')

In [None]:
# this is only for stock_id == 0 
book_example.head()

In [None]:
# this is only for stock_id == 0 
trade_example.head()

In [None]:
# My model will forecast 'target' or the next 10 minutes window of realized volatility 
target_example.head()

In [None]:
# There are 112 different stocks 
target_example['stock_id'].unique()

# Features and Model  
#### Features
From the raw data, I compute multiple features that help to predict future volatility. In the book data, I add WAP or weighted average price, log return, spread and volume total. In the Trade Data, I add VWAP or volume weighted average price, log return and price volume. After that, I grouped those features by time_id based on multiple functions. All of this steps was done in process_bk and process_tr function 

#### Clustering
1. with K-Means, I clustered the stock based on their target. And then I selected multiple features to calculate its mean based on their cluster. This steps was done in clustering function and adding_clustering_feature function 
2. I selected multiple features, grouped them based on their time_id, and applied multiple functions. This step was done in adding_more_feature function 

#### Model
I chose Light Gradient Boost because it is very fast, and high performance. In some cases, Light GBM produces better accuracy than XGBoost and other boosting algorithms.



#### Challenges
* Time IDs are not necessarily sequential
* We only given the stock_id, not the real ticker
* All of the price have already been normalize
* The normalization of the price happened in every time_id which mean same stock_id with the same price but different time_id, possibly have different un-normalize price 

In [None]:
import numpy as np 
import pandas as pd 
import os
import glob
from sklearn.model_selection import KFold
from sklearn.cluster import KMeans
import lightgbm as lgb
from joblib import Parallel, delayed
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
path = '/kaggle/input/optiver-realized-volatility-prediction/'
file_path_bk = path + ('book_train.parquet/stock_id=0')
file_path_tr = path + ('trade_train.parquet/stock_id=0')

In [None]:
def wap (df_sec, n = 1, cross = True):
    a,b = 'ask' , 'bid'
    if cross : return (df_sec[b+'_price' + str (n)] * df_sec[a+'_size'+ str (n)] + df_sec[a+'_price'+ str (n)] * df_sec[b+'_size'+ str (n)])/(df_sec[b+'_size'+ str (n)] + df_sec[a+'_size'+ str (n)])
    else : return (df_sec[a+'_price' + str (n)] * df_sec[a+'_size'+ str (n)] + df_sec[b+'_price'+ str (n)] * df_sec[b+'_size'+ str (n)])/(df_sec[b+'_size'+ str (n)] + df_sec[a+'_size'+ str (n)])
    
def log_return(series):
    return np.log(series).diff()

def count_unique(series):
    return len(np.unique(series))

def realized_volatility(series):
    return np.sqrt(np.sum(series**2))

def high_low_spread(series):
    return np.max(series) - np.min(series)
def high_low_7525_spread(series):
    return np.percentile(series,75) - np.percentile(series,25)

def ser_diff(series):
    return series.diff()

def sum_above_mean (series) :
    return np.sum (series > np.mean(series))
def sum_below_mean (series) :
    return np.sum (series < np.mean(series))

def sum_above_zero (series) :
    return np.sum ( series > 0 )
def sum_below_zero (series) :
    return np.sum ( series < 0 )

def std_above_zero (series) :
    return np.std ( series > 0 )
def std_below_zero (series) :
    return np.std ( series < 0 )

def sum_sqrt (series) :
    return np.sqrt(1/np.sum(series))

def vwap (price_size ,size) :
    return np.cumsum (price_size) / np.cumsum (size)

def mean_diff (series) :
    return np.mean (series.diff())

In [None]:
def process_bk (file_path) :
    df_sec = pd.read_parquet (file_path)
    
    # adding feature 
    df_sec['wap1'] = wap(df_sec)
    df_sec['wap2'] = wap(df_sec,2)
    df_sec['wap1diff'] = df_sec['wap1'].diff()
    df_sec['log_wap1'] = df_sec.groupby(['time_id'])['wap1'].apply(log_return)
    df_sec['log_wap2'] = df_sec.groupby(['time_id'])['wap2'].apply(log_return)
    df_sec['wap_spread'] = abs(df_sec['wap1'] - df_sec['wap2'])
    df_sec ['bid_spread'] = df_sec ['bid_price1'] - df_sec ['bid_price2']
    df_sec ['ask_spread'] = df_sec ['ask_price2'] - df_sec ['ask_price1']
    df_sec ['price_spread1'] = (df_sec ['ask_price1'] - df_sec ['bid_price1'])/ (df_sec ['ask_price1'] + df_sec ['bid_price1'])
    df_sec ['price_spread2'] = (df_sec ['ask_price2'] - df_sec ['bid_price2'])/ (df_sec ['ask_price2'] + df_sec ['bid_price2'])
    df_sec ['volume_bid'] = df_sec ['bid_size1'] + df_sec ['bid_size2']
    df_sec ['volume_ask'] = df_sec ['ask_size1'] + df_sec ['ask_size2']
    df_sec ['volume_total'] = df_sec ['bid_size1'] + df_sec ['ask_size1'] + df_sec ['bid_size2'] + df_sec ['ask_size2']
    df_sec ['volume_spread'] = abs(df_sec ['bid_size1'] + df_sec ['bid_size2'] - df_sec ['ask_size1'] - df_sec ['ask_size2'])
    df_sec ['volume_total_1'] = df_sec ['bid_size1'] + df_sec ['ask_size1']
    df_sec ['volume_spread_1'] = (df_sec ['bid_size1'] - df_sec ['ask_size1']) / (df_sec ['bid_size1'] + df_sec ['ask_size1'])
    df_sec ['bid_ask_spread'] = abs(df_sec ['bid_spread'] - df_sec ['ask_spread'])
    df_sec['wap3'] = wap(df_sec,n=1,cross = False)
    df_sec['wap4'] = wap(df_sec,n=2,cross = False)
    df_sec['log_wap3'] = df_sec.groupby(['time_id'])['wap3'].apply(log_return)
    df_sec['log_wap4'] = df_sec.groupby(['time_id'])['wap4'].apply(log_return)
    
    # dictionary
    feature_dict1 = {
                     'wap1diff' : [np.mean],
                     'log_wap1' : [realized_volatility, np.mean, np.std],
                     'log_wap2' : [realized_volatility, np.mean, np.std],
                     'wap_spread' : [np.mean, np.std],
                     'bid_spread' : [np.mean, np.std],
                     'ask_spread' : [np.mean, np.std],
                     'price_spread1': [np.mean, np.std],
                     'price_spread2': [np.mean, np.std],
                     'volume_bid': [np.mean],
                     'volume_ask': [np.mean],
                     'volume_total': [np.mean],
                     'volume_spread': [np.mean],
                     'volume_total_1': [np.mean],
                     'volume_spread_1': [np.mean, np.std],
                     'bid_ask_spread' : [np.mean],
                     'log_wap3' :[realized_volatility],
                     'log_wap4' : [realized_volatility]
                    }
    feature_dict2 = {'log_wap1' : [realized_volatility, np.mean, np.std],
                   'log_wap2' : [realized_volatility, np.mean, np.std],
                    'price_spread1': [np.mean, np.std],
                    'volume_spread': [np.mean],
                    'volume_total_1': [np.mean],
                     'log_wap3' :[realized_volatility],
                     'log_wap4' : [realized_volatility]
                    }
    feature_dict3 = {'log_wap1' : [realized_volatility],
                   'log_wap2' : [realized_volatility]
                    }
    
    df = df_sec.groupby (['time_id']).agg(feature_dict1).reset_index()
    df.columns = ['_'.join(col) for col in df.columns]
    df = df.add_prefix('bk_')
    df['row_id'] = df['bk_time_id_'].apply(lambda x: file_path[ file_path.find('=') + 1: ] + f'-{x}')
    for window in [200,400] :
        df_temp = df_sec[df_sec['seconds_in_bucket'] > window].groupby (['time_id']).agg(feature_dict2).reset_index()
        df_temp.columns = ['_'.join(col) for col in df_temp.columns]
        df_temp = df_temp.add_prefix('bk_')
        df_temp = df_temp.add_suffix('_' + str(window))
        df_temp['row_id'] = df_temp['bk_time_id__' + str(window)].apply(lambda x: file_path[ file_path.find('=') + 1: ] + f'-{x}')
        df = df.merge (df_temp, how = 'left', on = 'row_id')
    for window in [100,300,500] :
        df_temp = df_sec[df_sec['seconds_in_bucket'] > window].groupby (['time_id']).agg(feature_dict3).reset_index()
        df_temp.columns = ['_'.join(col) for col in df_temp.columns]
        df_temp = df_temp.add_prefix('bk_')
        df_temp = df_temp.add_suffix('_' + str(window))
        df_temp['row_id'] = df_temp['bk_time_id__' + str(window)].apply(lambda x: file_path[ file_path.find('=') + 1: ] + f'-{x}')
        df = df.merge (df_temp, how = 'left', on = 'row_id')
    df.drop(['bk_time_id_','bk_time_id__100','bk_time_id__200','bk_time_id__300','bk_time_id__400','bk_time_id__500' ], axis = 1, inplace =True)
    return df

In [None]:
def process_tr (file_path) :
    df_sec = pd.read_parquet (file_path)
    
    df_sec['log_return'] = df_sec.groupby(['time_id'])['price'].apply(log_return)
    df_sec['log_return2'] = df_sec.groupby(['time_id'])['log_return'].apply(ser_diff)
    df_sec['price_size'] = df_sec['price'] * df_sec['size']
    df_sec['vwap'] = df_sec.groupby(['time_id'])['price_size'].cumsum() / df_sec.groupby(['time_id'])['size'].cumsum()
    df_sec['vwap_log_return'] = df_sec.groupby(['time_id'])['vwap'].apply(log_return)
    
    feature_dict1 = {'seconds_in_bucket':[count_unique],
                     'price' :[high_low_spread],
                     'size':[np.sum, np.std, high_low_7525_spread, sum_sqrt],
                     'order_count':[np.sum, np.std,sum_sqrt],
                     'log_return':[realized_volatility, 
                                   high_low_spread,sum_above_zero,sum_below_zero,
                                   std_above_zero,std_below_zero],
                     'log_return2':[np.mean],
                     'price_size' : [np.mean],
                     'vwap_log_return' : [realized_volatility, np.std]
                    }
    feature_dict2 = {'log_return':[realized_volatility, np.mean, high_low_spread],
                     'log_return2':[np.std, np.mean],
                     'vwap' : [np.std],
                     'vwap_log_return' : [realized_volatility, high_low_spread]
                    }
    feature_dict3 = {'log_return':[realized_volatility],
                     'vwap_log_return' : [realized_volatility]
                    }
    df = df_sec.groupby (['time_id']).agg(feature_dict1).reset_index()
    df.columns = ['_'.join(col) for col in df.columns]
    df = df.add_prefix('tr_')
    df['row_id'] = df['tr_time_id_'].apply(lambda x: file_path[ file_path.find('=') + 1: ] + f'-{x}')
    for window in [300,450] :
        df_temp = df_sec[df_sec['seconds_in_bucket'] > window].groupby (['time_id']).agg(feature_dict2).reset_index()
        df_temp.columns = ['_'.join(col) for col in df_temp.columns]
        df_temp = df_temp.add_prefix('tr_')
        df_temp = df_temp.add_suffix('_' + str(window))
        df_temp['row_id'] = df_temp['tr_time_id__' + str(window)].apply(lambda x: file_path[ file_path.find('=') + 1: ] + f'-{x}')
        df = df.merge (df_temp, how = 'left', on = 'row_id')
    for window in [200,400,500] :
        df_temp = df_sec[df_sec['seconds_in_bucket'] > window].groupby (['time_id']).agg(feature_dict3).reset_index()
        df_temp.columns = ['_'.join(col) for col in df_temp.columns]
        df_temp = df_temp.add_prefix('tr_')
        df_temp = df_temp.add_suffix('_' + str(window))
        df_temp['row_id'] = df_temp['tr_time_id__' + str(window)].apply(lambda x: file_path[ file_path.find('=') + 1: ] + f'-{x}')
        df = df.merge (df_temp, how = 'left', on = 'row_id')
    df.drop(['tr_time_id_', 'tr_time_id__200','tr_time_id__300','tr_time_id__400', 'tr_time_id__450','tr_time_id__500' ], axis = 1, inplace =True)
    return df

In [None]:
def process_all (stock_id_list, is_train = True) :
    path = '/kaggle/input/optiver-realized-volatility-prediction/'
    
    df = pd.DataFrame()
    
    def process (stock_id) :
        if is_train :
            path_bk = path + f'book_train.parquet/stock_id={stock_id}' 
            path_tr = path + f'trade_train.parquet/stock_id={stock_id}' 
        else :
            path_bk = path + f'book_test.parquet/stock_id={stock_id}' 
            path_tr = path + f'trade_test.parquet/stock_id={stock_id}'
            
        df_temp = pd.merge (process_bk (path_bk), process_tr (path_tr), how = 'left', on = 'row_id')
        return pd.concat([df,df_temp])
    
    df = Parallel(n_jobs=-1, verbose=1)(
        delayed(process)(stock_id) for stock_id in stock_id_list
        )

    df =  pd.concat(df,ignore_index = True)
    
    if is_train : 
        target = pd.read_csv (path + 'train.csv')
        target.insert (2,'row_id',target['stock_id'].astype (str) + '-' + target['time_id'].astype(str))
        df = target.merge (df, how = 'left', on = 'row_id')
    
    else : 
        test = pd.read_csv (path + 'test.csv')
        df = test.merge (df, how = 'left', on = 'row_id')
    
    return df 

In [None]:
def adding_more_feature (df) : # do this after process all
    cols = ['bk_log_wap1_realized_volatility', 'bk_log_wap2_realized_volatility',
            'bk_log_wap1_realized_volatility_500','bk_log_wap2_realized_volatility_500',
            'tr_log_return_realized_volatility','tr_vwap_log_return_realized_volatility',
            'tr_log_return_realized_volatility_300','tr_vwap_log_return_realized_volatility_300',
            'tr_log_return_realized_volatility_400','tr_vwap_log_return_realized_volatility_400',
            'tr_log_return_realized_volatility_500','tr_vwap_log_return_realized_volatility_500'
           ]
    categories = ['time_id']

    for cat in categories :
        df_temp = df.groupby([cat])[cols].agg([np.nanmean, np.nanstd, high_low_spread, high_low_7525_spread]).reset_index()
        df_temp.columns = ['_'.join(col) for col in df_temp.columns]
        df_temp = df_temp.add_prefix(cat + '_')
        df = pd.merge (df,df_temp,how = 'left', left_on = cat, right_on = (cat + '_')*2 )
        df.drop([(cat + '_')*2],axis = 1,inplace = True)
    
    return df

In [None]:
def clustering ( n_c = 7) : # n_c is number of cluster 
    path = '/kaggle/input/optiver-realized-volatility-prediction/'
    df_cluster = pd.read_csv (path +'train.csv')

        
    df_cluster = df_cluster.pivot(index='time_id', columns='stock_id', values='target')

    corr = df_cluster.corr()
    kmeans = KMeans(n_clusters=n_c, random_state=0).fit(corr.values)

    cluster_list = [ ]
    for n in range(n_c):
        cluster_list.append ( [ (x-1) for x in ( (corr.index+1)*(kmeans.labels_ == n)) if x > 0] )
    return cluster_list

In [None]:
def adding_clustering_feature (df, clustering_list , cluster_ids = []):
    
    if not cluster_ids : cluster_ids = [i for i in range (len (clustering_list))]
    df_cluster = pd.DataFrame()
    
    for i,cl in enumerate (clustering_list):
        if i not in cluster_ids : continue 
        df_ini = pd.DataFrame({'time_id' : df['time_id'].unique()})
        df_temp = df.loc[df['stock_id'].isin(cl)]
        df_temp = df_temp.groupby(['time_id']).agg(np.nanmean).reset_index()
        df_temp = pd.merge(df_ini, df_temp, how = 'left', on = 'time_id')
        df_temp.insert (2,'cluster_id', 'cluster_' + str(i))
        df_temp.drop(['stock_id'], axis = 1, inplace =True)
        df_cluster = pd.concat ([df_cluster,df_temp])
    
    df_cluster.reset_index( drop = True, inplace = True)
    df_cluster = df_cluster.pivot (index='time_id', columns='cluster_id')
    df_cluster.columns = ["_".join(x) for x in df_cluster.columns]
    df_cluster.reset_index (inplace = True)
    
    def cluster_selection (cl) :
        selections = cl[
            cl.columns[cl.columns == 'time_id'] |
            cl.columns[cl.columns.str.contains('bk_wap1diff_mean_cluster_')] |
            cl.columns[cl.columns.str.contains('bk_log_wap1_realized_volatility_cluster_')] |
            cl.columns[cl.columns.str.contains('bk_log_wap2_std_cluster_')] |
            cl.columns[cl.columns.str.contains('bk_wap_spread_mean_cluster_')] |
            cl.columns[cl.columns.str.contains('bk_bid_spread_std_cluster_')] |
            cl.columns[cl.columns.str.contains('bk_ask_spread_std_cluster_')] |
            cl.columns[cl.columns.str.contains('bk_price_spread1_mean_cluster_')] |
            cl.columns[cl.columns.str.contains('bk_price_spread2_mean_cluster_')] |
            cl.columns[cl.columns.str.contains('bk_volume_total_mean_cluster_')] |
            cl.columns[cl.columns.str.contains('bk_volume_spread_mean_cluster_')] |
            cl.columns[cl.columns.str.contains('bk_volume_spread_1_mean_cluster_')] |
            
            cl.columns[cl.columns.str.contains('tr_seconds_in_bucket_count_unique_cluster_')] |
            cl.columns[cl.columns.str.contains('tr_price_high_low_spread_cluster_')] |
            cl.columns[cl.columns.str.contains('tr_size_sum_cluster_')] |
            cl.columns[cl.columns.str.contains('tr_size_sum_sqrt_cluster_')] |
            cl.columns[cl.columns.str.contains('tr_order_count_sum_cluster_')] |
            cl.columns[cl.columns.str.contains('tr_order_count_sum_sqrt_cluster_')] |
            cl.columns[cl.columns.str.contains('tr_log_return_realized_volatility_cluster_')] |
            cl.columns[cl.columns.str.contains('tr_log_return2_mean_cluster_')] |
            cl.columns[cl.columns.str.contains('tr_price_size_mean_cluster_')] |
            cl.columns[cl.columns.str.contains('tr_vwap_log_return_realized_volatility_cluster_')]
        ]
        return selections
    
    df_cluster = cluster_selection(df_cluster)
    
    df = pd.merge (df,df_cluster, how = 'left', on = 'time_id')
    
    return df

In [None]:
# features 

train = pd.read_csv (path + 'train.csv')
test = pd.read_csv (path + 'test.csv')
stock_id_list_train = train.stock_id.unique()
stock_id_list_test = test.stock_id.unique()

df = process_all (stock_id_list_train)
df_test = process_all (stock_id_list_test, is_train = False)

#df = process_all ([0])
#df_test = process_all ([0], is_train = False)

df = adding_more_feature(df)
df_test = adding_more_feature(df_test)

clustering_list = clustering(5)

df = adding_clustering_feature(df, clustering_list)
df_test = adding_clustering_feature(df_test, clustering_list)

In [None]:
# Model
def rmspe(y_true, y_pred):
    return np.sqrt(np.mean(np.square((y_true - y_pred) / y_true)))

def feval_rmspe(y_pred, lgb_train):
    y_true = lgb_train.get_label()
    return 'RMSPE', rmspe(y_true, y_pred), False

def train_and_evaluate_lgb(train, test, params):
    
    features = [col for col in train.columns if col not in {"time_id", "target", "row_id"}]
    y = train['target']
    
    oof_predictions = np.zeros(train.shape[0])
    test_predictions = np.zeros(test.shape[0])
    
    
    kfold = KFold(n_splits = 5, random_state = 2021, shuffle = True)
    
    for fold, (trn_ind, val_ind) in enumerate(kfold.split(train)):
        print(f'Training fold {fold + 1}')
        x_train, x_val = train.iloc[trn_ind], train.iloc[val_ind]
        y_train, y_val = y.iloc[trn_ind], y.iloc[val_ind]
        
        train_weights = 1 / np.square(y_train)
        val_weights = 1 / np.square(y_val)
        train_dataset = lgb.Dataset(x_train[features], y_train, weight = train_weights)
        val_dataset = lgb.Dataset(x_val[features], y_val, weight = val_weights)
        
        model = lgb.train(params = params,
                          train_set = train_dataset, 
                          valid_sets = [train_dataset, val_dataset], 
                          verbose_eval = 250,
                          feval = feval_rmspe)
        
        oof_predictions[val_ind] = model.predict(x_val[features])
        test_predictions += model.predict(test[features]) / 5
        
    rmspe_score = rmspe(y, oof_predictions)
    print(f'Our out of folds RMSPE is {rmspe_score}')
    lgb.plot_importance(model,max_num_features=20)
    
    return test_predictions, oof_predictions, model

In [None]:
seed1=2021
params1 = {
    'objective': 'rmse',
    'boosting_type': 'gbdt',
    'max_depth': -1,
    'max_bin':100,
    'min_data_in_leaf':500,
    'learning_rate': 0.05,
    'subsample': 0.72,
    'subsample_freq': 4,
    'feature_fraction': 0.5,
    'lambda_l1': 0.5,
    'lambda_l2': 1.0,
    'categorical_column':[0],
    'seed':seed1,
    'feature_fraction_seed': seed1,
    'bagging_seed': seed1,
    'drop_seed': seed1,
    'data_random_seed': seed1,
    'n_jobs':-1,
    'verbose': -1,
    'num_boost_round':1400,
    'early_stopping_rounds' : 30

}

seed2=1111
params2 = {
    'objective': 'rmse',
    'boosting_type': 'gbdt',
    'max_depth': -1,
    'max_bin':100,
    'min_data_in_leaf':500,
    'learning_rate': 0.05,
    'subsample': 0.72,
    'subsample_freq': 4,
    'feature_fraction': 0.5,
    'lambda_l1': 0.5,
    'lambda_l2': 1.0,
    'categorical_column':[0],
    'seed':seed2,
    'feature_fraction_seed': seed2,
    'bagging_seed': seed2,
    'drop_seed': seed2,
    'data_random_seed': seed2,
    'n_jobs':-1,
    'verbose': -1,
    'num_boost_round':1200,
    'early_stopping_rounds' : 50
}

In [None]:
# Training 
predictions_lgb1, oof_predictions1, model1= train_and_evaluate_lgb(df, df_test,params1)
predictions_lgb2, oof_predictions2, model2= train_and_evaluate_lgb(df, df_test,params2)

In [None]:
def feature_importance (x,model):
    num = 20 
    fig_size = (40, 20)
    feature_imp = pd.DataFrame({'Feature':x.columns,'Value':model.feature_importance()})
    plt.figure(figsize=fig_size)
    sns.set(font_scale = 5)
    sns.barplot(x="Value", y="Feature", data=feature_imp.sort_values(by="Value", 
                                                        ascending=False)[0:num])
    plt.show ()
    return feature_imp

In [None]:
fi1 = feature_importance(df.drop(columns=['target','row_id','time_id'],axis = 1), model1)
fi2 = feature_importance(df.drop(columns=['target','row_id','time_id'],axis = 1), model2)

In [None]:
# Prediction consists 2 model
preds = []
for i in range (30,71) :
    i = i/100
    series = (oof_predictions1 * i) + (oof_predictions2 * (1-i))
    preds.append (rmspe (df['target'],series) )

percentage = (30 + preds.index (min(preds))) / 100

In [None]:
# Prediction 
prediction = (predictions_lgb1*percentage) + (predictions_lgb2*(1-percentage) )
submission = pd.concat ([df_test['row_id'],pd.Series (prediction)], axis = 1)
submission.columns = ['row_id','target']
submission

In [None]:
submission.to_csv('submission.csv',index = False)