# **Jane Street Market Prediction using XGBoost Algorithm with GPU üöÄ‚ö°** 

Hey everyone, this is my very first competition i'm taking part in. I want to thank Jane Street and Kaggle for making this competition available to all of us. 

Thanks everyone, and good luck!

**- By Nakshatra Singh**

# **1. Loading necessary libraries and dependencies**

All imports are delineated below for easy reference. Make sure you have selected the **`gpu`** accelerator instance for this notebook. Click on `+Add data` (by toggling the sidebar) and add the the [Jane Street Market dataset](https://www.kaggle.com/c/jane-street-market-prediction/data). 

Now, you are all set-up to run this worksheet. ü§ó

In [None]:
import cudf 
import numpy as np
import pandas as pd

#@ Plotly import 
import plotly.io as pio               
#@ Not using Plotly express 
import plotly.graph_objs as go      
#@ Graph object has more customizations
from plotly.offline import iplot
#@ ggplot2 theme for plotly
pio.templates.default = "ggplot2"  

#@ Importing environment
import janestreet
#@ Initialize the environment
env = janestreet.make_env() 
#@ An iterator which loops over the test
iter_test = env.iter_test() 

#@ Classifier import
import xgboost as xgb

#@ Clean progress bar
from tqdm.notebook import tqdm

#@ Numba is an open source JIT compiler that translates a subset of Python and NumPy code into fast machine code
from numba import njit

# **2. Reading the data**

I'll be using the [cuDF library](https://github.com/rapidsai/cudf) by RAPIDSAI. cuDF provides a pandas-like API that will be familiar to data engineers & data scientists, so they can use it to easily accelerate their workflows without going into the details of CUDA programming. This library does a lot of heavy lifting for us.

In [None]:
print("Reading dataset using CUDA dataframes ...", end='')
#@ Parsing the training dataset by using RAPIDSAI cudf library
train_cudf = cudf.read_csv('/kaggle/input/jane-street-market-prediction/train.csv')
#@ Converting to a pandas dataframe 
train_data = train_cudf.to_pandas()
#@ Deleting training variable to save memory
del train_cudf

#@ Parsing the meta-dataset by using RAPIDSAI cudf library
meta_cudf = cudf.read_csv('/kaggle/input/jane-street-market-prediction/features.csv')
#@ Converting to a pandas dataframe 
meta_data = meta_cudf.to_pandas()
#@ Deleting meta-data variables to save memory
del meta_cudf

#@ Parsing sample predictions
sample_prediction_df = pd.read_csv('../input/jane-street-market-prediction/example_sample_submission.csv') 

print('Finished.\n')

#@ Printing out training and meta-data shapes
print(f'Train shape: {format(train_data.shape)}')
print(f'Features meta shape: {format(meta_data.shape)}')

# **3. Preprocessing the data**

First, I'll be training the rows with weight > 0 (you can read the [data description](https://www.kaggle.com/c/jane-street-market-prediction/data) for more details), store the mean of indivitual columns, fill the null values with indivitual column means, setup the training dataframes and variables, finally splitting the dataset for model evaluation.



In [None]:
print('Preprocessing data...', end='')

#@ Storing columns with the word feature included
features = [c for c in train_data.columns if 'feature' in c]

#@ Trades with weight=0 are not considered for scoring evaluation
train_data = train_data[train_data['weight'] > 0].reset_index(drop = True)

#@ Filling nan using ffill method
train_data[features] = train_data[features].fillna(method = 'ffill').fillna(0)

#@ Only considering the target column values > 0 
train_data['action'] = (train_data['resp'].values > 0).astype(int)

print('Finished.')

# **4. Data Visualization**

We'll now plot the target column distribution using [plotly](https://plotly.com/python/).



In [None]:
x = train_data['action'].value_counts().index
y = train_data['action'].value_counts().values

trace = go.Bar(x=x,
               y=y,
               marker=dict(
               color=y,
               colorscale='sunsetdark'))   
    
data = [trace]
layout = go.Layout(showlegend=False,
                   title='<b>Is the target balanced or Not?</b>',
                   xaxis=dict(title='<b>Action</b>'),
                   yaxis=dict(title='<b>Count</b>'))

fig = go.Figure(data=data, layout=layout)
iplot(fig)  

#@ Deleting unnecessary variables to save memory
del(x, y)  

# **5. XGBoost Classifier** 

Let's start by using XGBoost as our first boosting classifier to build our Machine Learning Model. You can also try various **cross-validation techiniques** (like, [RandomizedSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html), [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html), [StratifiedKFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html), and many more..) to optimize your hyperparameters.

{[XGBOOST - Official Documentation](https://xgboost.readthedocs.io/en/latest/)}

In [None]:
#@ XGBoost Classifier with GPU support 
print('Creating XGBclassifier...\n', end='')

#@ Setting up hyper-parameters in a pretty formatted way
parameters = {'max_depth': 8,
              'learning_rate': 0.015,
              'random_state': 42,
              'tree_method': 'gpu_hist',
              'min_child_weight': 0.30,
              'subsample': 0.46,
              'colsample_bytree': 0.99,
              'eval_metric': 'auc',
              'gamma': 9.8,
              'objective': 'binary:logistic'}

#@ Setting up training variables 
X_train = train_data.loc[train_data['date'] > 80, features].values
y_train = train_data.loc[train_data['date'] > 80, 'action'].values

#@ Loading numpy arrays into DMatrix
d_train = xgb.DMatrix(X_train, y_train)
#@ Fitting the classifier with hyper-parameters and training variables
%time clf = xgb.train(parameters, d_train, 1175)   

print('Finished training the classifier.') 

# **6. Submitting**

In [None]:
#@ Utitlity function for submittion using njit
@njit
def fast_fillna(array, values):
    if np.isnan(array.sum()):
        array = np.where(np.isnan(array), values, array)
    return array
    
train_data.loc[0, features[1:]] = fast_fillna(train_data.loc[0, features[1:]].values, 0) 

In [None]:
tmp = np.zeros(len(features))
for (test_df, sample_prediction_df) in tqdm(iter_test):
    if test_df['weight'].item() > 0:
        X_test = test_df.loc[:, features].values
        X_test[0, :] = fast_fillna(X_test[0, :], tmp)
        tmp = X_test[0, :]
        #@ Converting pandas df to DMatrix
        d_test = xgb.DMatrix(X_test)
        #@ Submitting xgb model predictions
        y_preds = clf.predict(d_test) 
        sample_prediction_df.action = np.where(y_preds >= 0.5, 1, 0).astype(int)
        
    else:
        sample_prediction_df.action = 0

    env.predict(sample_prediction_df) 

## If you liked this notebook, please make sure to upvote this kernel ‚¨ÜÔ∏è. üí¨ Connect? Let‚Äôs get social: http://myurls.co/nakshatrasinghh.