# Import The Libraries Needed

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#import datatable as dt

# Import The data
**Load the data**

If we try to read with pandas it will take a long time, The opperation is faster on Datatable , we can try with Datatable but it will bring us in a memory problem .The notebook will try to allocate more memory than is available.

In [None]:
%%time
data_train = pd.read_csv("../input/jane-street-market-prediction/train.csv")

In Pandas it will take approximately 1min 40s

Method: Datatable

![](https://i.ibb.co/V9S7jRH/0-w7dsj-AY9-CKNY7ow-L.png)

Datatable (heavily inspired by R's data.table) can read large datasets fairly quickly and is often faster than pandas. It is specifically meant for data processing of tabular datasets with emphasis on speed and support for large sized data.

Documentation: https://datatable.readthedocs.io/en/latest/index.html

install datatable

**Read The Data**

In [None]:
#%%time

#data = dt.fread("../input/jane-street-market-prediction/train.csv")

#print("Train size:", data.shape)

 It contains 2390491 datapoints in 138 unique columns.

In [None]:
data_train.head()

In [None]:
#column names
data_train.columns

The types of columns:

In [None]:
# column types
data_train.dtypes

# Checking the missing data 

In [None]:
missing_values_count = data_train.isnull().sum()
missing_values_count

In [None]:
total_cells_data = np.product(data_train.shape)
total_missing_data = missing_values_count.sum()
print ("The percentage of missing data = ",(total_missing_data/total_cells_data) * 100, "%")

**Why replacing with -999**  >  [Reference](https://stats.stackexchange.com/questions/225175/why-do-some-people-use-999-or-9999-to-replace-missing-values/225179)

And since our data is distributed far from -999 we can replace it:

In [None]:
data_train = data_train.fillna(-999)

In [None]:
data_train = data_train[data_train['weight'] != 0]
data_train['action'] = ((data_train['weight'].values * data_train['resp'].values) > 0).astype('int')
data_train['action']

In [None]:
data_train

# splitting The Train data

This dataset contains an anonymized set of features, feature_{0...129}, representing real stock market data. Each row in the dataset represents a trading opportunity, for which you will be predicting an action value: 1 to make the trade and 0 to pass on it. Each trade has an associated weight and resp, which together represents a return on the trade. 

In [None]:
X_train = data_train.loc[:, data_train.columns.str.contains('feature')]
y_train = data_train.loc[:, 'action']

# Make Predictions with XGBoost Model

**What is XGBoost**?

XGBoost stands for eXtreme Gradient Boosting.

The name xgboost, though, actually refers to the engineering goal to push the limit of computations resources for boosted tree algorithms. Which is the reason why many people use xgboost.


It is an implementation of gradient boosting machines created by Tianqi Chen, now with contributions from many developers. It belongs to a broader collection of tools under the umbrella of the Distributed Machine Learning Community or DMLC who are also the creators of the popular mxnet deep learning library.

Tianqi Chen provides a brief and interesting back story on the creation of XGBoost in the post Story and Lessons Behind the Evolution of XGBoost.

XGBoost is a software library that you can download and install on your machine, then access from a variety of interfaces. Specifically, XGBoost supports the following main interfaces:

* Command Line Interface (CLI).
* C++ (the language in which the library is written).
* Python interface as well as a model in scikit-learn.
* R interface as well as a model in the caret package.
* Julia.
* Java and JVM languages like Scala and platforms like Hadoop.

**XGBoost Features**:

The library is laser focused on computational speed and model performance, as such there are few frills. Nevertheless, it does offer a number of advanced features.

[Reference: To get More](https://machinelearningmastery.com/gentle-introduction-xgboost-applied-machine-learning/)

In [None]:
import xgboost as xgb
from xgboost import XGBClassifier

In [None]:
# fit model no training data
model = xgb.XGBClassifier(
    n_estimators=480,
    max_depth=10,
    learning_rate=0.05,
    subsample=0.9,
    colsample_bytree=0.7,
    missing=-999,
    random_state=2020,
    tree_method='gpu_hist'
)

# Train the XGBoost Model

In [None]:
%time 
model.fit(X_train, y_train)

In [None]:
sample_prediction_df  = pd.read_csv("../input/jane-street-market-prediction/example_sample_submission.csv")
data_test= pd.read_csv("../input/jane-street-market-prediction/example_test.csv")

In [None]:
data_test.isnull().sum()

# Make The environement

In [None]:
import janestreet
# initiation of the environment
env = janestreet.make_env()
# an iterator to loops over the test set
iter_test = env.iter_test() 

**iter_test** function is:

Generator which loops through each rushing play in the test set and provides the observations at TimeHandoff just like the training set. Once you call predict to make your yardage prediction, you can continue on to the next play.

In [None]:
for (test_df, sample_prediction_df) in iter_test:
    #We will specify the X_test from our test data (features)
    X_test = test_df.loc[:, test_df.columns.str.contains('feature')]
    #Replace the missing value with -999
    X_test.fillna(-999)
    #Predict using our X_test
    y_preds = model.predict(X_test)
    #Make / store our prediction results in sample_pred_df
    sample_prediction_df.action = y_preds
    env.predict(sample_prediction_df)