# Jane Street: Effect of Delays in Time When Using XGboost? 

**Goal:** 

I wanted to test a few questions. First, I wanted to see if training on part of the Jane Street data effected AUC. I also wanted to see how time effects my XGBoost models ability to predict. I split training data into 4 quarters. Preprocessing included filling NaN's with conditional means based on day. Yeo Johnson Transformation to fix skew. Removal of outliers by keeping data between the 5th and 97th quantiles. The AUC remains low. 

**Findings:** 

1) Time lag did not appear to effect predictive ability of model. 

2) Training on part of the data had a similar AUC to training on all of the data.

**Next Steps:**

1) Make model useable for one test observation at a time.

2) Perform cross validation on model. 

3) Find GPU and Tune Parameters. 

**Helpful Links:**

1) https://www.kaggle.com/dstuerzer/optimization-of-xgboost

2) https://www.datacamp.com/community/tutorials/xgboost-in-python

3) https://www.kaggle.com/saxinou/imbalanced-data-xgboost-tunning



In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import scipy
from scipy import stats
from sklearn.datasets import make_classification
from sklearn.model_selection import TimeSeriesSplit, GridSearchCV
from sklearn.metrics import roc_auc_score, roc_curve
import xgboost as xgb

In [None]:
features = pd.read_csv('../input/jane-street-yeo-data/features.csv')

In [None]:
outcomes = pd.read_csv('../input/jane-street-yeo-data/outcomes.csv')

In [None]:
def reduce_memory_usage(df):
    
    start_memory = df.memory_usage().sum() / 1024**2
    print(f"Memory usage of dataframe is {start_memory} MB")
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != 'object':
            c_min = df[col].min()
            c_max = df[col].max()
            
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    pass
        else:
            df[col] = df[col].astype('category')
    
    end_memory = df.memory_usage().sum() / 1024**2
    print(f"Memory usage of dataframe after reduction {end_memory} MB")
    print(f"Reduced by {100 * (start_memory - end_memory) / start_memory} % ")
    return df

# https://www.kaggle.com/sbunzini/reduce-memory-usage-by-75

In [None]:
reduce_memory_usage(features)

In [None]:
reduce_memory_usage(outcomes)

In [None]:
print(features.info())

In [None]:
###Create Real Score for Scoring
outcomes.loc[outcomes['resp'] >= 0,'real_score'] = int(1)
outcomes.loc[outcomes['resp']< 0,'real_score'] = int(0)

In [None]:
#Include date in features
features['date']=outcomes['date']
features['date'].isnull().sum()


In [None]:
features[['date']].describe()

In [None]:
#Split Features Into Quarters
First_25_Percent_features=features[features['date']<=203]
Second_25_Percent_features=features[(features['date']>203)&(features['date']<309)]
Third_25_Percent_features=features[(features['date']>=309)&(features['date']<409)]
Fourth_25_Percent_features=features[features['date']>=409]

In [None]:

#Split Outcome Into Quarters
First_25_Percent_outcomes=outcomes[outcomes['date']<=203]
Second_25_Percent_outcomes=outcomes[(outcomes['date']>203)&(outcomes['date']<309)]
Third_25_Percent_outcomes=outcomes[(outcomes['date']>=309)&(outcomes['date']<409)]
Fourth_25_Percent_outcomes=outcomes[outcomes['date']>=409]

In [None]:
#Note:The function reindexed the data. Remove index column. 

X_Q1=np.array(First_25_Percent_features.iloc[::,1:-1])
y_Q1=np.array(First_25_Percent_outcomes['action'])
X_Q2=np.array(Second_25_Percent_features.iloc[::,1:-1])
y_Q2=np.array(Second_25_Percent_outcomes['action'])
X_Q3=np.array(Third_25_Percent_features.iloc[::,1:-1])
y_Q3=np.array(Third_25_Percent_outcomes['action'])
X_Q4=np.array(Fourth_25_Percent_features.iloc[::,1:-1])
y_Q4=np.array(Fourth_25_Percent_outcomes['action'])

In [None]:
# Quarterly Model

# XGBoost Model

## Parameters Meaning

#### learning_rate: 

step size shrinkage to avoid overfitting. **Common Starting Points:** .1

#### max depth:
controls depth of trees. More shallow = reduced complexity = more underfit. **Common Starting Points:** 4-6

#### subsample: 
percentage of samples used per tree. Lower value = more under fit. **Common Starting Points:** .8

#### col sample_bytree: 
% of features used per tree. High value = more overfitting. **Common Starting Points:** .5-.9

#### n_estimators/num_boost_round: 
number of trees to build. **Common Starting Points:** 500-1000

#### Objective: Loss Functions

-reg:linear : predict continuous values
-reg:logistic: single decision classification
-reg:binary : probability based classification

#### scale_pos_weight:
Used for class imbalance to adjust postive class. **#neg / #pos**

#### min child weight: 
Minimum number of samples if all samples have weight 1 required to create a new node. A small number means that the alg will create new leafs even when only a few samples are left to distinguish. This leads to more complexity but can also increase overfitting.  **Common Starting Points:** 1

### Regularizers:

#### gamma: 
adjusts propensity of node to split in tree based learners based on reduction in loss. higher gamma = less splits. **Common Starting Points:** 0-.2

#### reg_alpha: 
L1 regularizer of leaf weights. 


#### reg_lambda: 
L2 regularizer of leaf weights. 




In [None]:
#xg_clas= xgb.XGBClassifier(objective= 'binary:logistic', colsample_bytree = 0.3, learning_rate = 0.1,
#                max_depth = 3, gamma=.1, subsample=0.8, alpha = 10, scale_pos_weight = 1.22, n_estimators=1000, verbosity=2, tree_method ='gpu_hist')
xg_clas= xgb.XGBClassifier(objective= 'binary:logistic', colsample_bytree = 0.3, learning_rate = 0.05,
               max_depth = 2, gamma=.1, subsample=0.8, min_child_weight=4, alpha = 10, scale_pos_weight = 1.22, n_estimators=1000, tree_method ='gpu_hist', verbosity=2)

In [None]:
#data_dmatrix = xgb.DMatrix(data=X,label=y)
#model.fit(X_train, Y_train, eval_metric="rmse", eval_set=[(X_train, Y_train), (X_cv, Y_cv)], verbose=True, early_stopping_rounds = 10)

In [None]:
xg_clas.fit(X_Q1, y_Q1)

In [None]:
# Performance of Q1
q1_y_pred = xg_clas.predict(X_Q1)
auc = roc_auc_score(y_Q1, q1_y_pred)
print("Q1 AUC Performance:", auc)

# Performance on Q2
q2_y_pred = xg_clas.predict(X_Q2)
auc = roc_auc_score(y_Q2, q2_y_pred)
print("Q2 AUC  Test Performance : ", auc)

# Performance on Q3
q3_y_pred = xg_clas.predict(X_Q3)
auc = roc_auc_score(y_Q3, q3_y_pred)
print("Q3 AUC  Test Performance : ", auc)

# Performance on Q4
q4_y_pred = xg_clas.predict(X_Q4)
auc = roc_auc_score(y_Q4, q4_y_pred)
print("Q4 AUC  Test Performance : ", auc)

In [None]:
from sklearn.metrics import confusion_matrix
#q2_real_y=np.array(Second_25_Percent_outcomes['real_score'])

tn, fp, fn, tp = confusion_matrix(y_Q2, xg_clas.predict(X_Q2)).ravel()
# Q2 Error rate : 
err_rate = (fp + fn) / (tp + tn + fn + fp)
print("Error rate  : ", err_rate)
# Q2 Accuracy : 
acc_ = (tp + tn) / (tp + tn + fn + fp)
print("Accuracy  : ", acc_)
# Q2 Sensitivity : 
sens_ = tp / (tp + fn)
print("Sensitivity  : ", sens_)
# Q2 Specificity 
sp_ = tn / (tn + fp)
print("Specificity  : ", sens_)
# Q2 False positive rate (FPR)
FPR = fp / (tn + fp)
print("False positive rate  : ", FPR)

In [None]:
xgb.plot_importance(xg_clas, max_num_features=20)
plt.rcParams['figure.figsize'] = [10, 10]
plt.show()