## <a>Loading Packages and Data</a>

In [None]:
import numpy as np 
import pandas as pd
import os, gc
import matplotlib.pyplot as plt
import seaborn as sns
import lightgbm as lgb

sns.set_style('whitegrid')
from sklearn.model_selection import train_test_split, KFold, GridSearchCV
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
PATH = '../input/jane-street-market-prediction/'

train = pd.read_csv(PATH + 'train.csv')
print(train.shape)

In [None]:
train.head(10)

Here we have, 

1. date column which represents the day of the trade and ts_id represents a time ordering.
2. 130 anonymized features - feature_{0...129}
3. weight and resp - which together represents a return on the trade
4. resp_{1,2,3,4} values that represent returns over different time horizons.
5. The target 'action' is not present in the train set. 

In the data section of the competition, it is mentioned that **"Trades with weight = 0 were intentionally included in the dataset for completeness, although such trades will not contribute towards the scoring evaluation."** 

So, we can remove rows where weight=0.

Target variable 'action' is not present in the train set, we can create one by setting action=1 where resp>0 and action=0 where resp<0. By doing this we are passing the trades(where resp<0) which decreases the utility score.

To better understand utility score, check this simple and detailed explanation https://www.kaggle.com/renataghisloti/understanding-the-utility-score-function

In [None]:
train = train[train['weight'] != 0]
train['action'] = (train['resp'] > 0).astype(int)
sns.countplot(train['action'])

The dataset is balanced w.r.t target 'action'. Now let's analyse the features.

In [None]:
train.describe()

Here, 
1. Weight of the trade is always +ve, min value being 0.
2. Resp varies from -0.54 to 0.44, never exceeding 1 in either direction.
3. The mean and std values are low for many features and NaNs are present.




In [None]:
FEATURES = [x for x in train.columns if 'feature' in x]
len(FEATURES)

In [None]:
missing_values = pd.DataFrame()
missing_values['column'] = FEATURES
missing_values['num_missing'] = [train[i].isna().sum() for i in FEATURES]

missing_values.T

We've many columns with high number of missing values. 'feature_0' is float64 type but contains only '1' and '-1' as values. Let's see if there are any other such columns.

In [None]:
unique_vals = pd.DataFrame()
unique_vals['column'] = FEATURES
unique_vals['num_missing'] = [train[i].nunique() for i in FEATURES]

unique_vals.T

All other feature columns have continuous values. Now, let's compare the distribution of features w.r.t two target values.

In [None]:
fig, ax = plt.subplots(10, 10, figsize=(20,22))
ax = ax.flatten()

for k,i in enumerate(FEATURES[1:101]):
    sns.distplot(train[train['action'] == 0][i], hist=False, label='0', ax=ax[k])
    sns.distplot(train[train['action'] == 1][i], hist=False, label='1', ax=ax[k])

In [None]:
fig, ax = plt.subplots(6, 5, figsize=(20,22))
ax = ax.flatten()

for k,i in enumerate(FEATURES[101:]):
    sns.distplot(train[train['action'] == 0][i], hist=False, label='0', ax=ax[k])
    sns.distplot(train[train['action'] == 1][i], hist=False, label='1', ax=ax[k])

For most features, the distributions are very similar for the two target values. In case of some features like 91, 94, 103, 115 the distributions are different. We can use this in feature selection.

Now let's check the correlations b/w features.

In [None]:
p = FEATURES
p.append('resp')
len(p)

In [None]:
x = train[p].corr()
x

Dropping columns where correlation coeff. > 0.95

In [None]:
x = x.abs()
upper = x.where(np.triu(np.ones(x.shape), k=1).astype(np.bool))
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]
print(to_drop)

In [None]:
train.drop(to_drop, 1, inplace=True)
train

## <a>Model training</a>

In [None]:
FEATURES = [x for x in train.columns if 'feature' in x]

X = train[FEATURES]
y = train['action']
print(X.shape, y.shape)

In [None]:
model = lgb.LGBMRegressor()
cv = KFold(shuffle=True, n_splits=5, random_state=108)
params = {
    'n_estimators':[500]
#     'learning_rate':[0.1, 0.001, 0.5],
#     'subsample':[1, 0.9],
#     'feature_fraction':[1, 0.9]
}

clf = GridSearchCV(
    estimator=model, 
    scoring='neg_mean_squared_error',
    cv = cv,
    param_grid=params, 
    verbose=10
)

In [None]:
clf.fit(X, y)

In [None]:
feature_imp = pd.DataFrame()
feature_imp['imp'] = clf.best_estimator_.feature_importances_
feature_imp['column'] = X.columns

feature_imp = feature_imp.sort_values(by='imp', ascending=False)

plt.figure(figsize=(15,20))
sns.barplot(feature_imp.imp[:30], feature_imp.column[:30])

In [None]:
import janestreet
env = janestreet.make_env() 
iter_test = env.iter_test() 

In [None]:
for (test_df, sample_prediction_df) in iter_test:
    wt = test_df.iloc[0].weight
    if(wt == 0):
        sample_prediction_df.action = 0 
    else:
        sample_prediction_df.action = (clf.predict(test_df.loc[:, FEATURES]) > 0.5).astype(int)
    env.predict(sample_prediction_df)