## COMPREHENSIVE DATA EXPLORATION WITH PYTHON
[CrislÃ¢nio MacÃªdo](https://medium.com/sapere-aude-tech) -  Feb, 21th, 2021

----------


## Market Prediction -Starter Data Exploration ðŸ“ˆ

- <a href='#1'>1. Market Prediction -Starter Data Exploration ðŸ“ˆ</a>
    - <a href='#1-1'>1.1. Data Description</a>
    - <a href='#1-2'>1.2. Evaluation Metric</a>
- <a href='#2'>2. Imports</a>
- <a href='#3'>3. Read in Data</a>
- <a href='#4'>4. Glimpse of Data</a>
- <a href='#5'>5. Reducing Memory Size</a>
- <a href='#6'>6. Exploratory Data Analysis</a>
    - <a href='#6-1'>6.1. Examine the Distribution of the Target Column</a>
    - <a href='#6-2'>6.2. Examine Missing Values</a>
    - <a href='#6-3'>6.3. Column Types</a>
- <a href='#7'>7. Ploting</a>




    

# <a id='1'>1. Introduction: Jane Street Market Prediction</a>



> About the Host

Jane Street has spent decades developing their own trading models and machine learning solutions to identify profitable opportunities and quickly decide whether to execute trades. These models help Jane Street trade thousands of financial products each day across 200 trading venues around the world. Admittedly, this challenge far oversimplifies the depth of the quantitative problems Jane Streeters work on daily, and Jane Street is happy with the performance of its existing trading model for this particular question. However, thereâ€™s nothing like a good puzzle, and this challenge will hopefully serve as a fun introduction to a type of data science problem that a Jane Streeter might tackle on a daily basis. Jane Street looks forward to seeing the new and creative approaches the Kaggle community will take to solve this trading challenge.


## <a id='1-1'>1.1 Data</a>


Files
- train.csv - the training set, contains historical data and returns
- example_test.csv - a mock test set which represents the structure of the unseen test set. You will not be directly using the test set or sample submission in this competition, as the time-series API will get/set the test set and predictions.
- example_sample_submission.csv - a mock sample submission file in the correct format
- features.csv - metadata pertaining to the anonymized features


## <a id='1-1'>1.2 Evaluation Metric</a>

This competition is evaluated on a utility score. Each row in the test set represents a trading opportunity for which you will be predicting an action value, 1 to make the trade and 0 to pass on it. Each trade j has an associated weight and resp, which represents a return.

For each date i, we define:
$pi=âˆ‘j(weightijâˆ—respijâˆ—actionij)$,

$t=âˆ‘piâˆ‘p2iâˆ’âˆ’âˆ’âˆ’âˆšâˆ—250|i|$, 

where |i| is the number of unique dates in the test set. The utility is then defined as:

$u=min(max(t,0),6)âˆ‘pi$.

# <a id='2'>2. Imports </a>
<a href='#1'>Top</a>

> We are using a typical data science stack: `numpy`, `pandas`, `sklearn`, `matplotlib`. 

In [None]:
from lightgbm import LGBMClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler # ver 19

import janestreet
import warnings
warnings.filterwarnings(action='ignore')

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import gc

# matplotlib and seaborn for plotting
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
import matplotlib.patches as patches
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.tools as tls
from plotly.subplots import make_subplots

import plotly.express as px
import plotly.graph_objects as go
from plotly.offline import init_notebook_mode, iplot
from IPython.display import HTML, Image
pd.set_option('max_columns', 150)

import os,random, math, psutil, pickle    
from sklearn.preprocessing import RobustScaler

# <a id='3'>3. Read in Data </a>
<a href='#1'>Top</a>

In [None]:
print(os.listdir("../input/jane-street-market-prediction/"))


In [None]:
%%time
root = '../input/jane-street-market-prediction/'

train = pd.read_csv(root+'train.csv')
features = pd.read_csv(root+'features.csv')
example_test = pd.read_csv(root+'example_test.csv')
sample_prediction_df = pd.read_csv(root+'example_sample_submission.csv')

# <a id='4'>4. Glimpse of Data</a>
<a href='#1'>Top</a>

In [None]:
print('Size of train data', train.shape)
print('Size of features data', features.shape)
print('Size of example_test data', example_test.shape)

# <a id='5'>5. Reducing Memory Size</a>
<a href='#1'>Top</a>


<p><font size="3" color="blue" style="Comic Sans MS;">
It is necessary that after using this code, carefully check the output results for each column.
</font></p>

In [None]:
## Function to reduce the DF size
def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df


<p><font size="3" color="blue" style="Comic Sans MS;">
Reducing memory
</font></p>

In [None]:
train = reduce_mem_usage(train)
features = reduce_mem_usage(features)
example_test = reduce_mem_usage(example_test)



<html>
<body>
<p><font size="5" color="Red">ðŸ”“MEMORY USAGE AFTER COMPLETION:</font></p>
<p>Mem. usage decreased to  : <b> 631.49 Mb (74.9% reduction)</b></p>
<p>Mem. usage decreased to  : <b>  0.00 Mb (0.0% reduction)</b></p>
<p>Mem. usage decreased to  : <b> 3.83 Mb (75.2% reduction)</b></p>

</body>
</html>





In [None]:
train['resp'] = (((train['resp'].values)*train['weight']) > 0).astype(int)
train['resp_1'] = (((train['resp_1'].values)*train['weight']) > 0).astype(int)
train['resp_2'] = (((train['resp_2'].values)*train['weight']) > 0).astype(int)
train['resp_3'] = (((train['resp_3'].values)*train['weight']) > 0).astype(int)
train['resp_4'] = (((train['resp_4'].values)*train['weight']) > 0).astype(int)


train data

In [None]:
train.head()

features

In [None]:
features.head()

example_test

In [None]:
example_test.head()

# <a id='6'>6. Exploratory Data Analysis</a>
<a href='#1'>Top</a>


<p><font size="3" color="blue" style="Comic Sans MS;">
Exploratory Data Analysis (EDA) is an open-ended process where we calculate statistics and make figures to find trends, anomalies, patterns, or relationships within the data. 
</font></p>

## <a id='6-2'>6.3 Examine Missing Values</a>


<p><font size="3" color="blue" style="Comic Sans MS;">
Next we can look at the number and percentage of missing values in each column. 

</font></p>



### checking missing data for train

In [None]:
total = train.isnull().sum().sort_values(ascending = False)
percent = (train.isnull().sum()/train.isnull().count()*100).sort_values(ascending = False)
missing__train_data  = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing__train_data.head(10)

### checking missing data for feature

In [None]:
total = features.isnull().sum().sort_values(ascending = False)
percent = (features.isnull().sum()/features.isnull().count()*100).sort_values(ascending = False)
missing__train_data  = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing__train_data.head(10)

## <a id='6-3'>6.3 Column Types</a>


Let's look at the number of columns of each data type. `int64` and `float64` are numeric variables ([which can be either discrete or continuous](https://stats.stackexchange.com/questions/206/what-is-the-difference-between-discrete-data-and-continuous-data)). `object` columns contain strings and are  [categorical features.](http://support.minitab.com/en-us/minitab-express/1/help-and-how-to/modeling-statistics/regression/supporting-topics/basics/what-are-categorical-discrete-and-continuous-variables/) . 

In [None]:
# Number of each type of column
train.dtypes.value_counts()

In [None]:
# Number of each type of column
features.dtypes.value_counts()

# <a id='7'>7. Ploting</a>
<a href='#1'>Top</a>

In [None]:
traces = [
    go.Histogram(
        x = train[train.columns[1:7][i]].value_counts().index, 
        name=train.columns[1:7][i]
    ) for i in range(len(train.columns[1:7]))
]

fig = make_subplots(rows=2, cols=3,subplot_titles=(train.columns))
for i in range(len(traces)):
    fig.append_trace(
        traces[i], 
        (i // 3) + 1, 
        (i % 3) + 1
    )

fig.update_layout(
    template="plotly_white"
)

fig.show()

# <a id='8'>8. Modeling</a>
<a href='#1'>Top</a>

In [None]:
features = ['feature_{}'.format(i) for i in range(0,130)]
resp_cols = ['resp_1', 'resp_2', 'resp_3', 'resp']
train_df = train[train['date']>85]
train_data = train_df[features]
train_target = np.stack([(train_df[c] > 0).astype('int') for c in resp_cols]).T

rb = RobustScaler().fit(train_data)
train_data = pd.DataFrame(rb.transform(train_data), columns=train_data.columns)

In [None]:
lgb_models = []
lgb_params = {
    'n_jobs':-1,
    'num_leaves':400,
    'learning_rate':0.15,
    'n_estimators':1000,
    'objective':'binary',
    'subsample':0.75,
    'colsample_bytree':0.5,
    'metric':'auc',
    'max_bin':500
}

for i in range(train_target.shape[1]):
    x_tr,x_val,y_tr,y_val = train_test_split(train_data ,train_target[:,i],test_size=0.2, stratify=train_target[:,i], random_state=i)
    lgb_clf = LGBMClassifier(**lgb_params)
    lgb_clf.fit(x_tr, y_tr, eval_set=[(x_tr, y_tr),(x_val,y_val)], eval_metric='auc', early_stopping_rounds=100, verbose=50)
    lgb_models.append(lgb_clf)
print('Average CV score:',np.mean([model.best_score_['valid_1']['auc'] for model in lgb_models]))    

### Submission

In [None]:
th = 0.5
env = janestreet.make_env()

In [None]:
for (test_df, pred_df) in env.iter_test():
    if test_df['weight'].item() > 0:
        x_tt = test_df[features]        
        pred = np.median([model.predict_proba(x_tt)[:,1] for model in lgb_models]).T
        pred_df.action = np.where(pred >= th, 1, 0).astype(int)
    else:
        pred_df.action = 0
    env.predict(pred_df)

# End Notebook