# Content
1. [Intoduction](#1)
2. [Dataset Description](#2)
3. [Feature Descrtions](#3)
4. [Exploratory Data Analysis](#5)
    * [Features feature_{0...129}](#61) 


<a id="1"></a>

<a id="3"></a>
<a id="4"></a>
<a id="5"></a>
<a id="61"></a>
# 1-Introduction

## “Buy low, sell high.” It sounds so easy….

In reality, trading for profit has always been a difficult problem to solve, even more so in today’s fast-moving and complex financial markets. Electronic trading allows for thousands of transactions to occur within a fraction of a second, resulting in nearly unlimited opportunities to potentially find and take advantage of price differences in real time.

<a id="2"></a>
# 2-Dataset Descriptions

* train.csv - the training set, contains historical data and returns. 

* example_test.csv - a mock test set which represents the structure of the unseen test set. You will not be directly using the test set or sample submission in this competition, as the time-series API will get/set the test set and predictions.
* example_sample_submission.csv - a mock sample submission file in the correct format

* 

### features.csv :

 this are metadata pertaining to the anonymized features. The purpose of feature.csv is to show the relationship between the anonymized features, tag0 ~ tag28 are anonymized shared components/concepts used in feature derivation. For example, if the value for (feature_i, tag_j) is True, then it means the tag_j is used to derive feature_i.

Let's assume we only have the following three features, and their definitions are

feature_0: volatility of this stock in past 30 days
feature_1: volume of this stock in past 30 days
feature_2: volume of this stock in past 10 days

But after the anonymization, the info about the connection between the features is lost, so in order to show this info to some extent, we create two tags here with the following definition

tag0: some metric on the past 30 days
tag1: volume of this stock

Then the feature.csv should be something like

feature0, True, False
feature1, True, True
feature2, False, True

<a id="3"></a>
# Explore the features of data

### **feature_{0...129}**

 -- representing real stock market data. Each row in the dataset represents a trading opportunity, for which you will be predicting an action value: 1 to make the trade and 0 to pass on it.
 
 
 ### weight & resp
 
-- Each trade has an associated weight and resp, which together represents a return on the trade as well as several other resp_{1,2,3,4} values that represent returns over different time horizons. resp is how much we would gain from the trade (and it can be negative). All trades also have a weight associated with them that resp gets multiplied by. Resp1 - Resp4 only exists in the train.csv, they are correlated to Resp but not exactly the same (see the data description). They are provided just in case some people want some alternative objective metrics to regularize their model training.


 ### date

 -- date column is an integer which represents the day of the trade, 
 
 ### ts_id
 --ts_id represents a time ordering. In addition to anonymized feature values, you are provided with metadata about the features in features.csv.
 
 
 ### Action: 
 
 
 We're not trying to predict action, but rather deciding whether to perform the trade (action = 1) or not (action = 0).

The simplest way to do that is probably to predict the return of the trade (resp) and perform the trade if it is positive (action= resp > 0) (you could take the weight into account as well by multiplying the return with weight).

Also We're not predicting what trades will take place; we're picking the trades we believe will give the highest score (it's not just total returns, we also have to take volatility into account)!

But as you've noted, we do not have resp (the returns) in the test set. So, we have to find some way of predicting which trades are "good trades" — either directly, or by predicting resp (and perhaps even its probability distribution), and then somehow optimise for the utility score.


<a id="4"></a>
# Exploratory Data Analysis

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Import the data 

* To speed up to load the data here i have used the dataable library and then converted into pandas dataframe. 
* It can can read large datasets fairly quickly and is often faster than pandas. It is specifically meant for data processing of tabular datasets with emphasis on speed and support for large sized data.


In [None]:
!pip install ../input/python-datatable/datatable-0.11.0-cp37-cp37m-manylinux2010_x86_64.whl

import datatable as dt

train = dt.fread("../input/jane-street-market-prediction/train.csv").to_pandas() #, max_nrows=30000000
features = pd.read_csv("../input/jane-street-market-prediction/features.csv")
example_test = pd.read_csv("../input/jane-street-market-prediction/example_test.csv", nrows=10**3 )

In [None]:
print(f'train shape:{train.shape}')
print(f'featues shape:{features.shape}')

In [None]:
train.sample(3)

In [None]:
features.sample(3)

In [None]:
example_test.sample(3)

In [None]:
train.describe()

# The columns and number of null vaues in train set

In [None]:
print("No of columsn containing null values")
print(len(train.columns[train.isna().any()]))


print("No. of columns not containing null values")
print(len(train.columns[train.notna().all()]))

print("the columns containing null values")
print(train.columns[train.isna().any()])

# The columns and number of null vaues in feature set

In [None]:
print("No. of columns containing null values")
print(len(features.columns[features.isna().any()]))

print("No. of columns not containing null values")
print(len(features.columns[features.notna().all()]))

# Correlation coefficients
Correlation coefficients are used in statistics to measure how strong a relationship is between two variables

In [None]:
plt.figure(figsize=(18,18))
data_corr = train.corr()
sns.heatmap(data_corr, cmap='coolwarm')

In [None]:
data_corr.style.background_gradient(cmap='coolwarm', axis=None).set_precision(2)


# Features feature_{0...129}


# Distribution of features
There are 500 days of data in train.csv . Let us take a look at date=0 (the first day)

In [None]:
date = 0
n_features = 130

cols = [f'feature_{i}' for i in range(1, n_features)]
hist = px.histogram(
    train[train["date"] == date], 
    x=cols, 
    animation_frame='variable', 
    range_y=[0, 700], 
    range_x=[-8, 8]
)

hist.show()

In [None]:
date = 0
n_features = 130
cols = [f'feature_{i}' for i in range(1, n_features)]
hist = px.scatter(
    train[train["date"] == date], 
    x=cols, 
    animation_frame='variable', 
    range_y=[-500, 6500], 
    range_x=[-10, 10]
)
hist.layout.showlegend = False
hist.show()

* There are a total of 500 days of data in train.csv (i.e. two years of trading data). Let us take a look at the cumulative value of feature 0 over time

In [None]:
fig, ax = plt.subplots(figsize=(15, 5))
feature_0 = pd.Series(train['feature_0']).cumsum()
ax.set_xlabel ("Trade", fontsize=18)
ax.set_ylabel ("feature_0 (cumulative)", fontsize=18);
feature_0.plot(lw=3);

### Now look at the return or rsp over the time. There are total 500 days of data. 

In [None]:
fig, ax = plt.subplots(figsize=(15, 5))
balance= pd.Series(train['resp']).cumsum()
ax.set_xlabel ("Trade", fontsize=18)
ax.set_ylabel ("Cumulative return", fontsize=18);
balance.plot(lw=3);

Now, lets see all the cummalitive return for diffenrent resp with different time horizon.

In [None]:
fig, ax = plt.subplots(figsize=(15, 5))
balance= pd.Series(train['resp']).cumsum()
resp_1= pd.Series(train['resp_1']).cumsum()
resp_2= pd.Series(train['resp_2']).cumsum()
resp_3= pd.Series(train['resp_3']).cumsum()
resp_4= pd.Series(train['resp_4']).cumsum()
ax.set_xlabel ("Trade", fontsize=18)
ax.set_title ("Cumulative return of resp and time horizons 1, 2, 3, and 4 (500 days)", fontsize=18)
balance.plot(lw=3)
resp_1.plot(lw=3)
resp_2.plot(lw=3)
resp_3.plot(lw=3)
resp_4.plot(lw=3)
plt.legend(loc="upper left");

Insight from the plot: 
"The longer the Time Horizon, the more aggressive, or riskier portfolio, an investor can build. The shorter the Time Horizon, the more conservative, or less risky, the investor may want to adopt."

#### What is an Investment Time Horizon?
An Investment Time Horizon, or just Time Horizon, is the period of time one expects to hold an investment until they need the money back. Time horizons are largely dictated by investment goals and strategies. For example, saving for a down payment on a house, maybe 2 years, would be considered a short-term time horizon, while saving for college a medium-term time horizon, and investing for retirement a long-term time horizon.