# Jane Street Market Prediction, Reading Dataset

### In this notebook we are going to use Dask to read a large amount of data in seconds using CPU parallelism, to know more about read the following:<br>
#### 1) Understanding Dask Framework [here](https://towardsdatascience.com/dask-a-guide-to-process-large-datasets-using-parallelization-c5554889abdb)<br>2) Official docs [here](https://docs.dask.org/en/latest/).

![Dask](https://cdn.analyticsvidhya.com/wp-content/uploads/2018/07/dask-feat-768x432.jpg)

In [None]:
# !pip install “dask[complete]”
# !python -m pip install dask distributed --upgrade

import os
import dask.dataframe as dd
from dask.distributed import Client, progress

import matplotlib.pyplot as plt

import pyarrow

import warnings
warnings.filterwarnings("ignore")

In [None]:
# reading the paths of all the files present in the dataset
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## Reading data

In [None]:
# setting the paths to variables to access when required
TRAINING_PATH = "/kaggle/input/jane-street-market-prediction/train.csv"
FEATURES_PATH = "/kaggle/input/jane-street-market-prediction/features.csv"
TEST_PATH = "/kaggle/input/jane-street-market-prediction/example_test.csv"
SAMPLE_SUB_PATH = "/kaggle/input/jane-street-market-prediction/example_sample_submission.csv"

In [None]:
%time
train_df = dd.read_csv(TRAINING_PATH)
features_df = dd.read_csv(FEATURES_PATH)
test_df = dd.read_csv(TEST_PATH)

In [None]:
train_df.head()

In [None]:
test_df.head()

## Output data in form of parquet file

In [None]:
 # start a local Dask client
client = Client(n_workers=4, memory_limit='16GB')
client

In [None]:
train_df.to_parquet('./jane_street_market_output.parquet')

In [None]:
%%time
data = dd.read_parquet('./jane_street_market_output.parquet', engine='pyarrow')

## Usage Example

In [None]:
 train_df["date"].value_counts().compute().sort_index()

In [None]:
weights_date = train_df.groupby(train_df.date).weight.mean().compute()
weights_date

In [None]:
plt.figure(figsize=(16,12))
plt.bar(weights_date.index, weights_date.values, align='center', alpha=0.8)

#### Notice how Dask has reduced the time of reading dataframe to seconds :)<br>To get complete understanding about the data read another notebook [here](https://www.kaggle.com/blurredmachine/jane-street-market-eda-viz-prediction)

### Hardware I use to work on Large Data:

<table style="width:100%">
  <tr>
    <th>HP Z8 G4 Tower - 1125W PSU</th>
    <th>HP ZBook Studio - G7 Mobile Workstation:</th>
  </tr>
  <tr>
    <td>6234 3.3 GHz (8 Core each) i9 Processors x 2</td>
    <td>6234 3.3 GHz (8 Core) i9 Processor x 1</td>
  </tr>
  <tr>
    <td>NVIDIA Quadro RTX 8000 x 1</td>
    <td>NVIDIA Quadro RTX 5000 x 1</td>
  </tr>
  <tr>
    <td>96GB DDR4 RAM 2933</td>
    <td>32GB DDR4 RAM 2933</td>
  </tr>
    
  <tr>
    <td>2 TB NVMe M.2 SSD</td>
    <td>2 TB NVMe M.2 SSD</td>
  </tr>
  <tr>
    <td><img src= "https://ssl-product-images.www8-hp.com/digmedialib/prodimg/lowres/c05724976.png?imdensity=1&imwidth=1024" width=200px></td>
    <td><img src="https://www8.hp.com/content/dam/sites/worldwide/personal-computers/commercial/workstations/zbook-studio/images/color-accuracy-image-desktop.png" width=200px></td>
  </tr>
</table>


### Thanks :)