# Ubiquant Market Prediction
Twitch Stream EDA.

1. This notebook was create during a live coding session on twitch. follow for past and future broadcasts here: [here](https://www.twitch.tv/medallionstallion_) 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
import seaborn as sns
from itertools import cycle
import gc

plt.style.use("ggplot")
color_pal = plt.rcParams["axes.prop_cycle"].by_key()["color"]
color_cycle = cycle(plt.rcParams["axes.prop_cycle"].by_key()["color"])

# The Data
Note that the training data is roughly 18.55 Gb in size. This is too large to load into memory directly in kaggle notebook.

Some things to note when exploring the entire dataset on a local machine:
- There are 3579 unique `investment_id`s
- There are 1211 unique `time_id`s - we are told these are not equally spaced and could be different in the test set.
- The features columns are mostly normalized with a mean value close to 0 and standard deviation of ~1.


In [None]:
!ls -GFlash ../input/ubiquant-market-prediction/

# Reading the Parquet Version
Reading in csvs can be slow. Instead read from the parquet version here:
- https://www.kaggle.com/robikscube/ubiquant-parquet

In [None]:
train = pd.read_parquet('../input/ubiquant-parquet/train.parquet',
               columns=['time_id','investment_id','target','f_1','f_2','f_3'])
test = pd.read_parquet('../input/ubiquant-parquet/example_test.parquet')
ss = pd.read_parquet('../input/ubiquant-parquet/example_sample_submission.parquet')

In [None]:
unique_time_ids = train['time_id'].nunique()
unique_inv_ids = train['investment_id'].nunique()

print(f'There are {unique_inv_ids} unique investment ids and {unique_time_ids} unique time ids')

## Read in a single invesment_id

In [None]:
example = pd.read_parquet('../input/ubiquant-parquet/investment_ids/1.parquet')
example.head()

# Train Data Fields

tl;dr - we have time series data but don't know the exact time periods being provided. We also have investment_ids that are not unique. Everything is anonymized so it's not easy to create features.

- `row_id` - A unique identifier for the row.
- `time_id` - The ID code for the time the data was gathered. The time IDs are in order, but the real time between the time IDs is not constant and will likely be shorter for the final private test set than in the training set.
- `investment_id` - The ID code for an investment. Not all investment have data in all time IDs.
- `target` - The target.
- `features` - [f_0:f_299] - Anonymized features generated from market data.

# Example Target for some investment ids

In [None]:
for investment_id in range(5):
    d = train.query('investment_id == @investment_id')
    d.set_index('time_id')['target'] \
        .plot(figsize=(15, 5),
              title=f'Investment_id {investment_id}',
              color=next(color_cycle),
              style='.-')
    plt.show()

# Example of Features for a Single Investment ID
- We are only looking at 3 of the features.

In [None]:
example_id = train.query('investment_id == 529')
sns.pairplot(example_id,
             vars=['f_1','f_2','f_3','target'],
            hue='time_id')

# Make Some Dummy Predictions

In [None]:
# Take the last 50 known targets for each invesment_id and predict as the mean
inv_pred_dict = train.groupby('investment_id') \
    .tail(50).groupby('investment_id')['target'].mean().to_dict()

# How to Make Predictions

In [None]:
import ubiquant
env = ubiquant.make_env()
iter_test = env.iter_test()
for (test_df, spdf) in iter_test:
    spdf['target'] = test_df['investment_id'].map(inv_pred_dict)
    env.predict(spdf)