# Ubiquant Market Prediction
Twitch Stream EDA.

1. This notebook was create during a live coding session on twitch. follow for past and future broadcasts here: [here](https://www.twitch.tv/medallionstallion_) 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
import seaborn as sns
from itertools import cycle
import gc

plt.style.use("ggplot")
color_pal = plt.rcParams["axes.prop_cycle"].by_key()["color"]
color_cycle = cycle(plt.rcParams["axes.prop_cycle"].by_key()["color"])

# The Data
Note that the training data is roughly 18.55 Gb in size. This is too large to load into memory directly in kaggle notebook.

Some things to note when exploring the entire dataset on a local machine:
- There are 3579 unique `investment_id`s
- There are 1211 unique `time_id`s - we are told these are not equally spaced and could be different in the test set.
- The features columns are mostly normalized with a mean value close to 0 and standard deviation of ~1.


In [None]:
!ls -GFlash ../input/ubiquant-market-prediction/

# Reading the Parquet Version
Reading in csvs can be slow. Instead read from the parquet version here:
- https://www.kaggle.com/robikscube/ubiquant-parquet

In [None]:
train = pd.read_parquet('../input/ubiquant-parquet/train.parquet',
               columns=['time_id','investment_id','target','f_1','f_2','f_3'])
test = pd.read_parquet('../input/ubiquant-parquet/example_test.parquet')
ss = pd.read_parquet('../input/ubiquant-parquet/example_sample_submission.parquet')

In [None]:
'''Getting an idea of how many observations, assets and time steps'''

obs = train.shape[0]
print(f"Number of observations: {obs}")

In [None]:
unique_time_ids = train['time_id'].nunique()
unique_inv_ids = train['investment_id'].nunique()

print(f'There are {unique_inv_ids} unique investment ids and {unique_time_ids} unique time ids')

print(f"Number of assets: {unique_time_ids} (range from {train.investment_id.min()} to {train.investment_id.max()})")

## Read in a single invesment_id

In [None]:
example = pd.read_parquet('../input/ubiquant-parquet/investment_ids/1.parquet')
example.head()

# Train Data Fields

tl;dr - we have time series data but don't know the exact time periods being provided. We also have investment_ids that are not unique. Everything is anonymized so it's not easy to create features.

- `row_id` - A unique identifier for the row.
- `time_id` - The ID code for the time the data was gathered. The time IDs are in order, but the real time between the time IDs is not constant and will likely be shorter for the final private test set than in the training set.
- `investment_id` - The ID code for an investment. Not all investment have data in all time IDs.
- `target` - The target.
- `features` - [f_0:f_299] - Anonymized features generated from market data.

# Example of Features for a Single Investment ID
- We are only looking at 3 of the features.

In [None]:
example_id = train.query('investment_id == 529')
sns.pairplot(example_id,
             vars=['f_1','f_2','f_3','target'],
            hue='time_id')

# Target analysis

In [None]:
'''The target: investment return rate (IRR)'''
plt.figure(figsize = (12,5))
ax = sns.distplot(train['target'], bins=1000)
plt.xlim(-3,3)
plt.xlabel("Histogram of the IRR values", size=18)
plt.show();
gc.collect();

The target values look quite normal without any outliers or long tails. We should not have any problems working with it. 


Assuming an uniform investment (all investment have the same weight), the overall investment is in loss 

In [None]:
plt.figure(figsize=(20,20))

for i in range(5):
    plt.subplot(5,1,i+1)
    cumReturn = train.loc[train['investment_id']==i,'target'].cumsum()
    time_id = train.loc[train['investment_id']==i,'time_id']
    plt.plot(time_id, cumReturn, color='green', lw=2);
    plt.ylabel (f'investment_id {i}', fontsize=18);
    plt.title(f'investment_id {i}  time dependency', size=18)

plt.xlabel ('Time_id', fontsize=18)

del cumReturn, time_id
gc.collect();

In [None]:
for investment_id in range(5):
    d = train.query('investment_id == @investment_id')
    d.set_index('time_id')['target'] \
        .plot(figsize=(15, 5),
              title=f'Investment_id {investment_id}',
              color=next(color_cycle),
              style='.-')
    plt.show()

In [None]:
selection = train.groupby("investment_id").time_id.max()
outlier_inv_ids = selection[selection != 1219].index.values

plt.figure(figsize=(20,5))
for n in range(10):
    plt.plot(train[train.investment_id == outlier_inv_ids[n]].time_id,
               train[train.investment_id == outlier_inv_ids[n]].target.cumsum(), '.')
    plt.xlim([0,1220])
    plt.title("Return/target cumsum for outlier investments")
    plt.xlabel("time_id")
    plt.ylabel("cumsum return");

Have you noticed that some of the data we have is missing, this can be judged by the long smooth connected areas without hesitation.

- We can clearly see that some investments miss parts of their timeseries or end earlier.
- Looking back into the competition description, we find: "The ID code for an investment. Not all investment have data in all time IDs."

In [None]:
print('timestamps in our data', train.time_id.unique())
print('the total number of timestamps in our data = ', len(train.time_id.unique()))

missing_time_ids = []
for t in range(1220):
    if t not in train.time_id.unique():
        missing_time_ids.append(t)
        
print('Missing time_ids: ', missing_time_ids)

time_id: The ID code for the time the data was gathered. The time IDs are in order, but the real time between the time IDs is not constant and will likely be shorter for the final private test set than in the training set.

Yes the IDs are in order from 0-1219 with 8 missing (?) time_ids.

One time id may belong to 1st Jan 2:00 IST, the next one can be 4th Jan 12:00 IST, the other one 5th Jan 16:00 IST and so on.

Clearly the number of data points (rows) in each time_id is not constant.

The following time_ids are not present. I don't think it should be an issue since we anyway don't have a constant gap between consecutive time_ids.

In [None]:
obs_by_asset = train.groupby(['investment_id'])['target'].count()

fig, ax = plt.subplots(1, 1, figsize=(12, 6))
obs_by_asset.plot.hist(bins=60)
plt.title("target by asset distribution")
plt.show()

Assets are distributed in a different way, there are assets that are actually more frequently observed and others that are not. A good cv and modelling strategy should keep this into account (stratify if you are working with subsamples).

In [None]:
mean_target = train.groupby(['investment_id'])['target'].mean()
mean_mean_target = np.mean(mean_target)

fig, ax = plt.subplots(1, 1, figsize=(12, 6))
mean_target.plot.hist(bins=60)
plt.title("mean target distribution")
plt.show()

print(f"Mean of mean target: {mean_mean_target: 0.5f}")

The average of mean target by asset show a bell-shaped distribution, beware that there are outliers, anyway, because there are some assets with quite negative average target (-0.4 area) and some quite positive ones (+0.8 area). Overall the average mean target by asset is slightly negative (-0.0231)

In [None]:
sts_target = train.groupby(['investment_id'])['target'].std()
mean_std_target = np.mean(sts_target)

fig, ax = plt.subplots(1, 1, figsize=(12, 6))
sts_target.plot.hist(bins=60)
plt.title("standard deviation of target distribution")
plt.show()

print(f"Mean of std target: {mean_std_target: 0.5f}")

Also the average of mean standard deviation (std) by asset presents some interesting patterns. First of all, it is skewed toward the right, with some assets having more std (up to 2.5). On the other side there are also some few assets with std almost at zero.

In [None]:
ax = sns.jointplot(x=obs_by_asset, y=mean_target, kind="reg", 
                   height=8, joint_kws={'line_kws':{'color':'blue'}})
ax.ax_joint.set_xlabel('observations')
ax.ax_joint.set_ylabel('mean target')
plt.show()

By jointly plotting the distribution of observartions by asset and the mean target value by asset, we may notice that the target value slightly reduces proportionally to the number of observation. The dispersion of values tends to grow with less observations, hence we need to re-plot the scatterplot this time using the standard deviation.

In [None]:
qx = sns.jointplot(x=obs_by_asset.values, y=sts_target, kind="reg", 
                   height=8, joint_kws={'line_kws':{'color':'blue'}})
ax.ax_joint.set_xlabel('observations')
ax.ax_joint.set_ylabel('std target')
plt.show()

The new scatterplot reveals that the less the observations, imply a much more uncertainty in the mean target. 

Strategy: in training you need to control this effect by expliciting the number of observations because this is predictive of the uncertainty of the predictions. In the test phase, instead, when you are working with an asset that you don't know about, you need to impute an average number of observations, thus expecting an average dispersion of predictions for that asset.

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(12, 6))
train.groupby('time_id')['investment_id'].nunique().plot()
plt.title("number of unique assets by time")
plt.show()

In [None]:
num_investments_per_time_id = train.groupby("time_id").investment_id.nunique()

plt.figure(figsize=(20,5))
plt.plot(num_investments_per_time_id.index, num_investments_per_time_id.values, 'o')

As we have reasoned how the investments with less observations seem more risky, we notice how the number of the assets present at each time step is quite different and also highly oscillating. By the end of the avaliable time, the number of assets has grown by one third. We can see that the number of investments given the time id varies especially around the id 400.

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(12, 6))
train.groupby('time_id')['investment_id'].nunique().plot()
plt.title("number of unique assets by time")
plt.show()

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(12, 6))
train.groupby('time_id')['target'].mean().plot()
plt.title("average target by time")
plt.show()

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(12, 6))
train.groupby('time_id')['target'].std().plot()
plt.title("average target by time")
plt.show()

In [None]:
r = np.corrcoef(train.groupby('time_id')['investment_id'].nunique(), train.groupby('time_id')['target'].mean())[0][1]
print(f"Correlation of number of assets by target: {r:0.3f}")

In [None]:
fig, ax = plt.subplots(1,2,figsize=(20,5))
ax[0].plot(train[train.investment_id==4].target.cumsum())
ax[1].plot(train[train.investment_id==4].f_3.cumsum())

If we plot the number of assets by time alongside the average target by time, it becomes evident that when there are less assets, the target oscillates more with prevalently higher targets. The correlation of assets number and target is negative, in fact. 

### Features interaction
We will do analysis on a smaller random 1% samle of the dataset to speed up the process.

In [None]:
data_types_dict = {
    'time_id': 'int32',
    'investment_id': 'int16',
    "target": 'float16',
}

features = [f'f_{i}' for i in range(300)]

for f in features:
    data_types_dict[f] = 'float16'
    
target = 'target'

train_df = pd.read_csv('/kaggle/input/ubiquant-market-prediction/train.csv', 
                       usecols = data_types_dict.keys(),
                       dtype=data_types_dict, 
                       index_col = 0)

sample_df = train_df.sample(frac = 0.01)
sample_df

In [None]:
correlation = sample_df[[target] + features].corr()

In [None]:
correlation['target'].iloc[1:].hist(bins = 20, figsize = (20,10))

In [None]:
sns.clustermap(correlation, figsize=(20, 20))

There are definitely some clusters of highly correlated features that can be later analyzed together.