# Ubiquant market prediction - EDA

<img src="https://cdn.pixabay.com/photo/2017/08/30/07/56/clock-2696234_1280.jpg" width="800px">

# Table of contents

- [Motivation](#motivation)
- [Loading packages and data](#packages)
- [Initial explorations (to be continued)](#initial_eda)
- [Feature generation (to be continued)](#new_features)
    - [Operations to perform](#operations)
    - [The feature selector](#selector)
    - [Model structure](#model)
    - [Go, honey, go!](#search)

# Motivation <a class="anchor" id="motivation"></a>

Our goal is to predict a metric (not known by us but related to the return rate) that should help traders to make a trading decision. To solve this task we are given:

* 300 anonymized features
* different investments. They are not the same all the time, there can be different investment ids in the test data than in train [(look here)](https://www.kaggle.com/c/ubiquant-market-prediction/discussion/301693).
* time_ids per investment
* (and row ids)

The features in the training set were derived using real historic data. Furthermore the description says that: 

**the final private leaderboard will be determined using data gathered after the training period closes, which means that the public and private leaderboards will have zero overlap**

What exactly does this mean? Is the test data gathered in temporal order right after the training period? Is it subsequent? Or could it also mean that the test data was gathered later after some time has passed by? For me this is not really clear...

# Loading packages and data <a class="anchor" id="packages"></a>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.linear_model import Ridge
from scipy.stats import pearsonr

from kaggle_secrets import UserSecretsClient

import wandb
# always wanted to try it out - now it's time to do so! :-) 

import seaborn as sns
sns.set_style("whitegrid")
#sns.set()

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

In [None]:
use_wandb=False
if use_wandb:
    user_secrets = UserSecretsClient()
    secret_value_wb = user_secrets.get_secret("wandb")
    ! wandb login $secret_value_wb
    wandb.init(project="ubiquant", name="starter")

In [None]:
train = pd.read_parquet('../input/ubiquant-parquet/train_low_mem.parquet')
train.head()

In [None]:
test = pd.read_parquet('../input/ubiquant-parquet/example_test.parquet')
test.head()

# Exploratory Analysis <a class="anchor" id="exploration"></a>

# Initial explorations <a class="anchor" id="initial_eda"></a>

How much samples can be found?

In [None]:
train.shape[0]

How many investments are present?

In [None]:
train.investment_id.nunique()

What's the maximum time_id per investment id?

In [None]:
train.groupby("investment_id").time_id.max().value_counts()

That's interesting! Most of the ids have a max time_id of 1219. But there are a few ids that differ from them. 

In [None]:
train.groupby("investment_id").time_id.max().value_counts().index.max()

In [None]:
train.groupby("investment_id").time_id.max().value_counts().index.min()

There are no ids with an id > 1219 or < 62. We should better look at a few examples. By the way - how many outlier samples do we have?

In [None]:
selection = train.groupby("investment_id").time_id.max()
outlier_inv_ids = selection[selection != 1219].index.values
len(outlier_inv_ids)

In [None]:
plt.figure(figsize=(20,10))
for n in range(50):
    plt.plot(train[train.investment_id == outlier_inv_ids[n]].time_id,
               train[train.investment_id == outlier_inv_ids[n]].target.cumsum(), '.', alpha=0.5)
    plt.xlim([0,1220])
    plt.title("Return/target cumsum for outlier investments")
    plt.xlabel("time_id")
    plt.ylabel("cumsum return");

### Insights

* We can clearly see that some investments miss parts of their timeseries or end earlier.
* Looking back into the competition description, we find: "The ID code for an investment. **Not all investment have data in all time IDs**."
* A lot of investment ids have a **break** close to time_id 400 (roughly **380**) and/or stop at roughly **1050**.  

In [None]:
selection = train.groupby("investment_id").time_id.max()
not_outlier_inv_ids = selection[selection == 1219].index.values
len(not_outlier_inv_ids)

In [None]:
plt.figure(figsize=(20,10))
for n in range(50):
    plt.plot(train[train.investment_id == not_outlier_inv_ids[n]].time_id,
               train[train.investment_id == not_outlier_inv_ids[n]].target.cumsum(), '.', alpha=0.5)
    plt.xlim([0,1220])
    plt.title("Return/target cumsum for not-outlier investments")
    plt.xlabel("time_id")
    plt.ylabel("cumsum return");

### How many investments do we have per time id?

In [None]:
num_investments_per_time_id = train.groupby("time_id").investment_id.nunique()

plt.figure(figsize=(20,5))
plt.plot(num_investments_per_time_id.index, num_investments_per_time_id.values, 'o', color="black")
plt.xlabel("time_id")
plt.ylabel("count")
plt.title("Number of unique investment_ids given time_id");

### Insights

* The number of unique investment ids seems to be roughly constant in the beginning but has an increasing trend after the crazy id 400. 
* We can see that the number of investments given the time id varies especially around the id 400.

What does that mean...? The description says...

>  The time IDs are in order, but the real time between the time IDs is not constant and will likely be shorter for the final private test set than in the training set.

Hmm... what?!

Does it mean given a single investment the time id is in order but the time between ids can vary (also for this single investment)? If so, do we have different "temporal spaces" per investment? Then time_id 12 for example would be a different time for each investment? Or does it mean that this id 12 is the same real time for all investment ids? 

### How is the target distributed?

In [None]:
fig, ax = plt.subplots(2,2,figsize=(20,10))
sns.distplot(train.target.values, color="red", ax=ax[0,0])
ax[0,0].set_title("Target distribution")
n = 0
selected_id = train[train.target > 8].investment_id.values[n]
selection = train[train.investment_id == selected_id][["time_id", "target"]]
ax[0,1].set_title("Target distribution of investment id {}".format(selected_id))
sns.distplot(selection.target.values, ax=ax[0,1], color="seagreen")
ax[1,0].plot(selection.time_id.values, selection.target.values)
ax[1,0].set_title("Target timeseries of investment id {}".format(selected_id))
ax[1,1].plot(selection.time_id.values, selection.target.cumsum().values)
ax[1,1].set_title("Target timeseries cumsum of investment id {}".format(selected_id))

### Insights

* Looks like Student's t-distribution (like normal with heavy tails).
* Browsing through different outlier series the heavy tails belong to steep changes given our temporal interval. Looking at the cumsum we can see that it does not necessarily mean that this is a strange behaviour as we could also have strong changes over a small number of time_id steps. 

# How are features (or the target) distributed over time?

In [None]:
train.loc[:, "f0_diff"] = train.groupby("investment_id").f_0.diff()
selection = train.loc[:, ["time_id", "investment_id", "f_0", "f0_diff"]].dropna()

In [None]:
import matplotlib.animation as animation
from matplotlib import rc
rc('animation', html='html5')

plt.rcParams["animation.html"] = "jshtml" 


class ShowFeatureOverTimeID:

    def __init__(self, df, feature):
        self.df = df
        self.max_iter = df.time_id.max() - 1
        self.feature = feature
        
        data = self.df[self.df.time_id==1][self.feature]
        self.fig, self.ax = plt.subplots()
        _, _, self.bar_container = self.ax.hist(data, 70, lw=0.1, fc="blue", alpha=0.6)
        self.ax.set_xlim([self.df[feature].min(),self.df[feature].max()])
        self.ax.set_title("How does the feature distribution change over time_id?")
        self.ax.set_ylabel("count")
        self.ax.set_xlabel(feature)
        
    def show(self):
        self.ani = animation.FuncAnimation(self.fig, self.prepare_animation(self.bar_container),
                                           frames=self.max_iter,
                                           blit=False,
                                           repeat=False)
        #Writer = animation.writers['ffmpeg']
        #writer = Writer(fps=15, metadata=dict(artist='Me'), bitrate=1800)
        #self.ani.save('example.mp4', writer=writer)
        plt.close()
        return self.ani
    
    def prepare_animation(self,bar_container):
        def animate(i):
            
            data = self.df[self.df.time_id==i+1][self.feature]
            n, _ = np.histogram(data, 100)
            for count, rect in zip(n, bar_container.patches):
                rect.set_height(count)
            return bar_container.patches
        return animate

In [None]:
selection

In [None]:
ani = ShowFeatureOverTimeID(selection, "f0_diff")
ani.show()

In [None]:
plt.figure(figsize=(20,5))
plt.plot(selection.groupby("time_id")["f0_diff"].mean())
plt.title("How does the mean of f_0 changes behave over time?");
plt.xlabel("time_id")
plt.ylabel("f0_diff mean over all investment ids");

# Model structure <a class="anchor" id="model"></a>

Ok, this problem has some difficulties:

* We have investment_ids that can be present in train & test or only in train or only in test. 
* We can have some kind of measurement gaps where values are missing.
* Timeseries can have different lengths in total. 
* The time inverals can vary.



Before adding more and more content about feature engineering, I like to build a loop that will allow to add previous features.

In [None]:
train.head()

In [None]:
selection = None

In [None]:
selection = ["time_id", "investment_id"]
feature_names = train.drop(["row_id", "time_id", "investment_id", "target"], axis=1).columns.values
N = np.random.choice(np.arange(0, len(feature_names)), replace=False, size=100)
my_features = ["f_{}".format(n) for n in N]
selection.extend(my_features)

In [None]:
len(selection)

In [None]:
X = train[selection].copy()
Y = train.target
for f in my_features:
    X[f + "_shift1"] = X.groupby("investment_id")[f].shift(1).fillna(0)

    #X[f + "_diff"] = X[f] - X[f+"_shift1"]
    #X[f + "_diff"] = X[f+"_diff"].fillna(0)

In [None]:
X.shape

In [None]:
features = ["f_{}_shift1".format(n) for n in N]
features.extend(["f_{}".format(n) for n in N])
print(len(features))

In [None]:
V = 0.9
x_train, x_dev = X[0:int(V*X.shape[0])][features], X[int(V*X.shape[0])::][features]
y_train, y_dev = Y[0:int(V*X.shape[0])], Y[int(V*X.shape[0])::]

In [None]:
lm = Ridge()
lm.fit(x_train.values, y_train.values)
y_pred = lm.predict(x_dev)
pearsonr(y_dev, y_pred)

In [None]:
import ubiquant
env = ubiquant.make_env()   # initialize the environment
iter_test = env.iter_test()    # an iterator which loops over the test set and sample submission

previous_test_df = train[train.time_id == train.time_id.max()].iloc[0:int(V*train.shape[0])]
for (test_df, sample_prediction_df) in iter_test:
    
    for f in my_features:
        test_df.loc[:, f+"_shift1"] = 0
        already_known = previous_test_df[previous_test_df.investment_id.isin(test_df.investment_id)]
        test_df.loc[test_df.investment_id.isin(already_known.investment_id), f+"_shift1"] = already_known[f]
        test_df.loc[:, f+"_shift1"] = test_df.loc[:, f+"_shift1"].fillna(0)
        
        #test_df.loc[:, f+"_diff"] = test_df[f] - test_df[f+"_shift1"]
        #test_df.loc[:, f+"_diff"] = test_df.loc[:, f+"_diff"].fillna(0)
    
    pred = lm.predict(test_df[features])
    sample_prediction_df['target'] = pred  # make your predictions here
    env.predict(sample_prediction_df)   # register your predictions
    previous_test_df = test_df.copy()

The loop seems to work. Next topic - automatic feature generation and a few more experiments on the feature per time_id distributions.