In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.cluster import AgglomerativeClustering
!pip install hdbscan
import umap, hdbscan
import pickle
from scipy.stats import pearsonr
from typing import Tuple
import seaborn as sns


import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Overview of the data

In [None]:
train    = pd.read_pickle('/kaggle/input/low-memory-pickle/train.pkl')

features = [col for col in train.columns if col.startswith('f_')]
target   = 'target'

In [None]:
train.info()

In [None]:
train.head()

In [None]:
train.isnull().sum().sum()

In [None]:
# How many time_ids are there?
len(train.index.unique())

In [None]:
# How many investment_ids are there? 
len(train.investment_id.unique())

Here we have a panel of 3579 different `investment_id`s across 1211 time periods, jointly indexing a target variable (presumably some sort of forward return) and 300 features. We don't know what the `time_id`corresponds to, nor what any of the features are. 

# Exploration of the target variable

Let's take a look at the overall distribution of the target variable:

In [None]:
(train[target]
 .plot(kind='hist', 
       bins = 1000, 
       figsize = (20,10), 
       title='Distribution of target variable', 
       ylabel='Frequency'))

The general picture is pretty normal, if skewed somewhat to the right.

One thing that immediately jumps out is a discontinuity at 0. Let's first look into whether that's a data quality issue or not:

In [None]:
# Graph a narrow bandwidth around 0 just to confirm
(train[target]
 .groupby(pd.cut(train[target], bins=np.arange(-0.4,0.4,0.02)))
 .count()
 .plot(rot=45, 
       title='Does the target variable bunch around 0?', 
       ylabel='Frequency', 
       xlabel='Target (binned)',
       xticks=np.arange(18,22),
       grid=True,
       figsize=(20, 10)))

There's a clump of data at 0 -- 0s are about 10% more frequent than values in the neighborhood. 

Is this a problem with the data? Maybe not: I could see a world in which financial data is truly discontinuous at 0, e.g. if our target is a forward return and between two periods no one trades the underlying asset, then its price won't change and the return will be zero. If so, then there's no need to worry about this. I'd be curious, though, to see if the frequency of 0s varies by `time_id`or `investment_id`, which might reflect artifacts in the data that we'd want to cut out of our training set. 

Let's plot the distribution of the share of zeros, first by `investment_id`, then by time_id:

In [None]:
train['temp'] = train[target].between(-0.02, 0.02)



In [None]:
(train
 .groupby('investment_id')['temp']
 .mean()
 .plot(kind='hist',
       bins=100,
       figsize=(20, 10)))

The left-hand side of this plot shows nothing out of the ordinary. Let's see what those outliers are:

In [None]:

(train
 .groupby('investment_id')['temp']
 .aggregate(['mean','count'])
 .sort_values('mean', ascending=False)[:30])


As you can see, investments that only show up a handful of times in the data are the outliers -- although this doesn't tell us much, as it's consistent even with the null hypothesis that the true fraction of zeros per `investment_id` is constant (i.e. this could just reflect sampling variation). That being said it seems sensible to throw out the most sparse `investment_id`s, because they basically have no variation in the target variable and are unlikely to show up in the evaluation data. 

Does the fraction of zeros vary across time?

In [None]:
(train
 .groupby('time_id')['temp']
 .mean()
 .plot(kind='hist',
       bins=100,
       figsize=(20, 10)))

Again, nothing really jumps out as an obvious data problem. Let's see where the outliers are.

Below is a list of time_ids in descending order of fraction of zeros:

In [None]:
(train
 .groupby('time_id')['temp']
 .aggregate(['mean','count'])
 .sort_values('mean', ascending=False)[:30])

There's an issue with time_id==367 -- notice how the mean discontinuously jumps? However, quantitatively this isn't enough to drive the +10,000 excess zeros we see in the distribution of the target variable. I conclude that the zeros are valid data that we should train on, but also that there might be something strange going on with the time_ids, especially around time_id==367. 

In [None]:
del train['temp']

Lastly, here, let's see how the target variable was standardized. I'd guess that it's something like z-scores across `investment_id` within time_id, or maybe vice-versa. Let's see:

In [None]:
(train
 .groupby('investment_id')['target']
 .aggregate(['mean', 'std'])
 .plot(kind='hist', 
       bins=50, 
       figsize = (20,10), 
       title='Distribution of target\'s mean and standard deviation within investment_id across time_id'))

In [None]:
(train
 .groupby('time_id')['target']
 .aggregate(['mean', 'std'])
 .plot(kind='hist', 
       bins=50, 
       figsize = (20,10), 
       title='Distribution of target\'s mean and standard deviation within `time_id`across investment_id'))

Generally it looks like it's standardized within-time_id, although not perfectly -- perhaps they enforce standard rolling means and variances over some time. Regardless I think it's pretty clear that we don't need to do much or any additional work cleaning the target variable, at least for now.

# Exploration of the time dimension

We already know that the panel isn't balanced, so let's look into that. Below we have descriptive stats of the count of time_ids across `investment_id`s -- clearly there is a wide range of time periods covered in each investment.

In [None]:
(train
 .groupby('investment_id')['target']
 .count()
 .describe())

Let's get a sense for what these investments actually look like by picking 10 randomly and plotting the target for each of them over time:

In [None]:
import matplotlib.pyplot as plt
for k in np.random.choice(train['investment_id'].unique(), 10):
    d = train[train['investment_id']==k]
    d['target'].plot(figsize=(20, 5),
                     title=f'investment_id {k}',
                     style='.-',
                     yticks=np.arange(-6, 9, 2),
                     xticks=np.arange(0, 1300, 200),
                     ylabel='Target')
    plt.show()

These series look mostly like noise, although presumably with a degree of mean reversion. The main thing that stands out to me is the missing data -- there are gaps in the time series of varying lengths, sometimes long, sometimes short. 

Below is a quick way to visualize time coverage in the aggregate: a scatter plot with `investment_id` on the y-axis and `time_id`on the x-axis, with each dot reflecting coverage for that (x,y) pair. 

In [None]:
train['time_id'] = train.index
(train
 .plot
 .scatter(y='investment_id', 
          x='time_id', 
          figsize = (20, 20), 
          s=0.01))

The vertical white streak around `time_id`==370 is interesting -- note that this is where we saw a preponderance of 0s. I wonder if it corresponds to the market crash in early March 2020. If so, then the market dynamics before and after that vertical band are presumably quite different, and we should try training a version of the model with that early period excluded. 

The tab below shows us that `time_id`is missing between 367 and 373:

In [None]:
(train
 .groupby(by=train.index)
 .count()['target'][350:375])

One way to infer whether or not that period around `time_id`==370 is the Covid market crash (or at least some sort of unique financial event) would be to look for substantially heightened volatility around that time, because we know that the target isn't perfectly standardized within-`time_id`. Plotting the cross-sectional standard deviation in the target variable over time, we find exactly that:

In [None]:
(train
 .groupby(by=train.index)['target']
 .std()
 .plot(figsize=(20,10),
       title="Standard deviation of the target variable across investments over time"))

There's unusual volatility around this period of missing data -- clearly we should run versions of our models in which we've excluded data from this time period (and potentially from before it as well) from the training set, as it's unlikely that whatever was going on then is a good guide to the evaluation period in spring 2022. 

We saw earlier that the series for some of these investments don't span the entire observation window. The white horizontal streaks are missing `time_id`s -- they're all concentrated on the lefthand side of the plot, whereas the righthand side is pretty blue. I'm thinking that we should try cutting some of the investments that we only observe for a sliver of time or that go missing well before the most recent `time_id`, because they're unlikely to correlate well with future data. Let's plot the min/max `time_id`by `investment_id` to get a sense for the prevalence of this particular issue:

In [None]:
temp = (train
        .groupby('investment_id')['time_id']
        .aggregate(['min', 'max'])
        .sort_values(['max', 'min']))

temp['y'] = np.arange(len(temp))/len(temp)*100

ax1 = (temp
      .plot
      .scatter(y='y',
               x='max',
               figsize = (20, 10), 
               s=0.3, 
               c='blue', 
               title='Last time_id observed (blue) versus first time_id observed (red), by investment_id', 
               xlabel='time_id',
               ylabel='Percentile of investments'))

(temp
 .plot
 .scatter(y='y',
          x='min', 
          s=0.3,  
          c='red', 
          ax=ax1, 
          xlabel='time_id',
          ylabel='Percentile of investments'))

Some observations:
1. ~60% of the investments have full time coverage (although this says nothing of how sparse they are)
2. Relatively few (~2%) of the investments do not have any data extending to the most recent time period -- among these, virtually all end after `time_id`==800 (i.e. it's not like they end at the very beginning of the observation window)
3. A solid chunk (~20%) of the investments only begin to have coverage halfway through the observation window

I think the principled way to deal with this is simply to drop the few investments that go missing before the end of the observation window because they're unlikely to show up in the evaluation set.

Let's look at the autocorrelations of the target for a subset of investments:

In [None]:
sample_tickers = np.random.choice(train['investment_id'].unique(), 10)
(train
 .query('investment_id in @sample_tickers')
 .set_index(['time_id', 'investment_id'])['target']
 .astype('float64')
 .unstack()
 .rolling(90)
 .apply(lambda x: x.autocorr(), raw=False)
 .plot(alpha=0.4,
       title='Target autocorrelation for 10 random investments (rolling)',
       figsize=(20,10))
 .legend(loc='lower center',
         ncol=len(sample_tickers)));
plt.axhline(0);

There's no clear pattern in the autocorrelations -- perhaps surprisingly it's more common for them to be positive than negative (I would have expected mean-reversion to be dominant here). 

However, I now think that time-series analysis is a no-go in this problem because of this line from the competition data page: 

> The time IDs are in order, but the real time between the time IDs is not constant and will likely be shorter for the final private test set than in the training set.

This precludes the use of lags as predictors -- if the people running the competition reduce the gap between successive time periods, then any lags in our models will be totally misspecified. So I think our best bet is to train a model that predicts purely based on cross-sectional variation. 

With that being said, let's look into the cross-sections: 

Drop investments that disappear before the end of the sample period and sparse investments:

In [None]:
train['time_max']   = (train
                       .groupby('investment_id')['time_id']
                       .transform('max'))
train['time_count'] = (train
                       .groupby('investment_id')['time_id']
                       .transform('count'))

In [None]:
train = train[(train['time_max']>1200) & (train['time_count']>10)]

In [None]:
del train['time_id']
del train['time_max']
del train['time_count']
del temp

# What do the features look like?

Let's get a sense for how the features relate to the target and to each other:

At a high level these things look very clean -- presumably they've already been standardized to mean zero and standard deviation 1 in some manner.



In [None]:
# Speed things up by taking a 10% sample
sample = train.sample(frac=.10)

In [None]:
# Need to cast to float64 because the standard deviation function in groupby.aggregate() seems broken for float16 data
(train[features]
 .astype('float64')
 .aggregate(['mean', 'std'])
 .T
 .plot(kind='hist', 
       bins=50, 
       figsize = (20,10), 
       title='Distribution of unconditional mean and standard deviation across features')
)


In [None]:
temp = (train[features]
 .astype('float64')
 .aggregate(['mean', 'std'])
 .T)

In [None]:
temp.sort_values(by='mean')

In [None]:
train[list(temp.sort_values(by='mean').iloc[:10].index)]

In [None]:
train['f_170'].plot(kind='hist', bins=200, figsize=(20,10))

In [None]:
train[list(temp.sort_values(by='mean', ascending=False).iloc[:10].index)]

In [None]:
train['f_246'].astype('float64')

In [None]:
train[list(temp.sort_values(by='std').iloc[:10].index)]

Let's sort the descriptive stats by the standard deviation:

In [None]:
temp.sort_values(by='std')

The outlier is f_124:

In [None]:
train['f_124'].plot(kind='hist', bins=200, figsize=(20,10))

In [None]:
train['f_124'].astype('float64').describe()

In [None]:
train['f_124'].corr(train['target'])

This data is bad. We can try dropping it from the training set -- let's see whether it's the only feature that looks like this. 

In [None]:
train['f_170'].astype('float64').describe()

In [None]:
train['f_182'].astype('float64').describe()

The rest of these look broadly fine -- the lower standard deviation seems to be a normal outcome of combining a categorical variable (the 0s/1s/-1s) with a continuous variable. We might want to try splitting these features into their categorical and continuous parts to make things a bit easier on our models. 

Let's look at the full correlation matrix. Here's the distribution of feature correlations with the target:

In [None]:
correlation = sample[[target] + features].corr()
(correlation[target]
 .iloc[1:]
 .plot(kind='hist',
       bins = 50, 
       figsize = (20,10),
       title='Distribution of feature correlations with target'))

There's very little (linear) signal in any of these individual features.

Let's see if there are any correlated clusters of the features that we could use to reduce the dimensionality of the feature space or derive new features from:

In [None]:
sns.clustermap(correlation, figsize=(10,10))

At a high level, I would say that about half of the data belongs to 2-3 fairly tight clusters of features, while the other half are more or less mutually orthogonal. I'll make some hierarchical clusters because that's what's visualized here, and then try some other clustering methods at a later date. 

#### A note on using `investment_id` for feature engineering: 

I don't think we can directly use `investment_id` in the model. According to the competition description, there's no guarantee that any of the `investment_id`s in the training set will be present in the evaluation set -- that's not ideal, because you would think there would be some important information contained in the interplay between movements in different investments (e.g. when investment X and Y are both up in time $t$, then investment Z will often be down in time $t+1$). 

One way to exploit some of the information contained in the `investment_id`s without overfitting to the particular set of them contained in the training data would be to train a clustering algorithm on the set of investments that seem most likely to show up in the evaluation set (e.g. the investments that always show up in the training set) and then assign any new investments we encounter to one of those clusters before computing new features within- or cross-clusters. This procedure would have a lot of assumptions baked into it, however, about the validity of the clustering, so for now I'll avoid it, instead focusing on clustering like groups of features. 

Below I write a function that uses UMAP to non-linearly project the feature space into two dimensions before clustering these two-dimensional representations with HDBSCAN, plotting the resulting clusters. 

In [None]:
def cluster(data, plot=True, tag=''):
    reduced_dim_data = (umap
                        .UMAP(n_neighbors=5,
                              min_dist=0.15,
                              n_components=2,
                              random_state=42)
                        .fit_transform(data))
    clusters = (hdbscan
                .HDBSCAN(min_samples=15,
                         min_cluster_size=35)
                .fit_predict(reduced_dim_data))
    if plot:
        clustered = (clusters >= 0)
        
        fig = plt.figure(figsize=(15,15))
        ax  = plt.gca()

        ax.scatter(
            reduced_dim_data[clustered, 0],
            reduced_dim_data[clustered, 1],
            c=clusters[clustered]
        )

        ax.scatter(
            reduced_dim_data[~clustered, 0],
            reduced_dim_data[~clustered, 1],
            c='grey',
            alpha=0.5
        )

        ax.set_title(f'Two-dimensional UMAP projections of features, clustered with HDBSCAN \n Unclustered features in grey \n {tag}');
    
    # Return a dictionary indexed by the cluster label that contains the list of column names for each cluster
    return {x: pd.Series(data.index, index=clusters)[x].to_list() for x in set(clusters)}

Now let's run this clustering algorithm on two samples:

1. Full training data
2. Training data with `time_id`<400 excluded (corresponding to that large temporal break in the data)

In [None]:
clusters_full_sample = cluster(train[features].T, tag='Sample: full time period')

In [None]:
clusters_late_sample = cluster(train.loc[400:, features].T, tag='Sample: late time period')

Let's look at how much overlap there is in the cluster assignments between the two samples:

In [None]:
len(set(clusters_late_sample[1]).intersection(clusters_full_sample[0]))/len(clusters_late_sample[1])

In [None]:
len(set(clusters_late_sample[0]).intersection(clusters_full_sample[1]))/len(clusters_late_sample[0])

As you can see, the clustering algorithm (which I've tuned somewhat) produces very similar clusters when run on either the full sample or the late sample (dropping the data from before the big time break), which is good evidence of its robustness to time. In the future, I'll experiment with reducing the min_cluster_size parameter to see if there's any extra signal in additional clusters, but for now I'll err on the side of caution with larger/fewer clusters. In the next notebook I'll engineer some features based on the cross-sectional distributions within- and cross-clusters. 

In [None]:
# Save our feature clusters
with open('feature_clusters.pickle', 'wb') as handle:
    pickle.dump(clusters_late_sample, handle, protocol=pickle.HIGHEST_PROTOCOL)


# Cross-validation strategy

It's easy to rule out a couple of simple CV strategies:

1. Splitting the data randomly into folds will train the model on data collected *after* the holdout set, which we definitely don't want
2. Splitting the data by `investment_id` may introduce leakage because some of the anonymous features are calculated as aggregates of the values of other investments

This means that we must split by time, training our model on data from times $j<t_i$ and evaluating performance at time $t_i$ for $i=1,...,k$, for $k$ folds.

The competition data page tells us to "Expect to see roughly one million rows in the test set." I'd estimate that the number of time_ids to be predicted is: $1,000,000 / 3,579 = ~300$, where 3,579 is the number of `investment_id`s in the training set. Therefore I'll make my holdout sets $t_i$ of size ~300. 

# Iterating on models

Below I plot the importance of the various features after running two 895-estimator LGBM models (with hyperparameters tuned locally according to the above 5-fold, `time_id`-based cross-validation strategy):
1. In the first I've thrown in the cluster-derived features in addition to the 300 original features
2. The second has only added features derived purely from cross-sectional distributional statistics taken over the 300 original features (i.e. no clustering involved) 

You can see that clustering doesn't help at all, and essentially none of the other smorgasbord of derived features help our model, so I will not pursue this kind of thing any further -- you'd expect that if the approach worked *at all*, then we'd see *some* improvement...

In [None]:
version = 1
scores, models, importance = pickle.load(open(f'../input/version{version}/results{version}.pkl', 'rb'))
(importance
 .groupby('feature')['importance']
 .aggregate(['mean','std'])
 .sort_values(by='mean')
 .plot(kind='barh',
       y='mean',
       xerr='std',
       grid=True,
       figsize=(20,80),
       title='Feature importance (gain): mean and standard deviation across 5 folds\n' 
       'Model includes features derived within- and across-clusters'))

In [None]:
version = 2
scores, models, importance = pickle.load(open(f'../input/version{version}/results{version}.pkl', 'rb'))
(importance
 .groupby('feature')['importance']
 .aggregate(['mean','std'])
 .sort_values(by='mean')
 .plot(kind='barh',
       y='mean',
       xerr='std',
       grid=True,
       figsize=(20,80),
       title='Feature importance (gain): mean and standard deviation across 5 folds\n' 
       'Model includes features derived from cross-sectional distr. of original features'))

# Submissions

Notes:
* Used correlation as our early stopping rule (this is the metric Kaggle will score submissions with)
* Used MSE as the loss function -- practically this seems to learn better than the correlation, which has a stranger-shaped gradient function
* Found a wide range of plausible hyperparameters with a grid search, then fine-tuned with a Bayesian search
* Fine-tuned and trained all models on a Google Cloud virtual machine, then uploaded final weights to Kaggle
* Clustering algorithms all failed to produce any gains
* All models perform worse when trimming the earlier time periods

#### Version 1: LGBM (CV score: 0.1397, test score: 0.1356)

Fine-tuned LGBM trained only on the original features (no clustering)

#### Version 2: DNN (score: 0.1472, test score: 1302)

Still fine-tuning this -- clearly I'm overfitting somewhat. 

#### Version 3: 50/50 ensemble of LGBM and DNN (test score: 0.1377)

# Discussion

I feel pretty good about the score of 0.138 -- I'd note that the leaderboard is almost entirely scores of 0.15, but I've looked at the code and these models are all way overfitted to the leaderboard data because they use `time_id` and `investment_id` extensively, e.g. directly using either as a feature or including lags or clusters of investments. I'm not *sure*, but my belief is that these models will be way off during the evaluation phase due to the increased time-frequency of observations already discussed as well as the shuffled set of `investment_id`s. 

There are only two ways I think my score could be substantively improved while avoiding any of the above issues:
1. Find a better way to cluster the features
2. Use `investment_id` in some conservative way

(1) might work if I can infer the "speed" of the individual features, e.g. whether they're 30-day rolling averages versus intraday volatilities, and then include relevant statistics as features in the LGBM/DNN ensemble -- this might allow for a little more separation between long-term and short-term signals. I really doubt this is going to do much, though -- surely my previous attempt at clustering features would have picked up on this somewhat, but in practice it contributed nothing. 

(2) might allow me to cluster some `investment_id`s that are sure to show up, so that I'm no longer throwing away signal in the correlations between investments -- e.g. right now, I don't have anything that says "when some tech stocks go up, then the others are likely to go up as well." 

More on (2), which seems most promising (if somewhat risky):

I'm going to remake the plot from earlier in this notebook that shows the coverage of `time_id` by `investment_id`, but this time I'll plot also the share of the `time_id`s that are present in between the first and last ones. I want to see how common it is for an investment to virtually always be in the data, because then (2) could be based solely on those, which I feel comfortable doing.

You can see straightaway in the plot below that very, very few values of `investment_id` have full time-coverage. So there won't be a simple "core" set of investments that we can use as the bedrock for a clustering strategy. 

I think the most principled approach would be something like this:
1. Pull out the, say, top 1/2 of `investment_id` by time coverage
2. Cluster those somehow
3. Re-run LGBM/DNN models with statistics derived from those clusters, where we've first imputed clusters to "new" `investment_id`s either randomly or by some rule based on the observed values of their features

# To-do

1. Fine-tune the neural net further
2. Implement some of the above clustering techniques
3. Train an autoencoder to derive new features
4. Reduce prediction variance by training final models on three different seeds and average the results

In [None]:

temp = (train
        .reset_index()
        .groupby('investment_id')['time_id']
        .aggregate(['min', 'max', 'count'])
        .sort_values(['max', 'min', 'count']))

temp['y']     = np.arange(len(temp))/len(temp)*100
temp['share'] = temp['min'] + temp['count']

ax1 = (temp
      .plot
      .scatter(y='y',
               x='max',
               figsize = (20, 10), 
               s=0.3, 
               c='blue', 
               title='Last time_id observed (blue) versus first time_id observed (red), along with share of time_id coverage (green), by investment_id', 
               xlabel='time_id',
               ylabel='Percentile of investments'))

ax2 = (temp
      .plot
      .scatter(y='y',
               x='share',
               figsize = (20, 10), 
               s=0.3, 
               c='green',
               ax=ax1,
               xlabel='time_id',
               ylabel='Percentile of investments'))

(temp
 .plot
 .scatter(y='y',
          x='min', 
          s=0.3,  
          c='red', 
          ax=ax2, 
          xlabel='time_id',
          ylabel='Percentile of investments'))

# Submission notebook

[www.kaggle.com/danielreuter/ubiquant-model-testing](https://www.kaggle.com/danielreuter/ubiquant-model-testing)