This notebook aims to analyse data and provide insights using a non linear embedding technique called UMAP. 
It is a relatively new embedding technique that is usually very fast and very effective at embedding.

## Other Feature Exploration / Feature engineering for Ubiquant:

- [Complete Feature Exploration](https://www.kaggle.com/lucasmorin/complete-feature-exploration)
- [Weird pattern in unique values](https://www.kaggle.com/lucasmorin/weird-patterns-in-unique-values-across-time-ids/)
- [Time x Strategy EDA](https://www.kaggle.com/lucasmorin/time-x-strategy-eda)  
- [UMAP Data Analysis & Applications](https://www.kaggle.com/lucasmorin/umap-data-analysis-applications)   
- [LB probing Notebook  ](https://www.kaggle.com/lucasmorin/don-t-mind-me-just-probing-the-lb)
- On-Line Feature Engineering (in progress)

# Base imports

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import random
import warnings
import matplotlib as mpl
from sklearn.decomposition import PCA
from sklearn.preprocessing import RobustScaler

warnings.filterwarnings("ignore")

DEBUG = True

# Install UMAP

In [None]:
%%capture
!pip install umap-learn[plot]
!pip install yfinance

import umap
import umap.plot

Using @slawekbiel Feather dataset: https://www.kaggle.com/slawekbiel/ubiquant-trainfeather-32-bit

# Load Data

In [None]:
%%time
train_data = pd.read_feather('../input/ubiquant-trainfeather-32-bit/train32.feather')

if DEBUG:
    train_data = train_data.sample(n=100000)

# Weird patterns in unique values
 
We have weird patterns in unique values: https://www.kaggle.com/lucasmorin/weird-patterns-in-unique-values-across-time-ids

I have asked the host about this pattern in the Q&A: https://www.kaggle.com/c/ubiquant-market-prediction/discussion/301693

In [None]:
plt.plot(np.log(train_data[['time_id','investment_id']].groupby(['time_id']).count()))
plt.plot(np.log(train_data[['time_id','f_170']].groupby(['time_id']).nunique()))
plt.show()

We can build a feature to indicate if we have this weird pattern or not.

In [None]:
weird_pattern_ind = (train_data[['time_id','f_170']].groupby('time_id').nunique()==1)
train_data['f_301'] = train_data['time_id'].map(pd.Series(weird_pattern_ind.values.flatten(),index=weird_pattern_ind.index))
plt.plot(weird_pattern_ind)

Seems like a good idea to use this indicator as a feature.

Might be usefull for more advanced stuff like stratification (make sure each fold has enough of both patterns).

# All data

In [None]:
id_data = train_data[['time_id','investment_id','target','f_301']].copy()
feature_names = [c for c in train_data.columns if 'f_' in c]

scaler = RobustScaler()
train_data = scaler.fit_transform(train_data[feature_names])

In [None]:
%%time

emb = umap.UMAP(n_neighbors=60, min_dist=0.1, target_metric='euclidean', init='spectral', 
                low_memory=False, verbose=True, spread=0.5, local_connectivity=1, 
                repulsion_strength=1, negative_sample_rate=5).fit_transform(train_data)

# Color by target

In [None]:
plt.figure(figsize=(16, 9))
plt.scatter(emb[:, 0], emb[:, 1], s=3, c=id_data['target'], edgecolors='none', cmap='jet', vmin=-2,vmax=2);
cb = plt.colorbar(label='target')
cb.ax.yaxis.set_minor_formatter(mpl.ticker.ScalarFormatter())
plt.title('UMAP row embeddings');

There seems to be some pattern but one need to clip the values to notice them (target clipped to -2 2).
The pattern appears similar to what we can see when studying targets (see: https://www.kaggle.com/marketneutral/ubiquant-target-eda-pca-magic)

# Color by time_id

In [None]:
plt.figure(figsize=(16, 9))

plt.scatter(emb[:, 0], emb[:, 1], s=3, c=id_data['time_id'].astype('int'), edgecolors='none', cmap='jet');
cb = plt.colorbar(label='time_id')
cb.ax.yaxis.set_minor_formatter(mpl.ticker.ScalarFormatter())
plt.title('UMAP row embeddings');

There are some clear outliers (in deep blue). Maybe a good thing to remove them.

# color by investment id

In [None]:
plt.figure(figsize=(16, 9))

plt.scatter(emb[:, 0], emb[:, 1], s=3, c=id_data['investment_id'].astype('int'), edgecolors='none', cmap='jet');
cb = plt.colorbar(label='investment_id')
cb.ax.yaxis.set_minor_formatter(mpl.ticker.ScalarFormatter())
plt.title('UMAP row embeddings');

Nothing obvious to me. To update if we get more obvious investment id clustering.

# Color by pattern in unique value

In [None]:
plt.figure(figsize=(16, 9))

plt.scatter(emb[:, 0], emb[:, 1], s=3, c=id_data['f_301'].astype('int'), edgecolors='none', cmap='jet');
cb = plt.colorbar(label='pattern_ind')
cb.ax.yaxis.set_minor_formatter(mpl.ticker.ScalarFormatter())
plt.title('UMAP row embeddings');

The weird pattern in missing value impact 30% of feature and 45% of values on averages in these features (see: ). Seems normal that this pattern appears clustered in UMAP.

# Embedding for each kind of pattern in unique values

In [None]:
train1 = train_data[id_data['f_301']]
train0 = train_data[~id_data['f_301']]


emb0 = umap.UMAP(n_neighbors=60, min_dist=0.1, target_metric='euclidean', init='spectral', 
                low_memory=False, verbose=True, spread=0.5, local_connectivity=1, 
                repulsion_strength=1, negative_sample_rate=5).fit_transform(train0)

In [None]:
# with all data

plt.figure(figsize=(16, 9))
plt.scatter(emb0[:, 0], emb0[:, 1], s=3, c=id_data[~id_data['f_301']].target.astype('int'), edgecolors='none', cmap='jet', vmin=-2,vmax=2);
cb = plt.colorbar(label='target')
cb.ax.yaxis.set_minor_formatter(mpl.ticker.ScalarFormatter())
plt.title('UMAP row embeddings');

oddly symetric ? realtionship to volatility ? 

In [None]:
emb1 = umap.UMAP(n_neighbors=60, min_dist=0.1, target_metric='euclidean', init='spectral', 
                low_memory=False, verbose=True, spread=0.5, local_connectivity=1, 
                repulsion_strength=1, negative_sample_rate=5).fit_transform(train1)

plt.figure(figsize=(16, 9))
plt.scatter(emb0[:, 0], emb0[:, 1], s=3, c=id_data[~id_data['f_301']].target.astype('int')**2, edgecolors='none', cmap='jet', vmin=0,vmax=2);
cb = plt.colorbar(label='target')
cb.ax.yaxis.set_minor_formatter(mpl.ticker.ScalarFormatter())
plt.title('UMAP row embeddings');

In [None]:
# with unique values

plt.figure(figsize=(16, 9))
plt.scatter(emb1[:, 0], emb1[:, 1], s=3, c=id_data[id_data['f_301']].target.astype('int'), edgecolors='none', cmap='jet', vmin=-2,vmax=2);
cb = plt.colorbar(label='target')
cb.ax.yaxis.set_minor_formatter(mpl.ticker.ScalarFormatter())
plt.title('UMAP row embeddings');

# Time id embedding for a given investment id

Investment id 2140, see notebook: https://www.kaggle.com/lucasmorin/complete-feature-exploration-strategy-2140

In [None]:
%%time

train_data = pd.read_feather('../input/ubiquant-trainfeather-32-bit/train32.feather')

weird_pattern_ind = (train_data[['time_id','f_170']].groupby('time_id').nunique()==1)

train_data['f_301'] = train_data['time_id'].map(pd.Series(weird_pattern_ind.values.flatten(),index=weird_pattern_ind.index))


id_data = train_data[['time_id','investment_id','target','f_301']].copy()
feature_names = [c for c in train_data.columns if 'f_' in c]

investment_id_ref = 2140
train_data_iid = train_data[id_data['investment_id']==investment_id_ref].copy()

scaler = RobustScaler()

emb_iid = umap.UMAP(n_neighbors=60, min_dist=0.1, target_metric='euclidean', init='spectral', 
                low_memory=False, verbose=True, spread=0.5, local_connectivity=1, 
                repulsion_strength=1, negative_sample_rate=5).fit_transform(scaler.fit_transform(train_data_iid))

plt.figure(figsize=(10, 8))
plt.scatter(emb_iid[:, 0], emb_iid[:, 1], s=3, c=id_data[id_data['investment_id']==investment_id_ref].time_id.astype('int'), edgecolors='none', cmap='jet');
cb = plt.colorbar(label='time_id')
cb.ax.yaxis.set_minor_formatter(mpl.ticker.ScalarFormatter())
plt.title('UMAP row embeddings');

In [None]:
plt.figure(figsize=(10, 8))
plt.scatter(emb_iid[:, 0], emb_iid[:, 1], s=3, c=id_data[id_data['investment_id']==investment_id_ref].f_301.astype('int'), edgecolors='none', cmap='jet');
cb = plt.colorbar(label='pattern ind')
cb.ax.yaxis.set_minor_formatter(mpl.ticker.ScalarFormatter())
plt.title('UMAP row embeddings');

In [None]:
plt.figure(figsize=(10, 8))
plt.scatter(emb_iid[:, 0], emb_iid[:, 1], s=3, c=id_data[id_data['investment_id']==investment_id_ref].target, edgecolors='none', cmap='jet',vmin=-2,vmax=2);
cb = plt.colorbar(label='target')
cb.ax.yaxis.set_minor_formatter(mpl.ticker.ScalarFormatter())
plt.title('UMAP row embeddings');

# investment id embedding for a given time id

Studying a time id, specifically time id 1214, (see https://www.kaggle.com/lucasmorin/complete-feature-exploration-time-1214)

In [None]:
%%time

train_data = pd.read_feather('../input/ubiquant-trainfeather-32-bit/train32.feather')

weird_pattern_ind = (train_data[['time_id','f_170']].groupby('time_id').nunique()==1)

train_data['f_301'] = train_data['time_id'].map(pd.Series(weird_pattern_ind.values.flatten(),index=weird_pattern_ind.index))


id_data = train_data[['time_id','investment_id','target','f_301']].copy()
feature_names = [c for c in train_data.columns if 'f_' in c]

time_id_ref = 1214
train_data_tid = train_data[id_data['time_id']==time_id_ref].copy()

scaler = RobustScaler()

emb_iid = umap.UMAP(n_neighbors=60, min_dist=0.1, target_metric='euclidean', init='spectral', 
                low_memory=False, verbose=True, spread=0.5, local_connectivity=1, 
                repulsion_strength=1, negative_sample_rate=5).fit_transform(scaler.fit_transform(train_data_tid))

plt.figure(figsize=(10, 8))
plt.scatter(emb_iid[:, 0], emb_iid[:, 1], s=3, c=id_data[id_data['time_id']==time_id_ref].investment_id.astype('int'), edgecolors='none', cmap='jet');
cb = plt.colorbar(label='investment_id')
cb.ax.yaxis.set_minor_formatter(mpl.ticker.ScalarFormatter())
plt.title('UMAP row embeddings');

One very weird pattern : one big cluster, surrounded by small clusters. Nothing unusual. But a BIG LINE cluster... what to do with that ?

In [None]:
plt.figure(figsize=(10, 8))
plt.scatter(emb_iid[:, 0], emb_iid[:, 1], s=3, c=id_data[id_data['time_id']==time_id_ref].f_301.astype('int'), edgecolors='none', cmap='jet');
cb = plt.colorbar(label='pattern ind')
cb.ax.yaxis.set_minor_formatter(mpl.ticker.ScalarFormatter())
plt.title('UMAP row embeddings');

In [None]:
plt.figure(figsize=(10, 8))
plt.scatter(emb_iid[:, 0], emb_iid[:, 1], s=3, c=id_data[id_data['time_id']==time_id_ref].target, edgecolors='none', cmap='jet',vmin=-2,vmax=2);
cb = plt.colorbar(label='target')
cb.ax.yaxis.set_minor_formatter(mpl.ticker.ScalarFormatter())
plt.title('UMAP row embeddings');

# Embedings over time

As we saw, embedding might changes over time. Let's check different time ids in a loop.

In [None]:
%%time

train_data = pd.read_feather('../input/ubiquant-trainfeather-32-bit/train32.feather')
weird_pattern_ind = (train_data[['time_id','f_170']].groupby('time_id').nunique()==1)
train_data['f_301'] = train_data['time_id'].map(pd.Series(weird_pattern_ind.values.flatten(),index=weird_pattern_ind.index))
id_data = train_data[['time_id','investment_id','target','f_301']].copy()
feature_names = [c for c in train_data.columns if 'f_' in c]

for i in range(11):
    time_ref = i*120
    
    print('time_ref: '+str(time_ref))
    
    print('pattern: ' + str(train_data.groupby('time_id').f_301.any()[time_ref]))
    
    train_data_tid = train_data[id_data['time_id']==time_ref].copy()

    scaler = RobustScaler()
    emb_tid = umap.UMAP(n_neighbors=60, min_dist=0.1, target_metric='euclidean', init='spectral', 
                    low_memory=False, verbose=True, spread=0.5, local_connectivity=1, 
                    repulsion_strength=1, negative_sample_rate=5).fit_transform(scaler.fit_transform(train_data_tid))

    plt.figure(figsize=(10, 8))
    plt.scatter(emb_tid[:, 0], emb_tid[:, 1], s=3, c=id_data[id_data['time_id']==time_ref].investment_id.astype('int'), edgecolors='none', cmap='jet');
    cb = plt.colorbar(label='time_id')
    cb.ax.yaxis.set_minor_formatter(mpl.ticker.ScalarFormatter())
    plt.title('UMAP row embeddings');
    plt.show()

    plt.figure(figsize=(10, 8))
    plt.scatter(emb_tid[:, 0], emb_tid[:, 1], s=3, c=id_data[id_data['time_id']==time_ref].target.astype('int'), edgecolors='none', cmap='jet',vmin=-1,vmax=1);
    cb = plt.colorbar(label='target')
    cb.ax.yaxis.set_minor_formatter(mpl.ticker.ScalarFormatter())
    plt.title('UMAP row embeddings');
    plt.show()

    plt.figure(figsize=(10, 8))
    plt.scatter(emb_tid[:, 0], emb_tid[:, 1], s=3, c=id_data[id_data['time_id']==time_ref].f_301.astype('int'), edgecolors='none', cmap='jet');
    cb = plt.colorbar(label='time_id')
    cb.ax.yaxis.set_minor_formatter(mpl.ticker.ScalarFormatter())
    plt.title('UMAP row embeddings');
    plt.show()

Definitely a pattern in unique values. When we have all the values we observe some clusters outside of the main cluster. When we don't we seem to have a main cluster and a longer embedding.