# Labeling: Tail Sets

Tail set labels are a classification labeling technique introduced in the following paper: "[Huerta, R., Corbacho, F. and
Elkan, C., 2013. Nonlinear support vector machines can systematically identify stocks with high and low future returns.
Algorithmic Finance, 2(1), pp.45-58.](https://content.iospress.com/download/algorithmic-finance/af016?id=algorithmic-finance%2Faf016)

A tail set is defined to be a group of assets whose volatility-adjusted price change is in the highest or lowest
quantile, for example the highest or lowest 5%.

A classification model is then fit using these labels to determine which stocks to buy and sell, for a long / short
portfolio.

We label the y variable using the tail set labeling technique, which makes up the positive and negative (1, -1) classes
of the training data. The original paper investigates the performance of 3 types of metrics on which the tail sets are
built:

1. Real returns
2. Residual alpha after regression on the sector index
3. Volatility-adjusted returns

For our particular implementation, we have focused on the volatility-adjusted returns.

In [15]:
import numpy as np
import pandas as pd

from mlfinlab.labeling import TailSetLabels

In [16]:
# Import price data
data = pd.read_csv('../Sample-Data/stock_prices.csv', index_col='Date', parse_dates=True)
data.head()

Unnamed: 0_level_0,EEM,EWG,TIP,EWJ,EFA,IEF,EWQ,EWU,XLB,XLE,...,XLU,EPP,FXI,VGK,VPL,SPY,TLT,BND,CSJ,DIA
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2008-01-02,49.273335,35.389999,106.639999,52.919998,78.220001,87.629997,37.939999,47.759998,41.299999,79.5,...,42.09,51.173328,55.98333,74.529999,67.309998,144.929993,94.379997,77.360001,101.400002,130.630005
2008-01-03,49.716667,35.290001,107.0,53.119999,78.349998,87.809998,37.919998,48.060001,42.049999,80.440002,...,42.029999,51.293331,55.599998,74.800003,67.5,144.860001,94.25,77.459999,101.519997,130.740005
2008-01-04,48.223331,34.599998,106.970001,51.759998,76.57,88.040001,36.990002,46.919998,40.779999,77.5,...,42.349998,49.849998,54.536671,72.980003,65.769997,141.309998,94.269997,77.550003,101.650002,128.169998
2008-01-07,48.576668,34.630001,106.949997,51.439999,76.650002,88.199997,37.259998,47.060001,40.220001,77.199997,...,43.23,50.416672,56.116669,72.949997,65.650002,141.190002,94.68,77.57,101.720001,128.059998
2008-01-08,48.200001,34.389999,107.029999,51.32,76.220001,88.389999,36.970001,46.400002,39.599998,75.849998,...,43.240002,49.566669,55.326672,72.400002,65.360001,138.910004,94.57,77.650002,101.739998,125.849998


In [7]:
# Create tail set labels
labels = TailSetLabels(data, window=180, mean_abs_dev=True)
pos_set, neg_set, matrix_set = labels.get_tail_sets()

In [9]:
pos_set.head()

Date
2008-09-18    [EEM, EFA, FXI]
2008-09-19    [EEM, EWQ, CSJ]
2008-09-22    [TIP, LQD, EPP]
2008-09-23    [IEF, XLK, BND]
2008-09-24    [EWJ, EPP, CSJ]
dtype: object

In [10]:
neg_set.head()

Date
2008-09-18    [TIP, BND, CSJ]
2008-09-19    [IEF, XLK, TLT]
2008-09-22    [EEM, XLF, CSJ]
2008-09-23    [TIP, VGK, SPY]
2008-09-24    [TIP, XLB, LQD]
dtype: object

In [11]:
matrix_set.head()

Unnamed: 0_level_0,EEM,EWG,TIP,EWJ,EFA,IEF,EWQ,EWU,XLB,XLE,...,XLU,EPP,FXI,VGK,VPL,SPY,TLT,BND,CSJ,DIA
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2008-09-18,1,0,-1,0,1,0,0,0,0,0,...,0,0,1,0,0,0,0,-1,-1,0
2008-09-19,1,0,0,0,0,-1,1,0,0,0,...,0,0,0,0,0,0,-1,0,1,0
2008-09-22,-1,0,1,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,-1,0
2008-09-23,0,0,-1,0,0,1,0,0,0,0,...,0,0,0,-1,0,-1,0,1,0,0
2008-09-24,0,0,-1,1,0,0,0,0,-1,0,...,0,1,0,0,0,0,0,0,1,0


In [18]:
labels.vol_adj_rets.dropna().head()

Unnamed: 0_level_0,EEM,EWG,TIP,EWJ,EFA,IEF,EWQ,EWU,XLB,XLE,...,XLU,EPP,FXI,VGK,VPL,SPY,TLT,BND,CSJ,DIA
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2008-09-18,4.508371,3.15523,-3.111303,3.58753,3.848416,-2.785101,2.938655,3.345493,-0.02065,1.453407,...,3.251648,3.474827,5.034605,3.431751,3.472486,2.568232,-2.335827,-4.461541,-5.097691,2.696179
2008-09-19,6.105422,5.61192,1.188319,4.135636,5.57562,-4.072046,5.794786,4.375463,4.014838,3.036212,...,2.033955,3.187363,4.07429,5.525651,4.38008,2.844334,-4.859581,3.678729,21.788846,2.861168
2008-09-22,-3.401323,-2.110389,0.805733,-2.696409,-2.764349,-0.48561,-2.454268,-1.944645,-0.926679,-0.852665,...,-2.2076,0.248532,-2.401272,-1.306426,-1.494577,-1.939615,-0.016243,-0.789467,-13.15219,-2.390453
2008-09-23,-1.583086,-1.319794,-2.360897,-0.709549,-1.125958,0.0,-1.373446,-1.278954,-1.679985,-1.603999,...,-1.519144,-0.750959,-0.896277,-2.227828,-0.797525,-1.926089,-0.588874,0.947756,-0.059872,-1.581367
2008-09-24,0.574284,-0.168215,-1.131977,1.061057,0.231426,0.59836,-0.357022,-0.084261,-1.416672,0.142575,...,0.803514,0.901233,0.469463,-0.227092,0.575725,0.270343,0.673304,-0.039926,0.956826,-0.161457


## How to use these labels in practice?

The tail set labels from the code above returns the names of the assets which should be labeled with a positive or
negative label. Its important to note that the model you  would develop is a many to one model, in that it has many
x variables and only one y variable. The model is a binary classifier.

The model is trained on the training data and then used to score every security in the test data (on a given day).
Example: On December 1st 2019, the strategy needs to rebalance its positions, we score all 100 securities in our tradable
universe and then rank the outputs in a top down fashion. We form a long / short portfolio by going long the top 10
stocks and short the bottom 10 (equally weighted). We then hold the position to the next rebalance date.