# Labeling: Tail Sets

## Abstract

Tail set labels are a classification labeling technique introduced in the following paper: "[Huerta, R., Corbacho, F. and
Elkan, C., 2013. Nonlinear support vector machines can systematically identify stocks with high and low future returns.
Algorithmic Finance, 2(1), pp.45-58.](https://content.iospress.com/download/algorithmic-finance/af016?id=algorithmic-finance%2Faf016)

A tail set is defined to be a group of assets whose volatility-adjusted price change is in the highest or lowest
quantile, for example the highest or lowest 5%.

A classification model is then fit using these labels to determine which stocks to buy and sell, for a long / short
portfolio.

## How it works

We label the y variable using the tail set labeling technique, which makes up the positive and negative (1, -1) classes
of the training data. The original paper investigates the performance of 3 types of metrics on which the tail sets are
built:

1. Real returns
2. Residual alpha after regression on the sector index
3. Volatility-adjusted returns

For our particular implementation, we have focused on the volatility-adjusted returns.

An input DataFrame of prices is converted to returns, which can be have volatility adjustment applied. The formula for volatility-adjusted return is:

$$r(t - t', t) = \frac{R(t-t',t)}{vol(t)}$$

We provide two implementations for estimations of volatility, first the exponential moving average of the mean absolute returns, and second the traditional standard deviation. The paper suggests a 180 day window period. 

The volatility adjusted return of each stock is assigned to a quantile relative to other returns in the row i.e. same timestamp. The top and bottom quantiles are then labeled as the positive and negative class, respectively.

## How to use these labels in practice?

The tail set labels from the code above returns the names of the assets which should be labeled with a positive or
negative label. Its important to note that the model you  would develop is a many to one model, in that it has many
x variables and only one y variable. The model is a binary classifier.

The model is trained on the training data and then used to score every security in the test data (on a given day).
Example: On December 1st 2019, the strategy needs to rebalance its positions, we score all 100 securities in our tradable
universe and then rank the outputs in a top down fashion. We form a long / short portfolio by going long the top 10
stocks and short the bottom 10 (equally weighted). We then hold the position to the next rebalance date.

---
## Examples of use

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import yfinance as yf

from mlfinlab.labeling import TailSetLabels

In [2]:
# Load price data for 20 stocks
tickers = "AAPL MSFT COST PFE SYY F GE BABA AMD CCL ZM FB WFC JPM NVDA CVX TWTR ACI GPS KO"

data = yf.download(tickers, start="2019-01-20", end="2020-05-25", group_by="ticker")
data = data.loc[:, (slice(None), 'Adj Close')]
data.columns = data.columns.droplevel(1)
data.head()

[*********************100%***********************]  20 of 20 completed


Unnamed: 0_level_0,ACI,AMD,ZM,FB,AAPL,COST,BABA,NVDA,GPS,TWTR,GE,PFE,JPM,CCL,CVX,F,SYY,KO,MSFT,WFC
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
2019-01-22,5100.0,19.76,,147.570007,150.266403,209.413116,152.149994,148.035126,22.838694,32.25,8.272988,39.946537,98.963676,51.750248,105.162872,7.837517,60.661575,45.439438,103.568062,46.484818
2019-01-23,5100.0,19.799999,,144.300003,150.87413,209.107468,152.029999,148.552521,23.241623,30.969999,8.33986,39.842587,98.71373,51.512947,104.273567,7.689988,60.884239,45.96315,104.577469,46.727222
2019-01-24,5100.0,20.85,,145.830002,149.678253,207.352509,155.860001,157.060287,23.122576,31.610001,8.387626,38.699097,98.7714,52.224846,106.258125,7.929724,60.535717,45.410866,104.077667,46.596699
2019-01-25,5100.0,21.93,,149.009995,154.638153,206.129929,159.210007,159.358887,23.516348,32.900002,8.750644,38.406132,99.396294,52.689953,105.986641,8.169458,60.041988,45.106155,105.028282,46.736546
2019-01-28,5100.0,20.18,,147.470001,153.207047,207.806046,158.919998,137.328262,23.598763,33.130001,8.530922,37.357147,99.867378,53.53474,105.003731,7.985046,60.284012,44.91571,102.980049,46.447533


In [3]:
# Create tail set labels with mean absolute deviation as the volatility adjustment.
labels = TailSetLabels(data, n_bins=10, vol_adj='mean_abs_dev', window=180)
pos_set, neg_set, matrix_set = labels.get_tail_sets()

In [4]:
# Get the positive set, of the top 10% returns for each day.
pos_set.head()

Date
2020-01-06      [ZM, GPS]
2020-01-07     [ZM, TWTR]
2020-01-08    [SYY, MSFT]
2020-01-09     [COST, KO]
2020-01-10     [GPS, PFE]
dtype: object

In [5]:
# Get the negative set, of the lowest 10% returns for each day.
neg_set.head()

Date
2020-01-06    [CCL, WFC]
2020-01-07    [JPM, CVX]
2020-01-08     [GE, CVX]
2020-01-09    [GPS, PFE]
2020-01-10     [GE, JPM]
dtype: object

In [6]:
# All labels for the day.
matrix_set.head()

Unnamed: 0_level_0,ACI,AMD,ZM,FB,AAPL,COST,BABA,NVDA,GPS,TWTR,GE,PFE,JPM,CCL,CVX,F,SYY,KO,MSFT,WFC
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
2020-01-06,0,0,1,0,0,0,0,0,1,0,0,0,0,-1,0,0,0,0,0,-1
2020-01-07,0,0,1,0,0,0,0,0,0,1,0,0,-1,0,-1,0,0,0,0,0
2020-01-08,0,0,0,0,0,0,0,0,0,0,-1,0,0,0,-1,0,1,0,1,0
2020-01-09,0,0,0,0,0,1,0,0,-1,0,0,-1,0,0,0,0,0,1,0,0
2020-01-10,0,0,0,0,0,0,0,0,1,0,-1,1,-1,0,0,0,0,0,0,0


In [7]:
# See the numerical returns.
labels.vol_adj_rets.dropna().head()

Unnamed: 0_level_0,ACI,AMD,ZM,FB,AAPL,COST,BABA,NVDA,GPS,TWTR,GE,PFE,JPM,CCL,CVX,F,SYY,KO,MSFT,WFC
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
2020-01-06,0.0,-0.227461,1.907438,1.680486,0.745082,0.039689,-0.121334,0.266647,2.396394,0.259407,0.905581,-0.163835,-0.09861,-2.423721,-0.42372,-0.510353,-0.217696,-0.056495,0.311705,-0.650362
2020-01-07,0.0,-0.153854,0.959547,0.196681,-0.445724,-0.23071,0.336044,0.769067,-0.021452,1.892687,-0.48104,-0.430076,-2.095482,0.246498,-1.592236,0.917607,-1.346047,-1.188172,-1.10447,-0.901826
2020-01-08,0.0,-0.467444,0.391693,0.91888,1.499143,1.652935,0.126608,0.121107,0.115619,1.048754,-0.595934,1.022932,0.950251,0.379514,-1.41571,0.0,1.894668,0.286338,1.883782,0.33165
2020-01-09,0.0,1.255253,0.04249,1.2896,1.950353,2.271257,1.276731,0.708613,-1.860596,0.348897,-0.165231,-0.56311,0.448849,0.723627,-0.201126,0.103904,0.196094,2.743687,1.471037,-0.18838
2020-01-10,0.0,-0.879143,0.286838,-0.100989,0.211654,-1.041752,0.686095,0.349014,1.082667,-0.907804,-1.331385,1.952353,-1.230105,-0.5782,-1.136807,-0.105099,0.570522,0.524137,-0.55265,-0.486054


---
## Conclusion

This notebook presents the tail sets labeling method. This method is useful in identifying outliers in the returns for a group of stocks during a given day. The user chooses the number of quantiles, the the top and bottom quantiles are labeled as the positive and negative tail sets, respectively. This method can be used in training data for classification. A strategy can be adopted of going long the predicted positive tail set and short the negative one.

## References

1. Huerta, R., Corbacho, F. and Elkan, C., 2013. Nonlinear support vector machines can systematically identify stocks with high and low future returns. Algorithmic Finance, 2(1), pp.45-58.