# Loading external data with yfinance, exploration with PCA / T-SNE / UMAP

In this notebook I Load and explore external data. 

It was initially built for the Optiver Volatility Forecasting Competition, following the discussion in [The Leak](https://www.kaggle.com/c/optiver-realized-volatility-prediction/discussion/256725) discussion and reusing most of the embedding (PCA / T-sne / UMAP) code from : https://www.kaggle.com/stassl/exploring-time-id-relationships. The goal was to download and visualize embeddings based on historical daily prices of the top 100 sp500 stocks. My contribution were: Installation and usage of yfinance, Collecting and ploting data from a list of tickers, building a GARMAN-KLASS volatility estimator from OLHC data. Discussion with @stassl brought numerous updates: loading tickers from wikipedia, downloading data in bulk, downloading data at 1h intervalls. The idea was to discuss a possible use of time embeddings. I didn't figure how to use them myself but it was at the core of some of the current top results.

Regarding the G-research competition, I just figured that most cryptocurrencies are on yahoo-finance nowadays. So I show how to use yfinance to download relevant crypto data. Those data might be usefull to:
 - extend the training data
 - use additional data for designing CV
 - use additional data for neutralisation

Feel free to upvote the discussion and the notebooks if you find the content interesting.

# Packages installation and import

In [None]:
%%capture
!pip install umap-learn[plot]
!pip install yfinance

In [None]:
import yfinance as yf
import glob
import umap
import umap.plot
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt

from joblib import Parallel, delayed
from sklearn.manifold import TSNE
from sklearn.preprocessing import minmax_scale
from sklearn.decomposition import PCA

%config InlineBackend.figure_format = 'retina'

# Stock info 

In [None]:
SPY = yf.Ticker("SPY")
SPY.info

# Stock Values - Exemple

In [None]:
import datetime

# get historical market data
SPY_histo = SPY.history(start="2017-01-01", end="2021-07-31")

SPY_histo.index

plt.figure(figsize=(10,10))
plt.plot(SPY_histo.index, SPY_histo['Close'])
plt.xlabel("date")
plt.ylabel("$ price")
plt.title("Stock Price")

# Aggregate price

Loading data in bulks.

In [None]:
tickers = pd.read_html('https://en.wikipedia.org/wiki/S%26P_100')[2].Symbol

In [None]:
df_prices_all = yf.download(tickers.to_list(), start='2020-01-01', interval='1h')

In [None]:
df_prices_all.head()

In [None]:
o = df_prices_all.Open
h = df_prices_all.High
l = df_prices_all.Low
c = df_prices_all.Close

#GARMAN-KLASS rv estimator from OLHC data
vol = 1/2 * np.square(np.log(h/l)) - (2*np.log(2)-1)*np.square(np.log(c/o))

In [None]:
df_prices = df_prices_all['Adj Close']

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

colnames = df_prices.columns
dates = df_prices.index

df_prices = pd.DataFrame(scaler.fit_transform(df_prices),index = dates, columns=colnames)

df_prices = df_prices.fillna(df_prices.mean())
df_prices = df_prices.dropna(axis=1, how='any')

# market volatility

In [None]:
df_target = (vol/vol.mean()).mean(axis=1)

In [None]:
df_target 

In [None]:
plt.plot(df_target)

# PCA

In [None]:
%%time
emb = PCA(n_components=2).fit_transform(df_prices)

In [None]:
plt.figure(figsize=(10, 8))
plt.scatter(emb[:, 0], emb[:, 1], s=3, c=df_target, edgecolors='none', cmap='jet', norm=mpl.colors.LogNorm());
cb = plt.colorbar(label='realized volatility', format=mpl.ticker.ScalarFormatter(),
                  ticks=mpl.ticker.LogLocator(10))
cb.ax.yaxis.set_minor_formatter(mpl.ticker.ScalarFormatter())
plt.title('PCA time_id embeddings');

# TSNE

In [None]:
%%time
emb = TSNE(n_components=2, perplexity=40, learning_rate=50, verbose=1, init='pca', n_iter=2000,
           early_exaggeration=12).fit_transform(df_prices)

In [None]:
plt.figure(figsize=(10, 8))
plt.scatter(emb[:, 0], emb[:, 1], s=3, c=df_target, edgecolors='none', cmap='jet', norm=mpl.colors.LogNorm());
cb = plt.colorbar(label='realized volatility', format=mpl.ticker.ScalarFormatter(),
                  ticks=mpl.ticker.LogLocator(10))
cb.ax.yaxis.set_minor_formatter(mpl.ticker.ScalarFormatter())
plt.title('TSNE time_id embeddings');

# UMAP

In [None]:
%%time
emb = umap.UMAP(n_neighbors=60, min_dist=0.1, target_metric='euclidean', init='spectral', 
                low_memory=False, verbose=True, spread=0.5, local_connectivity=1, 
                repulsion_strength=1, negative_sample_rate=5).fit_transform(df_prices)

In [None]:
plt.figure(figsize=(10, 8))
plt.scatter(emb[:, 0], emb[:, 1], s=3, c=df_target, edgecolors='none', cmap='jet', norm=mpl.colors.LogNorm());
cb = plt.colorbar(label='realized volatility', format=mpl.ticker.ScalarFormatter(),
                  ticks=mpl.ticker.LogLocator(10))
cb.ax.yaxis.set_minor_formatter(mpl.ticker.ScalarFormatter())
plt.title('UMAP time_id embeddings');

# High volatility days

In [None]:
plt.figure(figsize=(10, 8))
plt.scatter(emb[:, 0], emb[:, 1], s=3, c=df_target>df_target.quantile(0.9), edgecolors='none', cmap='jet', norm=mpl.colors.LogNorm());
cb = plt.colorbar(label='realized volatility', format=mpl.ticker.ScalarFormatter(),
                  ticks=mpl.ticker.LogLocator(10))
cb.ax.yaxis.set_minor_formatter(mpl.ticker.ScalarFormatter())
plt.title('UMAP time_id embeddings');

# Plotting times

In [None]:
import matplotlib.dates as mdates

plt.figure(figsize=(10, 8))
plt.scatter(emb[:, 0], emb[:, 1], s=3, c=[mdates.date2num(i) for i in df_prices.index], edgecolors='none', cmap='jet');

cb = plt.colorbar(label='date', format=mdates.AutoDateFormatter(mdates.MonthLocator(interval=6)))
cb.ax.yaxis.set_minor_formatter(mpl.ticker.ScalarFormatter())

plt.title('UMAP time_id embeddings');

# Optiver - Stock 31 
Stock 31 has been problematic from the beggining of the optiver competition. From deanonimizing stock price it appeared to be GE.

In [None]:
GE = yf.Ticker("GE")
GE.info

We can see there is : 
    
'lastSplitDate': 1627862400,
'lastSplitFactor': '1:8',

In [None]:
date = datetime.datetime.fromtimestamp(1627862400)
print("d2 =", date.strftime("%B %d, %Y"))

GE splitted between training and testing set. Which is a supplementary problem. 

In [None]:
# get historical market data
GE_histo = GE.history(start="2017-01-01", end="2021-08-31")

GE_histo.index

plt.figure(figsize=(10,10))
plt.plot(GE_histo.index, GE_histo['Close'])
plt.xlabel("date")
plt.ylabel("$ price")
plt.title("Stock Price")

In [None]:
GE.calendar

In [None]:
GE.actions

We see the last split that wrecked the LB. 
Next step: other stocks ?

# G-research CryptoCurrencies

It appears that crypto are now on yahoo finance. What a time to be alive.

Missing for now :

Binance Coin
EOS.IO
Ethereum Classic


In [None]:
list_crypto = ['BCH-USD',
'BTC-USD',
'ETH-USD',
'LTC-USD',
'XMR-USD',
'TRX-USD',
'XLM-USD',
'ADA-USD',
'MIOTA-USD',
'MKR-USD',
'DOGE-USD']

# get crypto infos

In [None]:
BTC = yf.Ticker("BTC-USD")
BTC.info

# get prices in bulk

In [None]:
df_prices_crypto = yf.download(list_crypto, start="2021-11-04", end="2021-11-06", interval='1m')

In [None]:
df_prices_crypto

# standardised prices

In [None]:
plt.figure(figsize=(10,10))
plt.plot(df_prices_crypto.index, df_prices_crypto['Close']/df_prices_crypto['Close'].iloc[0])
plt.xlabel("date")
plt.ylabel("$ price")
plt.title("Stock Price")

# log returns

In [None]:
plt.figure(figsize=(10,10))
plt.plot(df_prices_crypto.index, np.log(df_prices_crypto['Close']/df_prices_crypto['Close'].shift()))
plt.xlabel("date")
plt.ylabel("$ price")
plt.title("Stock Price")

# crypto exploration with UMAP

In [None]:
scaler = MinMaxScaler()

df_prices_crypto = df_prices_crypto['Adj Close']

colnames = df_prices_crypto.columns
dates = df_prices_crypto.index

df_prices_crypto = pd.DataFrame(scaler.fit_transform(df_prices_crypto),index = dates, columns=colnames)

df_prices_crypto = df_prices_crypto.fillna(df_prices_crypto.mean())
df_prices_crypto = df_prices_crypto.dropna(axis=1, how='any')

emb = umap.UMAP(n_neighbors=60, min_dist=0.1, target_metric='euclidean', init='spectral', 
                low_memory=False, verbose=True, spread=0.5, local_connectivity=1, 
                repulsion_strength=1, negative_sample_rate=5).fit_transform(df_prices_crypto)

plt.figure(figsize=(10, 8))
plt.scatter(emb[:, 0], emb[:, 1], s=3, c=df_prices_crypto.mean(axis=1), edgecolors='none', cmap='jet', norm=mpl.colors.LogNorm());
cb = plt.colorbar(label='avg standardized price', format=mpl.ticker.ScalarFormatter(),
                  ticks=mpl.ticker.LogLocator(10))
cb.ax.yaxis.set_minor_formatter(mpl.ticker.ScalarFormatter())
plt.title('UMAP time_id embeddings');

In [None]:
plt.figure(figsize=(10, 8))
plt.scatter(emb[:, 0], emb[:, 1], s=3, c=dates, edgecolors='none', cmap='jet', norm=mpl.colors.LogNorm());
cb = plt.colorbar(label='time', format=mpl.ticker.ScalarFormatter())
cb.ax.yaxis.set_minor_formatter(mpl.ticker.ScalarFormatter())
plt.title('UMAP time_id embeddings');

In [None]:
scaler = MinMaxScaler()

df_ret_crypto = np.log(df_prices_crypto/df_prices_crypto.shift())
df_ret_crypto = df_ret_crypto[~np.isinf(df_ret_crypto).any(axis=1)]
df_ret_crypto = df_ret_crypto.fillna(df_ret_crypto.mean(skipna=True))

colnames = df_ret_crypto.columns
dates = df_ret_crypto.index

df_ret_crypto = pd.DataFrame(scaler.fit_transform(df_ret_crypto),index = dates, columns=colnames)

emb = umap.UMAP(n_neighbors=60, min_dist=0.1, target_metric='euclidean', init='spectral', 
                low_memory=False, verbose=True, spread=0.5, local_connectivity=1, 
                repulsion_strength=1, negative_sample_rate=5).fit_transform(df_ret_crypto)

plt.figure(figsize=(10, 8))
plt.scatter(emb[:, 0], emb[:, 1], s=3, c=df_ret_crypto.mean(axis=1), edgecolors='none', cmap='jet', norm=mpl.colors.LogNorm());
cb = plt.colorbar(label='avg standardized returns', format=mpl.ticker.ScalarFormatter(),
                  ticks=mpl.ticker.LogLocator(10))
cb.ax.yaxis.set_minor_formatter(mpl.ticker.ScalarFormatter())
plt.title('UMAP time_id embeddings');

In [None]:
plt.figure(figsize=(10, 8))
plt.scatter(emb[:, 0], emb[:, 1], s=3, c=dates, edgecolors='none', cmap='jet', norm=mpl.colors.LogNorm());
cb = plt.colorbar(label='time', format=mpl.ticker.ScalarFormatter())
cb.ax.yaxis.set_minor_formatter(mpl.ticker.ScalarFormatter())
plt.title('UMAP time_id embeddings');