**Locating Novel Digital Commodities Within a Cluster-Driven Model for Global Commodities**

With massive, recent interest in institutional investment in digital commodities, ie cryptocurrencies, US and other regulatory commissions effectively classify such assets as commodities. Given that these risk assets are typically priced in tandem with stock equity, and contrasted against US Treasury instruments, little scholarship has analyzed cryptocurrencies and digital assets as effective commodities, such as Sugar, Timber, Oil products or Grains. 

Seeing Bitcoin as a necessary commodity to participate in cross border money exchange, ecommerce, or oil purchasing is necessary to justify considering it as a commodity, rather than a risk asset. For those who analye cryptocurrency as a holding, and analyze it via other valuation methods typically finds the exercise wanting, as valuation tends to look for underlying, fundamental value. The use case, also for Bitcoin and other digital commodities also leaves the analyst to wonder whether they are investing in Ponzi goods; Bitcoin is used to purchase hotel rooms, and at times, yachts or pizza slices, but it remains a held-good such as Gold. 

**Why Cluster Commodities, to Study Bitcoin (or Hogs)?**

When digial commodities are analyzed alongside Oats, Gold, E-Mini Futures and other classical commodities, their prices covariance, against a pool of commodities can be tracked. Unifying digital commodities within pools of other commonly traded daily commodities allows another category of analysis to emerge, where traders simply shift from one commodity to another, as economic winds change, or opportunities simply justify a change of trading venue, ie a trend-shift toward energy away from equity, and we have seen since the start of a hot war in Ukraine.

**Using Cluster Matrices to Study Covariant, Affine Price Behaviors between Bitcoin and Other Commodity Flows**

This study samples the recent price behavior of 37 commodities, then traces the covariant, linear behavior, matrix style. Affine, or common mover groups are established, and presented interactively, for the viewer in a visual milieu. 

Discussion of data pipeline used, and the subsequent data transformations needed in order to create this affine matrix, as well as the technical tools to facilitate this. 

**Overview of Data Science Techniques**

The pipeline includes downloading data, introducing processing efficiencies, model building and cross validation, and cluster expression. I outline my steps as I take them, to arrive at a matrix of pricing which affords the following advantages. 

The experiement was adapted from scikit-learn's own documentation, where the techniques were applied to the US stock market. My rendition creates several departures while adapting the advantage of Varoquaux's pipeline.[1]

1. The data ingest is fast, efficient, updateable and portable. Anyone may use this code to build a working model of US-traded commodities, and add symbols they wish to see, where I have missed them. 
2. Data represent public, recently settled trades. 
3. Local CPU resources are used in order to use notebook memory efficiently, and leverage local Linux resources.
4. Data remains in perpetuity for the analyst, or it may be rebuilt, using updated, daily trade series. 
5. Data is built as a time series, in the OHLC format, where Opening, Closing, High and daily Low prices are located.
6. Clustering is aimed toward predictive use, where clusters can achieve whatever size is needed, to cluster affine, covariant items
7. Every commodity under consideration is measured for covariance against each other, to locate a product that trades in the same linear way
8. Sparse Inverse Covariance is the technique used to identify relationships between every item in the Matrix, and thus explose clusters of products, trading similarly. This is a list of connected items, trading conditionally upon the others.Thus the list is a useable, probable list of items which trade in the same way, over a week of US business.
9. An edge model exposes the borders for classification, and locates clusters at its discretion. Thus, no supervised limits are imposed in cluster formation. 
10. Hyperparameters are determined via search with a predetermined number of folds, where each subset is used to locate model parameters, which are averaged at the close of the run. 
11. Given the large volume of colinear features, a cross validation technique is used to 'lasso' model features. 

**Building the Data Science Environment for Linux and Python**

Use the following commands to interface with your underlying linux environment. These may not need to be commented out, but will remain necessary each time a new kernel boot, in your notebook, takes place.

In [1]:
!pip install yfinance
!pip install vega_datasets

Collecting yfinance
  Downloading yfinance-0.1.86-py2.py3-none-any.whl (29 kB)
Collecting lxml>=4.5.1
  Downloading lxml-4.9.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (6.9 MB)
     |████████████████████████████████| 6.9 MB 62.0 MB/s            
[?25hCollecting multitasking>=0.0.7
  Downloading multitasking-0.0.11-py3-none-any.whl (8.5 kB)
Installing collected packages: multitasking, lxml, yfinance
  Attempting uninstall: lxml
    Found existing installation: lxml 4.4.1
    Uninstalling lxml-4.4.1:
      Successfully uninstalled lxml-4.4.1
Successfully installed lxml-4.9.1 multitasking-0.0.11 yfinance-0.1.86
Collecting vega_datasets
  Downloading vega_datasets-0.9.0-py3-none-any.whl (210 kB)
     |████████████████████████████████| 210 kB 43.4 MB/s            
Installing collected packages: vega-datasets
Successfully installed vega-datasets-0.9.0


**Data Ingest from Public Markets**

The free, common Yahoo Finace API is used to download data from all commodites you wish to see studied. This data will be stored persistently next to your notebook in common environments such as Binder. 

Please note that if you deploy this notebook in Google Collab that the 37+ files downloaded will be erased between uses, but can be rebuilt easily each time you operate this notebook. 

The data you download becomes permanently usable, and the ingest request below can be customized in order to grab more, or less data and at different intervals.[2]

I have included several exceptions to the download and renaming technique, in order to tolerate commodities with differing ticker symbols. 

In [2]:
import yfinance as yf
from time import time,ctime, clock_gettime
from time import gmtime, time, time_ns

def ifs(input):
    ni = ''
    if input =='gff':
        input = 'GFF'
        ni = "GF=F"
    elif input == 'zff':
        input = 'ZFF'
        ni = "ZF=F"
    else:
        input = input.upper()
        ins = "="
        before = "F"
        ni = input.replace(before, ins + before , 1)
    print(ni)
    data = yf.download( 
        tickers = ni,
        period = "7d",
        interval = "1m",
        group_by = 'ticker',
        auto_adjust = True,
        prepost = True,
        threads = True,
        proxy = None
    )
    epoch = ctime()
    filename = input
    data.to_csv(filename)
#!ls #only in jupy

**Trigger Data Downloads**

The following code customizes the commodities under investigation. In order to compare every commodity's price history versus the rest in your matrix, the lengths of the data captures are minimized to the length of the smallest data set. Thus, larger sets are only captured at the length of the smallest set.

The volatility of every price tick is calculated via [close price minus open price].

In [3]:
#read in csv data from each commodity capture, gather
#assign 'open' to an array, create df from arrays
import numpy as np 
import pandas as pd
from  scipy.stats import pearsonr
symbol_dict = {"clf":"crude oil", "esf":"E-Mini S&P 500","btcf":"Bitcoin","bzf":"Brent Crude Oil", "ccf":"Cocoa","ctf":"Cotton","gcf":"Gold",
           "gff":"Feeder Cattle", "hef":"Lean Hogs","hgf":"Copper","hof":"Heating Oil","kcf":"Coffee","kef":"KC HRW Wheat",
           "lbsf":"Lumber","lef":"Live Cattle","mgcf":"Micro Gold","ngf":"Natural Gas","nqf":"Nasdaq 100","ojf":"Orange Juice","paf":"Palladium","plf":"Chicago Ethanol (Platts)",
            "rbf":"RBOB Gasoline","rtyf":"E-mini Russell 2000","sbf":"Sugar #11","sif":"Silver","silf":"Micro Silver","ymf":"Mini Dow Jones Indus","zbf":"U.S. Treasury Bond Futures",
            "zcf":"Corn","zff":"Five-Year US Treasury Note","zlf":"Soybean Oil Futures","zmf":"Soybean Meal","znf":"10-Year T-Note","zof":"Oat Futures","zrf":"Rough Rice",
            "zsf":"Soybean","ztf":"2-Year T-Note"} 

sym, names = np.array(sorted(symbol_dict.items())).T

for i in sym:    #build all symbol csvs, will populate/appear in your binder. Use linux for efficient dp
    ifs(i)

quotes = []
lens = []
for symbol in sym:
    symbol = symbol.upper()
    t = pd.read_csv(symbol) 
    lens.append(t.shape[0])
mm = np.amin(lens)-1
print("min length of data: ",mm)

for symbol in sym:
    symbol = symbol.upper()
    t = pd.read_csv(symbol) 
    t= t.truncate(after=mm)
    quotes.append(t)
mi = np.vstack([q["Close"] for q in quotes]) #min
ma = np.vstack([q["Open"] for q in quotes]) #max

volatility = ma - mi 
      

BTC=F
[*********************100%***********************]  1 of 1 completed
BZ=F
[*********************100%***********************]  1 of 1 completed
CC=F
[*********************100%***********************]  1 of 1 completed
CL=F
[*********************100%***********************]  1 of 1 completed
CT=F
[*********************100%***********************]  1 of 1 completed
ES=F
[*********************100%***********************]  1 of 1 completed
GC=F
[*********************100%***********************]  1 of 1 completed
GF=F
[*********************100%***********************]  1 of 1 completed
HE=F
[*********************100%***********************]  1 of 1 completed
HG=F
[*********************100%***********************]  1 of 1 completed
HO=F
[*********************100%***********************]  1 of 1 completed
KC=F
[*********************100%***********************]  1 of 1 completed
KE=F
[*********************100%***********************]  1 of 1 completed
LBS=F
[*********************100%*****

**Data Format**

After downloading this massive store of data, you should click on a file, in your project. Using the file browser, you will see a large quantity of new files. 

When you open one, you will see the rows of new data. 


**Cross Validate for Optimal Parameters: the Lasso**

Varoquaux's pipeline involves steps in the following two cells. 

A set of clusters is built using a set of predefined edges, called the edge model. The volatility of every OHLC tick is fed into the edge model, in order to establish every commodity's covariance to eachother. 

The advantages of the Graphical Lasso model is that a cross validated average set of hyperparameters is located, then applied to cluster each commodity. Thus, every commodity is identified with other commodities which move in tandem, together, over seven days. I print the alpha edges below, and visualize this group. 

Depending upon the markets when you run this study, more intensive clustering may take place at either end of the spectrum. This exposes the covariance between different groups, while exposing outlier clusters. 

**Using the Interactive Graph**

Feel free to move your mouse into the graph, then roll your mouse. This will drill in/out and allow you to hover over data points. They will mape to the edges of the clusters, under investigation.




In [4]:
from sklearn import covariance
import altair as alt
alphas = np.logspace(-1.5, 1, num=15)
edge_model = covariance.GraphicalLassoCV(alphas=alphas)
X = volatility.copy().T
X /= X.std(axis=0)
l =edge_model.fit(X)
n= []
print(type(l.alphas))
for  i in range(len(l.alphas)):
    print(l.alphas[i])
    dict = {"idx":i , "alpha":l.alphas[i]}
    n.append(dict)
    
dd = pd.DataFrame(n)
alt.Chart(dd).mark_point(filled=True, size=100).encode(
    y=alt.Y('idx'),
    x=alt.X('alpha'),tooltip=['alpha'],).properties(
        width=800,
        height=400,
        title="Edges Present Within the Graphical Lasso Model"
    ).interactive()

<class 'numpy.ndarray'>
0.03162277660168379
0.047705826961439296
0.07196856730011521
0.10857111194022041
0.16378937069540642
0.2470911227985605
0.372759372031494
0.5623413251903491
0.8483428982440722
1.279802213997954
1.9306977288832505
2.9126326549087382
4.39397056076079
6.628703161826448
10.0


**Definining cluster Membership, by Covariant Affinity**

Clusters of covariant, affine moving commodities are established. This group is then passed into a dataframe so that the buckets of symbols can become visible. 

In [5]:
from sklearn import cluster
                                                    #each symbol, at index, is labeled with a cluster id:
_, labels = cluster.affinity_propagation(edge_model.covariance_, random_state=0)
n_labels = labels.max()                             #integer limit to list of clusters ids    
# print("names: ",names,"  symbols: ",sym)
gdf = pd.DataFrame()
for i in range(n_labels + 1):
    print(f"Cluster {i + 1}: {', '.join(np.array(sym)[labels == i])}")
    l = np.array(sym)[labels == i]
    ss = np.array(names)[labels == i]
    dict = {"cluster":(i+1), "symbols":l, "size":len(l), "names":ss}
    gdf = gdf.append(dict, ignore_index=True, sort=True)
    
gdf.head(15)


Cluster 1: clf, ctf, esf, gcf, hgf, nqf, ymf, zff
Cluster 2: gff, kcf, rbf
Cluster 3: hef, lef, zbf, zrf
Cluster 4: bzf, hof, ngf, zmf
Cluster 5: kef, sbf, zof
Cluster 6: lbsf, ojf, silf, zlf, znf
Cluster 7: ccf, mgcf
Cluster 8: paf, rtyf, sif
Cluster 9: zsf
Cluster 10: btcf, plf, zcf, ztf


Unnamed: 0,cluster,names,size,symbols
0,1,"[crude oil, Cotton, E-Mini S&P 500, Gold, Copp...",8,"[clf, ctf, esf, gcf, hgf, nqf, ymf, zff]"
1,2,"[Feeder Cattle, Coffee, RBOB Gasoline]",3,"[gff, kcf, rbf]"
2,3,"[Lean Hogs, Live Cattle, U.S. Treasury Bond Fu...",4,"[hef, lef, zbf, zrf]"
3,4,"[Brent Crude Oil, Heating Oil, Natural Gas, So...",4,"[bzf, hof, ngf, zmf]"
4,5,"[KC HRW Wheat, Sugar #11, Oat Futures]",3,"[kef, sbf, zof]"
5,6,"[Lumber, Orange Juice, Micro Silver, Soybean O...",5,"[lbsf, ojf, silf, zlf, znf]"
6,7,"[Cocoa, Micro Gold]",2,"[ccf, mgcf]"
7,8,"[Palladium, E-mini Russell 2000, Silver]",3,"[paf, rtyf, sif]"
8,9,[Soybean],1,[zsf]
9,10,"[Bitcoin, Chicago Ethanol (Platts), Corn, 2-Ye...",4,"[btcf, plf, zcf, ztf]"


**Visualizing cluster and affine commodities, by volatility**

The interactive graphic requires the user to hover over each dot, in teh scatter chart. The size of the commodity cluster pushes it to the top, where the user can study the members, whose prices move in covariant fashion. 

I have experimented with laying the text of the commodity group over the dots, but I find that the above table is most helpful, in identifying markets which move in tandem, and with similar price graphs. Also, as groups expand and contract, overlaying text on the chart below may prevent certain clusters from appearing. I appreciate spacing them out, and not congesting the chart. 

The user is free to study where his or her chosen commodity may sit, in close relation to other globally relevant commodities.

In [6]:
for i in gdf['cluster']:
    print("cluster ",i)
    d = gdf[gdf['cluster'].eq(i)]
    for j in d.names:
        print(j, ", ")

cluster  1
['crude oil' 'Cotton' 'E-Mini S&P 500' 'Gold' 'Copper' 'Nasdaq 100'
 'Mini Dow Jones Indus' 'Five-Year US Treasury Note'] , 
cluster  2
['Feeder Cattle' 'Coffee' 'RBOB Gasoline'] , 
cluster  3
['Lean Hogs' 'Live Cattle' 'U.S. Treasury Bond Futures' 'Rough Rice'] , 
cluster  4
['Brent Crude Oil' 'Heating Oil' 'Natural Gas' 'Soybean Meal'] , 
cluster  5
['KC HRW Wheat' 'Sugar #11' 'Oat Futures'] , 
cluster  6
['Lumber' 'Orange Juice' 'Micro Silver' 'Soybean Oil Futures'
 '10-Year T-Note'] , 
cluster  7
['Cocoa' 'Micro Gold'] , 
cluster  8
['Palladium' 'E-mini Russell 2000' 'Silver'] , 
cluster  9
['Soybean'] , 
cluster  10
['Bitcoin' 'Chicago Ethanol (Platts)' 'Corn' '2-Year T-Note'] , 


In [7]:
import altair as alt
def runCluster():
    c = alt.Chart(gdf).mark_circle(size=60).encode(
        x= alt.X('cluster:N'),
        y= alt.Y('size:Q'),
        color='size:Q',
        tooltip=['names'],
        size=alt.Size('size:Q')
    ).properties(
        width=800,
        height=400,
        title="40 Top Global Commodities, Clustered by Affine Covariance"
    ).interactive()
    #.configure_title("40 Top Global Commodities, Clustered by Affine Covariance")
        
    chart =c 
    return chart
runCluster()


**References**

1. Gael Varoquaux. Visualizing the Stock Market Structure. Scikit-Learn documentation pages, https://scikit-learn.org/stable/auto_examples/applications/plot_stock_market.html
2. Ran Aroussi. YFinance API documents. https://github.com/ranaroussi/yfinance
3. The Altair Charting Toolkit. https://altair-viz.github.io/index.html