# Crypto Clustering Analysis

This code combines financial Python programming with unsupervised learning to cluster cryptocurrencies based on performance in different periods, using K-Means and Principal component Analysis (PCA).

Data contains returns (price change) data of cryptocurrencies in several periods, in a CSV file.

The steps of the analysis are broken out into the following sections:

* Import the Data
* Prepare the Data
* Find the Best Value for `k` Using the Original Data
* Cluster Cryptocurrencies with K-means Using the Original Data
* Optimize Clusters with Principal Component Analysis
* Find the Best Value for `k` Using the PCA Data
* Cluster the Cryptocurrencies with K-means Using the PCA Data
* Visualize and Compare the Results

In [1]:
! pip install hvplot --quiet

In [2]:
! pip install yellowbrick --quiet

In [3]:
# Import required libraries and dependencies
import pandas as pd
import hvplot.pandas
from pathlib import Path
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import numpy as np
import plotly.express as px

# Additional tool to apply the elbow rule
from yellowbrick.cluster import KElbowVisualizer

import holoviews as hv
hv.extension('bokeh', 'matplotlib')

### Import the Data

This section imports the data into a new DataFrame. It follows these steps:

1. Read  the “crypto_market_data.csv” file from the Resources folder into a DataFrame, and set the cryptocurrency name as the index.

2. Generate the summary statistics, and use HvPlot to visualize the data.


In [4]:
# Load the data into a Pandas DataFrame
df_market_data = pd.read_csv("crypto_market_data.csv",index_col="coin_id")

# Display sample data
df_market_data.head()

Unnamed: 0_level_0,price_change_percentage_24h,price_change_percentage_7d,price_change_percentage_14d,price_change_percentage_30d,price_change_percentage_60d,price_change_percentage_200d,price_change_percentage_1y
coin_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
bitcoin,1.08388,7.60278,6.57509,7.67258,-3.25185,83.5184,37.51761
ethereum,0.22392,10.38134,4.80849,0.13169,-12.8889,186.77418,101.96023
tether,-0.21173,0.04935,0.0064,-0.04237,0.28037,-0.00542,0.01954
ripple,-0.37819,-0.60926,2.24984,0.23455,-17.55245,39.53888,-16.60193
bitcoin-cash,2.90585,17.09717,14.75334,15.74903,-13.71793,21.66042,14.49384


In [5]:
# Generate summary statistics
df_market_data.describe()

Unnamed: 0,price_change_percentage_24h,price_change_percentage_7d,price_change_percentage_14d,price_change_percentage_30d,price_change_percentage_60d,price_change_percentage_200d,price_change_percentage_1y
count,41.0,41.0,41.0,41.0,41.0,41.0,41.0
mean,-0.269686,4.497147,0.185787,1.545693,-0.094119,236.537432,347.667956
std,2.694793,6.375218,8.376939,26.344218,47.365803,435.225304,1247.842884
min,-13.52786,-6.09456,-18.1589,-34.70548,-44.82248,-0.3921,-17.56753
25%,-0.60897,0.04726,-5.02662,-10.43847,-25.90799,21.66042,0.40617
50%,-0.06341,3.29641,0.10974,-0.04237,-7.54455,83.9052,69.69195
75%,0.61209,7.60278,5.51074,4.57813,0.65726,216.17761,168.37251
max,4.84033,20.69459,24.23919,140.7957,223.06437,2227.92782,7852.0897


In [6]:
# Plot of the data in the DataFrame
df_market_data.hvplot.line(
    width=1200,
    height=600,
    rot=90,
    title='Cryptocurrency Market Data: Returns for Several Holding Periods (%)',
    ylabel= 'Return (%)',
    xlabel='Cryptocurrency Name',
    value_label='Return in the Period (%)'
)

### Prepare the Data

This section prepares the data before running the K-Means algorithm. It follows these steps:

1. Use od `StandardScaler` module from scikit-learn to normalize the CSV file data. 

2. Create a DataFrame that contains the scaled data. The column `coin_id` from the original DataFrame will be set as index.


In [8]:
# Use the `StandardScaler()` module from scikit-learn to normalize the data from the CSV file
scaled_data = StandardScaler().fit_transform(df_market_data)

In [9]:
# Create a DataFrame with the scaled data
df_market_data_scaled = pd.DataFrame(
    scaled_data,
    columns=df_market_data.columns
)

# Copy the crypto names from the original data
df_market_data_scaled["coin_id"] = df_market_data.index

# Set the coinid column as index
df_market_data_scaled = df_market_data_scaled.set_index("coin_id")

# Display sample data
df_market_data_scaled.head()

Unnamed: 0_level_0,price_change_percentage_24h,price_change_percentage_7d,price_change_percentage_14d,price_change_percentage_30d,price_change_percentage_60d,price_change_percentage_200d,price_change_percentage_1y
coin_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
bitcoin,0.508529,0.493193,0.7722,0.23546,-0.067495,-0.355953,-0.251637
ethereum,0.185446,0.934445,0.558692,-0.054341,-0.273483,-0.115759,-0.199352
tether,0.021774,-0.706337,-0.02168,-0.06103,0.008005,-0.550247,-0.282061
ripple,-0.040764,-0.810928,0.249458,-0.050388,-0.373164,-0.458259,-0.295546
bitcoin-cash,1.193036,2.000959,1.76061,0.545842,-0.291203,-0.499848,-0.270317


In [10]:
# Plot the scaled data
df_market_data_scaled.hvplot.line(
    width=1100,
    height=400,
    rot=90,
    title='Cryptocurrency Market Data Scaled: Standarized Returns for Several Holding Periods (%)',
    ylabel= 'Standarized Return (%)',
    xlabel='Cryptocurrency Name',
    value_label='Standarized Return in the Period (%)'
)

### Find the Best Value for k Using the Original Data

In this section, we will use the elbow method to find the best value for `k`.

1. Will use a range from 1 to 11. 

2. We will plot a line chart with all the inertia values computed with the different values of `k` to visually identify the optimal value for `k`.

In [12]:
# Create a list with the number of k-values to try
# Use a range from 1 to 11
k = range(1,12)
list(k)

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]

In [13]:
# Create an empy list to store the inertia values
inertia = []

In [14]:
# use a loop to compute the inertia with each possible value of k

# "Inertia measures how well a dataset was clustered by K-Means. It is calculated by measuring the distance between each data point and its centroid, 
# squaring this distance, and summing these squares across one cluster."- Codecademy Inertia is equivalent to the sum of the cluster intra variances.

# Inside the loop, we:
# 1. create a KMeans model using the loop counter for the n_clusters
# 2. fit the model to the scaled data
# 3. append the model.inertia_ to the inertia list


for i in k:
    model = KMeans(n_clusters=i)
    model.fit(df_market_data_scaled)
    inertia.append(model.inertia_)
    
inertia

  super()._check_params_vs_input(X, default_n_init=10)


AttributeError: 'NoneType' object has no attribute 'split'

In [74]:
# Create a dictionary with the data to plot the Elbow curve
dic_inertia = {'k':k, 'inertia':inertia}

# Create a DataFrame with the data to plot the Elbow curve
inertia_df = pd.DataFrame(dic_inertia)

inertia_df

Unnamed: 0,k,inertia


To find the best value it is helpful to quantify the decrease in inertia with each increment of *k*.

In [75]:
#   Quantities that help in applying elbow rule.
# Rate of decrease of the inertia
# Rate of intertia for k, with the respect to only one cluster (k=1)
inertia_df['inertia_fraction_with_respect_to_one_cluster'] = inertia_df['inertia']/inertia_df.iloc[0,1]

# Rate of decrease from one level to the next in the initial inertia (k=1). This is: inertia_rate(k)-inertia_rate(k-1)
inertia_df['inertia_rate_of_decrease'] = inertia_df['inertia_fraction_with_respect_to_one_cluster'] - inertia_df['inertia_fraction_with_respect_to_one_cluster'].shift()

inertia_df['inertia_rate_of_decrease_1'] = inertia_df['inertia_fraction_with_respect_to_one_cluster'].diff()
inertia_df


IndexError: index 0 is out of bounds for axis 0 with size 0

Based on the inertia rate of decrease, we can see that with k=4 we decrease 15% of inertia, but after that decrease stagnates. It looks like k=4 can be a good candidate. 

Let's check this visually now.

In [17]:
# Visual application of the elbow rule
# We plot a line chart with all the inertia values computed with 
# the different values of k to visually identify the optimal value for k.
original_inertia_plot=inertia_df.hvplot(
    x      = 'k',
    y      = 'inertia',
    title  = 'Standarized Performance Data: Inertia versus Number of Clusters',
    xlabel = 'Number of Clusters (k)',
    ylabel = 'Inertia (units)'
)
original_inertia_plot

NameError: name 'inertia_df' is not defined

The visual verify the previous assesment of k=4 as an optimal number of clusters.

As an additional exersice, we will also apply another metric technique called Calinski Harabarz, which will verify also our results, and can be helpful in cases the visual are not so clear.

In [18]:
! pip install PyQt5 



In [76]:
OMP_NUM_THREADS=1

In [19]:
import matplotlib
matplotlib.use('TkAgg')
import matplotlib.pyplot as plt

In [20]:
# Additional tool to select k applying Calinski Harabasz metric
model_used = KMeans()
visualizer = KElbowVisualizer(
    model_used, 
    k=(2,11), 
    metric='calinski_harabasz', 
    timings=False
)

visualizer.fit(df_market_data_scaled)        
visualizer.show();
plt.show()

  super()._check_params_vs_input(X, default_n_init=10)


AttributeError: 'NoneType' object has no attribute 'split'

#### Answer to our question: What is the best value for k?

Based on the elbow rule, k=4 is the best choice. This, because:
* with k=4 we get a significant inertia reduction from 123 in k=3 to 79 in k=4. After that, reductions do not improve in more than 16 inertia units.  
* Graphically, the elbow on k=4 is also notizable, because after k=4 the inertia start decreasing in a close to linear fashion.
* Finally, the KelbowVisualizer tool corroborates the qualitative prior analysis, with k=4 as the best choice.

### Cluster Cryptocurrencies with K-means Using the Original Data

In this section, we will use the K-Means algorithm with the best value for `k` found in the previous section (k=4) to cluster the cryptocurrencies according to the price changes of cryptocurrencies provided. For these purpose, we will:

1. Initialize the K-Means model with four clusters using the best value for `k`. 

2. Fit the K-Means model using the original (scaled) data.

3. Predict the clusters to group the cryptocurrencies using the original data. View the resulting array of cluster values.

4. Add a new column to the DataFrame with the original data to store the predicted clusters.

5. Create a scatter plot to display 2-week returns with 1Y-returns for each cryptocurrency.

In [21]:
# Initialize the K-Means model using the best value for k
model = KMeans(n_clusters=4)

In [77]:
# Fit the K-Means model using the scaled data
model.fit(df_market_data_scaled)

  super()._check_params_vs_input(X, default_n_init=10)


AttributeError: 'NoneType' object has no attribute 'split'

In [23]:
# Predict the clusters to group the cryptocurrencies using the scaled data
clusters = model.predict(df_market_data_scaled)

# View the resulting array of cluster values.
clusters

AttributeError: 'KMeans' object has no attribute 'cluster_centers_'

In [24]:
# Add a new column to the DataFrame with the predicted clusters
df_market_data_scaled['cluster_original'] = clusters

# Display sample data
df_market_data_scaled.head()

NameError: name 'clusters' is not defined

In [25]:
df_market_data_scaled.describe()

Unnamed: 0,price_change_percentage_24h,price_change_percentage_7d,price_change_percentage_14d,price_change_percentage_30d,price_change_percentage_60d,price_change_percentage_200d,price_change_percentage_1y
count,41.0,41.0,41.0,41.0,41.0,41.0,41.0
mean,0.0,1.895503e-16,2.707861e-17,2.978647e-17,-5.4157220000000004e-18,-1.326852e-16,4.197185e-17
std,1.012423,1.012423,1.012423,1.012423,1.012423,1.012423,1.012423
min,-4.981042,-1.682027,-2.217108,-1.393153,-0.9560487,-0.5511464,-0.2963296
25%,-0.127467,-0.7066688,-0.6299628,-0.460558,-0.5517599,-0.4998478,-0.2817468
50%,0.077497,-0.1906843,-0.009190922,-0.06103015,-0.1592496,-0.3550537,-0.2255326
75%,0.33128,0.4931931,0.6435649,0.1165382,0.01606038,-0.0473611,-0.1454693
max,1.919812,2.572251,2.907054,5.351455,4.769913,4.63238,6.088625


In [26]:
# Scatter plot using hvPlot
# Color points mark the clusters
# crypto name are in the hover
plot_original_clusters = df_market_data_scaled.hvplot.scatter(
    x     = "price_change_percentage_14d", 
    y     = "price_change_percentage_1y", 
    by    = 'cluster_original',
    title = "Cryptocurrencies Standarized Returns. K-Mean Clusters with k=4.",
    hover_cols = 'coin_id'
)
plot_original_clusters

DataError: Supplied data does not contain specified dimensions, the following dimensions were not found: ['cluster_original']

PandasInterface expects tabular data, for more information on supported datatypes see http://holoviews.org/user_guide/Tabular_Datasets.html

### Optimize Clusters with Principal Component Analysis

In this section, we will perform a principal component analysis (PCA) and reduce all the features to only three principal components. For this purpose we will:

1. Create a PCA model instance and use the PCA model to reduce to three principal components. 

2. Retrieve the explained variance to determine how much information can be attributed to each principal component.

3. Determine the explained variance of the three principal components

4. Create a new DataFrame with the PCA data. 

In [27]:
# Create a PCA model instance and set `n_components=3`.
pca = PCA(n_components = 3)

# coming back to the original data
df_market_data_scaled_no_clusters = df_market_data_scaled.drop(columns='cluster_original')

KeyError: "['cluster_original'] not found in axis"

In [28]:
df_market_data_scaled_no_clusters.head()

NameError: name 'df_market_data_scaled_no_clusters' is not defined

In [29]:
# Use the PCA model with `fit_transform` to reduce to three principal components.
market_data_pca = pca.fit_transform(df_market_data_scaled_no_clusters)

# View the first five rows of the DataFrame. 
market_data_pca[0:5]

NameError: name 'df_market_data_scaled_no_clusters' is not defined

In [30]:
# Retrieve the explained variance to determine how much information can be attributed to each principal component.
pca.explained_variance_ratio_

AttributeError: 'PCA' object has no attribute 'explained_variance_ratio_'

In [31]:
# Total explained variance with 3 components
pca.explained_variance_ratio_.sum()

AttributeError: 'PCA' object has no attribute 'explained_variance_ratio_'

### Variance explained by the 3 principal components: 90%
We can see that 89.50% is the total variance explained by the three first principal components. That means that if we drop the other 4 components, we would be just losing a 10% of the variance in the data.

In [32]:
df_market_data_scaled_no_clusters.head()

NameError: name 'df_market_data_scaled_no_clusters' is not defined

In [33]:
pca.components_

AttributeError: 'PCA' object has no attribute 'components_'

### Principal components representation:
Observations regarding the more important variables in the components based on the coeficients of the linear combinations (pca.components_):

>1) The first component gives higher importance to the longer term returns (200 days and 1 year)

>2) The second component gives higher importance to the middle term returns (30 days and 60 days)

>3) The third component gives more importance to short term returns (7 days)

In [34]:
# Creating a new DataFrame with the PCA data
df_market_data_pca = pd.DataFrame(market_data_pca, columns=['PC1','PC2','PC3'])

# Copy the crypto names from the original data
df_market_data_pca['coin_id'] = df_market_data_scaled.index

# Set the coin_id column as index
df_market_data_pca = df_market_data_pca.set_index('coin_id')

# Display sample data
df_market_data_pca.head()

NameError: name 'market_data_pca' is not defined

### Find the Best Value for k Using the PCA Data

In this section, we will use the elbow method to find the best value for `k` using the PCA data, with the same steps we used with the original scaled data.

In [35]:
# Create a list with the number of k-values to try
# We use a range from 1 to 11
k = list(range(1,12))
k

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]

In [36]:
# Create an empy list to store the inertia values
inertia_pca = []

In [37]:
# Create a for loop to compute the inertia with each possible value of k
# Inside the loop:
# 1. Create a KMeans model using the loop counter for the n_clusters
# 2. Fit the model to the data using `df_market_data_pca`
# 3. Append the model.inertia_ to the inertia list
for i in k:
    model_pca = KMeans(n_clusters = i)
    model_pca.fit(df_market_data_pca)
    inertia_pca.append(model_pca.inertia_)

inertia_pca

NameError: name 'df_market_data_pca' is not defined

In [38]:
# Create a dictionary with the data to plot the Elbow curve
inertia_pca = {'k':k, 'inertia_pca':inertia_pca}

# Create a DataFrame with the data to plot the Elbow curve
inertia_pca_df = pd.DataFrame(inertia_pca)

inertia_pca_df = inertia_pca_df.set_index('k')
inertia_pca_df

ValueError: All arrays must be of the same length

In [39]:
# Quantities that help in applying elbow rule.
# Rate of decrease of the inertia
# Rate of inertia for k, with the respect to only one cluster (k=1)
inertia_pca_df['inertia_rate_with_respect_to_one_cluster'] = inertia_pca_df['inertia_pca']/inertia_pca_df.iloc[0,0]

# Rate of decrease from one level to the next in the initial inertia (k=1). This is: inertia_rate(k)-inertia_rate(k-1)
inertia_pca_df['inertia_rate_of_decrease'] = inertia_pca_df['inertia_rate_with_respect_to_one_cluster'] - inertia_pca_df['inertia_rate_with_respect_to_one_cluster'].shift()
inertia_pca_df

NameError: name 'inertia_pca_df' is not defined

In [40]:
# Plot a line chart with all the inertia values computed with 
# the different values of k to visually identify the optimal value for k.
inertia_pca_plot = inertia_pca_df['inertia_pca'].hvplot(
    title  = 'Principal Components of the Standarized Performance: Inertia v/s Number of Clusters',
    xlabel = 'Number of Clusters (k)',
    ylabel = 'Inertia (units)'     
)

inertia_pca_plot

NameError: name 'inertia_pca_df' is not defined

We can see that again the best value for the number of clusters k is 4. It does not differ from the original data analysis. However, what does differ is the Inertia itself at that point, which is 79 with k=4 with the standarized data, and only 50 using 3 PCA. Definetily a significant improvement. Not even with k=6 we got a level of 50 for the inertia with the standarized regular data. 
  
  The reasons to choose again k=4 are similar than the ones to choose k=4 for the scalar original data:
* with k=4 we get a significant inertia reduction from 94 in k=3 to 50 in k=4. After that, reductions do not improve in more than 11 inertia units.  
* Graphically, the elbow on k=4 is also notizable, because after k=4 the inertia start decreasing in a close to linear fashion.
* Finally, the KelbowVisualizer tool corroborates the qualitative prior analysis, with k=4 as the best choice, as can be seen in plot below.

In [42]:
# Additional tool to select k applying Calinski Harabasz metric
model_2 = KMeans()
visualizer = KElbowVisualizer(
    model_2, 
    k = (2,12), 
    metric = 'calinski_harabasz', 
    timings = False
)

visualizer.fit(df_market_data_pca)        
visualizer.show();

NameError: name 'df_market_data_pca' is not defined

In [43]:
# Additional tool to select k applying inertia elbow, but k selection provided by the software instead of visually determine it ourselves
model_2 = KMeans()
visualizer = KElbowVisualizer(
    model_2, 
    k=(2,12)
)

visualizer.fit(df_market_data_pca)        
visualizer.show();

NameError: name 'df_market_data_pca' is not defined

### Cluster Cryptocurrencies with K-means Using the PCA Data

In this section, we will use the PCA data and the K-Means algorithm with the best value for `k` found in the previous section to cluster the cryptocurrencies according to the principal components.

1. Initialize the K-Means model with four clusters using the best value for `k`. 

2. Fit the K-Means model using the PCA data.

3. Predict the clusters to group the cryptocurrencies using the PCA data. View the resulting array of cluster values.

4. Add a new column to the DataFrame with the PCA data to store the predicted clusters.


In [44]:
# Initialize the K-Means model using the best value for k
k = 4
model_pca = KMeans(n_clusters=k, random_state=1)

In [45]:
# Fit the K-Means model using the PCA data
model_pca.fit(df_market_data_pca)

NameError: name 'df_market_data_pca' is not defined

In [46]:
# Predict the clusters to group the cryptocurrencies using the PCA data
cluster = model_pca.predict(df_market_data_pca)

# View the resulting array of cluster values.
cluster[0:5]

NameError: name 'df_market_data_pca' is not defined

In [47]:
# Add a new column to the DataFrame with the predicted clusters
df_market_data_pca['cluster_pca'] = cluster

# Display sample data
df_market_data_pca.head()

NameError: name 'cluster' is not defined

In [48]:
df_market_data_pca.describe()

NameError: name 'df_market_data_pca' is not defined

In [49]:
# From the original DataFrame, add the `price_change_percentage_1y` and `price_change_percentage_14d columns`.
df_market_data_pca['cluster_original']            = df_market_data_scaled['cluster_original']
df_market_data_pca['price_change_percentage_1y']  = df_market_data_scaled['price_change_percentage_1y']
df_market_data_pca['price_change_percentage_60d'] = df_market_data_scaled['price_change_percentage_60d']
df_market_data_pca['price_change_percentage_7d']  = df_market_data_scaled['price_change_percentage_7d']
df_market_data_pca['price_change_percentage_24h'] = df_market_data_scaled['price_change_percentage_24h']

df_market_data_pca

KeyError: 'cluster_original'

## COLOR OF CLUSTERS
We will arrange colors for the clusters, so in case the clusters exactly match, then the colors will also be the same (eventhough the label itself may differ)

In [50]:
# We will color the clusters accordingly to the sort in the dataframe

# sort of cluster names in cluster_pca
color_pca = list(df_market_data_pca['cluster_pca'].drop_duplicates().reset_index()['cluster_pca'])
color_pca

NameError: name 'df_market_data_pca' is not defined

In [51]:
#sort of cluster names in cluster_original, which is the one with standarized data
color_original = list(df_market_data_pca['cluster_original'].drop_duplicates().reset_index()['cluster_original'])
print(color_pca)
print(color_original)

NameError: name 'df_market_data_pca' is not defined

In [52]:
# Defining the colors in the pallete, and initializing the color list for each group of clusters
k = 4

# colors_pallette = ['green', 'blue', 'red','orange']
colors_pallette = ['orange', 'red', 'blue', 'green']

# Initialization of colors with NaN
colors_pca, colors_original = [np.nan] * k, [np.nan] *k
print(colors_pca)
print(colors_original)

[nan, nan, nan, nan]
[nan, nan, nan, nan]


In [78]:
# Defining the list of colors for clusters_pca
c = 0
for i in color_pca:
    colors_pca[i] = colors_pallette[c]
    c += 1
colors_pca

NameError: name 'color_pca' is not defined

In [54]:
# Defining the list of colors for clusters_original
c = 0
for i in color_original:
    colors_original[i] = colors_pallette[c]
    c += 1
colors_original,color_original

NameError: name 'color_original' is not defined

## Plotting the Principal Components

In [55]:
# Plotting the PCAs
color=hv.Cycle(colors_pca)
pca_longer_term_plot=df_market_data_pca.hvplot.scatter(
    x  = 'PC1',
    y  = 'PC2',
    by = 'cluster_pca',
    hover_cols='coin_id',
    title  = "Principal Components K-Means Clusters and PC1 versus PC2",
    color  =  color,
    xlabel = 'First Principal Component \n Returns (%) (PC1)',
    ylabel = 'Second Principal Component \n Returns (%) (PC2)',
    width  = 600,
    height = 500
).opts(legend_position='top')

pca_longer_term_plot

NameError: name 'df_market_data_pca' is not defined

In [56]:
# Plotting the PCAs
color = hv.Cycle(colors_pca)
pca_shorter_term_plot=df_market_data_pca.hvplot.scatter(
    x  = 'PC2',
    y  = 'PC3',
    by = 'cluster_pca',
    hover_cols='coin_id',
    title  = 'Principal Components K-Means Clusters and PC2 versus PC3',
    xlabel = 'Second Principal Component \n Returns (%) (PC2)',
    ylabel = 'Third  Principal Component \n Returns (%) (PC3)',
    color  = color,
    width  = 600,
    height = 500
).opts(legend_position='top')
pca_shorter_term_plot

NameError: name 'df_market_data_pca' is not defined

In [57]:
# Plotting the PCAs
color = hv.Cycle(colors_pca)
pca_mix_term_plot = df_market_data_pca.hvplot.scatter(
    x  = 'PC1',
    y  = 'PC3',
    by = 'cluster_pca',
    hover_cols = 'coin_id',
    title  = "Principal Components K-Means Clusters and PC1 versus PC3",
    color  =  color,
    xlabel = 'First Principal Component \n Returns (%) (PC1)',
    ylabel = 'Second Principal Component \n Returns (%) (PC3)',
    width  = 600,
    height = 500
).opts(legend_position='top')

pca_mix_term_plot

NameError: name 'df_market_data_pca' is not defined

In [58]:
# Three dimentional visualization of clusters with the three principal components
df = df_market_data_pca.reset_index()
fig = px.scatter_3d(df, x='PC1', y='PC2', z='PC3' #, size='gdpPercap', 
                   ,color='cluster_pca'
                   ,hover_data=['coin_id'],
                   title='Principal Components - KMean Clusters'
)
fig.update_layout(scene_zaxis_type="log")
fig.show()

NameError: name 'df_market_data_pca' is not defined

In [59]:
# 3D View of Crypto Clusters on a Mix of 3 Terms Standarized Returns Data
df = df_market_data_scaled.reset_index()
fig = px.scatter_3d(
                   df
                   ,x="price_change_percentage_1y"
                   ,z="price_change_percentage_7d"
                   ,y='price_change_percentage_200d' 
                   ,color='cluster_original'
                  ,hover_data=['coin_id'],
                   labels={"price_change_percentage_1y":"1Y Return (%)",
                           "price_change_percentage_200d" :"200D Return (%)",
                          "price_change_percentage_7d":"1W Return (%)"},
                   title='View of Crypto Clusters on a Mix of 3 Terms Standarized Returns Data'
   
   
)
fig.update_layout(scene_zaxis_type="log")
fig.show()


ValueError: Value of 'color' is not the name of a column in 'data_frame'. Expected one of ['coin_id', 'price_change_percentage_24h', 'price_change_percentage_7d', 'price_change_percentage_14d', 'price_change_percentage_30d', 'price_change_percentage_60d', 'price_change_percentage_200d', 'price_change_percentage_1y'] but received: cluster_original

In [60]:
# Very short term
# Scatter plot using hvPlot by setting 
# `x="price_change_percentage_24h"` and `y="price_change_percentage_7d"`. 
# Graph points are colored with the labels found using K-Means and 
# the crypto name is in the hover to identify 
# the cryptocurrency represented by each data point.
color = hv.Cycle(colors_original)
very_short_term_scaled_data_plot = df_market_data_scaled.hvplot.scatter(
    x  = "price_change_percentage_24h",
    y  = "price_change_percentage_7d",
    by = 'cluster_original',
    hover_cols = 'coin_id',
    title = 'Standarized Returns & K-Means Clusters - Very Short Term Performance',
    color = color,
    xlabel = 'Standarize 24h Return (%)',
    ylabel = 'Standarize 7d Return (%)',
    width  = 600,
    height = 500
).opts(legend_position='top')

very_short_term_scaled_data_plot

DataError: Supplied data does not contain specified dimensions, the following dimensions were not found: ['cluster_original']

PandasInterface expects tabular data, for more information on supported datatypes see http://holoviews.org/user_guide/Tabular_Datasets.html

In [61]:
# Short term
# Scatter plot using hvPlot by setting 
# `x="price_change_percentage_7d"` and `y="price_change_percentage_14d"`. 
color = hv.Cycle(colors_original)
short_term_scaled_data_plot = df_market_data_scaled.hvplot.scatter(
    x  = "price_change_percentage_7d",
    y  = "price_change_percentage_14d",
    by = 'cluster_original',
    hover_cols='coin_id',
    title  = 'Standarized Returns & K-Means Clusters - Short Term Performance',
    color  =  color,
    xlabel = 'Standarize 7 Days Return (%)',
    ylabel = 'Standarize 14 Days Return (%)',
    width  = 600,
    height = 500
).opts(legend_position='top')
short_term_scaled_data_plot

DataError: Supplied data does not contain specified dimensions, the following dimensions were not found: ['cluster_original']

PandasInterface expects tabular data, for more information on supported datatypes see http://holoviews.org/user_guide/Tabular_Datasets.html

In [62]:
# Middle Term
# Scatter plot using hvPlot by setting 
# `x="price_change_percentage_30d"` and `y="price_change_percentage_60d"`. 
color = hv.Cycle(colors_original)
middle_term_scaled_data_plot = df_market_data_scaled.hvplot.scatter(
    x  = "price_change_percentage_30d",
    y  = "price_change_percentage_60d",
    by = 'cluster_original',
    hover_cols='coin_id',
    title  = 'Standarized Returns & K-Means Clusters - Middle Term Performance',
    color  = color,
    xlabel = 'Standarize 30 Days Return (%)',
    ylabel = 'Standarize 60 Days Return (%)',
    width  = 600,
    height = 500
).opts(legend_position='top')
middle_term_scaled_data_plot

DataError: Supplied data does not contain specified dimensions, the following dimensions were not found: ['cluster_original']

PandasInterface expects tabular data, for more information on supported datatypes see http://holoviews.org/user_guide/Tabular_Datasets.html

In [63]:
# Longer term
# Scatter plot using hvPlot by setting 
# `x="price_change_percentage_200d"` and `y="price_change_percentage_1Yd"`. 
color = hv.Cycle(colors_original)
longer_term_scaled_data_plot = df_market_data_scaled.hvplot.scatter(
    x  = "price_change_percentage_200d",
    y  = "price_change_percentage_1y",
    by = 'cluster_original',
    hover_cols='coin_id',
    title  = 'Standarized Returns & K-Means Clusters - Longer Term Performance',
    color  =  color,
    xlabel = 'Standarize 200 Days Return (%)',
    ylabel = 'Standarize 1 Year Return (%)',
    width  = 600,
    height = 500
).opts(legend_position='top')

longer_term_scaled_data_plot

DataError: Supplied data does not contain specified dimensions, the following dimensions were not found: ['cluster_original']

PandasInterface expects tabular data, for more information on supported datatypes see http://holoviews.org/user_guide/Tabular_Datasets.html

In [64]:
# Mix Short/Longer terms
# Scatter plot using hvPlot by setting 
# `x="price_change_percentage_14d"` and `y="price_change_percentage_1Yd"`. 
color = hv.Cycle(colors_original)
mix_short_longer_term_scaled_data_plot = df_market_data_scaled.hvplot.scatter(
    y  = "price_change_percentage_14d",
    x  = "price_change_percentage_1y",
    by = 'cluster_original',
    hover_cols = 'coin_id',
    title  = 'Standarized Returns & K-Means Clusters - Mix Term Performance',
    color  =  color,
    ylabel = 'Standarize 14 Days Return (%)',
    xlabel = 'Standarize 1 Year Return (%)',
    width  = 600,
    height = 500
).opts(legend_position='top')

mix_short_longer_term_scaled_data_plot

DataError: Supplied data does not contain specified dimensions, the following dimensions were not found: ['cluster_original']

PandasInterface expects tabular data, for more information on supported datatypes see http://holoviews.org/user_guide/Tabular_Datasets.html

In [65]:
df_market_data_scaled.hvplot.scatter?

In [66]:
# Mix Middle/Longer terms
# Scatter plot using hvPlot by setting 
# `x="price_change_percentage_60d"` and `y="price_change_percentage_1Yd"`. 
color=hv.Cycle(colors_original)
mix_middle_longer_term_scaled_data_plot=df_market_data_scaled.hvplot.scatter(
    y  = "price_change_percentage_60d",
    x  = "price_change_percentage_1y",
    by = 'cluster_original',
    hover_cols = 'coin_id',
    title  = 'Standarized Returns & K-Means Clusters - Mix Term Performance',
    color  =  color,
    ylabel = 'Standarize 60 Days Return (%)',
    xlabel = 'Standarize 1 Year Return (%)',
    width  = 600,
    height = 500
).opts(legend_position='top')

mix_middle_longer_term_scaled_data_plot

DataError: Supplied data does not contain specified dimensions, the following dimensions were not found: ['cluster_original']

PandasInterface expects tabular data, for more information on supported datatypes see http://holoviews.org/user_guide/Tabular_Datasets.html

# Cryptocurrency K-Mean Clusters Based on Performance in Several Holding Period
## Constrast Results from Standarized Seven Period Returns Data versus Three First Return Principal Components. Both with k=4.

In this section, we visually analyze the cluster analysis results by contrasting the outcome with and without using the optimization techniques. For these, we:

1. contrast the Elbow Curve that you created to find the best value for `k` with the original and the PCA data.

2. contrast the cryptocurrencies clusters using the original and the PCA data.

In [67]:
# 1. Composite plot to contrast the Elbow curves
original_inertia_plot + inertia_pca_plot

NameError: name 'original_inertia_plot' is not defined

In [68]:
# 2. Composite plot to contrast the clusters - COLORS DISPLAY SAME CLUSTERs, but label may differ.
# 2.a First, comparative view of clusters by scatter plotting short term returns (7d & 24 hrs) for scaled data, and principal components 2&3, which
# are directed toward short/middle term returns.
very_short_term_scaled_data_plot + pca_shorter_term_plot

NameError: name 'very_short_term_scaled_data_plot' is not defined

In [69]:
# 2.b. Secondly, comparative view of clusters by scatter plotting long term returns (200d & 1Y) for scaled data, and principal components 1&2, which
# are directed toward longer term returns.
mix_middle_longer_term_scaled_data_plot+pca_longer_term_plot

NameError: name 'mix_middle_longer_term_scaled_data_plot' is not defined

In [70]:
# Mix terms performance
mix_short_longer_term_scaled_data_plot + pca_mix_term_plot

NameError: name 'mix_short_longer_term_scaled_data_plot' is not defined

## Impacts of using fewer features to cluster the data using K-Means

### First, at the level of inertia
* Using less amount of features reduces the amount of inertia. This happens because the reduction of dimensionality implies a reduction in the variance of the clustered data. In the analyzed case, we reduced the variance in 10% by using three components, which implied a reduction in the inertia in a similar amount (11% reduction from 287 to 256). This kind of reduction could have involved a reduction in optimal number of clusters. However in this case, it didn't. 
* PCA achieves a higher reduction rate of the initial inertia. That means, the components are more efficient in setting up the data for clustering. In the case of PCA, the reduction was from 256 to 50 units of inertia, which is a reduction in 80% of the initial value; whereas the reduction with standarized data was from 287 to 79 units, wich is only a 72% reduction rate. 

### Second, in terms of the clusters
* The original clusters and the pca clusters exactly match. That supports the use of dimentionality reduction. We dramatically reduced dimentions from seven to three, and still we got the same results. An impact of using fewer features is the benefit in reducing the resources needed to manage large amounts of data, without compromising optimal results.
    
* Another notizable benefit in this particular case, is about the interpretation of the principal components. It is not easy at the beggining, but after making sense of the representation involved in the components, it allows us to visualize the data, and understand the clusters with less graphs. For example, in this case, we have realized that:

>a. principal component one reflects mainly returns of longer terms, such as 1Y and 200 days returns; 

>b. principal component two reflects mostly middle term returns (60 days and 30 days returns); and

>c. principal component three describes mostly short term returns such as 7 and 14 days. 
 
    Then, we can identify the cluster representation (here by returns I mean standarized returns):
>* **red** represent cryptocurrencies with the worse performance of the set. Negative or small return in all periods (short, middle and long term). For example, vechain, ontology.
>* **orange** performs a little better than the red. Those are cryptos with moderate positive or slighly negative in the middle and short term returns (ie. bitcoin, chainlink, bitcoin-cash)
>* **blue** performs really well in the long term, but may have large drops in shorter terms (ie. Ethlend)
>* **green** performs well in the middle term side, without the large drops that a blue crypto could have. (ie. celcius-degree-token)
    
   Disclaimer: I referred to 1 year as long term because it is the longest in the data. usually 1 year is short term. In this case, I have use short term for 2 weeks or less, middle term for 30-60 days, and long term for 200-365 days.


In [71]:
# Table with principal components, and contrast of the two clusters, as well as some longer and short term returns.
df_market_data_pca_clusters_view = df_market_data_pca.sort_values(by='cluster_pca' ).reset_index()

NameError: name 'df_market_data_pca' is not defined

In [72]:
df_market_data_pca_clusters_view

NameError: name 'df_market_data_pca_clusters_view' is not defined

In [73]:
# Apply color to the output to visualize the clusters on the table

def format_color_groups(df):
    colors = colors_pca.copy()
    x = df_market_data_pca_clusters_view.copy()
    factors = list(x['cluster_pca'].unique())
    i = 0
    for factor in factors:
        style = f'color: {colors[i]}'
        x.loc[x['cluster_pca'] == factor, :] = style
        i = i+1
    return x

df_market_data_pca_clusters_view.style.apply(format_color_groups, axis=None)


NameError: name 'df_market_data_pca_clusters_view' is not defined