# Cryptocurreny Euclidean Distance Semantic Analysis

### Authors
|    Student Name                 |    Student Number  |
|---------------------------------|--------------------|
| Raj Sandhu                      | 101111960          |
| Akaash Kapoor                   | 101112895          |
| Ali Alvi                        | 101114940          |
| Hassan Jallad                   | 101109334          |
| Areeb Ul Haq                    | 101115337          |
| Ahmad Abuoudeh                  | 101072636          |



# Libraries to Import

In [1]:
import pandas as pd
import numpy as np
import os
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.preprocessing import StandardScaler

# Read In Processed Coin Datasheet

In [2]:
parent_folder = os.path.dirname(os.path.dirname(os.getcwd())) #Parent folder of the repo
data_folder = "data"
model_folder = "models"
processed_folder = "processed"
model_data_file_path = os.path.join(parent_folder, model_folder) #path to models folder
processed_data_file_path = os.path.join(parent_folder, data_folder, processed_folder)#path to processed data folder.
coin_df = pd.read_csv(open(os.path.join(processed_data_file_path, "coin-info.csv"), "r"))
coin_df.head() #test output of the processed data file. 

Unnamed: 0,Name,Volatility,Description
0,iota,0.388529,IOTA (IOTA or MIOTA) is a cryptocurrency token...
1,anchor-protocol,1.155277,Anchor Protocol is a yield stable and attracti...
2,compound,155.017778,COMP is an ERC-20 token built on the Ethereum ...
3,bitcoin-sv,64.927187,Bitcoin SV is a cryptocurrency that was create...
4,drep,0.48517,DREPis committed to building a performance-ori...


# Create and Display the Similarity Matrix

In [3]:
#We must first standardize the matrix. 
stan_scaler = StandardScaler()

stan_data = np.array(coin_df["Volatility"]).reshape(-1,1)

stan_matrix = stan_scaler.fit_transform(stan_data)

#Using the euclidean_distances library from sklearn, create a similarity matrix of volatilities.
coin_similarity_matrix = pd.DataFrame(euclidean_distances(stan_matrix.reshape(-1,1)), columns = coin_df["Name"], index= coin_df["Name"])
#Print all volatilities
coin_similarity_matrix

Name,iota,anchor-protocol,compound,bitcoin-sv,drep,moonbeam,usd-coin,chainlink,basic-attention-token,bittorrent,...,gala,bitcoin-gold,render-token,unfoldu-group-coin-(new),maker,nexus-mutual,juno,okb,avalanche,compound-usd-coin
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
iota,0.000000,0.000468,0.094384,0.039394,0.000059,0.001550,2.368410e-04,0.004297,0.000071,2.371535e-04,...,0.000132,0.014342,0.000775,2.362516e-04,0.477463,0.017720,0.005296,0.003586,0.020100,0.000237
anchor-protocol,0.000468,0.000000,0.093916,0.038926,0.000409,0.001082,7.048557e-04,0.003829,0.000539,7.051683e-04,...,0.000600,0.013874,0.000307,7.042663e-04,0.476995,0.017252,0.004828,0.003118,0.019632,0.000705
compound,0.094384,0.093916,0.000000,0.054990,0.094325,0.092834,9.462082e-02,0.090087,0.094455,9.462113e-02,...,0.094516,0.080042,0.093609,9.462023e-02,0.383079,0.076663,0.089088,0.090798,0.074284,0.094621
bitcoin-sv,0.039394,0.038926,0.054990,0.000000,0.039335,0.037844,3.963052e-02,0.035097,0.039465,3.963083e-02,...,0.039526,0.025052,0.038619,3.962993e-02,0.438069,0.021673,0.034098,0.035808,0.019294,0.039630
drep,0.000059,0.000409,0.094325,0.039335,0.000000,0.001491,2.958295e-04,0.004238,0.000130,2.961421e-04,...,0.000191,0.014283,0.000716,2.952401e-04,0.477404,0.017661,0.005237,0.003527,0.020041,0.000296
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
nexus-mutual,0.017720,0.017252,0.076663,0.021673,0.017661,0.016170,1.795732e-02,0.013423,0.017791,1.795763e-02,...,0.017853,0.003379,0.016946,1.795673e-02,0.459743,0.000000,0.012425,0.014135,0.002379,0.017957
juno,0.005296,0.004828,0.089088,0.034098,0.005237,0.003746,5.532731e-03,0.000999,0.005367,5.533044e-03,...,0.005428,0.009046,0.004521,5.532142e-03,0.472167,0.012425,0.000000,0.001710,0.014804,0.005533
okb,0.003586,0.003118,0.090798,0.035808,0.003527,0.002036,3.822801e-03,0.000711,0.003657,3.823114e-03,...,0.003718,0.010756,0.002811,3.822212e-03,0.473877,0.014135,0.001710,0.000000,0.016514,0.003823
avalanche,0.020100,0.019632,0.074284,0.019294,0.020041,0.018550,2.033654e-02,0.015803,0.020171,2.033685e-02,...,0.020232,0.005758,0.019325,2.033595e-02,0.457363,0.002379,0.014804,0.016514,0.000000,0.020336


The reason why a standardize matrix was used to calculate the euclidean distance similarity matrix is to reduce the bias present in the weightings of the parameters for the recommender system. In addition, the variance in the volatility data is very high, as some coins cost a lot more than others (ex Bitcoin and Ethereum) which contributes to a higher volatility value. Standardizing the matrix is the most effective way to reduce this bias. 

# Sanity Check For Similiarity Matrix

In [4]:
coin_volatility_1 = stan_matrix.flat[0] #Extract volatility of iota coin
coin_volatility_2 = stan_matrix.flat[1] #Extract volatility of anchor-protocol coin
coin_name_1 = coin_df["Name"][0] #Extract iota coin name
coin_name_2 = coin_df["Name"][1] #Extract anchor-protocol name

#Store both volatilities into a 2-D Numpy Array.
volatility_array_1 = np.array([[coin_volatility_1]])
volatility_array_2 = np.array([[coin_volatility_2]])

#Calculate the euclidean distance between the two coins 
euclidean_array = euclidean_distances(volatility_array_1, volatility_array_2).reshape(-1,1)

#Run a sanity test to ensure euclidean distance is calculated correctly. 
assert euclidean_array.all() == abs(volatility_array_2 - volatility_array_1).all()
print("Coins " + coin_name_1 + " and " + coin_name_2 + " pass the unit test. They have a euclidean distance of: " + np.array2string(euclidean_array))


Coins iota and anchor-protocol pass the unit test. They have a euclidean distance of: [[0.00046801]]


# Download Similarity Matrix as a CSV File

In [5]:
#Store the similarity matrix in the models file. 
coin_similarity_matrix.to_csv(open(os.path.join(model_data_file_path, "coin-similarity-matrix-euclidean-distance.csv"), "w"))