# Cryptocurrency Word Mover's Distance Semantic Analysis
### Authors


|    Student Name                 |    Student Number  |
|---------------------------------|--------------------|
| Raj Sandhu                      | 101111960          |
| Akaash Kapoor                   | 101112895          |
| Ali Alvi                        | 101114940          |
| Hassan Jallad                   | 101109334          |
| Areeb Ul Haq                    | 101115337          |
| Ahmad Abuoudeh                  | 101072636          |

## Libraries to Import

In [1]:
import pandas as pd
import gensim.downloader as api

## Read In Processed Coin Dataset

In [2]:
coin_df = pd.read_csv("coin-info.csv") #Read in the processed dataframe generated in phase 2.
coin_df.head() #Print first 5 rows of dataframe to assess validity.

Unnamed: 0,Name,Volatility,Description
0,iota,0.388529,IOTA (IOTA or MIOTA) is a cryptocurrency token...
1,anchor-protocol,1.155277,Anchor Protocol is a yield stable and attracti...
2,compound,155.017778,COMP is an ERC-20 token built on the Ethereum ...
3,bitcoin-sv,64.927187,Bitcoin SV is a cryptocurrency that was create...
4,drep,0.48517,DREPis committed to building a performance-ori...


## Load In Pretrained Word Embedding Model

In [3]:
model = api.load("word2vec-google-news-300") #Load in the pretrained word embedding model which is used to perform word mover's distance between pairs of documents.



## Generate Similarity Matrix for the Word Mover's Distance Metric

In [7]:
coin_similarity_matrix =  pd.DataFrame([[model.wmdistance(p1, p2) for p2 in coin_df.iloc[:, -1].str.split()] for p1 in coin_df.iloc[:, -1].str.split()], columns = coin_df.iloc[:, 0], index= coin_df.iloc[:, 0])
#Performs pairwise computations over all possible pairwise combinations of provided descriptions, and stores these computations in a similarity matrix.
#Descriptions are split using split() function because the wmdistance function requires a list of string tokens for proper results.

## Display Similarity Matrix and Check Validity

In [5]:
coin_similarity_matrix #Display computed similarity matrix.

Name,iota,anchor-protocol,compound,bitcoin-sv,drep,moonbeam,usd-coin,chainlink,basic-attention-token,bittorrent,...,gala,bitcoin-gold,render-token,unfoldu-group-coin-(new),maker,nexus-mutual,juno,okb,avalanche,compound-usd-coin
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
iota,0.000000,3.170357,2.171160,2.229937,2.735465,2.258385,2.208431,2.103930,2.225857,2.376386,...,2.772715,2.334604,2.705839,2.391945,2.102176,2.190762,2.375256,2.178784,2.516597,2.509003
anchor-protocol,3.170357,0.000000,3.326736,3.277865,3.407600,3.268028,3.369509,3.340737,3.381832,3.322350,...,3.450060,3.292741,3.360755,3.208156,3.191857,3.253310,3.346427,3.152493,3.380975,3.372036
compound,2.171160,3.326736,0.000000,2.281674,2.891091,2.400574,2.188473,2.137912,2.133458,2.439778,...,2.925447,2.358951,2.770554,2.393285,1.800100,2.407950,2.410279,2.302774,2.489660,2.248290
bitcoin-sv,2.229937,3.277865,2.281674,0.000000,2.843226,2.393085,2.277789,2.277939,2.292242,2.510614,...,2.917513,2.098307,2.752338,2.477368,2.233188,2.465231,2.507217,2.308496,2.574286,2.623275
drep,2.735465,3.407600,2.891091,2.843226,0.000000,2.810384,2.868590,2.770253,2.926254,2.970365,...,2.998832,2.827450,3.008179,2.844016,2.881863,2.844083,2.840785,2.764615,2.836954,3.020076
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
nexus-mutual,2.190762,3.253310,2.407950,2.465231,2.844083,2.458540,2.456764,2.374419,2.428557,2.453446,...,2.867152,2.513954,2.761209,2.371646,2.399128,0.000000,2.518541,2.237765,2.583840,2.703363
juno,2.375256,3.346427,2.410279,2.507217,2.840785,2.363450,2.412082,2.239042,2.504455,2.621340,...,3.086009,2.510507,2.909125,2.545965,2.367397,2.518541,0.000000,2.388843,2.540172,2.738081
okb,2.178784,3.152493,2.302774,2.308496,2.764615,2.014560,2.286681,2.299715,2.352729,2.253653,...,2.818819,2.235299,2.675767,2.259754,2.232096,2.237765,2.388843,0.000000,2.387466,2.515510
avalanche,2.516597,3.380975,2.489660,2.574286,2.836954,2.317926,2.612435,2.414112,2.605800,2.601057,...,3.019100,2.538815,2.715792,2.550356,2.554725,2.583840,2.540172,2.387466,0.000000,2.863573


In [8]:
#Obtain information of first two coins to perform a sanity check of calculations performed.
coin_desc_1 = coin_df["Description"][0]
coin_desc_2 = coin_df["Description"][1]
coin_name_1 = coin_df["Name"][0]
coin_name_2 = coin_df["Name"][1]

coin_similarity_1 = coin_similarity_matrix.iloc[0].iloc[1]
assert model.wmdistance(coin_desc_1.split(), coin_desc_2.split()) == coin_similarity_1, "Coins " + coin_name_1 + " and " + coin_name_2 + " fail unit test. Computed word mover's distances do not match."
print("Coins " + coin_name_1 + " and " + coin_name_2 + " pass the unit test. They have a word mover's distance of: " + str(coin_similarity_1))
#Verifies that the word mover's distance computed for the first two coins is correct.

Coins iota and anchor-protocol pass the unit test. They have a word mover's distance of: 3.170357186667697


In the above cell, a sanity check is performed to ensure that word mover's distance calculations were performed correctly, ensuring the obtained similarity matrix is of the highest quality. This is done through the assert statement. A manual computation of word mover's distance of the first two coins is performed, and this computation is also retrieved from the similarity matrix. These computations are then compared for equality with the assert statement. If this unit test is passed, a success message is printed, otherwise an error is thrown with the provided error message. 

## Download Similarity Matrix as a CSV File

In [9]:
coin_similarity_matrix.to_csv("coin-similarity-matrix-description.csv") #exports similarity matrix to a csv file.

From here, the csv file should be downloaded and you should be able to see it on the left side of the screen in the Files section. From here, simply right click it and download it and then save it in the models folder of the repo.