# Semi-supervised learning with Deepwalk on bitcoin transaction graph

We will use Deepwalk to obtain a low dimension embedding of the graph and uses clustering and/or semi-supervised learning on it

In [None]:
!pip install deepwalk
!pip install pyclustertend

In [None]:
## libraries 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import subprocess
import shlex
import deepwalk
import networkx as nx
import matplotlib.pyplot as plt
from networkx.readwrite.edgelist import read_edgelist
from networkx.readwrite.edgelist import write_edgelist

from sklearn.preprocessing import normalize
from sklearn.semi_supervised import LabelPropagation, LabelSpreading

## Deepwalk usage on a graph of transaction 

In [None]:
!deepwalk --format edgelist --input /kaggle/input/bitcoin-to-edge-list/bitcoin_edge.edgelist --workers 10 --number-walks 20 --representation-size 128 --walk-length 30 --window-size 5 --output /kaggle/working/graph_bitcoin.embeddings

In [None]:
_embeddings = np.loadtxt('/kaggle/input/bitcoin-graph-embedding/graph_bitcoin.embeddings', skiprows = 1)

In [None]:
df_classes = pd.read_csv('/kaggle/input/elliptic-data-set/elliptic_bitcoin_dataset/elliptic_bitcoin_dataset/elliptic_txs_classes.csv')

In [None]:
df_classes['class'].value_counts()

Run a ML algorithm to find for the unknown observations in which class they relate : 
- 2% are known to be ilicit
- 21% are known to be licit
- 77% are unknow kind of transaction

## analyse embedding 

We would like to analyse the embedding to find information about the repartion of data

In [None]:
_embeddings

In [None]:
_embeddings.shape

In [None]:
df_classes['class'][df_classes['class'] == 'unknown'] = -1

In [None]:
y = df_classes['class'].to_numpy()
y = y.astype(int)

## Use semi-supervised learning to find unlabeled observation class

### LabelSpreaping

labelspreading is a semi-supervised algorithm which allows to classify unclassified observations using label classidied ones on the graph.

In [None]:
model = LabelSpreading(kernel = 'knn', max_iter = 10000, tol = 0.3, n_jobs = -1)

In [None]:
normalized_X = normalize(_embeddings)

In [None]:
model.fit(normalized_X,y)

In [None]:
result_label_propagation = model.predict_proba(normalized_X)

In [None]:
plt.hist(result_label_propagation[:,1], color = 'blue', edgecolor = 'black')

Done!