### Load the saved files

In [None]:
import pandas as pd

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Load the dataset
df = pd.read_csv('/content/drive/MyDrive/final year/ml/data/blockchaincom_data.csv')

## Feature Engineering

As described in the README paper, there are a number of features that will provide meaninful representations of the problem domain to the machine learning model.

* size
* fee
* Time of transaction



In [None]:
# fee percentage
df['sat_per_byte'] = df['fee']  / df['size']

There is a discrepancy between two of the data sources...
The outdegree value for tx aff5627c3efd229dbf380d3f50a5600e45e6c0e154288814685dc90926145d23 from the original dataset (Kaggle) is 1, meaning there should only be a single output from this transaction. However, the vout_sz vlaue for the same row aquired from blockchain.com says the number of outputs is 2.

After looking at the outputs, the vout displays 2 outputs, yet the outdegree is 1.

There is another discrepancy with tx 25e3d5bf2334290b31c55129d100c2bce167a93985a35d6b8ebf44b6d04632ef where the same thing occurs. The Kaggle dataset reports there being 1 output, while the output string  in vout clearly shows 2.

 After [cross-referencing with other block explorers](https://blockexplorer.one/bitcoin/mainnet/tx/25e3d5bf2334290b31c55129d100c2bce167a93985a35d6b8ebf44b6d04632ef), it seems the indegree and outdegree from the Kaggle dataset can not be trusted, so I will instead use the vin_sz and vout_sz from the blockchain.com api data.

In [None]:
#Drop irrelevant columns

final_dataset = df.drop([
                    'tx_hash',
                    'indegree',
                    'outdegree',
                    'ver',
                    'relayed_by',
                    'lock_time',
                    'tx_index',
                    'double_spend',
                    'block_index',
                    'block_height',
                    'inputs',
                    'out',
                    'weight'
                    ],
                   axis=1)

In [None]:
df

Unnamed: 0,in_btc,out_btc,total_btc,mean_in_btc,mean_out_btc,in_malicious,out_malicious,is_malicious,out_and_tx_malicious,all_malicious,vin_sz,vout_sz,size,weight,fee,time,sat_per_byte
0,50.000000,50.000000,100.000000,50.000000,50.000000,0,0,0,0,0,1,1,159,636,0,1277800165,0.000000
1,50.000000,50.000000,100.000000,50.000000,50.000000,0,0,0,0,0,1,1,224,896,0,1289311198,0.000000
2,2.700000,2.700000,5.400000,2.700000,1.350000,0,0,0,0,0,1,2,258,1032,0,1271613446,0.000000
3,0.000000,50.000000,50.000000,0.000000,50.000000,0,0,0,0,0,1,1,134,536,0,1264625653,0.000000
4,0.000000,50.000000,50.000000,0.000000,50.000000,0,0,0,0,0,1,1,134,536,0,1262161248,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11320,23.224965,23.224865,46.449829,23.224965,11.612432,1,0,0,0,1,1,2,225,900,10000,1385735317,44.444444
11321,65.000000,64.990000,129.990000,32.500000,32.495000,0,1,0,1,1,2,2,439,1756,1000000,1387540345,2277.904328
11322,2592.000000,2431.567900,5023.567900,2592.000000,2431.567900,0,1,0,1,1,1,2,257,1028,1000000,1387540588,3891.050584
11323,100.000000,99.990000,199.990000,100.000000,49.995000,0,1,0,1,1,1,2,258,1032,1000000,1387540345,3875.968992


In [None]:
# check for any nulls or nans
df.isnull().sum()

in_btc                  0
out_btc                 0
total_btc               0
mean_in_btc             0
mean_out_btc            0
in_malicious            0
out_malicious           0
is_malicious            0
out_and_tx_malicious    0
all_malicious           0
vin_sz                  0
vout_sz                 0
size                    0
fee                     0
time                    0
sat_per_byte            0
dtype: int64

In [None]:
df.isna().sum()

in_btc                  0
out_btc                 0
total_btc               0
mean_in_btc             0
mean_out_btc            0
in_malicious            0
out_malicious           0
is_malicious            0
out_and_tx_malicious    0
all_malicious           0
vin_sz                  0
vout_sz                 0
size                    0
fee                     0
time                    0
sat_per_byte            0
dtype: int64

## Write to file

In [None]:
df.to_csv('/content/drive/MyDrive/final year/ml/data/final_experimental_dataset.csv', index=False)