# Task
We use a real-world Ethereum transactions dataset available publicly8 for our investigation. The dataset consists of six attributes, namely, ‘from’, ‘to’, ‘amount’, ‘timestamp’, ‘FromIsPhi’, and ‘ToIsPhi’. The attributes ‘from’ and ‘to’ of a transaction indicate the addresses of its sender and receiver, respectively. The date and time when a block consisting of the transaction is mined are represented by ‘timestamp’, whereas the amount of Ether transacted is represented by ‘amount’. As the dataset is labeled, the attributes ‘FromIsPhi’ and ‘ToIsPhi’ together act as the labels of the accounts associated with a transaction, which signifies whether the account is phishing or non-phishing. More precisely, if the sender address of a transaction is phishing, then ‘FromIsPhi’ is labeled as 1, otherwise 0. Similarly, if the receiver address of a transaction is phishing, then ‘ToIsPhi’ is labeled as 1, otherwise 0. As a result, we have four different types of transactions: (0 − 0), (0 − 1), (1 − 0), and (1 − 1).

## Data loading

### Subtask:
Load the Ethereum transaction dataset.


**Reasoning**:
Load the dataset and display the first few rows along with its shape to inspect the data.



In [10]:
import pandas as pd

try:
    df = pd.read_csv('final_data_4.csv', dtype={'from': str, 'to': str, 'amount': float, 'timestamp': int, 'FromIsPhi': int, 'ToIsPhi': int})
    display(df.head())
    print(df.shape)
except FileNotFoundError:
    print("Error: 'final_data_4.csv' not found.")
except pd.errors.ParserError:
    print("Error: Could not parse the CSV file. Please check the file format.")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

Unnamed: 0.2,Unnamed: 0.1,from,to,amount,timestamp,fromIsPhi,toIsPhi,date,Unnamed: 0
0,2562268,0x75e7f640bf6968b6f32c47a3cd82c3c2c9dcae68,0x2b6e4b95a82f4bfed7e63e0df5d16eb0701f3c30,0.99,1491661658,0,0,2017-04-08 00:00:00,
1,2562267,0x75e7f640bf6968b6f32c47a3cd82c3c2c9dcae68,0x2b6e4b95a82f4bfed7e63e0df5d16eb0701f3c30,0.99,1491661658,0,0,2017-04-08 00:00:00,
2,352022,0x22b84d5ffea8b801c0422afe752377a64aa738c2,0x22e8804d93089d1a1ddeefa818769645518de52d,11.80434,1491613812,0,0,2017-04-08 00:00:00,
3,352021,0x22b84d5ffea8b801c0422afe752377a64aa738c2,0x7842385d4725d1fb0a0d0b8d6df7d29e35f1e2c4,99.036351,1491612905,0,0,2017-04-08 00:00:00,
4,352020,0x22b84d5ffea8b801c0422afe752377a64aa738c2,0x43641b5a17613ce476b8b933347c7b650d6907eb,19.46,1491612434,0,0,2017-04-08 00:00:00,


(119999, 9)


**Reasoning**:
Examine data types, missing values, and distributions; calculate unique node counts.



## Data preparation

### Subtask:
Prepare the data for graph creation.


**Reasoning**:
Drop unnecessary columns, convert address columns to numerical IDs, handle the 'amount' column, and convert the 'timestamp' column to integer type.



In [11]:
# Drop unnecessary columns
df = df.drop(columns=['Unnamed: 0', 'Unnamed: 0.1', 'date'])

# Create mappings for 'from' and 'to' addresses
from_mapping = {address: idx for idx, address in enumerate(df['from'].unique())}
to_mapping = {address: idx for idx, address in enumerate(df['to'].unique())}

# Apply mappings to create numerical IDs
df['from_id'] = df['from'].map(from_mapping)
df['to_id'] = df['to'].map(to_mapping)

# Handle 'amount' column
df['amount'] = pd.to_numeric(df['amount'], errors='coerce')
df['amount'] = df['amount'].fillna(df['amount'].median())

# Convert 'timestamp' to integer
df['timestamp'] = df['timestamp'].astype(int)

display(df.head())

Unnamed: 0,from,to,amount,timestamp,fromIsPhi,toIsPhi,from_id,to_id
0,0x75e7f640bf6968b6f32c47a3cd82c3c2c9dcae68,0x2b6e4b95a82f4bfed7e63e0df5d16eb0701f3c30,0.99,1491661658,0,0,0,0
1,0x75e7f640bf6968b6f32c47a3cd82c3c2c9dcae68,0x2b6e4b95a82f4bfed7e63e0df5d16eb0701f3c30,0.99,1491661658,0,0,0,0
2,0x22b84d5ffea8b801c0422afe752377a64aa738c2,0x22e8804d93089d1a1ddeefa818769645518de52d,11.80434,1491613812,0,0,1,1
3,0x22b84d5ffea8b801c0422afe752377a64aa738c2,0x7842385d4725d1fb0a0d0b8d6df7d29e35f1e2c4,99.036351,1491612905,0,0,1,2
4,0x22b84d5ffea8b801c0422afe752377a64aa738c2,0x43641b5a17613ce476b8b933347c7b650d6907eb,19.46,1491612434,0,0,1,3


## Summary:


### Data Analysis Key Findings
* The Ethereum transaction dataset contains 119,999 transactions.
* The 'fromIsPhi' column shows an imbalance, with 96,194 non-phishing senders (0) and 23,805 phishing senders (1).
* The 'toIsPhi' column also displays an imbalance, with 83,544 non-phishing receivers (0) and 36,455 phishing receivers (1).





Create a dictionary to store the unified labels for each unique address

This dictionary is built by iterating through all the unique 'from' and 'to' addresses and assigning a single unified label to each unique address string.

In [12]:
# Create a dictionary to store the unified labels for each unique address
unified_labels = {}

# Iterate through the DataFrame to determine the unified label for each address
for index, row in df.iterrows():
    from_address = row['from']
    to_address = row['to']
    from_label = row['fromIsPhi']
    to_label = row['toIsPhi']

    # If the address is already in the dictionary, update the label if it was previously non-phishing and is now phishing
    if from_address in unified_labels:
        if unified_labels[from_address] == 0 and from_label == 1:
            unified_labels[from_address] = 1
    else:
        unified_labels[from_address] = from_label

    if to_address in unified_labels:
        if unified_labels[to_address] == 0 and to_label == 1:
            unified_labels[to_address] = 1
    else:
        unified_labels[to_address] = to_label

# Now, create a new dataframe with the unique addresses and their unified labels
unique_addresses = list(unified_labels.keys())
labels = list(unified_labels.values())

address_labels_df = pd.DataFrame({'address': unique_addresses, 'label': labels})

# Display the first few rows of the new dataframe and the count of each label
display(address_labels_df.head())
print("\nUnified Label Distribution:")
print(address_labels_df['label'].value_counts())

Unnamed: 0,address,label
0,0x75e7f640bf6968b6f32c47a3cd82c3c2c9dcae68,0
1,0x2b6e4b95a82f4bfed7e63e0df5d16eb0701f3c30,0
2,0x22b84d5ffea8b801c0422afe752377a64aa738c2,0
3,0x22e8804d93089d1a1ddeefa818769645518de52d,0
4,0x7842385d4725d1fb0a0d0b8d6df7d29e35f1e2c4,0



Unified Label Distribution:
label
0    43177
1     1165
Name: count, dtype: int64


Create a new graph using the unique addresses as nodes and the unified labels as node attributes. Then, add the transactions as edges with their attributes.

In [13]:
import networkx as nx

# Create a new directed graph
new_graph = nx.DiGraph()

# Add nodes with unified labels
for index, row in address_labels_df.iterrows():
    address = row['address']
    label = row['label']
    # Use the address itself as the node ID
    new_graph.add_node(address, label=label)

# Add edges from the original transaction data
for index, row in df.iterrows():
    from_address = row['from']
    to_address = row['to']
    amount = row['amount']
    timestamp = row['timestamp']

    # Add edge with attributes using the original addresses as node identifiers
    if from_address in new_graph.nodes() and to_address in new_graph.nodes():
        new_graph.add_edge(from_address, to_address, amount=amount, timestamp=timestamp)

# Check the number of nodes and edges in the new graph
num_nodes_new = new_graph.number_of_nodes()
num_edges_new = new_graph.number_of_edges()
print(f"Number of nodes in the new graph: {num_nodes_new}")
print(f"Number of edges in the new graph: {num_edges_new}")

# Verify that nodes have the 'label' attribute and edges have attributes
print("\nExample nodes and edges in the new graph:")
for i, (node, attributes) in enumerate(new_graph.nodes(data=True)):
    if 'label' in attributes:
        print(f"Node {node}: {attributes}")
        if i > 5: # Print only a few examples
            break

for i, (u, v, attributes) in enumerate(new_graph.edges(data=True)):
    if 'amount' in attributes and 'timestamp' in attributes:
        print(f"Edge from {u} to {v}: {attributes}")
        if i > 5: # Print only a few examples
            break

Number of nodes in the new graph: 44342
Number of edges in the new graph: 55608

Example nodes and edges in the new graph:
Node 0x75e7f640bf6968b6f32c47a3cd82c3c2c9dcae68: {'label': 0}
Node 0x2b6e4b95a82f4bfed7e63e0df5d16eb0701f3c30: {'label': 0}
Node 0x22b84d5ffea8b801c0422afe752377a64aa738c2: {'label': 0}
Node 0x22e8804d93089d1a1ddeefa818769645518de52d: {'label': 0}
Node 0x7842385d4725d1fb0a0d0b8d6df7d29e35f1e2c4: {'label': 0}
Node 0x43641b5a17613ce476b8b933347c7b650d6907eb: {'label': 0}
Node 0x51836a753e344257b361519e948ffcaf5fb8d521: {'label': 0}
Edge from 0x75e7f640bf6968b6f32c47a3cd82c3c2c9dcae68 to 0x2b6e4b95a82f4bfed7e63e0df5d16eb0701f3c30: {'amount': 0.99, 'timestamp': 1491661658}
Edge from 0x75e7f640bf6968b6f32c47a3cd82c3c2c9dcae68 to 0xa89f7a78c9c54e3027a318ea3cd152ef18674f39: {'amount': 0.128919, 'timestamp': 1491614486}
Edge from 0x75e7f640bf6968b6f32c47a3cd82c3c2c9dcae68 to 0x5b2f9f6d5b245b8d7534fd1210fe45d53edb554f: {'amount': 1.0, 'timestamp': 1491616476}
Edge from 0x75