# Bitcoin Transaction Fraud Detection: Feature Engineering

This notebook focuses on extracting and analyzing graph-based features for the Bitcoin transaction dataset. Graph features can provide valuable information about the structure and patterns of transactions, which can help improve fraud detection.

## Table of Contents
1. [Setup](#Setup)
2. [Loading Processed Data](#Loading-Processed-Data)
3. [Basic Graph Features](#Basic-Graph-Features)
4. [Centrality Measures](#Centrality-Measures)
5. [Clustering Coefficients](#Clustering-Coefficients)
6. [Temporal Features](#Temporal-Features)
7. [Combining Features](#Combining-Features)
8. [Feature Importance Analysis](#Feature-Importance-Analysis)
9. [Saving Engineered Features](#Saving-Engineered-Features)

## Setup

Let's import the necessary libraries and configure the environment.

In [None]:
import os
import numpy as np
import pandas as pd
import networkx as nx
import torch
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import mutual_info_classif
import logging
import time
from tqdm.notebook import tqdm

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

# Set random seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)

# Create directories
os.makedirs('data/processed', exist_ok=True)

## Loading Processed Data

First, let's load the preprocessed data from the previous notebook.

In [None]:
def load_data(classes_path, edgelist_path, features_path):
    """
    Load blockchain dataset from CSV files with the special format.
    
    Parameters:
    -----------
    classes_path : str
        Path to the classes CSV file with txId and class columns
    edgelist_path : str
        Path to the edgelist CSV file with txId1 and txId2 columns
    features_path : str
        Path to the features CSV file
        
    Returns:
    --------
    df_nodes : pandas.DataFrame
        DataFrame containing node data with txId, class, and features
    df_edges : pandas.DataFrame
        DataFrame containing edge data
    """
    logger.info(f"Loading data from {classes_path}, {edgelist_path}, and {features_path}")
    
    # Load classes data
    df_classes = pd.read_csv(classes_path)
    logger.info(f"Loaded {len(df_classes)} transactions with class information")
    
    # Load edge data
    df_edges = pd.read_csv(edgelist_path)
    logger.info(f"Loaded {len(df_edges)} edges")
    
    # Load features data - this has a non-standard format
    try:
        # Read the features file
        df_features = pd.read_csv(features_path)
        
        # First column should be txId
        df_features = df_features.rename(columns={df_features.columns[0]: 'txId'})
        
        # Second column is a constant (1) - drop it
        df_features = df_features.drop(columns=[df_features.columns[1]])
        
        # Rename remaining columns to feature_0, feature_1, etc.
        feature_cols = [col for col in df_features.columns if col != 'txId']
        feature_rename = {col: f'feature_{i}' for i, col in enumerate(feature_cols)}
        df_features = df_features.rename(columns=feature_rename)
        
        logger.info(f"Loaded {len(df_features.columns)-1} features for {len(df_features)} transactions")
    except Exception as e:
        logger.error(f"Error loading features: {str(e)}")
        raise
    
    # Merge classes and features based on txId
    df_nodes = pd.merge(df_classes, df_features, on='txId', how='inner')
    logger.info(f"Combined data has {len(df_nodes)} transactions with both class and feature information")
    
    return df_nodes, df_edges

def create_node_mapping(df_nodes):
    """
    Create a mapping between transaction IDs and indices for graph construction.
    
    Parameters:
    -----------
    df_nodes : pandas.DataFrame
        DataFrame containing node data
        
    Returns:
    --------
    id2idx : dict
        Dictionary mapping transaction IDs to indices
    """
    # Create mapping of txId to index
    id2idx = {tx_id: i for i, tx_id in enumerate(df_nodes['txId'])}
    logger.info(f"Created mapping for {len(id2idx)} transactions")
    return id2idx

# Try to load processed data from previous notebook
processed_features_path = 'data/processed/features.npy'
processed_labels_path = 'data/processed/labels.npy'

if os.path.exists(processed_features_path) and os.path.exists(processed_labels_path):
    logger.info("Loading processed data from previous notebook")
    features = np.load(processed_features_path)
    labels = np.load(processed_labels_path)
    logger.info(f"Loaded {features.shape[0]} samples with {features.shape[1]} features")
    
    # We also need to load the raw data for graph analysis
    classes_path = 'data/raw/classes.csv'
    edgelist_path = 'data/raw/edgelist.csv'
    features_path = 'data/raw/Features.csv'
    
    if all(os.path.exists(p) for p in [classes_path, edgelist_path, features_path]):
        df_nodes, df_edges = load_data(classes_path, edgelist_path, features_path)
        
        # Filter out unknown classes
        unknown_mask = df_nodes['class'] == 'unknown'
        if unknown_mask.sum() > 0:
            df_nodes = df_nodes[~unknown_mask]
        
        # Convert class labels to numeric if needed
        if df_nodes['class'].dtype == 'object':
            unique_classes = df_nodes['class'].unique()
            class_map = {cls: i for i, cls in enumerate(unique_classes)}
            df_nodes['class'] = df_nodes['class'].map(class_map)
        
        # Create node mapping
        id2idx = create_node_mapping(df_nodes)
        
        logger.info("Raw data loaded for graph analysis")
    else:
        logger.warning("Raw data files not found. Please run Data Preparation notebook first.")
else:
    logger.warning("Processed data not found. Please run Data Preparation notebook first.")

## Basic Graph Features

Let's create a graph from the transaction data and compute basic graph metrics for each node.

In [None]:
def create_graph(df_nodes, df_edges):
    """
    Create a directed graph from transaction data.
    
    Parameters:
    -----------
    df_nodes : pandas.DataFrame
        DataFrame containing node data
    df_edges : pandas.DataFrame
        DataFrame containing edge data
        
    Returns:
    --------
    G : networkx.DiGraph
        Directed graph of transactions
    """
    logger.info("Creating transaction graph")
    start_time = time.time()
    
    # Create directed graph
    G = nx.DiGraph()
    
    # Add nodes with attributes
    for _, row in df_nodes.iterrows():
        node_id = row['txId']
        is_fraud = row['class'] == 1
        G.add_node(node_id, fraud=is_fraud)
    
    # Add edges
    for _, row in df_edges.iterrows():
        source_id, target_id = row['txId1'], row['txId2']
        if source_id in G and target_id in G:
            G.add_edge(source_id, target_id)
    
    logger.info(f"Created graph with {G.number_of_nodes()} nodes and {G.number_of_edges()} edges in {time.time()-start_time:.2f} seconds")
    
    return G

# Create graph
G = create_graph(df_nodes, df_edges)

# Print graph statistics
print("Graph statistics:")
print(f"Number of nodes: {G.number_of_nodes()}")
print(f"Number of edges: {G.number_of_edges()}")
print(f"Is directed: {nx.is_directed(G)}")
print(f"Is connected: {nx.is_weakly_connected(G)}")

# Get number of connected components
weakly_connected_components = list(nx.weakly_connected_components(G))
print(f"Number of weakly connected components: {len(weakly_connected_components)}")

# Get size of largest component
largest_component_size = max(len(c) for c in weakly_connected_components)
print(f"Size of largest weakly connected component: {largest_component_size} nodes ({largest_component_size/G.number_of_nodes():.2%} of all nodes)")

Let's visualize the degree distribution of the network to better understand its structure.

In [None]:
# Get in-degree and out-degree distributions
in_degrees = [d for n, d in G.in_degree()]
out_degrees = [d for n, d in G.out_degree()]

# Create figure
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Plot in-degree distribution
axes[0].hist(in_degrees, bins=50, alpha=0.7, color='blue')
axes[0].set_title('In-Degree Distribution')
axes[0].set_xlabel('In-Degree')
axes[0].set_ylabel('Frequency')
axes[0].set_yscale('log')
axes[0].grid(True, alpha=0.3)

# Plot out-degree distribution
axes[1].hist(out_degrees, bins=50, alpha=0.7, color='green')
axes[1].set_title('Out-Degree Distribution')
axes[1].set_xlabel('Out-Degree')
axes[1].set_ylabel('Frequency')
axes[1].set_yscale('log')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Print degree statistics
print("In-degree statistics:")
print(f"Min: {min(in_degrees)}")
print(f"Max: {max(in_degrees)}")
print(f"Mean: {np.mean(in_degrees):.2f}")
print(f"Median: {np.median(in_degrees)}")

print("\nOut-degree statistics:")
print(f"Min: {min(out_degrees)}")
print(f"Max: {max(out_degrees)}")
print(f"Mean: {np.mean(out_degrees):.2f}")
print(f"Median: {np.median(out_degrees)}")

Let's compare the degree distributions between fraudulent and legitimate transactions to see if there are any patterns.

In [None]:
# Separate nodes by class
fraudulent_nodes = [node for node in G.nodes() if G.nodes[node].get('fraud', False)]
legitimate_nodes = [node for node in G.nodes() if not G.nodes[node].get('fraud', False)]

print(f"Number of fraudulent nodes: {len(fraudulent_nodes)}")
print(f"Number of legitimate nodes: {len(legitimate_nodes)}")

# Get in-degree and out-degree for each class
fraud_in_degrees = [G.in_degree(node) for node in fraudulent_nodes]
fraud_out_degrees = [G.out_degree(node) for node in fraudulent_nodes]

legit_in_degrees = [G.in_degree(node) for node in legitimate_nodes]
legit_out_degrees = [G.out_degree(node) for node in legitimate_nodes]

# Create figure
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Plot in-degree distribution by class
axes[0].hist(legit_in_degrees, bins=30, alpha=0.5, color='blue', label='Legitimate')
axes[0].hist(fraud_in_degrees, bins=30, alpha=0.5, color='red', label='Fraudulent')
axes[0].set_title('In-Degree Distribution by Class')
axes[0].set_xlabel('In-Degree')
axes[0].set_ylabel('Frequency')
axes[0].set_yscale('log')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Plot out-degree distribution by class
axes[1].hist(legit_out_degrees, bins=30, alpha=0.5, color='blue', label='Legitimate')
axes[1].hist(fraud_out_degrees, bins=30, alpha=0.5, color='red', label='Fraudulent')
axes[1].set_title('Out-Degree Distribution by Class')
axes[1].set_xlabel('Out-Degree')
axes[1].set_ylabel('Frequency')
axes[1].set_yscale('log')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Print degree statistics by class
print("Fraudulent transactions:")
print(f"Average in-degree: {np.mean(fraud_in_degrees):.2f}")
print(f"Average out-degree: {np.mean(fraud_out_degrees):.2f}")

print("\nLegitimate transactions:")
print(f"Average in-degree: {np.mean(legit_in_degrees):.2f}")
print(f"Average out-degree: {np.mean(legit_out_degrees):.2f}")

## Centrality Measures

Now, let's compute centrality measures for each node. These measures can help identify important nodes in the network and might be useful for fraud detection.

In [None]:
def compute_fast_graph_features(df_nodes, G):
    """
    Compute optimized graph features that are fast to calculate.
    
    Parameters:
    -----------
    df_nodes : pandas.DataFrame
        Dataframe containing node data
    G : networkx.DiGraph
        NetworkX graph of transactions
        
    Returns:
    --------
    graph_features : numpy.ndarray
        Array of graph features for each node
    """
    logger.info("Computing optimized graph features (fast version)")
    start_time = time.time()
    
    # Calculate degrees (fast)
    logger.info("Computing degree centralities...")
    in_degree = dict(G.in_degree())
    out_degree = dict(G.out_degree())
    total_degree = {node: in_degree.get(node, 0) + out_degree.get(node, 0) for node in G.nodes()}
    
    logger.info(f"Calculated degree centralities in {time.time()-start_time:.2f} seconds")
    
    # PageRank (reasonably fast)
    logger.info("Computing PageRank centrality...")
    pagerank_start = time.time()
    try:
        pagerank = nx.pagerank(G, alpha=0.85, max_iter=100)
        logger.info(f"Calculated PageRank in {time.time()-pagerank_start:.2f} seconds")
    except nx.PowerIterationFailedConvergence:
        logger.warning("PageRank failed to converge, using simplified calculation")
        pagerank = nx.pagerank(G, alpha=0.85, max_iter=50, tol=1e-3)
        logger.info(f"Calculated simplified PageRank in {time.time()-pagerank_start:.2f} seconds")
    
    # HITS algorithm (good for directed networks)
    logger.info("Computing HITS centrality...")
    hits_start = time.time()
    try:
        hubs, authorities = nx.hits(G, max_iter=100)
        logger.info(f"Calculated HITS in {time.time()-hits_start:.2f} seconds")
    except nx.PowerIterationFailedConvergence:
        logger.warning("HITS failed to converge, using simplified calculation")
        hubs, authorities = nx.hits(G, max_iter=50, tol=1e-3)
        logger.info(f"Calculated simplified HITS in {time.time()-hits_start:.2f} seconds")
    
    # Create feature matrix with 5 features (in-degree, out-degree, total degree, pagerank, hubs, authorities)
    logger.info("Creating feature matrix...")
    graph_features = np.zeros((len(df_nodes), 6))
    
    logger.info("Filling feature matrix...")
    for i, node_id in tqdm(enumerate(df_nodes['txId']), desc="Processing nodes", total=len(df_nodes)):
        # In-degree
        graph_features[i, 0] = in_degree.get(node_id, 0)
        # Out-degree
        graph_features[i, 1] = out_degree.get(node_id, 0)
        # Total degree
        graph_features[i, 2] = total_degree.get(node_id, 0)
        # PageRank
        graph_features[i, 3] = pagerank.get(node_id, 0)
        # Hub score
        graph_features[i, 4] = hubs.get(node_id, 0)
        # Authority score
        graph_features[i, 5] = authorities.get(node_id, 0)
    
    logger.info(f"Generated graph features with shape {graph_features.shape} in {time.time()-start_time:.2f} seconds")
    
    return graph_features

# Compute centrality measures
centrality_features = compute_fast_graph_features(df_nodes, G)

# Create a DataFrame with the centrality features
centrality_df = pd.DataFrame(
    centrality_features,
    columns=['in_degree', 'out_degree', 'total_degree', 'pagerank', 'hub_score', 'authority_score']
)

# Add transaction ID and class for reference
centrality_df['txId'] = df_nodes['txId'].values
centrality_df['class'] = df_nodes['class'].values

# Display statistics of centrality features
print("Centrality features statistics:")
centrality_df.describe()

Let's visualize the distribution of each centrality measure for fraudulent and legitimate transactions.

In [None]:
# Create a figure with 3 rows and 2 columns
fig, axes = plt.subplots(3, 2, figsize=(16, 18))
axes = axes.flatten()

centrality_measures = ['in_degree', 'out_degree', 'total_degree', 'pagerank', 'hub_score', 'authority_score']

for i, measure in enumerate(centrality_measures):
    # Filter values for each class
    fraud_values = centrality_df[centrality_df['class'] == 1][measure]
    legit_values = centrality_df[centrality_df['class'] == 0][measure]
    
    # Plot distributions
    if measure in ['in_degree', 'out_degree', 'total_degree']:  # Discrete values
        axes[i].hist(legit_values, bins=30, alpha=0.5, color='blue', label='Legitimate')
        axes[i].hist(fraud_values, bins=30, alpha=0.5, color='red', label='Fraudulent')
    else:  # Continuous values
        sns.kdeplot(legit_values, ax=axes[i], color='blue', label='Legitimate', fill=True, alpha=0.3)
        sns.kdeplot(fraud_values, ax=axes[i], color='red', label='Fraudulent', fill=True, alpha=0.3)
    
    axes[i].set_title(f'{measure.replace("_", " ").title()} Distribution by Class')
    axes[i].set_xlabel(measure.replace("_", " ").title())
    axes[i].set_ylabel('Density')
    axes[i].legend()
    axes[i].grid(True, alpha=0.3)
    
    # Use log scale for degree measures
    if measure in ['in_degree', 'out_degree', 'total_degree']:
        axes[i].set_yscale('log')
    
plt.tight_layout()
plt.show()

# Calculate and print the average values for each class
print("Average centrality measures by class:")
print(centrality_df.groupby('class')[centrality_measures].mean())

Let's compute the correlation between centrality measures and the target class to see which measures are most predictive of fraud.

In [None]:
# Calculate correlations with the target
correlations = []
for measure in centrality_measures:
    corr = centrality_df[measure].corr(centrality_df['class'])
    correlations.append((measure, corr))

# Sort by absolute correlation
correlations.sort(key=lambda x: abs(x[1]), reverse=True)

# Display correlations
print("Correlations with the target class:")
for measure, corr in correlations:
    print(f"{measure}: {corr:.4f}")

# Create a bar chart of correlations
plt.figure(figsize=(12, 6))
measures, corrs = zip(*correlations)
plt.bar(measures, corrs, color=['blue' if c > 0 else 'red' for c in corrs])
plt.axhline(y=0, color='k', linestyle='-', alpha=0.3)
plt.title('Correlation of Centrality Measures with Fraud')
plt.xlabel('Centrality Measure')
plt.ylabel('Correlation Coefficient')
plt.xticks(rotation=45)
plt.grid(True, axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

## Clustering Coefficients

Let's compute clustering coefficients, which measure the degree to which nodes in a graph tend to cluster together.

In [None]:
def compute_clustering_coefficients(df_nodes, G):
    """
    Compute clustering coefficients for nodes in an efficient way.
    
    Parameters:
    -----------
    df_nodes : pandas.DataFrame
        Dataframe containing node data
    G : networkx.Graph
        NetworkX graph
        
    Returns:
    --------
    clustering_features : numpy.ndarray
        Array of clustering coefficients for each node
    """
    logger.info("Computing clustering coefficients...")
    start_time = time.time()
    
    # Convert to undirected for clustering coefficient calculation
    G_undirected = G.to_undirected()
    
    # Initialize features
    clustering_features = np.zeros((len(df_nodes), 1))
    
    # Calculate clustering coefficients for all nodes at once (more efficient)
    clustering_dict = nx.clustering(G_undirected)
    
    # Fill feature array
    for i, node_id in tqdm(enumerate(df_nodes['txId']), desc="Processing clustering", total=len(df_nodes)):
        clustering_features[i, 0] = clustering_dict.get(node_id, 0)
    
    logger.info(f"Generated clustering features in {time.time()-start_time:.2f} seconds")
    
    return clustering_features

# Compute clustering coefficients
try:
    clustering_features = compute_clustering_coefficients(df_nodes, G)
    
    # Add clustering coefficient to the DataFrame
    centrality_df['clustering_coefficient'] = clustering_features
    
    # Display statistics of clustering coefficients
    print("Clustering coefficient statistics:")
    print(centrality_df['clustering_coefficient'].describe())
    
    # Visualize the distribution of clustering coefficients by class
    plt.figure(figsize=(12, 6))
    
    # Filter values for each class
    fraud_values = centrality_df[centrality_df['class'] == 1]['clustering_coefficient']
    legit_values = centrality_df[centrality_df['class'] == 0]['clustering_coefficient']
    
    # Plot distributions
    sns.kdeplot(legit_values, color='blue', label='Legitimate', fill=True, alpha=0.3)
    sns.kdeplot(fraud_values, color='red', label='Fraudulent', fill=True, alpha=0.3)
    
    plt.title('Clustering Coefficient Distribution by Class')
    plt.xlabel('Clustering Coefficient')
    plt.ylabel('Density')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()
    
    # Calculate and print the average clustering coefficient for each class
    print("Average clustering coefficient by class:")
    print(centrality_df.groupby('class')['clustering_coefficient'].mean())
    
    # Calculate correlation with the target
    clustering_corr = centrality_df['clustering_coefficient'].corr(centrality_df['class'])
    print(f"\nCorrelation with the target class: {clustering_corr:.4f}")
    
except Exception as e:
    logger.warning(f"Error computing clustering coefficients: {str(e)}. Skipping.")
    clustering_features = None

## Temporal Features

If the dataset contains time-related columns, we can extract temporal features that might be useful for fraud detection.

In [None]:
def extract_temporal_features(df_nodes):
    """
    Extract temporal features if time-related columns are available.
    
    Parameters:
    -----------
    df_nodes : pandas.DataFrame
        Dataframe containing node data
        
    Returns:
    --------
    temporal_features : numpy.ndarray or None
        Array of temporal features, or None if no temporal data exists
    """
    # Check if time-related columns exist
    time_columns = [col for col in df_nodes.columns if 'time' in col.lower()]
    
    if not time_columns:
        logger.info("No temporal features found in the dataset")
        return None
    
    logger.info(f"Extracting temporal features from columns: {time_columns}")
    start_time = time.time()
    
    # Extract temporal features
    temporal_features = df_nodes[time_columns].values
    
    logger.info(f"Extracted temporal features in {time.time()-start_time:.2f} seconds")
    
    return temporal_features

# Extract temporal features
temporal_features = extract_temporal_features(df_nodes)

if temporal_features is not None:
    # Create a DataFrame with the temporal features
    time_columns = [col for col in df_nodes.columns if 'time' in col.lower()]
    temporal_df = pd.DataFrame(temporal_features, columns=time_columns)
    
    # Add transaction ID and class for reference
    temporal_df['txId'] = df_nodes['txId'].values
    temporal_df['class'] = df_nodes['class'].values
    
    # Display statistics of temporal features
    print("Temporal features statistics:")
    print(temporal_df.describe())
    
    # Calculate correlations with the target
    temporal_corrs = []
    for col in time_columns:
        corr = temporal_df[col].corr(temporal_df['class'])
        temporal_corrs.append((col, corr))
    
    # Sort by absolute correlation
    temporal_corrs.sort(key=lambda x: abs(x[1]), reverse=True)
    
    # Display correlations
    print("\nCorrelations with the target class:")
    for col, corr in temporal_corrs:
        print(f"{col}: {corr:.4f}")

## Combining Features

Now, let's combine the original transaction features with the graph-based features we've computed.

In [None]:
def combine_features(transaction_features, graph_features, clustering_features=None, temporal_features=None, normalize=True):
    """
    Combine different feature sets and optionally normalize them.
    
    Parameters:
    -----------
    transaction_features : numpy.ndarray
        Original transaction features
    graph_features : numpy.ndarray
        Graph-based features
    clustering_features : numpy.ndarray or None
        Clustering coefficient features
    temporal_features : numpy.ndarray or None
        Temporal features, if available
    normalize : bool
        Whether to normalize the combined features
        
    Returns:
    --------
    combined_features : numpy.ndarray
        Combined and normalized features
    """
    logger.info("Combining feature sets...")
    start_time = time.time()
    
    feature_list = [transaction_features, graph_features]
    feature_types = ["transaction", "graph"]
    feature_counts = [transaction_features.shape[1], graph_features.shape[1]]
    
    if clustering_features is not None:
        feature_list.append(clustering_features)
        feature_types.append("clustering")
        feature_counts.append(clustering_features.shape[1])
    
    if temporal_features is not None:
        feature_list.append(temporal_features)
        feature_types.append("temporal")
        feature_counts.append(temporal_features.shape[1])
    
    # Combine features
    combined_features = np.hstack(feature_list)
    
    # Create a description of the feature combination
    feature_desc = ", ".join([f"{ftype} ({fcount})" for ftype, fcount in zip(feature_types, feature_counts)])
    logger.info(f"Combined {feature_desc} features")
    
    # Normalize features
    if normalize:
        logger.info("Normalizing combined features...")
        scaler = StandardScaler()
        combined_features = scaler.fit_transform(combined_features)
    
    logger.info(f"Final combined features shape: {combined_features.shape}, completed in {time.time()-start_time:.2f} seconds")
    
    return combined_features

# Get the original transaction features
transaction_features = features

# Combine all features
combined_features = combine_features(
    transaction_features=transaction_features,
    graph_features=centrality_features,
    clustering_features=clustering_features,
    temporal_features=temporal_features
)

# Print the shape of the combined features
print(f"Original transaction features shape: {transaction_features.shape}")
print(f"Graph centrality features shape: {centrality_features.shape}")
if clustering_features is not None:
    print(f"Clustering coefficient features shape: {clustering_features.shape}")
if temporal_features is not None:
    print(f"Temporal features shape: {temporal_features.shape}")
print(f"Combined features shape: {combined_features.shape}")

## Feature Importance Analysis

Let's analyze the importance of the features for fraud detection using correlation and mutual information.

In [None]:
def get_feature_importance(features, labels, method='correlation'):
    """
    Calculate feature importance using the specified method.
    
    Parameters:
    -----------
    features : numpy.ndarray
        Feature matrix
    labels : numpy.ndarray
        Target labels
    method : str
        Method to calculate importance ('correlation', 'mutual_info', etc.)
        
    Returns:
    --------
    feature_importance : numpy.ndarray
        Array of importance scores for each feature
    """
    logger.info(f"Calculating feature importance using {method}...")
    start_time = time.time()
    
    if method == 'correlation':
        # Calculate absolute correlation with target
        correlations = np.zeros(features.shape[1])
        
        for i in tqdm(range(features.shape[1]), desc="Calculating correlations"):
            correlations[i] = abs(np.corrcoef(features[:, i], labels)[0, 1])
        
        logger.info(f"Calculated feature importance in {time.time()-start_time:.2f} seconds")
        return correlations
    
    elif method == 'mutual_info':
        # Calculate mutual information
        logger.info("Computing mutual information...")
        importance = mutual_info_classif(features, labels)
        
        logger.info(f"Calculated feature importance in {time.time()-start_time:.2f} seconds")
        return importance
    
    else:
        logger.warning(f"Unknown importance method: {method}, using correlation")
        return get_feature_importance(features, labels, method='correlation')

# Calculate feature importance using correlation
correlation_importance = get_feature_importance(combined_features, labels, method='correlation')

# Calculate feature importance using mutual information
mutual_info_importance = get_feature_importance(combined_features, labels, method='mutual_info')

# Create a DataFrame to hold the importance scores
feature_cols = [f"feature_{i}" for i in range(transaction_features.shape[1])]
graph_cols = ['in_degree', 'out_degree', 'total_degree', 'pagerank', 'hub_score', 'authority_score']

all_cols = feature_cols + graph_cols
if clustering_features is not None:
    all_cols.append('clustering_coefficient')
if temporal_features is not None:
    time_cols = [col for col in df_nodes.columns if 'time' in col.lower()]
    all_cols.extend(time_cols)

importance_df = pd.DataFrame({
    'feature': all_cols,
    'correlation_importance': correlation_importance,
    'mutual_info_importance': mutual_info_importance
})

# Sort by mutual information importance
importance_df = importance_df.sort_values('mutual_info_importance', ascending=False).reset_index(drop=True)

# Display the top 20 most important features
print("Top 20 most important features (by mutual information):")
importance_df.head(20)

Let's visualize the top features by importance.

In [None]:
# Create a figure with two subplots
fig, axes = plt.subplots(1, 2, figsize=(20, 10))

# Plot top 15 features by correlation importance
corr_top = importance_df.sort_values('correlation_importance', ascending=False).head(15)
sns.barplot(x='correlation_importance', y='feature', data=corr_top, ax=axes[0], palette='viridis')
axes[0].set_title('Top 15 Features by Correlation Importance')
axes[0].set_xlabel('Absolute Correlation')
axes[0].set_ylabel('Feature')
axes[0].grid(True, axis='x', alpha=0.3)

# Plot top 15 features by mutual information importance
mi_top = importance_df.sort_values('mutual_info_importance', ascending=False).head(15)
sns.barplot(x='mutual_info_importance', y='feature', data=mi_top, ax=axes[1], palette='viridis')
axes[1].set_title('Top 15 Features by Mutual Information')
axes[1].set_xlabel('Mutual Information')
axes[1].set_ylabel('Feature')
axes[1].grid(True, axis='x', alpha=0.3)

plt.tight_layout()
plt.show()

Let's check if our graph-based features are among the top features.

In [None]:
# Check if the graph features are among the top features
graph_cols_set = set(graph_cols)
if clustering_features is not None:
    graph_cols_set.add('clustering_coefficient')

# Get top 20 features
top_features = importance_df.head(20)['feature'].tolist()
top_graph_features = [f for f in top_features if f in graph_cols_set]

print(f"Number of graph-based features in the top 20: {len(top_graph_features)}")
print("Top graph-based features:")
for feature in top_graph_features:
    rank = importance_df[importance_df['feature'] == feature].index[0] + 1
    corr = importance_df[importance_df['feature'] == feature]['correlation_importance'].values[0]
    mi = importance_df[importance_df['feature'] == feature]['mutual_info_importance'].values[0]
    print(f"  {rank}. {feature} - Correlation: {corr:.4f}, Mutual Info: {mi:.4f}")

## Saving Engineered Features

Finally, let's save the engineered features for use in model training.

In [None]:
# Save combined features
output_dir = 'data/processed'
os.makedirs(output_dir, exist_ok=True)

np.save(os.path.join(output_dir, 'combined_features.npy'), combined_features)
np.save(os.path.join(output_dir, 'feature_importance.npy'), mutual_info_importance)

# Save feature names
with open(os.path.join(output_dir, 'feature_names.txt'), 'w') as f:
    for name in all_cols:
        f.write(f"{name}\n")

print("Saved engineered features to data/processed/ directory:")
print(f"  - combined_features.npy: Combined features matrix with shape {combined_features.shape}")
print(f"  - feature_importance.npy: Feature importance scores with shape {mutual_info_importance.shape}")
print(f"  - feature_names.txt: Names of all {len(all_cols)} features")

## Summary

In this notebook, we've conducted comprehensive feature engineering for the Bitcoin transaction fraud detection project. We've:

1. Created a graph representation of the transaction network and analyzed its structure
2. Computed centrality measures (in-degree, out-degree, PageRank, HITS) to identify important nodes
3. Computed clustering coefficients to capture local network structure
4. Extracted temporal features (if available)
5. Combined all features and analyzed their importance for fraud detection
6. Saved the engineered features for use in model training

The graph-based features we've created provide valuable information about the structure and patterns of transactions, which can help improve fraud detection performance. In the next notebook, we'll implement and train Graph Neural Network models using these features.