# Implementation of Correlation Analysis in Smart Contract Security Data

Welcome to the advanced section of our analysis series, where we focus on implementing correlation analysis for binary data using Python. In this notebook, we'll dive into calculating both the Phi coefficient and the Point-Biserial correlation, visualizing the results, and interpreting what these correlations tell us about the relationships between different risk tags in smart contract data.

## Objective
Our goal is to enhance your ability to perform and understand advanced statistical analyses, preparing you for data-driven decision-making in cybersecurity or any field requiring detailed data insight.

Before you begin, ensure you have a basic understanding of Python programming and familiarity with libraries such as pandas, matplotlib, and seaborn. If you're ready, let's start by setting up our environment and loading the data!


### Step 1: Import libraries

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats
import networkx as nx



# Ensure plots are displayed inline in the notebook
%matplotlib inline

print("Libraries imported successfully!")

### Step 2: Download the dataset

This step would download the Webacy Smart Contract Risk dataset. If you have your own dataset, then please add it to colab's environment.

In [None]:
!gdown 1andAuermOWqVXfhsh_AQ3Db93D3BIqgx

In [None]:
print("Setup complete. Imported pandas, seaborn, and matplotlib. Downloaded Webacy risk dataset.")

### Step 3: Load the Data Section

Now even though we have downloaded the dataset, we still need to load it into our Python environment. For this we will utilize the Pandas library.

In [None]:
# Loading the dataset

df = pd.read_excel('/path/to/data')

# Display the first five rows of the dataframe
df.head()

## Calculate Correlation

To calculate the Phi coefficient, which is suitable for pairs of binary variables, we first need to establish a function that can handle this calculation:

In [None]:
def phi_coefficient(x, y):
    """Calculate the Phi coefficient for two binary variables."""
    # Create a contingency table
    contingency_table = pd.crosstab(x, y)
    # Calculate the phi coefficient from the contingency table
    chi2 = scipy.stats.chi2_contingency(contingency_table, correction=False)[0]
    n = np.sum(np.sum(contingency_table))
    phi = np.sqrt(chi2 / n)
    return phi

# Example calculation between two risk tags
phi = phi_coefficient(df['Is_honeypot'], df['anti_whale_modifiable'])
print(f"Phi Coefficient between 'Is_honeypot' and 'anti_whale_modifiable': {phi}")

Phi value close to 0 indicates no correlation between the two columns.

**Note:** Phi values range from -1 to 1. A negative value of Phi indicates that the variables are inversely related, or when one variable increases, the other decreases. On the other hand, positive values indicate that when one variable increases, so does the other.

Let's now define the risk columns of our dataset.

In [None]:
risk_columns = ['Is_closed_source', 'hidden_owner', 'anti_whale_modifiable',
       'Is_anti_whale', 'Is_honeypot', 'buy_tax', 'sell_tax',
       'slippage_modifiable', 'Is_blacklisted', 'can_take_back_ownership',
       'owner_change_balance', 'is_airdrop_scam', 'selfdestruct', 'trust_list',
       'is_whitelisted', 'is_fake_token', 'illegal_unicode', 'exploitation',
       'bad_contract', 'reusing_state_variable', 'encode_packed_collision',
       'encode_packed_parameters', 'centralized_risk_medium',
       'centralized_risk_high', 'centralized_risk_low', 'event_setter',
       'external_dependencies', 'immutable_states',
       'reentrancy_without_eth_transfer', 'incorrect_inheritance_order',
       'shadowing_local', 'events_maths']

Now we will calculate the phi coefficient for all the columns

In [None]:
risk_df = df[risk_columns]

# Create a DataFrame to store Phi coefficients
phi_matrix = pd.DataFrame(index=risk_df.columns, columns=risk_df.columns)

# Calculate Phi coefficient for each pair of binary variables
for var1 in risk_df.columns:
    for var2 in risk_df.columns:
        phi_matrix.loc[var1, var2] = phi_coefficient(risk_df[var1], risk_df[var2])

print("Phi coefficients calculated for all pairs of variables:")
phi_matrix


Now even though we have the full correlation matrix in front of us, it is very difficult to visualize. One thing that we can do is only display those correlations where value is significantly positive or negative.

But a much better way is to visualize this matrix as a heatmap.

In [None]:
# Setting the size of the plot
plt.figure(figsize=(12, 10))

# Creating a heatmap
sns.heatmap(phi_matrix.astype(float), annot=False, fmt=".2f", cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Heatmap of Phi Coefficients Between Risk Tags')
plt.show()


You can experiment with a variety of versions of this heatmap to improve visibility of the trends

In [None]:
# Setting a figure shape
plt.figure(figsize=(19, 12))

# Creating a Filtered Heatmap
threshold=0.2 # set threshold

phi_matrix = phi_matrix.astype(float)

# Create mask for low correlations and diagonal
mask = np.abs(phi_matrix) < threshold
mask = mask.to_numpy()
np.fill_diagonal(mask, True)  # Hide diagonal

# Plot heatmap with improved formatting
sns.heatmap(phi_matrix,
            mask=mask,
            cmap='RdBu_r',
            vmin=-1,
            vmax=1,
            center=0,
            annot=True,
            fmt='.1f',
            square=True,
            cbar_kws={'label': 'Phi Coefficient'})

plt.title(f'Correlation Heatmap (|φ| > {threshold})')
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()


In [None]:
# Top Correlations Table
# Get upper triangle
upper_tri = phi_matrix.where(np.triu(np.ones(phi_matrix.shape), k=1).astype(bool))
stacked_corrs = upper_tri.stack()
strong_corrs = stacked_corrs[np.abs(stacked_corrs) > threshold]
strong_corrs = strong_corrs.sort_values(ascending=False)

print("\nTop Positive Correlations:")
print(strong_corrs[strong_corrs > 0].head(10))
print("\nTop Negative Correlations:")
print(strong_corrs[strong_corrs < 0].head(10))



You can also use a network graph. Would recommend reducing the number of features here before using a network graph.

In [None]:
# Simplified Network Graph
plt.figure(figsize=(15, 15))
G = nx.Graph()

# Add edges for strong correlations
for i in range(len(phi_matrix.columns)):
    for j in range(i + 1, len(phi_matrix.columns)):
        corr = phi_matrix.iloc[i, j]
        if abs(corr) > threshold:
            G.add_edge(phi_matrix.columns[i],
                      phi_matrix.columns[j],
                      weight=abs(corr))

# Draw network
pos = nx.spring_layout(G, k=1, iterations=50)

# Draw nodes and labels
nx.draw_networkx_nodes(G, pos, node_size=2000, node_color='lightblue', alpha=0.6)

# Draw edges with width proportional to correlation strength
edge_weights = [G[u][v]['weight'] * 5 for u, v in G.edges()]
nx.draw_networkx_edges(G, pos, width=edge_weights, alpha=0.5)

# Add labels
nx.draw_networkx_labels(G, pos, font_size=8, font_weight='bold')

plt.title('Network of Strong Correlations')
plt.axis('off')
plt.tight_layout()
plt.show()