# Environment Sanity Check #

Click the _Runtime_ dropdown at the top of the page, then _Change Runtime Type_ and confirm the instance type is _GPU_.

You can check the output of `!nvidia-smi` to check which GPU you have.  Please uncomment the cell below if you'd like to do that.  Currently, RAPIDS runs on all available Colab GPU instances.

In [1]:
# !nvidia-smi

#Setup:
This set up script:

1. Checks to make sure that the GPU is RAPIDS compatible
1. Installs the **current stable version** of RAPIDSAI's core libraries using pip, which are:
  1. cuDF
  1. cuML
  1. cuGraph
  1. cuSpatial
  1. cuxFilter
  1. cuCIM
  1. xgboost

**This will complete in about 5-6 minutes**

If you require installing the **nightly** releases of RAPIDSAI, please use the [RAPIDS Conda Colab Template notebook](https://colab.research.google.com/drive/1TAAi_szMfWqRfHVfjGSqnGVLr_ztzUM9) and use the nightly parameter option when running the RAPIDS installation cell.


In [1]:
# This get the RAPIDS-Colab install files and test check your GPU.  Run this and the next cell only.
# Please read the output of this cell.  If your Colab Instance is not RAPIDS compatible, it will warn you and give you remediation steps.
!git clone https://github.com/rapidsai/rapidsai-csp-utils.git
!python rapidsai-csp-utils/colab/pip-install.py


Cloning into 'rapidsai-csp-utils'...
remote: Enumerating objects: 460, done.[K
remote: Counting objects: 100% (191/191), done.[K
remote: Compressing objects: 100% (100/100), done.[K
remote: Total 460 (delta 131), reused 124 (delta 91), pack-reused 269[K
Receiving objects: 100% (460/460), 126.19 KiB | 1.00 MiB/s, done.
Resolving deltas: 100% (233/233), done.
Collecting pynvml
  Downloading pynvml-11.5.0-py3-none-any.whl (53 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 53.1/53.1 kB 2.3 MB/s eta 0:00:00
Installing collected packages: pynvml
Successfully installed pynvml-11.5.0
***********************************************************************
Woo! Your instance has a Tesla T4 GPU!
We will install the latest stable RAPIDS via pip 24.2.*!  Please stand by, should be quick...
***********************************************************************

Looking in indexes: https://pypi.org/simple, https://pypi.nvidia.com
Collecting cudf-cu12==24.2.*
  Downloading https://pypi.nvidia.

# RAPIDS is now installed on Colab.  
You can copy your code into the cells below or use the below to validate your RAPIDS installation and version.  
# Enjoy!

In [2]:
import cudf
cudf.__version__

'24.02.01'

In [3]:
import cuml
cuml.__version__

'24.02.00'

In [4]:
import cugraph
cugraph.__version__

'24.02.00'

In [5]:
import cuspatial
cuspatial.__version__

'24.02.00'

In [6]:
import cuxfilter
cuxfilter.__version__

'24.02.00'

# Next Steps #

For an overview of how you can access and work with your own datasets in Colab, check out [this guide](https://towardsdatascience.com/3-ways-to-load-csv-files-into-colab-7c14fcbdcb92).

For more RAPIDS examples, check out our RAPIDS notebooks repos:
1. https://github.com/rapidsai/notebooks
2. https://github.com/rapidsai/notebooks-contrib

In [7]:
!pip install graphistry
!pip install torch_geometric

Collecting graphistry
  Downloading graphistry-0.33.2-py3-none-any.whl (244 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.9/244.9 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
Collecting palettable>=3.0 (from graphistry)
  Downloading palettable-3.3.3-py2.py3-none-any.whl (332 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m332.3/332.3 kB[0m [31m14.8 MB/s[0m eta [36m0:00:00[0m
Collecting squarify (from graphistry)
  Downloading squarify-0.4.3-py3-none-any.whl (4.3 kB)
Installing collected packages: squarify, palettable, graphistry
Successfully installed graphistry-0.33.2 palettable-3.3.3 squarify-0.4.3
Collecting torch_geometric
  Downloading torch_geometric-2.5.0-py3-none-any.whl (1.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: torch_geometric
Successfully installed torch_geometric-2.5.0


In [8]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import networkx as nx
from networkx.algorithms import community
import matplotlib.pyplot as plt
import torch
import torch.optim as optim
from torch_geometric.nn import Node2Vec
from torch_geometric.utils.convert import from_networkx
from torch_geometric.utils import degree, to_networkx
from torch_geometric.data import Data
import torch.nn.functional as F
from torch_geometric.nn import GCNConv
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN
from sklearn.neighbors import NearestNeighbors
import seaborn as sns
sns.set()

In [9]:
anomalies_full_df_1 = pd.read_csv('final_upd_3.csv')
anomalies_full_df_1.drop('Unnamed: 0', axis =1, inplace = True)

In [10]:
anomalies_full_df = pd.read_csv('final_upd_2.csv')
anomalies_full_df.drop('Unnamed: 0', axis =1, inplace = True)
df_test = anomalies_full_df[['cust_id','betweenness_centrality','degree','eigenvector_centrality','pagerank']]

In [11]:
for column in ['betweenness_centrality', 'degree', 'eigenvector_centrality', 'pagerank']:
    df_test[f'{column}_rank'] = df_test[column].rank(method='average',ascending = False)
df_test.head(1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_test[f'{column}_rank'] = df_test[column].rank(method='average',ascending = False)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_test[f'{column}_rank'] = df_test[column].rank(method='average',ascending = False)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_test[f'{column}_rank'] = df_test

Unnamed: 0,cust_id,betweenness_centrality,degree,eigenvector_centrality,pagerank,betweenness_centrality_rank,degree_rank,eigenvector_centrality_rank,pagerank_rank
0,CUST82758793,0.000166,17.0,6e-06,1.9e-05,8545.0,9554.5,9684.0,6312.0


In [12]:
df_test_sorted = df_test.sort_values(by=['degree_rank', 'betweenness_centrality_rank', 'eigenvector_centrality_rank', 'pagerank_rank'], ascending=True)
# High betweenness centrality nodes
high_betweenness_nodes = df_test[df_test['betweenness_centrality'] > df_test['betweenness_centrality'].quantile(0.9)]
# Normalizing centrality measures
for col in ['betweenness_centrality', 'degree', 'eigenvector_centrality', 'pagerank']:
    df_test[f'{col}_norm'] = (df_test[col] - df_test[col].min()) / (df_test[col].max() - df_test[col].min())

# Creating a composite score
df_test['composite_brokerage_score'] = df_test[['betweenness_centrality_norm', 'degree_norm', 'eigenvector_centrality_norm', 'pagerank_norm']].mean(axis=1)

# Identifying top brokerage nodes based on the composite score
top_brokerage_nodes = df_test.sort_values(by='composite_brokerage_score', ascending=False)
top_brokerage_nodes

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_test[f'{col}_norm'] = (df_test[col] - df_test[col].min()) / (df_test[col].max() - df_test[col].min())
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_test[f'{col}_norm'] = (df_test[col] - df_test[col].min()) / (df_test[col].max() - df_test[col].min())
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-co

Unnamed: 0,cust_id,betweenness_centrality,degree,eigenvector_centrality,pagerank,betweenness_centrality_rank,degree_rank,eigenvector_centrality_rank,pagerank_rank,betweenness_centrality_norm,degree_norm,eigenvector_centrality_norm,pagerank_norm,composite_brokerage_score
188082,CUST36405209,0.000167,20.0,5.102206e-01,0.000142,8455.0,4778.5,1.0,1.0,0.178731,0.512821,1.000000,1.000000,0.672888
169376,CUST45945736,0.000788,39.0,2.043833e-05,0.000046,11.0,1.0,4361.0,222.0,0.843867,1.000000,0.000040,0.323601,0.541877
16289,CUST71866300,0.000416,24.0,1.478026e-01,0.000113,634.0,1229.0,5.0,3.0,0.445364,0.615385,0.289684,0.791945,0.535594
57313,CUST48479576,0.000887,36.0,7.827581e-06,0.000034,5.0,2.5,8417.0,979.0,0.949636,0.923077,0.000015,0.241124,0.528463
62775,CUST27817706,0.000568,28.0,1.579812e-03,0.000094,132.0,197.5,86.0,6.0,0.608570,0.717949,0.003096,0.660032,0.497412
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16993,CUST66490945,0.000000,0.0,-1.694066e-21,0.000000,166750.0,181435.5,181237.0,181435.5,0.000000,0.000000,0.000000,0.000000,0.000000
16992,CUST55679921,0.000000,0.0,-1.694066e-21,0.000000,166750.0,181435.5,181237.0,181435.5,0.000000,0.000000,0.000000,0.000000,0.000000
121060,CUST27414005,0.000000,0.0,-1.694066e-21,0.000000,166750.0,181435.5,181237.0,181435.5,0.000000,0.000000,0.000000,0.000000,0.000000
121050,CUST59006248,0.000000,0.0,-1.694066e-21,0.000000,166750.0,181435.5,181237.0,181435.5,0.000000,0.000000,0.000000,0.000000,0.000000


In [13]:
top_1_percent_cutoff = top_brokerage_nodes['composite_brokerage_score'].quantile(0.99)  # Adjust quantile for different cutoffs
highly_influential_nodes = top_brokerage_nodes[top_brokerage_nodes['composite_brokerage_score'] >= top_1_percent_cutoff]

In [14]:
highly_influential_nodes.head(1)

Unnamed: 0,cust_id,betweenness_centrality,degree,eigenvector_centrality,pagerank,betweenness_centrality_rank,degree_rank,eigenvector_centrality_rank,pagerank_rank,betweenness_centrality_norm,degree_norm,eigenvector_centrality_norm,pagerank_norm,composite_brokerage_score
188082,CUST36405209,0.000167,20.0,0.510221,0.000142,8455.0,4778.5,1.0,1.0,0.178731,0.512821,1.0,1.0,0.672888


In [15]:
anomalies_df = anomalies_full_df_1[anomalies_full_df_1.anomaly == -1]
df_merged = anomalies_df.merge(highly_influential_nodes[['cust_id','composite_brokerage_score']], how = 'inner', on = 'cust_id')
influential_customers = df_merged.cust_id.tolist()

In [16]:
# Load the provided CSV files
transactions_combined = pd.read_csv('transactions_combined_up.csv').drop('Unnamed: 0', axis =1)
transactions_combined.head(1)

Unnamed: 0,cust_id_sender,cust_id_receiver,amount,count
0,CUST10000513,CUST13934055,360.0,1


In [17]:
transactions_grouped = transactions_combined.copy()

In [18]:
filtered_transactions = transactions_combined[
    transactions_combined['cust_id_sender'].isin(influential_customers) |
    transactions_combined['cust_id_receiver'].isin(influential_customers)
]
# Now group by sender and receiver, and sum the amounts
transactions_grouped = filtered_transactions.copy()

In [19]:
all_cust_ids = pd.concat([transactions_grouped['cust_id_sender'], transactions_grouped['cust_id_receiver']]).unique()
cust_id_to_index = {cust_id: i for i, cust_id in enumerate(all_cust_ids)}
transactions_grouped['sender_idx'] = transactions_grouped['cust_id_sender'].map(cust_id_to_index)
transactions_grouped['receiver_idx'] = transactions_grouped['cust_id_receiver'].map(cust_id_to_index)

In [20]:
transactions_combined_cudf = cudf.from_pandas(transactions_grouped)

In [21]:
transactions_combined_cudf.head(1)
print(len(transactions_combined_cudf))

43797


#Pagerank Plot

In [22]:
tf = transactions_combined_cudf[['sender_idx','receiver_idx','amount']]

In [23]:
tf.head(1)

Unnamed: 0,sender_idx,receiver_idx,amount
58,0,2158,1924.0


In [25]:
!pip install graphistry



In [26]:
import graphistry

# Register with your Graphistry API key
graphistry.register(api=3, username='-------', password='--------')

In [27]:
g = graphistry.edges(tf, tf.columns[0], tf.columns[1])
g._edges.sample(5)

Unnamed: 0,sender_idx,receiver_idx,amount
181127,6388,31803,104.5
3339,89,23856,643.0
558947,22964,1588,4.0
105098,3646,28364,2670.0
65971,2282,19530,2796.0


In [28]:
g2 = g.compute_cugraph('pagerank')
g2._nodes.sample(5)



Unnamed: 0,id,pagerank
17442,25970,2.7e-05
34392,33032,2.1e-05
28730,22765,1.2e-05
11178,522,1.2e-05
1851,7387,1.2e-05


In [29]:
g2.encode_point_color('pagerank', ['blue', 'yellow', 'red'], as_continuous=True).plot()

In [30]:
g3 = g2.layout_cugraph('force_atlas2')
g3._nodes.sample(5)



Unnamed: 0,id,pagerank,x,y
28695,23528,1.2e-05,-3700.764648,2702.368652
19101,17549,1.2e-05,-10516.560547,2689.90332
16959,23839,2.1e-05,8470.984375,-9310.243164
21497,23609,1.2e-05,-3324.848633,3758.442383
34330,32315,2.1e-05,1608.307739,-758.456116


In [31]:
g3b = g2.layout_cugraph('force_atlas2', params={'lin_log_mode': True})
g3b._nodes.sample(5)



Unnamed: 0,id,pagerank,x,y
8908,588,2.5e-05,15986.702148,1647.134399
32818,33426,2e-05,-34927.90625,-63427.253906
30104,21035,1.2e-05,109013.148438,44291.578125
16867,25059,2.8e-05,128760.382812,-17683.785156
22775,20488,1.2e-05,12111.095703,-132496.203125


In [32]:
g3.plot()

#Plot direct and indirect Grpahs using Graphistry

In [33]:
G_cugraph = cugraph.Graph(directed=True)
G_cugraph.from_cudf_edgelist(
    transactions_combined_cudf,
    source='sender_idx',
    destination='receiver_idx',
    edge_attr=['amount'],  # Include 'count' as an additional edge attribute
    renumber=False
)



In [34]:
influential_indices = [cust_id_to_index[cust_id] for cust_id in influential_customers if cust_id in cust_id_to_index]
influential_indices_cudf = cudf.Series(influential_indices)

In [35]:
# Create a DataFrame of all nodes in the graph
all_nodes = cudf.Series(all_cust_ids).to_frame(name='cust_id')
all_nodes['index'] = all_nodes['cust_id'].map(cust_id_to_index)

# Initialize all nodes as 'grey'
all_nodes['color'] = 'grey'

# Mark influential nodes as 'red'
all_nodes.loc[all_nodes['index'].isin(influential_indices_cudf), 'color'] = 'red'

In [36]:
# Bind the edges from your transactions DataFrame
plotter = graphistry.edges(transactions_combined_cudf, 'sender_idx', 'receiver_idx')

# Assuming all_nodes is a cuDF DataFrame containing 'index' (after mapping), 'cust_id', and 'color'
# Ensure it's in a format compatible with Graphistry - might require conversion to Pandas
all_nodes_pd = all_nodes.to_pandas()

# Bind the nodes. Graphistry requires a Pandas DataFrame for nodes
plotter = plotter.nodes(all_nodes_pd, 'index')

# Specify how to encode node colors: directly use the 'color' column from all_nodes DataFrame
plotter = plotter.bind(point_color='color')

# Plot
plotter.plot()


In [None]:
import graphistry

# Register with your Graphistry API key
graphistry.register(api=3, username='vaibhav1', password='Graphistry@2024')

# Bind the edges from your transactions DataFrame
plotter = graphistry.edges(transactions_combined_cudf, 'sender_idx', 'receiver_idx')

# Assuming all_nodes is a cuDF DataFrame containing 'index' (after mapping), 'cust_id', and 'color'
# Ensure it's in a format compatible with Graphistry - might require conversion to Pandas
all_nodes_pd = all_nodes.to_pandas()

# Bind the nodes. Graphistry requires a Pandas DataFrame for nodes
plotter = plotter.nodes(all_nodes_pd, 'index')

# Specify how to encode node colors: directly use the 'color' column from all_nodes DataFrame
plotter = plotter.bind(point_color='color')

# Plot
plotter.plot()

#Direct and Indirect Graphs with similarity

In [37]:
import ast
indirect_edges = pd.read_csv('subG_indirect_edges.csv')
indirect_edges['amount'] = indirect_edges['amount'].apply(lambda x: ast.literal_eval(x)['weight'])
direct_edges = pd.read_csv('subG_direct_edges.csv')
direct_edges['amount'] = direct_edges['amount'].apply(lambda x: ast.literal_eval(x)['weight'])

In [38]:
len(direct_edges)

12202

In [39]:
len(indirect_edges)

75151

In [41]:
direct_nodes = pd.concat([direct_edges['cust_id_sender'], direct_edges['cust_id_receiver']]).unique()
len(direct_nodes)

9340

In [42]:
indirect_nodes = pd.concat([indirect_edges['cust_id_sender'], indirect_edges['cust_id_receiver']]).unique()
len(indirect_nodes)

39802

In [116]:
# Assuming 'indirect_edges' contains a column 'is_direct' to mark direct edges
indirect_edges['is_direct'] = False
# Mark direct edges
direct_edge_tuples = set(zip(direct_edges['cust_id_sender'], direct_edges['cust_id_receiver']))
indirect_edges['is_direct'] = indirect_edges.apply(lambda x: (x['cust_id_sender'], x['cust_id_receiver']) in direct_edge_tuples, axis=1)

In [118]:
# Bind edges to the Graphistry plotter
plotter = graphistry.edges(indirect_edges).bind(source="cust_id_sender", destination="cust_id_receiver")

# Optional: Encode edge colors based on the 'is_direct' flag
plotter = plotter.encode_edge_color("is_direct", ["blue", "red"], as_categorical=True)

# Plot the graph
plotter.plot()

In [121]:
import pandas as pd
import graphistry

# Step 1: Preparing Node Data
nodes_df = pd.concat([direct_edges['cust_id_sender'], direct_edges['cust_id_receiver'],
                      indirect_edges['cust_id_sender'], indirect_edges['cust_id_receiver']]).unique()
nodes_df = pd.DataFrame(nodes_df, columns=['node_id'])
nodes_df['type'] = 'indirect'  # default to indirect
nodes_df.loc[nodes_df['node_id'].isin(influential_customers), 'type'] = 'influential'
direct_nodes = pd.concat([direct_edges['cust_id_sender'], direct_edges['cust_id_receiver']]).unique()
nodes_df.loc[nodes_df['node_id'].isin(direct_nodes), 'type'] = 'direct'

# Assign colors
nodes_df['color'] = nodes_df['type'].map({'influential': 'red', 'direct': 'blue', 'indirect': 'grey'})

# Step 2: Preparing Edge Data
direct_edges['edge_type'] = 'direct'
indirect_edges['edge_type'] = 'indirect'
edges_df = pd.concat([direct_edges, indirect_edges])

# Assign colors
edges_df['color'] = edges_df['edge_type'].map({'direct': 'black', 'indirect': 'green'})

# Step 3: Visualization
plotter = graphistry.bind(source="cust_id_sender", destination="cust_id_receiver")
plotter = plotter.edges(edges_df).bind(edge_color='color')
plotter = plotter.nodes(nodes_df).bind(node='node_id', point_color='color')
plotter.plot()

In [125]:
import pandas as pd
import graphistry

# Assuming direct_edges, indirect_edges, and influential_customers are defined

# Preparing Node Data
nodes = pd.concat([direct_edges[['cust_id_sender', 'cust_id_receiver']],
                   indirect_edges[['cust_id_sender', 'cust_id_receiver']]]).melt(value_name="node_id").drop("variable", axis=1).drop_duplicates()
nodes['type'] = 'indirect'  # Default type
nodes.loc[nodes['node_id'].isin(direct_edges['cust_id_sender']) | nodes['node_id'].isin(direct_edges['cust_id_receiver']), 'type'] = 'direct'
nodes.loc[nodes['node_id'].isin(influential_customers), 'type'] = 'influential'

# Preparing Edge Data
direct_edges['edge_type'] = 'direct'
indirect_edges['edge_type'] = 'indirect'
edges = pd.concat([direct_edges, indirect_edges])

# Visualization with Graphistry
plotter = graphistry.edges(edges).bind(source="cust_id_sender", destination="cust_id_receiver").encode_edge_color("edge_type", as_categorical=True)
plotter = plotter.nodes(nodes).bind(node='node_id').encode_point_color("type", as_categorical=True)

plotter.plot()


ERROR:graphistry.arrow_uploader:Error: <Response [400]>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/graphistry/arrow_uploader.py", line 364, in refresh
    raise Exception(out.text)
Exception: {"non_field_errors":["Token has expired."]}

#Clustering

In [44]:
G = cugraph.Graph()
indirect_edges_cudf = cudf.from_pandas(indirect_edges)
G.from_cudf_edgelist(indirect_edges_cudf, source='cust_id_sender', destination='cust_id_receiver', edge_attr='amount', renumber=True)



In [45]:
df = cugraph.ecg(G)

In [46]:
df.dtypes

partition     int32
vertex       object
dtype: object

In [47]:
# How many partitions where found
part_ids = df["partition"].unique()

In [48]:
print(str(len(part_ids)) + " partition detected")

61 partition detected


In [135]:
# print the clusters.
for p in range(len(part_ids)):
    part = []
    for i in range(len(df)):
        if (df['partition'].iloc[i] == p):
            part.append(df['vertex'].iloc[i] )
    print("Partition " + str(p) + ":")
    print(part)

Partition 0:
['CUST50320430', 'EXTERNAL521500', 'CUST18759342', 'CUST22130077', 'CUST78243209', 'CUST91058366', 'EXTERNAL141277', 'EXTERNAL787374', 'CUST97826406', 'CUST23840839', 'CUST69676656', 'CUST94735170', 'EXTERNAL215129', 'CUST93566135', 'CUST91555446', 'CUST23982477', 'CUST53090425', 'CUST20501993', 'EXTERNAL845883', 'EXTERNAL714983', 'EXTERNAL717279', 'EXTERNAL321702', 'CUST30149940', 'CUST88980746', 'CUST79826019', 'CUST66540231', 'CUST66018653', 'CUST95792354', 'CUST75111872', 'CUST17256949', 'CUST63100206', 'EXTERNAL576995', 'CUST33731207', 'CUST70438392', 'CUST78853892', 'CUST97993246', 'CUST41514875', 'CUST66182718', 'CUST98709153', 'CUST56838438', 'EXTERNAL467336', 'CUST62567916', 'CUST55648837', 'EXTERNAL969844', 'CUST43910611', 'CUST60454913', 'CUST70276245', 'CUST88448178', 'EXTERNAL148698', 'EXTERNAL437070', 'CUST45404105', 'CUST89491853', 'EXTERNAL875761', 'EXTERNAL671528', 'EXTERNAL133209', 'CUST48462945', 'EXTERNAL682343', 'CUST73853382', 'CUST55931422', 'CUST947

In [49]:
# Assuming df is your DataFrame and influential_customers is your list
df['is_influential'] = df['vertex'].isin(influential_customers).astype(int)

summary_df = df.groupby('partition').agg({
    'vertex': 'count',  # Count of total customers in each partition
    'is_influential': 'sum'  # Sum of influential customers in each partition
}).rename(columns={'vertex': 'total_customers', 'is_influential': 'influential_customers'})

summary_df['percentage_influential'] = (summary_df['influential_customers'] / summary_df['total_customers']) * 100

summary_df.reset_index(inplace=True)

In [50]:
summary_df.sort_values(by='influential_customers', ascending=False)

Unnamed: 0,partition,total_customers,influential_customers,percentage_influential
1,39,2157,79,3.662494
15,8,2130,77,3.615023
2,7,2083,70,3.360538
53,11,1624,69,4.248768
55,15,1658,63,3.799759
...,...,...,...,...
23,9,3,0,0.000000
31,31,18,0,0.000000
50,21,3,0,0.000000
51,40,6,0,0.000000


In [150]:
summary_df[summary_df['percentage_influential'] > 3]

Unnamed: 0,partition,total_customers,influential_customers,percentage_influential
0,14,1452,51,3.512397
1,39,2157,79,3.662494
2,7,2083,70,3.360538
7,4,551,20,3.629764
8,12,604,23,3.807947
11,50,729,23,3.155007
12,37,696,21,3.017241
15,8,2130,77,3.615023
19,1,22,2,9.090909
20,13,292,9,3.082192


In [153]:
vids = df.query("partition == 1")
v = cudf.Series(vids['vertex'])
print(len(v))

22




In [154]:
subG = cugraph.subgraph(G, v)

