# 3. Transactions Graph

In [1]:
import pandas as pd
import numpy as np
import dgl
import torch as th
from sklearn.preprocessing import StandardScaler

# Import Data

In [2]:
# import new data after feature engineering
wires = pd.read_csv('wires.csv').iloc[:, 1:]
emts = pd.read_csv('emts.csv').iloc[:, 1:]
cash = pd.read_csv('cash.csv').iloc[:, 1:]
cust_info = pd.read_csv('cust_info.csv').iloc[:, 1:]
detail_cust = pd.read_csv('detailed_cust_info.csv').iloc[:, 1:]
ext_info = pd.read_csv('external_info.csv').iloc[:, 1:]

In [3]:
# convert the amount in cash df to type float
cash['amount'] = cash['amount'].astype(float)

# Defining Transaction Relationships

We will be making a directed heterogeneous graph to model the relationships between various wire, emt and cash transactions.
- Find transactions that **share a user between the transactions**
    - e.g. each transaction will have a sender and receiver
    - potential shared users include (we will not focus on those who have the same sender or the same receiver):
        - **receiver transaction A = sender transaction B**
- find those shared transactions who **have a similar transaction amount**
    - Set a threshold for what counts as similar. This is vary depending on the transactions
        - If user sends/receives money, then we **assume that amount received => amount sent**
        - If amount received => amount sent --> we will **create an edge between these two transactions**

Note: There will be no edge features for this graph, only node features.

### Graph Nodes and Edges

For the heterogeneous directed transactions graph:

Nodes:
- wire transactions
- emt transactions
- cash transactions
    
Edges
- Sender/receiver: transactions need to share at least 1 person that is the same 
    - receiver transaction A = sender transaction B
- Transaction amount: amount received => amount sent

### **Canonical Edge Types**:</br>

Node: **edge abbreviations** are in the format of **source-shares-destination**
- For example: wire-share-wire = 'wsw'

**Between `wire`/`emt` transactions**</br>
- ('wire', 'wsw', 'wire')</br>
- ('wire', 'wse', 'emt')</br>
- ('emt', 'ese', 'emt')</br>


**Between `wire`/`emt` --> `Cash Withdrawal` transactions**</br>
- ('wire', 'wsc', 'cash') --> receive money through transfer then withdraw the money as cash to send elsewhere</br>
- ('emt', 'esc', 'cash')</br>

**Between `Cash Deposits` --> `wire`/`emt` transactions**
- ('cash', 'csw', 'wire') --> deposit cash from someone then trasfer the money elsewhere</br>
- ('cash', 'cse', 'emt')</br>

**Between `Cash withdrawals` --> `Cash deposits`**
- ('cash', 'csc', 'cash') --> match on amounts with withdrawal --> deposit relationship for different people</br>

Send and receive relations summary:
- receive wire/emt --> send wire/emt
- receive wire/emt --> deposit cash
- withdrawal cash --> send wire/emt

# Get node features for wires

In [4]:
wires.head()

Unnamed: 0,wire value,trxn_id,sender_Age,sender_Tenure,sender_label,receiver_Age,receiver_Tenure,receiver_label,sender_global_id,rec_global_id,...,sender_Gender_Missing,sender_Gender_female,sender_Gender_male,sender_Gender_other,receiver_Gender_Missing,receiver_Gender_female,receiver_Gender_male,receiver_Gender_other,sender_Occupation_num,rec_Occupation_num
0,1267.0,LWCS42954834,34.0,0.0,0.0,45.0,21.0,0.0,46393,118403,...,0,1,0,0,0,0,1,0,8,53
1,8591.0,NTTG55749308,38.0,5.0,1.0,-1.0,-1.0,-1.0,101584,183844,...,0,1,0,0,1,0,0,0,91,250
2,1480.5,IXVD84599097,-1.0,-1.0,-1.0,56.0,21.0,0.0,184551,104548,...,1,0,0,0,0,1,0,0,250,51
3,1587.0,SLBV29462341,22.0,3.0,0.0,-1.0,-1.0,-1.0,71803,162608,...,0,0,1,0,1,0,0,0,217,250
4,1546.0,ERLU26785367,36.0,0.0,1.0,61.0,13.0,0.0,117672,9197,...,0,0,1,0,0,1,0,0,19,164


In [5]:
wire1 = wires[['trxn_id', 'sender_global_id', 'rec_global_id', 'wire value']].rename(columns={'wire value':'amount'})
wire1

Unnamed: 0,trxn_id,sender_global_id,rec_global_id,amount
0,LWCS42954834,46393,118403,1267.0
1,NTTG55749308,101584,183844,8591.0
2,IXVD84599097,184551,104548,1480.5
3,SLBV29462341,71803,162608,1587.0
4,ERLU26785367,117672,9197,1546.0
...,...,...,...,...
48287,LRRP66624765,142262,107141,6059.5
48288,KVQK50168638,112004,155051,5067.0
48289,IUIP17370739,69903,167767,18874.0
48290,ZHVK78574815,29953,96357,4084.0


# Get node features for emts

In [6]:
emts.head()

Unnamed: 0,emt value,trxn_id,sender_Age,sender_Tenure,sender_label,receiver_Age,receiver_Tenure,receiver_label,sender_global_id,rec_global_id,...,sender_Gender_female,sender_Gender_male,sender_Gender_other,receiver_Gender_Missing,receiver_Gender_female,receiver_Gender_male,receiver_Gender_other,sender_Occupation_num,rec_Occupation_num,has_msg
0,1170.5,RAUG63886259,-1.0,-1.0,-1.0,37.0,6.0,0.0,166925,69450,...,0,0,0,0,0,1,0,250,111,0
1,46.0,WPXP45854083,34.0,8.0,0.0,-1.0,-1.0,-1.0,68825,155054,...,0,1,0,1,0,0,0,110,250,0
2,480.0,TRNT55099512,69.0,14.0,0.0,22.0,4.0,0.0,24521,6745,...,0,0,1,0,0,1,0,188,52,0
3,735.0,YSNV62579819,44.0,14.0,1.0,49.0,10.0,0.0,77547,23015,...,1,0,0,0,1,0,0,8,62,0
4,540.0,MZYI28216959,-1.0,-1.0,-1.0,48.0,3.0,0.0,156121,78536,...,0,0,0,0,1,0,0,250,120,0


In [7]:
emt1 = emts[['trxn_id', 'sender_global_id', 'rec_global_id', 'emt value']].rename(columns={'emt value':'amount'})
emt1

Unnamed: 0,trxn_id,sender_global_id,rec_global_id,amount
0,RAUG63886259,166925,69450,1170.5
1,WPXP45854083,68825,155054,46.0
2,TRNT55099512,24521,6745,480.0
3,YSNV62579819,77547,23015,735.0
4,MZYI28216959,156121,78536,540.0
...,...,...,...,...
318895,XAOG83079223,62944,91980,682.0
318896,USHN74907347,156186,91130,119.0
318897,VXES44436032,3063,82331,208.0
318898,LTUK21435620,155038,112810,150.0


# Get node features for cash 

In [8]:
cash1 = cash[['trxn_id', 'Global_id', 'amount', 'type_deposit']]
cash1

Unnamed: 0,trxn_id,Global_id,amount,type_deposit
0,BFMG48785876,96095,4800.0,1
1,FUKV94845036,102552,7420.0,0
2,NUZO58830551,45375,5595.0,1
3,ZOSP34629709,39239,1600.0,0
4,HQUJ43887606,80360,1055.0,0
...,...,...,...,...
90224,JOQU43611104,85640,1140.0,1
90225,LTBH81014009,105558,6870.0,1
90226,GGHM25093698,42954,8740.0,0
90227,CNXP31340871,23527,5750.0,0


# Get COO Tensors for Edges

### Create a map for all transactions

In [9]:
# Get list of transaction ids from the three transaction tables
ids = []
w_ids = list(wires['trxn_id'].values)
e_ids = list(emts['trxn_id'].values)
c_ids = list(cash['trxn_id'].values)

# append the ids to the ids list
ids += w_ids + e_ids + c_ids

In [10]:
# get global_ids for indexes of customers
indexes = [ x for x in range(len(ids)) ]

In [11]:
# create a dictionary for the mapped indexes
t_map = {gid:index for gid, index in zip(ids, indexes)}

In [12]:
checking = pd.DataFrame({'trxn_id':ids, 'index':indexes})
checking

Unnamed: 0,trxn_id,index
0,LWCS42954834,0
1,NTTG55749308,1
2,IXVD84599097,2
3,SLBV29462341,3
4,ERLU26785367,4
...,...,...
457416,JOQU43611104,457416
457417,LTBH81014009,457417
457418,GGHM25093698,457418
457419,CNXP31340871,457419


In [13]:
# export the transaction map
checking.to_csv('trxn_map.csv')

#### Create Function to get edges and map indexes

We will use a exponential decay function to calculate the weights for the relationship.
When the difference between the amounts increase, the strong the relationship will be and the weight will move towards 1. We use an exponential decay function to model this relationship, since as the difference grows, the final value moves towards 1.

In [14]:
def calc_weightother(amt1, amt2):
    return 1 - np.exp(-abs(amt1 - amt2) / 100)

In [15]:
# create a function to track the number of times a customer is listed in the COO format
def count_participation(m, cash):
    """
    """
    
    # Get the columns that have global_id
    g_cols = list(m.filter(like='global_id').columns)
    
    if cash:
        # concat the columns that have ids into a single column
        ids = pd.concat([m[g_cols[0]], m['common']]).reset_index(drop=True)
    else:
        # concat the columns that have ids into a single column
        ids = pd.concat([m[g_cols[0]], m[g_cols[1]], m['common']]).reset_index(drop=True)

    # Store the customer participation 
    id_counts = ids.value_counts().reset_index()
    
    # Rename the columns
    id_counts = id_counts.rename(columns={'index': 'Global_id'})
    
    return id_counts

In [16]:
# Create a function to get the edges between two transactions
def get_edges(receiver_df, sender_df, withdrawal, deposit):
    """
    
        Parameters:
            - receiver_df (DataFrame): dataframe containing transactions with incoming money
            - sender_df (DataFrame): dataframe containing transactions with outgoing money
            - withdrawal (Boolean): 1 if contains a withdrawal trxn
            - deposit (Boolean): 1 if contains a deposit trxn
    """
    
    # Set cash to false for pariticipation function
    cash = False
    
    # Rename the sender in sender_df common and rename the receiver in receiver_df common to merge
    if deposit:
        cash = True
        receiver_df = receiver_df.rename(columns={'Global_id': 'common'})
    else:
        receiver_df = receiver_df.rename(columns={'rec_global_id': 'common'})
    
    if withdrawal:
        cash = True
        sender_df = sender_df.rename(columns={'Global_id': 'common'})
    else:
        sender_df = sender_df.rename(columns={'sender_global_id': 'common'})
    
    # Merge the two transaction dataframes on the common sender column
    m = pd.merge(receiver_df, sender_df, on='common')
    
    # Store the participation counts for each customer in this network
    participation = count_participation(m, cash)
    
    # filter merged for where amount_x < amount_y
    filtered = m[m['amount_y'] <= m['amount_x']].rename(columns={'trxn_id_x': 'Source', 'trxn_id_y': 'Destination'})
    
    # Calculate the weight of the edge based on the amount columns
    filtered['Weight'] = calc_weightother(filtered['amount_x'], filtered['amount_y'])
    
    # Get the COO for the transactions
    coo = filtered[['Source', 'Destination', 'Weight']]
    
    return coo, participation

In [17]:
# create a function to map edges with the corresponding index values in the node features dataframe
def map_edges(df):
    """
    """
    edges_df = pd.DataFrame()
    edges_df['Source'] = df['Source'].map(t_map)
    edges_df['Destination'] = df['Destination'].map(t_map)
    edges_df['Weight'] = df['Weight']
    
    return edges_df

In [18]:
# create a function to store the edges dataframe as a tuple with tensors
def get_tensors(edges):
    """
    """
    
    # transpose the dataframe to get the format for the tensors
    values = edges.values.transpose()
    
    # convert the values to tensors
    u = th.tensor(values[0].copy())
    v = th.tensor(values[1].copy())
    
    # return a tuple of node_tensors
    return (u, v)

In [19]:
# create copies of the transaction dataframes to find the edges
w1 = wire1.copy()
w2 = wire1.copy()
e1 = emt1.copy()
e2 = emt1.copy()

In [20]:
# Get cash deposits vs withdrawals
deposits = cash1[cash1['type_deposit'] == 1][['trxn_id', 'Global_id', 'amount']]
withdrawals = cash1[cash1['type_deposit'] == 0][['trxn_id', 'Global_id', 'amount']]

# Find send/receive transaction relations

### Extract COO for ('wire', 'wsw', 'wire')

These edges model the relationship where an individual received money through wire and then sent money through wire at a smaller amount than the amount they received. Source indicates the transaction id where they received the wire transaction and destination is the transaction where they sent a smaller amount.

In [21]:
# Get the edges for the edge type
wsw, wsw_part = get_edges(w1, w2, False, False)

# Apply the proper mapping to get the COO 
wsw_edges = map_edges(wsw)
wsw_edges

Unnamed: 0,Source,Destination,Weight
0,3,14754,0.994538
3,31370,5986,1.000000
17,4713,4233,1.000000
18,4713,31656,1.000000
19,4713,41506,1.000000
...,...,...,...
55729,48284,16074,0.996552
55730,48286,23213,0.999976
55731,48286,43184,0.999614
55732,48288,23423,1.000000


In [22]:
# Show graph participation
wsw_part

Unnamed: 0,Global_id,count
0,14234,72
1,39393,71
2,25911,65
3,77559,58
4,61010,54
...,...,...
35255,91280,1
35256,156947,1
35257,74874,1
35258,149466,1


In [23]:
# get the node-tensors for these edges
wsw_ten = get_tensors(wsw_edges)
wsw_ten

(tensor([3.0000e+00, 3.1370e+04, 4.7130e+03,  ..., 4.8286e+04, 4.8288e+04,
         4.8288e+04], dtype=torch.float64),
 tensor([14754.,  5986.,  4233.,  ..., 43184., 23423., 34215.],
        dtype=torch.float64))

### Extract COO for ('wire', wse', 'emt')

These edges model the relationship where an individual received money through wire and then sent money through emt at a smaller amount than the amount they received. Source indicates the transaction id where they received the wire transaction and destination is the transaction where they sent a smaller amount using an emt.

In [24]:
# Get the coo format
wse, wse_part = get_edges(w1, e2, False, False)

# Apply the proper mapping to get the COO 
wse_edges = map_edges(wse)
wse_edges

Unnamed: 0,Source,Destination,Weight
0,0,160081,0.999880
1,2,316409,0.999990
2,3,62742,0.999958
5,3,190297,1.000000
6,3,293826,0.999999
...,...,...,...
197500,48291,97945,0.999999
197502,48291,127663,0.999972
197504,48291,156628,1.000000
197505,48291,210456,0.999999


In [25]:
# get the node-tensors for these edges
wse_ten = get_tensors(wse_edges)
wse_ten

(tensor([0.0000e+00, 2.0000e+00, 3.0000e+00,  ..., 4.8291e+04, 4.8291e+04,
         4.8291e+04], dtype=torch.float64),
 tensor([160081., 316409.,  62742.,  ..., 156628., 210456., 301973.],
        dtype=torch.float64))

### Extract COO for ('emt', 'same amt', 'wire')

In [26]:
# Get the coo format
esw, esw_part = get_edges(e1, w2, False, False)

# Apply the proper mapping to get the COO 
esw_edges = map_edges(esw)
esw_edges

Unnamed: 0,Source,Destination,Weight
29,153716,5685,0.995831
31,153716,30695,0.994483
66,203448,11931,0.987349
67,203448,44656,0.949963
85,48305,11048,1.000000
...,...,...,...
191550,364999,39965,0.989796
191567,365331,1627,1.000000
191568,365331,25437,0.999561
191581,366085,46647,1.000000


In [27]:
# get the node-tensors for these edges
esw_ten = get_tensors(esw_edges)
esw_ten

(tensor([153716., 153716., 203448.,  ..., 365331., 366085., 366938.],
        dtype=torch.float64),
 tensor([ 5685., 30695., 11931.,  ..., 25437., 46647., 19797.],
        dtype=torch.float64))

### Extract COO for ('emt', 'same amt', 'emt')

Due to the large amount of emts, we will also filter the transactions after standardizing the transaction amount values. We do this by standardizing the transaction amounts then filtering for those that amore than 3 standard deviations away from the mean.

In [28]:
# Set the receiver df and sender df 
receiver_df = e1.rename(columns={'rec_global_id': 'common'})
sender_df = e2.rename(columns={'sender_global_id': 'common'})

# merge the dataframe on the common customer
m = pd.merge(receiver_df, sender_df, on='common')

# count participation of customers in this network
ese_part = count_participation(m, False)

# standardize the amounts 
emt_scalerx = StandardScaler()
emt_scalery = StandardScaler()

# Select amount column and reshape
amtx = m['amount_x'].values.reshape(-1, 1)
amty = m['amount_y'].values.reshape(-1, 1)

# Fit Transform amounts
m['amt_x_std'] = emt_scalerx.fit_transform(amtx)
m['amt_y_std'] = emt_scalery.fit_transform(amty)

# filter for outliers
filtered = m[(m['amt_x_std'] > 3) | (m['amt_x_std'] < -3) | (m['amt_y_std'] > 3) | (m['amt_y_std'] < -3)]

# Calculate the weight of the edge based on the amount columns
filtered['Weight'] = calc_weightother(filtered['amount_x'], filtered['amount_y'])

# rename
filtered = filtered.rename(columns={'trxn_id_x': 'Source', 'trxn_id_y': 'Destination'})

# Get the COO for the transactions
ese = filtered[['Source', 'Destination', 'Weight']]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered['Weight'] = calc_weightother(filtered['amount_x'], filtered['amount_y'])


In [29]:
# Apply the proper mapping to get the COO 
ese_edges = map_edges(ese)
ese_edges

Unnamed: 0,Source,Destination,Weight
160,48299,236666,1.0
171,51243,236666,1.0
182,197203,236666,1.0
193,204378,236666,1.0
204,249860,236666,1.0
...,...,...,...
1171574,365306,112956,1.0
1171760,366381,147718,1.0
1171795,366518,226318,1.0
1171796,366518,311335,1.0


In [30]:
# get the node-tensors for these edges
ese_ten = get_tensors(ese_edges)
ese_ten

(tensor([ 48299.,  51243., 197203.,  ..., 366518., 366518., 366938.],
        dtype=torch.float64),
 tensor([236666., 236666., 236666.,  ..., 226318., 311335., 141206.],
        dtype=torch.float64))

# Find receive wire/emt --> withdrawal cash transaction relations

These COOs will represent relations where an individual receives a wire or emt transaction and also withdraws cash for a similar amount. Can represent the flow of money where some one receives money through wire/emt, then withdraws cash to send elsewhere.

### Extract COO for ('wire', 'same amt', 'cash')

In [31]:
# Get the coo format
wsc, wsc_part = get_edges(w1, withdrawals, True, False)

# Apply the proper mapping to get the COO 
wsc_edges = map_edges(wsc)
wsc_edges

Unnamed: 0,Source,Destination,Weight
5,21617,442668,1.000000
7,32218,442668,1.000000
12,4713,375802,1.000000
13,4713,399260,1.000000
14,4713,441813,1.000000
...,...,...,...
42663,48222,454100,1.000000
42664,48239,391454,1.000000
42669,48281,370047,0.998253
42670,48281,394730,0.954951


In [32]:
# get the node-tensors for these edges
wsc_ten = get_tensors(wsc_edges)
wsc_ten

(tensor([21617., 32218.,  4713.,  ..., 48281., 48281., 48281.],
        dtype=torch.float64),
 tensor([442668., 442668., 375802.,  ..., 370047., 394730., 413278.],
        dtype=torch.float64))

### Extract COO for ('emt', 'same amt', 'cash')

In [33]:
# Get the coo format
esc, esc_part = get_edges(e1, withdrawals, True, False)

# Apply the proper mapping to get the COO 
esc_edges = map_edges(esc)
esc_edges

Unnamed: 0,Source,Destination,Weight
311,48345,420339,0.748421
312,53642,420339,1.000000
315,137415,420339,0.527633
329,63587,388163,0.933463
330,63587,391707,0.999942
...,...,...,...
196327,361514,413956,1.000000
196364,364096,431572,0.477954
196371,364648,402940,0.329680
196389,365723,442059,0.134978


In [34]:
# get the node-tensors for these edges
esc_ten = get_tensors(esc_edges)
esc_ten

(tensor([ 48345.,  53642., 137415.,  ..., 364648., 365723., 366518.],
        dtype=torch.float64),
 tensor([420339., 420339., 420339.,  ..., 402940., 442059., 408825.],
        dtype=torch.float64))

# Find deposit --> send wire/emt transaction relations

These COOs will represent relations where an individual deposits cash they received from somewhere, then sends a wire/emt transfer of a smaller amount to someone else. It can represent the flow of money where some one receives money through cash and deposits it, then sends money through wire/emt to somewhere else.

### Extract COO for ('cash', 'same amt', 'wire')

In [35]:
# Get the coo format
csw, csw_part = get_edges(deposits, w2, False, True)

# Apply the proper mapping to get the COO 
csw_edges = map_edges(csw)
csw_edges

Unnamed: 0,Source,Destination,Weight
0,367194,12489,1.000000
1,367194,40117,1.000000
2,372636,12489,1.000000
3,372636,40117,1.000000
4,380931,12489,1.000000
...,...,...,...
60144,456927,26597,0.999885
60151,457412,16115,1.000000
60153,457412,30858,1.000000
60154,457412,31628,1.000000


In [36]:
# get the node-tensors for these edges
csw_ten = get_tensors(csw_edges)
csw_ten

(tensor([367194., 367194., 372636.,  ..., 457412., 457412., 457412.],
        dtype=torch.float64),
 tensor([12489., 40117., 12489.,  ..., 30858., 31628., 32238.],
        dtype=torch.float64))

### Extract COO for ('cash', 'same amt', 'emt')

In [37]:
# Get the coo format
cse, cse_part = get_edges(deposits, e2, False, True)

# Apply the proper mapping to get the COO 
cse_edges = map_edges(cse)
cse_edges

Unnamed: 0,Source,Destination,Weight
0,367192,331423,1.000000
1,367192,365941,1.000000
2,423741,331423,1.000000
3,423741,365941,1.000000
4,435544,331423,1.000000
...,...,...,...
265248,457412,216735,1.000000
265249,457412,265432,1.000000
265250,457412,323085,1.000000
265251,457413,127008,0.999665


In [38]:
# get the node-tensors for these edges
cse_ten = get_tensors(cse_edges)
cse_ten

(tensor([367192., 367192., 423741.,  ..., 457412., 457413., 457413.],
        dtype=torch.float64),
 tensor([331423., 365941., 331423.,  ..., 323085., 127008., 313681.],
        dtype=torch.float64))

# Find cash --> cash transaction relations

This COO represents the flow of cash through from individuals that withdraw cash to individuals who deposit cash for the same amount. These edges can then help us detect cash transactions between individuals.

Since there are many cash transactions that share the same amount, we will filter for only those transactions that are unusually large or small. This is measured by standardizing the transaction amount then finding those that are greater than 3 standard deviations from the mean.

Since we are merging on the amount, the weight will be 1 for all of these edges.

In [39]:
# Create a function to get the edges for cash transactions
def cash_edges(withdrawals, deposits):
    """    
        Parameters:
            - receiver_df (DataFrame): dataframe containing transactions with incoming money
            - sender_df (DataFrame): dataframe containing transactions with outgoing money
    """
    # Merge the two transaction dataframes on the amount column
    m = pd.merge(withdrawals, deposits, on='amount').rename(columns={'trxn_id_x': 'Source', 'trxn_id_y': 'Destination'})
    
    # Get customer participation
    g_cols = list(m.filter(like='Global_id').columns)
    ids = pd.concat([m[g_cols[0]], m[g_cols[1]]]).reset_index(drop=True)
    id_counts = ids.value_counts().reset_index()
    id_counts = id_counts.rename(columns={'index': 'Global_id'})
    
    # Select amount column and reshape
    amts = m['amount'].values.reshape(-1, 1)
    
    # Create a new variable for the standardizing the amount
    cash_scaler = StandardScaler()
    m['std'] = cash_scaler.fit_transform(amts)
    
    # filter for amounts that are greater than 3 std from mean
    filtered = m[(m['std'] > 3) | (m['std'] < -3)]
    
    filtered.loc[:, 'Weight'] = 1
    
    # Get the COO for the transactions
    coo = filtered[['Source', 'Destination', 'Weight']]
    
    return coo, id_counts

### Extract COO for ('cash', 'same amt', 'cash')

In [40]:
# Get the coo format
csc, csc_part = cash_edges(withdrawals, deposits)

# Apply the proper mapping to get the COO 
csc_edges = map_edges(csc)
csc_edges

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered.loc[:, 'Weight'] = 1


Unnamed: 0,Source,Destination,Weight
958,367197,381488,1
959,367197,386331,1
960,367197,397930,1
961,367197,432518,1
962,367197,441131,1
...,...,...,...
826098,456596,448750,1
826099,457123,421007,1
826100,457138,369544,1
826101,457138,396487,1


In [41]:
# get the node-tensors for these edges
csc_ten = get_tensors(csc_edges)
csc_ten

(tensor([367197, 367197, 367197,  ..., 457138, 457138, 457399]),
 tensor([381488, 386331, 397930,  ..., 369544, 396487, 450473]))

# Export customer participation dataframes

In [42]:
wsw_part.to_csv('wsw_part.csv')
wse_part.to_csv('wse_part.csv')
wsc_part.to_csv('wsc_part.csv')
ese_part.to_csv('ese_part.csv')
esc_part.to_csv('esc_part.csv')
esw_part.to_csv('esw_part.csv')
csc_part.to_csv('csc_part.csv')
csw_part.to_csv('csw_part.csv')
cse_part.to_csv('cse_part.csv')

# Export Subgraph Edges

In [43]:
wsw_edges.to_csv('wsw_edges.csv')
wse_edges.to_csv('wse_edges.csv')
wsc_edges.to_csv('wsc_edges.csv')
ese_edges.to_csv('ese_edges.csv')
esc_edges.to_csv('esc_edges.csv')
esw_edges.to_csv('esw_edges.csv')
csc_edges.to_csv('csc_edges.csv')
csw_edges.to_csv('csw_edges.csv')
cse_edges.to_csv('cse_edges.csv')

# Construct Graphs

Since we want to apply Louvain Modularity to find communities of transactions, we will create a graph for each unique edge type defined before. After creating these graphs, we will apply the louvain method to each to find communities within each graph of transactions. 

In [44]:
#!pip install leidenalg python-igraph

In [45]:
from igraph import Graph

### Concat all edges to form a single graph

In [46]:
# Create a list of edges to get 
edge_dfs = [wsw_edges, wse_edges, wsc_edges,
            ese_edges, esc_edges, esw_edges, 
            csc_edges, csw_edges, cse_edges]

In [47]:
# use a for loop to get all the tuples of edges an a list of corresponding weights
edges = []
weights = []

for df in edge_dfs:
    # Get the node pairs in a tuple
    eds = list(zip(df['Source'], df['Destination']))
    
    # Get the list of corresponding weights
    w = list(df['Weight'])
    
    # Add the edges to the edges list and ws to weights
    edges += eds
    weights += w

In [48]:
# Get the unique nodes in the graph
nodes = list(checking['trxn_id'].values)
len(nodes)

457421

In [49]:
# Create a directed graph
g = Graph(directed=True)
g.add_vertices(nodes)
g.add_edges(edges)
g.es['weight'] = weights

### Check connections

In [50]:
# Print edge source, target, and weight to verify
for edge in g.es[:5]:
    print(f"Edge from {g.vs[edge.source]['name']} to {g.vs[edge.target]['name']} has weight {edge['weight']}")

Edge from SLBV29462341 to FAHV35141379 has weight 0.9945383263123592
Edge from KLPL18292745 to BTAG24626959 has weight 1.0
Edge from BTSF11865310 to TQUJ12547796 has weight 0.9999999999999999
Edge from BTSF11865310 to CFJZ65537629 has weight 0.9999999999774436
Edge from BTSF11865310 to CCXP36889396 has weight 1.0


In [51]:
wsw_edges.head()

Unnamed: 0,Source,Destination,Weight
0,3,14754,0.994538
3,31370,5986,1.0
17,4713,4233,1.0
18,4713,31656,1.0
19,4713,41506,1.0


In [52]:
checking[checking['trxn_id'] == 'SLBV29462341']

Unnamed: 0,trxn_id,index
3,SLBV29462341,3


In [53]:
checking[checking['trxn_id'] == 'FAHV35141379']

Unnamed: 0,trxn_id,index
14754,FAHV35141379,14754


### Apply Leiden Algorithm to Detect Communities

Choice of Leiden over Louvain:
- Can handle **directed graphs**, which is important in this case
- Typically yields more stable results across different runs compared to Louvain.
- Faster in practice than Louvain, especially on networks where the community structure is deeply nested.
- Can produce a hierarchy of partitions that are locally optimally partitioned.

**Choice of RBERVertexPartition**</br>
Model Basis: This partition type is based on the Erdős-Rényi Model but adjusted for degree sequences (hence, sometimes referred to in relation to the configuration model). The Erdős-Rényi model in its classic form assumes that all possible edges between nodes have an equal probability of existing, leading to a random network. The RBERVertexPartition adapts this concept to evaluate the significance of the observed community structure against a backdrop where connections are more randomly distributed but constrained by the degree sequence.

Use Case: Suitable for analyses where the emphasis is on understanding how the observed community divisions stand out against a null model of random connections with similar degree characteristics. It's particularly useful when the question revolves around the significance of the community structure rather than just its existence.

Reference: Chat-GPT, https://leidenalg.readthedocs.io/en/stable/reference.html

In [54]:
import leidenalg

In [55]:
# Find the optimal partition with the Leiden algorithm
partition = leidenalg.find_partition(g, leidenalg.RBERVertexPartition, weights='weight', n_iterations=-1, resolution_parameter=5)

In [56]:
# Get the community labels for each node
community_labels = partition.membership

In [57]:
# create a dataframe of the transaction_ids and community labels
coms_df = pd.DataFrame({'trxn_id': nodes, 'Community Label': community_labels})
coms_df

Unnamed: 0,trxn_id,Community Label
0,LWCS42954834,127
1,NTTG55749308,7254
2,IXVD84599097,276
3,SLBV29462341,68
4,ERLU26785367,15
...,...,...
457416,JOQU43611104,871
457417,LTBH81014009,236
457418,GGHM25093698,213115
457419,CNXP31340871,213116


In [58]:
coms_df['Community Label'].nunique()

213144

# Save Community Labels Dataframe to add to Customer Details

In [59]:
coms_df.to_csv('Community_labels.csv')