# Data Wrangling
You will notice a number of what look like duplicated transactions in the data set. One type of duplicated transaction is a reversed transaction, where a purchase is followed by a reversal. Another example is a multi-swipe, where a vendor accidentally charges a customer's card multiple times within a short time span.
* Can you programmatically identify reversed and multi-swipe transactions?
* What total number of transactions and total dollar amount do you estimate for the reversed transactions? For the multi-swipe transactions? (please consider the first transaction to be "normal" and exclude it from the number of transaction and dollar amount counts)
* Did you find anything interesting about either kind of transaction?

The following sections summarize my findings for each question above. Each subsection begins with a summary of results and is following by supporting details and the Python scripts used to generate those results.

# Imports

In [3]:
# Data Wrangling
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Statistical Analysis
from scipy import stats
from distfit import distfit

# Utils
import os
from IPython.display import display

# Data loading and cleaning functions
from data_load_clean import get_transaction_data, clean_transation_data

# Load and Clean Data

In [5]:
# Load and Clean data using scripts in data_clean_load.py
# These scripts were developed during the data cleaning and analysis process
df_raw = get_transaction_data()
df = clean_transation_data(df_raw)

# Conclusions from Identifying Duplicate Transaction
Inspecting the transaction data revealed a set activities that complicate analysis of customer transactions. I identified chains of non-standard events including multi-swipes, reversals, address verifications that reflect the complexities of "real-world" transactions. Answers to the specific prompts are summarized below:
* Reversed transactions occured in 20,303 records and totaled to \$2,821,782.50
* Multi-swipe transactions occured in 7,450 recoreds and totaled \$1,103,132.25, not including the original "normal" transaction. 
* During my review of these duplicate transactions, I identified that many duplicates existed within chains of non-standard transactions. An example of these chains would be, purchase-purchase-reversal. I mapped these event sequences to understand their frequency and potential impact on customers. I found:
    * Non-standard event chains occur in 3.31% of all transaction records (26026 records) and in 1.70% of transaction records, if you exclude the original transaction (13405 records)
    * Event chains have 1.5 purchase transactions on average and infrequently include address verifications (3.3% of the time)
    * Almost half of all event chains contain a reversal (45.3%). This suggests that the remainder are left un-resolved at time of purchase.
    * 28% of all reversals occur in event chains (5723 reversals of the 20303 total). Reversals also occur with higher frequency in chains than outside of chains (22.0% of chain transactions are reversals vs 1.9% of non-chain transactions). Reversals occuring outside of chains appear to happen well after the initial purchase suggesting a refund or return as opposed to an in-the-moment series of swipes and events.
    * Address verification chains, while relatively rare, included a reversals almost half of the time. Assuming that address verifications has standard procedures, it is suprising to see a different compositions of these chains of events. Furthermore, address verifications are no cost, so the reversal may be unnecessary.  
    * I also observed transactions where the amount was greater than the available money for the account. These transactions were slightly more frequent in a event chains (occured in 4.2% of event chains transactions compared to 3.8% of non-chain transactions). These transactions could be worth fututre investigation and more rigorous statistical treatment in future work. 

Questions for subject matter experts:
* I assumed a time window of 10 minutes would reasonably capture multi-swipes, do we have evidence to suggest a different time window?
* What is customer feedback about transaction reversal and non-standard event chains? Are they satisified with Captial One's management of these events?
* What is the purpose of address verification transaction and what is the proper way for a merchant to process them?
* How are transactions where the amount was greater than the available money for the account handled in the the transaction processing system and how is that reflected in the final transactions data?
    * It is not clear in the transaction data if those transactiosn are allowed to result in a purchase or not. Depending on the answer, it could influence the quantification of non-standard event chains. 


Analytic next steps, given more time:
* Compare different sized time windows for event chains to understand the senstivity of my multi-swipe detection.
* Statistically test if the frequencies of non-standard events are different in and out of non-standard event chains. I left that assessment at a comparison of frequencies, but could generate hypothesis tests if needed.
* Review the costs associated with reversals, multi-swipes, and non-standard event chains to understand if customers were negatively impacted by these events. For example, if chains were fully refunded or if customers ended up paying for duplicate purchases. If there was a significant negative impact, those unresolved chains could be flagged for review to ensure customers are not being over-charged by merchants.


    



## Reversed transactions
Review of the descriptive statistics completed in question 1 indicates that the transactionType column contains a flag for "REVERSAL". It is reasonable to assume that those records are reversed transactions. The cells below provide summary statistics on reversals in both raw values and percentages. Example tables are provided to illustrate the data.
* Note that the denominators used in the percentages use all records. This is not quite accurate, because we know some records are duplicated. However, this assessment still provides an estimate of the proportion of data duplicated in this way.

In [6]:
"""
One type of duplicated transaction is a reversed transaction, 
where a purchase is followed by a reversal.
""" 
print("Transaction types (excluding missing): ", df[df['transactionType'].notnull()]['transactionType'].unique())

print("Reversals occur in {:.02%} of transaction records ({} records)"
      .format(df[df['transactionType']=='REVERSAL'].shape[0]/df.shape[0],
              df[df['transactionType']=='REVERSAL'].shape[0]))

print("Reversals account for {:.02%} of the total transaction amount (${:.02f} in reversed transactions)"
      .format(df[df['transactionType']=='REVERSAL']['transactionAmount'].sum()/df['transactionAmount'].sum(),
              df[df['transactionType']=='REVERSAL']['transactionAmount'].sum()))

Transaction types (excluding missing):  ['PURCHASE' 'ADDRESS_VERIFICATION' 'REVERSAL']
Reversals occur in 2.58% of transaction records (20303 records)
Reversals account for 2.62% of the total transaction amount ($2821792.50 in reversed transactions)


In [7]:
"""
Manual inspection of reversals. Assuming each reversal has a previous 
corresponding purchase of the same amount.

Note: Some reversals do not contain the original transaction, 
assuming that this is due to censoring of pre-2016 data. In addition, 
other reversals have multiple transactions of similar amounts.
These instances will be addressed in the multi-swipe investigation
 
"""

#Show the first five reversals
for c, (i, r) in enumerate(df[df['transactionType']=='REVERSAL'].head().iterrows()):
    with pd.option_context('display.max_columns', 30):
        display(
            df[(df['accountNumber']==r['accountNumber']) &
               (df['transactionDateTime']<=r['transactionDateTime']) &
               (df['transactionAmount']==r['transactionAmount'])]
        )


#Show the last reversals, which has multiple transactions
for c, (i, r) in enumerate(df[df['transactionType']=='REVERSAL'].tail(1).iterrows()):
    with pd.option_context('display.max_columns', 30):
        display(
            df[(df['accountNumber']==r['accountNumber']) &
               (df['transactionDateTime']<=r['transactionDateTime']) &
               (df['transactionAmount']==r['transactionAmount'])]
        )

        


Unnamed: 0,accountNumber,customerId,creditLimit,availableMoney,transactionDateTime,transactionAmount,merchantName,acqCountry,merchantCountryCode,posEntryMode,posConditionCode,merchantCategoryCode,currentExpDate,accountOpenDate,dateOfLastAddressChange,cardCVV,enteredCVV,cardLast4Digits,transactionType,currentBalance,cardPresent,expirationDateKeyInMatch,isFraud
38,574788567,574788567,2500,2416.11,2016-05-24 01:35:33,215.13,Convenient Tire,US,US,9,1,auto,2021-10-01,2015-10-13,2015-10-13,206,206,8522,PURCHASE,83.89,True,False,False
39,574788567,574788567,2500,2200.98,2016-05-24 01:38:03,215.13,Convenient Tire,US,US,9,1,auto,2021-10-01,2015-10-13,2015-10-13,206,206,8522,REVERSAL,299.02,True,False,False


Unnamed: 0,accountNumber,customerId,creditLimit,availableMoney,transactionDateTime,transactionAmount,merchantName,acqCountry,merchantCountryCode,posEntryMode,posConditionCode,merchantCategoryCode,currentExpDate,accountOpenDate,dateOfLastAddressChange,cardCVV,enteredCVV,cardLast4Digits,transactionType,currentBalance,cardPresent,expirationDateKeyInMatch,isFraud
73,574788567,574788567,2500,2432.72,2016-10-07 10:23:57,3.87,Eazy Repair,US,US,5,1,auto,2022-10-01,2015-10-13,2015-10-13,579,579,4219,REVERSAL,67.28,True,False,False


Unnamed: 0,accountNumber,customerId,creditLimit,availableMoney,transactionDateTime,transactionAmount,merchantName,acqCountry,merchantCountryCode,posEntryMode,posConditionCode,merchantCategoryCode,currentExpDate,accountOpenDate,dateOfLastAddressChange,cardCVV,enteredCVV,cardLast4Digits,transactionType,currentBalance,cardPresent,expirationDateKeyInMatch,isFraud
100,924729945,924729945,50000,50000.0,2016-10-04 04:20:04,168.57,discount.com,US,US,5,1,online_retail,2024-11-01,2014-07-25,2014-07-25,205,265,9459,PURCHASE,0.0,False,False,False
101,924729945,924729945,50000,49831.43,2016-10-19 14:01:45,168.57,discount.com,US,US,5,1,online_retail,2024-11-01,2014-07-25,2014-07-25,205,265,9459,REVERSAL,168.57,False,False,False


Unnamed: 0,accountNumber,customerId,creditLimit,availableMoney,transactionDateTime,transactionAmount,merchantName,acqCountry,merchantCountryCode,posEntryMode,posConditionCode,merchantCategoryCode,currentExpDate,accountOpenDate,dateOfLastAddressChange,cardCVV,enteredCVV,cardLast4Digits,transactionType,currentBalance,cardPresent,expirationDateKeyInMatch,isFraud
132,984504651,984504651,50000,46818.15,2016-01-16 09:52:30,450.74,Planet Fitness #849960,US,US,5,1,health,2028-08-01,2015-07-27,2015-07-27,640,640,8332,PURCHASE,3181.85,False,False,False
133,984504651,984504651,50000,46367.41,2016-01-16 09:53:15,450.74,Planet Fitness #849960,US,US,5,1,health,2028-08-01,2015-07-27,2015-07-27,640,640,8332,REVERSAL,3632.59,False,False,False


Unnamed: 0,accountNumber,customerId,creditLimit,availableMoney,transactionDateTime,transactionAmount,merchantName,acqCountry,merchantCountryCode,posEntryMode,posConditionCode,merchantCategoryCode,currentExpDate,accountOpenDate,dateOfLastAddressChange,cardCVV,enteredCVV,cardLast4Digits,transactionType,currentBalance,cardPresent,expirationDateKeyInMatch,isFraud
138,984504651,984504651,50000,46341.05,2016-01-17 08:13:08,81.73,AMC #724446,US,US,9,1,entertainment,2031-12-01,2015-07-27,2015-07-27,640,640,8332,PURCHASE,3658.95,True,False,False
156,984504651,984504651,50000,41909.3,2016-01-25 20:39:15,81.73,AMC #724446,US,US,9,1,entertainment,2031-12-01,2015-07-27,2015-07-27,640,640,8332,REVERSAL,8090.7,True,False,False


Unnamed: 0,accountNumber,customerId,creditLimit,availableMoney,transactionDateTime,transactionAmount,merchantName,acqCountry,merchantCountryCode,posEntryMode,posConditionCode,merchantCategoryCode,currentExpDate,accountOpenDate,dateOfLastAddressChange,cardCVV,enteredCVV,cardLast4Digits,transactionType,currentBalance,cardPresent,expirationDateKeyInMatch,isFraud
786300,732852505,732852505,50000,49882.78,2016-06-16 10:32:11,22.55,Uber,US,US,90,1,rideshare,2027-07-01,2012-08-23,2012-08-23,939,939,3388,PURCHASE,117.22,False,False,False
786301,732852505,732852505,50000,49860.23,2016-06-22 19:07:55,22.55,Uber,US,US,90,1,rideshare,2027-07-01,2012-08-23,2012-08-23,939,939,3388,REVERSAL,139.77,False,False,False


# Identifying Multi-Swipes
To identify multi-swipes, I look for identical transactions occuring at a given customer-merchant pair within a time window. Reviewing these instances suggests that multi-swipes can be a part of more complex chains of events involving multi-swipes, reversals, and address verfications. In order to fully understand these chains of events, I identify when they occur and map each chain to unique identifier. This mapping enable the quantification of the frequency of multi-swipes as well as other events. The following cells create that mapping and quantification.

In [9]:
"""
Another example is a multi-swipe, where a vendor accidentally 
charges a customer's card multiple times within a short time span.

I use series of self-joins, filters, and joins to create the desired mapping
for chains of events that contain multi-swipes.

"""
# Create unique id for each transaction 
df['transactionId'] = df.index.astype(str)

# Order transaction data
df = df.sort_values('transactionDateTime')

# Set columns that will be used for merging later on
keep_cols = ['customerId','merchantName','transactionDateTime','transactionAmount','transactionId']

# Parameterize the time window
minutes_window = 10

# Self-join using merge_asof to identify any preceding transaction within the
# specified time-window that matches the customer, merchant, and transaction amount.
multi_df = pd.merge_asof(df, df[keep_cols],
                         on='transactionDateTime',
                         by=['customerId', 'transactionAmount','merchantName'],
                         suffixes=("","_previous"),
                         tolerance = pd.Timedelta(minutes=minutes_window),
                         allow_exact_matches = False)

#Display sample
display(multi_df[multi_df['transactionId_previous'].notnull()].head())

# The dataframe produced above identifies the previous multi-swipe or reversal 
# transaction in the "transactionId_previous" field.
# I am interested in identfying the transaction that started the multi-swipe/reversal event chain
# The starting transaction will be referenced in other records as a previous transaction, 
# but will not have a predecessor. This makes it relative easy to identify.
multi_start_df = multi_df[
    (multi_df['transactionId'].isin(multi_df['transactionId_previous'])) &
    (multi_df['transactionId_previous'].isnull())]

multi_start_df['transactionIdEventChainStart'] = multi_start_df['transactionId']

# Re-joining the resulting transactionIdEventChainStart field to the previous records
# identified as multi-swipe establishes which multi-swipe/reversals are associated with each event chain
# No tolerance is used, because a chain of swipes and reversals could theoretically go longer
# than the individual event window.
keep_cols = ['customerId','merchantName','transactionDateTime','transactionAmount','transactionIdEventChainStart']


multi_df = pd.merge_asof(multi_df, 
                         multi_start_df[keep_cols],
                         on='transactionDateTime',
                         by=['customerId', 'transactionAmount','merchantName'],
                         allow_exact_matches = True) #allow exact matches so the start of the chain is identfied

# Clean up matches outside of an event chain
multi_df.loc[(multi_df['transactionId_previous'].isnull()) &
             (multi_df['transactionId']!=multi_df['transactionIdEventChainStart']),
             'transactionIdEventChainStart'] = np.nan

# Create inidicator variables for future use
# Indicate transaction in event chain
multi_df.loc[:,'inEventChain'] = 0
multi_df.loc[multi_df['transactionIdEventChainStart'].notnull(),
            'inEventChain'] = 1

# Indicate origin of event chain, the first swipe
multi_df.loc[:,'originEventChain'] = 0
multi_df.loc[multi_df['transactionId']==multi_df['transactionIdEventChainStart'],
            'originEventChain'] = 1

# Indicate subsequent activity in event chain
multi_df.loc[:,'subsequentEventChain'] = 0
multi_df.loc[(multi_df['transactionId']!=multi_df['transactionIdEventChainStart']) &
             (multi_df['inEventChain']==1),
            'subsequentEventChain'] = 1

# Display sample chains
chain_count = (multi_df
               .groupby('transactionIdEventChainStart')[['transactionId']]
               .count()
               .sort_values('transactionId'))

# Short chain - picks up a reversal
display(multi_df[multi_df['transactionIdEventChainStart']==chain_count.index[0]])

# Long chain - combination of reversal, multi-swipe, and address verification
display(multi_df[multi_df['transactionIdEventChainStart']==chain_count.index[-1]])


Unnamed: 0,accountNumber,customerId,creditLimit,availableMoney,transactionDateTime,transactionAmount,merchantName,acqCountry,merchantCountryCode,posEntryMode,...,cardCVV,enteredCVV,cardLast4Digits,transactionType,currentBalance,cardPresent,expirationDateKeyInMatch,isFraud,transactionId,transactionId_previous
172,708054411,708054411,20000,19996.68,2016-01-01 01:52:37,3.32,Wayfair.com,US,US,5,...,895,895,9150,PURCHASE,3.32,True,False,False,413236,413235
174,829756717,829756717,10000,9996.3,2016-01-01 01:54:16,3.7,Auntie Anne's #274744,US,US,5,...,844,844,2059,REVERSAL,3.7,True,False,False,334434,334433
333,325587097,325587097,20000,19721.32,2016-01-01 03:50:28,188.23,Shell Repair,US,US,2,...,980,980,6709,REVERSAL,278.68,True,False,False,408581,408580
374,449619624,449619624,15000,14787.32,2016-01-01 04:16:30,100.54,Uber,US,US,5,...,332,332,1458,PURCHASE,212.68,False,False,False,378028,378027
403,409705516,409705516,7500,7276.01,2016-01-01 04:35:38,83.31,Auntie Anne's #324440,US,US,9,...,108,108,9581,PURCHASE,223.99,True,False,False,273197,273196


Unnamed: 0,accountNumber,customerId,creditLimit,availableMoney,transactionDateTime,transactionAmount,merchantName,acqCountry,merchantCountryCode,posEntryMode,...,currentBalance,cardPresent,expirationDateKeyInMatch,isFraud,transactionId,transactionId_previous,transactionIdEventChainStart,inEventChain,originEventChain,subsequentEventChain
640165,740574175,740574175,15000,1579.47,2016-10-27 23:29:02,118.87,ebay.com,US,US,5,...,13420.53,False,False,False,100014,,100014,1,1,0
640167,740574175,740574175,15000,1460.6,2016-10-27 23:30:28,118.87,ebay.com,US,US,5,...,13539.4,False,False,False,100015,100014.0,100014,1,0,1


Unnamed: 0,accountNumber,customerId,creditLimit,availableMoney,transactionDateTime,transactionAmount,merchantName,acqCountry,merchantCountryCode,posEntryMode,...,currentBalance,cardPresent,expirationDateKeyInMatch,isFraud,transactionId,transactionId_previous,transactionIdEventChainStart,inEventChain,originEventChain,subsequentEventChain
449961,225678947,225678947,2500,1581.88,2016-08-03 08:24:05,0.0,Ray's Fries #745083,US,US,9,...,918.12,False,False,False,169363,,169363,1,1,0
449962,225678947,225678947,2500,1581.88,2016-08-03 08:25:10,0.0,Ray's Fries #745083,US,US,9,...,918.12,False,False,False,169364,169363.0,169363,1,0,1
449966,225678947,225678947,2500,1581.88,2016-08-03 08:26:33,0.0,Ray's Fries #745083,US,US,9,...,918.12,False,False,False,169365,169364.0,169363,1,0,1
449968,225678947,225678947,2500,1581.88,2016-08-03 08:26:45,0.0,Ray's Fries #745083,US,US,9,...,918.12,False,False,False,169366,169365.0,169363,1,0,1


In [10]:
# Multi-swipe summary statistics
# Summarizes repeat "PURCHASE" swipes. Excludes the first "normal" swipes.
multi_swipe_mask = ((multi_df['transactionType']=='PURCHASE') &
                    (multi_df['subsequentEventChain']==1))

print('Multi-swipes occur in {:.02%} of transaction records ({} records)'
      .format(multi_df[multi_swipe_mask].shape[0]/multi_df.shape[0],
             multi_df[multi_swipe_mask].shape[0]))

print("Multi-swipes account for {:.02%} of the total transaction amount (${:.02f} in multi-swipe purchases)"
      .format(multi_df[multi_swipe_mask]['transactionAmount'].sum()/multi_df['transactionAmount'].sum(),
             multi_df[multi_swipe_mask]['transactionAmount'].sum()))


Multi-swipes occur in 0.95% of transaction records (7450 records)
Multi-swipes account for 1.02% of the total transaction amount ($1103132.25 in multi-swipe purchases)


## Evaluating non-standard transaction
I am also interested in understanding the inter-relationships between non-standard transactions: multi-swipes, reversals, and address verifications. Below, I analyze the compositions of these non-standard event chains and compare them to transactions outside of event chains. 

In [11]:
"""
Using the dataframe developed to evaluate multi-swipes, I generate the summary
statistics for each type of non-standard transactions.
"""
# Fraction of transactions involved in a chain
chain_mask = ((multi_df['inEventChain']==1))

print('Non-standard event chains occur in {:.02%} of transaction records ({} records)'
      .format(multi_df[chain_mask].shape[0]/multi_df.shape[0],
             multi_df[chain_mask].shape[0]))

# Fraction of transactions involved in a chain, excluding the original
sub_chain_mask = ((multi_df['subsequentEventChain']==1))

print('Non-standard event chains occur in {:.02%} of transaction records, excluding the original transaction ({} records)'
      .format(multi_df[sub_chain_mask].shape[0]/multi_df.shape[0],
             multi_df[sub_chain_mask].shape[0]))

# Check the count and frequency of different transaction types
# Compare count and frequency between all transactions, chains, and non-chains.
# All
transaction_comparison = multi_df.groupby('transactionType')[['transactionId']].count()
transaction_comparison.rename(columns={'transactionId':'All Count'}, inplace=True)
transaction_comparison['All Frequency'] = transaction_comparison['All Count']/transaction_comparison['All Count'].sum()

# Chains
transaction_comparison= transaction_comparison.join(multi_df[multi_df['inEventChain']==1]
                                                    .groupby('transactionType')[['transactionId']].count())
transaction_comparison.rename(columns={'transactionId':'Chain Count'}, inplace=True)
transaction_comparison['Chain Frequency'] = transaction_comparison['Chain Count']/transaction_comparison['Chain Count'].sum()

# Non-chains
transaction_comparison= transaction_comparison.join(multi_df[multi_df['inEventChain']!=1]
                                                    .groupby('transactionType')[['transactionId']].count())
transaction_comparison.rename(columns={'transactionId':'Non-Chain Count'}, inplace=True)
transaction_comparison['Non-Chain Frequency'] = transaction_comparison['Non-Chain Count']/transaction_comparison['Non-Chain Count'].sum()

# Show comparison of transaction occurences
display(transaction_comparison)

# Investigate the composition of individual chains
chain_sum = (multi_df[multi_df['transactionIdEventChainStart'].notnull()]
             .groupby(['transactionIdEventChainStart','transactionType'])[['transactionId']]
             .count()
             .unstack()
             .fillna(0))

chain_sum.columns = ['COUNT_{}'.format(c[1]) for c in chain_sum.columns]

for c in chain_sum.columns:
    chain_sum.loc[:,"FREQ_{}".format(c)] = 0
    chain_sum.loc[chain_sum[c]>0,"FREQ_{}".format(c)] = 1

chain_sum_mean = (chain_sum
                  .mean()
                  .to_frame()
                  .rename(columns={0:'Average within Chains'}))
# Show chain composition
display(chain_sum_mean)

# Show chain composition for address verification only
display((chain_sum[chain_sum['FREQ_COUNT_ADDRESS_VERIFICATION']==1]
         .mean()
         .to_frame()
         .rename(columns={0:'Average within Chains Contianing Address Verification'})))

Non-standard event chains occur in 3.31% of transaction records (26026 records)
Non-standard event chains occur in 1.70% of transaction records, excluding the original transaction (13405 records)


Unnamed: 0_level_0,All Count,All Frequency,Chain Count,Chain Frequency,Non-Chain Count,Non-Chain Frequency
transactionType,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
ADDRESS_VERIFICATION,20169,0.025671,606,0.023306,19563,0.025752
PURCHASE,745193,0.948487,19673,0.756596,725520,0.955055
REVERSAL,20303,0.025842,5723,0.220098,14580,0.019193


Unnamed: 0,Average within Chains
COUNT_ADDRESS_VERIFICATION,0.048046
COUNT_PURCHASE,1.55974
COUNT_REVERSAL,0.453738
FREQ_COUNT_ADDRESS_VERIFICATION,0.030286
FREQ_COUNT_PURCHASE,0.96908
FREQ_COUNT_REVERSAL,0.453738


Unnamed: 0,Average within Chains Contianing Address Verification
COUNT_ADDRESS_VERIFICATION,1.586387
COUNT_PURCHASE,0.0
COUNT_REVERSAL,0.471204
FREQ_COUNT_ADDRESS_VERIFICATION,1.0
FREQ_COUNT_PURCHASE,0.0
FREQ_COUNT_REVERSAL,0.471204


In [12]:
"""
Notice above that not all reversals are contained in chains. This suggests 
that many reversals happen after the fact. We confirm this below with a manual inspection.
"""
# Get reversals outside of chains
long_rev_df = multi_df[(multi_df['inEventChain']==0) & (multi_df['transactionType']=='REVERSAL')]

# Illustrate examples of after the fact reversals
for i, r in long_rev_df.sample(5, random_state=42).iterrows():
    with pd.option_context('display.max_columns', 30):
        display(multi_df[(multi_df['accountNumber']==r['accountNumber']) &
                         (multi_df['transactionAmount']==r['transactionAmount']) &
                         (multi_df['merchantName']==r['merchantName'])])

Unnamed: 0,accountNumber,customerId,creditLimit,availableMoney,transactionDateTime,transactionAmount,merchantName,acqCountry,merchantCountryCode,posEntryMode,posConditionCode,merchantCategoryCode,currentExpDate,accountOpenDate,dateOfLastAddressChange,cardCVV,enteredCVV,cardLast4Digits,transactionType,currentBalance,cardPresent,expirationDateKeyInMatch,isFraud,transactionId,transactionId_previous,transactionIdEventChainStart,inEventChain,originEventChain,subsequentEventChain
425095,934499811,934499811,10000,9538.04,2016-07-22 23:19:22,224.81,Rodeway Inn #599107,US,US,2,1,hotels,2024-05-01,2015-03-25,2015-03-25,703,703,2399,PURCHASE,461.96,True,False,False,173573,,,0,0,0
445925,934499811,934499811,10000,10000.0,2016-08-01 12:54:54,224.81,Rodeway Inn #599107,US,US,2,1,hotels,2024-05-01,2015-03-25,2015-03-25,703,703,2399,REVERSAL,0.0,True,False,False,173578,,,0,0,0


Unnamed: 0,accountNumber,customerId,creditLimit,availableMoney,transactionDateTime,transactionAmount,merchantName,acqCountry,merchantCountryCode,posEntryMode,posConditionCode,merchantCategoryCode,currentExpDate,accountOpenDate,dateOfLastAddressChange,cardCVV,enteredCVV,cardLast4Digits,transactionType,currentBalance,cardPresent,expirationDateKeyInMatch,isFraud,transactionId,transactionId_previous,transactionIdEventChainStart,inEventChain,originEventChain,subsequentEventChain
607113,133105237,133105237,5000,1747.98,2016-10-13 06:56:38,7.45,Uber,US,US,2,1,rideshare,2033-07-01,2015-12-23,2016-09-08,706,706,6220,PURCHASE,3252.02,False,False,False,716449,,,0,0,0
626884,133105237,133105237,5000,2168.7,2016-10-22 02:06:57,7.45,Uber,US,US,2,1,rideshare,2033-07-01,2015-12-23,2016-09-08,706,706,6220,REVERSAL,2831.3,False,False,False,716479,,,0,0,0


Unnamed: 0,accountNumber,customerId,creditLimit,availableMoney,transactionDateTime,transactionAmount,merchantName,acqCountry,merchantCountryCode,posEntryMode,posConditionCode,merchantCategoryCode,currentExpDate,accountOpenDate,dateOfLastAddressChange,cardCVV,enteredCVV,cardLast4Digits,transactionType,currentBalance,cardPresent,expirationDateKeyInMatch,isFraud,transactionId,transactionId_previous,transactionIdEventChainStart,inEventChain,originEventChain,subsequentEventChain
597487,419709514,419709514,5000,1623.9,2016-10-09 00:27:42,0.44,cheapfast.com,US,US,2,1,online_retail,2021-09-01,2015-03-01,2016-09-11,917,917,2178,PURCHASE,3376.1,False,False,False,431228,,,0,0,0
633950,419709514,419709514,5000,2167.45,2016-10-25 05:37:23,0.44,cheapfast.com,US,US,2,1,online_retail,2021-09-01,2015-03-01,2016-09-11,917,917,2178,REVERSAL,2832.55,False,False,False,431506,,,0,0,0


Unnamed: 0,accountNumber,customerId,creditLimit,availableMoney,transactionDateTime,transactionAmount,merchantName,acqCountry,merchantCountryCode,posEntryMode,posConditionCode,merchantCategoryCode,currentExpDate,accountOpenDate,dateOfLastAddressChange,cardCVV,enteredCVV,cardLast4Digits,transactionType,currentBalance,cardPresent,expirationDateKeyInMatch,isFraud,transactionId,transactionId_previous,transactionIdEventChainStart,inEventChain,originEventChain,subsequentEventChain
82308,780458682,780458682,15000,14914.26,2016-02-11 07:35:32,528.03,McDonalds #46753,US,US,9,1,fastfood,2022-09-01,2015-08-16,2015-08-16,612,612,8698,PURCHASE,85.74,True,False,False,302647,,,0,0,0
118960,780458682,780458682,15000,14239.57,2016-02-29 04:16:31,528.03,McDonalds #46753,US,US,9,1,fastfood,2022-09-01,2015-08-16,2015-08-16,612,612,8698,REVERSAL,760.43,True,False,False,302651,,,0,0,0


Unnamed: 0,accountNumber,customerId,creditLimit,availableMoney,transactionDateTime,transactionAmount,merchantName,acqCountry,merchantCountryCode,posEntryMode,posConditionCode,merchantCategoryCode,currentExpDate,accountOpenDate,dateOfLastAddressChange,cardCVV,enteredCVV,cardLast4Digits,transactionType,currentBalance,cardPresent,expirationDateKeyInMatch,isFraud,transactionId,transactionId_previous,transactionIdEventChainStart,inEventChain,originEventChain,subsequentEventChain
66928,700725639,700725639,50000,14327.59,2016-02-03 15:16:38,11.1,cheapfast.com,US,US,5,1,online_retail,2030-10-01,2015-08-12,2016-01-04,926,926,9522,REVERSAL,35672.41,False,False,False,152132,,,0,0,0


In [13]:
"""
Finally, we also have a hypothesis about transactions that are greater than available credit.
We investigate below how frequently those occur and if they are more likely to be present
in non-standard chain.
"""

# Create flag
multi_df.loc[:,'overDraw'] = 0
multi_df.loc[multi_df['transactionAmount']>multi_df['availableMoney'],'overDraw'] = 1

# Compare frequencies
print("Transactions greater than avaialble credit occur {:.02%} of the time, with {:.02%} occuring in event chains and {:.02%} occuring out of event chains."
      .format(multi_df['overDraw'].mean(),
             multi_df[multi_df['inEventChain']==1]['overDraw'].mean(),
            multi_df[multi_df['inEventChain']==0]['overDraw'].mean()))

# Illustrate examples of overdraws
for i, r in multi_df[(multi_df['overDraw']==1)].sample(5, random_state=42).iterrows():
    with pd.option_context('display.max_columns', 30):
        display(multi_df[(multi_df['accountNumber']==r['accountNumber']) &
                         (multi_df['transactionDateTime']<(r['transactionDateTime']+pd.Timedelta(minutes=minutes_window*2))) &
                         (multi_df['transactionDateTime']>(r['transactionDateTime']-pd.Timedelta(minutes=minutes_window*2))) &
                         (multi_df['merchantName']==r['merchantName'])])


Transactions greater than avaialble credit occur 3.82% of the time, with 4.17% occuring in event chains and 3.81% occuring out of event chains.


Unnamed: 0,accountNumber,customerId,creditLimit,availableMoney,transactionDateTime,transactionAmount,merchantName,acqCountry,merchantCountryCode,posEntryMode,posConditionCode,merchantCategoryCode,currentExpDate,accountOpenDate,dateOfLastAddressChange,cardCVV,enteredCVV,cardLast4Digits,transactionType,currentBalance,cardPresent,expirationDateKeyInMatch,isFraud,transactionId,transactionId_previous,transactionIdEventChainStart,inEventChain,originEventChain,subsequentEventChain,overDraw
725320,352034651,352034651,250,163.74,2016-12-04 10:52:33,165.16,ebay.com,US,US,5,1,online_retail,2022-01-01,2015-12-22,2016-10-10,150,150,5377,PURCHASE,86.26,False,False,False,83619,,,0,0,0,1


Unnamed: 0,accountNumber,customerId,creditLimit,availableMoney,transactionDateTime,transactionAmount,merchantName,acqCountry,merchantCountryCode,posEntryMode,posConditionCode,merchantCategoryCode,currentExpDate,accountOpenDate,dateOfLastAddressChange,cardCVV,enteredCVV,cardLast4Digits,transactionType,currentBalance,cardPresent,expirationDateKeyInMatch,isFraud,transactionId,transactionId_previous,transactionIdEventChainStart,inEventChain,originEventChain,subsequentEventChain,overDraw
640188,981447324,981447324,250,65.59,2016-10-27 23:46:39,198.03,AMC #706324,US,US,9,1,entertainment,2020-04-01,2015-06-26,2016-09-26,391,391,0,PURCHASE,184.41,True,False,False,779510,,,0,0,0,1


Unnamed: 0,accountNumber,customerId,creditLimit,availableMoney,transactionDateTime,transactionAmount,merchantName,acqCountry,merchantCountryCode,posEntryMode,posConditionCode,merchantCategoryCode,currentExpDate,accountOpenDate,dateOfLastAddressChange,cardCVV,enteredCVV,cardLast4Digits,transactionType,currentBalance,cardPresent,expirationDateKeyInMatch,isFraud,transactionId,transactionId_previous,transactionIdEventChainStart,inEventChain,originEventChain,subsequentEventChain,overDraw
683391,570852613,570852613,250,160.98,2016-11-16 00:56:24,225.22,Universe Massage #219515,US,US,5,1,personal care,2022-02-01,2015-12-13,2015-12-13,675,675,6371,PURCHASE,89.02,True,False,False,560147,,,0,0,0,1


Unnamed: 0,accountNumber,customerId,creditLimit,availableMoney,transactionDateTime,transactionAmount,merchantName,acqCountry,merchantCountryCode,posEntryMode,posConditionCode,merchantCategoryCode,currentExpDate,accountOpenDate,dateOfLastAddressChange,cardCVV,enteredCVV,cardLast4Digits,transactionType,currentBalance,cardPresent,expirationDateKeyInMatch,isFraud,transactionId,transactionId_previous,transactionIdEventChainStart,inEventChain,originEventChain,subsequentEventChain,overDraw
208781,618793140,618793140,1000,166.37,2016-04-12 14:02:33,167.11,KFC #992683,US,US,2,1,fastfood,2029-11-01,2015-03-18,2015-03-18,392,392,542,PURCHASE,833.63,True,False,False,665382,,,0,0,0,1


Unnamed: 0,accountNumber,customerId,creditLimit,availableMoney,transactionDateTime,transactionAmount,merchantName,acqCountry,merchantCountryCode,posEntryMode,posConditionCode,merchantCategoryCode,currentExpDate,accountOpenDate,dateOfLastAddressChange,cardCVV,enteredCVV,cardLast4Digits,transactionType,currentBalance,cardPresent,expirationDateKeyInMatch,isFraud,transactionId,transactionId_previous,transactionIdEventChainStart,inEventChain,originEventChain,subsequentEventChain,overDraw
70730,380680241,380680241,5000,280.46,2016-02-05 13:54:11,483.45,Fresh Flowers,US,US,5,8,online_gifts,2030-09-01,2014-06-21,2016-01-29,869,869,593,PURCHASE,4719.54,False,False,True,243384,,,0,0,0,1
