# Transaction-Level Analysis in Banking

* In this lesson, we will learn how to explore and analyze individual transactions in a synthetic banking dataset.
* Understanding transaction-level data helps banks detect fraud, know their customers, and monitor risk.
* You will build practical Python code to uncover trends, unusual spending patterns, and customer insights.
* This is a critical skill for working in data roles at banks, fintechs, or credit risk teams.

In [1]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")

In [2]:
# Example 1: Create synthetic banking transactions dataset
np.random.seed(42)
n_transactions = 1000
n_customers = 200
df = pd.DataFrame({
    'transaction_id': range(1, n_transactions + 1),
    'customer_id': np.random.choice([f'CUST_{i:04d}' for i in range(1, n_customers + 1)], n_transactions),
    'amount': np.round(np.random.normal(150, 60, n_transactions), 2),
    'transaction_type': np.random.choice(['Debit', 'Credit'], n_transactions),
    'channel': np.random.choice(['ATM', 'Online', 'Branch', 'POS'], n_transactions),
    'date': pd.date_range(start='2024-01-01', periods=n_transactions, freq='h')
})
print(df.shape)
print(df.head(3))

(1000, 6)
   transaction_id customer_id  amount transaction_type channel  \
0               1   CUST_0103  238.77           Credit     ATM   
1               2   CUST_0180  269.17           Credit     POS   
2               3   CUST_0093   58.62           Credit  Online   

                 date  
0 2024-01-01 00:00:00  
1 2024-01-01 01:00:00  
2 2024-01-01 02:00:00  


In [3]:
# Beginner Example 2: Checking for missing or duplicate transaction_id
print('Missing values:', df.isnull().sum().sum())
duplicates = df['transaction_id'].duplicated().sum()
print('Duplicate transaction_id:', duplicates)

Missing values: 0
Duplicate transaction_id: 0


In [4]:
# Beginner Example 3: Exploring basic statistics for transaction amounts
print(df['amount'].describe())

count    1000.000000
mean      151.702080
std        63.022399
min       -64.090000
25%       111.372500
50%       153.535000
75%       195.162500
max       340.470000
Name: amount, dtype: float64


# Interpreting Early Results

- Transaction amounts can reveal outliers, such as very high or negative values.
- Missing or duplicate identifiers usually suggest data entry or pipeline problems.
- Comparing transaction counts for each type and channel helps spot suspicious patterns.

In [5]:

# Beginner Example 4: Count of transactions by type
print(df['transaction_type'].value_counts())


transaction_type
Credit    503
Debit     497
Name: count, dtype: int64


In [6]:

# Beginner Example 5: Count transactions by channel
print(df['channel'].value_counts())

channel
ATM       266
POS       250
Branch    248
Online    236
Name: count, dtype: int64


In [7]:
# Intermediate Example 1: Find transactions with negative or zero amounts
negatives = df[df['amount'] <= 0]
print('Number of negative or zero transactions:', negatives.shape[0])
print(negatives[['transaction_id', 'amount']].head(2))

Number of negative or zero transactions: 7
     transaction_id  amount
94               95   -4.38
165             166  -64.09


In [8]:
# Intermediate Example 2: Top five transactions by amount
top5 = df.nlargest(5, 'amount')
print(top5[['transaction_id', 'customer_id', 'amount']])

     transaction_id customer_id  amount
92               93   CUST_0111  340.47
658             659   CUST_0009  338.52
521             522   CUST_0085  335.64
535             536   CUST_0096  326.77
240             241   CUST_0098  315.65


In [9]:
# Identify the lowest five transactions by amount
bottom5 = df.nsmallest(5, 'amount')
print(bottom5[['transaction_id', 'customer_id', 'amount']])

     transaction_id customer_id  amount
165             166   CUST_0170  -64.09
819             820   CUST_0165  -53.31
297             298   CUST_0161  -12.75
312             313   CUST_0128   -8.84
231             232   CUST_0068   -4.69


In [10]:
# find the customer with the highest total transaction amount
customer_totals = df.groupby('customer_id')['amount'].sum() 
top_customer = customer_totals.idxmax()
top_amount = customer_totals.max()  
print(f'Customer with highest total transaction amount: {top_customer} with amount {top_amount}')

Customer with highest total transaction amount: CUST_0099 with amount 1999.1


In [12]:
## same as above
count_per_customer = df['customer_id'].value_counts()
print(count_per_customer.head(5))

customer_id
CUST_0190    13
CUST_0099    13
CUST_0113    11
CUST_0161    11
CUST_0090    10
Name: count, dtype: int64


In [13]:
import matplotlib.pyplot as plt
# Intermediate Example 4: Group transactions by date and sum amounts
daily_totals = df.groupby(df['date'].dt.date)['amount'].sum()
print(daily_totals.head())

date
2024-01-01    3771.76
2024-01-02    3993.85
2024-01-03    3777.92
2024-01-04    3946.69
2024-01-05    3489.53
Name: amount, dtype: float64


In [14]:
# Intermediate Example 5: Detect possible suspicious patterns (very high transactions)
threshold = df['amount'].mean() + 3 * df['amount'].std()
suspicious = df[df['amount'] > threshold]
print('Transactions flagged as suspicious:', suspicious.shape[0])
print(suspicious[['transaction_id', 'customer_id', 'amount']].head())

Transactions flagged as suspicious: 0
Empty DataFrame
Columns: [transaction_id, customer_id, amount]
Index: []


In [15]:
# Intermediate Example 6: Add a day_of_week column for analysis
df['day_of_week'] = df['date'].dt.day_name()
print(df[['date', 'day_of_week']].head(3))

                 date day_of_week
0 2024-01-01 00:00:00      Monday
1 2024-01-01 01:00:00      Monday
2 2024-01-01 02:00:00      Monday


In [16]:
# Advanced Example 1: Top three customers by total spend
total_spend = df.groupby('customer_id')['amount'].sum()
top_customers = total_spend.nlargest(3)
print(top_customers)

customer_id
CUST_0099    1999.10
CUST_0147    1879.19
CUST_0190    1826.46
Name: amount, dtype: float64


In [17]:
# Advanced Example 2: Analyze average debit vs credit amount
means = df.groupby('transaction_type')['amount'].mean()
print(means)

transaction_type
Credit    150.538052
Debit     152.880161
Name: amount, dtype: float64


In [18]:
# Advanced Example 3: Find customers with high frequency of small transactions
small_tx = df[df['amount'] < 20]
small_count = small_tx['customer_id'].value_counts()
print(small_count.head())

customer_id
CUST_0161    2
CUST_0172    1
CUST_0088    1
CUST_0050    1
CUST_0200    1
Name: count, dtype: int64


In [19]:
# Advanced Example 4: Time gap between consecutive customer transactions
df_sorted = df.sort_values(['customer_id', 'date'])
df_sorted['prev_date'] = df_sorted.groupby('customer_id')['date'].shift(1)
df_sorted['gap_hours'] = (df_sorted['date'] - df_sorted['prev_date']).dt.total_seconds() / 3600
print(df_sorted[['customer_id', 'date', 'gap_hours']].dropna().head())


    customer_id                date  gap_hours
490   CUST_0001 2024-01-21 10:00:00      353.0
536   CUST_0001 2024-01-23 08:00:00       46.0
709   CUST_0001 2024-01-30 13:00:00      173.0
741   CUST_0001 2024-01-31 21:00:00       32.0
825   CUST_0001 2024-02-04 09:00:00       84.0


In [20]:
# Error Handling: What if the 'amount' column is missing?
try:
    print(df['amount'].head())
except KeyError:
    print('Column amount is missing! Please check your data source.')

0    238.77
1    269.17
2     58.62
3     81.85
4    163.56
Name: amount, dtype: float64


In [21]:
# Error Handling: Detect if any date values are missing or out of order
if df['date'].isnull().sum() > 0:
    print('There are missing dates!')
elif not df['date'].is_monotonic_increasing:
    print('Dates are not in order!')
else:
    print('Dates are OK!')

Dates are OK!


# Best Practices for Transaction Analysis

- Always check for missing, duplicated, and out-of-range data before analysis.
- Ensure reproducibility by setting random seeds and documenting steps.
- Summarize findings using groupby, value_counts, and visualizations.
- Time ordering is critical for all fraud and sequence-based analytics.
- Keep code modular: write small functions for routine checks.


In [22]:
# Best Practice: Function to summarize key quality checks for transaction DataFrame
def transaction_qc(df):
    print('Missing:', df.isnull().sum().sum())
    print('Duplicates:', df.duplicated().sum())
    print('Negative Amounts:', (df['amount'] < 0).sum())
    print('Out-of-order Dates:', not df['date'].is_monotonic_increasing)

transaction_qc(df)

Missing: 0
Duplicates: 0
Negative Amounts: 7
Out-of-order Dates: False


End-to-End Example: Find Customers with Sudden Spending Surges

- Let us walk through a practical use case: flagging customers whose recent spending is much higher than average.
- This scenario is common in fraud prevention and credit risk monitoring.
- We will group transactions by customer, then compare recent to historical spend.
- The steps are: sort by date, compute rolling averages, and highlight surges above a threshold.


In [23]:
# End-to-End: Sort, compute rolling mean, and detect surges
df_sorted = df.sort_values(['customer_id', 'date'])
df_sorted['rolling_avg'] = df_sorted.groupby('customer_id')['amount'].rolling(window=10, min_periods=5).mean().reset_index(level=0, drop=True)
df_sorted['spending_surge'] = df_sorted['amount'] > df_sorted['rolling_avg'] * 2
surge_cases = df_sorted[df_sorted['spending_surge']]
print('Customers with surges:', len(surge_cases['customer_id'].unique()))
print(surge_cases[['customer_id', 'amount', 'rolling_avg', 'date']].head())

Customers with surges: 4
    customer_id  amount  rolling_avg                date
658   CUST_0009  338.52      129.120 2024-01-28 10:00:00
866   CUST_0030  254.95      116.736 2024-02-06 02:00:00
282   CUST_0129  231.29      114.654 2024-01-12 18:00:00
811   CUST_0197  307.75      146.804 2024-02-03 19:00:00
