# Handling Missing and Dirty Bank Data in Python

* In real-world banking, data issues like missing or incorrect values are common.
* Identifying and fixing dirty data is vital for accurate reporting and risk assessment.
* This lesson covers how to find, clean, and handle missing or dirty data using pandas.
* You will practice hands-on techniques for cleaning up messy transaction and customer data.
* By the end, you will be able to prepare banking datasets for analysis or machine learning.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

Core Concepts: Understanding Banking Data Problems

- Banking data includes transactions, customer details, and account records.
- Data can become dirty through missing fields, typographical mistakes, or inconsistent entries.
- Beginners often forget to check for missing data, assume data is always clean, or use the wrong fill values.

In [2]:
# Example 1: Create synthetic banking transactions dataset
np.random.seed(42)
n_transactions = 1000
n_customers = 200
df = pd.DataFrame({
    'transaction_id': range(1, n_transactions + 1),
    'customer_id': np.random.choice([f'CUST_{i:04d}' for i in range(1, n_customers + 1)], n_transactions),
    'amount': np.round(np.random.normal(150, 60, n_transactions), 2),
    'transaction_type': np.random.choice(['Debit', 'Credit'], n_transactions),
    'channel': np.random.choice(['ATM', 'Online', 'Branch', 'POS'], n_transactions),
    'date': pd.date_range(start='2024-01-01', periods=n_transactions, freq='h')
})
print(df.shape)
print(df.head(3))

(1000, 6)
   transaction_id customer_id  amount transaction_type channel  \
0               1   CUST_0103  238.77           Credit     ATM   
1               2   CUST_0180  269.17           Credit     POS   
2               3   CUST_0093   58.62           Credit  Online   

                 date  
0 2024-01-01 00:00:00  
1 2024-01-01 01:00:00  
2 2024-01-01 02:00:00  


In [3]:
# Last 3 records in the dataset
print(df.tail(3))

     transaction_id customer_id  amount transaction_type channel  \
997             998   CUST_0034  150.35            Debit  Online   
998             999   CUST_0111  243.82            Debit  Branch   
999            1000   CUST_0008   74.99            Debit  Branch   

                   date  
997 2024-02-11 13:00:00  
998 2024-02-11 14:00:00  
999 2024-02-11 15:00:00  


In [4]:
# Example 2: Create a simple customers table
customer_ids = [f'CUST_{i:04d}' for i in range(1, 201)]
customers = pd.DataFrame({
    'customer_id': customer_ids,
    'segment': ['Retail'] * 150 + ['Business'] * 50,
    'region': ['Metro'] * 100 + ['Regional'] * 100
})
print(customers.shape)
print(customers.head(3))

(200, 3)
  customer_id segment region
0   CUST_0001  Retail  Metro
1   CUST_0002  Retail  Metro
2   CUST_0003  Retail  Metro


In [5]:
# Example 3: Create a synthetic account types table
np.random.seed(42)
accounts = pd.DataFrame({
    'account_id': [f'ACC_{i:05d}' for i in range(1, 201)],
    'customer_id': customer_ids,
    'account_type': np.random.choice(['Savings', 'Cheque', 'Credit'], size=200),
    'open_date': pd.date_range(start='2015-01-01', periods=200, freq='30D')
})
print(accounts.shape)
print(accounts.head(3))

(200, 4)
  account_id customer_id account_type  open_date
0  ACC_00001   CUST_0001       Credit 2015-01-01
1  ACC_00002   CUST_0002      Savings 2015-01-31
2  ACC_00003   CUST_0003       Credit 2015-03-02


In [6]:
# Beginner: Checking for missing values in a DataFrame
missing = df.isnull().sum()
print(missing)

transaction_id      0
customer_id         0
amount              0
transaction_type    0
channel             0
date                0
dtype: int64


In [7]:
# Beginner: Making data dirty intentionally to practice cleaning
df.loc[3, 'amount'] = np.nan  # Remove amount from row 4 (index starts at 0)
df.loc[5, 'customer_id'] = None  # Remove customer_id from row 6
df.loc[10, 'channel'] = 'online'  # Introduce a typo for channel
print(df.head(12))

    transaction_id customer_id  amount transaction_type channel  \
0                1   CUST_0103  238.77           Credit     ATM   
1                2   CUST_0180  269.17           Credit     POS   
2                3   CUST_0093   58.62           Credit  Online   
3                4   CUST_0015     NaN            Debit  Branch   
4                5   CUST_0107  163.56           Credit  Online   
5                6        None  200.38            Debit  Online   
6                7   CUST_0189  149.33           Credit  Branch   
7                8   CUST_0021   51.72            Debit     POS   
8                9   CUST_0103  179.79           Credit  Branch   
9               10   CUST_0122  138.34           Credit  Online   
10              11   CUST_0075  137.63            Debit  online   
11              12   CUST_0088   13.32            Debit  Online   

                  date  
0  2024-01-01 00:00:00  
1  2024-01-01 01:00:00  
2  2024-01-01 02:00:00  
3  2024-01-01 03:00:00  
4  

In [8]:
# Beginner: Identify and count dirty values by category
print('Rows with missing amount:', df['amount'].isnull().sum())
print('Rows with missing customer_id:', df['customer_id'].isnull().sum())
print('Rows with obvious typos in channel:', (df['channel'].str.lower() == 'online').sum())

Rows with missing amount: 1
Rows with missing customer_id: 1
Rows with obvious typos in channel: 237


In [9]:
# Beginner: Fill missing numeric amounts with the mean
mean_amount = df['amount'].mean()
df['amount'].fillna(mean_amount, inplace=True)
print(df.head(6))

   transaction_id customer_id      amount transaction_type channel  \
0               1   CUST_0103  238.770000           Credit     ATM   
1               2   CUST_0180  269.170000           Credit     POS   
2               3   CUST_0093   58.620000           Credit  Online   
3               4   CUST_0015  151.772002            Debit  Branch   
4               5   CUST_0107  163.560000           Credit  Online   
5               6        None  200.380000            Debit  Online   

                 date  
0 2024-01-01 00:00:00  
1 2024-01-01 01:00:00  
2 2024-01-01 02:00:00  
3 2024-01-01 03:00:00  
4 2024-01-01 04:00:00  
5 2024-01-01 05:00:00  


In [10]:
# Beginner: Fill missing customer_id with a placeholder
df['customer_id'].fillna('UNKNOWN', inplace=True)
print(df.loc[5, ['customer_id']])

customer_id    UNKNOWN
Name: 5, dtype: object


In [11]:
# Intermediate: Standardize text values in 'channel' to title case
df['channel'] = df['channel'].str.title()
print(df['channel'].unique())

['Atm' 'Pos' 'Online' 'Branch']


In [12]:
# Intermediate: Drop rows where a critical identifier is missing
df_before = df.shape[0]
df = df[df['customer_id'] != 'UNKNOWN']
df_after = df.shape[0]
print('Rows before:', df_before, 'Rows after dropping missing customer_id:', df_after)

Rows before: 1000 Rows after dropping missing customer_id: 999


In [13]:
# Intermediate: Replace outlier values using domain rules
outlier_idx = df[df['amount'] > 1000].index
df.loc[outlier_idx, 'amount'] = 1000
print(df.loc[outlier_idx, ['amount']])

Empty DataFrame
Columns: [amount]
Index: []


In [14]:
# Intermediate: Remove obvious duplicate transactions
duplicates = df.duplicated(subset=['customer_id', 'amount', 'date'])
df = df[~duplicates]
print('Remaining rows after removing duplicates:', df.shape[0])

Remaining rows after removing duplicates: 999


In [15]:
# Advanced: Merge transactions with customer profiles for richer analysis
merged = pd.merge(df, customers, on='customer_id', how='left')
print(merged.head(3))

   transaction_id customer_id  amount transaction_type channel  \
0               1   CUST_0103  238.77           Credit     Atm   
1               2   CUST_0180  269.17           Credit     Pos   
2               3   CUST_0093   58.62           Credit  Online   

                 date   segment    region  
0 2024-01-01 00:00:00    Retail  Regional  
1 2024-01-01 01:00:00  Business  Regional  
2 2024-01-01 02:00:00    Retail     Metro  


In [16]:
# Advanced: Identify transactions with no matching customer info
missing_cust = merged[merged['segment'].isnull()]
print('Transactions with unknown customer details:')
print(missing_cust)

Transactions with unknown customer details:
Empty DataFrame
Columns: [transaction_id, customer_id, amount, transaction_type, channel, date, segment, region]
Index: []


In [17]:
# Advanced: Impute missing categories using similar records
mode_channel = merged['channel'].mode()[0]
merged['channel'].fillna(mode_channel, inplace=True)
print('Imputed missing channels with:', mode_channel)

Imputed missing channels with: Atm


In [18]:
# Error Handling: Try-except for risky cleaning operations
try:
    merged['amount'] = merged['amount'].astype(float)
    print('Conversion to float successful.')
except Exception as e:
    print('Failed to convert amount:', str(e))

Conversion to float successful.


In [19]:
# Error Handling: Custom function to log and drop invalid data
def drop_invalid(df, col):
    n_invalid = df[col].isnull().sum()
    print(f'Dropping {n_invalid} rows with missing {col}')
    return df[df[col].notnull()]

merged = drop_invalid(merged, 'customer_id')


Dropping 0 rows with missing customer_id


In [20]:
# Best Practice: Write a data cleaning pipeline for repeatability
def clean_transactions(df):
    df = df.copy()
    df['amount'] = df['amount'].fillna(df['amount'].median())
    df['customer_id'] = df['customer_id'].fillna('UNKNOWN')
    df['channel'] = df['channel'].str.title()
    df = df[df['customer_id'] != 'UNKNOWN']
    return df

df_clean = clean_transactions(df)
print(df_clean.head(3))

   transaction_id customer_id  amount transaction_type channel  \
0               1   CUST_0103  238.77           Credit     Atm   
1               2   CUST_0180  269.17           Credit     Pos   
2               3   CUST_0093   58.62           Credit  Online   

                 date  
0 2024-01-01 00:00:00  
1 2024-01-01 01:00:00  
2 2024-01-01 02:00:00  


End-to-End Example: Clean Dirty Bank Data and Report

- Now let us load, dirty, clean and summarize transaction data in one workflow.
- You will see how the pieces come together with real business value.
- The final result will be a quick summary report ready for decision making.


In [21]:
# Simulate dirty data again for end-to-end test
df2 = df.copy()
df2.loc[15, 'amount'] = -999
df2.loc[20, 'customer_id'] = None
df2.loc[30, 'channel'] = 'bRanch'
cleaned = clean_transactions(df2)
summary = cleaned.groupby('channel')['amount'].agg(['count','mean']).reset_index()
print(summary)

  channel  count        mean
0     Atm    266  156.043534
1  Branch    248  147.395734
2  Online    234  150.937393
3     Pos    250  147.687480


# Recap and Next Steps

- You learned techniques to detect and clean missing or dirty data.
- Practice on realistic bank transaction tables.
- Explore outlier handling, merging, and safe error handling.
- Build your own cleaning pipelines for any messy banking dataset.
- Visit our YouTube channel for deeper dives and more banking data science workflows.