# Introduction to SQL for Banking Analysts

* In banking, data is everywhere and SQL is the tool to unlock it.
* As an analyst, you will need to explore transactions, spot trends, and generate client reports.
* This lesson will teach you how to use SQL-like syntax in Python for real-world banking problems.
* You will start with synthetic banking data and apply practical queries step by step.
* By the end, you will be able to answer business questions about accounts, customers, and transactions.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

In [2]:
# Example 1: Create synthetic banking transactions dataset
np.random.seed(42)
n_transactions = 1000
n_customers = 200
df = pd.DataFrame({
    'transaction_id': range(1, n_transactions + 1),
    'customer_id': np.random.choice([f'CUST_{i:04d}' for i in range(1, n_customers + 1)], n_transactions),
    'amount': np.round(np.random.normal(150, 60, n_transactions), 2),
    'transaction_type': np.random.choice(['Debit', 'Credit'], n_transactions),
    'channel': np.random.choice(['ATM', 'Online', 'Branch', 'POS'], n_transactions),
    'date': pd.date_range(start='2024-01-01', periods=n_transactions, freq='h')
})
print(df.shape)
print(df.head(3))

(1000, 6)
   transaction_id customer_id  amount transaction_type channel  \
0               1   CUST_0103  238.77           Credit     ATM   
1               2   CUST_0180  269.17           Credit     POS   
2               3   CUST_0093   58.62           Credit  Online   

                 date  
0 2024-01-01 00:00:00  
1 2024-01-01 01:00:00  
2 2024-01-01 02:00:00  


In [3]:
# Example 2: Create a simple customers table
customer_ids = [f'CUST_{i:04d}' for i in range(1, 201)]
customers = pd.DataFrame({
    'customer_id': customer_ids,
    'segment': ['Retail'] * 150 + ['Business'] * 50,
    'region': ['Metro'] * 100 + ['Regional'] * 100
})
print(customers.shape)
print(customers.head(3))

(200, 3)
  customer_id segment region
0   CUST_0001  Retail  Metro
1   CUST_0002  Retail  Metro
2   CUST_0003  Retail  Metro


In [4]:
# Example 3: Create a synthetic account types table
np.random.seed(42)
accounts = pd.DataFrame({
    'account_id': [f'ACC_{i:05d}' for i in range(1, 201)],
    'customer_id': customer_ids,
    'account_type': np.random.choice(['Savings', 'Cheque', 'Credit'], size=200),
    'open_date': pd.date_range(start='2015-01-01', periods=200, freq='30D')
})
print(accounts.shape)
print(accounts.head(3))

(200, 4)
  account_id customer_id account_type  open_date
0  ACC_00001   CUST_0001       Credit 2015-01-01
1  ACC_00002   CUST_0002      Savings 2015-01-31
2  ACC_00003   CUST_0003       Credit 2015-03-02


# Progressive SQL examples in Pandas

- We will learn to use SQL-like operations in Python with Pandas.
- This means using .query(), .groupby(), .merge(), and .loc to ask business questions.
- We start with simple queries, and work up to joins and aggregations.


In [5]:
## Beginner Example 1: Select all debit transactions
debits_only = df.query("transaction_type == 'Debit'")
print(debits_only.head(3))


   transaction_id customer_id  amount transaction_type channel  \
3               4   CUST_0015   81.85            Debit  Branch   
5               6   CUST_0072  200.38            Debit  Online   
7               8   CUST_0021   51.72            Debit     POS   

                 date  
3 2024-01-01 03:00:00  
5 2024-01-01 05:00:00  
7 2024-01-01 07:00:00  


In [6]:
### Beginner Example 2: Top 5 largest transactions
top_transactions = df.nlargest(5, 'amount')  
print(top_transactions[['transaction_id', 'amount', 'customer_id']])

     transaction_id  amount customer_id
92               93  340.47   CUST_0111
658             659  338.52   CUST_0009
521             522  335.64   CUST_0085
535             536  326.77   CUST_0096
240             241  315.65   CUST_0098


In [7]:
# Beginner Example 3: Count of transactions by channel
channel_counts = df['channel'].value_counts()
print(channel_counts)

channel
ATM       266
POS       250
Branch    248
Online    236
Name: count, dtype: int64


In [8]:
# Intermediate Example 1: Monthly sum of debits vs credits
df['month'] = df['date'].dt.to_period('M')
monthly_sums = df.pivot_table(index='month', columns='transaction_type', values='amount', aggfunc='sum', fill_value=0)
print(monthly_sums)

transaction_type    Credit     Debit
month                               
2024-01           58569.65  55923.85
2024-02           17150.99  20057.59


In [9]:
# Intermediate Example 2: Join transactions to customers for region analysis
region_merged = df.merge(customers, on='customer_id', how='left')
print(region_merged[['customer_id', 'region', 'amount']].head(3))

  customer_id    region  amount
0   CUST_0103  Regional  238.77
1   CUST_0180  Regional  269.17
2   CUST_0093     Metro   58.62


In [10]:
# Intermediate Example 3: Average transaction per segment
segment_merged = df.merge(customers, on='customer_id', how='left')
avg_by_segment = segment_merged.groupby('segment')['amount'].mean()
print(avg_by_segment)

segment
Business    147.577040
Retail      153.077093
Name: amount, dtype: float64


In [11]:
# Advanced Example 1: Number of unique customers per channel per month
unique_per_channel = df.groupby(['month', 'channel'])['customer_id'].nunique().unstack(fill_value=0)
print(unique_per_channel)

channel  ATM  Branch  Online  POS
month                            
2024-01  124     116     114  116
2024-02   54      55      49   56


In [12]:
# Advanced Example 2: Top 3 account types by total credit volume
merged_accounts = df.merge(accounts, on='customer_id', how='left')
credit_vol = merged_accounts[merged_accounts['transaction_type'] == 'Credit']
top_account_types = credit_vol.groupby('account_type')['amount'].sum().nlargest(3)
print(top_account_types)

account_type
Savings    27934.60
Credit     26075.35
Cheque     21710.69
Name: amount, dtype: float64


In [13]:
# Advanced Example 3: Running balance for a sample account
sample_account = merged_accounts['account_id'].iloc[0]
sample_txns = merged_accounts[merged_accounts['account_id'] == sample_account].sort_values('date')
sample_txns['signed_amount'] = sample_txns['amount'] * sample_txns['transaction_type'].map({'Debit': -1, 'Credit': 1})
sample_txns['balance'] = sample_txns['signed_amount'].cumsum()
print(sample_txns[['date', 'amount', 'transaction_type', 'balance']].head(10))

                   date  amount transaction_type  balance
0   2024-01-01 00:00:00  238.77           Credit   238.77
8   2024-01-01 08:00:00  179.79           Credit   418.56
140 2024-01-06 20:00:00  109.65           Credit   528.21
188 2024-01-08 20:00:00   48.40           Credit   576.61
455 2024-01-19 23:00:00  206.23           Credit   782.84
859 2024-02-05 19:00:00  145.51            Debit   637.33


In [14]:
# Error Handling Example: Bad column name in query
try:
    bad_query = df.query("transaction_typ == 'Debit'")
except Exception as e:
    print(f'Error: {e}')

Error: name 'transaction_typ' is not defined


In [15]:
# Error Handling Example: Mismatched join keys
try:
    bad_merge = df.merge(accounts, left_on='customer_id', right_on='account_id', how='left')
except Exception as e:
    print(f'Error: {e}')

In [16]:
# Best Practice: Always inspect your data before analysis
df.info()
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   transaction_id    1000 non-null   int64         
 1   customer_id       1000 non-null   object        
 2   amount            1000 non-null   float64       
 3   transaction_type  1000 non-null   object        
 4   channel           1000 non-null   object        
 5   date              1000 non-null   datetime64[ns]
 6   month             1000 non-null   period[M]     
dtypes: datetime64[ns](1), float64(1), int64(1), object(3), period[M](1)
memory usage: 54.8+ KB


Unnamed: 0,transaction_id,amount,date
count,1000.0,1000.0,1000
mean,500.5,151.70208,2024-01-21 19:29:59.999999744
min,1.0,-64.09,2024-01-01 00:00:00
25%,250.75,111.3725,2024-01-11 09:45:00
50%,500.5,153.535,2024-01-21 19:30:00
75%,750.25,195.1625,2024-02-01 05:15:00
max,1000.0,340.47,2024-02-11 15:00:00
std,288.819436,63.022399,


In [17]:
# End-to-end mini-case: Flag customers with monthly debit total > $10,000

monthly_cust_debit = df[df['transaction_type'] == 'Debit'].groupby(['customer_id', 'month'])['amount'].sum().reset_index()
flagged = monthly_cust_debit[monthly_cust_debit['amount'] > 10000]
flagged_customers = flagged.merge(customers, on='customer_id', how='left')
print(flagged_customers[['customer_id', 'month', 'amount', 'segment', 'region']])

Empty DataFrame
Columns: [customer_id, month, amount, segment, region]
Index: []
