# Loading and Exploring Bank Datasets

* In this lesson, we will learn how to load and explore real banking datasets with Python.
* Bank datasets are crucial for understanding customer behavior, detecting fraud, and managing risk.
* You will build skills to read datasets, understand structure, spot problems, and prepare for real analysis.
* We will use practical examples you could face in a real bank data science job.

In [1]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")

# Understanding Banking Data Structures

- Banking datasets are usually tables with rows for transactions, accounts, or customers.
- Columns describe each row, like amount, type, customer ID, or date.
- Data might be synthetic or real, but structure is similar in practice.
- Beginners often confuse row meaning or misuse data types (for example, dates as strings).
- Missing or invalid data, duplicate rows, and mixed column types are common mistakes.

In [2]:
# Example 1: Create synthetic banking transactions dataset
np.random.seed(42)
n_transactions = 1000
n_customers = 200
df = pd.DataFrame({
    'transaction_id': range(1, n_transactions + 1),
    'customer_id': np.random.choice([f'CUST_{i:04d}' for i in range(1, n_customers + 1)], n_transactions),
    'amount': np.round(np.random.normal(150, 60, n_transactions), 2),
    'transaction_type': np.random.choice(['Debit', 'Credit'], n_transactions),
    'channel': np.random.choice(['ATM', 'Online', 'Branch', 'POS'], n_transactions),
    'date': pd.date_range(start='2024-01-01', periods=n_transactions, freq='h')
})
print(df.shape)
print(df.head(3))

(1000, 6)
   transaction_id customer_id  amount transaction_type channel  \
0               1   CUST_0103  238.77           Credit     ATM   
1               2   CUST_0180  269.17           Credit     POS   
2               3   CUST_0093   58.62           Credit  Online   

                 date  
0 2024-01-01 00:00:00  
1 2024-01-01 01:00:00  
2 2024-01-01 02:00:00  


In [3]:
# Example 2: Create a simple customers table
customer_ids = [f'CUST_{i:04d}' for i in range(1, 201)]
customers = pd.DataFrame({
    'customer_id': customer_ids,
    'segment': ['Retail'] * 150 + ['Business'] * 50,
    'region': ['Metro'] * 100 + ['Regional'] * 100
})
print(customers.shape)
print(customers.head(3))

(200, 3)
  customer_id segment region
0   CUST_0001  Retail  Metro
1   CUST_0002  Retail  Metro
2   CUST_0003  Retail  Metro


In [4]:
# Example 3: Create a synthetic account types table
np.random.seed(42)
accounts = pd.DataFrame({
    'account_id': [f'ACC_{i:05d}' for i in range(1, 201)],
    'customer_id': customer_ids,
    'account_type': np.random.choice(['Savings', 'Cheque', 'Credit'], size=200),
    'open_date': pd.date_range(start='2015-01-01', periods=200, freq='30D')
})
print(accounts.shape)
print(accounts.head(3))

(200, 4)
  account_id customer_id account_type  open_date
0  ACC_00001   CUST_0001       Credit 2015-01-01
1  ACC_00002   CUST_0002      Savings 2015-01-31
2  ACC_00003   CUST_0003       Credit 2015-03-02


In [5]:
# Example 4: Load the German Credit Risk dataset from web
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data'
cols = [f'feature_{i}' for i in range(1, 25)] + ['credit_status']
credit_df = pd.read_csv(url, sep=' ', names=cols)
credit_df['default_flag'] = (credit_df['credit_status'] == 2).astype(int)
print(credit_df.shape)
print(credit_df[['credit_status', 'default_flag']].head(3))

(1000, 26)
   credit_status  default_flag
0            NaN             0
1            NaN             0
2            NaN             0


In [6]:
# How many records are leveled as default
print('Number of default records:', credit_df['default_flag'].sum())

Number of default records: 0


In [7]:
# Example 5: Load a public credit card fraud dataset
cc_url = 'https://storage.googleapis.com/download.tensorflow.org/data/creditcard.csv'
fraud_df = pd.read_csv(cc_url)
print(fraud_df.shape)
print(fraud_df.head(3))

(284807, 31)
   Time        V1        V2        V3        V4        V5        V6        V7  \
0   0.0 -1.359807 -0.072781  2.536347  1.378155 -0.338321  0.462388  0.239599   
1   0.0  1.191857  0.266151  0.166480  0.448154  0.060018 -0.082361 -0.078803   
2   1.0 -1.358354 -1.340163  1.773209  0.379780 -0.503198  1.800499  0.791461   

         V8        V9  ...       V21       V22       V23       V24       V25  \
0  0.098698  0.363787  ... -0.018307  0.277838 -0.110474  0.066928  0.128539   
1  0.085102 -0.255425  ... -0.225775 -0.638672  0.101288 -0.339846  0.167170   
2  0.247676 -1.514654  ...  0.247998  0.771679  0.909412 -0.689281 -0.327642   

        V26       V27       V28  Amount  Class  
0 -0.189115  0.133558 -0.021053  149.62      0  
1  0.125895 -0.008983  0.014724    2.69      0  
2 -0.139097 -0.055353 -0.059752  378.66      0  

[3 rows x 31 columns]


In [8]:
# Example 6: Check data types, missing values, and basic stats
print(fraud_df.dtypes)
print(fraud_df.isnull().sum().head())
print(fraud_df.describe().T.head(5))

Time      float64
V1        float64
V2        float64
V3        float64
V4        float64
V5        float64
V6        float64
V7        float64
V8        float64
V9        float64
V10       float64
V11       float64
V12       float64
V13       float64
V14       float64
V15       float64
V16       float64
V17       float64
V18       float64
V19       float64
V20       float64
V21       float64
V22       float64
V23       float64
V24       float64
V25       float64
V26       float64
V27       float64
V28       float64
Amount    float64
Class       int64
dtype: object
Time    0
V1      0
V2      0
V3      0
V4      0
dtype: int64
         count          mean           std        min           25%  \
Time  284807.0  9.481386e+04  47488.145955   0.000000  54201.500000   
V1    284807.0  1.175161e-15      1.958696 -56.407510     -0.920373   
V2    284807.0  3.384974e-16      1.651309 -72.715728     -0.598550   
V3    284807.0 -1.379537e-15      1.516255 -48.325589     -0.890365   
V4    2848

In [9]:
# For deeper understanding, further analysis can be performed as needed.
print(fraud_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    284807 non-null  float64
 1   V1      284807 non-null  float64
 2   V2      284807 non-null  float64
 3   V3      284807 non-null  float64
 4   V4      284807 non-null  float64
 5   V5      284807 non-null  float64
 6   V6      284807 non-null  float64
 7   V7      284807 non-null  float64
 8   V8      284807 non-null  float64
 9   V9      284807 non-null  float64
 10  V10     284807 non-null  float64
 11  V11     284807 non-null  float64
 12  V12     284807 non-null  float64
 13  V13     284807 non-null  float64
 14  V14     284807 non-null  float64
 15  V15     284807 non-null  float64
 16  V16     284807 non-null  float64
 17  V17     284807 non-null  float64
 18  V18     284807 non-null  float64
 19  V19     284807 non-null  float64
 20  V20     284807 non-null  float64
 21  V21     28

In [10]:
fraud_df.columns

Index(['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',
       'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20',
       'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount',
       'Class'],
      dtype='object')

In [11]:
# Example 7: Explore unique values in key columns
print(df['channel'].unique())
print(df['transaction_type'].value_counts())

['ATM' 'POS' 'Online' 'Branch']
transaction_type
Credit    503
Debit     497
Name: count, dtype: int64


In [12]:
# Example 8: Filter for large credit transactions at ATMs
atm_credits = df[(df['channel'] == 'ATM') & (df['transaction_type'] == 'Credit') & (df['amount'] > 300)]
print(atm_credits.shape)
print(atm_credits[['transaction_id', 'amount', 'channel', 'transaction_type']].head())

(1, 6)
    transaction_id  amount channel transaction_type
19              20   307.0     ATM           Credit


In [13]:
# Example 9: Merge transactions with customer segments and regions
merged = df.merge(customers, on='customer_id', how='left')
print(merged.head(3)[['transaction_id', 'customer_id', 'segment', 'region']])

   transaction_id customer_id   segment    region
0               1   CUST_0103    Retail  Regional
1               2   CUST_0180  Business  Regional
2               3   CUST_0093    Retail     Metro


In [14]:
# Example 10: Group transaction amount statistics by customer segment
grouped_stats = merged.groupby('segment')['amount'].agg(['count', 'mean', 'std', 'min', 'max'])
print(grouped_stats)

          count        mean        std    min     max
segment                                              
Business    250  147.577040  65.536040 -64.09  307.75
Retail      750  153.077093  62.145989  -8.84  340.47


In [15]:
# Example 11: Find customers with only one account type
account_counts = accounts.groupby('customer_id')['account_type'].nunique()
single_account_customers = account_counts[account_counts == 1]
print(single_account_customers.shape)
print(single_account_customers.head())

(200,)
customer_id
CUST_0001    1
CUST_0002    1
CUST_0003    1
CUST_0004    1
CUST_0005    1
Name: account_type, dtype: int64


In [16]:
# Example 12: Parse transaction dates and sort
df['date'] = pd.to_datetime(df['date'])
sorted_txn = df.sort_values('date').reset_index(drop=True)
print(sorted_txn[['transaction_id', 'date']].head(3))

   transaction_id                date
0               1 2024-01-01 00:00:00
1               2 2024-01-01 01:00:00
2               3 2024-01-01 02:00:00


In [17]:
# Example 13: Find possible duplicate transactions
duplicates = df[df.duplicated(['customer_id', 'amount', 'date', 'transaction_type'], keep=False)]
print(duplicates[['transaction_id', 'customer_id', 'amount', 'transaction_type', 'date']].head())

Empty DataFrame
Columns: [transaction_id, customer_id, amount, transaction_type, date]
Index: []


In [18]:
#Example 14: Handle missing values in banking data
fraud_df_missing = fraud_df.copy()
fraud_df_missing.loc[0, 'Amount'] = np.nan # Injecting a missing value
filled = fraud_df_missing['Amount'].fillna(fraud_df_missing['Amount'].mean())
print(filled.head(3))

0     88.349404
1      2.690000
2    378.660000
Name: Amount, dtype: float64


In [19]:
# Example 15: Catch errors when loading files
try:
    pd.read_csv('non_existent_file.csv')
except FileNotFoundError as e:
    print('File not found. Please check the path:', e)

File not found. Please check the path: [Errno 2] No such file or directory: 'non_existent_file.csv'


In [20]:
# Example 16: Validate column types before analysis
if not np.issubdtype(df['amount'].dtype, np.number):
    print('Amount column is not numeric!')
else:
    print('Amount column is numeric, safe for stats calculations.')


Amount column is numeric, safe for stats calculations.


In [21]:
# Example 17: Always copy DataFrames before modifying
safe_copy = df.copy()
safe_copy['amount'] = safe_copy['amount'] * 1.1  # Simulate a processing step
print(safe_copy['amount'].head(3))

0    262.647
1    296.087
2     64.482
Name: amount, dtype: float64


In [22]:
# Example 18: Use describe(include="all") for broad summary
print(df.describe(include='all').T)

                   count unique        top freq  \
transaction_id    1000.0    NaN        NaN  NaN   
customer_id         1000    198  CUST_0190   13   
amount            1000.0    NaN        NaN  NaN   
transaction_type    1000      2     Credit  503   
channel             1000      4        ATM  266   
date                1000    NaN        NaN  NaN   

                                           mean                  min  \
transaction_id                            500.5                  1.0   
customer_id                                 NaN                  NaN   
amount                                151.70208               -64.09   
transaction_type                            NaN                  NaN   
channel                                     NaN                  NaN   
date              2024-01-21 19:29:59.999999744  2024-01-01 00:00:00   

                                  25%                  50%  \
transaction_id                 250.75                500.5   
customer_id  

In [23]:
# Example 19: End-to-end: Flag risky transactions (amount > 3 std above segment mean)
thresholds = merged.groupby('segment')['amount'].agg(['mean', 'std'])
merged = merged.join(thresholds, on='segment', rsuffix='_stats')
merged['is_risky'] = merged['amount'] > (merged['mean'] + 3 * merged['std'])
print(merged[['transaction_id', 'segment', 'amount', 'mean', 'std', 'is_risky']].head(8))
print('Total risky transactions:', merged['is_risky'].sum())

   transaction_id   segment  amount        mean        std  is_risky
0               1    Retail  238.77  153.077093  62.145989     False
1               2  Business  269.17  147.577040  65.536040     False
2               3    Retail   58.62  153.077093  62.145989     False
3               4    Retail   81.85  153.077093  62.145989     False
4               5    Retail  163.56  153.077093  62.145989     False
5               6    Retail  200.38  153.077093  62.145989     False
6               7  Business  149.33  147.577040  65.536040     False
7               8    Retail   51.72  153.077093  62.145989     False
Total risky transactions: 1


Keep Building!

- Ready to try more? Search YouTube for 'Python pandas banking data analysis' for hands-on videos.
- To keep your skills sharp, try building your own synthetic financial dataset and share your work online.


In [24]:
# Example 20: List all columns containing the word ‘flag’  
flag_cols = [col for col in merged.columns if 'flag' in col]  
print(flag_cols)

[]


In [25]:
# Example 21: Extract month from transaction dates and count
df['month'] = df['date'].dt.month
print(df['month'].value_counts().sort_index())

month
1    744
2    256
Name: count, dtype: int64


In [26]:
# Example 22: Save merged banking dataset to CSV file
merged.to_csv('bank_transactions_with_segments.csv', index=False)
print('File saved!')

File saved!
