
# Lab 1: Exploring Customer Behavior and Retention – <br>Insights from a Bank Dataset (Churn Modelling)
January 11, 2026
<br><br>


Hi, my name is **Troy Dela Rosa**, and I'm an aspiring **Data Scientist and Analyst** with a passion for turning raw data into meaningful insights through engaging visualizations.  

As someone with a background in **retail management**, understanding customer behavior is key to improving satisfaction, loyalty, and retention.
<br>
This lab uses a bank dataset to explore patterns in customer demographics, account activity, and product usage—similar to analyzing customer behavior in a retail setting.  
<br>
Join me as we explore the exciting world of data and discover the stories it reveals about customers, their behaviors, and how businesses can make informed decisions to retain them.


## Dataset Description


This dataset contains information about bank customers, including their demographic details, account information, and whether they exited (closed their account). 

Columns include:
- `RowNumber`: Unique row identifier
- `CustomerId`: Customer ID
- `Surname`: Last name of the customer
- `CreditScore`: Credit score of the customer
- `Geography`: Country of the customer
- `Gender`: Male or Female
- `Age`: Age of the customer
- `Tenure`: Number of years the customer has been with the bank
- `Balance`: Account balance
- `NumOfProducts`: Number of bank products the customer uses
- `HasCrCard`: Whether the customer has a credit card (1 = Yes, 0 = No)
- `IsActiveMember`: Whether the customer is an active member (1 = Yes, 0 = No)
- `EstimatedSalary`: Estimated annual salary
- `Exited`: Whether the customer left the bank (1 = Yes, 0 = No)

## Load the Dataset

In [33]:
# URL- https://github.com/selva86/datasets/blob/master/Churn_Modelling.csv

import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/refs/heads/master/Churn_Modelling_m.csv')
df

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619.0,France,Female,42.0,2,0.00,1,1,1,101348.88,1
1,2,15647311,Hill,608.0,Spain,Female,41.0,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502.0,France,,,8,159660.80,3,1,0,113931.57,1
3,4,15701354,Boni,699.0,France,,39.0,1,0.00,2,0,0,93826.63,0
4,5,15737888,Mitchell,850.0,Spain,Female,43.0,2,,1,1,1,79084.10,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,9996,15606229,Obijiaku,771.0,France,Male,39.0,5,0.00,2,1,0,96270.64,0
9996,9997,15569892,Johnstone,516.0,France,Male,35.0,10,57369.61,1,1,1,101699.77,0
9997,9998,15584532,Liu,709.0,France,Female,36.0,7,0.00,1,0,1,42085.58,1
9998,9999,15682355,Sabbatini,772.0,Germany,Male,42.0,3,75075.31,2,1,0,92888.52,1


In [35]:
df.head

<bound method NDFrame.head of       RowNumber  CustomerId    Surname  CreditScore Geography  Gender   Age  \
0             1    15634602   Hargrave        619.0    France  Female  42.0   
1             2    15647311       Hill        608.0     Spain  Female  41.0   
2             3    15619304       Onio        502.0    France     NaN   NaN   
3             4    15701354       Boni        699.0    France     NaN  39.0   
4             5    15737888   Mitchell        850.0     Spain  Female  43.0   
...         ...         ...        ...          ...       ...     ...   ...   
9995       9996    15606229   Obijiaku        771.0    France    Male  39.0   
9996       9997    15569892  Johnstone        516.0    France    Male  35.0   
9997       9998    15584532        Liu        709.0    France  Female  36.0   
9998       9999    15682355  Sabbatini        772.0   Germany    Male  42.0   
9999      10000    15628319     Walker        792.0    France  Female  28.0   

      Tenure    Balan

- We imported the pandas library and loaded the dataset into a DataFrame called `df`.
- Using `df.head()` we can view the first 5 rows of the dataset to get an initial idea of the data.
- The dataset is from a `github` dataset repository.
- This dataset appears to contain customer demographics, account details, and whether the customer has exited the bank.


## Explore the Dataset

In [36]:
# Basic info about the dataset
df.info()
df.isna().sum()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      9999 non-null   float64
 4   Geography        10000 non-null  object 
 5   Gender           9986 non-null   object 
 6   Age              9960 non-null   float64
 7   Tenure           10000 non-null  int64  
 8   Balance          9963 non-null   float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  9999 non-null   float64
 13  Exited           10000 non-null  int64  
dtypes: float64(4), int64(7), object(3)
memory usage: 1.1+ MB


RowNumber           0
CustomerId          0
Surname             0
CreditScore         1
Geography           0
Gender             14
Age                40
Tenure              0
Balance            37
NumOfProducts       0
HasCrCard           0
IsActiveMember      0
EstimatedSalary     1
Exited              0
dtype: int64

- `df.info()` shows the number of rows and columns, column names, data types, and non-null counts.
- We can see some columns have missing values (e.g., `CreditScore`, `Gender`, `Balance`, `EstimatedSalary`).


In [27]:
# Check dataset shape
df.shape


(10000, 14)

- `df.shape` shows the dataset has X rows and Y columns. 
- This tells us how large the dataset is and how much data we have to work with.


In [28]:
# Column names
df.columns


Index(['RowNumber', 'CustomerId', 'Surname', 'CreditScore', 'Geography',
       'Gender', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard',
       'IsActiveMember', 'EstimatedSalary', 'Exited'],
      dtype='object')

- Lists all column names in the dataset.
- Useful to quickly see what features are available for analysis.


In [29]:
# Viewing basic statistics
df.describe()


Unnamed: 0,RowNumber,CustomerId,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
count,10000.0,10000.0,9999.0,9960.0,10000.0,9963.0,10000.0,10000.0,10000.0,9999.0,10000.0
mean,5000.5,15690940.0,650.525453,38.916064,5.0128,76432.456152,1.5302,0.7055,0.5151,100085.272737,0.2037
std,2886.89568,71936.19,96.657553,10.487733,2.892174,62411.224323,0.581654,0.45584,0.499797,57511.223654,0.402769
min,1.0,15565700.0,350.0,18.0,0.0,0.0,1.0,0.0,0.0,11.58,0.0
25%,2500.75,15628530.0,584.0,32.0,3.0,0.0,1.0,0.0,0.0,50992.93,0.0
50%,5000.5,15690740.0,652.0,37.0,5.0,97133.92,1.0,1.0,1.0,100187.43,0.0
75%,7500.25,15753230.0,718.0,44.0,7.0,127638.135,2.0,1.0,1.0,149382.875,0.0
max,10000.0,15815690.0,850.0,92.0,10.0,250898.09,4.0,1.0,1.0,199992.48,1.0


- `df.describe()` provides statistical summary for numeric columns (mean, std, min, max, percentiles).
- Helps to identify unusual values, ranges, and data distribution.


## Initial Observations

1. Some customers don’t have all their information recorded, like their credit score or account balance.
2. Most customers are from France, Spain, or Germany.  
3. There are both men and women, but a few entries are missing the gender information.  
4. Customers’ ages vary a lot – from young adults to older adults.  
5. Some customers have zero balance, which might mean they are not active or don’t use their accounts much.  
6. Most customers use 1 to 4 bank products, like savings, credit cards, or loans.  
7. Some customers have credit cards, and some don’t.  
8. Some customers are active with the bank, others are not.  
9. The salary range is very wide – some people earn much more than others.  
10. The main goal is to see which customers left the bank (***churned***) and try to understand ***why***.

## Initial Analysis Questions

Questions We Want to Answer About Customers

1. Are younger or older customers more likely to leave the bank?  
2. Does having more bank products, like credit cards or savings accounts, make people stay longer?  
3. Do people with higher account balances or higher salaries tend to leave less often?  
4. Are active customers less likely to leave than inactive customers?  
5. Does the country someone lives in affect whether they stay or leave the bank?

## Summary of Initial EDA

After looking at the data, here’s what we learned in simple terms:

1. The dataset is about bank customers and whether they stayed with the bank or left.  
2. We have information about their age, gender, country, account balance, number of bank products, credit card ownership, activity level, and estimated salary.  
3. Some information is missing for a few customers, like their credit score or balance. This is normal in real-world data.  
4. We can see patterns that might explain why customers leave the bank. For example, younger or older customers, people with lower balances, or those using fewer products might be more likely to leave.  
5. There are also opportunities to see if certain features help keep customers, like having multiple products, being an active member, or having a credit card.  
6. Overall, the dataset looks useful to explore why customers leave and what can be done to keep them.  
7. Before doing deeper analysis, we may need to clean the data and handle missing values, but even now we can start asking interesting questions and looking for patterns.  

In short, this dataset helps us understand customer behavior at a bank and can guide decisions to improve customer satisfaction and retention.

# Lab 2: Exploring Customer Behavior  
## Exploratory Data Analysis of Customer Churn  

*Looking More Closely at a Dataset*

**Course:** COMP 2040  
**Instructor:** Chris Mac  
**Student:** Troy Dela Rosa  
**Student ID:** 0213352  
**Date:** January 17, 2026

## Dataset Overview and Initial Assumptions

I am analyzing the Churn Modelling dataset, which contains 10,000 bank customer records. The goal of this analysis is to explore patterns in customer behavior and better understand factors associated with customer churn.

Initial Assumptions:
<br>These assumptions are based on general expectations and will be explored and refined using the available data.

1. I initially assumed that customers with higher account balances are more likely to stay with the bank because they may be more financially engaged.
2. I assumed that geographic location would be less important than financial factors such as credit score when it comes to customer churn.
3. I assumed that customers who use more bank products, such as savings accounts or credit cards, are more likely to remain with the bank.
4. I assumed that active customers are less likely to leave the bank compared to inactive customers.
5. I assumed that younger customers are more likely to leave the bank than older customers.

## Load the Dataset

In [60]:
import pandas as pd

url = "https://raw.githubusercontent.com/selva86/datasets/refs/heads/master/Churn_Modelling.csv"
df = pd.read_csv(url)

df


Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.00,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.80,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.00,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.10,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,9996,15606229,Obijiaku,771,France,Male,39,5,0.00,2,1,0,96270.64,0
9996,9997,15569892,Johnstone,516,France,Male,35,10,57369.61,1,1,1,101699.77,0
9997,9998,15584532,Liu,709,France,Female,36,7,0.00,1,0,1,42085.58,1
9998,9999,15682355,Sabbatini,772,Germany,Male,42,3,75075.31,2,1,0,92888.52,1


## Data Completeness Check *(A happy accident)*
“This turned out to be a happy accident, as it highlighted how datasets can change over time and why data should always be revalidated.”

Earlier on, some cells looked like they had missing values, but it turns out those were just displayed as text and not treated as missing by pandas. It also looks like the version of the dataset pulled from GitHub has been updated, with missing values already replaced. This was a good reminder that datasets can change over time and that it’s always better to confirm data quality with code rather than relying only on what you see at first glance.

To double-check the data, I ran both `df.info()` and `df.isna().sum()`, and neither showed any true missing (NaN) values in the dataset used for this lab. 

In [69]:
df.info()
df.isna().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           10000 non-null  int64  
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(2), int64(9), object(3)
memory usage: 1.1+ MB


RowNumber          0
CustomerId         0
Surname            0
CreditScore        0
Geography          0
Gender             0
Age                0
Tenure             0
Balance            0
NumOfProducts      0
HasCrCard          0
IsActiveMember     0
EstimatedSalary    0
Exited             0
dtype: int64

With data quality verified, the next step is to explore individual columns in more detail to understand how customer characteristics are distributed

## Explore the ***Measures***

In this section, I explore a few key columns in more detail to better understand what the values look like, how much they vary, and what they might reveal about customer behavior. At this stage, the focus is only on observation and interpretation—no cleaning or modeling is performed.

### Measure 1: ***Geography***

In [None]:
# Summary statistics for the 'Geography' column
df['Geography'].describe()

count      10000
unique         3
top       France
freq        5014
Name: Geography, dtype: object

 #### ***Interpretation of Geography Stats***

Count: 10,000
<br> This means every customer in the dataset has a country listed. Since there are 10,000 rows total, this confirms there are no missing values for customer location.

Unique: 3
<br> This tells us there are only three countries in the dataset. That matches what we expected — the bank operates in France, Spain, and Germany. If this number were higher, it could point to data entry issues or unexpected regions

Freq: 5,014
<br> This tells us how many customers are from France. Just over half of the dataset (about 50%) comes from France, which means the data is not evenly distributed across countries. There are about as many French customers as there are customers from Spain and Germany combined, which is important to keep in mind when looking at churn patterns later.

Top: France
<br> This shows the most common country in the dataset. In this case, France appears the most, meaning most of the bank’s customers in this dataset are based there.

After understanding how customers are distributed across countries, the next step is to examine whether churn behavior differs by geography.

In [None]:
# Average churn rate by Geography
df.groupby('Geography')['Exited'].mean()


Geography
France     0.161548
Germany    0.324432
Spain      0.166734
Name: Exited, dtype: float64

 #### ***Interpretation of Geography VS Churn***
After grouping the data by geography and calculating the average churn rate, a clear regional pattern emerges. Customers in France and Spain have relatively low churn rates (around 16%), while customers in Germany have a much higher churn rate of over 32%.

It’s important to note that these values represent the probability of a customer exiting within each country. In other words, if a random customer is selected from Germany, they are about twice as likely to have left the bank compared to a customer from France or Spain.

#### ***Why Geography Is Useful for Churn Modeling***
Initially, I assumed that geography would be less important than individual financial factors such as credit score. However, exploring the Geography column showed that the dataset is heavily concentrated in a few regions, with France accounting for just over half of all customers.

When churn was examined by geography, a clear difference emerged. Customers in Germany have a much higher churn rate than those in France and Spain, directly challenging my initial assumption.

Together, the distribution shown by `.describe()` and the churn comparison indicate that geography provides important context for understanding churn. Customer exit behavior appears to vary by region, making geography a key variable to explore further.

After examining regional differences in churn, the next step is to explore whether customer demographics—such as age—also show meaningful differences between customers who stayed and those who left.

### Measure 2: ***Age***
Looking at customer age column to get a sense of the age range and how spread out the values are.

In [53]:
df['Age'].describe()
# This line summarizes the Age column by giving you a quick overview of its main statistics.


count    10000.000000
mean        38.921800
std         10.487806
min         18.000000
25%         32.000000
50%         37.000000
75%         44.000000
max         92.000000
Name: Age, dtype: float64

The average age is slightly higher than the middle age value, which suggests that most customers fall around the same range, with some older customers pushing the average up.

In [54]:
df['Age'].value_counts().sort_index()
# This shows how frequently each age appears in the dataset 
# and helps identify the most common age groups.


Age
18    22
19    27
20    40
21    53
22    84
      ..
83     1
84     2
85     1
88     1
92     2
Name: count, Length: 70, dtype: int64

In addition to using `describe(),` age frequencies were examined using `value_counts()` to confirm the range and distribution of ages. This helped verify the summary statistics and provided a clearer picture of how customer ages are distributed.

In [58]:
df['Age'].isna().sum()


np.int64(0)

During the initial inspection of the dataset, I expected to find missing values in several columns. After verifying this using `isna().sum()`, I realized that the Age column is fully populated. <br> **This highlighted the importance of checking assumptions with code rather than relying only on visual inspection.** <br><br> The dataset used in Lab 1 contained missing values. However, when the dataset was loaded again for Lab 2 using the raw GitHub link, the file appeared to have been updated and the previously missing values were populated. As a result, no true missing values were detected in the version of the dataset used for this analysis. This reinforced the need to validate data quality at each stage of analysis.

In [59]:
df.groupby('Exited')['Age'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Exited,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,7963.0,37.408389,10.125363,18.0,31.0,36.0,41.0,92.0
1,2037.0,44.837997,9.761562,18.0,38.0,45.0,51.0,84.0


When comparing age by churn status, customers who exited the bank tend to be older overall.
<br>Both the average and median age are higher for churned customers, <br>suggesting that age may be an important factor to consider when exploring churn behavior.

**Customers who stayed (Exited = 0)**
<br>Average age: ~37 years
<br>Median age: 36
<br>Most customers fall between 31 and 41
<br>Ages range from 18 to 92

**Customers who left (Exited = 1)**
<br>Average age: ~44.8 years
<br>Median age: 45
<br>Most customers fall between 38 and 51
<br>Ages range from 18 to 84

#### ***Interpretaion of the Age Column***

The Age column shows a wide range of customer ages, from young adults to older customers. Overall, most customers fall within a middle age range, but there is still noticeable variation across the dataset. When comparing age by churn status, customers who left the bank tend to be older on average than those who stayed. This difference is consistent across the mean, median, and quartile values, indicating that it is not driven by just a few extreme cases. This tells me that churn isn't just a **"young person's game"**—the bank is actually losing its more mature, likely wealthier, clients.

#### ***Why Age Is Useful for Churn Modeling***

Age is useful for churn modeling because it shows clear differences between customers who stayed and those who left. The fact that churned customers are generally older suggests that customer needs, engagement levels, or expectations may change with age. Since this pattern appears across the overall age distribution, age can serve as a meaningful demographic feature that helps distinguish between retained and churned customers when building a churn model.

## Reflection

Overall, this dataset is a strong candidate for further analysis. It is well structured, complete, and includes a clear churn indicator along with useful demographic, financial, and behavioral variables. Throughout the exploratory analysis, several initial assumptions were challenged—particularly the ideas that geography would not matter much and that higher account balances imply greater loyalty. These findings highlight the value of exploratory analysis and make the dataset well suited for deeper analysis and modeling in the future.