### Business Questions to Answer
1. Who are our most valuable customers? What defines them?
2. Are there distinct customer groups with similar spending behaviors? How can we target
them effectively?
3. What demographic factors (e.g., age, gender, income) influence spending habits?
4. What specific actions can MallCo take to improve retention and boost spending?


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#### 1. Loading the Dataset

In [5]:
# Load the data
data = pd.read_csv('../data/Mall_Customers.csv')

# Display the first few rows of the data
data.head()

Unnamed: 0,CustomerID,Gender,Age,Annual Income (k$),Spending Score (1-100)
0,1,Male,19,15,39
1,2,Male,21,15,81
2,3,Female,20,16,6
3,4,Female,23,16,77
4,5,Female,31,17,40


#### 2. Understand Customer Demographics
● Filter customers with a Spending Score > 80 and calculate their average Annual Income

● Identify the top 10 customers by Spending Score. What do they have in common (e.g., age group, gender)?

In [20]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 8 columns):
 #   Column                   Non-Null Count  Dtype   
---  ------                   --------------  -----   
 0   CustomerID               200 non-null    int64   
 1   Gender                   200 non-null    object  
 2   Age                      200 non-null    int64   
 3   Annual Income (k$)       200 non-null    int64   
 4   Spending Score (1-100)   200 non-null    int64   
 5   Age Group                200 non-null    category
 6   Age Group (Custom)       200 non-null    category
 7   Age Group (Equal Width)  200 non-null    category
dtypes: category(3), int64(4), object(1)
memory usage: 9.1+ KB


In [9]:
data.describe()

Unnamed: 0,CustomerID,Age,Annual Income (k$),Spending Score (1-100)
count,200.0,200.0,200.0,200.0
mean,100.5,38.85,60.56,50.2
std,57.879185,13.969007,26.264721,25.823522
min,1.0,18.0,15.0,1.0
25%,50.75,28.75,41.5,34.75
50%,100.5,36.0,61.5,50.0
75%,150.25,49.0,78.0,73.0
max,200.0,70.0,137.0,99.0


check for missing, duplicates

In [8]:
# Customer with spending score > 80
data[data['Spending Score (1-100)'] > 80]

Unnamed: 0,CustomerID,Gender,Age,Annual Income (k$),Spending Score (1-100)
1,2,Male,21,15,81
7,8,Female,23,18,94
11,12,Female,35,19,99
19,20,Female,35,23,98
25,26,Male,29,28,82
29,30,Female,23,29,87
33,34,Male,18,33,92
35,36,Female,21,33,81
41,42,Male,24,38,92
123,124,Male,39,69,91


In [12]:
# top 10 customers by spending score
top_10_by_score = data.sort_values(by='Spending Score (1-100)', ascending=False).head(10)

top_10_by_score

Unnamed: 0,CustomerID,Gender,Age,Annual Income (k$),Spending Score (1-100)
11,12,Female,35,19,99
19,20,Female,35,23,98
145,146,Male,28,77,97
185,186,Male,30,99,97
127,128,Male,40,71,95
167,168,Female,33,86,95
7,8,Female,23,18,94
141,142,Male,32,75,93
163,164,Female,31,81,93
41,42,Male,24,38,92


In [26]:
top_gender = top_10_by_score.value_counts("Gender")

top_age = top_10_by_score.value_counts("Age")

print(top_gender)
print("\n")
print(top_age)

Gender
Female    5
Male      5
dtype: int64


Age
35    2
23    1
24    1
28    1
30    1
31    1
32    1
33    1
40    1
dtype: int64


#### 4. Explore Relationships Between Features
● Compute the pairwise correlations between Age, Annual Income, and Spending Score to uncover key drivers of spending

● Filter young adults (18-25) and calculate their average Spending Score. Compare this with older age groups


In [15]:
# Select the numerical columns we want to analyze
numerical_columns = ['Age', 'Annual Income (k$)', 'Spending Score (1-100)']

# Compute correlations
correlations = data[numerical_columns].corr()

correlations

Unnamed: 0,Age,Annual Income (k$),Spending Score (1-100)
Age,1.0,-0.012398,-0.327227
Annual Income (k$),-0.012398,1.0,0.009903
Spending Score (1-100),-0.327227,0.009903,1.0


In [19]:
# Equal-width binning
data['Age Group (Equal Width)'] = pd.cut(data['Age'],
                                        bins=5,  # Number of equal-width bins
                                        include_lowest=True)

# Custom bins
custom_bins = [17, 25, 35, 50, 70]  # Define edges
custom_labels = ['18-25', '26-35', '36-50', '51+']  # Define labels
data['Age Group (Custom)'] = pd.cut(data['Age'],
                                   bins=custom_bins,
                                   labels=custom_labels,
                                   include_lowest=True)

# Analyze spending patterns for each binning approach
print("\nEqual-width binning analysis:")
print(data.groupby('Age Group (Equal Width)')['Spending Score (1-100)'].agg(['count', 'mean', 'std']))

print("\nCustom binning analysis:")
print(data.groupby('Age Group (Custom)')['Spending Score (1-100)'].agg(['count', 'mean', 'std']))



Equal-width binning analysis:
                         count       mean        std
Age Group (Equal Width)                             
(17.947, 28.4]              50  56.780000  23.561721
(28.4, 38.8]                63  61.206349  28.103586
(38.8, 49.2]                42  38.500000  22.126300
(49.2, 59.6]                25  34.720000  18.935241
(59.6, 70.0]                20  43.000000  16.673332

Custom binning analysis:
                    count       mean        std
Age Group (Custom)                             
18-25                  38  54.947368  25.118043
26-35                  60  64.450000  24.699842
36-50                  62  41.709677  24.171781
51+                    40  37.475000  18.768478
