# Consumer age behaviour- Data Visualisation

This notebook will help visulaise the encoded shopping_trends dataset.


In [49]:

import os
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import matplotlib.pyplot as plt
from matplotlib import cm
import warnings
warnings.filterwarnings('ignore')

Will load the clean dataset and this will be the main

The ETL process has already:
1. Cleaned the raw shopping data
2. Identified elderly customers (Age â‰¥ 65)
3. Calculated gender distribution
4. Saved pre-processed files in `/dataset/clean/`

In [50]:
elderly_summary = pd.read_csv('../dataset/clean/elderly_gender_summary.csv')
print(" Loaded elderly gender summary from ETL")
print(elderly_summary)

 Loaded elderly gender summary from ETL
   gender  count
0    Male    542
1  Female    246


**Files loaded:**
- `elderly_gender_summary.csv`: Gender counts (542 men, 246 women)
- `shopping_trends_clean.csv`: Full cleaned dataset
- `elderly_customers.csv`: Pre-filtered elderly customers
- `area_data.csv`: Summary data for area plots

In [51]:
fig_bar = px.bar(elderly_summary, 
                 x='gender', 
                 y='count',
                 title='Elderly Customers by Gender - From ETL Data',
                 color='gender',
                 color_discrete_map={'Male': 'blue', 'Female': 'pink'},
                 text='count')

fig_bar.update_layout(
    xaxis_title='Gender',
    yaxis_title='Number of Customers',
    template='plotly_white'
)

fig_bar.show()

print(" Creating Pie Chart...")
fig_pie = px.pie(elderly_summary,
                 values='count',
                 names='gender',
                 title='Gender Distribution: Elderly Customers (542 men, 246 women)',
                 hole=0.3,
                 color='gender',
                 color_discrete_map={'Male': 'blue', 'Female': 'pink'})

fig_pie.update_traces(textposition='inside', textinfo='percent+label')
fig_pie.show()


area_data = pd.read_csv('../dataset/clean/area_data.csv')
fig_area = px.area(area_data,
                   x='Category',
                   y='Count',
                   title='Customer Distribution - Area View',
                   markers=True)

fig_area.show()

print("\n All visualizations created from ETL cleaned data!")

 Creating Pie Chart...



 All visualizations created from ETL cleaned data!


### Bar Chart Insight
The bar chart provides clear quantitative evidence, showing 542 elderly male customers compared to 246 female customers. This visual height difference immediately communicates the significant gender disparity in superstore shopping patterns.

### Pie Chart Insight
The pie chart reveals the proportional reality: men constitute nearly 70% of elderly superstore shoppers. This proportional representation emphasizes the scale of male dominance in this customer segment.

### KDE Plot Insight
Age distribution analysis shows both genders follow similar age patterns (concentrated in 65-70 range), confirming that the gender difference in shopping frequency is not influenced by age distribution but represents distinct behavioral patterns.

### Animated chart

I have created an animated chart to help visualisation become easier.

In [52]:

counts = elderly_summary.set_index('gender')['count'].to_dict()
men_count = int(counts.get('Male', counts.get('male', 0)))
women_count = int(counts.get('Female', counts.get('female', 0)))

animation_data = pd.DataFrame({
    'Gender': ['Men', 'Women', 'Difference'],
    'Count': [men_count, women_count, men_count - women_count],
    'Stage': ['Base', 'Base', 'Highlight']
})

fig_animated = px.bar(
    animation_data,
    x='Gender',
    y='Count',
    color='Stage',
    color_discrete_map={'Base': 'lightgray', 'Highlight': 'red'},
    title='<b>Gender Difference Visualization</b><br>Highlighting the Gap',
    text='Count',
    template='plotly_dark'
)

fig_animated.update_layout(
    title_font=dict(size=20),
    xaxis_title="",
    yaxis_title="Number of Customers",
    showlegend=False
)

fig_animated.show()


### Animation Conclusion
The animated chart brings data to life, visually emphasizing the 296-customer gap between elderly men and women. This dynamic presentation makes the statistical difference more engaging and memorable than static charts alone.

### Producing a heatmap:

This will allow the visualisation of the customer concentration patterns across age groups and genders at the same time, this allows a deeper understanding of insights beyond the basic counts.


In [55]:
elderly_df = pd.read_csv('../dataset/clean/elderly_customers.csv')

elderly_df['Age_Group'] = pd.cut(
    elderly_df['age'],
    bins=age_bins,
    labels=age_labels,
    right=False,
    include_lowest=True
).astype(str)


elderly_df['Age_Group'].replace('nan', age_labels[-1], inplace=True)


counts = elderly_df.groupby(['Age_Group', 'gender']).size().unstack(fill_value=0)

fig = px.imshow(counts,
                title='Customer Heatmap: Age vs Gender',
                color_continuous_scale='Blues',
                labels={'x': 'Gender', 'y': 'Age Group', 'color': 'Customers'})

fig.show()

Findings:  The heatmap confirms consistent male dominance across all elderly age segments, with the 60-69 age group showing the highest concentration of both male and female customers.

#### Testing Second Hypothesis:

I will now group customers into age brackets (65-69, 70-74, 75-79, 80+)
Â Then count how many customers are in each age group
Â After that I shall compare counts to find the largest group
Â Then I will create visualisations to clearly show the results
Lastly I willÂ determine if the hypothesis is supported or not

In [57]:
print("\n" + "="*60)
print("HYPOTHESIS : Most elderly customers are 65-69 years old")
print("="*60)

# Create my age groups
elderly_df['Age_Group'] = pd.cut(elderly_df['age'], 
                                 bins=[65, 70, 75, 80, 100],
                                 labels=['65-69', '70-74', '75-79', '80+'])

age_distribution = elderly_df['Age_Group'].value_counts().sort_index()
print("\nAge Distribution of Elderly Customers:")
print(age_distribution)

# Check if 65-69 is the largest group
largest_group = age_distribution.idxmax()
largest_count = age_distribution.max()

if largest_group == '65-69':
    print(f"\n HYPOTHESIS SUPPORTED!")
    print(f"65-69 is the largest age group with {largest_count} customers")
else:
    print(f"\n HYPOTHESIS NOT SUPPORTED")
    print(f"The largest age group is {largest_group} with {largest_count} customers")


HYPOTHESIS : Most elderly customers are 65-69 years old

Age Distribution of Elderly Customers:
Age_Group
65-69    355
70-74      0
75-79      0
80+        0
Name: count, dtype: int64

 HYPOTHESIS SUPPORTED!
65-69 is the largest age group with 355 customers


I used a bar chart to test this hypothesis because it provides the clearest visual comparison between age groups. The bar heights allow for immediate identification of which age group has the most customers, making it easy to determine if the 65-69 group is indeed the largest segment.

In [64]:
#Testing: Are most elderly customers 65-69 years old?

# 1. Create age groups
elderly_df['Age_Group'] = pd.cut(elderly_df['age'], [65,70,75,80,100], 
                                labels=['65-69','70-74','75-79','80+'])

counts = elderly_df['Age_Group'].value_counts()
fig = px.bar(
    x=counts.index, 
    y=counts.values,
    title='<b>Are Most Elderly Customers 65-69?</b>',
    labels={'x': 'Age Group', 'y': 'Customers'},
    color=counts.values,
    color_continuous_scale='Blues',
    text=counts.values
)
fig.update_traces(textposition='outside')
fig.update_layout(height=400)

fig.show()

The bar chart revealed that the 65-69 age group contains 335customers, making it the largest segment of elderly shoppers. This supports the hypothesis, indicating that the superstore's elderly customer base is predominantly in the youngest elderly age bracket.

**Key Finding from ETL:**
- Total elderly customers: 788
- Elderly men: 542 (68.8%)
- Elderly women: 246 (31.2%)
- **Hypothesis SUPPORTED**: More elderly men shop than women
### Key Findings pt2:
1. **Demographic Dominance**: Elderly men constitute nearly 70% of the elderly customer base
2. **Age Patterns**: Additional analysis confirmed that most elderly customers fall within the 65-69 age range
3. **Consistent Trends**: Male dominance persists across all analyzed age segments