# **Consumer age behaviours**



## Hypothesis: More elderly men shop at superstores than elderly women

## Objectives


Test the Hypothesis: Determine if more elderly men shop at the superstore than elderly women

Data Quality: Clean and prepare the Kaggle shopping trends dataset for analysis

Visualization:Create 3-4 different visualizations to communicate my findings.

Statistical Validation:Use basic statistical methods to validate results

Pandas Proficiency: Demonstrate ability to manipulate and analyse data with Pandas

Visualization Skills:Create bar charts, pie charts, and areaplots using Matplotlib and Plotly



## Inputs

Dataset- Shopping_trends.csv(raw data)

Hypothesis-More elderly men shop than elderly women at the superstore.

## Process
Load data

Clean data (remove duplicates, handle missing values)

Filter for elderly (Age >= 65)

Count by gender

Create visualizations

Test hypothesis

## Outputs

clean data set

ensure there are no missing data or duplicates. 

Statistical results (print to console)

Business insights (documented in README)


##### 
The importation of all necessary Python libraries for data manipulation, analysis and for visualisation.




In [143]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot  as plt 
import plotly.express as px


Need to load the raw dataset and perform an initial inspection to understand the data set's structure and it's contents.


In [144]:
df= pd.read_csv('../dataset/raw/shopping_trends_updated.csv')
#important to add ../ as it tells code to step out of the direcotry

Have loaded the index headings to explore the titles of data sets, this will help with catergorising data easily.

In [145]:
print (df.head())

   Customer ID  Age Gender Item Purchased  Category  Purchase Amount (USD)  \
0            1   55   Male         Blouse  Clothing                     53   
1            2   19   Male        Sweater  Clothing                     64   
2            3   50   Male          Jeans  Clothing                     73   
3            4   21   Male        Sandals  Footwear                     90   
4            5   45   Male         Blouse  Clothing                     49   

        Location Size      Color  Season  Review Rating Subscription Status  \
0       Kentucky    L       Gray  Winter            3.1                 Yes   
1          Maine    L     Maroon  Winter            3.1                 Yes   
2  Massachusetts    S     Maroon  Spring            3.1                 Yes   
3   Rhode Island    M     Maroon  Spring            3.5                 Yes   
4         Oregon    M  Turquoise  Spring            2.7                 Yes   

   Shipping Type Discount Applied Promo Code Used  Previ

*Checking data shapes*

Clean the dataset by handling missing values, removing duplicates, and creating necessary features for analysis.

In [146]:
print ("Initial dataset info:")
print (f"Shape: {df.shape}")
df.info()
df.describe()
#this helps check if the data types are correct and if there is any missing data.
print (df.isnull())

Initial dataset info:
Shape: (3900, 18)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3900 entries, 0 to 3899
Data columns (total 18 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Customer ID             3900 non-null   int64  
 1   Age                     3900 non-null   int64  
 2   Gender                  3900 non-null   object 
 3   Item Purchased          3900 non-null   object 
 4   Category                3900 non-null   object 
 5   Purchase Amount (USD)   3900 non-null   int64  
 6   Location                3900 non-null   object 
 7   Size                    3900 non-null   object 
 8   Color                   3900 non-null   object 
 9   Season                  3900 non-null   object 
 10  Review Rating           3900 non-null   float64
 11  Subscription Status     3900 non-null   object 
 12  Shipping Type           3900 non-null   object 
 13  Discount Applied        3900 non-null   object 
 14  

In [147]:
#To clean my data, I need to check if there are any missing values.
print("\nMissing Values:")
print(df.isnull().sum())


Missing Values:
Customer ID               0
Age                       0
Gender                    0
Item Purchased            0
Category                  0
Purchase Amount (USD)     0
Location                  0
Size                      0
Color                     0
Season                    0
Review Rating             0
Subscription Status       0
Shipping Type             0
Discount Applied          0
Promo Code Used           0
Previous Purchases        0
Payment Method            0
Frequency of Purchases    0
dtype: int64


In [148]:
#check if there are any duplicates in my dataset
df.duplicated().sum()
#There are no duplicates in my dataset

0

This has shown that there are no duplicate in my data set, nor any missing data.

Checking through data types:

In [149]:
(df.dtypes)
#just checking to see if any of my data types are incorrect or needs changing.

Customer ID                 int64
Age                         int64
Gender                     object
Item Purchased             object
Category                   object
Purchase Amount (USD)       int64
Location                   object
Size                       object
Color                      object
Season                     object
Review Rating             float64
Subscription Status        object
Shipping Type              object
Discount Applied           object
Promo Code Used            object
Previous Purchases          int64
Payment Method             object
Frequency of Purchases     object
dtype: object

Checking through age column 

 I will now explore the dataset to understand it's distributions, identify it's patterns and prepare to test my hypothesis.


In [150]:
#I am now checking the age column
print(df.Age.unique())

[55 19 50 21 45 46 63 27 26 57 53 30 61 65 64 25 52 66 31 56 18 38 54 33
 36 35 29 70 69 67 20 39 42 68 49 59 47 40 41 48 22 24 44 37 58 32 62 51
 28 43 34 23 60]


Checking through gender index

In [151]:
#I am now checking gender column
print(df.Gender.unique())

['Male' 'Female']


Erasing capital letters to allow easier coding

In [152]:
# This will make it easier for me to type columns later on.
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')

*Defining Elderly*

In [153]:
#I will not define elderly as 60+
df['is elderly'] = df['age'] >=60
print(df['is elderly'].value_counts())

is elderly
False    3112
True      788
Name: count, dtype: int64


counting  the "Elderly" and then calculating what portion of the total they represent from all customers

In [154]:
# Ensure a consistent column exists for elderly flag (created earlier as 'is_elderly')
if 'is_elderly' not in df.columns:
	df['is_elderly'] = df['age'] >= 60

# Checking the total of elderly customers and their percentage
print(f"\nTotal elderly customers: {df['is_elderly'].sum()}")
print(f"Percentage of elderly customers: {(df['is_elderly'].sum() / len(df) * 100):.2f}%")


Total elderly customers: 788
Percentage of elderly customers: 20.21%


I will now filter for elderly customers only.
I will be couting by gender and create a subset for visualisation. 


In [155]:
df['is_elderly'] = df['age'] >= 60

# 2. Filter using the correct name
elderly_df = df[df['is_elderly'] == True]

# 3. Access gender (which was also lowercased by your cleaning script)
gender_counts = elderly_df['gender'].value_counts()

print("Elderly customers by gender:")
print(gender_counts)
print(f"\nRatio(/men/Women): {gender_counts['Male']/gender_counts['Female']:.2f}")

elderly_summary = pd.DataFrame({
    'Gender': gender_counts.index,
    'Count': gender_counts.values
})


Elderly customers by gender:
gender
Male      542
Female    246
Name: count, dtype: int64

Ratio(/men/Women): 2.20


Now I am creating visualisations with Matplotlib to show my data


For a Bar chart


In [156]:
colors = ['blue', 'pink']
ax1.bar(elderly_summary['Gender'], elderly_summary['Count'], color=colors)
ax1.set_title('Elderly Customers by Gender\n(Men vs Women)', fontsize=14, fontweight='bold')
ax1.set_xlabel('Gender')
ax1.set_ylabel('Number of Customers')
ax1.grid(axis='y', alpha=0.3)
# I am going to add count labels on bars
for i, count in enumerate(elderly_summary['Count']):
    ax1.text(i, count + 5, str(count), ha='center', fontweight='bold')

For Pie chart


In [157]:
# Pie chart
ax2.pie(elderly_summary['Count'], labels=elderly_summary['Gender'],  colors=colors, startangle=90)
ax2.set_title('Percentage Distribution', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

<Figure size 640x480 with 0 Axes>

Now I will Use plotly to showcase Data Visualisation


Using a bar chart to help visualise the count comparison between elderly men and women.


In [158]:
elderly_summary = elderly_df['gender'].value_counts().reset_index()

#Renaming my columns so it is easier to read.
elderly_summary.columns = ['gender', 'count']

# 3. Creating my chart with Plotly
fig = px.bar(elderly_summary, 
             x='gender', 
             y='count', 
             title='Elderly Customers by Gender',
             color='gender',
             template='plotly_dark') 

fig.show()

An area plot will help display the data to visualise trends and patterns.


In [159]:
fig3 = px.area(elderly_summary, 
               x='gender', 
               y='count',
               title='Elderly Customers by Gender - Simple Area Plot',
               markers=True,  # Adds dots to the area
               color='gender',  # Different colors for each gender
               template='plotly_dark')

fig3.update_traces(
    line=dict(width=3),
    marker=dict(size=10)
)

fig3.update_layout(
    xaxis_title='Gender',
    yaxis_title='Number of Customers'
)

A KDE chart will help analyse the age distribution density for elderly men and women.


In [160]:
fig_kde = px.density_contour(elderly_df, 
                            x='age', 
                            color='gender',
                            title='Age Density of Elderly Customers (KDE Plot)',
                            template='plotly_dark')

fig_kde.update_traces(contours_coloring="fill", contours_showlabels=True)
fig_kde.update_layout(
    xaxis_title='Age',
    yaxis_title='Density',
    height=500
)

fig_kde.show()



A pie chart will allow the proporttional distribution of elderly customers by their gender to be visualised easily.


In [161]:
fig_pie = px.pie(elderly_summary, 
                 values='count', 
                 names='gender',
                 title='Final Analysis: Elderly Customer Gender Split',
                 template='plotly_dark')

fig_pie.show()

### Conclusion:

I will now summsrise mu findings to state if my hypothesis is correct.


In [164]:
# Use the existing elderly_summary dataframe (columns: 'gender', 'count')
men_count = int(elderly_summary[elderly_summary['gender'] == 'Male']['count'].values[0])
women_count = int(elderly_summary[elderly_summary['gender'] == 'Female']['count'].values[0])

print("=" * 60)
print("HYPOTHESIS TEST RESULTS")
print("=" * 60)
print(f"Elderly Men Count: {men_count}")
print(f"Elderly Women Count: {women_count}")

if men_count > women_count:
    print(f"\n✅ HYPOTHESIS SUPPORTED")
    print(f"More elderly men ({men_count}) shop than elderly women ({women_count})")
    print(f"Difference: {men_count - women_count} more men")
elif men_count < women_count:
    print(f"\n❌ HYPOTHESIS NOT SUPPORTED")
    print(f"More elderly women ({women_count}) shop than elderly men ({men_count})")
    print(f"Difference: {women_count - men_count} more women")
else:
    print(f"\n⚖️  NO DIFFERENCE")
    print(f"Elderly men and women shop in equal numbers: {men_count} each")
print("=" * 60)

HYPOTHESIS TEST RESULTS
Elderly Men Count: 542
Elderly Women Count: 246

✅ HYPOTHESIS SUPPORTED
More elderly men (542) shop than elderly women (246)
Difference: 296 more men
