# Python Practice Notebook 2 - Shop Customer Analysis

## Business Task

A mid-sized retail chain has implemented a membership card system to better understand how their customers behave. Every time a customer visits the store, their information gets recorded: demographics, profession, spending patterns, family size, and income level.

The company has collected 2,000 customer profiles, but they lack the analytics capabilities to understand to identify high-value customers, understand spending behavior, target promotions effectively, or spot groups at risk of churn.

**Analyze customer traits and behavior to help the business optimize marketing, segmentation, product offerings, and customer retention strategies.**

## Import Libraries & Load the dataset

In [35]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df=pd.read_csv("data/customers_dataset.csv")
df.head()

Unnamed: 0,CustomerID,Gender,Age,Annual Income ($),Spending Score (1-100),Profession,Work Experience,Family Size
0,1,Male,19,15000,39,Healthcare,1,4
1,2,Male,21,35000,81,Engineer,3,3
2,3,Female,20,86000,6,Engineer,1,1
3,4,Female,23,59000,77,Lawyer,0,2
4,5,Female,31,38000,40,Entertainment,2,6


## Data Exploration

In [3]:
df.shape

(2000, 8)

In [4]:
# Basic statistics
df.describe()

Unnamed: 0,CustomerID,Age,Annual Income ($),Spending Score (1-100),Work Experience,Family Size
count,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0
mean,1000.5,48.96,110731.8215,50.9625,4.1025,3.7685
std,577.494589,28.429747,45739.536688,27.934661,3.922204,1.970749
min,1.0,0.0,0.0,0.0,0.0,1.0
25%,500.75,25.0,74572.0,28.0,1.0,2.0
50%,1000.5,48.0,110045.0,50.0,3.0,4.0
75%,1500.25,73.0,149092.75,75.0,7.0,5.0
max,2000.0,99.0,189974.0,100.0,17.0,9.0


**WHY do the minimum value in the Age column is 0?? ü§î There are few unrealistc values in the dataset.**

In [5]:
#Basic info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 8 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   CustomerID              2000 non-null   int64 
 1   Gender                  2000 non-null   object
 2   Age                     2000 non-null   int64 
 3   Annual Income ($)       2000 non-null   int64 
 4   Spending Score (1-100)  2000 non-null   int64 
 5   Profession              1965 non-null   object
 6   Work Experience         2000 non-null   int64 
 7   Family Size             2000 non-null   int64 
dtypes: int64(6), object(2)
memory usage: 125.1+ KB


In [6]:
# check for null values
df.isnull().sum()

CustomerID                 0
Gender                     0
Age                        0
Annual Income ($)          0
Spending Score (1-100)     0
Profession                35
Work Experience            0
Family Size                0
dtype: int64

**The column Profession has 35 null values.We need to handle them.‚úÇÔ∏è**

## Data Cleaning

Renaming the column names to shorter names.

In [7]:
#renaming the columns for convenience
df2 = df.rename(columns={"Annual Income ($)":"AnnualIncome",
              "Spending Score (1-100)":"SpendingScore",
              "Work Experience":"WorkExperience",
              "Family Size":"FamilySize"})
df2.head()

Unnamed: 0,CustomerID,Gender,Age,AnnualIncome,SpendingScore,Profession,WorkExperience,FamilySize
0,1,Male,19,15000,39,Healthcare,1,4
1,2,Male,21,35000,81,Engineer,3,3
2,3,Female,20,86000,6,Engineer,1,1
3,4,Female,23,59000,77,Lawyer,0,2
4,5,Female,31,38000,40,Entertainment,2,6


In [8]:
# filling the null values with "Unknown"
df2['Profession'] = df2['Profession'].fillna('Unknown')

# checking for null values
df2.isnull().sum()

CustomerID        0
Gender            0
Age               0
AnnualIncome      0
SpendingScore     0
Profession        0
WorkExperience    0
FamilySize        0
dtype: int64

üí°**35 customers were missing Profession data. Instead of removing these customers (which would reduce sample size and distort demographics), I replaced missing profession values with ‚ÄúUnknown‚Äù, consistent with consumer analytics best practice"** 

In [15]:
# remove rows with customers below 16 and unrealistic work experiences
customer_df = df2[(df2["Age"]>16) & (df2["WorkExperience"]<= (df2["Age"]-14))]
customer_df.head()

Unnamed: 0,CustomerID,Gender,Age,AnnualIncome,SpendingScore,Profession,WorkExperience,FamilySize
0,1,Male,19,15000,39,Healthcare,1,4
1,2,Male,21,35000,81,Engineer,3,3
2,3,Female,20,86000,6,Engineer,1,1
3,4,Female,23,59000,77,Lawyer,0,2
4,5,Female,31,38000,40,Entertainment,2,6


üí°**The dataset includes customers below age 16 , even 0 years olds. So I removed customers below age 16 (minimum realistic membership age), and also  rows where work experience exceeded the maximum possible value (Age ‚àí 14).**

In [16]:
customer_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1633 entries, 0 to 1999
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   CustomerID      1633 non-null   int64 
 1   Gender          1633 non-null   object
 2   Age             1633 non-null   int64 
 3   AnnualIncome    1633 non-null   int64 
 4   SpendingScore   1633 non-null   int64 
 5   Profession      1633 non-null   object
 6   WorkExperience  1633 non-null   int64 
 7   FamilySize      1633 non-null   int64 
dtypes: int64(6), object(2)
memory usage: 114.8+ KB


**Now the dataset looks clean and simple to work! üéâüòé**

## Questions And Solutions

### Q1. Which age groups make up most of our customers?

First, let's create bins for the different age groups.

In [32]:
customer_df['Age'].max()

np.int64(99)

In [44]:
# Age bins
age_bins=[16,30,50,65,100]
age_labels = ['16-30', '30-50', '50-65', '65+']

In [45]:
customer_df['Age_Group'] =pd.cut(customer_df['Age'],
                                bins=age_bins,
                                labels=age_labels,
                                 include_lowest=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  customer_df['Age_Group'] =pd.cut(customer_df['Age'],


In [46]:
# count the customers in each group
customer_df["Age_Group"].value_counts().sort_index()

Age_Group
16-30    251
30-50    424
50-65    317
65+      641
Name: count, dtype: int64

‚ùì**641 customers aged 65+ ! Is that realistic or not?? ü§î  
Let's inspect the dataset again !**

In [33]:
customer_df['Age'].describe()

count    1633.000000
mean       57.810778
std        23.535415
min        17.000000
25%        37.000000
50%        58.000000
75%        79.000000
max        99.000000
Name: Age, dtype: float64

**The dataset contains a very large number of customers aged 65 and above. The maximum age is 99.This is not typical for a retail store and suggests the data may be synthetic or not fully realistic.  
To keep the analysis meaningful,we‚Äôll clean the data by keeping ages between 16 and 75. This range reflects the age group most likely to hold store memberships and make regular purchases.**

In [52]:
customer_df=customer_df[customer_df['Age']<=75]

customer_df["Age_Group"].value_counts().sort_index()

Age_Group
16-30    251
30-50    424
50-65    317
65+      178
Name: count, dtype: int64

After cleaning unrealistic ages and restricting the range to 16‚Äì75, the age structure becomes more consistent with real-world retail demographics

**Solution:**
**30‚Äì50 is the largest customer group and should be the primary focus for marketing.**

### Q2. Do men or women spend more?

### Q3. Which income bracket dominates our customer base?

### Q4. Who are the ‚Äúhigh-potential but low-spend‚Äù customers?

### Q5. Which professions spend the most?

### Q6. Do younger customers spend more?

### Q7. Do highly experienced workers spend less?

### Q8. Do larger families spend more?