In [2]:
# Inspect the Dataset

import pandas as pd

# Load dataset
df = pd.read_csv(r"..\data\raw\marketing_campaign.csv", encoding='ISO-8859-1')

# First look
print("Shape:", df.shape)
print("\nColumns:\n", df.columns)

# Peek at data
print("\nHead:\n", df.head())

# Data types & non-null counts
print("\nInfo:\n")
df.info()

# Summary stats (numerical & categorical)
print("\nNumeric Summary:\n", df.describe())
print("\nCategorical Summary:\n", df.describe(include=['object']))

# Check for duplicates
print("\nDuplicates count:", df.duplicated().sum())

# Target variables & features
print("\nUnique values per column:")
for col in df.columns:
    print(f"{col}: {df[col].nunique()}")

Shape: (2240, 29)

Columns:
 Index(['ID', 'Year_Birth', 'Education', 'Marital_Status', 'Income', 'Kidhome',
       'Teenhome', 'Dt_Customer', 'Recency', 'MntWines', 'MntFruits',
       'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts',
       'MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases',
       'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth',
       'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1',
       'AcceptedCmp2', 'Complain', 'Z_CostContact', 'Z_Revenue', 'Response'],
      dtype='str')

Head:
      ID  Year_Birth   Education Marital_Status   Income  Kidhome  Teenhome  \
0  5524        1957  Graduation         Single  58138.0        0         0   
1  2174        1954  Graduation         Single  46344.0        1         1   
2  4141        1965  Graduation       Together  71613.0        0         0   
3  6182        1984  Graduation       Together  26646.0        1         0   
4  5324        1981         PhD        Married  58293.0 

See https://pandas.pydata.org/docs/user_guide/migration-3-strings.html#string-migration-select-dtypes for details on how to write code that works with pandas 2 and 3.
  print("\nCategorical Summary:\n", df.describe(include=['object']))


Data Inspection Summary

1. General Overview
•	Total records: 2,240 customers
•	Total features: 29 columns
•	Duplicates: None
•	Memory usage: ~507 KB
•	Data types:
o	25 integer
o	1 float (Income)
o	3 categorical (Education, Marital_Status, Dt_Customer)
The dataset is clean in structure with no duplicate rows.
________________________________________
2.  Missing Values
•	Income has 24 missing values (2216 non-null out of 2240).
•	All other variables are complete.
This suggests Income will need imputation during data cleaning.
________________________________________
3. Customer Demographics
Age (Year_Birth)
•	Range: 1893–1996
•	Mean birth year: 1968
•	Some extreme values (e.g., 1893) indicate unrealistic ages → potential outliers.
Education (5 categories)
•	Most common: Graduation (1127 customers, ~50%)
Marital Status (8 categories)
•	Most common: Married (864 customers)
________________________________________
4. Income Distribution
•	Mean: $52,247
•	Median: $51,381
•	Max: $666,666 (extreme outlier)
•	Min: $1,730
Strong right skew due to high-income outliers.
________________________________________
5. Household Composition
•	Kidhome: 0–2 children (mean ≈ 0.44)
•	Teenhome: 0–2 teenagers (mean ≈ 0.51)
Most customers have no children or teenagers at home.
________________________________________
6. Spending Behavior (Product Categories)
Average spending:
Category	Mean Spending
Wines	303.9
Meat	166.9
Fish	37.5
Fruits	26.3
Sweets	(lower compared to wine/meat)
Gold	moderate
Customers spend significantly more on Wines and Meat compared to other products.
Spending variables are highly skewed with large max values → likely require scaling.
________________________________________
7. Purchase Behavior
•	Web purchases: up to 15
•	Store purchases: up to 14
•	Catalog purchases: up to 14
•	Web visits per month: avg 5.3
Customers use multiple channels, making this dataset suitable for segmentation.
________________________________________
8. Campaign Response Behavior
Binary variables (0 = No, 1 = Yes):
•	AcceptedCmp1–5: low acceptance rates (~6–7%)
•	Overall Response rate: 14.9%
This suggests:
•	Campaign acceptance is relatively low.
•	Response variable can be used for supervised modeling if needed.
________________________________________
9. Constant Columns (No Variance)
•	Z_CostContact = 3 (constant)
•	Z_Revenue = 11 (constant)
These provide no predictive value and should be removed during preprocessing.
________________________________________
Key Data Issues Identified
1.	Missing values in Income
2.	Unrealistic birth years (e.g., 1893)
3.	Extreme income outlier (666,666)
4.	Constant columns (Z_CostContact, Z_Revenue)
5.	Skewed spending variables
6.	Dt_Customer stored as string (needs datetime conversion)
