# Phase 2: Data Exploration and Understanding

**Project**: Customer Purchase Behavior Analysis  
**Phase**: 2 - Data Understanding & Initial Profiling  
**Date**: November 7, 2025  

---

## Objective
- Load the e-commerce dataset
- Perform initial data profiling
- Understand data structure, types, and quality
- Identify missing values and outliers
- Generate summary statistics

## 1. Setup and Imports

In [1]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.float_format', lambda x: '%.2f' % x)

# Plotting settings
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
%matplotlib inline

print("‚úÖ Libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

‚úÖ Libraries imported successfully!
Pandas version: 2.0.3
NumPy version: 1.24.3


## 2. Load Dataset

In [2]:
# Define file path
DATA_PATH = Path('../dataset/E-commerce Customer Behavior - Sheet1.csv')

# Load dataset
print("Loading dataset...")
df = pd.read_csv(DATA_PATH)
print(f"‚úÖ Dataset loaded successfully!\n")

# Display basic information
print(f"Dataset Shape: {df.shape}")
print(f"Number of Rows: {df.shape[0]:,}")
print(f"Number of Columns: {df.shape[1]}")
print(f"Memory Usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

Loading dataset...
‚úÖ Dataset loaded successfully!

Dataset Shape: (350, 11)
Number of Rows: 350
Number of Columns: 11
Memory Usage: 0.10 MB


## 3. Initial Data Inspection

In [3]:
# Display first few rows
print("First 10 rows of the dataset:")
df.head(10)

First 10 rows of the dataset:


Unnamed: 0,Customer ID,Gender,Age,City,Membership Type,Total Spend,Items Purchased,Average Rating,Discount Applied,Days Since Last Purchase,Satisfaction Level
0,101,Female,29,New York,Gold,1120.2,14,4.6,True,25,Satisfied
1,102,Male,34,Los Angeles,Silver,780.5,11,4.1,False,18,Neutral
2,103,Female,43,Chicago,Bronze,510.75,9,3.4,True,42,Unsatisfied
3,104,Male,30,San Francisco,Gold,1480.3,19,4.7,False,12,Satisfied
4,105,Male,27,Miami,Silver,720.4,13,4.0,True,55,Unsatisfied
5,106,Female,37,Houston,Bronze,440.8,8,3.1,False,22,Neutral
6,107,Female,31,New York,Gold,1150.6,15,4.5,True,28,Satisfied
7,108,Male,35,Los Angeles,Silver,800.9,12,4.2,False,14,Neutral
8,109,Female,41,Chicago,Bronze,495.25,10,3.6,True,40,Unsatisfied
9,110,Male,28,San Francisco,Gold,1520.1,21,4.8,False,9,Satisfied


In [None]:
# Display last few rows
print("Last 5 rows of the dataset:")
df.tail()

In [None]:
# Display random sample
print("Random sample of 5 rows:")
df.sample(5)

## 4. Data Structure Analysis

In [4]:
# Display column names
print("Column Names:")
print(df.columns.tolist())
print(f"\nTotal Columns: {len(df.columns)}")

Column Names:
['Customer ID', 'Gender', 'Age', 'City', 'Membership Type', 'Total Spend', 'Items Purchased', 'Average Rating', 'Discount Applied', 'Days Since Last Purchase', 'Satisfaction Level']

Total Columns: 11


In [5]:
# Display data types
print("Data Types:")
print(df.dtypes)
print("\nData Type Distribution:")
print(df.dtypes.value_counts())

Data Types:
Customer ID                   int64
Gender                       object
Age                           int64
City                         object
Membership Type              object
Total Spend                 float64
Items Purchased               int64
Average Rating              float64
Discount Applied               bool
Days Since Last Purchase      int64
Satisfaction Level           object
dtype: object

Data Type Distribution:
int64      4
object     4
float64    2
bool       1
Name: count, dtype: int64


In [6]:
# Detailed information
print("Detailed Dataset Information:")
df.info()

Detailed Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 350 entries, 0 to 349
Data columns (total 11 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Customer ID               350 non-null    int64  
 1   Gender                    350 non-null    object 
 2   Age                       350 non-null    int64  
 3   City                      350 non-null    object 
 4   Membership Type           350 non-null    object 
 5   Total Spend               350 non-null    float64
 6   Items Purchased           350 non-null    int64  
 7   Average Rating            350 non-null    float64
 8   Discount Applied          350 non-null    bool   
 9   Days Since Last Purchase  350 non-null    int64  
 10  Satisfaction Level        348 non-null    object 
dtypes: bool(1), float64(2), int64(4), object(4)
memory usage: 27.8+ KB


## 5. Missing Values Analysis

In [7]:
# Check for missing values
print("Missing Values Analysis:")
missing = df.isnull().sum()
missing_pct = (missing / len(df)) * 100

missing_df = pd.DataFrame({
    'Column': df.columns,
    'Missing_Count': missing.values,
    'Missing_Percentage': missing_pct.values
}).sort_values('Missing_Count', ascending=False)

print(missing_df[missing_df['Missing_Count'] > 0])

if missing_df['Missing_Count'].sum() == 0:
    print("\n‚úÖ No missing values found!")
else:
    print(f"\n‚ö†Ô∏è Total missing values: {missing_df['Missing_Count'].sum():,}")

Missing Values Analysis:
                Column  Missing_Count  Missing_Percentage
10  Satisfaction Level              2                0.57

‚ö†Ô∏è Total missing values: 2


In [None]:
# Visualize missing values
if missing_df['Missing_Count'].sum() > 0:
    plt.figure(figsize=(12, 6))
    missing_cols = missing_df[missing_df['Missing_Count'] > 0].sort_values('Missing_Count', ascending=True)
    plt.barh(missing_cols['Column'], missing_cols['Missing_Percentage'])
    plt.xlabel('Missing Percentage (%)')
    plt.title('Missing Values by Column')
    plt.tight_layout()
    plt.show()
else:
    print("No missing values to visualize!")

## 6. Duplicate Records Check

In [8]:
# Check for duplicate rows
duplicates = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicates:,}")
print(f"Percentage of duplicates: {(duplicates/len(df)*100):.2f}%")

if duplicates > 0:
    print("\n‚ö†Ô∏è Duplicate rows found! Will need to handle in cleaning phase.")
else:
    print("\n‚úÖ No duplicate rows found!")

Number of duplicate rows: 0
Percentage of duplicates: 0.00%

‚úÖ No duplicate rows found!


## 7. Statistical Summary

In [9]:
# Summary statistics for numerical columns
print("Statistical Summary - Numerical Columns:")
df.describe()

Statistical Summary - Numerical Columns:


Unnamed: 0,Customer ID,Age,Total Spend,Items Purchased,Average Rating,Days Since Last Purchase
count,350.0,350.0,350.0,350.0,350.0,350.0
mean,275.5,33.6,845.38,12.6,4.02,26.59
std,101.18,4.87,362.06,4.16,0.58,13.44
min,101.0,26.0,410.8,7.0,3.0,9.0
25%,188.25,30.0,502.0,9.0,3.5,15.0
50%,275.5,32.5,775.2,12.0,4.1,23.0
75%,362.75,37.0,1160.6,15.0,4.5,38.0
max,450.0,43.0,1520.1,21.0,4.9,63.0


In [10]:
# Summary statistics for categorical columns
print("Statistical Summary - Categorical Columns:")
df.describe(include=['object'])

Statistical Summary - Categorical Columns:


Unnamed: 0,Gender,City,Membership Type,Satisfaction Level
count,350,350,350,348
unique,2,6,3,3
top,Female,New York,Gold,Satisfied
freq,175,59,117,125


## 8. Column-by-Column Analysis

In [None]:
# Analyze each column
print("Detailed Column Analysis:\n")
print("="*80)

for col in df.columns:
    print(f"\nColumn: {col}")
    print(f"Data Type: {df[col].dtype}")
    print(f"Non-Null Count: {df[col].count():,} / {len(df):,}")
    print(f"Unique Values: {df[col].nunique():,}")
    
    # For categorical columns, show value counts
    if df[col].dtype == 'object' and df[col].nunique() < 20:
        print(f"\nValue Counts:")
        print(df[col].value_counts().head(10))
    
    # For numerical columns, show basic stats
    elif df[col].dtype in ['int64', 'float64']:
        print(f"Min: {df[col].min()}, Max: {df[col].max()}")
        print(f"Mean: {df[col].mean():.2f}, Median: {df[col].median():.2f}")
        print(f"Std Dev: {df[col].std():.2f}")
    
    print("="*80)

## 9. Unique Values Count

In [11]:
# Create unique values summary
unique_summary = pd.DataFrame({
    'Column': df.columns,
    'Unique_Count': [df[col].nunique() for col in df.columns],
    'Unique_Percentage': [(df[col].nunique()/len(df)*100) for col in df.columns]
}).sort_values('Unique_Count', ascending=False)

print("Unique Values Summary:")
print(unique_summary)

Unique Values Summary:
                      Column  Unique_Count  Unique_Percentage
0                Customer ID           350             100.00
5                Total Spend            76              21.71
9   Days Since Last Purchase            54              15.43
7             Average Rating            20               5.71
2                        Age            16               4.57
6            Items Purchased            15               4.29
3                       City             6               1.71
4            Membership Type             3               0.86
10        Satisfaction Level             3               0.86
1                     Gender             2               0.57
8           Discount Applied             2               0.57


## 10. Numerical Columns Distribution

In [12]:
# Get numerical columns
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns.tolist()
print(f"Numerical Columns ({len(numerical_cols)}):")
print(numerical_cols)

Numerical Columns (6):
['Customer ID', 'Age', 'Total Spend', 'Items Purchased', 'Average Rating', 'Days Since Last Purchase']


In [None]:
# Plot distributions of numerical columns
if len(numerical_cols) > 0:
    n_cols = 3
    n_rows = (len(numerical_cols) + n_cols - 1) // n_cols
    
    fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, n_rows * 4))
    axes = axes.flatten() if n_rows > 1 else [axes]
    
    for idx, col in enumerate(numerical_cols):
        if idx < len(axes):
            df[col].hist(bins=30, ax=axes[idx], edgecolor='black')
            axes[idx].set_title(f'Distribution of {col}')
            axes[idx].set_xlabel(col)
            axes[idx].set_ylabel('Frequency')
    
    # Hide unused subplots
    for idx in range(len(numerical_cols), len(axes)):
        axes[idx].axis('off')
    
    plt.tight_layout()
    plt.show()
else:
    print("No numerical columns to plot!")

## 11. Categorical Columns Distribution

In [13]:
# Get categorical columns
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()
print(f"Categorical Columns ({len(categorical_cols)}):")
print(categorical_cols)

Categorical Columns (4):
['Gender', 'City', 'Membership Type', 'Satisfaction Level']


In [14]:
# Display value counts for categorical columns (with < 20 unique values)
print("Categorical Column Value Counts:\n")
for col in categorical_cols:
    if df[col].nunique() < 20:
        print(f"\n{col}:")
        print(df[col].value_counts())
        print("-" * 50)

Categorical Column Value Counts:


Gender:
Gender
Female    175
Male      175
Name: count, dtype: int64
--------------------------------------------------

City:
City
New York         59
Los Angeles      59
Chicago          58
San Francisco    58
Miami            58
Houston          58
Name: count, dtype: int64
--------------------------------------------------

Membership Type:
Membership Type
Gold      117
Silver    117
Bronze    116
Name: count, dtype: int64
--------------------------------------------------

Satisfaction Level:
Satisfaction Level
Satisfied      125
Unsatisfied    116
Neutral        107
Name: count, dtype: int64
--------------------------------------------------


## 12. Outlier Detection (Numerical Columns)

In [15]:
# Detect outliers using IQR method
print("Outlier Detection using IQR Method:\n")

outlier_summary = []

for col in numerical_cols:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
    
    outlier_summary.append({
        'Column': col,
        'Outlier_Count': len(outliers),
        'Outlier_Percentage': (len(outliers)/len(df)*100),
        'Lower_Bound': lower_bound,
        'Upper_Bound': upper_bound
    })

outlier_df = pd.DataFrame(outlier_summary)
print(outlier_df)

Outlier Detection using IQR Method:

                     Column  Outlier_Count  Outlier_Percentage  Lower_Bound  \
0               Customer ID              0                0.00       -73.50   
1                       Age              0                0.00        19.50   
2               Total Spend              0                0.00      -485.90   
3           Items Purchased              0                0.00         0.00   
4            Average Rating              0                0.00         2.00   
5  Days Since Last Purchase              0                0.00       -19.50   

   Upper_Bound  
0       624.50  
1        47.50  
2      2148.50  
3        24.00  
4         6.00  
5        72.50  


In [None]:
# Box plots for numerical columns
if len(numerical_cols) > 0:
    n_cols = 3
    n_rows = (len(numerical_cols) + n_cols - 1) // n_cols
    
    fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, n_rows * 4))
    axes = axes.flatten() if n_rows > 1 else [axes]
    
    for idx, col in enumerate(numerical_cols):
        if idx < len(axes):
            df.boxplot(column=col, ax=axes[idx])
            axes[idx].set_title(f'Box Plot: {col}')
            axes[idx].set_ylabel(col)
    
    # Hide unused subplots
    for idx in range(len(numerical_cols), len(axes)):
        axes[idx].axis('off')
    
    plt.tight_layout()
    plt.show()

## 13. Summary Report

In [16]:
# Generate comprehensive summary
print("="*80)
print("DATA EXPLORATION SUMMARY REPORT")
print("="*80)

print(f"\nüìä DATASET OVERVIEW")
print(f"   Total Rows: {len(df):,}")
print(f"   Total Columns: {len(df.columns)}")
print(f"   Memory Usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

print(f"\nüî¢ COLUMN TYPES")
print(f"   Numerical Columns: {len(numerical_cols)}")
print(f"   Categorical Columns: {len(categorical_cols)}")

print(f"\n‚ö†Ô∏è DATA QUALITY")
print(f"   Missing Values: {df.isnull().sum().sum():,}")
print(f"   Duplicate Rows: {df.duplicated().sum():,}")

print(f"\nüìà UNIQUE VALUES")
print(f"   Columns with all unique values: {(df.nunique() == len(df)).sum()}")
print(f"   Columns with 1 unique value: {(df.nunique() == 1).sum()}")

if len(outlier_df) > 0:
    print(f"\nüéØ OUTLIERS DETECTED")
    outliers_found = outlier_df[outlier_df['Outlier_Count'] > 0]
    if len(outliers_found) > 0:
        print(f"   Columns with outliers: {len(outliers_found)}")
        print(f"   Total outlier records: {outliers_found['Outlier_Count'].sum():,}")
    else:
        print("   No significant outliers detected")

print("\n" + "="*80)
print("‚úÖ Data exploration complete!")
print("="*80)

DATA EXPLORATION SUMMARY REPORT

üìä DATASET OVERVIEW
   Total Rows: 350
   Total Columns: 11
   Memory Usage: 0.10 MB

üî¢ COLUMN TYPES
   Numerical Columns: 6
   Categorical Columns: 4

‚ö†Ô∏è DATA QUALITY
   Missing Values: 2
   Duplicate Rows: 0

üìà UNIQUE VALUES
   Columns with all unique values: 1
   Columns with 1 unique value: 0

üéØ OUTLIERS DETECTED
   No significant outliers detected

‚úÖ Data exploration complete!


## 14. Save Exploration Summary

In [17]:
# Save summary to text file
summary_path = Path('../reports/data_exploration_summary.txt')
summary_path.parent.mkdir(parents=True, exist_ok=True)

with open(summary_path, 'w') as f:
    f.write("DATA EXPLORATION SUMMARY\n")
    f.write("="*80 + "\n\n")
    f.write(f"Dataset Shape: {df.shape}\n")
    f.write(f"Total Rows: {len(df):,}\n")
    f.write(f"Total Columns: {len(df.columns)}\n\n")
    f.write("Columns:\n")
    f.write(str(df.columns.tolist()) + "\n\n")
    f.write("Data Types:\n")
    f.write(str(df.dtypes) + "\n\n")
    f.write("Missing Values:\n")
    f.write(str(missing_df[missing_df['Missing_Count'] > 0]) + "\n\n")
    f.write("Statistical Summary:\n")
    f.write(str(df.describe()) + "\n")

print(f"‚úÖ Summary saved to: {summary_path}")

‚úÖ Summary saved to: ..\reports\data_exploration_summary.txt


## üìù Next Steps

Based on this exploration, we need to:

1. **Data Cleaning (Notebook 02)**
   - Handle missing values
   - Remove or treat duplicates
   - Address outliers
   - Fix data types if needed

2. **Feature Engineering (Notebook 03)**
   - Create customer-level metrics (CLV, AOV, Frequency)
   - Extract time-based features
   - Build RFM segmentation
   - Encode categorical variables

3. **EDA (Notebook 04)**
   - Detailed univariate, bivariate, multivariate analysis
   - Correlation analysis
   - Customer segmentation visualization

---

**Phase 2 Status**: ‚úÖ Complete  
**Next Phase**: Data Cleaning