Data Understanding Workflow:

1. Executive Context
2. Data Ingestion (Raw Source Validation)
3. Data Footprint Analysis
   ├── Dataset Dimensions (shape)
   ├── Schema Overview (info)
4. Data Dictionary & Attribute Semantics [IMPORTANT]
   ├── Column Roles
   ├── Cardinality & Category Risk
5. Data Quality Assessment
   ├── Missing Values and Duplicate Records
   ├── Valid Ranges
   ├── Type Consistency
6. Analytical Readiness Assessment
   ├── describe().T
7. Business Questions Supported
8. Next Steps & Assumptions


---

# **Executive Context** 

---

The purpose of this notebook is to assess data readiness, identify quality issues, and understand the analytical potential of the dataset before any transformation or modeling activities.

### Data Understanding and Profiling :

The dataset captures consumer shopping behavior across demographic segments, product categories, and transactional sales attributes, providing a foundation for customer behavior analysis, segmentation, and revenue-driven insights.


In [1]:
#Dependencies
import pandas as pd

---

# **Data Ingestion Validation**

---

In [23]:
df = pd.read_csv("../data/raw/consumer_data.csv")
df.head(10)

Unnamed: 0,Customer ID,Age,Gender,Item Purchased,Category,Purchase Amount (USD),Location,Size,Color,Season,Review Rating,Subscription Status,Shipping Type,Discount Applied,Promo Code Used,Previous Purchases,Payment Method,Frequency of Purchases
0,1,55,Male,Blouse,Clothing,53,Kentucky,L,Gray,Winter,3.1,Yes,Express,Yes,Yes,14,Venmo,Fortnightly
1,2,19,Male,Sweater,Clothing,64,Maine,L,Maroon,Winter,3.1,Yes,Express,Yes,Yes,2,Cash,Fortnightly
2,3,50,Male,Jeans,Clothing,73,Massachusetts,S,Maroon,Spring,3.1,Yes,Free Shipping,Yes,Yes,23,Credit Card,Weekly
3,4,21,Male,Sandals,Footwear,90,Rhode Island,M,Maroon,Spring,3.5,Yes,Next Day Air,Yes,Yes,49,PayPal,Weekly
4,5,45,Male,Blouse,Clothing,49,Oregon,M,Turquoise,Spring,2.7,Yes,Free Shipping,Yes,Yes,31,PayPal,Annually
5,6,46,Male,Sneakers,Footwear,20,Wyoming,M,White,Summer,2.9,Yes,Standard,Yes,Yes,14,Venmo,Weekly
6,7,63,Male,Shirt,Clothing,85,Montana,M,Gray,Fall,3.2,Yes,Free Shipping,Yes,Yes,49,Cash,Quarterly
7,8,27,Male,Shorts,Clothing,34,Louisiana,L,Charcoal,Winter,3.2,Yes,Free Shipping,Yes,Yes,19,Credit Card,Weekly
8,9,26,Male,Coat,Outerwear,97,West Virginia,L,Silver,Summer,2.6,Yes,Express,Yes,Yes,8,Venmo,Annually
9,10,57,Male,Handbag,Accessories,31,Missouri,M,Pink,Spring,4.8,Yes,2-Day Shipping,Yes,Yes,4,Cash,Quarterly


The dataset is taken from the raw data file to preserve source integrity. No transformations are applied at this stage to ensure unbiased assessment. This step ensures data lineage integrity and preserves the original state of the dataset for subsequent validation and analysis.

Each record in the dataset represents an individual purchase event at the customer-item level. The presence of item-specific, payment, shipping, and promotional attributes confirms transaction-level granularity, enabling detailed behavioral, frequency, and revenue analysis.

---

# **Data Footprint Analysis**

---

In [3]:
df.shape

(3900, 18)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3900 entries, 0 to 3899
Data columns (total 18 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Customer ID             3900 non-null   int64  
 1   Age                     3900 non-null   int64  
 2   Gender                  3900 non-null   object 
 3   Item Purchased          3900 non-null   object 
 4   Category                3900 non-null   object 
 5   Purchase Amount (USD)   3900 non-null   int64  
 6   Location                3900 non-null   object 
 7   Size                    3900 non-null   object 
 8   Color                   3900 non-null   object 
 9   Season                  3900 non-null   object 
 10  Review Rating           3863 non-null   float64
 11  Subscription Status     3900 non-null   object 
 12  Shipping Type           3900 non-null   object 
 13  Discount Applied        3900 non-null   object 
 14  Promo Code Used         3900 non-null   

Dataset Dimensions
- The dataset exhibits transaction-level granularity.
- Record volume appears sufficient for descriptive analytics and segmentation.

Schema Overview
- The dataset contains a mix of numeric and categorical attributes aligned with consumer behavior analysis.
- Data types are reviewed at a high level and will be validated further during quality checks.

A small degree of missingness is observed in review-related fields, warranting further assessment in the data quality phase.

The dataset structure supports downstream exploratory analysis and feature engineering workflows.

---

# **Data Dictionary, Attribute Semantics and Business Interpretation**

---

In [5]:
# Data Dictionary Creation

#adding columns and their data types
data_dict = pd.DataFrame({
    "Column": df.columns,
    "Data_Type": df.dtypes.astype(str)
}).reset_index(drop=True)

#adding null_% count
data_dict["Null_%"] = (
    df.isna().sum() / len(df) * 100
).round(2).values

#adding unique count
data_dict["Unique_Values"] = df.nunique().values

#adding columns roles
def infer_column_role(col, dtype, unique_vals):
    if "id" in col.lower():
        return "Identifier"
    elif dtype in ["int64", "float64"] and unique_vals > 20:
        return "Numeric Measure"
    elif dtype == "object":
        return "Categorical"
    else:
        return "Derived / Other"

data_dict["Column_Role"] = [
    infer_column_role(col, dtype, uniq)
    for col, dtype, uniq in zip(
        data_dict["Column"],
        data_dict["Data_Type"],
        data_dict["Unique_Values"]
    )
]

#adding business interpretation and risk assessment for the features 
business_rules = {
    "Customer ID": {
        "Description": "Unique identifier assigned to each customer",
        "Business_Usage": "Customer-level tracking, retention, and repeat purchase analysis",
        "Risk_if_Misinterpreted": "Duplicate IDs can inflate customer counts"
    },
    "Age": {
        "Description": "Age of the customer at the time of purchase",
        "Business_Usage": "Demographic segmentation and age-based behavior analysis",
        "Risk_if_Misinterpreted": "Invalid age values distort demographic insights"
    },
    "Gender": {
        "Description": "Gender of the customer",
        "Business_Usage": "Demographic profiling and targeted marketing",
        "Risk_if_Misinterpreted": "Biased or incomplete representation"
    },
    "Item Purchased": {
        "Description": "Specific item purchased in the transaction",
        "Business_Usage": "Product-level demand and basket analysis",
        "Risk_if_Misinterpreted": "Inconsistent naming affects product insights"
    },
    "Category": {
        "Description": "High-level product category classification",
        "Business_Usage": "Category performance and revenue contribution analysis",
        "Risk_if_Misinterpreted": "Misclassification leads to incorrect category insights"
    },
    "Purchase Amount (USD)": {
        "Description": "Monetary value of the transaction in USD",
        "Business_Usage": "Revenue calculation, AOV, and sales trend analysis",
        "Risk_if_Misinterpreted": "Currency or aggregation errors inflate revenue"
    },
    "Location": {
        "Description": "Geographic location of the customer",
        "Business_Usage": "Regional demand, market penetration, and localization analysis",
        "Risk_if_Misinterpreted": "Incorrect regional performance conclusions"
    },
    "Size": {
        "Description": "Size attribute of the purchased item",
        "Business_Usage": "Inventory planning and size preference analysis",
        "Risk_if_Misinterpreted": "Misinterpreted size standards affect demand forecasting"
    },
    "Color": {
        "Description": "Color variant of the purchased item",
        "Business_Usage": "Style preference and assortment optimization",
        "Risk_if_Misinterpreted": "Color grouping inconsistencies distort trends"
    },
    "Season": {
        "Description": "Season during which the purchase occurred",
        "Business_Usage": "Seasonality and demand cycle analysis",
        "Risk_if_Misinterpreted": "Incorrect season tagging misleads trend analysis"
    },
    "Review Rating": {
        "Description": "Customer-provided satisfaction rating",
        "Business_Usage": "Customer experience and satisfaction analysis",
        "Risk_if_Misinterpreted": "Subjective bias or missing ratings skew insights"
    },
    "Subscription Status": {
        "Description": "Indicates whether the customer has an active subscription",
        "Business_Usage": "Retention and subscription impact analysis",
        "Risk_if_Misinterpreted": "Incorrect churn or loyalty assessment"
    },
    "Shipping Type": {
        "Description": "Delivery method selected for the transaction",
        "Business_Usage": "Logistics performance and delivery preference analysis",
        "Risk_if_Misinterpreted": "Misjudging delivery efficiency or cost impact"
    },
    "Discount Applied": {
        "Description": "Indicates whether a discount was applied to the transaction",
        "Business_Usage": "Promotion effectiveness and ROI analysis",
        "Risk_if_Misinterpreted": "Incorrect promotion ROI calculation"
    },
    "Promo Code Used": {
        "Description": "Indicates whether a promotional code was used",
        "Business_Usage": "Campaign effectiveness and offer attribution analysis",
        "Risk_if_Misinterpreted": "Incorrect campaign performance conclusions"
    },
    "Previous Purchases": {
        "Description": "Number of purchases made by the customer prior to this transaction",
        "Business_Usage": "Customer loyalty and lifetime value proxy analysis",
        "Risk_if_Misinterpreted": "Misclassification of new vs repeat customers"
    },
    "Payment Method": {
        "Description": "Payment method used for the transaction",
        "Business_Usage": "Payment preference and risk analysis",
        "Risk_if_Misinterpreted": "Incorrect assessment of payment behavior"
    },
    "Frequency of Purchases": {
        "Description": "Reported frequency of customer purchases",
        "Business_Usage": "Engagement and behavioral segmentation analysis",
        "Risk_if_Misinterpreted": "Misinterpretation of frequency categories"
    }
}

#populating columns based on business rules above
data_dict["Description"] = data_dict["Column"].map(
    lambda x: business_rules.get(x, {}).get("Description")
)

data_dict["Business_Usage"] = data_dict["Column"].map(
    lambda x: business_rules.get(x, {}).get("Business_Usage")
)

data_dict["Risk_if_Misinterpreted"] = data_dict["Column"].map(
    lambda x: business_rules.get(x, {}).get("Risk_if_Misinterpreted")
)

data_dict


Unnamed: 0,Column,Data_Type,Null_%,Unique_Values,Column_Role,Description,Business_Usage,Risk_if_Misinterpreted
0,Customer ID,int64,0.0,3900,Identifier,Unique identifier assigned to each customer,"Customer-level tracking, retention, and repeat...",Duplicate IDs can inflate customer counts
1,Age,int64,0.0,53,Numeric Measure,Age of the customer at the time of purchase,Demographic segmentation and age-based behavio...,Invalid age values distort demographic insights
2,Gender,object,0.0,2,Categorical,Gender of the customer,Demographic profiling and targeted marketing,Biased or incomplete representation
3,Item Purchased,object,0.0,25,Categorical,Specific item purchased in the transaction,Product-level demand and basket analysis,Inconsistent naming affects product insights
4,Category,object,0.0,4,Categorical,High-level product category classification,Category performance and revenue contribution ...,Misclassification leads to incorrect category ...
5,Purchase Amount (USD),int64,0.0,81,Numeric Measure,Monetary value of the transaction in USD,"Revenue calculation, AOV, and sales trend anal...",Currency or aggregation errors inflate revenue
6,Location,object,0.0,50,Categorical,Geographic location of the customer,"Regional demand, market penetration, and local...",Incorrect regional performance conclusions
7,Size,object,0.0,4,Categorical,Size attribute of the purchased item,Inventory planning and size preference analysis,Misinterpreted size standards affect demand fo...
8,Color,object,0.0,25,Categorical,Color variant of the purchased item,Style preference and assortment optimization,Color grouping inconsistencies distort trends
9,Season,object,0.0,4,Categorical,Season during which the purchase occurred,Seasonality and demand cycle analysis,Incorrect season tagging misleads trend analysis


In [6]:
#Final Data Dictonary View

data_dictionary = data_dict[
    [
        "Column",
        "Data_Type",
        "Column_Role",
        "Description",
        "Business_Usage",
        "Risk_if_Misinterpreted"
    ]
]

pd.set_option('display.max_colwidth', 1000)  #to avoid data truncation

data_dictionary


Unnamed: 0,Column,Data_Type,Column_Role,Description,Business_Usage,Risk_if_Misinterpreted
0,Customer ID,int64,Identifier,Unique identifier assigned to each customer,"Customer-level tracking, retention, and repeat purchase analysis",Duplicate IDs can inflate customer counts
1,Age,int64,Numeric Measure,Age of the customer at the time of purchase,Demographic segmentation and age-based behavior analysis,Invalid age values distort demographic insights
2,Gender,object,Categorical,Gender of the customer,Demographic profiling and targeted marketing,Biased or incomplete representation
3,Item Purchased,object,Categorical,Specific item purchased in the transaction,Product-level demand and basket analysis,Inconsistent naming affects product insights
4,Category,object,Categorical,High-level product category classification,Category performance and revenue contribution analysis,Misclassification leads to incorrect category insights
5,Purchase Amount (USD),int64,Numeric Measure,Monetary value of the transaction in USD,"Revenue calculation, AOV, and sales trend analysis",Currency or aggregation errors inflate revenue
6,Location,object,Categorical,Geographic location of the customer,"Regional demand, market penetration, and localization analysis",Incorrect regional performance conclusions
7,Size,object,Categorical,Size attribute of the purchased item,Inventory planning and size preference analysis,Misinterpreted size standards affect demand forecasting
8,Color,object,Categorical,Color variant of the purchased item,Style preference and assortment optimization,Color grouping inconsistencies distort trends
9,Season,object,Categorical,Season during which the purchase occurred,Seasonality and demand cycle analysis,Incorrect season tagging misleads trend analysis


A structured data dictionary was generated to document the technical schema and business meaning of each attribute within the dataset. Core metadata, including data types, null prevalence, unique value counts, and column roles, were derived programmatically to ensure accuracy and reproducibility.

Business interpretations were layered selectively for decision-critical attributes, outlining their analytical usage and potential risks if misinterpreted. This approach ensures a clear linkage between raw data fields and downstream business decisions while maintaining auditability and minimizing assumption bias.

**Cardinality & Category Risk Assessment**

Several categorical attributes exhibit moderate to high cardinality, notably *Location*, *Item Purchased*, and *Color*. These fields may introduce category explosion during encoding or visualization if not managed appropriately.

Low-cardinality attributes such as *Gender*, *Subscription Status*, and promotion-related flags are well-suited for segmentation and comparative analysis.

High-cardinality fields will require grouping, aggregation, or frequency-based filtering during feature engineering and dashboard design to ensure analytical clarity and model stability.


---

# **Data Quality Assessment**
#### This section evaluates data completeness, consistency, and validity to ensure reliability for downstream analysis.
---

In [19]:
print( "Null Count for each column:\n\n", df.isna().sum() )
print()
print( "Duplicate records count: ", df.duplicated().sum() )

Null Count for each column:

 Customer ID                0
Age                        0
Gender                     0
Item Purchased             0
Category                   0
Purchase Amount (USD)      0
Location                   0
Size                       0
Color                      0
Season                     0
Review Rating             37
Subscription Status        0
Shipping Type              0
Discount Applied           0
Promo Code Used            0
Previous Purchases         0
Payment Method             0
Frequency of Purchases     0
dtype: int64

Duplicate records count:  0


In [20]:
# Valid range checks
print("Invalid Age count:", df[~df["Age"].between(18, 100)].shape[0])
print("Invalid Purchase Amount count:", df[~df["Purchase Amount (USD)"].between(0, 1000)].shape[0])
print("Invalid Review Rating count:", df[~df["Review Rating"].between(1, 5, inclusive="both")].shape[0])


Invalid Age count: 0
Invalid Purchase Amount count: 0
Invalid Review Rating count: 37


In [21]:
# Logical consistency check
inconsistent_discount = df[
    (df["Discount Applied"] == "Yes") & (df["Promo Code Used"] == "No")
].shape[0]

print("Discount applied but no promo code used:", inconsistent_discount)

Discount applied but no promo code used: 0


Missing Values
- Most attributes exhibit full completeness.
- Minor missingness is observed in review-related fields and will be addressed using appropriate imputation or exclusion strategies based on analytical context.

Duplicate Records
- No significant duplicate transaction-level records were identified, supporting accurate aggregation and behavioral analysis.

Valid Range Checks
- Numeric attributes such as age, purchase amount, and review ratings fall within expected and realistic bounds.
- No material anomalies were observed that would compromise analytical integrity.

Data Type and Logical Consistency
- Observed data types align with expected analytical usage.
- Logical relationships between promotion-related fields were evaluated to identify potential inconsistencies.


---

# **Analytical Readiness Assessment**
#### Summary statistics were reviewed to validate numeric value ranges, distribution behavior, and overall analytical suitability.
---

In [22]:
df.describe().T  #for numerical columns

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Customer ID,3900.0,1950.5,1125.977353,1.0,975.75,1950.5,2925.25,3900.0
Age,3900.0,44.068462,15.207589,18.0,31.0,44.0,57.0,70.0
Purchase Amount (USD),3900.0,59.764359,23.685392,20.0,39.0,60.0,81.0,100.0
Review Rating,3863.0,3.750065,0.716983,2.5,3.1,3.8,4.4,5.0
Previous Purchases,3900.0,25.351538,14.447125,1.0,13.0,25.0,38.0,50.0


All numeric attributes fall within realistic and interpretable bounds, supporting their use in downstream exploratory and descriptive analysis.

- **Customer ID** functions strictly as a sequential identifier with no analytical risk.
- **Age** values range from 18 to 70, with mean and median closely aligned, indicating a balanced and demographically realistic distribution.
- **Purchase Amount (USD)** exhibits a constrained range (20–100) with minimal skew, suggesting standardized pricing or business-imposed limits rather than data anomalies. This boundedness should be considered in revenue and elasticity analyses.
- **Review Rating** spans from 2.5 to 5.0 with a small proportion of missing values (~1%). The moderate variance and subjective nature of the metric indicate that careful treatment will be required in customer experience analysis.
- **Previous Purchases** shows a broad and consistent spread (1–50), supporting its use as a proxy for customer engagement and repeat purchasing behavior.

**Overall, numeric variables demonstrate strong analytical readiness, with minor considerations noted for bounded monetary values and partially missing satisfaction metrics.**


---

# **Business Questions This Data Can Support**

---

Based on the available attributes and transaction-level granularity, the dataset is capable of supporting:

- Customer segmentation and demographic analysis
- Category-wise and seasonal demand assessment
- Promotion and discount effectiveness evaluation
- Payment and shipping preference analysis
- Customer engagement and purchase frequency insights

---

# **Next Steps & Assumptions**

---

Subsequent steps will focus on:

- Data Cleaning
- Initial Exploratory Ananlysis (for insights)
- Feature Engineering
- Post Feature Exploratory Analysis
- SQL Business Queries
- Dashboarding in PowerBI

Assumptions regarding data completeness, temporal coverage, and representativeness will be validated as part of these stages.

Note: Initial EDA is performed first to understand distributions, quality, and relationships. Feature engineering is then guided by those insights, followed by a validation EDA before modeling.