# E-Commerce EDA — Simplified Notebook

This simplified notebook contains the same sections and outcomes as the previous version,
but uses **easy-to-read code** and **line-by-line comments** so students can follow quickly.

Sections: Load → Clean → Univariate → Bivariate → Multivariate → Segments → Derived metrics → Time-series → Save

Each code cell contains short comments explaining what each variable/line does.

## 0. Imports & plotting settings

**Why:** Load standard libraries and set a clean plotting style.

## 1. Load dataset

**Why:** Confirm file exists and inspect first rows to form hypotheses.

**Outcome:** Know dataset shape and column names.

## 2. Quick summary & missing values

**Why:** See basic statistics and missing counts to guide cleaning.

**Outcome:** Decide imputation and type fixes.

## 3. Clean types and parse dates

**Why:** Convert columns to correct types, especially `date` for time analysis.

**Outcome:** Ready for feature engineering.

## 4. (Optional) Inject a few missing values for demo

**Why:** If dataset has no missing values, create a few for imputation.

**Outcome:** A few controlled missing values to practice filling.

## 5. Imputation (simple and explainable)

**Why:** Fill missing values so plots and models run without errors. We'll use median for numbers and 'Unknown' for categories.

**Outcome:** Cleaned columns ready for analysis.

## 6. Derived features (simple and useful)

**Why:** These features help capture behavior and time patterns. Each line is commented.

**Outcome:** New, interpretable features for EDA and later modeling.

## 7. Univariate EDA — Numeric features (simple plots)

**Why:** See the shape of numeric features. We annotate mean and median for clarity.

**Outcome:** Students can spot skew and outliers.

## 8. Univariate EDA — Categorical features (counts)

**Why:** Identify dominant categories which affect encoding and sampling.

**Outcome:** We list top categories for each column.

## 9. Target variable `purchase` (balance)

**Why:** Check how balanced the target is; influences modeling choices.

**Outcome:** Observe purchase percentage.

## 10. Bivariate — Numeric vs purchase (boxplots)

**Why:** Check whether numeric features differ for purchasers vs non-purchasers.

**Outcome:** Compare medians; higher median often means the feature is predictive.

## 11. Bivariate — Categorical vs purchase (conversion rates)

**Why:** Find top converting categories to prioritize targeting.

**Outcome:** List top converters for each categorical column.

## 12. Multivariate — Correlation matrix

**Why:** See which numeric features are correlated. Helps avoid multicollinearity.

**Outcome:** Table and heatmap of correlations.

## 13. Pairwise scatter (sample)

**Why:** Quick visual check for relationships; sample rows to keep plots readable.

**Outcome:** Spot obvious clusters or trends.

## 14. Segmented analysis — pivot & grouped view

**Why:** Look at combinations like membership × product to prioritize segments.

**Outcome:** Pivot table and a small bar chart.

## 15. Derived metrics — check usefulness

**Why:** See whether engagement_score and cart_to_wishlist_ratio separate purchasers.

**Outcome:** Boxplots and medians.

## 16. Time-series — daily & hourly patterns

**Why:** Find times/days with higher conversion; helpful for scheduling campaigns.

**Outcome:** Line plots of daily purchase rate and hourly conversion.

## 17. Summary & next steps

**Summary:**
- The notebook is simplified for student consumption.
- Key signals: engagement, pages viewed, cart behavior, membership.

**Next steps:** Build preprocessing pipeline, train baseline models, evaluate with business-driven metrics, deploy a prediction API.