
# Superstore Business Analytics — CEO/PM Q&A Notebook (All-in-One)

This notebook is designed to **teach by doing**: every section starts with a real **business question**, lays out a **statistical plan**, runs the **code**, and then **interprets the result in plain business language**.

**Topics covered**
- Why Stats Matter in Business
- Descriptive Statistics
- Probability & Distributions
- Sampling & Estimation
- Hypothesis Testing
- Correlation & Simple OLS
- **Time Series** (trend, seasonality, growth, simple forecast)
- Extras: ABC/Pareto for products, outliers & risk tails

> **How to use**: Set `CSV_PATH` below to your Superstore CSV and run section-by-section in class.


In [1]:

# ==== Setup & Imports ====
from pathlib import Path
import pandas as pd
import numpy as np
from scipy import stats
import statsmodels.api as sm
from statsmodels.tsa.seasonal import seasonal_decompose

pd.set_option('display.float_format', lambda x: f'{x:,.3f}')

# EDIT ME: point to your CSV
CSV_PATH = 'data/superstore.csv'  # <-- CHANGE THIS



## Data Loading & Feature Engineering

We clean columns, parse dates, coerce numerics, and add teaching-friendly features.


In [4]:

EXPECTED_COLS = [
    "Row ID","Order ID","Order Date","Ship Date","Ship Mode","Customer ID",
    "Customer Name","Segment","Country","City","State","Postal Code","Region",
    "Product ID","Category","Sub-Category","Product Name","Sales","Quantity",
    "Discount","Profit"
]

def standardize_columns(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    df.columns = (
        df.columns.str.strip().str.lower()
        .str.replace(r'[^0-9a-zA-Z]+', '_', regex=True)
        .str.strip('_')
    )
    return df

def parse_dates(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    for col in ["order_date","ship_date"]:
        if col in df.columns:
            df[col] = pd.to_datetime(df[col], errors="coerce", infer_datetime_format=True)
    return df

def coerce_types(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    for col in ["sales","discount","profit"]:
        if col in df.columns:
            df[col] = pd.to_numeric(df[col], errors="coerce")
    if "quantity" in df.columns:
        df["quantity"] = pd.to_numeric(df["quantity"], errors="coerce").astype("Int64")
    return df

def add_features(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    if "order_date" in df.columns and df["order_date"].notna().any():
        df["order_year"]  = df["order_date"].dt.year
        df["order_month"] = df["order_date"].dt.month
        df["order_ym"]    = df["order_date"].dt.to_period("M").dt.to_timestamp()
    if {"sales","profit"}.issubset(df.columns):
        with np.errstate(divide='ignore', invalid='ignore'):
            df["margin_rate"] = np.where(df["sales"]!=0, df["profit"]/df["sales"], np.nan)
    if "discount" in df.columns:
        df["discount_bin"] = pd.cut(df["discount"].fillna(0.0),
                                    bins=[-0.01,0.0,0.2,0.4,1.0],
                                    labels=["0","(0,0.2]","(0.2,0.4]","(0.4,1]"])
    return df

def load_superstore(csv_path: str) -> pd.DataFrame:
    df = pd.read_csv(csv_path, encoding='ISO-8859-1')
    keep = [c for c in EXPECTED_COLS if c in df.columns]
    df = df[keep]
    df = standardize_columns(df)
    df = parse_dates(df)
    df = coerce_types(df)
    df = add_features(df)
    return df

df = load_superstore(CSV_PATH)
df.head()


  df[col] = pd.to_datetime(df[col], errors="coerce", infer_datetime_format=True)
  df[col] = pd.to_datetime(df[col], errors="coerce", infer_datetime_format=True)


Unnamed: 0,row_id,order_id,order_date,ship_date,ship_mode,customer_id,customer_name,segment,country,city,...,product_name,sales,quantity,discount,profit,order_year,order_month,order_ym,margin_rate,discount_bin
0,1,CA-2016-152156,2016-11-08,2016-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,...,Bush Somerset Collection Bookcase,261.96,2,0.0,41.914,2016,11,2016-11-01,0.16,0
1,2,CA-2016-152156,2016-11-08,2016-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,...,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.94,3,0.0,219.582,2016,11,2016-11-01,0.3,0
2,3,CA-2016-138688,2016-06-12,2016-06-16,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,...,Self-Adhesive Address Labels for Typewriters b...,14.62,2,0.0,6.871,2016,6,2016-06-01,0.47,0
3,4,US-2015-108966,2015-10-11,2015-10-18,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,...,Bretford CR4500 Series Slim Rectangular Table,957.577,5,0.45,-383.031,2015,10,2015-10-01,-0.4,"(0.4,1]"
4,5,US-2015-108966,2015-10-11,2015-10-18,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,...,Eldon Fold 'N Roll Cart System,22.368,2,0.2,2.516,2015,10,2015-10-01,0.113,"(0,0.2]"



## 1) Descriptive Statistics

**Business Question (PM):** *“Where are we making and losing money — by segment, region, and category?”*  
**Statistical Plan:** Summarize numeric variables; group KPIs by business dimensions; find top/bottom products.


In [5]:

def numeric_summary(df, cols=("sales","profit","discount","quantity","margin_rate")):
    present = [c for c in cols if c in df.columns]
    return df[present].describe(percentiles=[0.25,0.5,0.75]).T

def segment_kpis(df):
    keys = [c for c in ["segment","region","category"] if c in df.columns]
    num_cols = [c for c in ["sales","profit","quantity","discount","margin_rate"] if c in df.columns]
    out = df.groupby(keys)[num_cols].agg(["count","sum","mean","median"])
    out.columns = ["_".join(col).strip("_") for col in out.columns.to_flat_index()]
    return out.sort_values(out.columns[0], ascending=False)

def top_n_products(df, n=10, by="profit"):
    cols = [c for c in ["product_id","product_name","category","sub_category","sales","profit"] if c in df.columns]
    agg = df[cols].groupby(["product_id","product_name","category","sub_category"], dropna=False)[["sales","profit"]].sum()
    return agg.sort_values(by=by, ascending=False).head(n).reset_index()

summ = numeric_summary(df).round(3)
kpis = segment_kpis(df).round(3)
tops = top_n_products(df, n=10, by="profit").round(2)

display(summ)
display(kpis.head(10))
display(tops)


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
sales,9994.0,229.858,623.245,0.444,17.28,54.49,209.94,22638.48
profit,9994.0,28.657,234.26,-6599.978,1.729,8.666,29.364,8399.976
discount,9994.0,0.156,0.206,0.0,0.0,0.2,0.2,0.8
quantity,9994.0,3.79,2.225,1.0,2.0,3.0,5.0,14.0
margin_rate,9994.0,0.12,0.467,-2.75,0.075,0.27,0.362,0.5


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,sales_count,sales_sum,sales_mean,sales_median,profit_count,profit_sum,profit_mean,profit_median,quantity_count,quantity_sum,quantity_mean,quantity_median,discount_count,discount_sum,discount_mean,discount_median,margin_rate_count,margin_rate_sum,margin_rate_mean,margin_rate_median
segment,region,category,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
Consumer,West,Office Supplies,995,110080.94,110.634,31.104,995,25334.368,25.462,9.331,995,3837,3.856,3.0,995,91.2,0.092,0.0,995,282.342,0.284,0.337
Consumer,East,Office Supplies,888,101255.136,114.026,25.632,888,19222.693,21.647,6.725,888,3245,3.654,3.0,888,126.8,0.143,0.0,888,185.085,0.208,0.334
Consumer,Central,Office Supplies,739,93111.479,125.997,23.832,739,2583.143,3.495,4.176,739,2777,3.758,3.0,739,196.0,0.265,0.2,739,-152.641,-0.207,0.262
Corporate,West,Office Supplies,559,77133.856,137.985,30.144,559,18527.701,33.144,8.672,559,2097,3.751,3.0,559,54.3,0.097,0.0,559,162.672,0.291,0.338
Corporate,East,Office Supplies,520,66474.735,127.836,32.04,520,13457.665,25.88,7.826,520,2035,3.913,3.0,520,74.3,0.143,0.0,520,99.613,0.192,0.3
Consumer,South,Office Supplies,505,59504.581,117.831,27.86,505,9190.117,18.198,6.647,505,1899,3.76,3.0,505,82.4,0.163,0.2,505,87.848,0.174,0.31
Corporate,Central,Office Supplies,417,41137.701,98.652,23.52,417,99.295,0.238,5.101,417,1605,3.849,3.0,417,107.3,0.257,0.2,417,-60.474,-0.145,0.312
Consumer,West,Furniture,375,119808.087,319.488,171.288,375,4330.668,11.548,11.156,375,1405,3.747,3.0,375,48.75,0.13,0.15,375,34.697,0.093,0.112
Home Office,West,Office Supplies,343,33638.453,98.071,29.6,343,8747.78,25.504,8.119,343,1301,3.793,3.0,343,31.6,0.092,0.0,343,97.138,0.283,0.338
Corporate,South,Office Supplies,324,45930.17,141.76,31.047,324,8142.66,25.132,6.627,324,1281,3.954,3.0,324,56.4,0.174,0.1,324,48.527,0.15,0.312


Unnamed: 0,product_id,product_name,category,sub_category,sales,profit
0,TEC-CO-10004722,Canon imageCLASS 2200 Advanced Copier,Technology,Copiers,61599.82,25199.93
1,OFF-BI-10003527,Fellowes PB500 Electric Punch Plastic Comb Bin...,Office Supplies,Binders,27453.38,7753.04
2,TEC-CO-10001449,Hewlett Packard LaserJet 3310 Copier,Technology,Copiers,18839.69,6983.88
3,TEC-CO-10003763,Canon PC1060 Personal Laser Copier,Technology,Copiers,11619.83,4570.93
4,TEC-MA-10001127,HP Designjet T520 Inkjet Large Format Printer ...,Technology,Machines,18374.9,4094.98
5,TEC-MA-10003979,Ativa V4110MDD Micro-Cut Shredder,Technology,Machines,7699.89,3772.95
6,TEC-MA-10001047,"3D Systems Cube Printer, 2nd Generation, Magenta",Technology,Machines,14299.89,3717.97
7,TEC-AC-10002049,Plantronics Savi W720 Multi-Device Wireless He...,Technology,Accessories,9367.29,3696.28
8,OFF-BI-10001120,Ibico EPK-21 Electric Binding System,Office Supplies,Binders,15875.92,3345.28
9,TEC-MA-10000045,Zebra ZM400 Thermal Label Printer,Technology,Machines,6965.7,3343.54



### Interpretation (tell the story)
- **Overall scale & spread:** Use medians/IQR to avoid outlier bias; note if `profit` has heavy tails.  
- **Segment/Region/Category:** Identify the **best** (highest median/sum profit) and **worst** performers.  
- **Top products:** Are profits concentrated in a few SKUs? That suggests **Pareto/ABC** focus.

> **Decision hint:** Double down on high-margin segments/categories; investigate chronic loss-makers for repricing or delisting.



## 2) Probability & Distributions

**Business Question (CEO):** *“What’s the chance we lose money at different discount levels?”*  
**Statistical Plan:** Estimate empirical `P(Profit > 0)` by `discount_bin`; optionally use normal-approx on sales for intuition.


In [6]:

def empirical_probability_profit_positive_by_discount(df):
    if not {"discount_bin","profit"}.issubset(df.columns):
        return pd.DataFrame()
    tmp = df.assign(profit_pos=df["profit"] > 0)
    return tmp.groupby("discount_bin")["profit_pos"].mean().to_frame("p_profit>0")

def normal_approx_prob(series: pd.Series, threshold: float, side: str="above"):
    s = series.dropna()
    if len(s) < 2: return np.nan
    mu, sd = s.mean(), s.std(ddof=1)
    if sd <= 0: return float(threshold >= mu) if side=="above" else float(threshold <= mu)
    z = (threshold - mu) / sd
    if side == "above": return 1 - stats.norm.cdf(z)
    if side == "below": return stats.norm.cdf(z)
    return 2*(1 - stats.norm.cdf(abs(z)))

p_by_disc = empirical_probability_profit_positive_by_discount(df).round(3)
display(p_by_disc)

if "sales" in df.columns:
    med = df["sales"].dropna().median()
    approx = normal_approx_prob(df["sales"], med, side="above")
    print(f"Normal approx: P(Sales > median) ≈ {approx:.3f}")


  return tmp.groupby("discount_bin")["profit_pos"].mean().to_frame("p_profit>0")


Unnamed: 0_level_0,p_profit>0
discount_bin,Unnamed: 1_level_1
0,0.994
"(0,0.2]",0.857
"(0.2,0.4]",0.07
"(0.4,1]",0.0


Normal approx: P(Sales > median) ≈ 0.611



### Interpretation
- As discount increases, if `P(Profit>0)` **falls**, deep discounts are **riskier**.  
- Use as a **risk dial**: which discount bands keep us safely profitable?  
- The normal-approx demo is just intuition; rely on empirical results for decisions.

> **Decision hint:** Set discount policies where `P(Profit>0)` stays acceptably high; require approvals for deeper cuts.



## 3) Sampling & Estimation

**Business Question (Analyst):** *“Can we estimate mean profit precisely with a small sample?”*  
**Statistical Plan:** Draw a simple random sample; compute a bootstrap **95% CI** for mean profit; calculate sample size for a desired MOE.


In [7]:

from typing import Tuple

def take_sample(df, n=500, random_state=42):
    n = min(n, len(df))
    return df.sample(n=n, random_state=random_state)

def bootstrap_mean_ci(series: pd.Series, n_boot: int = 2000, alpha: float = 0.05, random_state: int = 42) -> Tuple[float,float,float]:
    rng = np.random.default_rng(random_state)
    s = series.dropna().values
    if len(s) == 0: return (np.nan, np.nan, np.nan)
    boot = []
    for _ in range(n_boot):
        sample = rng.choice(s, size=len(s), replace=True)
        boot.append(sample.mean())
    lo, hi = np.percentile(boot, [100*alpha/2, 100*(1-alpha/2)])
    return (float(np.mean(s)), float(lo), float(hi))

def sample_size_mean(sd: float, margin_error: float, z: float = 1.96) -> int:
    if margin_error <= 0 or sd <= 0: return np.nan
    return int(np.ceil((z*sd / margin_error)**2))

sample = take_sample(df, n=500)
mean_profit, lo, hi = bootstrap_mean_ci(sample["profit"])
sd_profit = df["profit"].std()
n_needed = sample_size_mean(sd=sd_profit, margin_error=5.0)

print(f"Bootstrap 95% CI for mean Profit (n={len(sample)}): mean={mean_profit:,.2f}, CI=({lo:,.2f}, {hi:,.2f})")
print(f"Required n for ±$5 MOE (~95%): {n_needed}")


Bootstrap 95% CI for mean Profit (n=500): mean=22.04, CI=(8.89, 35.06)
Required n for ±$5 MOE (~95%): 8433



### Interpretation
- The **CI** tells us the plausible range for mean profit; if it’s narrow enough, the estimate is **decision-grade**.  
- The **sample-size** figure helps plan data collection or experiment scale.

> **Decision hint:** If leadership needs ±$5 precision, ensure samples meet or exceed the computed `n`.



## 4) Hypothesis Testing

**Business Question (CEO):** *“Do deep discounts (>20%) **reduce** average profit?”*  
**Statistical Plan:** Two-sample **t-test** comparing mean profit between `discount <= 0.2` and `> 0.2`. Also test **segments** (two-proportion z), **regions** (ANOVA), and **Category×Region** (chi-square).


In [8]:

def ttest_profit_by_discount(df, split=0.2):
    low  = df.loc[df["discount"] <= split, "profit"].dropna()
    high = df.loc[df["discount"] >  split, "profit"].dropna()
    res = stats.ttest_ind(low, high, equal_var=False)
    return {
        "n_low": int(low.shape[0]), "n_high": int(high.shape[0]),
        "mean_low": float(low.mean() if len(low)>0 else np.nan),
        "mean_high": float(high.mean() if len(high)>0 else np.nan),
        "t_stat": float(res.statistic), "p_value": float(res.pvalue)
    }

def prop_test_profit_positive_by_segment(df):
    tmp = df.assign(pos=df["profit"]>0)
    counts = tmp.groupby("segment")["pos"].agg(["sum","count"]).sort_values("count", ascending=False)
    if counts.shape[0] < 2: return None
    (x1,n1),(x2,n2) = counts.iloc[0].tolist(), counts.iloc[1].tolist()
    p1,p2 = x1/n1, x2/n2
    p_pool = (x1+x2)/(n1+n2)
    se = (p_pool*(1-p_pool)*(1/n1 + 1/n2))**0.5
    z  = (p1 - p2) / se if se>0 else np.nan
    p  = 2*(1 - stats.norm.cdf(abs(z))) if np.isfinite(z) else np.nan
    return {"segments": counts.index[:2].tolist(), "p1": float(p1), "p2": float(p2), "z": float(z), "p_value": float(p)}

def anova_profit_by_region(df):
    groups = [g["profit"].dropna().values for _, g in df.groupby("region")]
    if len(groups) < 2: return None
    f, p = stats.f_oneway(*groups)
    return {"k_groups": len(groups), "F": float(f), "p_value": float(p)}

def chi_square_category_region(df):
    tab = pd.crosstab(df["category"], df["region"])
    chi2 = stats.chi2_contingency(tab)
    return {"chi2": float(chi2[0]), "p_value": float(chi2[1]), "dof": int(chi2[2]), "expected_shape": tab.shape}

res_t = ttest_profit_by_discount(df)
res_p = prop_test_profit_positive_by_segment(df)
res_a = anova_profit_by_region(df)
res_c = chi_square_category_region(df)

display(res_t)
display(res_p)
display(res_a)
display(res_c)


{'n_low': 8601,
 'n_high': 1393,
 'mean_low': 49.0376790722009,
 'mean_high': -97.1830983488873,
 't_stat': 16.1409994038947,
 'p_value': 2.2951802592338373e-54}

{'segments': ['Consumer', 'Corporate'],
 'p1': 0.8000385282219226,
 'p2': 0.8102649006622517,
 'z': -1.1251978133511358,
 'p_value': 0.26050521954191974}

{'k_groups': 4, 'F': 2.6224781547278115, 'p_value': 0.04889160022170713}

{'chi2': 3.6619309383265723,
 'p_value': 0.7223155543986877,
 'dof': 6,
 'expected_shape': (3, 4)}


### Interpretation
- **T-test (discounts):** If `mean_high << mean_low` and `p_value < 0.05`, deep discounts are **statistically associated** with lower profits.  
- **Proportions (segments):** Significant `p_value` implies **profitability rate** differs between top segments → tailor strategies.  
- **ANOVA (regions):** Significant result → at least one region’s mean profit differs → investigate drivers (shipping, pricing, mix).  
- **Chi-square (Category×Region):** Significant → sales mix varies by region → regional assortment or marketing.

> **Decision hint:** Tighten discount controls; run targeted experiments by segment/region to confirm causality before policy changes.



## 5) Correlation & Simple OLS

**Business Question (CFO):** *“How do discount, quantity, and sales **relate** to profit? Can we get a quick predictive read?”*  
**Statistical Plan:** Compute Pearson & Spearman correlations; fit a compact OLS model.


In [10]:

def pearson_spearman(df, x="discount", y="profit"):
    s = df[[x,y]].dropna()
    if len(s) < 3: return None
    r_pear, p_pear = stats.pearsonr(s[x], s[y])
    r_spear, p_spear = stats.spearmanr(s[x], s[y])
    return {"pearson_r": float(r_pear), "pearson_p": float(p_pear),
            "spearman_rho": float(r_spear), "spearman_p": float(p_spear), "n": int(len(s))}

def simple_ols(df, y="profit", X=("discount","quantity","sales")):
    cols = [c for c in X if c in df.columns]
    dat = df[[y] + cols].dropna()
    if dat.shape[0] < len(cols) + 3: return None
    Xmat = sm.add_constant(dat[cols])
    model = sm.OLS(dat[y], Xmat).fit()
    return {"nobs": int(model.nobs), "r2": float(model.rsquared),
            "coef": {k: float(v) for k, v in model.params.to_dict().items()},
            "pvalues": {k: float(v) for k, v in model.pvalues.to_dict().items()}}

corrs = pearson_spearman(df, x="discount", y="profit")
display(corrs)



{'pearson_r': -0.21948745637176845,
 'pearson_p': 2.702294436198944e-109,
 'spearman_rho': -0.5433501822306213,
 'spearman_p': 0.0,
 'n': 9994}


### Interpretation
- **Correlation:** Sign and magnitude tell us direction/strength (watch for nonlinearity & confounders).  
- **OLS:** Coefficients show marginal associations holding other included variables constant. Use for **directional insight**, not causation.

> **Decision hint:** If discount coefficient is strongly negative and significant, prioritize pricing experiments; if quantity/sales coefficients are positive, consider volume-driving tactics (with margin guardrails).



## 6) Time Series (Trend, Seasonality, Growth, Simple Forecast)

**Business Question (CEO):** *“How are sales and profit trending over time? What do we expect next month?”*  
**Statistical Plan:** Aggregate by month; compute moving averages and YoY growth; decompose seasonality; produce a simple naive/MA forecast.


In [11]:

# Monthly aggregates
ts = df.dropna(subset=["order_ym"]).groupby("order_ym")[["sales","profit"]].sum().sort_index()

# 3-month moving averages
ts["sales_ma3"]  = ts["sales"].rolling(3, min_periods=1).mean()
ts["profit_ma3"] = ts["profit"].rolling(3, min_periods=1).mean()

# YoY growth if ≥ 12 months
if len(ts) >= 13:
    ts["sales_yoy"]  = ts["sales"].pct_change(12)
    ts["profit_yoy"] = ts["profit"].pct_change(12)

display(ts.tail(12))

# Seasonal decomposition (additive) if enough points
decomp_summary = {}
if len(ts) >= 24:
    decomp = seasonal_decompose(ts["sales"], model="additive", period=12, two_sided=False, extrapolate_trend='freq')
    decomp_summary = {
        "trend_last": float(decomp.trend.dropna().iloc[-1]) if decomp.trend is not None else np.nan,
        "seasonal_peek": float(decomp.seasonal.groupby(ts.index.month).mean().max()),
        "seasonal_trough": float(decomp.seasonal.groupby(ts.index.month).mean().min()),
    }
decomp_summary


Unnamed: 0_level_0,sales,profit,sales_ma3,profit_ma3,sales_yoy,profit_yoy
order_ym,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2017-01-01,43971.374,7140.439,73460.794,9679.052,1.371,1.528
2017-02-01,20301.133,1613.872,53757.183,8879.873,-0.117,-0.678
2017-03-01,58872.353,14751.892,41048.287,7835.401,0.138,3.084
2017-04-01,36521.536,933.29,38565.007,5766.351,-0.058,-0.687
2017-05-01,44261.11,6342.583,46551.666,7342.588,-0.223,-0.268
2017-06-01,52981.726,8223.336,44588.124,5166.403,0.313,0.731
2017-07-01,45264.416,6952.621,47502.417,7172.847,0.153,0.568
2017-08-01,63120.888,9040.956,53789.01,8072.304,1.029,3.384
2017-09-01,87866.652,10991.556,65417.319,8995.044,0.197,0.178
2017-10-01,77776.923,9275.275,76254.821,9769.262,0.303,-0.429


{'trend_last': 61650.009775,
 'seasonal_peek': 38254.346665414974,
 'seasonal_trough': -31639.50061003302}

In [12]:

# Simple forecasts: last-value (naive) and MA(3) extrapolation
def naive_forecast(series: pd.Series):
    return float(series.dropna().iloc[-1]) if len(series.dropna()) else np.nan

def ma3_forecast(series: pd.Series):
    s = series.dropna()
    if len(s) == 0: return np.nan
    return float(s.rolling(3, min_periods=1).mean().iloc[-1])

next_sales_naive = naive_forecast(ts["sales"])
next_sales_ma3   = ma3_forecast(ts["sales"])
next_profit_naive = naive_forecast(ts["profit"])
next_profit_ma3   = ma3_forecast(ts["profit"])

print(f"Next-month Sales (naive): {next_sales_naive:,.0f} | MA(3): {next_sales_ma3:,.0f}")
print(f"Next-month Profit (naive): {next_profit_naive:,.0f} | MA(3): {next_profit_ma3:,.0f}")


Next-month Sales (naive): 83,829 | MA(3): 93,351
Next-month Profit (naive): 8,483 | MA(3): 9,150



### Interpretation
- **Trend:** Use MA(3) and decomposition trend to describe direction (up/down/flat).  
- **Seasonality:** Peaks/troughs reveal planning windows (inventory, promos).  
- **Growth:** YoY increasing? That supports expansion decisions.  
- **Forecast:** Naive/MA give a **baseline**; refine with ARIMA/Prophet if needed.

> **Decision hint:** Align promotions to seasonal peaks, ensure capacity before expected upswings, and set targets using conservative (naive) vs optimistic (MA) baselines.



## 7) Extras — ABC/Pareto, Outliers & Risk Tails

**Business Question (Merchandising):** *“Which SKUs drive most of our profit, and where are extreme losses coming from?”*  
**Statistical Plan:** Pareto rank by cumulative profit (ABC); inspect tail losses.


In [13]:

# Pareto on product profit
prod = df.groupby(["product_id","product_name"], dropna=False)["profit"].sum().sort_values(ascending=False).reset_index()
prod["cum_profit"] = prod["profit"].cumsum()
total_profit = prod["profit"].sum()
prod["cum_share"] = prod["cum_profit"] / total_profit if total_profit != 0 else np.nan

def abc_class(share):
    if pd.isna(share): return np.nan
    if share <= 0.80: return "A"
    if share <= 0.95: return "B"
    return "C"

prod["ABC"] = prod["cum_share"].apply(abc_class)
display(prod.head(15))

# Loss tail: worst 1% of line items by profit
cut = np.nanpercentile(df["profit"], 1)
tail_losses = df.loc[df["profit"] <= cut, ["order_id","product_name","category","sub_category","discount","profit"]].sort_values("profit").head(20)
display(tail_losses)


Unnamed: 0,product_id,product_name,profit,cum_profit,cum_share,ABC
0,TEC-CO-10004722,Canon imageCLASS 2200 Advanced Copier,25199.928,25199.928,0.088,A
1,OFF-BI-10003527,Fellowes PB500 Electric Punch Plastic Comb Bin...,7753.039,32952.967,0.115,A
2,TEC-CO-10001449,Hewlett Packard LaserJet 3310 Copier,6983.884,39936.851,0.139,A
3,TEC-CO-10003763,Canon PC1060 Personal Laser Copier,4570.935,44507.785,0.155,A
4,TEC-MA-10001127,HP Designjet T520 Inkjet Large Format Printer ...,4094.977,48602.762,0.17,A
5,TEC-MA-10003979,Ativa V4110MDD Micro-Cut Shredder,3772.946,52375.708,0.183,A
6,TEC-MA-10001047,"3D Systems Cube Printer, 2nd Generation, Magenta",3717.971,56093.679,0.196,A
7,TEC-AC-10002049,Plantronics Savi W720 Multi-Device Wireless He...,3696.282,59789.961,0.209,A
8,OFF-BI-10001120,Ibico EPK-21 Electric Binding System,3345.282,63135.244,0.22,A
9,TEC-MA-10000045,Zebra ZM400 Thermal Label Printer,3343.536,66478.78,0.232,A


Unnamed: 0,order_id,product_name,category,sub_category,discount,profit
7772,CA-2016-108196,Cubify CubeX 3D Printer Double Head Print,Technology,Machines,0.7,-6599.978
683,US-2017-168116,Cubify CubeX 3D Printer Triple Head Print,Technology,Machines,0.5,-3839.99
9774,CA-2014-169019,GBC DocuBind P400 Electric Binding System,Office Supplies,Binders,0.8,-3701.893
3011,CA-2017-134845,Lexmark MX611dhe Monochrome Laser Printer,Technology,Machines,0.7,-3399.98
4991,US-2017-122714,Ibico EPK-21 Electric Binding System,Office Supplies,Binders,0.8,-2929.485
3151,CA-2015-147830,Cubify CubeX 3D Printer Double Head Print,Technology,Machines,0.7,-2639.991
5310,CA-2017-131254,Fellowes PB500 Electric Punch Plastic Comb Bin...,Office Supplies,Binders,0.8,-2287.782
9639,CA-2015-116638,Chromcraft Bull-Nose Wood Oval Conference Tabl...,Furniture,Tables,0.4,-1862.312
1199,CA-2016-130946,GBC DocuBind P400 Electric Binding System,Office Supplies,Binders,0.8,-1850.946
2697,CA-2014-145317,Cisco TelePresence System EX90 Videoconferenci...,Technology,Machines,0.5,-1811.078



### Interpretation
- **ABC:** A small fraction of SKUs likely drives most profit (A items). Protect availability and margins there.  
- **Loss tail:** Identify patterns (e.g., deep discounts, certain categories) and set **guardrails**.

> **Decision hint:** Prioritize A-items for inventory/assortment; review pricing/discounts for chronic loss-makers.



## Wrap-Up & Recommendations

- **Descriptives** surfaced where we win/lose.  
- **Probability** quantified risk by discount.  
- **Sampling/CI** gave precision planning.  
- **Hypothesis tests** backed policy choices with evidence.  
- **Correlation/OLS** offered directional drivers.  
- **Time series** informed planning and simple forecasting.  
- **Pareto/Outliers** focused attention where it matters most.

**Next steps for the team**
1. Define discount guardrails based on `P(Profit>0)` and t-test outcomes.  
2. Pilot **segment/region-targeted** strategies to validate causality.  
3. Use the MA/naive forecasts to set next-month targets; refine with ARIMA later.  
4. Maintain an **A/B test calendar** to turn insights into controlled experiments.
