## Grouped Summary Statistics

**Grouped summary statistics** are a powerful way to analyze data by dividing it into categories and computing descriptive statistics for each group. This allows for more insightful, segmented analysis rather than relying solely on overall statistics.

### Purpose of Grouped Statistics

Grouped statistics help answer questions like:

* How does the average sales differ across store types?
* What is the maximum unemployment rate for each year?
* Which department generates the highest revenue in each store?

They allow you to identify trends, patterns, and outliers within specific groups of your data.

### How It Works

You group the data based on one or more categorical columns (like store, department, or year), and then apply summary functions (like mean, sum, or median) to numerical columns. This helps break down large datasets into smaller, more manageable pieces of analysis.

### Common Use Cases

* Comparing performance metrics across different categories (e.g., store types, departments)
* Summarizing data per group for reporting or visualization
* Identifying top-performing segments
* Analyzing time-based trends within groups (like yearly averages)

### Things to Remember

* Grouped results are usually indexed by the grouping column(s)
* Multiple summary statistics can be calculated for each group
* Grouping can be hierarchical, such as grouping by both store and department
* Grouped summary statistics provide context-specific insights that raw totals can't

Grouped statistics are essential in exploratory data analysis, business intelligence, and machine learning feature engineering — offering a deeper and more structured understanding of your data.

## Preparing Data

In [1]:
import pandas as pd
sales = pd.read_csv("datasets/sales_subset.csv")

## Exercise: What percent of sales occurred at each store type?

Instead of relying on `.groupby()`, you can figure out summary values just by filtering the data and summing up.

Walmart’s dataset marks stores as type **"A"** (supercenters), **"B"** (discount stores), and **"C"** (neighborhood markets). In this task, you’ll calculate the contribution of each store type to the company’s overall sales.

### Instructions

1. Add up all values in `weekly_sales` to get the grand total.
2. Extract rows for type `"A"` and compute their sales total.
3. Do the same for type `"B"` and `"C"`.
4. Put the results for A, B, and C into a list and divide by the overall total to find their proportions.


In [3]:
# Step 1: Overall weekly sales across all stores
all_sales = sales["weekly_sales"].sum()

# Step 2: Sales totals by store type
sales_A = sales[sales["type"] == "A"]["weekly_sales"].sum()
sales_B = sales[sales["type"] == "B"]["weekly_sales"].sum()
sales_C = sales[sales["type"] == "C"]["weekly_sales"].sum()

# Step 3: Proportion of sales by store type
proportion_by_type = [
    sales_A / all_sales,
    sales_B / all_sales,
    sales_C / all_sales
]

print(proportion_by_type)

[np.float64(0.9097746968515047), np.float64(0.09022530314849538), np.float64(0.0)]


## Exercise: Calculations with .groupby()

The `.groupby()` method is a powerful way to summarize your data. In this task, you’ll repeat the sales breakdown by store type, but this time with `.groupby()`. You’ll also see how grouping by more than one column (store type and holiday indicator) lets you compare sales patterns more flexibly.

### Instructions

1. Group the data by the column `"type"`, calculate the sum of `"weekly_sales"`, and save it as `totals_by_type`.
2. Find the share of sales for each store type by dividing `totals_by_type` by the overall sum. Store the result as `proportion_by_type`.


In [4]:
# Step 1: Sum of weekly sales for each store type
totals_by_type = sales.groupby("type")["weekly_sales"].sum()

# Step 2: Proportion of sales by type
proportion_by_type = totals_by_type / totals_by_type.sum()

print(proportion_by_type)

type
A    0.909775
B    0.090225
Name: weekly_sales, dtype: float64



The `.groupby()` method isn’t limited to just a single column—you can pass in multiple columns to get even deeper insights. In this exercise, you’ll extend your previous analysis by checking whether weekly sales vary not only by store type, but also by whether the week was a holiday.

### Instructions

1. Start by grouping the data by `"type"` and summing `"weekly_sales"`. Save this as `totals_by_type`.
2. Next, group the data by both `"type"` and `"is_holiday"`, summing `"weekly_sales"`, and assign it to `totals_by_type_holiday`.
3. Print out the result to inspect sales across store types for holiday vs. non-holiday weeks.


In [5]:
# Step 1: Total weekly sales by store type
totals_by_type = sales.groupby("type")["weekly_sales"].sum()

# Step 2: Total weekly sales by store type and holiday flag
totals_by_type_holiday = sales.groupby(["type", "is_holiday"])["weekly_sales"].sum()

print(totals_by_type_holiday)

type  is_holiday
A     False         2.336927e+08
      True          2.360181e+04
B     False         2.317678e+07
      True          1.621410e+03
Name: weekly_sales, dtype: float64


## Exercise: Multiple grouped summaries

The `.agg()` method is handy when you want to calculate several summary measures at once. It works seamlessly on grouped data too. With it, you can quickly compute descriptive statistics such as minimum, maximum, average, and median values.

### Instructions

1. For each store type, calculate the `min`, `max`, `mean`, and `median` of `weekly_sales`. Save this as `sales_stats`.
2. Do the same for the columns `unemployment` and `fuel_price_usd_per_l`, grouped by store type, and store the result in `unemp_fuel_stats`.
3. Print both results to explore the summary values.

In [6]:
import numpy as np

# Step 1: Weekly sales stats by store type
sales_stats = sales.groupby("type")["weekly_sales"].agg([min, max, np.mean, np.median])
print(sales_stats)

# Step 2: Unemployment and fuel price stats by store type
unemp_fuel_stats = sales.groupby("type")[["unemployment", "fuel_price_usd_per_l"]] \
                        .agg([min, max, np.mean, np.median])
print(unemp_fuel_stats)

         min        max          mean    median
type                                           
A    -1098.0  293966.05  23674.667242  11943.92
B     -798.0  232558.51  25696.678370  13336.08
     unemployment                         fuel_price_usd_per_l            \
              min    max      mean median                  min       max   
type                                                                       
A           3.879  8.992  7.972611  8.067             0.664129  1.107410   
B           7.170  9.765  9.279323  9.199             0.760023  1.107674   

                          
          mean    median  
type                      
A     0.744619  0.735455  
B     0.805858  0.803348  


  sales_stats = sales.groupby("type")["weekly_sales"].agg([min, max, np.mean, np.median])
  sales_stats = sales.groupby("type")["weekly_sales"].agg([min, max, np.mean, np.median])
  sales_stats = sales.groupby("type")["weekly_sales"].agg([min, max, np.mean, np.median])
  sales_stats = sales.groupby("type")["weekly_sales"].agg([min, max, np.mean, np.median])
  .agg([min, max, np.mean, np.median])
  .agg([min, max, np.mean, np.median])
  .agg([min, max, np.mean, np.median])
  .agg([min, max, np.mean, np.median])
  .agg([min, max, np.mean, np.median])
