## Summary Statistics

### What Are Summary Statistics?

Summary statistics are key figures that give us a quick snapshot of a dataset. Instead of looking at every single value, we can use summary statistics to understand **patterns**, **central tendencies**, **variability**, and **distribution** of data.

These statistics help answer questions like:
- What’s the **average** number of homeless individuals per state?
- What’s the **maximum** or **minimum** number of family members reported?
- How **spread out** are the state populations?

### Common Types of Summary Statistics

Here are some of the most frequently used summary metrics in data analysis:

#### 1. **Mean (Average)**
- Tells us the central value of a numerical column.
- _Example_: The average number of homeless individuals across all states.

#### 2. **Median**
- The middle value when all data is sorted.
- Useful when data is skewed.
- _Example_: Median population per state to understand typical state size without influence from very large or small states.

#### 3. **Maximum and Minimum**
- Highest and lowest values in a column.
- _Example_: Which state has the **maximum** number of homeless individuals? Which one has the **minimum**?

#### 4. **Standard Deviation**
- Shows how much the values vary from the mean.
- _Example_: How consistent is the number of family members across all states?

#### 5. **Count**
- Total number of non-missing (valid) entries.
- _Example_: How many states have data recorded for individual homelessness?

### Why Use Summary Statistics?

- **Quick insights** without looking at raw data.
- Helps in **data cleaning** by identifying outliers or missing values.
- Crucial for **exploratory data analysis (EDA)** before deeper modeling or visualization.
- Sets the foundation for more complex techniques like machine learning.

### Example Use Cases

- You might **compare the average homelessness rates** between different regions of the USA.
- You could **track the range** (difference between max and min) of state populations.
- Summary stats help determine if data is **skewed**, which affects how we interpret results and choose models.

Summary statistics are the **first step** in any data exploration. They provide clarity and direction before diving into deeper analysis or visualization.

## Exercise: Mean and median

Summary statistics help us get a quick overview of our dataset by calculating values like the mean, median, minimum, maximum, and standard deviation. These values give us insight into the structure and distribution of our data — especially useful when dealing with large datasets.

In this task, we’ll get familiar with a DataFrame called `sales`, which contains weekly sales data.

### Instructions:

1. View the first few entries of the dataset to understand its structure.
2. Get a summary of column names, data types, and non-null counts.
3. Calculate the **mean** of the `weekly_sales` column to understand the average sales.
4. Calculate the **median** of the `weekly_sales` column to see the midpoint of the sales distribution.


In [2]:
import pandas as pd

sales = pd.read_csv("datasets/sales_subset.csv")
# Preview the first five rows of the sales DataFrame
print(sales.head())

   Unnamed: 0  store type  department        date  weekly_sales  is_holiday  \
0           0      1    A           1  2010-02-05      24924.50       False   
1           1      1    A           1  2010-03-05      21827.90       False   
2           2      1    A           1  2010-04-02      57258.43       False   
3           3      1    A           1  2010-05-07      17413.94       False   
4           4      1    A           1  2010-06-04      17558.09       False   

   temperature_c  fuel_price_usd_per_l  unemployment  
0       5.727778              0.679451         8.106  
1       8.055556              0.693452         8.106  
2      16.816667              0.718284         7.808  
3      22.527778              0.748928         7.808  
4      27.050000              0.714586         7.808  


In [4]:
# Display detailed info about the sales DataFrame
print(sales.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10774 entries, 0 to 10773
Data columns (total 10 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Unnamed: 0            10774 non-null  int64  
 1   store                 10774 non-null  int64  
 2   type                  10774 non-null  object 
 3   department            10774 non-null  int64  
 4   date                  10774 non-null  object 
 5   weekly_sales          10774 non-null  float64
 6   is_holiday            10774 non-null  bool   
 7   temperature_c         10774 non-null  float64
 8   fuel_price_usd_per_l  10774 non-null  float64
 9   unemployment          10774 non-null  float64
dtypes: bool(1), float64(4), int64(3), object(2)
memory usage: 768.2+ KB
None


In [5]:
# Calculate and display the average weekly sales
average_sales = sales["weekly_sales"].mean()
print("Mean weekly sales:", average_sales)

Mean weekly sales: 23843.95014850566


In [6]:
# Calculate and display the median weekly sales
median_sales = sales["weekly_sales"].median()
print("Median weekly sales:", median_sales)

Median weekly sales: 12049.064999999999


## Exercise: Summarizing dates

### Overview

Date columns in a dataset can also be summarized using specific statistical functions. While calculating something like a **mean** on dates might not be very meaningful, identifying the **earliest** and **latest** dates can be incredibly helpful to understand the **time range** your dataset spans.

In this exercise, we’ll focus on summarizing the date column in the `sales` DataFrame.

The dataset `sales` is already available, and pandas has been imported as `pd`.

### Instructions:

1. Find and print the **latest date** in the dataset.
2. Find and print the **earliest date** in the dataset.

These steps will help you determine the range of time your sales data covers.

###  Example

If the latest date is `2020-12-31` and the earliest date is `2018-01-01`, it tells you that the data spans three full years.

> Tip: This is especially useful for validating time-series data or preparing to group data by time intervals like months or years.


In [8]:
# Display the most recent date in the sales data
latest_date = sales["date"].max()
print("Latest date in dataset:", latest_date)

# Display the earliest date in the sales data
earliest_date = sales["date"].min()
print("Earliest date in dataset:", earliest_date)

Latest date in dataset: 2012-10-26
Earliest date in dataset: 2010-02-05


## Exercise: Efficient Summaries

In data analysis, we often need to summarize columns beyond just the basic statistics like mean or standard deviation. Sometimes, we need **custom summaries** — for example, to better understand distributions that contain **outliers**.

The `.agg()` method in pandas allows us to:

* Apply **custom functions** to columns.
* Apply **multiple functions** across **multiple columns** in a single, efficient operation.

A useful example is the **interquartile range (IQR)** — the difference between the 75th and 25th percentiles. It's a robust measure of spread that isn't easily affected by extreme values.

### Instructions:

1. Use a custom `iqr()` function to calculate the IQR of the `temperature_c` column using `.agg()`.
2. Expand the aggregation to also include the `fuel_price_usd_per_l` and `unemployment` columns, still using the `iqr()` function.
3. Enhance the summary by applying **both** `iqr` and `median` (from NumPy) to all three columns at once using `.agg()`.

In [9]:
# Define a function to calculate the interquartile range (IQR)
def iqr(series):
    return series.quantile(0.75) - series.quantile(0.25)

# Display the IQR of the 'temperature_c' column
temperature_iqr = sales['temperature_c'].agg(iqr)
print("IQR of temperature (°C):", temperature_iqr)

IQR of temperature (°C): 16.583333333333336


In [11]:
# Define a function to calculate the interquartile range (IQR)
def iqr(series):
    return series.quantile(0.75) - series.quantile(0.25)

# Select relevant columns
selected_columns = ["temperature_c", "fuel_price_usd_per_l", "unemployment"]

# Calculate and display the IQR for the selected columns
iqr_values = sales[selected_columns].agg(iqr)
print("Interquartile Ranges (IQR):")
print(iqr_values)

Interquartile Ranges (IQR):
temperature_c           16.583333
fuel_price_usd_per_l     0.073176
unemployment             0.565000
dtype: float64


In [3]:
# Import NumPy
import numpy as np

# Define custom function to calculate IQR (Interquartile Range)
def iqr(column):
    return column.quantile(0.75) - column.quantile(0.25)

# Wrap np.median in a custom function
def median_func(column):
    return np.median(column.values)

# Select the numeric columns of interest
columns_to_summarize = ["temperature_c", "fuel_price_usd_per_l", "unemployment"]

# Apply both IQR and median using agg()
summary_stats = sales[columns_to_summarize].agg([iqr, median_func])

# Print the resulting summary statistics
print("Summary Statistics (IQR and Median):")
print(summary_stats)

Summary Statistics (IQR and Median):
             temperature_c  fuel_price_usd_per_l  unemployment
iqr              16.583333              0.073176         0.565
median_func      16.966667              0.743381         8.099


## Exercise: Cumulative Statistics

Cumulative statistics are useful when you want to observe how metrics build up over time. Instead of looking at just a single week's performance, you can track ongoing totals and record-breaking values. In this task, you'll be calculating the **cumulative total sales** and the **highest sales recorded so far** for a department over time.

You're given a DataFrame named `sales_1_1`, which contains weekly sales data for **department 1** of **store 1**. Your goal is to track how total sales and maximum sales evolve across the dates.

### Instructions:

1. **Sort** the DataFrame by the `date` column in ascending order to ensure chronological tracking.
2. Calculate the **cumulative sum** of `weekly_sales` and store it in a new column named `cum_weekly_sales`. This shows the total sales so far each week.
3. Calculate the **cumulative maximum** of `weekly_sales` and store it in a column named `cum_max_sales`. This helps track the highest weekly sales achieved so far.
4. Finally, **display** the following columns: `date`, `weekly_sales`, `cum_weekly_sales`, and `cum_max_sales`.

These steps give insight into how sales have built up over time and when new sales records were set.

In [16]:
# Create sales_1_1 which contains weekly sales data for **department 1** of **store 1**
sales_1_1 = sales[np.logical_and(sales['department'] == 1, sales['store'] == 1)]

# Sort sales_1_1 by date
sales_1_1 = sales_1_1.sort_values('date')

# Get the cumulative sum of weekly_sales and add it as cum_weekly_sales column
sales_1_1['cum_weekly_sales'] = sales_1_1['weekly_sales'].cumsum()

# Get the cumulative max of weekly_sales and add it as cum_max_sales column
sales_1_1['cum_max_sales'] = sales_1_1['weekly_sales'].cummax()

# Display the calculated columns
print(sales_1_1[["date", "weekly_sales", "cum_weekly_sales", "cum_max_sales"]])

          date  weekly_sales  cum_weekly_sales  cum_max_sales
0   2010-02-05      24924.50          24924.50       24924.50
1   2010-03-05      21827.90          46752.40       24924.50
2   2010-04-02      57258.43         104010.83       57258.43
3   2010-05-07      17413.94         121424.77       57258.43
4   2010-06-04      17558.09         138982.86       57258.43
5   2010-07-02      16333.14         155316.00       57258.43
6   2010-08-06      17508.41         172824.41       57258.43
7   2010-09-03      16241.78         189066.19       57258.43
8   2010-10-01      20094.19         209160.38       57258.43
9   2010-11-05      34238.88         243399.26       57258.43
10  2010-12-03      22517.56         265916.82       57258.43
11  2011-01-07      15984.24         281901.06       57258.43
