# Pandas Statistical Methods Tutorial

## Learning Objectives
In this tutorial, you will learn:
1. **Statistical Attributes** - Calculate mean, median, mode, min, max, std, var, sum, prod, quantile
2. **Grouping with groupby()** - Group data and apply aggregations
3. **Retrieving Extremes** - Use nlargest() and nsmallest() to find top/bottom values

## Dataset Description
We'll work with a sales dataset containing 500 transaction records with the following columns:
- **order_id**: Unique order identifier
- **date**: Order date
- **region**: Geographic region (North, South, East, West, Central)
- **sales_channel**: Sales channel (Online, Store, Phone)
- **customer_segment**: Customer type (Individual, Corporate, Small Business)
- **category**: Product category
- **product**: Product name
- **quantity**: Number of units sold
- **unit_price**: Price per unit (in ‚Çπ)
- **total_sales**: Total sales amount (in ‚Çπ)
- **discount_percent**: Discount percentage applied
- **shipping_cost**: Shipping cost (in ‚Çπ)
- **profit_margin**: Profit margin percentage
- **discount_amount**: Discount amount in rupees
- **net_sales**: Sales after discount (in ‚Çπ)
- **profit**: Profit amount (in ‚Çπ)
- **total_cost**: Total cost including shipping (in ‚Çπ)

---

## Step 1: Import Libraries and Load Data

In [None]:
import pandas as pd
import numpy as np

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 2)

print("Libraries imported successfully!")

In [None]:
# Load the sales dataset
df = pd.read_csv('../Datasets/sales_data.csv')

print(f"Dataset shape: {df.shape}")
print("\nFirst 5 rows:")
display(df.head())
print("\nDataset info:")
print(df.info())

---
# PART 1: STATISTICAL ATTRIBUTES
---

## 1.1 Measures of Central Tendency

### Mean, Median, and Mode

### .mean() - Calculate the Average

In [None]:
# Calculate mean for a single column
print("Mean of Total Sales:")
mean_sales = df['total_sales'].mean()
print(f"‚Çπ{mean_sales:,.2f}")

print("\nMean of Quantity Sold:")
mean_quantity = df['quantity'].mean()
print(f"{mean_quantity:.2f} units")

print("\nMean of Profit:")
mean_profit = df['profit'].mean()
print(f"‚Çπ{mean_profit:,.2f}")

In [None]:
# Calculate mean for multiple columns at once
print("Mean of All Numerical Columns:")
print("="*60)
means = df[['quantity', 'total_sales', 'discount_percent', 'profit']].mean()
display(means)

In [None]:
# Calculate mean for all numeric columns in the dataframe
print("Mean of All Numeric Data:")
print("="*60)
display(df.mean(numeric_only=True))

### .median() - Calculate the Middle Value

In [None]:
# Calculate median
print("Median Values:")
print("="*60)
print(f"Median Total Sales: ‚Çπ{df['total_sales'].median():,.2f}")
print(f"Median Profit: ‚Çπ{df['profit'].median():,.2f}")
print(f"Median Quantity: {df['quantity'].median():.0f} units")

print("\nComparing Mean vs Median (Total Sales):")
print(f"Mean: ‚Çπ{df['total_sales'].mean():,.2f}")
print(f"Median: ‚Çπ{df['total_sales'].median():,.2f}")
print(f"Difference: ‚Çπ{df['total_sales'].mean() - df['total_sales'].median():,.2f}")
print("\nüìä Note: When mean > median, the distribution is right-skewed (has high outliers)")

### .mode() - Calculate the Most Frequent Value

In [None]:
# Calculate mode for categorical columns
print("Most Common Values (Mode):")
print("="*60)
print(f"Most common region: {df['region'].mode()[0]}")
print(f"Most common category: {df['category'].mode()[0]}")
print(f"Most common sales channel: {df['sales_channel'].mode()[0]}")
print(f"Most common customer segment: {df['customer_segment'].mode()[0]}")

In [None]:
# Mode can also work with numerical data
print("Mode for Numerical Columns:")
print("="*60)
print(f"Most common quantity: {df['quantity'].mode()[0]} units")
print(f"Most common discount percent: {df['discount_percent'].mode()[0]}%")

---
## 1.2 Measures of Dispersion

### Min, Max, Standard Deviation, and Variance

### .min() and .max() - Find Minimum and Maximum Values

In [None]:
# Find minimum and maximum values
print("Minimum and Maximum Values:")
print("="*60)
print(f"Min Total Sales: ‚Çπ{df['total_sales'].min():,.2f}")
print(f"Max Total Sales: ‚Çπ{df['total_sales'].max():,.2f}")
print(f"Range: ‚Çπ{df['total_sales'].max() - df['total_sales'].min():,.2f}")

print(f"\nMin Quantity: {df['quantity'].min()} units")
print(f"Max Quantity: {df['quantity'].max()} units")

print(f"\nMin Profit: ‚Çπ{df['profit'].min():,.2f}")
print(f"Max Profit: ‚Çπ{df['profit'].max():,.2f}")

In [None]:
# Find min and max for all numeric columns
print("Minimum Values for All Numeric Columns:")
print("="*60)
display(df.min(numeric_only=True))

print("\nMaximum Values for All Numeric Columns:")
print("="*60)
display(df.max(numeric_only=True))

### .std() - Calculate Standard Deviation

In [None]:
# Calculate standard deviation
print("Standard Deviation (Measure of Spread):")
print("="*60)
print(f"Total Sales - Mean: ‚Çπ{df['total_sales'].mean():,.2f}")
print(f"Total Sales - Std Dev: ‚Çπ{df['total_sales'].std():,.2f}")
print(f"Coefficient of Variation: {(df['total_sales'].std() / df['total_sales'].mean() * 100):.2f}%")

print(f"\nProfit - Mean: ‚Çπ{df['profit'].mean():,.2f}")
print(f"Profit - Std Dev: ‚Çπ{df['profit'].std():,.2f}")
print(f"Coefficient of Variation: {(df['profit'].std() / df['profit'].mean() * 100):.2f}%")

print("\nüìä Note: Higher standard deviation means more variability in the data")

### .var() - Calculate Variance

In [None]:
# Calculate variance
print("Variance (Square of Standard Deviation):")
print("="*60)
print(f"Total Sales Variance: ‚Çπ{df['total_sales'].var():,.2f}")
print(f"Total Sales Std Dev: ‚Çπ{df['total_sales'].std():,.2f}")
print(f"Verification: Std Dev¬≤ = {df['total_sales'].std()**2:,.2f}")

print("\nVariance for Key Metrics:")
print(f"Quantity Variance: {df['quantity'].var():.2f}")
print(f"Profit Variance: ‚Çπ{df['profit'].var():,.2f}")
print(f"Discount Percent Variance: {df['discount_percent'].var():.2f}")

---
## 1.3 Aggregate Functions

### Sum and Product

### .sum() - Calculate Total Sum

In [None]:
# Calculate sum
print("Total Sum Calculations:")
print("="*60)
print(f"Total Revenue: ‚Çπ{df['total_sales'].sum():,.2f}")
print(f"Total Profit: ‚Çπ{df['profit'].sum():,.2f}")
print(f"Total Units Sold: {df['quantity'].sum():,.0f} units")
print(f"Total Discount Given: ‚Çπ{df['discount_amount'].sum():,.2f}")
print(f"Total Shipping Cost: ‚Çπ{df['shipping_cost'].sum():,.2f}")

print("\nProfit Margin Analysis:")
total_revenue = df['net_sales'].sum()
total_profit = df['profit'].sum()
overall_margin = (total_profit / total_revenue * 100)
print(f"Overall Profit Margin: {overall_margin:.2f}%")

### .prod() - Calculate Product of All Values

In [None]:
# Calculate product (less commonly used, but useful for compound calculations)
# Example: Calculate product of first 5 quantities
print("Product Calculation Example:")
print("="*60)
sample_quantities = df['quantity'].head(5)
print(f"Quantities: {sample_quantities.values}")
print(f"Product: {sample_quantities.prod()}")

# More practical example: geometric mean
print("\nGeometric Mean Calculation (using product):")
n = len(df['profit_margin'].head(10))
geometric_mean = df['profit_margin'].head(10).prod() ** (1/n)
print(f"Geometric mean of first 10 profit margins: {geometric_mean:.2f}%")

### .quantile() - Calculate Percentiles

In [None]:
# Calculate specific quantiles
print("Quantile Analysis for Total Sales:")
print("="*60)
print(f"25th Percentile (Q1): ‚Çπ{df['total_sales'].quantile(0.25):,.2f}")
print(f"50th Percentile (Median): ‚Çπ{df['total_sales'].quantile(0.50):,.2f}")
print(f"75th Percentile (Q3): ‚Çπ{df['total_sales'].quantile(0.75):,.2f}")
print(f"90th Percentile: ‚Çπ{df['total_sales'].quantile(0.90):,.2f}")
print(f"95th Percentile: ‚Çπ{df['total_sales'].quantile(0.95):,.2f}")

# Calculate IQR (Interquartile Range)
q1 = df['total_sales'].quantile(0.25)
q3 = df['total_sales'].quantile(0.75)
iqr = q3 - q1
print(f"\nInterquartile Range (IQR): ‚Çπ{iqr:,.2f}")
print(f"This means the middle 50% of sales fall within a range of ‚Çπ{iqr:,.2f}")

In [None]:
# Calculate multiple quantiles at once
print("Multiple Quantiles for Profit:")
print("="*60)
quantiles = df['profit'].quantile([0.1, 0.25, 0.5, 0.75, 0.9])
display(quantiles)

---
## 1.4 Combined Statistical Summary

### Using .describe() to Get All Statistics at Once

In [None]:
# Get comprehensive statistics for numerical columns
print("Comprehensive Statistical Summary:")
print("="*80)
display(df.describe())

In [None]:
# Describe specific columns
print("Statistics for Key Sales Metrics:")
print("="*80)
display(df[['quantity', 'total_sales', 'profit', 'discount_percent']].describe())

In [None]:
# Describe categorical columns
print("Statistics for Categorical Columns:")
print("="*80)
display(df[['region', 'category', 'sales_channel']].describe())

---
# PART 2: GROUPBY OPERATIONS
---

## 2.1 Basic GroupBy Operations

### Grouping by Single Column

In [None]:
# Group by region and calculate mean
print("Average Sales by Region:")
print("="*60)
region_avg = df.groupby('region')['total_sales'].mean()
display(region_avg.sort_values(ascending=False))

In [None]:
# Group by category and calculate sum
print("Total Sales by Category:")
print("="*60)
category_sum = df.groupby('category')['total_sales'].sum().sort_values(ascending=False)
display(category_sum)

In [None]:
# Group by sales channel and calculate count
print("Number of Orders by Sales Channel:")
print("="*60)
channel_count = df.groupby('sales_channel')['order_id'].count()
display(channel_count)

## 2.2 Multiple Aggregations with GroupBy

In [None]:
# Apply multiple aggregation functions to grouped data
print("Multiple Statistics by Region:")
print("="*80)
region_stats = df.groupby('region')['total_sales'].agg(['count', 'sum', 'mean', 'median', 'std', 'min', 'max'])
display(region_stats)

In [None]:
# Multiple columns with multiple aggregations
print("Sales and Profit Analysis by Category:")
print("="*80)
category_analysis = df.groupby('category').agg({
    'total_sales': ['sum', 'mean'],
    'profit': ['sum', 'mean'],
    'quantity': ['sum', 'mean']
})
display(category_analysis)

## 2.3 Grouping by Multiple Columns

In [None]:
# Group by region and category
print("Average Sales by Region and Category:")
print("="*80)
region_category = df.groupby(['region', 'category'])['total_sales'].mean().round(2)
display(region_category.head(20))

In [None]:
# Group by region and sales channel with multiple metrics
print("Sales Performance by Region and Channel:")
print("="*80)
region_channel = df.groupby(['region', 'sales_channel']).agg({
    'total_sales': 'sum',
    'profit': 'sum',
    'order_id': 'count'
}).rename(columns={'order_id': 'num_orders'})
display(region_channel.head(15))

In [None]:
# Unstack to create a pivot-like view
print("Sales by Region and Category (Pivot View):")
print("="*80)
pivot_view = df.groupby(['region', 'category'])['total_sales'].sum().unstack(fill_value=0)
display(pivot_view)

## 2.4 Advanced GroupBy Techniques

In [None]:
# Calculate percentage contribution by category
print("Category Sales Contribution Analysis:")
print("="*80)
category_sales = df.groupby('category')['total_sales'].sum()
category_pct = (category_sales / category_sales.sum() * 100).round(2)
category_summary = pd.DataFrame({
    'Total Sales': category_sales,
    'Percentage': category_pct
}).sort_values('Total Sales', ascending=False)
display(category_summary)

In [None]:
# Group by customer segment and calculate profitability
print("Customer Segment Profitability:")
print("="*80)
segment_profit = df.groupby('customer_segment').agg({
    'order_id': 'count',
    'total_sales': 'sum',
    'profit': 'sum',
    'discount_amount': 'sum'
}).rename(columns={'order_id': 'num_orders'})

# Calculate profit margin for each segment
segment_profit['profit_margin_%'] = (segment_profit['profit'] / segment_profit['total_sales'] * 100).round(2)
display(segment_profit)

In [None]:
# Using transform to add group statistics back to original dataframe
print("Adding Group Means to Original Data:")
print("="*80)
df['category_avg_sales'] = df.groupby('category')['total_sales'].transform('mean')
df['region_avg_profit'] = df.groupby('region')['profit'].transform('mean')

# Show sample with the new columns
display(df[['category', 'region', 'total_sales', 'category_avg_sales', 'profit', 'region_avg_profit']].head(10))

In [None]:
# Filter groups based on conditions
print("High-Performing Categories (Total Sales > ‚Çπ50,000):")
print("="*80)
high_performing = df.groupby('category').filter(lambda x: x['total_sales'].sum() > 50000)
print(f"Categories meeting criteria: {high_performing['category'].unique()}")
print(f"Number of orders: {len(high_performing)}")

---
# PART 3: RETRIEVING EXTREMES
---

## 3.1 Using .nlargest() to Find Top Values

In [None]:
# Find top 10 orders by total sales
print("Top 10 Orders by Total Sales:")
print("="*80)
top_10_sales = df.nlargest(10, 'total_sales')[['order_id', 'category', 'region', 'total_sales', 'profit']]
display(top_10_sales)

In [None]:
# Find top 5 orders by profit
print("Top 5 Most Profitable Orders:")
print("="*80)
top_5_profit = df.nlargest(5, 'profit')[['order_id', 'category', 'product', 'profit', 'profit_margin']]
display(top_5_profit)

In [None]:
# Find orders with highest quantities
print("Top 10 Orders by Quantity:")
print("="*80)
top_quantity = df.nlargest(10, 'quantity')[['order_id', 'product', 'quantity', 'unit_price', 'total_sales']]
display(top_quantity)

In [None]:
# Find highest discount percentages
print("Top 10 Orders with Highest Discounts:")
print("="*80)
high_discount = df.nlargest(10, 'discount_percent')[['order_id', 'category', 'total_sales', 'discount_percent', 'discount_amount']]
display(high_discount)

## 3.2 Using .nsmallest() to Find Bottom Values

In [None]:
# Find bottom 10 orders by total sales
print("Bottom 10 Orders by Total Sales:")
print("="*80)
bottom_10_sales = df.nsmallest(10, 'total_sales')[['order_id', 'category', 'region', 'total_sales', 'profit']]
display(bottom_10_sales)

In [None]:
# Find bottom 5 orders by profit
print("Bottom 5 Orders by Profit:")
print("="*80)
bottom_5_profit = df.nsmallest(5, 'profit')[['order_id', 'category', 'product', 'profit', 'profit_margin']]
display(bottom_5_profit)

In [None]:
# Find orders with smallest quantities
print("Bottom 10 Orders by Quantity:")
print("="*80)
small_quantity = df.nsmallest(10, 'quantity')[['order_id', 'product', 'quantity', 'unit_price', 'total_sales']]
display(small_quantity)

In [None]:
# Find lowest discount percentages
print("Bottom 10 Orders with Lowest Discounts:")
print("="*80)
low_discount = df.nsmallest(10, 'discount_percent')[['order_id', 'category', 'total_sales', 'discount_percent', 'discount_amount']]
display(low_discount)

## 3.3 Combining nlargest/nsmallest with GroupBy

In [None]:
# Find top 3 orders in each region
print("Top 3 Orders by Total Sales in Each Region:")
print("="*80)
top_by_region = df.groupby('region', group_keys=False).apply(lambda x: x.nlargest(3, 'total_sales'))
display(top_by_region[['order_id', 'region', 'category', 'total_sales', 'profit']])

In [None]:
# Find top 2 profitable orders in each category
print("Top 2 Most Profitable Orders in Each Category:")
print("="*80)
top_profit_by_category = df.groupby('category', group_keys=False).apply(lambda x: x.nlargest(2, 'profit'))
display(top_profit_by_category[['order_id', 'category', 'product', 'profit', 'total_sales']])

In [None]:
# Find bottom 2 orders by sales in each sales channel
print("Bottom 2 Orders by Sales in Each Channel:")
print("="*80)
bottom_by_channel = df.groupby('sales_channel', group_keys=False).apply(lambda x: x.nsmallest(2, 'total_sales'))
display(bottom_by_channel[['order_id', 'sales_channel', 'category', 'total_sales']])

## 3.4 Advanced Extreme Value Analysis

In [None]:
# Compare top and bottom performers
print("Performance Comparison: Top 5 vs Bottom 5 Orders")
print("="*80)

top_5 = df.nlargest(5, 'total_sales')
bottom_5 = df.nsmallest(5, 'total_sales')

print("\nTop 5 Orders Statistics:")
print(f"Average Sales: ‚Çπ{top_5['total_sales'].mean():,.2f}")
print(f"Average Profit: ‚Çπ{top_5['profit'].mean():,.2f}")
print(f"Average Quantity: {top_5['quantity'].mean():.2f}")

print("\nBottom 5 Orders Statistics:")
print(f"Average Sales: ‚Çπ{bottom_5['total_sales'].mean():,.2f}")
print(f"Average Profit: ‚Çπ{bottom_5['profit'].mean():,.2f}")
print(f"Average Quantity: {bottom_5['quantity'].mean():.2f}")

In [None]:
# Find outliers using nlargest
print("Identifying Sales Outliers (Top 1%):")
print("="*80)

top_1_percent = int(len(df) * 0.01)
outliers = df.nlargest(top_1_percent, 'total_sales')

print(f"Number of outlier orders: {len(outliers)}")
print(f"Outlier sales threshold: ‚Çπ{outliers['total_sales'].min():,.2f}")
print(f"Average outlier sale: ‚Çπ{outliers['total_sales'].mean():,.2f}")
print(f"Total outlier sales: ‚Çπ{outliers['total_sales'].sum():,.2f}")
print(f"Percentage of total revenue: {(outliers['total_sales'].sum() / df['total_sales'].sum() * 100):.2f}%")

In [None]:
# Create summary of extremes by category
print("Extreme Values Summary by Category:")
print("="*80)

category_extremes = pd.DataFrame({
    'Max_Sales': df.groupby('category')['total_sales'].max(),
    'Min_Sales': df.groupby('category')['total_sales'].min(),
    'Max_Profit': df.groupby('category')['profit'].max(),
    'Min_Profit': df.groupby('category')['profit'].min(),
    'Sales_Range': df.groupby('category')['total_sales'].max() - df.groupby('category')['total_sales'].min()
}).sort_values('Sales_Range', ascending=False)

display(category_extremes)

---
# PART 4: PRACTICAL EXAMPLES
---

## 4.1 Business Intelligence Dashboard Statistics

In [None]:
# Create a comprehensive business dashboard
print("="*80)
print("SALES PERFORMANCE DASHBOARD")
print("="*80)

print("\nüìä OVERALL METRICS")
print("-" * 60)
print(f"Total Orders: {len(df):,}")
print(f"Total Revenue: ‚Çπ{df['total_sales'].sum():,.2f}")
print(f"Total Profit: ‚Çπ{df['profit'].sum():,.2f}")
print(f"Average Order Value: ‚Çπ{df['total_sales'].mean():,.2f}")
print(f"Overall Profit Margin: {(df['profit'].sum() / df['net_sales'].sum() * 100):.2f}%")

print("\nüèÜ TOP PERFORMERS")
print("-" * 60)
print(f"Best Region: {df.groupby('region')['total_sales'].sum().idxmax()}")
print(f"Best Category: {df.groupby('category')['total_sales'].sum().idxmax()}")
print(f"Best Channel: {df.groupby('sales_channel')['total_sales'].sum().idxmax()}")

print("\nüìà SALES DISTRIBUTION")
print("-" * 60)
print(f"Median Sale: ‚Çπ{df['total_sales'].median():,.2f}")
print(f"75th Percentile: ‚Çπ{df['total_sales'].quantile(0.75):,.2f}")
print(f"90th Percentile: ‚Çπ{df['total_sales'].quantile(0.90):,.2f}")
print(f"Largest Single Sale: ‚Çπ{df['total_sales'].max():,.2f}")

print("\nüí∞ PROFITABILITY")
print("-" * 60)
print(f"Average Profit per Order: ‚Çπ{df['profit'].mean():,.2f}")
print(f"Highest Profit Order: ‚Çπ{df['profit'].max():,.2f}")
print(f"Profit Std Dev: ‚Çπ{df['profit'].std():,.2f}")

## 4.2 Regional Performance Analysis

In [None]:
# Comprehensive regional analysis
print("REGIONAL PERFORMANCE ANALYSIS")
print("="*80)

regional_analysis = df.groupby('region').agg({
    'order_id': 'count',
    'total_sales': ['sum', 'mean', 'median'],
    'profit': ['sum', 'mean'],
    'quantity': 'sum',
    'discount_percent': 'mean'
}).round(2)

regional_analysis.columns = ['Orders', 'Total_Sales', 'Avg_Sale', 'Median_Sale', 
                               'Total_Profit', 'Avg_Profit', 'Units_Sold', 'Avg_Discount']

# Add profit margin
regional_analysis['Profit_Margin_%'] = (
    regional_analysis['Total_Profit'] / regional_analysis['Total_Sales'] * 100
).round(2)

# Sort by total sales
regional_analysis = regional_analysis.sort_values('Total_Sales', ascending=False)

display(regional_analysis)

## 4.3 Product Category Deep Dive

In [None]:
# Detailed category analysis with extremes
print("PRODUCT CATEGORY ANALYSIS")
print("="*80)

for category in df['category'].unique():
    cat_data = df[df['category'] == category]
    
    print(f"\nüì¶ {category.upper()}")
    print("-" * 60)
    print(f"Total Orders: {len(cat_data)}")
    print(f"Total Revenue: ‚Çπ{cat_data['total_sales'].sum():,.2f}")
    print(f"Average Sale: ‚Çπ{cat_data['total_sales'].mean():,.2f}")
    print(f"Price Range: ‚Çπ{cat_data['unit_price'].min():.2f} - ‚Çπ{cat_data['unit_price'].max():,.2f}")
    
    # Top product in this category
    top_product = cat_data.nlargest(1, 'total_sales')
    print(f"Top Sale: {top_product['product'].values[0]} (‚Çπ{top_product['total_sales'].values[0]:,.2f})")

---
## Summary

### Key Takeaways:

1. **Statistical Attributes**:
   - Use `.mean()`, `.median()`, `.mode()` for central tendency
   - Use `.min()`, `.max()`, `.std()`, `.var()` for dispersion
   - Use `.sum()`, `.prod()`, `.quantile()` for aggregations

2. **GroupBy Operations**:
   - Group data by one or multiple columns
   - Apply multiple aggregation functions
   - Use `.transform()` to add group statistics back to original data
   - Use `.filter()` to filter groups based on conditions

3. **Retrieving Extremes**:
   - Use `.nlargest(n, 'col')` to get top N values
   - Use `.nsmallest(n, 'col')` to get bottom N values
   - Combine with groupby for category-wise extremes
   - Useful for outlier detection and performance analysis

### Best Practices:
- Always check data types before statistical operations
- Use `.describe()` for quick statistical overview
- Combine groupby with multiple aggregations for comprehensive analysis
- Use extremes to identify outliers and exceptional cases
- Consider domain context when interpreting statistics