### **Module 6: Grouping and Aggregation**
This module explores grouping data for aggregation, which is a crucial step in exploratory data analysis and feature engineering.

#### **Topics:**
- **Grouping Data:**
  - Grouping data by one or more columns using `groupby()`.
  
- **Aggregation Functions:**
  - Applying aggregation functions like `sum()`, `mean()`, `count()`, and `max()` on grouped data.

- **Pivot Tables:**
  - Creating pivot tables for summarizing and aggregating data.

#### **Hands-on Lab:**
- Group a dataset by a categorical column (e.g., `gender`) and calculate summary statistics for each group.
- Create a pivot table to summarize the data.

---

## **1. Grouping Data**

### **Real-world Scenario 1: Customer Purchases by City**  
You have a **sales dataset** from multiple cities, and you want to analyze total sales for each city.

### **Example: Grouping by City**

In [1]:
import pandas as pd

# Sample sales data
data = {
    'city': ['New York', 'Los Angeles', 'New York', 'Chicago', 'Los Angeles', 'Chicago'],
    'sales': [250, 300, 200, 150, 400, 100],
    'product_category': ['Electronics', 'Clothing', 'Furniture', 'Clothing', 'Electronics', 'Furniture']
}

df = pd.DataFrame(data)

# Grouping sales data by city
grouped_by_city = df.groupby('city')['sales'].sum()

print(grouped_by_city)

city
Chicago        250
Los Angeles    700
New York       450
Name: sales, dtype: int64


## **2. Aggregation Functions**

### **Real-world Scenario 2: Product Sales Summary**  
You have a dataset of sales for different **products**, and you want to calculate **average sales, total sales, and the number of sales per product category**.

### **Example: Grouping by Product Category with Multiple Aggregations**

In [2]:
# Grouping by product category and applying aggregation functions
grouped_by_category = df.groupby('product_category').agg(
    total_sales=('sales', 'sum'),
    average_sales=('sales', 'mean'),
    sale_count=('sales', 'count')
)

print(grouped_by_category)

                  total_sales  average_sales  sale_count
product_category                                        
Clothing                  450          225.0           2
Electronics               650          325.0           2
Furniture                 300          150.0           2


### **Real-world Scenario 3: Student Exam Scores**  
You have a dataset of **students' exam scores** across different subjects. You want to calculate the **highest, lowest, and average scores for each subject**.

In [3]:
# Sample student exam data
student_data = {
    'subject': ['Math', 'Science', 'Math', 'History', 'Science', 'History'],
    'student': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'Charlie'],
    'score': [85, 90, 78, 92, 88, 80]
}

df = pd.DataFrame(student_data)

# Grouping by subject and aggregating
score_summary = df.groupby('subject').agg(
    max_score=('score', 'max'),
    min_score=('score', 'min'),
    average_score=('score', 'mean')
)

print(score_summary)

         max_score  min_score  average_score
subject                                     
History         92         80           86.0
Math            85         78           81.5
Science         90         88           89.0


## **3. Pivot Tables**

### **Real-world Scenario 4: Employee Salary Report**  
You want to create a **pivot table** to analyze the **average salary of employees grouped by department and job role**.

### **Example: Employee Salary Pivot Table**

In [4]:
# Sample employee salary data
employee_data = {
    'department': ['Sales', 'IT', 'Sales', 'HR', 'IT', 'HR'],
    'job_role': ['Manager', 'Developer', 'Executive', 'Manager', 'Analyst', 'Executive'],
    'salary': [60000, 75000, 50000, 65000, 80000, 55000]
}

df = pd.DataFrame(employee_data)

# Creating a pivot table for average salary
pivot_salary = pd.pivot_table(df, values='salary', index='department', columns='job_role', aggfunc='mean')

print(pivot_salary)

job_role    Analyst  Developer  Executive  Manager
department                                        
HR              NaN        NaN    55000.0  65000.0
IT          80000.0    75000.0        NaN      NaN
Sales           NaN        NaN    50000.0  60000.0


### **Real-world Scenario 5: Store Sales Report**  
You want to create a **pivot table** to summarize **total sales for each store, grouped by product category and month**.

In [5]:
# Sample store sales data
sales_data = {
    'store': ['Store A', 'Store B', 'Store A', 'Store B', 'Store A', 'Store B'],
    'month': ['January', 'January', 'February', 'February', 'March', 'March'],
    'product_category': ['Electronics', 'Clothing', 'Electronics', 'Clothing', 'Electronics', 'Clothing'],
    'sales': [1000, 1200, 1500, 1300, 2000, 1800]
}

df = pd.DataFrame(sales_data)

# Creating a pivot table for total sales
pivot_sales = pd.pivot_table(df, values='sales', index='store', columns='month', aggfunc='sum')

print(pivot_sales)

month    February  January  March
store                            
Store A      1500     1000   2000
Store B      1300     1200   1800


## **4. Hands-on Lab: Full Example 1**

### **Scenario 6: Grouping and Aggregating Students' Grades**  
You have a dataset of **students’ grades** for multiple subjects and want to:
1. Group the data by **subject** and calculate summary statistics for **average and max scores**.
2. Create a **pivot table** to summarize the data by **student and subject**.

In [6]:
# Student grades dataset
grade_data = {
    'student': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob', 'Charlie'],
    'subject': ['Math', 'Math', 'Science', 'Science', 'History', 'History'],
    'grade': [88, 78, 90, 85, 92, 87]
}

df = pd.DataFrame(grade_data)

# Grouping by subject
subject_summary = df.groupby('subject').agg(
    average_grade=('grade', 'mean'),
    highest_grade=('grade', 'max')
)

print("Subject Summary:")
print(subject_summary)

# Pivot table by student and subject
pivot_grades = pd.pivot_table(df, values='grade', index='student', columns='subject', aggfunc='mean')

print("\nPivot Table:")
print(pivot_grades)

Subject Summary:
         average_grade  highest_grade
subject                              
History           89.5             92
Math              83.0             88
Science           87.5             90

Pivot Table:
subject  History  Math  Science
student                        
Alice        NaN  88.0     90.0
Bob         92.0  78.0      NaN
Charlie     87.0   NaN     85.0


## **5. Hands-on Lab: Full Example 2**

### **Scenario 7: E-Commerce Sales Data**  
You have an **e-commerce dataset** with data on product sales by **region** and **payment method**. You want to:
1. Group sales data by **region** and calculate **total sales and average sales**.
2. Create a **pivot table** to show **total sales for each region, grouped by payment method**.

In [7]:
# E-commerce sales data
ecommerce_data = {
    'region': ['North', 'South', 'North', 'East', 'West', 'East', 'West'],
    'payment_method': ['Credit Card', 'PayPal', 'Credit Card', 'PayPal', 'Credit Card', 'Debit Card', 'PayPal'],
    'sales_amount': [500, 700, 300, 600, 800, 400, 900]
}

df = pd.DataFrame(ecommerce_data)

# Grouping by region
region_summary = df.groupby('region').agg(
    total_sales=('sales_amount', 'sum'),
    average_sales=('sales_amount', 'mean')
)

print("Region Summary:")
print(region_summary)

# Pivot table for sales by region and payment method
pivot_payments = pd.pivot_table(df, values='sales_amount', index='region', columns='payment_method', aggfunc='sum')

print("\nPivot Table by Payment Method:")
print(pivot_payments)

Region Summary:
        total_sales  average_sales
region                            
East           1000          500.0
North           800          400.0
South           700          700.0
West           1700          850.0

Pivot Table by Payment Method:
payment_method  Credit Card  Debit Card  PayPal
region                                         
East                    NaN       400.0   600.0
North                 800.0         NaN     NaN
South                   NaN         NaN   700.0
West                  800.0         NaN   900.0


## **Summary of Examples:**
| **Scenario**               | **Function**    | **Description**                                         |
|----------------------------|----------------|--------------------------------------------------------|
| Customer Purchases by City  | `groupby()`     | Grouping sales data by city to calculate total sales.  |
| Product Sales Summary       | `agg()`         | Summary of sales (sum, mean, count) per product.        |
| Student Exam Scores         | `agg()`         | Highest, lowest, and average exam scores per subject.   |
| Employee Salary Report      | `pivot_table()` | Average salary grouped by department and job role.      |
| Store Sales Report          | `pivot_table()` | Total sales grouped by store and month.                 |
| Students’ Grades            | Full Example    | Summary stats and pivot table for student grades.       |
| E-commerce Sales by Region  | Full Example    | Region summary and pivot table by payment method.       |

These examples cover a wide range of **real-world scenarios** for grouping, aggregating, and summarizing data using **pivot tables**.