# Day 8: Advanced Grouping and Aggregation with Pandas

Welcome to Day 8! Today, we're diving deep into one of Pandas' most powerful features: the `groupby` operation. This is the cornerstone of data analysis, allowing you to split your data into groups, apply functions to each group independently, and combine the results into a new data structure.

Let's start by importing the necessary libraries. We'll also import `seaborn` just to load a more realistic dataset called 'tips'.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

### Loading a Realistic Dataset

Instead of a small, custom DataFrame, let's use the 'tips' dataset. It records tips given by diners, along with details about the meal.

In [None]:
df = sns.load_dataset("tips")

print("First 5 rows of the Tips dataset:")
print(df.head())

print("\nDataset Information:")
df.info()

---

## Part 1: The GroupBy Object and Basic Aggregations

When you use `.groupby()`, Pandas creates a special `DataFrameGroupBy` object. This object holds all the information about the groups but doesn't show you anything until you apply an aggregation function to it.

**Exercise 1.1:** Group the DataFrame by the 'day' column and calculate the *average* `total_bill` for each day. 

In [None]:
# Your code here

**Solution 1.1:**

In [None]:
# Solution
avg_bill_by_day = df.groupby("day", observed=True)["total_bill"].mean()
print(avg_bill_by_day)

### Inspecting Groups

Sometimes it's useful to see the data within a specific group. You can use the `.get_group()` method for this.

**Exercise 1.2:** First, create a `DataFrameGroupBy` object by grouping by 'smoker'. Then, use `.get_group('Yes')` to retrieve all the rows for smokers.

In [None]:
# Your code here

**Solution 1.2:**

In [None]:
# Solution
grouped_by_smoker = df.groupby("smoker", observed=False)
smoker_data = grouped_by_smoker.get_group("Yes")
print(smoker_data.head())
smokers = df['smoker'].value_counts(normalize=True)
print(smokers)


**Exercise 1.3:** Find the highest tip amount for each day of the week.

In [None]:
# Your code here

**Solution 1.3:**

In [None]:
# Solution
max_tip_by_day = df.groupby("day", observed=False)["tip"].max()
print(max_tip_by_day)

---

## Part 2: Advanced Aggregation with `.agg()`

The `.agg()` method is your best friend for advanced aggregation. It allows you to apply multiple functions at once and even apply different functions to different columns.

**Exercise 2.1:** For each 'sex', calculate the sum of 'total_bill' and the mean of 'tip' in a single operation.

In [None]:
# Your code here

**Solution 2.1:**

In [None]:
# Solution
agg_results = df.groupby("sex", observed=False).agg(
    total_bill_sum=("total_bill", "sum"), mean_tip=("tip", "mean")
)
print(agg_results)

**Exercise 2.2 (Challenge):** This is a very common and powerful pattern. Group by 'day' and 'time'. For the 'total_bill', calculate the mean and standard deviation. For the 'tip', find the minimum and maximum values. **Bonus:** Rename the columns to be more descriptive (e.g., 'avg_bill', 'std_bill', 'min_tip', 'max_tip').

In [None]:
# Your code here

**Solution 2.2:**

In [None]:
# Solution
detailed_agg = df.groupby(["day", "time"], observed=True).agg(
    avg_bill=("total_bill", "mean"),
    std_bill=("total_bill", "std"),
    min_tip=("tip", "min"),
    max_tip=("tip", "max"),
)
print(detailed_agg)

**Exercise 2.3 (Visualization):** Let's visualize the results from Exercise 1.1. Create a bar chart showing the average total bill for each day.

In [None]:
# Your code here
# Hint: The result of your groupby operation is a Pandas Series, which can be plotted directly.

**Solution 2.3:**

In [None]:
# Solution
plt.bar(x=avg_bill_by_day.index, height=avg_bill_by_day.values, color="skyblue", edgecolor="black")
plt.grid(axis="y", linestyle="--")
plt.title("Average Bill By Day")
plt.xlabel("Day")
plt.ylabel("Bill($)")
plt.show()

---

## Part 3: Handling the Index (`as_index=False` and `reset_index`)

By default, `groupby` makes the grouping columns the index of the resulting DataFrame. Often, you want to keep them as regular columns. You can do this with `as_index=False` or by calling `.reset_index()` on the result.

**Exercise 3.1:** Repeat the aggregation from Exercise 2.2, but this time, keep 'day' and 'time' as regular columns in the final DataFrame by using `as_index=False`.

In [None]:
# Your code here

**Solution 3.1:**

In [None]:
# Solution
df["day"] = df["day"].astype(str)
df["time"] = df["time"].astype(str)
flat_agg = df.groupby(["day", "time"], as_index=False).agg(
    avg_bill=("total_bill", "mean"),
    std_bill=("total_bill", "std"),
    min_tip=("tip", "min"),
    max_tip=("tip", "max"),
)

print(flat_agg)

---

## Part 4: Beyond Aggregation - `transform` and `filter`

Grouping isn't just for summarizing. You can also use it to create new features or filter your data.

### `transform`
`transform` returns a Series that has the same index as the original DataFrame, making it perfect for creating new columns.

**Exercise 4.1:** Create a new column in the original DataFrame called `avg_bill_for_day` that contains the average `total_bill` for the day that each row's transaction occurred on. 

In [None]:
# Your code here

**Solution 4.1:**

In [None]:
# Solution
df["avg_bill_for_day"] = df.groupby("day")["total_bill"].transform("mean")
print(df.head())

### `filter`
`filter` allows you to drop entire groups based on a condition.

**Exercise 4.2:** Filter the DataFrame to only include data for days where the *total number of transactions* (i.e., the count of rows for that day) was greater than 50.

In [None]:
# Your code here

**Solution 4.2:**

In [None]:
# Solution
# The lambda function is applied to each group (each day's sub-DataFrame)
# It must return True or False. If True, the group is kept.
filtered_df = df.groupby("day").filter(lambda x: len(x) > 50)

print("Original number of rows:", len(df))
print("Number of rows after filtering:", len(filtered_df))

print("\nRemaining days in the filtered data:")
print(filtered_df["day"].value_counts())

---

### Incredible work!

You've gone far beyond basic `groupby` and explored some of the most practical and powerful data manipulation techniques in Pandas. The ability to `agg`, `transform`, and `filter` will form the backbone of almost any analysis you perform. Tomorrow, we'll look at how to combine different datasets.