# Lesson 3: Data Aggregation

Here’s the lesson reformatted in Markdown for better readability:

---

# Lesson Introduction

In this lesson, we'll explore **data aggregation**, a powerful tool in data analysis. Aggregation helps you summarize and simplify large sets of data to gain insights quickly. By the end of this lesson, you'll know how to use data aggregation techniques to find specific information about groups in your dataset.

---

## Introduction to Data Aggregation

Data aggregation involves combining, summarizing, or consolidating data points into a single representation. Imagine you have a large set of student test scores. Instead of looking at every individual score, you might want to know the average score for each class. This simplifies your data and helps you see the bigger picture.

### Common Functions in Data Aggregation:
- **Maximum (`max`)**: Finds the highest value in a group.
- **Mean (`mean`)**: Calculates the average value of a group.
- **Sum (`sum`)**: Calculates the total sum of a group.
- **Standard Deviation (`std`)**: Calculates the dispersion or spread of a group.

### Example:
```python
scores = [89, 76, 92, 54, 88]
print("Maximum score:", max(scores))  # Maximum score: 92
print("Average score:", sum(scores) / len(scores))  # Average score: 79.8
```
Here, `max` gives us the highest score, and calculating the average (mean) summarizes the test scores.

---

## Using Aggregation in Pandas

Pandas makes data aggregation simple with the `groupby` and `agg` methods.

- **`groupby`**: Splits the data into groups based on some criteria.
- **`agg`**: Applies one or more aggregation functions to these groups.

---

### Defining the Dataset

Let’s start with a sample dataset containing information about products sold in various stores:

```python
import pandas as pd

data = {
    'store': ['Store A', 'Store A', 'Store B', 'Store B', 'Store C'],
    'product': ['Apples', 'Bananas', 'Apples', 'Bananas', 'Apples'],
    'units_sold': [30, 50, 40, 35, 90],
    'price': [1.20, 0.50, 1.00, 0.50, 1.30]
}

df = pd.DataFrame(data)
print(df)
```

Output:
```
     store  product  units_sold  price
0  Store A   Apples          30   1.20
1  Store A  Bananas          50   0.50
2  Store B   Apples          40   1.00
3  Store B  Bananas          35   0.50
4  Store C   Apples          90   1.30
```

---

### Step-by-Step Code Walkthrough

Let’s find the **maximum units sold** and the **average price** of products by store.

#### 1. Create the Aggregation Dictionary
This maps column names to their aggregation functions:
```python
agg_funcs = {'units_sold': 'max', 'price': 'mean'}
```

#### 2. Group Data by Store and Apply the Aggregation Functions
```python
result = df.groupby('store').agg(agg_funcs)
print(result)
```

#### Output:
```
         units_sold  price
store                      
Store A          50   0.85
Store B          40   0.75
Store C          90   1.30
```

---

### Explanation:

- `groupby('store')`: Groups rows by the `store` column.
- `agg(agg_funcs)`: Applies the specified functions:
  - **`max`** for `units_sold`.
  - **`mean`** for `price`.

For example:
- In **Store A**, the most units sold for any product was 50, and the average product price was $0.85.

---

## Lesson Summary

In this lesson, we've covered:
1. What **data aggregation** is and why it's useful.
2. Common aggregation functions like **max** and **mean**.
3. How to use Python's **Pandas library** to aggregate data effectively.

By mastering these concepts, you are now equipped to summarize and analyze large datasets with ease. 

---

## Practice Time! 🎉

Now it's your turn to apply these techniques. Try practicing data aggregation on different datasets to solidify your understanding and skills. Good luck! 🚀

---

## Analyzing Store Performance Using Data Aggregation

Let's start! Run the provided code to calculate the total revenue per store and the maximum units sold per store using aggregation functions. Understanding which store performs best in terms of revenue can help businesses make strategic decisions regarding inventory and pricing.

```py
import pandas as pd

data = {
    'store': ['Store A', 'Store A', 'Store B', 'Store B', 'Store C'],
    'product': ['Oranges', 'Mangoes', 'Oranges', 'Mangoes', 'Oranges'],
    'units_sold': [45, 60, 55, 45, 100],
    'price': [1.10, 1.50, 1.30, 1.60, 1.40]
}

df = pd.DataFrame(data)

# Calculate revenue
df['revenue'] = df['units_sold'] * df['price']

# Aggregate data to find the maximum units sold and total revenue by store
agg_funcs = {'units_sold': 'max', 'revenue': 'sum'}

result = df.groupby('store').agg(agg_funcs)
print(result)
```

Let's run through this example! The provided code calculates the **maximum units sold** and the **total revenue per store** by using aggregation functions. Here's a breakdown of what happens:

### Code Explanation
1. **Create a DataFrame**:
   - A dataset of stores, products, units sold, and prices is converted into a Pandas DataFrame.
2. **Calculate Revenue**:
   - A new column `revenue` is created by multiplying `units_sold` and `price`.
3. **Aggregate Data**:
   - The data is grouped by the `store` column, and two aggregation functions are applied:
     - `max` for `units_sold` to find the maximum number of units sold per store.
     - `sum` for `revenue` to calculate the total revenue per store.

### Expected Output
If you run the code, the output will look like this:

```
         units_sold  revenue
store                        
Store A          60   141.0
Store B          55   142.5
Store C         100   140.0
```

### Insights:
- **Store B** generates the highest total revenue: **$142.50**.
- **Store C** sold the most units in a single transaction: **100 units**.

This information can help businesses identify:
- Which store is leading in revenue generation.
- Which store has the highest sales potential based on units sold.

Feel free to ask if you'd like to modify the example or add more insights! 🚀

## Change Aggregation Function to Max

Great job, Space Explorer! Let's take it a step further.

Modify the code: first, change grouping to be by the product column, not Store. Also change the aggregation function from mean to max for the price column. This will tell us the highest price of products sold for each product type.

You've got this!

```py
import pandas as pd

# Define the dataset about retail stores and product sales
data = {
    'store': ['Store A', 'Store A', 'Store B', 'Store C', 'Store C'],
    'product': ['Chips', 'Soda', 'Chips', 'Chips', 'Soda'],
    'units_sold': [20, 30, 25, 45, 60],
    'price': [1.50, 1.00, 1.75, 1.50, 1.25]
}

df = pd.DataFrame(data)

# Group by store and calculate sum of units sold and average price
agg_funcs = {'units_sold': 'sum', 'price': 'mean'}
result = df.groupby('store').agg(agg_funcs)

print(result)

```

Here’s the updated code where we group by the **`product`** column instead of **`store`** and change the aggregation for the **`price`** column to find its maximum value:

```py
import pandas as pd

# Define the dataset about retail stores and product sales
data = {
    'store': ['Store A', 'Store A', 'Store B', 'Store C', 'Store C'],
    'product': ['Chips', 'Soda', 'Chips', 'Chips', 'Soda'],
    'units_sold': [20, 30, 25, 45, 60],
    'price': [1.50, 1.00, 1.75, 1.50, 1.25]
}

df = pd.DataFrame(data)

# Group by product and calculate the total units sold and maximum price
agg_funcs = {'units_sold': 'sum', 'price': 'max'}
result = df.groupby('product').agg(agg_funcs)

print(result)
```

### Expected Output:
When you run this code, the result will be:

```
         units_sold  price
product                    
Chips           90   1.75
Soda            90   1.25
```

### Explanation of Changes:
1. **Grouping**:
   - The dataset is now grouped by the **`product`** column, meaning each row in the output represents a unique product.
2. **Aggregation Functions**:
   - **`sum`** for `units_sold`: Adds up the total number of units sold for each product.
   - **`max`** for `price`: Finds the highest price recorded for each product.

### Insights:
- **Chips** sold a total of 90 units, and the highest price was **$1.75**.
- **Soda** also sold 90 units, but its highest price was **$1.25**.

This approach helps identify the pricing range for each product type. Let me know if you'd like to explore further analysis! 🚀

## Aggregating Units Sold Per Store

Fill in the missing parts to group data by store and apply the Standard Deviation aggregation function to the units_sold column. This will help you see the spread of units_sold in each store.

May the stars guide you!
```py
import pandas as pd

data = {
    'store': ['Store A', 'Store A', 'Store A', 'Store A', 'Store B', 'Store B', 'Store B', 'Store B', 'Store C', 'Store C', 'Store C', 'Store C', 'Store C'],
    'product': ['Apples', 'Bananas', 'Oranges', 'Grapes', 'Apples', 'Bananas', 'Oranges', 'Grapes', 'Apples', 'Bananas', 'Oranges', 'Grapes', 'Cherries'],
    'units_sold': [30, 50, 45, 55, 40, 35, 50, 32, 90, 85, 80, 78, 95],
    'price': [1.20, 0.50, 0.30, 1.50, 1.00, 0.50, 0.70, 1.00, 1.30, 1.25, 1.10, 1.15, 2.00]
}

df = pd.DataFrame(data)

# TODO: Add 'std' function for the `units_sold` column to the `agg_funcs` dictionary
agg_funcs = {}

# TODO: Group data by store and apply the standard deviation aggregation function. Print the result
```

Here’s the completed code to group data by **`store`** and calculate the standard deviation for the **`units_sold`** column:

```py
import pandas as pd

data = {
    'store': ['Store A', 'Store A', 'Store A', 'Store A', 'Store B', 'Store B', 'Store B', 'Store B', 'Store C', 'Store C', 'Store C', 'Store C', 'Store C'],
    'product': ['Apples', 'Bananas', 'Oranges', 'Grapes', 'Apples', 'Bananas', 'Oranges', 'Grapes', 'Apples', 'Bananas', 'Oranges', 'Grapes', 'Cherries'],
    'units_sold': [30, 50, 45, 55, 40, 35, 50, 32, 90, 85, 80, 78, 95],
    'price': [1.20, 0.50, 0.30, 1.50, 1.00, 0.50, 0.70, 1.00, 1.30, 1.25, 1.10, 1.15, 2.00]
}

df = pd.DataFrame(data)

# Define the aggregation functions
agg_funcs = {'units_sold': 'std'}

# Group data by store and apply the standard deviation function
result = df.groupby('store').agg(agg_funcs)

print(result)
```

### Expected Output:
When you run the code, the output will show the standard deviation of `units_sold` for each store:

```
         units_sold
store              
Store A    10.801234
Store B     7.520176
Store C     7.483315
```

### Explanation:
1. **Aggregation Function**:
   - **`std`** calculates the standard deviation for the **`units_sold`** column. It measures the spread or variability of sales within each store.

2. **Grouping**:
   - The data is grouped by the **`store`** column, so the result shows one row per store.

3. **Result Analysis**:
   - A higher standard deviation (e.g., Store A: 10.80) indicates more variability in the number of units sold for its products.
   - Lower values (e.g., Store C: 7.48) suggest more consistent sales across products.

Let me know if you'd like to expand this analysis! 🚀

## Aggregate Products by Store

Hey there, Space Voyager! Ready for your next mission? Your task is to aggregate data by store to find the minimum units sold. Complete the missing pieces in the code to accomplish this.

Let's see you shoot for the stars!

```py
import pandas as pd

# Creating the DataFrame
data = {
    'store': ['Store A', 'Store A', 'Store B', 'Store B', 'Store C'],
    'product': ['Apples', 'Bananas', 'Apples', 'Bananas', 'Apples'],
    'units_sold': [30, 50, 40, 35, 90],
    'price': [1.20, 0.50, 1.00, 0.50, 1.30]
}

# TODO: Convert the data dictionary into a DataFrame
df = ____

# TODO: Group data by 'store' and get the minimum 'units_sold'
result = ____

print(result)
```

Here’s the completed code to find the minimum units sold per store:

```py
import pandas as pd

# Creating the DataFrame
data = {
    'store': ['Store A', 'Store A', 'Store B', 'Store B', 'Store C'],
    'product': ['Apples', 'Bananas', 'Apples', 'Bananas', 'Apples'],
    'units_sold': [30, 50, 40, 35, 90],
    'price': [1.20, 0.50, 1.00, 0.50, 1.30]
}

# Convert the data dictionary into a DataFrame
df = pd.DataFrame(data)

# Group data by 'store' and get the minimum 'units_sold'
result = df.groupby('store')['units_sold'].min()

print(result)
```

### Expected Output:
When you run the code, the output will show the minimum `units_sold` for each store:

```
store
Store A    30
Store B    35
Store C    90
Name: units_sold, dtype: int64
```

### Explanation:
1. **DataFrame Creation**:
   - The data dictionary is converted into a Pandas DataFrame using `pd.DataFrame(data)`.

2. **Grouping and Aggregating**:
   - `df.groupby('store')` groups the data by the `store` column.
   - `['units_sold'].min()` calculates the minimum value of the `units_sold` column for each group.

3. **Result**:
   - The result provides the smallest number of units sold in each store.

Let me know if you’d like to explore further insights from this dataset! 🚀
