### **`groupby()` and Aggregation**
Grouping and aggregating data in Pandas is essential for summarization, transformation, and analysis of large datasets.

#### **1. Basic `groupby()` Syntax**

```python
# Basic grouping by one column
grouped = df.groupby('Category')

# Grouping by multiple columns
grouped = df.groupby(['Region', 'Year'])
```

- Returns a `DataFrameGroupBy` object.
- No computation until an aggregation is applied (lazy evaluation).

#### **2. Common Aggregation Methods**

| Method         | Description                                     |
|----------------|-------------------------------------------------|
| `sum()`        | Sum of values                                  |
| `mean()`       | Average of values                              |
| `median()`     | Median of values                               |
| `min()`        | Minimum value                                  |
| `max()`        | Maximum value                                  |
| `count()`      | Number of non-null observations                |
| `size()`       | Size of each group (including nulls)           |
| `std()`        | Standard deviation                             |
| `var()`        | Variance                                       |

```python
# Example: sum of Sales by Region
df.groupby('Region')['Sales'].sum()

# Multiple aggregations
df.groupby('Region')['Sales'].agg(['sum', 'mean', 'count'])
``` 

#### **3. Using `.agg()` for Custom Aggregations**

```python
# Dictionary mapping columns to functions
df.groupby('Region').agg({
    'Sales': 'sum',
    'Profit': 'mean',
    'Quantity': 'max'
})

# Lambda functions or named functions
df.groupby('Region')['Sales'].agg(lambda x: x.max() - x.min())
``` 

#### **4. Applying Multiple Aggregations to Multiple Columns**

```python
agg_funcs = {
    'Sales': ['sum', 'mean'],
    'Profit': ['mean', 'std'],
    'Quantity': 'count'
}
df.groupby(['Region', 'Category']).agg(agg_funcs)
``` 

- Results in **MultiIndex** columns with outer (column) and inner (function) levels.


#### **5. Transformations with `.transform()`**

- Returns an object **same shape** as original.
- Useful for adding group-level metrics back to DataFrame.

```python
# Z-score of Sales within each Region
df['Sales_zscore'] = df.groupby('Region')['Sales'].transform(
    lambda x: (x - x.mean()) / x.std()
)
``` 

#### **6. Filtering Groups with `.filter()`**

- Keep entire rows of groups that satisfy a condition.

```python
# Keep regions where total Sales > 1e6
df_filtered = df.groupby('Region').filter(
    lambda x: x['Sales'].sum() > 1e6
)
``` 

#### **7. Looping Over Groups**

```python
for name, group in df.groupby('Category'):
    print(f"Group: {name}")
    print(group.head())
``` 

- `name`: group key or tuple of keys
- `group`: subset DataFrame for that group

#### **8. `as_index` Parameter**

- Default `as_index=True` makes group keys the index of the result.
- Use `as_index=False` to keep keys as columns.

```python
# Keys remain as columns
df.groupby('Region', as_index=False)['Sales'].sum()
``` 

#### **9. `.size()` vs `.count()`**

- `.size()`: counts **all** rows in each group (including NaNs).
- `.count()`: counts **non-null** values for each column.

```python
df.groupby('Region').size()
# vs
df.groupby('Region')['Sales'].count()
``` 

#### **Tips**

- Use `.agg()` for **flexible** summaries.
- Use `.transform()` to **annotate** original DataFrame.
- Use `.filter()` to **prune** groups.
- Remember chaining methods (`.groupby().agg().reset_index()`) to flatten results.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv(r"D:\New Desktop\LEARNINGS\Pandas Source Files\Flavors.csv")
df

Unnamed: 0,Flavor,Base Flavor,Liked,Flavor Rating,Texture Rating,Total Rating
0,Mint Chocolate Chip,Vanilla,Yes,10.0,8.0,18.0
1,Chocolate,Chocolate,Yes,8.8,7.6,16.6
2,Vanilla,Vanilla,No,4.7,5.0,9.7
3,Cookie Dough,Vanilla,Yes,6.9,6.5,13.4
4,Rocky Road,Chocolate,Yes,8.2,7.0,15.2
5,Pistachio,Vanilla,No,2.3,3.4,5.7
6,Cake Batter,Vanilla,Yes,6.5,6.0,12.5
7,Neapolitan,Vanilla,No,3.8,5.0,8.8
8,Chocolte Fudge Brownie,Chocolate,Yes,8.2,7.1,15.3


In [19]:
df.groupby('Base Flavor')[['Flavor Rating', 'Texture Rating', 'Total Rating']].mean()


Unnamed: 0_level_0,Flavor Rating,Texture Rating,Total Rating
Base Flavor,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Chocolate,8.4,7.233333,15.7
Vanilla,5.7,5.65,11.35


In [17]:
df.groupby('Base Flavor')[['Flavor']].count()

Unnamed: 0_level_0,Flavor
Base Flavor,Unnamed: 1_level_1
Chocolate,3
Vanilla,6


In [20]:
df.groupby('Base Flavor').min()

Unnamed: 0_level_0,Flavor,Liked,Flavor Rating,Texture Rating,Total Rating
Base Flavor,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Chocolate,Chocolate,Yes,8.2,7.0,15.2
Vanilla,Cake Batter,No,2.3,3.4,5.7


In [21]:
df.groupby('Base Flavor').max()

Unnamed: 0_level_0,Flavor,Liked,Flavor Rating,Texture Rating,Total Rating
Base Flavor,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Chocolate,Rocky Road,Yes,8.8,7.6,16.6
Vanilla,Vanilla,Yes,10.0,8.0,18.0


In [23]:
df.groupby('Base Flavor')[['Flavor Rating', 'Texture Rating', 'Total Rating']].sum()

Unnamed: 0_level_0,Flavor Rating,Texture Rating,Total Rating
Base Flavor,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Chocolate,25.2,21.7,47.1
Vanilla,34.2,33.9,68.1


In [29]:
# Using Aggregate Function:
df.groupby('Base Flavor').agg({
    'Flavor Rating' : ['count', 'mean', 'median', 'min', 'max'],
    'Texture Rating' : ['mean', 'median', 'min', 'max']
})

Unnamed: 0_level_0,Flavor Rating,Flavor Rating,Flavor Rating,Flavor Rating,Flavor Rating,Texture Rating,Texture Rating,Texture Rating,Texture Rating
Unnamed: 0_level_1,count,mean,median,min,max,mean,median,min,max
Base Flavor,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
Chocolate,3,8.4,8.2,8.2,8.8,7.233333,7.1,7.0,7.6
Vanilla,6,5.7,5.6,2.3,10.0,5.65,5.5,3.4,8.0


In [34]:
df.groupby(['Base Flavor', 'Liked']).agg({
    'Flavor' : 'count',
    'Flavor Rating' : ['mean','min','max'],
    'Texture Rating' : ['mean','min','max'],
    'Total Rating' : ['mean','min','max']
})

Unnamed: 0_level_0,Unnamed: 1_level_0,Flavor,Flavor Rating,Flavor Rating,Flavor Rating,Texture Rating,Texture Rating,Texture Rating,Total Rating,Total Rating,Total Rating
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,min,max,mean,min,max,mean,min,max
Base Flavor,Liked,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2
Chocolate,Yes,3,8.4,8.2,8.8,7.233333,7.0,7.6,15.7,15.2,16.6
Vanilla,No,3,3.6,2.3,4.7,4.466667,3.4,5.0,8.066667,5.7,9.7
Vanilla,Yes,3,7.8,6.5,10.0,6.833333,6.0,8.0,14.633333,12.5,18.0


#### **`.describe()`**
`.describe()` is a powerful method in Pandas used to generate descriptive statistics that summarize the central tendency, dispersion, and shape of a dataset’s distribution.

##### **What `.describe()` Returns (For Numeric Columns):**

| Statistic | Description                          |
|-----------|--------------------------------------|
| `count`   | Number of non-null values            |
| `mean`    | Mean of the values                   |
| `std`     | Standard deviation                   |
| `min`     | Minimum value                        |
| `25%`     | 1st quartile (25th percentile)       |
| `50%`     | Median (50th percentile)             |
| `75%`     | 3rd quartile (75th percentile)       |
| `max`     | Maximum value                        |

For non-numeric (`object` or `string`) columns, it returns:
- `count`, `unique`, `top`, `freq`

##### **Usage Examples**

##### **1. On Entire DataFrame:**
```python
df.describe()
```

##### **2. On a Specific Column:**
```python
df['Sales'].describe()
```

##### **3. Grouped Summary with `groupby()`:**
```python
df.groupby('Region')['Sales'].describe()
```

##### **4. Transposed View (for better readability):**
```python
df.groupby('Region')['Sales'].describe().T
```

##### **Tips**
- `.describe()` is an excellent first step in exploratory data analysis (EDA).
- Use `.T` to flip rows and columns when viewing grouped descriptions.
- Combine with `.round(2)` to improve readability:
```python
df.describe().round(2)
```

In [38]:
df.groupby('Base Flavor').describe().T

Unnamed: 0,Base Flavor,Chocolate,Vanilla
Flavor Rating,count,3.0,6.0
Flavor Rating,mean,8.4,5.7
Flavor Rating,std,0.34641,2.710719
Flavor Rating,min,8.2,2.3
Flavor Rating,25%,8.2,4.025
Flavor Rating,50%,8.2,5.6
Flavor Rating,75%,8.5,6.8
Flavor Rating,max,8.8,10.0
Texture Rating,count,3.0,6.0
Texture Rating,mean,7.233333,5.65
