### Exercise 6: Group and Sort

Given a dataset of books, group by `Genre`, and then sort each group by `Rating` in descending order. Display the top book from each genre.

**Dataset**:
```python
data = pd.DataFrame({
    'Book': ['Book A', 'Book B', 'Book C', 'Book D', 'Book E', 'Book F'],
    'Genre': ['Fiction', 'Non-Fiction', 'Fiction', 'Non-Fiction', 'Fiction', 'Fiction'],
    'Rating': [4.5, 4.2, 4.8, 4.7, 4.3, 4.9]
})
```

**What to Do**:
1. Group the data by `Genre`.
2. For each group, sort the books by `Rating` in descending order and display the top book (the one with the highest rating) for each genre.

---

### Exercise 7: Calculate Multiple Aggregations for Each Group

Given a dataset of students with their `Subject`, `Score`, and `Age`, group by `Subject` and calculate:
1. The mean score.
2. The highest score (`max`).
3. The lowest score (`min`).
4. The standard deviation (`std`) of scores.

**Dataset**:
```python
data = pd.DataFrame({
    'Student': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace'],
    'Subject': ['Math', 'Math', 'Math', 'English', 'English', 'Math', 'English'],
    'Score': [85, 90, 88, 78, 92, 95, 88],
    'Age': [20, 22, 21, 23, 22, 21, 24]
})
```

**What to Do**:
1. Group the data by `Subject`.
2. For each group, calculate the mean, max, min, and standard deviation (`std`) of the `Score`.

---

### Bonus Challenge: Group and Apply a Transformation

Given a dataset of transactions, group by `Product`, and then **normalize** the sales within each group by subtracting the group’s mean and dividing by the group’s standard deviation. This is a simple feature scaling technique.

**Dataset**:
```python
data = pd.DataFrame({
    'Transaction_ID': [1, 2, 3, 4, 5, 6],
    'Product': ['A', 'A', 'B', 'B', 'A', 'B'],
    'Sales': [100, 150, 200, 250, 120, 350]
})
```

**What to Do**:
1. Group the data by `Product`.
2. Apply a transformation to each group to **normalize** the `Sales` column, so that each group's sales values have a mean of 0 and a standard deviation of 1.

---

### Tips for Solving These Exercises:
- Use `groupby()` to group by one or more columns.
- Apply functions like `mean()`, `sum()`, `count()`, `agg()`, etc.
- Use `.agg()` to apply multiple functions at once.
- To filter groups, you can use `.filter()` or a custom function with `.apply()`.

Good luck, and happy coding!

### Exercise 1: Group by a Single Column and Calculate Summary Statistics

Given a dataset of employees with their department, age, and salary, group the data by the `Department` and calculate the following:
- The **mean** salary for each department.
- The **minimum** and **maximum** salary for each department.
- The **count** of employees in each department.

**Dataset**:
```python
import pandas as pd

# Create a simple DataFrame
data = pd.DataFrame({
    'Employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace'],
    'Department': ['HR', 'Engineering', 'HR', 'Engineering', 'Sales', 'Sales', 'HR'],
    'Age': [25, 30, 35, 40, 28, 22, 29],
    'Salary': [50000, 70000, 55000, 80000, 60000, 65000, 48000]
})
```

**What to Do**:
1. Group the data by the `Department` column.
2. For each department, calculate:
   - The mean salary (`mean`).
   - The minimum and maximum salary (`min` and `max`).
   - The count of employees in each department (`count`).

---

In [33]:
import pandas as pd

# Create a simple DataFrame
data = pd.DataFrame({
    'Employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace'],
    'Department': ['HR', 'Engineering', 'HR', 'Engineering', 'Sales', 'Sales', 'HR'],
    'Age': [25, 30, 35, 40, 28, 22, 29],
    'Salary': [50000, 70000, 55000, 80000, 60000, 65000, 48000]
})

data

Unnamed: 0,Employee,Department,Age,Salary
0,Alice,HR,25,50000
1,Bob,Engineering,30,70000
2,Charlie,HR,35,55000
3,David,Engineering,40,80000
4,Eve,Sales,28,60000
5,Frank,Sales,22,65000
6,Grace,HR,29,48000


In [34]:
#The mean salary for each department.
mean_salary = data.groupby('Department')['Salary'].mean()
mean_salary

Department
Engineering    75000.0
HR             51000.0
Sales          62500.0
Name: Salary, dtype: float64

In [35]:
#The minimum and maximum salary (min and max).
min_salary = data.groupby('Department')['Salary'].min()
print(min_salary)
print("--------------------")
max_salary = data.groupby('Department')['Salary'].max()
print(max_salary)

Department
Engineering    70000
HR             48000
Sales          60000
Name: Salary, dtype: int64
--------------------
Department
Engineering    80000
HR             55000
Sales          65000
Name: Salary, dtype: int64


In [36]:
#The count of employees in each department.
employees_count = data.groupby('Department')['Employee'].count()
print(employees_count)

Department
Engineering    2
HR             3
Sales          2
Name: Employee, dtype: int64


In [37]:
department_stats = data.groupby('Department')['Salary'].agg(['mean', 'min', 'max'])
department_stats

Unnamed: 0_level_0,mean,min,max
Department,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Engineering,75000.0,70000,80000
HR,51000.0,48000,55000
Sales,62500.0,60000,65000


### Exercise 2: Group by Multiple Columns

Given a dataset of sales transactions, group the data by `Product` and `Region`. Then, calculate the total sales for each group.

**Dataset**:
```python
data = pd.DataFrame({
    'Transaction_ID': [1, 2, 3, 4, 5, 6],
    'Product': ['A', 'B', 'A', 'A', 'B', 'B'],
    'Region': ['East', 'West', 'East', 'West', 'East', 'West'],
    'Sales': [100, 200, 150, 250, 300, 350]
})
```

**What to Do**:
1. Group the data by both `Product` and `Region`.
2. For each group, calculate the total sales (`sum`).

---

In [38]:
data = pd.DataFrame({
    'Transaction_ID': [1, 2, 3, 4, 5, 6],
    'Product': ['A', 'B', 'A', 'A', 'B', 'B'],
    'Region': ['East', 'West', 'East', 'West', 'East', 'West'],
    'Sales': [100, 200, 150, 250, 300, 350]
})

data

Unnamed: 0,Transaction_ID,Product,Region,Sales
0,1,A,East,100
1,2,B,West,200
2,3,A,East,150
3,4,A,West,250
4,5,B,East,300
5,6,B,West,350


In [39]:
total_sales = data.groupby(['Product', 'Region'])['Sales'].sum()
total_sales

Product  Region
A        East      250
         West      250
B        East      300
         West      550
Name: Sales, dtype: int64

### Exercise 3: Find the Median for Each Group

In a dataset of student exam scores, group by the `Class` column and calculate the **median** score for each class.

**Dataset**:
```python
data = pd.DataFrame({
    'Student': ['John', 'Sarah', 'Mike', 'Anna', 'James', 'Linda', 'Tom'],
    'Class': ['Math', 'Math', 'English', 'English', 'Math', 'English', 'Math'],
    'Score': [85, 92, 78, 88, 95, 90, 89]
})
```

**What to Do**:
1. Group the data by the `Class` column.
2. For each group, calculate the median score.

---

In [40]:
data = pd.DataFrame({
    'Student': ['John', 'Sarah', 'Mike', 'Anna', 'James', 'Linda', 'Tom'],
    'Class': ['Math', 'Math', 'English', 'English', 'Math', 'English', 'Math'],
    'Score': [85, 92, 78, 88, 95, 90, 89]
})

data

Unnamed: 0,Student,Class,Score
0,John,Math,85
1,Sarah,Math,92
2,Mike,English,78
3,Anna,English,88
4,James,Math,95
5,Linda,English,90
6,Tom,Math,89


In [41]:
median_score = data.groupby('Class')['Score'].median()
median_score

Class
English    88.0
Math       90.5
Name: Score, dtype: float64

### Exercise 4: Group and Apply a Custom Function

Given a dataset of sales transactions, group by `Product`, then create a custom function that calculates the **coefficient of variation** (standard deviation / mean) for each product's sales. Apply this function to each group.

**Dataset**:
```python
import numpy as np
data = pd.DataFrame({
    'Transaction_ID': [1, 2, 3, 4, 5, 6, 7, 8],
    'Product': ['A', 'A', 'B', 'B', 'A', 'A', 'B', 'B'],
    'Sales': [100, 150, 200, 250, 120, 180, 300, 350]
})
```

**What to Do**:
1. Group the data by the `Product` column.
2. Apply a custom function that calculates the **coefficient of variation** for sales in each group. The formula is:

   \[
       CV = \frac{\text{Standard Deviation}}{\text{Mean}}\
   ]
   
   Use the `agg()` function to apply this.
---

In [43]:
import numpy as np

data = pd.DataFrame({
    'Transaction_ID': [1, 2, 3, 4, 5, 6, 7, 8],
    'Product': ['A', 'A', 'B', 'B', 'A', 'A', 'B', 'B'],
    'Sales': [100, 150, 200, 250, 120, 180, 300, 350]
})

data

Unnamed: 0,Transaction_ID,Product,Sales
0,1,A,100
1,2,A,150
2,3,B,200
3,4,B,250
4,5,A,120
5,6,A,180
6,7,B,300
7,8,B,350


In [51]:
coefficient_of_variation = data.groupby('Product')['Sales'].agg(['std', 'mean'])
coefficient_of_variation['CV'] = coefficient_of_variation['std'] / coefficient_of_variation['mean']
coefficient_of_variation

Unnamed: 0_level_0,std,mean,CV
Product,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A,35.0,137.5,0.254545
B,64.549722,275.0,0.234726


In [52]:
# Step 1: Define the custom function to calculate coefficient of variation
def coefficient_of_variation(series):
    return series.std() / series.mean()

# Step 2: Group by 'Product' and apply the custom function
cv_per_product = data.groupby('Product')['Sales'].apply(coefficient_of_variation)

# Display the result
print(cv_per_product)

Product
A    0.254545
B    0.234726
Name: Sales, dtype: float64


### Exercise 5: Group and Filter Data

Given a dataset of employees, group by `Department` and then filter out departments with fewer than 2 employees.

**Dataset**:
```python
data = pd.DataFrame({
    'Employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Department': ['HR', 'Engineering', 'HR', 'Engineering', 'Sales'],
    'Salary': [50000, 70000, 55000, 80000, 60000]
})
```

**What to Do**:
1. Group the data by `Department`.
2. Filter out any departments with fewer than 2 employees.

---

In [53]:
data = pd.DataFrame({
    'Employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Department': ['HR', 'Engineering', 'HR', 'Engineering', 'Sales'],
    'Salary': [50000, 70000, 55000, 80000, 60000]
})

data

Unnamed: 0,Employee,Department,Salary
0,Alice,HR,50000
1,Bob,Engineering,70000
2,Charlie,HR,55000
3,David,Engineering,80000
4,Eve,Sales,60000


In [71]:
employee_counts  = data.groupby('Department').count()
filtered_departments  = employee_counts[employee_counts['Employee'] >= 2]
filtered_departments

Unnamed: 0_level_0,Employee,Salary
Department,Unnamed: 1_level_1,Unnamed: 2_level_1
Engineering,2,2
HR,2,2
