# **`Data Science Learners Hub`**

**Module : Python**

**email** : [datasciencelearnershub@gmail.com](mailto:datasciencelearnershub@gmail.com)

## **`#3: Data Manipulation with Pandas`**
7. **Data Filtering and Selection**
   - Conditional selection
   - Using boolean indexing

8. **Data Sorting and Ranking**
   - Sorting by columns
   - Ranking data

9. **Grouping and Aggregation**
   - GroupBy operations
   - Aggregation functions (sum, mean, count, etc.)

### **`9. Grouping and Aggregation`**

#### **`GroupBy Operations in Pandas`**

#### GroupBy Concept:

The GroupBy operation in Pandas involves splitting a DataFrame into groups based on one or more criteria, applying a function to each group independently, and then combining the results. It is a powerful tool for aggregation and analysis of data subsets.

#### Grouping by a Single Column:


In [5]:
import pandas as pd

# Sample DataFrame
data = {'Name': ['Laxman', 'Padma', 'Harshita', 'Naina', 'Aanchal'],
        'Department': ['HR', 'IT', 'Marketing', 'IT', 'Marketing'],
        'Salary': [50000, 60000, 45000, 70000, 55000]}

df = pd.DataFrame(data)

# Grouping by 'Department'
grouped_by_department = df.groupby('Department')

# Displaying the GroupBy object
print("GroupBy Object for Department:")
print(grouped_by_department)


GroupBy Object for Department:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x147c15750>


#### Grouping by Multiple Columns:

In [3]:
# Grouping by 'Department' and 'Name'
grouped_by_department_name = df.groupby(['Department', 'Name'])

# Displaying the GroupBy object with multiple columns
print("\nGroupBy Object for Department and Name:")
print(grouped_by_department_name)


GroupBy Object for Department and Name:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x147c16a10>


#### Explanation of output:
The output of the code that groups the DataFrame by a single column ('Department') and prints the GroupBy object would look like the following:

```plaintext
GroupBy Object for Department:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0xXXXXXXXX>
```

The actual memory address (indicated by `0xXXXXXXXX`) will vary. The output shows that a GroupBy object has been created, but it doesn't display the grouped data itself. The GroupBy object is an intermediate step, and you typically apply aggregation or transformation functions to extract meaningful information from the grouped data.

#### Aggregation with GroupBy:

In [6]:
# Calculating the average salary for each department
avg_salary_by_department = grouped_by_department['Salary'].mean()

# Displaying the result of the aggregation
print("\nAverage Salary by Department:")
print(avg_salary_by_department)


Average Salary by Department:
Department
HR           50000.0
IT           65000.0
Marketing    50000.0
Name: Salary, dtype: float64


#### Explanation:

- **GroupBy Object:**
  - The `groupby()` function creates a GroupBy object, which is a special DataFrame with grouped data.
  - It does not actually perform any computation until an aggregation function is applied.

- **Grouping by Single and Multiple Columns:**
  - Grouping can be done based on a single column (`df.groupby('Department')`) or multiple columns (`df.groupby(['Department', 'Name'])`).

- **Aggregation with GroupBy:**
  - After grouping, aggregation functions like `mean()`, `sum()`, `count()`, etc., can be applied to obtain summary statistics for each group.

#### Scenarios for GroupBy:

1. **Department-wise Analysis:**
   - Analyze average salary, total employees, etc., for each department.

2. **Customer Segmentation:**
   - Group customers based on demographics for targeted marketing analysis.

3. **Time Series Analysis:**
   - Group time series data by month or year for temporal analysis.

4. **Product Categories:**
   - Group sales data by product categories to analyze performance.

#### Considerations:

- **Efficiency:**
  - GroupBy operations can be memory-intensive. Use them judiciously, especially with large datasets.

- **Aggregation Functions:**
  - Choose appropriate aggregation functions based on the insights you seek.

#### Tips:

- **Resetting Index:**
  - After aggregation, consider using `reset_index()` to bring the GroupBy result back to a regular DataFrame.

- **Chaining Operations:**
  - Chain multiple operations with GroupBy for comprehensive analysis.

GroupBy operations are fundamental in Pandas for analyzing and summarizing data based on specific criteria. Whether you're working with categorical data, time series, or any other type of dataset, mastering GroupBy allows you to extract meaningful insights from your data.


#### **`Aggregation Functions in Pandas and their Application`**


#### Common Aggregation Functions:

Pandas provides various aggregation functions to summarize and analyze data. Here are some commonly used aggregation functions:

1. **Sum:**
   - Calculates the sum of values in a group.

2. **Mean (Average):**
   - Computes the average value in a group.

3. **Median:**
   - Finds the middle value in a group.

4. **Count:**
   - Counts the number of non-null values in a group.

5. **Min and Max:**
   - Identify the minimum and maximum values in a group.

6. **Standard Deviation and Variance:**
   - Measures the dispersion of values in a group.

#### Application with GroupBy:

In [7]:
import pandas as pd

# Sample DataFrame
data = {'Department': ['HR', 'IT', 'Marketing', 'IT', 'Marketing'],
        'Salary': [50000, 60000, 45000, 70000, 55000]}

df = pd.DataFrame(data)

# Grouping by 'Department'
grouped_by_department = df.groupby('Department')

# Applying Aggregation Functions
agg_result = grouped_by_department['Salary'].agg(['sum', 'mean', 'count', 'min', 'max', 'std', 'var'])

# Displaying the Aggregated Result
print("Aggregated Result for Salary by Department:")
print(agg_result)

Aggregated Result for Salary by Department:
               sum     mean  count    min    max          std         var
Department                                                               
HR           50000  50000.0      1  50000  50000          NaN         NaN
IT          130000  65000.0      2  60000  70000  7071.067812  50000000.0
Marketing   100000  50000.0      2  45000  55000  7071.067812  50000000.0


#### Explanation:

- **GroupBy Object:**
  - The DataFrame is grouped by the 'Department' column using `groupby('Department')`.

- **Aggregation Functions:**
  - The `agg()` function is applied to the 'Salary' column within each group.
  - Multiple aggregation functions (sum, mean, count, min, max, std, var) are used simultaneously.

- **Result Display:**
  - The result is a DataFrame showing aggregated values for each department.

#### Insights Derived:

- **Total Salary Expenditure:**
  - `sum` provides the total salary expenditure for each department.

- **Average Salary:**
  - `mean` gives the average salary within each department.

- **Employee Count:**
  - `count` indicates the number of employees in each department.

- **Salary Range:**
  - `min` and `max` reveal the minimum and maximum salaries in each department.

- **Salary Dispersion:**
  - `std` and `var` quantify the dispersion (standard deviation and variance) of salaries within each department.

#### Scenarios for Aggregation:

1. **Financial Analysis:**
   - Calculate total revenue, average transaction value, etc.

2. **Population Statistics:**
   - Analyze average age, population size, etc., for different regions.

3. **Sales Performance:**
   - Assess total sales, average sales, and variability in performance.

4. **Quality Control:**
   - Evaluate average quality, defects count, etc., in manufacturing.

#### Tips:

- **Custom Aggregation:**
  - You can define custom aggregation functions using lambda functions.

- **Applying Multiple Functions:**
  - Apply multiple aggregation functions simultaneously for a comprehensive analysis.

Aggregation functions are essential tools for summarizing data and extracting key insights. When combined with the GroupBy operation, they allow for efficient analysis and exploration of data within specific groups or categories.
