
# Day 17 - GroupBy Operations for Aggregation
    


### Why Are GroupBy Operations Important?

GroupBy operations are crucial for data analysis because they allow you to split your data into meaningful groups and perform aggregate calculations, such as sums, averages, and counts. This is particularly useful when analyzing sales data, customer data, or any dataset where you need to understand the distribution or total of certain metrics across different categories.
    


### Tutorial: Grouping Data and Calculating Statistics

Let's start by understanding how to use the `groupby` function in Pandas to group data and then perform aggregate operations on those groups.
    

In [None]:
!pip install pandas

In [None]:
import pandas as pd

# Example DataFrame
data = {
    'Region': ['North', 'South', 'East', 'West', 'North', 'South', 'East', 'West'],
    'Sales': [250, 150, 200, 300, 400, 350, 100, 450],
    'Profit': [50, 40, 60, 80, 70, 90, 30, 100]
}
df = pd.DataFrame(data)

# Grouping by 'Region'
grouped = df.groupby('Region')
print(grouped.head())
    


### Calculating Aggregate Statistics

Once you've grouped your data, you can calculate various aggregate statistics, such as the sum, mean, or count, using functions like `sum()`, `mean()`, `count()`, etc.
    

In [None]:

# Calculating the sum of Sales and Profit by Region
aggregated = grouped.sum()
print("Sum of Sales and Profit by Region:")
print(aggregated)
    


### Applying Multiple Aggregations

You can also apply multiple aggregation functions simultaneously using the `agg()` function.
    

In [None]:

# Applying multiple aggregate functions
multiple_aggregations = grouped.agg({'Sales': ['sum', 'mean'], 'Profit': ['sum', 'mean']})
print("Multiple Aggregations (sum and mean) for Sales and Profit by Region:")
print(multiple_aggregations)
    


## Use Case: Aggregating Sales Data by Region

In this use case, we will aggregate sales data by region to understand the distribution of sales and profits across different regions. This will help in identifying which regions are performing well and which may need improvement.
    


### Step 1: Preparing the Dataset

Suppose you have a dataset containing sales transactions with columns for the region, sales amount, and profit.
    

In [None]:

import pandas as pd
import numpy as np

# Sales data
np.random.seed(42)
data = {
    'Transaction ID': range(1, 101),
    'Region': pd.Categorical(np.random.choice(['North', 'South', 'East', 'West'], size=100)),
    'Sales': np.random.randint(100, 500, size=100),
    'Profit': np.random.randint(10, 100, size=100)
}
sales_df = pd.DataFrame(data)

# Display the first few rows of the dataset
print("First few rows of the sales dataset:")
print(sales_df.head())
    


### Step 2: Grouping Sales Data by Region

We will group the sales data by region and calculate the total sales and profit for each region.
    

In [None]:

# Grouping by Region and calculating the sum of Sales and Profit
regional_sales = sales_df.groupby('Region').sum()

print("Total Sales and Profit by Region:")
print(regional_sales[['Sales','Profit']])
    


### Step 3: Analyzing the Data

Let's take this a step further by calculating additional statistics like the average sales and profit per transaction in each region.
    

In [None]:

# Grouping by Region and calculating the mean of Sales and Profit
regional_sales_mean = sales_df.groupby('Region').mean()

print("Average Sales and Profit per Transaction by Region:")
print(regional_sales_mean[['Sales','Profit']])
    


### Step 4: Visualizing the Results

To better understand the distribution of sales and profits across regions, you can visualize the aggregated data using a bar plot.
    

In [None]:
!pip install matplotlib

In [None]:

import matplotlib.pyplot as plt

# Plotting the total sales by region
regional_sales['Sales'].plot(kind='bar', title='Total Sales by Region', ylabel='Total Sales', xlabel='Region')
plt.show()

# Plotting the average profit by region
regional_sales_mean['Profit'].plot(kind='bar', 
                                   title='Average Profit per Transaction by Region', 
                                   ylabel='Average Profit', 
                                   xlabel='Region', 
                                   color='orange')
plt.show()
    