<a href="https://colab.research.google.com/github/sarahajbane/colab_workbook_templates/blob/main/Grouping_and_Aggregation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Grouping and Aggregation**
* groupby() allows you to split the dataset into subsets based on some criteria or categories, making it easier to analyze different groups separately.
* Aggregation functions like sum, mean, count, and median allow you to compute statistics on each subset, quickly summarizing large datasets into meaningful insights.

## **Groupby**

The groupby method allows you to group rows of data together and call aggregate functions

In [None]:
import pandas as pd
# Create dataframe
data = {'Company':['GOOG','GOOG','MSFT','MSFT','FB','FB'],
       'Person':['Sam','Charlie','Amy','Vanessa','Carl','Sarah'],
       'Sales':[200,120,340,124,243,350]}

In [None]:
df = pd.DataFrame(data)

In [None]:
df

Unnamed: 0,Company,Person,Sales
0,GOOG,Sam,200
1,GOOG,Charlie,120
2,MSFT,Amy,340
3,MSFT,Vanessa,124
4,FB,Carl,243
5,FB,Sarah,350


Now you can use the .groupby() method to group rows together based off of a column name. For instance let's group based off of Company. This will create a DataFrameGroupBy object:

In [None]:
df.groupby('Company')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x78763452f8e0>

You can save this object as a new variable:

In [None]:
by_comp = df.groupby("Company")

And then call aggregate methods off the object:

In [None]:
# df.groupby("Company")['Sales'].mean()
by_comp['Sales'].mean()


Unnamed: 0_level_0,Sales
Company,Unnamed: 1_level_1
FB,296.5
GOOG,160.0
MSFT,232.0


More examples of aggregate methods:

In [None]:
# df.groupby("Company")['Sales'].std()
by_comp['Sales'].std()

Unnamed: 0_level_0,Sales
Company,Unnamed: 1_level_1
FB,75.660426
GOOG,56.568542
MSFT,152.735065


In [None]:
# df.groupby("Company")['Sales'].min()
by_comp['Sales'].min()

Unnamed: 0_level_0,Sales
Company,Unnamed: 1_level_1
FB,243
GOOG,120
MSFT,124


In [None]:
# df.groupby("Company")['Sales'].max()
by_comp['Sales'].max()

Unnamed: 0_level_0,Sales
Company,Unnamed: 1_level_1
FB,350
GOOG,200
MSFT,340


In [None]:
# df.groupby("Company")['Sales'].count()
by_comp.count()

Unnamed: 0_level_0,Person,Sales
Company,Unnamed: 1_level_1,Unnamed: 2_level_1
FB,2,2
GOOG,2,2
MSFT,2,2


In [None]:
by_comp.describe()

Unnamed: 0_level_0,Sales,Sales,Sales,Sales,Sales,Sales,Sales,Sales
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
Company,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
FB,2.0,296.5,75.660426,243.0,269.75,296.5,323.25,350.0
GOOG,2.0,160.0,56.568542,120.0,140.0,160.0,180.0,200.0
MSFT,2.0,232.0,152.735065,124.0,178.0,232.0,286.0,340.0


## **Functions**
 * a block of reusable code that performs a specific task.
 *  Functions help in organizing your code into manageable and logical sections.
 * By defining functions, you can avoid repeating code and improve readability and maintainability.


In [None]:
#syntax
def my_func(param1='default'):
    """
    Docstring goes here.
    """
    print(param1)

In [None]:
my_func

In [None]:
my_func('new param')

new param


In [None]:
my_func(param1='new param')

new param


Explanation:

* The function multiply(a, b) takes two parameters a and b.
* Instead of returning a value, it prints the result of the multiplication a * b inside the function.
* The function is called with 4 and 6 as arguments, and it directly prints the multiplication result.

In [None]:
def multiply(a, b):
    print(f"The result of {a} * {b} is {a * b}")  # The function prints the multiplication result

In [None]:
multiply(4, 6)

The result of 4 * 6 is 24


Explanation:

* The function square(x) takes one parameter x.
* It returns the result of squaring x (i.e., x**2).
*  The function is called with the argument 5, which returns the square of 5 (i.e., 25).
* The result is stored in the variable result and printed.

In [None]:
def square(x):
    return x**2

In [None]:
out = square(5)
print(out)

25


The below example code run
* The function add(a, b) takes two parameters a and b.
* It returns the sum of a and b.
* The function is called with 3 and 5 as arguments, and the returned value 8 is stored in the variable result and printed.

In [None]:
def add(a, b):
    return a + b

In [None]:
result = add(3, 5)
print(result)

8


##**Lambda Functions**


In [None]:
import pandas as pd
import numpy as np

In [None]:
# Create a sample dataset
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Age': [25, 30, 35, 28, 22],
    'Salary': [50000, 60000, 75000, 55000, 45000],
    'City': ['New York', 'San Francisco', 'Chicago', 'Boston', 'Miami']
})

In [None]:
df

Unnamed: 0,Name,Age,Salary,City
0,Alice,25,50000,New York
1,Bob,30,60000,San Francisco
2,Charlie,35,75000,Chicago
3,David,28,55000,Boston
4,Eva,22,45000,Miami


In [None]:
#Using apply() with a lambda function to create a new column
df['Salary_After_Tax'] = df['Salary'].apply(lambda x: x * 0.8)

In [None]:
#Using apply() with a lambda function to categorize age
df['Age_Group'] = df['Age'].apply(lambda x: 'Young' if x < 30 else 'Adult')

## **.apply() method**

In [None]:
#Using apply() with a lambda function on multiple columns
df['Name_City'] = df.apply(lambda row: f"{row['Name']} from {row['City']}", axis=1)

In [None]:
#Using apply() with a lambda function to transform a column
df['Name_Length'] = df['Name'].apply(lambda x: len(x))

In [None]:
#Using apply() with a custom function
def salary_category(salary):
    if salary < 55000:
        return 'Low'
    elif salary < 70000:
        return 'Medium'
    else:
        return 'High'

In [None]:
df['Salary_Category'] = df['Salary'].apply(salary_category)

In [None]:
df

Unnamed: 0,Name,Age,Salary,City,Salary_After_Tax,Age_Group,Name_City,Name_Length,Salary_Category
0,Alice,25,50000,New York,40000.0,Young,Alice from New York,5,Low
1,Bob,30,60000,San Francisco,48000.0,Adult,Bob from San Francisco,3,Medium
2,Charlie,35,75000,Chicago,60000.0,Adult,Charlie from Chicago,7,High
3,David,28,55000,Boston,44000.0,Young,David from Boston,5,Medium
4,Eva,22,45000,Miami,36000.0,Young,Eva from Miami,3,Low


**Thank You!**

**Keep Practcing!**