# 1. Aggregates

- Like in numpy, sum(), min(), max() etc.
- In Pandas series, the aggregatess gives single value.
- In DFs, aggregates are performed across columns by default and across rows with axis=1.

## Simple aggregation in Pandas

### In Series

In [1]:
import numpy as  np
np.random.random(0)
ser = np.random.rand(10)
ser

array([0.15609602, 0.50623094, 0.0541819 , 0.45046744, 0.96698162,
       0.57584475, 0.15352664, 0.04408715, 0.86692653, 0.46955534])

In [2]:
sum(ser)

np.float64(4.243898333485362)

In [3]:
ser.mean()

np.float64(0.4243898333485362)

### In Dataframes

In [4]:
import pandas as pd 
df = pd.DataFrame({'A': np.random.rand(5), 'B': np.random.rand(5)})
df

Unnamed: 0,A,B
0,0.737256,0.465369
1,0.306456,0.017635
2,0.324319,0.98765
3,0.35288,0.521013
4,0.3957,0.669365


In [5]:
df.sum()  # defualt columnwise

A    2.116611
B    2.661032
dtype: float64

In [6]:
df.sum(axis=1)

0    1.202625
1    0.324090
2    1.311969
3    0.873893
4    1.065065
dtype: float64

## Planets dataset

- This is dataset of exoplanets discovered by scientists.

In [7]:
# Example dataset

import seaborn as sns
planets = sns.load_dataset('planets')

In [8]:
planets.shape

(1035, 6)

In [9]:
planets.head()

Unnamed: 0,method,number,orbital_period,mass,distance,year
0,Radial Velocity,1,269.3,7.1,77.4,2006
1,Radial Velocity,1,874.774,2.21,56.95,2008
2,Radial Velocity,1,763.0,2.6,19.84,2011
3,Radial Velocity,1,326.03,19.4,110.62,2007
4,Radial Velocity,1,516.22,10.5,119.47,2009


In [10]:
planets.isnull().sum()

method              0
number              0
orbital_period     43
mass              522
distance          227
year                0
dtype: int64

## describe()

In [11]:
# We will drop nan rows adn find describe

planets.dropna().describe()

Unnamed: 0,number,orbital_period,mass,distance,year
count,498.0,498.0,498.0,498.0,498.0
mean,1.73494,835.778671,2.50932,52.068213,2007.37751
std,1.17572,1469.128259,3.636274,46.596041,4.167284
min,1.0,1.3283,0.0036,1.35,1989.0
25%,1.0,38.27225,0.2125,24.4975,2005.0
50%,1.0,357.0,1.245,39.94,2009.0
75%,2.0,999.6,2.8675,59.3325,2011.0
max,6.0,17337.5,25.0,354.0,2014.0


- USeful way to understand overall properties of a dataset.
- We can see here, in year, first exoplanet was discovered in 1989 and more than 50% discovered after 2010, i.e. after keppler mission.

**Summary of Aggregation functions in pandas**
| **Aggregation**    | **Description**                           |
|--------------------|-------------------------------------------|
| `count()`          | Total number of items                     |
| `first()`, `last()`| First and last item                       |
| `mean()`, `median()`| Mean and median                          |
| `min()`, `max()`   | Minimum and maximum                       |
| `std()`, `var()`   | Standard deviation and variance           |
| `mad()`            | Mean absolute deviation                  |
| `prod()`           | Product of all items                      |
| `sum()`            | Sum of all items                          |


# 2. GroupBy: Split, Apply, Combine

- Aggregates are performed on complete columns/rows.
- But Groupby is used to apply aggregates on conditional basis.
- Inspired by SQL "group by" command 
- But better understood in terms of split, apply, combine.

## (a) GroupBy process : Split, Apply, Combine

1. Split: DataFrame is split into groups based on the value of a specified key

2. Apply: A function (usually an aggregate, transformation, or filtering) is applied within the groups.

3. Combine: Results are merged back into an output array.

- Same operation can be done manually using, masking -> Aggregation -> Merging.
- But GroupBy automates the manual process of masking, aggregation, and merging.

In [12]:
# Create a dataframe

df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
                   'data': range(1,7)}, columns=['key', 'data'] )
df

Unnamed: 0,key,data
0,A,1
1,B,2
2,C,3
3,A,4
4,B,5
5,C,6


In [13]:
df.groupby('key')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x124c97e90>

- The ouput of groupby is a groupby object and not a DF.
- It gives result as Df as we apply aggregation on to it directly - Lazy evaluation.
- Due to "lazy evaluation" this is very efficient.

In [14]:
df.groupby('key').sum()  # This step applies the aggregate and combines the split.

Unnamed: 0_level_0,data
key,Unnamed: 1_level_1
A,5
B,7
C,9


<img src="/Users/tanukhanuja/data_science_essential_packages/sklearn/groupby.png" style="width:500px;"/>


## (b) GoupBy Object
- Flexible abstraction that handles complex operations internally.
- Its like collection of dataframes withe ach group being processed efficiently.

Key GroupBy Operations:
1. Aggregate: Summarizes data, like calculating the sum, mean, etc.
2. Filter: Filters groups based on a condition.
3. Transform: Applies a function to each group and returns a transformed DataFrame.
4. Apply: Lets you apply custom functions to groups.

#### 1. Column indexing

You can index specific columns in a GroupBy object, similar to how you do with DataFrames. This returns a modified GroupBy object.

In [15]:
planets.head()

Unnamed: 0,method,number,orbital_period,mass,distance,year
0,Radial Velocity,1,269.3,7.1,77.4,2006
1,Radial Velocity,1,874.774,2.21,56.95,2008
2,Radial Velocity,1,763.0,2.6,19.84,2011
3,Radial Velocity,1,326.03,19.4,110.62,2007
4,Radial Velocity,1,516.22,10.5,119.47,2009


In [16]:
planets.groupby('method')  # Grouped df by method

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x124ee6e70>

In [17]:
planets.groupby('method')['orbital_period']  # Access a Series from grouped object

<pandas.core.groupby.generic.SeriesGroupBy object at 0x124ee63f0>

In [18]:
planets.groupby('method')['orbital_period'].median()  # Apply Aggregation function

method
Astrometry                         631.180000
Eclipse Timing Variations         4343.500000
Imaging                          27500.000000
Microlensing                      3300.000000
Orbital Brightness Modulation        0.342887
Pulsar Timing                       66.541900
Pulsation Timing Variations       1170.000000
Radial Velocity                    360.200000
Transit                              5.714932
Transit Timing Variations           57.011000
Name: orbital_period, dtype: float64

#### 2. Iteration over groups

You can directly iterate over the GroupBy object, where each iteration returns a group as a Series or DataFrame.


In [19]:
for (method, group) in planets.groupby('method'):
    print(method, group.shape)

Astrometry (2, 6)
Eclipse Timing Variations (9, 6)
Imaging (38, 6)
Microlensing (23, 6)
Orbital Brightness Modulation (3, 6)
Pulsar Timing (5, 6)
Pulsation Timing Variations (1, 6)
Radial Velocity (553, 6)
Transit (397, 6)
Transit Timing Variations (4, 6)


This can be useful for manual operations, though the built-in apply method is generally faster for most tasks.

#### 3. Dispatch Methods
Through Python’s class magic, any method that is not explicitly implemented in the GroupBy object is passed through and applied to each group.

For example, you can use the describe() method on the grouped data to get a summary of each group.

In [20]:
planets.groupby('method')['year'].median()

method
Astrometry                       2011.5
Eclipse Timing Variations        2010.0
Imaging                          2009.0
Microlensing                     2010.0
Orbital Brightness Modulation    2011.0
Pulsar Timing                    1994.0
Pulsation Timing Variations      2007.0
Radial Velocity                  2009.0
Transit                          2012.0
Transit Timing Variations        2012.5
Name: year, dtype: float64

In [21]:
planets.groupby('method')['year'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Astrometry,2.0,2011.5,2.12132,2010.0,2010.75,2011.5,2012.25,2013.0
Eclipse Timing Variations,9.0,2010.0,1.414214,2008.0,2009.0,2010.0,2011.0,2012.0
Imaging,38.0,2009.131579,2.781901,2004.0,2008.0,2009.0,2011.0,2013.0
Microlensing,23.0,2009.782609,2.859697,2004.0,2008.0,2010.0,2012.0,2013.0
Orbital Brightness Modulation,3.0,2011.666667,1.154701,2011.0,2011.0,2011.0,2012.0,2013.0
Pulsar Timing,5.0,1998.4,8.38451,1992.0,1992.0,1994.0,2003.0,2011.0
Pulsation Timing Variations,1.0,2007.0,,2007.0,2007.0,2007.0,2007.0,2007.0
Radial Velocity,553.0,2007.518987,4.249052,1989.0,2005.0,2009.0,2011.0,2014.0
Transit,397.0,2011.236776,2.077867,2002.0,2010.0,2012.0,2013.0,2014.0
Transit Timing Variations,4.0,2012.5,1.290994,2011.0,2011.75,2012.5,2013.25,2014.0


- This table helps you understand the data better, showing that most planets were discovered using the Radial Velocity and Transit methods, while newer methods like Transit Timing Variation started after 2011.

**Key Points:**
- Lazy Evaluation: GroupBy does not compute results until an aggregation or other method is applied.

- Iteration: You can iterate over groups manually, but apply is typically more efficient.

- Dispatch Methods: Any valid DataFrame/Series method can be applied to each group, making GroupBy a powerful and flexible tool for data analysis.

## (c) GroupBy methods

- groupby().aggregate()
- groupby().filter()
- groupby().tranform()
- groupby().apply

In [22]:
# Create an eaxmple dataset

rng = np.random.RandomState(0)
df = pd.DataFrame({
    'key': ['A', 'B', 'C', 'A', 'B', 'C'],
    'data1': range(6),
    'data2': rng.randint(0, 10, 6)
})
df

Unnamed: 0,key,data1,data2
0,A,0,5
1,B,1,0
2,C,2,3
3,A,3,3
4,B,4,7
5,C,5,9


**1. Aggregation with groupby**

- Allows flexible aggregation with strings, functions, or lists.

In [23]:
df.groupby('key').aggregate(['min', np.median, 'max'])  # Combining multiple aggregation functions

  df.groupby('key').aggregate(['min', np.median, 'max'])  # Combining multiple aggregation functions


Unnamed: 0_level_0,data1,data1,data1,data2,data2,data2
Unnamed: 0_level_1,min,median,max,min,median,max
key,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
A,0,1.5,3,3,4.0,5
B,1,2.5,4,0,3.5,7
C,2,3.5,5,3,6.0,9


In [24]:
df.groupby('key').aggregate({'data1': 'min', 'data2': 'max'})  # Custom aggregation with dictionaries


Unnamed: 0_level_0,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
A,0,5
B,1,7
C,2,9


**2. filter() with groupby**  
Filters groups based on custom criteria, returning only groups that meet the condition.


In [25]:
df

Unnamed: 0,key,data1,data2
0,A,0,5
1,B,1,0
2,C,2,3
3,A,3,3
4,B,4,7
5,C,5,9


In [26]:
df.groupby('key').std()

Unnamed: 0_level_0,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
A,2.12132,1.414214
B,2.12132,4.949747
C,2.12132,4.242641


In [27]:
df.groupby('key').filter(lambda x: x['data2'].std()>4)

Unnamed: 0,key,data1,data2
1,B,1,0
2,C,2,3
4,B,4,7
5,C,5,9


**3. Transform() with groupby**

- Returns a transformed version of the original data.
- The output shape is the same as the input shape.
- Common Example: Centering data by subtracting the group-wise mean.

In [28]:
df

Unnamed: 0,key,data1,data2
0,A,0,5
1,B,1,0
2,C,2,3
3,A,3,3
4,B,4,7
5,C,5,9


In [29]:
df.groupby('key').transform(lambda x: x-x.mean())

Unnamed: 0,data1,data2
0,-1.5,1.0
1,-1.5,-3.5
2,-1.5,-3.0
3,1.5,-1.0
4,1.5,3.5
5,1.5,3.0


In [30]:
df.groupby('key').mean()  # This was subtracted in above transform

Unnamed: 0_level_0,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
A,1.5,4.0
B,2.5,3.5
C,3.5,6.0


**4. apply() with GroupBy**

- Allows the application of an arbitrary function to group results.
- The function should take a DataFrame and return either a Pandas object (DataFrame, Series) or a scalar.

Example:

- Normalizing the first column by the sum of the second:

In [31]:
def norm_by_data2(x):
    x['data1'] /= x['data2'].sum()
    return x

In [32]:
df

Unnamed: 0,key,data1,data2
0,A,0,5
1,B,1,0
2,C,2,3
3,A,3,3
4,B,4,7
5,C,5,9


In [33]:
df.groupby('key').apply(norm_by_data2)

  df.groupby('key').apply(norm_by_data2)


Unnamed: 0_level_0,Unnamed: 1_level_0,key,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A,0,A,0.0,5
A,3,A,0.375,3
B,1,B,0.142857,0
B,4,B,0.571429,7
C,2,C,0.166667,3
C,5,C,0.416667,9


## (d) Specifying Grouping Keys

### 1. Using a List/Array/Series:

- Keys can be any series or list that matches the length of the DataFrame.


In [35]:
df

Unnamed: 0,key,data1,data2
0,A,0,5
1,B,1,0
2,C,2,3
3,A,3,3
4,B,4,7
5,C,5,9


In [38]:
L = [0,1,0,1,2,0]
df.groupby(L).sum()

Unnamed: 0,key,data1,data2
0,ACC,7,17
1,BA,4,3
2,B,4,7


### 2. A Dictionary or Series mapping index to group:

- A dictionary that maps index values to group keys.

In [41]:
df2 = df.set_index('key')

In [42]:
df2

Unnamed: 0_level_0,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
A,0,5
B,1,0
C,2,3
A,3,3
B,4,7
C,5,9


In [43]:
mapping = {'A': 'vowel', 'B': 'consonant', 'C': 'consonant'}

In [45]:
df2.groupby(mapping).sum()

Unnamed: 0_level_0,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
consonant,12,19
vowel,3,8


### 3. Using a Python Function:

- Any function that takes an index value and outputs the group key.

In [46]:
df2

Unnamed: 0_level_0,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
A,0,5
B,1,0
C,2,3
A,3,3
B,4,7
C,5,9


In [47]:
df2.groupby(str.lower).mean()

Unnamed: 0_level_0,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
a,1.5,4.0
b,2.5,3.5
c,3.5,6.0


### 4. Combining Keys:

Multiple keys can be combined for multi-index grouping.


In [49]:
df2

Unnamed: 0_level_0,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
A,0,5
B,1,0
C,2,3
A,3,3
B,4,7
C,5,9


In [50]:
mapping

{'A': 'vowel', 'B': 'consonant', 'C': 'consonant'}

In [51]:
df2.groupby([str.lower, mapping]).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key,key,Unnamed: 2_level_1,Unnamed: 3_level_1
a,vowel,1.5,4.0
b,consonant,2.5,3.5
c,consonant,3.5,6.0


### Grouping Example: Planets Data

Count discovered planets by method and decade.

Steps:

- Create a decade column from the 'year' column.
- Group by ['method', 'decade'].
- Aggregate the 'number' column and fill missing values with 0.

In [52]:
planets

Unnamed: 0,method,number,orbital_period,mass,distance,year
0,Radial Velocity,1,269.300000,7.10,77.40,2006
1,Radial Velocity,1,874.774000,2.21,56.95,2008
2,Radial Velocity,1,763.000000,2.60,19.84,2011
3,Radial Velocity,1,326.030000,19.40,110.62,2007
4,Radial Velocity,1,516.220000,10.50,119.47,2009
...,...,...,...,...,...,...
1030,Transit,1,3.941507,,172.00,2006
1031,Transit,1,2.615864,,148.00,2007
1032,Transit,1,3.191524,,174.00,2007
1033,Transit,1,4.125083,,293.00,2008


In [57]:
decade = 10* planets['year']//10
decade.head()

0    2006
1    2008
2    2011
3    2007
4    2009
Name: year, dtype: int64

In [59]:
decade.astype('str') + 's'

0       2006s
1       2008s
2       2011s
3       2007s
4       2009s
        ...  
1030    2006s
1031    2007s
1032    2007s
1033    2008s
1034    2008s
Name: year, Length: 1035, dtype: object

In [63]:
planets.groupby(['method', decade])['number'].sum().unstack().fillna(0)

year,1989,1992,1994,1995,1996,1997,1998,1999,2000,2001,...,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014
method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Astrometry,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
Eclipse Timing Variations,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,4.0,1.0,4.0,5.0,1.0,0.0,0.0
Imaging,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,4.0,1.0,17.0,3.0,9.0,3.0,2.0,7.0,0.0
Microlensing,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,2.0,1.0,0.0,6.0,2.0,2.0,1.0,8.0,4.0,0.0
Orbital Brightness Modulation,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,1.0,0.0
Pulsar Timing,0.0,6.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
Pulsation Timing Variations,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Radial Velocity,1.0,0.0,0.0,1.0,15.0,1.0,11.0,24.0,27.0,15.0,...,61.0,33.0,47.0,76.0,105.0,92.0,176.0,70.0,65.0,21.0
Transit,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,5.0,16.0,17.0,20.0,85.0,162.0,175.0,197.0,93.0
Transit Timing Variations,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,2.0,2.0,2.0,3.0
