### groupby()
**1.** Used for Reshaping and Aggregation. 

**2.** Returns a **DataFrameGroupBy** object.

**3.** Follows Three step process. 
> 1. Splitting the data into groups. 
2. Applying a function to each group independently, 
3. Combining the results into a data structure.

**4.** During `Apply` process, we can apply one of the below three :
> 1. **Aggregation:** performs a statistical summary on each grouup. Takes single func or list of func

>2. **Transformation:**  `transform()`Perfroms a provided function on a single column or series of the grouped dataframe. Returns dataframe of same length as the parent. 

>3. **Filtration:** `filter()` Filter a dataframe based on a given criteria or function. produces a sub dataframe.

In [21]:
import pandas as pd
import numpy as np

In [22]:
df = pd.read_csv('Dataset/volcano_data_2010.csv', usecols= ['Name', 'Country', 'Year', 'Location', 'Type'])
df

Unnamed: 0,Year,Name,Location,Country,Type
0,2010,Tungurahua,Ecuador,Ecuador,Stratovolcano
1,2010,Eyjafjallajokull,Iceland-S,Iceland,Stratovolcano
2,2010,Pacaya,Guatemala,Guatemala,Complex volcano
3,2010,Sarigan,Mariana Is-C Pacific,United States,Stratovolcano
4,2010,Karangetang [Api Siau],Sangihe Is-Indonesia,Indonesia,Stratovolcano
...,...,...,...,...,...
58,2018,Kilauea,Hawaiian Is,United States,Shield volcano
59,2018,Kadovar,New Guinea-NE of,Papua New Guinea,Stratovolcano
60,2018,Ijen,Java,Indonesia,Stratovolcano
61,2018,Kilauea,Hawaiian Is,United States,Shield volcano


In [23]:
type(df.groupby('Type'))
# It shows thetype of the groupby object

pandas.core.groupby.generic.DataFrameGroupBy

In [24]:
df.groupby('Type').apply(display)

Unnamed: 0,Year,Name,Location,Country,Type
44,2016,Yellowstone,US-Wyoming,United States,Caldera
46,2016,Aso,Kyushu-Japan,Japan,Caldera
51,2017,Campi Flegrei,Italy,Italy,Caldera


Unnamed: 0,Year,Name,Location,Country,Type
2,2010,Pacaya,Guatemala,Guatemala,Complex volcano
32,2014,On-take,Honshu-Japan,Japan,Complex volcano
50,2017,Dieng Volc Complex,Java,Indonesia,Complex volcano


Unnamed: 0,Year,Name,Location,Country,Type
29,2013,Okataina,New Zealand,New Zealand,Lava dome
41,2015,Okataina,New Zealand,New Zealand,Lava dome


Unnamed: 0,Year,Name,Location,Country,Type
10,2011,Kirishima,Kyushu-Japan,Japan,Shield volcano
19,2012,Kilauea,Hawaiian Is,United States,Shield volcano
20,2012,Kilauea,Hawaiian Is,United States,Shield volcano
21,2012,Tolbachik,Kamchatka,Russia,Shield volcano
33,2014,Kilauea,Hawaiian Is,United States,Shield volcano
52,2017,Aoba,Vanuatu-SW Pacific,Vanuatu,Shield volcano
58,2018,Kilauea,Hawaiian Is,United States,Shield volcano
61,2018,Kilauea,Hawaiian Is,United States,Shield volcano
62,2018,Aoba,Vanuatu-SW Pacific,Vanuatu,Shield volcano


Unnamed: 0,Year,Name,Location,Country,Type
0,2010,Tungurahua,Ecuador,Ecuador,Stratovolcano
1,2010,Eyjafjallajokull,Iceland-S,Iceland,Stratovolcano
3,2010,Sarigan,Mariana Is-C Pacific,United States,Stratovolcano
4,2010,Karangetang [Api Siau],Sangihe Is-Indonesia,Indonesia,Stratovolcano
5,2010,Sinabung,Sumatra,Indonesia,Stratovolcano
6,2010,Merapi,Java,Indonesia,Stratovolcano
7,2010,Tungurahua,Ecuador,Ecuador,Stratovolcano
8,2010,Tengger Caldera,Java,Indonesia,Stratovolcano
9,2011,Merapi,Java,Indonesia,Stratovolcano
11,2011,Bulusan,Luzon-Philippines,Philippines,Stratovolcano


Unnamed: 0,Year,Name,Location,Country,Type
16,2011,Katla,Iceland-S,Iceland,Subglacial volcano


In [25]:
# We can see the no. of groups the object have
df.groupby('Type')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000022359B13280>

In [26]:
# Size of each group
df.groupby('Type').size()

Type
Caldera                3
Complex volcano        3
Lava dome              2
Shield volcano         9
Stratovolcano         45
Subglacial volcano     1
dtype: int64

In [27]:
# All the groups at once
dict_ = df.groupby('Type').groups
dict_

{'Caldera': [44, 46, 51], 'Complex volcano': [2, 32, 50], 'Lava dome': [29, 41], 'Shield volcano': [10, 19, 20, 21, 33, 52, 58, 61, 62], 'Stratovolcano': [0, 1, 3, 4, 5, 6, 7, 8, 9, 11, 12, 13, 14, 15, 17, 18, 22, 23, 24, 25, 26, 27, 28, 30, 31, 34, 35, 36, 37, 38, 39, 40, 42, 43, 45, 47, 48, 49, 53, 54, 55, 56, 57, 59, 60], 'Subglacial volcano': [16]}

In [28]:
for items in dict_.keys():
    print(items)

Caldera
Complex volcano
Lava dome
Shield volcano
Stratovolcano
Subglacial volcano


In [29]:
# Getting a particualr group 
df.groupby('Type').get_group('Complex volcano')

Unnamed: 0,Year,Name,Location,Country,Type
2,2010,Pacaya,Guatemala,Guatemala,Complex volcano
32,2014,On-take,Honshu-Japan,Japan,Complex volcano
50,2017,Dieng Volc Complex,Java,Indonesia,Complex volcano


In [30]:
# First rows of each group (oppo is last()) 
df.groupby('Type').first() 

Unnamed: 0_level_0,Year,Name,Location,Country
Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Caldera,2016,Yellowstone,US-Wyoming,United States
Complex volcano,2010,Pacaya,Guatemala,Guatemala
Lava dome,2013,Okataina,New Zealand,New Zealand
Shield volcano,2011,Kirishima,Kyushu-Japan,Japan
Stratovolcano,2010,Tungurahua,Ecuador,Ecuador
Subglacial volcano,2011,Katla,Iceland-S,Iceland


### aggregation

In [31]:
df.groupby('Type')['Year'].agg('count') 
# simmilar to df.groupby('Type')['Year'].count()

Type
Caldera                3
Complex volcano        3
Lava dome              2
Shield volcano         9
Stratovolcano         45
Subglacial volcano     1
Name: Year, dtype: int64

In between 2010-2018, Volcano type of strto has shown highest no of eruptions.

In [32]:
df.groupby('Year')['Country'].count()

Year
2010     9
2011    10
2012     3
2013     8
2014     6
2015     6
2016     5
2017     8
2018     8
Name: Country, dtype: int64

In [33]:
# Without Columns
df.groupby(['Type','Country']).agg('count')

Unnamed: 0_level_0,Unnamed: 1_level_0,Year,Name,Location
Type,Country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Caldera,Italy,1,1,1
Caldera,Japan,1,1,1
Caldera,United States,1,1,1
Complex volcano,Guatemala,1,1,1
Complex volcano,Indonesia,1,1,1
Complex volcano,Japan,1,1,1
Lava dome,New Zealand,2,2,2
Shield volcano,Japan,1,1,1
Shield volcano,Russia,1,1,1
Shield volcano,United States,5,5,5


In [34]:
# Using a custom agg func
# return True if total no of eruption equals and more than 3, else False
def cnt(val):
    if val >= 3:
        return True
    else: return False

df.groupby('Country')['Location'].agg('count').agg(cnt)

Country
Cape Verde          False
Chile               False
Ecuador              True
Eritrea             False
Guatemala           False
Iceland             False
Indonesia            True
Italy               False
Japan                True
New Zealand         False
Papua New Guinea     True
Peru                False
Philippines          True
Russia              False
United States        True
Vanuatu             False
Name: Location, dtype: bool

In [35]:
# lambda equivalent of above func
df.groupby('Country')['Location'].agg('count').agg(lambda val: True if val >= 3 else False)

Country
Cape Verde          False
Chile               False
Ecuador              True
Eritrea             False
Guatemala           False
Iceland             False
Indonesia            True
Italy               False
Japan                True
New Zealand         False
Papua New Guinea     True
Peru                False
Philippines          True
Russia              False
United States        True
Vanuatu             False
Name: Location, dtype: bool

out of 8 years, 2011 saw highest no of eruption. 

In [36]:
# we can set custom column name too
df.groupby('Country')['Location'].agg(Tot_Ocuurence =('count'))

Unnamed: 0_level_0,Tot_Ocuurence
Country,Unnamed: 1_level_1
Cape Verde,1
Chile,2
Ecuador,3
Eritrea,1
Guatemala,2
Iceland,2
Indonesia,26
Italy,2
Japan,5
New Zealand,2


Indonesia saw highest no of eruptions.

In [37]:
# Per year, Per Country no of eruptions. 
df.groupby('Year')['Country'].value_counts()

Year  Country         
2010  Indonesia           4
      Ecuador             2
      Guatemala           1
      Iceland             1
      United States       1
2011  Indonesia           4
      Chile               1
      Ecuador             1
      Eritrea             1
      Iceland             1
      Japan               1
      Philippines         1
2012  United States       2
      Russia              1
2013  Indonesia           4
      Japan               1
      New Zealand         1
      Peru                1
      Philippines         1
2014  Indonesia           3
      Cape Verde          1
      Japan               1
      United States       1
2015  Indonesia           3
      Chile               1
      New Zealand         1
      Papua New Guinea    1
2016  Indonesia           3
      Japan               1
      United States       1
2017  Indonesia           4
      Italy               2
      Guatemala           1
      Vanuatu             1
2018  Papua New Guinea   

#### transformation. see transform() notebook 

#### filtration

Evalutes to True or False

In [38]:
# printing data of total no of eruption equals and more than 3
df.groupby('Country').filter(lambda val : len(val)>=3)

Unnamed: 0,Year,Name,Location,Country,Type
0,2010,Tungurahua,Ecuador,Ecuador,Stratovolcano
3,2010,Sarigan,Mariana Is-C Pacific,United States,Stratovolcano
4,2010,Karangetang [Api Siau],Sangihe Is-Indonesia,Indonesia,Stratovolcano
5,2010,Sinabung,Sumatra,Indonesia,Stratovolcano
6,2010,Merapi,Java,Indonesia,Stratovolcano
7,2010,Tungurahua,Ecuador,Ecuador,Stratovolcano
8,2010,Tengger Caldera,Java,Indonesia,Stratovolcano
9,2011,Merapi,Java,Indonesia,Stratovolcano
10,2011,Kirishima,Kyushu-Japan,Japan,Shield volcano
11,2011,Bulusan,Luzon-Philippines,Philippines,Stratovolcano


In [39]:
# handling missing values
temp_df = pd.DataFrame({
    'name': ['A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
    'B': [np.nan, 4, np.nan, 5,6,np.nan, 8,2]
})
temp_df

Unnamed: 0,name,B
0,A,
1,A,4.0
2,B,
3,B,5.0
4,B,6.0
5,C,
6,C,8.0
7,C,2.0


In [42]:
# By defualt, groupby() ignores the nan values while appling an aggregated function
# becoz of defualt 'dropna = True' argument
temp_df.groupby('name').mean()

Unnamed: 0_level_0,B
name,Unnamed: 1_level_1
A,4.0
B,5.5
C,5.0


In [41]:
# replacing all nan values with mean values
temp_df['B'] = temp_df.groupby('name')['B'].transform( lambda val: val.fillna(np.mean(val)))
temp_df

Unnamed: 0,name,B
0,A,4.0
1,A,4.0
2,B,5.5
3,B,5.0
4,B,6.0
5,C,5.0
6,C,8.0
7,C,2.0
