## Groupby operations

This operation allows us to examine data on a per category basis 

- In pandas, groupby() creates a 'lazy' groupby object waiting to be evaluated by an aggregate method call

In [1]:
import numpy as np
import pandas as pd

**TYPES OF COLUMNS**
- Continuous column
- Categorial column

#### Adding an aggregate method call. To use a grouped object, you need to tell pandas how you want to aggregate the data.

Common Options:

    mean(): Compute mean of groups
    sum(): Compute sum of group values
    size(): Compute group sizes
    count(): Compute count of group
    std(): Standard deviation of groups
    var(): Compute variance of groups
    sem(): Standard error of the mean of groups
    describe(): Generates descriptive statistics
    first(): Compute first of group values
    last(): Compute last of group values
    nth() : Take nth value, or a subset if n is a list
    min(): Compute min of group values
    max(): Compute max of group values
    
Full List at the Online Documentation: https://pandas.pydata.org/docs/reference/groupby.html

In [2]:
df = pd.read_csv('mpg.csv')    #car dataset
df

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino
...,...,...,...,...,...,...,...,...,...
393,27.0,4,140.0,86,2790,15.6,82,1,ford mustang gl
394,44.0,4,97.0,52,2130,24.6,82,2,vw pickup
395,32.0,4,135.0,84,2295,11.6,82,1,dodge rampage
396,28.0,4,120.0,79,2625,18.6,82,1,ford ranger


In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           398 non-null    float64
 1   cylinders     398 non-null    int64  
 2   displacement  398 non-null    float64
 3   horsepower    398 non-null    object 
 4   weight        398 non-null    int64  
 5   acceleration  398 non-null    float64
 6   model_year    398 non-null    int64  
 7   origin        398 non-null    int64  
 8   name          398 non-null    object 
dtypes: float64(3), int64(4), object(2)
memory usage: 28.1+ KB


In [10]:
#model year is a good categorial column to check performance throughout the years

df['model_year'].unique()

array([70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82], dtype=int64)

In [11]:
df['model_year'].value_counts()

73    40
78    36
76    34
82    31
75    30
70    29
79    29
80    29
81    29
71    28
72    28
77    28
74    27
Name: model_year, dtype: int64

In [3]:
#using groupby() to check performace over the years

#returns mean of numeric data acc to years
#model year becomes the index
df.groupby('model_year').mean(numeric_only=True)

Unnamed: 0_level_0,mpg,cylinders,displacement,weight,acceleration,origin
model_year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
70,17.689655,6.758621,281.413793,3372.793103,12.948276,1.310345
71,21.25,5.571429,209.75,2995.428571,15.142857,1.428571
72,18.714286,5.821429,218.375,3237.714286,15.125,1.535714
73,17.1,6.375,256.875,3419.025,14.3125,1.375
74,22.703704,5.259259,171.740741,2877.925926,16.203704,1.666667
75,20.266667,5.6,205.533333,3176.8,16.05,1.466667
76,21.573529,5.647059,197.794118,3078.735294,15.941176,1.470588
77,23.375,5.464286,191.392857,2997.357143,15.435714,1.571429
78,24.061111,5.361111,177.805556,2861.805556,15.805556,1.611111
79,25.093103,5.827586,206.689655,3055.344828,15.813793,1.275862


In [4]:
df.groupby('model_year').mean(numeric_only=True)['mpg']

model_year
70    17.689655
71    21.250000
72    18.714286
73    17.100000
74    22.703704
75    20.266667
76    21.573529
77    23.375000
78    24.061111
79    25.093103
80    33.696552
81    30.334483
82    31.709677
Name: mpg, dtype: float64

In [5]:
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino


In [23]:
#grouping by number of cylinders
#hence returning basic stats of each group using describe()

df.groupby('cylinders').describe()

Unnamed: 0_level_0,mpg,mpg,mpg,mpg,mpg,mpg,mpg,mpg,displacement,displacement,...,model_year,model_year,origin,origin,origin,origin,origin,origin,origin,origin
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
cylinders,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
3,4.0,20.55,2.564501,18.0,18.75,20.25,22.05,23.7,4.0,72.5,...,77.75,80.0,4.0,3.0,0.0,3.0,3.0,3.0,3.0,3.0
4,204.0,29.286765,5.710156,18.0,25.0,28.25,33.0,46.6,204.0,109.796569,...,80.0,82.0,204.0,1.985294,0.833285,1.0,1.0,2.0,3.0,3.0
5,3.0,27.366667,8.228204,20.3,22.85,25.4,30.9,36.4,3.0,145.0,...,79.5,80.0,3.0,2.0,0.0,2.0,2.0,2.0,2.0,2.0
6,84.0,19.985714,3.807322,15.0,18.0,19.0,21.0,38.0,84.0,218.142857,...,78.0,82.0,84.0,1.190476,0.548298,1.0,1.0,1.0,1.0,3.0
8,103.0,14.963107,2.836284,9.0,13.0,14.0,16.0,26.6,103.0,345.009709,...,76.0,81.0,103.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0


In [29]:
#transposing the above dataset for readability

df.groupby('cylinders').describe().transpose()

Unnamed: 0,cylinders,3,4,5,6,8
mpg,count,4.0,204.0,3.0,84.0,103.0
mpg,mean,20.55,29.286765,27.366667,19.985714,14.963107
mpg,std,2.564501,5.710156,8.228204,3.807322,2.836284
mpg,min,18.0,18.0,20.3,15.0,9.0
mpg,25%,18.75,25.0,22.85,18.0,13.0
mpg,50%,20.25,28.25,25.4,19.0,14.0
mpg,75%,22.05,33.0,30.9,21.0,16.0
mpg,max,23.7,46.6,36.4,38.0,26.6
displacement,count,4.0,204.0,3.0,84.0,103.0
displacement,mean,72.5,109.796569,145.0,218.142857,345.009709


### Multilevel/Hierarchy indexing

In [27]:
#grouping by two categories

#first group by model year then group by cylinders

df.groupby(['model_year','cylinders']).mean(numeric_only=True)

Unnamed: 0_level_0,Unnamed: 1_level_0,mpg,displacement,weight,acceleration,origin
model_year,cylinders,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
70,4,25.285714,107.0,2292.571429,16.0,2.285714
70,6,20.5,199.0,2710.5,15.5,1.0
70,8,14.111111,367.555556,3940.055556,11.194444,1.0
71,4,27.461538,101.846154,2056.384615,16.961538,1.923077
71,6,18.0,243.375,3171.875,14.75,1.0
71,8,13.428571,371.714286,4537.714286,12.214286,1.0
72,3,19.0,70.0,2330.0,13.5,3.0
72,4,23.428571,111.535714,2382.642857,17.214286,1.928571
72,8,13.615385,344.846154,4228.384615,13.0,1.0
73,3,18.0,70.0,2124.0,13.5,3.0


In [3]:
#grouping by three categories

new_df = df.groupby(['model_year','cylinders','name']).mean(numeric_only=True)
new_df.head(60)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,mpg,displacement,weight,acceleration,origin
model_year,cylinders,name,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
70,4,audi 100 ls,24.0,107.0,2430.0,14.5,2.0
70,4,bmw 2002,26.0,121.0,2234.0,12.5,2.0
70,4,datsun pl510,27.0,97.0,2130.0,14.5,3.0
70,4,peugeot 504,25.0,110.0,2672.0,17.5,2.0
70,4,saab 99e,25.0,104.0,2375.0,17.5,2.0
70,4,toyota corona mark ii,24.0,113.0,2372.0,15.0,3.0
70,4,volkswagen 1131 deluxe sedan,26.0,97.0,1835.0,20.5,2.0
70,6,amc gremlin,21.0,199.0,2648.0,15.0,1.0
70,6,amc hornet,18.0,199.0,2774.0,15.5,1.0
70,6,ford maverick,21.0,200.0,2587.0,16.0,1.0


- **Indexing and selection**

In [50]:
#in the above dataset, there are three indices

new_df.index

MultiIndex([(70, 4,                       'audi 100 ls'),
            (70, 4,                          'bmw 2002'),
            (70, 4,                      'datsun pl510'),
            (70, 4,                       'peugeot 504'),
            (70, 4,                          'saab 99e'),
            (70, 4,             'toyota corona mark ii'),
            (70, 4,      'volkswagen 1131 deluxe sedan'),
            (70, 6,                       'amc gremlin'),
            (70, 6,                        'amc hornet'),
            (70, 6,                     'ford maverick'),
            ...
            (82, 4,            'plymouth horizon miser'),
            (82, 4,        'pontiac j2000 se hatchback'),
            (82, 4,                   'pontiac phoenix'),
            (82, 4,                  'toyota celica gt'),
            (82, 4,                    'toyota corolla'),
            (82, 4,               'volkswagen rabbit l'),
            (82, 4,                         'vw pickup')

In [51]:
#index names

new_df.index.names

FrozenList(['model_year', 'cylinders', 'name'])

In [53]:
#index levels
#i.e., returning all possible index values for all three index categories

new_df.index.levels

FrozenList([[70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82], [3, 4, 5, 6, 8], ['amc ambassador brougham', 'amc ambassador dpl', 'amc ambassador sst', 'amc concord', 'amc concord d/l', 'amc concord dl', 'amc concord dl 6', 'amc gremlin', 'amc hornet', 'amc hornet sportabout (sw)', 'amc matador', 'amc matador (sw)', 'amc pacer', 'amc pacer d/l', 'amc rebel sst', 'amc spirit dl', 'audi 100 ls', 'audi 100ls', 'audi 4000', 'audi 5000', 'audi 5000s (diesel)', 'audi fox', 'bmw 2002', 'bmw 320i', 'buick century', 'buick century 350', 'buick century limited', 'buick century luxus (sw)', 'buick century special', 'buick electra 225 custom', 'buick estate wagon (sw)', 'buick lesabre custom', 'buick opel isuzu deluxe', 'buick regal sport coupe (turbo)', 'buick skyhawk', 'buick skylark', 'buick skylark 320', 'buick skylark limited', 'cadillac eldorado', 'cadillac seville', 'capri ii', 'chevroelt chevelle malibu', 'chevrolet bel air', 'chevrolet camaro', 'chevrolet caprice classic', 'chevrolet c

In [54]:
#indexing in multilevel moves from outside to inside

#returning data for 1970
new_df.loc[70]

Unnamed: 0_level_0,Unnamed: 1_level_0,mpg,displacement,weight,acceleration,origin
cylinders,name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
4,audi 100 ls,24.0,107.0,2430.0,14.5,2.0
4,bmw 2002,26.0,121.0,2234.0,12.5,2.0
4,datsun pl510,27.0,97.0,2130.0,14.5,3.0
4,peugeot 504,25.0,110.0,2672.0,17.5,2.0
4,saab 99e,25.0,104.0,2375.0,17.5,2.0
4,toyota corona mark ii,24.0,113.0,2372.0,15.0,3.0
4,volkswagen 1131 deluxe sedan,26.0,97.0,1835.0,20.5,2.0
6,amc gremlin,21.0,199.0,2648.0,15.0,1.0
6,amc hornet,18.0,199.0,2774.0,15.5,1.0
6,ford maverick,21.0,200.0,2587.0,16.0,1.0


In [59]:
#returning 1970 and 1982 data

#use list as index for same level

new_df.loc[[70,82]]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,mpg,displacement,weight,acceleration,origin
model_year,cylinders,name,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
70,4,audi 100 ls,24.0,107.0,2430.0,14.5,2.0
70,4,bmw 2002,26.0,121.0,2234.0,12.5,2.0
70,4,datsun pl510,27.0,97.0,2130.0,14.5,3.0
70,4,peugeot 504,25.0,110.0,2672.0,17.5,2.0
70,4,saab 99e,25.0,104.0,2375.0,17.5,2.0
70,4,toyota corona mark ii,24.0,113.0,2372.0,15.0,3.0
70,4,volkswagen 1131 deluxe sedan,26.0,97.0,1835.0,20.5,2.0
70,6,amc gremlin,21.0,199.0,2648.0,15.0,1.0
70,6,amc hornet,18.0,199.0,2774.0,15.5,1.0
70,6,ford maverick,21.0,200.0,2587.0,16.0,1.0


In [65]:
#returning 1970 data for 6 cylinders

#use tuple for different level indexing

new_df.loc[(70,6)]

Unnamed: 0_level_0,mpg,displacement,weight,acceleration,origin
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
amc gremlin,21.0,199.0,2648.0,15.0,1.0
amc hornet,18.0,199.0,2774.0,15.5,1.0
ford maverick,21.0,200.0,2587.0,16.0,1.0
plymouth duster,22.0,198.0,2833.0,15.5,1.0


**CROSS SECTION**

**(general way of indexing in multilevel indexing)**

- **method xs()**

In [5]:
#grabbing data for model year 1970

new_df.xs(key = 70 , axis = 0 , level = 'model_year')

#PARAMETERS:
#key is value of index we're looking for
#axis = 0 by default (rows)
#level is name of the labeled index we're looking it

Unnamed: 0_level_0,Unnamed: 1_level_0,mpg,displacement,weight,acceleration,origin
cylinders,name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
4,audi 100 ls,24.0,107.0,2430.0,14.5,2.0
4,bmw 2002,26.0,121.0,2234.0,12.5,2.0
4,datsun pl510,27.0,97.0,2130.0,14.5,3.0
4,peugeot 504,25.0,110.0,2672.0,17.5,2.0
4,saab 99e,25.0,104.0,2375.0,17.5,2.0
4,toyota corona mark ii,24.0,113.0,2372.0,15.0,3.0
4,volkswagen 1131 deluxe sedan,26.0,97.0,1835.0,20.5,2.0
6,amc gremlin,21.0,199.0,2648.0,15.0,1.0
6,amc hornet,18.0,199.0,2774.0,15.5,1.0
6,ford maverick,21.0,200.0,2587.0,16.0,1.0


In [7]:
#grabbing 4 cylinder data for each year

new_df.xs(key = 4 , level = 'cylinders')

Unnamed: 0_level_0,Unnamed: 1_level_0,mpg,displacement,weight,acceleration,origin
model_year,name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
70,audi 100 ls,24.0,107.0,2430.0,14.5,2.0
70,bmw 2002,26.0,121.0,2234.0,12.5,2.0
70,datsun pl510,27.0,97.0,2130.0,14.5,3.0
70,peugeot 504,25.0,110.0,2672.0,17.5,2.0
70,saab 99e,25.0,104.0,2375.0,17.5,2.0
...,...,...,...,...,...,...
82,pontiac phoenix,27.0,151.0,2735.0,18.0,1.0
82,toyota celica gt,32.0,144.0,2665.0,13.9,3.0
82,toyota corolla,34.0,108.0,2245.0,16.9,3.0
82,volkswagen rabbit l,36.0,105.0,1980.0,15.3,2.0


In [9]:
#grabbing 6 and 8 cylinder data for each year

#IMPORTANT

#xs() method does not work for multiple keys because key parameter takes single value
six_eight_cyl = df[df['cylinders'].isin([6,8])]
six_eight_cyl

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino
...,...,...,...,...,...,...,...,...,...
365,20.2,6,200.0,88,3060,17.1,81,1,ford granada gl
366,17.6,6,225.0,85,3465,16.6,81,1,chrysler lebaron salon
386,25.0,6,181.0,110,2945,16.4,82,1,buick century limited
387,38.0,6,262.0,85,3015,17.0,82,1,oldsmobile cutlass ciera (diesel)


In [11]:
#performing statistical operation on the filtered data above using groupby()

six_eight_cyl.groupby(['model_year','cylinders']).mean(numeric_only=True)

Unnamed: 0_level_0,Unnamed: 1_level_0,mpg,displacement,weight,acceleration,origin
model_year,cylinders,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
70,6,20.5,199.0,2710.5,15.5,1.0
70,8,14.111111,367.555556,3940.055556,11.194444,1.0
71,6,18.0,243.375,3171.875,14.75,1.0
71,8,13.428571,371.714286,4537.714286,12.214286,1.0
72,8,13.615385,344.846154,4228.384615,13.0,1.0
73,6,19.0,212.25,2917.125,15.6875,1.25
73,8,13.2,365.25,4279.05,12.25,1.0
74,6,17.857143,230.428571,3320.0,16.857143,1.0
74,8,14.2,315.2,4438.4,14.7,1.0
75,6,17.583333,233.75,3398.333333,17.708333,1.0


In [15]:
#performing statistical operations using groupby() and xs()

new_df.xs(key = 80 , level = 'model_year').groupby('cylinders').mean()

Unnamed: 0_level_0,mpg,displacement,weight,acceleration,origin
cylinders,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
3,23.7,70.0,2420.0,12.5,3.0
4,34.612,111.0,2360.08,17.144,2.2
5,36.4,121.0,2950.0,19.9,2.0
6,25.9,196.5,3145.5,15.05,2.0


In [23]:
#swapping labeled indices
new_df.swaplevel()

#by default, swaps the two innermost levels of the index

#PARAMETERS:
#swaplevel(i , j)
#indices at i,jth position are swapped

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,mpg,displacement,weight,acceleration,origin
model_year,name,cylinders,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
70,audi 100 ls,4,24.0,107.0,2430.0,14.5,2.0
70,bmw 2002,4,26.0,121.0,2234.0,12.5,2.0
70,datsun pl510,4,27.0,97.0,2130.0,14.5,3.0
70,peugeot 504,4,25.0,110.0,2672.0,17.5,2.0
70,saab 99e,4,25.0,104.0,2375.0,17.5,2.0
...,...,...,...,...,...,...,...
82,volkswagen rabbit l,4,36.0,105.0,1980.0,15.3,2.0
82,vw pickup,4,44.0,97.0,2130.0,24.6,2.0
82,buick century limited,6,25.0,181.0,2945.0,16.4,1.0
82,ford granada l,6,22.0,232.0,2835.0,14.7,1.0


In [24]:
#swaps indices model year and cylinders
new_df.swaplevel(0,1)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,mpg,displacement,weight,acceleration,origin
cylinders,model_year,name,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
4,70,audi 100 ls,24.0,107.0,2430.0,14.5,2.0
4,70,bmw 2002,26.0,121.0,2234.0,12.5,2.0
4,70,datsun pl510,27.0,97.0,2130.0,14.5,3.0
4,70,peugeot 504,25.0,110.0,2672.0,17.5,2.0
4,70,saab 99e,25.0,104.0,2375.0,17.5,2.0
4,...,...,...,...,...,...,...
4,82,volkswagen rabbit l,36.0,105.0,1980.0,15.3,2.0
4,82,vw pickup,44.0,97.0,2130.0,24.6,2.0
6,82,buick century limited,25.0,181.0,2945.0,16.4,1.0
6,82,ford granada l,22.0,232.0,2835.0,14.7,1.0


#### SORTING IN MULTILEVEL INDEX

In [25]:
#sorting in descedning order acc to model year

new_df.sort_index(level = 'model_year',ascending = False)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,mpg,displacement,weight,acceleration,origin
model_year,cylinders,name,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
82,6,oldsmobile cutlass ciera (diesel),38.0,262.0,3015.0,17.0,1.0
82,6,ford granada l,22.0,232.0,2835.0,14.7,1.0
82,6,buick century limited,25.0,181.0,2945.0,16.4,1.0
82,4,vw pickup,44.0,97.0,2130.0,24.6,2.0
82,4,volkswagen rabbit l,36.0,105.0,1980.0,15.3,2.0
...,...,...,...,...,...,...,...
70,4,saab 99e,25.0,104.0,2375.0,17.5,2.0
70,4,peugeot 504,25.0,110.0,2672.0,17.5,2.0
70,4,datsun pl510,27.0,97.0,2130.0,14.5,3.0
70,4,bmw 2002,26.0,121.0,2234.0,12.5,2.0


In [26]:
#sorting by the inner levels is not recommended as it breaks up the grouping

new_df.sort_index(level = 'cylinders',ascending = False)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,mpg,displacement,weight,acceleration,origin
model_year,cylinders,name,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
81,8,oldsmobile cutlass ls,26.6,350.0,3725.0,19.0,1.0
79,8,oldsmobile cutlass salon brougham,23.9,260.0,3420.0,22.2,1.0
79,8,mercury grand marquis,16.5,351.0,3955.0,13.2,1.0
79,8,ford ltd landau,17.6,302.0,3725.0,13.4,1.0
79,8,ford country squire (sw),15.5,351.0,4054.0,14.3,1.0
...,...,...,...,...,...,...,...
70,4,audi 100 ls,24.0,107.0,2430.0,14.5,2.0
80,3,mazda rx-7 gs,23.7,70.0,2420.0,12.5,3.0
77,3,mazda rx-4,21.5,80.0,2720.0,13.5,3.0
73,3,maxda rx3,18.0,70.0,2124.0,13.5,3.0


### agg() method to customize aggregate functions for particular columns

**advanced method of groupby (in groupby(), one function runs on all columns)**

In [33]:
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino


In [34]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           398 non-null    float64
 1   cylinders     398 non-null    int64  
 2   displacement  398 non-null    float64
 3   horsepower    398 non-null    object 
 4   weight        398 non-null    int64  
 5   acceleration  398 non-null    float64
 6   model_year    398 non-null    int64  
 7   origin        398 non-null    int64  
 8   name          398 non-null    object 
dtypes: float64(3), int64(4), object(2)
memory usage: 28.1+ KB


In [31]:
#standard deviation and mean for all numeric columns

df.drop(['horsepower', 'name'],axis=1).agg(['std','mean'])
#columns horsepower and name are dropped because they are not of numeric type

Unnamed: 0,mpg,cylinders,displacement,weight,acceleration,model_year,origin
std,7.815984,1.701004,104.269838,846.841774,2.757689,3.697627,0.802055
mean,23.514573,5.454774,193.425879,2970.424623,15.56809,76.01005,1.572864


In [36]:
#using dictionary to customize aggregate functions for particular columns

df.drop(['horsepower','name'],axis=1).agg({'mpg':['std','mean'],'displacement':'std','weight':['max','mean']})

Unnamed: 0,mpg,displacement,weight
std,7.815984,104.269838,
mean,23.514573,,2970.424623
max,,,5140.0
