# Agenda: Grouping

1. Simple grouping
2. More complex grouping
    - More than one grouping column
    - More than one calculating column
    - More than one aggregation method
3. Pivot tables (2D grouping)
4. Moving rows to columns and back

In [1]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

In [2]:
filename = '../data/taxi.csv'

df = pd.read_csv(filename)

In [3]:
df.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
0,2,2015-06-02 11:19:29,2015-06-02 11:47:52,1,1.63,-73.95443,40.764141,1,N,-73.974754,40.754093,2,17.0,0.0,0.5,0.0,0.0,0.3,17.8
1,2,2015-06-02 11:19:30,2015-06-02 11:27:56,1,0.46,-73.971443,40.758942,1,N,-73.978539,40.761909,1,6.5,0.0,0.5,1.0,0.0,0.3,8.3
2,2,2015-06-02 11:19:31,2015-06-02 11:30:30,1,0.87,-73.978111,40.738434,1,N,-73.990273,40.745438,1,8.0,0.0,0.5,2.2,0.0,0.3,11.0
3,2,2015-06-02 11:19:31,2015-06-02 11:39:02,1,2.13,-73.945892,40.773529,1,N,-73.971527,40.76033,1,13.5,0.0,0.5,2.86,0.0,0.3,17.16
4,1,2015-06-02 11:19:32,2015-06-02 11:32:49,1,1.4,-73.979088,40.776772,1,N,-73.982162,40.758999,2,9.5,0.0,0.5,0.0,0.0,0.3,10.3


In [6]:
# I want to find out the mean trip_distance where passenger_count is 1

df.loc[  
    df['passenger_count'] == 1   # row selector
    ,
    'trip_distance' # column selector
].mean()

3.0923380047176354

In [7]:
# find out the mean trip_distance where passenger_count is 2

df.loc[  
    df['passenger_count'] == 2   # row selector
    ,
    'trip_distance' # column selector
].mean()

3.3843869002284848

In [8]:
# find out the mean trip_distance where passenger_count is 3

df.loc[  
    df['passenger_count'] == 3   # row selector
    ,
    'trip_distance' # column selector
].mean()

3.3423891625615765

In [9]:
# what we really want is: For every unique value of passenger_count,
# calculate the mean of trip_distance on those rows

# this is the essence of *grouping*

# - we have an aggregation method
# - we want to run it on a numeric column
# - we want to get a separate result for each unique value of a categorical column

# the syntax is:

# .groupby(categorical_column)[numeric_column].agg_method()
df.groupby('passenger_count')['trip_distance'].mean()

passenger_count
0    4.600000
1    3.092338
2    3.384387
3    3.342389
4    3.628901
5    3.182712
6    3.170976
Name: trip_distance, dtype: float64

In [10]:
# what does this give me?
df.groupby('passenger_count')['trip_distance']

<pandas.core.groupby.generic.SeriesGroupBy object at 0x7f0e8314e060>

In [11]:
df.groupby('passenger_count')['trip_distance'].std()   # standard deviation

passenger_count
0    4.666905
1    4.020187
2    4.242826
3    3.822041
4    4.351369
5    3.969468
6    3.759807
Name: trip_distance, dtype: float64

In [13]:
# we can try other categorical columns

df['tip_percentage'] = df['tip_amount'] / df['total_amount']   # let's add a new column indicating the percentage of tip

In [16]:
# now let's see if the percentage changes per passenger_count

df.groupby('passenger_count')['tip_percentage'].mean()

passenger_count
0    0.183127
1    0.092880
2    0.088309
3    0.087368
4    0.077067
5    0.094349
6    0.086075
Name: tip_percentage, dtype: float64

In [17]:
# do we see different tip percentages according to the manufacturer of the computer in the taxi?
df.groupby('VendorID')['tip_percentage'].mean()

VendorID
1    0.091538
2    0.091680
Name: tip_percentage, dtype: float64

# Why not sort the `groupby` results?

1. If you're going to be sorting by values immediately after, why waste the time/CPU?
2. The sorting does take some time - -maybe you don't want to waste on that

In [18]:
# what if we groupby a non-categorical column?
# (don't do this!)

# for every distinct value of trip_distance
# find the mean total_amount
df.groupby('trip_distance')['total_amount'].mean()

trip_distance
0.00      31.58194
0.01      52.80000
0.02      43.46000
0.03       3.96000
0.04      70.01000
           ...    
34.84    137.59000
35.51    135.13000
37.20    210.14000
60.30    160.05000
64.60     79.96000
Name: total_amount, Length: 1219, dtype: float64

In [21]:
!ls ../data/*csv*

../data/2020_sharing_data_outside.csv  ../data/olympic_athlete_events.csv
../data/CPILFESL.csv		       ../data/san+francisco,ca.csv
../data/albany,ny.csv		       ../data/sat-scores.csv
../data/boston,ma.csv		       ../data/skyscrapers.csv
../data/burrito_current.csv	       ../data/springfield,il.csv
../data/celebrity_deaths_2016.csv      ../data/springfield,ma.csv
../data/chicago,il.csv		       ../data/taxi-distance.csv
../data/eu_cpi.csv		       ../data/taxi-passenger-count.csv
../data/eu_gdp.csv		       ../data/taxi.csv
../data/ice-cream.csv		       ../data/titanic3.csv
../data/languages.csv		       ../data/us-median-cpi.csv
../data/los+angeles,ca.csv	       ../data/us-unemployment-rate.csv
../data/miles-traveled.csv	       ../data/us_gdp.csv
../data/new+york,ny.csv		       ../data/winemag-150k-reviews.csv
../data/oecd_locations.csv	       ../data/wti-daily.csv
../data/oecd_tourism.csv


In [22]:
!head ../data/olympic_athlete_events.csv

"ID","Name","Sex","Age","Height","Weight","Team","NOC","Games","Year","Season","City","Sport","Event","Medal"
"1","A Dijiang","M",24,180,80,"China","CHN","1992 Summer",1992,"Summer","Barcelona","Basketball","Basketball Men's Basketball",NA
"2","A Lamusi","M",23,170,60,"China","CHN","2012 Summer",2012,"Summer","London","Judo","Judo Men's Extra-Lightweight",NA
"3","Gunnar Nielsen Aaby","M",24,NA,NA,"Denmark","DEN","1920 Summer",1920,"Summer","Antwerpen","Football","Football Men's Football",NA
"4","Edgar Lindenau Aabye","M",34,NA,NA,"Denmark/Sweden","DEN","1900 Summer",1900,"Summer","Paris","Tug-Of-War","Tug-Of-War Men's Tug-Of-War","Gold"
"5","Christine Jacoba Aaftink","F",21,185,82,"Netherlands","NED","1988 Winter",1988,"Winter","Calgary","Speed Skating","Speed Skating Women's 500 metres",NA
"5","Christine Jacoba Aaftink","F",21,185,82,"Netherlands","NED","1988 Winter",1988,"Winter","Calgary","Speed Skating","Speed Skating Women's 1,000 metres",NA
"5","Christine Jacoba Aaftink","F",25,1

# Exercise: Olympic calculations

1. Load the file `../data/olympic_athlete_events.csv` into a data frame.
2. Find the mean age for people in each sport.
3. Find the mean height for people in each sport after 1960.