In [1]:
import numpy as np
import pandas as pd
pd.options.display.float_format = '{:,.2f}'.format
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
Vegas = pd.read_csv('vegas.csv')
Vegas.columns = Vegas.columns.str.replace('\.*\s+', '_').str.lower()

# Grouping a DataFrame

Let's turn our attention from grouping a single `Series` to grouping a `DataFrame`.  The basic idea is the same, but there are some extra issues that you need to know about.

When applying `groupby` to a `DataFrame`, you can pass in a `Series`, as before.  More conveniently, you can pass in the name of the column you want to group by.

In [5]:
by_hotel = Vegas.groupby('hotel_name')
by_hotel

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fb798481f50>

Let's pull out one group to see what it looks like.

In [8]:
by_hotel.score.get_group(Vegas.hotel_name[0]).mean()

3.2083333333333335

We get a `DataFrame` that contains just the rows for Circus Circus.  Let's try applying the `mean()` method.

In [5]:
by_hotel.mean()

Unnamed: 0_level_0,nr_reviews,nr_hotel_reviews,helpful_votes,score,nr_rooms,member_years
hotel_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Bellagio Las Vegas,27.71,10.21,24.88,4.21,3933.0,3.42
Caesars Palace,38.0,15.29,26.38,4.12,3348.0,4.75
Circus Circus Hotel & Casino Las Vegas,29.21,7.79,18.5,3.21,3773.0,3.83
Encore at wynn Las Vegas,57.17,16.33,36.75,4.54,2034.0,4.75
Excalibur Hotel & Casino,45.58,22.12,31.75,3.71,3981.0,4.5
Hilton Grand Vacations at the Flamingo,48.42,23.04,32.46,3.96,315.0,3.96
Hilton Grand Vacations on the Boulevard,36.92,15.54,20.83,4.17,1228.0,4.79
Marriott's Grand Chateau,90.04,30.42,57.33,4.54,732.0,4.38
Monte Carlo Resort&Casino,71.83,17.46,39.25,3.29,3003.0,3.58
Paris Las Vegas,51.21,11.46,26.83,4.04,2916.0,3.62


We get a `DataFrame` containing means for different variables.  Notice that we lost some columns in the process.  The non-numeric columns are considered nuisance columns in this process, and they are dropped automatically.

We can also pass in a list of variable names to serve as group keys.

In [6]:
Vegas.groupby(['traveler_type', 'pool']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,nr_reviews,nr_hotel_reviews,helpful_votes,score,nr_rooms,member_years
traveler_type,pool,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Business,NO,119.0,21.0,75.0,3.0,3773.0,3.0
Business,YES,71.36,23.71,41.82,3.89,1940.22,4.18
Couples,NO,20.43,6.86,18.0,2.71,3773.0,4.71
Couples,YES,46.54,15.06,32.58,4.29,2249.36,4.43
Families,NO,41.62,9.5,20.0,3.38,3773.0,3.0
Families,YES,45.38,14.7,26.67,4.07,1933.38,4.36
Friends,NO,14.43,5.29,11.57,3.43,3773.0,3.57
Friends,YES,42.01,15.39,31.13,4.33,2099.25,4.39
Solo,NO,5.0,5.0,2.0,4.0,3773.0,7.0
Solo,YES,40.65,16.83,30.74,3.91,2370.52,-73.91


Some of these columns make a lot of sense, and some aren't that interesting to us.  What does the mean number of rooms mean for this unusual sample of travelers?  A very common operation, then is to select a few columns when grouping a `DataFrame`.  Here's how we would do that.

In [7]:
Vegas.groupby(['traveler_type', 'pool'])['score', 'member_years'].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,score,member_years
traveler_type,pool,Unnamed: 2_level_1,Unnamed: 3_level_1
Business,NO,3.0,3.0
Business,YES,3.89,4.18
Couples,NO,2.71,4.71
Couples,YES,4.29,4.43
Families,NO,3.38,3.0
Families,YES,4.07,4.36
Friends,NO,3.43,3.57
Friends,YES,4.33,4.39
Solo,NO,4.0,7.0
Solo,YES,3.91,-73.91


We have a `groupby` followed by a column selection.  You could write this in the opposite order: first pull out the columns you want, then group.

In [8]:
trav_pool_df = Vegas[['score', 'member_years']].groupby([Vegas['traveler_type'], Vegas['pool']]).mean()
trav_pool_df

Unnamed: 0_level_0,Unnamed: 1_level_0,score,member_years
traveler_type,pool,Unnamed: 2_level_1,Unnamed: 3_level_1
Business,NO,3.0,3.0
Business,YES,3.89,4.18
Couples,NO,2.71,4.71
Couples,YES,4.29,4.43
Families,NO,3.38,3.0
Families,YES,4.07,4.36
Friends,NO,3.43,3.57
Friends,YES,4.33,4.39
Solo,NO,4.0,7.0
Solo,YES,3.91,-73.91


Under the hood, both expressions are treated the same way.  However, the top one is easier to write.

### Grouping with an Index

So far, we've always used a `Series` for our group key.  It's also possible to leverage dictionaries and functions when grouping.

For this demo, the data to group by will need to be in the `Index`.  Let's begin by setting the index to the `hotel_name` column

In [9]:
Vegas.set_index('hotel_name').head()

Unnamed: 0_level_0,user_country,nr_reviews,nr_hotel_reviews,helpful_votes,score,period_of_stay,traveler_type,pool,gym,tennis_court,spa,casino,free_internet,hotel_stars,nr_rooms,user_continent,member_years,review_month,review_weekday
hotel_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
Circus Circus Hotel & Casino Las Vegas,USA,11,4,13,5,Dec-Feb,Friends,NO,YES,NO,NO,YES,YES,3,3773,North America,9,January,Thursday
Circus Circus Hotel & Casino Las Vegas,USA,119,21,75,3,Dec-Feb,Business,NO,YES,NO,NO,YES,YES,3,3773,North America,3,January,Friday
Circus Circus Hotel & Casino Las Vegas,USA,36,9,25,5,Mar-May,Families,NO,YES,NO,NO,YES,YES,3,3773,North America,2,February,Saturday
Circus Circus Hotel & Casino Las Vegas,UK,14,7,14,4,Mar-May,Friends,NO,YES,NO,NO,YES,YES,3,3773,Europe,6,February,Friday
Circus Circus Hotel & Casino Las Vegas,Canada,5,5,2,4,Mar-May,Solo,NO,YES,NO,NO,YES,YES,3,3773,North America,7,March,Tuesday


Now we can define a function that operates on the hotel name and pass this into `groupby`.

In [1]:
def f(name):
    if 'Circus' in name or 'Flamingo' in name:
        return 'preferred'
    else:
        return 'non-preferred'

In [11]:
Vegas.set_index('hotel_name').groupby(f).mean()

Unnamed: 0,nr_reviews,nr_hotel_reviews,helpful_votes,score,nr_rooms,member_years
non-preferred,49.11,16.09,32.41,4.18,2212.42,0.44
preferred,38.81,15.42,25.48,3.58,2044.0,3.9


We could have done the same thing with a dictionary that maps hotel_names to group keys.

Finally, it's possible to group by a level of a multi-index.  We could do this, for example, on the traveler-pool `DataFrame` we constructed earlier.

In [12]:
trav_pool_df.groupby(level=1).mean()

Unnamed: 0_level_0,score,member_years
pool,Unnamed: 1_level_1,Unnamed: 2_level_1
NO,3.3,4.26
YES,4.1,-11.31
