In [2]:
import numpy as np
import pandas as pd
pd.options.display.float_format = '{:,.2f}'.format
import matplotlib.pyplot as plt
%matplotlib inline

In [3]:
Vegas = pd.read_csv('vegas.csv')
Vegas.columns = Vegas.columns.str.replace('\.*\s+', '_').str.lower()

In [4]:
display(Vegas.head())
Vegas.dtypes

Unnamed: 0,user_country,nr_reviews,nr_hotel_reviews,helpful_votes,score,period_of_stay,traveler_type,pool,gym,tennis_court,spa,casino,free_internet,hotel_name,hotel_stars,nr_rooms,user_continent,member_years,review_month,review_weekday
0,USA,11,4,13,5,Dec-Feb,Friends,NO,YES,NO,NO,YES,YES,Circus Circus Hotel & Casino Las Vegas,3,3773,North America,9,January,Thursday
1,USA,119,21,75,3,Dec-Feb,Business,NO,YES,NO,NO,YES,YES,Circus Circus Hotel & Casino Las Vegas,3,3773,North America,3,January,Friday
2,USA,36,9,25,5,Mar-May,Families,NO,YES,NO,NO,YES,YES,Circus Circus Hotel & Casino Las Vegas,3,3773,North America,2,February,Saturday
3,UK,14,7,14,4,Mar-May,Friends,NO,YES,NO,NO,YES,YES,Circus Circus Hotel & Casino Las Vegas,3,3773,Europe,6,February,Friday
4,Canada,5,5,2,4,Mar-May,Solo,NO,YES,NO,NO,YES,YES,Circus Circus Hotel & Casino Las Vegas,3,3773,North America,7,March,Tuesday


user_country        object
nr_reviews           int64
nr_hotel_reviews     int64
helpful_votes        int64
score                int64
period_of_stay      object
traveler_type       object
pool                object
gym                 object
tennis_court        object
spa                 object
casino              object
free_internet       object
hotel_name          object
hotel_stars         object
nr_rooms             int64
user_continent      object
member_years         int64
review_month        object
review_weekday      object
dtype: object

# Data Aggregation

When we say aggregation, we usually mean the following definition:

Aggregation: an operation that transforms an array into a scalar value.

This is what we're doing when we apply `mean()` to a `GroupBy` object.  Let's take a look at a list of other built-in `GroupBy` aggregation methods.

- count
- sum
- mean
- median
- std
- var
- min
- max
- prod
- first
- last

Note that all of these methods ignore missing values.  For example, `count()` returns the number of non-NA values.  
In addition to these, if you've grouped a `Series`, you can use any `Series` method on the `groupby`.  If you've grouped a `DataFrame`, you can use any `DataFrame` method on the `Groupby`.

Here's an example where we use `quantile()` to find the median score for each hotel.  This works because `quantile()` is a `Series` method.

In [5]:
by_hotel = Vegas.groupby('hotel_name')
display(by_hotel['score'].quantile(.5))

hotel_name
Bellagio Las Vegas                                    4.50
Caesars Palace                                        4.50
Circus Circus Hotel & Casino Las Vegas                3.00
Encore at wynn Las Vegas                              5.00
Excalibur Hotel & Casino                              4.00
Hilton Grand Vacations at the Flamingo                4.00
Hilton Grand Vacations on the Boulevard               4.50
Marriott's Grand Chateau                              5.00
Monte Carlo Resort&Casino                             3.50
Paris Las Vegas                                       4.00
The Cosmopolitan Las Vegas                            5.00
The Cromwell                                          4.50
The Palazzo Resort Hotel Casino                       5.00
The Venetian Las Vegas Hotel                          5.00
The Westin las Vegas Hotel Casino & Spa               4.00
Treasure Island- TI Hotel & Casino                    4.00
Tropicana Las Vegas - A Double Tree by Hilton

In [6]:
by_hotel['nr_reviews'].sum().sort_values(ascending=False)

hotel_name
Marriott's Grand Chateau                               2161
Monte Carlo Resort&Casino                              1724
The Cromwell                                           1717
Wyndham Grand Desert                                   1604
Trump International Hotel Las Vegas                    1560
Tuscany Las Vegas Suites & Casino                      1538
Encore at wynn Las Vegas                               1372
Paris Las Vegas                                        1229
Hilton Grand Vacations at the Flamingo                 1162
The Westin las Vegas Hotel Casino & Spa                1115
Excalibur Hotel & Casino                               1094
The Palazzo Resort Hotel Casino                        1078
Caesars Palace                                          912
Hilton Grand Vacations on the Boulevard                 886
Wynn Las Vegas                                          851
The Cosmopolitan Las Vegas                              811
The Venetian Las Vegas Hotel 

One of the most powerful features of `groupby` is the ability to write you own aggregation function.  Just create a function that takes in a `Series` and returns a scalar value.  You can then pass it into the `aggregate` or `agg` method.

In [7]:
Circus = Vegas[Vegas['hotel_name'] == 'Circus Circus Hotel & Casino Las Vegas']
display(Circus[['hotel_name', 'score']])
display(Circus.score.max())
display(Circus.score.min())

Unnamed: 0,hotel_name,score
0,Circus Circus Hotel & Casino Las Vegas,5
1,Circus Circus Hotel & Casino Las Vegas,3
2,Circus Circus Hotel & Casino Las Vegas,5
3,Circus Circus Hotel & Casino Las Vegas,4
4,Circus Circus Hotel & Casino Las Vegas,4
5,Circus Circus Hotel & Casino Las Vegas,3
6,Circus Circus Hotel & Casino Las Vegas,4
7,Circus Circus Hotel & Casino Las Vegas,4
8,Circus Circus Hotel & Casino Las Vegas,4
9,Circus Circus Hotel & Casino Las Vegas,3


5

1

In [8]:
def my_range(var):
    return np.max(var)-np.min(var), (np.max(var), np.min(var))

by_hotel['score'].apply(my_range)

hotel_name
Bellagio Las Vegas                                     (3, (5, 2))
Caesars Palace                                         (4, (5, 1))
Circus Circus Hotel & Casino Las Vegas                 (4, (5, 1))
Encore at wynn Las Vegas                               (4, (5, 1))
Excalibur Hotel & Casino                               (3, (5, 2))
Hilton Grand Vacations at the Flamingo                 (3, (5, 2))
Hilton Grand Vacations on the Boulevard                (4, (5, 1))
Marriott's Grand Chateau                               (2, (5, 3))
Monte Carlo Resort&Casino                              (4, (5, 1))
Paris Las Vegas                                        (3, (5, 2))
The Cosmopolitan Las Vegas                             (4, (5, 1))
The Cromwell                                           (4, (5, 1))
The Palazzo Resort Hotel Casino                        (2, (5, 3))
The Venetian Las Vegas Hotel                           (2, (5, 3))
The Westin las Vegas Hotel Casino & Spa            

In [9]:
def max_min(var):
    return np.max(var), np.min(var)

by_hotel['score'].apply(max_min)

hotel_name
Bellagio Las Vegas                                     (5, 2)
Caesars Palace                                         (5, 1)
Circus Circus Hotel & Casino Las Vegas                 (5, 1)
Encore at wynn Las Vegas                               (5, 1)
Excalibur Hotel & Casino                               (5, 2)
Hilton Grand Vacations at the Flamingo                 (5, 2)
Hilton Grand Vacations on the Boulevard                (5, 1)
Marriott's Grand Chateau                               (5, 3)
Monte Carlo Resort&Casino                              (5, 1)
Paris Las Vegas                                        (5, 2)
The Cosmopolitan Las Vegas                             (5, 1)
The Cromwell                                           (5, 1)
The Palazzo Resort Hotel Casino                        (5, 3)
The Venetian Las Vegas Hotel                           (5, 3)
The Westin las Vegas Hotel Casino & Spa                (5, 2)
Treasure Island- TI Hotel & Casino                     (5, 

It's worth pointing out that group operations are generally time-consuming.  They require either moving data around or maintaining an extra layer of references.  The built-in `groupby` methods have been heavily optimized, so you'll get better performance if you find ways to use those.

You can also use `Series` and `DataFrame` methods that aren't aggregations.  For example, you might want to apply `describe()` and `value_counts()` on your `GroupBy` object.

In [10]:
Vegas.score.groupby(Vegas.pool).describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
pool,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
NO,24.0,3.21,1.1,1.0,2.75,3.0,4.0,5.0
YES,480.0,4.17,0.98,1.0,4.0,4.0,5.0,5.0


Here's another one with `value_counts`.

In [11]:
Vegas.score.groupby(Vegas.pool).value_counts().unstack()

score,1,2,3,4,5
pool,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
NO,2,4,7,9,2
YES,9,26,65,155,225


In [38]:
Vegas.spa.groupby(Vegas.pool)['score'].agg('mean')

KeyError: 'Column not found: score'

In [12]:
Vegas.score.groupby(Vegas.pool).value_counts()

pool  score
NO    4          9
      3          7
      2          4
      1          2
      5          2
YES   5        225
      4        155
      3         65
      2         26
      1          9
Name: score, dtype: int64

These are really examples of a more general split-apply-combine procedure.  We'll discuss them in more detail soon.

### Aggregating with Multiple Functions

We can use `aggregate` with more than one function at a time.  Here's an example, where we pass in a list of functions to a `Series.GroupBy`.

In [41]:
by_hotel['score'].agg(['mean', my_range, lambda x: x.mean()>4])

Unnamed: 0_level_0,mean,my_range,<lambda>
hotel_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Bellagio Las Vegas,4.21,3,True
Caesars Palace,4.12,4,True
Circus Circus Hotel & Casino Las Vegas,3.21,4,False
Encore at wynn Las Vegas,4.54,4,True
Excalibur Hotel & Casino,3.71,3,False
Hilton Grand Vacations at the Flamingo,3.96,3,False
Hilton Grand Vacations on the Boulevard,4.17,4,True
Marriott's Grand Chateau,4.54,2,True
Monte Carlo Resort&Casino,3.29,4,False
Paris Las Vegas,4.04,3,True


Notice that we can pass in a function object, and also the name of a built-in funtion, like `mean`.  

We can even pass in lambda functions, though if we do this, they end up in a column named `<lambda>`.  If we want to, we can set any column names we want.  Just pass in a tuple containing a name followed by a function.

In [42]:
by_hotel['score'].agg(['mean', ('range', my_range) , ( '4+ score', lambda x: x.mean()>4) ])

Unnamed: 0_level_0,mean,range,4+ score
hotel_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Bellagio Las Vegas,4.21,3,True
Caesars Palace,4.12,4,True
Circus Circus Hotel & Casino Las Vegas,3.21,4,False
Encore at wynn Las Vegas,4.54,4,True
Excalibur Hotel & Casino,3.71,3,False
Hilton Grand Vacations at the Flamingo,3.96,3,False
Hilton Grand Vacations on the Boulevard,4.17,4,True
Marriott's Grand Chateau,4.54,2,True
Monte Carlo Resort&Casino,3.29,4,False
Paris Las Vegas,4.04,3,True


We even apply multiple functions to multiple columns.

In [43]:
by_hotel[['score','member_years']].agg(['mean', my_range])

Unnamed: 0_level_0,score,score,member_years,member_years
Unnamed: 0_level_1,mean,my_range,mean,my_range
hotel_name,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Bellagio Las Vegas,4.21,3,3.42,10
Caesars Palace,4.12,4,4.75,13
Circus Circus Hotel & Casino Las Vegas,3.21,4,3.83,10
Encore at wynn Las Vegas,4.54,4,4.75,9
Excalibur Hotel & Casino,3.71,3,4.5,11
Hilton Grand Vacations at the Flamingo,3.96,3,3.96,10
Hilton Grand Vacations on the Boulevard,4.17,4,4.79,10
Marriott's Grand Chateau,4.54,2,4.38,10
Monte Carlo Resort&Casino,3.29,4,3.58,9
Paris Las Vegas,4.04,3,3.62,9


In [32]:
Vegas[['spa','tennis_court']].agg('mean')

Series([], dtype: float64)

Notice that we get back every combination of a function we pass in with every column.  The result is organized with a heierarchical index, where level 0 comes from the `DataFrame` columns and level 1 comes from the functions we pass in.

For a wide and varied `DataFrame` like this, we probably wouldn't want to apply the same functions to every variable.  More commonly, we would choose specific functions for each column.  Suppose you want to take the mean of the `score` column, but return the percent of non-USA entries for the `user_country` column.

We can do that by passing in a dictionary that maps from column names to the functions we want to apply.

When passing a dict into `agg`, we don't have the option of choosing custom column names.  We can switch to the more general `apply` method, which we'll learn about soon.  For now, we'll just fix the column names manually.

In [57]:
func_dict = {'score': 'mean', 'user_country': lambda x : (x != 'USA').mean()}
hotel_df = by_hotel.agg(func_dict)
hotel_df.columns = ['mean_score', 'percent_non_US']
hotel_df

Unnamed: 0_level_0,mean_score,percent_non_US
hotel_name,Unnamed: 1_level_1,Unnamed: 2_level_1
Bellagio Las Vegas,4.21,0.67
Caesars Palace,4.12,0.42
Circus Circus Hotel & Casino Las Vegas,3.21,0.58
Encore at wynn Las Vegas,4.54,0.42
Excalibur Hotel & Casino,3.71,0.75
Hilton Grand Vacations at the Flamingo,3.96,0.5
Hilton Grand Vacations on the Boulevard,4.17,0.58
Marriott's Grand Chateau,4.54,0.67
Monte Carlo Resort&Casino,3.29,0.75
Paris Las Vegas,4.04,0.38


We have the start of a nice hotel-level `DataFrame` here.  Let's fill in some more columns so that it holds basic information about each hotel.  For example, we want to know whether each hotel has a pool.  

Aggregating the pool variable is conceptually easy.  This variable should only have one value for any given hotel.  For example, let's look at the pool variable for Circus Circus.

In [59]:
by_hotel.get_group('Circus Circus Hotel & Casino Las Vegas')['pool']

0     NO
1     NO
2     NO
3     NO
4     NO
5     NO
6     NO
7     NO
8     NO
9     NO
10    NO
11    NO
12    NO
13    NO
14    NO
15    NO
16    NO
17    NO
18    NO
19    NO
20    NO
21    NO
22    NO
23    NO
Name: pool, dtype: object

It's no surprise that all values math.  We need an aggregation function that turns this `Series` into a single value.  A good idea is to use `groupby`'s `first` method, which just returns the first element of the `Series`.

In [63]:
by_hotel['pool'].first()

hotel_name
Bellagio Las Vegas                                     YES
Caesars Palace                                         YES
Circus Circus Hotel & Casino Las Vegas                  NO
Encore at wynn Las Vegas                               YES
Excalibur Hotel & Casino                               YES
Hilton Grand Vacations at the Flamingo                 YES
Hilton Grand Vacations on the Boulevard                YES
Marriott's Grand Chateau                               YES
Monte Carlo Resort&Casino                              YES
Paris Las Vegas                                        YES
The Cosmopolitan Las Vegas                             YES
The Cromwell                                           YES
The Palazzo Resort Hotel Casino                        YES
The Venetian Las Vegas Hotel                           YES
The Westin las Vegas Hotel Casino & Spa                YES
Treasure Island- TI Hotel & Casino                     YES
Tropicana Las Vegas - A Double Tree by Hilton

Before we run with this solution, however, we would probably like to make sure that each hotel really does have a consistent value for this variable.  We can devise a custom aggregation function to check this for us.

In [25]:
def is_unique(x):
    return len(x.unique()) == 1

In [26]:
by_hotel['pool'].agg(is_unique)

hotel_name
Bellagio Las Vegas                                     True
Caesars Palace                                         True
Circus Circus Hotel & Casino Las Vegas                 True
Encore at wynn Las Vegas                               True
Excalibur Hotel & Casino                               True
Hilton Grand Vacations at the Flamingo                 True
Hilton Grand Vacations on the Boulevard                True
Marriott's Grand Chateau                               True
Monte Carlo Resort&Casino                              True
Paris Las Vegas                                        True
The Cosmopolitan Las Vegas                             True
The Cromwell                                           True
The Palazzo Resort Hotel Casino                        True
The Venetian Las Vegas Hotel                           True
The Westin las Vegas Hotel Casino & Spa                True
Treasure Island- TI Hotel & Casino                     True
Tropicana Las Vegas - A Doubl

In [27]:
len(by_hotel)

21

In this case, you can scan down the Series and see that each row is `True`.  However, if the list was really long, we could make sure with the `all` function.

In [28]:
by_hotel['pool'].agg(is_unique).all()

True

Let's apply this strategy to all columns to look for potential errors / coding issues.

In [29]:
by_hotel.agg(is_unique).apply(all, axis=0)

user_country        False
nr_reviews          False
nr_hotel_reviews    False
helpful_votes       False
score               False
period_of_stay      False
traveler_type       False
pool                 True
gym                  True
tennis_court         True
spa                  True
casino               True
free_internet        True
hotel_stars          True
nr_rooms             True
user_continent      False
member_years        False
review_month        False
review_weekday      False
dtype: bool

As we expect, the variables that primarily describe a hotel, like `pool`, `gym`, `tennis_court`, and `spa`, all have one value within a hotel. Let's aggregate them with the `first` method. 

In [30]:
func_dict = {'score': 'mean', 
             'user_country': lambda x : (x != 'USA').mean(),
             'pool': 'first',
             'gym' : 'first',
             'tennis_court' : 'first',
             'spa' : 'first',
             'casino' : 'first',
             'free_internet' : 'first',
             'hotel_stars' : 'first',
             'nr_rooms' : 'first'
            }
hotel_df = by_hotel.agg(func_dict)
hotel_df.columns = ['mean_score', 'percent_non_US'] + list(hotel_df.columns[2:])
hotel_df

Unnamed: 0_level_0,mean_score,percent_non_US,pool,gym,tennis_court,spa,casino,free_internet,hotel_stars,nr_rooms
hotel_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Bellagio Las Vegas,4.21,0.67,YES,YES,NO,YES,YES,YES,5,3933
Caesars Palace,4.12,0.42,YES,YES,NO,YES,YES,YES,5,3348
Circus Circus Hotel & Casino Las Vegas,3.21,0.58,NO,YES,NO,NO,YES,YES,3,3773
Encore at wynn Las Vegas,4.54,0.42,YES,YES,NO,YES,YES,YES,5,2034
Excalibur Hotel & Casino,3.71,0.75,YES,YES,NO,YES,YES,YES,3,3981
Hilton Grand Vacations at the Flamingo,3.96,0.5,YES,YES,NO,NO,NO,YES,3,315
Hilton Grand Vacations on the Boulevard,4.17,0.58,YES,YES,NO,YES,YES,YES,35,1228
Marriott's Grand Chateau,4.54,0.67,YES,YES,NO,NO,YES,YES,35,732
Monte Carlo Resort&Casino,3.29,0.75,YES,YES,NO,YES,YES,NO,4,3003
Paris Las Vegas,4.04,0.38,YES,YES,NO,YES,YES,YES,4,2916


You can imagine extending a table like this and using it for further analysis.  Let's save it for later.

In [31]:
hotel_df.to_csv('hotels.csv')

In [34]:
Vegas[Vegas]

AttributeError: 'float' object has no attribute 'groupby'