# Group By

In this chapter we introduce the following query syntax:

* `GROUP BY` for grouping
* `HAVING` for filtering of groups

In [1]:
#| echo: false

import pandas as pd
import pyhive.sqlalchemy_presto

# always show every column
pd.set_option('display.max_columns', None)
# suppress a SQLAlchemy warning
pyhive.sqlalchemy_presto.PrestoDialect.supports_statement_cache = False

%load_ext sql
%config SqlMagic.autocommit = False
%config SqlMagic.displaycon = False
%config SqlMagic.autopandas = True
%config SqlMagic.feedback = False

%sql presto://localhost:8080/

In [2]:
#| echo: false
#| output: false

## just a reminder
%sql select * from listings limit 5

Unnamed: 0,listing_id,listing_url,scrape_id,last_scraped,name,description,neighborhood_overview,picture_url,host_id,neighbourhood,neighbourhood_cleansed,neighbourhood_group_cleansed,latitude,longitude,property_type,room_type,accommodates,bathrooms,bathrooms_text,bedrooms,beds,amenities,price,minimum_nights,maximum_nights,minimum_minimum_nights,maximum_minimum_nights,minimum_maximum_nights,maximum_maximum_nights,minimum_nights_avg_ntm,maximum_nights_avg_ntm,calendar_updated,has_availability,availability_30,availability_60,availability_90,availability_365,calendar_last_scraped,number_of_reviews,number_of_reviews_ltm,number_of_reviews_l30d,first_review,last_review,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,reviews_per_month,access_date,country,state,city
0,2595,https://www.airbnb.com/rooms/2595,20220603182654,2022-06-04T00:00:00+00:00,Skylit Midtown Castle,"Beautiful, spacious skylit studio in the heart...",Centrally located in the heart of Manhattan ju...,https://a0.muscache.com/pictures/f0813a11-40b2...,2845,"New York, United States",Midtown,Manhattan,40.75356,-73.98559,Entire rental unit,Entire home/apt,1,,1 bath,,1.0,"[""Essentials"", ""Bathtub"", ""Extra pillows and b...",225.0,30,1125,30.0,30.0,1125.0,1125.0,30.0,1125.0,,True,0,3,33,308,2022-06-04T00:00:00+00:00,48,0,0,2009-11-21T00:00:00+00:00,2019-11-04T00:00:00+00:00,4.7,4.72,4.62,4.76,4.79,4.86,4.41,,False,0.31,2022-06-03,united-states,ny,new-york-city
1,5121,https://www.airbnb.com/rooms/5121,20220603182654,2022-06-04T00:00:00+00:00,BlissArtsSpace!,One room available for rent in a 2 bedroom apt...,,https://a0.muscache.com/pictures/2090980c-b68e...,7356,,Bedford-Stuyvesant,Brooklyn,40.68535,-73.95512,Private room in rental unit,Private room,2,,,1.0,1.0,"[""Heating"", ""Long term stays allowed"", ""Kitche...",60.0,30,730,30.0,30.0,730.0,730.0,30.0,730.0,,True,30,60,90,365,2022-06-04T00:00:00+00:00,50,0,0,2009-05-28T00:00:00+00:00,2019-12-02T00:00:00+00:00,4.52,4.22,4.09,4.91,4.91,4.47,4.52,,False,0.32,2022-06-03,united-states,ny,new-york-city
2,5136,https://www.airbnb.com/rooms/5136,20220603182654,2022-06-04T00:00:00+00:00,"Spacious Brooklyn Duplex, Patio + Garden",We welcome you to stay in our lovely 2 br dupl...,,https://a0.muscache.com/pictures/miso/Hosting-...,7378,,Sunset Park,Brooklyn,40.66265,-73.99454,Entire rental unit,Entire home/apt,4,,1.5 baths,2.0,2.0,"[""Dryer"", ""Heating"", ""Hair dryer"", ""Carbon mon...",275.0,21,1125,21.0,21.0,1125.0,1125.0,21.0,1125.0,,True,3,3,4,250,2022-06-04T00:00:00+00:00,2,1,0,2014-01-02T00:00:00+00:00,2021-08-08T00:00:00+00:00,5.0,5.0,5.0,5.0,5.0,4.5,5.0,,False,0.02,2022-06-03,united-states,ny,new-york-city
3,5178,https://www.airbnb.com/rooms/5178,20220603182654,2022-06-04T00:00:00+00:00,Large Furnished Room Near B'way,Please don’t expect the luxury here just a bas...,"Theater district, many restaurants around here.",https://a0.muscache.com/pictures/12065/f070997...,8967,"New York, United States",Midtown,Manhattan,40.76457,-73.98317,Private room in rental unit,Private room,2,,1 bath,1.0,1.0,"[""Conditioner"", ""Essentials"", ""Extra pillows a...",68.0,2,14,2.0,2.0,14.0,14.0,2.0,14.0,,True,3,5,9,172,2022-06-04T00:00:00+00:00,536,62,2,2009-05-06T00:00:00+00:00,2022-05-09T00:00:00+00:00,4.23,4.24,3.75,4.66,4.44,4.87,4.39,,False,3.37,2022-06-03,united-states,ny,new-york-city
4,5203,https://www.airbnb.com/rooms/5203,20220603182654,2022-06-03T00:00:00+00:00,Cozy Clean Guest Room - Family Apt,"Our best guests are seeking a safe, clean, spa...",Our neighborhood is full of restaurants and ca...,https://a0.muscache.com/pictures/103776/b37157...,7490,"New York, United States",Upper West Side,Manhattan,40.8038,-73.96751,Private room in rental unit,Private room,1,,1 shared bath,1.0,1.0,"[""Carbon monoxide alarm"", ""Heating"", ""Essentia...",75.0,2,14,2.0,2.0,14.0,14.0,2.0,14.0,,True,0,0,0,0,2022-06-03T00:00:00+00:00,118,0,0,2009-09-07T00:00:00+00:00,2017-07-21T00:00:00+00:00,4.91,4.83,4.82,4.97,4.95,4.94,4.92,,False,0.76,2022-06-03,united-states,ny,new-york-city


## `group by`

In the previous chapter, we looked at this example in which we combine Python and SQL to calculate an aggreggate `avg(beds)` for each value of `listings.city`:

In [3]:
cities = %sql select distinct city from listings
cities = cities.city.values
cities

array(['new-york-city', 'amsterdam', 'paris'], dtype=object)

In [4]:
for city in cities:
    x = %sql select '{city}' city, avg(beds) mean_beds from listings where city = '{city}' and accommodates >= 4
    display(x)

Unnamed: 0,city,mean_beds
0,new-york-city,2.713029


Unnamed: 0,city,mean_beds
0,amsterdam,3.152174


Unnamed: 0,city,mean_beds
0,paris,2.664234


In SQL we can use `group by` to streamline exactly this sort of calculation:

In [5]:
%%sql

select
    city,
    avg(beds) mean_beds
from listings
where accommodates >= 4
group by
    city

Unnamed: 0,city,mean_beds
0,amsterdam,3.152174
1,new-york-city,2.713029
2,paris,2.664234


The default ordering of the result will be implementation-dependent; with dask-sql (which builds on Dask, which builds on Pandas) the default appears to be to order by the grouping columns.  In many implementations this is not guaranteed, but you'll want to specify it explicitly:

In [6]:
%%sql

select
    city,
    avg(beds) mean_beds
from listings
where accommodates >= 4
group by
    city
order by
    city

Unnamed: 0,city,mean_beds
0,amsterdam,3.152174
1,new-york-city,2.713029
2,paris,2.664234


Other times, we'll be interested in a different ordering:

In [7]:
%%sql

select
    city,
    avg(beds) mean_beds
from listings
where accommodates >= 4
group by
    city
order by
    mean_beds

Unnamed: 0,city,mean_beds
0,paris,2.664234
1,new-york-city,2.713029
2,amsterdam,3.152174


## Grouping on multiple columns

The above examples are clearly superior to separate queries for each city.  But we've still only learned about listing accommodating four or more guests.  This leaves a lot unanswered.  For example, does Amsterdam offer more beds per potential guest?  Or are many-guest listings proportionally more common in Amsterdam?

We can start to address these questions by grouping on multiple colmuns.  First off, let's find `avg(beds)` as a function of both city and number of guests accommodated:

In [8]:
%%sql

select
    city,
    accommodates,
    count(1) num_listing,
    avg(beds) mean_beds
from listings
where
    4 <= accommodates
    and accommodates <= 10
group by
    city,
    accommodates
order by
    city,
    accommodates desc

Unnamed: 0,city,accommodates,num_listing,mean_beds
0,amsterdam,10,2,5.0
1,amsterdam,9,3,6.666667
2,amsterdam,8,23,6.434783
3,amsterdam,7,27,5.074074
4,amsterdam,6,185,4.295082
5,amsterdam,5,131,3.740458
6,amsterdam,4,1947,2.679732
7,new-york-city,10,190,5.074468
8,new-york-city,9,82,4.829268
9,new-york-city,8,591,4.229983


Most SQL implementations do not support pivot operations[^1], so this is an example where post-processing in Python is not just a crutch but offers real value:

In [9]:
%%sql  city_acc_beds <<

-- save this to a Python variable
-- for further analysis
select
    city,
    accommodates,
    count(1) num_listings,
    avg(beds) mean_beds
from listings
where
    4 <= accommodates
    and accommodates <= 10
group by
    city,
    accommodates
order by
    city,
    accommodates desc

Returning data to local variable city_acc_beds


In [10]:
city_acc_beds.pivot('city', 'accommodates', 'mean_beds')

accommodates,4,5,6,7,8,9,10
city,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
amsterdam,2.679732,3.740458,4.295082,5.074074,6.434783,6.666667,5.0
new-york-city,2.042316,2.708128,3.203418,3.835079,4.229983,4.829268,5.074468
paris,2.063127,3.09299,3.42136,4.354023,4.621053,5.463918,5.479638


In [11]:
x = city_acc_beds.pivot('city', 'accommodates', 'num_listings')
x = (x.div(x.sum(axis='columns'), axis='rows') * 100).fillna(0)
x.round().astype(int).astype(str) + '%'

accommodates,4,5,6,7,8,9,10
city,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
amsterdam,84%,6%,8%,1%,1%,0%,0%
new-york-city,56%,15%,17%,4%,6%,1%,2%
paris,66%,10%,16%,2%,4%,0%,1%


It seems that among listings accommodating at least 4 guests, Amsterdam really does offer the most beds per guests on average, even while very high guest counts are relatively more rare there.

Anyways, back to SQL!  Let's look at a couple more examples of grouping on multiple columns.  Here are the NYC boroughs ordered by number of listings (Amsterdam and Paris don't appear here because they don't share NYC's concept of "neighbourhood group" in Airbnb's schema):

In [12]:
%%sql

select
    city,
    neighbourhood_group_cleansed,
    count(1) num
from listings
group by
    city,
    neighbourhood_group_cleansed
order by
    num desc

Unnamed: 0,city,neighbourhood_group_cleansed,num
0,new-york-city,Manhattan,15855
1,new-york-city,Brooklyn,13954
2,new-york-city,Queens,5824
3,new-york-city,Bronx,1376
4,new-york-city,Staten Island,401


Here are similar neighborhood-level tables[^2] for Amsterdam and Paris:

In [13]:
%%sql

select
    city,
    neighbourhood_cleansed,
    count(1) num
from listings
where city like 'amsterdam'
group by
    city,
    neighbourhood_cleansed
order by
    num desc
limit 5

Unnamed: 0,city,neighbourhood_cleansed,num
0,amsterdam,De Baarsjes - Oud-West,894
1,amsterdam,Centrum-West,838
2,amsterdam,Centrum-Oost,602
3,amsterdam,De Pijp - Rivierenbuurt,593
4,amsterdam,Westerpark,417


In [14]:
%%sql

select
    city,
    neighbourhood_cleansed,
    count(1) num
from listings
where city like 'paris'
group by
    city,
    neighbourhood_cleansed
order by
    num desc
limit 5

Unnamed: 0,city,neighbourhood_cleansed,num
0,paris,Buttes-Montmartre,5775
1,paris,Popincourt,4669
2,paris,Vaugirard,4230
3,paris,Batignolles-Monceau,4173
4,paris,Passy,4122


[^1]: While [Microsoft SQL Server](https://learn.microsoft.com/en-us/sql/t-sql/queries/from-using-pivot-and-unpivot?view=sql-server-ver16) and [Spark SQL](https://www.databricks.com/blog/2018/11/01/sql-pivot-converting-rows-to-columns.html) support a `PIVOT` clause, in most implementations you are forced to do this "manually" by constructing [absurd queries](https://mode.com/sql-tutorial/sql-pivot-table/).  We might be tempted to construct such queries using Python, and there are valid use cases for that sort of thing.  But often it's sufficient to build a long-form table in SQL and pivot it after the fact in Python.

[^2]: These queries would normally be written with conditions `where city = 'amsterdam'` and `where city = 'paris'`.  We used `like` instead of `=` here to work around an apparent bug involving partition columns read from file paths, like in this case `.../country=___/state=___/city=___/...`.

## Group vs aggregate columns

In a group-by query, all selected columns must either be part of the group specification, or aggregates over columns that _are not_ part of the group specification.  This rule is non-negotiable, and SQL implementations will not read your mind here.

We might know that in our dataset, two columns have a one-to-one relationship with each other.  For example, in our limited import of Airbnb data (not in their complete dataset, of course) each country maps to one state, which maps to one city, which maps to one country:

In [15]:
%%sql

select distinct
    country,
    state,
    city
from listings
order by
    country,
    state,
    city

Unnamed: 0,country,state,city
0,france,ile-de-france,paris
1,the-netherlands,north-holland,amsterdam
2,united-states,ny,new-york-city


Nevertheless, this query raises an error:

In [16]:
%%sql

select
    country,
    state,
    city,
    count(1) as num
from listings
group by
    country
order by
    country

(pyhive.exc.DatabaseError) {'message': "Can not parse the given SQL: From line 3, column 5 to line 3, column 9: Expression 'state' is not being grouped\n\nThe problem is probably somewhere here:\n\n\tselect\n\t    country,\n\t    state,\n\t    ^^^^^\n\t    city,\n\t    count(1) as num\n\tfrom listings\n\tgroup by\n\t    country\n\torder by\n\t    country", 'errorCode': 0, 'errorName': "<class 'dask_sql.utils.ParsingException'>", 'errorType': 'USER_ERROR', 'errorLocation': {'lineNumber': 3, 'columnNumber': 5}}
[SQL: select
    country,
    state,
    city,
    count(1) as num
from listings
group by
    country
order by
    country]
(Background on this error at: https://sqlalche.me/e/14/4xp6)


We can resolve this either by grouping on all of those columns:

In [17]:
%%sql

select
    country,
    state,
    city,
    count(1) as num
from listings
group by
    country,
    state,
    city
order by
    country,
    state,
    city

Unnamed: 0,country,state,city,num
0,france,ile-de-france,paris,56739
1,the-netherlands,north-holland,amsterdam,6173
2,united-states,ny,new-york-city,37410


In the wild you might find this problem resolved by instead introducing aggreggates on some of the de-facto group columns:

In [18]:
%%sql

select
    country,
    min(state) state,
    min(city) city,
    count(1) as num
from listings
group by
    country
order by
    country

Unnamed: 0,country,state,city,num
0,france,ile-de-france,paris,56739
1,the-netherlands,north-holland,amsterdam,6173
2,united-states,ny,new-york-city,37410


:::{.callout-warning}
It is safer, and almost always generally better, to include all logical group columns as explicit group columns in your query.
:::

## `having`

In some of the above examples, we use both `group by` and `where`.  In those cases, the `where` filter-condition is applied _before_ the `group by`.  We can't use `where` to filter on the results of aggreggates in a group by:

In [19]:
%%sql

select
    neighbourhood_cleansed,
    count(1) num
from listings
where num >= 3000
group by
    neighbourhood_cleansed
order by num desc

(pyhive.exc.DatabaseError) {'message': "Can not parse the given SQL: From line 5, column 7 to line 5, column 9: Column 'num' not found in any table\n\nThe problem is probably somewhere here:\n\n\tselect\n\t    neighbourhood_cleansed,\n\t    count(1) num\n\tfrom listings\n\twhere num >= 3000\n\t      ^^^\n\tgroup by\n\t    neighbourhood_cleansed\n\torder by num desc", 'errorCode': 0, 'errorName': "<class 'dask_sql.utils.ParsingException'>", 'errorType': 'USER_ERROR', 'errorLocation': {'lineNumber': 5, 'columnNumber': 7}}
[SQL: select
    neighbourhood_cleansed,
    count(1) num
from listings
where num >= 3000
group by
    neighbourhood_cleansed
order by num desc]
(Background on this error at: https://sqlalche.me/e/14/4xp6)


For this situation, SQL adds the keyword `having`, which appears after `group by` but before `order by`:

In [20]:
%%sql

select
    neighbourhood_cleansed,
    count(1) num
from listings
group by
    neighbourhood_cleansed
having count(1) >= 3000
order by
    num desc

Unnamed: 0,neighbourhood_cleansed,num
0,Buttes-Montmartre,5775
1,Popincourt,4669
2,Vaugirard,4230
3,Batignolles-Monceau,4173
4,Passy,4122
5,Entrepôt,3611
6,Buttes-Chaumont,3346
7,Ménilmontant,3234
8,Reuilly,3107


Note that, somewhat frustratingly, `having` does not support the use of aliases in the same way as `order by`:

In [21]:
%%sql

select
    neighbourhood_cleansed,
    count(1) num
from listings
group by
    neighbourhood_cleansed
having num >= 3000
order by
    num desc

(pyhive.exc.DatabaseError) {'message': "Can not parse the given SQL: From line 7, column 8 to line 7, column 10: Column 'num' not found in any table\n\nThe problem is probably somewhere here:\n\n\tselect\n\t    neighbourhood_cleansed,\n\t    count(1) num\n\tfrom listings\n\tgroup by\n\t    neighbourhood_cleansed\n\thaving num >= 3000\n\t       ^^^\n\torder by\n\t    num desc", 'errorCode': 0, 'errorName': "<class 'dask_sql.utils.ParsingException'>", 'errorType': 'USER_ERROR', 'errorLocation': {'lineNumber': 7, 'columnNumber': 8}}
[SQL: select
    neighbourhood_cleansed,
    count(1) num
from listings
group by
    neighbourhood_cleansed
having num >= 3000
order by
    num desc]
(Background on this error at: https://sqlalche.me/e/14/4xp6)


There is, of course, a perfectly reasonable explanation for that seeming inconsistency... which we will come back to later.

## Exercises

**TODO.**