# SQL grouping and summarizing data

## Preparation

For this section you need `chinook.db` database file and working `%sql` magic.  
If you don't have it, please go back to the [previous section](connect_to_database.ipynb) and follow the instructions.  
The following code should not produce any errors:

In [1]:
%load_ext sql
%sql sqlite:///chinook.db

## `GROUP BY` - operations on sets of (multiple) rows

SQL allows to perform aggregation (descriptive statistics) operations on disjoint sets of rows.  
Then, for each input group (so multiple rows belonging to the same group) a single summary row is generated at the output.  
Here we definie grops and illustrate usage with a simple `COUNT` rows operation. Later we show other aggregations.

Let's build groups step-by-step.

### A table before grouping

Let's consider some rows of the `tracks` table:

In [2]:
%%sql
SELECT * 
  FROM tracks 
  LIMIT 5

TrackId,Name,AlbumId,MediaTypeId,GenreId,Composer,Milliseconds,Bytes,UnitPrice
1,For Those About To Rock (We Salute You),1,1,1,"Angus Young, Malcolm Young, Brian Johnson",343719,11170334,0.99
2,Balls to the Wall,2,2,1,,342562,5510424,0.99
3,Fast As a Shark,3,2,1,"F. Baltes, S. Kaufman, U. Dirkscneider & W. Hoffman",230619,3990994,0.99
4,Restless and Wild,3,2,1,"F. Baltes, R.A. Smith-Diesel, S. Kaufman, U. Dirkscneider & W. Hoffman",252051,4331779,0.99
5,Princess of the Dawn,3,2,1,Deaffy & R.A. Smith-Diesel,375418,6290521,0.99


### Simple `GROUP BY`

Observe, that a simple `GROUP BY` performed on the `AlbumId` prints one row for each value of `AlbumId`:

In [3]:
%%sql
SELECT * 
  FROM tracks 
  GROUP BY AlbumId
  LIMIT 5

TrackId,Name,AlbumId,MediaTypeId,GenreId,Composer,Milliseconds,Bytes,UnitPrice
1,For Those About To Rock (We Salute You),1,1,1,"Angus Young, Malcolm Young, Brian Johnson",343719,11170334,0.99
2,Balls to the Wall,2,2,1,,342562,5510424,0.99
3,Fast As a Shark,3,2,1,"F. Baltes, S. Kaufman, U. Dirkscneider & W. Hoffman",230619,3990994,0.99
15,Go Down,4,1,1,AC/DC,331180,10847611,0.99
23,Walk On Water,5,1,1,"Steven Tyler, Joe Perry, Jack Blades, Tommy Shaw",295680,9719579,0.99


### `COUNT` - counting rows (suboptimal)

Using `COUNT(*)` for each `GROUP BY` set of rows we will get the number of rows in the group.  
*Note:* The star (`*`) denotes that a subtable is referred to, not a particular column (see below).

In [4]:
%%sql
SELECT COUNT(*)
  FROM tracks
  GROUP BY AlbumId
  LIMIT 5

COUNT(*)
10
1
3
8
15


### `COUNT` - counting rows (better)

The above example does not show to which `AlbumId`s the counts correspond.  
Better code (with `AlbumId` column, renamed column with counts and special sort order):

In [6]:
%%sql
SELECT AlbumId, COUNT(*) AS TracksNum
  FROM tracks
  GROUP BY AlbumId
  ORDER BY TracksNum DESC
  LIMIT 5

AlbumId,TracksNum
141,57
23,34
73,30
229,26
230,25


## `HAVING` - filtering based on group aggregations results

In SQL to filter rows of an aggregated result it is necessary to use `HAVING` statement (`WHERE` does not operate on the results of aggregation).

Consider the following modification of the above example:

In [7]:
%%sql
SELECT AlbumId, COUNT(*) AS TracksNum
  FROM tracks
  GROUP BY AlbumId
  HAVING TracksNum > 30
  ORDER BY TracksNum DESC
  LIMIT 5

AlbumId,TracksNum
141,57
23,34


## Aggregation functions

Aggregate functions operate on a set of rows and return a single result.  
Aggregate functions are often used in conjunction with `GROUP BY` and `HAVING` clauses in the `SELECT` statement.  
When `GROUP BY` is not provided, the aggregation of the whole table is performed.

SQL provides the following aggregate functions:

- `COUNT(*)` – Returns the number of rows.
- `COUNT(col)` – Returns the number of non-`NULL` values in `col`.
- `AVG(col)` – Returns the average of values.
- `MAX(col)` – Returns the maximum of values.
- `MIN(col)` – Returns the minimum of values.
- `SUM(col)` – Returns the sum of values.
- `GROUP_CONCAT(col,sep)` - Returns a string that is the concatenation of all non-`NULL` values of the input expression separated by the separator.

See examples below.

### `AVG` - average of values

The `AVG` function is an aggregate function that calculates the average value of all non-NULL values within a group.

To calculate the average length of all `tracks` in milliseconds, you use the following statement:

In [8]:
%%sql
SELECT AVG(Milliseconds) AS MeanMilliseconds
  FROM tracks

MeanMilliseconds
393599.2121039109


To calculate the average length of tracks for every album the following modification is needed:

In [9]:
%%sql
SELECT AlbumId, AVG(Milliseconds) AS MeanMilliseconds
  FROM tracks
  GROUP BY AlbumId
  LIMIT 5

AlbumId,MeanMilliseconds
1,240041.5
2,342562.0
3,286029.3333333333
4,306657.375
5,294113.93333333335


### `GROUP_CONCAT` - merging texts of the values

The `GROUP_CONCAT()` function is an aggregate function that concatenates all non-null values in a column.  
It uses a comma by default but you can use different separator given as the second argument.

For example, let's concatenate all track `Name`s separately for each album:

In [10]:
%%sql
SELECT AlbumId, GROUP_CONCAT( Name, ";" ) AS TrackNames
  FROM tracks 
  GROUP BY AlbumId
  LIMIT 5

AlbumId,TrackNames
1,For Those About To Rock (We Salute You);Put The Finger On You;Let's Get It Up;Inject The Venom;Snowballed;Evil Walks;C.O.D.;Breaking The Rules;Night Of The Long Knives;Spellbound
2,Balls to the Wall
3,Fast As a Shark;Restless and Wild;Princess of the Dawn
4,Go Down;Dog Eat Dog;Let There Be Rock;Bad Boy Boogie;Problem Child;Overdose;Hell Ain't A Bad Place To Be;Whole Lotta Rosie
5,Walk On Water;Love In An Elevator;Rag Doll;What It Takes;Dude (Looks Like A Lady);Janie's Got A Gun;Cryin';Amazing;Blind Man;Deuces Are Wild;The Other Side;Crazy;Eat The Rich;Angel;Livin' On The Edge
