# 07 - Aggregating data with GROUP BY and ORDER BY

The GROUP BY clause is an optional clause of the SELECT statement. The GROUP BY clause a selected group of rows into summary rows by values of one or more columns.

The GROUP BY clause returns one row for each group. For each group, you can apply an aggregate function such as MIN, MAX, SUM, COUNT, or AVG to provide more information about each group.

In [1]:
import pandas as pd
import mysql.connector as sql
import os

In [2]:
connection = sql.connect(
    host = os.environ.get('mysql_host'),
    user = os.environ.get('mysql_user'),
    password = os.environ.get('mysql_password')
)

cursor = connection.cursor()

### 1. Grouping data
Take country table as an example.

### 1.1 Check the columns firstly

In [4]:
pd.read_sql_query("""
    SELECT *
    FROM world.country
    LIMIT 3""",
    connection)

Unnamed: 0,Code,Name,Continent,Region,SurfaceArea,IndepYear,Population,LifeExpectancy,GNP,GNPOld,LocalName,GovernmentForm,HeadOfState,Capital,Code2
0,ABW,Aruba,North America,Caribbean,193.0,,103000,78.4,828.0,793.0,Aruba,Nonmetropolitan Territory of The Netherlands,Beatrix,129,AW
1,AFG,Afghanistan,Asia,Southern and Central Asia,652090.0,1919.0,22720000,45.9,5976.0,,Afganistan/Afqanestan,Islamic Emirate,Mohammad Omar,1,AF
2,AGO,Angola,Africa,Central Africa,1246700.0,1975.0,12878000,38.3,6648.0,7984.0,Angola,Republic,JosÃ© Eduardo dos Santos,56,AO


### 1.2 Check unique values
Firstly, let's have a look at the number of Continents. We can use the DISTINCT keyword in conjunction with SELECT statement to eliminate all the duplicate records and fetching only the unique records.

In [7]:
pd.read_sql_query("""
    SELECT COUNT(DISTINCT Continent)
    FROM world.country""",
    connection)

Unnamed: 0,COUNT(DISTINCT Continent)
0,7


We also can use the GROUP BY clause to get back a cleaner output, with fewer rows – only unique values returned.

In [8]:
pd.read_sql_query("""
    SELECT Continent
    FROM world.country
    GROUP BY Continent""",
    connection)

Unnamed: 0,Continent
0,North America
1,Asia
2,Africa
3,Europe
4,South America
5,Oceania
6,Antarctica


### 1.3 Use aggregate functions on Groups
We can get more details through aggregating data on group than on whole columns.

In [9]:
pd.read_sql_query("""
    SELECT Continent, AVG(Population), AVG(SurfaceArea)
    FROM world.country
    GROUP BY Continent""",
    connection)

Unnamed: 0,Continent,AVG(Population),AVG(SurfaceArea)
0,North America,13053860.0,654445.1
1,Asia,72647560.0,625117.7
2,Africa,13525430.0,521558.2
3,Europe,15871190.0,501068.1
4,South America,24698570.0,1276066.0
5,Oceania,1085755.0,305867.6
6,Antarctica,0.0,2626420.0


### 2. Order/Sort Records
Firstly, let us check the years with the maximum population.

In [16]:
pd.read_sql_query("""
    SELECT Continent, IndepYear, MAX(Population)
    FROM world.country
    GROUP BY Continent""",
    connection)

Unnamed: 0,Continent,IndepYear,MAX(Population)
0,North America,,278357000
1,Asia,1919.0,1277558000
2,Africa,1975.0,111506000
3,Europe,1912.0,146934000
4,South America,1816.0,170115000
5,Oceania,,18886000
6,Antarctica,,0


It is obvious that the year column is not in a natural sort. This is a good time to bring up the ORDER BY operator, which you can put at the end of a SQL state‐ment (after any WHERE and GROUP BY). We can sort the query results by year.

In [15]:
pd.read_sql_query("""
    SELECT Continent, IndepYear, MAX(Population)
    FROM world.country
    GROUP BY Continent
    ORDER BY IndepYear""",
    connection)

Unnamed: 0,Continent,IndepYear,MAX(Population)
0,North America,,278357000
1,Oceania,,18886000
2,Antarctica,,0
3,South America,1816.0,170115000
4,Europe,1912.0,146934000
5,Asia,1919.0,1277558000
6,Africa,1975.0,111506000


By default, sorting is done with the ASC operator, which orders the data in ascending order. We can sort in descending order applying the DESC operator.

### 3. Filter data on groups with the HAVING clause
Sometimes, we may want to filter out records based on a group or an aggregated value. While the first instinct might be to use a WHERE statement, this actually will not work because the WHERE filters records, and does not filter aggregations. For example, we try to use a WHERE to filter results where MAX(FLOW_INcms) is greater than 3000. This will get an OperationalError of misuse of aggregate.

In [23]:
pd.read_sql_query("""
    SELECT Continent, IndepYear, MAX(Population)
    FROM world.country
    GROUP BY Continent
    WHERE Population > 1000000
    ORDER BY IndepYear""",
    connection)

DatabaseError: Execution failed on sql '
    SELECT Continent, IndepYear, MAX(Population)
    FROM world.country
    GROUP BY Continent
    WHERE Population > 1000000
    ORDER BY IndepYear': 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'WHERE Population > 1000000
    ORDER BY IndepYear' at line 4

Under such a case, we cab use the HAVING clause to specify a filter condition for a group or an aggregate. The HAVING clause is an optional clause of the SELECT statement. We often use the HAVING clause with the GROUP BY clause. The GROUP BY clause groups a set of rows into a set of summary rows or groups. Then the HAVING clause filters groups based on specified conditions.

It is worth noting that the HAVING clause must follow the GROUP BY cluase strictly.

In [29]:
pd.read_sql_query("""
    SELECT Continent, Population, SUM(Population), IndepYear
    FROM world.country
    GROUP BY Continent
    HAVING Population > 1000000
    ORDER BY IndepYear""",
    connection)

Unnamed: 0,Continent,Population,SUM(Population),IndepYear
0,South America,37032000,345780000.0,1816
1,Europe,3401200,730074600.0,1912
2,Asia,22720000,3705026000.0,1919
3,Africa,12878000,784475000.0,1975


### Summary
In this notebook, we learned how to use the DISTINCT operator to get distinct results in our queries and eliminate duplicates.

Next, we learned how to aggregate and sort data using GROUP BY and ORDER BY.

We also showed the power of the aggregate functions of SUM(), MAX(), MIN(), AVG(), and COUNT().

Furthermore, we used the HAVING clause to filter aggregated fields that can not be done with the WHERE clause.

# References
- [Chonghua Yin notebook](https://github.com/royalosyin/Practice-SQL-with-SQLite-and-Jupyter-Notebook/blob/master/ex07-Aggregating%20data%20with%20GROUP%20BY%20and%20ORDER%20BY.ipynb)