Copyright Jana Schaich Borg/Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)

# MySQL Exercise 6: Common Pitfalls of Grouped Queries

     
## 1. Misinterpretations due to Aggregation Mismatches

Begin by loading the SQL library, connecting to the Dognition database, and setting the Dognition database as the default.

In [1]:
%load_ext sql
%sql mysql://studentuser:studentpw@mysqlserver/dognitiondb
%sql USE dognitiondb

0 rows affected.


[]

Imagine that we would like to retrieve, for each breed_type in the Dognition database, the number of unique dog_guids associated with that breed_type and their weight.  Let's try to write a query that reflects that request:

```mySQL
SELECT breed_type, COUNT(DISTINCT dog_guid) AS NumDogs, weight
FROM dogs
GROUP BY breed_type;
```

**Now take a look at the output:**

In [2]:
%%sql
SELECT breed_type, COUNT(DISTINCT dog_guid) AS NumDogs, weight # weight is also aggregated, which is not what we want
FROM dogs
GROUP BY breed_type;

4 rows affected.


breed_type,NumDogs,weight
Cross Breed,5568,0
Mixed Breed/ Other/ I Don't Know,9499,50
Popular Hybrid,1160,70
Pure Breed,18823,50


weight is also aggregated, which is not what we want, we want each weight for every dog_guid, MySQL cannot do the aggregated and non-aggregated at the same time, the number of weight comes from the first row when aggregated. (Notice: MySQL didn't report error here, so be careful to use group)  
   
This flexibility is very convenient when you know that all the values in a non-aggregated column are the same for the subsets of the data that correspond to the variable by which you are grouping.  In fact, the visualization software Tableau (which is based in SQL language) recognized how frequently this type of situation arises and came up with a custom solution for its customers.  Tableau incorprated an aggregation-like function called "ATTR" into its interface to let users say "I'm using an aggregation function here because SQL says I have to, but I know that this is a situation where all of the rows in each group will have the same value."  
    
Tableau's approach is helpful because it forces users to acknowledge that a field in a query is supposed to be aggregated, and Tableau's formulas will crash if all the rows in a group do not have the same value.  MySQL doesn't force users to do this.  MySQL trusts users to know what they are doing, and will provide an output even if all the rows in a group do not have the same value.  Unfortunately, this approach can cause havoc if you aren't aware of what you are asking MySQL to do and aren't familiar with your data.


Let's see a couple more first-hand examples of this tricky GROUP BY behavior.  Let's assume you want to know the number of each kind of test completed in different months of the year.

You execute the following query:

```mySQL
SELECT test_name, MONTH(created_at) AS Month, COUNT(created_at) AS Num_Completed_Tests
FROM complete_tests
GROUP BY test_name
ORDER BY test_name ASC, Month ASC;
```

**Question 1: What does the Month column represent in this output?  Take a look and see what you think:**

In [2]:
%%sql
SELECT test_name, MONTH(created_at) AS Month, COUNT(created_at) AS Num_Completed_Tests
FROM complete_tests
GROUP BY test_name
ORDER BY test_name ASC, Month ASC
LIMIT 5;

5 rows affected.


test_name,Month,Num_Completed_Tests
1 vs 1 Game,6,255
3 vs 1 Game,5,368
5 vs 1 Game,5,620
Arm Pointing,2,11452
Cover Your Eyes,2,7250


From the result, the Month will not be right, it comes from the first row in each group.

Now try a similar query, but GROUP BY Month instead of test_name:

```mySQL
SELECT test_name, MONTH(created_at) AS Month, COUNT(created_at) AS Num_Completed_Tests
FROM complete_tests
GROUP BY Month
ORDER BY Month ASC, test_name ASC;
```

**Question 2: What does test_name mean in this case?  Try it out:**

In [3]:
%%sql
SELECT test_name, MONTH(created_at) AS Month, COUNT(created_at) AS Num_Completed_Tests
FROM complete_tests
GROUP BY Month
ORDER BY Month ASC, test_name ASC;

12 rows affected.


test_name,Month,Num_Completed_Tests
Delayed Cup Game,1,11068
Yawn Warm-up,2,9122
Yawn Warm-up,3,9572
Physical Reasoning Game,4,7130
Delayed Cup Game,5,21013
Foot Pointing,6,23381
Eye Contact Game,7,15977
Memory versus Smell,8,13382
Yawn Warm-up,9,19853
Yawn Warm-up,10,39237


It looks like in both of these cases, MySQL is likely populating the unaggregated column with the first value it finds in that column within the first "group" of rows it is examining.  

So how do we prevent this from happening?

><mark>The only way to be sure how the MySQL database will summarize a set of data in a SELECT clause is to tell it how to do so with an aggregate function.<mark\>

I should have written my original request to read:

"I would like to know, for *each breed type* of dog, *the number of* unique Dog_Guids there are in the Dognition database and *the breed_type's average weight*."

The query that would have reflected this sentence would have executed an aggregate function for both Dog_Guids and weight.  The output of these aggregate functions would be unambiguous, and would easily be represented in a single table.  
  
   
## 2. Errors due to Aggregation Mismatches

It is important to note that the issues I described above are the consequence of mismatching aggregate and non-aggregate functions through the GROUP BY clause in MySQL, but other databases manifest the problem in a different way.  Other databases won't allow you to run the queries described above at all.  When you try to do so, you get an error message that sounds something like:

```
Column 'X' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
```

Especially when you are just starting to learn MySQL, these error messages can be confusing and infuriating.  A good discussion of this problem can be found here:

http://weblogs.sqlteam.com/jeffs/archive/2007/07/20/but-why-must-that-column-be-contained-in-an-aggregate.aspx

As a way to prevent these logical mismatches or error messages, you will often hear a rule that "every non-aggregated field that is listed in the SELECT list *must* be listed in the GROUP BY list."  You have just seen that this rule is not true in MySQL, which makes MySQL both more flexible and more tricky to work with.  However, it is a useful rule of thumb for helping you avoid unknown mismatch errors.



## 3. By the way, even if you want to, there is no way to intentionally include aggregation mismatches in a single query


You might want to know the total number of unique User_Guids in the Dognition database, and in addition, the total number of unique User_Guids and average weight associated with each breed type. Given that you want to see the information efficiently to help you make decisions, you would like all of this information in one output.  After all, that would be easy to do in Excel, given that all of this information could easily be summarized in a single worksheet.

To retrieve this information, you try one of the queries described above.  Since you know the rule describing the relationship between fields in the SELECT and GROUP BY clauses, you write:

```mySQL
SELECT COUNT(DISTINCT dog_guid), breed_type, AVG(weight) AS avg_weight, 
FROM dogs
GROUP BY breed_type;
```
