Copyright Jana Schaich Borg/Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)

# MySQL Exercise 5: Summaries of Groups of Data
    
## The GROUP BY clause

The GROUP BY clause comes after the WHERE clause, but before ORDER BY or LIMIT:


In [1]:
%load_ext sql
%sql mysql://studentuser:studentpw@mysqlserver/dognitiondb
%sql USE dognitiondb

0 rows affected.


[]

In [2]:
%%sql
SELECT test_name, AVG(rating) AS AVG_Rating
FROM reviews
GROUP BY test_name; # calculate in each group, from result, we can see there are 40 groups

40 rows affected.


test_name,AVG_Rating
1 vs 1 Game,3.9206
3 vs 1 Game,4.2857
5 vs 1 Game,3.9272
Arm Pointing,4.2153
Cover Your Eyes,2.6741
Delayed Cup Game,3.3514
Different Perspective,2.7647
Expression Game,4.0
Eye Contact Game,2.9372
Eye Contact Warm-up,0.9632


In [3]:
%%sql
SELECT test_name, MONTH(created_at) AS Month, COUNT(created_at) AS Num_Completed_Tests # MONTH() return a number representing the month
FROM complete_tests
GROUP BY Month; # group by the new created Month column

12 rows affected.


test_name,Month,Num_Completed_Tests
Delayed Cup Game,1,11068
Yawn Warm-up,2,9122
Yawn Warm-up,3,9572
Physical Reasoning Game,4,7130
Delayed Cup Game,5,21013
Foot Pointing,6,23381
Eye Contact Game,7,15977
Memory versus Smell,8,13382
Yawn Warm-up,9,19853
Yawn Warm-up,10,39237


In [4]:
%%sql
SELECT test_name, MONTH(created_at) AS Month, COUNT(created_at) AS Num_Completed_Tests
FROM complete_tests
GROUP BY test_name, Month # group by multiple columns
LIMIT 15;

15 rows affected.


test_name,Month,Num_Completed_Tests
1 vs 1 Game,1,25
1 vs 1 Game,2,28
1 vs 1 Game,3,22
1 vs 1 Game,4,12
1 vs 1 Game,5,13
1 vs 1 Game,6,18
1 vs 1 Game,7,36
1 vs 1 Game,8,17
1 vs 1 Game,9,28
1 vs 1 Game,10,27


In [5]:
%%sql
SELECT test_name, MONTH(created_at) AS Month, COUNT(created_at) AS Num_Completed_Tests
FROM complete_tests
GROUP BY test_name, MONTH(created_at) # you can still group by derived fields
LIMIT 15;

15 rows affected.


test_name,Month,Num_Completed_Tests
1 vs 1 Game,1,25
1 vs 1 Game,2,28
1 vs 1 Game,3,22
1 vs 1 Game,4,12
1 vs 1 Game,5,13
1 vs 1 Game,6,18
1 vs 1 Game,7,36
1 vs 1 Game,8,17
1 vs 1 Game,9,28
1 vs 1 Game,10,27


**Question 1: Output a table that calculates the number of distinct female and male dogs in each breed group of the Dogs table, sorted by the total number of dogs in descending order (the sex/breed_group pair with the greatest number of dogs should have 8466 unique Dog_Guids):**

In [6]:
%%sql
SELECT gender, breed_group, COUNT(DISTINCT dog_guid) AS Num_Dogs
FROM dogs
GROUP BY breed_group, gender
ORDER BY Num_Dogs DESC
LIMIT 10;

10 rows affected.


gender,breed_group,Num_Dogs
male,,8466
female,,8367
male,Sporting,2584
female,Sporting,2262
male,Herding,1736
female,Herding,1704
male,Toy,1473
female,Toy,1145
male,Non-Sporting,1098
male,Working,1075


```mySQL
SELECT test_name, MONTH(created_at) AS Month, COUNT(created_at) AS Num_Completed_Tests
```

test_name would be #1, Month would be #2, and Num_Completed_Tests would be #3.  You could then rewrite the query above to read:

```mySQL
SELECT test_name, MONTH(created_at) AS Month, COUNT(created_at) AS Num_Completed_Tests
FROM complete_tests
GROUP BY 1, 2
ORDER BY 1 ASC, 2 ASC;
```

**Question 2: Revise the query your wrote in Question 1 so that it uses only numbers in the GROUP BY and ORDER BY fields.**

In [7]:
%%sql
SELECT gender, breed_group, COUNT(DISTINCT dog_guid) AS Num_Dogs
FROM dogs
GROUP BY 1, 2 # 1 is gender, 2 is breed_group
ORDER BY 3 DESC # 3 is Num_Dogs
LIMIT 5;

5 rows affected.


gender,breed_group,Num_Dogs
male,,8466
female,,8367
male,Sporting,2584
female,Sporting,2262
male,Herding,1736


## The HAVING clause

you can query subsets of aggregated groups using the HAVING clause. The expression that follows a HAVING clause has to be applicable or computable using a group of data.  

```mySQL
SELECT test_name, MONTH(created_at) AS Month, COUNT(created_at) AS Num_Completed_Tests
FROM complete_tests
WHERE MONTH(created_at)=11 OR MONTH(created_at)=12
GROUP BY 1, 2
HAVING COUNT(created_at)>=20
ORDER BY 3 DESC;
```

**Question 3: Revise the query your wrote in Question 2 so that it (1) excludes the NULL and empty string entries in the breed_group field, and (2) excludes any groups that don't have at least 1,000 distinct Dog_Guids in them.  Your result should contain 8 rows.  (HINT: sometimes empty strings are registered as non-NULL values.  You might want to include the following line somewhere in your query to exclude these values as well):**

```mySQL
breed_group!=""
```

In [8]:
%%sql
SELECT gender, breed_group, COUNT(DISTINCT dog_guid) AS Num_Dogs
FROM dogs
WHERE breed_group!="" AND breed_group IS NOT NULL AND breed_group<>"None" # <> is the same with !=
GROUP BY 1, 2
HAVING Num_Dogs>=1000 # Having should be after GROUP
ORDER BY 3 DESC;

8 rows affected.


gender,breed_group,Num_Dogs
male,Sporting,2584
female,Sporting,2262
male,Herding,1736
female,Herding,1704
male,Toy,1473
female,Toy,1145
male,Non-Sporting,1098
male,Working,1075


## Practice incorporating GROUP BY and HAVING into your own queries.

**Question 4: Write a query that outputs the average number of tests completed and average mean inter-test-interval for every breed type, sorted by the average number of completed tests in descending order (popular hybrid should be the first row in your output).**

In [9]:
%%sql
SELECT breed_type, AVG(total_tests_completed) AS Avg_tests_completed, AVG(mean_iti_days) AS Avg_iti_days
FROM dogs 
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10;

4 rows affected.


  cursor.execute(statement, parameters)


breed_type,Avg_tests_completed,Avg_iti_days
Popular Hybrid,10.257530120481928,1.9682781756219017
Cross Breed,9.945900537634408,1.994688302855342
Pure Breed,9.871602824737856,2.217604509580722
Mixed Breed/ Other/ I Don't Know,9.54250850170034,2.0993549515344747


**Question 5: Write a query that outputs the average amount of time it took customers to complete each type of test where any individual reaction times over 6000 hours are excluded and only average reaction times that are greater than 0 seconds are included (your output should end up with 67 rows).**


In [10]:
%%sql
SELECT test_name, AVG(TIMESTAMPDIFF(hour, start_time, end_time)) AS Duration
FROM exam_answers
WHERE TIMESTAMPDIFF(hour, start_time, end_time)<=6000 AND TIMESTAMPDIFF(minute, start_time, end_time)>0
GROUP BY 1
ORDER BY Duration DESC;

67 rows affected.


test_name,Duration
Excitability,806.1134
Attachment,717.7429
Shy/Boldness,716.3277
Sociability,533.1124
Gender,492.2347
Diet,466.6369
Confinement,354.5727
Social-Quiz,347.9484
Activity,342.814
Purina-Only,321.3333


**Question 6: Write a query that outputs the total number of unique User_Guids in each combination of State and ZIP code (postal code) in the United States, sorted first by state name in ascending alphabetical order, and second by total number of unique User_Guids in descending order (your first state should be AE and there should be 5043 rows in total in your output).**

In [11]:
%%sql
SELECT state, zip, COUNT(DISTINCT user_guid) AS Num_users
FROM users
WHERE country='US'
GROUP BY state, zip
ORDER BY state ASC, Num_Users DESC
LIMIT 10;

10 rows affected.


state,zip,Num_users
AE,9128,2
AE,9053,1
AE,9107,1
AE,9469,1
AE,9845,1
AK,99709,3
AK,99507,3
AK,99577,2
AK,99501,2
AK,99587,1


**Question 7: Write a query that outputs the total number of unique User_Guids in each combination of State and ZIP code in the United States *that have at least 5 users*, sorted first by state name in ascending alphabetical order, and second by total number of unique User_Guids in descending order (your first state/ZIP code combination should be AZ/86303).**

In [12]:
%%sql
SELECT state, zip, COUNT(DISTINCT user_guid) AS Num_users
FROM users
WHERE country='US'
GROUP BY state, zip
HAVING Num_Users>=5
ORDER BY state ASC, Num_Users DESC
LIMIT 5;

5 rows affected.


state,zip,Num_users
AZ,86303,14
AZ,85718,6
AZ,85254,5
AZ,85260,5
AZ,85711,5
