Copyright Jana Schaich Borg/Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)

# MySQL Exercise 4: Summarizing your Data

COUNT is the only aggregate function that can work on any type of variable.  The other four aggregate functions (SUM, AVG, MIN, and MAX) are only appropriate for numerical data.

All aggregate functions require you to enter either a column name or a "\*" in the parentheses after the function word.   
    

## 1. The COUNT function

In [1]:
%load_ext sql
%sql mysql://studentuser:studentpw@mysqlserver/dognitiondb
%sql USE dognitiondb

0 rows affected.


[]

In [2]:
%%sql
SELECT COUNT(breed) # return how many rows are in the breed column in total
FROM dogs

1 rows affected.


COUNT(breed)
35050


In [3]:
%%sql
SELECT COUNT(DISTINCT breed) # count the number of distinct breed names in the breed column
  FROM dogs

1 rows affected.


COUNT(DISTINCT breed)
2006


**Question 1: Try combining this query with a WHERE clause to find how many individual dogs completed tests after March 1, 2014 (the answer should be 13,289):**

In [4]:
%%sql
SELECT COUNT(DISTINCT Dog_Guid)
FROM complete_tests
WHERE created_at > '2014-03-01';

1 rows affected.


COUNT(DISTINCT Dog_Guid)
13289


**Question 2: count the number of rows in the dogs table using COUNT(\*):**    

In [5]:
%%sql
SELECT COUNT(*) # When an asterisk is included in a count function, nulls are included in the count
FROM dogs;

1 rows affected.


COUNT(*)
35050


**Question 3: Now count the number of rows in the exclude column of the dogs table:**

In [6]:
%%sql
SELECT COUNT(exclude) # When a column is included in a count function, null values are ignored in the count
FROM dogs;

1 rows affected.


COUNT(exclude)
1025


The output of the second query should return a much smaller number than the output of the first query.  That's because:

><mark> When a column is included in a count function, null values are ignored in the count. When an asterisk is included in a count function, nulls are included in the count.</mark>

 
**Question 4: How many distinct dogs have an exclude flag in the dogs table (value will be "1")? (the answer should be 853)**

In [7]:
%%sql
SELECT COUNT(DISTINCT Dog_Guid)
FROM dogs
WHERE exclude=1;

1 rows affected.


COUNT(DISTINCT Dog_Guid)
853


## 2. The SUM Function

You will see that ISNULL is a logical function that returns a 1 for every row that has a NULL value in the specified column, and a 0 for everything else.  If we sum up the number of 1s outputted by ISNULL(exclude), then, we should get the total number of NULL values in the column.  Here's what that query would look like:

It might be tempting to treat SQL like a calculator and leave out the SELECT statement, but you will quickly see that doesn't work.  

><mark>*Every SQL query that extracts data from a database MUST contain a SELECT statement.*  <mark\>

**Try counting the number of NULL values in the exclude column:**
      
              
     

In [8]:
%%sql
SELECT SUM(ISNULL(exclude)) # note: need to include SELECT statement
FROM dogs

1 rows affected.


SUM(ISNULL(exclude))
34025


## 3. The AVG, MIN, and MAX Functions

AVG, MIN, and MAX all work very similarly to SUM.

**Question 5: What is the average, minimum, and maximum ratings given to "Memory versus Pointing" game? (Your answer should be 3.5584, 0, and 9, respectively)**

In [9]:
%%sql
SELECT test_name, 
AVG(rating) AS AVG_Rating, 
MIN(rating) AS MIN_Rating, 
MAX(rating) AS MAX_Rating  # note: the last one doesn't have a comma ,
FROM reviews
WHERE test_name="Memory versus Pointing";

1 rows affected.


test_name,AVG_Rating,MIN_Rating,MAX_Rating
Memory versus Pointing,3.5584,0,9


What if you wanted the average rating for each of the 40 tests in the Reviews table?  One way to do that with the tools you know already is to write 40 separate queries like the ones you wrote above for each test, and then copy or transcribe the results into a separate table in another program like Excel to assemble all the results in one place.  That would be a very tedious and time-consuming exercise.  Fortunately, there is a very simple way to produce the results you want within one query.  That's what we will learn how to do in MySQL Exercise 5.  However, it is important that you feel comfortable with the syntax we have learned thus far before we start taking advantage of that functionality. Practice is the best way to become comfortable!


## Practice incorporating aggregate functions with everything else you've learned so far in your own queries.

**Question 6: How would you query how much time it took to complete each test provided in the exam_answers table, in minutes?  Title the column that represents this data "Duration."**  Note that the exam_answers table has over 2 million rows, so if you don't limit your output, it will take longer than usual to run this query.  (HINT: use the TIMESTAMPDIFF function described at: http://www.w3resource.com/mysql/date-and-time-functions/date-and-time-functions.php.

In [12]:
%%sql
SELECT TIMESTAMPDIFF(MINUTE,start_time,end_time) AS 'Duration' # during which less than 1 minute is returned with 0
FROM exam_answers
LIMIT 5;

5 rows affected.


Duration
345139
345139
345139
345138
345138


**Question 7: Include a column for Dog_Guid, start_time, and end_time in your query, and examine the output.  Do you notice anything strange?**  

In [13]:
%%sql
SELECT TIMESTAMPDIFF(MINUTE,start_time,end_time) AS 'Duration', dog_guid, start_time, end_time
FROM exam_answers
LIMIT 10;

10 rows affected.


Duration,dog_guid,start_time,end_time
345139,fd27b272-7144-11e5-ba71-058fbc01cf0b,2013-02-05 03:58:13,2013-10-02 20:18:06
345139,fd27b272-7144-11e5-ba71-058fbc01cf0b,2013-02-05 03:58:31,2013-10-02 20:18:06
345139,fd27b272-7144-11e5-ba71-058fbc01cf0b,2013-02-05 03:59:03,2013-10-02 20:18:06
345138,fd27b272-7144-11e5-ba71-058fbc01cf0b,2013-02-05 03:59:10,2013-10-02 20:18:06
345138,fd27b272-7144-11e5-ba71-058fbc01cf0b,2013-02-05 03:59:22,2013-10-02 20:18:06
345138,fd27b272-7144-11e5-ba71-058fbc01cf0b,2013-02-05 03:59:36,2013-10-02 20:18:06
345138,fd27b272-7144-11e5-ba71-058fbc01cf0b,2013-02-05 03:59:41,2013-10-02 20:18:06
345138,fd27b272-7144-11e5-ba71-058fbc01cf0b,2013-02-05 04:00:00,2013-10-02 20:18:06
345137,fd27b272-7144-11e5-ba71-058fbc01cf0b,2013-02-05 04:00:16,2013-10-02 20:18:06
345137,fd27b272-7144-11e5-ba71-058fbc01cf0b,2013-02-05 04:00:35,2013-10-02 20:18:06


If you explore your output you will find that some of your calculated durations appear to be "0." In some cases, you will see many entries from the same Dog_ID with the same start time and end time.  That should be impossible.  These types of entries probably represent tests run by the Dognition team rather than real customer data.  In other cases, though, a "0" is entered in the Duration column even though the start_time and end_time are different.  This is because we instructed the function to output the time difference in minutes; unless you change your settings, it will output "0" for any time differences less than the integer 1.  If you change your function to output the time difference in seconds, the duration in most of these columns will have a non-zero number.  

**Question 8: What is the average amount of time it took customers to complete all of the tests in the exam_answers table, if you do not exclude any data (the answer will be approximately 587 minutes)?**

In [14]:
%%sql
SELECT AVG(TIMESTAMPDIFF(MINUTE,start_time,end_time)) AS 'Average Duration'
FROM exam_answers;

1 rows affected.


Average Duration
586.9041


**Question 9: What is the average amount of time it took customers to complete the "Treat Warm-Up" test, according to the exam_answers table (about 165 minutes, if no data is excluded)?**

In [15]:
%%sql
SELECT AVG(TIMESTAMPDIFF(MINUTE,start_time,end_time)) 
FROM exam_answers
WHERE test_name="Treat Warm-Up";

1 rows affected.


"AVG(TIMESTAMPDIFF(MINUTE,start_time,end_time))"
164.9176


**Question 10: How many possible test names are there in the exam_answers table?**

In [16]:
%%sql
SELECT COUNT(DISTINCT test_name) 
FROM exam_answers;

1 rows affected.


COUNT(DISTINCT test_name)
67


You should have discovered that the exam_answers table has many more test names than the completed_tests table.  It turns out that this table has information about experimental tests that Dognition has not yet made available to its customers. 
   

**Question 11: What is the minimum and maximum value in the Duration column of your query that included the data from the entire table?**

In [17]:
%%sql
SELECT MAX(TIMESTAMPDIFF(MINUTE,start_time,end_time)), MIN(TIMESTAMPDIFF(MINUTE,start_time,end_time))
FROM exam_answers;

1 rows affected.


"MAX(TIMESTAMPDIFF(MINUTE,start_time,end_time))","MIN(TIMESTAMPDIFF(MINUTE,start_time,end_time))"
1036673,-187


The minimum Duration value is *negative*! these entries must be mistakes.  

**Question 12: How many of these negative Duration entries are there? (the answer should be 620)**

In [18]:
%%sql
SELECT COUNT(TIMESTAMPDIFF(MINUTE,start_time,end_time)) AS Duration
FROM exam_answers
WHERE TIMESTAMPDIFF(MINUTE,start_time,end_time)<0;

1 rows affected.


Duration
620


**Question 13: How would you query all the columns of all the rows that have negative durations so that you could examine whether they share any features that might give you clues about what caused the entry mistake?**

In [20]:
%%sql
SELECT *
FROM exam_answers
WHERE TIMESTAMPDIFF(MINUTE,start_time,end_time)<0
LIMIT 10;

10 rows affected.


script_detail_id,subcategory_name,test_name,step_type,start_time,end_time,loop_number,dog_guid
60,Empathy,Eye Contact Warm-up,question,2013-02-17 20:35:43,2013-02-17 20:34:43,3,fd3fe18a-7144-11e5-ba71-058fbc01cf0b
558,Sociability,Sociability,question,2013-02-18 04:25:19,2013-02-18 04:24:18,0,fd3fe50e-7144-11e5-ba71-058fbc01cf0b
557,Sociability,Sociability,question,2013-02-18 07:44:09,2013-02-18 07:43:09,0,fd3fe5ea-7144-11e5-ba71-058fbc01cf0b
574,Shy/Boldness,Shy/Boldness,question,2013-02-18 07:46:14,2013-02-18 07:45:13,0,fd3fe5ea-7144-11e5-ba71-058fbc01cf0b
582,Shy/Boldness,Shy/Boldness,question,2013-02-18 07:47:07,2013-02-18 07:46:06,0,fd3fe5ea-7144-11e5-ba71-058fbc01cf0b
600,Sociability,Sociability,question,2013-02-18 07:50:07,2013-02-18 07:49:07,0,fd3fe5ea-7144-11e5-ba71-058fbc01cf0b
293,Memory,Two Cup Warm-up,question,2013-02-18 13:23:25,2013-02-18 13:22:23,2,fd3fbd7c-7144-11e5-ba71-058fbc01cf0b
293,Memory,Two Cup Warm-up,question,2013-02-18 13:23:31,2013-02-18 13:22:28,4,fd3fbd7c-7144-11e5-ba71-058fbc01cf0b
322,Memory,Memory versus Pointing,question,2013-02-18 13:25:15,2013-02-18 13:24:14,1,fd3fbd7c-7144-11e5-ba71-058fbc01cf0b
322,Memory,Memory versus Pointing,question,2013-02-18 13:25:30,2013-02-18 13:24:27,5,fd3fbd7c-7144-11e5-ba71-058fbc01cf0b


**Question 14: What is the average amount of time it took customers to complete all of the tests in the exam_answers table when the negative durations are excluded from your calculation (you should get 11233 minutes)?**

In [22]:
%%sql
SELECT AVG(TIMESTAMPDIFF(MINUTE,start_time,end_time)) AS Avg_Duration
FROM exam_answers
WHERE TIMESTAMPDIFF(MINUTE,start_time,end_time)>0;

1 rows affected.


Avg_Duration
11233.0951
