Copyright Jana Schaich Borg/Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)

# MySQL Exercise 3: Formatting Selected Data

In [1]:
# loading the SQL library into Jupyter
# connecting to the Dognition database
# setting Dognition as the default database
%load_ext sql
%sql mysql://studentuser:studentpw@mysqlserver/dognitiondb
%sql USE dognitiondb

0 rows affected.


[]


## 1. Use AS to change the titles of the columns in your output

Since aliases are strings, again, MySQL accepts both double and single quotation marks, but some database systems only accept single quotation marks. It is good practice to avoid using SQL keywords in your aliases, but if you have to use an SQL keyword in your alias for some reason, the string must be enclosed in backticks instead of quotation marks.

In [2]:
%%sql
SELECT dog_guid, created_at AS time_stamp # change "created_at" to "time_stamp" in the output
FROM complete_tests
LIMIT 5;

5 rows affected.


dog_guid,time_stamp
fd27b86c-7144-11e5-ba71-058fbc01cf0b,2013-02-05 18:26:54
fd27b86c-7144-11e5-ba71-058fbc01cf0b,2013-02-05 18:31:03
fd27b86c-7144-11e5-ba71-058fbc01cf0b,2013-02-05 18:32:04
fd27b86c-7144-11e5-ba71-058fbc01cf0b,2013-02-05 18:32:25
fd27b86c-7144-11e5-ba71-058fbc01cf0b,2013-02-05 18:32:56


In [3]:
%%sql
SELECT dog_guid, created_at AS "time stamp" # alias includes a space, the alias must be surrounded in quotes
FROM complete_tests
LIMIT 5;

5 rows affected.


dog_guid,time stamp
fd27b86c-7144-11e5-ba71-058fbc01cf0b,2013-02-05 18:26:54
fd27b86c-7144-11e5-ba71-058fbc01cf0b,2013-02-05 18:31:03
fd27b86c-7144-11e5-ba71-058fbc01cf0b,2013-02-05 18:32:04
fd27b86c-7144-11e5-ba71-058fbc01cf0b,2013-02-05 18:32:25
fd27b86c-7144-11e5-ba71-058fbc01cf0b,2013-02-05 18:32:56


In [4]:
%%sql
SELECT dog_guid, created_at AS "time stamp"
FROM complete_tests AS tests # also make an alias for a table
LIMIT 5;

5 rows affected.


dog_guid,time stamp
fd27b86c-7144-11e5-ba71-058fbc01cf0b,2013-02-05 18:26:54
fd27b86c-7144-11e5-ba71-058fbc01cf0b,2013-02-05 18:31:03
fd27b86c-7144-11e5-ba71-058fbc01cf0b,2013-02-05 18:32:04
fd27b86c-7144-11e5-ba71-058fbc01cf0b,2013-02-05 18:32:25
fd27b86c-7144-11e5-ba71-058fbc01cf0b,2013-02-05 18:32:56


## 2. Use DISTINCT to remove duplicate rows

In [6]:
%%sql
SELECT breed # we see breed 'Golden Retriever' in the first five rows
FROM dogs
LIMIT 5; 

5 rows affected.


breed
Labrador Retriever
Shetland Sheepdog
Golden Retriever
Golden Retriever
Shih Tzu


In [5]:
%%sql
SELECT DISTINCT breed # return only values that are distinct, which are not repeated
FROM dogs
LIMIT 5;

5 rows affected.


breed
Labrador Retriever
Shetland Sheepdog
Golden Retriever
Shih Tzu
Siberian Husky


<mark> When the DISTINCT clause is used with multiple columns in a SELECT statement, the combination of all the columns together is used to determine the uniqueness of a row in a result set.</mark>  

In [8]:
%%sql
SELECT state, city # we can see 'NC Raleigh' is repeated two times below
FROM users
LIMIT 10; 

10 rows affected.


state,city
ND,Grand Forks
MA,Barre
CT,Darien
IL,Winnetka
NC,Raleigh
WA,Auburn
NC,Raleigh
CO,Fort Collins
WA,Seattle
WA,Bainbridge Island


In [7]:
%%sql
SELECT DISTINCT state, city # the duplicated (both state and city are the same) was removed
FROM users
LIMIT 10;

10 rows affected.


state,city
ND,Grand Forks
MA,Barre
CT,Darien
IL,Winnetka
NC,Raleigh
WA,Auburn
CO,Fort Collins
WA,Seattle
WA,Bainbridge Island
WA,Bremerton


When you use the DISTINCT clause with the LIMIT clause in a statement, MySQL stops searching when it finds the number of *unique* rows specified in the LIMIT clause, not when it goes through the number of rows in the LIMIT clause. 

## 3. Use ORDER BY to sort the output of your query

In [11]:
%%sql
SELECT DISTINCT breed
FROM dogs 
ORDER BY breed # sort the breed in alphabetical order, default is in ascending order
LIMIT 5;

5 rows affected.


breed
-American Eskimo Dog Mix
-American Pit Bull Terrier Mix
-Anatolian Shepherd Dog Mix
-Australian Cattle Dog Mix
-Australian Shepherd Mix


(You might notice that some of the breeds start with a hyphen; we'll come back to that later.)

In [12]:
%%sql
SELECT DISTINCT breed
FROM dogs 
ORDER BY breed DESC
LIMIT 5;

5 rows affected.


breed
Yorkshire Terrier-Soft Coated Wheaten Terrier Mix
Yorkshire Terrier-Silky Terrier Mix
Yorkshire Terrier-Shih Tzu Mix
Yorkshire Terrier-Rat Terrier Mix
Yorkshire Terrier-Poodle Mix


In [13]:
%%sql
SELECT DISTINCT user_guid, (median_ITI_minutes * 60) AS median_ITI_sec # create a new derived field called median_ITI_sec
FROM dogs 
ORDER BY median_ITI_sec DESC # sorted by a new derived field in descending order
LIMIT 5;

5 rows affected.


  cursor.execute(statement, parameters)


user_guid,median_ITI_sec
ce33102e-7144-11e5-ba71-058fbc01cf0b,56242621.0002
ce71fdde-7144-11e5-ba71-058fbc01cf0b,28501473.0
ce2462fe-7144-11e5-ba71-058fbc01cf0b,23877894.0
ce7472a8-7144-11e5-ba71-058fbc01cf0b,18482487.0
ce3c0f1c-7144-11e5-ba71-058fbc01cf0b,16674519.000000002


Note that the parentheses are important in that query; without them, the database would try to make an alias for 60 instead of median_ITI_minutes * 60.

SQL queries also allow you to sort by multiple fields in a specified order, similar to how Excel allows to include multiple levels in a sort.

In [15]:
%%sql
SELECT DISTINCT user_guid, state, membership_type
FROM users
WHERE country="US"
ORDER BY state ASC, membership_type DESC # Sort states in ascending first, then sort membership in descending
LIMIT 5;

5 rows affected.


user_guid,state,membership_type
ce969298-7144-11e5-ba71-058fbc01cf0b,AE,3
ce221dbe-7144-11e5-ba71-058fbc01cf0b,AE,2
ce70836e-7144-11e5-ba71-058fbc01cf0b,AE,2
ce138312-7144-11e5-ba71-058fbc01cf0b,AE,1
ce7587ba-7144-11e5-ba71-058fbc01cf0b,AE,1


In [16]:
%%sql
SELECT DISTINCT user_guid, state, membership_type
FROM users
WHERE country="US" AND state IS NOT NULL and membership_type IS NOT NULL # select data that there is not null in both state and membership
ORDER BY state ASC, membership_type ASC
LIMIT 5;

5 rows affected.


user_guid,state,membership_type
ce138312-7144-11e5-ba71-058fbc01cf0b,AE,1
ce7587ba-7144-11e5-ba71-058fbc01cf0b,AE,1
ce76f528-7144-11e5-ba71-058fbc01cf0b,AE,1
ce221dbe-7144-11e5-ba71-058fbc01cf0b,AE,2
ce70836e-7144-11e5-ba71-058fbc01cf0b,AE,2


## 4. Export your query results to a text file 

In [17]:
# put the  the list of distinct dog breeds into variable breed_list
breed_list = %sql SELECT DISTINCT breed FROM dogs ORDER BY breed;

2006 rows affected.


In [18]:
# format the variable breed_list as a csv file
breed_list.csv('breed_list.csv')

## 5.  View of Other Functions Which May Associated With Data Cleaning

There are some strange values in the breed list.  Some of the entries in the breed column seem to have a dash included before the name.  This is an example of what real business data sets look like...they are messy!  We will use this as an opportunity to highlight why it is so important to be curious and explore MySQL functions on your own. 

If you needed an accurate list of all the dog breeds in the dogs table, you would have to find some way to "clean up" the breed list you just made.  Let's examine some of the functions that could help you achieve this cleaning using SQL syntax rather than another program or language outside of the database.

I included these links to MySQL functions in an earlier notebook:  
http://dev.mysql.com/doc/refman/5.7/en/func-op-summary-ref.html  
http://www.w3resource.com/mysql/mysql-functions-and-operators.php

In [20]:
%%sql
SELECT DISTINCT breed,
REPLACE(breed,'-','') AS breed_fixed # replace any dashes in the breed names with no character
FROM dogs
ORDER BY breed_fixed
LIMIT 5;

5 rows affected.


breed,breed_fixed
Affenpinscher,Affenpinscher
Affenpinscher-Afghan Hound Mix,AffenpinscherAfghan Hound Mix
Affenpinscher-Airedale Terrier Mix,AffenpinscherAiredale Terrier Mix
Affenpinscher-Alaskan Malamute Mix,AffenpinscherAlaskan Malamute Mix
Affenpinscher-American English Coonhound Mix,AffenpinscherAmerican English Coonhound Mix


That was helpful, but you'll still notice some issues with the output.

First, the leading dashes are indeed removed in the breed_fixed column, but now the dashes used to separate breeds in entries like 'French Bulldog-Boston Terrier Mix' are missing as well. So REPLACE isn't the right choice to selectively remove leading dashes.

Perhaps we could try using the TRIM function:

http://www.w3resource.com/mysql/string-functions/mysql-trim-function.php

In [21]:
%%sql
SELECT DISTINCT breed, TRIM(LEADING '-' FROM breed) AS breed_fixed # only leading prefixes '-' are to be removed
FROM dogs
ORDER BY breed_fixed
limit 5;

5 rows affected.


breed,breed_fixed
Affenpinscher,Affenpinscher
Affenpinscher-Afghan Hound Mix,Affenpinscher-Afghan Hound Mix
Affenpinscher-Airedale Terrier Mix,Affenpinscher-Airedale Terrier Mix
Affenpinscher-Alaskan Malamute Mix,Affenpinscher-Alaskan Malamute Mix
Affenpinscher-American English Coonhound Mix,Affenpinscher-American English Coonhound Mix


That certainly gets us a lot closer to the list we might want, but there are still some entries in the breed_fixed column that are conceptual duplicates of each other, due to poor consistency in how the breed names were entered.  For example, one entry is "Beagle Mix" while another is "Beagle- Mix".  These entries are clearly meant to refer to the same breed, but they will be counted as separate breeds as long as their breed names are different.

Cleaning up all of the entries in the breed column would take quite a bit of work, so we won't go through more details about how to do it in this lesson.  Instead, use this exercise as a reminder for why it's so important to always look at the details of your data, and as motivation to explore the MySQL functions we won't have time to discuss in the course.  If you push yourself to learn new SQL functions and embrace the habit of getting to know your data by exploring its raw values and outputs, you will find that SQL provides very efficient tools to clean real-world messy data sets, and you will arrive at the correct conclusions about what your data indicate your company should do.

## Now it's time to practice using AS, DISTINCT, and ORDER BY in your own queries.


**Question 1: How would you get a list of all the subcategories of Dognition tests, in alphabetical order, with no test listed more than once (if you do not limit your output, you should retrieve 16 rows)?**

In [22]:
%%sql
SELECT DISTINCT subcategory_name
FROM complete_tests
ORDER BY subcategory_name
LIMIT 5;

5 rows affected.


subcategory_name
Communication
Cunning
Empathy
Expression Game
Impossible Task


**Question 2: How would you create a text file with a list of all the non-United States countries of Dognition customers with no country listed more than once?**

In [23]:
NonUsCountries = %sql SELECT DISTINCT country FROM users WHERE country != "US";
NonUsCountries.csv('NonUsCountries.csv')

68 rows affected.


**Question 3: How would you find the User ID, Dog ID, and test name of the first 5 tests to ever be completed in the Dognition database?**

In [24]:
%%sql
SELECT user_guid, dog_guid, test_name
FROM complete_tests
ORDER BY created_at 
LIMIT 5;


5 rows affected.


user_guid,dog_guid,test_name
,fd27b86c-7144-11e5-ba71-058fbc01cf0b,Yawn Warm-up
,fd27b86c-7144-11e5-ba71-058fbc01cf0b,Yawn Game
,fd27b86c-7144-11e5-ba71-058fbc01cf0b,Eye Contact Warm-up
,fd27b86c-7144-11e5-ba71-058fbc01cf0b,Eye Contact Game
,fd27b86c-7144-11e5-ba71-058fbc01cf0b,Treat Warm-up


**Question 4: How would create a text file with a list of all the customers with yearly memberships who live in the state of North Carolina (USA) and joined Dognition after March 1, 2014, sorted so that the most recent member is at the top of the list?**

In [27]:
answer_Q4 = %sql SELECT DISTINCT user_guid, state, created_at FROM users WHERE membership_type=2 AND state='NC' AND country='US' AND created_at > '2014-03-01' ORDER BY created_at DESC;

68 rows affected.


In [28]:
answer_Q4.csv('answer_Q4.csv')

**Question 5: See if you can find an SQL function from the list provided at:**

http://www.w3resource.com/mysql/mysql-functions-and-operators.php

**that would allow you to output all of the distinct breed names in UPPER case.  Create a query that would output a list of these names in upper case, sorted in alphabetical order.**

In [29]:
%%sql
SELECT DISTINCT UPPER(breed)
FROM dogs
ORDER BY breed
LIMIT 5;


5 rows affected.


UPPER(breed)
-AMERICAN ESKIMO DOG MIX
-AMERICAN PIT BULL TERRIER MIX
-ANATOLIAN SHEPHERD DOG MIX
-AUSTRALIAN CATTLE DOG MIX
-AUSTRALIAN SHEPHERD MIX
