# Basic SQL Queries
© Explore Data Science Academy

## Learning Objectives
In this tutorial, we will learn how to:
- Construct SQL queries
    - Reading database tables using the SELECT statement
    - Filtering query results using the WHERE clause
- Connect multiple tables
- How Limit query output
- Assign aliases to table and column names
- Comment SQL code

## Outline
- Preparing the SQL environment
    - Loading the database
- Writing SQL queries
    - The SELECT statement
    - The WHERE clause
    - Limiting Query output
    - Aliases
    - Comments

## Preparing the SQL environment
Before we start making SQL queries, we first need to prepare the SQL environment. Assuming you have installed the `pymysql` python package, this can be achieved using the following [magic command](https://ipython.readthedocs.io/en/stable/interactive/magics.html):

In [1]:
%load_ext sql

Now all we need to do to run SQL code in jupyter notebook cells is to prepend the cell with `%%sql` (as will be demonstrated shortly). 

### Loading the Database

In this train, we will use the **[chinook database](https://github.com/lerocha/chinook-database)** - a sample database for a digital media company that has tables for artists, albums, media tracks, invoices and customers.
The basic characteristics of Chinook include:

- 11 tables
- A variety of indexes, primary and foreign key constraints
- Over 15,000 rows of data

Here’s an [ER (Entity Relationship) diagram](https://www.lucidchart.com/pages/er-diagrams) of the chinook database:

![Chinook ERD](https://github.com/Explore-AI/Pictures/blob/master/sqlite-sample-database-color.jpg?raw=true)

_[Image source](https://www.sqlitetutorial.net/sqlite-sample-database/)_

The Media related data was created using real data from an iTunes Library. Customer and employee information was created using fictitious names and addresses that can be located on Google maps, and other well formatted data (phone, fax, email, etc.). Sales information was auto generated using random data for a four year period. 

Let's load this database into the notebook (make sure you have downloaded the `chinook.db` sqlite file from Athena and have stored it in a known location before attempting this step). 

In [2]:
%%sql 

sqlite:///chinook.db

## Writing SQL Queries
SQL queries will generally consist of **statements**, **clauses**, **operations**, **built-in functions**, and will end in a semi colon (i.e. `;`). When executed, SQL queries generate virtual tables containing data from existing database tables that have been processed according to the SQL query. 

In this train we will focus on writing SQL queries for reading and filtering data from a SQL (particularly SQLite) database. The queries that are covered here will be useful in cases where we want to extract insights from information stored in the database or when we want to view a specific subset of the data to use for some other purpose. 

**Note:** For ease of display, we will be using the `LIMIT` SQL keyword to constrain the output of some of our queries. If you want to see the full output of any query, simply delete the line containing this keyword. 

For example:

```sql
LIMIT -- Remove this line to see the full output 
```

can be removed to see the full query output. 

### 1. The SELECT statement

The SELECT statement is used for reading data from one or more tables in the database. Basic SELECT statements generally take on the following format:

```sql
SELECT column name(s)
FROM table name(s)
```

The words **SELECT**, and **FROM** here are SQL keywords and just like any other programming language, each keyword has a specific function:

- SELECT - For "selecting" or specifying which table field(s) (i.e. columns) or calculations we want returned from the database. 
- FROM - For specifying which database location (i.e. tables) the "selected" data is stored.


It is good practice to type SQL keywords in capital letters to make queries more readable. 

Let's see some examples.

### 1.1. Reading data in a single column from a table in the database
Let's write a query that returns the names of all chinook digital media store customers. This means we need to:
    
    return data in the FirstName column from the customers table (see ER diagram above)
the version of this is:

In [4]:
%%sql 

SELECT FirstName 
FROM customers
LIMIT 10; -- Remove this line to see the full output

 * sqlite:///chinook.db
Done.


FirstName
Luís
Leonie
François
Bjørn
František
Helena
Astrid
Daan
Kara
Eduardo


As expected, a virtual table containing the results of our query is generated.

### 1.2 Reading data in multiple columns from a table in the database
Let's write a query to find out when each chinook employee was hired. Looking at the ER diagram above, we can achieve this by:
    
    returning data in the FirstName, LastName, and HireDate column(s) from the employees table

The SQL query for this is as follows:

In [6]:
%%sql

SELECT FirstName, LastName, HireDate
FROM employees;

 * sqlite:///chinook.db
Done.


FirstName,LastName,HireDate
Andrew,Adams,2002-08-14 00:00:00
Nancy,Edwards,2002-05-01 00:00:00
Jane,Peacock,2002-04-01 00:00:00
Margaret,Park,2003-05-03 00:00:00
Steve,Johnson,2003-10-17 00:00:00
Michael,Mitchell,2003-10-17 00:00:00
Robert,King,2004-01-02 00:00:00
Laura,Callahan,2004-03-04 00:00:00


As you can see, we have specified multiple columns by separating each column name in the list with a comma. The same applies to table names as will be demonstrated shortly.

### 1.3 Reading data from all columns of a table in the database
Let's write a query that returns all chinook employee information. In simple English, our query has to:

    return data stored in all columns from the employees table

In SQL:

In [7]:
%%sql

SELECT *
FROM employees;

 * sqlite:///chinook.db
Done.


EmployeeId,LastName,FirstName,Title,ReportsTo,BirthDate,HireDate,Address,City,State,Country,PostalCode,Phone,Fax,Email
1,Adams,Andrew,General Manager,,1962-02-18 00:00:00,2002-08-14 00:00:00,11120 Jasper Ave NW,Edmonton,AB,Canada,T5K 2N1,+1 (780) 428-9482,+1 (780) 428-3457,andrew@chinookcorp.com
2,Edwards,Nancy,Sales Manager,1.0,1958-12-08 00:00:00,2002-05-01 00:00:00,825 8 Ave SW,Calgary,AB,Canada,T2P 2T3,+1 (403) 262-3443,+1 (403) 262-3322,nancy@chinookcorp.com
3,Peacock,Jane,Sales Support Agent,2.0,1973-08-29 00:00:00,2002-04-01 00:00:00,1111 6 Ave SW,Calgary,AB,Canada,T2P 5M5,+1 (403) 262-3443,+1 (403) 262-6712,jane@chinookcorp.com
4,Park,Margaret,Sales Support Agent,2.0,1947-09-19 00:00:00,2003-05-03 00:00:00,683 10 Street SW,Calgary,AB,Canada,T2P 5G3,+1 (403) 263-4423,+1 (403) 263-4289,margaret@chinookcorp.com
5,Johnson,Steve,Sales Support Agent,2.0,1965-03-03 00:00:00,2003-10-17 00:00:00,7727B 41 Ave,Calgary,AB,Canada,T3B 1Y7,1 (780) 836-9987,1 (780) 836-9543,steve@chinookcorp.com
6,Mitchell,Michael,IT Manager,1.0,1973-07-01 00:00:00,2003-10-17 00:00:00,5827 Bowness Road NW,Calgary,AB,Canada,T3B 0C5,+1 (403) 246-9887,+1 (403) 246-9899,michael@chinookcorp.com
7,King,Robert,IT Staff,6.0,1970-05-29 00:00:00,2004-01-02 00:00:00,590 Columbia Boulevard West,Lethbridge,AB,Canada,T1K 5N8,+1 (403) 456-9986,+1 (403) 456-8485,robert@chinookcorp.com
8,Callahan,Laura,IT Staff,6.0,1968-01-09 00:00:00,2004-03-04 00:00:00,923 7 ST NW,Lethbridge,AB,Canada,T1H 1Y8,+1 (403) 467-3351,+1 (403) 467-8772,laura@chinookcorp.com


The `*` here simply means "all columns". Another way to write this query would have been to list each column name in the employees table individually. However, this approach gets tedious for large database tables. 

### 1.4 Reading data in multiple columns from multiple tables in the database
Let's write a query that lists album tiltes and the corresponding artists. In English:

    return data in the Title column from the albums table and the Name column from the artists table
In SQL:

In [10]:
%%sql

SELECT albums.Title, artists.Name
FROM albums, artists
LIMIT 10; -- Remove this line to see the full output 

 * sqlite:///chinook.db
Done.


Title,Name
For Those About To Rock We Salute You,AC/DC
For Those About To Rock We Salute You,Accept
For Those About To Rock We Salute You,Aerosmith
For Those About To Rock We Salute You,Alanis Morissette
For Those About To Rock We Salute You,Alice In Chains
For Those About To Rock We Salute You,Antônio Carlos Jobim
For Those About To Rock We Salute You,Apocalyptica
For Those About To Rock We Salute You,Audioslave
For Those About To Rock We Salute You,BackBeat
For Those About To Rock We Salute You,Billy Cobham


In the above query we used a dot convention to tell SQL which table each selected column belongs to. This method is particularly useful in cases where the specified tables have columns with the same name. For example, the artists table and the albums table both have an ArtistId field.

However, the query above doesn't seem to have provided what we wanted. If you take a closer look and remove the `LIMIT` keyword, you will notice that each artist has written every album in the table (despite other artists having written the same album)! We will cover why this happens and offer a solution in the next section.

### 2. The WHERE clause

The WHERE clause is an optional element for SQL statements that specifies a condition (i.e. a boolean expression) to be applied on the returned data. It will return only the rows of data for which the boolean expression evaluates to true. Boolean expressions can be created using standard [boolean operators](https://docs.oracle.com/javadb/10.8.3.0/ref/rrefsqlj23075.html):

- `=` - equal to
- `!=` - not equal to (also `<>`)
- `<` - less than
- `<=` - less than or equal to
- `>` - greater than
- `>=` - greater than or equal to 

Multiple boolean expressions can be combined using the keywords `AND`, `OR`, and `NOT`.  

A SELECT statement that has a WHERE clause has the following format:
```
SELECT column name(s)
FROM table name(s)
WHERE condition is true
```
Let's explore the different types of conditions using some examples:

### 2.1. Filtering data using a given condition
Let's write a SQL query that will return all customers who live in Germany. In other words, we need to

    return data in the FirstName and LastName columns of the customers table where the country is equal to Germany

In SQL:

In [11]:
%%sql

SELECT FirstName, LastName
FROM customers
WHERE Country = "Germany";

 * sqlite:///chinook.db
Done.


FirstName,LastName
Leonie,Köhler
Hannah,Schneider
Fynn,Zimmermann
Niklas,Schröder


The double quotes (i.e. `""`) in this query are used to specify a string (i.e. VARCHAR) in SQL. Datatypes will be discussed in detail in later tutorials.

Next, let's write a SQL query that will return all customers who **don't** live in Germany. In English:

    return data in the FirstName, LastName, and Country columns of the customers table where the country is not equal to Germany
In SQL:

In [12]:
%%sql

SELECT FirstName, LastName, Country
FROM customers
WHERE Country != "Germany"
LIMIT 10; -- Remove this line to see the full output 

 * sqlite:///chinook.db
Done.


FirstName,LastName,Country
Luís,Gonçalves,Brazil
François,Tremblay,Canada
Bjørn,Hansen,Norway
František,Wichterlová,Czech Republic
Helena,Holý,Czech Republic
Astrid,Gruber,Austria
Daan,Peeters,Belgium
Kara,Nielsen,Denmark
Eduardo,Martins,Brazil
Alexandre,Rocha,Brazil


### 2.2. Filtering data using multiple conditions
let's write a query that returns all invoices that show USA purchases with a total greater than 10 dollars. In English:

    return data from all columns in the invoices table where the BillingCountry is equal to USA and the Total is greater than 10
    
In SQL:

In [13]:
%%sql

SELECT *
FROM invoices
WHERE BillingCountry = "USA"
    AND Total > 10; 

 * sqlite:///chinook.db
Done.


InvoiceId,CustomerId,InvoiceDate,BillingAddress,BillingCity,BillingState,BillingCountry,BillingPostalCode,Total
5,23,2009-01-11 00:00:00,69 Salem Street,Boston,MA,USA,2113,13.86
26,19,2009-04-14 00:00:00,1 Infinite Loop,Cupertino,CA,USA,95014,13.86
82,28,2009-12-18 00:00:00,302 S 700 E,Salt Lake City,UT,USA,84102,13.86
103,24,2010-03-21 00:00:00,162 E Superior Street,Chicago,IL,USA,60611,15.86
124,20,2010-06-22 00:00:00,541 Del Medio Avenue,Mountain View,CA,USA,94040-111,13.86
145,16,2010-09-23 00:00:00,1600 Amphitheatre Parkway,Mountain View,CA,USA,94043-1351,13.86
201,25,2011-05-29 00:00:00,319 N. Frances Street,Madison,WI,USA,53703,18.86
222,21,2011-08-30 00:00:00,801 W 4th Street,Reno,NV,USA,89503,13.86
243,17,2011-12-01 00:00:00,1 Microsoft Way,Redmond,WA,USA,98052-8300,13.86
298,17,2012-07-31 00:00:00,1 Microsoft Way,Redmond,WA,USA,98052-8300,10.91


This query joins multiple boolean expressions using the `AND` keyword. One more example:

Write a query that returns all stored information for Sales Support Agents that were hired on or after the 3rd of May 2003 and stay in Calgary. In English:

    return data in all columns from the employees table where the hire date is greater than or equal to 2003-05-03 and the Title is equal to Sales Support Agent and the City is equal to Calgary.

In [14]:
%%sql

SELECT * 
FROM employees
WHERE HireDate >= "2003-05-03"
    AND Title = "Sales Support Agent"
    AND City = "Calgary";

 * sqlite:///chinook.db
Done.


EmployeeId,LastName,FirstName,Title,ReportsTo,BirthDate,HireDate,Address,City,State,Country,PostalCode,Phone,Fax,Email
4,Park,Margaret,Sales Support Agent,2,1947-09-19 00:00:00,2003-05-03 00:00:00,683 10 Street SW,Calgary,AB,Canada,T2P 5G3,+1 (403) 263-4423,+1 (403) 263-4289,margaret@chinookcorp.com
5,Johnson,Steve,Sales Support Agent,2,1965-03-03 00:00:00,2003-10-17 00:00:00,7727B 41 Ave,Calgary,AB,Canada,T3B 1Y7,1 (780) 836-9987,1 (780) 836-9543,steve@chinookcorp.com


### 2.3. Writing queries across multiple tables
Since databases can consist of multiple tables (entities) connected together through relationships (i.e. primary and foreign keys), it will be useful to write queries that span across multiple tables. In such cases, we may also  need to align data (i.e. records) between tables as follows:

`SELECT table1.field1,table2.field3 
FROM table1, table2
WHERE table1.field1_id = table2.field1_id;
`
The WHERE clause in the above sample query is what really connects the two tables, it makes sure that records in one table correspond to records in the other table, this is achieved by using a common field between the two tables. Without the WHERE clause we would receive a weird permutation of selected fields from the involved tables. For example, refer back to the query we made above in section 1.4.

Example time, let's try the query in 1.4 again (but with the WHERE clause this time):

Let's write a query that lists album tiltes and the corresponding artists. In English:

    return data in the Title column from the albums table and the Name column from the artists table where the Artistid in the artists table is the same as the Artistid in the albums table.
    
In SQL:

In [15]:
%%sql

SELECT albums.Title, artists.Name
FROM albums, artists
WHERE artists.Artistid = albums.Artistid
LIMIT 10; -- Remove this line to see the full output 

 * sqlite:///chinook.db
Done.


Title,Name
For Those About To Rock We Salute You,AC/DC
Balls to the Wall,Accept
Restless and Wild,Accept
Let There Be Rock,AC/DC
Big Ones,Aerosmith
Jagged Little Pill,Alanis Morissette
Facelift,Alice In Chains
Warner 25 Anos,Antônio Carlos Jobim
Plays Metallica By Four Cellos,Apocalyptica
Audioslave,Audioslave


Unlike before, the returned data is aligned perfectly between both tables. We were able to get all albums and the corresponding artists (INCLUDING 9 METALLICA ALBUMS!). Naturally, some artists will have written more than one album. 

---
**Test yourself:** write a query that returns the firstname, lastname, and invoice total of customers who spent more than 15 dollars.

_Hint: You will need to connect the invoices table and the customers table_

In [16]:
%%sql

your SQL query here

 * sqlite:///chinook.db
(sqlite3.OperationalError) near "your": syntax error
[SQL: your SQL query here]
(Background on this error at: http://sqlalche.me/e/e3q8)


---

### 3. Limiting query output
As we've already seen in within this train, SQL queries will take longer to execute for large databases and where the query results in a lot of rows. Queries that output lots of rows will also use more RAM. The `LIMIT` keyword can be used **at the end of a query** to limit the query output in such cases.

Usage `LIMIT N`, where `N` is the number of rows that should be displayed. For example:

Write a query that displays information for the first 15 tracks in the database. In English:

    return data in all columns from the tracks table and limit the number of rows to 5

In [17]:
%%sql

SELECT *
FROM tracks
LIMIT 5;

 * sqlite:///chinook.db
Done.


TrackId,Name,AlbumId,MediaTypeId,GenreId,Composer,Milliseconds,Bytes,UnitPrice
1,For Those About To Rock (We Salute You),1,1,1,"Angus Young, Malcolm Young, Brian Johnson",343719,11170334,0.99
2,Balls to the Wall,2,2,1,,342562,5510424,0.99
3,Fast As a Shark,3,2,1,"F. Baltes, S. Kaufman, U. Dirkscneider & W. Hoffman",230619,3990994,0.99
4,Restless and Wild,3,2,1,"F. Baltes, R.A. Smith-Diesel, S. Kaufman, U. Dirkscneider & W. Hoffman",252051,4331779,0.99
5,Princess of the Dawn,3,2,1,Deaffy & R.A. Smith-Diesel,375418,6290521,0.99


### 4. Aliases

Before we explain what aliases in SQL are and what they are for, let's first demonstrate their necessity. 

Suppose we wanted to show which customers (name and surname) and Sales Support Agents (name and surname) live in the same country, perhaps chinook was aiming to do a door-to-door marketing campaign for customers who live in the same country as chinook employees. In English, this query is 

    return data from the Firstname and LastName columns from the customers table and the Firstname and LastName columns from the employees table where the customers table Country is equal to Canada and the customers table SupportRepId is equal to the employees table ImployeeId

In SQL:

In [18]:
%%sql

SELECT customers.FirstName, customers.LastName, employees.FirstName, Employees.LastName 
FROM customers, employees
WHERE customers.Country = "Canada"
AND customers.SupportRepId = employees.EmployeeId;

 * sqlite:///chinook.db
Done.


FirstName,LastName,FirstName_1,LastName_1
François,Tremblay,Jane,Peacock
Mark,Philips,Steve,Johnson
Jennifer,Peterson,Jane,Peacock
Robert,Brown,Jane,Peacock
Edward,Francis,Jane,Peacock
Martha,Silk,Steve,Johnson
Aaron,Mitchell,Margaret,Park
Ellie,Sullivan,Jane,Peacock


We have two problems here:

1. This query was long and took a while to type
2. The two tables have similar column names, now we have no way of telling employees apart from customers (i.e. if this virtual table is exported to some other format).

If you are lucky (as we have been with our SQLite), the SQL enviroment you use will not return columns with the same names and will rename duplicates by appending `_1`,`_2`, `_3`, etc. as it encounters them. Let's rewrite this query but now using aliases this time to resolve the listed problems:

In [19]:
%%sql

SELECT c.FirstName AS "customer name", c.LastName AS "customer surname", e.FirstName AS "agent_name", e.LastName AS "agent_surname"
FROM customers c, employees e
WHERE c.Country = "Canada"
    AND c.SupportRepId = e.EmployeeId;

 * sqlite:///chinook.db
Done.


customer name,customer surname,agent_name,agent_surname
François,Tremblay,Jane,Peacock
Mark,Philips,Steve,Johnson
Jennifer,Peterson,Jane,Peacock
Robert,Brown,Jane,Peacock
Edward,Francis,Jane,Peacock
Martha,Silk,Steve,Johnson
Aaron,Mitchell,Margaret,Park
Ellie,Sullivan,Jane,Peacock


In this version of the query we have:
- assigned aliases (i.e. custom names) to columns using the `AS` keyword and,
- assigned aliases to tables by typing them next to the table name in the `FROM` clause.

A few rules to remember for specifying aliases:
1. Try to avoid using space-separated aliases, rather separate different words with underscores or capilaization. 
2. Try to avoid aliases that start with numerical characters, e.g. `1_employee`.

Column aliases are an exception to both of these rules since you use an valid string enclosed by quotes (`""`) as an alias.

### 5. Comments
No programming language is complete without the ability to make annotations to your code. Although comments in SQL will vary depending on the software tool and flavour of SQL used, in general:

- Single line comments can be implemented with a `--`.
- Multi-line or block comments can be implemmented by enclosing code within `/*` and `*/`.

These will be useful if you want to explain or document your SQL queries inline or prevent SQL from running certain queries within a group of queries.

In [20]:
%%sql

-- This is a single line comment (SQL will not execute lines that begin with '--')

/* 
This is a block
comment which will comment
multiple lines
*/

 * sqlite:///chinook.db
0 rows affected.


[]

## Conclusion
In general, SQL queries allow users and applications to interact with the database. Having knowledge of database structure and how to interact with the data they contain is an extremely important skill for any data scientist.
In this tutorial we learnt:

- How to use the SELECT statement to query columns from tables in a database
- How to use the WHERE clause to add conditions to our queries
- How to LIMIT query output, assign aliases to columns and tables, and comment SQL code

## Additional links
- [Entity relationship diagrams](https://youtu.be/QpdhBUYk7Kk)
- [SQL statements](https://db.apache.org/derby/docs/10.13/ref/crefsqlj39374.html)
- [SQL clauses](https://db.apache.org/derby/docs/10.13/ref/rrefclauses.html)
- [Boolean explressions](https://docs.oracle.com/javadb/10.8.3.0/ref/rrefsqlj23075.html)