# Introduction to SQL queries

[üá´üá∑ Lire en fran√ßais](intro-sql-fr.ipynb)

Welcome to this introductory course on SQL! If you're comfortable with spreadsheets like Excel or Google Sheets, you're already halfway there. This notebook will teach you how to work with databases using SQL (Structured Query Language).

## What is a Database?

Think of a **database** as a collection of spreadsheets that can talk to each other.
We can compare the whole spreadsheet file to the database, while the individual tabs or sheets would be akin to a database **table**.

There are some more restrictions on the tables for the comparison to be relevant.
Essentially, database tables only contain pure data, organized in **columns** or **fields**.
One column represents a piece of information to store, and each **row** or **record** is an independent data entry.
There is no notion of formatting (colors, empty lines, comment *etc*), no aggregate cells, no heterogeneous rows *etc*.

When a typical user spreadsheet may look like the following

<table >
<tbody>
  <tr>
    <td style="background-color: #e6d8ac; color: black; text-align:center;" colspan="2">Attendees</th>
    <td></td>
    <td style="background-color: #e6d8ac; color: black; text-align:center;" colspan="3">Expenses</th>
  </tr></thead>
  <tr>
    <td style="background-color: #ACBAE6; color: black; text-align:center;">First Name</td>
    <td style="background-color: #ADD8E6; color: black; text-align:center;">City</td>
    <td></td>
    <td style="background-color: #ACBAE6; color: black; text-align:center;">Category</td>
    <td style="background-color: #ADD8E6; color: black; text-align:center;">Date</td>
    <td style="background-color: #ACBAE6; color: black; text-align:center;">Price</td>
  </tr>
  <tr>
    <td>Paul</td>
    <td>Paris</td>
    <td></td>
    <td>Transport</td>
    <td>2026-01-01</td>
    <td>‚Ç¨57.89</td>
  </tr>
  <tr>
    <td>Maria</td>
    <td></td>
    <td></td>
    <td>Groceries</td>
    <td>2025-12-30</td>
    <td>‚Ç¨128.45</td>
  </tr>
  <tr>
    <td>Louis</td>
    <td>London</td>
    <td></td>
    <td>Restaurant</td>
    <td>2025-12-31</td>
    <td>‚Ç¨110.50</td>
  </tr>
  <tr>
    <td></td>
    <td></td>
    <td></td>
    <td>Hotel</td>
    <td>2025-12-30</td>
    <td>‚Ç¨534.00</td>
  </tr>
  <tr>
    <td></td>
    <td></td>
    <td></td>
    <td></td>
    <td></td>
    <td></td>
  </tr>
  <tr>
    <td style="background-color: #e5bbac; color: black;">Total</td>
    <td style="background-color: #e5bbac; color: black;">3</td>
    <td></td>
    <td style="background-color: #e5bbac; color: black;" colspan="2">Total</td>
    <td style="background-color: #e5bbac; color: black;">‚Ç¨830.84</td>
  </tr>
  <tr>
    <td style="background-color: #e5bbac; color: black;" colspan="5">Price Per person</td>
    <td style="background-color: #e5bbac; color: black;">‚Ç¨276.95</td>
  </tr>
</tbody></table>

We would structure it in two separate homogeneous tables, *Attendees* and *Expenses*, with only raw data.
Missing entries are supported, usually denoted by the special symbol `NULL`.

<div style="display: flex; justify-content: space-evenly;">
  <table>
    <thead>
      <tr>
        <th colspan="2">Attendees</th>
      </tr>
      <tr>
        <td>FirstName</td>
        <td>City</td>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>Paul</td>
        <td>Paris</td>
      </tr>
      <tr>
        <td>Maria</td>
        <td>NULL</td>
      </tr>
      <tr>
        <td>Louis</td>
        <td>London</td>
      </tr>
    </tbody>
  </table>

  <table>
    <thead>
      <tr>
        <th colspan="3">Expenses</th>
      </tr>
      <tr>
        <td>Category</td>
        <td>Date</td>
        <td>Price</td>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>Transport</td>
        <td>2026-01-01</td>
        <td>‚Ç¨57.89</td>
      </tr>
      <tr>
        <td>Groceries</td>
        <td>2025-12-30</td>
        <td>‚Ç¨128.45</td>
      </tr>
      <tr>
        <td>Restaurant</td>
        <td>2025-12-31</td>
        <td>‚Ç¨110.50</td>
      </tr>
      <tr>
        <td>Hotel</td>
        <td>2025-12-30</td>
        <td>‚Ç¨534.00</td>
      </tr>
    </tbody>
  </table>
</div>


Each table has a **name** and a strict set of **columns**, each with a **name** and data type, such as text *a.k.a.* string, number, euros *etc*.
Together, we call this the table **schema**.
It describe the type of data that can go into a table.
In this example, simplified schemas would look like the following:

<div style="display: flex; justify-content: space-evenly;">

```yaml
name: "Attendees"
columns:
  - name: "FirstName"
    type: string
  - name: "City"
    type: string
```

```yaml
name: "Expenses"
columns:
  - name: "Category"
    type: string
  - name: "Date"
    type: date
  - name: "Price"
    type: euros
```

</div>

Note that we are not storing the information about the totals anywhere.
This is not information that we collected, but rather that we computed from the raw data, via functions `SUM`, `COUNT`, *etc* in the spreadsheet.

With databases we will instead write a **query** to get this information. They can be used to request specific data, filtered, aggregated *etc*.
This introduction will teach you how to write such queries.

## Why use a database instead of a spreadsheet?

Databases are easy to work with when writing programs.
They are much more powerful for extracting specific information from all the raw data.
Their strict data handling makes them reliable and preserves data integrity in the long term.
They also scale much better than a spreadsheet.
With millions of rows, a single database can effectively serve thousands of queries per second.

When logging in on a website, all the information you see (your email, your messages, your preferences *etc*) is stored in a database. Some queries are run on the website server to return exactly what it should.

On the contrary databases can be hard to work with for beginners.
Inserting new data and retrieving information is not straightforward.
When working on small data between a few humans, a spreadsheet has less complexity.

## What is SQL?

**SQL** stands for **Structured Query Language**. It's the language we use to ask questions and make changes to a database, so called **queries**.

Instead of creating intermediate cells with functions, it can directly answer questions such as:
- "*Show me all attendees from Paris.*"
- "*What is the largest expense?*"
- "*What is the cost per attendee?*"

These are **read-only** queries, they give information about the data without modifying it.
It's common for many more people in a company, such as business analysts or data scientists, to have permission to execute read-only queries. So that's what we will mostly focus on here.
In this course, the results of the queries are displayed as tables, but in a program the result is read and mixed in with the rest of the execution, for example to display a welcome message "*Hello Paul*", or list news articles of the day.

There are also **write** queries that can modify the data, insert new entries, create new tables.
It's not uncommon that only programs (*e.g.* the website) has permission to modify the data.

Let's dive in! But before we can write queries, we need to know a little bit more about the database we will use in this course.

## SQL vendors

There exist many database products, both commercial and open source.
The databases that can interact with tables using the SQL language are called... **SQL databases**.
Unfortunately each one often adds small extensions to SQL resulting in many **SQL dialects**.
Everything here should be standard, we will mention if we are using a particular extension.

Typically, a database would be a whole program, running almost alone on a server, creating many files to save data and optimize frequent queries.
One database that took another direction is *SQLite*.
This is a small database that can be embedded in other programs and stores all its data in a single file.

## Chinook dataset

Finally, we need to connect the program in which we write queries to the actual database (server).
This typically involves a URL, username and password.
With SQLite however we only need to point to a file.
In this course setup, we need to execute the following non SQL code.

<div style="padding: 15px; background-color: #d1ecf1; color: #17a2b8; border: 1px solid #17a2b8; margin: 10px 0;">
    <strong>‚ÑπÔ∏è Note:</strong>
    The following code cell can be modified freely and executed by pressing <code>Shift-Enter</code> inside.
</div>

In [None]:
%LOAD chinook.sqlite

The dataset is called **Chinook**.
It contains data about a fictional music store (artists, albums, tracks, customers, and sales).

In every SQLite file, there is a special table named `sqlite_master` that lists information about other tables created by the user.
We query it with SQL to list the tables available in our Chinook file.
You do not need to understand the query at this point, we only care about knowing the name of the tables to have a starting point.
By the end of this course, you will be able to write much more complex queries.

In [None]:
SELECT name FROM sqlite_master WHERE type="table" ORDER BY name;

# Part 1: Your First Query with `SELECT`

The most basic SQL command is `SELECT`. It retrieves data from a table.
Let's look at the employee table.

In [None]:
SELECT * FROM Employee;

### Syntax

The general syntax is the following.
```sql
SELECT column1, column2 FROM table_name;
```

All queries must finish with a semicolon `;` to explicitly mark the end.
Because queries end with `;`, we can split them across multiple lines for readability.
Previously we used `*` to ask for all columns.
If a table has many rows, we may also want to reduce the number of rows that are returned with the `LIMIT` keyword.

```sql
SELECT column1, column2 FROM table_name LIMIT max_number_rows;
```

### Example 1: Select specific columns

In [None]:
SELECT FirstName, LastName FROM Employee;

### Example 2: Limit results

We do not know how many rows are in the `Artist` table, to be safe, we will add a limit of maximum ten rows.
This is a good habit to have.
Otherwise, because databases can have many millions rows, we risk overwhelming it.
We can gradually increase the limit afterwards.

In [None]:
SELECT * FROM Artist LIMIT 10;

### Example 3: Select unique values

Sometimes a column has duplicate values. `DISTINCT` removes duplicates, showing each value only once (like "Remove Duplicates" in a spreadsheet):

What countries do our customers come from?

In [None]:
SELECT DISTINCT Country FROM Customer LIMIT 5;

Take the time to edit these request to get familiar with them.
Change the limit, the columns, remove `DISTINCT` *etc*.

# Part 2: Filtering Data with `WHERE`

In a spreadsheet, you might use filters to show only certain rows. In SQL, we use the `WHERE` clause.

### Syntax
```sql
SELECT columns FROM table_name WHERE some_condition;
```

### Example 1: Filter by exact match

Let's find all customers from France:

In [None]:
SELECT FirstName, LastName, City, Country 
FROM Customer 
WHERE Country = "France";

### Example 2: Filter with comparison operators

Find all tracks longer than 5 minutes:

In [None]:
SELECT Name, Milliseconds, Milliseconds/1000.0/60.0 AS Minutes
FROM Track
WHERE Minutes >= 5
LIMIT 10;

**New concept**: `AS` - We used `AS Minutes` to give a calculated column a friendly name (like naming a formula column in a spreadsheet).

### Comparison Operators
| Operator | Meaning |
|----------|----------|
| = | Equal to |
| != or <> | Not equal to |
| > | Greater than |
| < | Less than |
| >= | Greater than or equal |
| <= | Less than or equal |

### Example 3: Combining conditions with `AND` / `OR`

Find tracks that are Rock genre (`GenreId` = 1) and longer than 4 minutes.
Notice how we can also write expressions in the `WHERE` clause.

In [None]:
SELECT Name, Milliseconds
FROM Track 
WHERE GenreId = 1 AND Milliseconds/1000/60.0 >= 4
LIMIT 10;

Find customers from either USA OR Canada:

In [None]:
SELECT FirstName, LastName, Country 
FROM Customer
WHERE Country = 'USA' OR Country = 'Canada';

### Example 4: Using IN for multiple values

A cleaner way to check multiple values:

In [None]:
SELECT FirstName, LastName, Country 
FROM Customer 
WHERE Country IN ('France', 'Canada', 'Brazil');

### Example 5: Handling missing data with `NULL`

In spreadsheets, empty cells are just blank. In databases, missing data is represented by a special value called `NULL`.

<div style="padding: 15px; background-color: #fff3cd; color: #ffc107; border: 1px solid #ffc107; margin: 10px 0;">
    <strong>‚ö†Ô∏è Warning:</strong>
    You can't use <code>= NULL</code> or <code>!= NULL</code>.
    Instead, use <code>IS NULL</code> and <code>IS NOT NULL</code>.
    <br/>
    Even if <code>NULL</code> is sometimes displayed as empty (no text), a <code>NULL</code> entry is also different from an empty string <code>""</code>.
</div>

Find customers who don't have a company listed:

In [None]:
SELECT FirstName, LastName, Company
FROM Customer
WHERE Company IS NULL
LIMIT 10;

# Part 3: Pattern Matching with `LIKE`

Sometimes you don't know the exact value. `LIKE` lets you search for patterns.

| Symbol | Meaning |
|--------|----------|
| % | Any sequence of characters |
| _ | Any single character |

### Example 1: Find artists whose name starts with "The"

In [None]:
SELECT Name FROM Artist 
WHERE Name LIKE 'The %'
LIMIT 10;

### Example 2: Find tracks containing "love" anywhere in the name:

In [None]:
SELECT Name FROM Track 
WHERE Name LIKE '%love%'
LIMIT 15;

# Part 4: Sorting Results with `ORDER BY`

Just like sorting columns in a spreadsheet, we can sort our results.

### Syntax
```sql
SELECT columns FROM table_name ORDER BY column_name ASC|DESC;
```
- `ASC` = Ascending (A-Z, smallest to largest) - this is the default
- `DESC` = Descending (Z-A, largest to smallest)

### Example 1: Sort customers alphabetically by last name

In [None]:
SELECT FirstName, LastName, Country 
FROM Customer 
ORDER BY LastName ASC
LIMIT 10;

We can use `||` to concatenate strings (stick them together).
Here we show the first name, a space, and the last name as the `Name`, while sorting by last name.

In [None]:
SELECT FirstName || " " || LastName AS Name, Country 
FROM Customer 
ORDER BY LastName ASC
LIMIT 10;

### Example 2: Find the longest track name (descending order)

Here we use the `LENGTH` SQL function that computes the length of a string (text entry).
There are many such functions for strings, dates, numbers *etc*. just like there are many spreadsheet functions.
Like other expressions, they can be used in `SELECT`, `WHERE`, `ORDER`, and more!
Some SQL functions tend to be common, while others are part of specific SQL dialects.

In [None]:
SELECT Name
FROM Track 
ORDER BY LENGTH(NAME) DESC
LIMIT 10;

### Example 3: Sort by multiple columns

Sort customers by country, then by last name within each country:

In [None]:
SELECT FirstName, LastName, Country 
FROM Customer 
ORDER BY Country, LastName
LIMIT 5;

# Part 5: Aggregating Data (`COUNT`, `SUM`, `AVG`, `MIN`, `MAX`)

Just like spreadsheet functions (`COUNT`, `SUM`, `AVERAGE`), SQL has **aggregate functions** that calculate values across multiple rows.

### Example 1: Count how many tracks we have:

In [None]:
SELECT COUNT(*) AS TotalTracks FROM Track;

This counts all the rows in the `Track` table.
We can also count all countries cells in the `Customer` table.
The following is counting how many rows would be returned in the query:
```sql
SELECT Country FROM Customer;
```

In [None]:
SELECT COUNT(Country) FROM Customer;

However, we may be more interested in the number of different countries our customers live in.
We can do so with `DISTINCT` again.

<div style="padding: 15px; background-color: #fff3cd; color: #ffc107; border: 1px solid #ffc107; margin: 10px 0;">
    <strong>‚ö†Ô∏è Warning:</strong>
    <code>COUNT</code> will not de-duplicate values.
</div>

In [None]:
SELECT COUNT(DISTINCT Country) FROM Customer;

### Example 2: Find the average, minimum, and maximum track length

In [None]:
SELECT 
    AVG(Milliseconds)/1000.0/60.0 AS AvgMinutes,
    MIN(Milliseconds)/1000.0/60.0 AS ShortestMinutes,
    MAX(Milliseconds)/1000.0/60.0 AS LongestMinutes
FROM Track;

### Example 3: Calculate total sales

In [None]:
SELECT 
    COUNT(*) AS NumberInvoices,
    SUM(Total) AS TotalRevenue,
    AVG(Total) AS AvgInvoiceAmount
FROM Invoice;

# Part 6: Grouping Data with `GROUP BY`

This is like creating a Pivot Table in a spreadsheet!
`GROUP BY` lets you aggregate data inside various groups, such as a category, a country of living, but also the activity of an individual.

### Example 1: The top 5 countries where the most customers live

In [None]:
SELECT Country, COUNT(*) AS CustomerCount
FROM Customer
GROUP BY Country
ORDER BY CustomerCount DESC
LIMIT 5;

### Example 2: The top 10 countries with the most sales

In [None]:
SELECT
    BillingCountry, 
    COUNT(*) AS NumberOfSales,
    ROUND(SUM(Total), 2) AS TotalRevenue
FROM Invoice
GROUP BY BillingCountry
ORDER BY TotalRevenue DESC
LIMIT 10;

### Example 3: Find the composers who write the most about love

The `WHERE` filter removes individual rows *before* they are used in the aggregation.

In [None]:
SELECT Name, Composer, SUM(Milliseconds)/1000.0/60.0 AS TotalMinutes
FROM Track
WHERE Name LIKE "%love%" AND Composer IS NOT NULL
GROUP BY Composer
ORDER BY TotalMinutes DESC
LIMIT 10;

### Filtering Groups with `HAVING`

Previously we used `WHERE` to filter individual rows.
With `HAVING` we can filter groups (after aggregation).

Let us find cities where there is a medium number of customers, neither too low nor too high.
Since we are not looking for a maximum or minimum, `ORDER BY` will not be sufficient.

In [None]:
SELECT City, COUNT(*) AS CustomerCount
FROM Customer
GROUP BY Country
HAVING CustomerCount >= 2 AND CustomerCount <= 5 
ORDER BY CustomerCount DESC;

# Part 7: Connecting tables with `JOIN`
### Databases organization

When looking at the tables you may have seen a lot of column names with `Id` in them.
These serve two main purposes.
The first one we are most accustomed to.
What should we do if two customers have the same name?
Should we try to identify with their city? Their birthdate?
This will complicate queries a lot.
Instead we give each customer a unique customer number or Id.
We are much used to this in our daily lives, for instance with our social security numbers.

Databases push the idea of Ids further, giving *internal* numbers to many tables  (if not all).
Let us look at the track table.
You may have noticed there is no album or artist for the tracks.

In [None]:
SELECT * FROM track LIMIT 10;

We may replace the `AlbumId` with the album name in this table, but think about it, just like people album name may not be unique.
Should we add the parution date too? The publisher?
And when a customer buys an album, must we insert all this information in the invoice table on top of all the information about the customer needed to identify them?
The album and customer data would be **duplicated** (copied over more than once).
This is bad for performance wasting a lot of disk space and slowing down queries.
It is also bad for data consistency.
What if a customer calls to correct an error in their name? 
Then, we would need to modify their name in all the tables that use it, perhaps forgetting some and creating inconsistencies.

There are many rules for how to [organize databases](https://en.wikipedia.org/wiki/Database_normalization) (it is even a whole field of study).
A first approximate rule could be summarized as

> Do not duplicate data entries. Rather, liberally assign IDs and use them when referring to this entry in another table.

This is what is done with the `AlbumId`.
In the Album table, an `AlbumId` corresponds to a single row.
To substitute it, we use a `JOIN`.
We detail in the `JOIN` clause the name of the Id column in both tables, with an equaltiy between them, to specify how they should match.
Later, we will learn how to use more general conditions in this clause.

In [None]:
SELECT *
FROM Track
JOIN Album ON Track.AlbumID = Album.AlbumID
LIMIT 5;

This can quickly turn noisy.
We quickly lose track of which table the columns are coming from.
Is `Name` the name of the song or the album?
As always we can refine the selected columns and give aliases.

In [None]:
SELECT Track.*, Album.Title AS AlbumName
FROM Track
JOIN Album ON Track.AlbumID = Album.AlbumID
LIMIT 5;

We can continue with joining the `Artist` to get the name of the Artist.
As always, we could also filter with `WHERE` and aggregate with `GROUP BY`.

In [None]:
SELECT
    Track.Name AS TrackName,
    Album.Title AS AlbumTitle,
    Artist.Name AS ArtistName
FROM Track
JOIN Album ON Track.AlbumId  = Album.AlbumId
JOIN Artist ON Album.ArtistId = Artist.ArtistId
LIMIT 10;

### Syntax
The general syntax for a `JOIN` will match all rows from `Table1` with those from `Table2` where the condition matches.
```sql
SELECT Table1.columns, Table2.columns
FROM Table1
INNER JOIN Table2 ON conditions;
```

Previously, when matching `AlbumId` from the `Track` table, there was only one row in `Album` with the corresponding id.
This **many-to-one** match can be understood as *substitute the information for that id*.

In the more general use of `JOIN`, some rows can match multiple others or **many-to-many**.
They are then repeated with each match.

For instance let us find customers that one of our employees can make a local phone call to (*i.e.* living in the same country).

In [None]:
SELECT
    Customer.FirstName AS CustomerFirstName,
    Customer.LastName AS CustomerLastName,
    Customer.Phone AS CustomerPhone,
    Employee.FirstName AS EmployeeFirstName,
    Employee.LastName AS EmployeeLastName
FROM Customer
INNER JOIN Employee ON Customer.Country = Employee.Country;

This is a large result!
`JOIN` can often produce many rows.
In this example all 8 employees live in Canada, and 8 customers live in Canada (`SELECT COUNT(*) FROM Customer WHERE Country = "Canada";`).
Each of these customers matches with each employee from Canada, resulting in 8x8 = 64 rows in the result.
If we had another 3 employees in France, they would each match with the 5 customers in France, resulting in 3x5 = 15 additional rows.

An extreme example is to match on a condition that is always true, then each 59 customers matches each 8 employees, essentially creating all possible employee-customer combinations.

In [None]:
SELECT COUNT(*) FROM Customer INNER JOIN Employee ON True;

### Example 1: Count albums per artist (`JOIN` with `GROUP BY`)

Because an artist name may not be unique, we group by both the name and `ArtistId` to avoid merging different artists.

We can also give aliases to tables.

In [None]:
SELECT ar.Name AS Artist, COUNT(al.AlbumId) AS AlbumCount
FROM Artist ar
JOIN Album al ON ar.ArtistId = al.ArtistId
GROUP BY ar.Name, ar.ArtistId
ORDER BY AlbumCount DESC
LIMIT 10;

### Example 2: Most profitable songs

Similarly, we use `Track.TrackId` in the `GROUP BY` to avoid counting together tracks with a similar name.

In [None]:
SELECT
    Track.Name,
    SUM(InvoiceLine.UnitPrice * InvoiceLine.Quantity) AS Revenue
FROM Track
JOIN InvoiceLine ON Track.TrackId = InvoiceLine.TrackId
GROUP BY Track.TrackId, Track.Name
ORDER BY Revenue DESC
LIMIT 10;

### Example 2: Most profitable artists

This time, we need to `JOIN` `InvoiceLine` ‚Üî `Track` ‚Üî `Album` ‚Üî `Artist` to access the artist name.

In [None]:
SELECT
    Artist.Name,
    SUM(InvoiceLine.UnitPrice * InvoiceLine.Quantity) AS Revenue
FROM InvoiceLine
JOIN Track ON InvoiceLine.TrackId = Track.TrackId
JOIN Album ON Track.AlbumId = Album.AlbumId
JOIN Artist ON Album.ArtistId = Artist.ArtistId
GROUP BY Artist.Name, Artist.ArtistId
ORDER BY Revenue DESC
LIMIT 10;

### `INNER JOIN` vs `LEFT JOIN`

The `JOIN`s we've been using are actually short for `INNER JOIN` ‚Äî they only return rows where there's a match in **both** tables.
Let us look at the songs that are the least sold.

In [None]:
SELECT
    Track.Name AS Song,
    SUM(InvoiceLine.UnitPrice * InvoiceLine.Quantity) AS Revenue
FROM Track
JOIN InvoiceLine ON Track.TrackId = InvoiceLine.TrackId
GROUP BY Track.TrackId, Track.Name
ORDER BY Revenue ASC
LIMIT 5;

Notice how the least sold songs are still making *some* revenue.
Surely there must be some songs that are up for sale that were never sold.
The `INNER JOIN` removed these songs from the result because they did not appear in the `InvoiceLine` table and so did not match anything.

We want to keep all songs from the `Track` table, even if they do not match anything.
This is exactly what `LEFT JOIN` does.
Running the query with this join gives the following:

In [None]:
SELECT
    Track.Name AS Song,
    SUM(InvoiceLine.UnitPrice * InvoiceLine.Quantity) AS Revenue
FROM Track
LEFT JOIN InvoiceLine ON Track.TrackId = InvoiceLine.TrackId
GROUP BY Track.TrackId, Track.Name
ORDER BY Revenue ASC
LIMIT 5;

Because the `SUM` result in `NULL` for unsold songs, we see nothing in the result.
We can wrap it in `COALESCE` to replace it with a default value if it is `NULL`.

In [None]:
SELECT
    Track.Name AS Song,
    COALESCE(SUM(InvoiceLine.UnitPrice * InvoiceLine.Quantity), 0.0) AS Revenue
FROM Track
LEFT JOIN InvoiceLine ON Track.TrackId = InvoiceLine.TrackId
GROUP BY Track.TrackId, Track.Name
ORDER BY Revenue ASC
LIMIT 5;

Here we focused on the least sold songs to see the behaviour of `INNER JOIN` *v.s.* `LEFT JOIN`.
A more likely query is to export all songs and their revenue.
With an `INNER JOIN`, people looking at the export would retort, "*What about song [NAME]? Do we not sell it?*".
With the `LEFT JOIN` there is no ambiguity.

If we only wanted to find songs that were never sold, we could go faster.
With a `LEFT JOIN`, rows in `Track` that do not match anything in `InvoiceLine` but are still kept will have `NULL` entries for `InvoiceLine` columns.
We can directly leverage this without `GROUP BY`.

In [None]:
SELECT Track.Name
FROM Track
LEFT JOIN InvoiceLine ON Track.TrackId = InvoiceLine.TrackId
WHERE InvoiceLine.TrackId IS NULL
LIMIT 5;

### Other `JOIN`s
There are two more types of `JOIN`.
`RIGHT JOIN` is the symmetric of `LEFT JOIN`: it keeps the unmatched rows from the table in the `RIGHT JOIN` clause.
`FULL OUTER JOIN` keeps unmatched rows from both tables.

# Part 8: Putting It All Together

Now that you've learned all the key SQL concepts -- `SELECT`, `WHERE`, `JOIN`, `GROUP BY`, aggregate functions, and more -- it's time to put them into practice!
Head over to the exercises below and try to answer the questions on your own before checking the solutions.
Do not hesitate to play with SQL in this notebook, slowly building up queries, adding and refining information as you go.
Create multiple cells, edit them, merge code from multiple *etc*.

### Summary: SQL Query Structure

Here's the order of clauses in a SQL query:

```sql
SELECT columns           -- What columns to show
FROM table               -- Which table(s) to use
JOIN other_table ON ...  -- Connect related tables
WHERE conditions         -- Filter individual rows
GROUP BY columns         -- Group rows for aggregation
HAVING conditions        -- Filter groups
ORDER BY columns         -- Sort the results
LIMIT n                  -- Limit number of rows
```

### Quick Reference Card

| Task | SQL |
|------|-----|
| See all data | `SELECT * FROM table` |
| See specific columns | `SELECT col1, col2 FROM table` |
| Remove duplicates | `SELECT DISTINCT col FROM table` |
| Filter rows | `WHERE condition` |
| Check for missing data | `WHERE column IS NULL` |
| Sort results | `ORDER BY column ASC/DESC` |
| Limit rows | `LIMIT n` |
| Count rows | `SELECT COUNT(*) FROM table` |
| Group & aggregate | `GROUP BY column` |
| Connect tables (inner) | `JOIN table2 ON table1.id = table2.id` |
| Keep all left rows | `LEFT JOIN table2 ON table1.id = table2.id` |
| Pattern matching | `WHERE column LIKE '%pattern%'` |

# Practice Exercises

Try these on your own! Write your queries in the empty cells below. Solutions are provided further down!

### Exercise 1

List all genres in the database.

### Exercise 2

Find all customers from Germany.

### Exercise 3

Find the 5 shortest tracks by duration. Display their name and length in minutes.

### Exercise 4

Count how many tracks are in each album. Display the album name and its track count.

### Exercise 5

Find the average track length for each genre, sorted from longest to shortest.

### Exercise 6

What are the different track price points, and how many tracks are at each price?

Use grouping and counting to explore how tracks are distributed across different unit prices in the store.

### Exercise 7

Format a track catalog as a readable list.

Using string concatenation (`||`), display each track as a single formatted line: `- "Song name", Album from Artist`. You will need to join multiple tables and build the string in the `SELECT` clause.

### Exercise 8

Which artist has the most tracks in the store?

Join the `Artist`, `Album`, and `Track` tables to count the total number of tracks per artist and find the most prolific ones.

### Exercise 9

Who are the top spending customers?

Join customers with their invoices to compute each customer's total spending and rank them.

### Exercise 10

Which genres generate the most revenue?

Follow the chain from `Genre` through `Track` and `InvoiceLine` to calculate the total revenue per genre.

### Exercise 11

Which artists have sold more than 10 tracks to customers in the USA?

This requires chaining many tables together (`Artist` through `InvoiceLine` and `Invoice`), filtering with `WHERE` on the billing country, grouping by artist, and using `HAVING` to keep only those above the threshold.

# Solutions

Scroll down only when you're ready to check your answers!

### Solution 1

In [None]:
SELECT * FROM Genre;

### Solution 2

In [None]:
SELECT FirstName, LastName, City
FROM Customer
WHERE Country = 'Germany';

### Solution 3

In [None]:
SELECT Name, Milliseconds / 1000.0 / 60.0 AS Minutes
FROM Track
ORDER BY Milliseconds ASC
LIMIT 5;

### Solution 4

In [None]:
SELECT al.Title AS Album, COUNT(t.TrackId) AS TrackCount
FROM Album al
JOIN Track t ON al.AlbumId = t.AlbumId
GROUP BY al.AlbumId, al.Title
ORDER BY TrackCount DESC
LIMIT 10;

### Solution 5

In [None]:
SELECT
    g.Name AS Genre,
    ROUND(AVG(t.Milliseconds) / 1000.0 / 60.0, 2) AS AvgMinutes
FROM Genre g
JOIN Track t ON g.GenreId = t.GenreId
GROUP BY g.GenreId, g.Name
ORDER BY AvgMinutes DESC;

### Solution 6

In [None]:
SELECT UnitPrice, COUNT(*) AS TrackCount
FROM Track
GROUP BY UnitPrice
ORDER BY UnitPrice;

### Solution 7

In [None]:
SELECT '- "' || t.Name || '", ' || al.Title || ' from ' || ar.Name AS TrackEntry
FROM Track t
JOIN Album al ON t.AlbumId = al.AlbumId
JOIN Artist ar ON al.ArtistId = ar.ArtistId
LIMIT 15;

### Solution 8

In [None]:
SELECT 
    ar.Name AS Artist,
    COUNT(t.TrackId) AS TrackCount
FROM Artist ar
JOIN Album al ON ar.ArtistId = al.ArtistId
JOIN Track t ON al.AlbumId = t.AlbumId
GROUP BY ar.Name
ORDER BY TrackCount DESC
LIMIT 10;

### Solution 9

In [None]:
SELECT 
    c.FirstName || ' ' || c.LastName AS Customer,
    c.Country,
    ROUND(SUM(i.Total), 2) AS TotalSpent
FROM Customer c
JOIN Invoice i ON c.CustomerId = i.CustomerId
GROUP BY c.CustomerId
ORDER BY TotalSpent DESC
LIMIT 10;

### Solution 10

In [None]:
SELECT 
    g.Name AS Genre,
    COUNT(il.InvoiceLineId) AS TracksSold,
    ROUND(SUM(il.UnitPrice * il.Quantity), 2) AS TotalRevenue
FROM Genre g
JOIN Track t ON g.GenreId = t.GenreId
JOIN InvoiceLine il ON t.TrackId = il.TrackId
GROUP BY g.GenreId, g.Name
ORDER BY TotalRevenue DESC
LIMIT 10;

### Solution 11

In [None]:
SELECT
    ar.Name AS Artist,
    COUNT(il.InvoiceLineId) AS TracksSold
FROM Artist ar
JOIN Album al ON ar.ArtistId = al.ArtistId
JOIN Track t ON al.AlbumId = t.AlbumId
JOIN InvoiceLine il ON t.TrackId = il.TrackId
JOIN Invoice i ON il.InvoiceId = i.InvoiceId
WHERE i.BillingCountry = 'USA'
GROUP BY ar.ArtistId, ar.Name
HAVING TracksSold > 10
ORDER BY TracksSold DESC;