
# DS2002 — SQL Fundamentals in Notebooks  
## 2026-01-26 — Lecture (Expanded, Real-World First)

**Instructor:** Jason Williamson  
**Course:** DS2002 — Data Science Systems

---

You are going to hear a lot of people talk about SQL like it is a set of magic spells: `SELECT`, `JOIN`, `WHERE`. That framing is backwards. SQL is boring on purpose. It exists because organizations need a way to store facts so those facts don’t quietly mutate when the spreadsheet gets copied, when the intern “fixes” a column, or when three different teams each keep their own version of “the customer list.”

This lecture is about learning the *reason* SQL exists, then learning how to design tables so your queries have a chance of being correct. Only after that do we start writing SQL. If you remember one thing, remember this: when the structure is wrong, the analysis is wrong, and SQL will happily return an answer anyway.



# Why Structured Data Matters

Structured data is not an academic obsession. It is a survival mechanism for scale.

Imagine a small online store in week one. One person is taking orders in a Google Sheet. The sheet has a column called `Customer`, a column called `Items`, and a column called `Total`. That works until the first real-world event happens: a customer changes their email, a product’s price changes, returns start happening, or two employees edit the sheet at the same time. Suddenly “the truth” is whatever the last person typed, and the history is a blur of edits.

Now imagine you are doing analytics for that store and you are asked, “How much did we earn from keyboards last month?” If keyboards are stored as text inside a comma-separated cell, you can’t reliably count them. If prices are repeated in multiple rows, price updates become inconsistent. If customer information is duplicated, you will treat one human as multiple humans. The result is not just messy data. The result is decisions made from wrong numbers.

Structured data means you write down what a row *means* and what a column *means*, and you enforce it. A database is a place where facts are stored with constraints, not vibes.



# Why SQL, Specifically?

SQL is the language we use to interact with structured data stored in relational databases. It is “relational” because data is not stored as one giant table; it is stored as multiple tables that relate to each other using keys.

SQL persists because it solves a universal problem: asking precise questions of data that is too large and too shared to live safely inside spreadsheets. When you ask for “all orders from last week” you want a consistent definition of what an “order” is. When you ask for “average purchase price by product” you want each product to be represented consistently. SQL gives you a common vocabulary to describe those questions, and databases give you enforcement so the data stays sane.

Another reason SQL matters in data science is practical: most serious datasets you will touch in industry are stored in systems where SQL is the front door. Even when the rest of your pipeline is Python, the data often begins as SQL queries.



# A Notebook Is a Good Place to Learn SQL

Notebooks are useful because they combine explanation with execution. You can read a concept, run a query, see the output, and then change it and run again. That loop is how you learn. In this lecture, we will use a small in-memory database so you can see SQL working without installing anything.

We will build a tiny e-commerce-style dataset. That sounds trivial, but it’s perfect for learning because it has the same structure as real systems: customers place orders, orders contain items, items correspond to products.



# Before We Normalize: What Goes Wrong in the Real World

Normalization is a set of design rules. Those rules exist to prevent three common failure modes.

The first failure mode is the **update problem**. You store the same fact in multiple places, and when the fact changes, you forget to update one of them. You don’t notice until reporting disagrees with reality.

The second failure mode is the **insert problem**. You want to store a fact, but your table design forces you to invent unrelated data just to fill a row. That leads to placeholder values, nulls, and strange “fake records” that later contaminate analysis.

The third failure mode is the **delete problem**. You remove one row and accidentally remove the only copy of an important fact because that fact was stored in a table that mixed multiple kinds of information.

Normalization is basically “design so these don’t happen.”



# Thinking in Tables: One Table, One Type of Thing

A table should represent one type of thing. If you can’t say what the rows are in a single sentence, the table is probably trying to do too much.

A good sentence sounds like: “Each row in `customers` represents one customer.” Or: “Each row in `products` represents one product.” Or: “Each row in `orders` represents one order.”

A bad sentence sounds like: “Each row represents an order plus the customer details plus the product details plus the price at the time.” That is a spreadsheet sentence. Databases want you to separate concerns.



# Normalization: The Core Idea

Normalization is a method for organizing data so that each fact is stored once, in the best place, and relationships connect facts together. You can think of normalization as learning to separate a messy human story into clean components.

A helpful way to keep your brain straight is to ask: “What is the entity here?” An entity is a thing you care about and want to refer to repeatedly. Customers are entities. Products are entities. Orders are entities. If you find yourself repeating the same attributes over and over, you probably have an entity hiding in your table that deserves its own table.



# First Normal Form (1NF)

First Normal Form is about shape. It says: a table must not hide multiple values inside one cell. Each column should hold one value per row, and each row should be identifiable.

Consider a realistic “week one startup” table of orders:

| order_id | customer_email | products |
|---:|---|---|
| 1 | alice@example.com | Keyboard, Mouse |
| 2 | bob@example.com | Keyboard |
| 3 | alice@example.com | Monitor, HDMI Cable |

This is not in 1NF because the `products` cell is a list. It feels convenient, but it breaks the moment you try to compute something. Counting keyboards requires splitting strings. Filtering for orders that contain a mouse requires text parsing. Joining to a products table becomes awkward because the value is no longer a single product identifier.

In the real world, this is how dashboards become fragile. Someone misspells “Keyboard” once and now you have two product categories. Or someone writes “keyboard” in lowercase and your counts drift. You can’t enforce consistency because the database can’t see individual values inside the cell.



## Making It 1NF

To make this 1NF, you rewrite the table so that each row corresponds to a single product within an order. That means one order will appear in multiple rows, but that is not a problem. That duplication is controlled and meaningful.

| order_id | customer_email | product |
|---:|---|---|
| 1 | alice@example.com | Keyboard |
| 1 | alice@example.com | Mouse |
| 2 | bob@example.com | Keyboard |
| 3 | alice@example.com | Monitor |
| 3 | alice@example.com | HDMI Cable |

Now every cell holds one value. The database can filter, count, and join cleanly. You have taken a messy “list-in-a-cell” and turned it into something the database understands.

If 1NF feels like “more rows,” that’s correct. Databases are designed for lots of rows. Spreadsheets are designed for human eyeballs. We are not optimizing for eyeballs anymore.



# Second Normal Form (2NF)

Second Normal Form is about meaning. It says: if your table uses a composite key (a key made of multiple columns), then every non-key column must depend on the full key, not just part of it.

This sounds abstract until you see the common pattern. After converting the “products list” into one row per product, many people create an `order_items` table like this:

| order_id | product_id | product_name | product_price |
|---:|---:|---|---:|
| 1 | 101 | Keyboard | 50 |
| 1 | 102 | Mouse | 20 |
| 2 | 101 | Keyboard | 50 |

The intended key here is `(order_id, product_id)` because that pair uniquely identifies a line item.

Now ask a simple dependency question. Does `product_price` depend on the order, the product, or both? In most systems, the price describes the product (or perhaps the product at a time), not the order itself. That means `product_price` depends on `product_id`, not on the full composite key. That violates 2NF.

This is the update problem in disguise. If the price of a keyboard changes from 50 to 55, you would have to update every row where product_id = 101. Miss one, and now your data contains two truths.



## Making It 2NF

The fix is to separate the product facts from the order line items.

You create a `products` table where product facts live once:

| product_id | name | price |
|---:|---|---:|
| 101 | Keyboard | 50 |
| 102 | Mouse | 20 |

Then your `order_items` table becomes only the relationship between orders and products:

| order_id | product_id |
|---:|---:|
| 1 | 101 |
| 1 | 102 |
| 2 | 101 |

This may feel like you are “splitting tables just to split tables,” but it is exactly what makes analytics trustworthy. You have moved product meaning into `products` and left `order_items` as a pure record of what was purchased.

In real organizations, this separation is what prevents one team from “fixing” a price in their report while another team uses a different price elsewhere. One truth, one place.



# Third Normal Form (3NF)

Third Normal Form targets a subtler issue: non-key columns must not depend on other non-key columns. This is called a transitive dependency, and it is where a lot of “it mostly works” data designs go to die.

Consider a `customers` table:

| customer_id | city | state |
|---:|---|---|
| 1 | Richmond | VA |
| 2 | Charlottesville | VA |
| 3 | Raleigh | NC |

At first glance, this seems fine. But the table is storing a fact about geography. The state is determined by the city, not by the customer. That means `city → state`. The state value is not truly a customer attribute; it is a city attribute that happens to be copied into the customer table.

What breaks in the real world? If the city value is misspelled in one row, state can diverge. Or you introduce “Richmond” and accidentally mark it as “NC” once. Now you have a data quality issue that is hard to detect because it looks like a legitimate customer record.

3NF pushes you toward isolating “lookup facts” into separate tables so that you don’t duplicate derived meaning across many records.



## Making It 3NF

One clean fix is to separate location facts.

You keep customers as customers:

| customer_id | city |
|---:|---|
| 1 | Richmond |
| 2 | Charlottesville |
| 3 | Raleigh |

Then you store the city-to-state mapping once:

| city | state |
|---|---|
| Richmond | VA |
| Charlottesville | VA |
| Raleigh | NC |

Now if a city’s state mapping needs correction, it is corrected once. Customers no longer carry a duplicated fact that is not truly about them.

In larger systems, you might store a `location_id` and keep a separate `locations` table. The pattern is the same: keep facts where they belong.



# A Short Summary of 1NF, 2NF, 3NF

First Normal Form prevents you from hiding lists inside cells and forces your data into a queryable shape.

Second Normal Form prevents you from storing attributes in a table where they only depend on part of a composite key, which is how you get repeated facts and inconsistent updates.

Third Normal Form prevents you from mixing “side facts” into an entity table, which is how you get subtle contradictions that are hard to notice until analysis fails.

If you learn to spot these violations, you can walk into almost any messy dataset and design a safer version of it.



# SQL Command Categories

SQL has different kinds of commands, and learning the categories keeps you from feeling lost.

Data Definition Language (DDL) is how you define structure. It answers, “What tables exist and what columns do they have?”

Data Manipulation Language (DML) is how you change the contents. It answers, “What rows exist?”

Data Query Language (often treated as part of DML in practice) is how you ask questions using `SELECT`.

For today, we will define a small schema (DDL), insert tiny data (DML), and run basic queries (SELECT). The goal is not completeness. The goal is comfort.



# SQL in a Notebook: Our Setup

In many notebooks, you can run SQL using a database connector. To keep this lecture simple and portable, we will use SQLite, a lightweight relational database that can run in memory.

You will see two things:
1. SQL written as SQL, because that is how you should learn it.
2. A small Python helper that sends SQL to the database so we can see results.

In Kaggle and most Jupyter environments, this pattern works reliably.


In [None]:
import sqlite3
import pandas as pd

# Create an in-memory SQLite database for the lecture.
conn = sqlite3.connect(":memory:")

def q(sql: str) -> pd.DataFrame:
    """Run a SQL query and return results as a DataFrame."""
    return pd.read_sql_query(sql, conn)

def exec_sql(sql: str) -> None:
    """Execute SQL that does not return a result set (DDL/DML)."""
    conn.executescript(sql)
    conn.commit()

print("SQLite database ready.")



# DDL Example: Creating Normalized Tables

We will model a tiny store. This is deliberately small, but it is not fake. The structure is the same structure you would see in real commerce systems, student registration systems, ticketing systems, and many operational databases.

We will create four tables:
- `customers` stores customer facts.
- `products` stores product facts.
- `orders` stores the fact that a customer placed an order.
- `order_items` stores which products are in which order.

Notice what is missing: we do not store product price in the `order_items` table. That is intentional. We are staying consistent with 2NF thinking, and we are keeping product facts in the product table.


In [None]:
exec_sql("""
DROP TABLE IF EXISTS customers;
DROP TABLE IF EXISTS products;
DROP TABLE IF EXISTS orders;
DROP TABLE IF EXISTS order_items;

CREATE TABLE customers (
    customer_id INTEGER PRIMARY KEY,
    name TEXT NOT NULL,
    email TEXT NOT NULL UNIQUE
);

CREATE TABLE products (
    product_id INTEGER PRIMARY KEY,
    name TEXT NOT NULL,
    price REAL NOT NULL
);

CREATE TABLE orders (
    order_id INTEGER PRIMARY KEY,
    customer_id INTEGER NOT NULL,
    order_date TEXT NOT NULL,
    FOREIGN KEY (customer_id) REFERENCES customers(customer_id)
);

CREATE TABLE order_items (
    order_id INTEGER NOT NULL,
    product_id INTEGER NOT NULL,
    quantity INTEGER NOT NULL DEFAULT 1,
    PRIMARY KEY (order_id, product_id),
    FOREIGN KEY (order_id) REFERENCES orders(order_id),
    FOREIGN KEY (product_id) REFERENCES products(product_id)
);
""")
print("Tables created.")



## A Note About Keys (Primary Keys and Foreign Keys)

A primary key is the column (or set of columns) that uniquely identifies a row. It is how the database knows “this is that record.”

A foreign key is a reference from one table to another. It is how the database represents relationships. When `orders.customer_id` references `customers.customer_id`, you are saying, “Every order must belong to a real customer.”

In spreadsheets, relationships are informal and easy to break. In databases, relationships can be enforced. That enforcement is one reason structured data is powerful.



# DML Example: Inserting Small, Realistic Data

We will insert two customers and a few products. Then we will insert orders and order items. The purpose here is not volume; it is to create enough structure that joins feel meaningful.


In [None]:
exec_sql("""
INSERT INTO customers (customer_id, name, email) VALUES
(1, 'Alice Johnson', 'alice@example.com'),
(2, 'Bob Smith', 'bob@example.com');

INSERT INTO products (product_id, name, price) VALUES
(101, 'Keyboard', 50.00),
(102, 'Mouse', 20.00),
(103, 'Monitor', 200.00),
(104, 'HDMI Cable', 12.00);

INSERT INTO orders (order_id, customer_id, order_date) VALUES
(1, 1, '2026-01-20'),
(2, 2, '2026-01-21'),
(3, 1, '2026-01-22');

INSERT INTO order_items (order_id, product_id, quantity) VALUES
(1, 101, 1),
(1, 102, 1),
(2, 101, 2),
(3, 103, 1),
(3, 104, 2);
""")
print("Sample data inserted.")



# Querying Data: SELECT Is Asking a Question

A `SELECT` statement is you asking the database a question. The database answers with a table.

This is a useful mindset: every `SELECT` returns a table, even if it is a one-row table. When you chain SQL ideas, you are basically building tables from tables.


In [None]:
q("SELECT * FROM customers;")

In [None]:
q("SELECT * FROM products;")


## Filtering Rows: WHERE

A `WHERE` clause is a filter. If you have ever filtered a spreadsheet to show only rows where a value is greater than something, you already understand `WHERE`.

The difference is that SQL filtering is precise and repeatable. If you define “expensive products” as those with a price above 30, then that definition is embedded in the query and can be reused.


In [None]:
q("SELECT product_id, name, price FROM products WHERE price > 30;")


# JOIN: Reconstructing Meaning From Normalized Tables

Normalization splits the world into separate tables. JOIN is how you put the story back together.

This is the payoff. If you do normalization well, JOINs feel natural. If your schema is messy, JOINs become painful and unreliable.

We are going to answer a real question: “What did each customer buy?”

To do that, we need to connect:
- customers to orders (who placed what)
- orders to order_items (what is inside each order)
- order_items to products (what those items are and what they cost)


In [None]:
q("""
SELECT 
    c.name AS customer,
    o.order_id,
    o.order_date,
    p.name AS product,
    oi.quantity,
    p.price,
    (oi.quantity * p.price) AS line_total
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
ORDER BY o.order_id, p.product_id;
""")


## A Real-World Reflection: Why This Design Scales

In a real system, there might be millions of orders and thousands of products. If product prices change, you want one place to update the current price. If a customer changes their email, you want one place to update it. If you accidentally let multiple “Alices” exist with slightly different emails, you will miscount customers and your marketing spend becomes inefficient.

Normalization keeps the facts clean, and joins let you combine them when you need them.

There is a real tradeoff here: sometimes systems store historical price at the time of purchase because that is the reality of accounting. When you do that, you still store it intentionally, and you name the column honestly (for example, `unit_price_at_purchase`). The difference between good design and bad design is not “never duplicate,” it is “duplicate with meaning and control.”



# Mini Checkpoint: You Should Be Able to Explain These Three Sentences

At this point, you should be able to say what each table represents, in plain English.

Each row in `customers` represents one customer.

Each row in `products` represents one product.

Each row in `orders` represents one order placed by one customer.

Each row in `order_items` represents one product included in one order, including quantity.

If you can say those sentences without hesitation, SQL becomes dramatically easier.



# Final Takeaways

SQL exists because people need consistent truths in shared data.

Structured data is valuable because it makes analysis reliable and operations safer.

Normalization is how you design tables so each fact lives where it belongs.

DDL creates the structure, DML fills it with facts, and `SELECT` is how you ask questions.

Most importantly, JOIN is not a trick. JOIN is the natural result of a clean design.
