![Digital Futures](https://github.com/digital-futures-academy/DataScienceMasterResources/blob/main/Resources/datascience-notebook-header.png?raw=true)

## Learner Stories

```txt
As a DATA PROFESSIONAL,  
I want to be able to use indexes strategically,  
so that I can improve the performance of frequently used queries

As a DATA PROFESSIONAL,  
I want to be able to apply indexing strategies,  
so that I can speed up query execution without sacrificing data integrity

As a DATA PROFESSIONAL,  
I want to be able to analyse query execution plans using EXPLAIN,  
so that I can optimise query performance and reduce execution time
```

# What are Indexes?

Indexes are a way to optimise the performance of a database by reducing the number of disk accesses required when a query is processed. They are a data structure that allows for faster retrieval of data from a table.

Indexes are created on columns in a table. When you create an index on a column, the database creates a separate data structure that holds the values of that column and pointers to the rows in the table that contain those values.


---

## Why Use Indexes?

When you query a table using a column that has an index, the database can use the index to quickly find the rows that match the query criteria. This can significantly speed up the query execution time.

Indexes are particularly useful for columns that are frequently used in queries, such as columns that are used in WHERE clauses or JOIN conditions.

Indexes are useful for:

- Speeding up query execution
- Reducing the number of disk accesses required to retrieve data
- Improving the performance of frequently used queries
- Enforcing data integrity constraints
- Preventing duplicate values in a column
- Enforcing uniqueness constraints
- Supporting efficient data retrieval operations

However, they should be used judiciously, as they can also have a negative impact on performance if they are not used correctly, such as:

- Slowing down data modification operations (INSERT, UPDATE, DELETE)
- Consuming additional disk space
- Increasing the time required to build and maintain the index
- Reducing the performance of queries that do not use the index
- Increasing the complexity of the database schema
- Increasing the risk of deadlocks and other concurrency issues

---

## Use Cases for Indexes

Here are some common use cases for indexes:

- ***Primary Key Index***:
  - A primary key index is used to enforce the uniqueness of the primary key column in a table. It is automatically created when you define a primary key constraint on a column.
- ***Unique Index***:
  - A unique index is used to enforce the uniqueness of a column or a combination of columns in a table. It is automatically created when you define a unique constraint on a column or a combination of columns.
- ***Foreign Key Index***:
  - A foreign key index is used to enforce referential integrity between two tables. It is automatically created when you define a foreign key constraint on a column.
- ***Composite Index***:
  - A composite index is used to index multiple columns in a table. It is useful when you frequently query on a combination of columns.
- ***Clustered Index***:
  - A clustered index is used to physically order the rows in a table based on the values of the indexed column. It is useful when you frequently query on the indexed column.
- ***Non-Clustered Index***:
  - A non-clustered index is used to create a separate data structure that holds the values of the indexed column and pointers to the rows in the table. It is useful when you frequently query on the indexed column but do not need to physically order the rows in the table.
- ***Covering Index***:
  - A covering index is used to include all the columns required by a query in the index itself. It is useful when you frequently query on a combination of columns and need to retrieve all the columns in the query result.

---

## Demo Set Up

To see some examples in action, we'll create an in-memory SQLite database and populate it with some sample data.

In [1]:
import sqlite3
import pandas as pd # type: ignore
import time # Needed to record how long the script takes to run (i.e.queries to the database)

In [2]:
# Create an in-memory SQLite database
conn = sqlite3.connect(':memory:')
cursor = conn.cursor()

In [3]:
# Create the employees table
cursor.execute('''
CREATE TABLE employees (
    employee_id INT PRIMARY KEY,
    name VARCHAR(50),
    department_id INT,
    salary INT
);
''')

<sqlite3.Cursor at 0x208ff3f19c0>

In [4]:
# Define the sample data
employees = [
    (1, 'Alice', 1, 70000),
    (2, 'Bob', 1, 80000),
    (3, 'Charlie', 1, 70000),
    (4, 'David', 2, 90000),
    (5, 'Eve', 2, 85000),
    (6, 'Frank', 2, 90000)
]

# Insert some data into the employees table - (?,?,?,?) known as a 'prepared query'
cursor.executemany('INSERT INTO employees (employee_id, name, department_id, salary) VALUES (?, ?, ?, ?)', employees)

<sqlite3.Cursor at 0x208ff3f19c0>

---

## Demo 1 - Create Indexes

We're going to create some indexes on the `employees` table using the `employees.id` field an a composite index on `department_id` and `salary`.

In [5]:
# Create a query for creating an index on employee_id
create_employee_id_index = '''
CREATE INDEX idx_employee_id ON employees(employee_id);
'''

# Execute the query
cursor.execute(create_employee_id_index)

<sqlite3.Cursor at 0x208ff3f19c0>

`employee_id` is the ***primary key*** of the table, so it already has an index on it.  Therefore, strictly speaking, we don't need to explicitly create an index on it.

In [6]:
# Create a query for creating a composite index on department_id and salary
create_department_id_salary_index = '''
CREATE INDEX idx_department_salary ON employees(department_id, salary);
'''

# Execute the query
cursor.execute(create_department_id_salary_index)

<sqlite3.Cursor at 0x208ff3f19c0>

### Explanation

- **Primary Key Index**: The `employee_id` column is the primary key, so it is automatically indexed.
- **Composite Index**: The composite index on `department_id` and `salary` helps speed up queries that filter or sort by these columns.

---

### Example Queries Using Indexes

#### Query 1: Find an employee by `employee_id`

```sql
SELECT * FROM employees WHERE employee_id = 2;
```

In [7]:
find_employee_2 = '''
SELECT * FROM employees WHERE employee_id = 2;
'''

# Execute the query and put the results into a DataFrame
employee_2_df = pd.read_sql_query(find_employee_2, conn)

employee_2_df

Unnamed: 0,employee_id,name,department_id,salary
0,2,Bob,1,80000


#### Query 2: Find employees in a specific department, ordered by salary

```sql
SELECT * FROM employees WHERE department_id = 2 ORDER BY salary DESC;
```

In [8]:
find_in_specific_department = '''
SELECT * FROM employees WHERE department_id = 2 ORDER BY salary DESC;
'''

# Execute the query and put the results into a DataFrame
dept_2_sal_desc_df = pd.read_sql_query(find_in_specific_department, conn)

dept_2_sal_desc_df

Unnamed: 0,employee_id,name,department_id,salary
0,6,Frank,2,90000
1,4,David,2,90000
2,5,Eve,2,85000


In [9]:
# Close the connection
conn.close()

These queries will benefit from the indexes we created, making them faster and more efficient.

---

## Was it worth it?

> Is it the case that you would only see the benefits of indexes when working with larger datasets?
>
> "YES!"

The benefits of indexes become more apparent with larger datasets. Here's why:

- ***Small Datasets***
  - **Full Table Scans**: For small datasets, the database can quickly perform full table scans to retrieve the required data. The overhead of maintaining and using indexes might not provide significant performance improvements.
  - **Minimal Performance Gain**: The time saved by using an index on a small dataset is often negligible compared to the time taken to perform a full table scan.
- ***Large Datasets***
  - **Efficient Data Retrieval**: Indexes significantly speed up data retrieval by allowing the database to quickly locate the rows that match the query criteria without scanning the entire table.
  - **Reduced I/O Operations**: Indexes reduce the number of I/O operations needed to fetch data from disk, which is crucial for large datasets where disk I/O can be a bottleneck.
  - **Improved Query Performance**: Queries that filter, sort, or join large datasets benefit greatly from indexes, resulting in faster response times and more efficient use of resources.

### Example

Consider a table with millions of rows. Without an **index**, a query that searches for a specific `employee_id` would require scanning all rows, which is time-consuming. With an **index** on `employee_id`, the database can quickly locate the matching row using the **index**, significantly reducing the query time.

---

## Demo 2 - Comparing Query Performance with and without Indexes

### Create a Table with a Large Dataset

In [10]:
# Create an in-memory SQLite database

conn = sqlite3.connect(':memory:')
cursor = conn.cursor()

In [11]:
# Execute the query to create employees table
cursor.execute('''
CREATE TABLE large_employees (
    employee_id INT,
    name VARCHAR(50),
    department_id INT,
    salary INT
);
''')

<sqlite3.Cursor at 0x208ff3f1ac0>

In [12]:
# Insert a large number of rows into the employees table
cursor.execute('''
WITH RECURSIVE generate_series AS (
    SELECT 1 AS i
    UNION ALL
    SELECT i + 1 FROM generate_series WHERE i < 1000000
)
INSERT INTO large_employees (employee_id, name, department_id, salary)
SELECT i, 'Employee' || i, (i % 10) + 1, (i % 100000) + 50000
FROM generate_series;
''')

<sqlite3.Cursor at 0x208ff3f1ac0>

Because we are using SQLite, this code uses a ***recursive CTE*** to generate a series of numbers from `1` to `1,000,000` and then inserts them into the `large_employees` table.

For other versions of SQL, we could have used:

```sql
-- Insert a large number of rows
INSERT INTO large_employees (employee_id, name, department_id, salary)
SELECT i, 'Employee' || i, (i % 10) + 1, (i % 100000) + 50000
FROM generate_series(1, 1000000) AS s(i);
```

### Demo 2.1 - Query Performance without Indexes

Here we will run a query that gives us information about a query to run on the `large_employees` table by `employee_id` without an index.

> ***NOTE***: The `SELECT` query is not actually executed - we are just looking at the query plan

In [13]:
cursor.execute('''
EXPLAIN QUERY PLAN
SELECT * FROM large_employees WHERE employee_id = 500000;
''')

# Fetch the results
query_plan = cursor.fetchall()

# Print the results
for row in query_plan:
    print(row)


(2, 0, 0, 'SCAN large_employees')


The output of this query is:

```txt
(2, 0, 216, 'SCAN large_employees')
```

The numbers in the query plan output represent different aspects of how **SQLite** plans to execute the query. Here is what each number typically means:

1. ***Select ID (2)***:
   - This is the identifier for the specific query or subquery. If there are multiple queries or subqueries, each will have a unique Select ID.

2. ***Order (0)***:
   - This indicates the order in which operations are performed within the query plan.

3. ***From (216)***:
   - This is the identifier for the table or index being accessed. It can be used to correlate with other parts of the query plan.

4. ***Detail ('SCAN large_employees')***:
   - This is a textual description of the operation being performed. In this case, it indicates that a full table scan is being performed on the large_employees table.

In summary, the plan `(2, 0, 216, 'SCAN large_employees')` indicates that **SQLite** will perform a full table scan on the `large_employees` table to execute the query.

Again, this query has had to be modified to work with SQLite.  For other versions of SQL, like **PostgreSQL**, we could have used:

```sql
-- Query without an index
EXPLAIN ANALYZE
SELECT * FROM large_employees WHERE employee_id = 500000;
```

In [14]:
# Time the execution of the query
start_time = time.time()

cursor.execute('''
SELECT * FROM large_employees WHERE employee_id = 500000;
''')

# Fetch the results
results = cursor.fetchall()

# Stop the timer
end_time = time.time()
elapsed_time_no_index = end_time - start_time

# Print the results
for row in results:
    print(row)

(500000, 'Employee500000', 1, 50000)


---

### Demo 2.2 - Query Performance with Indexes

Here we will run a query that filters the `large_employees` table by `employee_id` after we have made `employee_id` an index.

In [15]:
# Add index on employee_id
cursor.execute('''
CREATE INDEX idx_employee_id ON large_employees(employee_id);
''')
conn.commit()

In [16]:
cursor.execute('''
EXPLAIN QUERY PLAN
SELECT * FROM large_employees WHERE employee_id = 500000;
''')

# Fetch the results
query_plan = cursor.fetchall()


# Print the results
for row in query_plan:
    print(row)
    

(3, 0, 0, 'SEARCH large_employees USING INDEX idx_employee_id (employee_id=?)')


In [17]:
# Time the execution of the query
start_time = time.time()

cursor.execute('''
SELECT * FROM large_employees WHERE employee_id = 500000;
''')

# Fetch the results
results = cursor.fetchall()

# Stop the timer
end_time = time.time()
elapsed_time_with_index = end_time - start_time

# Print the results
for row in results:
    print(row)

(500000, 'Employee500000', 1, 50000)


In [18]:
conn.close()

---

## What was the Difference in Performance?

In [19]:
# Create a DataFrame for the stats
stats_df = pd.DataFrame({
    'Metric': ['No Index Query Time', 'With Index Query Time', 'Difference (ms)', 'Difference (%)', 'Times Faster'],
    'Value': [
        elapsed_time_no_index * 1000,
        elapsed_time_with_index * 1000,
        (elapsed_time_no_index - elapsed_time_with_index) * 1000,
        ((elapsed_time_no_index - elapsed_time_with_index) / elapsed_time_no_index) * 100,
        elapsed_time_no_index / elapsed_time_with_index
    ],
    'Unit': ['ms', 'ms', 'ms', '%', 'x']
})

stats_df

Unnamed: 0,Metric,Value,Unit
0,No Index Query Time,38.324833,ms
1,With Index Query Time,0.18096,ms
2,Difference (ms),38.143873,ms
3,Difference (%),99.527827,%
4,Times Faster,211.786561,x


## Why did the Query with Indexes Perform Better?

The query with indexes performed better because the database could use the index on `employee_id` to quickly locate the row that matched the query criteria. This is known as an **index seek** operation, where the database can directly access the row using the index without scanning the entire table.

In contrast, the query without indexes had to perform a **full table scan**, which involves reading every row in the table to find the matching row. This is a time-consuming operation, especially for large tables, as it requires reading and processing a large amount of data.

By creating an index on the `employee_id` column, we provided the database with a more efficient way to locate the row, resulting in faster query execution and improved performance.

---

## EXPLAIN and ANALYZE Queries

`EXPLAIN` is a command that shows the query plan that the database will use to execute a query. It provides information about how the database will access the data and perform the operations specified in the query.  It is a useful tool for understanding how the database processes queries and can help identify potential performance bottlenecks.

`ANALYZE` is a command that executes the query and collects statistics about its performance. It provides information about the actual execution time, the number of rows processed, and other performance metrics.  It is useful for comparing the estimated query plan with the actual query execution and identifying discrepancies.

By using `EXPLAIN` and `ANALYZE` together, you can gain insights into how the database processes queries, identify areas for optimization, and improve query performance.

## `EXPLAIN` and `ANALYZE` Example

Let's assume we have the following tables:

```sql
--- Customers Table
CREATE TABLE customers (
    customer_id INT PRIMARY KEY,
    name VARCHAR(255),
    email VARCHAR(255),
    city VARCHAR(255)
);

--- Orders Table
CREATE TABLE orders (
    order_id INT PRIMARY KEY,
    order_date DATE,
    customer_id INT,
    FOREIGN KEY (customer_id) REFERENCES customers(customer_id)
);
```

We have inherited the following query, but our customers are telling us that it seems to take a long time to run:

```sql
SELECT c.name, COUNT(o.order_id) AS order_count
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE c.city = 'New York'
GROUP BY c.name;
```

We have been asked to look into optimising this query to make it run more efficiently.

### Step 1 - Use `EXPLAIN` to Analyse the Query Plan

To understand the execution plan of this query, we can use the EXPLAIN command:

```sql
EXPLAIN
SELECT c.name, COUNT(o.order_id) AS order_count
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE c.city = 'New York'
GROUP BY c.name;
```

The output of the `EXPLAIN` command will show the query plan that the database will use to execute the query. It will provide information about how the database will access the data and perform the operations specified in the query.

#### RESULT

```txt
+----+-------------+-------+------------+------+---------------+-------------+---------+---------------+--------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key         | key_len | ref           | rows   | filtered | Extra       |
+----+-------------+-------+------------+------+---------------+-------------+---------+---------------+--------+----------+-------------+
|  1 | SIMPLE      | c     | NULL       | ALL  | NULL          | NULL        | NULL    | NULL          | 100000 | 10.00    | Using where |
|  1 | SIMPLE      | o     | NULL       | ref  | customer_id   | customer_id | 4       | c.customer_id | 10     | 100.00   | Using index |
+----+-------------+-------+------------+------+---------------+-------------+---------+---------------+--------+----------+-------------+
```

#### INTERPRETATION

- ***Table Scan***: The `customers` table is being scanned (type: `ALL`), which indicates a full table scan. This is inefficient, especially if the table is large.
- ***Join Operation***: The `orders` table is using an **index** on `customer_id` (`type: ref`), which is more efficient.

### Step 2- Use `ANALYZE` to Execute the Query and Collect Statistics

To get more detailed information, including actual execution times, we can use the `ANALYZE` command:

```sql
EXPLAIN ANALYZE
SELECT c.name, COUNT(o.order_id) AS order_count
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE c.city = 'New York'
GROUP BY c.name;
```

#### RESULT

```txt
+----------------------------------------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                                          |
+----------------------------------------------------------------------------------------------------------------------------------+
| -> Aggregate: count(o.order_id)                                                                                                  |
|    -> Nested loop inner join                                                                                                     |
|       -> Filter: (c.city = 'New York')                                                                                           |
|          -> Table scan on c                                                                                                      |
|       -> Index lookup on o using customer_id (customer_id=c.customer_id)                                                         |
+----------------------------------------------------------------------------------------------------------------------------------+
```

#### INTERPRETATION

- ***Aggregate Operation***: The query is performing an aggregate operation (`count(o.order_id)`).
- ***Nested Loop Join***: The query is using a nested loop inner join to combine the `customers` and `orders` tables.
- ***Filter Operation***: The query is filtering the `customers` table by `city = 'New York'`.
- ***Table Scan***: The `customers` table is being scanned.
- ***Index Lookup***: The `orders` table is using an index lookup on `customer_id`.

### Step 3 - Optimise the Query

To optimize the query, we can add an index on the city column of the customers table to avoid the full table scan.

#### Add an Index on the City Column

```sql
CREATE INDEX idx_customers_city ON customers(city);
```

#### Optimised Query Execution Plan

After adding the index, let's run the `EXPLAIN` command again:

```sql
EXPLAIN
SELECT c.name, COUNT(o.order_id) AS order_count
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE c.city = 'New York'
GROUP BY c.name;
```

#### RESULT

```txt
+----+-------------+-------+------------+------+--------------------+--------------------+---------+---------------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys      | key                | key_len | ref           | rows | filtered | Extra       |
+----+-------------+-------+------------+------+--------------------+--------------------+---------+---------------+------+----------+-------------+
|  1 | SIMPLE      | c     | NULL       | ref  | idx_customers_city | idx_customers_city | 767     | const         | 100  |   100.00 | Using index |
|  1 | SIMPLE      | o     | NULL       | ref  | customer_id        | customer_id        | 4       | c.customer_id | 10   |   100.00 | Using index |
+----+-------------+-------+------------+------+--------------------+--------------------+---------+---------------+------+----------+-------------+
```

#### INTERPRETATION

- ***Index Scan***: The `customers` table is now using an index scan on the `city` column, which is more efficient than a full table scan.
- ***Index Lookup***: The `orders` table is still using an index lookup on `customer_id`.
- ***Optimised Query Plan***: The query plan has been optimised by using an index scan on the `city` column.
- ***Improved Performance***: The query should now run more efficiently with the index in place.

### SUMMARY

By using `EXPLAIN` and `ANALYZE`, we were able to analyse the query plan, identify performance bottlenecks, and optimise the query by adding an index on the `city` column. This improved the query performance and reduced the execution time by avoiding a full table scan.

- ***Initial Query***: The initial query performed a full table scan on the customers table, leading to poor performance.
- ***Optimized Query***: By adding an index on the city column, the query now uses an index scan, significantly improving performance.

---

## Conclusion

***Indexes*** are crucial for optimizing query performance, especially for large datasets. They allow the database to quickly locate and retrieve the required data, reducing the need for full table scans and improving overall efficiency.

---

---

## Activity

In this activity you will perform some queries on a table with a large dataset and compare the query performance with and without indexes.

---

## Activity Set Up

We'll create an in-memory SQLite database with the following tables:

1. `customers`: Stores customer information.
2. `products`: Stores product information.
3. `orders`: Stores order information.
4. `order_items`: Stores details of each item in an order.

We'll populate these tables with a large number of rows.

### 1. Create the Tables: Define the schema for the customers, products, orders, and order_items tables.

In [20]:
import sqlite3

# Create an in-memory SQLite database
conn = sqlite3.connect(':memory:')
cursor = conn.cursor()

# Create the customers table
cursor.execute('''
CREATE TABLE customers (
    customer_id INTEGER PRIMARY KEY,
    name TEXT,
    email TEXT
)
''')

# Create the products table
cursor.execute('''
CREATE TABLE products (
    product_id INTEGER PRIMARY KEY,
    name TEXT,
    price REAL
)
''')

# Create the orders table
cursor.execute('''
CREATE TABLE orders (
    order_id INTEGER PRIMARY KEY,
    customer_id INTEGER,
    order_date TEXT,
    FOREIGN KEY (customer_id) REFERENCES customers (customer_id)
)
''')

# Create the order_items table
cursor.execute('''
CREATE TABLE order_items (
    order_item_id INTEGER PRIMARY KEY,
    order_id INTEGER,
    product_id INTEGER,
    quantity INTEGER,
    FOREIGN KEY (order_id) REFERENCES orders (order_id),
    FOREIGN KEY (product_id) REFERENCES products (product_id)
)
''')

<sqlite3.Cursor at 0x208ff3f1a40>

### 2. Populate the Tables: Insert a large number of rows into each table.

In [21]:
import random
from datetime import datetime, timedelta

# Insert data into customers
customers = [(i, f'Customer{i}', f'customer{i}@example.com') for i in range(1, 1001)]
cursor.executemany('INSERT INTO customers VALUES (?, ?, ?)', customers)

# Insert data into products
products = [(i, f'Product{i}', round(random.uniform(10.0, 100.0), 2)) for i in range(1, 101)]
cursor.executemany('INSERT INTO products VALUES (?, ?, ?)', products)

# Insert data into orders
start_date = datetime.now() - timedelta(days=365)
orders = [(i, random.randint(1, 1000), (start_date + timedelta(days=random.randint(0, 365))).strftime('%Y-%m-%d')) for i in range(1, 10001)]
cursor.executemany('INSERT INTO orders VALUES (?, ?, ?)', orders)

# Insert data into order_items
order_items = []
for i in range(1, 100001):
    order_id = random.randint(1, 10000)
    product_id = random.randint(1, 100)
    quantity = random.randint(1, 10)
    order_items.append((i, order_id, product_id, quantity))
cursor.executemany('INSERT INTO order_items VALUES (?, ?, ?, ?)', order_items)

conn.commit()

### 3. Define some helper function to get the query metric into a DataFrame

In [22]:
# Define a function to return the time to execute a query
import time

def time_query(query):
    start_time = time.time()
    cursor.execute(query)
    end_time = time.time()
    return end_time - start_time


In [23]:
# Define a DataFrame to store query metrics
query_metrics_columns = ['Query', 'Select ID', 'Order', 'From', 'Detail', 'Time Taken']

query_metrics = pd.DataFrame(columns=query_metrics_columns)


In [24]:
# Define a function to add a row to the query metrics DataFrame

def add_query_metrics(query, query_plan, time_taken):
    global query_metrics
    for query_plan_row in query_plan:
        new_row = pd.DataFrame([{
            'Query': query,
            'Select ID': query_plan_row[0],
            'Order': query_plan_row[1],
            'From': query_plan_row[2],
            'Detail': query_plan_row[3],
            'Time Taken': time_taken
        }])
        if not new_row.empty and not new_row.isna().all().all():
            query_metrics = pd.concat([query_metrics, new_row], ignore_index=True)

In [25]:
# Define a function to get the query_plan
def get_query_plan(query):
    cursor.execute(f'EXPLAIN QUERY PLAN {query}')
    query_plan = cursor.fetchall()
    
    # Check the rows in the query plan
    print(len(query_plan))

    for row in query_plan:
        print(row)
    return query_plan

In [26]:
def get_and_add_metrics(query_name, query):
    query_plan = get_query_plan(query)
    time_taken = time_query(query)
    add_query_metrics(query_name, query_plan, time_taken)

---

## Activity 1 - Query Performance without Indexes

### Activity 1.1

1. Write a query to ***find all orders for a specific customer within a date range***
2. Examine the query plan to see how the database plans to execute the query.
3. Determine the execution time of the query.
4. Add these metrics to the DataFrame for comparison.

In [48]:
# Define the SQL Query as a string
query_1_1 = '''
WITH customer_orders AS (
    SELECT o.order_id, o.order_date, o.customer_id
    FROM orders o
    WHERE o.order_date BETWEEN '2024-01-01' AND '2024-12-31'
)
SELECT co.order_id, co.order_date, c.name, c.email
FROM customer_orders co
JOIN customers c ON co.customer_id = c.customer_id
WHERE c.customer_id = 1;
'''

In [49]:
df_1_1 = pd.read_sql(query_1_1, conn)
df_1_1

Unnamed: 0,order_id,order_date,name,email
0,1198,2024-02-21,Customer1,customer1@example.com
1,2211,2024-01-23,Customer1,customer1@example.com
2,2701,2024-06-23,Customer1,customer1@example.com
3,3040,2024-02-17,Customer1,customer1@example.com
4,4653,2024-07-28,Customer1,customer1@example.com
5,5054,2024-10-28,Customer1,customer1@example.com
6,5715,2024-02-22,Customer1,customer1@example.com
7,6360,2024-01-07,Customer1,customer1@example.com
8,6921,2024-10-24,Customer1,customer1@example.com
9,8334,2024-02-03,Customer1,customer1@example.com


In [29]:
get_and_add_metrics('Query 1.1', query_1_1)

query_metrics

2
(3, 0, 0, 'SEARCH c USING INTEGER PRIMARY KEY (rowid=?)')
(6, 0, 0, 'SCAN o')


  query_metrics = pd.concat([query_metrics, new_row], ignore_index=True)


Unnamed: 0,Query,Select ID,Order,From,Detail,Time Taken
0,Query 1.1,3,0,0,SEARCH c USING INTEGER PRIMARY KEY (rowid=?),0.001003
1,Query 1.1,6,0,0,SCAN o,0.001003


---

### Activity 1.2

1. Write a query to ***find the total quantity of each product sold***
2. Examine the query plan to see how the database plans to execute the query.
3. Determine the execution time of the query.
4. Add these metrics to the DataFrame for comparison.

In [30]:
# Define the SQL Query as a string
query_1_2 = '''
WITH product_quantities AS (
    SELECT oi.product_id, SUM(oi.quantity) AS total_quantity
    FROM order_items oi
    GROUP BY oi.product_id
)
SELECT p.name, pq.total_quantity
FROM products p
JOIN product_quantities pq ON p.product_id = pq.product_id
ORDER BY pq.total_quantity DESC;
'''

In [31]:
get_and_add_metrics('Query 1.2', query_1_2)
query_metrics

6
(2, 0, 0, 'CO-ROUTINE product_quantities')
(8, 2, 0, 'SCAN oi')
(10, 2, 0, 'USE TEMP B-TREE FOR GROUP BY')
(49, 0, 0, 'SCAN pq')
(52, 0, 0, 'SEARCH p USING INTEGER PRIMARY KEY (rowid=?)')
(60, 0, 0, 'USE TEMP B-TREE FOR ORDER BY')


Unnamed: 0,Query,Select ID,Order,From,Detail,Time Taken
0,Query 1.1,3,0,0,SEARCH c USING INTEGER PRIMARY KEY (rowid=?),0.001003
1,Query 1.1,6,0,0,SCAN o,0.001003
2,Query 1.2,2,0,0,CO-ROUTINE product_quantities,0.046331
3,Query 1.2,8,2,0,SCAN oi,0.046331
4,Query 1.2,10,2,0,USE TEMP B-TREE FOR GROUP BY,0.046331
5,Query 1.2,49,0,0,SCAN pq,0.046331
6,Query 1.2,52,0,0,SEARCH p USING INTEGER PRIMARY KEY (rowid=?),0.046331
7,Query 1.2,60,0,0,USE TEMP B-TREE FOR ORDER BY,0.046331


---

### Activity 1.3

1. Write a query to ***find the total sales amount for each customer***
2. Examine the query plan to see how the database plans to execute the query.
3. Determine the execution time of the query.
4. Add these metrics to the DataFrame for comparison.

In [32]:
# Define the SQL Query as a string
query_1_3 = '''
WITH customer_orders AS (
    SELECT c.customer_id, c.name, o.order_id
    FROM customers c
    JOIN orders o ON c.customer_id = o.customer_id
),
order_items_products AS (
    SELECT oi.order_id, oi.quantity, p.price
    FROM order_items oi
    JOIN products p ON oi.product_id = p.product_id
)
SELECT co.customer_id, co.name, SUM(oip.quantity * oip.price) AS total_sales
FROM customer_orders co
JOIN order_items_products oip ON co.order_id = oip.order_id
GROUP BY co.customer_id, co.name
ORDER BY total_sales DESC;
'''

In [33]:
get_and_add_metrics('Query 1.3', query_1_3)
query_metrics

6
(10, 0, 0, 'SCAN oi')
(12, 0, 0, 'SEARCH o USING INTEGER PRIMARY KEY (rowid=?)')
(15, 0, 0, 'SEARCH c USING INTEGER PRIMARY KEY (rowid=?)')
(18, 0, 0, 'SEARCH p USING INTEGER PRIMARY KEY (rowid=?)')
(21, 0, 0, 'USE TEMP B-TREE FOR GROUP BY')
(66, 0, 0, 'USE TEMP B-TREE FOR ORDER BY')


Unnamed: 0,Query,Select ID,Order,From,Detail,Time Taken
0,Query 1.1,3,0,0,SEARCH c USING INTEGER PRIMARY KEY (rowid=?),0.001003
1,Query 1.1,6,0,0,SCAN o,0.001003
2,Query 1.2,2,0,0,CO-ROUTINE product_quantities,0.046331
3,Query 1.2,8,2,0,SCAN oi,0.046331
4,Query 1.2,10,2,0,USE TEMP B-TREE FOR GROUP BY,0.046331
5,Query 1.2,49,0,0,SCAN pq,0.046331
6,Query 1.2,52,0,0,SEARCH p USING INTEGER PRIMARY KEY (rowid=?),0.046331
7,Query 1.2,60,0,0,USE TEMP B-TREE FOR ORDER BY,0.046331
8,Query 1.3,10,0,0,SCAN oi,0.115443
9,Query 1.3,12,0,0,SEARCH o USING INTEGER PRIMARY KEY (rowid=?),0.115443


---

### Activity 1.4

1. Write a query to ***find the most popular products in terms of the number of orders***
2. Examine the query plan to see how the database plans to execute the query.
3. Determine the execution time of the query.
4. Add these metrics to the DataFrame for comparison.

In [34]:
# Define the SQL Query as a string
query_1_4 = '''
WITH product_orders AS (
    SELECT oi.product_id, oi.order_id
    FROM order_items oi
)
SELECT p.product_id, p.name, COUNT(DISTINCT po.order_id) AS order_count
FROM products p
JOIN product_orders po ON p.product_id = po.product_id
GROUP BY p.product_id, p.name
ORDER BY order_count DESC;
'''

In [35]:
get_and_add_metrics('Query 1.4', query_1_4)
query_metrics

5
(8, 0, 0, 'SCAN oi')
(10, 0, 0, 'SEARCH p USING INTEGER PRIMARY KEY (rowid=?)')
(13, 0, 0, 'USE TEMP B-TREE FOR GROUP BY')
(55, 0, 0, 'USE TEMP B-TREE FOR count(DISTINCT)')
(58, 0, 0, 'USE TEMP B-TREE FOR ORDER BY')


Unnamed: 0,Query,Select ID,Order,From,Detail,Time Taken
0,Query 1.1,3,0,0,SEARCH c USING INTEGER PRIMARY KEY (rowid=?),0.001003
1,Query 1.1,6,0,0,SCAN o,0.001003
2,Query 1.2,2,0,0,CO-ROUTINE product_quantities,0.046331
3,Query 1.2,8,2,0,SCAN oi,0.046331
4,Query 1.2,10,2,0,USE TEMP B-TREE FOR GROUP BY,0.046331
5,Query 1.2,49,0,0,SCAN pq,0.046331
6,Query 1.2,52,0,0,SEARCH p USING INTEGER PRIMARY KEY (rowid=?),0.046331
7,Query 1.2,60,0,0,USE TEMP B-TREE FOR ORDER BY,0.046331
8,Query 1.3,10,0,0,SCAN oi,0.115443
9,Query 1.3,12,0,0,SEARCH o USING INTEGER PRIMARY KEY (rowid=?),0.115443


---

## Activity 2 - Query Performance with Indexes

### Activity 2.1

1. Create indexes on the columns that are frequently used in queries.
2. Rerun the queries from Activity 1.1 with the indexes in place

In [37]:
# Define the indexes to create as a list
indexes_to_create_2_1 = ['''CREATE INDEX idx_orders_customer_id ON orders(customer_id);''',
'''CREATE INDEX idx_customers_customer_id ON customers(customer_id);''',
'''CREATE INDEX idx_orders_order_date ON orders(order_date);''']

# Add indexes to the tables
for index in indexes_to_create_2_1:
    cursor.execute(index)

In [38]:
get_and_add_metrics('Query 1.1 with Index', query_1_1)

query_metrics

2
(4, 0, 0, 'SEARCH c USING INTEGER PRIMARY KEY (rowid=?)')
(7, 0, 0, 'SEARCH o USING INDEX idx_orders_customer_id (customer_id=?)')


Unnamed: 0,Query,Select ID,Order,From,Detail,Time Taken
0,Query 1.1,3,0,0,SEARCH c USING INTEGER PRIMARY KEY (rowid=?),0.001003
1,Query 1.1,6,0,0,SCAN o,0.001003
2,Query 1.2,2,0,0,CO-ROUTINE product_quantities,0.046331
3,Query 1.2,8,2,0,SCAN oi,0.046331
4,Query 1.2,10,2,0,USE TEMP B-TREE FOR GROUP BY,0.046331
5,Query 1.2,49,0,0,SCAN pq,0.046331
6,Query 1.2,52,0,0,SEARCH p USING INTEGER PRIMARY KEY (rowid=?),0.046331
7,Query 1.2,60,0,0,USE TEMP B-TREE FOR ORDER BY,0.046331
8,Query 1.3,10,0,0,SCAN oi,0.115443
9,Query 1.3,12,0,0,SEARCH o USING INTEGER PRIMARY KEY (rowid=?),0.115443


---

### Activity 2.2

1. Create indexes on the columns that are frequently used in queries.
2. Rerun the queries from Activity 1.2 with the indexes in place

In [39]:
# Define the additional indexes to create as a list (i.e. those not defined previously)
indexes_to_create_2_2 = ['''CREATE INDEX idx_order_items_product_id ON order_items(product_id);''',
'''CREATE INDEX idx_products_product_id ON products(product_id);''']

# Add indexes to the tables
for index in indexes_to_create_2_2:
    cursor.execute(index)

In [40]:
get_and_add_metrics('Query 1.2 with Index', query_1_2)
query_metrics

5
(2, 0, 0, 'CO-ROUTINE product_quantities')
(9, 2, 0, 'SCAN oi USING INDEX idx_order_items_product_id')
(42, 0, 0, 'SCAN pq')
(45, 0, 0, 'SEARCH p USING INTEGER PRIMARY KEY (rowid=?)')
(53, 0, 0, 'USE TEMP B-TREE FOR ORDER BY')


Unnamed: 0,Query,Select ID,Order,From,Detail,Time Taken
0,Query 1.1,3,0,0,SEARCH c USING INTEGER PRIMARY KEY (rowid=?),0.001003
1,Query 1.1,6,0,0,SCAN o,0.001003
2,Query 1.2,2,0,0,CO-ROUTINE product_quantities,0.046331
3,Query 1.2,8,2,0,SCAN oi,0.046331
4,Query 1.2,10,2,0,USE TEMP B-TREE FOR GROUP BY,0.046331
5,Query 1.2,49,0,0,SCAN pq,0.046331
6,Query 1.2,52,0,0,SEARCH p USING INTEGER PRIMARY KEY (rowid=?),0.046331
7,Query 1.2,60,0,0,USE TEMP B-TREE FOR ORDER BY,0.046331
8,Query 1.3,10,0,0,SCAN oi,0.115443
9,Query 1.3,12,0,0,SEARCH o USING INTEGER PRIMARY KEY (rowid=?),0.115443


---

### Activity 2.3

1. Create indexes on the columns that are frequently used in queries.
2. Rerun the queries from Activity 1.3 with the indexes in place

In [41]:
# Define the additional indexes to create as a list (i.e. those not defined previously)
indexes_to_create_2_3 = ['''CREATE INDEX idx_order_items_order_id ON order_items(order_id);''']

# Add indexes to the tables
for index in indexes_to_create_2_3:
    cursor.execute(index)

In [42]:
get_and_add_metrics('Query 1.3 with Index', query_1_3)
query_metrics

6
(10, 0, 0, 'SCAN oi')
(12, 0, 0, 'SEARCH o USING INTEGER PRIMARY KEY (rowid=?)')
(15, 0, 0, 'SEARCH c USING INTEGER PRIMARY KEY (rowid=?)')
(18, 0, 0, 'SEARCH p USING INTEGER PRIMARY KEY (rowid=?)')
(21, 0, 0, 'USE TEMP B-TREE FOR GROUP BY')
(66, 0, 0, 'USE TEMP B-TREE FOR ORDER BY')


Unnamed: 0,Query,Select ID,Order,From,Detail,Time Taken
0,Query 1.1,3,0,0,SEARCH c USING INTEGER PRIMARY KEY (rowid=?),0.001003
1,Query 1.1,6,0,0,SCAN o,0.001003
2,Query 1.2,2,0,0,CO-ROUTINE product_quantities,0.046331
3,Query 1.2,8,2,0,SCAN oi,0.046331
4,Query 1.2,10,2,0,USE TEMP B-TREE FOR GROUP BY,0.046331
5,Query 1.2,49,0,0,SCAN pq,0.046331
6,Query 1.2,52,0,0,SEARCH p USING INTEGER PRIMARY KEY (rowid=?),0.046331
7,Query 1.2,60,0,0,USE TEMP B-TREE FOR ORDER BY,0.046331
8,Query 1.3,10,0,0,SCAN oi,0.115443
9,Query 1.3,12,0,0,SEARCH o USING INTEGER PRIMARY KEY (rowid=?),0.115443


---

### Activity 2.4

1. Create indexes on the columns that are frequently used in queries.
2. Rerun the queries from Activity 1.4 with the indexes in place

In [None]:
# Define the additional indexes to create as a list (i.e. those not defined previously)
indexes_to_create_2_4 = [] # no unique indexes to create here

# Add indexes to the tables
#for index in indexes_to_create_2_4:
#    cursor.execute(index)

In [None]:
get_and_add_metrics('Query 1.4 with Index', query_1_4)
query_metrics

---

## A bit of Pandas Fun!

Can you use the your data wrangling skills to compare the performance of the queries with and without indexes?



In [None]:
# Start here!

---

## Solutions

### Query 1.1

***Query***:

```sql
WITH customer_orders AS (
    SELECT o.order_id, o.order_date, o.customer_id
    FROM orders o
    WHERE o.order_date BETWEEN '2022-01-01' AND '2022-12-31'
)
SELECT co.order_id, co.order_date, c.name, c.email
FROM customer_orders co
JOIN customers c ON co.customer_id = c.customer_id
WHERE c.customer_id = 1;
```

***Unique Indexes***:

```sql
CREATE INDEX idx_orders_customer_id ON orders(customer_id);
CREATE INDEX idx_customers_customer_id ON customers(customer_id);
CREATE INDEX idx_orders_order_date ON orders(order_date);
```

---

### Query 1.2

***Query***:

```sql
WITH product_quantities AS (
    SELECT oi.product_id, SUM(oi.quantity) AS total_quantity
    FROM order_items oi
    GROUP BY oi.product_id
)
SELECT p.name, pq.total_quantity
FROM products p
JOIN product_quantities pq ON p.product_id = pq.product_id
ORDER BY pq.total_quantity DESC;
```

***Unique Indexes***:

```sql
CREATE INDEX idx_order_items_product_id ON order_items(product_id);
CREATE INDEX idx_products_product_id ON products(product_id);
```

---

### Query 1.3

***Query***:

```sql
WITH customer_orders AS (
    SELECT c.customer_id, c.name, o.order_id
    FROM customers c
    JOIN orders o ON c.customer_id = o.customer_id
),
order_items_products AS (
    SELECT oi.order_id, oi.quantity, p.price
    FROM order_items oi
    JOIN products p ON oi.product_id = p.product_id
)
SELECT co.customer_id, co.name, SUM(oip.quantity * oip.price) AS total_sales
FROM customer_orders co
JOIN order_items_products oip ON co.order_id = oip.order_id
GROUP BY co.customer_id, co.name
ORDER BY total_sales DESC;
```

***Unique Indexes***:

```sql
CREATE INDEX idx_order_items_order_id ON order_items(order_id);
```

---

### Query 1.4

***Query***:

```sql
WITH product_orders AS (
    SELECT oi.product_id, oi.order_id
    FROM order_items oi
)
SELECT p.product_id, p.name, COUNT(DISTINCT po.order_id) AS order_count
FROM products p
JOIN product_orders po ON p.product_id = po.product_id
GROUP BY p.product_id, p.name
ORDER BY order_count DESC;
```

***Unique Indexes***:

```sql
NONE
```

---