![Digital Futures](https://github.com/digital-futures-academy/DataScienceMasterResources/blob/main/Resources/datascience-notebook-header.png?raw=true)

## Learner Stories

```txt
As a DATA PROFESSIONAL,  
I want to be able to avoid common performance pitfalls like SELECT * and inefficient joins,  
so that I can write performant and scalable SQL queries
```

## The Need for Performance Optimisation

We have previously looked at ***indexing*** as a strategy for improving the performance of SQL queries, where we reduced the execution time significantly by creating indexes on the columns that are frequently used in the WHERE clause of our queries.  But why do we care?  What else can we do to optimise our queries?

### Why Performance Optimisation?

1. ***Efficiency and Speed***:
    - *Reduced Latency:* Optimized queries execute faster, reducing the time it takes to process data. This is especially important in real-time or near-real-time data pipelines where timely data processing is critical.
    - *Resource Utilization:* Efficient queries use fewer computational resources (CPU, memory, I/O), allowing the system to handle more queries simultaneously and reducing the overall load on the database.
2. Scalability
    - *Handling Large Datasets:* As data volumes grow, poorly optimized queries can become bottlenecks, slowing down the entire pipeline. Optimized queries ensure that the pipeline can scale to handle increasing amounts of data without significant performance degradation.
    - *Cost Management:* In cloud environments, resource usage directly impacts costs. Efficient queries reduce the need for additional resources, helping to manage and reduce operational costs.
3. Reliability and Stability
    - *Avoiding Timeouts and Failures:* Inefficient queries can lead to timeouts, crashes, or failures, disrupting the data pipeline and causing data processing delays or loss.
    - *Consistent Performance:* Optimized queries provide consistent performance, ensuring that the data pipeline operates smoothly and predictably under varying loads.
4. Data Quality and Accuracy
    - *Timely Data Processing:* Fast and efficient queries ensure that data is processed and available for analysis or reporting in a timely manner, maintaining the relevance and accuracy of the data.
    - *Minimized Errors:* Efficient queries reduce the likelihood of errors caused by resource exhaustion or timeouts, contributing to higher data quality.
5. User Experience
    - *Faster Insights:* For end-users and analysts, optimized queries mean faster access to insights and reports, improving decision-making and responsiveness.
    - *Improved Interactivity:* In interactive data applications, such as dashboards, efficient queries provide a smoother and more responsive user experience.
6. Maintainability and Future-Proofing
    - *Easier Troubleshooting:* Optimized queries are often more readable and maintainable, making it easier to troubleshoot and optimize further if needed.
    - *Adaptability:* Well-optimized queries are more adaptable to changes in data volume, schema, or infrastructure, ensuring the pipeline remains robust and efficient over time.

> Optimized, performant queries are essential in a data pipeline to ensure efficiency, scalability, reliability, data quality, user experience, and maintainability.  
> By focusing on query optimization, data professionals can build robust and scalable data pipelines that meet the demands of modern data processing and analysis.

---

# COMMON PERFORMANCE PITFALS

## 1. Avoiding `SELECT *`

- **Problem**: Selecting all columns using `SELECT *` can lead to unnecessary data transfer and processing overhead.
- **Solution**: Explicitly list the columns you need in the `SELECT` statement.

#### Bad Practice

```sql
SELECT * 
FROM customers
WHERE customer_id = 1;
```

#### Why it should be avoided

- **Performance**: Fetching all columns can be inefficient, especially if the table has many columns and you only need a few.
- **Readability**: It is not clear which columns are being used in the query.
- **Maintenance**: If the table schema changes, the query might break or return unexpected results.

#### Better Practice

```sql
SELECT customer_id, name, email
FROM customers
WHERE customer_id = 1;
```
#### Benefits

- **Performance**: Only the necessary columns are fetched, reducing the amount of data transferred.
- **Readability**: It is clear which columns are being used.
- **Maintenance**: The query is less likely to break if the table schema changes.

#### What if all the columns seem to be needed?

Even if all columns are needed, it is generally recommended to avoid using `SELECT *` for several reasons:

- **Explicitness**: Specifying the columns explicitly makes the query more readable and understandable. It is clear which columns are being used.
- **Schema Changes**: If the table schema changes (e.g. columns are added or removed), using `SELECT *` can lead to unexpected results or break the application.
- **Performance**: While fetching all columns might seem necessary, it can still be inefficient if the table has many columns, especially if some of them are large (e.g., `BLOB`s or `TEXT` fields).
- **Linting and Best Practices**: Many SQL linting tools and best practice guidelines recommend against using `SELECT *` to encourage better coding practices.

### Quick Activity 1

Trivial, but here it is:

Rewrite the following query to avoid using `SELECT *`:

```sql
SELECT *
FROM orders
WHERE order_id = 123;
```

The table has columns `order_id`, `customer_id`, `order_date`, and `order_status`.

In [None]:
--- Write SQL in this cell
SELECT order_id, customer_id, order_date, order_status 
FROM orders 
WHERE order_id = 123;

---

## 2. Inefficient Joins

- **Problem**: Inefficient join conditions can lead to slow query performance.
- **Solution**: Use appropriate join conditions and ensure that the joined columns are indexed.

#### Bad Practice

```sql
SELECT o.order_id, o.order_date, c.name, c.email
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE c.customer_id = 1;
```

#### Why it is inefficient

- **Unnecessary Data**: The join retrieves all columns from both tables, even if they are not needed.
- **Filtering After Join**: Filtering on `customer_id` after the join can be less efficient.

#### Better Practice

```sql
WITH filtered_customers AS (
    SELECT customer_id, name, email
    FROM customers
    WHERE customer_id = 1
)
SELECT o.order_id, o.order_date, fc.name, fc.email
FROM orders o
JOIN filtered_customers fc ON o.customer_id = fc.customer_id;
```

#### Benefits

- **Efficiency**: Filtering the customers table before the join reduces the number of rows involved in the join.
- **Clarity**: The query is more readable and easier to understand.

### Quick Activity 2

Rewrite the following query to improve the join efficiency:

```sql
SELECT o.order_id, o.order_date, c.customer_id, c.name, SUM(oi.quantity * p.price) AS total_amount
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
WHERE c.city = 'New York' AND p.category = 'Electronics'
GROUP BY o.order_id, o.order_date, c.customer_id, c.name;
```

In [None]:
--- Write SQL in this cell
WITH ny_customers AS (
    SELECT customer_id, name 
    FROM customers
    WHERE city = 'New York'
),
elecs AS (
    SELECT product_id, price
    FROM 
    WHERE category = 'Electronics'
)
SELECT 
FROM
JOIN

---

## 3. Lack of Indexes

- **Problem**: Missing indexes on columns used in `WHERE`, `JOIN`, or `ORDER BY` clauses can result in slow query execution.
- **Solution**: Create indexes on columns that are frequently used in filtering, joining, or sorting operations.

#### Bad Practice

```sql
SELECT order_id, order_date
FROM orders
WHERE customer_id = 1;
```

#### Why it is inefficient

- **Full Table Scan**: Without an index on `customer_id`, the database may need to scan the entire `orders` table to find matching rows.
- **Performance**: The query may be slow, especially for large tables.

#### Better Practice

```sql
CREATE INDEX idx_customer_id ON orders(customer_id);

SELECT order_id, order_date
FROM orders
WHERE customer_id = 1;
```

#### Benefits

- **Efficiency**: The index speeds up the query by allowing the database to quickly locate rows with the specified `customer_id`.
- **Performance**: The query executes faster, especially for large tables.
- **Scalability**: Indexes improve query performance as the data volume grows.
- **Resource Utilisation**: Indexes reduce the computational resources needed to process the query.
- **Cost Management**: Faster queries reduce the need for additional resources, helping to manage costs.

### Quick Activity 3

Nothing here as we've already done this!

---

## 4. Suboptimal `WHERE` Clauses

- **Problem**: Inefficient `WHERE` clauses can prevent the query optimizer from using indexes effectively.
- **Solution**: Simplify `WHERE` clauses, avoid non-sargable expressions, and use indexed columns for filtering.

#### Bad Practice

```sql
SELECT order_id, order_date
FROM orders
WHERE YEAR(order_date) = 2022;
```

#### Why it is inefficient

- **Non-Sargable**: The `YEAR()` function makes the `order_date` column non-sargable, preventing the use of indexes.
- **Performance**: The query may be slow, especially for large tables.
- **Resource Usage**: The database may need to perform a full table scan to evaluate the function for each row.
- **Index Usage**: Even if an index exists on `order_date`, it may not be used effectively due to the function.
- **Scalability**: Inefficient queries can become bottlenecks as data volumes grow.

#### Better Practice

```sql
SELECT order_id, order_date
FROM orders
WHERE order_date >= '2022-01-01' AND order_date < '2023-01-01';
```

#### Benefits

- **Sargable**: The query is sargable, allowing the database to use indexes efficiently.
- **Performance**: The query executes faster, especially for large tables.
- **Resource Utilisation**: The database can process the query more efficiently.
- **Scalability**: Optimized queries can handle larger data volumes without significant performance degradation.

### Quick Activity 4

Rewrite the following query to make it more efficient:

```sql
SELECT o.order_id, o.order_date, c.customer_id, c.name, c.email
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE LOWER(c.email) = 'example@example.com' AND DATE(o.order_date) = '2022-01-01';
```

In [None]:
--- Write SQL in this cell
SELECT o.order_id, o.order_date, c.customer_id, c.name, c.email
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE c.email = 'example@example.com' AND o.order_date = '2022-01-01';

---

## 5. Unnecessary Sorting

- **Problem**: Sorting large result sets can be computationally expensive, especially if the sorting operation is not necessary.
- **Solution**: Avoid unnecessary sorting by removing `ORDER BY` clauses that are not required.
- **Note**: If sorting is necessary, ensure that the columns being sorted are indexed.

#### Bad Practice

```sql
SELECT order_id, order_date
FROM orders
WHERE customer_id = 1
ORDER BY order_date;
```

#### Why it is inefficient

- **Unnecessary Sorting**: The `ORDER BY` clause is not needed for the query results.
- **Performance**: Sorting large result sets can be computationally expensive.
- **Resource Usage**: Sorting operations can consume additional resources.
- **Index Usage**: Sorting on non-indexed columns can slow down the query.
- **Scalability**: Unnecessary sorting can become a bottleneck as data volumes grow.

#### Better Practice

```sql
SELECT order_id, order_date
FROM orders
WHERE customer_id = 1;
```

#### Benefits

- **Efficiency**: Removing the `ORDER BY` clause avoids unnecessary sorting.
- **Performance**: The query executes faster without the sorting overhead.
- **Resource Utilisation**: The database can process the query more efficiently.
- **Scalability**: Optimized queries can handle larger data volumes without significant performance degradation.

### Quick Activity 5

Another trivial one:

Rewrite the following query to remove unnecessary sorting:

```sql
SELECT o.order_id, o.order_date, c.customer_id, c.name, SUM(oi.quantity * p.price) AS total_amount
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
WHERE c.city = 'New York' AND p.category = 'Electronics'
GROUP BY o.order_id, o.order_date, c.customer_id, c.name
ORDER BY c.name, o.order_date;
```

In [None]:
--- Write SQL in this cell
SELECT o.order_id, o.order_date, c.customer_id, c.name, SUM(oi.quantity * p.price) AS total_amount
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
WHERE c.city = 'New York' AND p.category = 'Electronics'
GROUP BY o.order_id, o.order_date, c.customer_id, c.name
ORDER BY c.name;

---

## 6. Redundant Subqueries

- **Problem**: Redundant subqueries or nested queries can lead to unnecessary data processing and performance overhead.
- **Solution**: Simplify queries by removing redundant subqueries and optimizing the query structure.

#### Bad Practice

```sql
SELECT order_id, order_date
FROM orders
WHERE customer_id IN (
    SELECT customer_id
    FROM customers
    WHERE city = 'New York'
);
```

#### Why it is inefficient

- **Redundant Subquery**: The subquery is unnecessary and can be replaced with a join.
- **Performance**: Subqueries can be less efficient than joins.
- **Resource Usage**: Subqueries can consume additional resources.
- **Readability**: Subqueries can make the query harder to understand.
- **Scalability**: Redundant subqueries can become bottlenecks as data volumes grow.
- **Adaptability**: Simplifying queries makes them more adaptable to changes in data volume or schema.
- **Maintainability**: Removing redundant subqueries improves query maintainability.

#### Better Practice

```sql
SELECT o.order_id, o.order_date
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE c.city = 'New York';
```

#### Benefits

- **Efficiency**: Replacing the subquery with a join can improve query performance.
- **Performance**: The query executes faster with a more efficient structure.
- **Resource Utilisation**: The database can process the query more efficiently.
- **Readability**: The query is easier to understand and maintain.
- **Scalability**: Optimized queries can handle larger data volumes without significant performance degradation.
- **Adaptability**: Simplified queries are more adaptable to changes in data volume or schema.
- **Maintainability**: Removing redundant subqueries improves query maintainability.
- **Consistent Performance**: Optimised queries provide consistent performance under varying loads.

### Quick Activity 6

Rewrite the following query to remove the redundant subquery:

```sql
SELECT c.customer_id, c.name, (
    SELECT SUM(oi.quantity * p.price)
    FROM orders o
    JOIN order_items oi ON o.order_id = oi.order_id
    JOIN products p ON oi.product_id = p.product_id
    WHERE o.customer_id = c.customer_id
) AS total_amount
FROM customers c
WHERE c.city = 'New York';
```

In [None]:
--- Write SQL in this cell
CREATE INDEX idx_customers_city ON customers(city);
CREATE INDEX idx_orders_customer_id ON orders(customer_id);
CREATE INDEX idx_order_items_order_id ON order_items(order_id);
CREATE INDEX idx_order_items_product_id ON order_items(product_id);
CREATE INDEX idx_products_product_id ON products(product_id);

SELECT c.customer_id, c.name, COALESCE(SUM(oi.quantity * p.price), 0) AS total_amount
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
WHERE c.city = 'New York'
GROUP BY c.customer_id, c.name;

---

## 7. Overuse of `DISTINCT`

- **Problem**: Using `DISTINCT` to remove duplicates from query results can be resource-intensive, especially for large datasets.
- **Solution**: Use `DISTINCT` judiciously and consider alternative approaches, such as grouping or filtering, to achieve the desired results without unnecessary overhead.

#### Bad Practice

```sql
SELECT DISTINCT customer_id
FROM orders;
```

#### Why it is inefficient

- **Resource Usage**: `DISTINCT` can be computationally expensive, especially for large datasets.
- **Performance**: Removing duplicates using `DISTINCT` can slow down query execution.

#### Better Practice

```sql
SELECT customer_id
FROM orders
GROUP BY customer_id;
```

#### Benefits

- **Efficiency**: Using `GROUP BY` can achieve the same result as `DISTINCT` with potentially better performance.
- **Performance**: The query executes faster without the overhead of `DISTINCT`.
- **Resource Utilisation**: The database can process the query more efficiently.
- **Scalability**: Optimized queries can handle larger datasets without significant performance degradation.
- **Data Quality**: Using `GROUP BY` can provide additional aggregation capabilities if needed.
- **User Experience**: Faster queries improve the user experience for data consumers.

### Quick Activity 7

Rewrite the following query to remove the unnecessary use of `DISTINCT`:

```sql
SELECT DISTINCT c.customer_id, c.name, c.email, COUNT(oi.order_id) AS order_count
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
WHERE c.city = 'New York'
GROUP BY c.customer_id, c.name, c.email;
```

In [None]:
--- Write SQL in this cell
SELECT c.customer_id, c.name, c.email, COUNT(o.order_id) AS order_count
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
WHERE c.city = 'New York'
GROUP BY c.customer_id, c.name, c.email;


---

## 8. Lack of Query Plan Analysis

- **Problem**: Not analysing query execution plans or performance metrics can lead to missed optimization opportunities and inefficient query performance.
- **Solution**: Monitor query execution plans, performance metrics, and resource usage to identify bottlenecks and optimize queries for better performance.

#### Bad Practice

```sql
SELECT order_id, order_date
FROM orders
WHERE customer_id = 1;
```

#### Why it is inefficient

- **Lack of Analysis**: Without analysing the query execution plan, it is difficult to identify potential performance issues.
- **Optimisation Opportunities**: Query plans can reveal opportunities for index usage, join strategies, or other optimizations.
- **Performance Tuning**: Monitoring performance metrics can help identify bottlenecks and areas for improvement.

#### Better Practice

```sql
EXPLAIN SELECT order_id, order_date
FROM orders
WHERE customer_id = 1;

ANALYZE SELECT order_id, order_date
FROM orders
WHERE customer_id = 1;
```

These 2 queries allow developers to see the query plan and analyse the performance of the query.

#### Benefits

- **Analysis**: Query plans provide insights into how the database processes the query.
- **Optimisation**: Identifying inefficient operations or missing indexes can lead to query optimisations.
- **Performance Tuning**: Monitoring performance metrics helps improve query performance.
- **Resource Utilisation**: Query plans can reveal resource-intensive operations that can be optimised.
- **Scalability**: Optimised queries can handle larger datasets without significant performance degradation.
- **Data Quality**: Optimised queries improve data processing accuracy and reliability.

### Quick Activity 8

Nothing here as it's more of a practice than an activity.  See the Notebook on Indexing for more information.

---

## 9. Poorly Structured Queries

- **Problem**: Poorly structured queries with complex logic, redundant operations, or inefficient data access patterns can lead to slow query performance.
- **Solution**: Refactor queries to improve readability, simplify logic, and optimise data access patterns for better performance.

#### Bad Practice

```sql
SELECT o.order_id, o.order_date, c.customer_id, c.name, SUM(oi.quantity * p.price) AS total_amount
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
WHERE c.city = 'New York' AND p.category = 'Electronics'
GROUP BY o.order_id, o.order_date, c.customer_id, c.name;
```

#### Why it is inefficient

- **Complex Logic**: The query has complex logic with multiple joins, aggregations, and filtering conditions.
- **Readability**: The query is hard to read and understand due to its complexity.
- **Performance**: Complex queries can be less efficient and harder to optimise.
- **Resource Usage**: Redundant operations or inefficient data access patterns can consume additional resources.
- **Scalability**: Poorly structured queries can become bottlenecks as data volumes grow.

#### Better Practice

```sql
WITH filtered_customers AS (
    SELECT customer_id, name
    FROM customers
    WHERE city = 'New York'
), filtered_products AS (
    SELECT product_id, price
    FROM products
    WHERE category = 'Electronics'
)
SELECT o.order_id, o.order_date, fc.customer_id, fc.name, SUM(oi.quantity * fp.price) AS total_amount
FROM orders o
JOIN filtered_customers fc ON o.customer_id = fc.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
JOIN filtered_products fp ON oi.product_id = fp.product_id
GROUP BY o.order_id, o.order_date, fc.customer_id, fc.name;
```

#### Benefits

- **Simplification**: Refactoring the query into CTEs simplifies the logic and improves readability.
- **Performance**: The query structure is optimised for better performance.
- **Resource Utilisation**: The database can process the query more efficiently.
- **Scalability**: Optimised queries can handle larger datasets without significant performance degradation.
- **Adaptability**: Simplified queries are more adaptable to changes in data volume or schema.
- **Maintainability**: Refactoring queries improves query maintainability and readability.

### Quick Activity 9

Rewrite the following query to improve its structure and readability:

```sql
SELECT c.customer_id, c.name, c.email, 
       (SELECT COUNT(*) 
        FROM orders o 
        WHERE o.customer_id = c.customer_id 
          AND o.order_date BETWEEN '2022-01-01' AND '2022-12-31') AS order_count,
       (SELECT SUM(oi.quantity * p.price) 
        FROM orders o 
        JOIN order_items oi ON o.order_id = oi.order_id 
        JOIN products p ON oi.product_id = p.product_id 
        WHERE o.customer_id = c.customer_id 
          AND o.order_date BETWEEN '2022-01-01' AND '2022-12-31') AS total_spent
FROM customers c
WHERE c.city = 'New York'
ORDER BY total_spent DESC;
```

In [None]:
--- Write SQL in this cell


---

## 10. Lack of Data Normalisation

- **Problem**: Denormalised data structures can lead to redundant data storage, update anomalies, and inefficient query performance.
- **Solution**: Normalise data structures to reduce redundancy, improve data integrity, and optimize query performance through efficient data access patterns.

#### Bad Practice

```sql
CREATE TABLE orders (
    order_id INT PRIMARY KEY,
    customer_id INT,
    customer_name VARCHAR(100),
    order_date DATE,
    total_amount DECIMAL(10, 2)
);
```

#### Why it is inefficient

- **Redundant Data**: Storing customer_name in the orders table can lead to data redundancy.
- **Update Anomalies**: If the customer name changes, all orders for that customer need to be updated.
- **Data Integrity**: Redundant data can lead to inconsistencies and data integrity issues.
- **Performance**: Denormalised structures can lead to inefficient queries and slower performance.
- **Scalability**: Redundant data storage can impact scalability and resource usage.
- **Maintainability**: Denormalised structures are harder to maintain and update.

#### Better Practice

```sql
CREATE TABLE orders (
    order_id INT PRIMARY KEY,
    customer_id INT,
    order_date DATE,
    total_amount DECIMAL(10, 2)
);

CREATE TABLE customers (
    customer_id INT PRIMARY KEY,
    name VARCHAR(100)
);
```

#### Benefits

- **Data Integrity**: Normalised structures improve data integrity and consistency.
- **Efficiency**: Normalised tables reduce redundancy and improve query performance.
- **Resource Utilisation**: Efficient data access patterns reduce resource consumption.
- **Scalability**: Normalised structures are more scalable and adaptable to changing data volumes.
- **Maintainability**: Normalised tables are easier to maintain and update.

### Quick Activity 10

Rewrite the following table structure to normalise the data:

```sql
CREATE TABLE orders (
    order_id INT PRIMARY KEY,
    order_date DATE,
    customer_id INT,
    customer_name VARCHAR(255),
    customer_email VARCHAR(255),
    product_id INT,
    product_name VARCHAR(255),
    product_category VARCHAR(255),
    quantity INT,
    price DECIMAL(10, 2),
    total_amount DECIMAL(10, 2)
);
```

In [None]:
--- Write SQL in this cell


---

---

## Learner Stories

```txt
As a DATA PROFESSIONAL,  
I want to be able to avoid common performance pitfalls like SELECT * and inefficient joins,  
so that I can write performant and scalable SQL queries
```

## Summary

In this notebook, we have explored common performance pitfalls in SQL queries and provided solutions to avoid them. By addressing these pitfalls, data professionals can write more efficient, scalable, and maintainable SQL queries that meet the demands of modern data processing and analysis.

The usual reasons and benefits of optimising queries are:

1. **Efficiency and Speed**: Optimised queries execute faster, reducing latency and resource consumption.
2. **Scalability**: Optimised queries can handle larger datasets and growing data volumes without significant performance degradation.
3. **Reliability and Stability**: Efficient queries are less prone to timeouts, crashes, or failures, ensuring consistent performance.
4. **Data Quality and Accuracy**: Timely data processing and reduced errors lead to higher data quality.
5. **User Experience**: Faster queries provide faster insights and improved interactivity for end-users.
6. **Maintainability and Future-Proofing**: Optimised queries are easier to troubleshoot, maintain, and adapt to changes in data volume or schema.
7. **Cost Management**: Efficient queries reduce resource consumption, helping to manage operational costs.
8. **Consistent Performance**: Optimised queries provide consistent performance under varying loads.
9. **Adaptability**: Optimised queries are more adaptable to changes in data volume, schema, or infrastructure.


---

---

## Solutions

### Quick Activity 1

In [None]:
SELECT order_id, customer_id, order_date, order_status
FROM orders
WHERE order_date BETWEEN '2013-07-01' AND '2013-07-31'

### Quick Activity 2

In [None]:
WITH filtered_customers AS (
    SELECT customer_id, name
    FROM customers
    WHERE city = 'New York'
),
filtered_products AS (
    SELECT product_id, price
    FROM products
    WHERE category = 'Electronics'
)
SELECT o.order_id, o.order_date, fc.customer_id, fc.name, SUM(oi.quantity * fp.price) AS total_amount
FROM orders o
JOIN filtered_customers fc ON o.customer_id = fc.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
JOIN filtered_products fp ON oi.product_id = fp.product_id
GROUP BY o.order_id, o.order_date, fc.customer_id, fc.name;

#### Explanation

***Original Query:***

The original query joins multiple tables and filters on columns from the `customers` and `products` tables. However, the filtering is done after the joins, which can lead to inefficient joins and poor performance.  

***Optimised Query:***

- ***Filtered Customers CTE***: Filters the `customers` table to include only those in `'New York'` before joining.  
- ***Filtered Products CTE***: Filters the `products` table to include only those in the `'Electronics'` category before joining.  
- ***Final Query***: Joins the `orders` table with the filtered CTEs and the `order_items` table, then groups and calculates the `total amount`.

### Quick Activity 3

N/A

### Quick Activity 4

In [None]:
SELECT o.order_id, o.order_date, c.customer_id, c.name, c.email
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE c.email = 'example@example.com' AND o.order_date >= '2022-01-01' AND o.order_date < '2022-01-02';

#### Explanation

***Original Query***:

The `WHERE` clause in this query uses the `LOWER()` function on the `email` column and the `DATE()` function on the `order_date` column. Both of these make the query non-sargable, preventing the use of indexes on these columns and leading to poor performance.

***Optimised Query***:

- ***Email Comparison***: Ensure that the `email` column is stored in a consistent case (e.g., all lowercase) to avoid the need for the `LOWER()` function. This allows the query to use an **index** on the `email` column.  
- ***Date Comparison***: Use a range condition on the `order_date` column to avoid the `DATE()` function, allowing the query to use an **index** on the `order_date` column.  

### Quick Activity 5

In [None]:
SELECT o.order_id, o.order_date, c.customer_id, c.name, SUM(oi.quantity * p.price) AS total_amount
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
WHERE c.city = 'New York' AND p.category = 'Electronics'
GROUP BY o.order_id, o.order_date, c.customer_id, c.name;

#### Explanation

***Original Query***:

The query includes an `ORDER BY` clause that sorts the results by `c.name` and `o.order`_date. However, if the sorting is not required for the final result, it adds unnecessary computational overhead.  

***Optimised Query***:

- ***Removed Unnecessary Sorting***: The `ORDER BY` clause has been removed to eliminate unnecessary sorting, reducing computational overhead and improving query performance.

### Quick Activity 6

In [None]:
WITH customer_totals AS (
    SELECT o.customer_id, SUM(oi.quantity * p.price) AS total_amount
    FROM orders o
    JOIN order_items oi ON o.order_id = oi.order_id
    JOIN products p ON oi.product_id = p.product_id
    GROUP BY o.customer_id
)
SELECT c.customer_id, c.name, ct.total_amount
FROM customers c
JOIN customer_totals ct ON c.customer_id = ct.customer_id
WHERE c.city = 'New York';

#### Explanation

***Original Query***:

The query includes a subquery in the `SELECT` clause to calculate the *total amount* for each *customer*. This subquery is executed for each row in the `customers` table, leading to redundant calculations and poor performance.  

***Optimised Query***:

- ***Customer Totals CTE***: Calculates the *total amount* for each *customer* in a CTE, grouping by `customer_id`.
- ***Final Query***: Joins the `customers` table with the `customer_totals` CTE to get the *total amount* for each *customer* in `'New York'`.

### Quick Activity 7

In [None]:
SELECT c.customer_id, c.name, c.email, COUNT(oi.order_id) AS order_count
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
WHERE c.city = 'New York'
GROUP BY c.customer_id, c.name, c.email;

#### Explanation

***Original Query***:

The query uses `DISTINCT` in combination with `GROUP BY`, which is redundant because `GROUP BY` already ensures that the results are unique based on the grouped columns. This can lead to unnecessary computational overhead.

***Optimised Query***:

- ***Removed Unnecessary `DISTINCT`***: The `DISTINCT` keyword has been removed because the `GROUP BY` clause already ensures unique results based on the grouped columns.

### Quick Activity 8

N/A

### Quick Activity 9

In [None]:
WITH customer_orders AS (
    SELECT o.customer_id, COUNT(*) AS order_count, SUM(oi.quantity * p.price) AS total_spent
    FROM orders o
    JOIN order_items oi ON o.order_id = oi.order_id
    JOIN products p ON oi.product_id = p.product_id
    WHERE o.order_date BETWEEN '2022-01-01' AND '2022-12-31'
    GROUP BY o.customer_id
)
SELECT c.customer_id, c.name, c.email, co.order_count, co.total_spent
FROM customers c
JOIN customer_orders co ON c.customer_id = co.customer_id
WHERE c.city = 'New York'
ORDER BY co.total_spent DESC;

#### Explanation

***Original Query***:

The query includes multiple subqueries in the `SELECT` clause, which are executed for each row in the customers table. This leads to redundant calculations and poor performance.

***Optimised Query***:

- ***Customer Orders CTE***: Calculates the order count and total spent for each customer in a CTE, grouping by `customer_id`.  
- ***Final Query***: Joins the `customers` table with the `customer_orders` CTE to get the *order count* and *total spent* for each *customer* in `'New York'`.  

### Quick Activity 10

In [None]:
CREATE TABLE customers (
    customer_id INT PRIMARY KEY,
    name VARCHAR(255),
    email VARCHAR(255)
);

CREATE TABLE products (
    product_id INT PRIMARY KEY,
    name VARCHAR(255),
    category VARCHAR(255),
    price DECIMAL(10, 2)
);

CREATE TABLE orders (
    order_id INT PRIMARY KEY,
    order_date DATE,
    customer_id INT,
    FOREIGN KEY (customer_id) REFERENCES customers(customer_id)
);

CREATE TABLE order_items (
    order_item_id INT PRIMARY KEY,
    order_id INT,
    product_id INT,
    quantity INT,
    price DECIMAL(10, 2),
    FOREIGN KEY (order_id) REFERENCES orders(order_id),
    FOREIGN KEY (product_id) REFERENCES products(product_id)
);

#### Explanation  

***Original Query***:

- ***Redundant Data***: The `orders` table includes *customer* information (`customer_name`, `customer_email`) and product information (`product_name`, `product_category`) that are repeated for each order.  
- ***Violation of 1NF***: The table includes multiple values for *customer* and *product* information in a single table, leading to data redundancy and potential anomalies.  
- ***Violation of 2NF***: The table includes non-key attributes (`customer_name`, `customer_email`, `product_name`, `product_category`) that are dependent on part of the **primary key** (`customer_id`, `product_id`), not the whole **primary key**.  

***Optimised Query***:

- ***Customers Table***: Stores *customer* information with a *unique* `customer_id`.  
- ***Products Table***: Stores *product* information with a *unique* `product_id`.  
- ***Orders Table***: Stores *order* information with a *reference* to the `customer_id`.  
- ***Order Items Table***: Stores *order item* details with *references* to `order_id` and `product_id`, and includes the `quantity` and `price` of each *product* in the *order*.

---

---