# **DQL (Queries)** ❓

DQL (Data Query Language) is a subset of SQL (Structured Query Language) used to retrieve and query data from a database. It allows you to perform operations such as filtering, sorting, and grouping data but does not modify the data itself (e.g., no updates or deletions). The primary and most common command in DQL is the SELECT statement.

In [3]:
import os

import duckdb as dd

from src.config import DATABASE_DIR

In [4]:
# Create a persistent DuckDB database
os.chdir(DATABASE_DIR)
con = dd.connect("duckdb_test.db")

In [10]:
con.close()

## **Information Schema**

As part of an international SQL standard, the information schema is a database containing meta information about objects in the database including tables, columns and constraints. This schema provides users with read-only views of many topics of interest.
For example, to determine if a column has been designated correctly as a primary key, we can query a special view, key_column_usage, generated from this database. This view identifies all columns in the current database that are restricted by some constraint such as primary key or foreign key.

The view columns contain information about all table columns (or view columns) in the database. The view tables contain all tables and views defined in the current database. The view views contain all views defined in the current database.

In [5]:
query = """
    SELECT *
    FROM information_schema.tables
"""
con.sql(query)

┌───────────────┬──────────────┬────────────┬────────────┬──────────────────────────────┬──────────────────────┬───────────────────────────┬──────────────────────────┬────────────────────────┬────────────────────┬──────────┬───────────────┬───────────────┐
│ table_catalog │ table_schema │ table_name │ table_type │ self_referencing_column_name │ reference_generation │ user_defined_type_catalog │ user_defined_type_schema │ user_defined_type_name │ is_insertable_into │ is_typed │ commit_action │ TABLE_COMMENT │
│    varchar    │   varchar    │  varchar   │  varchar   │           varchar            │       varchar        │          varchar          │         varchar          │        varchar         │      varchar       │ varchar  │    varchar    │    varchar    │
├───────────────┼──────────────┼────────────┼────────────┼──────────────────────────────┼──────────────────────┼───────────────────────────┼──────────────────────────┼────────────────────────┼────────────────────┼──────────┼─────

In [6]:
query = """
    SELECT constraint_name, table_name, column_name
    FROM information_schema.key_column_usage
    WHERE table_name = 'users';
"""
con.sql(query)

┌─────────────────┬────────────┬─────────────┐
│ constraint_name │ table_name │ column_name │
│     varchar     │  varchar   │   varchar   │
├─────────────────┼────────────┼─────────────┤
│ users_id_pkey   │ users      │ id          │
│ users_email_key │ users      │ email       │
└─────────────────┴────────────┴─────────────┘

In [7]:
query = """
    SELECT column_name, data_type, ordinal_position, column_default, is_nullable,  character_maximum_length
    FROM information_schema.columns
    WHERE table_name = 'users';
"""
con.sql(query)

┌─────────────┬───────────┬──────────────────┬───────────────────┬─────────────┬──────────────────────────┐
│ column_name │ data_type │ ordinal_position │  column_default   │ is_nullable │ character_maximum_length │
│   varchar   │  varchar  │      int32       │      varchar      │   varchar   │          int32           │
├─────────────┼───────────┼──────────────────┼───────────────────┼─────────────┼──────────────────────────┤
│ id          │ INTEGER   │                1 │ NULL              │ NO          │                     NULL │
│ name        │ VARCHAR   │                2 │ NULL              │ NO          │                     NULL │
│ age         │ INTEGER   │                3 │ NULL              │ YES         │                     NULL │
│ created_at  │ TIMESTAMP │                4 │ CURRENT_TIMESTAMP │ YES         │                     NULL │
│ email       │ VARCHAR   │                5 │ NULL              │ YES         │                     NULL │
└─────────────┴───────────┴─

## **Selecting Columns**

The SELECT clause specifies the columns or expressions you want to retrieve from a table. You can specify specific columns or use * to select all columns. Columns or tables can be aliased using the AS clause. This allows columns or tables to be specifically renamed in the returned result set.


## **Distinct Values** 

In [8]:
query = """
    SELECT DISTINCT age
    FROM users;
"""
con.sql(query).df().head()

Unnamed: 0,age
0,19
1,45
2,35
3,74
4,37


## **Limiting Values** 

In [9]:
query = """
    SELECT name
    FROM users
    LIMIT 3;
"""
con.sql(query).df()

Unnamed: 0,name
0,Jeffrey Thomas
1,Heather Brown
2,Sarah Baldwin


## **Column Aliases** 

In [12]:
query = """
    SELECT name AS "Name", age AS "Age", created_at AS "Registration Date" 
    FROM users;
"""
con.sql(query).df().head()

Unnamed: 0,Name,Age,Registration Date
0,Jeffrey Thomas,19,2024-05-05 15:02:40
1,Heather Brown,25,2022-03-20 00:15:17
2,Sarah Baldwin,45,2022-03-27 19:31:44
3,Stephanie Russell,37,2021-06-28 18:09:47
4,Michael Farmer,61,2023-05-27 01:58:12


## **Type Casting** 

The `CAST` function is used to convert the value of an expression into another data type.

In [13]:
query = """
    SELECT 
        CAST(age AS FLOAT) AS age_float,
        created_at::DATE AS registration_date
    FROM users;
"""
con.sql(query).df().head()

Unnamed: 0,age_float,registration_date
0,19.0,2024-05-05
1,25.0,2022-03-20
2,45.0,2022-03-27
3,37.0,2021-06-28
4,61.0,2023-05-27


## **Concatenate Strings** 

`||` is used to concatenate strings together.

In [14]:
query = """
    SELECT (name || ' ' || age) AS name_age
    FROM users;
"""
con.sql(query).df().head()

Unnamed: 0,name_age
0,Jeffrey Thomas 19
1,Heather Brown 25
2,Sarah Baldwin 45
3,Stephanie Russell 37
4,Michael Farmer 61


## **If-Else Statements** 

A CASE statement allows us to create different outputs (usually in the statement). It is SQL’s way of handling if-then logic.

In [15]:
query = """
    SELECT name,
    CASE 
        WHEN age < 18 THEN 'Child'
        WHEN age < 65 THEN 'Adult'
        ELSE 'Senior'
    END AS age_group
    FROM users;
"""
con.sql(query).df().head()

Unnamed: 0,name,age_group
0,Jeffrey Thomas,Adult
1,Heather Brown,Adult
2,Sarah Baldwin,Adult
3,Stephanie Russell,Adult
4,Michael Farmer,Adult


## **Date & Time** 

In SQL, dates are typically written in one of the following formats:
- Date: YYYY-MM-DD
- Datetime or Timestamp: YYYY-MM-DD hh:mm:ss

### **Extraction**
The DATE() function allows us to extract just the date portion of a time string, which consists of the year, month and day.

For SQLite:
- The DATETIME() function will return the entire time string which includes the date and time portions. To obtain the current date and time, you can provide the string 'now' to the function, which returns the date and time in UTC.
- The TIME() function allows us to extract just the time portion of a time string, which consists of the hour, minute and second.


In [16]:
query = """
    SELECT 
        created_at,
        DATE(created_at) AS registration_date,
        created_at::TIME AS registration_time,
        EXTRACT(HOUR FROM created_at) AS hour,
        DATE_PART('minute', created_at) AS minute,
        DATE('2020-02-10') AS temp_date
    FROM users;
"""
con.sql(query).df().head()

Unnamed: 0,created_at,registration_date,registration_time,hour,minute,temp_date
0,2024-05-05 15:02:40,2024-05-05,15:02:40,15,2,2020-02-10
1,2022-03-20 00:15:17,2022-03-20,00:15:17,0,15,2020-02-10
2,2022-03-27 19:31:44,2022-03-27,19:31:44,19,31,2020-02-10
3,2021-06-28 18:09:47,2021-06-28,18:09:47,18,9,2020-02-10
4,2023-05-27 01:58:12,2023-05-27,01:58:12,1,58,2020-02-10



### **Modifiers**

SQLite provide additional arguments, called modifiers, to date functions in addition to the time string. Modifiers are applied from left to right as they are listed in the function, so order matters. The following modifiers can be used to shift the date backwards to a specified part of the date.
- start of year: shifts the date to the beginning of the current year.
- start of month: shifts the date to the beginning of the current month.
- start of day: shifts the date to the beginning of the current day.

The following modifiers add a specified amount to the date and time of the time string.
- '+-N years': offsets the year
- '+-N months': offsets the month
- '+-N days': offsets the day
- '+-N hours': offsets the hour
- '+-N minutes': offsets the minute
- '+-N seconds': offsets the second

`SELECT DATETIME('2020-02-10', 'start of month', '-1 day', '+7 hours');`

In [25]:
query = """
    SELECT 
        created_at,
        DATE_TRUNC('month', created_at) AS start_of_month,
        DATE_TRUNC('year', created_at) AS start_of_year,
        DATE_TRUNC('month', created_at - INTERVAL '1 month') AS start_of_previous_month
    FROM users;
"""
con.sql(query).df().head()

Unnamed: 0,created_at,start_of_month,start_of_year,start_of_previous_month
0,2024-05-05 15:02:40,2024-05-01,2024-01-01,2024-04-01
1,2022-03-20 00:15:17,2022-03-01,2022-01-01,2022-02-01
2,2022-03-27 19:31:44,2022-03-01,2022-01-01,2022-02-01
3,2021-06-28 18:09:47,2021-06-01,2021-01-01,2021-05-01
4,2023-05-27 01:58:12,2023-05-01,2023-01-01,2023-04-01



### **Formatting**

In SQLite the STRFTIME() function allows you to return a formatted date, as specified in a format string.

`STRFTIME(format, timestring, modifier1, modifier2, ...)`


The first argument, format, is the format string. The second argument is the time string. The remaining arguments are 0 or more optional modifiers to transform the time string. The substitutions to extract each part of the date and time are the following:
- %Y returns the year (YYYY)
- %m returns the month (01-12)
- %d returns the day of month (01-31)
- %H returns the hour (00-23)
- %M returns the minute (00-59)
- %S returns the second (00-59)

In [27]:
# query = """
#     SELECT
#         created_at,
#         TO_CHAR(created_at, 'DY Month YYYY') AS day_month_year
#     FROM users;
# """
# con.sql(query).df().head()

## **Ordering Rows**

Order rows using the `ORDER BY` clause.

In [28]:
query = """
    SELECT name, age
    FROM users
    ORDER BY name DESC;
"""
con.sql(query).df().head()

Unnamed: 0,name,age
0,Timothy Wolf,69
1,Timothy Hopkins,52
2,Thomas Simmons,57
3,Sylvia Mason,68
4,Stephen Smith,39


## **Filtering Rows**

Filter rows using the `WHERE` clause with conditions.

### **IN Operator**

In [29]:
query = """
    SELECT name, age
    FROM users
    WHERE age IN (40, 50, 60)
"""
con.sql(query).df().head()

Unnamed: 0,name,age
0,Amy Nichols,40
1,John Moyer,60
2,Meghan Short,60
3,Megan Long,50
4,James Hall MD,60


### **NOT IN Operator**

In [30]:
query = """
    SELECT name, age
    FROM users
    WHERE age NOT IN (40, 50, 60)
"""
con.sql(query).df().head()

Unnamed: 0,name,age
0,Jeffrey Thomas,19
1,Heather Brown,25
2,Sarah Baldwin,45
3,Stephanie Russell,37
4,Michael Farmer,61


### **LIKE Operator**

The `LIKE` operator can be used inside of a WHERE clause to match a specified pattern. 
- The `%` wildcard can be used in a LIKE operator pattern to match zero or more unspecified character(s).
- The `_` wildcard can be used in a LIKE operator pattern to match any single unspecified character. 

In [31]:
query = """
    SELECT name, age
    FROM users
    WHERE name LIKE 'A%'
"""
con.sql(query).df().head()

Unnamed: 0,name,age
0,Alicia Morse,36
1,Amy Nichols,40
2,Austin Torres,21
3,Arthur Griffith,27
4,Alexis Chan,66


In [32]:
query = """
    SELECT name, age
    FROM users
    WHERE name LIKE '_aron Hernandez'
"""
con.sql(query).df().head()

Unnamed: 0,name,age


### **IS NULL Operator**

Column values can be NULL, or have no value. These records can be matched (or not matched) using the IS NULL and IS NOT NULL operators in combination with the WHERE clause. The given query will match all addresses where the address has a value or is not NULL.

## **Aggregating Rows**

Group rows and apply aggregate functions (`COUNT`, `SUM`, `AVG`, etc.).

### **Grouping Rows**

The `GROUP BY` clause will group records in a result set by identical values in one or more columns. It is often used in combination with aggregate functions to query information of similar records. The `GROUP BY` clause can come after `FROM` or `WHERE` but must come before any `ORDER BY` or `LIMIT` clause.

In [33]:
query = """
    SELECT age, COUNT(*)
    FROM users
    WHERE age IN (40, 50, 60)
    GROUP BY age
"""
con.sql(query).df().head()

Unnamed: 0,age,count_star()
0,40,1
1,50,1
2,60,3


### **Filtering Aggregated Results**

The `HAVING` clause is used to further filter the result set groups provided by the `GROUP BY` clause. `HAVING` is often used with aggregate functions to filter the result set groups based on an aggregate property. The `HAVING` clause must always come after a `GROUP BY` clause but must come before any `ORDER BY` or `LIMIT` clause.

In [34]:
query = """
    SELECT age, COUNT(*)
    FROM users
    WHERE age IN (40, 50, 60)
    GROUP BY age
    HAVING COUNT(*) > 1
"""
con.sql(query).df().head()

Unnamed: 0,age,count_star()
0,60,3


## **Joining Tables**

### **Merging Tables**

1. **INNER JOIN**:
   Returns only the rows that have matching values in both tables.
     ```sql
     SELECT columns
     FROM table1
     INNER JOIN table2
     ON table1.column = table2.column;
     ```

2. **LEFT JOIN (or LEFT OUTER JOIN)**:
   - Returns all rows from the left table, and the matched rows from the right table. If no match is found, NULL values are returned for columns from the right table.
     ```sql
     SELECT columns
     FROM table1
     LEFT JOIN table2
     ON table1.column = table2.column;
     ```

3. **RIGHT JOIN (or RIGHT OUTER JOIN)**:
   - Returns all rows from the right table, and the matched rows from the left table. If no match is found, NULL values are returned for columns from the left table.
     ```sql
     SELECT columns
     FROM table1
     RIGHT JOIN table2
     ON table1.column = table2.column;
     ```

4. **FULL JOIN (or FULL OUTER JOIN)**:
   - Returns all rows when there is a match in either left or right table. Rows without a match in one of the tables will have NULL values for columns from that table.
     ```sql
     SELECT columns
     FROM table1
     FULL JOIN table2
     ON table1.column = table2.column;
     ```

5. **CROSS JOIN**:
   - Returns the Cartesian product of the two tables, i.e., it returns all possible combinations of rows from both tables.
     ```sql
     SELECT columns
     FROM table1
     CROSS JOIN table2;
     ```

6. **SELF JOIN**:
   - A self join is a regular join but the table is joined with itself.
     ```sql
     SELECT a.columns, b.columns
     FROM table a, table b
     WHERE condition;
     ```

In [35]:
query = """
    SELECT u.name, s.space, s.price
    FROM users AS u
    LEFT JOIN storage s ON s.user_id = u.id
    
"""
con.sql(query).df().head()

Unnamed: 0,name,space,price
0,Sarah Baldwin,740,24733.0
1,Stephanie Russell,143,84623.0
2,Alicia Morse,907,32985.0
3,Amy Nichols,389,30685.0
4,Melinda Morgan,923,1024.0


### **Temporary Tables**

Often, we want to combine two tables, but one of the tables is the result of another calculation. The `WITH` clause stores the result of a query in a temporary table using an alias. Multiple temporary tables can be defined with one instance of the `WITH` keyword.

When you have a complex query that involves multiple subqueries or aggregations, a WITH clause makes the code more readable by allowing you to define subqueries as named, reusable expressions. You can use these expressions multiple times within the main query, keeping the code cleaner and reducing redundancy.


In [36]:
query = """
    WITH user_counts AS (
        SELECT age, COUNT(*) AS count
        FROM users
        GROUP BY age
    )
    SELECT *
    FROM user_counts
    WHERE count > 1
"""
con.sql(query).df().head()

Unnamed: 0,age,count
0,19,4
1,20,4
2,24,3
3,25,3
4,29,2


### **Concatenating Tables**


The `UNION` clause is used to combine results that appear from multiple SELECT statements and filter duplicates.

```sql
SELECT name
FROM first_names
UNION
SELECT name
FROM last_names
```

### **Subqueries**


What happens when we query a database but we really only need a subset of the results returned? How is this situation handled when the subset of data needed spans across multiple tables? 

As the name suggests, a subquery is an internal query nested inside of an external query. They can be nested inside of SELECT, INSERT, UPDATE, or DELETE statements. Anytime a subquery is present, it gets executed before the external statement is run. Subqueries are very similar to joins in terms of functionality; however, joins are more efficient and subqueries are typically more readable. 

One of the more common ways to use subqueries is with the use of an IN or NOT IN clause. When an IN clause is used, results retrieved from the external query must appear within the subquery results. Similarly, when a NOT IN clause is used, results retrieved from the external query must not appear within the subquery results.

```sql
DELETE FROM statistics_students
WHERE id NOT IN (
  SELECT id 
  FROM history_students);
```

Subqueries have the unique ability to take the place of expressions in SQL queries. As such, one way of using subqueries in SQL statements is with comparison operators. We can use operators such as <, >, =, and != to compare the results of the external query to those of the inner query.

```sql
SELECT * 
FROM history_students
WHERE grade <= (
  SELECT grade
  FROM statistics_students
  WHERE id = 1);

```


### **EXISTS Operator**

If we compare this functionality in terms of efficiency, EXISTS/NOT EXISTS are usually more efficient than IN/NOT IN clauses; this is because the IN/NOT IN clause has to return all rows meeting the specific criteria whereas the EXISTS/NOT EXISTS only needs to find the presence of one row to determine if a true or false value needs to be returned.

Note that with EXISTS, we must include a WHERE clause within the subquery that defines the criteria we are checking. Here, we specify that in order for the subquery to return true, it must locate a row with the same id value as the outer query.
```sql
SELECT * 
FROM statistics_students
WHERE EXISTS (
  SELECT * 
  FROM history_students
  WHERE id = statistics_students.id
);
```


## **Window Functions**

Window functions allow you to maintain the values of your original table while displaying grouped or summative information alongside in another column. This is why many Data Scientists and Data Engineers love to use window functions for complex data analysis.

### **OVER Clause**

This is the clause that designates the aggregate function as a window function. The `ORDER BY` statement declares what we would like our window function to do. Hence you can find the running totals, running averages or running counts.

```sql
SELECT 
   month,
   change_in_followers,
   SUM(change_in_followers) OVER (
      ORDER BY month
   ) AS 'running_total'
FROM
   social_media;
```



### **PARTITION BY Subclause**

`PARTITION BY` is a subclause of the `OVER` clause and divides a query’s result set into parts. It’s very similar to `GROUP BY` except it does not reduce the number of rows returned. While using `GROUP BY` only allows one row to be returned for each group, `PARTITION BY` allows you to see all of the resultant rows.

```sql
SELECT 
    username,
    month,
    change_in_followers,
    SUM(change_in_followers) OVER (
      PARTITION BY username 
      ORDER BY month
    ) 'running_total_followers_change'
FROM
    social_media;
```



### **FIRST_VALUE Operator**

In the past, when we wanted to get the first or last value of a query, we might use the `LIMIT` clause, probably in conjunction with `ORDER BY`, which would return one result showing us the first or last value from our dataset. With window functions, we can return our first or last values alongside our other data by using the `FIRST_VALUE` or `LAST_VALUE` functions.

```sql
SELECT username,
   posts,
   FIRST_VALUE (posts) OVER (
      PARTITION BY username 
      ORDER BY posts
   ) AS 'fewest_posts'
FROM social_media;
```



### **LAST_VALUE Operator**

To get `LAST_VALUE` to show us the most posts for a user, we need to specify a frame for our window function.
```sql
SELECT
   username,
   posts,
   LAST_VALUE (posts) OVER (
      PARTITION BY username 
      ORDER BY posts
      RANGE BETWEEN UNBOUNDED PRECEDING AND 
      UNBOUNDED FOLLOWING
    ) most_posts
FROM
    social_media;
```

`RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING` specifies the frame for our window function as the current partition and thus returns the highest number of posts in one month for each user.


### **LAG and LEAD Operator**
Window functions can use `LAG` or `LEAD` in order to access information from a row at a specified offset which comes before (`LAG`) or after (`LEAD`) the current row. This means that by using `LAG` or `LEAD` you can access any row before or after the current row, which can be very useful in calculating the difference between the current and adjacent row. 
```sql
SELECT
   artist,
   week,
   streams_millions,
   LAG(streams_millions, 1, 0) OVER (
      PARTITION BY artist
      ORDER BY week 
   ) previous_week_streams 
FROM
   streams;
```



### **ROW_NUMBER Operator**
The most straight-forward way to order our results is by using the `ROW_NUMBER` function which adds a sequential integer number to each row. Adding a `ROW_NUMBER` to each row can be useful for seeing where in your result set the row falls.
```sql
SELECT 
   ROW_NUMBER() OVER (
      PARTITION BY week
      ORDER BY streams_millions DESC
   ) AS 'row_num', 
   artist, 
   week,
   streams_millions
FROM streams;
```


### **RANK Operator**
Now that we understand how to use `ROW_NUMBER`, there is another function that is similar but provides an actual ranking: RANK. If you were to modify your `ROW_NUMBER` query to use `RANK` instead, it might appear to be exactly the same at first glance. But if you look more closely, you can see that `RANK` will follow standard ranking rules so that when two values are the same, they will have the same rank whereas with `ROW_NUMBER` they would not.
```sql
SELECT 
   RANK() OVER (
      PARTITION BY week
      ORDER BY streams_millions DESC
   ) AS 'rank', 
   artist, 
   week,
   streams_millions
FROM streams;
```

### **NTILE Operator**
`NTILE` allows you to break your data into roughly equal groups, based on what nth tile you’d like. When using `NTILE` you are required to provide a bucket, which represents the number of groups you’d like your data broken down into.
```sql
SELECT 
   NTILE(4) OVER (
      PARTITION BY week
      ORDER BY streams_millions DESC
   ) AS 'quartile', 
   artist, 
   week,
FROM streams;
```