# Lesson 2.4: Data Cleaning, Operators & Main clauses - SQL

### Lesson Duration: 3 hours

> Purpose: The purpose of this lesson is to proceed with learning basic SQL queries, combined with other operators including `BETWEEN`, `LIKE`, `REGEXP`(regular expressions), and `DISTINCT` keyword. We will also take a look at more examples focusing on the five main clauses of the `SELECT` statement, which includes `select`, `from`, `where`, `order by`, and `limit`, along with the applications of some operators and expressions we have looked at so far.

> Note: An expression is a combination of columns and operators that evaluate to a single value. In the select clause, you can code an expression with one or more operators and functions.

---

### Setup

To start this lesson, students should have:

- Completed lesson 2.3
- All previous Setup

---

### Learning Objectives

After this lesson, students will be able to:

- Remove duplicates using the `DISTINCT` keyword
- Use operators including `BETWEEN`, `LIKE`, and `REGEXP`
- Work with 5 main clauses - `SELECT`, `FROM`, `WHERE`, `ORDER BY`, and `LIMIT`

---

### Lesson 1 key concepts

> :clock10: 20 min

- Removing duplicate rows with `DISTINCT`
- Using `IN` operator
- Using `BETWEEN`

<details>
<summary> Click for Code Sample </summary>

:exclamation: Keep working on the `bank` database.

```sql
select A3 from bank.district;
select distinct A3 from bank.district;
```

```sql
select * from bank.order
where k_symbol in ('leasing', 'pojistine');
```

:exclamation: Note for instructor: Remind the students that when using comparison operators in MySQL, they are not case sensitive, which means 'LEASING' and 'leasing' will be evaluated as equal.

```sql
select * from bank.account
where district_id in (1,2,3,4,5);
```

```sql
-- We are trying to get the same result using the between operator.
-- Note that 1 and 5 are included in the range of values compared/evaluated

select * from bank.account
where district_id between 1 and 5;

select * from bank.loan
where amount - payments between 1000 and 10000;
```

    

# 2.06 Activity 1

Keep working on the `bank` database and its `card` table.

#### Queries

1. Get different card types.
2. Get transactions in the first 15 days of 1993.
3. Get all running loans.
4. Find the different values from the field `A2` that start with the letter 'K'.
5. Find the different values from the field `A2` that end with the letter 'K'.
6. Discuss the possible use cases of using regular expressions in your query.

### Solutions:

### 1

```sql
select distinct type from bank.card;
```

### 2

```sql
select * from bank.trans
where convert(date, date) between '1993-01-01' and '1993-01-15'
limit 10;
```

### 3

```sql
select count(*) from bank.loan
where status in ('C', 'D');
```

### 4

```sql
select distinct a2 from bank.district
where a2 regexp '^k';
```

### 5

```sql
select distinct a2 from bank.district
where a2 regexp 'k$';
```

### 6

When the text is not standardized and there may be a few different ways in which users might use certain words, for example, gray and grey b. To extract the results when you might know that there might be spelling errors in the input text.

### Lesson 2 key concepts

> :clock10: 20 min

- Using `LIKE` operator
- Using `REGEXP`

> The `LIKE` and `REGEX` can be used to extract rows that match a string pattern, called a _mask_. Mask for `LIKE` can contain special characters called wildcards. `%`, and `_` are two wildcards that can be used with `LIKE`.

<details>
<summary> Click for Code Sample </summary>

```sql
select * from bank.district
where A3 like 'north%';

select * from bank.district
where a3 like 'north_M%';
-- This would return all the results for
-- 'north  Moravia', 'northMoravia', northMiami'
```

How is the result changed if we use `%` instead of `_` in the previous query? This is an activity for students later after this session.

```sql
select * from bank.district
where a3 regexp 'north';

-- Now we will take a look at another table
-- to see the difference between LIKE and REGEXP
select * from bank.order
where k_symbol regexp 's';

select * from bank.order
where k_symbol regexp '^s';

select * from bank.order
where k_symbol regexp 'o$';

-- We can include multiple conditions at the same time
select distinct k_symbol from bank.order
where k_symbol regexp 'ip|is';
```

# 2.06 Activity 2

#### Queries and questions

1. Can you use the following query:

```sql
select * from bank.district
where a3 like 'north%M%';
```

instead of:

```sql
select * from bank.district
where a3 like 'north_M%';
```

Try both the queries and check the results.

2. We looked at the following query in class:

```sql
select * from bank.district
where a2 regexp 'ch[e-r]';
```

Can you modify the query to print the rows only for those values in the **A2** column that starts with **'CH'**?

3. Use the table `trans` for this query. Use the column `type` to test: "By default, in an ascending sort, special characters appear first, followed by numbers, and then letters."

4. Again use the table `trans` for this query. Use the column `k_symbol` to test: "Null values appear first if the order is ascending."

5. Pick any table and any column to test: "You can use any column from the table to sort the values even if that column is not used in the select statement." Check the difference by writing the query with and without that column (column used to sort the results) in the select statement.

### Solutions:

### 1

Discussion

### 2

```sql
select * from bank.district
where a2 regexp '^ch[e-r]';
```

### 3

```sql
select * from bank.trans
order by type;
```

### 4

```sql
select * from bank.trans
order by k_symbol;
```

### 5

```sql
select trans_id, type from bank.trans
order by balance;

select trans_id, type, balance from bank.trans
order by balance;
```

### Lesson 3 key concepts

> :clock10: 20 min

- More on Regexp

<details>
<summary> Click for Code Sample </summary>

```sql
select * from bank.district
where a2 regexp 'cesk[ey]';

select * from bank.district
where a2 regexp 'ch[e-r]';
```

Here are some more examples on regular expressions:

![regular expressions](https://education-team-2020.s3-eu-west-1.amazonaws.com/data-analytics/regular_expression_examples.png)

> Note: `LIKE` and `REGEXP` operators significantly degrade the performance of query execution as compared to simple comparison operators. One should be careful when using them.

# 2.06 Activity 3

--1 
During the lesson, we mentioned that one of the primary reasons for normalizing tables is to eliminate data redundancy. Otherwise, data redundancy can result in highly inefficient data storages. Which other problems you may think non-normalized structure may have?

The students can refer to the following link to read more about normalization, its advantages and disadvantages. (https://whatisdbms.com/normalization-in-dbms-anomalies-advantages-disadvantages/)


-- 2
Later in the labs we will use another database that models a DVD rental store. ERD (entity relationship diagram) for the database is shown below. You can refer the file `sakila-schema.pdf` in the files_for_activities folder as well.

[./files_for_activities/sakila-schema.pdf]

### Questions

- Identify the primary keys and foreign keys from the ER diagram.

### Solutions:

### 1 

Some other problems that can arise due to non-normalization of the database are:

Slower query processing (which would be due to inefficient storage of data)
Data anomalies (INSERT, UPDATE, DELETE). We will talk about the anomalies in detail in the later lessons
Database maintenance becomes tedious



### 2 

No separate gist added here. The question is to identify the primary and foreign keys in the tables from the database entity relationship diagram

### Lesson 4 key concepts

> :clock10: 20 min

Arrange results in ascending or descending order

- Using `order by` clause with one column
- Using `order by` clause with more than one column

<details>
<summary> Click for Code Sample </summary>

```sql
select distinct a2 from bank.district
order by a2;

select distinct a2 from bank.district
order by a2 asc;

select * from bank.district
order by a3;

select * from bank.district
order by a3 desc;
```

> Some important points to remember: (NOTE: the students will test the last three points by themselves in the activity)

    - Note that, by default, (if not specified) the order is ascending.
    - By default, in an ascending sort, special characters appear first, followed by numbers, and then letters.
    - Null values appear first if the order is ascending.
    - You can use any column from the table to sort the values even if that column is not used in the select statement.

```sql
select * from bank.order
order by account_id, bank_to;

select * from bank.order
order by account_id, bank_to, k_symbol;
```    


# Lab | SQL Queries 4

In this lab, you will be using the [Sakila](https://dev.mysql.com/doc/sakila/en/) database of movie rentals. You have been using this database for a couple labs already, but if you need to get the data again, refer to the official [installation link](https://dev.mysql.com/doc/sakila/en/sakila-installation.html).

The database is structured as follows:
![DB schema](https://education-team-2020.s3-eu-west-1.amazonaws.com/data-analytics/database-sakila-schema.png)

<br><br>

### Instructions

1. Get film ratings.
2. Get release years.
3. Get all films with ARMAGEDDON in the title.
4. Get all films with APOLLO in the title
5. Get all films which title ends with APOLLO.
6. Get all films with word DATE in the title.
7. Get 10 films with the longest title.
8. Get 10 the longest films.
9. How many films include **Behind the Scenes** content?
10. List films ordered by release year and title in alphabetical order.