# SQL Data Transformation
© Explore Data Science Academy

## Learning Objectives

In this train, you will learn to:
* Use the **`DISTINCT`** statement to unearth underlying categories.
* Make use of the **`CASE`** statement and **`IIF()`** function to create conditional logic within queries.
* Make use of the **`COALESCE()`** function to get rid of null values.
* Make use of the **`NULLIF()`** function to remove unwanted values.
* Cast variables to suitable data types.

## Outline

This train is structured as follows: 
* Imports and DB connections.
* The **`DISTINCT`**  statement.
* The **`CASE`** and **`IIF()`** functions.
* The **`COALESCE()`** function.
* The **`NULLIF()`** function. 
* Casting of variables.


## Introduction

In this train you will learn how to transform your data into a usable format. We will achieve this by using control flow statements and functions in sqlite. To achieve these learning objectives, we will be using the ***SoftDevEmployees.db*** database which contains basic employee data from the ***SoftDev*** company which specialises in Mobile and Software development.

Data preparation and transformation is just as important as the downstream techniques used to gain data insights from data, such as machine learning algorithms or visualisation. In fact, if your data are not organised and structured in an appropriate manner, chances are that any results you produce will be spurious. With this in mind, let's learn how to transform our data appropriately within SQL!

## Imports and DB Connections

Please use the below command to install **sql_magic** if you do not already have it. We will use this package to assist us with SQL syntax hightlighting.
* `pip install sql_magic`

Remember to start each new cell with:  **`%%read_sql`**

In [1]:
import sqlite3
import csv
from sqlalchemy import create_engine
%load_ext sql_magic

# Create engine instance using sqlalchemy
engine = create_engine("sqlite:///SoftDevEmployees.db")
%config SQL.conn_name = 'engine'

# Create connection object using sqlite3
conn = sqlite3.connect('SoftDevEmployees.db')
cursor = conn.cursor()

## DISTINCT

The first statement that we introduce is the **`DISTINCT`** statement. `DISTINCT` allows us to identify unqinue values that exist in a database table, assisting with the removal of duplicate data which may exist. Alternatively, it is also helpful when trying to discern the number of categories present in our data for a particular attribute. 

The statement is used in conjuction with the **`SELECT`** statement, and the syntax takes the following form:

```SQL
SELECT 
    DISTINCT column_1, 
    column_2, 
    column_3,
    .
    .
    .,
    column_n
FROM TableName
```

Let's make use of the **`DISTINCT`** statement to help us identify the catergoies that exist in our table!

In [None]:
%%read_sql

SELECT DISTINCT Department FROM Employees;

In [None]:
%%read_sql

SELECT DISTINCT Level FROM Employees;

In [None]:
%%read_sql

SELECT DISTINCT role FROM Employees;

The usage of the **`DISTINCT`** statement is quite simple, but one needs be cognisant of the data that is contained in the column when trying to find categories. This is because values that are obviously the same to us - humans, will not be obvious to a machine and we sometimes have to cater for these instances. One example is: 

```SQL
'Mr' <> 'mr' <> 'MR' <> 'mR'
```

All the above variations of 'Mr' will be seen as a separate category which leads to duplicated categories. Let's see how we can get around this.

In [None]:
%%read_sql

SELECT DISTINCT UPPER(Title) FROM Employees;
-- REMOVE the UPPER function from the query. How many categories does our query now return? 

## CASE Statement

The **`CASE`** statement is used to assign/associate a condition with a particular result. We can use this to create conditional logic that maps our input (column data) to our desired output.

The syntax for the **`CASE`** statement takes the following form:

```sql
CASE 
    WHEN conditon_1 THEN result_1
    WHEN conditon_2 THEN result_2
    .
    .
    .
    WHEN conditon_n THEN result_n
[ELSE result_n+1]
END AS ColumnName
```
The else condition is optional. If, however, it is not specified and none of the conditions in the case statement are met, then it will default the result to `NULL`.

Let's put this theory to the test by using the case statement to determine if an employee from SoftDev is male or female, based upon their assigned titles:

In [None]:
%%read_sql

SELECT 
    Name, 
    Title, 
    CASE 
        WHEN UPPER(Title) IN ('MS','MRS','MISS') THEN 'Female'
        WHEN UPPER(Title) IN ('MR') THEN 'Male'
        WHEN UPPER(Title) IS NULL THEN 'Value not specified'
    ELSE
        'Cannot Determine from Title'
    END AS Gender
FROM
    Employees
ORDER BY Name
LIMIT 5;

## IIF() Function

The **`IIF()`** function also allows you to to create conditional 'if-else' logic within your SQL quries.

The syntax for the **`IIF()`** function takes the following form:


```sql
IIF(condition_x,result_1,result_2)
```

Given `condition_x`, the function will return `result_1` if the condition is **TRUE** else it will return `result2` if the condition is **FALSE**

This function call is equivalent to:

```sql
CASE
    WHEN condition_x THEN result_1
    ELSE result_2
END
```

Let's use the **`IIF()`** function to recreate the geneder column.

**NB: The `IIF()` function is available from version 3.32.0 of SQLite onwards. If the below cell returns an error, you may need to update your SQLite installation. Also note that this cell will not run in Colab, as it uses a deprecated version of sqlite3.**

In [None]:
%%read_sql

SELECT 
    Name, 
    Title, 
    IIF(UPPER(Title) IN ('MS','MRS','MISS'),'Female',
        IIF(UPPER(Title) IN ('MR'),'Male','Cannot Determine from Title'))AS Gender
FROM
    Employees
ORDER BY Name
LIMIT 5;

## COALESCE() function

Given a set of input arguments, the **`COALESCE()`** function works by returning the first non-null argument value. While this functionality may seem strange, `COALESCE()` is extremely versatile and can be used for many tricky tasks, including exception handling and creating compact conditional statements (see [here](https://www.sqltutorial.org/sql-comparison-functions/sql-coalesce/) for some examples).  

In our case, we'll use the function to ensure that a particular column does not have a null value - providing a fallback or default entry within our query result.

The syntax of the **`COALESCE()`** function takes the following form:

```sql
   COALESCE(value_1,value_2,value_3,...,value_n)
```
And will return the first non-null value in `[value_1,value_2,value_3,...,value_n]`. For example: 

```sql 
value_1 = NULL and value_2 = NULL and value_3 = 'EXPLORE EDSA'
```
then 

```sql
   COALESCE(value_1,value_2,value_3)
```

will return '***EXPLORE EDSA***'.

In [16]:
%%read_sql

SELECT 
    Name,
    Surname,
    Level,
    Title,
    COALESCE(Title,'No title available') as Title_Coalesce
FROM Employees
LIMIT 5;

Query started at 12:50:25 PM SAST; Query executed in 0.00 m

Unnamed: 0,Name,Surname,Level,Title,Title_Coalesce
0,Dumisani,Thwala,Graduate,,No title available
1,Tony,Horn,Graduate,Mr,Mr
2,Vuyokazi,barnes,Graduate,Mr,Mr
3,sello,Details,Graduate,Mr,Mr
4,Jacqueline,fredericks,Graduate,,No title available


## NULLIF() function

The **`NULLIF()`** function is used when we want to insert `null` for particular values that exist in our database. 

The syntax of the NULLIF() function takes the following form:

```sql
NULLIF(value_1, value_2)
```
If `value_1 = value_2` (the values or expressions are equal), then the function will return `null`, else it will return the contents of `value_1`

Imagine the Soft Dev CEO declared that no role should be given to interns as they are expected to rotate through the different roles available in the company for the period of their internship. To accommodate for this, we could decide to assign null value for all interns where the role is concerned. Let's write the query for this in the below cell:

In [17]:
%%read_sql
SELECT 
    Name,
    Surname,
    Level,
    NULLIF(Level,'Intern') as Role 
FROM Employees
WHERE Level = 'Intern'

Query started at 12:54:00 PM SAST; Query executed in 0.00 m

Unnamed: 0,Name,Surname,Level,Role
0,Jan,Ngwenya,Intern,
1,Patience,Willemse,Intern,
2,Dirk,Banda,Intern,
3,Janine,De Villiers,Intern,
4,barend,Edwards,Intern,
5,Jabulani,Horn,Intern,
6,kelly,Manuel,Intern,
7,Claire,Morris,Intern,
8,Janet,Patel,Intern,
9,Pearl,Stewart,Intern,


## Casting

Casting allows us to change the data type of variable into one that you wish to work with. 

The syntax for casting takes the following form:

```sql
CAST(value AS datatype)
```

All major SQL flavours have comprehensive documentation around the datatypes they support. You can visit [here](https://www.sqlite.org/datatype3.html) for a quick review of these types for SQLite. 

One example of when casting our data could be useful is when we need to obtain the correct precision for a calculation we've performed. The salaries in the current database are stored as integers. Here it's important to note that when we perfom arithmetic on integer values, the final results may not always be a 100% correct. Suppose if all employees had to contribue a 3rd of their salaries to both their pension and prodivent fund. 

Let's see how the computation gets truncated if we do not convert to the correct formats. We'll also show the correct answers by using ``CAST()``

In [19]:
%%read_sql
select 
    Salary,
    Salary/3 Pension_INT,
    CAST(Salary as REAL)/3 AS Pension_REAL
FROM
    Employees
limit 5;

Query started at 01:08:34 PM SAST; Query executed in 0.00 m

Unnamed: 0,Salary,Pension_INT,Pension_REAL
0,52171,17390,17390.333333
1,103397,34465,34465.666667
2,69220,23073,23073.333333
3,54945,18315,18315.0
4,51104,17034,17034.666667


## Conclusion

In this train we've learned a few tricks on how to transform our data using new SQL functions and syntax. While these statements are fairly simple and easy to use, they become powerful when combined with additional functionality such as string manipulation techniques. As a future Data Scientist, the ability to transform your data to a usable format will be pertinent for your preprocessing efforts; allowing you to clean, prepare, and give your data desirable properties that will assist in your modeling pipeline.

## Appendix

<a href="https://www.sqlite.org/datatype3.html">Data types in SQLite</a>

<a href="">Data type Casting</a>

<a href="https://www.sqltutorial.org/sql-comparison-functions/sql-coalesce/"> Exception</a>