# String Manipulation in SQL
© Explore Data Science Academy

## Learning Objectives

In this train, you will learn to:
* Find the Length of a String.
* Replace a substring with a substring of your choice.
* Remove whitespaces or string patterns of your choice. 
* Find specified substrings within your original string. 
* Find the index of a specified substring.
* Convert strings to lowercase or uppercase.
* Make use of the concatenation operator "||" in order to join two strings.

## Outline

This train is structured as follows: 

* Using the ***LENGTH()*** function.
* Using the ***REPLACE()*** function.
* Using the ***RTRIM()***, ***LTRIM()***, ***TRIM()*** functions.
* Using the ***SUBSTR()*** function.
* Using the ***INSTR()*** function.
* Using the ***UPPER()***, ***LOWER()*** functions.
* Concatenation Operator - ***||***.
* String Manipulation Excercises.


## Introduction

Often times when working with data, you might find that your data is not in a format or structure that is immediately usable for your use case. Being skilled in SQL string manipulation will assist in turning unstructured data into a structured format so that we can perform generic transformations on the data.

This is especially true for string data types. String Manipulation functions in SQL allow you to slice and dice your data whichever way you choose. Being able to handle strings in SQL will go a long way in helping you to organize and structure your data so that you can populate database tables or derive valuable information contained within your data.

<div align="center" style="width: 600px; font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/SQL4DS/String_Manipulation/Cartoon-Ninja-darwing_new.png"
     alt="Dummy image 1"
     style="float: center; padding-bottom=0.5em"
     width=500px/>
     Image by <a href="https://pixabay.com/users/newarta-4978945/?utm_source=link-attribution&amp;utm_medium=referral&amp;utm_campaign=image&amp;utm_content=4983545">Paul Diaconu</a> from <a href="https://pixabay.com/?utm_source=link-attribution&amp;utm_medium=referral&amp;utm_campaign=image&amp;utm_content=4983545">Pixabay</a>
</div>

Mastering string manipulation will bring you a step closer to becoming a SQL Ninja. So let's get started!


## Imports and DB Connections

Please use the below command to install **sql_magic** if you do not already have it. We will use this package to assist us with SQL syntax hightlighting.
* `pip install sql_magic`

Remember to start each new cell with:  **`%%read_sql`**


In [1]:
import sqlite3
import csv
from sqlalchemy import create_engine
%load_ext sql_magic

# Create engine instance using sqlalchemy
engine = create_engine("sqlite:///Students.db")
%config SQL.conn_name = 'engine'

# Create connection object using sqlite3
conn = sqlite3.connect('Students.db')
cursor = conn.cursor()

### LENGTH

This function returns the length of a given string. It is important to note that any whitespaces that may exist in the string will also be counted. Let's use this function to determine the length of the `IDNumber` and `Name` fields in our database.

The syntax of the *LENGTH()* function takes the following form:

```sql
    LENGTH(string)
```
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/SQL4DS/String_Manipulation/LENGTH.png" alt="Illustration of the length function" border="0">

In [2]:
%%read_sql
SELECT 
    Name,
    LENGTH(Name) AS LengthOfName,
    IDNumber,
    LENGTH(IDNumber) AS LengthOfID
FROM 
    Students
LIMIT 5;  

Query started at 12:46:30 AM SAST; Query executed in 0.00 m

Unnamed: 0,Name,LengthOfName,IDNumber,LengthOfID
0,Jan,5,#820410-5405-084#,17
1,Dumisani,8,9005272774082,13
2,Christopher,16,9011245483180,13
3,Marco,19,9902225381086,13
4,marthinus,9,8105294344187,13


### Try it yourself: Find all IDs that were entered incorrectly

From the above result we can see that there are some ID numbers that have been entered incorrectly as they have a length greater than 13. Let us write a SQL query to identify all the ID numbers that have been entered incorrectly using the length function.

In [5]:
%%read_sql
--Write your query here
SELECT IDNumber AS LengthOfID FROM Students
WHERE LENGTH(IDNumber) > 13
LIMIT 10;  

Query started at 12:51:50 AM SAST; Query executed in 0.00 m

Unnamed: 0,LengthOfID
0,#820410-5405-084#
1,#501004-621-2182#
2,#751010-414-4187#
3,#530219-492-6185#
4,#950510-1851-081#
5,#561122-1763-085#
6,621207-5110-185
7,960628-4133-180
8,651225-0376-186
9,870816-0468-082


### REPLACE
The REPLACE function allows you to replace a specified string pattern found in your data with a string pattern of your choice.

The syntax of the *`REPLACE()`* function takes the following form:

```sql
REPLACE(string, pattern, replacement_string)
```

<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/SQL4DS/String_Manipulation/REPLACE.png" alt="Illustration of different TRIM functions" border="0">

Let's see how we can use this to correct the ID numbers that have a LENGTH greater than 13.

In [4]:
%%read_sql

SELECT 
    IDNumber,
    LENGTH(IDNumber) AS LengthOfID,
    REPLACE(IDNumber,'-','##') AS HashtagID, -- replace with arbitrary string pattern
    REPLACE(REPLACE(IDNumber,'-',''),'#','') AS CorrectIDFormat, -- correct ID format such that we only have 13 characters in the string,
    LENGTH(REPLACE(REPLACE(IDNumber,'-',''),'#','')) AS LengthOfCorrectID
FROM 
    Students
WHERE 
    LengthOfID<>13

Query started at 10:23:51 AM South Africa Standard Time; Query executed in 0.00 m

Unnamed: 0,IDNumber,LengthOfID,HashtagID,CorrectIDFormat,LengthOfCorrectID
0,#820410-5405-084#,17,#820410##5405##084#,8204105405084,13
1,#501004-621-2182#,17,#501004##621##2182#,5010046212182,13
2,#751010-414-4187#,17,#751010##414##4187#,7510104144187,13
3,#530219-492-6185#,17,#530219##492##6185#,5302194926185,13
4,#950510-1851-081#,17,#950510##1851##081#,9505101851081,13
5,#561122-1763-085#,17,#561122##1763##085#,5611221763085,13
6,621207-5110-185,15,621207##5110##185,6212075110185,13
7,960628-4133-180,15,960628##4133##180,9606284133180,13
8,651225-0376-186,15,651225##0376##186,6512250376186,13
9,870816-0468-082,15,870816##0468##082,8708160468082,13


### Try it yourself: Remove all vowels from the students' names

Write a query that will remove all the lowercase vowels (a, e, i, o, u) from a students name. 

***Hint:*** *You will need to nest your **REPLACE()** function calls*

In [7]:
%%read_sql
--Write your query here

SELECT 
REPLACE(Name,'a,e,i,o,u','A,E,I,O,U') AS ReplacedNames 
FROM
Students
LIMIT 10;  

Query started at 12:58:39 AM SAST; Query executed in 0.00 m

Unnamed: 0,ReplacedNames
0,Jan
1,Dumisani
2,Christopher
3,Marco
4,marthinus
5,Patience
6,Tony
7,gugulethu
8,Tumelo
9,Priscilla


### TRIM, RTRIM, LTRIM

The `TRIM`, `RTRIM` and `LTRIM` functions allow you to either trim the whitespaces or string patterns that are found at the beginning or end of a string:

* **TRIM** allows you to trim the whitespaces or string patterns both at the beginning and at the end of a string.
* **LTRIM** allows to trim whitespaces or string patterns that are to the **L**eft of the string - beginning of the string.
* **RTRIM** allow to trime whitespaces or string patterns that are to the **R**ight of the string - end of the string.

The syntax of the *`TRIM()`, `RTRIM()`, `LTRIM()`* functions take the following forms:

```sql
    TRIM(string[,pattern]), RTRIM(string[,pattern]), LTRIM(string[,pattern])
```


The argument: `[,pattern]` is optional - if it is omitted then it will only serve to remove whitespaces.

<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/SQL4DS/String_Manipulation/TRIM.png" alt="Illustration of different TRIM functions" border="0">

In our database we have some IDs with hashtags at their extremities and names that have whitespaces. The whitespaces will not be visible to us, and as such we will use the `LENGTH` function to discern whether they exist. Let us write a query for this.

In [6]:
%%read_sql
SELECT
    Name,
    LENGTH(Name)AS LengthOfName,
    LENGTH(TRIM(Name)) AS TRIM_Length,
    LENGTH(LTRIM(Name)) AS LTRIM_Length,
    LENGTH(RTRIM(Name)) AS RTRIM_Length,
    CASE 
        WHEN LENGTH(TRIM(Name)) <> LENGTH(LTRIM(Name)) THEN 'Whitespaces at the end'
        WHEN LENGTH(TRIM(Name)) <> LENGTH(RTRIM(Name)) THEN "Whitespaces at the begining"
    ELSE
        'No Whitespaces'
    END AS Whitespace
FROM 
    Students
LIMIT 5;

Query started at 10:23:51 AM South Africa Standard Time; Query executed in 0.00 m

Unnamed: 0,Name,LengthOfName,TRIM_Length,LTRIM_Length,RTRIM_Length,Whitespace
0,Jan,5,3,3,5,Whitespaces at the begining
1,Dumisani,8,8,8,8,No Whitespaces
2,Christopher,16,11,11,16,Whitespaces at the begining
3,Marco,19,5,19,5,Whitespaces at the end
4,marthinus,9,9,9,9,No Whitespaces


### Try it yourself: Remove hashtags at both ends of the ID

You can use the below area to write a query that will remove the hashtags "#" from the ID string. Play around with all three trim functions.

In [9]:
%%read_sql
--Write your query here
SELECT 
TRIM(IDNumber,"#") AS TrimmedIDs 
FROM
Students
LIMIT 10;  

Query started at 01:05:32 AM SAST; Query executed in 0.00 m

Unnamed: 0,TrimmedIDs
0,820410-5405-084
1,9005272774082
2,9011245483180
3,9902225381086
4,8105294344187
5,5911252957188
6,5006191871185
7,501004-621-2182
8,751010-414-4187
9,6812103283181


You might be wondering at this point how this might be useful. One example is when you have to perform string comparisons. If you have whitespaces in your names it might lead to *false-negative* searches in your database: i.e.  You might try to search for "Jan" in your query and the database returns no results and you falsely assume that it is not in the database. 

Let us have a look at one such query:

In [8]:
%%read_sql

/*The following query will not find Jan's record in the DB */
 SELECT 
     AdmissionNo,
     Name,
     Surname,
     IDNumber
 FROM 
     Students 
 WHERE
     Name = "Jan";  

Query started at 10:23:51 AM South Africa Standard Time; Query executed in 0.00 m

Unnamed: 0,AdmissionNo,Name,Surname,IDNumber


In [9]:
%%read_sql

/*The following query will correctly find Jan's record*/
 SELECT 
     AdmissionNo,
     Name,
     Surname,
     IDNumber
 FROM 
     Students 
 WHERE
     TRIM(Name) = "Jan";  

Query started at 10:23:51 AM South Africa Standard Time; Query executed in 0.00 m

Unnamed: 0,AdmissionNo,Name,Surname,IDNumber
0,1,Jan,Makhanya,#820410-5405-084#


### SUBSTR

The SUBSTR function returns a **substr**ing: given the **starting index** of the substring and the **number of characters** in the substring that is required.

The syntax of the *`SUBSTR()`* function takes the following form:

```SQL
    SUBSTR(string,starting_index,number_of_characters)
```

Have a look at the following function call:
```SQL 
SUBSTR('EXPLORE DSA',3,5)
```
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/SQL4DS/String_Manipulation/SUBSTR.png" alt="Substring Ilustration" border="0">


The result of this function call will be "PLORE" because the starting index points to the letter "P" and we want the succeeding 5 characters (inclusive of "P").


Let us use this function to get the first 6 characters of the ID which represents a person's date of birth and separate these into **Year**, **Month** and **Day**.


In [10]:
%%read_sql

SELECT
    Name,
    IDNumber,
    SUBSTR(IDNUmber,1,2) AS Year, 
    SUBSTR(IDNUmber,3,2) AS Month, 
    SUBSTR(IDNUmber,5,2) AS Day 
FROM
    Students
LIMIT 5;

Query started at 10:23:51 AM South Africa Standard Time; Query executed in 0.00 m

Unnamed: 0,Name,IDNumber,Year,Month,Day
0,Jan,#820410-5405-084#,#8,20,41
1,Dumisani,9005272774082,90,5,27
2,Christopher,9011245483180,90,11,24
3,Marco,9902225381086,99,2,22
4,marthinus,8105294344187,81,5,29


### INSTR

The INSTR function returns the index position for the first of occurrence of substring that we are looking for.

The syntax of the *`INSTR()`* function takes the following form:

```sql
INSTR(string,substring)
```

Have a look at the following function call:
```SQL 
INSTR('EXPLORE DSA','DSA')
```
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/SQL4DS/String_Manipulation/INSTR.png" alt="Substring Index Ilustration" border="0">

In [11]:
%%read_sql
SELECT 
    Name,
    IDNumber,
    INSTR(IDNumber,'-') AS FirstOccurrence
FROM
    Students
WHERE LENGTH(IDNumber)>13

Query started at 10:23:52 AM South Africa Standard Time; Query executed in 0.00 m

Unnamed: 0,Name,IDNumber,FirstOccurrence
0,Jan,#820410-5405-084#,8
1,gugulethu,#501004-621-2182#,8
2,Tumelo,#751010-414-4187#,8
3,Dirk,#530219-492-6185#,8
4,sello,#950510-1851-081#,8
5,nicole,#561122-1763-085#,8
6,Jacqueline,621207-5110-185,7
7,Louise,960628-4133-180,7
8,Claire,651225-0376-186,7
9,Ivan,870816-0468-082,7


## UPPER, LOWER

The `UPPER` and `LOWER` functions allow us to convert our strings and characters to either lower or upper case:

The syntax of the *`UPPER()`* and *`LOWER()`* functions takes the following forms:

```sql
UPPER(string),LOWER(string)
```

<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/SQL4DS/String_Manipulation/UPPER_LOWER.png" alt="Upper and Lowercase Ilustration" border="0">

In [12]:
%%read_sql
SELECT
    Name,
    UPPER(Name) AS Uppercase,
    LOWER(Name) AS Lowercase
FROM
    Students
LIMIT 5;

Query started at 10:23:52 AM South Africa Standard Time; Query executed in 0.00 m

Unnamed: 0,Name,Uppercase,Lowercase
0,Jan,JAN,jan
1,Dumisani,DUMISANI,dumisani
2,Christopher,CHRISTOPHER,christopher
3,Marco,MARCO,marco
4,marthinus,MARTHINUS,marthinus


### Concatenation Operator - ||

The concatenation operator consits of two vertical lines, sometimes refered to as  pipes "||" and can be used to combine two strings - a process known as string concatenation.

The syntax for the "||" operator will take the following form:

```sql
    string_1 || string_2 || string_3 || ... || string_n
```

Let us see how we can combine the items in our table to form a single string from the entries:


In [13]:
%%read_sql

SELECT
    AdmissionNo || Name ||  Surname  ||  IDNumber
FROM 
    Students
LIMIT 5;

Query started at 10:23:52 AM South Africa Standard Time; Query executed in 0.00 m

Unnamed: 0,AdmissionNo || Name || Surname || IDNumber
0,1 JanMakhanya#820410-5405-084#
1,2DumisaniMorris9005272774082
2,3 ChristopherBennett9011245483180
3,4Marco barnes9902225381086
4,5marthinusLourens 8105294344187


### String Manipulation Excercises

Now that we've learned the functions neccessary to perform transformations on strings, let's bring it all together to cement our understanding.

#### 1. Write a query to create custom student numbers

In this section we will be creating a query that will produce 10-character long student numbers for all students in the table. The student numbers are to be created in the following manner:
* The first 2 characters are the uppercase letters taken from the 1st letters of student's name and surname.
* The next 6 characters are to be the last 6 characters of the student's ID Number - without dashes or hashtags.
* The last 2 characters are to be an underscore ("_") and the length of the student's surname.

**Examples:** 

If we have student: **m**arthinus **L**ourens 8105294**344187**, the student number becomes **ML344187_7**

If we have student: **J**an **M**akhanya 820410-5**405**-**084**, the student number becomes **JM405084_8**


In [14]:
%%read_sql

-- Write your query here

#### 2. Write a query to obtain all information contained within an ID Number

An ID number has various information built into it: 

1. The first 6 characters represent a person's date of birth in the format YYMMDD.
2. The next 4 characters tell us whether the person is MALE or FEMALE if the number is less than 5000 then the person is female, else if the number is greater than 5000 then the person is Male. 
3. The 11th character tells us whether the person is a South African citizen by birth or a Permanent Resident.

<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/SQL4DS/String_Manipulation/ID_Breakdown.png" alt="Breakdown of ID information" border="0">

Write a query that will derive all this data from the ID Number:

* Date of Birth - write it in the following format: DD/MM/YY
* The gender as Male or Female
* Determine citizenship of student: South African or Permanent Resident

You will need to make use of the **`CASE`** statement or the **`IFF()`** function for the the last two items. Please refer to the Appendix section below for links on the syntax and usage.


In [15]:
%%read_sql

-- Write your query here

## Exercise Answers

1. Write a query to create custom student numbers

In [16]:
%%read_sql

SELECT 
    Name,
    Surname,
    IDNumber,
    UPPER(--UPPER is used to transform the whole string we build to uppercase
        
           SUBSTR(TRIM(Name),1,1) --obtain 1st letter of Name, we TRIM because there might be white spaces in front
        || SUBSTR(TRIM(Surname),1,1) --obtain 1st letter of Surname, we trim because there might be white spaces in front
        || SUBSTR(REPLACE(REPLACE(IDNumber,'-',''),'#',''),8,6) -- obtain last 6 characters, we use REPLACE to get rid of dashes and hashtags
        || "_" || LENGTH(TRIM(Surname)) --obtain the length of the TRIMMED surname and prepend with an underscore
    ) AS StudentNumber
FROM
    Students
LIMIT 10;

Query started at 10:23:52 AM South Africa Standard Time; Query executed in 0.00 m

Unnamed: 0,Name,Surname,IDNumber,StudentNumber
0,Jan,Makhanya,#820410-5405-084#,JM405084_8
1,Dumisani,Morris,9005272774082,DM774082_6
2,Christopher,Bennett,9011245483180,CB483180_7
3,Marco,barnes,9902225381086,MB381086_6
4,marthinus,Lourens,8105294344187,ML344187_7
5,Patience,Banda,5911252957188,PB957188_5
6,Tony,Ngwenya,5006191871185,TN871185_7
7,gugulethu,Horn,#501004-621-2182#,GH212182_4
8,Tumelo,Ebrahim,#751010-414-4187#,TE144187_7
9,Priscilla,Jansen,6812103283181,PJ283181_6


2. Write a query to obtain all information contained within the ID Number

    * **No solution is provided for this question**: You are encouraged do this excercise on own your so that you may get comfortable with nesting string functions to get the desired ouput. 

## Conclusion

Having gone through this train, you should be able to confidently tackle all complex string data that are thrown your way. Initially, it might take a while getting comfortable with writing the complex queries to perform string manipulation - but practise makes perfect. Now go forth and conquer!

## Appendix

Below are some resources that you can use to further understand CASE statements and the IIF() function you can use together with your string manipulation functions.

<a href="https://www.sqlitetutorial.net/sqlite-case/">The CASE statement explained</a>

<a href="https://www.sqlitetutorial.net/sqlite-functions/sqlite-iif/">The IIF() function explained</a>