##  Using SQL String Functions to Clean Data

###  Learning Objectives
In this section, we will learn how to:
- Identify and remove unwanted spaces from string values.
- Extract specific portions of a string using SQL string functions.

###  Overview
We’ll explore how string functions can be used to clean and standardize text data in our dataset.  
For this exercise, we’ll use the `united_nations.Access_to_Basic_Services` table.

It has been observed that some entries in the `Country_name` column contain unwanted details  
inside parentheses — for example:



In [1]:
%load_ext sql

##  Cleaning Country Names with SQL String Functions

###  Exercise Overview
Some country names in the `Access_to_Basic_Services` table contain extra information inside parentheses,  
for example:  
`Kenya (Republic of Kenya)` or `Tanzania (United Republic of)`.

We want to **extract only the clean country names** without any text inside parentheses  
and remove any trailing spaces that remain afterward.



###  Step 1: Identify Country Names with Parentheses
We’ll first select all unique country names that contain parentheses.


In [3]:
%%sql
SELECT DISTINCT
    Country_name
FROM
    united_nations.Access_to_Basic_Services
WHERE
    Country_name LIKE '%(%)%';


 * mysql+pymysql://root:***@localhost:3306/united_nations
7 rows affected.


Country_name
Iran (Islamic Republic of)
Saint Martin (French Part)
Sint Maarten (Dutch part)
Bolivia (Plurinational State of)
Falkland Islands (Malvinas)
Venezuela (Bolivarian Republic of)
Micronesia (Federated States of)


###  Step 2: Extract the Country Names Without the Parentheses
Next, we extract the text to the left of the first opening parenthesis `(` using the `SUBSTR()` and `INSTR()` functions.  
We also calculate the length of the new string using the `LENGTH()` function to help us identify any extra spaces that may remain.


In [4]:
%%sql
SELECT DISTINCT
    Country_name,
    SUBSTR(Country_name, 1, INSTR(Country_name, '(') - 1) AS New_country_name,
    LENGTH(SUBSTR(Country_name, 1, INSTR(Country_name, '(') - 1)) AS New_country_name_length
FROM
    united_nations.Access_to_Basic_Services
WHERE
    Country_name LIKE '%(%)%';


 * mysql+pymysql://root:***@localhost:3306/united_nations
7 rows affected.


Country_name,New_country_name,New_country_name_length
Iran (Islamic Republic of),Iran,5
Saint Martin (French Part),Saint Martin,13
Sint Maarten (Dutch part),Sint Maarten,13
Bolivia (Plurinational State of),Bolivia,8
Falkland Islands (Malvinas),Falkland Islands,17
Venezuela (Bolivarian Republic of),Venezuela,10
Micronesia (Federated States of),Micronesia,11


###  Step 3: Identify and Remove Extra Characters
After extracting, we may still have trailing spaces in the new country names.  
We can use the `TRIM()` function to remove them and confirm that the cleaned names are neat and consistent.


In [5]:
%%sql
SELECT DISTINCT
    Country_name,
    TRIM(SUBSTR(Country_name, 1, INSTR(Country_name, '(') - 1)) AS New_country_name,
    LENGTH(TRIM(SUBSTR(Country_name, 1, INSTR(Country_name, '(') - 1))) AS New_country_name_length
FROM
    united_nations.Access_to_Basic_Services
WHERE
    Country_name LIKE '%(%)%';


 * mysql+pymysql://root:***@localhost:3306/united_nations
7 rows affected.


Country_name,New_country_name,New_country_name_length
Iran (Islamic Republic of),Iran,4
Saint Martin (French Part),Saint Martin,12
Sint Maarten (Dutch part),Sint Maarten,12
Bolivia (Plurinational State of),Bolivia,7
Falkland Islands (Malvinas),Falkland Islands,16
Venezuela (Bolivarian Republic of),Venezuela,9
Micronesia (Federated States of),Micronesia,10


###  Summary
We successfully cleaned up the `Country_name` column by:
- Identifying entries containing parentheses.
- Extracting only the country name before the parenthesis.
- Removing any trailing spaces for consistent formatting.

These string functions — `SUBSTR()`, `INSTR()`, `LENGTH()`, and `TRIM()` — are essential for text cleaning in SQL.
