<a href="https://colab.research.google.com/github/sethkipsangmutuba/SQL/blob/main/4b_Using_SQL_String_Functions_to_Clean_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#  Using SQL String Functions to Clean Data

In this section, we’ll simulate the **data cleaning process** on string/text fields from the **Titanic dataset** — such as `embarked`, `class`, or `who`.

We’ll demonstrate how to use **SQL string functions** to improve data quality and consistency.

---

##  String Functions Covered

- **`TRIM()`** – Removes leading and trailing whitespace
- **`SUBSTR()`** – Extracts specific parts of a string
- **`INSTR()`** – Finds the position of a character or substring
- **`REPLACE()`** – Substitutes or removes unwanted characters

---

## Step-by-Step: SQL String Cleaning with SQLite in Colab

---

### 1️ Setup: Import, Load Titanic Dataset, Save to SQLite

Start by:

- Importing necessary libraries (e.g., `pandas`, `sqlite3`)
- Loading the Titanic dataset into a DataFrame
- Saving the dataset into a **temporary SQLite database** for querying

This prepares the environment for applying string functions using SQL on Titanic text fields like `embarked`, `class`, and `who`.


In [5]:
import pandas as pd
import sqlite3
import seaborn as sns

# Load Titanic dataset
df = sns.load_dataset('titanic')

# Drop rows where all values are NaN (optional)
df.dropna(how='all', inplace=True)

# Create SQLite connection
conn = sqlite3.connect(":memory:")

# Save to SQLite
df.to_sql("titanic", conn, index_label="passenger_id", if_exists="replace")


891

---

## Check Sample Strings to Clean

Before cleaning, let’s simulate messy or inconsistent text entries to work with.

For example, imagine the `embark_town` column contains values like:



In [6]:
conn.execute("""
UPDATE titanic
SET embark_town = embark_town || ' (Q)'
WHERE embark_town IS NOT NULL AND embark_town = 'Queenstown'
""")


<sqlite3.Cursor at 0x7e4d70327cc0>

In [7]:
pd.read_sql("""
SELECT DISTINCT embark_town
FROM titanic
WHERE embark_town LIKE '%(%)%';
""", conn)


Unnamed: 0,embark_town
0,Queenstown (Q)


---

##  Extract Cleaned Version of `embark_town` Using String Functions

Now that we've simulated messy entries like:

- `"Queenstown (Q)"`
- `"Southampton (S)"`
- `"Cherbourg (C)"`

We’ll clean the data by **extracting only the town name**, excluding everything inside the parentheses.

---

### Approach

To extract the clean version of `embark_town`, we will:

- Use `INSTR()` to find the position of the opening parenthesis `(`
- Use `SUBSTR()` to extract the portion of the string **before** that position
- Optionally use `TRIM()` or `RTRIM()` to remove any trailing whitespace

This gives us a clean version of the town name for analysis.

---



In [8]:
pd.read_sql("""
SELECT DISTINCT
  embark_town,
  TRIM(SUBSTR(embark_town, 1, INSTR(embark_town, '(') - 1)) AS cleaned_embark_town,
  LENGTH(TRIM(SUBSTR(embark_town, 1, INSTR(embark_town, '(') - 1))) AS cleaned_length
FROM titanic
WHERE embark_town LIKE '%(%)%';
""", conn)


Unnamed: 0,embark_town,cleaned_embark_town,cleaned_length
0,Queenstown (Q),Queenstown,10
