<a href="https://colab.research.google.com/github/sethkipsangmutuba/SQL/blob/main/4c_Creating_a_Custom_ID_Using_String_Functions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#  Creating a Custom ID Using String Functions (SQLite3 in Colab)

In this section, we’ll build a **custom passenger ID** by combining several string-based columns from the Titanic dataset.

---

##  Goal

Create a new column `passenger_id_custom` by combining fields like:

- `sex`
- `class`
- `embarked`

We'll also:

- Handle **NULL values** to avoid broken IDs
- Format strings with:
  - `UPPER()` – convert to uppercase
  - `SUBSTR()` – control which parts of a string to use
  - `LENGTH()` – for logic and validation

---

##  Step 1: Load Titanic Data and Save to SQLite (Colab)

First, load the Titanic dataset into a Pandas DataFrame and write it into a local SQLite database.

This gives us the foundation for SQL-based string manipulation in upcoming steps.

We’ll create the custom ID by transforming and combining string values using SQL.


In [1]:
import pandas as pd
import sqlite3
import seaborn as sns

# Load dataset
df = sns.load_dataset('titanic')
df.dropna(how='all', inplace=True)

# Connect to SQLite memory DB
conn = sqlite3.connect(":memory:")

# Write to SQL table
df.to_sql("titanic", conn, index_label="id", if_exists="replace")


891

---

## Step 2: Preview Columns to Combine

Before building the custom `passenger_id_custom`, let’s inspect the columns we’ll use.

We'll use the following fields:

- **`sex`** → e.g., `"male"`, `"female"`
- **`class`** → e.g., `"First"`, `"Second"`, `"Third"`
- **`embarked`** → e.g., `"C"`, `"Q"`, `"S"`

These fields will be:

- **Formatted** (e.g., uppercased, shortened)
- **Combined** into a unique ID-like string
- **Handled carefully** to account for possible `NULL` values

This preview helps confirm the input data before we write transformation logic.


In [2]:
pd.read_sql("""
SELECT DISTINCT sex, class, embarked
FROM titanic
LIMIT 5;
""", conn)


Unnamed: 0,sex,class,embarked
0,male,Third,S
1,female,First,C
2,female,Third,S
3,female,First,S
4,male,Third,Q


Step 3.1: Combine columns to create a passenger_id_custom

In [3]:
pd.read_sql("""
SELECT
    id,
    sex,
    class,
    embarked,
    sex || class || embarked AS passenger_id_custom
FROM titanic
LIMIT 5;
""", conn)


Unnamed: 0,id,sex,class,embarked,passenger_id_custom
0,0,male,Third,S,maleThirdS
1,1,female,First,C,femaleFirstC
2,2,female,Third,S,femaleThirdS
3,3,female,First,S,femaleFirstS
4,4,male,Third,S,maleThirdS


Step 3.2: Handle NULLs with IFNULL

In [4]:
pd.read_sql("""
SELECT
    id,
    sex,
    class,
    embarked,
    IFNULL(sex, 'UNKNOWN') || IFNULL(class, 'UNKNOWN') || IFNULL(embarked, 'UNKNOWN') AS passenger_id_custom
FROM titanic
LIMIT 5;
""", conn)


Unnamed: 0,id,sex,class,embarked,passenger_id_custom
0,0,male,Third,S,maleThirdS
1,1,female,First,C,femaleFirstC
2,2,female,Third,S,femaleThirdS
3,3,female,First,S,femaleFirstS
4,4,male,Third,S,maleThirdS


Step 3.3a: Standardize text case with UPPER()

In [5]:
pd.read_sql("""
SELECT
    id,
    sex,
    class,
    embarked,
    UPPER(IFNULL(sex, 'UNKNOWN')) || UPPER(IFNULL(class, 'UNKNOWN')) || UPPER(IFNULL(embarked, 'UNKNOWN')) AS passenger_id_custom
FROM titanic
LIMIT 5;
""", conn)


Unnamed: 0,id,sex,class,embarked,passenger_id_custom
0,0,male,Third,S,MALETHIRDS
1,1,female,First,C,FEMALEFIRSTC
2,2,female,Third,S,FEMALETHIRDS
3,3,female,First,S,FEMALEFIRSTS
4,4,male,Third,S,MALETHIRDS


---

## Step 3.3b: Uniform ID Length Using `SUBSTR()`

To keep our `passenger_id_custom` consistently formatted, we'll extract **fixed-length segments** from each component:

- **`sex`**: Take the **first 3 letters**  
  → e.g., `"male"` → `"mal"`, `"female"` → `"fem"`

- **`class`**: Take the **first 4 letters**  
  → e.g., `"First"` → `"Firs"`, `"Second"` → `"Seco"`, `"Third"` → `"Thir"`

- **`embarked`**: Take the **last 2 characters**  
  → e.g., `"S"` → `"S"` (padded if needed), `"Queenstown"` → `"wn"`

We'll use:

- `SUBSTR(column, 1, N)` to get the **first N characters**
- `SUBSTR(column, -N)` to get the **last N characters**

This step ensures a **uniform structure** across all generated IDs, making them cleaner and more comparable.


In [6]:
pd.read_sql("""
SELECT
    id,
    sex,
    class,
    embarked,
    SUBSTR(UPPER(IFNULL(sex, 'UNKNOWN')), 1, 3) ||
    SUBSTR(UPPER(IFNULL(class, 'UNKNOWN')), 1, 4) ||
    SUBSTR(UPPER(IFNULL(embarked, 'UNKNOWN')), -2) AS passenger_id_custom
FROM titanic
LIMIT 5;
""", conn)


Unnamed: 0,id,sex,class,embarked,passenger_id_custom
0,0,male,Third,S,MALTHIRS
1,1,female,First,C,FEMFIRSC
2,2,female,Third,S,FEMTHIRS
3,3,female,First,S,FEMFIRSS
4,4,male,Third,S,MALTHIRS
