<div id="singlestore-header" style="display: flex; background-color: rgba(235, 249, 245, 0.25); padding: 5px;">
    <div id="icon-image" style="width: 90px; height: 90px;">
        <img width="100%" height="100%" src="https://raw.githubusercontent.com/singlestore-labs/spaces-notebooks/master/common/images/header-icons/browser.png" />
    </div>
    <div id="text" style="padding: 5px; margin-left: 10px;">
        <div id="badge" style="display: inline-block; background-color: rgba(0, 0, 0, 0.15); border-radius: 4px; padding: 4px 8px; align-items: center; margin-top: 6px; margin-bottom: -2px; font-size: 80%">SingleStore Notebooks</div>
        <h1 style="font-weight: 500; margin: 8px 0 0 4px;">Demonstrate ML function Classify</h1>
    </div>
</div>

<div class="alert alert-block alert-warning">
    <b class="fa fa-solid fa-exclamation-circle"></b>
    <div>
        <p><b>Note</b></p>
        <p>You can use your existing Standard or Premium workspace with this Notebook.</p>
    </div>
</div>


This feature is currently in **Private Preview**. Please reach out to support@singlestore.com to confirm if this feature can be enabled in your org.

This Jupyter notebook will help you:
1. Load the titanic dataset
2. Store the data in a SingleStore table
3. Use ML Functions for training and predictions
4. Run some common Data Analysis tasks

**Prerequisites**: Ensure ML Functions are installed on your deployment (AI > AI & ML Functions).

In [1]:
%%sql
-- Ensure that ML_CLASSIFY is listed in Functions_in_cluster column
show functions in cluster;

In [2]:
!pip install -q httplib2 seaborn pandas numpy scikit-learn

### Load and Prepare the Titanic Dataset

We'll use the famous Titanic dataset from seaborn, which contains passenger information from the RMS Titanic. The goal is to predict whether a passenger survived based on features like age, sex, ticket class, and fare.

In [3]:
%%sql
CREATE DATABASE IF NOT EXISTS temp;
USE temp;

In [4]:
import json
import seaborn as sns
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Load the Titanic dataset
titanic_df = sns.load_dataset('titanic')

# Display basic information
print(f"Dataset shape: {titanic_df.shape}")
print(f"\nColumn names: {list(titanic_df.columns)}")
print(f"\nFirst 5 rows:")
print(titanic_df.head())

# Check survival distribution
print(f"\nSurvival Distribution:")
print(titanic_df['survived'].value_counts())

### Clean and Prepare Features

We'll select the most important features and handle missing values to create a clean dataset for training.

In [5]:
# Select relevant columns for prediction
columns_to_use = ['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked']
titanic_clean = titanic_df[columns_to_use].copy()

# Fill missing values
titanic_clean['age'] = titanic_clean['age'].fillna(titanic_clean['age'].median())
titanic_clean['fare'] = titanic_clean['fare'].fillna(titanic_clean['fare'].median())
titanic_clean['embarked'] = titanic_clean['embarked'].fillna('S')  # Most common port

# Drop any remaining rows with missing values
titanic_clean = titanic_clean.dropna()

# Convert survived to text labels for classification
titanic_clean['survival_status'] = titanic_clean['survived'].map({
    0: 'Died',
    1: 'Survived'
})

# Drop the original numeric survived column
titanic_clean = titanic_clean.drop('survived', axis=1)

print(f"Clean dataset shape: {titanic_clean.shape}")
print(f"\nMissing values per column:")
print(titanic_clean.isnull().sum())
print(f"\nSurvival status distribution:")
print(titanic_clean['survival_status'].value_counts())
print(f"\nFirst 5 rows of clean data:")
print(titanic_clean.head())

### Split Data into Training and Test Sets

We'll split the data into 80% training and 20% test sets to evaluate model performance.

In [6]:
# Split into train (80%) and test (20%) sets
train_df, test_df = train_test_split(
    titanic_clean,
    test_size=0.2,
    random_state=42,
    stratify=titanic_clean['survival_status']
)

print(f"Training set size: {len(train_df)} passengers")
print(f"Test set size: {len(test_df)} passengers")
print(f"\nTraining set survival distribution:")
print(train_df['survival_status'].value_counts())
print(f"\nTest set survival distribution:")
print(test_df['survival_status'].value_counts())

In [7]:
%%sql
DROP TABLE IF EXISTS titanic_training_data;
DROP TABLE IF EXISTS titanic_test_data;
DROP TABLE IF EXISTS titanic_predictions;

CREATE TABLE titanic_training_data (
    pclass INT,
    sex VARCHAR(10),
    age FLOAT,
    sibsp INT,
    parch INT,
    fare FLOAT,
    embarked VARCHAR(1),
    survival_status VARCHAR(10)
);

CREATE TABLE titanic_test_data (
    pclass INT,
    sex VARCHAR(10),
    age FLOAT,
    sibsp INT,
    parch INT,
    fare FLOAT,
    embarked VARCHAR(1),
    survival_status VARCHAR(10)
);

CREATE TABLE titanic_predictions (
    pclass INT,
    sex VARCHAR(10),
    age FLOAT,
    sibsp INT,
    parch INT,
    fare FLOAT,
    embarked VARCHAR(1),
    actual_status VARCHAR(10),
    predicted_status JSON
);

### Load Data into SingleStore Tables

We'll use pandas to insert the training and test data into our SingleStore tables.

In [8]:
import singlestoredb as s2

# Create engine with database specified
engine = s2.create_engine(database='temp')

# Insert training data
train_df.to_sql(
    'titanic_training_data',
    con=engine,
    if_exists='append',
    index=False,
    method='multi'
)

# Insert test data
test_df.to_sql(
    'titanic_test_data',
    con=engine,
    if_exists='append',
    index=False,
    method='multi'
)

print(f"Inserted {len(train_df)} rows into titanic_training_data")
print(f"Inserted {len(test_df)} rows into titanic_test_data")

### Verify Data Load

Let's verify that our data was loaded correctly and review the passenger demographics.

In [9]:
%%sql
SELECT COUNT(*) as training_count FROM titanic_training_data;

In [10]:
%%sql
SELECT COUNT(*) as test_count FROM titanic_test_data;

In [11]:
%%sql
SELECT
    survival_status,
    COUNT(*) as passenger_count,
    ROUND(AVG(age), 1) as avg_age,
    ROUND(AVG(fare), 2) as avg_fare
FROM titanic_training_data
GROUP BY survival_status;
SELECT * FROM titanic_training_data LIMIT 5;

### Train the ML Classification Model

Now we'll train an ML model using the `%s2ml train` magic command. This will use SingleStore's ML Functions to train a classification model that predicts passenger survival.

**Note:** Training may take several minutes depending on the compute size selected. The model will learn patterns like "women and children first" and the impact of ticket class on survival.

In [12]:
%%s2ml train as training_result
task: classification
model: titanic_survival_predictor
db: temp
input_table: titanic_training_data
target_column: survival_status
description: "Titanic passenger survival prediction based on demographics and ticket info"
runtime: cpu-small
selected_features: {"mode":"*","features":null}
force: True

### Check Training Results

The training result is assigned to the variable `training_result`. Let's examine the training details.

In [13]:
# Display the training result
print(json.dumps(training_result, indent=4))

### Monitor Training Status

Use the `%s2ml status` command to view the model details and training status. The status will be one of: Pre-processing, Training, Completed, or Error.

In [14]:
%s2ml status --model titanic_survival_predictor

<div class="alert alert-block alert-warning">
    <b class="fa fa-solid fa-exclamation-circle"></b>
    <div>
        <p><b>Note</b></p>
        <p>Wait for training to complete before proceeding to the next section</p>
    </div>
</div>

 You can re-run the cell above to check the status. Once the `pipeline_status` shows "Ready", you can proceed with predictions.

### Run Sample Predictions

Once training is complete, let's run predictions on a few sample passengers from our test dataset to see how the model performs. Ensure that you still are using the right database selected

In [15]:
%%sql
Use temp;

In [16]:
%%sql
SELECT
    cluster.ML_CLASSIFY('titanic_survival_predictor', TO_JSON(passenger.*)) as predicted_status,
    passenger.survival_status as actual_status,
    passenger.pclass as ticket_class,
    passenger.sex,
    passenger.age,
    passenger.fare
FROM (SELECT * FROM titanic_test_data LIMIT 10) AS passenger;

### Run Predictions on Full Test Dataset

Now let's run predictions on the entire test dataset and store the results in our predictions table.

In [17]:
%%sql
INSERT INTO titanic_predictions (
    pclass, sex, age, sibsp, parch, fare, embarked,
    actual_status, predicted_status
)
SELECT
    passenger.pclass,
    passenger.sex,
    passenger.age,
    passenger.sibsp,
    passenger.parch,
    passenger.fare,
    passenger.embarked,
    passenger.survival_status as actual_status,
    cluster.ML_CLASSIFY('titanic_survival_predictor', TO_JSON(passenger.*)) as predicted_status
FROM titanic_test_data AS passenger;

### Evaluate Model Performance

Let's analyze the prediction accuracy by comparing actual vs predicted survival status.

In [18]:
%%sql
SELECT
    COUNT(*) as total_predictions,
    SUM(CASE WHEN actual_status = JSON_EXTRACT_STRING(predicted_status, 'predicted_label') THEN 1 ELSE 0 END) as correct_predictions,
    ROUND(100.0 * SUM(CASE WHEN actual_status = JSON_EXTRACT_STRING(predicted_status, 'predicted_label') THEN 1 ELSE 0 END) / COUNT(*), 2) as accuracy_percentage
FROM titanic_predictions;

### Analyze Survival Factors

Let's examine how different passenger characteristics influenced survival predictions.

In [19]:
%%sql
-- Survival rate by sex
SELECT
    sex,
    COUNT(*) as total_passengers,
    SUM(CASE WHEN actual_status = 'Survived' THEN 1 ELSE 0 END) as actual_survivors,
    ROUND(100.0 * SUM(CASE WHEN actual_status = 'Survived' THEN 1 ELSE 0 END) / COUNT(*), 1) as survival_rate_pct
FROM titanic_predictions
GROUP BY sex
ORDER BY survival_rate_pct DESC;

In [20]:
%%sql
-- Survival rate by passenger class
SELECT
    pclass as ticket_class,
    COUNT(*) as total_passengers,
    SUM(CASE WHEN actual_status = 'Survived' THEN 1 ELSE 0 END) as actual_survivors,
    ROUND(100.0 * SUM(CASE WHEN actual_status = 'Survived' THEN 1 ELSE 0 END) / COUNT(*), 1) as survival_rate_pct,
    ROUND(AVG(fare), 2) as avg_fare_paid
FROM titanic_predictions
GROUP BY pclass
ORDER BY pclass;

### Examine Misclassified Passengers

Let's look at passengers where the model made incorrect predictions to understand potential model limitations.

In [21]:
%%sql
SELECT
    actual_status,
    JSON_EXTRACT_STRING(predicted_status, 'predicted_label') as predicted_label,
    JSON_EXTRACT_DOUBLE(predicted_status, 'confidence') as confidence,
    pclass as ticket_class,
    sex,
    age,
    sibsp as siblings_spouses,
    parch as parents_children,
    fare,
    embarked
FROM titanic_predictions
WHERE actual_status != JSON_EXTRACT_STRING(predicted_status, 'predicted_label')
LIMIT 15;

# Cleanup

In [22]:
%%sql
DROP TABLE IF EXISTS titanic_training_data;
DROP TABLE IF EXISTS titanic_test_data;
DROP TABLE IF EXISTS titanic_predictions;
DROP DATABASE IF EXISTS temp;

<div id="singlestore-footer" style="background-color: rgba(194, 193, 199, 0.25); height:2px; margin-bottom:10px"></div>
<div><img src="https://raw.githubusercontent.com/singlestore-labs/spaces-notebooks/master/common/images/singlestore-logo-grey.png" style="padding: 0px; margin: 0px; height: 24px"/></div>