From 0b71f9cb296c55a3377ff08bdd80b22482919647 Mon Sep 17 00:00:00 2001 From: Bharath Swamy Date: Wed, 8 Oct 2025 12:33:47 -0700 Subject: [PATCH 1/4] add(notebooks): ml functions example and fix ai functions --- .../meta.toml | 2 +- .../notebook.ipynb | 98 +-- .../meta.toml | 13 + .../notebook.ipynb | 640 ++++++++++++++++++ 4 files changed, 703 insertions(+), 50 deletions(-) create mode 100644 notebooks/getting-started-with-ml-functions/meta.toml create mode 100644 notebooks/getting-started-with-ml-functions/notebook.ipynb diff --git a/notebooks/getting-started-with-ai-functions/meta.toml b/notebooks/getting-started-with-ai-functions/meta.toml index 52b66db..4188002 100644 --- a/notebooks/getting-started-with-ai-functions/meta.toml +++ b/notebooks/getting-started-with-ai-functions/meta.toml @@ -8,6 +8,6 @@ description="""\ icon="browser" difficulty="beginner" tags=["advanced", "notebooks", "python"] -lesson_areas=[] +lesson_areas=["AI"] destinations=["spaces"] minimum_tier="standard" diff --git a/notebooks/getting-started-with-ai-functions/notebook.ipynb b/notebooks/getting-started-with-ai-functions/notebook.ipynb index 2973a94..cda2bde 100644 --- a/notebooks/getting-started-with-ai-functions/notebook.ipynb +++ b/notebooks/getting-started-with-ai-functions/notebook.ipynb @@ -19,6 +19,7 @@ { "attachments": {}, "cell_type": "markdown", + "id": "5831c1ac", "metadata": {}, "source": [ "
\n", @@ -38,23 +39,23 @@ "3. Demonstrate powerful AI Functions for text processing and analysis\n", "\n", "**Prerequisites**: Ensure AI Functions are enabled on your deployment (AI Services > AI & ML Functions)." - ], - "id": "5831c1ac" + ] }, { "attachments": {}, "cell_type": "markdown", + "id": "ea429156", "metadata": {}, "source": [ "## Create some simple tables\n", "\n", "This setup establishes a basic relational structure to store some reviews for restaurants. Ensure you have selected a database and have CREATE permissions to create/delete tables." - ], - "id": "ea429156" + ] }, { "cell_type": "code", "execution_count": 1, + "id": "1f8ccd75", "metadata": {}, "outputs": [ { @@ -97,21 +98,21 @@ " Summary TEXT,\n", " Text TEXT\n", ");" - ], - "id": "1f8ccd75" + ] }, { "attachments": {}, "cell_type": "markdown", + "id": "6a2118dd", "metadata": {}, "source": [ "## Install the required packages" - ], - "id": "6a2118dd" + ] }, { "cell_type": "code", "execution_count": 2, + "id": "40350277", "metadata": {}, "outputs": [ { @@ -143,21 +144,21 @@ ], "source": [ "!pip install kagglehub pandas" - ], - "id": "40350277" + ] }, { "attachments": {}, "cell_type": "markdown", + "id": "97437a79", "metadata": {}, "source": [ "## Download and Load Dataset" - ], - "id": "97437a79" + ] }, { "cell_type": "code", "execution_count": 3, + "id": "cf62cc7e", "metadata": {}, "outputs": [ { @@ -349,21 +350,21 @@ "print(f\"Columns: {list(df.columns)}\")\n", "print(\"\\nFirst few rows:\")\n", "df.head()" - ], - "id": "cf62cc7e" + ] }, { "attachments": {}, "cell_type": "markdown", + "id": "0c938c99", "metadata": {}, "source": [ "## Load Data into SingleStore" - ], - "id": "0c938c99" + ] }, { "cell_type": "code", "execution_count": 4, + "id": "4d427d08", "metadata": {}, "outputs": [ { @@ -396,21 +397,21 @@ ")\n", "\n", "print(\"Data loaded successfully!\")" - ], - "id": "4d427d08" + ] }, { "attachments": {}, "cell_type": "markdown", + "id": "ee21f51b", "metadata": {}, "source": [ " ## Verify Data Load" - ], - "id": "ee21f51b" + ] }, { "cell_type": "code", "execution_count": 5, + "id": "8423c269", "metadata": {}, "outputs": [ { @@ -458,21 +459,21 @@ "%%sql\n", "-- Check the number of reviews loaded\n", "SELECT COUNT(*) as total_reviews FROM reviews;" - ], - "id": "8423c269" + ] }, { "attachments": {}, "cell_type": "markdown", + "id": "d6c8e487", "metadata": {}, "source": [ "## Sample Data Preview" - ], - "id": "d6c8e487" + ] }, { "cell_type": "code", "execution_count": 6, + "id": "ccefec53", "metadata": {}, "outputs": [ { @@ -602,24 +603,24 @@ "SELECT Id, ProductId, Score, Summary, LEFT(Text, 100) as Review_Preview\n", "FROM reviews\n", "LIMIT 10;" - ], - "id": "ccefec53" + ] }, { "attachments": {}, "cell_type": "markdown", + "id": "0bb3deb8", "metadata": {}, "source": [ "## AI Functions Demonstrations\n", "\n", "Now let's explore the power of SingleStore AI Functions for text analysis and processing.\n", "Ensure that AI functions are enabled for the org and you are able to list the available AI functions" - ], - "id": "0bb3deb8" + ] }, { "cell_type": "code", "execution_count": 7, + "id": "bd293861", "metadata": {}, "outputs": [ { @@ -769,12 +770,12 @@ "%%sql\n", "USE cluster;\n", "SHOW functions;" - ], - "id": "bd293861" + ] }, { "cell_type": "code", "execution_count": 8, + "id": "05d5d27a", "metadata": {}, "outputs": [ { @@ -824,12 +825,12 @@ "SELECT cluster.AI_COMPLETE(\n", " 'What is SingleStore?'\n", ") AS completion;" - ], - "id": "05d5d27a" + ] }, { "cell_type": "code", "execution_count": 9, + "id": "9f842a0d", "metadata": {}, "outputs": [ { @@ -888,7 +889,7 @@ "%%sql\n", "-- AI_SENTIMENT: Analyze sentiment of customer reviews for a specific product\n", "-- WHERE ProductId = \n", - "-- Remember to specific the datbase name. In this example 'temp' is the Database name\n", + "-- Remember to specify the datbase name. In this example 'temp' is the Database name\n", "SELECT\n", " Id,\n", " ProductId,\n", @@ -898,12 +899,12 @@ "FROM temp.reviews\n", "WHERE ProductId = 'B000NY8ODS'\n", "LIMIT 10;" - ], - "id": "9f842a0d" + ] }, { "cell_type": "code", "execution_count": 10, + "id": "56ff7a17", "metadata": {}, "outputs": [ { @@ -1015,12 +1016,12 @@ " review_count,\n", " cluster.AI_SENTIMENT(combined_text) as overall_sentiment\n", "FROM grouped_reviews;" - ], - "id": "56ff7a17" + ] }, { "cell_type": "code", "execution_count": 11, + "id": "b9786b66", "metadata": {}, "outputs": [ { @@ -1122,12 +1123,12 @@ " 15\n", " ) AS summary\n", "FROM long_reviews;" - ], - "id": "b9786b66" + ] }, { "cell_type": "code", "execution_count": 12, + "id": "4febc8e0", "metadata": {}, "outputs": [ { @@ -1263,12 +1264,12 @@ " '[quality, price, shipping, taste]'\n", " ) AS classification\n", "FROM negative_reviews;" - ], - "id": "4febc8e0" + ] }, { "cell_type": "code", "execution_count": 13, + "id": "40f4cd14", "metadata": {}, "outputs": [ { @@ -1431,12 +1432,12 @@ " 'Does this customer indicate they will buy this product again? Answer with yes, no, or unclear only'\n", " ) AS repeat_purchase_intent\n", "FROM positive_reviews;" - ], - "id": "40f4cd14" + ] }, { "cell_type": "code", "execution_count": 14, + "id": "a09f2d5b", "metadata": {}, "outputs": [ { @@ -1585,12 +1586,12 @@ " 'Is this customer at high risk of not purchasing again? Answer with high, medium, or low only'\n", " ) AS churn_risk\n", "FROM low_rated_reviews;" - ], - "id": "a09f2d5b" + ] }, { "cell_type": "code", "execution_count": 15, + "id": "3d78f449", "metadata": {}, "outputs": [ { @@ -1685,12 +1686,12 @@ " 'spanish'\n", " ) AS spanish_translation\n", "FROM translatable_reviews;" - ], - "id": "3d78f449" + ] }, { "cell_type": "code", "execution_count": 16, + "id": "082dc59a", "metadata": {}, "outputs": [ { @@ -1860,8 +1861,7 @@ " cluster.AI_CLASSIFY(Text, '[quality, value, taste, packaging]') as category,\n", " cluster.AI_SUMMARIZE(Text, 'aifunctions_chat_default', 10) as brief_summary\n", "FROM product_reviews;" - ], - "id": "082dc59a" + ] }, { "cell_type": "markdown", diff --git a/notebooks/getting-started-with-ml-functions/meta.toml b/notebooks/getting-started-with-ml-functions/meta.toml new file mode 100644 index 0000000..00f9499 --- /dev/null +++ b/notebooks/getting-started-with-ml-functions/meta.toml @@ -0,0 +1,13 @@ +[meta] +authors=["bharath-swamy"] +title="Demonstrate ML function Classify" +description="""\ + Learn how to train an ML Classify \ + model and run it to predict the class of an input row. + """ +icon="browser" +difficulty="beginner" +tags=["advanced", "notebooks", "python"] +lesson_areas=["AI"] +destinations=["spaces"] +minimum_tier="standard" diff --git a/notebooks/getting-started-with-ml-functions/notebook.ipynb b/notebooks/getting-started-with-ml-functions/notebook.ipynb new file mode 100644 index 0000000..5100f35 --- /dev/null +++ b/notebooks/getting-started-with-ml-functions/notebook.ipynb @@ -0,0 +1,640 @@ +{ + "cells": [ + { + "id": "d9a6c944", + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + "
\n", + " \n", + "
\n", + "
\n", + "
SingleStore Notebooks
\n", + "

Demonstrate ML function Classify

\n", + "
\n", + "
" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + " \n", + "
\n", + "

Note

\n", + "

You can use your existing Standard or Premium workspace with this Notebook.

\n", + "
\n", + "
\n", + "\n", + "\n", + "This feature is currently in **Private Preview**. Please reach out to support@singlestore.com to confirm if this feature can be enabled in your org.\n", + "\n", + "This Jupyter notebook will help you:\n", + "1. Load the titanic dataset\n", + "2. Store the data in a SingleStore table\n", + "3. Use ML Functions for training and predictions\n", + "4. Run some common Data Analysis tasks\n", + "\n", + "**Prerequisites**: Ensure ML Functions are enabled on your deployment (AI Services > AI & ML Functions)." + ], + "id": "e5b97b82" + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "-- Ensure that ML_CLASSIFY is listed in Functions_in_cluster column\n", + "use cluster;\n", + "\n", + "show functions;" + ], + "id": "a5610f23" + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "!pip install -q seaborn pandas numpy scikit-learn" + ], + "id": "1c891b9c" + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Load and Prepare the Titanic Dataset\n", + "\n", + "We'll use the famous Titanic dataset from seaborn, which contains passenger information from the RMS Titanic. The goal is to predict whether a passenger survived based on features like age, sex, ticket class, and fare." + ], + "id": "89a194f3" + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "import seaborn as sns\n", + "import pandas as pd\n", + "import numpy as np\n", + "from sklearn.model_selection import train_test_split\n", + "\n", + "# Load the Titanic dataset\n", + "titanic_df = sns.load_dataset('titanic')\n", + "\n", + "# Display basic information\n", + "print(f\"Dataset shape: {titanic_df.shape}\")\n", + "print(f\"\\nColumn names: {list(titanic_df.columns)}\")\n", + "print(f\"\\nFirst 5 rows:\")\n", + "print(titanic_df.head())\n", + "\n", + "# Check survival distribution\n", + "print(f\"\\nSurvival Distribution:\")\n", + "print(titanic_df['survived'].value_counts())" + ], + "id": "891ebe48" + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Clean and Prepare Features\n", + "\n", + "We'll select the most important features and handle missing values to create a clean dataset for training." + ], + "id": "6977b0a4" + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "# Select relevant columns for prediction\n", + "columns_to_use = ['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked']\n", + "titanic_clean = titanic_df[columns_to_use].copy()\n", + "\n", + "# Fill missing values\n", + "titanic_clean['age'] = titanic_clean['age'].fillna(titanic_clean['age'].median())\n", + "titanic_clean['fare'] = titanic_clean['fare'].fillna(titanic_clean['fare'].median())\n", + "titanic_clean['embarked'] = titanic_clean['embarked'].fillna('S') # Most common port\n", + "\n", + "# Drop any remaining rows with missing values\n", + "titanic_clean = titanic_clean.dropna()\n", + "\n", + "# Convert survived to text labels for classification\n", + "titanic_clean['survival_status'] = titanic_clean['survived'].map({\n", + " 0: 'Died',\n", + " 1: 'Survived'\n", + "})\n", + "\n", + "# Drop the original numeric survived column\n", + "titanic_clean = titanic_clean.drop('survived', axis=1)\n", + "\n", + "print(f\"Clean dataset shape: {titanic_clean.shape}\")\n", + "print(f\"\\nMissing values per column:\")\n", + "print(titanic_clean.isnull().sum())\n", + "print(f\"\\nSurvival status distribution:\")\n", + "print(titanic_clean['survival_status'].value_counts())\n", + "print(f\"\\nFirst 5 rows of clean data:\")\n", + "print(titanic_clean.head())" + ], + "id": "d3c92afe" + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Split Data into Training and Test Sets\n", + "\n", + "We'll split the data into 80% training and 20% test sets to evaluate model performance." + ], + "id": "6206f158" + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "# Split into train (80%) and test (20%) sets\n", + "train_df, test_df = train_test_split(\n", + " titanic_clean,\n", + " test_size=0.2,\n", + " random_state=42,\n", + " stratify=titanic_clean['survival_status']\n", + ")\n", + "\n", + "print(f\"Training set size: {len(train_df)} passengers\")\n", + "print(f\"Test set size: {len(test_df)} passengers\")\n", + "print(f\"\\nTraining set survival distribution:\")\n", + "print(train_df['survival_status'].value_counts())\n", + "print(f\"\\nTest set survival distribution:\")\n", + "print(test_df['survival_status'].value_counts())" + ], + "id": "0610ab81" + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "USE temp; -- replace with your own database name\n", + "\n", + "\n", + "DROP TABLE IF EXISTS titanic_training_data;\n", + "DROP TABLE IF EXISTS titanic_test_data;\n", + "DROP TABLE IF EXISTS titanic_predictions;\n", + "\n", + "CREATE TABLE titanic_training_data (\n", + " pclass INT,\n", + " sex VARCHAR(10),\n", + " age FLOAT,\n", + " sibsp INT,\n", + " parch INT,\n", + " fare FLOAT,\n", + " embarked VARCHAR(1),\n", + " survival_status VARCHAR(10)\n", + ");\n", + "\n", + "CREATE TABLE titanic_test_data (\n", + " pclass INT,\n", + " sex VARCHAR(10),\n", + " age FLOAT,\n", + " sibsp INT,\n", + " parch INT,\n", + " fare FLOAT,\n", + " embarked VARCHAR(1),\n", + " survival_status VARCHAR(10)\n", + ");\n", + "\n", + "CREATE TABLE titanic_predictions (\n", + " pclass INT,\n", + " sex VARCHAR(10),\n", + " age FLOAT,\n", + " sibsp INT,\n", + " parch INT,\n", + " fare FLOAT,\n", + " embarked VARCHAR(1),\n", + " actual_status VARCHAR(10),\n", + " predicted_status VARCHAR(10)\n", + ");" + ], + "id": "caa2aa64" + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Load Data into SingleStore Tables\n", + "\n", + "We'll use pandas to insert the training and test data into our SingleStore tables." + ], + "id": "459e967c" + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "import singlestoredb as s2\n", + "\n", + "# Specify the database name in the connection URL\n", + "database_name = 'temp' # Replace with your actual database name\n", + "\n", + "# Create engine with database specified\n", + "engine = s2.create_engine(database=database_name)\n", + "\n", + "\n", + "\n", + "# Insert training data\n", + "train_df.to_sql(\n", + " 'titanic_training_data',\n", + " con=engine,\n", + " # conn,\n", + " if_exists='append',\n", + " index=False,\n", + " method='multi'\n", + ")\n", + "\n", + "# Insert test data\n", + "test_df.to_sql(\n", + " 'titanic_test_data',\n", + " con=engine,\n", + " # conn,\n", + " if_exists='append',\n", + " index=False,\n", + " method='multi'\n", + ")\n", + "\n", + "print(f\"Inserted {len(train_df)} rows into titanic_training_data\")\n", + "print(f\"Inserted {len(test_df)} rows into titanic_test_data\")" + ], + "id": "2c3a0d65" + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Verify Data Load\n", + "\n", + "Let's verify that our data was loaded correctly and review the passenger demographics." + ], + "id": "26421f4f" + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "SELECT COUNT(*) as training_count FROM titanic_training_data;\n", + "SELECT COUNT(*) as test_count FROM titanic_test_data;\n", + "SELECT\n", + " survival_status,\n", + " COUNT(*) as passenger_count,\n", + " ROUND(AVG(age), 1) as avg_age,\n", + " ROUND(AVG(fare), 2) as avg_fare\n", + "FROM titanic_training_data\n", + "GROUP BY survival_status;\n", + "SELECT * FROM titanic_training_data LIMIT 5;" + ], + "id": "49f3b05b" + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Train the ML Classification Model\n", + "\n", + "Now we'll train an ML model using the `%s2ml train` magic command. This will use SingleStore's ML Functions to train a classification model that predicts passenger survival.\n", + "\n", + "**Note:** Training may take several minutes depending on the compute size selected. The model will learn patterns like \"women and children first\" and the impact of ticket class on survival." + ], + "id": "042927a9" + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [], + "source": [ + "%%s2ml train as training_result\n", + "task: classification\n", + "model: titanic_survival_predictor\n", + "db: temp #\n", + "input_table: titanic_training_data\n", + "target_column: survival_status\n", + "description: \"Titanic passenger survival prediction based on demographics and ticket info\"\n", + "runtime: cpu-small\n", + "selected_features: {\"mode\":\"*\",\"features\":null}" + ], + "id": "0382fa90" + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Check Training Results\n", + "\n", + "The training result is assigned to the variable `training_result`. Let's examine the training details." + ], + "id": "16c8f8c6" + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "# Display the training result\n", + "training_result" + ], + "id": "ae50a529" + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Monitor Training Status\n", + "\n", + "Use the `%s2ml show` command to view the model details and training status. The status will be one of: Pre-processing, Training, Done, or Error." + ], + "id": "5ff6fa16" + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [], + "source": [ + "# In [10]:\n", + "%s2ml show --model titanic_survival_predictor" + ], + "id": "45edbab3" + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Wait for training to complete** before proceeding to the next section. You can re-run the cell above to check the status. Once the status shows \"Done\", you can proceed with predictions.\n", + "\n", + "### Run Sample Predictions\n", + "\n", + "Once training is complete, let's run predictions on a few sample passengers from our test dataset to see how the model performs." + ], + "id": "e0c7eaff" + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "SELECT\n", + " cluster.ML_CLASSIFY('titanic_survival_predictor', TO_JSON(passenger.*)) as predicted_status,\n", + " passenger.survival_status as actual_status,\n", + " passenger.pclass as ticket_class,\n", + " passenger.sex,\n", + " passenger.age,\n", + " passenger.fare\n", + "FROM (SELECT * FROM titanic_test_data LIMIT 10) AS passenger;" + ], + "id": "85f05f71" + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Run Predictions on Full Test Dataset\n", + "\n", + "Now let's run predictions on the entire test dataset and store the results in our predictions table." + ], + "id": "8616a501" + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [], + "source": [ + "# In [12]:\n", + "%%sql\n", + "INSERT INTO titanic_predictions (\n", + " pclass, sex, age, sibsp, parch, fare, embarked,\n", + " actual_status, predicted_status\n", + ")\n", + "SELECT\n", + " passenger.pclass,\n", + " passenger.sex,\n", + " passenger.age,\n", + " passenger.sibsp,\n", + " passenger.parch,\n", + " passenger.fare,\n", + " passenger.embarked,\n", + " passenger.survival_status as actual_status,\n", + " cluster.ML_CLASSIFY('titanic_survival_predictor', TO_JSON(passenger.*)) as predicted_status\n", + "FROM titanic_test_data AS passenger;" + ], + "id": "3993868e" + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Evaluate Model Performance\n", + "\n", + "Let's analyze the prediction accuracy by comparing actual vs predicted survival status." + ], + "id": "dc306a9a" + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "SELECT\n", + " COUNT(*) as total_predictions,\n", + " SUM(CASE WHEN actual_status = predicted_status THEN 1 ELSE 0 END) as correct_predictions,\n", + " ROUND(100.0 * SUM(CASE WHEN actual_status = predicted_status THEN 1 ELSE 0 END) / COUNT(*), 2) as accuracy_percentage\n", + "FROM titanic_predictions;" + ], + "id": "def69045" + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Analyze Survival Factors\n", + "\n", + "Let's examine how different passenger characteristics influenced survival predictions." + ], + "id": "3cf0f9a0" + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "-- Survival rate by sex\n", + "SELECT\n", + " sex,\n", + " COUNT(*) as total_passengers,\n", + " SUM(CASE WHEN actual_status = 'Survived' THEN 1 ELSE 0 END) as actual_survivors,\n", + " ROUND(100.0 * SUM(CASE WHEN actual_status = 'Survived' THEN 1 ELSE 0 END) / COUNT(*), 1) as survival_rate_pct\n", + "FROM titanic_predictions\n", + "GROUP BY sex\n", + "ORDER BY survival_rate_pct DESC;" + ], + "id": "9a7770b8" + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [], + "source": [], + "id": "d9598cae" + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "-- Survival rate by passenger class\n", + "SELECT\n", + " pclass as ticket_class,\n", + " COUNT(*) as total_passengers,\n", + " SUM(CASE WHEN actual_status = 'Survived' THEN 1 ELSE 0 END) as actual_survivors,\n", + " ROUND(100.0 * SUM(CASE WHEN actual_status = 'Survived' THEN 1 ELSE 0 END) / COUNT(*), 1) as survival_rate_pct,\n", + " ROUND(AVG(fare), 2) as avg_fare_paid\n", + "FROM titanic_predictions\n", + "GROUP BY pclass\n", + "ORDER BY pclass;" + ], + "id": "598504a9" + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Examine Misclassified Passengers\n", + "\n", + "Let's look at passengers where the model made incorrect predictions to understand potential model limitations." + ], + "id": "a96a3d39" + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "SELECT\n", + " actual_status,\n", + " predicted_status,\n", + " pclass as ticket_class,\n", + " sex,\n", + " age,\n", + " sibsp as siblings_spouses,\n", + " parch as parents_children,\n", + " fare,\n", + " embarked\n", + "FROM titanic_predictions\n", + "WHERE actual_status != predicted_status\n", + "LIMIT 15;" + ], + "id": "3f26a7f0" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Cleanup" + ], + "id": "2b054d99" + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "DROP TABLE IF EXISTS titanic_training_data;\n", + "DROP TABLE IF EXISTS titanic_test_data;\n", + "DROP TABLE IF EXISTS titanic_predictions;" + ], + "id": "abbc0dbb" + }, + { + "id": "0e8b6ce3", + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + "
" + ] + } + ], + "metadata": { + "jupyterlab": { + "notebooks": { + "version_major": 6, + "version_minor": 4 + } + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimeType": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.9" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} From bbb21d029dd5fb2ca86b0bf6cf939b9373ff4b86 Mon Sep 17 00:00:00 2001 From: Bharath Swamy Date: Wed, 8 Oct 2025 13:14:24 -0700 Subject: [PATCH 2/4] update(resource): nb-check add printlogs --- resources/nb-check.py | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/resources/nb-check.py b/resources/nb-check.py index bccf328..8fc41ed 100755 --- a/resources/nb-check.py +++ b/resources/nb-check.py @@ -3,6 +3,7 @@ import html import json import os +import subprocess import sys import tomllib import uuid @@ -328,3 +329,8 @@ def new_markdown_cell(cell_id: str, content: list[str]) -> dict[str, Any]: with open(f, 'w') as outfile: outfile.write(json.dumps(nb, indent=2)) outfile.write('\n') + + res = subprocess.run(['git', 'diff', f], capture_output=True, text=True) + if res.stdout: + print('--- ' + f + ' ---') + print(res.stdout) From 42ab58a4353d074939d2d54a526999b2139ec75e Mon Sep 17 00:00:00 2001 From: Bharath Swamy Date: Wed, 8 Oct 2025 13:23:44 -0700 Subject: [PATCH 3/4] fix(notebook): kafka notebook cell id --- notebooks/load-kafka-template/notebook.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/notebooks/load-kafka-template/notebook.ipynb b/notebooks/load-kafka-template/notebook.ipynb index c096452..89acb6d 100644 --- a/notebooks/load-kafka-template/notebook.ipynb +++ b/notebooks/load-kafka-template/notebook.ipynb @@ -50,7 +50,7 @@ { "attachments": {}, "cell_type": "markdown", - "id": "64fdd646", + "id": "a3754e68", "metadata": {}, "source": [ "This notebook demonstrates how to create a sample table in SingleStore, set up a pipeline to import data from Kafka Topic, and run queries on the imported data. It is designed for users who want to integrate Kafka data with SingleStore and explore the capabilities of pipelines for efficient data ingestion." From 6a60e1cf78f3d82b90e0b7ac348c539cd1d72765 Mon Sep 17 00:00:00 2001 From: Bharath Swamy Date: Wed, 8 Oct 2025 14:08:44 -0700 Subject: [PATCH 4/4] fix(notebook): remove commented code --- .../notebook.ipynb | 140 +++++++++--------- 1 file changed, 68 insertions(+), 72 deletions(-) diff --git a/notebooks/getting-started-with-ml-functions/notebook.ipynb b/notebooks/getting-started-with-ml-functions/notebook.ipynb index 5100f35..660e1cf 100644 --- a/notebooks/getting-started-with-ml-functions/notebook.ipynb +++ b/notebooks/getting-started-with-ml-functions/notebook.ipynb @@ -19,6 +19,7 @@ { "attachments": {}, "cell_type": "markdown", + "id": "e5b97b82", "metadata": {}, "source": [ "
\n", @@ -39,12 +40,12 @@ "4. Run some common Data Analysis tasks\n", "\n", "**Prerequisites**: Ensure ML Functions are enabled on your deployment (AI Services > AI & ML Functions)." - ], - "id": "e5b97b82" + ] }, { "cell_type": "code", "execution_count": 1, + "id": "a5610f23", "metadata": {}, "outputs": [], "source": [ @@ -53,33 +54,33 @@ "use cluster;\n", "\n", "show functions;" - ], - "id": "a5610f23" + ] }, { "cell_type": "code", "execution_count": 2, + "id": "1c891b9c", "metadata": {}, "outputs": [], "source": [ "!pip install -q seaborn pandas numpy scikit-learn" - ], - "id": "1c891b9c" + ] }, { "attachments": {}, "cell_type": "markdown", + "id": "89a194f3", "metadata": {}, "source": [ "### Load and Prepare the Titanic Dataset\n", "\n", "We'll use the famous Titanic dataset from seaborn, which contains passenger information from the RMS Titanic. The goal is to predict whether a passenger survived based on features like age, sex, ticket class, and fare." - ], - "id": "89a194f3" + ] }, { "cell_type": "code", "execution_count": 3, + "id": "891ebe48", "metadata": {}, "outputs": [], "source": [ @@ -100,23 +101,23 @@ "# Check survival distribution\n", "print(f\"\\nSurvival Distribution:\")\n", "print(titanic_df['survived'].value_counts())" - ], - "id": "891ebe48" + ] }, { "attachments": {}, "cell_type": "markdown", + "id": "6977b0a4", "metadata": {}, "source": [ "### Clean and Prepare Features\n", "\n", "We'll select the most important features and handle missing values to create a clean dataset for training." - ], - "id": "6977b0a4" + ] }, { "cell_type": "code", "execution_count": 4, + "id": "d3c92afe", "metadata": {}, "outputs": [], "source": [ @@ -148,23 +149,23 @@ "print(titanic_clean['survival_status'].value_counts())\n", "print(f\"\\nFirst 5 rows of clean data:\")\n", "print(titanic_clean.head())" - ], - "id": "d3c92afe" + ] }, { "attachments": {}, "cell_type": "markdown", + "id": "6206f158", "metadata": {}, "source": [ "### Split Data into Training and Test Sets\n", "\n", "We'll split the data into 80% training and 20% test sets to evaluate model performance." - ], - "id": "6206f158" + ] }, { "cell_type": "code", "execution_count": 5, + "id": "0610ab81", "metadata": {}, "outputs": [], "source": [ @@ -182,12 +183,12 @@ "print(train_df['survival_status'].value_counts())\n", "print(f\"\\nTest set survival distribution:\")\n", "print(test_df['survival_status'].value_counts())" - ], - "id": "0610ab81" + ] }, { "cell_type": "code", "execution_count": 6, + "id": "caa2aa64", "metadata": {}, "outputs": [], "source": [ @@ -232,23 +233,23 @@ " actual_status VARCHAR(10),\n", " predicted_status VARCHAR(10)\n", ");" - ], - "id": "caa2aa64" + ] }, { "attachments": {}, "cell_type": "markdown", + "id": "459e967c", "metadata": {}, "source": [ "### Load Data into SingleStore Tables\n", "\n", "We'll use pandas to insert the training and test data into our SingleStore tables." - ], - "id": "459e967c" + ] }, { "cell_type": "code", "execution_count": 7, + "id": "2c3a0d65", "metadata": {}, "outputs": [], "source": [ @@ -260,13 +261,10 @@ "# Create engine with database specified\n", "engine = s2.create_engine(database=database_name)\n", "\n", - "\n", - "\n", "# Insert training data\n", "train_df.to_sql(\n", " 'titanic_training_data',\n", " con=engine,\n", - " # conn,\n", " if_exists='append',\n", " index=False,\n", " method='multi'\n", @@ -276,7 +274,6 @@ "test_df.to_sql(\n", " 'titanic_test_data',\n", " con=engine,\n", - " # conn,\n", " if_exists='append',\n", " index=False,\n", " method='multi'\n", @@ -284,23 +281,23 @@ "\n", "print(f\"Inserted {len(train_df)} rows into titanic_training_data\")\n", "print(f\"Inserted {len(test_df)} rows into titanic_test_data\")" - ], - "id": "2c3a0d65" + ] }, { "attachments": {}, "cell_type": "markdown", + "id": "26421f4f", "metadata": {}, "source": [ "### Verify Data Load\n", "\n", "Let's verify that our data was loaded correctly and review the passenger demographics." - ], - "id": "26421f4f" + ] }, { "cell_type": "code", "execution_count": 8, + "id": "49f3b05b", "metadata": {}, "outputs": [], "source": [ @@ -315,12 +312,12 @@ "FROM titanic_training_data\n", "GROUP BY survival_status;\n", "SELECT * FROM titanic_training_data LIMIT 5;" - ], - "id": "49f3b05b" + ] }, { "attachments": {}, "cell_type": "markdown", + "id": "042927a9", "metadata": {}, "source": [ "### Train the ML Classification Model\n", @@ -328,12 +325,12 @@ "Now we'll train an ML model using the `%s2ml train` magic command. This will use SingleStore's ML Functions to train a classification model that predicts passenger survival.\n", "\n", "**Note:** Training may take several minutes depending on the compute size selected. The model will learn patterns like \"women and children first\" and the impact of ticket class on survival." - ], - "id": "042927a9" + ] }, { "cell_type": "code", "execution_count": 9, + "id": "0382fa90", "metadata": {}, "outputs": [], "source": [ @@ -346,56 +343,56 @@ "description: \"Titanic passenger survival prediction based on demographics and ticket info\"\n", "runtime: cpu-small\n", "selected_features: {\"mode\":\"*\",\"features\":null}" - ], - "id": "0382fa90" + ] }, { "attachments": {}, "cell_type": "markdown", + "id": "16c8f8c6", "metadata": {}, "source": [ "### Check Training Results\n", "\n", "The training result is assigned to the variable `training_result`. Let's examine the training details." - ], - "id": "16c8f8c6" + ] }, { "cell_type": "code", "execution_count": 10, + "id": "ae50a529", "metadata": {}, "outputs": [], "source": [ "# Display the training result\n", "training_result" - ], - "id": "ae50a529" + ] }, { "attachments": {}, "cell_type": "markdown", + "id": "5ff6fa16", "metadata": {}, "source": [ "### Monitor Training Status\n", "\n", "Use the `%s2ml show` command to view the model details and training status. The status will be one of: Pre-processing, Training, Done, or Error." - ], - "id": "5ff6fa16" + ] }, { "cell_type": "code", "execution_count": 11, + "id": "45edbab3", "metadata": {}, "outputs": [], "source": [ "# In [10]:\n", "%s2ml show --model titanic_survival_predictor" - ], - "id": "45edbab3" + ] }, { "attachments": {}, "cell_type": "markdown", + "id": "e0c7eaff", "metadata": {}, "source": [ "**Wait for training to complete** before proceeding to the next section. You can re-run the cell above to check the status. Once the status shows \"Done\", you can proceed with predictions.\n", @@ -403,12 +400,12 @@ "### Run Sample Predictions\n", "\n", "Once training is complete, let's run predictions on a few sample passengers from our test dataset to see how the model performs." - ], - "id": "e0c7eaff" + ] }, { "cell_type": "code", "execution_count": 12, + "id": "85f05f71", "metadata": {}, "outputs": [], "source": [ @@ -421,23 +418,23 @@ " passenger.age,\n", " passenger.fare\n", "FROM (SELECT * FROM titanic_test_data LIMIT 10) AS passenger;" - ], - "id": "85f05f71" + ] }, { "attachments": {}, "cell_type": "markdown", + "id": "8616a501", "metadata": {}, "source": [ "### Run Predictions on Full Test Dataset\n", "\n", "Now let's run predictions on the entire test dataset and store the results in our predictions table." - ], - "id": "8616a501" + ] }, { "cell_type": "code", "execution_count": 13, + "id": "3993868e", "metadata": {}, "outputs": [], "source": [ @@ -458,23 +455,23 @@ " passenger.survival_status as actual_status,\n", " cluster.ML_CLASSIFY('titanic_survival_predictor', TO_JSON(passenger.*)) as predicted_status\n", "FROM titanic_test_data AS passenger;" - ], - "id": "3993868e" + ] }, { "attachments": {}, "cell_type": "markdown", + "id": "dc306a9a", "metadata": {}, "source": [ "### Evaluate Model Performance\n", "\n", "Let's analyze the prediction accuracy by comparing actual vs predicted survival status." - ], - "id": "dc306a9a" + ] }, { "cell_type": "code", "execution_count": 14, + "id": "def69045", "metadata": {}, "outputs": [], "source": [ @@ -484,23 +481,23 @@ " SUM(CASE WHEN actual_status = predicted_status THEN 1 ELSE 0 END) as correct_predictions,\n", " ROUND(100.0 * SUM(CASE WHEN actual_status = predicted_status THEN 1 ELSE 0 END) / COUNT(*), 2) as accuracy_percentage\n", "FROM titanic_predictions;" - ], - "id": "def69045" + ] }, { "attachments": {}, "cell_type": "markdown", + "id": "3cf0f9a0", "metadata": {}, "source": [ "### Analyze Survival Factors\n", "\n", "Let's examine how different passenger characteristics influenced survival predictions." - ], - "id": "3cf0f9a0" + ] }, { "cell_type": "code", "execution_count": 15, + "id": "9a7770b8", "metadata": {}, "outputs": [], "source": [ @@ -514,20 +511,20 @@ "FROM titanic_predictions\n", "GROUP BY sex\n", "ORDER BY survival_rate_pct DESC;" - ], - "id": "9a7770b8" + ] }, { "cell_type": "code", "execution_count": 16, + "id": "d9598cae", "metadata": {}, "outputs": [], - "source": [], - "id": "d9598cae" + "source": [] }, { "cell_type": "code", "execution_count": 17, + "id": "598504a9", "metadata": {}, "outputs": [], "source": [ @@ -542,23 +539,23 @@ "FROM titanic_predictions\n", "GROUP BY pclass\n", "ORDER BY pclass;" - ], - "id": "598504a9" + ] }, { "attachments": {}, "cell_type": "markdown", + "id": "a96a3d39", "metadata": {}, "source": [ "### Examine Misclassified Passengers\n", "\n", "Let's look at passengers where the model made incorrect predictions to understand potential model limitations." - ], - "id": "a96a3d39" + ] }, { "cell_type": "code", "execution_count": 18, + "id": "3f26a7f0", "metadata": {}, "outputs": [], "source": [ @@ -576,20 +573,20 @@ "FROM titanic_predictions\n", "WHERE actual_status != predicted_status\n", "LIMIT 15;" - ], - "id": "3f26a7f0" + ] }, { "cell_type": "markdown", + "id": "2b054d99", "metadata": {}, "source": [ "# Cleanup" - ], - "id": "2b054d99" + ] }, { "cell_type": "code", "execution_count": 19, + "id": "abbc0dbb", "metadata": {}, "outputs": [], "source": [ @@ -597,8 +594,7 @@ "DROP TABLE IF EXISTS titanic_training_data;\n", "DROP TABLE IF EXISTS titanic_test_data;\n", "DROP TABLE IF EXISTS titanic_predictions;" - ], - "id": "abbc0dbb" + ] }, { "id": "0e8b6ce3",