# Site search with Webflow and Shaped

This notebook demonstrates how to prepare, store, and retrieve documents from a Shaped relevance engine. We will use a real-world example of the Shaped Webflow site as an example. We will cover the following steps: 
1. Setup: Install dependencies and set Shaped API key
2. Ingestion: Chunking and inserting documents into a Shaped custom dataset
3. Inference: Making a text query to the Shaped relevance engine and retrieving results

# 1. Setup

## 1.1 Create virtual environment
**Prerequisite:** Before you get started, create a new Python virtual environment using Python 3.11 and activate it: 
```
cd /path/to/working/directory
python3.11 -m venv .venv
source ./.venv/bin/activate
```

## 1.2 Install dependencies via pip

Then we'll install the needed libraries and set our API keys.

In [2]:
%pip install -qU shaped webflow langchain-text-splitters lxml python-dotenv


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


## 1.3 Restart kernel

After installing the packages, you may need to restart your kernel. In VSCode or Cursor, you can do this via the command palette. 

## 1.4 Initialize Shaped CLI with your API key

We will need to get a [free Shaped API key with write permissions](https://docs.shaped.ai/docs/support/getting-an-api-key). We'll then attach this to the Shaped Python SDK:

In [3]:
import os
from getpass import getpass
from dotenv import load_dotenv

# Load environment variables from .env file in the notebooks directory
load_dotenv()

# Load API keys from environment variables, prompt if not found
if (os.getenv("SHAPED_API_KEY") is None):
    SHAPED_API_KEY = getpass("Please enter your Shaped API key: ")
    os.environ["SHAPED_API_KEY"] = SHAPED_API_KEY


# 2. Ingestion

Next, we'll declare a new table to store our documents and insert rows to it. We'll cover the following steps in this section: 
1. Create a table with the Shaped API
2. Get documents with the Webflow API
3. Run the documents through a chunker
4. Upload the document chunks to our Shaped table

## 2.1 Get blog posts from Webflow

The following code is how we get the documents from our Webflow CMS. This is useful to understand how document ingestion is done in the real-world; feel free to skip this section if you don't want to get bogged down in the details. 

### Get metadata

We start by loading our environment variables and fetching category/author metadata:

In [4]:
import json
import select
from webflow.client import Webflow
from getpass import getpass
import pandas as pd
from datetime import datetime

# Webflow bug workaround - 
# Monkey patch CollectionItemFieldData to support extra config (necessary to
# retrieve pydantic.Extra.allow reference from same pydantic reference used by
# webflow)
try:
    import pydantic.v1 as pydantic
except ImportError:
    import pydantic
from webflow import CollectionItemFieldData
CollectionItemFieldData.Config.extra = pydantic.Extra.allow

# Load Webflow API key and consts from .env file
def get_env_or_prompt(env_var_name: str, prompt_message: str) -> str:
    """Get environment variable, or prompt user if not set."""
    value = os.getenv(env_var_name)
    if value is None:
        value = getpass(prompt_message)
        os.environ[env_var_name] = value
    return value

WEBFLOW_API_KEY = get_env_or_prompt("WEBFLOW_API_KEY", "Please enter your Webflow API key: ")
WEBFLOW_SITE_ID = get_env_or_prompt("WEBFLOW_SITE_ID", "Please enter your Webflow Site ID: ")
BLOG_COLLECTION_ID = get_env_or_prompt("WEBFLOW_BLOG_COLLECTION_ID", "Enter the Webflow collection ID for blog posts: ")
CATEGORY_COLLECTION_ID = get_env_or_prompt("WEBFLOW_CATEGORY_COLLECTION_ID", "Enter the Webflow collection ID for categories: ")
AUTHORS_COLLECTION_ID = get_env_or_prompt("WEBFLOW_AUTHORS_COLLECTION_ID", "Enter the Webflow collection ID for authors: ")
ROLES_COLLECTION_ID = get_env_or_prompt("WEBFLOW_ROLES_COLLECTION_ID", "Enter the Webflow collection ID for roles: ")

# initialize the webflow client
webflowClient = Webflow(access_token=WEBFLOW_API_KEY)


### Get blog posts from the API:

In [5]:
# Webflow API is paginated; we need to get documents 100 entries at a time
items = []
limit = 100
offset = 0

while True:
    items_page = webflowClient.collections.items.list_items(
        collection_id=BLOG_COLLECTION_ID,
        limit=limit,
        offset=offset,
    )
    
    if items_page.items:
        items.extend(items_page.items)
        print(f"Fetched {len(items_page.items)} items (total: {len(items)})")
    else:
        break
    
    if len(items_page.items) < limit:
        break
    offset += limit


Fetched 100 items (total: 100)
Fetched 100 items (total: 200)
Fetched 80 items (total: 280)


### Construct dataframe

Replace IDs with actual names to preserve semantic meaning

Also, output all columns as strings and ensure there are no nested dicts (flatten column structure)

In [6]:
import uuid 

blog_posts_df = pd.DataFrame()

# we get category, author name, and target role so that we can enrich these columns with semantic info (not IDs)
categories = webflowClient.collections.items.list_items(
    collection_id=CATEGORY_COLLECTION_ID
).dict().get("items")

authors = webflowClient.collections.items.list_items(
    collection_id=AUTHORS_COLLECTION_ID
).dict().get("items")

roles = webflowClient.collections.items.list_items(
    collection_id=ROLES_COLLECTION_ID
).dict().get("items")

for item in items:
    item_id = item.id
    
    # Create a row dictionary starting with the item id
    row = {"id": item.id}
    
    # Destructure field_data into separate columns
    if hasattr(item, "field_data") and item.field_data:
        # Convert field_data to dict to get all fields
        field_data_dict = item.field_data.dict() if hasattr(item.field_data, "dict") else {}
        # Merge field_data fields into the row
        row.update(field_data_dict)
    
    # Replace IDs in `roles`, `author`, categories with strings
    if (row.get("roles")) is not None:
        roles_names = []
        for role in row["roles"]:
            role_name = next((r['fieldData']['name'] for r in roles if r.get('id') == role))
            roles_names.append(role_name)
        row["roles"] = roles_names

    if (row.get("categories")) is not None:
        categories_names = []
        for category in row["categories"]:
            category_name = next((r['fieldData']['name'] for r in categories if r.get('id') == category))
            categories_names.append(category_name)
        row["categories"] = categories_names

    if (row.get("author")) is not None:
        row["author"] = next((a['fieldData']['name'] for a in authors if a.get('id') == row["author"]))

    # convert lists and dicts into strings
    for key, value in row.items():
        if isinstance(value, list):
            row[key] = " ".join(str(v) for v in value)
        if isinstance(value, dict):
            row[key] = json.dumps(value)

    

    now_str = datetime.now().isoformat()
    row["created_at"] = now_str
    row["updated_at"] = now_str
    
    # Append row to dataframe
    blog_posts_df = pd.concat([blog_posts_df, pd.DataFrame([row])], ignore_index=True)

blog_posts_df.columns = blog_posts_df.columns.str.replace('-', '_') # replace hyphens with underscores

blog_posts_df.to_json("data/posts.jsonl", orient="records", lines=True)
blog_posts_df.to_json("data/posts.json")
print("Blog posts have been successfully saved to posts.jsonl.")

Blog posts have been successfully saved to posts.jsonl.


## Chunking

Now that we have our table of posts, we should extract sections to searches more relevant. To do this, we use a `chunking strategy`. 

In [7]:
from langchain_text_splitters import HTMLSemanticPreservingSplitter

# chunking step
blog_posts_chunked_df = pd.DataFrame()

# import posts from JSONL file
posts = pd.read_json('data/posts.jsonl', lines=True)
print(posts.columns)

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3")
]

splitter = HTMLSemanticPreservingSplitter(
    headers_to_split_on=headers_to_split_on,
    separators=["\n\n", "\n", ". ", "! ", "? "],
    max_chunk_size=50,
    elements_to_preserve=["ul", "ol", "code"],
    denylist_tags=["script", "style", "head"],
)
# Process all posts and create chunked dataframe
blog_posts_chunked_df = pd.DataFrame()

for idx, post in posts.iterrows():
    post_body = post['post_body']
    post_summary = post.get('post_summary', '')
    if pd.isna(post_body):
        post_body = ''
    if pd.isna(post_summary):
        post_summary = ''
    
    post_to_chunk = f"<p>{post_summary}</p>\n" + post_body
    
    # Split the post into chunks
    documents = splitter.split_text(post_to_chunk)
    
    # For each chunk, create a row with all post columns except post-summary and post-body
    for doc in documents:
        # Create a row dictionary with all post columns except post-summary and post-body
        row = post.drop(['post_summary', 'post_body']).to_dict()
        
        # Add the chunk content
        row['content'] = doc.page_content
        
        # Optionally add chunk metadata if it exists
        if doc.metadata:
            row['chunk_metadata'] = json.dumps(doc.metadata)

        row['post_id'] = row['id']
        row['id'] = str(uuid.uuid4())
        
        # Append to dataframe
        blog_posts_chunked_df = pd.concat([blog_posts_chunked_df, pd.DataFrame([row])], ignore_index=True)

print(f"Created {len(blog_posts_chunked_df)} chunks from {len(posts)} posts")
blog_posts_chunked_df['url'] = "https://www.shaped.ai/blog/" + blog_posts_chunked_df['slug']

blog_posts_chunked_df.to_json('data/blog_post_chunked.jsonl', orient='records', lines=True)
# blog_posts_chunked_df.to_json('data/blog_post_chunked.json') # optional - not recommended

print("Columns and their types in blog_posts_chunked_df:")
print(blog_posts_chunked_df.dtypes)


Index(['id', 'name', 'slug', 'release_date', 'post_summary', 'author',
       'read_length_in_mins', 'categories', 'post_body', 'main_image', 'roles',
       'featured', 'popular', 'created_at', 'updated_at'],
      dtype='object')


  splitter = HTMLSemanticPreservingSplitter(


Created 11438 chunks from 280 posts
Columns and their types in blog_posts_chunked_df:
id                             object
name                           object
slug                           object
release_date                   object
author                         object
read_length_in_mins           float64
categories                     object
main_image                     object
roles                          object
featured                         bool
popular                          bool
created_at             datetime64[ns]
updated_at             datetime64[ns]
content                        object
post_id                        object
chunk_metadata                 object
url                            object
dtype: object


# Upload to Shaped

Now that we have our data in semantic chunks, we can upload this data to Shaped. We'll use the Shaped CLI for this.

First we need to create a schema for our table, to tell Shaped the column names and types. 

```yaml
name: shaped_blog_posts_chunked
schema_type: CUSTOM
unique_keys: [id]
column_schema:
    id: String
    name: String
    slug: String
    main_image: String
    roles: String
    author: String
    categories: String
    read_length_in_mins: Int32
    popular: Bool
    featured: Bool
    created_at: DateTime
    updated_at: DateTime
    content: String
    post_id: String
    chunk_metadata: String
    url: String
```

```
shaped create-dataset --file
```

> Note: If you get an error - Module Not Found, remember to activate your virtual environment (`source ./.venv/bin/activate`)

## Upload dataset schema 

In [8]:
schema = {
    "name": "shaped_blog_chunked",
    "schema_type": "CUSTOM",
    "unique_keys": ["id"],
    "column_schema": {
        "id": "String",
        "name": "String",
        "slug": "String",
        "main_image": "String",
        "roles": "String",
        "author": "String",
        "categories": "String",
        "read_length_in_mins": "Float",
        "popular": "Bool",
        "featured": "Bool",
        "created_at": "DateTime",
        "updated_at": "DateTime",
        "content": "String",
        "post_id": "String",
        "chunk_metadata": "String",
        "url": "String"
    }
}

# Save schema to YAML file
import yaml

with open("data/blog_posts_chunked.schema.yaml", "w") as f:
    yaml.dump(schema, f, default_flow_style=False, sort_keys=False)

# !shaped create-dataset --file data/blog_posts_chunked.schema.yaml

# Engine config

Now we're going to create a simple engine to support semantic search on our document chunks. 

We will declare our configuration as a YAML file and upload it with the CLI.

The most basic engine configuration has a `data` field with the data to fetch and the columns to index. 

In [None]:
engine_config = {
    "version" : "v2",
    "name" : "blog_posts__simple_semantic_search_3",
    "data" : {
        "item_dataset" : {
            "name" : "blog_post_chunked"
        },
        "index" : {
            "search" : {
                "item_fields" : [
                    "name",
                    "content",
                    "author",
                    "categories",
                    "roles",
                ]
            }
        }
    }
}

with open("data/blog_posts_chunked.engine.yaml", "w") as f:
    yaml.dump(engine_config, f, default_flow_style=False, sort_keys=False)

!shaped create-model --file data/blog_posts_chunked.engine.yaml

{
  "version": "v2",
  "name": "blog_posts__simple_semantic_search_2",
  "data": {
    "item_dataset": {
      "name": "blog_post_chunked"
    },
    "index": {
      "search": {
        "item_fields": [
          "name",
          "content",
          "author",
          "categories",
          "roles"
        ]
      }
    }
  }
}
model_url: https://api.shaped.ai/v1/models/blog_posts__simple_semantic_search_2

