[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/weaviate/recipes/blob/main/integrations/data-platforms/web-search/firecrawl/firecrawl-to-weaviate.ipynb)

# Building a Vector Index from Webpages!

This notebook will show you how to scrape webpages using Firecrawl by Mendable to load into Weaviate!

We will then use a Generative Feedback Loop to clean the data from the webscrape result.

## Firecrawl

In [1]:
!pip install firecrawl-py==0.0.14 > /dev/null

In [14]:
from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key="YOUR-FIRECRAWL-API-KEY")


In [15]:
scraped_data = app.scrape_url("https://www.databricks.com/blog/accelerating-innovation-jetblue-using-databricks")

In [29]:
for key in scraped_data.keys():
    print(key)

content
markdown
metadata


In [33]:
from typing import List

def get_markdown_from_Firecrawl(website_urls: List[str]) -> List[str]:
    results = []
    for website_url in website_urls:
        crawl_result = app.scrape_url(website_url)
        # Get the markdown
        results.append({
            "content": crawl_result["content"],
            "weblink": website_url
        })
    return results

In [34]:
results = get_markdown_from_Firecrawl(["https://www.databricks.com/blog/accelerating-innovation-jetblue-using-databricks"])

# Create Weaviate WebChunk Collection

`!pip install weaviate-client==4.6.4`

## Connect to Weaviate

You can use [Weaviate Cloud](https://console.weaviate.cloud/), [Weaviate Embedded](https://weaviate.io/developers/weaviate/installation/embedded), or [locally](https://weaviate.io/developers/weaviate/installation/docker-compose) (only chose one).

### Weaviate Embedded

In [None]:
# # Weaviate Embedded (will run in your local runtime)

import weaviate
import os

client = weaviate.connect_to_embedded(
    headers={
        "X-Cohere-Api-Key": os.getenv("COHERE_API_KEY")  # Replace with your Cohere key
    }
)

client.is_ready()

### Weaviate Cloud

In [None]:
# import weaviate

# # Set these environment variables
# URL = os.getenv("WEAVIATE_URL")
# APIKEY = os.getenv("WEAVIATE_API_KEY")

# # Connect to your WCD instance
# client = weaviate.connect_to_wcs(
#     cluster_url=URL,
#     auth_credentials=weaviate.auth.AuthApiKey(APIKEY),
#     headers={
#         "X-Cohere-Api-Key": os.getenv("COHERE_API_KEY")  # Replace with your Cohere key
#     }
# )

# client.is_ready()

### Local 

In [None]:
# if you run it locally, make sure to add your Cohere key to the `yaml` file.

# import weaviate
# import os

# client = weaviate.connect_to_local(
#     headers={
#         "X-Cohere-Api-Key": os.getenv("COHERE_API_KEY")  # Replace with your Cohere key
#     }
# )

# client.is_ready()

## Create Collection

In [35]:
# CAUTION: Running this will delete your data in your cluster

# weaviate_client.collections.delete_all()

In [36]:
import weaviate.classes.config as wvcc


web_chunks = weaviate_client.collections.create(
    name="WebChunk",
    vectorizer_config=wvcc.Configure.Vectorizer.text2vec_cohere
    (
        model="embed-multilingual-v3.0"
    ),
    properties=[
            wvcc.Property(name="content", data_type=wvcc.DataType.TEXT),
            wvcc.Property(name="weblink", data_type=wvcc.DataType.TEXT),
      ]
)

  web_chunks = weaviate_client.collections.create(


## Ingest into Weaviate

In [37]:
from weaviate.util import get_valid_uuid
from uuid import uuid4

weblink = results[0]["weblink"]
results = results[0]["content"].split()

chunk_size = 300
chunk_uuids = []
for i in range(0, len(results), chunk_size):
    chunk = results[i:i+chunk_size]
    id = get_valid_uuid(uuid4())
    chunk_uuids.append(id)
    web_chunks.data.insert(
        properties={
            "content": " ".join(chunk),
            "weblink": weblink
        },
        uuid=id
    )

In [38]:
response = web_chunks.query.hybrid(
    query="How does JetBlue use Databricks?",
    limit=1
)

for o in response.objects:
    print(o.properties)

{'content': 'Catalog](https://www.databricks.com/product/unity-catalog) role-based access to documents in the vector database document store. Using this framework, any JetBlue user can access the same chatbot hidden behind Azure AD SSO protocols and Databricks Unity Catalog Access Control Lists (ACLs). Every product, including the BlueSky real-time digital twin, ships with embedded LLMs. ![JetBlue’s Chatbot based on Microsoft Azure OpenAI APIs and Databricks Dolly](https://www.databricks.com/sites/default/files/inline-images/image5.png?v=1687203897) JetBlue’s Chatbot based on Microsoft Azure OpenAI APIs and Databricks Dolly By deploying AI and ML enterprise products on Databricks using data in lakehouse, JetBlue has thus far unlocked a relatively high Return-on-Investment (ROI) multiple within two years. In addition, Databricks allows the Data Science and Analytics teams to rapidly prototype, iterate and launch data pipelines, jobs and ML models using the [lakehouse](https://www.databr

# Clean with a Generative Feedback Loop

`!pip install dspy-ai==2.4.9`

In [59]:
import dspy # !pip install dspy-ai==2.4.9
from dspy.retrieve.weaviate_rm import WeaviateRM
import weaviate

retriever_model = WeaviateRM("WebChunk", weaviate_client=weaviate_client)

command_r_plus = dspy.Cohere(model="command-r-plus",
                             api_key="YOUR-COHERE-KEY",
                             max_input_tokens=4000,
                             max_tokens=4000)

dspy.settings.configure(lm=command_r_plus, rm=retriever_model)

# Generative Feedback Loop

In [70]:
from pydantic import BaseModel, field_validator


class UpdatedPropertyValue(BaseModel):
    property_value: str

class UpdateProperty(dspy.Signature):
    """Your task is to generate the value of a property by following the instruction using the provided name-value property references."""

    property_name: str = dspy.InputField(
        desc="The name of the property that you should update."
    )
    references: str = dspy.InputField(
        desc="The name-value property pairs that you should refer to while updating the property."
    )
    instruction: str = dspy.InputField(
        desc="The prompt to use when generating the content of the updated property value."
    )
    property_value: UpdatedPropertyValue = dspy.OutputField(
        desc="The value of the updated property as a string. Only the value should be returned in the following format. IMPORTANT!!"
    )


class Program(dspy.Module):
    def __init__(self) -> None:
        self.predict = dspy.TypedPredictor(UpdateProperty)

    def forward(self, property_name: str, references: str, instruction: str) -> str:
        prediction: dspy.Prediction = self.predict(
            property_name=property_name, references=references, instruction=instruction
        )
        return prediction.property_value

## GFL Instruction for Cleaning Web Scraped Text

In [71]:
program = Program()

instruction = """
This content is the result of a web scraper. Clean the text to remove any special characters.
"""

### Observe the uncleaned text

In [72]:
chunk_uuids[0]

web_chunks.query.fetch_object_by_id(chunk_uuids[0]).properties["content"]

'[Skip to main content](#main) [![](

# Run GFL

In [75]:
for chunk_uuid in chunk_uuids:
    # Get the object
    object = web_chunks.query.fetch_object_by_id(chunk_uuid)
    # Format the references
    references=" ".join(f"{k}: {v}" for k, v in object.properties.items())
    # Run GFL
    cleaned_text = program(
        property_name="cleaned_text",
        references=references,
        instruction=instruction,
    ).property_value
    # Update property in Weaviate
    web_chunks.data.update(
        properties={
            "content": cleaned_text
        },
        uuid=chunk_uuid
    )

print(f"{len(chunk_uuids)} objects have been updated.")

13 objects have been updated.


### Observe the cleaned text

In [78]:
chunk_uuids[0]

web_chunks.query.fetch_object_by_id(chunk_uuids[0]).properties["content"]

'Why Databricks Discover For Executives For Startups Lakehouse Architecture DatabricksIQ Mosaic Research Customers Featured Stories See All Customers Partners Cloud Providers Databricks on AWS, Azure, and GCP Consulting & System Integrators Experts to build, deploy and migrate to Databricks Technology Partners Connect your existing tools to your Lakehouse C&SI Partner Program Build, deploy or migrate to the Lakehouse Data Partners Access the ecosystem of data consumers Partner Solutions Find custom industry and migration solutions Built on Databricks Build, market and grow your business Product Databricks Platform Platform Overview A unified platform for data, analytics and AI Data Management Data reliability, security and performance Sharing An open, secure, zero-copy sharing for all data Data Warehousing Serverless data warehouse for SQL analytics Governance Unified governance for all data, analytics and AI assets Real-Time Analytics Real-time analytics, AI and applications made simp

# RAG Demo with DSPy using the cleaned index

In [76]:
class RAG(dspy.Module):
    def __init__(self, num_passages=3):
        super().__init__()
        
        self.retrieve = dspy.Retrieve(k=num_passages)
        self.generate_answer = dspy.ChainOfThought("question, contexts -> precise_answer")
    
    def forward(self, question):
        contexts = "".join(self.retrieve(question).passages)
        prediction = self.generate_answer(question=question, contexts=contexts).precise_answer
        return dspy.Prediction(answer=prediction)

In [77]:
rag = RAG()

print(rag("How does JetBlue use Databricks?").answer)

JetBlue uses Databricks to increase productivity by utilizing its flexibility to work with SQL, Python, and PySpark. They also leverage the Databricks Data Intelligence Platform to process real-time data and develop historical and real-time ML pipelines.
