In [5]:
from dotenv import load_dotenv

load_dotenv()

True

# Databricks Model Serving

In [6]:
from databricks_langchain import ChatDatabricks

chat_model = ChatDatabricks(endpoint="databricks-gpt-oss-120b")

In [7]:
chat_model.invoke("What is MLflow?")

  PydanticSerializationUnexpectedValue(Expected `str` - serialized value may not be as expected [input_value=[{'type': 'reasoning', 's...overned ML pipelines.'}], input_type=list])
  return self.__pydantic_serializer__.to_python(


AIMessage(content=[{'type': 'reasoning', 'summary': [{'type': 'summary_text', 'text': 'The user asks: "What is MLflow?" Provide a clear explanation. Need to describe MLflow as an open-source platform for managing the ML lifecycle, its components: tracking, projects, models, model registry, integrations, etc. Also mention use cases, architecture, benefits. Provide examples. Probably concise yet thorough.'}]}, {'type': 'text', 'text': '**MLflow** is an open‑source platform designed to simplify and standardize the end‑to‑end machine‑learning (ML) lifecycle. It was created by the Databricks team in 2018 and is now a graduated project in the Linux Foundation’s AI\u202fFoundation.  \n\n---\n\n## Core Idea\n\nMLflow does **not** replace any specific ML library (e.g., scikit‑learn, TensorFlow, PyTorch).  \nInstead, it provides a **set of loosely‑coupled components** that you can adopt incrementally to:\n\n1. **Track** experiments (parameters, metrics, artifacts, and source code).  \n2. **Packa

In [8]:
from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are a chatbot that can answer questions about {topic}.",
        ),
        ("user", "{question}"),
    ]
)

chain = prompt | chat_model
output = chain.invoke(
    {
        "topic": "Databricks",
        "question": "What is Unity Catalog?",
    }
)

output.content

  PydanticSerializationUnexpectedValue(Expected `str` - serialized value may not be as expected [input_value=[{'type': 'reasoning', 's...and from which cloud."}], input_type=list])
  return self.__pydantic_serializer__.to_python(


[{'type': 'reasoning',
  'summary': [{'type': 'summary_text',
    'text': 'The user asks: "What is Unity Catalog?" They want an answer likely about Databricks Unity Catalog, a data governance solution. Provide an overview: what it is, features, benefits, architecture, how it works, and possibly use cases. Should be concise but thorough. Also cover security, fine-grained access control, data lineage, cataloging. Mention that it provides a unified governance across all data assets, works with Lakehouse, supports multi-cloud, integrates with data policies, audit, platform. Possibly mention objects: catalogs, schemas, tables, functions, external tables. Mention how it works with IAM and external metastore, integration with notebooks, pipelines. Also mention that it\'s a metastore for data, not just Hive metastore. Provide details on usage, APIs, and benefits.\n\nWill answer in a friendly explanatory style.'}]},
 {'type': 'text',
  'text': "**Unity Catalog** is Databricks’ unified, cross‑cl

In [29]:
from langchain_groq import ChatGroq

llm = ChatGroq(
    model="deepseek-r1-distill-llama-70b",
    temperature=0,
    max_retries=2,
    # other params...
)

In [15]:
from databricks.connect import DatabricksSession
import os

catalog = os.environ.get("UC_CATALOG_NAME")
schema = os.environ.get("UC_SCHEMA_NAME")

spark = DatabricksSession.builder.getOrCreate()
spark.sql(f"USE CATALOG {catalog}")
spark.sql(f"USE SCHEMA {schema}")

df = spark.read.table(f"customer")
df.show(5)

+---------+------------------+--------------------+-----------+---------------+---------+------------+--------------------+
|c_custkey|            c_name|           c_address|c_nationkey|        c_phone|c_acctbal|c_mktsegment|           c_comment|
+---------+------------------+--------------------+-----------+---------------+---------+------------+--------------------+
|   412445|Customer#000412445|0QAB3OjYnbP6mA0B,kgf|         21|31-421-403-4333|  5358.33|    BUILDING|arefully blithely...|
|   412446|Customer#000412446|5u8MSbyiC7J,7PuY4...|         20|30-487-949-7942|  9441.59|   MACHINERY|sleep according t...|
|   412447|Customer#000412447|HC4ZT62gKPgrjr ce...|          7|17-797-466-6308|  7868.75|  AUTOMOBILE|aggle blithely am...|
|   412448|Customer#000412448|         hJok1MMrDgH|          6|16-541-510-4964|  6060.98|   MACHINERY|ly silent request...|
|   412449|Customer#000412449|zAt1nZNG01gOhIqgy...|         14|24-710-983-5536|  4973.84|   HOUSEHOLD|refully final the...|
+-------

In [40]:
from langchain_community.agent_toolkits import SparkSQLToolkit, create_spark_sql_agent
from langchain_community.utilities.spark_sql import SparkSQL

host = os.environ.get("DATABRICKS_HOST")

# Note, you can also connect to Spark via Spark connect. For example:
# db = SparkSQL.from_uri("sc://localhost:15002", schema=schema)
spark_sql = SparkSQL(schema=schema)
toolkit = SparkSQLToolkit(db=spark_sql, llm=llm)
agent_executor = create_spark_sql_agent(llm=llm, toolkit=toolkit, verbose=True, agent_executor_kwargs={"handle_parsing_errors": True})

In [41]:
output = agent_executor.run("Describe the customer table")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m<think>
Okay, so the user is asking me to describe the customer table. I'm not entirely sure if that table exists, so I should start by listing all the available tables in the database. I'll use the list_tables_sql_db tool for that. 

Once I have the list of tables, I can check if 'customer' is among them. If it is, the next step would be to get the schema of the customer table. The schema will tell me about the columns, their data types, and any other relevant details. 

I remember that the schema_sql_db tool can provide both the schema and some sample rows, which would be helpful in giving a thorough description. I need to make sure I format the input correctly as a comma-separated list, so I'll just input 'customer' since that's the only table I'm interested in right now.

After getting the schema, I can then explain the structure of the customer table to the user, including the columns and their data types. This should gi

In [43]:
output

'The customer table contains the following columns: c_custkey (BIGINT), c_name (STRING), c_address (STRING), c_nationkey (BIGINT), c_phone (STRING), c_acctbal (DECIMAL(18,2)), c_mktsegment (STRING), and c_comment (STRING). The table includes data such as customer names, addresses, nation keys, phone numbers, account balances, market segments, and comments.'

In [45]:
output = agent_executor.run("How many customers are present?")
output



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m<think>
Okay, so I need to figure out how many customers are present in the database. Let me start by understanding the tools I have and the process I should follow.

First, I remember that I can use the list_tables_sql_db tool to get a list of all the tables available. That makes sense because without knowing the table names, I can't proceed. So, I'll start by using that tool with an empty input.

After getting the list of tables, I need to identify which table contains customer information. Typically, such a table might be named something like "customers" or "customer_list". Once I have the correct table name, I can then query it to count the number of customers.

I should also consider the schema of the table to ensure there's a column that can help me count the customers accurately. Usually, a primary key like "customer_id" would be present, and I can count the number of unique or non-null entries in that column.

I need 

'**\n\nThere are 750,000 customers present.'