## 配置环境

### Python 模块

安装以下 Python 模块：

```bash
pip install ipykernel python-dotenv cassio pandas langchain_openai langchain langchain-community langchainhub langchain_experimental openai-multi-tool-use-parallel-patch
```

### 加载 `.env` 文件

连接通过使用 `auto=True` 参数的 `cassio` 实现，本笔记本使用 OpenAI。你应该相应地创建一个 `.env` 文件。

对于 Cassandra，设置：
```bash
CASSANDRA_CONTACT_POINTS
CASSANDRA_USERNAME
CASSANDRA_PASSWORD
CASSANDRA_KEYSPACE
```

对于 Astra，设置：
```bash
ASTRA_DB_APPLICATION_TOKEN
ASTRA_DB_DATABASE_ID
ASTRA_DB_KEYSPACE
```

例如：

```bash
# 连接到 Astra：
ASTRA_DB_DATABASE_ID=a1b2c3d4-...
ASTRA_DB_APPLICATION_TOKEN=AstraCS:...
ASTRA_DB_KEYSPACE=notebooks

# 同时设置 
OPENAI_API_KEY=sk-....
```

（你也可以修改下面的代码直接使用 `cassio` 连接。）

In [None]:
from dotenv import load_dotenv

load_dotenv(override=True)

### 连接到 Cassandra

In [None]:
import os

import cassio

cassio.init(auto=True)
session = cassio.config.resolve_session()
if not session:
    raise Exception(
        "请检查环境配置或手动配置 cassio 连接参数"
    )

keyspace = os.environ.get(
    "ASTRA_DB_KEYSPACE", os.environ.get("CASSANDRA_KEYSPACE", None)
)
if not keyspace:
    raise ValueError("必须设置 KEYSPACE 环境变量")

session.set_keyspace(keyspace)

## 设置数据库

这只需要执行一次！

### 下载数据

使用的数据集来自 Kaggle，是[环境传感器遥测数据](https://www.kaggle.com/datasets/garystafford/environmental-sensor-data-132k?select=iot_telemetry_data.csv)。下一个单元格将下载并解压数据到 Pandas 数据帧中。接下来的单元格是手动下载的说明。

本节的最终结果是你应该有一个名为 `df` 的 Pandas 数据帧变量。

#### 自动下载

In [None]:
from io import BytesIO
from zipfile import ZipFile

import pandas as pd
import requests

datasetURL = "https://storage.googleapis.com/kaggle-data-sets/788816/1355729/bundle/archive.zip?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20240404%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20240404T115828Z&X-Goog-Expires=259200&X-Goog-SignedHeaders=host&X-Goog-Signature=2849f003b100eb9dcda8dd8535990f51244292f67e4f5fad36f14aa67f2d4297672d8fe6ff5a39f03a29cda051e33e95d36daab5892b8874dcd5a60228df0361fa26bae491dd4371f02dd20306b583a44ba85a4474376188b1f84765147d3b4f05c57345e5de883c2c29653cce1f3755cd8e645c5e952f4fb1c8a735b22f0c811f97f7bce8d0235d0d3731ca8ab4629ff381f3bae9e35fc1b181c1e69a9c7913a5e42d9d52d53e5f716467205af9c8a3cc6746fc5352e8fbc47cd7d18543626bd67996d18c2045c1e475fc136df83df352fa747f1a3bb73e6ba3985840792ec1de407c15836640ec96db111b173bf16115037d53fdfbfd8ac44145d7f9a546aa"

response = requests.get(datasetURL)
if response.status_code == 200:
    zip_file = ZipFile(BytesIO(response.content))
    csv_file_name = zip_file.namelist()[0]
else:
    print("Failed to download the file")

with zip_file.open(csv_file_name) as csv_file:
    df = pd.read_csv(csv_file)

#### 手动下载

你可以下载 `.zip` 文件并解压其中包含的 `.csv` 文件。取消下一行的注释，并适当调整这个 `.csv` 文件的路径。

In [None]:
# df = pd.read_csv("/path/to/iot_telemetry_data.csv")

### 将数据加载到 Cassandra

本节假设存在一个数据帧 `df`，下一个单元格验证其结构。上面的下载部分创建了这个对象。

In [None]:
assert df is not None, "Dataframe 'df' must be set"
expected_columns = [
    "ts",
    "device",
    "co",
    "humidity",
    "light",
    "lpg",
    "motion",
    "smoke",
    "temp",
]
assert all(
    [column in df.columns for column in expected_columns]
), "DataFrame does not have the expected columns"

创建并加载表：

In [None]:
from datetime import UTC, datetime

from cassandra.query import BatchStatement

# 创建传感器表
table_query = """
CREATE TABLE IF NOT EXISTS iot_sensors (
    device text,
    conditions text,
    room text,
    PRIMARY KEY (device)
)
WITH COMMENT = '环境物联网房间传感器元数据。';
"""
session.execute(table_query)

pstmt = session.prepare(
    """
INSERT INTO iot_sensors (device, conditions, room)
VALUES (?, ?, ?)
"""
)

devices = [
    ("00:0f:00:70:91:0a", "稳定条件，较冷和较潮湿", "房间 1"),
    ("1c:bf:ce:15:ec:4d", "温度和湿度变化大", "房间 2"),
    ("b8:27:eb:bf:9d:51", "稳定条件，较暖和较干燥", "房间 3"),
]

for device, conditions, room in devices:
    session.execute(pstmt, (device, conditions, room))

print("传感器数据插入成功。")

# 创建数据表
table_query = """
CREATE TABLE IF NOT EXISTS iot_data (
    day text,
    device text,
    ts timestamp,
    co double,
    humidity double,
    light boolean,
    lpg double,
    motion boolean,
    smoke double,
    temp double,
    PRIMARY KEY ((day, device), ts)
)
WITH COMMENT = '来自环境物联网房间传感器的数据。列包括设备标识符、数据采集时间戳(ts)、一氧化碳水平(co)、相对湿度、光照存在、液化石油气浓度(LPG)、运动检测、烟雾浓度和温度(temp)。数据按天和设备分区。';
"""
session.execute(table_query)

pstmt = session.prepare(
    """
INSERT INTO iot_data (day, device, ts, co, humidity, light, lpg, motion, smoke, temp)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
"""
)


def insert_data_batch(name, group):
    batch = BatchStatement()
    day, device = name
    print(f"正在插入 {day} 日期, {device} 设备的数据批次")

    for _, row in group.iterrows():
        timestamp = datetime.fromtimestamp(row["ts"], UTC)
        batch.add(
            pstmt,
            (
                day,
                row["device"],
                timestamp,
                row["co"],
                row["humidity"],
                row["light"],
                row["lpg"],
                row["motion"],
                row["smoke"],
                row["temp"],
            ),
        )

    session.execute(batch)

# 将列转换为适当的类型
df["light"] = df["light"] == "true"
df["motion"] = df["motion"] == "true"
df["ts"] = df["ts"].astype(float)
df["day"] = df["ts"].apply(
    lambda x: datetime.fromtimestamp(x, UTC).strftime("%Y-%m-%d")
)

grouped_df = df.groupby(["day", "device"])

for name, group in grouped_df:
    insert_data_batch(name, group)

print("数据加载完成")

In [None]:
print(session.keyspace)

## 加载工具

演示所需的 Python `import` 语句：

In [None]:
from langchain.agents import AgentExecutor, create_openai_tools_agent
from langchain_community.agent_toolkits.cassandra_database.toolkit import (
    CassandraDatabaseToolkit,
)
from langchain_community.tools.cassandra_database.prompt import QUERY_PATH_PROMPT
from langchain_community.tools.cassandra_database.tool import (
    GetSchemaCassandraDatabaseTool,
    GetTableDataCassandraDatabaseTool,
    QueryCassandraDatabaseTool,
)
from langchain_community.utilities.cassandra_database import CassandraDatabase
from langchain_openai import ChatOpenAI

`CassandraDatabase` 对象从 `cassio` 加载，不过它也接受 `Session` 类型的参数作为替代选项。

In [None]:
# Create a CassandraDatabase instance
db = CassandraDatabase(include_tables=["iot_sensors", "iot_data"])

# Create the Cassandra Database tools
query_tool = QueryCassandraDatabaseTool(db=db)
schema_tool = GetSchemaCassandraDatabaseTool(db=db)
select_data_tool = GetTableDataCassandraDatabaseTool(db=db)

可以直接调用这些工具：

In [None]:
# Test the tools
print("Executing a CQL query:")
query = "SELECT * FROM iot_sensors LIMIT 5;"
result = query_tool.run({"query": query})
print(result)

print("\nGetting the schema for a keyspace:")
schema = schema_tool.run({"keyspace": keyspace})
print(schema)

print("\nGetting data from a table:")
table = "iot_data"
predicate = "day = '2020-07-14' and device = 'b8:27:eb:bf:9d:51'"
data = select_data_tool.run(
    {"keyspace": keyspace, "table": table, "predicate": predicate, "limit": 5}
)
print(data)

## Agent 配置

In [None]:
from langchain.agents import Tool
from langchain_experimental.utilities import PythonREPL

python_repl = PythonREPL()

repl_tool = Tool(
    name="python_repl",
    description="一个 Python 终端。用于执行 Python 命令。输入必须是有效的 Python 命令。如果你想看到某个值的输出，你应该使用 `print(...)` 打印出来。",
    func=python_repl.run,
)

In [None]:
from langchain import hub

llm = ChatOpenAI(temperature=0, model="gpt-4-1106-preview")
toolkit = CassandraDatabaseToolkit(db=db)

tools = [schema_tool, select_data_tool, repl_tool]

input = (
    QUERY_PATH_PROMPT
    + f"""

这是你的任务：在 {keyspace} keyspace 中，查找 2020 年 7 月 14 日每个设备温度超过 23 度的总次数。
创建一个包含房间名称的总结报告。如果有帮助的话可以使用 Pandas。
"""
)

prompt = hub.pull("hwchase17/openai-tools-agent")

llm = ChatOpenAI(model="gpt-3.5-turbo-1106", temperature=0)

# 构建 OpenAI Tools agent
agent = create_openai_tools_agent(llm, tools, prompt)

print("可用工具：")
for tool in tools:
    print("\t" + tool.name + " - " + tool.description + " - " + str(tool))

In [None]:
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)

response = agent_executor.invoke({"input": input})

print(response["output"])