# AI Agents in PySpark

In [1]:
import findspark
findspark.init()
from pyspark import SparkContext
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master('local[*]') \
    .appName('Spark-Example-30-ChatGPT') \
    .getOrCreate()

You can create a OpenAI API key on the following page: https://platform.openai.com/account/api-keys
Open AI models: https://platform.openai.com/docs/models

In [18]:
# Bring in deps
import os 
from langchain.llms import OpenAI
from pyspark.sql import SparkSession
from langchain_experimental.agents import create_spark_dataframe_agent
from working.apikey import apikey
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
# create a file named apikey.py and add the following line: apikey = 'sk-xxxxxxxxxxxxxxxxxxxxxxxx' (where xxxx is your api key)
os.environ['OPENAI_API_KEY'] = apikey

In [19]:
csv_file_path = "data\\taxi-data-sorted-verysmall-header.csv"
df = spark.read.csv(csv_file_path, header=True, inferSchema=True)
df.show()

+--------------------+--------------------+-------------------+-------------------+---------+-------------+----------------+---------------+-----------------+----------------+------------+-----------+---------+-------+----------+------------+------------+
|           medallion|        hack_license|    pickup_datetime|   dropoff_datetime|trip_time|trip_distance|pickup_longitude|pickup_latitude|dropoff_longitude|dropoff_latitude|payment_type|fare_amount|surcharge|mta_tax|tip_amount|tolls_amount|total_amount|
+--------------------+--------------------+-------------------+-------------------+---------+-------------+----------------+---------------+-----------------+----------------+------------+-----------+---------+-------+----------+------------+------------+
|07290D3599E7A0D62...|E7750A37CAB07D0DF...|2013-01-01 00:00:00|2013-01-01 00:02:00|      120|         0.44|      -73.956528|      40.716976|        -73.96244|       40.715008|         CSH|        3.5|      0.5|    0.5|       0.0|   

# Create an Spark DataFrame LangChain agent 

In [20]:
from langchain_community.chat_models import ChatOpenAI
# set llm model of openai to chatgpt 4.0
llm = ChatOpenAI(model_name = "gpt-4.1-2025-04-14", temperature=0)
#llm = ChatOpenAI(model_name = "gpt-4o-2024-11-20", temperature=0)
#llm = OpenAI(temperature=0)


In [21]:
llm.model_name

'gpt-4.1-2025-04-14'

In [22]:
agent = create_spark_dataframe_agent(llm=llm, df=df, verbose=True, 
                                     handle_parsing_errors=True,
                                      allow_dangerous_code=True)

# Run the agent

In [23]:
agent.run("how many rows are there?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: To find the number of rows in a Spark DataFrame, I should use the `count()` method.
Action: python_repl_ast
Action Input: df.count()[0m
Observation: [36;1m[1;3m10000[0m
Thought:[32;1m[1;3mI now know the final answer
Final Answer: There are 10,000 rows in the dataframe.[0m

[1m> Finished chain.[0m


'There are 10,000 rows in the dataframe.'

In [24]:
agent.run("The top ten taxis that have had the largest number of drivers")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: To find the top ten taxis (medallions) with the largest number of drivers (hack_license), I need to group the dataframe by 'medallion', count the distinct 'hack_license' for each, order by this count descending, and select the top 10.
Action: python_repl_ast
Action Input: df.groupBy("medallion").agg({"hack_license": "countDistinct"}).withColumnRenamed("count(DISTINCT hack_license)", "num_drivers").orderBy("num_drivers", ascending=False).limit(10).show()[0m
Observation: [36;1m[1;3mAnalysisException: Undefined function: 'countdistinct'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.[0m
Thought:[32;1m[1;3mThought: The error is because the aggregation function should be 'countDistinct' (with correct case) from pyspark.sql.functions, not as a string in agg. I should import countDistinct and use it properly in the aggregation.
Action: python_rep

'The top ten taxis (medallions) that have had the largest number of drivers (distinct hack_license values) are:\n\n1. DE85355130669A290... (2 drivers)\n2. 582CAF46446FA0332... (2 drivers)\n3. F899990AB4A6582D6... (2 drivers)\n4. 753BC0484097BB236... (1 driver)\n5. 80F732B990A7E3763... (1 driver)\n6. 846DFE2D59F6E76EC... (1 driver)\n7. BEA5A07E7B365D7F6... (1 driver)\n8. EF882BDAF03D41517... (1 driver)\n9. E1BD31C1BF8DDCFCB... (1 driver)\n10. A02946A94C960AF04... (1 driver)\n\nThe top three medallions each had 2 drivers, while the rest had 1 driver each.'

In [25]:
agent.run("Who are the top 10 best drivers are in terms of their average earned money per minute spent carrying a customer")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: To find the top 10 best drivers by average earned money per minute, I need to group by the driver identifier (likely "hack_license"), calculate the average of (total_amount / trip_time) for each driver, and then order by this value in descending order, taking the top 10.
Action: python_repl_ast
Action Input: 
from pyspark.sql import functions as F

df.withColumn(
    "money_per_minute", F.col("total_amount") / (F.col("trip_time") / 60)
).groupBy(
    "hack_license"
).agg(
    F.avg("money_per_minute").alias("avg_money_per_minute")
).orderBy(
    F.desc("avg_money_per_minute")
).limit(10).show()[0m
Observation: [36;1m[1;3m+--------------------+--------------------+
|        hack_license|avg_money_per_minute|
+--------------------+--------------------+
|7890F735BB4C3BB07...|               73.75|
|1FE259D800533F3B8...|               36.75|
|AC8F60CA2C993BE4A...|          30.9453125|
|3AAB0BF924B7A2F16...|   18.237777

'The top 10 best drivers in terms of their average earned money per minute spent carrying a customer (by hack_license) are:\n\n1. 7890F735BB4C3BB07... — 73.75\n2. 1FE259D800533F3B8... — 36.75\n3. AC8F60CA2C993BE4A... — 30.95\n4. 3AAB0BF924B7A2F16... — 18.24\n5. 11DDF1FD5B2C815FA... — 9.36\n6. 371910291B554B982... — 7.03\n7. 2C7D2EFF7686CFED6... — 6.77\n8. BABDC5E896D679E9F... — 6.01\n9. 01D771BD290991D50... — 5.77\n10. 911299BE5F3FF1DFD... — 5.57\n\n(Values are average dollars earned per minute, rounded to two decimal places.)'

In [26]:
agent.run("How many percent of taxi customers pay with cash, and how many percent use electronic cards?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: To answer this, I need to count the number of rows where payment_type is 'CSH' (cash) and where it is an electronic card (likely 'CRD'), then calculate the percentage of each over the total number of rows.
Action: python_repl_ast
Action Input: df.groupBy("payment_type").count().show()[0m
Observation: [36;1m[1;3m+------------+-----+
|payment_type|count|
+------------+-----+
|         CSH| 5998|
|         CRD| 4000|
|         UNK|    2|
+------------+-----+

[0m
Thought:[32;1m[1;3mThought: Now I need to calculate the percentage of 'CSH' and 'CRD' payments out of the total number of rows (5998 + 4000 + 2 = 10000). Then, I will compute the percentages for cash and electronic card payments.
Action: python_repl_ast
Action Input: (5998/10000)*100, (4000/10000)*100[0m
Observation: [36;1m[1;3m(59.98, 40.0)[0m
Thought:[32;1m[1;3mFinal Answer: 59.98% of taxi customers pay with cash, and 40.0% use electronic cards.[

'59.98% of taxi customers pay with cash, and 40.0% use electronic cards.'

In [13]:
import pyspark.sql.functions as F
agent.run("Find the top-10 most efficient taxi divers based on earned money per mile. ")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: To find the top-10 most efficient taxi drivers based on earned money per mile, I need to:
1. Group the dataframe by driver (hack_license).
2. For each driver, sum the total earned money (total_amount) and total miles (trip_distance).
3. Calculate the efficiency as total_amount / total_trip_distance for each driver.
4. Order by efficiency in descending order and select the top 10.

Action: python_repl_ast
Action Input: 
from pyspark.sql import functions as F

df_efficiency = (
    df.groupBy("hack_license")
      .agg(
          F.sum("total_amount").alias("total_earned"),
          F.sum("trip_distance").alias("total_miles")
      )
      .withColumn("efficiency", F.col("total_earned") / F.col("total_miles"))
      .orderBy(F.desc("efficiency"))
      .limit(10)
)

df_efficiency.show()[0m
Observation: [36;1m[1;3m+--------------------+-----------------+-------------------+------------------+
|        hack_license| 

'The top-10 most efficient taxi drivers based on earned money per mile are:\n\n1. ED506323B59B1F557... with efficiency 1125.0\n2. EA14140EF36CBCFF0... with efficiency 1050.0\n3. 7F364376C78806B0E... with efficiency 525.0\n4. 4E9DA457D6EC727AC... with efficiency 383.33\n5. 2EE83D77768FEE22C... with efficiency 325.0\n6. C2E67E72B7D6DFA4D... with efficiency 150.0\n7. E0C54FD4238BC93F3... with efficiency 149.49\n8. 8994FB75C5F11A994... with efficiency 140.0\n9. 0760304D0CAEC6949... with efficiency 139.39\n10. 47C715F18BE3EA6A9... with efficiency 107.89\n\nThese drivers earned the most money per mile driven.'

In [27]:
agent.run("What are the mean, median of tip amount?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: To answer this, I need to calculate the mean and median of the "tip_amount" column in the Spark dataframe `df`. For the mean, I can use the `mean` function from `pyspark.sql.functions`. For the median, Spark does not have a direct median function, but I can use the `approxQuantile` method to get the median (50th percentile).
Action: python_repl_ast
Action Input: 
from pyspark.sql import functions as F

mean_tip = df.select(F.mean("tip_amount")).first()[0]
median_tip = df.approxQuantile("tip_amount", [0.5], 0.01)[0]
(mean_tip, median_tip)[0m
Observation: [36;1m[1;3m(0.9702309999999977, 0.0)[0m
Thought:[32;1m[1;3mI now know the final answer
Final Answer: The mean of tip amount is approximately 0.97, and the median of tip amount is 0.0.[0m

[1m> Finished chain.[0m


'The mean of tip amount is approximately 0.97, and the median of tip amount is 0.0.'

In [29]:
agent.run("Create a linear model to predict the total paid amounts of Taxi rides based on duration. What will be the predicted total paid amount if the duration is 10 minutes?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: To create a linear model to predict total paid amount (`total_amount`) based on duration, I need to use Spark MLlib. The duration is given in seconds as `trip_time`, so 10 minutes = 600 seconds. The steps are: select relevant columns, assemble features, fit a linear regression model, and use it to predict for 600 seconds.
Action: import necessary libraries and select the relevant columns
Action Input: 
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression

df_model = df.select("trip_time", "total_amount")
df_model.show(5)[0m
Observation: import necessary libraries and select the relevant columns is not a valid tool, try one of [python_repl_ast].
Thought:[32;1m[1;3mAction: select the relevant columns to inspect the data
Action Input: df.select("trip_time", "total_amount").show(5)[0m
Observation: select the relevant columns to inspect the data is not a valid tool, try one

'The linear model to predict total paid amount is: total_amount = 0.0153 * trip_time + 3.29. For a duration of 10 minutes (600 seconds), the predicted total paid amount is:\n\ntotal_amount = 0.0153 * 600 + 3.29 ≈ 9.18 + 3.29 = 12.47\n\nSo, the predicted total paid amount for a 10-minute ride is approximately $12.47.'

In [30]:
agent.run("Create a linear model to predict the total paid amounts of Taxi rides based on duration, trip_distance, fare_amount, and tolls_amount.")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: To create a linear model in PySpark, I need to:
1. Select the relevant columns: trip_time, trip_distance, fare_amount, tolls_amount, and total_amount.
2. Assemble the features into a single vector column using VectorAssembler.
3. Split the data into training and test sets.
4. Fit a LinearRegression model using the features to predict total_amount.
5. Show the model summary.

Action: I will start by selecting the relevant columns.
Action Input: df.select("trip_time", "trip_distance", "fare_amount", "tolls_amount", "total_amount").show(5)[0m
Observation: I will start by selecting the relevant columns. is not a valid tool, try one of [python_repl_ast].
Thought:[32;1m[1;3mAction: python_repl_ast
Action Input: df.select("trip_time", "trip_distance", "fare_amount", "tolls_amount", "total_amount").show(5)[0m
Observation: [36;1m[1;3m+---------+-------------+-----------+------------+------------+
|trip_time|trip_distanc

'A linear regression model was created to predict total_amount using trip_time, trip_distance, fare_amount, and tolls_amount. The learned coefficients are:\n- trip_time: -0.0003\n- trip_distance: 0.0099\n- fare_amount: 1.1024\n- tolls_amount: 0.9944\n\nThe model intercept is 0.983. This means the model predicts total_amount as:\ntotal_amount ≈ 0.983 - 0.0003 × trip_time + 0.0099 × trip_distance + 1.1024 × fare_amount + 0.9944 × tolls_amount'