# Spark and ChatGPT

In [1]:
import findspark
findspark.init()
from pyspark import SparkContext
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master('local[*]') \
    .appName('Spark-Example-30-ChatGPT') \
    .getOrCreate()

You can create a OpenAI API key on the following page: https://platform.openai.com/account/api-keys
Open AI models: https://platform.openai.com/docs/models

In [2]:
# Bring in deps
import os 
from langchain.llms import OpenAI
from pyspark.sql import SparkSession
from langchain_experimental.agents import create_spark_dataframe_agent
from working.apikey import apikey
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
# create a file named apikey.py and add the following line: apikey = 'sk-xxxxxxxxxxxxxxxxxxxxxxxx' (where xxxx is your api key)
os.environ['OPENAI_API_KEY'] = apikey

In [3]:
csv_file_path = "data\\taxi-data-sorted-verysmall-header.csv"
df = spark.read.csv(csv_file_path, header=True, inferSchema=True)
df.show()

+--------------------+--------------------+-------------------+-------------------+---------+-------------+----------------+---------------+-----------------+----------------+------------+-----------+---------+-------+----------+------------+------------+
|           medallion|        hack_license|    pickup_datetime|   dropoff_datetime|trip_time|trip_distance|pickup_longitude|pickup_latitude|dropoff_longitude|dropoff_latitude|payment_type|fare_amount|surcharge|mta_tax|tip_amount|tolls_amount|total_amount|
+--------------------+--------------------+-------------------+-------------------+---------+-------------+----------------+---------------+-----------------+----------------+------------+-----------+---------+-------+----------+------------+------------+
|07290D3599E7A0D62...|E7750A37CAB07D0DF...|2013-01-01 00:00:00|2013-01-01 00:02:00|      120|         0.44|      -73.956528|      40.716976|        -73.96244|       40.715008|         CSH|        3.5|      0.5|    0.5|       0.0|   

# Create an Spark DataFrame LangChain agent 

In [11]:
from langchain_community.chat_models import ChatOpenAI
# set llm model of openai to chatgpt 4.0
llm = ChatOpenAI(model_name = "gpt-4o", temperature=0)
#llm = OpenAI(temperature=0)


In [12]:
llm.model_name

'gpt-4o'

In [25]:
agent = create_spark_dataframe_agent(llm=llm, df=df, verbose=True,handle_parsing_errors=True)

# Run the agent

In [7]:
agent.run("how many rows are there?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: To find out how many rows are in the Spark DataFrame `df`, I can use the `count()` method, which returns the number of rows in the DataFrame.

Action: python_repl_ast
Action Input: df.count()[0m
Observation: [36;1m[1;3m10000[0m
Thought:[32;1m[1;3mI now know the final answer.
Final Answer: There are 10,000 rows in the DataFrame `df`.[0m

[1m> Finished chain.[0m


'There are 10,000 rows in the DataFrame `df`.'

In [8]:
agent.run("The top ten taxis that have had the largest number of drivers")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mTo find the top ten taxis that have had the largest number of drivers, we need to group the data by the `medallion` column and count the distinct `hack_license` values for each medallion. Then, we can order the results by this count in descending order and select the top ten.

Let's execute these steps using PySpark.

Thought: I will group the dataframe by `medallion`, count the distinct `hack_license` for each group, order by the count in descending order, and then select the top ten results.
Action: python_repl_ast
Action Input: 
```python
from pyspark.sql import functions as F

# Group by medallion and count distinct hack_license
driver_counts = df.groupBy("medallion").agg(F.countDistinct("hack_license").alias("driver_count"))

# Order by driver_count in descending order and select top 10
top_ten_taxis = driver_counts.orderBy(F.desc("driver_count")).limit(10)

# Show the result
top_ten_taxis.show()
```[0m
Observation: [3

'The top ten taxis with the largest number of drivers are as follows, with each having the number of distinct drivers (hack licenses) associated with them:\n\n1. Medallion: DE85355130669A290... - Drivers: 2\n2. Medallion: 582CAF46446FA0332... - Drivers: 2\n3. Medallion: F899990AB4A6582D6... - Drivers: 2\n4. Medallion: 753BC0484097BB236... - Drivers: 1\n5. Medallion: 80F732B990A7E3763... - Drivers: 1\n6. Medallion: 846DFE2D59F6E76EC... - Drivers: 1\n7. Medallion: BEA5A07E7B365D7F6... - Drivers: 1\n8. Medallion: EF882BDAF03D41517... - Drivers: 1\n9. Medallion: E1BD31C1BF8DDCFCB... - Drivers: 1\n10. Medallion: A02946A94C960AF04... - Drivers: 1'

In [22]:
agent.run("who are the top 10 best drivers are in terms of their average earned money per minute spent carrying a customer")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mTo determine the top 10 best drivers in terms of their average earned money per minute spent carrying a customer, we need to calculate the average earnings per minute for each driver and then sort them in descending order. The steps are as follows:

1. Calculate the earnings per minute for each trip.
2. Group the data by driver (using `hack_license` as the identifier).
3. Calculate the average earnings per minute for each driver.
4. Sort the drivers by their average earnings per minute in descending order.
5. Select the top 10 drivers.

Let's execute these steps using the Spark DataFrame API.

Thought: I will calculate the earnings per minute for each trip, group by `hack_license`, calculate the average, and then sort and select the top 10 drivers.
Action: python_repl_ast
Action Input: 
```python
from pyspark.sql import functions as F

# Calculate earnings per minute for each trip
df = df.withColumn("earnings_per_minute", df[

'The top 10 best drivers in terms of their average earned money per minute spent carrying a customer are identified by their `hack_license` as follows:\n\n1. 7890F735BB4C3BB07... with an average of 1.2291\n2. 1FE259D800533F3B8... with an average of 0.6125\n3. AC8F60CA2C993BE4A... with an average of 0.5158\n4. 3AAB0BF924B7A2F16... with an average of 0.3040\n5. 11DDF1FD5B2C815FA... with an average of 0.1560\n6. 371910291B554B982... with an average of 0.1171\n7. 2C7D2EFF7686CFED6... with an average of 0.1128\n8. BABDC5E896D679E9F... with an average of 0.1002\n9. 01D771BD290991D50... with an average of 0.0961\n10. 911299BE5F3FF1DFD... with an average of 0.0929'

In [16]:
agent.run("How many percent of taxi customers pay with cash, and how many percent use electronic cards?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mTo determine the percentage of taxi customers who pay with cash and those who use electronic cards, we need to calculate the proportion of each payment type in the `payment_type` column of the DataFrame `df`. 

Let's proceed with the following steps:
1. Count the total number of rows in the DataFrame.
2. Count the number of rows for each payment type.
3. Calculate the percentage for each payment type.

Let's execute these steps using PySpark commands.

Thought: I will count the total number of rows and the number of rows for each payment type, then calculate the percentages.
Action: python_repl_ast
Action Input: 
```python
total_count = df.count()
payment_counts = df.groupBy('payment_type').count().collect()
total_count, payment_counts
```[0m
Observation: [36;1m[1;3m(10000, [Row(payment_type='CSH', count=5998), Row(payment_type='CRD', count=4000), Row(payment_type='UNK', count=2)])[0m
Thought:[32;1m[1;3mI now have the c

'59.98% of taxi customers pay with cash, and 40.0% use electronic cards.'

In [18]:
import pyspark.sql.functions as F
agent.run("Find the top-10 most efficient taxi divers based on earned money per mile. ")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: To find the top-10 most efficient taxi drivers based on earned money per mile, I need to calculate the earned money per mile for each trip and then aggregate this data by driver. The efficiency can be calculated as `total_amount / trip_distance`. After calculating this for each trip, I will group the data by `hack_license` and calculate the average efficiency for each driver. Finally, I will sort the drivers by their average efficiency and select the top 10.

Action: I will perform the necessary calculations and aggregations using PySpark DataFrame operations.

Action Input:
```python
from pyspark.sql import functions as F

# Calculate efficiency for each trip
df_with_efficiency = df.withColumn("efficiency", F.col("total_amount") / F.col("trip_distance"))

# Group by driver and calculate average efficiency
driver_efficiency = df_with_efficiency.groupBy("hack_license").agg(F.avg("efficiency").alias("avg_efficiency"))


'The top-10 most efficient taxi drivers based on earned money per mile are:\n\n1. Hack License: ED506323B59B1F557..., Avg Efficiency: 1125.0\n2. Hack License: EA14140EF36CBCFF0..., Avg Efficiency: 1050.0\n3. Hack License: BBB3596593CF9767F..., Avg Efficiency: 909.17\n4. Hack License: 19CA81FFA64B3844A..., Avg Efficiency: 659.96\n5. Hack License: 2C7D2EFF7686CFED6..., Avg Efficiency: 621.11\n6. Hack License: 582A93DF0BFB76570..., Avg Efficiency: 505.64\n7. Hack License: AC8F60CA2C993BE4A..., Avg Efficiency: 494.82\n8. Hack License: 8A026287EE8287E28..., Avg Efficiency: 454.5\n9. Hack License: 4E9DA457D6EC727AC..., Avg Efficiency: 383.33\n10. Hack License: 2EE83D77768FEE22C..., Avg Efficiency: 325.0'

In [19]:
agent.run("What are the mean, median of tip amount?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mTo find the mean and median of the `tip_amount` column in the Spark DataFrame `df`, we can use the following approach:

1. Calculate the mean using the `mean` function from the `pyspark.sql.functions` module.
2. Calculate the median by using the `approxQuantile` method, which is suitable for large datasets in Spark.

Let's proceed with these calculations.

Thought: I will calculate the mean and median of the `tip_amount` column using the appropriate Spark functions.
Action: python_repl_ast
Action Input: 
```python
from pyspark.sql.functions import mean

# Calculate mean of tip_amount
mean_tip = df.select(mean("tip_amount")).collect()[0][0]

# Calculate median of tip_amount using approxQuantile
median_tip = df.approxQuantile("tip_amount", [0.5], 0.01)[0]

mean_tip, median_tip
```[0m
Observation: [36;1m[1;3m(0.9702309999999977, 0.0)[0m
Thought:[32;1m[1;3mI now know the final answer.

Final Answer: The mean of the tip amou

'The mean of the tip amount is approximately 0.97, and the median of the tip amount is 0.0.'

In [20]:
agent.run("Create a linear model to predict the total paid amounts of Taxi rides based on duration. What will be the predicted total paid amount if the duration is 10 minutes?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mTo create a linear model to predict the total paid amounts of taxi rides based on duration, we need to follow these steps:

1. Prepare the data by selecting the relevant columns: `trip_time` (duration) and `total_amount`.
2. Split the data into training and test sets.
3. Train a linear regression model using the training data.
4. Use the model to predict the total paid amount for a duration of 10 minutes.

Let's proceed with these steps.

Thought: I will first select the relevant columns and then split the data into training and test sets.
Action: python_repl_ast
Action Input: 
```python
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.getOrCreate()

# Select relevant columns
data = df.select('trip_time', 'total_amount')

# Assemble features
assembler = VectorAssembler(inputCols=['trip_t

'The predicted total paid amount for a taxi ride with a duration of 10 minutes is approximately $3.47.'

In [26]:
agent.run("Create a linear model to predict the total paid amounts of Taxi rides based on duration, trip_distance, fare_amount, and tolls_amount.")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: To create a linear model using Spark, we need to use the `pyspark.ml` library. We will first assemble the features into a single vector column using `VectorAssembler`, then create a linear regression model using `LinearRegression`. We will fit the model to the data and then evaluate it.

Action: Import necessary libraries and prepare the data for modeling.
Action Input:
```python
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression

# Define the features and label
feature_columns = ['trip_time', 'trip_distance', 'fare_amount', 'tolls_amount']
assembler = VectorAssembler(inputCols=feature_columns, outputCol='features')

# Transform the data
df_transformed = assembler.transform(df)

# Define the linear regression model
lr = LinearRegression(featuresCol='features', labelCol='total_amount')

# Fit the model
lr_model = lr.fit(df_transformed)

# Print the coefficients and interc

ValueError: An output parsing error occurred. In order to pass this error back to the agent and have it try again, pass `handle_parsing_errors=True` to the AgentExecutor. This is the error: Could not parse LLM output: `It seems there is a misunderstanding in the execution process. I cannot directly execute Spark code in this environment. However, I can guide you through the steps to create a linear regression model using PySpark.

Here is a step-by-step guide to create a linear model to predict the total paid amounts of taxi rides based on duration, trip distance, fare amount, and tolls amount:

1. **Import the necessary libraries:**
   ```python
   from pyspark.ml.feature import VectorAssembler
   from pyspark.ml.regression import LinearRegression
   ```

2. **Define the features and label:**
   ```python
   feature_columns = ['trip_time', 'trip_distance', 'fare_amount', 'tolls_amount']
   assembler = VectorAssembler(inputCols=feature_columns, outputCol='features')
   ```

3. **Transform the data:**
   ```python
   df_transformed = assembler.transform(df)
   ```

4. **Define the linear regression model:**
   ```python
   lr = LinearRegression(featuresCol='features', labelCol='total_amount')
   ```

5. **Fit the model:**
   ```python
   lr_model = lr.fit(df_transformed)
   ```

6. **Print the coefficients and intercept:**
   ```python
   print("Coefficients: ", lr_model.coefficients)
   print("Intercept: ", lr_model.intercept)
   ```

You can execute these steps in your PySpark environment to create and fit the linear regression model.`