# Spark and ChatGPT

In [1]:
import findspark
findspark.init()
from pyspark import SparkContext
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master('local[*]') \
    .appName('Spark-Example-30-ChatGPT') \
    .getOrCreate()

You can create a OpenAI API key on the following page: https://platform.openai.com/account/api-keys
Open AI models: https://platform.openai.com/docs/models

In [2]:
# Bring in deps
import os 
from langchain.llms import OpenAI
from pyspark.sql import SparkSession
from langchain_experimental.agents import create_spark_dataframe_agent
from working.apikey import apikey
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
# create a file named apikey.py and add the following line: apikey = 'sk-xxxxxxxxxxxxxxxxxxxxxxxx' (where xxxx is your api key)
os.environ['OPENAI_API_KEY'] = apikey

In [3]:
csv_file_path = "data\\taxi-data-sorted-verysmall-header.csv"
df = spark.read.csv(csv_file_path, header=True, inferSchema=True)
df.show()

+--------------------+--------------------+-------------------+-------------------+---------+-------------+----------------+---------------+-----------------+----------------+------------+-----------+---------+-------+----------+------------+------------+
|           medallion|        hack_license|    pickup_datetime|   dropoff_datetime|trip_time|trip_distance|pickup_longitude|pickup_latitude|dropoff_longitude|dropoff_latitude|payment_type|fare_amount|surcharge|mta_tax|tip_amount|tolls_amount|total_amount|
+--------------------+--------------------+-------------------+-------------------+---------+-------------+----------------+---------------+-----------------+----------------+------------+-----------+---------+-------+----------+------------+------------+
|07290D3599E7A0D62...|E7750A37CAB07D0DF...|2013-01-01 00:00:00|2013-01-01 00:02:00|      120|         0.44|      -73.956528|      40.716976|        -73.96244|       40.715008|         CSH|        3.5|      0.5|    0.5|       0.0|   

# Create an Spark DataFrame LangChain agent 

In [13]:
from langchain_community.chat_models import ChatOpenAI
# set llm model of openai to chatgpt 4.0
llm = ChatOpenAI(model_name = "gpt-4o", temperature=0)
#llm = OpenAI(temperature=0)


In [14]:
llm.model_name

'gpt-4o'

In [15]:
agent = create_spark_dataframe_agent(llm=llm, df=df, verbose=True)

# Run the agent

In [16]:
agent.run("how many rows are there?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: To find out how many rows are in the Spark DataFrame `df`, I can use the `count()` method, which returns the number of rows in the DataFrame.
Action: python_repl_ast
Action Input: df.count()[0m
Observation: [36;1m[1;3m10000[0m
Thought:[32;1m[1;3mI now know the final answer.
Final Answer: There are 10,000 rows in the DataFrame `df`.[0m

[1m> Finished chain.[0m


'There are 10,000 rows in the DataFrame `df`.'

In [17]:
agent.run("the top ten taxis that have had the largest number of drivers")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: To find the top ten taxis with the largest number of drivers, I need to group the data by the `medallion` column and count the distinct `hack_license` values for each medallion. Then, I will sort the results in descending order and select the top ten entries.

Action: python_repl_ast
Action Input: 
```python
from pyspark.sql import functions as F

# Group by medallion and count distinct hack_license
driver_counts = df.groupBy("medallion").agg(F.countDistinct("hack_license").alias("driver_count"))

# Order by driver_count in descending order and take the top 10
top_ten_taxis = driver_counts.orderBy(F.desc("driver_count")).limit(10)

top_ten_taxis.show()
```[0m
Observation: [36;1m[1;3m+--------------------+------------+
|           medallion|driver_count|
+--------------------+------------+
|DE85355130669A290...|           2|
|582CAF46446FA0332...|           2|
|F899990AB4A6582D6...|           2|
|753BC0484097BB236.

'The top ten taxis with the largest number of drivers are:\n\n1. Medallion: DE85355130669A290... with 2 drivers\n2. Medallion: 582CAF46446FA0332... with 2 drivers\n3. Medallion: F899990AB4A6582D6... with 2 drivers\n4. Medallion: 753BC0484097BB236... with 1 driver\n5. Medallion: 80F732B990A7E3763... with 1 driver\n6. Medallion: 846DFE2D59F6E76EC... with 1 driver\n7. Medallion: BEA5A07E7B365D7F6... with 1 driver\n8. Medallion: EF882BDAF03D41517... with 1 driver\n9. Medallion: E1BD31C1BF8DDCFCB... with 1 driver\n10. Medallion: A02946A94C960AF04... with 1 driver'

In [18]:
agent.run("who are the top 10 best drivers are in terms of their average earned money per minute spent carrying a customer")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: To determine the top 10 best drivers based on their average earned money per minute, I need to calculate the average earnings per minute for each driver. This can be done by dividing the total amount earned by the trip time for each trip, and then averaging this value for each driver. Finally, I will sort the drivers by this average and select the top 10.

Action: I will calculate the average earnings per minute for each driver and then sort them to find the top 10.

Action Input:
```python
from pyspark.sql import functions as F

# Calculate earnings per minute for each trip
df = df.withColumn("earnings_per_minute", df["total_amount"] / df["trip_time"])

# Group by driver (hack_license) and calculate average earnings per minute
average_earnings_per_minute = df.groupBy("hack_license").agg(F.avg("earnings_per_minute").alias("avg_earnings_per_minute"))

# Sort the drivers by average earnings per minute in descending ord

'The top 10 best drivers in terms of their average earned money per minute spent carrying a customer are:\n\n1. Driver with hack_license `7890F735BB4C3BB07...` with an average earnings per minute of 1.2292\n2. Driver with hack_license `1FE259D800533F3B8...` with an average earnings per minute of 0.6125\n3. Driver with hack_license `AC8F60CA2C993BE4A...` with an average earnings per minute of 0.5158\n4. Driver with hack_license `3AAB0BF924B7A2F16...` with an average earnings per minute of 0.3040\n5. Driver with hack_license `11DDF1FD5B2C815FA...` with an average earnings per minute of 0.1560\n6. Driver with hack_license `371910291B554B982...` with an average earnings per minute of 0.1171\n7. Driver with hack_license `2C7D2EFF7686CFED6...` with an average earnings per minute of 0.1128\n8. Driver with hack_license `BABDC5E896D679E9F...` with an average earnings per minute of 0.1002\n9. Driver with hack_license `01D771BD290991D50...` with an average earnings per minute of 0.0961\n10. Drive

In [19]:
agent.run("How many percent of taxi customers pay with cash, and how many percent use electronic cards?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mTo determine the percentage of taxi customers who pay with cash and those who use electronic cards, we need to calculate the proportion of each payment type in the `payment_type` column of the DataFrame `df`. 

Let's proceed with the following steps:
1. Count the total number of rows in the DataFrame to get the total number of transactions.
2. Count the number of transactions for each payment type.
3. Calculate the percentage for each payment type.

Let's execute these steps using PySpark commands.

Thought: I will count the total number of transactions and the number of transactions for each payment type.
Action: python_repl_ast
Action Input: 
```python
total_count = df.count()
payment_counts = df.groupBy('payment_type').count().collect()
total_count, payment_counts
```[0m
Observation: [36;1m[1;3m(10000, [Row(payment_type='CSH', count=5998), Row(payment_type='CRD', count=4000), Row(payment_type='UNK', count=2)])[0m
Thoug

'59.98% of taxi customers pay with cash, and 40.0% use electronic cards.'

In [20]:
import pyspark.sql.functions as F
agent.run("Find the top-10 most efficient taxi divers based on earned money per mile. ")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: To find the top-10 most efficient taxi drivers based on earned money per mile, I need to calculate the earned money per mile for each trip, group the data by driver, and then calculate the average earned money per mile for each driver. Finally, I will sort the drivers by this average and select the top 10.

Action: Calculate the earned money per mile for each trip, group by driver, calculate the average, and sort to find the top 10.
Action Input: 
```python
from pyspark.sql import functions as F

# Calculate earned money per mile for each trip
df = df.withColumn("money_per_mile", df["total_amount"] / df["trip_distance"])

# Group by driver (hack_license) and calculate the average earned money per mile
avg_money_per_mile = df.groupBy("hack_license").agg(F.avg("money_per_mile").alias("avg_money_per_mile"))

# Sort the drivers by average earned money per mile in descending order and select the top 10
top_10_drivers = av

"The top-10 most efficient taxi driver based on earned money per mile is the driver with hack_license 'E7750A37CAB07D0DFF0AF7E3573AC141', with an average earned money per mile of 10.23. Since we only have one driver in the sample data, only one driver is listed."

In [22]:
agent.run("What are the mean, median of tip amount?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mTo find the mean and median of the `tip_amount` column in the Spark DataFrame `df`, we can use the following approach:

1. Calculate the mean using the `mean` function from the `pyspark.sql.functions` module.
2. Calculate the median by using the `approxQuantile` method, which is suitable for large datasets in Spark.

Let's proceed with these calculations.

Thought: I will calculate the mean and median of the `tip_amount` column using the appropriate Spark functions.
Action: python_repl_ast
Action Input: 
```python
from pyspark.sql.functions import mean

# Calculate mean of tip_amount
mean_tip = df.select(mean("tip_amount")).collect()[0][0]

# Calculate median of tip_amount using approxQuantile
median_tip = df.approxQuantile("tip_amount", [0.5], 0.01)[0]

mean_tip, median_tip
```[0m
Observation: [36;1m[1;3m(0.0, 0.0)[0m
Thought:[32;1m[1;3mI now know the final answer. The mean and median of the `tip_amount` column in the 

'The mean and median of the `tip_amount` are both 0.0.'

In [23]:
agent.run("Create a linear model to predict the total paid amounts of Taxi rides based on duration. What will be the predicted total paid amount if the duration is 10 minutes?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mTo create a linear model to predict the total paid amounts of taxi rides based on duration, we need to follow these steps:

1. Prepare the data by selecting the relevant columns: `trip_time` (duration) and `total_amount`.
2. Split the data into training and test sets.
3. Train a linear regression model using the training data.
4. Use the model to predict the total paid amount for a duration of 10 minutes.

Let's proceed with these steps.

Thought: I will first select the relevant columns and then split the data into training and test sets.
Action: python_repl_ast
Action Input: 
```python
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.getOrCreate()

# Select relevant columns
data = df.select('trip_time', 'total_amount')

# Assemble features
assembler = VectorAssembler(inputCols=['trip_t

'The predicted total paid amount for a taxi ride with a duration of 10 minutes is $4.50.'

In [24]:
agent.run("Create a linear model to predict the total paid amounts of Taxi rides based on duration, trip_distance, fare_amount, and tolls_amount.")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mTo create a linear model to predict the total paid amounts of Taxi rides based on the specified features, we need to follow these steps:

1. Select the relevant features and the target variable from the DataFrame.
2. Assemble these features into a feature vector.
3. Split the data into training and test sets.
4. Train a linear regression model using the training data.
5. Evaluate the model using the test data.

Let's proceed with these steps.

Thought: I will first import the necessary libraries and prepare the data by selecting the relevant features and assembling them into a feature vector.
Action: python_repl_ast
Action Input: 
```python
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.getOrCreate()

# Select relevant features and the target variable
feature_columns = ['trip_time', 't

ERROR:root:KeyboardInterrupt while sending command.
Traceback (most recent call last):
  File "C:\Spark\spark-3.2.2-bin-hadoop3.2\python\lib\py4j-0.10.9.5-src.zip\py4j\java_gateway.py", line 1038, in send_command
    response = connection.send_command(command)
  File "C:\Spark\spark-3.2.2-bin-hadoop3.2\python\lib\py4j-0.10.9.5-src.zip\py4j\clientserver.py", line 511, in send_command
    answer = smart_decode(self.stream.readline()[:-1])
  File "c:\ProgramData\Anaconda3\envs\spark\lib\socket.py", line 705, in readinto
    return self._sock.recv_into(b)
KeyboardInterrupt


KeyboardInterrupt: 