# Spark LLM Assistant

## Initialization

In [1]:
import sys
import os
from pathlib import Path

sys.path.append(str(Path.cwd().parent))

os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

In [2]:
from langchain.chat_models import ChatOpenAI
from spark_llm import SparkLLMAssistant

llm = ChatOpenAI(model_name='gpt-4') # using gpt-4 can achieve better results
assistant=SparkLLMAssistant(llm=llm, verbose=True)
assistant.activate() # active partial functions for Spark DataFrame

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/06/16 22:21:11 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/06/16 22:21:11 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


## Example 1: Auto sales by brand in US 2022

In [3]:
# Search and ingest web content into a DataFrame
auto_df = assistant.create_df("2022 USA national auto sales by brand")
auto_df.show()

Parsing URL: https://www.carpro.com/blog/full-year-2022-national-auto-sales-by-brand

SQL query for the ingestion:
 CREATE OR REPLACE TEMP VIEW auto_sales_2022 AS SELECT * FROM VALUES
('Toyota', 1849751, -9),
('Ford', 1767439, -2),
('Chevrolet', 1502389, 6),
('Honda', 881201, -33),
('Hyundai', 724265, -2),
('Kia', 693549, -1),
('Jeep', 684612, -12),
('Nissan', 682731, -25),
('Subaru', 556581, -5),
('Ram Trucks', 545194, -16),
('GMC', 517649, 7),
('Mercedes-Benz', 350949, 7),
('BMW', 332388, -1),
('Volkswagen', 301069, -20),
('Mazda', 294908, -11),
('Lexus', 258704, -15),
('Dodge', 190793, -12),
('Audi', 186875, -5),
('Cadillac', 134726, 14),
('Chrysler', 112713, -2),
('Buick', 103519, -42),
('Acura', 102306, -35),
('Volvo', 102038, -16),
('Mitsubishi', 102037, -16),
('Lincoln', 83486, -4),
('Porsche', 70065, 0),
('Genesis', 56410, 14),
('INFINITI', 46619, -20),
('MINI', 29504, -1),
('Alfa Romeo', 12845, -30),
('Maserati', 6413, -10),
('Bentley', 3975, 0),
('Lamborghini', 3134, 3),
('Fi

In [4]:
auto_df.llm_plot()

To visualize the query result stored in the 'df' dataframe using Plotly, you can follow these steps:

1. Import the required libraries
```python
import pandas as pd
import plotly.express as px
```

2. Convert the Spark DataFrame 'df' to a Pandas DataFrame
```python
pdf = df.toPandas()
```

3. Visualize the data. The type of visualization depends on the data and your requirements. Assuming you want to create a bar chart with 'brand' on the x-axis and 'us_sales' on the y-axis, you can use the following code:
```python
fig = px.bar(pdf, x='brand', y='us_sales', text='sales_change', title='Auto Sales 2022')
fig.show()
```

This code will create a bar chart using the 'brand' column as the x-axis and the 'us_sales' column as the y-axis. The 'sales_change' values will be displayed as text on the bars. The chart will have a title 'Auto Sales 2022'. You can adjust the visualization settings according to your needs.

Make sure to install the Plotly library if you haven't already:
```bash
pip ins

In [5]:
# Apply transforms to a Dataframe
auto_top_growth_df=auto_df.llm_transform("top brand with the highest growth")
auto_top_growth_df.show()

SQL query for the transform:
SELECT brand, sales_change
FROM temp_view_for_transform
ORDER BY sales_change DESC
LIMIT 1
+--------+------------+
|   brand|sales_change|
+--------+------------+
|Cadillac|          14|
+--------+------------+



In [6]:
# Explain what a DataFrame is retrieving.
auto_top_growth_df.llm_explain()

'In summary, this dataframe is retrieving the brand with the highest sales change from the "auto_sales_2022" dataset. It presents the results sorted by sales change in descending order and limits the output to just the top brand.'

## Example 2: USA Presidents

In [7]:
# You can also specify the expected columns for the ingestion.
df=assistant.create_df("USA presidents", ["president", "vice_president"])
df.show()

Parsing URL: https://www.loc.gov/rr/print/list/057_chron.html

SQL query for the ingestion:
 CREATE OR REPLACE TEMP VIEW usa_presidents AS SELECT * FROM VALUES
('George Washington', 'John Adams'),
('John Adams', 'Thomas Jefferson'),
('Thomas Jefferson', 'Aaron Burr'),
('Thomas Jefferson', 'George Clinton'),
('James Madison', 'George Clinton'),
('James Madison', 'Elbridge Gerry'),
('James Monroe', 'Daniel D. Tompkins'),
('John Quincy Adams', 'John C. Calhoun'),
('Andrew Jackson', 'John C. Calhoun'),
('Andrew Jackson', 'Martin Van Buren'),
('Martin Van Buren', 'Richard M. Johnson'),
('William Henry Harrison', 'John Tyler'),
('John Tyler', 'office vacant'),
('James K. Polk', 'George M. Dallas'),
('Zachary Taylor', 'Millard Fillmore'),
('Millard Fillmore', 'office vacant'),
('Franklin Pierce', 'William R. King'),
('Franklin Pierce', 'office vacant'),
('James Buchanan', 'John C. Breckinridge'),
('Abraham Lincoln', 'Hannibal Hamlin'),
('Abraham Lincoln', 'Andrew Johnson'),
('Andrew Johnson',

In [8]:
presidents_who_were_vp = df.llm_transform("presidents who were also vice presidents")
presidents_who_were_vp.show()

SQL query for the transform:
SELECT DISTINCT president FROM temp_view_for_transform
WHERE president IN (SELECT vice_president FROM temp_view_for_transform)
+------------------+
|         president|
+------------------+
|        John Adams|
|  Thomas Jefferson|
|  Martin Van Buren|
|  Millard Fillmore|
|        John Tyler|
|    Andrew Johnson|
| Chester A. Arthur|
|Theodore Roosevelt|
|   Calvin Coolidge|
|   Harry S. Truman|
|    Gerald R. Ford|
| Lyndon B. Johnson|
|  Richard M. Nixon|
|       George Bush|
|   Joseph R. Biden|
+------------------+



In [9]:
presidents_who_were_vp.llm_explain()

'In summary, this dataframe is retrieving a list of distinct presidents who have also served as vice presidents in the past.'

# Example 3: Top 10 tech companies

In [None]:
# Search and ingest web content into a DataFrame
company_df=assistant.create_df("Top 10 tech companies by market cap", ['company', 'cap', 'country'])
company_df.show()

Parsing URL: https://www.statista.com/statistics/1350976/leading-tech-companies-worldwide-by-market-cap/



In [None]:
us_company_df=company_df.llm_transform("companies in USA")
us_company_df.show()

In [None]:
us_company_df.llm_explain()

In [None]:
us_company_df.llm_plot()

## Example 4: Ingestion from a URL
Instead of searching for the web page, you can also ask the assistant to ingest from a URL.

In [None]:
assistant.create_df('https://time.com/6235186/best-albums-2022/').show()

## Example 5: UDF Generation

You can also ask the assistant to generate code for a Spark UDF, given a description.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType
from pyspark.sql.functions import col

spark = SparkSession.builder \
            .master("local[1]") \
            .appName("TestUDF") \
            .getOrCreate()


In [None]:
# first, a simple example

# assistant generates a udf that carries out the docstring description
@assistant.udf
def my_awesome_udf(x: int) -> int:
    """Output 42"""
    ...
    
# now we can register my_awesome_udf in PySpark
spark.udf.register("udf", my_awesome_udf)

# let's test it
spark.sql("select udf(0)").show()

In [None]:
@assistant.udf
def capitalize_full_names(first_name: str, last_name: str) -> str:
    """Convert first and last name to uppercase and append with space between"""
    ...
    
spark.udf.register("capitalize_full_names", capitalize_full_names)


schema = StructType([
    StructField("first_name", StringType(), nullable=False),
    StructField("last_name", StringType(), nullable=False)
])
names = [("amanda", "liu"), ("allison", "wang"), ("gengliang", "wang")]
df = spark.createDataFrame(names, schema)

df.show()
df.createOrReplaceTempView("namesDF")
spark.sql("select capitalize_full_names(first_name, last_name) as uppercase_full_names from namesDF").show()