In [0]:
# Databricks Notebook Source


# Setup

In [0]:
%pip install pyspark-ai

**Requirements**

..

In [0]:
import os
os.environ['OPENAI_API_KEY'] = ''

In [0]:
# Initialization

from langchain.chat_models import ChatOpenAI
from pyspark_ai import SparkAI

# If 'gpt-4' is unavailable, use 'gpt-3.5-turbo' (might lower output quality)
llm = ChatOpenAI(model_name='gpt-3.5-turbo', temperature=0)

# Initialize SparkAI with the ChatOpenAI model
spark_ai = SparkAI(llm=llm, verbose=True)

# Activate partial functions for Spark DataFrame
spark_ai.activate()

# Capabilities

**Data Ingestion**  
 
The SDK can perform a web search using your provided description, utilize the LLM to determine the most appropriate result, and then smoothly incorporate this chosen web data into Spark—all accomplished in a single step.

In [0]:
auto_df = spark_ai.create_df("2022 USA national auto sales by brand")

# Alternative 
# auto_df = spark_ai.create_df("https://www.carpro.com/blog/full-year-2022-national-auto-sales-by-brand")

In [0]:
auto_df = spark_ai.create_df("https://www.carpro.com/blog/full-year-2022-national-auto-sales-by-brand")

**DataFrame Operations**

The SDK provides functionalities on a given DataFrame that allow for transformation, plotting, and explanation based on your English description. These features significantly enhance the readability and efficiency of your code, making operations on DataFrames straightforward and intuitive.

In [0]:
auto_df.ai.plot()

In [0]:
# Plot with instructions
auto_df.ai.plot("pie chart for US sales market shares, show the top 5 brands and the sum of others")

In [0]:
# Dataframe Transformation
auto_top_growth_df=auto_df.ai.transform("brand with the highest growth")
auto_top_growth_df.show()

In [0]:
auto_top_growth_df.ai.verify("expect sales change percentage to be between -100 to 100")

**User-Defined Functions (UDFs)**

The SDK supports a streamlined process for creating UDFs. With a simple decorator, you only need to provide a docstring, and the AI handles the code completion.

In [0]:
@spark_ai.udf
def previous_years_sales(brand: str, current_year_sale: int, sales_change_percentage: float) -> int:
    """Calculate previous years sales from sales change percentage"""
    ...
    
spark.udf.register("previous_years_sales", previous_years_sales)
auto_df.createOrReplaceTempView("autoDF")

spark.sql("select brand as brand, previous_years_sales(brand, us_sales, sales_change_percentage) as 2021_sales from autoDF").show()

**Caching**

The SDK incorporates caching to boost execution speed, make reproducible results, and save cost.

In [0]:
spark_ai.commit()