# Sentiment review (fake) restaurant reviews using Synapse and ChatGPT

This notebook is based on an article and video from Thomas Costers and Stijn Wynants

- https://techcommunity.microsoft.com/t5/azure-synapse-analytics-blog/using-openai-gpt-in-synapse-analytics/ba-p/3751815
- https://www.youtube.com/watch?v=CY4dAWvh60M

One of the SynapseML’s capabilities is providing simple APIs for pre-built intelligent services, such as Azure cognitive services. Azure OpenAI is part of the cognitive services stack, making it accessible from within Synapse Spark pools. In order to use the Azure OpenAI in Synapse Spark, we’ll be using three components. The setup of these components is out of scope for this article.

- A Synapse Analytics workspace with a Spark Pool
- An Azure OpenAI cognitive service with text-davinci-003 model deployed
- Azure Key vault to store the OpenAI API key

Use the [Azure OpenAI Studio playground](https://oai.azure.com/portal/playground) to test the following prompt

```json
Generate a json containing a restaurant review. Use the following json structure: 
{
    "restaurant": "",
    "review": ""
}
````

The following code is applicable For Spark3.2 pool. SynapseML can be conveniently installed on Synapse using this piece of configuration

In [22]:
%%configure -f
{
  "name": "synapseml",
  "conf": {
      "spark.jars.packages": "com.microsoft.azure:synapseml_2.12:0.11.0,org.apache.spark:spark-avro_2.12:3.3.1",
      "spark.jars.repositories": "https://mmlspark.azureedge.net/maven",
      "spark.jars.excludes": "org.scala-lang:scala-reflect,org.apache.spark:spark-tags_2.12,org.scalactic:scalactic_2.12,org.scalatest:scalatest_2.12,com.fasterxml.jackson.core:jackson-databind",
      "spark.yarn.user.classpath.first": "true",
      "spark.sql.parquet.enableVectorizedReader": "false",
      "spark.sql.legacy.replaceDatabricksSparkAvro.enabled": "true"
  }
}

StatementMeta(, 2, -1, Finished, Available)

## Generate fake reviews, using ChatGPT

Now we will genrate 5 reviews. We first create a set of prompts for the number of reviews we want to generate

In [29]:
from synapse.ml.core.platform import running_on_synapse, find_secret
from pyspark.sql.types import *
from pyspark.sql.functions import *
from synapse.ml.cognitive import OpenAICompletion

key = find_secret("openaikey", "keyvault-weslbo")  # replace this with your secret and keyvault
nrOfReviews = 5

completion = (
    OpenAICompletion()
    .setSubscriptionKey(key)
    .setDeploymentName("text-davinci-003")
    .setUrl("https://openai-wedebols-3.openai.azure.com/")
    .setMaxTokens(2048)
    .setPromptCol("prompt")
    .setErrorCol("error")
    .setOutputCol("response")
)

def generateRestaurantPrompt() -> str:
    return "Generate a json containing a restaurant review. Use the following json structure: {\"restaurant\": \"\",\"review\": \"\"}"
generateRestaurantPrompt_udf = udf(lambda: generateRestaurantPrompt(), StringType())

df_prompts = spark.range(1, nrOfReviews+1) \
    .withColumnRenamed("restaurant", "review") \
    .withColumn("prompt", generateRestaurantPrompt_udf())

display(df_prompts)

StatementMeta(sparkpool, 2, 7, Finished, Available)

SynapseWidget(Synapse.DataFrame, 314fdd91-aa23-4fc7-a988-f6f40edc2ee1)

Then, we will call the OpenAI service and get the actual (fake) reviews

In [30]:
df_reviews_json = completion.transform(df_prompts).cache() \
    .select(
        col("id"),
        col("prompt"),
        col("error"),
        col("response.choices.text").getItem(0).alias("json")
    )

display(df_reviews_json)

StatementMeta(sparkpool, 2, 8, Finished, Available)

SynapseWidget(Synapse.DataFrame, de79d48f-e5ba-437c-8647-5d00f3c50f20)

Since we get json data back, we have to apply some schema to it and retrieve the actual Restaurant Name and Review text.

In [31]:
schema = StructType([ \
    StructField("restaurant", StringType(), False), \
    StructField("review", StringType(), False) \
])

df_reviews_table = df_reviews_json.withColumn("json", from_json(col("json"), schema)) \
    .select(col("id"), col("json.*"), col("error"))

display(df_reviews_table)


StatementMeta(sparkpool, 2, 9, Finished, Available)

SynapseWidget(Synapse.DataFrame, 8cd0aae6-964c-43c6-866c-8d5cc576ce51)

## Detect sentiment

Now that we have are restaurants and reviews, it's time to detect the sentiment. Again, we can use the OpenAI playground to test our prompt 

```text
Classify the sentiment of following restaurant review.
Classifications: [Positive, Negative, Neutral]
Review: """The food here is so delicious. The crepes are made to perfection and the servers are so friendly and helpful. I highly recommend it!"""
Classification:
``` 

When we are ready we the prompt, it's time to generate the prompt within our dataset.

In [32]:
def generateSentimentPrompt(s: str) -> str:
    return "Classify the sentiment of following restaurant review.\nClassifications: [Positive, Negative, Neutral]\nReview: " + s +"\nClassification:"
generateSentimentPrompt_udf = udf(lambda s: generateSentimentPrompt(s), StringType())

df_sentiment_prompt = df_reviews_table.withColumn("prompt", generateSentimentPrompt_udf(col("review")))
display(df_sentiment_prompt)

StatementMeta(sparkpool, 2, 10, Finished, Available)

SynapseWidget(Synapse.DataFrame, e3795a84-6437-499a-93ac-f05f80adc8a0)

Now, we apply the transformation. This is where the actual call happens towards OpenAI api

In [33]:
df_sentiment_json = completion.transform(df_sentiment_prompt).cache() \
    .select(
        col("id"),
        col("restaurant"),
        col("review"),
        col("response.choices.text").getItem(0).alias("sentiment")
    )

display(df_sentiment_json)


StatementMeta(sparkpool, 2, 11, Finished, Available)

SynapseWidget(Synapse.DataFrame, e7975c0b-3ca3-4fac-993c-1501f6c6db33)

Review the output above!

Todo:
1. Make sure quotes are properly handled (otherwise you get 'undefined')
2. I only get positive sentiment... Probably need to tweak the openai deployment model....