# Final Summarization Notebook
- Objective: summarize restaurant reviews with overall ideas and in different categories.


## Input/Output
- Input: a list of business id
- Output: Summary and Categories with categories summaries


## Example
- Input: business_id_list = ['VnpokM7AD0zYXfyDNEDe6g', 'f5S8fr9DruZNwSev1gyFWQ', 'mNw3UU6PPUAeS31VKgM-qw', 'YNgX5_SYHCXSoL9IMdVboA']
- Output:
   - Business: In-n-out, business_id =  'VnpokM7AD0zYXfyDNEDe6g'
     - Tip: In-N-Out consistently impresses with its simple menu. (quick suggestion, 1 sentence)
     - Summary: In-N-Out Burger stands out for its commitment to quality and simplicity, ... (3 - 4 sentences)
       - Categories
         - Positive
           - Food Quality: food quality is good (2 sentences)
             - High-Quality Ingredients ...
             - Delicious Dishes ...
           - Service ...
             - ...
           - ...
         - Negative
           - Negative attribute 1
             - sub-negative attribute 1-1
             - sub-negative attribute 1-2
             - ...
           - . . .
   - Business: KFC, business_id = 'f5S8fr9DruZNwSev1gyFWQ'
   - ...

## Installations
The installations and imports enhance the restaurant review summarizer by leveraging advanced AI capabilities. Google GenerativeAI provides powerful language models for accurate and nuanced text analysis, while LangChain streamlines the integration of these models into the summarization process. By incorporating these tools, the summarizer can deliver more precise and insightful summaries and improve customer satisfaction.


In [0]:
pip install -Uq google-generativeai

Python interpreter will be restarted.
Python interpreter will be restarted.


In [0]:
dbutils.library.restartPython()

In [0]:
pip install -q langchain

Python interpreter will be restarted.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
scipy 1.7.3 requires numpy<1.23.0,>=1.16.5, but you have numpy 1.26.4 which is incompatible.
Python interpreter will be restarted.


In [0]:
dbutils.library.restartPython()

In [0]:
pip install -q langchain-google-genai

Python interpreter will be restarted.
Python interpreter will be restarted.


In [0]:
dbutils.library.restartPython()

## Importing Libraries
Leveraging LangChain and Google's Generative AI significantly enhances Yelp's review summarization capabilities. It automates the process, extracts key points, analyzes sentiment, identifies topics, and generates concise summaries.


This improves user experience by providing faster access to information, aiding decision-making, and enabling personalized recommendations. Ultimately, this code contributes to Yelp's success as a leading platform for restaurant reviews and recommendations.


In [0]:
#Langchain
from langchain_core.runnables import RunnableLambda
from langchain_core.runnables.base import RunnableEach
from langchain_core.runnables import RunnableMap
from langchain_core.runnables import RunnableBranch

from langchain_core.output_parsers import JsonOutputParser
from langchain_core.prompts import PromptTemplate

from langchain_google_genai import ChatGoogleGenerativeAI

#format
import json
import pprint
from pydantic import BaseModel, Field
from pydantic.dataclasses import *
from IPython.display import display as python_display, Markdown
from IPython.display import Markdown

from typing import List

import google.generativeai as genai
import os

## Setting Up LLM Model
The langchain_google_genai library and the Gemini 1.5 Flash language model, is a tool for Yelp's restaurant review summarization. It enables automated, efficient, and accurate summarization of reviews, helping users quickly grasp key points, understand overall sentiment, and make informed dining decisions.


In [0]:
model = ChatGoogleGenerativeAI(model="gemini-1.5-flash")

# Loading the Restaurant Data

## Specifying Unique Business IDs
By specifying unique business IDs, the code allows the summarizer to focus on specific restaurants on Yelp. This enables focused analysis of individual businesses.


In [0]:
business_ids = ['196CWwMAtAcA21jYiMyRzg', 'DVBJRvnCpkqaYl6nHroaMg', 'WC8vQdCC-nSawCh2IV4epg', 'foh6hwQxjCs0SeLT5MO1SQ', 'VnpokM7AD0zYXfyDNEDe6g']

common restaruant: mcdonald's, subway, starbucks

In [0]:
business_ids = ['iXlYo2AI1iHgpeZnv5sPw', 'Xzvm1LxWyO7QZu9Q0UcSw', 'KdeX92-JV2K8GWbAxVj2w']

## Loading the Table for Restaurant Reviews
- This Python function, load_table, efficiently loads a Parquet table into a Spark DataFrame.
- It checks for table existence, loads it if necessary, prints the record count, displays the table schema, and shows sample rows for a quick overview of the data.
- This function is designed to streamline data loading and exploration processes in data analysis pipelines. 

In [0]:
# this function load or rebuild table 

def load_table(table_name: str, warehouse_path: str, sample_rows: int = 3) -> None:
    """
    Loads a Parquet table into Spark, prints record count, describes table structure,
    and shows sample rows.
    
    Args:
        table_name: Name of the table to load
        warehouse_path: Base path where the Parquet files are stored
        sample_rows: Number of sample rows to display (default=3)
    """
    # Check if table exists and load if necessary
    if spark.catalog._jcatalog.tableExists(table_name):
        print(f'{table_name} already loaded in memory')
    else:
        create_table_sql = f"""
            CREATE TABLE {table_name}
            USING PARQUET 
            LOCATION '{warehouse_path}/{table_name}'
        """
        spark.sql(create_table_sql)
        print(f'{table_name} rebuilt from parquet files.')
    
    # Print record count
    print("\nRecord Count:")
    count_sql = f"""
        SELECT COUNT(*) AS review_count
        FROM {table_name}
    """
    spark.sql(count_sql).show()
    
    # Show table description
    print("\nTable Description:")
    describe_sql = f"""DESCRIBE TABLE {table_name}"""
    spark.sql(describe_sql).show()
    
    # Show sample records
    print(f"\nSample Records ({sample_rows} rows):")
    sample_sql = f"""
        SELECT *
        FROM {table_name}
    """
    spark.sql(sample_sql).show(sample_rows, truncate=100, vertical=True)

# Example usage:
# load_table(
#     table_name='restaurant_reviews_for_summarization_table',
#     warehouse_path='/user/hive/warehouse'
# )

### Weighing the Value of Reviews and Users
Based on our Analysis from Data Wrangling Notebook, we are adding weights from the standard weight of 1 on the following attributes:

- elite user: due to the rarity of elite users, we're adding 3.2 more weight (weight calculation will be included in presentation) on review from elite users
- user's review count > 25: majority of users have < 25 reviews (82.42%) those with >25 reviews are uncommon (19.59%), thus we’re adding 0.42 more weight.
- review with useful >= 1: Reviews voted useful at least once are valuable. And since 23% reviews are considered useful, we’re adding 0.33 to its weight.
- User's Yelping since: We grouped users account based on years they were created into 5 groups: 2004-2008, 2008-2012, 2012-2016, 2016-2020, 2020-2022. We don’t consider account 
 between 2004-2008, since they may be inactive users. Accounts created in 2020 or later are not as credible since it’s newly created. So, we’re focusing and adding 0.16 and 0.12 more weight to account created on 2008-2012 and 2012-2016 respectively

Overall, the max total weight for having all attributes described above is 5.11.

In [0]:
# Load the Delta table into a DataFrame
df_restaurant_reviews_for_summarization_table = spark.read.parquet("/user/hive/warehouse/restaurant_reviews_for_summarization_table")


# Step 2: Create a temporary view
df_restaurant_reviews_for_summarization_table.createOrReplaceTempView("restaurant_reviews_for_summarization_table")

# Step 3: Add the weight column with the specified logic
restaurant_reviews_for_summarization_table = spark.sql('''
SELECT *,
       (CASE 
            WHEN SIZE(SPLIT(elite, ',')) > 0 THEN 3.2  -- Weight for elite users
            ELSE 0 
        END +
        CASE 
            WHEN review_count > 25 THEN 0.42  -- Weight for users with >25 reviews
            ELSE 0 
        END +
        CASE 
            WHEN user_useful > 1 THEN 0.33  -- Weight for reviews marked useful >1
            ELSE 0 
        END +
        CASE 
            WHEN CAST(SPLIT(yelping_since, '-')[0] AS INT) BETWEEN 2008 AND 2012 THEN 0.16  -- Weight for 2008-2012
            WHEN CAST(SPLIT(yelping_since, '-')[0] AS INT) BETWEEN 2013 AND 2016 THEN 0.12  -- Weight for 2013-2016
            ELSE 0
        END) AS weight
FROM restaurant_reviews_for_summarization_table
''')

# Step 4: Verify the result
restaurant_reviews_for_summarization_table.show(truncate=False)


# Step 5: Save the updated DataFrame back to a Parquet file

+----------------------+-------------------+------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

#### Confirm Weight Column
This table confirms and visualized the weight column.

The table below contains the restaurant reviews that will be processed by the summarization model. By loading this table, the code prepares the necessary data for the summarization task.

In [0]:
%sql
SELECT *
FROM restaurant_reviews_for_summarization_table

business_id,date,review_stars,text,review_useful,review_count,user_id,name,restaurant_stars,categories,metro_area,user_review_count,yelping_since,user_useful,elite,user_average_stars,tip,tip_compliment_count
--epgcb7xHGuJ-4PUeSLAw,2012-01-25 00:53:54,5.0,"this place continues to impress me. i stop in every tuesday on my way to work. i'm usually running a tight schedule, but these guys consistently get me out the door in literally 5-8 minutes. i get the same thing (egg and cheese, toasted onion bagel) every time and by the time i fix my coffee my food is ready and i'm out the door. i can't stress enough the amazing service here. everyone is super friendly- the guy james is great- he knows my order before i even get up to the register. really A+ establishment.",0,34,WPJA9X2FUM_KgEMUToSpYw,Manhattan Bagel,3.0,"List(Restaurants, Food, Bagels, Sandwiches, Breakfast & Brunch)",Philadelphia,39,2011-11-18 01:47:38,37,,3.88,,
--epgcb7xHGuJ-4PUeSLAw,2013-11-18 19:15:29,1.0,"Worst bagels I've ever had, the place is dirty on the inside, terrible service and the manager is BEYOND rude. Even if the bagels were great I would still never come back because of the manager's attitude.",2,34,uv9f_lt1306GNF1zvkZGTA,Manhattan Bagel,3.0,"List(Restaurants, Food, Bagels, Sandwiches, Breakfast & Brunch)",Philadelphia,20,2013-02-13 18:48:13,46,,2.45,,
--lqIzK-ZVTtgwiQM63XgQ,2021-04-25 11:38:25,1.0,Worst Wendy's I've ever been to. I don't know what kind of bozo they've got running this place but they're doing a piss poor job. I sat at the drive thru window for almost 10 minutes waiting on 2 cups of coffee. My food got cold before I even left the parking lot. Get your shit together Wendy's,0,15,DjrNa5dzMDIZlXFc8D9wzw,Wendy's,2.0,"List(Burgers, Fast Food, Restaurants)",Indianapolis,6,2018-05-11 01:50:09,0,,2.83,,
-09Oc2D14vRnmirPh0vlXw,2019-07-22 21:58:10,3.0,[953] 7/2019 This Cracker Barrel is located right off exit 5 of the NJ turnpike. It's very easy to access with plenty of parking. The staff was nice. We stopped in for breakfast which we enjoyed. The place is clean and there is a nice shop when you enter (like all other locations) We will definitely be back,22,135,6pbNMgTnSYrtPHDN07IXwA,Cracker Barrel Old Country Store,3.0,"List(Diners, Shopping, Caterers, Restaurants, Comfort Food, Salad, Event Planning & Services, Desserts, Southern, Breakfast & Brunch, American (Traditional), Food)",Philadelphia,1220,2011-06-09 20:49:59,9836,201620172018201920202021,3.72,,
-09Oc2D14vRnmirPh0vlXw,2016-01-15 14:56:19,5.0,Really good place to unwind and have a NICE MEAL. Food is excellent and country store is really nice. Can get noisy and crowded but well worth the visit. I go there for my holiday meals as I live alone. Just a perfect experience. Stop in.,0,135,RvC46zzXjsJS0j24nzrgMw,Cracker Barrel Old Country Store,3.0,"List(Diners, Shopping, Caterers, Restaurants, Comfort Food, Salad, Event Planning & Services, Desserts, Southern, Breakfast & Brunch, American (Traditional), Food)",Philadelphia,21,2011-12-29 18:48:34,15,,3.46,,
-09Oc2D14vRnmirPh0vlXw,2017-05-19 18:28:06,1.0,"2 in party ordered daily special of house salad with chicken. Did not receive until after rest of party had finished eating. Drinks were warm with no ice and late arriving. Received two people's potatoes on one plate and had to get 2nd plate. Server was new and placed order into computer incorrectly. Talked with manager who was nice and apologized, but this was not the typical Cracker Barrel dining experience.",0,135,UTM-Z74hSkC0dgZxRT6rbw,Cracker Barrel Old Country Store,3.0,"List(Diners, Shopping, Caterers, Restaurants, Comfort Food, Salad, Event Planning & Services, Desserts, Southern, Breakfast & Brunch, American (Traditional), Food)",Philadelphia,4,2011-04-15 15:39:14,1,,3.0,,
-09Oc2D14vRnmirPh0vlXw,2017-09-16 18:26:59,3.0,Sitting in is nice and the wait is never ridiculously long BUT always check your to go orders throughly because in my experience they're good for leaving items out.... you would think they wouldn't. Ring your order out if everything wasn't in the bag but it happens almost every time,0,135,e_ceD9QSwhvTNqEvLYoi4Q,Cracker Barrel Old Country Store,3.0,"List(Diners, Shopping, Caterers, Restaurants, Comfort Food, Salad, Event Planning & Services, Desserts, Southern, Breakfast & Brunch, American (Traditional), Food)",Philadelphia,26,2014-05-08 07:01:12,27,,2.57,,
-09Oc2D14vRnmirPh0vlXw,2012-11-27 03:24:50,4.0,"This was my first time at a Cracker Barrel, and I am impressed with the prices. For $15 I was able to get a ribeye steak with mashed potatoes, green beans, and mac & cheese - not bad! The lemonade was extremely fresh, and the service was great. My steak was also cooked properly - medium rare. I also like how you can buy old fashioned candy - peppermint sticks (as well as other flavors), peppermint bark, etc. Everything was decorated and ready for Christmas; there was a bit of a nostalgic feeling as you entered the store to get to the restaurant. I would definitely come again!",1,135,h9MZv5I1hvTX1kkzq-3Cvg,Cracker Barrel Old Country Store,3.0,"List(Diners, Shopping, Caterers, Restaurants, Comfort Food, Salad, Event Planning & Services, Desserts, Southern, Breakfast & Brunch, American (Traditional), Food)",Philadelphia,1052,2009-11-20 00:21:16,1659,20102011201220132014201520162017201820192020,2.86,,
-09Oc2D14vRnmirPh0vlXw,2021-07-26 20:29:35,1.0,"No. No. No. The service was a nightmare. Homie (our server, not his real name) was downright rude to a customer who was only asking about a special he saw on the sign when you walked in. When you suspect your server may punch you in the face because you want cream with your coffee, you have a bad server. I get it, ""nobody wants to work right now...blah blah blahhh"" but damn dude...this guy was/is in the wrong line of work.",0,135,lhUe6bjyw9XAPdYRVW0NRg,Cracker Barrel Old Country Store,3.0,"List(Diners, Shopping, Caterers, Restaurants, Comfort Food, Salad, Event Planning & Services, Desserts, Southern, Breakfast & Brunch, American (Traditional), Food)",Philadelphia,211,2008-08-25 22:17:58,470,,2.72,,
-09Oc2D14vRnmirPh0vlXw,2018-07-23 00:41:13,1.0,The only thing that this place has going for it is its proximity to the New Jersey Turnpike. They brought our food out wrong and they basically told us it was right. It wasn't. We show them the picture in the menu which didn't look like what they brought us. The service was absolutely terrible but I guess what do you expect from a Cracker Barrel. Where is Joanne when you need her?,0,135,oFzQMwBXCgRJ0KjI9jNOaw,Cracker Barrel Old Country Store,3.0,"List(Diners, Shopping, Caterers, Restaurants, Comfort Food, Salad, Event Planning & Services, Desserts, Southern, Breakfast & Brunch, American (Traditional), Food)",Philadelphia,256,2012-05-28 17:50:01,205,,3.15,,


# EDA

## Overview of Reviews of Unique Business IDs


This Python code, leveraging PySpark, efficiently extracts and summarizes key information about a restaurant from a dataset. Given a **specific business ID**, it:


1. **Retrieves Data:** Queries a Spark SQL table to fetch relevant details like the restaurant's name, average rating, and individual review ratings.
2. **Calculates Metrics:**
  - **Review Count:** Determines the total number of reviews.
  - **Star Distribution:** Computes the percentage distribution of reviews across different star ratings.
  - **Aggregate Difference:** Calculates the average difference between the restaurant's star rating and the user's average rating.
3. **Returns Summary:** Organizes the calculated metrics into a concise dictionary, providing a comprehensive overview of the restaurant's performance.


This summary can be used as a valuable tool for understanding a restaurant's reputation and customer sentiment.



In [0]:
from pyspark.sql import SparkSession, functions as F

# Initialize Spark session (if not already initialized)
spark = SparkSession.builder.appName("RestaurantReviews").getOrCreate()

def get_business_summary(business_id: str) -> dict:
    """
    Retrieves a summary of restaurant information for a given business ID.
    
    Args:
        business_id: The business ID to retrieve information for.
    
    Returns:
        A dictionary containing:
        - business_id: The ID of the business.
        - restaurant_name: The name of the restaurant.
        - restaurant_stars: The average rating of the restaurant.
        - number_of_reviews: The total number of reviews for the restaurant.
        - review_star_distribution: Distribution of review stars (1 to 5 stars) in percentages (rounded to 0 decimals).
        - aggregate_difference: Average of (review_stars - user_average_stars) (rounded to 2 decimals).
    """
    table_name = "restaurant_reviews_for_summarization_table"

    # Filter table for the specified business ID
    business_data = spark.sql(f"""
        SELECT
            business_id,
            name,
            restaurant_stars,
            review_stars,
            user_average_stars
        FROM {table_name}
        WHERE business_id = '{business_id}'
    """)

    # Calculate total number of reviews for percentage calculation
    total_reviews = business_data.count()

    # Calculate review star distribution
    star_distribution = business_data.groupBy("review_stars").count().orderBy("review_stars")
    star_distribution_dict = {
        row["review_stars"]: round((row["count"] / total_reviews) * 100, 0)
        for row in star_distribution.collect()
    }

    # Aggregate other metrics
    summary = business_data.groupBy("business_id", "name", "restaurant_stars") \
        .agg(
            F.count("review_stars").alias("number_of_reviews"),
            F.avg(F.col("review_stars") - F.col("user_average_stars")).alias("aggregate_difference")
        ).collect()

    if summary:
        summary_data = summary[0].asDict()
        return {
            "business_id": summary_data["business_id"],
            "name": summary_data["name"],
            "restaurant_stars": summary_data["restaurant_stars"],
            "number_of_reviews": summary_data["number_of_reviews"],
            "review_star_distribution": star_distribution_dict,
            "aggregate_difference": round(summary_data["aggregate_difference"], 2)
        }
    else:
        return {"error": f"No data found for business_id: {business_id}"}

# Example usage:
# result = get_business_summary("VnpokM7AD0zYXfyDNEDe6g")
# print(result)


### Overview of Cafe Pontalba’s Reviews (Y, HN)
Conclusion: Can_be_summarized (Y), Heavily negative summary (HN)


* 'business_id': 'VnpokM7AD0zYXfyDNEDe6g'
* From the star distribution, the majority of the reviews seem to be more negative.
* 562 reviews are enough to be summarized
* On average, people give around 1 star less than what they usually give to other restaurants.

In [0]:
get_business_summary('196CWwMAtAcA21jYiMyRzg')

Out[8]: {'business_id': '196CWwMAtAcA21jYiMyRzg',
 'name': 'Cafe Pontalba',
 'restaurant_stars': 3.0,
 'number_of_reviews': 562,
 'review_star_distribution': {1.0: 23.0,
  2.0: 17.0,
  3.0: 23.0,
  4.0: 21.0,
  5.0: 15.0},
 'aggregate_difference': -0.8}

### Overview of Tumerico’s Reviews (Y, P)
Conclusion: Can_be_summarized (Y), Positive (P)


* 'business_id': 'VnpokM7AD0zYXfyDNEDe6g'
* From the star distribution, the majority of the reviews seem to be positive, 91% of the reviews being 5 stars.
* 725 reviews are enough to be summarized
* On average, people give around 1 star more than what they usually give to other restaurants.


In [0]:
get_business_summary('DVBJRvnCpkqaYl6nHroaMg')

Out[9]: {'business_id': 'DVBJRvnCpkqaYl6nHroaMg',
 'name': 'Tumerico',
 'restaurant_stars': 5.0,
 'number_of_reviews': 725,
 'review_star_distribution': {1.0: 1.0,
  2.0: 0.0,
  3.0: 3.0,
  4.0: 5.0,
  5.0: 91.0},
 'aggregate_difference': 0.69}

### Overview of Atlantis Steakhouse’s Reviews (Y, P)
Conclusion: Can_be_summarized (Y), Positive (P)


* 'business_id': 'VnpokM7AD0zYXfyDNEDe6g'
* From the star distribution, the majority of the reviews seem to be positive, 68% of the reviews are 1 star.
* 499 reviews are enough to be summarized
* On average, people give around 0.5 stars more than what they usually give to other restaurants.

In [0]:
get_business_summary('WC8vQdCC-nSawCh2IV4epg')

Out[10]: {'business_id': 'WC8vQdCC-nSawCh2IV4epg',
 'name': 'Atlantis Steakhouse',
 'restaurant_stars': 4.5,
 'number_of_reviews': 499,
 'review_star_distribution': {1.0: 6.0,
  2.0: 5.0,
  3.0: 8.0,
  4.0: 13.0,
  5.0: 68.0},
 'aggregate_difference': 0.46}

### Overview of Wawa’s Reviews (Y, Ne)
Conclusion: Can_be_summarized (Y), Neutral (Ne)


* 'business_id': 'VnpokM7AD0zYXfyDNEDe6g'
* From the star distribution, there is not a strong sentiment of either negative or positive reviews.
* 59 reviews are enough to be summarized
* On average, people give around 0.3 stars less than what they give other restaurants.

In [0]:
get_business_summary('foh6hwQxjCs0SeLT5MO1SQ')

Out[11]: {'business_id': 'foh6hwQxjCs0SeLT5MO1SQ',
 'name': 'Wawa',
 'restaurant_stars': 3.0,
 'number_of_reviews': 59,
 'review_star_distribution': {1.0: 20.0,
  2.0: 17.0,
  3.0: 14.0,
  4.0: 29.0,
  5.0: 20.0},
 'aggregate_difference': -0.28}

### Overview of Marlton Diner’s Reviews (Y, HN)
Conclusion: Can_be_summarized (Y), Heavily negative summary (HN)


* 'business_id': 'VnpokM7AD0zYXfyDNEDe6g'
* From the star distribution, the majority of the reviews seem to be more negative, 43% reviews are 1 star
* 278 reviews are enough to be summarized
* On average, people give around 1 star less than what they usually give to other restaurants.

In [0]:
get_business_summary('VnpokM7AD0zYXfyDNEDe6g')

Out[12]: {'business_id': 'VnpokM7AD0zYXfyDNEDe6g',
 'name': 'Marlton Diner',
 'restaurant_stars': 2.5,
 'number_of_reviews': 278,
 'review_star_distribution': {1.0: 42.0,
  2.0: 18.0,
  3.0: 13.0,
  4.0: 10.0,
  5.0: 17.0},
 'aggregate_difference': -0.94}

# Generating Summaries with LangChain
This function is crucial for providing comprehensive information to the summarization model, enabling it to generate accurate and informative summaries.


This code provides a way to extract specific text data from a dataset for a specific restaurant. It focuses on retrieving:
1. **Reviews**: The full text of each review for the restaurant.
2. **Tips**: Tips left by users in response to reviews.


By extracting these specific text components, the summarizer can:


- **Improve Summary Accuracy**: Utilize a more focused dataset of relevant text to generate summaries that are more accurate and informative.
- **Enhance Summary Quality**: Incorporate the insights from tips and reviews to provide a more nuanced understanding of the restaurant's strengths and weaknesses.
- **Identify Key Themes**: Analyze the text to identify recurring themes, such as popular dishes, service quality, or ambiance.


Overall, this code enables the summarizer to provide more comprehensive and valuable insights into the restaurant's offerings and customer experiences.


In [0]:
def get_reviews_tips_for_business(business_id):
    """
    Retrieves all tips and review texts for a given business ID.
    
    Args:
        business_id: The business ID to retrieve reviews for.
    
    Returns:
        A list of tips and review texts for the specified business ID.
    """
    table_name = "restaurant_reviews_for_summarization_table"

    #initialize lists
    positive_review_list = []
    negative_review_list = []
    tips_list = []

    #Retrieves postive reviews from table
    positive_reviews = spark.sql(f"""
        SELECT text
        FROM {table_name}
        WHERE review_stars > 3
        AND business_id = '{business_id}'
    """).collect()
    
    for row in positive_reviews:
        positive_review_list.append(row.text)

    #Retrieves negative reviews from table
    negative_reviews = spark.sql(f"""
        SELECT text
        FROM {table_name}
        WHERE review_stars <= 3
        AND business_id = '{business_id}'
    """).collect()
    
    for row in negative_reviews:
        negative_review_list.append(row.text)

    # Retrieves tip from table
    tips = spark.sql(f"""
        SELECT tip
        FROM {table_name}
        WHERE business_id = '{business_id}'
    """).collect()

    for row in tips:
        tips_list.append(row.tip)

    #Return the dictionary
    positive_dict = {"sentiment": "positive", "review_list": positive_review_list, "tips_list": tips_list}
    negative_dict = {"sentiment": "negative", "review_list": negative_review_list, "tips_list": tips_list}

    # Return the combined dictionaries
    return positive_dict, negative_dict

# Example usage:
# reviews = get_reviews_tips_for_business("VnpokM7AD0zYXfyDNEDe6g")
# print(reviews)
get_reviews_tips_for_business("VnpokM7AD0zYXfyDNEDe6g")

Out[13]: ({'sentiment': 'positive',
  'review_list': ["French toast stuffed with NY cheesecake and strawberries is all you need to know. My fiancé had this and I really wish I could have had time to snap a photo. The chicken and waffles that I had were also good, but not great. The chicken was a little too thin with a little more breaking than I care for but overall still good. The coffee cups are definitely way too small but no other complaints. I'd definitely come back!",
   "Small diner, never really busy which is why I like going there. The waitresses are all really sweet and on point. It's a diner ... very hard for diners to stick out but I have never had a problem and really enjoy their sandwiches. Their French onion soup is good. I also suggest there tuna salad wrap , very good!",
   'hungry for some whipped cheeeeze? how about some of that salty pie? ohhh and dont even get me started on the cornbread (toasted please) and tea with curds (self inflicted... who knew that lemons an

In [0]:
# Compute percentiles (25th, 50th, 75th) using Spark SQL
spark.sql("""
SELECT 
    PERCENTILE(review_count, 0.25) AS p25,
    PERCENTILE(review_count, 0.50) AS p50,
    PERCENTILE(review_count, 0.75) AS p75
FROM restaurant_reviews_for_summarization_table
""").show()

spark.sql('''
SELECT stddev(review_count), min(review_count), median(review_count), max(review_count)
FROM restaurant_reviews_for_summarization_table
''').show()



+-----+-----+-----+
|  p25|  p50|  p75|
+-----+-----+-----+
|106.0|249.0|560.0|
+-----+-----+-----+

+--------------------+-----------------+--------------------+-----------------+
|stddev(review_count)|min(review_count)|median(review_count)|max(review_count)|
+--------------------+-----------------+--------------------+-----------------+
|   948.9361144098363|                5|               249.0|             7568|
+--------------------+-----------------+--------------------+-----------------+



In [0]:
#Jessica function
#Check if business id is a restaurant
def is_restaurant(business_id):
  result=spark.sql(f'''
	SELECT COUNT(*) AS count
	FROM restaurant_reviews_for_summarization_table
	WHERE business_id = '{business_id}'
	''').collect()

  count = result[0]['count']
  print(f"Is {business_id} a restaurant? Appeared in table: {count} times (note: only business id that appeared in table are restaurant)") #Debugging code
  return count > 0 

#check if business id has enough reviews
def enough_reviews(business_id):
    num_of_reviews = spark.sql(f'''
    SELECT review_count
    FROM restaurant_reviews_for_summarization_table
    WHERE business_id = '{business_id}'
    ''').collect()
 
    review_count = num_of_reviews[0]['review_count']
    print(f"Does {business_id} have enough reviews? review_count: {review_count}")
    return review_count > 5 

#Test function 
enough_reviews("-GqJOzN8AxFnkqSuvzmPtQ")
is_restaurant("YNgX5_SYHCXSoL9IMdVboA")

Does -GqJOzN8AxFnkqSuvzmPtQ have enough reviews? review_count: 5
Is YNgX5_SYHCXSoL9IMdVboA a restaurant? Appeared in table: 489 times (note: only business id that appeared in table are restaurant)
Out[15]: True

## Summarizing the Reviews with JSON Output
In essence, this code provides a structured way to organize and present the insights derived from restaurant reviews, making it easier for users to understand.


- **Improved Decision Making**: Provides actionable insights into customer sentiment, preferences, and pain points, enabling data-driven decisions.
- **Enhanced Customer Understanding**: Identifies common themes and trends in customer feedback, leading to a deeper understanding of customer needs and expectations.
- **Optimized Operations**: Pinpoints areas for improvement to enhance overall customer satisfaction.


**Format Instructions & Template:**
This specific format ensures that the output of the restaurant review summarizer is structured and organized.


By providing clear and direct instructions, the language model is able to produce more accurate, relevant, and insightful summaries. This leads to improved decision-making for businesses and users, as they can quickly identify key themes and trends in customer feedback.


In [0]:
#final skynet prompt and format

format_instructions = """
The output should be a markdown code snippet formatted in the following schema, including the leading and trailing "```json" and "```":
```json
{{
  "restaurant": string,
  "overall_summary": string,
  "tips": string,
  "positive_reviews": integer,
  "negative_reviews": integer,
  "categories": [
    {{
      "category": string,
      "themes": [
        {{
          "theme": string,
          "description": "string",
        }}
      ]
    }}
  ]
}}
```
"""

#Pydantic to format the output appropriately
class ThemeSummary(BaseModel):
    """Represents a specific theme within a category."""
    theme: str = Field(description="A specific theme or insight within the category")
    description: str = Field(description="Detailed description of the theme")

class CategorySummary(BaseModel):
    """Represents a detailed summary of a specific category."""
    category: str = Field(description="Category name (e.g., Food Quality, Price, Ambiance)")
    summary: str = Field(description="Overall summary of the category")
    themes: List[ThemeSummary] = Field(description="Specific themes within the category")

class BusinessReviewSummary(BaseModel):
    """Comprehensive summary of a business's reviews."""
    restaurant: str = Field(description="name for the business")
    overall_summary: str = Field(description="Comprehensive summary of the business reviews")
    tips: str = Field(description="Aggregated tips from user tips")
    review_count: int = Field(description="Total number of review count of this restaurant")
    positive_reviews: int = Field(description="Number of positive reviews for this restaurant")
    negative_reviews: int = Field(description="Number of negative reviews for this restaurant")
    categories: List[CategorySummary] = Field(description="Detailed summaries of different review categories")

#this will be the parser variable to feed to the chain
review_parser = JsonOutputParser(pydantic_object=BusinessReviewSummary)

template = """You are a data analyst summarizing the restaurant reviews as well the categories(food quality, pricing, ambiance) and the main themes of such category for a single restaurant. In your restaurant summary, provide the strength and weakness of the restaurant. Keep your summary direct to the point enabling quick decision. However present it in a relatable tone  - instead of sounding too formal. Ensure to summarize the review according to the review's weight, prioritizing reviews with higher weight more. Stay objective, Please Do not include any of your opinion in your summary, base your summary only on the dataset from restaurant_reviews_for_summarization_table. For tips, create a one line sentence giving quick suggestions for potential customers. For themes, first, identify the main themes of the restaurant that are discussed in multiple reviews. Next, summarize each theme as a description using a detailed multiple paragraph format. Next, group common themes into broader categories. Then, as output, provide a bullet point list of the categories, the theme descriptions within each category in a detailed paragraph format, and a count of positive reviews for that theme and a count of negative reviews for that theme. Also include the general consensus: whether positive or negative."

{format_instructions}

% USER INPUT: 
tips: {tips_list}
reviews: {review_list}
sentiment: {sentiment}

YOUR RESPONSE:
"""

prompt_template = PromptTemplate(
    input_variables=["sentiment", "review_list", "tips_list"],
    partial_variables={"format_instructions": format_instructions},
    template=template,
)

## Summarizing the Reviews with HTML Format


This code formats the generated summary into a clear and visually appealing format, making it easy to read and understand.


It organizes the summary into categories and themes, providing a structured overview of the restaurant's strengths, weaknesses, and key characteristics. This enhanced presentation significantly improves the usability and value of the summarization output.


In [0]:
from IPython.display import display, HTML

def format_as_html(parsed_data):
    """Convert parsed JSON output to an HTML format."""
    # Construct the HTML header with the restaurant name.
    html_output = f"<h1>{parsed_data['restaurant']} Review Summary</h1>\n"

    # Retrieve positive and negative review counts from the data.
    positive_review_count = parsed_data['positive_reviews']
    negative_review_count = parsed_data['negative_reviews']
    
    # Calculate the total number of reviews.
    total_count = positive_review_count + negative_review_count
    
    # Calculate and round the percentages for positive and negative reviews.
    positive_review_percentage = int(round(positive_review_count / total_count * 100, 0))
    negative_review_percentage = int(round(negative_review_count / total_count * 100, 0))
    

    
    # Append a highlights section to the HTML output.
    html_output += "<h2>Highlight</h2>\n"
    # Fix the typo and add a percentage sign for clarity.
    html_output += f"<p>Positive reviews: {positive_review_percentage}%, Negative reviews: {negative_review_percentage}%</p>\n"
    
    #print categories and themes
    html_output += f"<p>Categories & Themes: </p>\n"
    for category in parsed_data['categories']:
        temp_text = ""
        temp_text += f"<p>{category['category']}: "
        for theme in category['themes']:
            temp_text += f"{theme['theme']}, "

        html_output += ( temp_text + '</p>\n')
           
  
    html_output += f"<h2>Overall Summary</h2>\n<p>{parsed_data['overall_summary']}</p>\n"
    html_output += f"<h2>Tips</h2>\n<p>{parsed_data['tips']}</p>\n"

    html_output += "<h2>Categories and Themes</h2>\n"
    for category in parsed_data['categories']:
        html_output += f"<h3>Category: {category['category']}</h3>\n"
        for theme in category['themes']:
            html_output += f"<h4>Theme: {theme['theme']}</h4>\n"
            html_output += f"<p><strong>Description:</strong> {theme['description']}</p>\n"
    return html_output

# Chain Definition
id_branch = RunnableBranch(
    (is_restaurant, RunnableBranch(
        (enough_reviews, RunnableLambda(lambda x: x)),  # Pass the business ID if enough reviews
        RunnableLambda(lambda x: f"Not enough reviews for restaurant: {x}")  # Notify if not enough reviews
    )),
    RunnableLambda(lambda x: f"This is not a restaurant: {x}"))  # Notify if not a restaurant

# Ensures the input is in the proper structure for the PromptTemplate purpose
def input_mapping(data):
    # Provide a default empty list for tips if not present
    if len(data) == 2:
        sentiment, review_list = data
        tips_list = []  # Default empty list
    else:
        sentiment, review_list, tips_list = data
    
    return {
        "sentiment": sentiment, 
        "review_list": review_list, 
        "tips_list": tips_list
    }

runnable_restaurants = RunnableLambda(get_reviews_tips_for_business)

# Add HTML formatting as a final step
input_runnable = (
    RunnableLambda(input_mapping)
    | prompt_template
    | model
    | review_parser
    | RunnableLambda(format_as_html)  # New step for HTML formatting
)


chain = RunnableEach(bound=id_branch | runnable_restaurants | input_runnable)

# Test the Chain

output = chain.invoke(business_ids)

# Display the formatted summaries using HTML rendering
for idx, business_output in enumerate(output):
    print(f"Business {idx + 1} Summary:")
    display(HTML(business_output))


Failed to multipart ingest runs: langsmith.utils.LangSmithError: Failed to POST https://api.smith.langchain.com/runs/multipart in LangSmith API. HTTPError('403 Client Error: Forbidden for url: https://api.smith.langchain.com/runs/multipart', '{"detail":"Forbidden"}')
Failed to multipart ingest runs: langsmith.utils.LangSmithError: Failed to POST https://api.smith.langchain.com/runs/multipart in LangSmith API. HTTPError('403 Client Error: Forbidden for url: https://api.smith.langchain.com/runs/multipart', '{"detail":"Forbidden"}')


Is foh6hwQxjCs0SeLT5MO1SQ a restaurant? Appeared in table: 59 times (note: only business id that appeared in table are restaurant)
Is 196CWwMAtAcA21jYiMyRzg a restaurant? Appeared in table: 562 times (note: only business id that appeared in table are restaurant)


Failed to multipart ingest runs: langsmith.utils.LangSmithError: Failed to POST https://api.smith.langchain.com/runs/multipart in LangSmith API. HTTPError('403 Client Error: Forbidden for url: https://api.smith.langchain.com/runs/multipart', '{"detail":"Forbidden"}')


Is WC8vQdCC-nSawCh2IV4epg a restaurant? Appeared in table: 499 times (note: only business id that appeared in table are restaurant)
Is DVBJRvnCpkqaYl6nHroaMg a restaurant? Appeared in table: 725 times (note: only business id that appeared in table are restaurant)


Failed to multipart ingest runs: langsmith.utils.LangSmithError: Failed to POST https://api.smith.langchain.com/runs/multipart in LangSmith API. HTTPError('403 Client Error: Forbidden for url: https://api.smith.langchain.com/runs/multipart', '{"detail":"Forbidden"}')
Failed to multipart ingest runs: langsmith.utils.LangSmithError: Failed to POST https://api.smith.langchain.com/runs/multipart in LangSmith API. HTTPError('403 Client Error: Forbidden for url: https://api.smith.langchain.com/runs/multipart', '{"detail":"Forbidden"}')
Failed to multipart ingest runs: langsmith.utils.LangSmithError: Failed to POST https://api.smith.langchain.com/runs/multipart in LangSmith API. HTTPError('403 Client Error: Forbidden for url: https://api.smith.langchain.com/runs/multipart', '{"detail":"Forbidden"}')


Does 196CWwMAtAcA21jYiMyRzg have enough reviews? review_count: 560
Does foh6hwQxjCs0SeLT5MO1SQ have enough reviews? review_count: 56


Failed to multipart ingest runs: langsmith.utils.LangSmithError: Failed to POST https://api.smith.langchain.com/runs/multipart in LangSmith API. HTTPError('403 Client Error: Forbidden for url: https://api.smith.langchain.com/runs/multipart', '{"detail":"Forbidden"}')
Failed to multipart ingest runs: langsmith.utils.LangSmithError: Failed to POST https://api.smith.langchain.com/runs/multipart in LangSmith API. HTTPError('403 Client Error: Forbidden for url: https://api.smith.langchain.com/runs/multipart', '{"detail":"Forbidden"}')


Does WC8vQdCC-nSawCh2IV4epg have enough reviews? review_count: 474


Failed to multipart ingest runs: langsmith.utils.LangSmithError: Failed to POST https://api.smith.langchain.com/runs/multipart in LangSmith API. HTTPError('403 Client Error: Forbidden for url: https://api.smith.langchain.com/runs/multipart', '{"detail":"Forbidden"}')


Is VnpokM7AD0zYXfyDNEDe6g a restaurant? Appeared in table: 278 times (note: only business id that appeared in table are restaurant)


Failed to multipart ingest runs: langsmith.utils.LangSmithError: Failed to POST https://api.smith.langchain.com/runs/multipart in LangSmith API. HTTPError('403 Client Error: Forbidden for url: https://api.smith.langchain.com/runs/multipart', '{"detail":"Forbidden"}')


Does DVBJRvnCpkqaYl6nHroaMg have enough reviews? review_count: 705


Failed to multipart ingest runs: langsmith.utils.LangSmithError: Failed to POST https://api.smith.langchain.com/runs/multipart in LangSmith API. HTTPError('403 Client Error: Forbidden for url: https://api.smith.langchain.com/runs/multipart', '{"detail":"Forbidden"}')


Does VnpokM7AD0zYXfyDNEDe6g have enough reviews? review_count: 252


Failed to multipart ingest runs: langsmith.utils.LangSmithError: Failed to POST https://api.smith.langchain.com/runs/multipart in LangSmith API. HTTPError('403 Client Error: Forbidden for url: https://api.smith.langchain.com/runs/multipart', '{"detail":"Forbidden"}')
Failed to multipart ingest runs: langsmith.utils.LangSmithError: Failed to POST https://api.smith.langchain.com/runs/multipart in LangSmith API. HTTPError('403 Client Error: Forbidden for url: https://api.smith.langchain.com/runs/multipart', '{"detail":"Forbidden"}')
Failed to multipart ingest runs: langsmith.utils.LangSmithError: Failed to POST https://api.smith.langchain.com/runs/multipart in LangSmith API. HTTPError('403 Client Error: Forbidden for url: https://api.smith.langchain.com/runs/multipart', '{"detail":"Forbidden"}')
Failed to multipart ingest runs: langsmith.utils.LangSmithError: Failed to POST https://api.smith.langchain.com/runs/multipart in LangSmith API. HTTPError('403 Client Error: Forbidden for url: htt

Business 1 Summary:


Business 2 Summary:


Business 3 Summary:


Business 4 Summary:


Business 5 Summary:


Failed to multipart ingest runs: langsmith.utils.LangSmithError: Failed to POST https://api.smith.langchain.com/runs/multipart in LangSmith API. HTTPError('403 Client Error: Forbidden for url: https://api.smith.langchain.com/runs/multipart', '{"detail":"Forbidden"}')


# Limitation 

### 1. Algorithmic Structuring Limitations
#### Background
In this project, the algorithm was initially designed to prioritize and integrate key elements from the data. However, this initial structuring did not fully capture the complexity and nuances of certain attributes, which has impacted the overall effectiveness of the algorithm.

- **Issues**
- Integration of Attributes: The algorithm doesn't adequately integrate some critical attributes, which are essential for accurate data analysis and result generation.

- **Potential Enhancements**
- Revising Attribute Integration: Enhancing the algorithm to better incorporate and utilize all relevant attributes, possibly by more through analysis.


### 2. Review Arrangement Limitations
#### Background
The current system arranges reviews strictly in descending order, which might not always represent the most useful or relevant information for users effectively.

- **Issues** 
- Static Ordering: The descending order does not account for the varying importance or relevance of reviews, which can bury significant feedback under less pertinent content.

- **Potential Enhancements** 
Categorization of Reviews: Introducing a categorization system that groups reviews into different levels or tiers based on their importance, relevance, or impact.


### Conclusion
- By addressing these limitations, the project can significantly improve its functionality and user satisfaction. Implementing these enhancements will require careful planning, additional research, and possibly the integration of more advanced technologies or methodologies.