# Table of Contents
1. [Introduction](#Introduction)
1. [Data Preparation](#Data%20Preparation)
1. [Modeling and Evaluation](#Modeling)
1. [Deployment](#Deployment)


#  Introduction

In this notebook, we will building an NLP model that intends to classify restaurant reviews according to topics as well as sentiment. This will be used as a tool for restaurant staff to be able to analyze common praises/criticism about their services.

![restaurant](https://startyourrestaurantbusiness.com/wp-content/uploads/Restaurant-guests-having-fun.jpg)

# Data Preparation

The dataset used for this project is the Yelp Dataset. The Yelp dataset is a subset of our businesses, reviews, and user data for use in personal, educational, and academic purposes. It is available as JSON files and can be downloaded from this [link](https://www.yelp.com/dataset/download).

In [1]:
import pandas as pd
import numpy as np

Let's start by loading the *academic* version of the yelp review dataset. Due to its large size, we will load it in chunks of 100.

In [8]:
json_reader = pd.read_json('./yelp_academic_dataset_review.json', chunksize=100, lines=True)
df1 = next(json_reader)

In [4]:
df1.head()

Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date
0,KU_O5udG6zpxOg-VcAEodg,mh_-eMZ6K5RLWhZyISBhwA,XQfwVwDr-v0ZS3_CbbE5Xw,3,0,0,0,"If you decide to eat here, just be aware it is...",2018-07-07 22:09:11
1,BiTunyQ73aT9WBnpR9DZGw,OyoGAe7OKpv6SyGZT5g77Q,7ATYjTIgM3jUlt4UM3IypQ,5,1,0,1,I've taken a lot of spin classes over the year...,2012-01-03 15:28:18
2,saUsX_uimxRlCVr67Z4Jig,8g_iMtfSiwikVnbP2etR0A,YjUWPpI6HXG530lwP-fb2A,3,0,0,0,Family diner. Had the buffet. Eclectic assortm...,2014-02-05 20:30:30
3,AqPFMleE6RsU23_auESxiA,_7bHUi9Uuf5__HHc_Q8guQ,kxX2SOes4o-D3ZQBkiMRfA,5,1,0,1,"Wow! Yummy, different, delicious. Our favo...",2015-01-04 00:01:03
4,Sx8TMOWLNuJBWer-0pcmoA,bcjbaE6dDog4jkNY91ncLQ,e4Vwtrqf-wpJfwesgvdgxQ,4,1,0,1,Cute interior and owner (?) gave us tour of up...,2017-01-14 20:54:15


Let's take a look at the text fields, as this field represents the reviews themselves.

In [7]:
for x in df1['text']:
    print(x, "\n")

If you decide to eat here, just be aware it is going to take about 2 hours from beginning to end. We have tried it multiple times, because I want to like it! I have been to it's other locations in NJ and never had a bad experience. 

The food is good, but it takes a very long time to come out. The waitstaff is very young, but usually pleasant. We have just had too many experiences where we spent way too long waiting. We usually opt for another diner or restaurant on the weekends, in order to be done quicker. 

I've taken a lot of spin classes over the years, and nothing compares to the classes at Body Cycle. From the nice, clean space and amazing bikes, to the welcoming and motivating instructors, every class is a top notch work out.

For anyone who struggles to fit workouts in, the online scheduling system makes it easy to plan ahead (and there's no need to line up way in advanced like many gyms make you do).

There is no way I can write this review without giving Russell, the owner o

By taking a quick look at the different reviews, and after talking to members of the LAU community, we were able to identify 5 main topics that are of interest to consumers when it comes to restaurants: 
1. **Price**: This tackles the pricing of the dishes that the restaurant offers.
2. **Food**: This tackles the quality and taste of the dishes that the restaurant offers.
2. **Service**: This tackles the service of the restaurant, including speed of order completion and staff friendliness.
2. **Ambiance**: This tackles the environment inside the restaurant (e.g., decoration, music, layout...)
2. **Location**: This tackles the geographical location of the restaurant.

The way we will represent this information will be as follows: a *1* would indicate a positive mention of the category, a *-1* would indicate a negative mention of the category, and a *0* would indicate neutrality -that is, the topic was not mentioned in the review.

Let's take a look at an example:

*The food is good, but it takes a very long time to come out. The waitstaff is very young, but usually pleasant. We have just had too many experiences where we spent way too long waiting. We usually opt for another diner or restaurant on the weekends, in order to be done quicker.*

This statement would have the following label: 

*{"food": 1, "price": 0, "service": -1, "ambiance": 0, "location": 0}*



Before proceeding with the labeling of the data, we need to filter the businesses according to type. The dataset above contains different types of stores and venues, including places such as museums, tattoo parlors...

As we are only interested in the food and beverage industry, we will have to filter out all unrelated stores.

To do so, we load the *business* dataset that links the reviews to the businesses.

In [9]:
df2 = pd.read_json('./yelp_academic_dataset_business.json', lines=True)

In [10]:
df2.keys()

Index(['business_id', 'name', 'address', 'city', 'state', 'postal_code',
       'latitude', 'longitude', 'stars', 'review_count', 'is_open',
       'attributes', 'categories', 'hours'],
      dtype='object')

In [11]:
df2.head()

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
0,Pns2l4eNsfO8kk83dixA6A,"Abby Rappoport, LAC, CMQ","1616 Chapala St, Ste 2",Santa Barbara,CA,93101,34.426679,-119.711197,5.0,7,0,{'ByAppointmentOnly': 'True'},"Doctors, Traditional Chinese Medicine, Naturop...",
1,mpf3x-BjTdTEA3yCZrAYPw,The UPS Store,87 Grasso Plaza Shopping Center,Affton,MO,63123,38.551126,-90.335695,3.0,15,1,{'BusinessAcceptsCreditCards': 'True'},"Shipping Centers, Local Services, Notaries, Ma...","{'Monday': '0:0-0:0', 'Tuesday': '8:0-18:30', ..."
2,tUFrWirKiKi_TAnsVWINQQ,Target,5255 E Broadway Blvd,Tucson,AZ,85711,32.223236,-110.880452,3.5,22,0,"{'BikeParking': 'True', 'BusinessAcceptsCredit...","Department Stores, Shopping, Fashion, Home & G...","{'Monday': '8:0-22:0', 'Tuesday': '8:0-22:0', ..."
3,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,935 Race St,Philadelphia,PA,19107,39.955505,-75.155564,4.0,80,1,"{'RestaurantsDelivery': 'False', 'OutdoorSeati...","Restaurants, Food, Bubble Tea, Coffee & Tea, B...","{'Monday': '7:0-20:0', 'Tuesday': '7:0-20:0', ..."
4,mWMc6_wTdE0EUBKIGXDVfA,Perkiomen Valley Brewery,101 Walnut St,Green Lane,PA,18054,40.338183,-75.471659,4.5,13,1,"{'BusinessAcceptsCreditCards': 'True', 'Wheelc...","Brewpubs, Breweries, Food","{'Wednesday': '14:0-22:0', 'Thursday': '16:0-2..."


We can see that the dataset contains the *business_id* column, which acts as a foreign key in the *reviews* dataset seen earlier. Additionally, the *categories* column is of interest, as it contains the classification of the store.

To be able to identify the unique types of stores, we will have to process the data in the *categories* column as it currently stores strings that describe all the different categories that a store belongs to. We will split the string into the independent categories that it contains and then we will aggregate all the results in a list to be able to determine what categories are of interest in our case.

In [24]:
# Split the string according to ','s, expanding the columns as necessary
split_df = df2['categories'].str.split(',', expand=True)

# Strip the resulting data from all whitespace
up1 = split_df.apply(lambda x: x.str.strip())

up1.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,26,27,28,29,30,31,32,33,34,35
0,Doctors,Traditional Chinese Medicine,Naturopathic/Holistic,Acupuncture,Health & Medical,Nutritionists,,,,,...,,,,,,,,,,
1,Shipping Centers,Local Services,Notaries,Mailbox Centers,Printing Services,,,,,,...,,,,,,,,,,
2,Department Stores,Shopping,Fashion,Home & Garden,Electronics,Furniture Stores,,,,,...,,,,,,,,,,
3,Restaurants,Food,Bubble Tea,Coffee & Tea,Bakeries,,,,,,...,,,,,,,,,,
4,Brewpubs,Breweries,Food,,,,,,,,...,,,,,,,,,,


Now we have a dataframe that contains columns, where each column contains exactly one category. Next, we need to combine the categories in a unique manner. To do so, we will use the *value_counts* function and apply it on the columns. This will give us the count of a certain key on a column-basis. Then, we sum up the values horizontally to get the overall count of each key.

In [18]:
updated = up1.apply(pd.value_counts)
counts = updated.fillna(0).sum(axis=1)

counts

& Probates             38.0
3D Printing             5.0
ATV Rentals/Tours      12.0
Acai Bowls            268.0
Accessories          1639.0
                      ...  
Wraps                 310.0
Yelp Events            48.0
Yoga                  938.0
Ziplining              12.0
Zoos                   52.0
Length: 1311, dtype: float64

We can see the counts of each key. Let's get a list of the keys and inspect them to see which ones are relevant for our purposes:

In [19]:
for key in counts.keys():
    print(key)

& Probates
3D Printing
ATV Rentals/Tours
Acai Bowls
Accessories
Accountants
Acne Treatment
Active Life
Acupuncture
Addiction Medicine
Adoption Services
Adult
Adult Education
Adult Entertainment
Advertising
Aerial Fitness
Aerial Tours
Aestheticians
Afghan
African
Air Duct Cleaning
Aircraft Dealers
Aircraft Repairs
Airlines
Airport Lounges
Airport Shuttles
Airport Terminals
Airports
Airsoft
Allergists
Alternative Medicine
Amateur Sports Teams
American (New)
American (Traditional)
Amusement Parks
Anesthesiologists
Animal Assisted Therapy
Animal Physical Therapy
Animal Shelters
Antiques
Apartment Agents
Apartments
Appliances
Appliances & Repair
Appraisal Services
Aquarium Services
Aquariums
Arabic
Arcades
Archery
Architects
Architectural Tours
Argentine
Armenian
Art Classes
Art Consultants
Art Galleries
Art Installation
Art Museums
Art Restoration
Art Schools
Art Space Rentals
Art Supplies
Art Tours
Artificial Turf
Arts & Crafts
Arts & Entertainment
Asian Fusion
Assisted Living Facilities


We can identify five main categories that operate in the F&B industry: 
1. Cafes
1. Restaurants
1. Bars
1. Juice Bars & Smoothies
1. Pubs

Thus, we will now filter the dataset to include only these categories.

Note that there is a generic *Food* category that might be of interest. To make sure if it is relevant, we will filter the dataframe to include the rows that include the *Food* category but do not belong to one of the aforementioned categories. From that point we can see if the stores are relevant to our project. 

In [30]:
of_interest = ['Cafes', 'Restaurants', 'Bars', 'Juice Bars & Smoothies', 'Pubs']

In [31]:
# Return a Series object that indicates if the store in a row belongs to one of the above categories
types = up1.isin(of_interest).any(axis=1)
types

0         False
1         False
2         False
3          True
4         False
          ...  
150341    False
150342    False
150343    False
150344    False
150345    False
Length: 150346, dtype: bool

In [32]:
# Get the rows that belong to the 'Food' category
food = up1.isin(['Food']).any(axis=1)

# Get the categories to which 'Food' rows excluded from the categories of interest belong
non_interest = up1[food & ~types].apply(pd.value_counts).sum(axis=1).index
for x in non_interest:
    print(x)

Acai Bowls
Accessories
Accountants
Active Life
Acupuncture
Adult
Adult Education
Adult Entertainment
Airlines
Airport Shuttles
Airports
Alternative Medicine
Amateur Sports Teams
Amusement Parks
Antiques
Apartments
Appliances
Appliances & Repair
Arcades
Architects
Art Classes
Art Galleries
Art Schools
Art Supplies
Art Tours
Arts & Crafts
Arts & Entertainment
Attraction Farms
Auto Customization
Auto Detailing
Auto Glass Services
Auto Parts & Supplies
Auto Repair
Automotive
Axe Throwing
Baby Gear & Furniture
Bagels
Bakeries
Barbers
Bartenders
Beauty & Spas
Bed & Breakfast
Beer
Beer Gardens
Beer Tours
Bespoke Clothing
Beverage Store
Bike Rentals
Bike Repair/Maintenance
Bike tours
Bikes
Boat Charters
Boating
Body Shops
Bookkeepers
Books
Bookstores
Boot Camps
Botanical Gardens
Bounce House Rentals
Bowling
Breweries
Brewing Supplies
Brewpubs
Bridal
Bubble Tea
Building Supplies
Bus Tours
Business Consulting
Butcher
CSA
Campgrounds
Candle Stores
Candy Stores
Cannabis Clinics
Cannabis Dispensari

The list includes venues such as Casinos and Butchers, which are irrelevant in our cases. We will also exclude Cofeeshops, as people generally do not book a table at a coffee shop.

Therefore, we can now complete the filtering. Let's see the total number of venues that we will be tackling:

In [33]:
chosen_venues_ids = list(df2[types]['business_id'])
print(len(chosen_venues_ids))

55653


We can see that we have ~55K stores. Next, we need to join it to the reviews table:

In [34]:
wanted_reviews_df = df1[df1['business_id'].isin(chosen_venues_ids)]
wanted_reviews_df.head()

Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date
0,KU_O5udG6zpxOg-VcAEodg,mh_-eMZ6K5RLWhZyISBhwA,XQfwVwDr-v0ZS3_CbbE5Xw,3,0,0,0,"If you decide to eat here, just be aware it is...",2018-07-07 22:09:11
2,saUsX_uimxRlCVr67Z4Jig,8g_iMtfSiwikVnbP2etR0A,YjUWPpI6HXG530lwP-fb2A,3,0,0,0,Family diner. Had the buffet. Eclectic assortm...,2014-02-05 20:30:30
3,AqPFMleE6RsU23_auESxiA,_7bHUi9Uuf5__HHc_Q8guQ,kxX2SOes4o-D3ZQBkiMRfA,5,1,0,1,"Wow! Yummy, different, delicious. Our favo...",2015-01-04 00:01:03
4,Sx8TMOWLNuJBWer-0pcmoA,bcjbaE6dDog4jkNY91ncLQ,e4Vwtrqf-wpJfwesgvdgxQ,4,1,0,1,Cute interior and owner (?) gave us tour of up...,2017-01-14 20:54:15
5,JrIxlS1TzJ-iCu79ul40cQ,eUta8W_HdHMXPzLBBZhL1A,04UD14gamNjLY0IDYVhHJg,1,1,2,1,I am a long term frequent customer of this est...,2015-09-23 23:10:31


In [38]:
wanted_reviews = wanted_reviews_df['text'].values

Let's look at some reviews at random:

In [39]:
for rev in np.random.choice(wanted_reviews, size=2):
    print(rev, "\n")

I must admit, I wasn't expecting much. This place totally blew us away. This has to be one of the best Indian restaurants in the Tampa Bay. The ambience is wonderful including white table linens. The best part was the food, OMG. The chicken 65 was to die for, and the hot sour soup was rich and flavorful. The waitress (Angel) was friendly,  attentive and very pleasant. They opened just a week ago. I'm sure this will soon be the place foodies seek out for real Indian food. 

First time there and it was excellent!!! It feels like your are entering someone's home. The waiters there funny and nice. The food come out very quickly and it is phenomenal!!! Definitely will be going back to this place. 



## Labeling

Having cleaned the data, we can now proceed with the labeling process. We will start by labeling 73 entries (these were present in the first chunk that was read).

In [40]:
len(wanted_reviews)

73

In [75]:
import ast

categories = ["food", "price", "service", "ambiance", "location"]
Y = {"food": 0, "price": 0, "service": 0, "ambiance": 0, "location": 0}, {"food": 1, "price": 0, "service": 1, "ambiance": 0, "location": 1}, {"food": 1, "price": 0, "service": 0, "ambiance": 0, "location": 0}, {    "food": 0,    "price": 0,    "service": 1,    "ambiance": 1,    "location": 0}, {"food": 0, "price": 0, "service": -1, "ambiance": 0, "location": 0}, {"food": 1, "price": 0, "service": 1, "ambiance": 1, "location": 0}, {"food": 0, "price": -1, "service": -1, "ambiance": 0, "location": 0}, {"food": 1, "price": 0, "service": 1, "ambiance": 1, "location": 1}, {"food": 1, "price": 0, "service": 1, "ambiance": 1, "location": 0}, {"food": 1, "price": 0, "service": -1, "ambiance": 0, "location": 0}, {"food": 1, "price": 1, "service": 0, "ambiance": 0, "location": 1}, {"food": 1, "price": 0, "service": 1, "ambiance": 0, "location": 0}, {"food": 1, "price": -1, "service": 1, "ambiance": 1, "location": 0}, {"food":1, "price":0, "service":1, "ambiance":1, "location":0}, {"food": 1, "price": 0, "service": 0, "ambiance": 1, "location": 0}, {"food":0, "price":0, "service":0, "ambiance":0, "location":0}, {"food": 1, "price": 0, "service": 1, "ambiance": 0, "location": 0}, {"food": 0, "price": 0, "service": 0, "ambiance": 0, "location": 0}, {"food": 1, "price": 0, "service": 0, "ambiance": 0, "location": 0}, {"food": 1, "price": 0, "service": 1, "ambiance": 0, "location": 0}, {"food": -1, "price": 0, "service": -1, "ambiance": 0, "location": 0}, {"food": 1, "price": 0, "service": 1, "ambiance": 0, "location": 0}, {"food": 1, "price": -1, "service": -1, "ambiance": 0, "location": 0}, {"food": 1, "price": 0, "service": 1, "ambiance": 1, "location": 0}, {"food": 1, "price": 0, "service": 1, "ambiance": 1, "location": 0}, {"food":1, "price":0, "service":1, "ambiance":0, "location":0}, {"food": 1, "price": 0, "service": 1, "ambiance": 1, "location": 1}, {"food": 1, "price": 0, "service": 0, "ambiance": 0, "location": 0}, {"food": 1, "price": 0, "service": 0, "ambiance": 0, "location": 0}, {"food": 1, "price": 0, "service": 0, "ambiance": 0, "location": 0}, {"food": -1, "price": 1, "service": 0, "ambiance": -1, "location": 1}, {"food": 0, "price": 0, "service": -1, "ambiance": 0, "location": 1}, {"food": 1, "price":0, "service": 1, "ambiance": 1, "location": 0}, {"food": -1, "price": -1, "service": 0, "ambiance": 0, "location": 0}, {"food": -1, "price": -1, "service": -1, "ambiance": -1, "location": 0}, {"food": 0, "price": -1, "service": -1, "ambiance": 0, "location": 0}, {"food": 1, "price": 1, "service": 0, "ambiance": 1, "location": 0}, {"food": 1, "price": 0, "service": 1, "ambiance": 1, "location": 0}, {"food": 0, "price": -1, "service": 0, "ambiance": 0, "location": 0}, {"food": 1, "price": 0, "service": 0,"ambiance": 0, "location": 0}, {"food":1,"price":0,"service":-1,"ambiance":1,"location":0}, {"food": 1, "price": 0, "service": 1, "ambiance": 1, "location": 1}, {"food": 1, "price": 0, "service": 0, "ambiance": 0, "location": 0}, {"food": 1, "price": 0, "service": 1, "ambiance": 0, "location": 1}, {"food": 1, "price": 0, "service": -1, "ambiance": 1, "location": 1}, {"food": 1, "price": 0, "service": 0, "ambiance": 0, "location": 1}, {"food":0, "price":0, "service":0, "ambiance":0, "location":0}, {"food": 1, "price": 0, "service": 1, "ambiance": 0, "location": 1}, {"food": 1, "price": 0, "service": 0, "ambiance": 0, "location": 0}, {"food": 1, "price": 0, "service": 1, "ambiance": 0, "location": 0}, {"food": 1, "price": 0, "service": 1, "ambiance": 0, "location": 1}, {"food": 1, "price": 0, "service": 1, "ambiance":0, "location":0}, {"food": 1, "price": 1, "service": 0, "ambiance": 0, "location": 0}, {"food": 1, "price": -1, "service": 1, "ambiance": 1, "location": 0}, {"food": -1, "price": 0, "service": -1, "ambiance": 0, "location": 0}, {"food": 1, "price": 1, "service": 1, "ambiance": 0, "location": 0}, {"food": 1, "price": 0, "service": 0, "ambiance": 0, "location": 0}, {"food": 0, "price": 0, "service": 0, "ambiance": 1, "location": 0}, {"food": -1, "price": 0, "service": 0, "ambiance":0, "location":0}, {"food": 0, "price": 0, "service": 0, "ambiance": 0, "location": 0}, {"food": 1, "price": 0, "service": 1, "ambiance": 1, "location": 0}, {"food": 1, "price": 0, "service": 1, "ambiance": 1, "location": 0}, {"food": -1, "price": 1, "service": 1, "ambiance": -1, "location": 0}, {"food": -1, "price": 0, "service": -1, "ambiance": 0, "location": 0}, {"food": 0, "price": 0, "service": -1, "ambiance": -1, "location": 0}, {"food":1,"price":-1,"service":0,"ambiance":0,"location":1}, {"food": 1, "price": 0, "service": 1, "ambiance": 1, "location": 1}, {"food": 0, "price": 0, "service": 0, "ambiance": 1, "location": 0}, {"food": 1, "price": 0, "service": 0, "ambiance": 1, "location": 0}, {"food": 1, "price": 0, "service": 1, "ambiance": 0, "location": 0}, {"food": 1, "price": 0, "service": 1, "ambiance": 1, "location": 0}, {"food": -1, "price": 0, "service": -1, "ambiance": 0, "location": 0}, {"food": 0, "price": 0, "service": 1, "ambiance": 0, "location": 1}

Y = [list(x.values()) for x in Y]

In [81]:
Y = np.array(Y)
X = np.array([x[0] for x in labels])
X.shape, Y.shape

((73,), (73, 5))

In [89]:
Y

array([[ 0,  0,  0,  0,  0],
       [ 1,  0,  1,  0,  1],
       [ 1,  0,  0,  0,  0],
       [ 0,  0,  1,  1,  0],
       [ 0,  0, -1,  0,  0],
       [ 1,  0,  1,  1,  0],
       [ 0, -1, -1,  0,  0],
       [ 1,  0,  1,  1,  1],
       [ 1,  0,  1,  1,  0],
       [ 1,  0, -1,  0,  0],
       [ 1,  1,  0,  0,  1],
       [ 1,  0,  1,  0,  0],
       [ 1, -1,  1,  1,  0],
       [ 1,  0,  1,  1,  0],
       [ 1,  0,  0,  1,  0],
       [ 0,  0,  0,  0,  0],
       [ 1,  0,  1,  0,  0],
       [ 0,  0,  0,  0,  0],
       [ 1,  0,  0,  0,  0],
       [ 1,  0,  1,  0,  0],
       [-1,  0, -1,  0,  0],
       [ 1,  0,  1,  0,  0],
       [ 1, -1, -1,  0,  0],
       [ 1,  0,  1,  1,  0],
       [ 1,  0,  1,  1,  0],
       [ 1,  0,  1,  0,  0],
       [ 1,  0,  1,  1,  1],
       [ 1,  0,  0,  0,  0],
       [ 1,  0,  0,  0,  0],
       [ 1,  0,  0,  0,  0],
       [-1,  1,  0, -1,  1],
       [ 0,  0, -1,  0,  1],
       [ 1,  0,  1,  1,  0],
       [-1, -1,  0,  0,  0],
       [-1, -1

In [84]:
X.reshape((-1, 1)).shape

(73, 1)

Now we can proceed with building the model. We will build an SVM for each category using Tf-Idf

In [92]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, shuffle=True)

X_train.shape, y_train.shape

((58,), (58, 5))

Let's look at the distribution of the data

In [109]:
df = pd.DataFrame(y_train)
counts = df.apply(pd.value_counts)
counts.columns = categories

counts

Unnamed: 0,food,price,service,ambiance,location
-1,7,6,11,3,
0,11,46,23,38,46.0
1,40,6,24,17,12.0


We can see that the distribution of the data is uneven. This could potentially lead to uneven results. For this reason we will use KFold cross validation in order to avoid any possible random data biases when splitting into training and testing.

Furthermore, we expect the model to behave poorly simply due to the small number of labeled points. However, due to time constraints, the team preferred to not proceed with the entire labeling. An alternative solution will be provided at the end.

# Modeling and Evaluation

Due to the nature of the NLP task and the small labeled dataset available, we decided to use SVMs given their documented performance in NLP tasks and good performance even with only a dataset for training.

We will create several SVM model and fine-tune their parameters using the Grid Search function in scikit-learn.

In [125]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.model_selection import KFold, cross_val_score

y_train_split = [y_train[:, 0], y_train[:, 1], y_train[:, 2], y_train[:, 3], y_train[:, 4]]
y_test_split = [y_test[:, 0], y_test[:, 1], y_test[:, 2], y_test[:, 3], y_test[:, 4]]

In [132]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

models = []

for pos, (y_train, y_test) in enumerate(zip(y_train_split, y_test_split)):
    
    # Create tf-idf vectorizer
    tfidf_vectorizer = TfidfVectorizer()
    
    # Create kfold cross validator
    kf = KFold(n_splits=3, shuffle=True, random_state=42)

    # Define parameters to loop through
    parameters = {'svm__C': [0.1, 1, 10, 100, 1000]}
    
    # Create pipeline
    steps = [('vectorizer', TfidfVectorizer()), ('svm', LinearSVC())]
    pipeline = Pipeline(steps=steps)
    
    # Create Grid Object
    grid = GridSearchCV(pipeline, parameters, cv=kf)
    
    # Fit the model
    grid = grid.fit(X_train, y_train)
    
    # Print results
    print(f"{categories[pos]} results:")
    print("-"*30, "\n\n")
    for param, score in zip([list(x.values())[0] for x in grid.cv_results_['params']], grid.cv_results_['mean_test_score']):
        
        print(f"Average cv score for C={param}: {score:.2f}")
        print()
        
    print(f"Best Parameters: {grid.best_params_}")
    print(f"Best Test Score: {grid.best_score_:.2f}")
    print()
    print(f"Score on test set: {grid.score(X_test, y_test):.2f}")

        
    print("-"*30,"\n\n")
    

        
    # Store best model
    
    models.append(grid.best_estimator_)
    

food results:
------------------------------ 


Average cv score for C=0.1: 0.69

Average cv score for C=1: 0.71

Average cv score for C=10: 0.69

Average cv score for C=100: 0.71

Average cv score for C=1000: 0.71

Best Parameters: {'svm__C': 1}
Best Test Score: 0.71

Score on test set: 0.67
------------------------------ 


price results:
------------------------------ 


Average cv score for C=0.1: 0.79

Average cv score for C=1: 0.79

Average cv score for C=10: 0.79

Average cv score for C=100: 0.79

Average cv score for C=1000: 0.79

Best Parameters: {'svm__C': 0.1}
Best Test Score: 0.79

Score on test set: 0.80
------------------------------ 


service results:
------------------------------ 


Average cv score for C=0.1: 0.50

Average cv score for C=1: 0.54

Average cv score for C=10: 0.50

Average cv score for C=100: 0.50

Average cv score for C=1000: 0.50

Best Parameters: {'svm__C': 1}
Best Test Score: 0.54

Score on test set: 0.67
------------------------------ 


ambiance r

In [133]:
models

[Pipeline(steps=[('vectorizer', TfidfVectorizer()), ('svm', LinearSVC(C=1))]),
 Pipeline(steps=[('vectorizer', TfidfVectorizer()), ('svm', LinearSVC(C=0.1))]),
 Pipeline(steps=[('vectorizer', TfidfVectorizer()), ('svm', LinearSVC(C=1))]),
 Pipeline(steps=[('vectorizer', TfidfVectorizer()), ('svm', LinearSVC(C=0.1))]),
 Pipeline(steps=[('vectorizer', TfidfVectorizer()), ('svm', LinearSVC(C=0.1))])]

Let's print the classification reports for each:

In [141]:
from sklearn.metrics import classification_report, precision_recall_curve

for pos, model in enumerate(models):
    y_pred = model.predict(X_test)
    print(f"Classification report for {categories[pos]}")
    report = classification_report(y_test_split[pos], y_pred)
    print(report, "\n\n")

Classification report for food
              precision    recall  f1-score   support

          -1       0.00      0.00      0.00         2
           0       1.00      0.25      0.40         4
           1       0.64      1.00      0.78         9

    accuracy                           0.67        15
   macro avg       0.55      0.42      0.39        15
weighted avg       0.65      0.67      0.58        15
 


Classification report for price
              precision    recall  f1-score   support

          -1       0.00      0.00      0.00         3
           0       0.80      1.00      0.89        12

    accuracy                           0.80        15
   macro avg       0.40      0.50      0.44        15
weighted avg       0.64      0.80      0.71        15
 


Classification report for service
              precision    recall  f1-score   support

          -1       1.00      0.33      0.50         3
           0       0.50      0.75      0.60         4
           1       0.75   

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


From the above scores, we can see that the model behaves quite poorly. This is expected, as the number of datapoints is quite small, specially for an NLP task as this one

# GPT

To provide reasonable performance, we will use the GPT API provided by OpenAI to generate predictions as required.

In [3]:
import openai

In [4]:
openai.api_key = "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"


labels = []


def gpt_analyze(review):
    
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {
                "role":
                "system",
                "content":
                "You must do multilabel classification in regards to whether the text talks positively (1), negatively (-1), neutral (0) about price, food, service, ambiance, or location. Response should be in JSON and follows this format: {food: 1, price: -1, service: 0, ambiance: 1, location: 0}. Include all categories along with their respective value. Do not add text to the reply, only provide the JSON. If nothing is mentioned in the text, give neutral to all fields"
            },
            {
                "role": "user",
                "content": review
            },
        ])

    return (review, response['choices'][0]['message']['content'])


Let us try it out:

In [8]:
gpt_analyze("The food was great! However, it was very slow.")

('The food was great! However, it was very slow.',
 '{"food": 1, "price": 0, "service": -1, "ambiance": 0, "location": 0}')

# Deployment

The deployment of this login and its structuring as an API can be found separately in the GitHub folder of this project.

# References

Yelp Dataset: https://www.yelp.com/dataset