<a href="https://colab.research.google.com/github/yli431/2023SummerProject/blob/main/DATA301_sample_project_code_2024.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Sample project
In this abbreviated study, we answer the question:
What are the most common (non stop-word) words associated with restaurant reviews that are 1, 2, 3, 4, and 5 star?

Method: we will compute word frequencies after first removing all words in NLTK's stop word category. We will then group by restaurant star rating.

##Download data
We will use a filtered dataset collected by from Google Maps (see https://cseweb.ucsd.edu/~jmcauley/datasets.html#google_restaurants)


In [1]:
import urllib.request
filename = 'filter-all-t.json'
urllib.request.urlretrieve('https://datarepo.eng.ucsd.edu/mcauley_group/gdrive/googlelocal_restaurants/filter_all_t.json', filename)


('filter-all-t.json', <http.client.HTTPMessage at 0x7f51ce97c340>)

In [2]:
!pip install ijson

Collecting ijson
  Downloading ijson-3.2.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (111 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m111.8/111.8 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: ijson
Successfully installed ijson-3.2.3


In [3]:
#custom extraction of tags from the original json file
#This is sequential and could be done in parallel but the effort is not worth it
#JSON is a poor format for random access storage and splitting data sets, so
# this code converts it to JSON-lines and filters out data we don't need for this project
import json, ijson

def convert_to_jsonl(filename):
  parser = ijson.parse(open(filename))
  with open(filename+'l', 'w') as output:
    # output.write('[\n')
    rating = {}
    # first = True
    for prefix, event, value in parser:
      if prefix.endswith('rating'):
        rating['rating'] = value
      elif prefix.endswith('review_text'):
        rating['review'] = value
        # if not first:
        #   output.write('\n')
        first = False
        output.write(json.dumps(rating))
        output.write('\n')

    # output.write('\n]\n')
convert_to_jsonl('filter-all-t.json')

In [4]:
#shows a sample of the first and last 10 extracted reviews
!head -10 'filter-all-t.jsonl'
!tail -10 'filter-all-t.jsonl'

{"rating": 4, "review": "The tang of the tomato sauce is outstanding. And the crust is a meal, as it should be. Order a whole pie fresh."}
{"rating": 5, "review": "Chicken and waffles were really good!"}
{"rating": 4, "review": "The appetizer of colossal shrimp was very good but the freshwater Lobster was a bit disappointing. The lobster mac and cheese was great. We got the 40 day dry aged rib steak for two. It was cooked very well , but I wish it had more of that signature aged steak flavor."}
{"rating": 5, "review": "The fish tacos here  omg! The salad was great also."}
{"rating": 4, "review": "Ribs are great, as are the mac and cheese, fries and onion rings. Skip the brisket and blueberry cornbread."}
{"rating": 5, "review": "Food are yummy, wide range of Asian street food."}
{"rating": 5, "review": ".  Roasted beets and brussel sprouts.  Tommy's Salad, customized w/ all spinach; added grilled chicken.  Lobster Seafood (seasonal special)"}
{"rating": 5, "review": "10/10 recommend:. 

## Loading the data into a Dask Dataframe
See https://docs.dask.org/en/latest/generated/dask.dataframe.read_json.html

In [5]:
import dask
# from dask import dataframe as ddf
from dask import bag as db

# data_df = ddf.read_json('filter-all-t.jsonl', blocksize="1MB")
data_bag = db.read_text('filter-all-t.jsonl', blocksize="1MB")
print(data_bag)

dask.bag<bag-from-delayed, npartitions=17>


In [6]:
#download a set of stop words that we can ignore because they are not interesting

!pip install nltk



In [8]:
import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords

STOP_WORDS = set(stopwords.words('english'))
print(STOP_WORDS)


{'yours', 'll', 'did', 'on', 'own', 'needn', 'couldn', 'should', 'who', 'hasn', "don't", "you'll", 'your', 'whom', 'themselves', 'below', 'but', "couldn't", 'it', 'wasn', 'again', 'no', "shan't", 'for', 'itself', 'then', 'was', 'just', 'all', 'yourselves', 'down', 'ma', 'shan', 'off', 'now', 'here', 'of', 'yourself', 'not', "you're", 'herself', 'hers', 'into', 's', 'which', 'over', 'to', 'than', "hasn't", 'each', 'until', 'were', 'from', 'very', 'her', 'has', 'an', 'isn', 'do', 'wouldn', 'with', 'he', 'how', 'its', 'what', 'm', 'them', 'about', 'more', 'nor', 'will', 'd', 'both', 'is', 'in', 'mustn', "haven't", 'above', "wasn't", 'don', 're', 'been', 'the', 'me', "didn't", 'their', 'through', 't', 'after', 've', "needn't", 'won', 'you', 'theirs', 'have', "hadn't", 'haven', 'mightn', 'this', 'before', 'few', 'that', 'doesn', 'between', "weren't", 'further', 'him', 'other', 'i', 'ourselves', 'these', 'doing', 'why', 'so', "aren't", 'any', "shouldn't", "that'll", 'too', 'having', 'same', 

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [9]:
import re
import json

def remove_nonletters(word):
  return

def process_line(line):
  entry = json.loads(line)
  results = []
  rating = int(entry['rating'])
  # removes punctuation and converts to lowercase
  review = re.sub(r'[^a-zA-Z ]', '', entry['review'].lower())
  for word in review.split(" "):
    # don't keep any words that are in STOP_WORDS
    if len(word) > 0 and word not in STOP_WORDS:
      results.append((rating, word))
  return results

reviews_only_valid_words = data_bag.map(process_line).flatten()
ratings_words_df = reviews_only_valid_words.to_dataframe(columns=['rating', 'word'])


In [10]:
#show that it is working
ratings_words_df.head(5)

Unnamed: 0,rating,word
0,4,tang
1,4,tomato
2,4,sauce
3,4,outstanding
4,4,crust


In [None]:
# now compute "answer" to our question
# compute top 10 words per rating using a group by and nlargest
ratings_words_df_with_counts = ratings_words_df.groupby(['rating','word']).size().to_frame('counts').reset_index()
for rating in ratings_words_df_with_counts['rating'].unique().compute():
  print(f"Top ten for {rating} rating:")
  print(ratings_words_df_with_counts[ratings_words_df_with_counts['rating'] == rating].nlargest(10, 'counts').compute())


Top ten for 1 rating:
     rating     word  counts
459       1  ordered     347
122       1  chicken     277
362       1     like     257
90        1   burger     201
276       1     good     186
116       1   cheese     177
278       1      got     171
494       1    pizza     166
598       1   shrimp     156
262       1    fries     155
Top ten for 2 rating:
      rating     word  counts
1177       2     good     721
1427       2  ordered     617
935        2  chicken     551
1298       2     like     450
894        2   burger     361
1155       2    fries     352
925        2   cheese     351
1180       2      got     323
1480       2    pizza     312
1152       2    fried     305
Top ten for 3 rating:
      rating     word  counts
2690       3     good    3315
2209       3  chicken    1802
3188       3  ordered    1445
2119       3   burger    1218
2930       3     like    1205
2647       3    fries    1092
2643       3    fried    1014
2195       3   cheese    1003
3311       3   