#### Sociology 128D: Mining Culture Through Text Data: Introduction to Social Data Science

# Notebook 13a: Preparing a Sample of Yelp Reviews

For most students, it will make sense to complete Notebook 13b (the companion notebook) on Google Colab, where you can more easily rely a GPU. Notebook 13a helps streamline the process of extracting relevant reviews and useful features from the large files that come as part of the Yelp Open Dataset. In this notebook, we will identify restaurant reviews from January 2019 to January 2021 with data about the price range, and we will save a sample of these reviews as a dataframe in the JSON format.

### Getting the Data

For this notebook, we are going to use the Yelp Open Dataset, which you can find [here](https://www.yelp.com/dataset). You'll have to click 'Download Dataset', agree to the terms, and click 'Download JSON'. It's a large download: ~5GB compressed and ~11GB once you've uncompressed it. The dataset has 8,635,403 reviews of businesses including text, a rating out of five stars, and various other information.

### Setup

You may need to install `contractions`,  `num2words`, and `unidecode`. You can install two of these using `conda` (if using Anaconda), but you will need to install `contractions` using `pip` (see below).

`conda install -c conda-forge num2words` <br>
`conda install -c conda-forge unidecode` <br>

`pip install contractions`

**Note: If you have trouble getting `contractions` to work, you can also comment out or delete both the line that imports it and the line that reads "<tt>doc = contractions.fix(doc)</tt>" in the cell that defines the <tt>preprocess_doc</tt> function.** You can disable or change other aspects of the preprocessing as you see fit. Just keep in mind what the preprocessing is meant to accomplish and how the different steps interact.

In [1]:
import contractions
import datetime as dt
import json
import numpy as np
import os
import pandas as pd
import re
import spacy

from collections import Counter
from gensim.models.phrases import Phrases
from num2words import num2words
from spacy.lang.en.stop_words import STOP_WORDS
from unidecode import unidecode

In [2]:
np.random.seed(6756) # from random.org

### The Files

The main file, <tt>yelp_academic_dataset_review.json</tt>, is quite large (> 6GB). We are not going to examine all of it, so we are going to avoid reading it into memory all at once. We are going to extract some information from the <tt>yelp_academic_dataset_business.json</tt> file, which has information about the businesses in the dataset. We will then use the business IDs and the date of reviews to select only the reviews we want to examine.

In [3]:
os.listdir("yelp_dataset/") # this should point to the where you have downloaded and extracted the data

['Dataset_User_Agreement.pdf',
 'yelp_academic_dataset_business.json',
 'yelp_academic_dataset_checkin.json',
 'yelp_academic_dataset_review.json',
 'yelp_academic_dataset_tip.json',
 'yelp_academic_dataset_user.json']

### Identifying Reviews to Keep

First, read in the file with information about businesses and take a look at the resulting dataframe.

In [4]:
biz_df = pd.read_json("yelp_dataset/yelp_academic_dataset_business.json", lines=True)

In [5]:
biz_df.head()

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
0,6iYb2HFDywm3zjuRg0shjw,Oskar Blues Taproom,921 Pearl St,Boulder,CO,80302,40.017544,-105.283348,4.0,86,1,"{'RestaurantsTableService': 'True', 'WiFi': 'u...","Gastropubs, Food, Beer Gardens, Restaurants, B...","{'Monday': '11:0-23:0', 'Tuesday': '11:0-23:0'..."
1,tCbdrRPZA0oiIYSmHG3J0w,Flying Elephants at PDX,7000 NE Airport Way,Portland,OR,97218,45.588906,-122.593331,4.0,126,1,"{'RestaurantsTakeOut': 'True', 'RestaurantsAtt...","Salad, Soup, Sandwiches, Delis, Restaurants, C...","{'Monday': '5:0-18:0', 'Tuesday': '5:0-17:0', ..."
2,bvN78flM8NLprQ1a1y5dRg,The Reclaimory,4720 Hawthorne Ave,Portland,OR,97214,45.511907,-122.613693,4.5,13,1,"{'BusinessAcceptsCreditCards': 'True', 'Restau...","Antiques, Fashion, Used, Vintage & Consignment...","{'Thursday': '11:0-18:0', 'Friday': '11:0-18:0..."
3,oaepsyvc0J17qwi8cfrOWg,Great Clips,2566 Enterprise Rd,Orange City,FL,32763,28.914482,-81.295979,3.0,8,1,"{'RestaurantsPriceRange2': '1', 'BusinessAccep...","Beauty & Spas, Hair Salons",
4,PE9uqAjdw0E4-8mjGl3wVA,Crossfit Terminus,1046 Memorial Dr SE,Atlanta,GA,30316,33.747027,-84.353424,4.0,14,1,"{'GoodForKids': 'False', 'BusinessParking': '{...","Gyms, Active Life, Interval Training Gyms, Fit...","{'Monday': '16:0-19:0', 'Tuesday': '16:0-19:0'..."


In [6]:
biz_df.shape

(160585, 14)

Let's take a look at the <tt>attributes</tt> column. Each row (business) has its own dictionary of attributes. Let's iterate through the rows and add each key to one set so we can examine the unique list of keys.

In [7]:
%%time

attr_keys = set()
for idx, row in biz_df.iterrows():
    atts = row.attributes
    if atts:
        for key in atts.keys():
            attr_keys.add(key)

Wall time: 5.52 s


As in Notebook 12, we'll use the values for "RestaurantsPriceRange2" (but you can use others if you'd like).

In [8]:
print(len(attr_keys))
attr_keys

39


{'AcceptsInsurance',
 'AgesAllowed',
 'Alcohol',
 'Ambience',
 'BYOB',
 'BYOBCorkage',
 'BestNights',
 'BikeParking',
 'BusinessAcceptsBitcoin',
 'BusinessAcceptsCreditCards',
 'BusinessParking',
 'ByAppointmentOnly',
 'Caters',
 'CoatCheck',
 'Corkage',
 'DietaryRestrictions',
 'DogsAllowed',
 'DriveThru',
 'GoodForDancing',
 'GoodForKids',
 'GoodForMeal',
 'HairSpecializesIn',
 'HappyHour',
 'HasTV',
 'Music',
 'NoiseLevel',
 'Open24Hours',
 'OutdoorSeating',
 'RestaurantsAttire',
 'RestaurantsCounterService',
 'RestaurantsDelivery',
 'RestaurantsGoodForGroups',
 'RestaurantsPriceRange2',
 'RestaurantsReservations',
 'RestaurantsTableService',
 'RestaurantsTakeOut',
 'Smoking',
 'WheelchairAccessible',
 'WiFi'}

Now let's take a look at the <tt>categories</tt> field. We'll use `biz_df.categories.tolist()` to convert the column to a list (one element per row), and then iterate through that inside a list comprehension. Each row with data for this column has a single string of categories separated by columns. We'll use `str.split(",")` to split these into lists on the columns, then use a second list comprehension to go from a list of lists to a single ("flat") list of categories. In the second list comprehension, where we're iterating through individual categories, we'll also use the `str.strip()` method to get rid of whitespace on either side of each token.

Finally, we'll take a look at the number of unique categories and then use `Counter` to see which are most common. It's set to show the 20 most common, but you can change that argument.

In [9]:
cats = [cat.split(",") for cat in biz_df.categories.tolist() if cat]
cats = [cat.strip() for cat_list in cats for cat in cat_list]

In [10]:
len(set(cats))

1330

In [11]:
Counter(cats).most_common(20)

[('Restaurants', 50763),
 ('Food', 29469),
 ('Shopping', 26205),
 ('Beauty & Spas', 16574),
 ('Home Services', 16465),
 ('Health & Medical', 15102),
 ('Local Services', 12192),
 ('Nightlife', 11990),
 ('Bars', 10741),
 ('Automotive', 10119),
 ('Event Planning & Services', 9644),
 ('Active Life', 9231),
 ('Coffee & Tea', 7725),
 ('Sandwiches', 7272),
 ('Fashion', 6599),
 ('American (Traditional)', 6541),
 ('Hair Salons', 5900),
 ('Pizza', 5756),
 ('Hotels & Travel', 5703),
 ('Breakfast & Brunch', 5505)]

Now we want to find the IDs of businesses that (1) are restaurants and (2) have data about how expensive they are. We can subset the data to include only rows that contain "Restaurants" in the <tt>categories</tt> field (this is the most common term, in fact). To do that, we'll first include a condition that excludes rows with *no* data for that field. This allows us to use `str.contains` in the second condition (which wouldn't work with rows with missing data in the form of NaNs, for example).

To keep things simple, we'll then iterate through the subset of rows with "Restaurants" in the <tt>categories</tt> field and check if the <tt>attributes</tt> field for each row has evaluated to a `dict`. If it has, we'll check whether there is data for the "RestaurantsPriceRange2" key. If there is, we'll add the row's <tt>busisness_id</tt> to our list of IDs.

In [12]:
%%time

restaurant_ids = set() # set for faster lookup later

for idx, row in biz_df.loc[(biz_df.categories.notna()) & 
                           (biz_df.categories.str.contains("Restaurants"))].iterrows():
    attr = row.attributes
    if type(attr) == dict:
        if "RestaurantsPriceRange2" in attr.keys():
            if attr["RestaurantsPriceRange2"] not in [None, "None"]:
                restaurant_ids.add(row.business_id)
            
print(len(restaurant_ids))

44849
Wall time: 2.16 s


Iterating through that subset of <tt>biz_df</tt> is pretty fast. Next we need to iterate through the main file, which could take longer. We know we want reviews since the start of 2019, so we'll speed things up by checking whether each line (while it's a string) contains "2019," "2020," or "2021" before we convert the line to a dictionary. If a line has one of those years (as a string) in it, we'll use `json.loads` so we can interact with the line like it's a dictionary. First, we'll check whether the business ID associated with the review is in the set of IDs for businesses we are interested in (i.e., restaurants with price data). If a business ID is a match, we'll confirm the date of the review is from the start of 2019 or later.

If a review is a match, we'll add a tuple consisting of the review ID and business ID to a list.

In [13]:
%%time

review_ids = []
earliest_date = dt.datetime(2019, 1, 1)

with open("yelp_dataset/yelp_academic_dataset_review.json", "r", encoding="utf-8") as reader:
    for line in reader:
        if any([year in line for year in ["2019", "2020", "2021"]]):
            line = json.loads(line.strip())
            if line["business_id"] in restaurant_ids:
                if pd.to_datetime(line["date"]) >= earliest_date:
                    review_ids.append((line["review_id"], line["business_id"]))
print(len(review_ids))

968091
Wall time: 1min 33s


Example tuple:

In [14]:
review_ids[0] # review ID, business ID

('a7KaN1Li94o1au_8BfNi3Q', 'YZs1gNSh_sN8JmN_nrpxeA')

Now we'll use `shuffle` to randomize the order.

In [15]:
np.random.shuffle(review_ids)

Now the first tuple is a different one:

In [16]:
review_ids[0]

('JYpNH-1WQ2FccwE-30bFaA', 'kWSuSb-aIkHy0Jf3iz3CrA')

Finally, we'll iterate through the tuples. If the second element (index 1) in the tuple, which is the business ID, is not already in the set <tt>biz_ids_tmp</tt>, we'll add the business ID to <tt>biz_ids_tmp</tt> and add the review ID to the set <tt>first_from_biz</tt>. If the business ID is already in <tt>biz_ids_tmp</tt>, we'll skip that review ID. This means that <tt>first_from_biz</tt> will only include the IDs of reviews if a review from the corresponding business had not already been added. We will only examine one review from any given business.

In [17]:
%%time

biz_ids_tmp = set()
first_from_biz = set()

for review in review_ids:
    if review[1] not in biz_ids_tmp:
        biz_ids_tmp.add(review[1])
        first_from_biz.add(review[0])

Wall time: 492 ms


In [18]:
len(first_from_biz)

30071

Now we have a set of review IDs for reviews of restaurants with price data. The IDs are for reviews from January 1, 2019, onward, and we have just one per business after randomly sorting the initial list of IDs. All that's left is to iterate through the main file and find these reviews.

In [19]:
%%time

reviews = []

with open("yelp_dataset/yelp_academic_dataset_review.json", "r", encoding="utf-8") as reader:
    for line in reader:
        if any([year in line for year in ["2019", "2020", "2021"]]):
            line = json.loads(line.strip())
            if line["review_id"] in first_from_biz:
                reviews.append(line)
print(len(reviews))

30071
Wall time: 31.4 s


In [20]:
print(reviews[0])

{'review_id': '4G1cR1njMkCWUPWMb0ZSJg', 'user_id': 'NTdi9yRSTD_RQn31fVvcCA', 'business_id': 'mP1EdIafQKMuOm9O4PzAfA', 'stars': 4.0, 'useful': 0, 'funny': 0, 'cool': 0, 'text': "Pretty good service. Restaurant is on the more expensive side. For two people we ordered 7 tapas, two 3oz red wines and dessert. We got the bacon wrapped dates, potatas bravas, duck confit, pickled beets, steak tartare, mahi mahi, and the pulpo. My favorites were the steak tartare and the pickled beets. My boyfriend loved the duck confit. I personally thought it was a little too oily (but he doesn't like oily foods either so I was surprised when he said that). The pulpo was cooked very tenderly which my boyfriend liked, but I like my octopus a little chewier. The chocolate for the churros could've been a little thicker. My boyfriend thought the churros could've been smaller in thicker because he thought it should be crispier. Overall, good food :). PS our total was ~80 without tip.\n\nBoyfriend's opinion: The da

Now, we can convert <tt>reviews</tt> (which is a list of dictionaries) to a dataframe. Pandas will use the keys of the dictionaries for the columns.

In [21]:
df = pd.DataFrame(reviews)

In [22]:
df.head()

Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date
0,4G1cR1njMkCWUPWMb0ZSJg,NTdi9yRSTD_RQn31fVvcCA,mP1EdIafQKMuOm9O4PzAfA,4.0,0,0,0,Pretty good service. Restaurant is on the more...,2019-01-30 22:26:00
1,hKW_VNKuqQAYmy7LvNpJBA,aN0dAIbhE5x1Rac1ZGcmyg,y-MM8_RYgtvgyJojV1RWLg,3.0,0,0,0,So I have to give kudos to the lady taking my ...,2019-01-12 04:13:48
2,0_R1uDOBRo81FqMiyYWHJQ,vzfyQFBVzCsMTxgj4WjjqQ,DrEMFfzJIwsGZHl1AqKLrw,5.0,0,0,0,The ingredients are fresh here and the subs ar...,2019-02-02 20:46:28
3,t-OmG9EDEYhIXHQHfuFBMw,YVYhnzICPx3tztqw83Mplg,ft-u7hmJk2b-UPdZrL55fw,2.0,0,0,0,"The coffee is pretty good, when it's available...",2019-02-14 13:24:10
4,K0X2DplCcYz-GB_Q96G3IQ,CqWXSfo_f5ZWUG-fxAmw3g,r_bcfIdazjqn-y7HP6rAUg,1.0,0,0,0,"Food is BLAND, service is not good. Me and my ...",2019-01-11 18:05:24


### Creating New Variables

We're going to create a dictionary from <tt>biz_df</tt> so that we can efficiently look up the categories and price range for the business corresponding to each of our reviews. The following line sets the index of <tt>biz_df</tt> to the <tt>business_id</tt> field, then converts the whole thing to a dictionary using the "index" option for the `to_dict` method. The keys for the resulting dictionary <tt>biz_dict</tt> are business IDs. The values are also dictionaries--i.e., one dictionary per business ID--containing the fields of the original <tt>biz_df</tt>.

To add a <tt>categories</tt> column to our new dataframe of reviews, we'll use `map` with a lambda function that simply looks up the value of "categories" for each business. Since "attributes" is itself a dictionary, we'll create a column for the restaurant price using a similar function that looks up the value for "RestaurantsPriceRange2" in the "attributes" dictionary--which is inside the dictionary for each business ID, which is inside the overarching dictionary, <tt>biz_dict</tt>.

In [23]:
biz_dict = biz_df.set_index("business_id").to_dict("index")

In [24]:
biz_dict["y-MM8_RYgtvgyJojV1RWLg"]

{'name': 'Burgerville',
 'address': '12785 SW Pacific Hwy',
 'city': 'Tigard',
 'state': 'OR',
 'postal_code': '97223',
 'latitude': 45.4276083568,
 'longitude': -122.7779190908,
 'stars': 3.0,
 'review_count': 78,
 'is_open': 1,
 'attributes': {'BikeParking': 'True',
  'RestaurantsAttire': "u'casual'",
  'BusinessAcceptsCreditCards': 'True',
  'RestaurantsTakeOut': 'True',
  'GoodForKids': 'True',
  'WiFi': "'free'",
  'RestaurantsGoodForGroups': 'True',
  'HasTV': 'False',
  'GoodForMeal': "{'dessert': False, 'latenight': False, 'lunch': True, 'dinner': False, 'brunch': False, 'breakfast': False}",
  'BusinessParking': "{'garage': False, 'street': False, 'validated': False, 'lot': True, 'valet': False}",
  'Caters': 'False',
  'Ambience': "{'romantic': False, 'intimate': False, 'classy': False, 'hipster': False, 'divey': False, 'touristy': False, 'trendy': False, 'upscale': False, 'casual': True}",
  'NoiseLevel': "'average'",
  'OutdoorSeating': 'False',
  'RestaurantsPriceRange2': 

In [25]:
df["categories"] = df.business_id.map(lambda x: biz_dict[x]["categories"])

In [26]:
df["price_tier"] = df.business_id.map(lambda x: biz_dict[x]["attributes"]["RestaurantsPriceRange2"])

Let's check the unique values of <tt>price_tier</tt>.

In [27]:
df.price_tier.unique()

array(['2', '1', '3', '4'], dtype=object)

We can now use `apply(int)` to convert the column values to integers.

In [29]:
df.price_tier = df.price_tier.apply(int)

In [30]:
df.price_tier.unique()

array([2, 1, 3, 4], dtype=int64)

### Sampling

Now we'll take a sample of the reviews using the built-in `sample` method.

In [31]:
SAMPLE_SIZE = 20000

sample = df.sample(SAMPLE_SIZE)
sample.reset_index(inplace=True)

In [32]:
sample.drop(columns=["index"], inplace=True)

In [33]:
sample.head()

Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date,categories,price_tier
0,_FD0i3UCDEKJmI1hQJBxbw,UPVNtJBaLt5s5n8_crAQag,TV2FwNhOTdc-0uLvD-DHQg,2.0,0,0,0,I just received my order delivered via Postmat...,2019-09-06 02:18:35,"Fast Food, Restaurants, Salad, Pizza",1
1,LNHQzsaI7J3oJc7IXnGT8Q,a---NC4J5BExqVvj2MubLA,gzzl_-bVtlCEyn34WtQ72g,2.0,0,0,0,Bubble World has gone down over the years. The...,2019-03-27 02:28:26,"Bubble Tea, Restaurants, Taiwanese, Coffee & T...",1
2,FgTDclq_60B1KqGGw2Royg,SdMVWxstaq8vML3EBenfAQ,p6RnfILI0jImkLz6uPiSqg,4.0,1,0,0,I've passed this restaurant in a small strip m...,2019-07-05 12:49:36,"Mediterranean, Restaurants, Halal, Indian",2
3,TmSbJvBacoD2-ai3C3rb2A,IWx9GNcJqK9yUk6gNLoYHQ,lygLAJtC-Oqz-UyCKUQ01A,5.0,0,0,0,"Outstanding! Food tasted great, GIANT portIons...",2019-12-11 17:27:31,"Restaurants, Chinese",1
4,UdmrtwRa-f0Rfe4W_eBxqg,v6QcUYdONxzc4U6IerCmVw,J6yW_6qMU_UZMJtq6AbNIw,1.0,0,0,0,One star for poor management skills and OK foo...,2020-02-29 01:19:33,"Burgers, American (New), Bars, Restaurants, Ni...",2


### Preprocessing and n-grams

The code provided for preprocessing the reviews is largely the same as the code we've been using. The nested for loop at the end of the next cell is from [here](https://stackoverflow.com/a/52266292). This adds capitalized and fully uppercase versions of each stop word (e.g., "The" and "THE") to the set of stop words from `spacy`. The second-to-last line of the <tt>preprocess_doc</tt> function definition splits each review on whitespace and keeps each token that is at least two characters and isn't in the set of stop words. That repeats something we do inside the earlier list comprehension involving the spacy language model we've loaded as <tt>nlp</tt>; we do this again because there may be tokens that, after removing non-alphabetic characters, look like stop words or are now a single character.

In [34]:
def fix_ordinal_nums(word: str) -> str:
    ord_num_reg = r"\d+[(st)(nd)(rd)(th)]"
    try:
        if any(re.findall(ord_num_reg, word)):
            word = re.sub("[(st)(nd)(rd)(th)]", "", word)
            word = num2words(word, lang="en", to="ordinal")
            
        return word
    
    except:
        
        return word


def preprocess_doc(doc: str) -> str:
    """
    Tokenize, lemmatize, remove stop words, 
    remove non-alphabetic characters.
    """
    doc = unidecode(str(doc))
    doc = contractions.fix(doc)
    doc = [word.lemma_ for word in nlp(doc) if not word.is_stop and (len(word.text) > 1)]
    doc = " ".join([fix_ordinal_nums(word) for word in doc])
    doc = re.sub("[^a-z]", " ", doc.lower())
    doc = " ".join([word for word in doc.split() if len(word) > 1 and word not in STOP_WORDS])
    
    return re.sub("\s+", " ", doc).strip()
    

nlp = spacy.load("en_core_web_sm", disable=["ner", "parser"])

# https://stackoverflow.com/a/52266292
for word in STOP_WORDS:
    for w in (word, word.capitalize(), word.upper()):
        lex = nlp.vocab[w]
        lex.is_stop = True

In [35]:
%%time

sample["preprocessed"] = [preprocess_doc(doc) for doc in sample.text]

Wall time: 2min 15s


In [36]:
def train_ngram_model(docs: list, min_count: int=5, inc_trigrams: bool=True) -> list:
    """Returns documents with n-grams joined by underscores"""
    docs = [doc for doc in docs if doc] # the "if doc" condition removes empty strings (docs with no words)
    bigram_model = Phrases(docs, min_count=min_count)
    ngrams = bigram_model[docs]
    ngrams = list(ngrams)
    if inc_trigrams:
        trigram_model = Phrases(ngrams, min_count=min_count)
        ngrams = trigram_model[ngrams]
        ngrams = list(ngrams)
    return ngrams

In [37]:
sample.preprocessed = sample.preprocessed.apply(str.split)
%time sample["ngrams"] = train_ngram_model(sample.preprocessed, min_count=25)

Wall time: 7.17 s


In [38]:
sample.head()

Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date,categories,price_tier,preprocessed,ngrams
0,_FD0i3UCDEKJmI1hQJBxbw,UPVNtJBaLt5s5n8_crAQag,TV2FwNhOTdc-0uLvD-DHQg,2.0,0,0,0,I just received my order delivered via Postmat...,2019-09-06 02:18:35,"Fast Food, Restaurants, Salad, Pizza",1,"[receive, order, deliver, postmates, pizza, pe...","[receive, order, deliver, postmates, pizza, pe..."
1,LNHQzsaI7J3oJc7IXnGT8Q,a---NC4J5BExqVvj2MubLA,gzzl_-bVtlCEyn34WtQ72g,2.0,0,0,0,Bubble World has gone down over the years. The...,2019-03-27 02:28:26,"Bubble Tea, Restaurants, Taiwanese, Coffee & T...",1,"[bubble, world, year, menu, extensive, new, bu...","[bubble, world, year, menu, extensive, new, bu..."
2,FgTDclq_60B1KqGGw2Royg,SdMVWxstaq8vML3EBenfAQ,p6RnfILI0jImkLz6uPiSqg,4.0,1,0,0,I've passed this restaurant in a small strip m...,2019-07-05 12:49:36,"Mediterranean, Restaurants, Halal, Indian",2,"[pass, restaurant, small, strip, mall, week, n...","[pass, restaurant, small, strip_mall, week, no..."
3,TmSbJvBacoD2-ai3C3rb2A,IWx9GNcJqK9yUk6gNLoYHQ,lygLAJtC-Oqz-UyCKUQ01A,5.0,0,0,0,"Outstanding! Food tasted great, GIANT portIons...",2019-12-11 17:27:31,"Restaurants, Chinese",1,"[outstanding, food, taste, great, giant, porti...","[outstanding, food, taste, great, giant, porti..."
4,UdmrtwRa-f0Rfe4W_eBxqg,v6QcUYdONxzc4U6IerCmVw,J6yW_6qMU_UZMJtq6AbNIw,1.0,0,0,0,One star for poor management skills and OK foo...,2020-02-29 01:19:33,"Burgers, American (New), Bars, Restaurants, Ni...",2,"[star, poor, management, skill, ok, food, salm...","[star, poor, management, skill, ok, food, salm..."


### Saving the Sample

We'll save our sample to disk, keeping only the <tt>stars</tt>, <tt>text</tt>, <tt>ngrams</tt>, <tt>price_tier</tt>, <tt>categories</tt>, and <tt>date</tt> columns.

Notebook 13b has a line that will prompt you to upload a file. Upload the file we create here, <tt>yelp_reviews_sample_for_notebook13b.json</tt>.

In [39]:
sample[["stars", "text", "ngrams", "price_tier", "categories", "date"]].to_json("yelp_reviews_sample_for_notebook13b.json")