## Lab Assignment: Analytics on Glassdoor Reviews and Yelp Category Data

### University of Virginia
### DS 7200: Distributed Computing
### Last Updated: August 20, 2023

---

### OBJECTIVE  

In this assignment, you will perform some basic analytics on review and category data.  
This will entail performing operations on *RDDs*, and using *list comprehensions*.   

As this assignment covers RDDs, do not use DataFrames.

Read in the dataset and perform the steps requested below.

#### TOTAL POINTS = 10

---

### Config Setup

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .master("local") \
    .appName("review_and_category_analytics") \
    .config("spark.executor.memory", '8g') \
    .config('spark.executor.cores', '4') \
    .config('spark.cores.max', '4') \
    .config("spark.driver.memory",'1g') \
    .getOrCreate()

sc = spark.sparkContext

/opt/conda/lib/python3.7/site-packages/pyspark/bin/load-spark-env.sh: line 68: ps: command not found
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


24/09/18 23:44:40 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Read in the dataset  

In [2]:
df = sc.textFile("reviews_and_categories.csv")

Get non-header records  

In [3]:
header_record = df.first()
df_no_header = df.filter(lambda record : record != header_record)

                                                                                

Print the first 2 records (note: exclude the header in all calculations)

**1) get a record count (2 POINTS)**

In [4]:
df_no_header.count()

1302

**2) get a count of records with non-missing reviews (2 POINTS)**

We split each record by comma and look at the text after the second comma (e.g. where the review should be). If it is an empty string, we assume there is no review and filter out that row, leaving the rows with non-missing reviews!

In [5]:
df_no_header.filter(lambda record : record.split(",")[1] != '').count()

305

**3) Return the count of records where review contains the word *flexible*  (1 POINT)**

We split on the open square bracket to separate the review text from the categories, just in case "flexible" appears in the categories. This will also be useful later when we are performing analysis on the categories.

In [6]:
df_categories_separate = df_no_header.map(lambda record : record.split("["))

In [7]:
df_categories_separate.filter(lambda record : "flexible" in record[0]).count()

36

Print the records where review contains the word *flexible*

In [8]:
df_categories_separate.filter(lambda record : "flexible" in record[0]).collect()

[['25,"Nice to work for but no room to advance Pros Gave me lots of hours and very flexible with my college school schedule. Always allowed me to work as many hours as I could. Cons When asked for a raise, there was no question about it, the answer was no. I feel like this is a dead end job where you can t advance to become something better","',
  '\'credit cards\', \'las vegas\', \'restaurants\', \'no outdoor seating\', \'price\', \'store\', \'hospitality\', \'hot dogs\', \'dogs\', \'restaurant chains\', \'fast food restaurant\', \'fast food\', \'point of interest\', \'restaurant\', \'sandwiches\', \'california\', \'los angeles\', \'chili\', \'establishment\', \'other\', \'food\']"'],
 ['30,"Great, Very flexible and understanding. Pros Good food, great people. flexible hours. Cons Limited work hours every week.","',
  '\'chicken wings\', \'meal takeaway\', \'store\', \'point of interest\', \'restaurant\', \'sandwiches\', \'pizza\', \'establishment\', \'other\', \'food\']"'],
 ['31,Hig

**4) Lowercase all reviews, then return the count of records where review contains the word *flexible* (1 POINT)**

In [9]:
def process_review(review):
    review_lower = review.lower()
    return "flexible" in review_lower

In [10]:
df_categories_separate.filter(lambda record : process_review(record[0])).count()

44

**5) Return the top 10 most frequent categories  (4 POINTS)**  

Preprocess the categories by:  
* stripping characters: &nbsp; [ &nbsp; ] &nbsp;  ' &nbsp;  "  
* trim spaces before and after words  
* lowercase
* removing blank categories

NOTE: Be sure to keep terms together, for example 'jet skiing' should not become 'jet', 'skiing'

In [52]:
def process_categories(categories):
    stripped_categories = categories.strip()
    clean = stripped_categories.replace("[", "")
    clean = clean.replace("]", "")
    clean = clean.replace("'", "")
    clean = clean.replace('"', '')
    
    clean = clean.split(",")
    clean = [cat.strip().lower() for cat in clean]
    
    return clean

In [67]:
df_categories_separate.flatMap(lambda record : process_categories(record[1])) \
                      .filter(lambda cat : cat != '') \
                      .map(lambda x: (x, 1)) \
                      .reduceByKey(lambda x,y : x+y) \
                      .map(lambda x : (x[1],x[0])) \
                      .sortByKey(False) \
                      .take(10)

[(717, 'point of interest'),
 (717, 'establishment'),
 (716, 'food'),
 (659, 'restaurant'),
 (496, 'price'),
 (482, 'other'),
 (331, 'credit cards'),
 (311, 'menus'),
 (291, 'eating places'),
 (274, 'dining options')]