# Homework 1 (Due Tuesday, March 22, 2022 at 6:29pm PST)

**Rubric**
* Identified 4 major themes from the reviews (2pts)
* Regex that groups / cleans the reviews is correctly implemented (4pts)
* Word count is correctly implemented (2pts)
* Analysis of recommendations and pitfalls/limitations are specific enough to be actionable (2pts)
* 

Not actionable recommendation:
* *The store managers should consider trying to improve the drive through experience to be more pleasant for customers*

More actionable recommendation:
* *Drive throughs are mentioned 23% of the time in reviews, and often focus on how slow the service is. We recommend adopting parallel drive through stations for Atlanta and Chicago*

You are a business analyst working for McDonalds. First, read through the reviews in `mcdonalds-yelp-negative-reviews.csv` (found in `datasets` folder). 

1. Identify 4 recurring themes/topics that reviewers are unhappy with. For example, one theme is that users are consistently unhappy with the drive-through experience.

2. Next, using regex, group together all occurrences of these phrases. For example, `drive-thru`, `drive through`, `drivethrough` can all be replaced as `_DRIVE_THROUGH_`.

3. Perform a word count, both overall, and broken out by city.

4. **Provide a few sentences with your findings and business recommendations.** Make any assumptions you'd like. I just want you to get into the habit of "finishing" your analysis: to avoid delivering technical numbers to a non-technical manager.

Some considerations in your analysis:

* Explain what some of **pitfalls/limitations** are of using only a word count analysis to make these inferences. What additional research/steps would you need to do to verify your conclusions?

**Submit everything as a new notebook and Slack direct message (group message) to me (Yu Chen) and the TAs (Mengqi Tan and Siyuan Ni) the HW as an attachment.**

**NOTE**: Name the notebook `lastname_firstname_HW1.ipynb`.

Every day late is -10%.

In [1]:
import pandas as pd
import re
from collections import Counter

In [2]:
text = open('../datasets/mcdonalds-yelp-negative-reviews.csv', 'r')
print(text)

<_io.TextIOWrapper name='../datasets/mcdonalds-yelp-negative-reviews.csv' mode='r' encoding='UTF-8'>


In [3]:
text.readline()

'_unit_id,city,review\n'

In [4]:
text_df = pd.read_csv('../datasets/mcdonalds-yelp-negative-reviews.csv', encoding_errors='ignore')
text_df

Unnamed: 0,_unit_id,city,review
0,679455653,Atlanta,"I'm not a huge mcds lover, but I've been to be..."
1,679455654,Atlanta,Terrible customer service. I came in at 9:30pm...
2,679455655,Atlanta,"First they ""lost"" my order, actually they gave..."
3,679455656,Atlanta,I see I'm not the only one giving 1 star. Only...
4,679455657,Atlanta,"Well, it's McDonald's, so you know what the fo..."
...,...,...,...
1520,679500008,Portland,I enjoyed the part where I repeatedly asked if...
1521,679500224,Houston,Worst McDonalds I've been in in a long time! D...
1522,679500608,New York,"When I am really craving for McDonald's, this ..."
1523,679501257,Chicago,Two points right out of the gate: 1. Thuggery ...


<div class='alert-danger'>
    <p> Q:
    <p> how to tell when <i>text</i> and when <i>text_df</i>?
    <p> whether it's too much for memory?
    </div>

In [178]:
text_df.groupby('city')[['review']].count()

Unnamed: 0_level_0,review
city,Unnamed: 1_level_1
Atlanta,130
Chicago,219
Cleveland,71
Dallas,75
Houston,105
Las Vegas,409
Los Angeles,167
New York,165
Portland,97


## task1

Identify 4 recurring themes/topics that reviewers are unhappy with.

In [32]:
# review_com = ''
# for r in data['review']:
#     review_com += '\n\n'+r
# review_com

In [None]:
# def count_words(lines, delimiter=" "):
#     words = Counter() # instantiate a Counter object called words
#     for line in lines:
#         for word in line.split(delimiter):
#             words[word] += 1 # increment count for word
#     return words

In [7]:
def count_words(doc):
    counts = Counter()
    for r in doc:
        counts_tmp = Counter(re.findall(r'\w\w+', r, flags=re.IGNORECASE))
        counts += counts_tmp
    return counts

In [9]:
count_rough = count_words(text_df['review'])
# sorted(count_rough, key=lambda x: -count_rough[x])

In [10]:
count_rough.most_common(150)

[('the', 6237),
 ('and', 4137),
 ('to', 4030),
 ('of', 2005),
 ('is', 1918),
 ('was', 1793),
 ('in', 1788),
 ('it', 1756),
 ('for', 1651),
 ('this', 1427),
 ('my', 1421),
 ('that', 1327),
 ('you', 1252),
 ('they', 1232),
 ('at', 1019),
 ('have', 950),
 ('on', 908),
 ('not', 897),
 ('but', 858),
 ('me', 857),
 ('The', 845),
 ('order', 838),
 ('food', 832),
 ('McDonald', 822),
 ('with', 815),
 ('are', 707),
 ('one', 682),
 ('get', 665),
 ('there', 656),
 ('drive', 650),
 ('be', 648),
 ('so', 636),
 ('up', 586),
 ('here', 580),
 ('had', 566),
 ('just', 540),
 ('time', 525),
 ('or', 516),
 ('go', 511),
 ('out', 495),
 ('like', 483),
 ('service', 481),
 ('no', 468),
 ('as', 468),
 ('thru', 467),
 ('It', 464),
 ('place', 460),
 ('This', 455),
 ('when', 451),
 ('were', 445),
 ('McDonalds', 443),
 ('can', 423),
 ('your', 413),
 ('all', 405),
 ('only', 397),
 ('what', 396),
 ('if', 387),
 ('because', 382),
 ('location', 382),
 ('their', 373),
 ('we', 371),
 ('about', 366),
 ('been', 365),
 ('an

<div class="alert-info">
    Roughly speaking, there may be much complaint about <b>('drive', 650), ('time', 525), ('location', 382), ('fries', 295), ('coffee', 254), ('breakfast', 174)</b> 
    </div>

<div class='alert-danger'>
    <p> TO-DO:
    <p> 1) how to tell <i>time</i> ~= <i>slow</i>? how to include <i>time</i> accurately in <i>slow</i> and exclude slow-irrelevant "time"s
    <p> 2) how to aggregate same topic? (e.g. slow, time, long, xx minutes)
    <p> 3) how to identify survivor bias? (e.g. people complaint fries more because they order fries more)
    </div>

<div class='alert-info'>
    <p> After scanning the reviews manually, 4 themes are identified below:
    <p> &emsp; - Terrible Drive Through
    <p> &emsp; - Slow Service
    <p> &emsp; - Terrible Food (especially the fries and coffee)
    <p> &emsp; - Terrible Breakfast
    </div>

## task2

Next, using regex, group together all occurrences of these phrases.

In [64]:
def standardize_word(doc, word_orig, word_std):
    for i in range(len(doc)):
#         rev_txt = rev.split(',"')[1] # it's for text not text_df
        rev = doc.iloc[i, -1]
        doc.iloc[i, -1] = re.sub(word_orig, word_std, rev, flags=re.IGNORECASE)

In [66]:
text_df['review_std'] = text_df['review']
text_df

Unnamed: 0,_unit_id,city,review,review_std
0,679455653,Atlanta,"I'm not a huge mcds lover, but I've been to be...","I'm not a huge mcds lover, but I've been to be..."
1,679455654,Atlanta,Terrible customer service. I came in at 9:30pm...,Terrible customer service. I came in at 9:30pm...
2,679455655,Atlanta,"First they ""lost"" my order, actually they gave...","First they ""lost"" my order, actually they gave..."
3,679455656,Atlanta,I see I'm not the only one giving 1 star. Only...,I see I'm not the only one giving 1 star. Only...
4,679455657,Atlanta,"Well, it's McDonald's, so you know what the fo...","Well, it's McDonald's, so you know what the fo..."
...,...,...,...,...
1520,679500008,Portland,I enjoyed the part where I repeatedly asked if...,I enjoyed the part where I repeatedly asked if...
1521,679500224,Houston,Worst McDonalds I've been in in a long time! D...,Worst McDonalds I've been in in a long time! D...
1522,679500608,New York,"When I am really craving for McDonald's, this ...","When I am really craving for McDonald's, this ..."
1523,679501257,Chicago,Two points right out of the gate: 1. Thuggery ...,Two points right out of the gate: 1. Thuggery ...


### Terrible Drive Through

<div class='alert-info'>
    drive-thru, drive through, drivethrough --> <b><i>_DRIVE_THROUGH_.
    </div>

In [67]:
word_orig, word_std = r'(drive-thru|drivethrough|drive through)', '_DRIVE_THROUGH_'
standardize_word(text_df, word_orig, word_std)

In [68]:
text_df

Unnamed: 0,_unit_id,city,review,review_std
0,679455653,Atlanta,"I'm not a huge mcds lover, but I've been to be...","I'm not a huge mcds lover, but I've been to be..."
1,679455654,Atlanta,Terrible customer service. I came in at 9:30pm...,Terrible customer service. I came in at 9:30pm...
2,679455655,Atlanta,"First they ""lost"" my order, actually they gave...","First they ""lost"" my order, actually they gave..."
3,679455656,Atlanta,I see I'm not the only one giving 1 star. Only...,I see I'm not the only one giving 1 star. Only...
4,679455657,Atlanta,"Well, it's McDonald's, so you know what the fo...","Well, it's McDonald's, so you know what the fo..."
...,...,...,...,...
1520,679500008,Portland,I enjoyed the part where I repeatedly asked if...,I enjoyed the part where I repeatedly asked if...
1521,679500224,Houston,Worst McDonalds I've been in in a long time! D...,Worst McDonalds I've been in in a long time! D...
1522,679500608,New York,"When I am really craving for McDonald's, this ...","When I am really craving for McDonald's, this ..."
1523,679501257,Chicago,Two points right out of the gate: 1. Thuggery ...,Two points right out of the gate: 1. Thuggery ...


### Slow Service

<div class='alert-info'>
    slow, slowly, sl-ow, SLOW, ... --> <b><i>_SLOW_
    </div>

In [69]:
word_orig, word_std = r'(\bs(?:\-)?l(?:\-)?o(?:\-)?w(?:\-)?(?:ly)?\b)', '_SLOW_'
standardize_word(text_df, word_orig, word_std)

<div class='alert-danger'>
    Q: any simplification?
    </div>

### Terrible food (especially the fries and coffee)

#### fries

<div class='alert-info'>
    fries --> <b><i>_FRIES_
    </div>

In [70]:
word_orig, word_std = r'(fries)', '_FRIES_'
standardize_word(text_df, word_orig, word_std)

#### coffee

<div class='alert-info'>
    coffee --> <b><i>_COFFEE_
    </div>

In [71]:
word_orig, word_std = r'(coffee)', '_COFFEE_'
standardize_word(text_df, word_orig, word_std)

### Terrible Breakfast

<div class='alert-info'>
    breakfast, Breakfast, ... --> <b><i>_BREAKFAST_
    </div>

In [72]:
word_orig, word_std = r'(breakfast)', '_BREAKFAST_'
standardize_word(text_df, word_orig, word_std)

## task3

Perform a word count, both overall, and broken out by city.

In [80]:
def count_certain_word_overall(doc, word):
    counts = Counter()
    for r in doc:
        counts_tmp = Counter(re.findall(word, r))
        counts += counts_tmp
    return counts

In [167]:
def count_certain_word_bycity(df, word):
    result = pd.DataFrame(columns=['city', 'count'])
    for c in df['city'].unique():
        df_tmp = df.loc[df['city']==c]
        counts = Counter()
        for r in df_tmp['review_std']:
            counts_tmp = Counter(re.findall(word, r))
            counts += counts_tmp
#         result.append({'city': c, 'count': list(counts.values())[0]}, ignore_index=True)
        result = result.append({'city': c, 'count': counts.values()}, ignore_index=True)
    return result

### _DRIVE_THROUGH_

#### overall

In [111]:
count_drive = count_certain_word_overall(text_df['review_std'], '_DRIVE_THROUGH_')
count_drive

Counter({'_DRIVE_THROUGH_': 290})

#### by city

In [168]:
count_drive_city = count_certain_word_bycity(text_df, '_DRIVE_THROUGH_')
count_drive_city

Unnamed: 0,city,count
0,Atlanta,(30)
1,Las Vegas,(83)
2,Dallas,(15)
3,Portland,(20)
4,Chicago,(24)
5,Cleveland,(16)
6,Houston,(22)
7,Los Angeles,(52)
8,New York,(1)
9,,()


### _SLOW_

#### overall

In [82]:
count_slow = count_certain_word_overall(text_df['review_std'], '_SLOW_')
count_slow

Counter({'_SLOW_': 147})

#### by city

In [169]:
count_slow_city = count_certain_word_bycity(text_df, '_SLOW_')
count_slow_city

Unnamed: 0,city,count
0,Atlanta,(28)
1,Las Vegas,(39)
2,Dallas,(5)
3,Portland,(3)
4,Chicago,(17)
5,Cleveland,(10)
6,Houston,(14)
7,Los Angeles,(13)
8,New York,(11)
9,,()


### _FRIES_

#### overall

In [170]:
count_fries = count_certain_word_overall(text_df['review_std'], '_FRIES_')
count_fries

Counter({'_FRIES_': 314})

#### by city

In [171]:
count_fries_city = count_certain_word_bycity(text_df, '_FRIES_')
count_fries_city

Unnamed: 0,city,count
0,Atlanta,(32)
1,Las Vegas,(89)
2,Dallas,(6)
3,Portland,(27)
4,Chicago,(49)
5,Cleveland,(5)
6,Houston,(23)
7,Los Angeles,(28)
8,New York,(40)
9,,()


### _COFFEE_

#### overall

In [172]:
count_coffee = count_certain_word_overall(text_df['review_std'], '_COFFEE_')
count_coffee

Counter({'_COFFEE_': 284})

#### by city

In [173]:
count_coffee_city = count_certain_word_bycity(text_df, '_COFFEE_')
count_coffee_city

Unnamed: 0,city,count
0,Atlanta,(17)
1,Las Vegas,(74)
2,Dallas,(12)
3,Portland,(11)
4,Chicago,(53)
5,Cleveland,(6)
6,Houston,(17)
7,Los Angeles,(36)
8,New York,(36)
9,,()


### _BREAKFAST_

#### overall

In [174]:
count_breakfast = count_certain_word_overall(text_df['review_std'], '_BREAKFAST_')
count_breakfast

Counter({'_BREAKFAST_': 190})

#### by city

In [175]:
count_breakfast_city = count_certain_word_bycity(text_df, '_BREAKFAST_')
count_breakfast_city

Unnamed: 0,city,count
0,Atlanta,(17)
1,Las Vegas,(77)
2,Dallas,(5)
3,Portland,(4)
4,Chicago,(29)
5,Cleveland,(6)
6,Houston,(16)
7,Los Angeles,(17)
8,New York,(9)
9,,()


### AGGREGATION

## task4

Provide a few sentences with your findings and business recommendations.

Not actionable recommendation:
* *The store managers should consider trying to improve the drive through experience to be more pleasant for customers*

More actionable recommendation:
* *Drive throughs are mentioned 23% of the time in reviews, and often focus on how slow the service is. We recommend adopting parallel drive through stations for Atlanta and Chicago*