# Homework 1 Feedback

### Use Word Boundaries For Regex Patterns
Since we are using regex to group together relevant **words** into themes, we want to make sure that our regex patterns do not match false positives. A common theme selected by students was `TIME`, or how long customers had to wait to receive their orders. A frequent pattern used was `(?:long|forever|waittime|waiting|wait|slow)`.

The problem with this pattern is that is will [match the following words](regexr.com/6i9nk):
* `await`
* `waiter`
* `along`
* `belong`

These are false positives that can, depending on hw frequently they appear, completely alter the final recommendations/findings. Using word boundaries, `\b(?:long|forever|waittime|waiting|wait|slow)\b`, will reduce the [number of false positives matched](regexr.com/6i9nt).

### Count the Number of Reviews a Theme Appears In, Not the Absolute Frequency

Consider a review that has the following text:
> This store is slow, slow, slow. I can't imagine how long we have spent waiting in line trying to get our order. The worst.

Now, pretend we have the regex `\b(?:long|forever|waittime|waiting|wait|slow)\b` and we substitute each match with `_TIME_`:
```python
import re

review = "This store is slow, slow, slow. I can't imagine how long we have spent waiting in line trying to get our order. The worst."
TIME_PATTERN: re.Pattern = re.compile(r'\b(?:long|forever|waittime|waiting|wait|slow)\b')
new_text = re.sub(TIME_PATTERN, '', review)
```
This will lead to `new_text` being
```
"This store is _TIME_, _TIME_, _TIME_. I can't imagine how _TIME_ we have spent _TIME_ in line trying to get our order. The worst."
```
Then, if we actually perform a word count, we'll see that this review has `_TIME_` counted 5 times:

In [7]:
import re
import pandas as pd
review = "This store is slow, slow, slow. I can't imagine how long we have spent waiting in line trying to get our order. The worst."
TIME_PATTERN: re.Pattern = re.compile(r'\b(?:long|forever|waittime|waiting|wait|slow)\b')
new_text = re.sub(TIME_PATTERN, '_TIME_', review)

reviews = pd.Series([new_text, "another review", "one more review"])
reviews.str.count('_TIME_')

0    5
1    0
2    0
dtype: int64

In [16]:
reviews.str.count('_TIME_').sum()

5

This will weight the review by 5x with respect to the theme of `_TIME_`, essentially overcounting the frequency of `_TIME_`. It may make more sense in this use case to count the **number of reviews** that `_TIME_` appears in. We can do this using the `str.findall()` or `str.contains()` method for Pandas dataframe series

In [13]:
original_reviews = pd.Series([review, "another review", "one more review"])
original_reviews.str.findall(TIME_PATTERN)

0    [slow, slow, slow, long, waiting]
1                                   []
2                                   []
dtype: object

In [14]:
original_reviews.str.contains(TIME_PATTERN)

0     True
1    False
2    False
dtype: bool

In [15]:
original_reviews.str.contains(TIME_PATTERN).sum()

1

### Reduce False Negatives By Including More Patterns

Some students used very limited regex patterns - for instance, for the theme `FOOD`, some students only provided `\b(burger)\b`. But you should use your domain knowledge of McDonalds and reading through some of the reviews to identify that there's many more patterns you could be including: `\b((?:ham)?burgers?|(?:mc)?muffins?|fries|hash browns?|mcflurr(?:y|ies))\b`.

Remember that the `?:` inside the parenthesis indicates that `(...)` is a non-capture group - this will become more and more important as we learn about capture groups and extracing phrases.

By only including a few patterns, you are increasing the number of **false negatives** (actual matches that are not grouped properly into the appropriate theme).

### Use Case Flags and Quantifiers to Shorten the Complexity of the Pattern
Some students used regex patterns like `r'\b(?:gross|GROSS|disgusting|DISGUSTING|nasty|NASTYYY|NASTY|nastyyy)\b'`.

You can use `flags=re.IGNORECASE` to make the regex case insensitive: `re.sub(r'\b(?:gross|disgusting|nasty|nastyyy)\b', "_TIME_", flags=re.IGNORECASE)`.

Then, you can also use quantifiers to reduce the expression even further:
`re.sub(r'\b(?:gross|disgusting|nasty{1,3})\b', "_TIME_", flags=re.IGNORECASE)`.

The `nasty{1,3}` will match `nasty`, `nastyy`, or `nastyyy`, but not `nastyyyy`.

### Don't Repeat Your Code - Keep Things DRY

There is a key principle in programming - `Keep things DRY (Don't Repeat Yourself)`. 

A lot of notebooks looked like this:
```python
atlanta = processed_reviews[processed_reviews.city == "Atlanta"]
atlanta['review'].str.count('COLD').sum()
atlanta['review'].str.count('LATE').sum()
atlanta['review'].str.count('SERVICE').sum()

vegas = processed_reviews[processed_reviews.city == "Las Vegas"]
vegas['review'].str.count('COLD').sum()
vegas['review'].str.count('LATE').sum()
vegas['review'].str.count('SERVICE').sum()
...
```
Refactor this into a function:
```python
import pandas as pd
def count_themes(reviews: pd.DataFrame, themes: List[str], by_city: bool) -> None:
    """
    Prints out the number of times. When by_city = True, then it groups by the "city" column. 
    If by_city = false, then it will simply sum each of the columns in the themes argument.
    
    reviews: The dataframe with processed reviews. It should have a column for "city".
    themes: A list of column names that correspond to the themes we want to analyze.
    """
    if by_city:
        counts = reviews.groupby("city")[themes].sum()
        print("Themes by city:")
        print(counts)
    else:
        print("Theme counts:\n", reviews[themes].sum(axis=1)
```
This is not just a stylistic preference. We often have to make modifications to the data processing logic/behavior. If we repeat the same lines of code over and over again, we have to make changes in multiple different places, instead of only one central source of truth.

<div class="alert-success">
use doc strings for explanations, specify input and output types
</div>

### Provide Specific, Actionable Insights - Go One Level Deeper

Some insights and recommendations were too generic. For example,
> We saw that customer service was the top issue in most cities, specifically Las Vegas and Chicago, where X% of reviews mentioned customer service. We recommend improving employee training and emphasizing better staff responses to customers to improve quality.

This is not specific enough to be actionable for management. Managers will generally always be trying to improve customer service. After you've identified that customer service is a trouble area, you need to do the following:

1. Filter for the reviews mentioning the customer service theme.
2. Identify any recurring patterns or topics within these specific reviews. Do they mention a specific type of order being repeatedly messed up? Are they frequently referencing rudeness in the way that staff respond to customers? Are staff constantly forgetting different line items in an order?
3. From there, identify specific action items that an operations-level manager (ie. a regional manager) could try to implement:
* Add an extra module in new employee onboarding to de-escalate tense situations and common phrases to use when talking to difficult staff to maintain professionalism.
* Implement an operational change that orders must be cross-checked by two staff members before being served to a customer to reduce the number of incorrect/forgotten order items.

In general, go **one level deeper in your recommendations** (if you find customer service is the issue, then dig deeper and identify the sub-themes within customer service - that is the level of recommendations you should be making to management). Making insights that are too generic/high-level reduces the effectiveness of the analysis.