# HW3

Submit via Slack. Due on Tuesday, April 13th, 2020, 6:29pm PST. You may work with one other person.

## TF-IDF

You are an analyst working at McDonalds as a store operations analyst, and charged with identifying areas for improvement for each franchise. Several metropolitan locations have been suffering recently from lower reviews.

Using the **mcdonalds-yelp-negative-reviews.csv** dataset, clean and parse the text reviews. Explain the decisions you make:
- why remove/keep stopwords?
- which stopwords to remove?
- stemming versus lemmatization?
- regex cleaning and substitution?
- adding in custom stopwords?
- what `n` for your `n-grams`?
- which words to collocate together?

Finally, generate a TF-IDF report that either **visualizes** or explains for a business (non-technical) stakeholder:
* the features your analysis showed that customers cited as reasons for a poor review
* the most common issues identified from your analysis that generated customer dissatisfaction.

Explain to what degree the TF-IDF findings make sense - what are its limitations?



In [None]:
import numpy as np
import pandas as pd
import nltk

#Read the reviews file
mcd = pd.read_csv('/Users/maheshpandit/Documents/NLP/dso-560-nlp-text-analytics-SPRING-2021/Week 3/mcdonalds-yelp-negative-reviews.csv', encoding = 'latin1')

mcd.head()

In [None]:
#Replace the different ways of saying "McDonalds" by a standard form
mcd.review = mcd.review.str.replace(r"(?:(?:M|m)a*(?:c|C)(?:d|D)(?:onald)*|(?:M|m)ickey (?:D|d)(?:ee)*|(?:G|g)olden (?:A|a)rche)'*(?:s|S)*", "McDonald's")

#Replace the different ways of saying "drive-through" by a standard form
mcd.review = mcd.review.str.replace(r"(?:D|d)rive(?:-)*\s*(?:T|t)(?:hru|hrough)", "drive-through")

#Replace the different ways of saying "take-out" by a standard form
mcd.review = mcd.review.str.replace(r"(?:T|t)ake(?:-)*\s*(?:O|o)ut", "take-out")

#Remove all punctuations
mcd.review = mcd.review.str.replace(r"[^\w\s]", " ")

In [None]:
from nltk.corpus import stopwords

stp = set(stopwords.words("english")) #These are common English stopwords that do not add any value to our analysis
stp = stp - {"off", "over", "under"} #Excluding these stopwords because they can be used to describe food. ex: over cooked, under cooked 
stp.add("mcdonald") #Including McDonald's because it provides no value to analysis

In [None]:
mcd.head()

In [None]:
from nltk.collocations import BigramCollocationFinder, BigramAssocMeasures
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize
lemmatizer = WordNetLemmatizer()

In [None]:
from nltk.corpus import wordnet

# https://gaurav5430.medium.com/using-nltk-for-lemmatizing-sentences-c1bfff963258
def lemmatize_sentence(sentence):
    #tokenize the sentence and find the POS tag for each token
    nltk_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))  
    #tuple of (token, wordnet_tag)
    wordnet_tagged = map(lambda x: (x[0], nltk_tag_to_wordnet_tag(x[1])), nltk_tagged)
    lemmatized_sentence = []
    for word, tag in wordnet_tagged:
        if tag is None:
            #if there is no available tag, append the token as is
            lemmatized_sentence.append(word)
        else:        
            #else use the tag to lemmatize the token
            lemmatized_sentence.append(lemmatizer.lemmatize(word, tag))
    return lemmatized_sentence

# function to convert nltk tag to wordnet tag
def nltk_tag_to_wordnet_tag(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:          
        return None

I have chosen to use lemmatization here instead of stemming because it is important to understand the sentiment of the reviews when we are trying to determine the reasons. It is easier to determine the sentiment of the reviews when stemming is used since it takes the part of speech into account.

In [None]:
lemmatized_reviews = [lemmatize_sentence(review) for review in mcd.review]

In [None]:
new_documents = []
doc_words = []
for doc in lemmatized_reviews:
    new_document = []
    for word in doc:
        if word.strip().lower() not in stp:
            new_document.append(word)
            doc_words.append(new_document)
    new_documents.append(' '.join(new_document) )
    
mcd['cleaned_lemmatized_reviews'] = new_documents

In [None]:
mcd.head()

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

for i in range(2, 5):

    vectorizer = TfidfVectorizer(ngram_range=(i,i))
    corpus = new_documents

    X = vectorizer.fit_transform(corpus)
    terms = vectorizer.get_feature_names()
    tf_idf = pd.DataFrame(X.toarray().transpose(), index=terms)

    tf_idf = tf_idf.sum(axis=1)
    score = pd.DataFrame(tf_idf, columns=["score"])
    score.sort_values(by="score", ascending=False, inplace=True)
    print("These are the 20 most common n-grams of size %d"%i)
    print("{}\n".format(score.head(20)))

In order to determine the most common reasons for a poor review, I have looked at the top 20 TF-IDF scores for n-grams of range 2, 3 and 4. I believe that this is a reasonable range that covers issues with a single product or service, as well as systemic issues. The reason I have not used the range(2, 5) within a single vectorizer is because some important n-grams of bigger size may have a relatively lower score when compared to less-important n-grams of the same size.

From the TF-IDF scores above, it is clear that some of the most common issues are:

- bad customer service
- issues with ice cream
- getting orders wrong
- long wait times

Let us take a closer look at these issues individually

In [None]:
mcd['customer_service_issue'] = mcd['cleaned_lemmatized_reviews'].str.contains(r'customer service|rude', regex = True, case = False )

mcd['ice_cream_issue'] = mcd['cleaned_lemmatized_reviews'].str.contains(r'ice cream', regex = True, case = False )

mcd['wrong_order_issue'] = mcd['cleaned_lemmatized_reviews'].str.contains(r'order right|order wrong|order correct', regex = True, case = False )

mcd['wait_time_issue'] = mcd['cleaned_lemmatized_reviews'].str.contains(r'wait long|long wait|slow', regex = True, case = False )

In [None]:
mcd.head()

In [None]:
mcd['customer_service_issue'].value_counts()

In [None]:
mcd['ice_cream_issue'].value_counts()

In [None]:
mcd['wrong_order_issue'].value_counts()

In [None]:
mcd['wait_time_issue'].value_counts()

### Analysis of bad customer service

In [None]:
for i in mcd[ mcd['customer_service_issue'] == True ]["review"].head(30):
    print("{}\n\n".format(i))

In [None]:
cs_issues = mcd[ mcd['customer_service_issue'] == True ]['cleaned_lemmatized_reviews'].str.replace("customer service", "", case = False).values

vectorizer = TfidfVectorizer(ngram_range=(3,3) )
corpus = cs_issues

X = vectorizer.fit_transform(corpus)
terms = vectorizer.get_feature_names()
tf_idf = pd.DataFrame(X.toarray().transpose(), index=terms)

tf_idf = tf_idf.sum(axis=1)
score = pd.DataFrame(tf_idf, columns=["score"])
score.sort_values(by="score", ascending=False, inplace=True)
print("These are the 5 highest tf-idf scores for customer-service issue n-grams of size 3")
print("{}\n".format(score.head(5)))

From the analysis of the reviews that complained about bad customer service, the most common issues that were identified are:

- employees are rude to customers
- orders are not fulfilled correctly very often
- employees are sometimes busy using the cash register and do not acknowledge customers

### Analysis of ice cream issues

In [None]:
for i in mcd[ mcd['ice_cream_issue'] == True ]["review"].head(30):
    print("{}\n\n".format(i))

From the reviews, it is evident that some of the common issues with ice cream are:

- ice cream machine is broken/locked
- ice cream is not served after a certain time
- they have run out of ice cream

### Analysis of wrong orders

In [None]:
for i in mcd[ mcd['wrong_order_issue'] == True ]["review"].head(30):
    print("{}\n\n".format(i))

In [None]:
wo_issues = mcd[ mcd['wrong_order_issue'] == True ]['cleaned_lemmatized_reviews'].str.replace(r'(?:get)*\sorder\sright|(?:get)*\sorder\swrong', "", regex = True).values

vectorizer = TfidfVectorizer(ngram_range=(2,4) )
corpus = wo_issues

X = vectorizer.fit_transform(corpus)
terms = vectorizer.get_feature_names()
tf_idf = pd.DataFrame(X.toarray().transpose(), index=terms)

tf_idf = tf_idf.sum(axis=1)
score = pd.DataFrame(tf_idf, columns=["score"])
score.sort_values(by="score", ascending=False, inplace=True)
print("These are the 20 highest tf-idf scores for wrong order issue n-grams of sizes 2 and 3")
print("{}\n".format(score.head(20)))

From the analysis of the issue of getting orders wrong, it seems like this problem is prevalent in drive-throughs as well as dine-in. There does not seem to be a specific reason for this other than human error.

### Analysis of long wait times

In [None]:
for i in mcd[ mcd['wait_time_issue'] == True ]["review"].head(30):
    print("{}\n\n".format(i))

In [None]:
wt_issues = mcd[ mcd['wait_time_issue'] == True ]['cleaned_lemmatized_reviews'].values

vectorizer = TfidfVectorizer(ngram_range=(3,3) )
corpus = wt_issues

X = vectorizer.fit_transform(corpus)
terms = vectorizer.get_feature_names()
tf_idf = pd.DataFrame(X.toarray().transpose(), index=terms)

tf_idf = tf_idf.sum(axis=1)
score = pd.DataFrame(tf_idf, columns=["score"])
score.sort_values(by="score", ascending=False, inplace=True)
print("These are the 10 highest tf-idf scores for wrong order issue n-grams of size 3")
print("{}\n".format(score.head(20)))

From the analysis of the reviews about long wait times, it seems that the majority of these issues are occuring in drive throughs.

#### Issues with TF-IDF

The TF-IDF methodology helps us identify n-grams that are uniquely important to multiple documents in the corpus. However, there are some limitations:

- It is essentially a bag-of-words methodology since it does not take into account the semantics of the words. This results in different n-grams with high scores, but the same meaning. For example, "never get order right" and "always get order wrong".

- Some n-grams with high TF-IDF score do not offer any value to our analysis because they do not make sense. For example, "hey cup coffee drive"

- Some n-grams with high scores are just phrases that are oftern used together in English. For example, "24 hour drive"

## Product Attribution (Feature Engineering and Regex Practice)

Download the [dataset](https://dso-560-nlp-text-analytics.s3.amazonaws.com/truncated_catalog.csv) from the class S3 bucket (`dso560-nlp-text-analytics`).

In preparation for the group project, our client company has provided a dataset of women's clothing products they are considering cataloging. 

1. Filter for only **women's clothing items**.

2. For each clothing item:

* Identify its **category**:
```
Bottom
One Piece
Shoe
Handbag
Scarf
```
* Identify its **color**:
```
Beige
Black
Blue
Brown
Burgundy
Gold
Gray
Green
Multi 
Navy
Neutral
Orange
Pinks
Purple
Red
Silver
Teal
White
Yellow
```

Your output will be the same dataset, except with **3 additional fields**:
* `is_womens_clothing`
* `product_category`
* `colors`

`colors` should be a list of colors, since it is possible for a piece of clothing to have multiple colors.

In [None]:
catalog = pd.read_csv('truncated_catalog.csv')
catalog.head()

In [None]:
import re
def isWomensClothing(txt):
    """ Function to determine whether it is an article of women's clothing """
    
    txt = str(txt)
    val = False
    if re.search(r'girl|wom(?:an|en)|lad(?:y|ies)', txt, re.IGNORECASE ):
        val = True
    return val

In [None]:
catalog['is_womens_clothing'] = pd.DataFrame( [ catalog.name.apply( isWomensClothing ), catalog.description.apply( isWomensClothing ), catalog.brand_category.apply( isWomensClothing ) ] ).any()

In [None]:
def findCategory(txt):
    """ Function to determine the article of clothing """
    
    txt = str(txt)
    val = np.nan
    if re.search(r'pants|trousers|jeans|shorts|leggings|skirt|jumpsuit', txt, re.IGNORECASE ):
        val = "Bottom"
    elif re.search(r'\bdress\b|gown|jumpsuit', txt, re.IGNORECASE ):
        val = "One Piece"
    elif re.search(r'shoes|sneakers|heels|pumps', txt, re.IGNORECASE ):
        val = "Shoe"
    elif re.search(r'purse|handbag|tote|clutch', txt, re.IGNORECASE ):
        val = "Handbag"
    elif re.search(r'scar(?:f|ves)|bandana', txt, re.IGNORECASE ):
        val = "Scarf"
    return val

In [None]:
catalog["product_category"] = catalog.description.apply( findCategory ).combine_first( catalog.details.apply( findCategory ).combine_first( catalog.brand_category.apply( findCategory ) ) )

In [None]:
colors_re = r'\bBeige\b|\bBlack\b|\bBlue\b|\bBrown\b|\bBurgundy\b|\bGold\b|\bGray\b|\bGreen\b|\bMulticolor|\bNavy\b|\bNeutral\b|\bOrange\b|\bPinks\b|\bPurple\b|\bRed\b|\bSilver\b|\bTeal\b|\bWhite\b|\bYellow\b'

In [None]:
def findColors(txt):
    """ Function to determine the color of item """
    
    val = []
    txt = str(txt)
    if re.findall(colors_re, txt, re.IGNORECASE ):
        val = re.findall(colors_re, txt, re.IGNORECASE )
    return val

In [None]:
catalog['colors'] = catalog.description.apply(findColors) + catalog.details.apply(findColors) + catalog.tsv.apply(findColors)
catalog.colors = catalog.colors.apply(lambda x: set(y.lower() for y in x))
catalog.colors = catalog.colors.replace(set(), np.nan)

In [None]:
catalog.head(20)