# HW3

Submit via Slack. Due on Tuesday, April 13th, 2020, 6:29pm PST. You may work with one other person.

## TF-IDF

You are an analyst working at McDonalds as a store operations analyst, and charged with identifying areas for improvement for each franchise. Several metropolitan locations have been suffering recently from lower reviews.

Using the **mcdonalds-yelp-negative-reviews.csv** dataset, clean and parse the text reviews. Explain the decisions you make:
- why remove/keep stopwords?
- which stopwords to remove?
- stemming versus lemmatization?
- regex cleaning and substitution?
- adding in custom stopwords?
- what `n` for your `n-grams`?
- which words to collocate together?

Finally, generate a TF-IDF report that either **visualizes** or explains for a business (non-technical) stakeholder:
* the features your analysis showed that customers cited as reasons for a poor review
* the most common issues identified from your analysis that generated customer dissatisfaction.

Explain to what degree the TF-IDF findings make sense - what are its limitations?




In [41]:
import pandas as pd
import numpy as numpy
import time

import re

import nltk
nltk.download('punkt') # A popular NLTK sentence tokenizer
nltk.download('stopwords') # library of common English stopwords
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/yutongwanyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/yutongwanyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/yutongwanyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [42]:
# read the file 
yelp = pd.read_csv('mcdonalds-yelp-negative-reviews.csv', encoding='latin-1')
yelp.head()

Unnamed: 0,_unit_id,city,review
0,679455653,Atlanta,"I'm not a huge mcds lover, but I've been to be..."
1,679455654,Atlanta,Terrible customer service. I came in at 9:30pm...
2,679455655,Atlanta,"First they ""lost"" my order, actually they gave..."
3,679455656,Atlanta,I see I'm not the only one giving 1 star. Only...
4,679455657,Atlanta,"Well, it's McDonald's, so you know what the fo..."


In [43]:
# get the review 
reviews = yelp['review']

## Stopwords keeping

In [44]:
from nltk.corpus import stopwords
# stopwords.words('english')
list_stopwords = stopwords.words('english')
for i in list_stopwords:
    if "n't" in i:
        list_stopwords.remove(i)
for i in ('don ain aren couldn didn doesn hadn hasn haven isn mightn mustn needn shan shouldn wasn weren won wouldn').split(' '):
    if i in list_stopwords:
        list_stopwords.remove(i)

print(list_stopwords) 

# I removed stop words that related to xx-not as they express denial/negative feelings in the reviews. As we're trying to understand why McDonald is doing poorly in certain aspects, we should remove them out of stopword list to make sure we don't miss some negative reviews 

# The stopwords I keep are mostly pronouce and frequently used stopwords. They do not provide us significant information (unless we have a clear goal of study like if poor employee performance relates to gender etc; in that case we want to remove pronounces from this list as well)


['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

## Tokenize and stemming/lemmatization

In [45]:

stemmer = nltk.stem.porter.PorterStemmer()
reviews = reviews.apply(lambda x: nltk.word_tokenize(x)).apply(lambda x: [stemmer.stem(y) for y in x]).apply(lambda x: ' '.join(x))
reviews

# Similar in HW2, I decided to use stemming because it's 
# (1) working fast on a 1500+ rows series and 
# (2) I prefer to take more aggresive approach to reduce final dimensionalty of words or n-gram


0       I 'm not a huge mcd lover , but I 've been to ...
1       terribl custom servic . I came in at 9:30pm an...
2       first they `` lost '' my order , actual they g...
3       I see I 'm not the onli one give 1 star . onli...
4       well , it 's mcdonald 's , so you know what th...
                              ...                        
1520    I enjoy the part where I repeatedli ask if I h...
1521    worst mcdonald I 've been in in a long time ! ...
1522    when I am realli crave for mcdonald 's , thi s...
1523    two point right out of the gate : 1 . thuggeri...
1524    I want to grab breakfast one morn befor work s...
Name: review, Length: 1525, dtype: object

## More latex cleaning (improved from hw 2)

In [46]:
# more latex cleaning (improved from hw 2)

pattern_dict = {
    r'(\w+burger\b)':'burger',
    r'(\bbarbe+\w+|bbq)':'barbeque',
    r'(\bchoco+\w+\b)':'chocolate',
    r'(\bcoffe+\w+)':"coffee",
    r'(\bfrap+\w+)':'frappuccino',
    r'(\bdirt+\w+)':'dirty',
    r'(\bdisa+\w+t)':'disappoint',
    r'(\bhm|hmm+\w)':'hmm',
    r'(\blettuce+\w)':'lettuce',
    r'(\blo+\w+g)':'long',
    r'(\blow|low+\w)':'low',
    r'(\bmcdo+\w+|mcd+\w\b)':'mcd',
    r'(\bmcchichen|mcchicken)':'mcchicken',
    r'(\bm+\w+m|mm)':'m'
} # again, these are some examples for regex cleaning; most of them I choose here are food names and stop words we can add later into the stopword list

for i in pattern_dict:
    reviews = reviews.apply(lambda x: re.sub(i, pattern_dict[i], x))
reviews 

0       I 'm not a huge mcd lover , but I 've been to ...
1       terribl custom servic . I came in at 9:30pm an...
2       first they `` lost '' my order , actual they g...
3       I see I 'm not the onli one give 1 star . onli...
4       well , it 's mcd 's , so you know what the foo...
                              ...                        
1520    I enjoy the part where I repeatedli ask if I h...
1521    worst mcd I 've been in in a long time ! dirt ...
1522    when I am realli crave for mcd 's , thi seem t...
1523    two point right out of the gate : 1 . thuggeri...
1524    I want to grab breakfast one morn befor work s...
Name: review, Length: 1525, dtype: object

In [47]:
# an example of add stop words we created from regex cleaninig

list_stopwords = list_stopwords + [
    'hmm',
    'get',
    'thi',
    'look',
    'ever'
]

## get features 

In [48]:
# CountVectorize the Documents
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(reviews) 
X = X.toarray()


In [49]:
corpus_df = pd.DataFrame(X, columns=vectorizer.get_feature_names())

# iterate through the Pandas dataframe, and drop the columns that reflect stopwords:
original_columns = corpus_df.columns # get existing columns

to_drop_columns = set(original_columns).intersection(set(list_stopwords)) # get the list of words to drop

# before drop stopwords
print(f"Dataframe shape was {corpus_df.shape}")

# after drop stopwords 
corpus_df.drop(columns=to_drop_columns, inplace=True)
print(f"Dataframe shape is now {corpus_df.shape}")

# NOW if compared to the same step of hw2, the number of columns (after removing stop words) are reduced from 6327 to 6269
# it shows that both regex cleaning and stopwords manipulation have worked.

Dataframe shape was (1525, 6383)
Dataframe shape is now (1525, 6269)


## Report tf_idf

In [50]:
# report tf_idf, using bi-gram

# first define re-usable function

def get_tf_idf(documents, ngram_range=(1,1), stop_word_true = True, stopword_list = list_stopwords):
    '''
    documents: a list of documents, as the corpus
    ngram_range: a tuple for n-gram size
    stop_word_true: if remove stopwords 
    stopword_list: a list of stopwords defined by user; by default list_stopwords
    '''
    import pandas as pd
    from sklearn.feature_extraction.text import TfidfVectorizer

    corpus = documents

    if stop_word_true == True:
        vectorizer = TfidfVectorizer(ngram_range=ngram_range,
                                    token_pattern=r'\b[a-zA-Z]{3,}\b',
                                    # max_df=0.4, 
                                    stop_words=stopword_list
                                    )
    else:
        vectorizer = TfidfVectorizer(ngram_range=ngram_range,
                                token_pattern=r'\b[a-zA-Z]{3,}\b'
                                )

                                
    X = vectorizer.fit_transform(corpus)
    terms = vectorizer.get_feature_names()
    tf_idf = pd.DataFrame(X.toarray().transpose(), index=terms)

    tf_idf = tf_idf.sum(axis=1)
    score = pd.DataFrame(tf_idf, columns=["score"])
    score.sort_values(by="score", ascending=False, inplace=True)

    return score




In [51]:
# create corpus 

corpus = []

for i in reviews:
    corpus.append(i)

# report tf_idf
tf_idf_score = get_tf_idf(corpus, ngram_range=(2,2), stop_word_true = True)
print(f'Number of word combination/collocations: {tf_idf_score.shape[0]}')
tf_idf_score[:20]

# if we choose 2 for n-gram, we get some words collocated together that are frequently used. Show top 20 as examples.
# some word combinations make sense, like drive thru, fast food, big mac, ice cream etc


Number of word combination/collocations: 49843


Unnamed: 0,score
drive thru,24.967784
fast food,10.857615
worst mcd,10.0797
custom servic,9.635739
ice cream,6.737402
order wrong,6.484587
big mac,5.701065
take order,5.543923
wait minut,5.466364
everi time,5.10738


In [52]:
# if we choose n = 3

tf_idf_score = get_tf_idf(corpus, ngram_range=(3,3), stop_word_true = True)
print(f'Number of word combination/collocations: {tf_idf_score.shape[0]}')
tf_idf_score[:10]

# from the top results, we can see first 3 combinations are all about drive through; this indicates that n = 2 is more efficient than n = 3;

Number of word combination/collocations: 64816


Unnamed: 0,score
drive thru window,2.447772
went drive thru,2.441666
drive thru order,2.390912
got order wrong,2.323411
drive thru line,2.322075
fast food place,1.908174
everi singl time,1.778436
never order right,1.599128
ice cream cone,1.553606
ice cream machin,1.533518


## Product Attribution (Feature Engineering and Regex Practice)

Download the [dataset](https://dso-560-nlp-text-analytics.s3.amazonaws.com/truncated_catalog.csv) from the class S3 bucket (`dso560-nlp-text-analytics`).

In preparation for the group project, our client company has provided a dataset of women's clothing products they are considering cataloging. 

1. Filter for only **women's clothing items**.

2. For each clothing item:

* Identify its **category**:
```
Bottom
One Piece
Shoe
Handbag
Scarf
```
* Identify its **color**:
```
Beige
Black
Blue
Brown
Burgundy
Gold
Gray
Green
Multi 
Navy
Neutral
Orange
Pinks
Purple
Red
Silver
Teal
White
Yellow
```

Your output will be the same dataset, except with **3 additional fields**:
* `is_womens_clothing`
* `product_category`
* `colors`

`colors` should be a list of colors, since it is possible for a piece of clothing to have multiple colors.

In [53]:
catalog = pd.read_csv('truncated_catalog.csv', encoding='latin-1')
catalog.head()


Unnamed: 0,ï»¿brand,name,description,brand_category,brand_canonical_url,details,tsv
0,FILA,Original Fitness Sneakers,Vintage Fitness leather sneakers with logo pri...,TheMensStore/Shoes/Sneakers/LowTop,https://www.saksfifthavenue.com/fila-original-...,Leather/synthetic upper\nLace-up closure\nText...,"'design':12 'fila':1A 'fit':3A,6 'leather':7 '..."
1,CHANEL,HAT,,Unknown,https://www.saksfifthavenue.com/chanel-hat/pro...,WOOL TWEED & FELT,'chanel':1A 'hat':2A
2,Frame,Petit Oval Buckle Belt,A Timeless Leather Belt Crafted From Smooth Co...,Accessories,https://frame-store.com/products/petit-oval-bu...,,"'belt':5A,9 'buckl':4A,21 'cowhid':13 'craft':..."
3,Lilly Pulitzer Kids,Little Gir's & Girl's Ariana One-Piece UPF 50+...,Pretty ruffle sleeves and trim elevate essenti...,"JustKids/Girls214/Girls/SwimwearCoverups,JustK...",https://www.saksfifthavenue.com/lilly-pulitzer...,Scoopneck\nAdjustable straps\nFlutter sleeves\...,'50':14A 'allov':28 'ariana':9A 'color':27 'el...
4,Kissy Kissy,Baby Girl's Endearing Elephants Pima Cotton Co...,Versatile convertible gown with elephant applique,JustKids/Baby024months/InfantGirls/FootiesRompers,https://www.saksfifthavenue.com/kissy-kissy-ba...,V-neckline\nLong sleeves\nFront snap closure\n...,"'appliqu':17 'babi':3A 'convert':10A,13 'cotto..."


In [56]:
catalog.head(10)

Unnamed: 0,ï»¿brand,name,description,brand_category,brand_canonical_url,details,tsv
0,FILA,Original Fitness Sneakers,Vintage Fitness leather sneakers with logo pri...,TheMensStore/Shoes/Sneakers/LowTop,https://www.saksfifthavenue.com/fila-original-...,Leather/synthetic upper\nLace-up closure\nText...,"'design':12 'fila':1A 'fit':3A,6 'leather':7 '..."
1,CHANEL,HAT,,Unknown,https://www.saksfifthavenue.com/chanel-hat/pro...,WOOL TWEED & FELT,'chanel':1A 'hat':2A
2,Frame,Petit Oval Buckle Belt,A Timeless Leather Belt Crafted From Smooth Co...,Accessories,https://frame-store.com/products/petit-oval-bu...,,"'belt':5A,9 'buckl':4A,21 'cowhid':13 'craft':..."
3,Lilly Pulitzer Kids,Little Gir's & Girl's Ariana One-Piece UPF 50+...,Pretty ruffle sleeves and trim elevate essenti...,"JustKids/Girls214/Girls/SwimwearCoverups,JustK...",https://www.saksfifthavenue.com/lilly-pulitzer...,Scoopneck\nAdjustable straps\nFlutter sleeves\...,'50':14A 'allov':28 'ariana':9A 'color':27 'el...
4,Kissy Kissy,Baby Girl's Endearing Elephants Pima Cotton Co...,Versatile convertible gown with elephant applique,JustKids/Baby024months/InfantGirls/FootiesRompers,https://www.saksfifthavenue.com/kissy-kissy-ba...,V-neckline\nLong sleeves\nFront snap closure\n...,"'appliqu':17 'babi':3A 'convert':10A,13 'cotto..."
5,Jocelyn,Savage Love Texty Time Leopard-Print Rabbit Fu...,From the Savage Love Collection. Fingerless kn...,JewelryAccessories/Accessories/Gloves,https://www.saksfifthavenue.com/jocelyn-savage...,Acrylic/wool\nFur type: Dyed rabbit\nFur origi...,'ad':29 'collect':16 'craft':20 'fingerless':1...
6,Theory,Teah stretch-silk camisole,"Beige stretch-silk Slips on 93% silk, 7% spand...",Clothing / Tops / Tanks and Camis,https://www.net-a-porter.com/us/en/product/119...,"Fits true to size, take your normal size\nCut ...",'7':15 '93':13 'beig':7 'camisol':6A 'clean':1...
7,AMI Paris,Postcard Patch Hoodie,Casual cotton-blend hoodie with an embossed la...,TheMensStore/Apparel/SweatshirtsHoodies18Q1,https://www.saksfifthavenue.com/ami-paris-post...,Attached drawstring hood\nLong sleeves\nPullov...,"'ami':1A,15 'blend':9 'casual':6 'chest':21 'c..."
8,Alexander Wang,Layered velvet mini dress,Black velvet Concealed hook and zip fastening ...,Clothing / Dresses / Mini,https://www.net-a-porter.com/us/en/product/120...,"Fits true to size, take your normal size \nDes...",'100':21 '35':18 '65':16 'alexand':1A 'back':1...
9,J.Crew,Wide leather belt,The ideal way to add definition to your favori...,belts,https://www.jcrew.com/p/womens_category/belts/...,,"'add':9 'belt':4A,17 'better':27 'custom':19 '..."
