# Homework 2 (Due 6:29pm PST March 29th, 2022): Word Vectorization, Regex Practice, and Similarity

You may work with **one other person on this assignment**. You may also work independently if you prefer.

If you just want to be assigned someone to work with, message me on Slack and I will assign you a partner to work with.

> This homework is co-completed by **Siqin Yang** (7374355500) and **Ningxi Wang** (3605565772) :)

In [1]:
import numpy as np
import pandas as pd
import re
from collections import Counter
from tqdm import tqdm
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
# nltk.download('averaged_perceptron_tagger')
# nltk.download('wordnet')
# nltk.download('omw-1.4')

## task A

A. Using the **Amazon Toy Reviews Dataset (both positive and negative)**, **process the reviews**.
This means you should think briefly about:
* what stopwords to remove (should you add any custom stopwords to the set? Remove any stopwords?)
* what regex cleaning you may need to perform (for example, are there different ways of saying `broken` that you need to account for?)
* stemming/lemmatization (explain in your notebook why you used stemming versus lemmatization). 

Next, **count-vectorize the dataset**. Use the **`sklearn.feature_extraction.text.CountVectorizer`** examples from `Linear Algebra, Distance and Similarity (Completed).ipynb` and `Text Preprocessing Techniques (Completed).ipynb`.

I do not want redundant features - for instance, I do not want `Christmas` and `Christ-mas` to be two distinct columns in your document-term matrix. Therefore, I'll be taking a look to make sure you've properly performed your cleaning, stopword removal, etc. to reduce the number of dimensions in your dataset. 

### stopwords

In [3]:
stpw = set(stopwords.words('english'))
stpw |= set(['amazon', 'toy', 'toys', 'review', 'reviews'])
stpw

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'amazon',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',


In [4]:
# rev_gd = open('../datasets/good_amazon_toy_reviews.txt', 'r')
# rev_pr = open('../datasets/poor_amazon_toy_reviews.txt', 'r')

In [5]:
rev_gd = pd.read_csv('../datasets/good_amazon_toy_reviews.txt', header=None, encoding='utf8')
rev_pr = pd.read_csv('../datasets/poor_amazon_toy_reviews.txt', header=None, encoding='utf8')

In [6]:
rev_df = pd.concat([rev_gd, rev_pr], axis=0, ignore_index=True)
rev_df.columns = ['review']
rev_df

Unnamed: 0,review
0,Excellent!!!
1,Great quality wooden track (better than some o...
2,my daughter loved it and i liked the price and...
3,Great item. Pictures pop thru and add detail a...
4,I was pleased with the product.
...,...
114879,It's a piece of junk...doesn't charge multiple...
114880,Really small
114881,It is contained in glass which is dangerous if...
114882,Fake. Not original. Every time my 5 yr old kid...


In [7]:
def count_words(lines, delimiter=" "):
    words = Counter() # instantiate a Counter object called words
    for line in lines:
        for word in line.split(delimiter):
            word = word.lower()
            if word in stpw: continue
            words[word] += 1 # increment count for word
    return words

# def count_words(doc):
#     counts = Counter()
#     for r in doc:
#         counts_tmp = Counter(re.findall(r'\w\w+', r, flags=re.IGNORECASE))
#         counts += counts_tmp
#     return counts

In [8]:
counts = count_words(rev_df['review'])
counts.most_common(150)

[('', 52277),
 ('great', 24128),
 ('love', 16701),
 ('loves', 13551),
 ('one', 12102),
 ('it.', 10472),
 ('like', 10194),
 ('good', 9927),
 ('little', 9809),
 ('old', 9549),
 ('/><br', 8375),
 ('loved', 8279),
 ('would', 8167),
 ('year', 8158),
 ('fun', 8157),
 ('really', 8065),
 ('kids', 7720),
 ('get', 7499),
 ('bought', 7008),
 ('well', 6862),
 ('son', 6425),
 ('perfect', 6320),
 ('game', 6295),
 ('daughter', 6204),
 ('got', 6093),
 ('play', 6088),
 ('nice', 5397),
 ('product', 5375),
 ('easy', 5341),
 ('even', 5266),
 ('much', 5108),
 ('quality', 5029),
 ('time', 4921),
 ('made', 4673),
 ('it!', 4629),
 ('use', 4558),
 ('also', 4549),
 ('set', 4430),
 ('put', 4211),
 ('cute', 4105),
 ('buy', 4050),
 ('gift', 3985),
 ('2', 3984),
 ('make', 3745),
 ('still', 3627),
 ('came', 3610),
 ('-', 3608),
 ('first', 3595),
 ('two', 3471),
 ('recommend', 3424),
 ('grandson', 3405),
 ('3', 3397),
 ('playing', 3296),
 ('price', 3190),
 ('received', 3153),
 ('looks', 3146),
 ("i'm", 3142),
 ('smal

> by comparing common words with default stopwords, add more stopwords

In [9]:
stpw |= set(['would', 'really', 'get', 'got', 'even', 'much', 'also', 'item', 'every', 
            'definitely', 'exactly', 'absolutely', 'actually', 'able'])

> by analyzing previous conunt-vectorize results, add more stopwords

In [10]:
stpw |= set(['almost', 'always', 'another', 'could', 'something', 'thing', 'must', 'never',
            'us', 'me',])

### regex cleaning

In [11]:
rev_df['rev_std'] = rev_df['review']

In [12]:
### too slow
# def standardize_word(doc, word_orig, word_std):
#     for i in tqdm(range(len(doc))):
#         doc.loc[i, 'rev_std'] = re.sub(word_orig, word_std, doc.loc[i, 'review'], flags=re.IGNORECASE)

def standardize_word(doc, word_orig, word_std):
    doc = doc.str.replace(word_orig, word_std,
                          flags=re.IGNORECASE, regex=True)
    return doc # has to return, otherwise local var wont affect global var

> we noticed considerable number of `&#34;xxx&#34;, <br />` exists

In [13]:
word_orig, word_std = r'(<br />)', ''
rev_df['rev_std'] = standardize_word(rev_df['rev_std'], word_orig, word_std)

In [14]:
rev_df.iloc[10,0]

'I got this item for me and my son to play around with. The closest relevance I have to items like these is while in the army I was trained in the camera rc bots. This thing is awesome we tested the range and got somewhere close to 50 yards without an issue. Getting the controls is a bit tricky at first but after about twenty minutes you get the feel for it. The drone comes just about fly ready you just have to sync the controller. I am definitely a fan of the drones now. Only concern I have is maybe a little more silent but other than that great buy.<br /><br />*Disclaimer I received this product at a discount for my unbiased review.'

In [15]:
rev_df.iloc[10,1]

'I got this item for me and my son to play around with. The closest relevance I have to items like these is while in the army I was trained in the camera rc bots. This thing is awesome we tested the range and got somewhere close to 50 yards without an issue. Getting the controls is a bit tricky at first but after about twenty minutes you get the feel for it. The drone comes just about fly ready you just have to sync the controller. I am definitely a fan of the drones now. Only concern I have is maybe a little more silent but other than that great buy.*Disclaimer I received this product at a discount for my unbiased review.'

In [16]:
word_orig, word_std = r'(&#[0-9]+;|)', '' # no \b
rev_df['rev_std'] = standardize_word(rev_df['rev_std'], word_orig, word_std)

In [17]:
sum(rev_df['rev_std'] != rev_df['review'])

9100

In [18]:
rev_df.iloc[3,0]

'Great item. Pictures pop thru and add detail as &#34;painted.&#34;  Pictures dry and it can be repainted.'

In [19]:
rev_df.iloc[3,1]

'Great item. Pictures pop thru and add detail as painted.  Pictures dry and it can be repainted.'

In [20]:
### it invalids stopwords, changes didn't -> didnt
# word_orig, word_std = '(@\[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)|^rt|http.+?', ''
# rev_df['rev_std'] = standardize_word(rev_df['rev_std'], word_orig, word_std)

> as stemming treats word with punctuations differently from the word itself, remove punctuations first

In [21]:
word_orig, word_std = r'[.!?\-\"\\]', ' '
rev_df['rev_std'] = standardize_word(rev_df['rev_std'], word_orig, word_std)

> also, we notice some synonyms and standardize them before count-vectorize

In [22]:
word_orig, word_std = r'\b((christ|x)(?:-)?mas)\b', 'christmas'
rev_df['rev_std'] = standardize_word(rev_df['rev_std'], word_orig, word_std)
word_orig, word_std = r'\bb(?:irth)?(?:-)?day(?:s)?\b', 'birthday'
rev_df['rev_std'] = standardize_word(rev_df['rev_std'], word_orig, word_std)
word_orig, word_std = r'\b(y(?:ea)?r(?:s)?)\b', 'year'
rev_df['rev_std'] = standardize_word(rev_df['rev_std'], word_orig, word_std)

### lemmatization

In [23]:
'''
    ref:
https://gist.github.com/gaurav5430/9fce93759eb2f6b1697883c3782f30de#file-nltk-lemmatize-sentences-py
'''
# lemmatizer = WordNetLemmatizer()

# function to convert nltk tag to wordnet tag
def nltk_tag_to_wordnet_tag(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:          
        return None

def lemmatize_sentence(sentence):
    #tokenize the sentence and find the POS tag for each token
    nltk_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))  
    #tuple of (token, wordnet_tag)
    wordnet_tagged = map(lambda x: (x[0], nltk_tag_to_wordnet_tag(x[1])), nltk_tagged)
    lemmatized_sentence = []
    for word, tag in wordnet_tagged:
        if tag is None:
            #if there is no available tag, append the token as is
            lemmatized_sentence.append(word)
        else:        
            #else use the tag to lemmatize the token
            lemmatized_sentence.append(lemmatizer.lemmatize(word, tag))
    return " ".join(lemmatized_sentence)

In [24]:
lemmatizer = WordNetLemmatizer()
for i in tqdm(range(len(rev_df))):
    rev_df.loc[i, 'rev_std'] = lemmatize_sentence(rev_df.loc[i, 'rev_std'])

100%|██████████| 114884/114884 [06:54<00:00, 277.30it/s]


> **Reason for choosing lemmatization over stemming:**
when lacking of contexts, stemming may discriminate between words with different meanings. However, lemmatization is a way to reduce word-form according to context. So, we prefer lemmatization in this case.

### count-vectorize

In [25]:
vectorizer = CountVectorizer(stop_words=stpw, binary=True, min_df=0.02)
X = vectorizer.fit_transform(rev_df['rev_std'])
vectorized_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
print(f"Shape of dataframe is {vectorized_df.shape}")
print(f"Total number of occurences: {vectorized_df.sum().sum()}")
vectorized_df.head()

Shape of dataframe is (114884, 90)
Total number of occurences: 484908


Unnamed: 0,around,awesome,back,best,big,birthday,box,buy,card,child,...,together,try,two,use,want,way,well,work,worth,year
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
2,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [26]:
for i in vectorized_df.columns:
    print(i, '\t', end='')

around 	awesome 	back 	best 	big 	birthday 	box 	buy 	card 	child 	color 	come 	cute 	daughter 	day 	doll 	easy 	enjoy 	enough 	excellent 	expect 	fast 	figure 	find 	first 	fit 	fun 	game 	gift 	give 	go 	good 	granddaughter 	grandson 	great 	happy 	hold 	keep 	kid 	know 	like 	little 	long 	look 	lot 	love 	make 	money 	month 	need 	new 	nice 	old 	one 	order 	party 	perfect 	picture 	piece 	play 	pretty 	price 	product 	purchase 	put 	quality 	receive 	recommend 	right 	say 	see 	set 	size 	small 	son 	still 	super 	take 	think 	time 	together 	try 	two 	use 	want 	way 	well 	work 	worth 	year 	

## task B

B. **Stopwords, Stemming, Lemmatization Practice**

Using the **McDonalds Negative Reviews** file from Week 1:
* Count-vectorize the corpus. Treat each sentence as a document.

How many features (dimensions) do you get when you:
* Perform **stemming** and then count-vectorization
* Perform **lemmatization** and then **count-vectorization**.
* Perform **lemmatization**, remove **stopwords**, and then perform **count-vectorization**?

In [2]:
mcd = pd.read_csv('../datasets/mcdonalds-yelp-negative-reviews.csv', encoding='latin1')
mcd

Unnamed: 0,_unit_id,city,review
0,679455653,Atlanta,"I'm not a huge mcds lover, but I've been to be..."
1,679455654,Atlanta,Terrible customer service. I came in at 9:30pm...
2,679455655,Atlanta,"First they ""lost"" my order, actually they gave..."
3,679455656,Atlanta,I see I'm not the only one giving 1 star. Only...
4,679455657,Atlanta,"Well, it's McDonald's, so you know what the fo..."
...,...,...,...
1520,679500008,Portland,I enjoyed the part where I repeatedly asked if...
1521,679500224,Houston,Worst McDonalds I've been in in a long time! D...
1522,679500608,New York,"When I am really craving for McDonald's, this ..."
1523,679501257,Chicago,Two points right out of the gate: 1. Thuggery ...


In [30]:
mcd['rev_stem'] = np.nan
mcd['rev_lem'] = np.nan

In [3]:
mcd['rev_lem_woPOS'] = np.nan

### stemming

In [31]:
def stemming_sentence(sentence):
    stemmed_sentence = []
    for word in sentence.split(' '):      
        stemmed_sentence.append(stemmer.stem(word))
    return " ".join(stemmed_sentence)

In [32]:
stemmer = PorterStemmer()
for i in tqdm(range(len(mcd))):
    mcd.loc[i, 'rev_stem'] = stemming_sentence(mcd.loc[i, 'review'])

100%|██████████| 1525/1525 [00:02<00:00, 681.22it/s]


In [33]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(mcd['rev_stem'])
vectorized_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
dim_stem = vectorized_df.shape[1]
dim_stem

7638

### lemmatization

In [34]:
lemmatizer = WordNetLemmatizer()
for i in tqdm(range(len(mcd))):
    mcd.loc[i, 'rev_lem'] = lemmatize_sentence(mcd.loc[i, 'review'])

100%|██████████| 1525/1525 [00:06<00:00, 249.68it/s]


In [35]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(mcd['rev_lem'])
vectorized_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
dim_lem = vectorized_df.shape[1]
dim_lem

7191

### lemmatization + stopwords

In [36]:
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(mcd['rev_lem'])
vectorized_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
dim_lem_stp = vectorized_df.shape[1]
dim_lem_stp

6910

### lemmatization w/o POS

In [5]:
def lemmatize_sentence_woPOS(sentence):
    lemmatized_sentence = []
    for word in sentence.split(' '):      
        lemmatized_sentence.append(lemmatizer.lemmatize(word))
    return " ".join(lemmatized_sentence)

In [6]:
lemmatizer = WordNetLemmatizer()
for i in tqdm(range(len(mcd))):
    mcd.loc[i, 'rev_lem_woPOS'] = lemmatize_sentence_woPOS(mcd.loc[i, 'review'])

100%|██████████| 1525/1525 [00:00<00:00, 2159.35it/s]


In [8]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(mcd['rev_lem_woPOS'])
vectorized_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
dim_lem_woPOS = vectorized_df.shape[1]
dim_lem_woPOS

8101

### results

In [37]:
print('# of features (dimensions) after')
print(f'\tstemming & count-vectorization: {dim_stem}.')
print(f'\tlemmatization & count-vectorization: {dim_lem}.')
print(f'\tlemmatization, removing stopwords & count-vectorization: {dim_lem_stp}.')

# of features (dimensions) after
	stemming & count-vectorization: 7638.
	lemmatization & count-vectorization: 7191.
	lemmatization, removing stopwords & count-vectorization: 6910.
