# Homework 1 (Due Thursday, November 3rd, 2022 at 6:29pm PST)

Every day late is -10%.

You are a business analyst working for a major US toy retailer. A manager in the marketing department would like your help to build a classification model that will predict whether a review is positive or negative. use the `../datasets/good_amazon_toy_reviews.txt` and `../datasets/poor_amazon_toy_reviews.txt` datasets for this exercise.

Combine the good and the poor datasets together.

### Preprocessing and Regex (3pts)

Perform the following cleanup steps:

* There are malformed characters in the review text. For instance, notice the `&#34;` - these are examples of incorrectly decoded [HTML encodings](https://krypted.com/utilities/html-encoding-reference/).
```
"amazing quality first of all, these cards are amazing proxies (but don't try to use em in &#34;official duels&#34; unless a judge is okay with it, if you have the real thing to show) and look amazing in your binder!"
```
Please clean up all instances of these incorrect decodings.

* Use **regular expressions to parse out and normalize all references to recipients (children, spouses, parents, etc.) and gift occasions (Christmas, birthdays, and anniversaries)**, and account for the possibility that people may spell words "son" / "children" / "Christmas" as both singular and plural, upper or lower-cased.

### Vectorization (7pts)

* with/without stopword removal
* with 1) no stemming or lemmatization, 2) stemming, 3) lemmatization
* using `TfIdfVectorizer` versus `CountVectorizer`
* using `ngram` sizes of 1, 2, and 3

Edit: Perform vectorization using above instructions. Print out the shape of the final vectorized datasets.

**Submit everything as a new notebook and Slack direct message to me (Yu Chen) and the TAs the HW as an attachment.**

**NOTE**: Name the notebook `lastname_firstname_HW1.ipynb`.

In [2]:
import pandas as pd
import numpy as np
import re
import nltk
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from nltk.corpus import wordnet

## Part1: Preprocessing and Regex

In [3]:
# create dataframe & Concate the good and the poor datasets
df1 = pd.DataFrame( open("../datasets/good_amazon_toy_reviews.txt", "r"), columns=["review"])
df2 = pd.DataFrame( open("../datasets/poor_amazon_toy_reviews.txt", "r"), columns=["review"])
df = pd.concat([df1, df2]).reset_index(drop = True)
print(df.shape)
df.head()

(114917, 1)


Unnamed: 0,review
0,Excellent!!!\n
1,"""Great quality wooden track (better than some ..."
2,my daughter loved it and i liked the price and...
3,Great item. Pictures pop thru and add detail a...
4,I was pleased with the product.\n


In [4]:
# create function to clean up the review
def Cleanup(text):
    # remove malformed characters in the review text
    #cleaned_text = html.unescape(text).strip().lower()
    cleaned_text = text.strip().lower()
    cleaned_text = re.sub(r'(&#[0-9]+)|(%(\w)+)', '', cleaned_text)
    cleaned_text = re.sub(r'(\<(br) \/\>)(\<(br) \/\>)?', ' ', cleaned_text)
    cleaned_text = re.sub('['+string.punctuation+']', '', cleaned_text)
    
    # Use regular expressions to parse out and normalize all references to recipients and gift occasions 
    cleaned_text = re.sub(r"\b((b(irth)?day)( party| parties)?|anniversar(y|ies)|(christ|x)mas|halloween( party| parties)?|thanksgiving|valentine's day|father's day|mother's day|housewarming( party| parties)?)\b", 
                        "_OCCASION_", cleaned_text)
    cleaned_text = re.sub(r"\b(my |our )?([a-zA_Z0-9]+ ?(y(ea)?rs?(-| ?)(olds?)?|y\.?o\.?|months?(-| ?)olds?) )?(younger )?(older )?(twin )?(daughters?|sons?|child|children|kids?|husbands?|wifes?|(little )?girls?|(little )?boys?|mom|mother|dad|father|grand ?sons?|grand ?daughters?|grand ?child|grand ?children|grand ?kids?|girlfriend|boyfriend|honey|baby|babies|sisters?|brothers?|aunts?|uncles?|cousins?|fiances?|parents?|friends?|classmates?|co\-?workers?|nieces?|nephews?)\b", 
                        "_RECIPIENT_", cleaned_text)
    cleaned_text = re.sub(r"\b(my |our )([a-zA_Z0-9]+ (years?-olds?|years? ?olds?|yr\.? olds?|yrs?|y\.?o\.?|months?-olds?|months? olds?))( and )?([a-zA_Z0-9]+ (years?-olds?|years? ?olds?|yr\.? olds?|yrs?|y\.?o\.?|months?-olds?|months? olds?))?\b", 
                        "_RECIPIENT_", cleaned_text)
    
    return cleaned_text

In [5]:
df["review"] = df.apply(lambda x: Cleanup(x["review"]), axis=1)

In [6]:
df.head()

Unnamed: 0,review
0,excellent
1,great quality wooden track better than some ot...
2,_RECIPIENT_ loved it and i liked the price and...
3,great item pictures pop thru and add detail as...
4,i was pleased with the product


## Part 2: Perform vectorization 

In [7]:
# prep nltk
nltk_stopwords = set(stopwords.words("english"))
stemmer = PorterStemmer()
# lemmatizer = WordNetLemmatizer()

In [18]:
## Reference: https://gist.github.com/gaurav5430/9fce93759eb2f6b1697883c3782f30de#file-nltk-lemmatize-sentences-py
lemmatizer = WordNetLemmatizer()

# function to convert nltk tag to wordnet tag
def nltk_tag_to_wordnet_tag(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:          
        return None

def lemmatize_sentence(sentence):
    #tokenize the sentence and find the POS tag for each token
    nltk_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))  
    #tuple of (token, wordnet_tag)
    wordnet_tagged = map(lambda x: (x[0], nltk_tag_to_wordnet_tag(x[1])), nltk_tagged)
    lemmatized_sentence = []
    for word, tag in wordnet_tagged:
        if tag is None:
            #if there is no available tag, append the token as is
            lemmatized_sentence.append(word)
        else:        
            #else use the tag to lemmatize the token
            lemmatized_sentence.append(lemmatizer.lemmatize(word, tag))
    return " ".join(lemmatized_sentence)

In [19]:
def Preprocess(Stopword_Removal, Stemming, Lemmatization, text):
    word_tokens = nltk.word_tokenize(text)
    
    # stopword removal
    if Stopword_Removal == True:
        word_remove_stopwords = []
        for t in word_tokens:
            if t in nltk_stopwords:
                continue
            word_remove_stopwords.append(t)
        word_tokens = word_remove_stopwords
    
    # stemming 
    if Stemming == True:
        word_tokens = [stemmer.stem(t) for t in word_tokens]
    
    # lemmatization
    if Lemmatization == True:
        word_tokens = [lemmatize_sentence(t) for t in word_tokens]
        
    cleaned_review = " ".join(word_tokens)
        
    return cleaned_review

In [14]:
# Vectorization
def Vectorize(Vectorization, Ngram, text):
    if Vectorization == 'COUNT':
        vectorizer = CountVectorizer(ngram_range=Ngram)
        X = vectorizer.fit_transform(text)
        Count = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
        
        return Count
        
    elif Vectorization == 'TFIDF':
        vectorizer = TfidfVectorizer(ngram_range=Ngram)
        corpus = list(text.values)
        X = vectorizer.fit_transform(corpus)
        terms = vectorizer.get_feature_names()
        tf_idf = pd.DataFrame(X.toarray(), columns=terms)
        
        return tf_idf

In [15]:
# Parameters
instructions = [[False, False, False], [False, True, False], [False, False, True], [True, False, False],
                [True, True, False], [True, False, True]]
vectorize = [['COUNT', (1,1)], ['COUNT', (2,2)], ['COUNT', (3,3)], 
             ['TFIDF', (1,1)], ['TFIDF', (2,2)], ['TFIDF', (3,3)]]

In [16]:
# Sample for only 20,000 rows 
df2 = df.sample(2000)
df2.shape

(2000, 1)

In [20]:
for Stopword_Removal, Stemming, Lemmatization in instructions:
    for Vectorization, Ngram in vectorize:
        df2["review"] = df2.apply(lambda x: Preprocess(Stopword_Removal, Stemming, Lemmatization, x["review"]), axis=1)
        vectorized_df = Vectorize(Vectorization, Ngram, df2["review"])
        print(f"Shape of dataframe(Stopword_Removal={Stopword_Removal}, Stemming={Stemming}, Lemmatization={Lemmatization}, Vectorization={Vectorization}, Ngram={Ngram}) is {vectorized_df.shape}")

Shape of dataframe(Stopword_Removal=False, Stemming=False, Lemmatization=False, Vectorization=COUNT, Ngram=(1, 1)) is (2000, 4366)
Shape of dataframe(Stopword_Removal=False, Stemming=False, Lemmatization=False, Vectorization=COUNT, Ngram=(2, 2)) is (2000, 28536)
Shape of dataframe(Stopword_Removal=False, Stemming=False, Lemmatization=False, Vectorization=COUNT, Ngram=(3, 3)) is (2000, 44157)
Shape of dataframe(Stopword_Removal=False, Stemming=False, Lemmatization=False, Vectorization=TFIDF, Ngram=(1, 1)) is (2000, 4366)
Shape of dataframe(Stopword_Removal=False, Stemming=False, Lemmatization=False, Vectorization=TFIDF, Ngram=(2, 2)) is (2000, 28536)
Shape of dataframe(Stopword_Removal=False, Stemming=False, Lemmatization=False, Vectorization=TFIDF, Ngram=(3, 3)) is (2000, 44157)
Shape of dataframe(Stopword_Removal=False, Stemming=True, Lemmatization=False, Vectorization=COUNT, Ngram=(1, 1)) is (2000, 4366)
Shape of dataframe(Stopword_Removal=False, Stemming=True, Lemmatization=False, V

In [None]:
# https://www.guru99.com/stemming-lemmatization-python-nltk.html