# Homework 2 (Due 6:29pm PST April 6th, 2021): Word Vectorization, Regex Practice, and Similarity

You may work with **one other person on this assignment**. You may also work independently if you prefer.

If you just want to be assigned someone to work with, message me on Slack and I will assign you a partner to work with.

A. Using the **McDonalds Yelp Review CSV file**, **process the reviews**.
This means you should think briefly about:
* what stopwords to remove (should you add any custom stopwords to the set? Remove any stopwords?)
* what regex cleaning you may need to perform (for example, are there different ways of saying `hamburger` that you need to account for?)
* stemming/lemmatization (explain in your notebook why you used stemming versus lemmatization). 

Next, **count-vectorize the dataset**. Use the **`sklearn.feature_extraction.text.CountVectorizer`** examples from `Linear Algebra, Distance and Similarity (Completed).ipynb` and `Text Preprocessing Techniques (Completed).ipynb` (read the last section, `Vectorization Techniques`).

I do not want redundant features - for instance, I do not want `hamburgers` and `hamburger` to be two distinct columns in your document-term matrix. Therefore, I'll be taking a look to make sure you've properly performed your cleaning, stopword removal, etc. to reduce the number of dimensions in your dataset. 


In [1]:
import pandas as pd
import numpy as numpy

import nltk
nltk.download('punkt') # A popular NLTK sentence tokenizer
nltk.download('stopwords') # library of common English stopwords
nltk.download('wordnet')


[nltk_data] Downloading package punkt to
[nltk_data]     /Users/yutongwanyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/yutongwanyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/yutongwanyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [2]:
yelp = pd.read_csv('mcdonalds-yelp-negative-reviews.csv', encoding='latin-1')
yelp.head()

Unnamed: 0,_unit_id,city,review
0,679455653,Atlanta,"I'm not a huge mcds lover, but I've been to be..."
1,679455654,Atlanta,Terrible customer service. I came in at 9:30pm...
2,679455655,Atlanta,"First they ""lost"" my order, actually they gave..."
3,679455656,Atlanta,I see I'm not the only one giving 1 star. Only...
4,679455657,Atlanta,"Well, it's McDonald's, so you know what the fo..."


In [3]:
yelp.shape

(1525, 3)

In [4]:
reviews = yelp['review']

## Tokenize & Stemming

I decide to use stemming because it's (1) working fast on a 1500+ rows series and (2) the goal of this practice is to get features with less redundants so I think we can be a bit aggressive there in terms of tekenization

In [5]:
# from nltk.stem import WordNetLemmatizer
# lemmatizer = WordNetLemmatizer()

stemmer = nltk.stem.porter.PorterStemmer()


In [6]:
reviews = reviews.apply(lambda x: nltk.word_tokenize(x)).apply(lambda x: [stemmer.stem(y) for y in x])
reviews = reviews.apply(lambda x: ' '.join(x))
reviews

0       I 'm not a huge mcd lover , but I 've been to ...
1       terribl custom servic . I came in at 9:30pm an...
2       first they `` lost '' my order , actual they g...
3       I see I 'm not the onli one give 1 star . onli...
4       well , it 's mcdonald 's , so you know what th...
                              ...                        
1520    I enjoy the part where I repeatedli ask if I h...
1521    worst mcdonald I 've been in in a long time ! ...
1522    when I am realli crave for mcdonald 's , thi s...
1523    two point right out of the gate : 1 . thuggeri...
1524    I want to grab breakfast one morn befor work s...
Name: review, Length: 1525, dtype: object

## More regex cleanning 



In [7]:
import re
re.sub(r'(he|th)', 'hi','hello this is wy')

'hillo hiis is wy'

In [8]:
# more can be added if needed 
# here I only list a few examples: xxxburger, barbeque, chocol, coffee, frappuccino

pattern_dict = {
    r'\b[a-z]+burger|burger\b':'burger',
    r'\bbarbe+[cq]+[a-z]|bbq\b':'barbeque',
    r'\bchocol|chocolateat|chocolateatt\b':'chocolate',
    r'\bcoffe+[a-z]|coffee+[a-z]|coffees\b':"coffee",
    r'\bfrap+[a-z]\b':'frappuccino'
}

In [9]:
for i in pattern_dict:
    reviews = reviews.apply(lambda x: re.sub(i, pattern_dict[i], x))
reviews 

0       I 'm not a huge mcd lover , but I 've been to ...
1       terribl custom servic . I came in at 9:30pm an...
2       first they `` lost '' my order , actual they g...
3       I see I 'm not the onli one give 1 star . onli...
4       well , it 's mcdonald 's , so you know what th...
                              ...                        
1520    I enjoy the part where I repeatedli ask if I h...
1521    worst mcdonald I 've been in in a long time ! ...
1522    when I am realli crave for mcdonald 's , thi s...
1523    two point right out of the gate : 1 . thuggeri...
1524    I want to grab breakfast one morn befor work s...
Name: review, Length: 1525, dtype: object

## Removing stopwords and Get features

In [10]:
# CountVectorize the Documents
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(reviews) 
X = X.toarray()


In [11]:
from nltk.corpus import stopwords

In [12]:
corpus_df = pd.DataFrame(X, columns=vectorizer.get_feature_names())

# iterate through the Pandas dataframe, and drop the columns that reflect stopwords:
original_columns = corpus_df.columns # get existing columns

to_drop_columns = set(original_columns).intersection(set(stopwords.words('english')+ ['00', '000'])) # get the list of words to drop
print(f"Dataframe shape was {corpus_df.shape}")
corpus_df.drop(columns=to_drop_columns, inplace=True)
print(f"Dataframe shape is now {corpus_df.shape}")

Dataframe shape was (1525, 6441)
Dataframe shape is now (1525, 6327)


In [13]:
corpus_df.head().T.to_csv("yelp_features.csv")

B. **Stopwords, Stemming, Lemmatization Practice**

Using the `tale-of-two-cities.txt` file from Week 1:
* Count-vectorize the corpus. Treat each sentence as a document.

How many features (dimensions) do you get when you:
* Perform **stemming** and then **count-vectorization**.
* Perform **lemmatization** and then **count-vectorization**.
* Perform **lemmatization**, remove **stopwords**, **remove punctuation**, and then perform **count-vectorization**?

In [14]:
tale = open('/Users/yutongwanyan/Desktop/dso-560-nlp-text-analytics-SPRING-2021/Week 1/tale-of-two-cities.txt','r')
tale.readline()

'  IT WAS the best of times, it was the worst of times, it was the\n'

In [15]:
tale_lines = tale.readlines()
text = []

for i in tale_lines:
    i = re.sub('\n', '', i) # get rid of n
    text.append(i)

text[:5]

['age of wisdom, it was the age of foolishness, it was the epoch of',
 'belief, it was the epoch of incredulity, it was the season of Light,',
 'it was the season of Darkness, it was the spring of hope, it was the',
 'winter of despair, we had everything before us, we had nothing',
 'before us, we were all going direct to Heaven, we were all going']

In [16]:
text = ' '.join(text)


## Count-vectorize the corpus. Treat each sentence as a document.

In [17]:
from sklearn.feature_extraction.text import CountVectorizer

In [18]:
text = nltk.sent_tokenize(text) # need to tokenize first!

vectorizer = CountVectorizer()

vectorizer.fit(text)

vector = vectorizer.transform(text)


In [19]:
# from vector to pd dataframe
corpus_df = pd.DataFrame(vector.toarray(), columns=vectorizer.get_feature_names())

# get rid of stopwords
original_columns = corpus_df.columns 
to_drop_columns = set(original_columns).intersection(set(stopwords.words('english'))) # get the list of words to drop
corpus_df.drop(columns=to_drop_columns, inplace=True)
print(f"Dataframe shape is now{corpus_df.shape}")
print(f"Number of features: {corpus_df.shape[1]}")

Dataframe shape is now(7764, 9568)
Number of features: 9568



## How many features (dimensions) do you get when you:

* Perform **stemming** and then **count-vectorization**.
* Perform **lemmatization** and then **count-vectorization**.
* Perform **lemmatization**, remove **stopwords**, **remove punctuation**, and then perform **count-vectorization**?

In [20]:
# define a re-use function 

# reload text
def load_text_list(textname):
    the_file = open(textname, 'r')
    file_lines = the_file.readlines()
    
    text = []
    for i in file_lines:
        i = re.sub('\n', '', i) 
        text.append(i)

    return text

###

def stem_or_lemma(input_text, method, remove_punctuation = False):

    input_text = pd.Series(input_text)

    if method == 'stem':
        stemmer = nltk.stem.porter.PorterStemmer()
        if remove_punctuation == True:
            from nltk.tokenize import RegexpTokenizer
            tokenizer = RegexpTokenizer(r'\w+')
            input_text = input_text.apply(lambda x: tokenizer.tokenize(x)).apply(lambda x: [stemmer.stem(y) for y in x])
        else:
            input_text = input_text.apply(lambda x: nltk.word_tokenize(x)).apply(lambda x: [stemmer.stem(y) for y in x])
        input_text = input_text.apply(lambda x: ' '.join(x))
    elif method == 'lemma':
        lemmatizer = nltk.stem.WordNetLemmatizer()
        if remove_punctuation == True:
            from nltk.tokenize import RegexpTokenizer
            tokenizer = RegexpTokenizer(r'\w+')
            input_text = input_text.apply(lambda x: tokenizer.tokenize(x)).apply(lambda x: [lemmatizer.lemmatize(y) for y in x])
        else:
            input_text = input_text.apply(lambda x: nltk.word_tokenize(x)).apply(lambda x: [lemmatizer.lemmatize(y) for y in x])
        input_text = input_text.apply(lambda x: ' '.join(x))

    return input_text

###

def count_vector_corpus(input_text, remove_stopwords = True):

    #input_text = nltk.sent_tokenize(input_text) # tokenize 

    vectorizer = CountVectorizer() # count vectorize
    vectorizer.fit(input_text)
    vector = vectorizer.transform(input_text)

    # from vector to pd dataframe
    corpus_df = pd.DataFrame(vector.toarray(), columns=vectorizer.get_feature_names())

    # get rid of stopwords
    if remove_stopwords == True:
        original_columns = corpus_df.columns 
        to_drop_columns = set(original_columns).intersection(set(stopwords.words('english'))) 
        # get the list of words to drop
        corpus_df.drop(columns=to_drop_columns, inplace=True)

    print(f"Dataframe shape is now {corpus_df.shape}")
    print(f"Number of features: {corpus_df.shape[1]}")

    return corpus_df
        

### Perform stemming and then count-vectorization.

In [21]:
text = load_text_list('/Users/yutongwanyan/Desktop/dso-560-nlp-text-analytics-SPRING-2021/Week 1/tale-of-two-cities.txt')
stemmed_text = stem_or_lemma(text, 'stem', remove_punctuation = False)
stemmed_corpus_df = count_vector_corpus(stemmed_text, remove_stopwords = False)


Dataframe shape is now (12870, 6659)
Number of features: 6659


### Perform **lemmatization** and then **count-vectorization**.

In [22]:
text = load_text_list('/Users/yutongwanyan/Desktop/dso-560-nlp-text-analytics-SPRING-2021/Week 1/tale-of-two-cities.txt')
stemmed_text = stem_or_lemma(text, 'lemma', remove_punctuation = False)
stemmed_corpus_df = count_vector_corpus(stemmed_text, remove_stopwords = False)

Dataframe shape is now (12870, 8910)
Number of features: 8910


### Perform **lemmatization**, remove **stopwords**, **remove punctuation**, and then perform **count-vectorization**?

In [23]:
# remove punctuation 
# https://stackoverflow.com/questions/15547409/how-to-get-rid-of-punctuation-using-nltk-tokenizer

from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'\w+')
tokenizer.tokenize('Eighty-seven miles to go, yet.  Onward!')

['Eighty', 'seven', 'miles', 'to', 'go', 'yet', 'Onward']

In [24]:
text = load_text_list('/Users/yutongwanyan/Desktop/dso-560-nlp-text-analytics-SPRING-2021/Week 1/tale-of-two-cities.txt')
stemmed_text = stem_or_lemma(text, 'lemma', remove_punctuation = True)
stemmed_corpus_df = count_vector_corpus(stemmed_text, remove_stopwords = True)


Dataframe shape is now (12870, 8705)
Number of features: 8705


In [25]:
# how about stemming, remove stopwords, remove punctuation, and then perform count-vectorization?

text = load_text_list('/Users/yutongwanyan/Desktop/dso-560-nlp-text-analytics-SPRING-2021/Week 1/tale-of-two-cities.txt')
stemmed_text = stem_or_lemma(text, 'stem', remove_punctuation = True)
stemmed_corpus_df = count_vector_corpus(stemmed_text, remove_stopwords = True)

Dataframe shape is now (12870, 6273)
Number of features: 6273


## Classwork For Lecture 2 (Due 6:29pm PST March 30th, 2021): Word Vectorization, Regex Practice, and Similarity

#### Pick A or B

A. Answer all the exercise questions in the **Text Preprocessing** notebook.

B. Answer the below questions about text encoding and word count distributions:

1. Which of the encodings below will be able to encode this text: 사업.
* `ascii`
* `latin1`
* `utf-8`
* `utf-32`
* `extended ascii`

2. **True or False**: the word dog will have the same binary representation regardless of whether it is ASCII, latin1, or utf8. If False, explain why it is false.


3. According to the Zipf Law approximation, approximately what frequency (express it has a percent) would the 3rd most popular word in a generic piece of text appear?


4. **True or False**: what is considered a stopword changes depending on the business context and dataset you are working with. If true, provide an example. If false, explain why it is false.