###**Word Processing Introduction**

The file contains English reviews about food services from the Yelp website: https://www.yelp.com/.

There are 1,000 reviews and they are part of the dataset located in the UCI Machine Learning Repository, called "Sentiment Labeled Sentences": https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences#


#**Part 1. Loading Data.**   

Load the data from the indicated file and obtain a list of strings/comments with a length of 1000.

For now, we only need the Numpy and re libraries for handling arrays and regular expressions in Python.

In particular, you won't need the Pandas library for this activity.

###**NOTE: You shouldn't import anything else for this activity; these two libraries will be *sufficient*.**

In [1]:
import numpy as np # We import Numpy to handle arrays.
import re # We import re to handle regular expressions.

In [2]:
# Execute the following instructions to load the information from the given file:

with open('YELP_Cmments_02.txt',        # you can update the path to your file, if applicable.
          mode='r',     # we open the file in read mode.
          ) as f:
    docs = f.readlines()    # we separate each comment by lines

f.close()  # since we have the information in the docs variable, we close the file

In [3]:
type(docs) == list   # Verify that your "docs" variable is a list

True

In [4]:
len(docs)==1000  # check that the length of "docs" is one thousand comments.

True

In [5]:
docs[0:10]     # look at some of the first comments

['Wow... Loved this place.\n',
 'Crust is not good.\n',
 'Not tasty and the texture was just nasty.\n',
 'Stopped by during the late May bank holiday off Rick Steve recommendation and loved it.\n',
 'The selection on the menu was great and so were the prices.\n',
 'Now I am getting angry and I want my damn pho.\n',
 "Honeslty it didn't taste THAT fresh.)\n",
 'The potatoes were like rubber and you could tell they had been made up ahead of time being kept under a warmer.\n',
 'The fries were great too.\n',
 'A great touch.\n']

#**Part 2: Question section (regex).**


##**Instructions:**

###**Next, you must answer each of the questions using regular expressions (regex).**

###**At this time, there is no restriction on the number of lines of code you can add, but try to include as few lines as possible.**

* **Question 1.**

Find and remove all '\n' line breaks at the end of each comment.

Once finished, print the first 10 comments from the result.


In [6]:
docs = [re.sub(r'\n$', '', doc) for doc in docs]  # Delete '\n' at the end
docs[:10] # Print first 10 comments

['Wow... Loved this place.',
 'Crust is not good.',
 'Not tasty and the texture was just nasty.',
 'Stopped by during the late May bank holiday off Rick Steve recommendation and loved it.',
 'The selection on the menu was great and so were the prices.',
 'Now I am getting angry and I want my damn pho.',
 "Honeslty it didn't taste THAT fresh.)",
 'The potatoes were like rubber and you could tell they had been made up ahead of time being kept under a warmer.',
 'The fries were great too.',
 'A great touch.']

* **Question 2.**

Find and print all words that end with two or more exclamation marks, for example "!!!".

You must print both the word and all the exclamation marks that follow it.

Indicate how many results you got.



In [7]:
exclamations = []
for doc in docs:
    matches = re.findall(r'\b\w+!{2,}', doc)
    exclamations.extend(matches)

print(f"Number of matches: {len(exclamations)}")
exclamations

Number of matches: 26


['Firehouse!!!!!',
 'APPETIZERS!!!',
 'amazing!!!',
 'buffet!!!',
 'good!!',
 'it!!!!',
 'DELICIOUS!!',
 'amazing!!',
 'shawarrrrrrma!!!!!!',
 'yucky!!!',
 'steak!!!!!',
 'delicious!!!',
 'far!!',
 'biscuits!!!',
 'dry!!',
 'disappointing!!!',
 'awesome!!',
 'Up!!',
 'FLY!!!!!!!!',
 'here!!!',
 'great!!!!!!!!!!!!!!',
 'packed!!',
 'otherwise!!',
 'amazing!!!!!!!!!!!!!!!!!!!',
 'style!!',
 'disappointed!!']

* **Question 3.**

Find and print all the words that are written entirely in capital letters. Each match must be a single word.

Indicate how many words you found.



In [8]:
uppercase_words = []
for doc in docs:
    matches = re.findall(r'\b[A-Z]{2,}\b', doc)  # two or more capital letters
    uppercase_words.extend(matches)

print(f"Number of capitalized words found: {len(uppercase_words)}")
uppercase_words

Number of capitalized words found: 96


['THAT',
 'APPETIZERS',
 'WILL',
 'NEVER',
 'EVER',
 'STEP',
 'FORWARD',
 'IN',
 'IT',
 'AGAIN',
 'LOVED',
 'AND',
 'REAL',
 'BITCHES',
 'NYC',
 'STALE',
 'DELICIOUS',
 'WORST',
 'EXPERIENCE',
 'EVER',
 'ALL',
 'BARGAIN',
 'TV',
 'NONE',
 'FREEZING',
 'AYCE',
 'FLAVOR',
 'NEVER',
 'BBQ',
 'UNREAL',
 'OMG',
 'BETTER',
 'BLAND',
 'RUDE',
 'INCONSIDERATE',
 'MANAGEMENT',
 'WILL',
 'NEVER',
 'EVER',
 'GO',
 'BACK',
 'AND',
 'HAVE',
 'TOLD',
 'MANY',
 'PEOPLE',
 'WHAT',
 'HAD',
 'HAPPENED',
 'TOTAL',
 'WASTE',
 'OF',
 'TIME',
 'FS',
 'AZ',
 'LOVED',
 'CONCLUSION',
 'BEST',
 'GO',
 'NOW',
 'GC',
 'AVOID',
 'THIS',
 'ESTABLISHMENT',
 'AN',
 'HOUR',
 'NASTY',
 'OMG',
 'NO',
 'BEST',
 'THE',
 'OWNERS',
 'REALLY',
 'REALLY',
 'PERFECT',
 'SCREAMS',
 'LEGIT',
 'MGM',
 'BEST',
 'FLY',
 'FLY',
 'FANTASTIC',
 'GREAT',
 'OK',
 'WAY',
 'MUST',
 'HAVE',
 'OK',
 'OVERPRICED',
 'BARE',
 'HANDS',
 'WEAK',
 'SHOULD',
 'RI',
 'VERY',
 'NOT']

* **Question 4.**

Find and print comments where all alphabetic characters (letters) are capitalized.

Each match found must be the entire comment/statement.

Indicate how many results you got.


In [9]:
uppercase_comments = [doc for doc in docs if re.fullmatch(r'[^a-z]*[A-Z]+[^a-z]*', doc)]

print(f"Number of comments in all capital letters: {len(uppercase_comments)}")
uppercase_comments

Number of comments in all capital letters: 5


['DELICIOUS!!',
 'RUDE & INCONSIDERATE MANAGEMENT.',
 'WILL NEVER EVER GO BACK AND HAVE TOLD MANY PEOPLE WHAT HAD HAPPENED.',
 'TOTAL WASTE OF TIME.',
 'AVOID THIS ESTABLISHMENT!']

* **Question 5.**

Find and print all the words that have a stressed vowel, such as á, é, í, ó, or ú.

Indicate how many results you got.

In [10]:
accented_words = []
for doc in docs:
    accented_words.extend(re.findall(r'\b\w*[áéíóúÁÉÍÓÚ]\w*\b', doc))

print(f"Number of words with stressed vowels: {len(accented_words)}")
accented_words

Number of words with stressed vowels: 3


['fiancé', 'Café', 'puréed']

* **Question 6.**

Find and print all monetary numerical amounts, whether whole or with decimals, that begin with the symbol $\$$.

Indicate how many results you got.

In [11]:
monetary_values = []
for doc in docs:
    monetary_values.extend(re.findall(r'\$\d+(?:\.\d{1,2})?', doc))

print(f"Number of monetary amounts found: {len(monetary_values)}")
monetary_values

Number of monetary amounts found: 8


['$20', '$4.00', '$17', '$3', '$35', '$7.85', '$12', '$11.99']

* **Question 7.**

Find and print all the words that are variants of the word "love," regardless of whether they include upper or lower case, the way they are conjugated, or any other variation of the word.

Indicate how many results you got.

In [12]:
love_variants = []
for doc in docs:
    #love_variants.extend([word for word in doc if re.search(r'\blove\w*\b', word, flags=re.IGNORECASE)])
    love_variants.extend(re.findall(r'\blove\w*\b', doc, flags=re.IGNORECASE))

print(f"Number of 'love' variants found: {len(love_variants)}")
love_variants

Number of 'love' variants found: 35


['Loved',
 'loved',
 'Loved',
 'love',
 'loves',
 'LOVED',
 'lovers',
 'love',
 'lovers',
 'Love',
 'loved',
 'loved',
 'love',
 'love',
 'love',
 'loved',
 'love',
 'loved',
 'Love',
 'LOVED',
 'love',
 'lovely',
 'love',
 'lovely',
 'love',
 'lover',
 'loved',
 'love',
 'love',
 'love',
 'love',
 'love',
 'love',
 'love',
 'love']

In [13]:
#love_words = ['love', 'loved', 'loves', 'lover', 'lovers', 'lovely', 'loving', 'beloved']
#pattern = r'\b(?:' + '|'.join(love_words) + r')\b'
love_matches = []
for doc in docs:
  #love_matches.extend(re.findall(pattern, doc, re.IGNORECASE))
  love_matches.extend(re.findall(r'\b(love\w*|loving)\b', doc, re.IGNORECASE))

print(f"Number of 'love' variants found: {len(love_matches)}")
love_matches

Number of 'love' variants found: 36


['Loved',
 'loved',
 'Loved',
 'love',
 'loves',
 'LOVED',
 'lovers',
 'loving',
 'love',
 'lovers',
 'Love',
 'loved',
 'loved',
 'love',
 'love',
 'love',
 'loved',
 'love',
 'loved',
 'Love',
 'LOVED',
 'love',
 'lovely',
 'love',
 'lovely',
 'love',
 'lover',
 'loved',
 'love',
 'love',
 'love',
 'love',
 'love',
 'love',
 'love',
 'love']

* **Question 8.**

Find and print all the words, variants of "so" and "good," that have two or more "o"s in "so" and three or more "o"s in "good."

Indicate how many you found.


In [14]:
#so_good_variants = []
so_variants = []
for doc in docs:
    so_variants.extend(re.findall(r'\bso{2,}\b', doc, flags=re.IGNORECASE))
    #so_good_variants.extend(re.findall(r'\bso{2,}\b|\bgo{o{3,}d\b', doc, flags=re.IGNORECASE))
    #so_good_variants.extend([word for word in doc if re.fullmatch(r'so{2,}', word) or re.fullmatch(r'go{3,}d', word)])

print(f"Number of 'so' variants found: {len(so_variants)}")
so_variants

Number of 'so' variants found: 4


['Sooooo', 'soooo', 'soooooo', 'soooo']

In [15]:
#so_good_variants = []
good_variants = []
for doc in docs:
    good_variants.extend(re.findall(r'\bgo{3,}d+\b', doc, flags=re.IGNORECASE))
    #so_good_variants.extend([word for word in doc if re.fullmatch(r'so{2,}', word) or re.fullmatch(r'go{3,}d', word)])

print(f"Number of 'good' variants found: {len(good_variants)}")
good_variants

Number of 'good' variants found: 1


['gooodd']

* **Question 9.**

Find and print all words that are strictly greater than 10 alphabetic characters in length.

Punctuation marks or special characters are not included in the length of these strings; only uppercase or lowercase alphabetic characters are included.

Indicate the number of words found.


In [16]:
long_words = []
for doc in docs:
    long_words.extend([word for word in re.findall(r'\b\w+\b', doc) if len(re.sub(r'[^a-zA-Z]', '', word)) > 10])

print(f"Number of words with more than 10 alphabetic characters: {len(long_words)}")
long_words

Number of words with more than 10 alphabetic characters: 141


['recommendation',
 'recommended',
 'overwhelmed',
 'inexpensive',
 'establishment',
 'imaginative',
 'opportunity',
 'experiencing',
 'underwhelming',
 'relationship',
 'unsatisfying',
 'disappointing',
 'outrageously',
 'disappointing',
 'expectations',
 'restaurants',
 'suggestions',
 'disappointed',
 'considering',
 'Unfortunately',
 'immediately',
 'ingredients',
 'accommodations',
 'maintaining',
 'Interesting',
 'disrespected',
 'accordingly',
 'unbelievable',
 'cheeseburger',
 'descriptions',
 'inexpensive',
 'disappointed',
 'Veggitarian',
 'outstanding',
 'recommendation',
 'disappointed',
 'disappointed',
 'neighborhood',
 'disappointed',
 'corporation',
 'considering',
 'exceptional',
 'shawarrrrrrma',
 'disappointed',
 'vinaigrette',
 'immediately',
 'unbelievably',
 'replenished',
 'disappointed',
 'enthusiastic',
 'Outstanding',
 'comfortable',
 'interesting',
 'INCONSIDERATE',
 'considering',
 'transcendant',
 'disappointment',
 'disappointed',
 'disappointed',
 'overwh

* **Question 10.**

Find and print all words that begin with a capital letter and end with a lowercase letter, but are not the first word in the comment/string.

Indicates the number of results obtained.

In [17]:
pattern = re.compile(r'\s([A-Z][a-z]+)\b')
capital_to_lower = []
for doc in docs:
    capital_to_lower.extend(pattern.findall(doc))

print(f"Number of words with initial capital letter and final lowercase letter (not first): {len(capital_to_lower)}")
capital_to_lower

Number of words with initial capital letter and final lowercase letter (not first): 266


['Loved',
 'May',
 'Rick',
 'Steve',
 'Cape',
 'Cod',
 'Vegas',
 'Burrittos',
 'Blah',
 'The',
 'They',
 'Mexican',
 'Luke',
 'Our',
 'Hiro',
 'Firehouse',
 'Greek',
 'Greek',
 'Heart',
 'Attack',
 'Grill',
 'Vegas',
 'Dos',
 'Gringos',
 'Jeff',
 'Really',
 'Excalibur',
 'Very',
 'Bad',
 'Customer',
 'Service',
 'Vegas',
 'Rice',
 'Company',
 'Pho',
 'Hard',
 'Rock',
 'Casino',
 'Buffet',
 'Tigerlilly',
 'Yama',
 'Thai',
 'Indian',
 'Not',
 'Vegas',
 'Lox',
 'Subway',
 'Subway',
 'Vegas',
 'Vegas',
 'Mandalay',
 'Bay',
 'Great',
 'Voodoo',
 'Phoenix',
 'Vegas',
 'Khao',
 'Soi',
 'Lemon',
 'Joey',
 'Valley',
 'Phoenix',
 'Magazine',
 'Pho',
 'Fridays',
 'Tasty',
 'Jamaican',
 'Bisque',
 'Bussell',
 'Sprouts',
 'Risotto',
 'Filet',
 'Otto',
 'Yeah',
 'Honestly',
 'Not',
 'Also',
 'Vegas',
 'Greek',
 'Vegas',
 'Veggitarian',
 'Madison',
 'Ironman',
 'Jenni',
 'Pho',
 'Bachi',
 'Burger',
 'Pizza',
 'Salads',
 'They',
 'Yelpers',
 'Bachi',
 'Service',
 'English',
 'Pizza',
 'Hut',
 'Seat',


* **Question 11.**

Find and print the sequence of two or more words separated by a hyphen, "-," with no spaces between them.

For example, "Go-Kart" would be valid, but "Go-Kart" or "Go-Kart" would not.

Indicate the number of results obtained.

In [18]:
hyphenated_words = []
for doc in docs:
    hyphenated_words.extend(re.findall(r'\b\w+-\w+\b', doc))

print(f"Number of hyphenated words without spaces: {len(hyphenated_words)}")
hyphenated_words

Number of hyphenated words without spaces: 21


['flat-lined',
 'hands-down',
 'must-stop',
 'sub-par',
 'Service-check',
 'in-house',
 'been-stepped',
 'in-and',
 'tracked-everywhere',
 'multi-grain',
 'to-go',
 'non-customer',
 'High-quality',
 'sit-down',
 'over-whelm',
 'low-key',
 'non-fancy',
 'golden-crispy',
 'over-priced',
 'over-hip',
 'under-services']

* **Question 12.**

Find and print all the words that end in "ing" or "ed."

Indicate the number of words you found for each.

In [19]:
ing_words = []
ed_words = []
for doc in docs:
    ing_words.extend(re.findall(r'\b\w+ing\b', doc))
    ed_words.extend(re.findall(r'\b\w+ed\b', doc))

print(f"Words ending in 'ing': {len(ing_words)}")
ing_words

Words ending in 'ing': 279


['during',
 'getting',
 'being',
 'being',
 'amazing',
 'running',
 'redeeming',
 'getting',
 'thing',
 'dressing',
 'refreshing',
 'running',
 'amazing',
 'nothing',
 'appalling',
 'wasting',
 'eating',
 'going',
 'Coming',
 'experiencing',
 'underwhelming',
 'eating',
 'raving',
 'spring',
 'unsatisfying',
 'amazing',
 'Everything',
 'disappointing',
 'dining',
 'flirting',
 'thing',
 'coming',
 'playing',
 'ordering',
 'arriving',
 'disappointing',
 'preparing',
 'loving',
 'liking',
 'reviewing',
 'venturing',
 'including',
 'during',
 'changing',
 'going',
 'considering',
 'coming',
 'going',
 'everything',
 'looking',
 'dressing',
 'dining',
 'Everything',
 'amazing',
 'judging',
 'maintaining',
 'asking',
 'having',
 'something',
 'lacking',
 'Interesting',
 'preparing',
 'missing',
 'feeling',
 'exceeding',
 'inviting',
 'climbing',
 'waiting',
 'coming',
 'being',
 'lacking',
 'going',
 'amazing',
 'dealing',
 'annoying',
 'falling',
 'sporting',
 'amazing',
 'providing',
 'bu

In [20]:
print(f"Words ending in 'ed': {len(ed_words)}")
ed_words

Words ending in 'ed': 335


['Loved',
 'Stopped',
 'loved',
 'ended',
 'overpriced',
 'tried',
 'disgusted',
 'shocked',
 'recommended',
 'performed',
 'red',
 'asked',
 'overwhelmed',
 'grossed',
 'melted',
 'provided',
 'cooked',
 'ordered',
 'realized',
 'Loved',
 'lined',
 'cooked',
 'ripped',
 'ripped',
 'petrified',
 'included',
 'expected',
 'seasoned',
 'cheated',
 'walked',
 'smelled',
 'tailored',
 'arrived',
 'roasted',
 'added',
 'cooked',
 'passed',
 'liked',
 'managed',
 'served',
 'overpriced',
 'checked',
 'disappointed',
 'red',
 'decorated',
 'served',
 'watched',
 'greeted',
 'seated',
 'waited',
 'flavored',
 'ordered',
 'ordered',
 'relocated',
 'impressed',
 'seated',
 'priced',
 'treated',
 'ordered',
 'used',
 'handed',
 'listed',
 'missed',
 'thrilled',
 'inspired',
 'desired',
 'overcooked',
 'decided',
 'looked',
 'dressed',
 'treated',
 'ordered',
 'sucked',
 'expected',
 'sucked',
 'imagined',
 'served',
 'arrived',
 'satisfied',
 'voted',
 'insulted',
 'disrespected',
 'dreamed',
 'l

#**Part 3. Cleaning process.**

* **Question 13.**

Now perform a corpus cleaning process that includes the following steps:

* Only alphabetic characters should be considered. That is, all punctuation marks and special characters are removed.
* All alphabetic characters are converted to lowercase.
* Any extra whitespace characters that may be found in each comment should be removed.

When this cleaning process is complete, print the results of the first 10 resulting comments.
   




In [21]:
clean_docs = []
for doc in docs:
    #1. Only alphabetic characters (we convert everything else to spaces)
    #text = re.sub(r'[^a-zA-Z]', ' ', doc)
    text = re.sub(r'[^a-zA-Z\s]', '', doc)
    # 2. We convert to lowercase
    text = text.lower()
    # 3. We eliminate multiple spaces
    text = re.sub(r'\s+', ' ', text).strip()
    clean_docs.append(text)

# Show the first 10 clean comments
clean_docs[:10]

['wow loved this place',
 'crust is not good',
 'not tasty and the texture was just nasty',
 'stopped by during the late may bank holiday off rick steve recommendation and loved it',
 'the selection on the menu was great and so were the prices',
 'now i am getting angry and i want my damn pho',
 'honeslty it didnt taste that fresh',
 'the potatoes were like rubber and you could tell they had been made up ahead of time being kept under a warmer',
 'the fries were great too',
 'a great touch']

* **Question 14.**

Using the cleaning results obtained in the previous question, now perform a word-based tokenization of the corpus.

That is, at the end of this tokenization process, you should have a list of lists, where each comment is tokenized by word.

When you finish, calculate the total number of tokens obtained across the entire corpus.

In [23]:
# List of lists: Each comment becomes a list of words
tokenized_docs = [doc.split() for doc in clean_docs]

# Total tokens in the entire corpus
total_tokens = sum(len(tokens) for tokens in tokenized_docs)
print(f"Total number of tokens in the corpus: {total_tokens}")

Total number of tokens in the corpus: 10777


In [24]:
tokenized_docs


[['wow', 'loved', 'this', 'place'],
 ['crust', 'is', 'not', 'good'],
 ['not', 'tasty', 'and', 'the', 'texture', 'was', 'just', 'nasty'],
 ['stopped',
  'by',
  'during',
  'the',
  'late',
  'may',
  'bank',
  'holiday',
  'off',
  'rick',
  'steve',
  'recommendation',
  'and',
  'loved',
  'it'],
 ['the',
  'selection',
  'on',
  'the',
  'menu',
  'was',
  'great',
  'and',
  'so',
  'were',
  'the',
  'prices'],
 ['now',
  'i',
  'am',
  'getting',
  'angry',
  'and',
  'i',
  'want',
  'my',
  'damn',
  'pho'],
 ['honeslty', 'it', 'didnt', 'taste', 'that', 'fresh'],
 ['the',
  'potatoes',
  'were',
  'like',
  'rubber',
  'and',
  'you',
  'could',
  'tell',
  'they',
  'had',
  'been',
  'made',
  'up',
  'ahead',
  'of',
  'time',
  'being',
  'kept',
  'under',
  'a',
  'warmer'],
 ['the', 'fries', 'were', 'great', 'too'],
 ['a', 'great', 'touch'],
 ['service', 'was', 'very', 'prompt'],
 ['would', 'not', 'go', 'back'],
 ['the',
  'cashier',
  'had',
  'no',
  'care',
  'what',


* **Question 15.**

Finally, in this exercise, we will define our set of "stopwords," which you must eliminate from the entire corpus.

Remember that examples of stopwords are articles, adverbs, connectives, etc., which have very high frequencies of occurrence in any document but do not provide much meaning in terms of the meaning of a statement.

Based on the list of stopwords provided, perform a cleaning process by eliminating all of these words from the corpus obtained in the previous exercise.

Determine how many tokens/words ultimately remain in the entire corpus.

Determine how many of these tokens/words are different, that is, how many unique tokens will exist in what we will later call our vocabulary.

In [25]:
# Consider the following list as your set of stopwords:
mis_stopwords = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'should', 'now', 'll']

In [26]:
# Remove stopwords from each comment
filtered_docs = [[word for word in tokens if word not in mis_stopwords] for tokens in tokenized_docs]
filtered_docs


[['wow', 'loved', 'place'],
 ['crust', 'not', 'good'],
 ['not', 'tasty', 'texture', 'nasty'],
 ['stopped',
  'late',
  'may',
  'bank',
  'holiday',
  'off',
  'rick',
  'steve',
  'recommendation',
  'loved'],
 ['selection', 'menu', 'great', 'prices'],
 ['getting', 'angry', 'want', 'damn', 'pho'],
 ['honeslty', 'didnt', 'taste', 'fresh'],
 ['potatoes',
  'like',
  'rubber',
  'could',
  'tell',
  'made',
  'ahead',
  'time',
  'kept',
  'warmer'],
 ['fries', 'great'],
 ['great', 'touch'],
 ['service', 'prompt'],
 ['would', 'not', 'go', 'back'],
 ['cashier',
  'no',
  'care',
  'ever',
  'say',
  'still',
  'ended',
  'wayyy',
  'overpriced'],
 ['tried', 'cape', 'cod', 'ravoli', 'chickenwith', 'cranberrymmmm'],
 ['disgusted', 'pretty', 'sure', 'human', 'hair'],
 ['shocked', 'no', 'signs', 'indicate', 'cash'],
 ['highly', 'recommended'],
 ['waitress', 'little', 'slow', 'service'],
 ['place', 'not', 'worth', 'time', 'let', 'alone', 'vegas'],
 ['not', 'like'],
 ['burrittos', 'blah'],
 ['f

In [None]:
# Total tokens remaining
filtered_total_tokens = sum(len(tokens) for tokens in filtered_docs)
print(f"Total tokens after removing stopwords: {filtered_total_tokens}")



Total de tokens después de eliminar stopwords: 5776


In [27]:
# Unique tokens (vocabulary)
vocabulary = set(word for tokens in filtered_docs for word in tokens)
print(f"Vocabulary size (unique tokens): {len(vocabulary)}")

Vocabulary size (unique tokens): 1941


In [None]:
print(f"Total number of tokens in the corpus: {total_tokens}")
print(f"Total tokens after removing stopwords: {filtered_total_tokens}")
print(f"Vocabulary size (unique tokens): {len(vocabulary)}")

Total de tokens en el corpus: 10777
Total de tokens después de eliminar stopwords: 5776
Tamaño del vocabulario (tokens únicos): 1941


* **Comments**

This activity represents a very comprehensive and practical introduction to natural language processing (NLP), covering several of the fundamental stages of a text analysis pipeline. Through a series of progressive exercises, it allows you to become familiar not only with basic text cleaning and normalization techniques, but also with the use of regular expressions to extract complex and specific patterns from written language.

A major advantage of the activity is working with a real, uncurated corpus such as Yelp reviews. This type of data, written by real users, includes spelling errors, creative punctuation, sarcasm, and a wide variability in style and content. This presents a much more realistic challenge than synthetic examples and provides an understanding of why text cleaning and normalization are critical steps in any NLP project. Cleaning the text, converting it to lowercase, removing punctuation, and reducing noise are essential tasks before applying any model or analysis.

Tokenization (i.e., separating the text into words) is another crucial step. It not only helps us quantify the corpus in terms of volume and lexical diversity, but also lays the groundwork for subsequent analyses such as frequency counting, n-gram generation, feature extraction, statistical modeling, and even deep learning. The activity also allows for practical exploration of the difference between the total number of words (tokens) and the number of unique words (vocabulary), which is a key metric when assessing the richness or redundancy of a corpus.

One of the most enriching points was the introduction of the concept of stopwords. These words, although very common in language, provide little semantic value and tend to generate noise in models. Working with a customized list helps understand the importance of adjusting linguistic resources to the specific problem domain (in this case, food reviews), as some words may be relevant in one context but irrelevant in another. Reducing the corpus by removing these words not only improves computational efficiency, but can also contribute to increasing the accuracy of subsequent models.

Finally, this activity lays a solid foundation for more advanced topics in NLP such as sentiment analysis, text classification, automatic summarization, and entity extraction. Learning how to manipulate and transform text with regular expressions, how to properly clean it, and how to extract relevant vocabulary is the essential first step for any modern natural language application, whether using classical statistical models or deep neural networks such as BERT, GPT, among others.

In short, this activity not only strengthens technical skills such as programming with regular expressions and using data structures like lists and sets in Python, but also offers a deeper understanding of the challenges of human language and how to address them from a computational perspective.