Lowercasing & HTML removal

Purpose: Normalize casing so "Good" and "good" are same token; remove HTML markup from scraped reviews.

In [1]:
import re
def clean_text_basic(text):
    # remove HTML tags
    text = re.sub(r'<[^>]+>', ' ', text)
    # lowercasing
    text = text.lower()
    # collapse whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    return text


2. Contraction expansion & punctuation removal

Purpose: Expand "don't" -> "do not" so model sees negation explicitly; remove punctuation that adds noise (but keep sentiment punctuation like "!" optionally).

In [None]:
contractions = {"don't":"do not", "i'm":"i am", "it's":"it is", "can't":"cannot"}
import re
def expand_contractions(text):
    for c, expansion in contractions.items():
        text = re.sub(r'\b' + re.escape(c) + r'\b', expansion, text)
    # remove punctuation (keep spaces)
    text = re.sub(r'[^\w\s]', ' ', text)
    return re.sub(r'\s+', ' ', text).strip()


3. Tokenization + stopword removal

Purpose: Convert text into tokens and remove high-frequency function words that add little sentiment information (e.g., "the", "a") while being careful not to remove negations like "not".

In [None]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

stop = set(stopwords.words('english')) - {"not", "no"}  # keep negations
def tokenize_and_remove_stopwords(text):
    tokens = word_tokenize(text)
    tokens = [t for t in tokens if t not in stop]
    return tokens


4. Lemmatization (or stemming)

Purpose: Reduce inflected forms to a common base (e.g., "liked", "likes" → "like") so model generalizes better over word forms. Lemmatization keeps words more readable than stemming.

In [None]:
import spacy
nlp = spacy.load("en_core_web_sm", disable=["parser","ner"])
def lemmatize_tokens(tokens):
    doc = nlp(" ".join(tokens))
    return [token.lemma_ for token in doc]
