Introduction

In this notebook, we perform text preprocessing on the Yelp Review Full dataset.
The goal is to clean and normalize reviews to improve model performance and reduce noise.

Objectives:

Remove noisy and irrelevant text patterns

Normalize text format

Handle emojis and punctuation

Prepare clean text for Transformer-based models

In [1]:
# Core libraries
import re
import numpy as np
import pandas as pd

# Hugging Face dataset
from datasets import load_dataset

# Emoji handling
import emoji

# Progress bar (SAFE for VS Code)
from tqdm import tqdm

pd.set_option("display.max_colwidth", 300)
print("All imports successful")


  from .autonotebook import tqdm as notebook_tqdm


All imports successful


Load Dataset

We reload the Yelp Review Full dataset to apply preprocessing.

In [2]:
dataset = load_dataset("yelp_review_full")

train_ds = dataset["train"]
test_ds = dataset["test"]

print("Train size:", len(train_ds))
print("Test size:", len(test_ds))


Train size: 650000
Test size: 50000


Why Preprocessing Is Needed

Real-world Yelp reviews contain:

HTML tags

URLs

Emojis

Excessive punctuation

Irregular spacing

Informal writing

These can:

Add noise

Increase token length

Confuse the model

Define Text Cleaning Function

We apply light but effective preprocessing, keeping semantic meaning intact.

Cleaning Steps:

1. Remove HTML tags

2. Convert emojis to text

3. Remove URLs

4. Normalize punctuation

5. Normalize whitespace

In [None]:
def clean_text(text: str) -> str:
    # Lowercase
    text = text.lower()
    
    # Remove HTML tags
    text = re.sub(r"<.*?>", " ", text)
    
    # Remove URLs
    text = re.sub(r"http\S+|www\S+", " ", text)
    
    # Convert emojis to text ( -> smiling_face)
    text = emoji.demojize(text, delimiters=(" ", " "))
    
    # Remove excessive punctuation
    text = re.sub(r"[!?.]{2,}", ".", text)
    
    # Remove non-alphanumeric characters (keep basic punctuation)
    text = re.sub(r"[^a-z0-9\s.,]", " ", text)
    
    # Normalize whitespace
    text = re.sub(r"\s+", " ", text).strip()
    
    return text


Test Cleaning on Sample Reviews

Before applying globally, we test preprocessing on sample data.

In [5]:
sample_text = train_ds[0]["text"]

print("ORIGINAL TEXT:\n")
print(sample_text[:500])

print("\nCLEANED TEXT:\n")
print(clean_text(sample_text)[:500])


ORIGINAL TEXT:

dr. goldberg offers everything i look for in a general practitioner.  he's nice and easy to talk to without being patronizing; he's always on time in seeing his patients; he's affiliated with a top-notch hospital (nyu) which my parents have explained to me is very important in case something happens and you need surgery; and you can get referrals to see specialists without having to see him first.  really, what more do you need?  i'm sitting here trying to think of any complaints i have about hi

CLEANED TEXT:

dr. goldberg offers everything i look for in a general practitioner. he s nice and easy to talk to without being patronizing he s always on time in seeing his patients he s affiliated with a top notch hospital nyu which my parents have explained to me is very important in case something happens and you need surgery and you can get referrals to see specialists without having to see him first. really, what more do you need i m sitting here trying to think of any co

Apply Cleaning to Training Data

We apply the cleaning function to all training reviews.

In [6]:
train_ds_clean = train_ds.map(
    lambda x: {"clean_text": clean_text(x["text"])},
    desc="Cleaning training data"
)


Cleaning training data: 100%|██████████| 650000/650000 [07:43<00:00, 1402.55 examples/s]


Apply Cleaning to Test Data

In [None]:
test_ds_clean = test_ds.map(
    lambda x: {"clean_text": clean_text(x["text"])},
    desc="Cleaning test data"
)


Cleaning test data: 100%|██████████| 50000/50000 [00:37<00:00, 1328.24 examples/s]


Verify Cleaned Dataset Structure  & Review Cleaned Samples

In [8]:
train_ds_clean.features

for i in range(3):
    print(f"\nLabel: {train_ds_clean[i]['label']}")
    print("Cleaned Text:", train_ds_clean[i]["clean_text"][:400])



Label: 4
Cleaned Text: dr. goldberg offers everything i look for in a general practitioner. he s nice and easy to talk to without being patronizing he s always on time in seeing his patients he s affiliated with a top notch hospital nyu which my parents have explained to me is very important in case something happens and you need surgery and you can get referrals to see specialists without having to see him first. reall

Label: 1
Cleaned Text: unfortunately, the frustration of being dr. goldberg s patient is a repeat of the experience i ve had with so many other doctors in nyc good doctor, terrible staff. it seems that his staff simply never answers the phone. it usually takes 2 hours of repeated calling to get an answer. who has time for that or wants to deal with it i have run into this problem with many other doctors and i just don t

Label: 3
Cleaned Text: been going to dr. goldberg for over 10 years. i think i was one of his 1st patients when he started at mhmg. he s been great o

Length Analysis After Cleaning

Compare Length Before vs After Cleaning

Handling Extremely Short Reviews

In [None]:
train_df = pd.DataFrame(train_ds_clean)

train_df["clean_word_count"] = train_df["clean_text"].apply(lambda x: len(x.split()))

train_df["clean_word_count"].describe()

original_lengths = train_df["text"].apply(lambda x: len(x.split()))

comparison_df = pd.DataFrame({
    "original": original_lengths,
    "cleaned": train_df["clean_word_count"]
})

comparison_df.describe()

short_reviews_pct = (train_df["clean_word_count"] < 10).mean() * 100
print(f"Percentage reviews shorter than 10 words: {short_reviews_pct:.2f}%")



Percentage of reviews shorter than 10 words: 1.59%


In [None]:
Save Processed Dataset

In [10]:
train_ds_clean.save_to_disk("data/processed/train_clean")
test_ds_clean.save_to_disk("data/processed/test_clean")


Saving the dataset (2/2 shards): 100%|██████████| 650000/650000 [00:01<00:00, 439737.76 examples/s]
Saving the dataset (1/1 shards): 100%|██████████| 50000/50000 [00:00<00:00, 171505.07 examples/s]
