# üßº **Text Cleaning for Tokopedia User Reviews**  
This notebook performs systematic text cleaning on raw, user-generated reviews collected from the Tokopedia application.

User reviews typically contain substantial noise, such as:

- emojis and unicode icons
- URLs and emails
- excessive character repetitions (‚Äúbaaaagus bangeeetttt‚Äù)
- exaggerated laughter (‚Äúwkwkwkwkwk‚Äù, ‚Äúhahahahaha‚Äù)
- slang and informal spellings (‚Äúgk‚Äù, ‚Äúga‚Äù, ‚Äúbgt‚Äù, ‚Äúplis‚Äù)
- typos and phonetic spelling
- punctuation noise
- extremely short or low-information messages (‚Äúok‚Äù, ‚Äú.‚Äù)

Cleaning these reviews is essential to:

- reduce vocabulary sparsity  
- standardize spelling variations  
- improve downstream NLP model quality  
- remove meaningless tokens  
- prepare the text for vectorization and modeling  

This notebook runs through the process **step-by-step**, showing before/after transformations to highlight the effect of each stage.

In [2]:
# Directory alignment and module update
import sys
import importlib
sys.path.append("..")

# Ignore warning
from warnings import filterwarnings
filterwarnings('ignore')

# Core library
import pandas as pd
import numpy as np
import json

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Cleaning tools
import re
import sentencepiece as spm
import src.cleaning as cleaning
from src.cleaning import *
from src.cleaning import CleaningPipeline

# Reload shortcut
def r(module=cleaning):
    importlib.reload(module)


# Defaults
pd.set_option('display.max_colwidth', None)
plt.style.use('seaborn-v0_8-whitegrid')

print('Ready!')

Ready!


# üîç **Load Raw Review Data**  

We start by loading the unprocessed user reviews from the dataset. Only the raw text column will be used in this notebook.

In [3]:
df = pd.read_csv('../data/raw/review.csv')

print(df.shape)
df.head()

(709000, 3)


Unnamed: 0,text,rating,date
0,belanja di tokopedia sangat mudah cuma sayang nya estimasi pengiriman yang tidak sesuai,5,2025-12-03 11:01:10
1,Memuaskan kan produk original,5,2025-12-03 10:35:41
2,mau nyari apa aja di mesin pencariannya TOKOPEDIA hasil timeout melulu padahal sinyal bagus maen game online aja lancar jaya,1,2025-12-03 10:11:21
3,jos mantap,5,2025-12-03 10:04:00
4,Tidak punya CS hanya ada bot yg tidak bisa memberikan solusi BURUK,1,2025-12-03 09:54:44


# üìö **Load Cleaning Resources**  

The cleaning pipeline uses several external resources stored in `resources/`:

- **slang.json**: a mapping from slang words to their normalized forms.  
- **stopwords.txt**: additional informal stopwords not found in standard lists  
- **whitelist.txt**: ground truth of indonesian word based on KBBI (Kamus Besar Bahasa Indoensia)
- **fuzzy_targets.json**: canonical words frequently affected by typos or misspellings 

These resources supplement the cleaning functions defined in `src/cleaning.py`.


In [4]:
with open("../assets/emoji_map.json", "r", encoding="utf-8") as f:
    emoji = json.load(f)

with open("../assets/pos_lexicon.json") as f:
    pos_lexicon = json.load(f)

with open("../assets/prefix_suffix.json") as f:
    prefix_suffix = json.load(f)

with open("../assets/slang.json") as f:
    slang = json.load(f)

with open("../assets/typo.json") as f:
    typo = json.load(f)

with open("../assets/affix_map.json") as f:
    affix_map = json.load(f)

with open("../assets/stopwords.txt") as f:
    stopwords = [x.strip() for x in f]

with open("../assets/whitelist.txt") as f:
    whitelist = [x.strip() for x in f]

with open("../assets/laughter.txt") as f:
    laughter = [x.strip() for x in f]

with open("../assets/negation.txt") as f:
    negation = [x.strip() for x in f]

# **Baseline Cleaning**

Lowercase, remove link, remove email, remove emoji, remove punctuation, unicode cleaning. To make this process more efficient, we'll make a dictionary of cached words that has been normalized before.

In [5]:
r(cleaning)

cleaner = cleaning.CleaningPipeline(
    whitelist=whitelist, slang=slang, typo=typo, prefix_suffix=prefix_suffix,
    emoji_map=emoji, laughter_list=laughter, stopwords=stopwords,
    pos_lexicon=pos_lexicon, negation_list=negation, affix_map=affix_map
)

In [6]:
STRETCH_CACHE = {}
LAUGHTER_CACHE = {}

def cached_stretch(word):
    w = word.lower()

    if w in STRETCH_CACHE:
        return STRETCH_CACHE[w]

    result = cleaner._stretch_all(w)

    STRETCH_CACHE[w] = result
    return result

def cached_laughter(word):
    w = word.lower()

    if w in LAUGHTER_CACHE:
        return LAUGHTER_CACHE[w]

    result = cleaner._normalize_laughter(w)

    LAUGHTER_CACHE[w] = result
    return result

# Baseline cleaning
def baseline_cleaning(text):
    if not isinstance(text, str):
        return ""

    text = cleaner._normalize_unicode(text)
    text = text.lower()
    text = cleaner._remove_email_and_link(text)
    text = cleaner._remove_punctuation(text)
    text = cleaner._handle_word_number(text)

    tokens = text.split()
    tokens = [
        cached_stretch(tok)
        for tok in tokens
    ]
    tokens = [
        cached_laughter(tok)
        for tok in tokens
    ]
    text = " ".join(tokens)
    text = cleaner._map_emoji(text)

    return text

from tqdm.notebook import tqdm
tqdm.pandas()

df["text_"] = df["text"].progress_apply(
    lambda sentence: baseline_cleaning(sentence)
)

  0%|          | 0/709000 [00:00<?, ?it/s]

In [7]:
df['_text_'] = df['text'].progress_apply(
    lambda sentence: cleaner.explain(sentence, verbose=False)
)

  0%|          | 0/709000 [00:00<?, ?it/s]

In [8]:
df.head()

Unnamed: 0,text,rating,date,text_,_text_
0,belanja di tokopedia sangat mudah cuma sayang nya estimasi pengiriman yang tidak sesuai,5,2025-12-03 11:01:10,belanja di tokopedia sangat mudah cuma sayang nya estimasi pengiriman yang tidak sesuai,belanja sangat mudah sayang estimasi pengiriman tidak sesuai
1,Memuaskan kan produk original,5,2025-12-03 10:35:41,memuaskan kan produk original,memuaskan produk original
2,mau nyari apa aja di mesin pencariannya TOKOPEDIA hasil timeout melulu padahal sinyal bagus maen game online aja lancar jaya,1,2025-12-03 10:11:21,mau nyari apa aja di mesin pencarianya tokopedia hasil timeout melulu padahal sinyal bagus maen game online aja lancar jaya,mencari mesin pencarianya hasil timeout melulu padahal jaringan bagus main game online lancar jaya
3,jos mantap,5,2025-12-03 10:04:00,jos mantap,jos mantap
4,Tidak punya CS hanya ada bot yg tidak bisa memberikan solusi BURUK,1,2025-12-03 09:54:44,tidak punya cs hanya ada bot yg tidak bisa memberikan solusi buruk,tidak customer service bot tidak bisa memberikan solusi buruk


In [18]:
# Intermediate Cleaning
df_interim = df[['text_', 'rating', 'date']]
df_interim.columns = ['text', 'rating', 'date']
df_interim.to_csv('../data/interim/review.csv', index=False)

df_interim['text'] = df_interim['text'].astype(str).str.strip()
df_interim = df_interim[df_interim['text'] != ""]

# Full Cleaning
df_clean = df[['_text_', 'rating', 'date']]
df_clean.columns = ['text', 'rating', 'date']
df_clean.to_csv('../data/processed/review.csv', index=False)

df_clean['text'] = df_clean['text'].astype(str).str.strip()
df_clean = df_clean[df_clean['text'] != ""]

with open("../data/interim/all_reviews.txt", "w", encoding="utf-8") as f:
    for line in df_interim.text.astype(str):
        f.write(line.replace("\n", " ") + "\n")

with open("../data/processed/all_reviews.txt", "w", encoding="utf-8") as f:
    for line in df_clean.text.astype(str):
        f.write(line.replace("\n", " ") + "\n")

In [None]:
# from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
# from tqdm import tqdm

# # 1. Sastrawi Stemmer
# factory = StemmerFactory()
# stemmer = factory.create_stemmer()

# affix_map = {root: set() for root in whitelist}

# for tok in tqdm(set(tokens_interim), desc="Processing tokens"):
#     tok = tok.strip()
#     if not tok:
#         continue

#     stem = stemmer.stem(tok)

#     if stem in whitelist:
#         affix_map[stem].add(tok)

Processing tokens:  83%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñé | 23748/28559 [28:28<06:42, 11.94it/s] 

In [None]:
# token_to_root = {}

# for root, forms in affix_map.items():

#     clean_root = baseline_cleaning(root).strip()
#     if not clean_root:
#         continue

#     for word in forms:
#         clean_word = baseline_cleaning(word).strip()

#         if not clean_word:
#             continue

#         # skip kata dasar ‚Üí kata dasar (aba:aba, abad:abad)
#         if clean_word == clean_root:
#             continue

#         # skip kalau lebih dari 1 kata
#         if " " in clean_word:
#             continue

#         token_to_root[clean_word] = clean_root

# with open("../assets/affix_map.json", "w", encoding="utf-8") as f:
#     json.dump(token_to_root, f, ensure_ascii=False, indent=4)

In [None]:
# unique_tokens_clean = set(tokens_interim)

# # ambil stems unik
# unique_stems = set(token_to_root.values())

# # FILTER whitelist
# whitelist_filtered = unique_stems.intersection(whitelist)

# print("Whitelist original:", len(whitelist))
# print("Whitelist filtered:", len(whitelist_filtered))

Whitelist original: 28494
Whitelist filtered: 2304


# üñãÔ∏è Possible Typo Mappings

With our resources above, we can make a list of typo hypothetically based on word that did not appear on our whitelist or slang dictionary. This process can help us to detect any possible typo and add them into our external resources to make dataset even more clean.

In order to extract unique tokens from our dataset, we'll normalize them to lowercase, removing punctuation, stripping emojis, split digit word, collapse whitespaces, normalize unicode, and normalize laughter.

In [19]:
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
Index: 486539 entries, 0 to 708999
Data columns (total 3 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    486539 non-null  object
 1   rating  486539 non-null  int64 
 2   date    486539 non-null  object
dtypes: int64(1), object(2)
memory usage: 14.8+ MB


In [20]:
tokens_raw = [
    word
    for sentence in df.text.astype(str)
    for word in sentence.split()
]

tokens_interim = [
    word
    for sentence in df_interim.text.astype(str)
    for word in sentence.split()
]

tokens_clean = [
    word
    for sentence in df_clean.text.astype(str)
    for word in sentence.split()
]

In [21]:
print("Raw Tokens length         : ", len(tokens_raw))
print("Interim Tokens Length     : ", len(tokens_interim))
print("Fully Clean Tokens Length : ", len(tokens_clean))

Raw Tokens length         :  5412151
Interim Tokens Length     :  5587881
Fully Clean Tokens Length :  4035205


In [22]:
print("Raw uniques tokens length     : ", len(set(tokens_raw)))
print("Interim uniques tokens length : ", len(set(tokens_interim)))
print("Clean uniques tokens length   : ", len(set(tokens_clean)))

Raw uniques tokens length     :  303552
Interim uniques tokens length :  77738
Clean uniques tokens length   :  59804


In [23]:
tokens_interim_set = set(tokens_clean)       # semua token yang muncul di data hasil cleaning tahap awal
whitelist = set(whitelist)
slang_keys = set(slang.keys())
slang_values = set(slang.values())
typo_keys = set(typo.keys())
emoji_mapping = set(emoji.values())
laughter_keys = set(laughter)
affix_keys = set(affix_map.keys())

covered = set()

covered |= whitelist
covered |= slang_keys
covered |= slang_values
covered |= typo_keys
covered |= emoji_mapping
covered |= affix_keys
covered |= laughter_keys
covered |= affix_keys

uncovered_tokens = tokens_interim_set - covered

In [25]:
from collections import Counter

freq = Counter(tokens_clean)
uncovered_freq = {tok: freq[tok] for tok in uncovered_tokens}
uncovered_sorted = sorted(uncovered_freq.items(), key=lambda x: -x[1])
filtered = [(tok, freq) for tok, freq in uncovered_sorted if not tok.isdigit()]

In [147]:
data = [{"token": tok, "count": freq} for tok, freq in filtered]

with open("uncovered_sorted.json", "w", encoding="utf8") as f:
    json.dump(data, f, ensure_ascii=False, indent=2)

# üìù Example Raw Review  

Let‚Äôs inspect the most noisy raw review to understand the noise present in the text.

In [27]:
example ="""
WKWKWKWKWKüò≠üò≠üò≠ gk bisaaa login SKRGGG plsssss helpppp!!!! lmoott bnaget... sumpaaahhh üò°üò°
cek ini deh: https://tokopedia.com/login-error gk tauuu kenapaaa, email-ku: TESTUSER@GMAIL.COM,,,
lamaaaaaa bangettttt prosesnyyyyaaaa üò≠üò≠ sm tolongggg bgt dongggg!!!!
"""

cleaner.explain(example)

=== EXPLAIN CLEANING PIPELINE ===
Input: 
WKWKWKWKWKüò≠üò≠üò≠ gk bisaaa login SKRGGG plsssss helpppp!!!! lmoott bnaget... sumpaaahhh üò°üò°
cek ini deh: https://tokopedia.com/login-error gk tauuu kenapaaa, email-ku: TESTUSER@GMAIL.COM,,,
lamaaaaaa bangettttt prosesnyyyyaaaa üò≠üò≠ sm tolongggg bgt dongggg!!!!

---------------------------------
[Lowercase]
  before: 
WKWKWKWKWKüò≠üò≠üò≠ gk bisaaa login SKRGGG plsssss helpppp!!!! lmoott bnaget... sumpaaahhh üò°üò°
cek ini deh: https://tokopedia.com/login-error gk tauuu kenapaaa, email-ku: TESTUSER@GMAIL.COM,,,
lamaaaaaa bangettttt prosesnyyyyaaaa üò≠üò≠ sm tolongggg bgt dongggg!!!!

  after : 
wkwkwkwkwküò≠üò≠üò≠ gk bisaaa login skrggg plsssss helpppp!!!! lmoott bnaget... sumpaaahhh üò°üò°
cek ini deh: https://tokopedia.com/login-error gk tauuu kenapaaa, email-ku: testuser@gmail.com,,,
lamaaaaaa bangettttt prosesnyyyyaaaa üò≠üò≠ sm tolongggg bgt dongggg!!!!

---------------------------------
[Remove Links]
  before: 

'hahaha [EMOJI_CRY] [EMOJI_CRY] [EMOJI_CRY] tidak bisa masuk sekarang mohon help lmot banget sumpah [EMOJI_VERY_ANGRY] [EMOJI_VERY_ANGRY] cek tidak tahu kenapa email ku lama sangat prosesnya [EMOJI_CRY] [EMOJI_CRY] tolong banget'

# üöÄ Applying the Full Cleaning Pipeline

Now that each cleaning step has been validated individually,
we apply the full `clean_text()` function to the entire dataset.

This ensures all reviews follow a standardized, noise-free text format.