# üßº **Text Cleaning for Tokopedia User Reviews**  
This notebook performs systematic text cleaning on raw, user-generated reviews collected from the Tokopedia application.

User reviews typically contain substantial noise, such as:

- emojis and unicode icons
- URLs and emails
- excessive character repetitions (‚Äúbaaaagus bangeeetttt‚Äù)
- exaggerated laughter (‚Äúwkwkwkwkwk‚Äù, ‚Äúhahahahaha‚Äù)
- slang_map and informal spellings (‚Äúgk‚Äù, ‚Äúga‚Äù, ‚Äúbgt‚Äù, ‚Äúplis‚Äù)
- typos and phonetic spelling
- punctuation noise
- extremely short or low-information messages (‚Äúok‚Äù, ‚Äú.‚Äù)

Cleaning these reviews is essential to:

- reduce vocabulary sparsity  
- standardize spelling variations  
- improve downstream NLP model quality  
- remove meaningless tokens  
- prepare the text for vectorization and modeling  

This notebook runs through the process **step-by-step**, showing before/after transformations to highlight the effect of each stage.

In [1]:
# Directory alignment and module update
import sys
import importlib
sys.path.append("..")

# Ignore warning
from warnings import filterwarnings
filterwarnings('ignore')

# Core library
import pandas as pd
import json
from collections import Counter

# Cleaning tools
import src.cleaning as cleaning

# Reload shortcut
def r(module=cleaning):
    importlib.reload(module)

RAW_PATH = '../data/raw/review.csv'

# Defaults
pd.set_option('display.max_colwidth', None)

print('Ready!')

Ready!


# üîç **Load Raw Review Data**  

We start by loading the unprocessed user reviews from the dataset. Only the raw text column will be used in this notebook.

In [2]:
df = pd.read_csv(RAW_PATH)

print(df.shape)
df.head()

(709000, 5)


Unnamed: 0,text,rating,date,char_len,token_len
0,belanja di tokopedia sangat mudah cuma sayang nya estimasi pengiriman yang tidak sesuai,5,2025-12-03 11:01:10,87.0,13
1,Memuaskan kan produk original,5,2025-12-03 10:35:41,29.0,4
2,mau nyari apa aja di mesin pencariannya TOKOPEDIA hasil timeout melulu padahal sinyal bagus maen game online aja lancar jaya,1,2025-12-03 10:11:21,124.0,20
3,jos mantap,5,2025-12-03 10:04:00,10.0,2
4,Tidak punya CS hanya ada bot yg tidak bisa memberikan solusi BURUK,1,2025-12-03 09:54:44,66.0,12


# üìö **Loading Assets**

- `emoji_map.json`  
  A mapping from emoji characters to alphabetic tokens. Emojis often introduce encoding issues and increase vocabulary size. By converting them into readable alphabetic placeholders, downstream models can process text more reliably and consistently.

- `pos_lexicon.json` 
  A part-of-speech lexicon used for advanced linguistic processing. This asset helps capture semantic structure (verbs, nouns, adjectives, etc.), allowing more accurate downstream interpretation such as dependency contexts or rule-based transformations.

- `slang_map.json`
  Contains mappings from Indonesian slang terms to their canonical (formal) forms. Since slang appears heavily in user-generated content, normalizing them reduces vocabulary explosion and improves model robustness.

- `typo_map.json`
  A list of common typos mapped directly to their canonical forms. Typos that originate from slang (e.g., "gppp" ‚Üí "gapapa") are resolved directly to the final standard word to avoid redundant two-step normalization (typo ‚Üí slang ‚Üí canonical).

- `affix_map.json`  
  A curated dictionary of valid stems and affixed words mined from the dataset. After performing automated stemming (Sastrawi), we cross-check the results with the whitelist to identify legitimate forms. This map prevents incorrect word-splitting during normalization (e.g., "dibukakan" shouldn't be split into "di" + "bukak" + "an").

- `stopwords.txt`
  A list of low-information words that can be ignored during analysis. Useful for reducing noise in tasks such as topic modeling or weighting in classical NLP.

- `whitelist.txt`
  A list of valid base words derived from KBBI and curated dataset vocabulary. Used to validate stems, prevent over-stemming, and guide morphological rules in the cleaning pipeline.

- `laughter.txt`
  A collection of Indonesian laughter variants (e.g., "wkwk", "awkawk", "xixixi", "hehe", "kekeke"). Since laughter expressions are extremely diverse and model-breaking, this list ensures consistent normalization to a stable token.

- `negation.txt`
  Contains Indonesian negation words (e.g., "tidak", "tak", "nggak", "bukan"). Important for future tasks such as sentiment analysis, where negation flipping plays a significant semantic role.

In [3]:
with open("../assets/emoji_map.json", "r", encoding="utf-8") as f:
    emoji_map = json.load(f)

with open("../assets/pos_lexicon.json") as f:
    pos_lexicon = json.load(f)

with open("../assets/slang_map.json") as f:
    slang_map = json.load(f)

with open("../assets/typo_map.json") as f:
    typo_map = json.load(f)

with open("../assets/affix_map.json") as f:
    affix_map = json.load(f)

with open("../assets/stopwords.txt") as f:
    stopwords = [x.strip() for x in f]

with open("../assets/whitelist.txt") as f:
    whitelist = [x.strip() for x in f]

with open("../assets/laughter.txt") as f:
    laughter = [x.strip() for x in f]

with open("../assets/negation.txt") as f:
    negation = [x.strip() for x in f]

# üìù Example Raw Review  

To understand the types of noise present in the dataset, we inspect one of the most unclean raw reviews.

In [4]:
r(cleaning)

cleaner = cleaning.CleaningPipeline(
    whitelist=whitelist, slang_map=slang_map, typo_map=typo_map,
    emoji_map=emoji_map, laughter_list=laughter, stopwords=stopwords,
    pos_lexicon=pos_lexicon, negation_list=negation, affix_map=affix_map
)

example ="""
WKWKWKWKWKüò≠üò≠üò≠ gk bisaaa login SKRGGG plsssss helpppp!!!! lmoott bnaget... sumpaaahhh üò°üò°
cek ini deh: https://tokopedia.com/login-error gk tauuu kenapaaa, email-ku: TESTUSER@GMAIL.COM,,,
lamaaaaaa bangettttt prosesnyyyyaaaa üò≠üò≠ sm tolongggg bgt dongggg!!!! HARI2 GINI MuLu 2JIRR 3KOCAKK2
"""

cleaner.explain(example, verbose=True)

=== EXPLAIN CLEANING PIPELINE ===
Input: 
WKWKWKWKWKüò≠üò≠üò≠ gk bisaaa login SKRGGG plsssss helpppp!!!! lmoott bnaget... sumpaaahhh üò°üò°
cek ini deh: https://tokopedia.com/login-error gk tauuu kenapaaa, email-ku: TESTUSER@GMAIL.COM,,,
lamaaaaaa bangettttt prosesnyyyyaaaa üò≠üò≠ sm tolongggg bgt dongggg!!!! HARI2 GINI MuLu 2JIRR 3KOCAKK2

---------------------------------
[Lowercase]
  before: 
WKWKWKWKWKüò≠üò≠üò≠ gk bisaaa login SKRGGG plsssss helpppp!!!! lmoott bnaget... sumpaaahhh üò°üò°
cek ini deh: https://tokopedia.com/login-error gk tauuu kenapaaa, email-ku: TESTUSER@GMAIL.COM,,,
lamaaaaaa bangettttt prosesnyyyyaaaa üò≠üò≠ sm tolongggg bgt dongggg!!!! HARI2 GINI MuLu 2JIRR 3KOCAKK2

  after : 
wkwkwkwkwküò≠üò≠üò≠ gk bisaaa login skrggg plsssss helpppp!!!! lmoott bnaget... sumpaaahhh üò°üò°
cek ini deh: https://tokopedia.com/login-error gk tauuu kenapaaa, email-ku: testuser@gmail.com,,,
lamaaaaaa bangettttt prosesnyyyyaaaa üò≠üò≠ sm tolongggg bgt dongggg!!!

'hahaha [EMOJI_CRY] [EMOJI_CRY] tidak bisa masuk sekarang mohon help lambat banget sumpah [EMOJI_VERY_ANGRY] [EMOJI_VERY_ANGRY] cek tidak tahu kenapa email ku lama sangat prosesnya [EMOJI_CRY] [EMOJI_CRY] tolong banget hari hari begini melulu 2 anjing 3 kocak kocak'

The `CleaningPipeline` applies a sequence of transformations to each review: it first normalizes Unicode and lowercases the text, then removes emails/URLs and punctuation while handling word‚Äìnumber patterns. Next, it normalizes laughter variants and maps emojis to text tokens. After that, it collapses character stretching, splits attached compound words, and fixes common typos before mapping slang terms to their canonical forms. Finally, it removes stopwords, normalizes whitespace, and drops low-information texts, resulting in a clean, standardized corpus ready for downstream modeling.

# üí° **Baseline Cleaning**

Natural language processing especially from scraped data can be very noisy e.g. random unicode, different capitalize format, emojis, random link, and spams. Those noise should be eliminated to make efficient model. Thus we'll do some basic cleaning ensuring all of those possible noise removed properly.

**Note:** To make this process more efficient, we'll make a dictionary of cached words that has been normalized before.

In [5]:
STRETCH_CACHE = {}
LAUGHTER_CACHE = {}

def cached_stretch(word):
    w = word.lower()

    if w in STRETCH_CACHE:
        return STRETCH_CACHE[w]

    result = cleaner._stretch_all(w)

    STRETCH_CACHE[w] = result
    return result

def cached_laughter(word):
    w = word.lower()

    if w in LAUGHTER_CACHE:
        return LAUGHTER_CACHE[w]

    result = cleaner._normalize_laughter(w)

    LAUGHTER_CACHE[w] = result
    return result

# Baseline cleaning
def baseline_cleaning(text):
    if not isinstance(text, str):
        return ""

    text = cleaner._normalize_unicode(text)
    text = text.lower()
    text = cleaner._remove_email_and_link(text)
    text = cleaner._remove_punctuation(text)
    text = cleaner._handle_word_number(text)

    tokens = text.split()
    tokens = [
        cached_stretch(tok)
        for tok in tokens
    ]
    tokens = [
        cached_laughter(tok)
        for tok in tokens
    ]
    text = " ".join(tokens)
    text = cleaner._map_emoji(text)

    return text

from tqdm.notebook import tqdm
tqdm.pandas()

df["baseline_text"] = df["text"].progress_apply(
    lambda sentence: baseline_cleaning(sentence)
)

  0%|          | 0/709000 [00:00<?, ?it/s]

Since the amount of tokens and length of text is changed, we'll replace them with current text.

In [6]:
# Baseline cleaning save
df_baseline = df[['baseline_text', 'rating', 'date']]
df_baseline.columns = ['text', 'rating', 'date']

# Additional features
df_baseline["char_len"] = df_baseline["text"].str.len()
df_baseline["token_len"] = (
    df_baseline["text"]
    .fillna("")
    .astype(str)
    .str.split()
    .apply(len)
)

df_baseline['text'] = df_baseline['text'].astype(str).str.strip()
df_baseline = df_baseline[df_baseline['text'] != ""]

# Save to csv and txt
df_baseline.to_csv('../data/baseline/review.csv', index=False)
with open("../data/baseline/all_reviews.txt", "w", encoding="utf-8") as f:
    for line in df_baseline.text.astype(str):
        f.write(line.replace("\n", " ") + "\n")

In [7]:
df_baseline.head()

Unnamed: 0,text,rating,date,char_len,token_len
0,belanja di tokopedia sangat mudah cuma sayang nya estimasi pengiriman yang tidak sesuai,5,2025-12-03 11:01:10,87,13
1,memuaskan kan produk original,5,2025-12-03 10:35:41,29,4
2,mau nyari apa aja di mesin pencarianya tokopedia hasil timeout melulu padahal sinyal bagus maen game online aja lancar jaya,1,2025-12-03 10:11:21,123,20
3,jos mantap,5,2025-12-03 10:04:00,10,2
4,tidak punya cs hanya ada bot yg tidak bisa memberikan solusi buruk,1,2025-12-03 09:54:44,66,12


In [8]:
index_to_compare = [10, 60, 134]

pd.DataFrame(
    [df['text'].iloc[index_to_compare,],
    df_baseline['text'].iloc[index_to_compare,]],
    index = ['raw', 'baseline']
)

Unnamed: 0,10,60,134
raw,"malas belanja disini, gak dpt gratis ongkir, diakun sy gak dpt promo murah, mahal, saya keluarüôèüôèmaaf",payah susah suka ngelek............................. ......................................................................................,"Mohon maaf, akun saya tiba-tiba dibatalkan saat akan melakukan check-out, alhasil, saya tidak diberi akses promosi Tokopedia, Voucher eXtra, dan Tokopedia Plus. Ini sangat merugikan, sedangkan penanganan CS tidak jelas kapan diperbaiki."
baseline,malas belanja disini gak dpt gratis ongkir diakun sy gak dpt promo murah mahal saya keluar [EMOJI_PRAY] maf,payah susah suka ngelek,mohon maaf akun saya tiba tiba dibatalkan saat akan melakukan check out alhasil saya tidak diberi akses promosi tokopedia voucher extra dan tokopedia plus ini sangat merugikan sedangkan penanganan cs tidak jelas kapan diperbaiki


After baseline cleaning above, we can immediately notice that all the noise possible is already mapped such as case typing, punctuation, emoji is mapped and any other possible noise.

In [9]:
df.drop(columns=['baseline_text'], inplace=True)

# üöÄ **Fully Cleaned Dataset**

Now that each cleaning step has been validated, we apply the full `CleaningPipeline.explain()` function to the entire dataset. This produces a fully standardized and noise-reduced text corpus that is ready for tokenization and modeling

In [10]:
df['clean_text'] = df['text'].progress_apply(
    lambda sentence: cleaner.explain(sentence, verbose=False)
)

  0%|          | 0/709000 [00:00<?, ?it/s]

Since the amount of tokens and length of text is changed, we'll replace them with current text.

In [11]:
# Full Cleaning
df_clean = df[['clean_text', 'rating', 'date']]
df_clean.columns = ['text', 'rating', 'date']

# Additional features
df_clean["char_len"] = df_clean["text"].str.len()
df_clean["token_len"] = (
    df_clean["text"]
    .fillna("")
    .astype(str)
    .str.split()
    .apply(len)
)

df_clean['text'] = df_clean['text'].astype(str).str.strip()
df_clean = df_clean[df_clean['text'] != ""]

df_clean.to_csv('../data/clean/review.csv', index=False)
with open("../data/clean/all_reviews.txt", "w", encoding="utf-8") as f:
    for line in df_clean.text.astype(str):
        f.write(line.replace("\n", " ") + "\n")

In [12]:
df_clean.head()

Unnamed: 0,text,rating,date,char_len,token_len
0,belanja sangat mudah sayang estimasi pengiriman tidak sesuai,5,2025-12-03 11:01:10,60,8
1,memuaskan produk original,5,2025-12-03 10:35:41,25,3
2,mencari mesin pencarianya hasil timeout melulu padahal jaringan bagus main game online lancar jaya,1,2025-12-03 10:11:21,98,14
3,jos mantap,5,2025-12-03 10:04:00,10,2
4,tidak customer service bot tidak bisa memberikan solusi buruk,1,2025-12-03 09:54:44,61,9


We can immediately tell between raw text and cleaned text. With cleaned text, we reduce the amount of token that'll make our next step more robust.

In [13]:
df.drop(columns=['clean_text'], inplace=True)

# üñãÔ∏è **Possible Typo Mappings**

With our resources above, we can make a list of typo_map hypothetically based on word that did not appear on our whitelist or `slang_map` dictionary. This process can help us to detect any possible `typo_map` and add them into our resources to make dataset even more clean.

After Cleaning, we'll do some exploration between the amount of tokens total and unique for each dataset. This step can give us insights that how noisy our dataset actually is.

In [14]:
print('Shape of raw dataset      :', df.shape)
print('Shape of baseline dataset :', df_baseline.shape)
print('Shape of clean dataset    :', df_clean.shape)

Shape of raw dataset      : (709000, 5)
Shape of baseline dataset : (708946, 5)
Shape of clean dataset    : (486438, 5)


In [15]:
tokens_raw = [
    word for sentence in df.text.astype(str)
    for word in sentence.split()
]

tokens_baseline = [
    word for sentence in df_baseline.text.astype(str)
    for word in sentence.split()
]

tokens_clean = [
    word for sentence in df_clean.text.astype(str)
    for word in sentence.split()
]

In [16]:
print("Raw Tokens length      :", len(tokens_raw))
print("Baseline Tokens Length :", len(tokens_baseline))
print("Clean Tokens Length    :", len(tokens_clean))

Raw Tokens length      : 5412151
Baseline Tokens Length : 5587881
Clean Tokens Length    : 3978209


In [17]:
print("Raw uniques tokens length      :", len(set(tokens_raw)))
print("Baseline uniques tokens length :", len(set(tokens_baseline)))
print("Clean uniques tokens length    :", len(set(tokens_clean)))

Raw uniques tokens length      : 303552
Baseline uniques tokens length : 77738
Clean uniques tokens length    : 59560


Based on unique and total amount of tokens, between raw and baseline can happen because we handle compound word consisting of word and number (e.g. *hari2* -> *hari hari*, *2hari* -> *2 hari*). Splitting them can make the amount of baseline tokens more than raw tokens.

On the other side, between baseline and clean tokens is actually expectable because we do other things to make uniques tokens follow one single format such as typo and slang.

In [18]:
covered = set()

covered |= set(whitelist)
covered |= set(slang_map.keys())
covered |= set(slang_map.values())
covered |= set(typo_map.values())
covered |= set(emoji_map.values())
covered |= set(laughter)
covered |= set(stopwords)
covered |= set(affix_map.keys())

uncovered_tokens = set(tokens_baseline) - covered

In [19]:
freq = Counter(tokens_clean)
uncovered_freq = {tok: freq[tok] for tok in uncovered_tokens}
uncovered_sorted = sorted(uncovered_freq.items(), key=lambda x: -x[1])
filtered = [(tok, freq) for tok, freq in uncovered_sorted if not tok.isdigit()]

filtered[:5]

[('[EMOJI_MISC]', 28288),
 ('prakerja', 11700),
 ('best', 5480),
 ('service', 5350),
 ('gopay', 4823)]

In [20]:
to_inspect = [{"token": tok, "count": freq} for tok, freq in filtered]

with open("../assets/to_inspect.json", "w", encoding="utf8") as f:
    json.dump(to_inspect, f, ensure_ascii=False, indent=2)

With to be inspect mapping above, we can actually make our data even more cleaner by adding more word to our assets.

# ü™® **Next Steps**

After the data cleaning stage, the next step is to convert the cleaned text into a form that can be used directly for modeling. This includes performing basic exploratory checks to ensure the cleaning pipeline worked as expected such as verifying token distributions, text lengths, and vocabulary size. Once the dataset looks consistent, we proceed with tokenization, either by training a custom tokenizer or using an existing one that fits our domain. The goal is to produce stable token sequences that the model can learn from.

After tokenization, we prepare model-ready inputs such as numerical token IDs, attention masks, or TF-IDF vectors depending on the modeling approach. We also finalize the train validation split to ensure balanced evaluation. At this point, the dataset is fully preprocessed and ready to be fed into baseline models or more advanced architectures.