# üßº **Text Cleaning for Tokopedia User Reviews**  
This notebook performs systematic text cleaning on raw, user-generated reviews collected from the Tokopedia application.

User reviews typically contain substantial noise, such as:

- emojis and unicode icons
- URLs and emails
- excessive character repetitions (‚Äúbaaaagus bangeeetttt‚Äù)
- exaggerated laughter (‚Äúwkwkwkwkwk‚Äù, ‚Äúhahahahaha‚Äù)
- slang and informal spellings (‚Äúgk‚Äù, ‚Äúga‚Äù, ‚Äúbgt‚Äù, ‚Äúplis‚Äù)
- typos and phonetic spelling
- punctuation noise
- extremely short or low-information messages (‚Äúok‚Äù, ‚Äú.‚Äù)

Cleaning these reviews is essential to:

- reduce vocabulary sparsity  
- standardize spelling variations  
- improve downstream NLP model quality  
- remove meaningless tokens  
- prepare the text for vectorization and modeling  

This notebook runs through the process **step-by-step**, showing before/after transformations to highlight the effect of each stage.

In [3]:
# Directory alignment and module update
import sys
import importlib
sys.path.append("..")

# Ignore warning
from warnings import filterwarnings
filterwarnings('ignore')

# Core library
import pandas as pd
import numpy as np
import json

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Cleaning tools
import re
import src.cleaning as cleaning

# Reload shortcut
def r(module=cleaning):
    importlib.reload(module)

# Defaults
pd.set_option('display.max_colwidth', None)
plt.style.use('seaborn-v0_8-whitegrid')

print('Ready!')

Ready!


# üîç **Load Raw Review Data**  

We start by loading the unprocessed user reviews from the dataset. Only the raw text column will be used in this notebook.

In [4]:
df = pd.read_csv('../data/raw/review.csv')
df.head(7)

Unnamed: 0,raw_text,rating,date
0,keluar masuk mulu,5,2025-11-27 08:23:08
1,good,5,2025-11-27 08:21:14
2,Penarikan Saldo refund saya kenapa masih di tahan pengembaliannya???,1,2025-11-27 07:51:54
3,update mulu heran,5,2025-11-27 07:18:06
4,"sekarang aplikasi tambah ancur, sudah boros batre dipakai nggak nyaman",1,2025-11-27 06:03:40
5,"minusnya satu kenapa customer service nya bisa lama bangett, ini perusahaan gede lohhh, please lah , aku nggk bisa narik dana refund lebih dari 2hari dan csnya terus dialihkan, ditanyakan nggk di bales2 ü•≤ coba diperbaiki lagi dong biar semuanya juga puas dengan pelayanan nan, dan masa iya penipu ada di tokped kamu gimana nyeleksinya heran penjual ada yang nipuüòîbukan duit sedikit lohhh ini yang aku tarik saldo refund nya",3,2025-11-27 06:03:36
6,"Sekarang kenapa susah ya menginfokan ke penjual utk lampirin orderan kita via chat, biasa begitu tanya penjual itu otomatis ke kirim orderan kita tapi sekarang ga bisa...",4,2025-11-27 05:41:14


# üìö **Load Cleaning Resources**  

The cleaning pipeline uses several external resources stored in `resources/`:

- **slang.json**: a mapping from slang words to their normalized forms.  
- **stopwords.txt**: additional informal stopwords not found in standard lists  
- **whitelist.txt**: ground truth of indonesian word based on KBBI (Kamus Besar Bahasa Indoensia)
- **fuzzy_targets.json**: canonical words frequently affected by typos or misspellings 

These resources supplement the cleaning functions defined in `src/cleaning.py`.


In [None]:
with open("../assets/emoji_map.json") as f:
    emoji_map = json.load(f)

with open("../assets/fuzzy_targets.json") as f:
    fuzzy_targets = json.load(f)

with open("../assets/pos_lexicon.json") as f:
    pos_lexicon = json.load(f)

with open("../assets/prefix_suffix.json") as f:
    prefix_suffix = json.load(f)

with open("../assets/slang.json") as f:
    slang = json.load(f)

with open("../assets/typo.json") as f:
    typo = json.load(f)

with open("../assets/stopwords.txt") as f:
    stopwords = [x.strip() for x in f]

with open("../assets/whitelist.txt") as f:
    whitelist = [x.strip() for x in f]

with open("../assets/laughter.txt") as f:
    laughter = [x.strip() for x in f]

with open("../assets/negation.txt") as f:
    negation = [x.strip() for x in f]

# üñãÔ∏è Possible Typo Mappings

With our resources above, we can make a list of typo hypothetically based on word that did not appear on our whitelist or slang dictionary. This process can help us to detect any possible typo and add them into our external resources to make dataset even more clean.

In order to extract unique tokens from our dataset, we'll normalize them to lowercase, removing punctuation, stripping emojis, split digit word, collapse whitespaces, normalize unicode, and normalize laughter.

In [7]:
tokens =[word for sentence in df.raw_text for word in sentence.split()]

hypo_typo = len(set(tokens) - set(whitelist))
hypo_typo

29754

# üìù Example Raw Review  

Let‚Äôs inspect the most noisy raw review to understand the noise present in the text.

In [None]:
example =
"""
WKWKWKWKWKüò≠üò≠üò≠ gk bisaaa login SKRGGG plsssss helpppp!!!! lmoott bnaget... sumpaaahhh üò°üò°
cek ini deh: https://tokopedia.com/login-error gk tauuu kenapaaa, email-ku: TESTUSER@GMAIL.COM,,,
lamaaaaaa bangettttt prosesnyyyyaaaa üò≠üò≠ sm tolongggg bgt dongggg!!!!
"""

# üöÄ Applying the Full Cleaning Pipeline

Now that each cleaning step has been validated individually,
we apply the full `clean_text()` function to the entire dataset.

This ensures all reviews follow a standardized, noise-free text format.

In [192]:
from src.cleaning import *

r()

def clean_text(text: str,
               slang: dict,
               stopwords: List[str],
               fuzzy_targets: dict,
               whitelist: set
) -> str:
    if not isinstance(text, str):
        return ""
    text = normalize_unicode(text)
    text = text.lower()
    text = re.sub(r"http\S+|www\.\S+|\S+@\S+", " ", text)
    text = remove_emoji(text)

    text = split_digit_words(text)
    text = re.sub(r"([a-zA-Z]+)2\b", r"\1-\1", text)

    tokens = text.split()
    tokens = [normalize_laughter_word(t, whitelist) for t in tokens]
    text = " ".join(tokens)

    text = remove_punctuation(text)

    text = normalize_vowel_stretch(text)
    text = collapse_repeated_chars(text)
    text = normalize_slang(text, slang)
    text = fuzzy_normalize(text, fuzzy_targets, whitelist)
    text = remove_stopwords(text, stopwords)

    text = " ".join(text.split())
    text = drop_lowinfo(text)

    return text

In [185]:
df_sample = df.iloc[:500,]

In [193]:
from tqdm.notebook import tqdm
tqdm.pandas()

df_sample["clean_text"] = df_sample["raw_text"].progress_apply(
    lambda x: clean_text(
        x,
        slang=slang,
        stopwords=stopwords,
        fuzzy_targets=fuzzy_targets,
        whitelist=whitelist
    )
)

print('finished!')

  0%|          | 0/500 [00:00<?, ?it/s]

finished!


In [194]:
df_sample.shape
df_sample.head(15)

Unnamed: 0,raw_text,rating,date,clean_text
0,keluar masuk mulu,5,2025-11-27 08:23:08,keluar masuk melulu
1,good,5,2025-11-27 08:21:14,
2,Penarikan Saldo refund saya kenapa masih di tahan pengembaliannya???,1,2025-11-27 07:51:54,penarikan saldo refund kenapa masih tahan pengembaliannya
3,update mulu heran,5,2025-11-27 07:18:06,perbarui melulu heran
4,"sekarang aplikasi tambah ancur, sudah boros batre dipakai nggak nyaman",1,2025-11-27 06:03:40,sekarang tambah hancur sudah boros batre dipakai tidak nyaman
5,"minusnya satu kenapa customer service nya bisa lama bangett, ini perusahaan gede lohhh, please lah , aku nggk bisa narik dana refund lebih dari 2hari dan csnya terus dialihkan, ditanyakan nggk di bales2 ü•≤ coba diperbaiki lagi dong biar semuanya juga puas dengan pelayanan nan, dan masa iya penipu ada di tokped kamu gimana nyeleksinya heran penjual ada yang nipuüòîbukan duit sedikit lohhh ini yang aku tarik saldo refund nya",3,2025-11-27 06:03:36,minusnya satu kenapa customer service nya bisa lama banget perusahan besar tolong tidak bisa narik dana refund lebih 2 hari csnya dialihkan ditanyakan tidak balas 2 coba diperbaiki semuanya puas pelayanan nan masa iya penipu bagaimana nyeleksinya heran penjual tipu bukan uang tertawa tarik saldo refund nya
6,"Sekarang kenapa susah ya menginfokan ke penjual utk lampirin orderan kita via chat, biasa begitu tanya penjual itu otomatis ke kirim orderan kita tapi sekarang ga bisa...",4,2025-11-27 05:41:14,sekarang kenapa susah iya menginfokan penjual lampirin orderan via chat biasa begitu tanya penjual otomatis kirim orderan sekarang tidak bisa
7,tokped ngeleg parah pas update jadi males benlajanya.. di perbaiki segera pelanggan pada kabur tar,2,2025-11-27 05:29:47,lemot parah ketika perbarui malas benlajanya perbaiki segera pelanggan kabur nanti
8,ok,3,2025-11-27 04:51:26,
9,semoga sukses,5,2025-11-27 03:37:29,semoga sukses


In [152]:
df_sample

Unnamed: 0,raw_text,rating,date
0,keluar masuk melulu,5,2025-11-27 08:23:08
1,,5,2025-11-27 08:21:14
2,penarikan saldo refund kenapa masih tahan pengembaliannya,1,2025-11-27 07:51:54
3,perbarui melulu heran,5,2025-11-27 07:18:06
4,sekarang tambah hancur sudah boros batre dipakai tidak nyaman,1,2025-11-27 06:03:40
...,...,...,...
495,pelit ongkos kirim najis,1,2025-11-19 23:05:38
496,,5,2025-11-19 22:36:08
497,mudah to tertawa,5,2025-11-19 22:34:24
498,gopay pinjamnya tidak bisa aktifkan ditolak,5,2025-11-19 22:29:40


In [None]:
df.to_csv('../data/processed/processed1.csv', index=False)
df.head(7)