# üßº Text Cleaning for Tokopedia User Reviews  
This notebook performs systematic text cleaning on raw, user-generated reviews collected from the Tokopedia application.

User reviews typically contain substantial noise, such as:

- emojis and unicode icons
- URLs and emails
- excessive character repetitions (‚Äúbaaaagus bangeeetttt‚Äù)
- exaggerated laughter (‚Äúwkwkwkwkwk‚Äù, ‚Äúhahahahaha‚Äù)
- slang and informal spellings (‚Äúgk‚Äù, ‚Äúga‚Äù, ‚Äúbgt‚Äù, ‚Äúplis‚Äù)
- typos and phonetic spelling
- punctuation noise
- extremely short or low-information messages (‚Äúok‚Äù, ‚Äú.‚Äù)

Cleaning these reviews is essential to:

- reduce vocabulary sparsity  
- standardize spelling variations  
- improve downstream NLP model quality  
- remove meaningless tokens  
- prepare the text for vectorization and modeling  

This notebook runs through the process **step-by-step**, showing before/after transformations to highlight the effect of each stage.

In [94]:
# Directory alignment and module update
import sys
import importlib
sys.path.append("..")

# Ignore warning
from warnings import filterwarnings
filterwarnings('ignore')

# Core library
import pandas as pd
import numpy as np
import json

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Cleaning tools
import re
import src.cleaning as cleaning

# Reload shortcut
def r(module=cleaning):
    importlib.reload(module)

# Defaults
pd.set_option('display.max_colwidth', None)
plt.style.use('seaborn-v0_8-whitegrid')

print('Ready!')

Ready!


# üîç Load Raw Review Data  

We start by loading the unprocessed user reviews from the dataset. Only the raw text column will be used in this notebook.

In [95]:
df = pd.read_csv('../data/raw/review.csv')
df.head()

Unnamed: 0,raw_text,rating,date
0,Aplikasi bagun untuk belanja,4,2025-11-26 16:54:58
1,sudah mantap,5,2025-11-26 16:52:43
2,"tokopedia sekarang jadi ribet tidak seperti dulu lagi kalo mau return barang yang tidak sesuai harus nunggu waktu terlalu lama jadi males belanja lagi di tokopedia saya auto unistal, tidak seperti shopee yang mudah dan enak dan sekarang belaja terus di shopee",1,2025-11-26 16:30:31
3,"kasih bintang 1 ,karena ngisi kouta saja lama prosesnya",1,2025-11-26 15:05:33
4,"di janjikan dapet kompensasi atas keterlambatan pengiriman sameday yg gak sesuai estimasi, sampai sekarang udh 8 hari kerja blm dapet jga gimana sih Tokopedia",1,2025-11-26 15:03:32


# üìö Load Cleaning Resources  

The cleaning pipeline uses several external resources stored in `resources/`:

- **slang.json** ‚Äî a mapping from slang words to their normalized forms  
- **stopwords_extra.txt** ‚Äî additional informal stopwords not found in standard lists  
- **fuzzy_targets.json** ‚Äî canonical words frequently affected by typos or misspellings  

These resources supplement the cleaning functions defined in `src/cleaning.py`.


In [96]:
with open("../resources/slang.json") as f:
    slang = json.load(f)

with open("../resources/stopwords.txt") as f:
    stopwords = [x.strip() for x in f]

with open("../resources/fuzzy_targets.json") as f:
    fuzzy_targets = json.load(f)

# üìù Example Raw Review  

Let‚Äôs inspect the most noisy raw review to understand the noise present in the text.

In [97]:
example = "WKWKWKWKWKüò≠üò≠üò≠ gk bisaaa login SKRGGG plsssss helpppp!!!! lmoott bnaget... sumpaaahhh üò°üò° cek ini deh: https://tokopedia.com/login-error gk tauuu kenapaaa, email-ku: TESTUSER@GMAIL.COM,,, lamaaaaaa bangettttt prosesnyyyyaaaa üò≠üò≠ sm tolongggg bgt dongggg!!!!"

## Step 1 ‚Äî Lowercasing & Removing URLs/Emails  

User reviews often contain URLs, emails, or random capitalizations. These introduce unnecessary variance into the vocabulary and should be standardized early.

In [98]:
step1 = example.lower()
step1 = re.sub(r"http\S+|www\.\S+|\S+@\S+", " ", step1)
step1

'wkwkwkwkwküò≠üò≠üò≠ gk bisaaa login skrggg plsssss helpppp!!!! lmoott bnaget... sumpaaahhh üò°üò° cek ini deh:   gk tauuu kenapaaa, email-ku:   lamaaaaaa bangettttt prosesnyyyyaaaa üò≠üò≠ sm tolongggg bgt dongggg!!!!'

## Step 2 ‚Äî Remove Emoji  

Emojis add noise to tokenization and typically do not contribute meaningful information for text modeling. We remove them using a Unicode-based pattern.

In [99]:
step2 = cleaning.remove_emoji(step1)
step2

'wkwkwkwkwk gk bisaaa login skrggg plsssss helpppp!!!! lmoott bnaget... sumpaaahhh  cek ini deh:   gk tauuu kenapaaa, email-ku:   lamaaaaaa bangettttt prosesnyyyyaaaa  sm tolongggg bgt dongggg!!!!'

## Step 3 ‚Äî Normalize Laughter Patterns  

Indonesian users frequently express laughter using patterns such as:  

- ‚Äúwkwkwkwk‚Äù
- ‚Äúwkwwkkwkw‚Äù
- ‚Äúhahahahaha‚Äù

We normalize these exaggerated sequences into a canonical form (‚Äúwkwk‚Äù, ‚Äúhaha‚Äù) to reduce vocabulary explosion.

In [100]:
step3 = cleaning.normalize_laughter(step2)
step3

'wkwk gk bisaaa login skrggg plsssss helpppp!!!! lmoott bnaget... sumpaaahhh  cek ini deh:   gk tauuu kenapaaa, email-ku:   lamaaaaaa bangettttt prosesnyyyyaaaa  sm tolongggg bgt dongggg!!!!'

## Step 4 ‚Äî Collapse Repeated Characters  

Over-emphasized expressions such as ‚Äúbaaaagusss‚Äù or ‚Äúbangeeettt‚Äù introduce many unique tokens. We collapse any character repeated more than twice into a single instance.

In [101]:
step4 = cleaning.collapse_repeated_chars(step3)
step4

'wkwk gk bisa login skrg pls help! lmoott bnaget. sumpah  cek ini deh: gk tau kenapa, email-ku: lama banget prosesnya  sm tolong bgt dong!'

## Step 5 ‚Äî Normalize Vowel Stretching  

In informal Indonesian text, users often elongate vowels to express emotion (‚Äúlaaaamaaa‚Äù). We reduce these to their canonical vowel forms.

In [102]:
step5 = cleaning.normalize_vowel_stretch(step4)
step5

'wkwk gk bisa login skrg pls help! lmott bnaget. sumpah  cek ini deh: gk tau kenapa, email-ku: lama banget prosesnya  sm tolong bgt dong!'

## Step 6 ‚Äî Remove Punctuation 

Removing punctuation before slang/fuzzy lookup ensures tokens match dictionary keys.

In [103]:
step6 = cleaning.remove_punctuation(step5)
step6

'wkwk gk bisa login skrg pls help  lmott bnaget  sumpah  cek ini deh  gk tau kenapa  email ku  lama banget prosesnya  sm tolong bgt dong '

## Step 7 ‚Äî Slang Normalization  

Slang expressions like:  
- ‚Äúgk‚Äù  
- ‚Äúga‚Äù  
- ‚Äúsm‚Äù  
- ‚Äúbgt‚Äù  

are replaced using a predefined slang dictionary.


In [104]:
step7 = cleaning.normalize_slang(step6, slang)
step7

'tertawa tidak bisa login sekarang tolong help lmott bnaget sumpah cek ini deh tidak tahu kenapa email ku lama banget prosesnya sama tolong banget dong'

## Step 8 ‚Äî Fuzzy Normalization  

Typographical variations such as:
- ‚Äúbangett‚Äù
- ‚Äúbnaget‚Äù
- ‚Äúbangeet‚Äù
- ‚Äúlemott‚Äù

are mapped back into canonical forms (‚Äúbanget‚Äù, ‚Äúlemot‚Äù) using fuzzy similarity scoring.

In [105]:
step8 = cleaning.fuzzy_normalize(step7, fuzzy_targets)
step8

'tertawa tidak bisa login sekarang tolong help lemot banget sumpah cek ini deh tidak tahu kenapa email ku lama banget prosesnya sama tolong banget dong'

## Step 9 ‚Äî Remove Stopwords  

We remove additional informal stopwords (e.g., ‚Äúsih‚Äù, ‚Äúdong‚Äù, ‚Äúlah‚Äù) to focus on content-bearing tokens.

In [106]:
step9 = cleaning.remove_stopwords(step8, stopwords)
step9

'tertawa login tolong help lemot banget sumpah cek deh email ku banget prosesnya tolong banget'

## Step 10 ‚Äî Remove Low-Information Reviews  

Extremely short or non-informative texts (e.g., ‚Äúok‚Äù, ‚Äú.‚Äù) are dropped entirely.

In [107]:
final_example = cleaning.drop_lowinfo(step9)
final_example

'tertawa login tolong help lemot banget sumpah cek deh email ku banget prosesnya tolong banget'

# üßπ After Cleaning Review

After all the cleaning pipeline, we'll end up with more valueable review text as shown

In [108]:
r()

In [109]:
pd.DataFrame({
    "raw_text": [example],
    "cleaned_text": [final_example]
})

Unnamed: 0,raw_text,cleaned_text
0,"WKWKWKWKWKüò≠üò≠üò≠ gk bisaaa login SKRGGG plsssss helpppp!!!! lmoott bnaget... sumpaaahhh üò°üò° cek ini deh: https://tokopedia.com/login-error gk tauuu kenapaaa, email-ku: TESTUSER@GMAIL.COM,,, lamaaaaaa bangettttt prosesnyyyyaaaa üò≠üò≠ sm tolongggg bgt dongggg!!!!",tertawa login tolong help lemot banget sumpah cek deh email ku banget prosesnya tolong banget


# üöÄ Applying the Full Cleaning Pipeline

Now that each cleaning step has been validated individually,
we apply the full `clean_text()` function to the entire dataset.

This ensures all reviews follow a standardized, noise-free text format.

In [None]:
from tqdm.notebook import tqdm
tqdm.pandas()

df["clean_text"] = df["raw_text"].progress_apply(
    lambda x: cleaning.clean_text(
        x,
        slang=slang,
        stopwords=stopwords,
        fuzzy_targets=fuzzy_targets
    )
)

print('finished!')

  0%|          | 0/500000 [00:00<?, ?it/s]