# Pre-processing the MuMiN large subset

In this notebook we will pre-process the text of the subset of the MuMiN large dataset that we extracted for our thesis project. We will be using two separate models: the IndoBERT model by Willie et al. that's been pre-trained on Tweets and the IndoBERTweet model by Koto et al. Since the pre-processing steps are slightly different between the two models we'll create two separate datsets, one for each model. For both datasets, we'll do the following:

- Encode the labels
- Convert all text to lowercase
- Remove duplicate Tweets
- Remove newlines and other non-informative characters
- Remove excess white space
  
For the IndoBERT model we'll perform the following pre-processing steps:

- Replace mentions, hashtags and URLs with generic tokens (e.g., \<user\> for mentions and \<url\> for URLs)

For the IndoBERTweet model we'll perform the following pre-processing steps:

- Replace mentions and URLs with the tags `@USER` and `HTTPURL` respectively
- Replacing emojis with their text representations

# Preliminary EDA

## Load the data

In [11]:
from pathlib import Path
import warnings
import numpy as np
import pandas as pd

# Ignore all warnings
warnings.filterwarnings("ignore")

# Set data directory path and file name
data_dir = Path("../../data")
data_file = "mumin_large-trans.csv"

# Load the data
mumin_df = pd.read_csv(data_dir.joinpath(data_file))

## Examine data

In [12]:
mumin_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9476 entries, 0 to 9475
Data columns (total 4 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   text                9476 non-null   object
 1   translated_text_id  9476 non-null   object
 2   label               9476 non-null   int64 
 3   lang                9476 non-null   object
dtypes: int64(1), object(3)
memory usage: 296.2+ KB


In [13]:
mumin_df.head()

Unnamed: 0,text,translated_text_id,label,lang
0,to keep our upper respiratory tract healthy in...,Untuk menjaga kesehatan saluran pernapasan ata...,1,en
1,gargling salt water does not 'kill' coronaviru...,berkumur air garam tidak 'membunuh' virus coro...,1,en
2,कॉरोना वायरस फेफड़ों में जाने से पहले तीन-चार ...,Virus corona bertahan di tenggorokan selama ti...,1,hi
3,antes de llegar a los pulmones dura 4 días en ...,sebelum mencapai paru-paru itu berlangsung 4 h...,1,es
4,so they say the first symptons are #coughingth...,jadi mereka mengatakan gejala pertama adalah #...,1,en


Note the duplicate entries in the `translated_text` column. We'll need to remove these duplicates later.

In [14]:
# np.random.seed(3)
np.random.seed(4)
mumin_df.sample(20).translated_text_id.to_list()

['[salah] “alkohol bisa membunuh coronavirus covid-19”gambar hasil suntingan. berasal dari situs pembuat template meme, sengaja diproduksi untuk parodi.selengkapnya di <URL> #coronavirusfacts <URL>',
 'Tolong jangan meremehkan situasi #coronavirus. Ini bukan flu yang lebih buruk, ini jauh lebih dramatis. Jika tidak ada yang buruk, jangan ke dokter. tidak ke rumah sakit. tinggal di rumah baca laporan dokter Italia dan pelajari<URL>',
 '<URL> wali mengungkapkan bahwa sebuah perusahaan kecil di brisbane terkait dengan skandal tes virus corona di australia, puerto rico &amp; amerika serikat mengatakan sekarang mencoba untuk memenuhi pesanan internasional dengan membuat tes antibodi sendiri, perlu diselidiki!',
 'obat herbal bisa menyembuhkan virus corona, kata ooni <URL>',
 '#indiafightscorona #mitos: #covaxin mengandung serum anak sapi yang baru lahir❓#fakta: tidak ada serum anak sapi yang baru lahir dalam produk akhir covaxin.➡️ serum anak sapi yang baru lahir hanya digunakan untuk persi

Here we can see the following types of entities in the text that must be removed:

- Newline characters (`\n` and `\r`)
- Emojis
- URLs
- Mentions

Not shown are non-unicode characters, such as `\x9a`.

Since we're dealing with Tweets, there are also hashtags. Hashtags do carry useful information and the IndoBERT model that we plan to use was trained on a corpus of Tweets so I'm assuming that IndoBERT knows how to handle Tweets. I'll check to see if there is a link to the corpus and how the model was trained on it. If it turns out that the hashtags were left as-is then there's no need to remove them. If the hashtags were processed in some way then I'll need to implement the same kind of processing on these Tweets.

# Process text

## Create separate column for cleaned text

This column will contain all of the pre-processing steps that are shared by both the IndoBERT and IndoBERTweet datasets.

In [15]:
mumin_df["cleaned_text"] = mumin_df["translated_text_id"]

## Convert all text to lowercase

In [16]:
mumin_df.cleaned_text = mumin_df.cleaned_text.str.lower()
mumin_df.cleaned_text.head().to_list()

['untuk menjaga kesehatan saluran pernapasan atas kita di masa #coronavirus ini, mari kita berkumur dengan air hangat dengan garam (sebaiknya himalayan atau garam laut), beberapa jahe dan cuka sari apel di pagi hari dan sebelum kita tidur. memuntahkan air setelahnya. #letsfightcovid19 <url>',
 "berkumur air garam tidak 'membunuh' virus corona di tenggorokan berita metro <url>",
 'virus corona bertahan di tenggorokan selama tiga-empat hari sebelum masuk ke paru-paru... juga menyebabkan batuk dan berdahak... jika dilakukan, penyakit ini bisa dihindari... dan banyak nyawa yang bisa diselamatkan.🕉️👋🌷 <url>',
 'sebelum mencapai paru-paru itu berlangsung 4 hari di tenggorokan, di mana orang yang terinfeksi mulai batuk dan sakit tenggorokan. mereka harus minum banyak air dan berkumur dengan air hangat dengan garam atau cuka ini akan menghilangkan retweet coronavirus karena mereka dapat menyelamatkan seseorang. <url>',
 'jadi mereka mengatakan gejala pertama adalah #batuk virus tetap di tenggo

## Remove duplicate entries

In [17]:
size_before = mumin_df.shape[0]
mumin_df.drop_duplicates(subset="cleaned_text", inplace=True, ignore_index=True)
size_after = mumin_df.shape[0]

In [18]:
print(f"# of records before dropping duplicates: {size_before}")
print(f"# of records after dropping duplicates: {size_after}")

# of records before dropping duplicates: 9476
# of records after dropping duplicates: 9302


## Pre-processing for IndoBERT

Here we do the following:

- Remove emojis
- Replace Tweet artifacts with generic tokens

In [23]:
# Create separate dataset for indoBERT text
mumin_df["cleaned_text_ib"] = mumin_df["cleaned_text"]

# Remove emojis
mumin_df.cleaned_text_ib = mumin_df.cleaned_text_ib.str.encode("ascii", "ignore").str.decode("utf_8", "ignore")

# Replace mentions with <user>
mumin_df.cleaned_text_ib = mumin_df.cleaned_text_ib.str.replace("<USER>", "<user>", case=False)

# Replace URLS with <links>
mumin_df.cleaned_text_ib = mumin_df.cleaned_text_ib.str.replace("<URL>", "<links>", case=False)

In [24]:
# Write this to a csv file
data_file = "mumin_large-trans-indobert_hashtags.csv"
cols_to_write = ["cleaned_text_ib", "label", "lang"]
mumin_df.reindex(columns=cols_to_write).to_csv(data_dir.joinpath(data_file), index=False)

We'll also create a separate dataset that replaces the hastags with the `<hashtag>` token.

In [25]:
# Replace hashtags with <hashtag>
mumin_df.cleaned_text_ib = mumin_df.cleaned_text_ib.str.replace("(?:#)\S+", "<hashtag>")
# mumin_df.cleaned_text_ib.str.replace("(?:#)\S+", "<hashtag>")

In [27]:
# Write this to a csv file
data_file = "mumin_large-trans-indobert_no_hashtags.csv"
mumin_df.reindex(columns=cols_to_write).to_csv(data_dir.joinpath(data_file), index=False)

## Pre-processing for IndoBERTweet

Here we do the following:

- Convert user mentions and URLs to `@USER` and `HTTPURL` respectively.
- Convert emojis to their text representations using the `emoji` package.

In [28]:
import emoji

# Create separate column for IndoBERTweet dataset
mumin_df["cleaned_text_ibt"] = mumin_df["cleaned_text"]

# Replace mentions with the token @USER
mumin_df.cleaned_text_ibt = mumin_df.cleaned_text_ibt.str.replace("<USER>", "@USER", case=False)

# Replace URLs with the token HTTPSURL
mumin_df.cleaned_text_ibt = mumin_df.cleaned_text_ibt.str.replace("<URL>", "HTTPURL", case=False)

# Convert emojis to their text representations using the emoji package
mumin_df.cleaned_text_ibt = mumin_df.cleaned_text_ibt.apply(emoji.demojize)

In [29]:
# Write this to a csv file
data_file = "mumin_large-trans-indobertweet.csv"
cols_to_write = ["cleaned_text_ibt", "label", "lang"]
mumin_df.reindex(columns=cols_to_write).to_csv(data_dir.joinpath(data_file), index=False)