# Pre-processing the MuMiN medium subset

In this notebook we will pre-process the text of the subset of the MuMiN medium dataset that we extracted for our thesis project. We will be using two separate models: the IndoBERT model by Willie et al. that's been pre-trained on Tweets and the IndoBERTweet model by Koto et al. Since the pre-processing steps are slightly different between the two models we'll create two separate datsets, one for each model. For both datasets, we'll do the following:

- Encode the labels
- Convert all text to lowercase
- Remove duplicate Tweets
- Remove newlines and other non-informative characters
- Remove excess white space
  
For the IndoBERT model we'll perform the following pre-processing steps:

- Replace mentions, hashtags and URLs with generic tokens (e.g., \<user\> for mentions and \<url\> for URLs)

For the IndoBERTweet model we'll perform the following pre-processing steps:

- Replace mentions and URLs with the tags `@USER` and `HTTPURL` respectively
- Replacing emojis with their text representations

# Preliminary EDA

## Load the data

In [1]:
from pathlib import Path
import numpy as np
import pandas as pd

# Set data directory path and file name
data_dir = Path("data")
data_file = "mumin_medium-id_trans.csv"

# Load the data
mumin_df = pd.read_csv(data_dir.joinpath(data_file))

## Examine data

In [2]:
mumin_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4453 entries, 0 to 4452
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   text             4453 non-null   object
 1   translated_text  4453 non-null   object
 2   label            4453 non-null   object
 3   lang             4453 non-null   object
dtypes: object(4)
memory usage: 139.3+ KB


In [3]:
mumin_df.head()

Unnamed: 0,text,translated_text,label,lang
0,कॉरोना वायरस फेफड़ों में जाने से पहले तीन-चार ...,Virus corona bertahan di tenggorokan selama ti...,misinformation,hi
1,Antes de llegar a los pulmones dura 4 días en ...,"Sebelum mencapai paru-paru, itu berlangsung 4 ...",misinformation,es
2,మంచి వార్త! కరోనా వైరస్ వ్యాక్సిన్ సిద్ధంగా ఉ...,Kabar baik! Vaksin virus corona sudah siap. Ke...,misinformation,te
3,Great news! Carona virus vaccine ready. Able t...,Kabar baik! Vaksin virus carona sudah siap. Ma...,misinformation,en
4,మంచి వార్త! కరోనా వైరస్ వ్యాక్సిన్ సిద్ధంగా ఉ...,Kabar baik! Vaksin virus corona sudah siap. Ke...,misinformation,te


Note the duplicate entries in the `translated_text` column. We'll need to remove these duplicates later.

In [4]:
# np.random.seed(3)
np.random.seed(4)
mumin_df.sample(20).translated_text.to_list()

['Setengah Dari Semua Orang Dewasa AS Akan Divaksinasi Sepenuhnya Terhadap COVID-19 Mulai Selasa https://t.co/2rL3p78Twy',
 '🚨🚨🚨\n\nKepala petugas medis Moderna mengakui bahwa vaksin messenger RNA mengubah DNA.\n\nhttps://t.co/GMmtMuOug3',
 'Bank Dunia menempatkan Brasil sebagai negara terbaik dalam perang melawan Covid-19 https://t.co/EkogjS9aKU',
 'Ini adalah negara-negara di mana virus corona ditemukan di China telah menyebar.\nhttps://t.co/VmlE7Psh9y https://t.co/ZE4npQQLnK',
 'Namun studi lain tentang efektivitas Ivermectin. https://t.co/JY61pDWaGc',
 'Dokter dari pedalaman Venezuela menuntut vaksin melawan COVID-19 sementara kematian tidak berhenti. https://t.co/N97GP7rGRF',
 'Apakah paket bantuan pandemi Covid-19 senilai $1,9 triliun akan membebani setiap warga negara AS $5.750?\n\nTidak. Para ekonom mengatakan biaya rencana tidak dapat dikaitkan dengan setiap orang Amerika dengan cara ini https://t.co/wLN3jUuAvs',
 'Virus corona Wuhan diyakini berasal dari pasar yang menjual he

Here we can see the following types of entities in the text that must be removed:

- Newline characters (`\n` and `\r`)
- Emojis
- URLs
- Mentions

Not shown are non-unicode characters, such as `\x9a`.

Since we're dealing with Tweets, there are also hashtags. Hashtags do carry useful information and the IndoBERT model that we plan to use was trained on a corpus of Tweets so I'm assuming that IndoBERT knows how to handle Tweets. I'll check to see if there is a link to the corpus and how the model was trained on it. If it turns out that the hashtags were left as-is then there's no need to remove them. If the hashtags were processed in some way then I'll need to implement the same kind of processing on these Tweets.

# Process text

## Create separate column for cleaned text

This column will contain all of the pre-processing steps that are shared by both the IndoBERT and IndoBERTweet datasets.

In [5]:
mumin_df["cleaned_text"] = mumin_df["translated_text"]

## Convert all text to lowercase

In [6]:
mumin_df.cleaned_text = mumin_df.cleaned_text.str.lower()
mumin_df.cleaned_text.head().to_list()

['virus corona bertahan di tenggorokan selama tiga hingga empat hari sebelum masuk ke paru-paru.\n\nia juga mengeluh batuk dan berdahak.\n\ndalam situasi seperti itu, jika kumur dilakukan dengan menambahkan garam ke air panas ...\n\ndan jika lemon dikonsumsi maka penyakit ini bisa dihindari.\n\ndan banyak nyawa bisa diselamatkan.\n\n️👋🌷🌹💐🌴 https://t.co/uoilydci6a',
 'sebelum mencapai paru-paru, itu berlangsung 4 hari di tenggorokan, di mana orang yang terinfeksi mulai batuk dan sakit tenggorokan. mereka harus banyak minum air putih dan berkumur dengan air hangat dengan garam atau cuka, ini akan menghilangkan retweet coronavirus karena bisa menyelamatkan seseorang. https://t.co/z7eudqcalj',
 'kabar baik! vaksin virus corona sudah siap. kemampuan untuk menyembuhkan pasien dalam waktu 3 jam setelah injeksi. angkat topi untuk para ilmuwan as.\n trump baru saja mengumumkan bahwa roche medical company akan merilis vaksin minggu depan dan jutaan dosis darinya https://t.co/zyx4rzhs3h',
 'kabar

## Remove duplicate entries

In [7]:
size_before = mumin_df.shape[0]
mumin_df.drop_duplicates(subset="cleaned_text", inplace=True, ignore_index=True)
size_after = mumin_df.shape[0]

In [8]:
print(f"# of records before dropping duplicates: {size_before}")
print(f"# of records after dropping duplicates: {size_after}")

# of records before dropping duplicates: 4453
# of records after dropping duplicates: 3089


## Remove newline characters

We will also remove other strange characters that add no additional information.

In [9]:
mumin_df.cleaned_text = mumin_df.cleaned_text.str.replace("(\n|\r|\|)", "")
mumin_df.cleaned_text.head().to_list()

  mumin_df.cleaned_text = mumin_df.cleaned_text.str.replace("(\n|\r|\|)", "")


['virus corona bertahan di tenggorokan selama tiga hingga empat hari sebelum masuk ke paru-paru.ia juga mengeluh batuk dan berdahak.dalam situasi seperti itu, jika kumur dilakukan dengan menambahkan garam ke air panas ...dan jika lemon dikonsumsi maka penyakit ini bisa dihindari.dan banyak nyawa bisa diselamatkan.️👋🌷🌹💐🌴 https://t.co/uoilydci6a',
 'sebelum mencapai paru-paru, itu berlangsung 4 hari di tenggorokan, di mana orang yang terinfeksi mulai batuk dan sakit tenggorokan. mereka harus banyak minum air putih dan berkumur dengan air hangat dengan garam atau cuka, ini akan menghilangkan retweet coronavirus karena bisa menyelamatkan seseorang. https://t.co/z7eudqcalj',
 'kabar baik! vaksin virus corona sudah siap. kemampuan untuk menyembuhkan pasien dalam waktu 3 jam setelah injeksi. angkat topi untuk para ilmuwan as. trump baru saja mengumumkan bahwa roche medical company akan merilis vaksin minggu depan dan jutaan dosis darinya https://t.co/zyx4rzhs3h',
 'kabar baik! vaksin virus ca

## Remove excess whitespace

In [10]:
mumin_df.cleaned_text = mumin_df.cleaned_text.str.replace("\s+", " ")
mumin_df.cleaned_text.head().to_list()

  mumin_df.cleaned_text = mumin_df.cleaned_text.str.replace("\s+", " ")


['virus corona bertahan di tenggorokan selama tiga hingga empat hari sebelum masuk ke paru-paru.ia juga mengeluh batuk dan berdahak.dalam situasi seperti itu, jika kumur dilakukan dengan menambahkan garam ke air panas ...dan jika lemon dikonsumsi maka penyakit ini bisa dihindari.dan banyak nyawa bisa diselamatkan.️👋🌷🌹💐🌴 https://t.co/uoilydci6a',
 'sebelum mencapai paru-paru, itu berlangsung 4 hari di tenggorokan, di mana orang yang terinfeksi mulai batuk dan sakit tenggorokan. mereka harus banyak minum air putih dan berkumur dengan air hangat dengan garam atau cuka, ini akan menghilangkan retweet coronavirus karena bisa menyelamatkan seseorang. https://t.co/z7eudqcalj',
 'kabar baik! vaksin virus corona sudah siap. kemampuan untuk menyembuhkan pasien dalam waktu 3 jam setelah injeksi. angkat topi untuk para ilmuwan as. trump baru saja mengumumkan bahwa roche medical company akan merilis vaksin minggu depan dan jutaan dosis darinya https://t.co/zyx4rzhs3h',
 'kabar baik! vaksin virus ca

## Encode labels

Here we encode the labels in the following way:

- `misinformation`: 1
- `factual`: 0

In [11]:
encodings = {"misinformation": 1, "factual": 0}
mumin_df.label.replace(to_replace=encodings, inplace=True)

In [12]:
mumin_df.label.value_counts()

1    2934
0     155
Name: label, dtype: int64

## Pre-processing for IndoBERT

Here we do the following:

- Remove emojis
- Replace Tweet artifacts with generic tokens

In [13]:
# Create separate dataset for indoBERT text
mumin_df["cleaned_text_ib"] = mumin_df["cleaned_text"]

# Remove emojis
mumin_df.cleaned_text_ib = mumin_df.cleaned_text_ib.str.encode("ascii", "ignore").str.decode("utf_8", "ignore")

# Replace mentions with <user>
mumin_df.cleaned_text_ib = mumin_df.cleaned_text_ib.str.replace("(?:@)\S+", "<user>")

# Replace URLS with <links>
mumin_df.cleaned_text_ib = mumin_df.cleaned_text_ib.str.replace("(?:https?://)\S+", "<links>")

  mumin_df.cleaned_text_ib = mumin_df.cleaned_text_ib.str.replace("(?:@)\S+", "<user>")
  mumin_df.cleaned_text_ib = mumin_df.cleaned_text_ib.str.replace("(?:https?://)\S+", "<links>")


In [14]:
# Write this to a csv file
data_file = "mumin_medium-id_trans-indobert_hashtags.csv"
cols_to_write = ["cleaned_text_ib", "label", "lang"]
mumin_df.reindex(columns=cols_to_write).to_csv(data_dir.joinpath(data_file), index=False)

We'll also create a separate dataset that replaces the hastags with the `<hashtag>` token.

In [15]:
# Replace hashtags with <hashtag>
mumin_df.cleaned_text_ib = mumin_df.cleaned_text_ib.str.replace("(?:#)\S+", "<hashtag>")

  mumin_df.cleaned_text_ib = mumin_df.cleaned_text_ib.str.replace("(?:#)\S+", "<hashtag>")


In [16]:
# Write this to a csv file
data_file = "mumin_medium-id_trans-indobert_no_hashtags.csv"
mumin_df.reindex(columns=cols_to_write).to_csv(data_dir.joinpath(data_file), index=False)

## Pre-processing for IndoBERTweet

Here we do the following:

- Convert user mentions and URLs to `@USER` and `HTTPURL` respectively.
- Convert emojis to their text representations using the `emoji` package.

In [17]:
import emoji

# Create separate column for IndoBERTweet dataset
mumin_df["cleaned_text_ibt"] = mumin_df["cleaned_text"]

# Replace mentions with the token @USER
mumin_df.cleaned_text_ibt = mumin_df.cleaned_text_ibt.str.replace("(?:@)\S+", "@USER")

# Replace URLs with the token HTTPSURL
mumin_df.cleaned_text_ibt = mumin_df.cleaned_text_ibt.str.replace("(?:https?://)\S+", "HTTPURL")

# Convert emojis to their text representations using the emoji package
mumin_df.cleaned_text_ibt = mumin_df.cleaned_text_ibt.apply(emoji.demojize)

  mumin_df.cleaned_text_ibt = mumin_df.cleaned_text_ibt.str.replace("(?:@)\S+", "@USER")
  mumin_df.cleaned_text_ibt = mumin_df.cleaned_text_ibt.str.replace("(?:https?://)\S+", "HTTPURL")


In [18]:
# Write this to a csv file
data_file = "mumin_medium-id_trans-indobertweet.csv"
cols_to_write = ["cleaned_text_ibt", "label", "lang"]
mumin_df.reindex(columns=cols_to_write).to_csv(data_dir.joinpath(data_file), index=False)