<a href="https://colab.research.google.com/github/rafiag/Basic-Text-Preprocessing/blob/main/Basic_Text_Pre_Processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This article will cover the basic on how to pre-process your text before feeding it into your fancy algorithm. I am going to demonstrate on how to clean your text data and how to transform it for your machine learning model using the [Bahasa Indonesia Hate Speech Dataset](https://medium.com/r/?url=https%3A%2F%2Fgithub.com%2Fialfina%2Fid-hatespeech-detection%2F).

In [11]:
# Import basic libs
import pandas as pd
import numpy as np
import warnings

# Load data
df = pd.read_csv('https://raw.githubusercontent.com/rafiag/Hate-Speech-Classification/main/hate_speech_dataset.csv')

print(df.shape)
df.head()

(713, 2)


Unnamed: 0,Label,Tweet
0,Non_HS,RT @spardaxyz: Fadli Zon Minta Mendagri Segera...
1,Non_HS,RT @baguscondromowo: Mereka terus melukai aksi...
2,Non_HS,Sylvi: bagaimana gurbernur melakukan kekerasan...
3,Non_HS,"Ahmad Dhani Tak Puas Debat Pilkada, Masalah Ja..."
4,Non_HS,RT @lisdaulay28: Waspada KTP palsu.....kawal P...


# Text Cleaning

Data pre-processing is an important step when you are trying to build a machine learning model, pre-processing process can either make or break your model. Just like any other dataset you also need to clean text data, especially because raw text data usually loaded with lots of junk or useless data. Of course you can just skip cleaning the text but it will significantly impact your model both on performance and also training time.
Let's get started onto the first step!

## Tokenize

The first step we will do is tokenizing our data. Tokenizing data simply means that we will separate our data from a sentence into a list of words, for example:

|  | Result |
|-|-|
| Original | @JohnDoe Hi, John! Today is my first day on #Twitter |
| Tokenize | ['@JohnDoe', 'Hi,', 'John!', 'Today', 'is', 'my', 'first', 'day', 'on', '#Twitter'] |

There are lots of way to do this. For the simplest approach you can easily use the `split()` function built-in for string in Python to achieve the same result as the example above. Or if you want more specific result, like only keeping the alphabet characters and throwing away the punctuation and numbers you can use `RegexpTokenizer()` function provided by NLTK library.

In [12]:
import nltk
from nltk.tokenize import RegexpTokenizer
nltk.download('punkt')

# Simple Tokenizer
def simpleTokenizer(s):
    return [token for token in s.split(' ')]

# Tokenize using NLTK
def nltkTokenizer(s, remove_punctuation=False):
    if remove_punctuation == True:
        tokenizer = RegexpTokenizer(r'\w+')
    else:
        tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+')
    return tokenizer.tokenize(s)

# Tokenize words
df['tokens'] = df['Tweet'].apply(nltkTokenizer)

df.head()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Unnamed: 0,Label,Tweet,tokens
0,Non_HS,RT @spardaxyz: Fadli Zon Minta Mendagri Segera...,"[RT, @spardaxyz:, Fadli, Zon, Minta, Mendagri,..."
1,Non_HS,RT @baguscondromowo: Mereka terus melukai aksi...,"[RT, @baguscondromowo:, Mereka, terus, melukai..."
2,Non_HS,Sylvi: bagaimana gurbernur melakukan kekerasan...,"[Sylvi, :, bagaimana, gurbernur, melakukan, ke..."
3,Non_HS,"Ahmad Dhani Tak Puas Debat Pilkada, Masalah Ja...","[Ahmad, Dhani, Tak, Puas, Debat, Pilkada, ,, M..."
4,Non_HS,RT @lisdaulay28: Waspada KTP palsu.....kawal P...,"[RT, @lisdaulay28:, Waspada, KTP, palsu, ........"


In [13]:
print('Basic approach:\n', simpleTokenizer('@JohnDoe Hi, John! Today is my first day on #Twitter'))

print('Remove non-alphabet:\n', nltkTokenizer('@JohnDoe Hi, John! Today is my first day on #Twitter', remove_punctuation=True))

Basic approach:
 ['@JohnDoe', 'Hi,', 'John!', 'Today', 'is', 'my', 'first', 'day', 'on', '#Twitter']
Remove non-alphabet:
 ['JohnDoe', 'Hi', 'John', 'Today', 'is', 'my', 'first', 'day', 'on', 'Twitter']


## Remove Useless Components

This second step is a special one, because in most case you are not really required to do this. In this case since our data is from Twitter we will have lot of @handle, #hashtags, URLs, and other things that will just clutter our data and won't help it's performance.

But again, this really depends on the case and problem you are solving. If you want to analyse the hashtag or maybe visualise the social network based on user mentions you can keep it.

|  | Result |
|-|-|
| Tokenize | ['@JohnDoe', 'Hi,', 'John!', 'Today', 'is', 'my', 'first', 'day', 'on', '#Twitter'] |
| Remove Useless Text | ['Hi', 'John', 'Today', 'is', 'my', 'first', 'day', 'on'] |

Below I wrote a new function involving the re (regular expression) library to extract pattern from text using regular expression, which is really useful to remove the aforementioned elements.

In [14]:
import re

def removeUselessText(tokens):
    new_tokens = []
    for t in tokens:
        # Remove hashtag
        if not t.startswith('#'):
            # Remove leading & trailing whitespace
            t = t.strip()
            
            # Remove mention
            t = re.sub('@[^\s]+', '', t)

            # Remove urls
            t = re.sub(r'\\/', '/', t) # replace escaped character
            t = re.sub(r'(https?://\S+)', '', t) # remove urls

            # Remove special character and number
            t = re.sub('[^a-zA-Z\s]', '', t)

            new_tokens.append(t)

    return [token for token in new_tokens if token]

df['no_useless'] = df['tokens'].apply(removeUselessText)

df.head()

Unnamed: 0,Label,Tweet,tokens,no_useless
0,Non_HS,RT @spardaxyz: Fadli Zon Minta Mendagri Segera...,"[RT, @spardaxyz:, Fadli, Zon, Minta, Mendagri,...","[RT, Fadli, Zon, Minta, Mendagri, Segera, Meno..."
1,Non_HS,RT @baguscondromowo: Mereka terus melukai aksi...,"[RT, @baguscondromowo:, Mereka, terus, melukai...","[RT, Mereka, terus, melukai, aksi, dalam, rang..."
2,Non_HS,Sylvi: bagaimana gurbernur melakukan kekerasan...,"[Sylvi, :, bagaimana, gurbernur, melakukan, ke...","[Sylvi, bagaimana, gurbernur, melakukan, keker..."
3,Non_HS,"Ahmad Dhani Tak Puas Debat Pilkada, Masalah Ja...","[Ahmad, Dhani, Tak, Puas, Debat, Pilkada, ,, M...","[Ahmad, Dhani, Tak, Puas, Debat, Pilkada, Masa..."
4,Non_HS,RT @lisdaulay28: Waspada KTP palsu.....kawal P...,"[RT, @lisdaulay28:, Waspada, KTP, palsu, ........","[RT, Waspada, KTP, palsu, kawal, PILKADA, http..."


## Stemming

Onto the third step, we will stem all word into it's base form by removing the prefix and suffix of the words. Please note that this step will vary depends on the language your dataset are in. In my case because I am dealing with text in Bahasa Indonesia, the result will be something like this:

|  | Result |
|-|-|
| Remove Useless Text | ['Hai', 'apa', 'kabar', 'aku', 'baru', 'bergabung', 'ke', 'nih'] |
| Stemming | ['Hai', 'apa', 'kabar', 'aku', 'baru', 'gabung', 'ke', 'nih'] |

In this step I will be using Bahasa Indonesia stemmer from Sastrawi libraries. If you are analysing text from English or other languages you can try to check on some other library such as [NLTK](https://www.nltk.org/api/nltk.stem.html?highlight=stemming).

*Note: this step may take a few minutes and is a known performance problem from Sastrawi*

In [15]:
!pip install Sastrawi
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory

def stemmSentence(tokens):
    # Initiate Sastrawi stemmer
    factory = StemmerFactory()
    stemmer = factory.create_stemmer()

    return [stemmer.stem(t) for t in tokens]

df['stemmed'] = df['no_useless'].apply(stemmSentence)

df.head()



Unnamed: 0,Label,Tweet,tokens,no_useless,stemmed
0,Non_HS,RT @spardaxyz: Fadli Zon Minta Mendagri Segera...,"[RT, @spardaxyz:, Fadli, Zon, Minta, Mendagri,...","[RT, Fadli, Zon, Minta, Mendagri, Segera, Meno...","[rt, fadli, zon, minta, mendagri, segera, nona..."
1,Non_HS,RT @baguscondromowo: Mereka terus melukai aksi...,"[RT, @baguscondromowo:, Mereka, terus, melukai...","[RT, Mereka, terus, melukai, aksi, dalam, rang...","[rt, mereka, terus, luka, aksi, dalam, rangka,..."
2,Non_HS,Sylvi: bagaimana gurbernur melakukan kekerasan...,"[Sylvi, :, bagaimana, gurbernur, melakukan, ke...","[Sylvi, bagaimana, gurbernur, melakukan, keker...","[sylvi, bagaimana, gurbernur, laku, keras, per..."
3,Non_HS,"Ahmad Dhani Tak Puas Debat Pilkada, Masalah Ja...","[Ahmad, Dhani, Tak, Puas, Debat, Pilkada, ,, M...","[Ahmad, Dhani, Tak, Puas, Debat, Pilkada, Masa...","[ahmad, dhani, tak, puas, debat, pilkada, masa..."
4,Non_HS,RT @lisdaulay28: Waspada KTP palsu.....kawal P...,"[RT, @lisdaulay28:, Waspada, KTP, palsu, ........","[RT, Waspada, KTP, palsu, kawal, PILKADA, http...","[rt, waspada, ktp, palsu, kawal, pilkada, http..."


## Replace Slang Words

One other thing to take note of when dealing with text data in Bahasa Indonesia is slang words, Indonesian people loves to use slang, especially on the internet. Unfortunately, I am not able to found any ready-to-use Python libraries to help with this problem. 

Instead what I will do is use the slangword dictionary from [dhitology's GitHub repos](https://github.com/dhitology) and convert it to Python dictionaries to make replacing our data easier. Granted, this is not a really good solution since there is still some slang words in our dataset that's not included in the dictionary, but (personally) I think it's good enough.

A much better approach in dealing with slang words can be found in this [article](https://medium.com/kata-engineering/mengubah-bahasa-indonesia-informal-menjadi-baku-menggunakan-kecerdasan-buatan-4c6317b00ea5) about a new methodology to transform text from informal Bahasa Indonesia to a more formal form.

In [16]:
# Load slang data
slang_df = pd.read_csv('https://raw.githubusercontent.com/dhitology/sma-r/master/data/support/Slangword.csv')

# Remove trailing whitespace
slang_df['old'] = slang_df['old'].apply(lambda x: x.strip())
slang_df['new'] = slang_df['new'].apply(lambda x: x.strip())

# Transform into key value paris in a dict
slang_dict = {}
for idx, row in slang_df.iterrows():
    slang_dict.update({row['old']: row['new']})

def replaceSlang(tokens):
    # iterate through tokens
    for i, word in enumerate(tokens):
        # check if token is in slang dictionary
        try:
            tokens[i] = slang_dict[word]
        # if token is not slang pass
        except KeyError:
            pass
    return tokens

df['no_slang'] = df['stemmed'].apply(replaceSlang)

df.head()

Unnamed: 0,Label,Tweet,tokens,no_useless,stemmed,no_slang
0,Non_HS,RT @spardaxyz: Fadli Zon Minta Mendagri Segera...,"[RT, @spardaxyz:, Fadli, Zon, Minta, Mendagri,...","[RT, Fadli, Zon, Minta, Mendagri, Segera, Meno...","[rt, fadli, zon, minta, mendagri, segera, nona...","[rt, fadli, zon, minta, mendagri, segera, nona..."
1,Non_HS,RT @baguscondromowo: Mereka terus melukai aksi...,"[RT, @baguscondromowo:, Mereka, terus, melukai...","[RT, Mereka, terus, melukai, aksi, dalam, rang...","[rt, mereka, terus, luka, aksi, dalam, rangka,...","[rt, mereka, terus, luka, aksi, dalam, rangka,..."
2,Non_HS,Sylvi: bagaimana gurbernur melakukan kekerasan...,"[Sylvi, :, bagaimana, gurbernur, melakukan, ke...","[Sylvi, bagaimana, gurbernur, melakukan, keker...","[sylvi, bagaimana, gurbernur, laku, keras, per...","[sylvi, bagaimana, gurbernur, laku, keras, per..."
3,Non_HS,"Ahmad Dhani Tak Puas Debat Pilkada, Masalah Ja...","[Ahmad, Dhani, Tak, Puas, Debat, Pilkada, ,, M...","[Ahmad, Dhani, Tak, Puas, Debat, Pilkada, Masa...","[ahmad, dhani, tak, puas, debat, pilkada, masa...","[ahmad, dhani, tak, puas, debat, pilkada, masa..."
4,Non_HS,RT @lisdaulay28: Waspada KTP palsu.....kawal P...,"[RT, @lisdaulay28:, Waspada, KTP, palsu, ........","[RT, Waspada, KTP, palsu, kawal, PILKADA, http...","[rt, waspada, ktp, palsu, kawal, pilkada, http...","[rt, waspada, ktp, palsu, kawal, pilkada, http..."


## Remove Stop Words

For the last step of our text cleaning we will remove stop words (meaningless word) from our data. I will use the Bahasa Indonesia stop words list provided in the spaCy libraries as the reference on which word to remove from our data. In this step we will also done a couple final tuning to our words:

*   Transforming each token into lowercase using lower() function from Python
*   Removing token less than 3 characters long in order to trim our data and increase our training speed





In [17]:
def removeStopWords(tokens, min_len=3):
    from spacy.lang.id.stop_words import STOP_WORDS

    return [t.lower() for t in tokens if t not in STOP_WORDS and len(t)>min_len]

df['no_stop'] = df['no_slang'].apply(removeStopWords)

df.head()

Unnamed: 0,Label,Tweet,tokens,no_useless,stemmed,no_slang,no_stop
0,Non_HS,RT @spardaxyz: Fadli Zon Minta Mendagri Segera...,"[RT, @spardaxyz:, Fadli, Zon, Minta, Mendagri,...","[RT, Fadli, Zon, Minta, Mendagri, Segera, Meno...","[rt, fadli, zon, minta, mendagri, segera, nona...","[rt, fadli, zon, minta, mendagri, segera, nona...","[fadli, mendagri, nonaktif, ahok, gubernur, ht..."
1,Non_HS,RT @baguscondromowo: Mereka terus melukai aksi...,"[RT, @baguscondromowo:, Mereka, terus, melukai...","[RT, Mereka, terus, melukai, aksi, dalam, rang...","[rt, mereka, terus, luka, aksi, dalam, rangka,...","[rt, mereka, terus, luka, aksi, dalam, rangka,...","[luka, aksi, rangka, penjara, ahok, ahok, gaga..."
2,Non_HS,Sylvi: bagaimana gurbernur melakukan kekerasan...,"[Sylvi, :, bagaimana, gurbernur, melakukan, ke...","[Sylvi, bagaimana, gurbernur, melakukan, keker...","[sylvi, bagaimana, gurbernur, laku, keras, per...","[sylvi, bagaimana, gurbernur, laku, keras, per...","[sylvi, gurbernur, laku, keras, perempuan, buk..."
3,Non_HS,"Ahmad Dhani Tak Puas Debat Pilkada, Masalah Ja...","[Ahmad, Dhani, Tak, Puas, Debat, Pilkada, ,, M...","[Ahmad, Dhani, Tak, Puas, Debat, Pilkada, Masa...","[ahmad, dhani, tak, puas, debat, pilkada, masa...","[ahmad, dhani, tak, puas, debat, pilkada, masa...","[ahmad, dhani, puas, debat, pilkada, jalan, be..."
4,Non_HS,RT @lisdaulay28: Waspada KTP palsu.....kawal P...,"[RT, @lisdaulay28:, Waspada, KTP, palsu, ........","[RT, Waspada, KTP, palsu, kawal, PILKADA, http...","[rt, waspada, ktp, palsu, kawal, pilkada, http...","[rt, waspada, ktp, palsu, kawal, pilkada, http...","[waspada, palsu, kawal, pilkada, https, tcoooo..."


# Splitting Dataset

To finish up our pre-processing we will split our dataset into training data and testing data. Because we have a quite small dataset (about 700 rows), we will use 80% of our dataset as training set in order to give our model a little bit more data to learn from. We will also combine our tokens of word back into single string to feed it into the next step, feature extraction.

In [18]:
# Combine cleaned text into one string
df['ready'] = df['no_stop'].apply(lambda x: ' '.join(x))

# Encode target labels
df['Label'] = df['Label'].apply(lambda x: 1 if x == 'HS' else 0)

# Split dataset
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df['ready'], 
                                                    df['Label'], 
                                                    random_state=0,
                                                    train_size=0.8)

print('Shape of X_train:', X_train.shape)
print('Shape of X_test:', X_test.shape)

Shape of X_train: (570,)
Shape of X_test: (143,)


# Feature Extraction

Now that we done our cleaning can we go feed the data into our model? No, not yet, because unlike numeric data e.g. float, integer, array, and so on, your computer can't understand WORD(s). So much for artificial 'intelligence', right? (jk ily AI).

So in order for our machine to understand the data we need to first translate it into something they understand, numbers. There are some methods available to transform our text data into a numerical data, such as:

*   One Hot Encoding
*   CountVectorizer
*   TFIDF
*   Word Embedding

In this notebook I will use the TF-IDF vectorizer, because despite it being simpler than word embedding it can still give us a pretty good result.

Other than just simply transforming our text into number there are also other features we can extract from our text, for example:

*   How many words are in the text?
*   How many character are in the text?
*   How often non-digit character appears in the text?

On this notebook I am going to stick with using our TF-IDF vector as the only feature, because from my earlier data exploration there is no notable difference of the characteristics listed above in this specific dataset. But it's always good idea to try to find out about those characteristics when you are dealing with your own data.

## Text Vectorization

To transform the text data I'm going to use the `TfidfVectorizer()` function from sklearn with the following parameters:

*   min_df=2; only include words that appears in at least 2 tweets
*   ngram_range=(1, 3); create a n-gram up to 3 words (trigram)

`TfidfVectorizer()` also have some neat paramters such as tokenizer and stop_words that allow us to specify what tokenizer or stop_words list we want to use. But since we already pre-process the text on the previous step we won't be using those paramters.

*Note: remember to ONLY fit our vectorizer into the training data to avoid data leak.*

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Fit the vectorizer into training data
tfidf_vectorizer = TfidfVectorizer(min_df=2, ngram_range=(1, 3)).fit(X_train)

# Transform the training & testing data using the vectorizer
X_train_vectorized = tfidf_vectorizer.transform(X_train)
X_test_vectorized = tfidf_vectorizer.transform(X_test)

## Most Common vs. Important Words

After calculating the TF-IDF vector of our data, let's take a look on the most common words (words that appears in a lot of tweets) and the most important words (words that frequently appears in single tweets). This step is important because the information can help us interpret our model later.

In [20]:
feature_names = np.array(tfidf_vectorizer.get_feature_names())

# Sort TFIDF by value
max_tf_idfs = X_train_vectorized.max(0).toarray()[0] # Get largest tfidf values across all documents.
sorted_tf_idxs = max_tf_idfs.argsort() # Sorted indices
sorted_tf_idfs = max_tf_idfs[sorted_tf_idxs] # Sorted TFIDF values

# feature_names doesn't need to be sorted! You just access it with a list of sorted indices!
smallest_tf_idfs = pd.Series(sorted_tf_idfs[:10], index=feature_names[sorted_tf_idxs[:10]])                    
largest_tf_idfs = pd.Series(sorted_tf_idfs[-10:][::-1], index=feature_names[sorted_tf_idxs[-10:][::-1]])

print('Most common words:\n', smallest_tf_idfs)
print('\n')
print('Most important words:\n', largest_tf_idfs)

Most common words:
 kelola jakarta           0.192791
kliatan kuasa problem    0.192791
tata                     0.192791
kelola                   0.192791
solusi tata kelola       0.192791
solusi tata              0.192791
debat ahok               0.192791
debat ahok liat          0.192791
problem                  0.192791
angin kliatan            0.192791
dtype: float64


Most important words:
 bangkit    1.0
mati       1.0
silvy      1.0
serang     1.0
fakta      1.0
gimana     1.0
sylvi      1.0
hoax       1.0
kampung    1.0
kasar      1.0
dtype: float64


# Finally...

With all those steps done we can now use our text dataset to build machine learning model. Similar to numeric data you can use our pre-processed text data to do classification, clustering, prediction, etc. Data cleaning and pre-processing in general might seems tedious but it definitely plays an important role when solving data science problem. This step ensure that our model will receive a good data to learn from, as they said "a model is only as good as it's data".

# References

1. Telkom Digital Talent Incubator - Data Scientist Module 7 (Text Mining)
2. [Digital Business Ecosystem Research Center - Text Miining (GitHub)](https://github.com/rc-dbe/dti-tm)
3. [Dhitology - Social Media Analytics (GitHub)](https://github.com/dhitology/sma-r)
4. [The Dataset for Hate Speech Detection in the Indonesian Language (Bahasa Indonesia)](https://github.com/ialfina/id-hatespeech-detection/)
5. [Scikit-learn Documentation](https://scikit-learn.org/stable/index.html)
6. [NLTK Documentation](https://www.nltk.org/index.html)
7. [spaCy Documentation](https://spacy.io/api/doc)
8. [Sastrawi](https://github.com/sastrawi/sastrawi)
