<a href="https://colab.research.google.com/github/thingkilia2507/PTCJNN_BangkitCapstoneProject/blob/celine-branch/Machine%20Learning/notebooks/Hate%20Speech%20Dataset%20Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Acknowledgment**

```
# This data preprocessing notebook is a part of Chrysant Celine Setyawan's bachelor thesis,
# copied and modified accordingly from the original source: https://github.com/celine-setyawan/id-porn-tweet-detection/tree/dev.
```


The Indonesian hate speech dataset is obtained from https://github.com/okkyibrohim/id-multi-label-hate-speech-and-abusive-language-detection with publication:
```
@inproceedings{ibrohim-budi-2019-multi,
    title = "Multi-label Hate Speech and Abusive Language Detection in {I}ndonesian Twitter",
    author = "Ibrohim, Muhammad Okky  and
      Budi, Indra",
    booktitle = "Proceedings of the Third Workshop on Abusive Language Online",
    month = aug,
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/W19-3506",
    doi = "10.18653/v1/W19-3506",
    pages = "46--57",
    abstract = "Hate speech and abusive language spreading on social media need to be detected automatically to avoid conflict between citizen. Moreover, hate speech has a target, category, and level that also needs to be detected to help the authority in prioritizing which hate speech must be addressed immediately. This research discusses multi-label text classification for abusive language and hate speech detection including detecting the target, category, and level of hate speech in Indonesian Twitter using machine learning approach with Support Vector Machine (SVM), Naive Bayes (NB), and Random Forest Decision Tree (RFDT) classifier and Binary Relevance (BR), Label Power-set (LP), and Classifier Chains (CC) as the data transformation method. We used several kinds of feature extractions which are term frequency, orthography, and lexicon features. Our experiment results show that in general RFDT classifier using LP as the transformation method gives the best accuracy with fast computational time.",
}
```



# **Library**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

PROJECT_ROOT = 'drive/My Drive/Bangkit Capstone PT CJNN/ML/'
HS_PATH = PROJECT_ROOT + 'dataset/hate_speech/'

Mounted at /content/drive


In [None]:
import pandas as pd
import pickle
import json
import string
import re

from ast import literal_eval
from gspread_dataframe import set_with_dataframe

# **Load Data**

In [None]:
hs_df = pd.read_csv(HS_PATH + 'hs-abusive dataset_okkyIbrohim.csv', encoding='ISO-8859-1')
hs_df.head()

Unnamed: 0,Tweet,HS,Abusive,HS_Individual,HS_Group,HS_Religion,HS_Race,HS_Physical,HS_Gender,HS_Other,HS_Weak,HS_Moderate,HS_Strong
0,- disaat semua cowok berusaha melacak perhatia...,1,1,1,0,0,0,0,0,1,1,0,0
1,RT USER: USER siapa yang telat ngasih tau elu?...,0,1,0,0,0,0,0,0,0,0,0,0
2,"41. Kadang aku berfikir, kenapa aku tetap perc...",0,0,0,0,0,0,0,0,0,0,0,0
3,USER USER AKU ITU AKU\n\nKU TAU MATAMU SIPIT T...,0,0,0,0,0,0,0,0,0,0,0,0
4,USER USER Kaum cebong kapir udah keliatan dong...,1,1,0,1,1,0,0,0,0,0,1,0


In [None]:
hs_df['labels'] = None
hs_df.loc[hs_df.sum(axis=1, skipna=True) == 0, 'labels'] = 'non_hs'
hs_df.loc[hs_df.sum(axis=1, skipna=True) >= 1, 'labels'] = 'hs'

In [None]:
# membuat list nama kolom kecuali 'Tweet' untuk di drop
cols = list(hs_df.columns)
cols.remove('Tweet')
cols.remove('labels')

hs_df.drop(cols, axis=1, inplace=True) 
hs_df.rename(columns={'Tweet': 'text_ori'}, inplace=True)
hs_df.head()

Unnamed: 0,text_ori,labels
0,- disaat semua cowok berusaha melacak perhatia...,hs
1,RT USER: USER siapa yang telat ngasih tau elu?...,hs
2,"41. Kadang aku berfikir, kenapa aku tetap perc...",non_hs
3,USER USER AKU ITU AKU\n\nKU TAU MATAMU SIPIT T...,non_hs
4,USER USER Kaum cebong kapir udah keliatan dong...,hs


In [None]:
hs_df.groupby(['labels']).size()

labels
hs        7309
non_hs    5860
dtype: int64

# **Data Cleaning pt 1**
* Remove RT, \n, username USER
* Mask 'link' into `<link>`
* Replace HTML character symbols, such as `&gt; &lt;`

Data cleaning p1 must be done before processing emoji and emoticon so that `://` in `https://` is not treated as emoticon.

In [None]:
def clean_p1(tweet):
    rgx = r'@[A-Za-z0-9_]*:?|RT|\n|\\n|USER:?'                                                   # @username: | RT | \n | \\n | USER:
    rgx_url = r'http[s]?[://]?(?:[a-z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))*|URL'    # including truncated 'https://'
    rgx_simbol_dan = r'&amp;'
    rgx_lt = r'&lt;'
    rgx_gt = r'&gt;'

    cleansed = re.sub(rgx, ' ', tweet)
    cleansed = re.sub(rgx_url, '<links>', cleansed)
    cleansed = re.sub(rgx_simbol_dan, 'dan', cleansed)
    cleansed = re.sub(rgx_lt, '<', cleansed)
    cleansed = re.sub(rgx_gt, '>', cleansed)
    return cleansed

hs_df['text_cleansed_p1'] = hs_df['text_ori'].apply(lambda row: clean_p1(row))
display(hs_df)

Unnamed: 0,text_ori,labels,text_cleansed_p1
0,- disaat semua cowok berusaha melacak perhatia...,hs,- disaat semua cowok berusaha melacak perhatia...
1,RT USER: USER siapa yang telat ngasih tau elu?...,hs,siapa yang telat ngasih tau elu?edan sar...
2,"41. Kadang aku berfikir, kenapa aku tetap perc...",non_hs,"41. Kadang aku berfikir, kenapa aku tetap perc..."
3,USER USER AKU ITU AKU\n\nKU TAU MATAMU SIPIT T...,non_hs,AKU ITU AKU KU TAU MATAMU SIPIT TAPI DILI...
4,USER USER Kaum cebong kapir udah keliatan dong...,hs,Kaum cebong kapir udah keliatan dongoknya ...
...,...,...,...
13164,USER jangan asal ngomong ndasmu. congor lu yg ...,hs,jangan asal ngomong ndasmu. congor lu yg sek...
13165,USER Kasur mana enak kunyuk',hs,Kasur mana enak kunyuk'
13166,USER Hati hati bisu :( .g\n\nlagi bosan huft \...,non_hs,Hati hati bisu :( .g lagi bosan huft \xf0\x...
13167,USER USER USER USER Bom yang real mudah terdet...,non_hs,Bom yang real mudah terdeteksi bom yan...


# **Emoticon and Emoji**


## **Decode byte (convert from bytes to emoji symbols)**

In [None]:
def byte_to_emoji(tweet):
    try:
      str_emoji = literal_eval('b"""' + tweet + '"""').decode('utf-8')
    except:
      str_emoji = tweet

    return str_emoji

In [None]:
tweets_decoded_emoji = [byte_to_emoji(row) for row in hs_df['text_cleansed_p1']]

In [None]:
print('\033[1mBefore:\033[0m')
print(hs_df['text_cleansed_p1'][65:66].to_list())

print('\n\033[1mAfter:\033[0m')
print(tweets_decoded_emoji[65:66])

[1mBefore:[0m
['        \\xf0\\x9f\\x8e\\xb6 la la la...hm hmm \\xf0\\x9f\\x8e\\xa7 "Semua diam ,semua bisu" "Kita coba tanya sama rumput yg bergoyang"  \\xe2\\x99\\xab\\xe2\\x99\\xab\\xe2\\x99\\xab\\xe2\\x99\\xaa\\xe2\\x99\\xaa\\xe2\\x99\\xaa\'']

[1mAfter:[0m
['        🎶 la la la...hm hmm 🎧 "Semua diam ,semua bisu" "Kita coba tanya sama rumput yg bergoyang"  ♫♫♫♪♪♪\'']


## **Emoticon**

The EMOTICONS_ID dictionary (Indonesian emoticon description) is a part of Chrysant Celine Setyawan's ongoing bachelor thesis, so can't be published now.

In [None]:
f = open('EMOTICONS_ID.json')
EMOTICONS_ID = json.load(f)
f.close()

In [None]:
def emot_to_desc(tweet):
    # Regex punctuation nya diperoleh dari `string.punctuation` dan ditambahkan `“”…`
    rgx_repeated_punct = re.compile(r'''([!"#$%&'()*+,-./:;<=>?@[\]^_`“”{|}~…])\1+''', re.IGNORECASE)
    tweet = re.sub(rgx_repeated_punct, r'\1', tweet)                                # remove repeated punctuation, e.g. :---))))))))

    for emot in EMOTICONS_ID:
        tweet = re.sub('(' + emot + ')', ' '.join(EMOTICONS_ID[emot].replace(',', '').split()), tweet)
    
    return tweet

In [None]:
tweets_emot_desc = [emot_to_desc(tweet) for tweet in tweets_decoded_emoji]

In [None]:
print('\033[1mBefore:\033[0m')
print(tweets_decoded_emoji[91:92])

print('\n\033[1mAfter:\033[0m')
print(tweets_emot_desc[91:92])

[1mBefore:[0m
['Kamu transgender atau gmn anjing :( <links>']

[1mAfter:[0m
['Kamu transgender atau gmn anjing sedih atau cemberut <links>']


## **Emoji**

The EMOJI_ID dictionary (Indonesian emoji description) is a part of Chrysant Celine Setyawan's ongoing bachelor thesis, so can't be published now.

In [None]:
# Open the file
f = open('EMOJI_ID.json')
EMOJI_ID = json.load(f)
f.close()

In [None]:
# Use code from https://github.com/carpedm20/emoji/blob/master/emoji/core.py and modified accordingly.

'''
  Returns compiled regular expression that matches emojis defined in
  ``EMOJI_ID``. The regular expression is only compiled once.
'''
def get_emoji_regexp():
    EMOJI_UNICODE = EMOJI_ID
    
    # Sort emojis by length to make sure multi-character emojis are matched first
    emojis = sorted(EMOJI_UNICODE.keys(), key=len, reverse=True)

    # Escape Unicode string
    pattern = u'(' + u'|'.join(re.escape(u) for u in emojis) + u')'

    # combine a regular expression pattern into pattern objects, which can be used for pattern matching
    _EMOJI_REGEXP = re.compile(pattern)

    return _EMOJI_REGEXP

'''
  Returns emoji that has been replaced by its decriptions in Bahasa Indonesia
'''
_EMOJI_REGEXP = get_emoji_regexp()
_DEFAULT_DELIMITER = ':'

def emoji_to_desc(string, delimiters=(_DEFAULT_DELIMITER, _DEFAULT_DELIMITER), codes_dict=EMOJI_ID):
    def replace(match):
        val = codes_dict.get(match.group(0), match.group(0))
        return delimiters[0] + val[1:-1] + delimiters[1]

    demojized = _EMOJI_REGEXP.sub(replace, string)
    return re.sub(u'\ufe0f', '', demojized)

In [None]:
tweets_emoji_desc = [emoji_to_desc(tweet) for tweet in tweets_emot_desc]
tweets_emoji_desc[65:66]

['        :not-not musik: la la la.hm hmm :headphone: "Semua diam ,semua bisu" "Kita coba tanya sama rumput yg bergoyang"  ♫♫♫♪♪♪\'']

In [None]:
hs_df['text_emot_emoji_desc'] = tweets_emoji_desc
display(hs_df)

Unnamed: 0,text_ori,labels,text_cleansed_p1,text_emot_emoji_desc
0,- disaat semua cowok berusaha melacak perhatia...,hs,- disaat semua cowok berusaha melacak perhatia...,- disaat semua cowok berusaha melacak perhatia...
1,RT USER: USER siapa yang telat ngasih tau elu?...,hs,siapa yang telat ngasih tau elu?edan sar...,siapa yang telat ngasih tau elu?edan sar...
2,"41. Kadang aku berfikir, kenapa aku tetap perc...",non_hs,"41. Kadang aku berfikir, kenapa aku tetap perc...","41. Kadang aku berfikir, kenapa aku tetap perc..."
3,USER USER AKU ITU AKU\n\nKU TAU MATAMU SIPIT T...,non_hs,AKU ITU AKU KU TAU MATAMU SIPIT TAPI DILI...,AKU ITU AKU KU TAU MATAMU SIPIT TAPI DILI...
4,USER USER Kaum cebong kapir udah keliatan dong...,hs,Kaum cebong kapir udah keliatan dongoknya ...,Kaum cebong kapir udah keliatan dongoknya ...
...,...,...,...,...
13164,USER jangan asal ngomong ndasmu. congor lu yg ...,hs,jangan asal ngomong ndasmu. congor lu yg sek...,jangan asal ngomong ndasmu. congor lu yg sek...
13165,USER Kasur mana enak kunyuk',hs,Kasur mana enak kunyuk',Kasur mana enak kunyuk'
13166,USER Hati hati bisu :( .g\n\nlagi bosan huft \...,non_hs,Hati hati bisu :( .g lagi bosan huft \xf0\x...,Hati hati bisu sedih atau cemberut .g lagi ...
13167,USER USER USER USER Bom yang real mudah terdet...,non_hs,Bom yang real mudah terdeteksi bom yan...,Bom yang real mudah terdeteksi bom yan...


<a name="cleaning_pt2"></a>
# **Data Cleaning pt 2**
* Lowercasing
* Remove byte (b' atau b")
* Remove trailing spaces and multi spaces
* Remove punctuation mark, except for `<>` that plays a role as unique token for masking `<links>`. The list of punctuation marks is obtained from `string.punctuation`, without `<>` and adding in `“”…`.

These parts are done at the very end to clean everything (to ensure that any residuals from previous processes are also cleaned or handled).

## **Version 1**
Just like the description above [Data Cleaning pt 2](#cleaning_pt2).

In [None]:
def clean_p2(tweet):
    punct = '''([!"#$%&'()*+,-./:;=?@[\]^_`“”{|}~…])\1+'''                                 # from string.punctuation but without <>
    rgx_multi_space = r' {2,}'
    rgx_byte_format = r'''b'|b"'''

    tweet = tweet.lower()
    cleansed = re.sub(rgx_byte_format, '', tweet)
    cleansed = cleansed.translate(str.maketrans(punct, ' '*len(punct)))                    # replace punct with whitespace
    cleansed = re.sub(rgx_multi_space, ' ', cleansed)                                      # remove multiple spaces
    cleansed = cleansed.strip()                                                            # remove trailing spaces
    return cleansed

hs_df['text_cleansed_p2'] = hs_df['text_emot_emoji_desc'].apply(lambda row: clean_p2(row))
display(hs_df)

Unnamed: 0,text_ori,labels,text_cleansed_p1,text_emot_emoji_desc,text_cleansed_p2
0,- disaat semua cowok berusaha melacak perhatia...,hs,- disaat semua cowok berusaha melacak perhatia...,- disaat semua cowok berusaha melacak perhatia...,disaat semua cowok berusaha melacak perhatian ...
1,RT USER: USER siapa yang telat ngasih tau elu?...,hs,siapa yang telat ngasih tau elu?edan sar...,siapa yang telat ngasih tau elu?edan sar...,siapa yang telat ngasih tau elu edan sarap gue...
2,"41. Kadang aku berfikir, kenapa aku tetap perc...",non_hs,"41. Kadang aku berfikir, kenapa aku tetap perc...","41. Kadang aku berfikir, kenapa aku tetap perc...",41 kadang aku berfikir kenapa aku tetap percay...
3,USER USER AKU ITU AKU\n\nKU TAU MATAMU SIPIT T...,non_hs,AKU ITU AKU KU TAU MATAMU SIPIT TAPI DILI...,AKU ITU AKU KU TAU MATAMU SIPIT TAPI DILI...,aku itu aku ku tau matamu sipit tapi diliat da...
4,USER USER Kaum cebong kapir udah keliatan dong...,hs,Kaum cebong kapir udah keliatan dongoknya ...,Kaum cebong kapir udah keliatan dongoknya ...,kaum cebong kapir udah keliatan dongoknya dari...
...,...,...,...,...,...
13164,USER jangan asal ngomong ndasmu. congor lu yg ...,hs,jangan asal ngomong ndasmu. congor lu yg sek...,jangan asal ngomong ndasmu. congor lu yg sek...,jangan asal ngomong ndasmu congor lu yg sekate...
13165,USER Kasur mana enak kunyuk',hs,Kasur mana enak kunyuk',Kasur mana enak kunyuk',kasur mana enak kunyuk
13166,USER Hati hati bisu :( .g\n\nlagi bosan huft \...,non_hs,Hati hati bisu :( .g lagi bosan huft \xf0\x...,Hati hati bisu sedih atau cemberut .g lagi ...,hati hati bisu sedih atau cemberut g lagi bosa...
13167,USER USER USER USER Bom yang real mudah terdet...,non_hs,Bom yang real mudah terdeteksi bom yan...,Bom yang real mudah terdeteksi bom yan...,bom yang real mudah terdeteksi bom yang terkub...


In [None]:
display(hs_df['text_cleansed_p2'][65:66].to_list())

['not not musik la la la hm hmm headphone semua diam semua bisu kita coba tanya sama rumput yg bergoyang ♫♫♫♪♪♪']

## **Version 2**
Without removing punctuation and without translating emot and emoji to description.

In [None]:
def clean_p2_v2(tweet):
    rgx_multi_space = r' {2,}'
    rgx_byte_format = r'''b'|b"'''

    tweet = tweet.lower()
    cleansed = re.sub(rgx_byte_format, '', tweet)
    cleansed = re.sub(rgx_multi_space, ' ', cleansed)                                      # remove multiple spaces
    cleansed = cleansed.strip()                                                            # remove trailing spaces
    return cleansed

In [None]:
text_cleansed_p2_without_emot_emoji_desc = [clean_p2_v2(tweet) for tweet in tweets_decoded_emoji]
text_cleansed_p2_without_emot_emoji_desc[65:66]

['🎶 la la la...hm hmm 🎧 "semua diam ,semua bisu" "kita coba tanya sama rumput yg bergoyang" ♫♫♫♪♪♪\'']

In [None]:
hs_df['text_cleansed_p2_without_emot_emoji_desc'] = text_cleansed_p2_without_emot_emoji_desc
display(hs_df)

Unnamed: 0,text_ori,labels,text_cleansed_p1,text_emot_emoji_desc,text_cleansed_p2,text_cleansed_p2_without_emot_emoji_desc
0,- disaat semua cowok berusaha melacak perhatia...,hs,- disaat semua cowok berusaha melacak perhatia...,- disaat semua cowok berusaha melacak perhatia...,disaat semua cowok berusaha melacak perhatian ...,- disaat semua cowok berusaha melacak perhatia...
1,RT USER: USER siapa yang telat ngasih tau elu?...,hs,siapa yang telat ngasih tau elu?edan sar...,siapa yang telat ngasih tau elu?edan sar...,siapa yang telat ngasih tau elu edan sarap gue...,siapa yang telat ngasih tau elu?edan sarap gue...
2,"41. Kadang aku berfikir, kenapa aku tetap perc...",non_hs,"41. Kadang aku berfikir, kenapa aku tetap perc...","41. Kadang aku berfikir, kenapa aku tetap perc...",41 kadang aku berfikir kenapa aku tetap percay...,"41. kadang aku berfikir, kenapa aku tetap perc..."
3,USER USER AKU ITU AKU\n\nKU TAU MATAMU SIPIT T...,non_hs,AKU ITU AKU KU TAU MATAMU SIPIT TAPI DILI...,AKU ITU AKU KU TAU MATAMU SIPIT TAPI DILI...,aku itu aku ku tau matamu sipit tapi diliat da...,aku itu aku ku tau matamu sipit tapi diliat da...
4,USER USER Kaum cebong kapir udah keliatan dong...,hs,Kaum cebong kapir udah keliatan dongoknya ...,Kaum cebong kapir udah keliatan dongoknya ...,kaum cebong kapir udah keliatan dongoknya dari...,kaum cebong kapir udah keliatan dongoknya dari...
...,...,...,...,...,...,...
13164,USER jangan asal ngomong ndasmu. congor lu yg ...,hs,jangan asal ngomong ndasmu. congor lu yg sek...,jangan asal ngomong ndasmu. congor lu yg sek...,jangan asal ngomong ndasmu congor lu yg sekate...,jangan asal ngomong ndasmu. congor lu yg sekat...
13165,USER Kasur mana enak kunyuk',hs,Kasur mana enak kunyuk',Kasur mana enak kunyuk',kasur mana enak kunyuk,kasur mana enak kunyuk'
13166,USER Hati hati bisu :( .g\n\nlagi bosan huft \...,non_hs,Hati hati bisu :( .g lagi bosan huft \xf0\x...,Hati hati bisu sedih atau cemberut .g lagi ...,hati hati bisu sedih atau cemberut g lagi bosa...,hati hati bisu :( .g lagi bosan huft 😪'
13167,USER USER USER USER Bom yang real mudah terdet...,non_hs,Bom yang real mudah terdeteksi bom yan...,Bom yang real mudah terdeteksi bom yan...,bom yang real mudah terdeteksi bom yang terkub...,bom yang real mudah terdeteksi bom yang terkub...


# **Drop Missing Values**
Drop rows that are empty (NaN) because the original tweet only consists of usernames. Such as index:


```
Int64Index([  182,   288,   318,   377,   490,  1282,  1565,  1840,  1972,
             2514,  2719,  2763,  3208,  3412,  3838,  4830,  5324,  5388,
             5444,  5710,  5801,  6075,  6328,  6746,  7010,  7179,  7190,
             7644,  7675,  7751,  7769,  8068,  8249,  8512,  8901,  8941,
             9297,  9982, 10701, 10736, 11303, 11958, 12632, 12682, 12788,
            12952],
           dtype='int64')
```

Pandas doesn't recognise empty strings as null. To fix this, you can convert the empty stings (or whatever is in your empty cells) to np.nan objects using replace(), and then call dropna() on your DataFrame.


In [None]:
# before replacing with NaN, empty strings are not detected
hs_df.loc[pd.isna(hs_df['text_cleansed_p2']), :].index

Int64Index([], dtype='int64')

In [None]:
import numpy as np

hs_df['text_cleansed_p2'].replace('', np.nan, inplace=True)
hs_df['text_cleansed_p2_without_emot_emoji_desc'].replace('', np.nan, inplace=True)

# after replacing with NaN, empty strings are detected
display(hs_df.loc[pd.isna(hs_df['text_cleansed_p2']), :].index)
display(hs_df.loc[pd.isna(hs_df['text_cleansed_p2_without_emot_emoji_desc']), :].index)

Int64Index([  182,   288,   318,   377,   490,  1282,  1565,  1840,  1972,
             2514,  2719,  2763,  3208,  3412,  3838,  4830,  5324,  5388,
             5444,  5710,  5801,  6075,  6328,  6746,  7010,  7179,  7190,
             7644,  7675,  7751,  7769,  8068,  8249,  8512,  8901,  8941,
             9297,  9982, 10701, 10736, 11303, 11958, 12632, 12682, 12788,
            12952],
           dtype='int64')

Int64Index([  182,   288,   318,   377,   490,  1282,  1565,  1840,  1972,
             2514,  2719,  2763,  3208,  3412,  3838,  4830,  5324,  5388,
             5710,  5801,  6075,  6746,  7010,  7179,  7190,  7644,  7675,
             7751,  7769,  8249,  8901,  8941, 10736, 11303, 11958, 12632,
            12682, 12788, 12952],
           dtype='int64')

In [None]:
nan_idx = hs_df.loc[pd.isna(hs_df['text_cleansed_p2']), :].index
display(nan_idx)
display(len(nan_idx))

Int64Index([  182,   288,   318,   377,   490,  1282,  1565,  1840,  1972,
             2514,  2719,  2763,  3208,  3412,  3838,  4830,  5324,  5388,
             5444,  5710,  5801,  6075,  6328,  6746,  7010,  7179,  7190,
             7644,  7675,  7751,  7769,  8068,  8249,  8512,  8901,  8941,
             9297,  9982, 10701, 10736, 11303, 11958, 12632, 12682, 12788,
            12952],
           dtype='int64')

46

In [None]:
hs_df['text_cleansed_p2'][180:183]

180    aku pernah sempat baca ttg harun yahya ini yg ...
181    kita maju bersama ulama kita bergerak bersama ...
182                                                  NaN
Name: text_cleansed_p2, dtype: object

In [None]:
hs_df.dropna(inplace=True)

In [None]:
len(hs_df)

13123

# **Export to csv**

In [None]:
# hs_df.to_csv(HS_PATH + 'hs-abusive_preprocessed_all-step.csv', index=False)

In [None]:
hs_df_preprocessed = hs_df[['text_cleansed_p2', 'labels']].copy()
hs_df_preprocessed.rename(columns={'text_cleansed_p2': 'text'}, inplace=True)
# hs_df_preprocessed.to_csv(HS_PATH + 'hs-abusive_preprocessed.csv', index=False)
hs_df_preprocessed

Unnamed: 0,text,labels
0,disaat semua cowok berusaha melacak perhatian ...,hs
1,siapa yang telat ngasih tau elu edan sarap gue...,hs
2,41 kadang aku berfikir kenapa aku tetap percay...,non_hs
3,aku itu aku ku tau matamu sipit tapi diliat da...,non_hs
4,kaum cebong kapir udah keliatan dongoknya dari...,hs
...,...,...
13164,jangan asal ngomong ndasmu congor lu yg sekate...,hs
13165,kasur mana enak kunyuk,hs
13166,hati hati bisu sedih atau cemberut g lagi bosa...,non_hs
13167,bom yang real mudah terdeteksi bom yang terkub...,non_hs


In [None]:
hs_df_preprocessed_v2 = hs_df[['text_cleansed_p2_without_emot_emoji_desc', 'labels']].copy()
hs_df_preprocessed_v2.rename(columns={'text_cleansed_p2_without_emot_emoji_desc': 'text'}, inplace=True)
# hs_df_preprocessed_v2.to_csv(HS_PATH + 'hs-abusive_preprocessed_emot-emoji-intact.csv', index=False)
hs_df_preprocessed_v2

Unnamed: 0,text,labels
0,- disaat semua cowok berusaha melacak perhatia...,hs
1,siapa yang telat ngasih tau elu?edan sarap gue...,hs
2,"41. kadang aku berfikir, kenapa aku tetap perc...",non_hs
3,aku itu aku ku tau matamu sipit tapi diliat da...,non_hs
4,kaum cebong kapir udah keliatan dongoknya dari...,hs
...,...,...
13164,jangan asal ngomong ndasmu. congor lu yg sekat...,hs
13165,kasur mana enak kunyuk',hs
13166,hati hati bisu :( .g lagi bosan huft 😪',non_hs
13167,bom yang real mudah terdeteksi bom yang terkub...,non_hs
