
# **Proses Pengolahan Data Deteksi Emosi Pengguna Twitter**

---

Data tweeter dapat diolah dengan tahapan berikut:
*   Case Folding
*   Tokenizing
*   Filtering
*   Stemming

In [None]:
##Perintah untuk melakukan import file yang berisi dataset

from google.colab import files
uploaded = files.upload()

Saving tweet_emotions.csv to tweet_emotions.csv




---
**Proses import Load Dataset dan membaca dataset dengan Pandas**


In [None]:
##Menampilkan hasil file yang telah diimport dan membaca dataset

import numpy as np
import pandas as pd

data_tweet = pd.read_csv('tweet_emotions.csv', encoding='latin-1') 
# spesifiksi encoding diperlukan karena data tidak menggunakan UTF-8

data_tweet.head(10)

Unnamed: 0,tweet_id,sentiment,content
0,1956967341,empty,@tiffanylue i know i was listenin to bad habi...
1,1956967666,sadness,Layin n bed with a headache ughhhh...waitin o...
2,1956967696,sadness,Funeral ceremony...gloomy friday...
3,1956967789,enthusiasm,wants to hang out with friends SOON!
4,1956968416,neutral,@dannycastillo We want to trade with someone w...
5,1956968477,worry,Re-pinging @ghostridah14: why didn't you go to...
6,1956968487,sadness,"I should be sleep, but im not! thinking about ..."
7,1956968636,worry,Hmmm. http://www.djhero.com/ is down
8,1956969035,sadness,@charviray Charlene my love. I miss you
9,1956969172,sadness,@kelcouch I'm sorry at least it's Friday?




---

**Proses Case Folding**

Melakukan proses case folding dengan fungsi lower() pada class Series.str library Pandas

In [None]:
#Case Folding

data_tweet['content']= data_tweet['content'].str.lower()

print('Case Folding Result : \n')
print(data_tweet['content'].head())

Case Folding Result : 

0    @tiffanylue i know  i was listenin to bad habi...
1    layin n bed with a headache  ughhhh...waitin o...
2                  funeral ceremony...gloomy friday...
3                 wants to hang out with friends soon!
4    @dannycastillo we want to trade with someone w...
Name: content, dtype: object


**Proses Tokenizing**

Proses number removal, whitecase removal, puntuation removal dan word_tokenize() untuk memecah string kedalam tokens. Pandas Dataframe atau Series mampu menjalankan function external untuk di terapkan pada kolom atau baris dengan menggunakan fungsi .apply()

In [None]:
# Tokenizing

import string 
import re #regex library

# import word_tokenize & FreqDist from NLTK
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize 
from nltk.probability import FreqDist

def remove_tweet_special(text):
    # remove tab, new line, ans back slice
    text = text.replace('\\t'," ").replace('\\n'," ").replace('\\u'," ").replace('\\',"")
    # remove non ASCII (emoticon, chinese word, .etc)
    text = text.encode('ascii', 'replace').decode('ascii')
    # remove mention, link, hashtag
    text = ' '.join(re.sub("([@#][A-Za-z0-9]+)|(\w+:\/\/\S+)"," ", text).split())
    # remove incomplete URL
    return text.replace("http://", " ").replace("https://", " ")

data_tweet['content']= data_tweet['content'].apply(remove_tweet_special)

#remove number
def remove_number(text):
    return  re.sub(r"\d+", "", text)

data_tweet['content'] = data_tweet['content'].apply(remove_number)

#remove punctuation
def remove_punctuation(text):
    return text.translate(str.maketrans("","",string.punctuation))

data_tweet['content'] = data_tweet['content'].apply(remove_punctuation)

#remove whitespace leading & trailing
def remove_whitespace_LT(text):
    return text.strip()

data_tweet['content'] = data_tweet['content'].apply(remove_whitespace_LT)

#remove multiple whitespace into single whitespace
def remove_whitespace_multiple(text):
    return re.sub('\s+',' ',text)
    
data_tweet['content'] = data_tweet['content'].apply(remove_whitespace_multiple)

# remove single char
def remove_singl_char(text):
    return re.sub(r"\b[a-zA-Z]\b", "", text)

data_tweet['content'] = data_tweet['content'].apply(remove_singl_char)

# NLTK word rokenize 
def word_tokenize_wrapper(text):
    return word_tokenize(text)

data_tweet['tweet_tokens'] = data_tweet['content'].apply(word_tokenize_wrapper)

print('Tokenizing Result : \n') 
print(data_tweet['tweet_tokens'].head())

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Tokenizing Result : 

0    [know, was, listenin, to, bad, habit, earlier,...
1    [layin, bed, with, headache, ughhhhwaitin, on,...
2                    [funeral, ceremonygloomy, friday]
3          [wants, to, hang, out, with, friends, soon]
4    [we, want, to, trade, with, someone, who, has,...
Name: tweet_tokens, dtype: object


Menghitung frekuensi distribusi token pada tiap row data pada Dataframe dengan menggunakan fungsi .freqDist() kedalam fungsi .apply() 

In [None]:
# NLTK calc frequency distribution
def freqDist_wrapper(text):
    return FreqDist(text)

data_tweet['tweet_tokens_fdist'] = data_tweet['tweet_tokens'].apply(freqDist_wrapper)

print('Frequency Tokens : \n') 
print(data_tweet['tweet_tokens_fdist'].head().apply(lambda x : x.most_common()))

Frequency Tokens : 

0    [(know, 1), (was, 1), (listenin, 1), (to, 1), ...
1    [(layin, 1), (bed, 1), (with, 1), (headache, 1...
2     [(funeral, 1), (ceremonygloomy, 1), (friday, 1)]
3    [(wants, 1), (to, 1), (hang, 1), (out, 1), (wi...
4    [(we, 1), (want, 1), (to, 1), (trade, 1), (wit...
Name: tweet_tokens_fdist, dtype: object


**Proses Filtering (Stopword Removal)**

Menggunakan library NLTK untuk filtering terhadap Dataframe dengan bahasa english. Menambahkan list stopword dengan menggunakan fungsi .extend() terhadap list_stopword , penggunaan fungsi .set() bermanfaat untuk membuat iterable list menjadi sequence iterable element. Hasilnya kita dapat mempercepat proses pengecekan apakan sebuah token terdapat pada list_stopword atau tidak (if token not in list_stopword:).

In [None]:
# Filtering

from nltk.corpus import stopwords
nltk.download('stopwords')

# ----------------------- get stopword from NLTK stopword -------------------------------
# get stopword indonesia
list_stopwords = stopwords.words('english')


# convert list to dictionary
list_stopwords = set(list_stopwords)


#remove stopword pada list token
def stopwords_removal(words):
    return [word for word in words if word not in list_stopwords]

data_tweet['tweet_tokens_WSW'] = data_tweet['tweet_tokens'].apply(stopwords_removal) 


print(data_tweet['tweet_tokens_WSW'].head())


0    [know, listenin, bad, habit, earlier, started,...
1           [layin, bed, headache, ughhhhwaitin, call]
2                    [funeral, ceremonygloomy, friday]
3                         [wants, hang, friends, soon]
4        [want, trade, someone, houston, tickets, one]
Name: tweet_tokens_WSW, dtype: object


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


**Melakukan Install library Sastrawi dan Siwfter**

In [None]:
!pip install Sastrawi

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting Sastrawi
  Downloading Sastrawi-1.0.1-py2.py3-none-any.whl (209 kB)
[K     |████████████████████████████████| 209 kB 15.9 MB/s 
[?25hInstalling collected packages: Sastrawi
Successfully installed Sastrawi-1.0.1


In [None]:
!pip install swifter

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting swifter
  Downloading swifter-1.3.4.tar.gz (830 kB)
[K     |████████████████████████████████| 830 kB 14.7 MB/s 
Collecting psutil>=5.6.6
  Downloading psutil-5.9.3-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (291 kB)
[K     |████████████████████████████████| 291 kB 65.9 MB/s 
Collecting jedi>=0.10
  Downloading jedi-0.18.1-py2.py3-none-any.whl (1.6 MB)
[K     |████████████████████████████████| 1.6 MB 64.7 MB/s 
Building wheels for collected packages: swifter
  Building wheel for swifter (setup.py) ... [?25l[?25hdone
  Created wheel for swifter: filename=swifter-1.3.4-py3-none-any.whl size=16322 sha256=d31dad6c5b8bbb5bf3af3ce6e63f791445ae0bd028909bbf21a9c6bf017c33d6
  Stored in directory: /root/.cache/pip/wheels/29/a7/0e/3a8f17ac69d759e1e93647114bc9bdc95957e5b0cbfd405205
Successfully built swifter
Installing collected

**Proses Stemming**

Normalization digunakan untuk menyeragamkan term yang memiliki makna sama namun penulisanya berbeda, bisa diakibatkan kesalahan penulisan, penyingkatan kata, ataupun “bahasa gaul”.

In [None]:
# import Sastrawi package

from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
import swifter


# create stemmer
factory = StemmerFactory()
stemmer = factory.create_stemmer()

# stemmed
def stemmed_wrapper(term):
    return stemmer.stem(term)

term_dict = {}

for document in data_tweet['tweet_tokens_WSW']:
    for term in document:
        if term not in term_dict:
            term_dict[term] = ' '
            
print(len(term_dict))
print("------------------------")

for term in term_dict:
    term_dict[term] = stemmed_wrapper(term)
    print(term,":" ,term_dict[term])
    
print(term_dict)
print("------------------------")


# apply stemmed term to dataframe
def get_stemmed_term(document):
    return [term_dict[term] for term in document]

data_tweet['tweet_tokens_WSW'] = data_tweet['tweet_normalized'].swifter.apply(get_stemmed_term)
print(data_tweet['tweet_tokens_WSW'])

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
todayreally : todayreally
garcia : garcia
mumyummy : mumyummy
bombbbb : bombbbb
pwetttty : pwetttty
constructions : constructions
poping : poping
atkins : atkins
concerti : concerti
additl : additl
lenny : lenny
packedthinking : packedthinking
backgroundmy : backgroundmy
heroswichita : heroswichita
kraussey : kraussey
gerry : gerry
mehelp : mehelp
newsfurtheri : newsfurtheri
coolthank : coolthank
youuuuu : youuuuu
santino : santino
thirteen : thirteen
dayssss : dayssss
aahh : aahh
griffin : griffin
confessions : confessions
shopoholictotally : shopoholictotally
dancys : dancys
obsessing : obsessing
pine : pine
pradas : pradas
dunks : dunks
compaired : compaired
nickj : nickj
umnicks : umnicks
voicesmileeyeslaughand : voicesmileeyeslaughand
rolemodel : rolemodel
cyberspace : cyberspace
blinks : blinks
motherfucking : motherfucking
shoutz : shoutz
nervvoouuss : nervvoouuss
narrowed : narrowed
ahmazing : ahmazing
broooooooo 