# Tugas 2 : Preprocessing  & TfIdf

**PRE PROCESSING**



---
Pre-processing adalah langkah-langkah awal dalam pemrosesan teks yang bertujuan untuk membersihkan dan mempersiapkan data teks mentah agar dapat dianalisis lebih lanjut atau digunakan dalam model pembelajaran mesin. Berikut adalah beberapa langkah umum dalam pre-processing teks:


In [59]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [60]:
import pandas as pd

df = pd.read_csv("/content/drive/My Drive/ppw/report/tugas-ppw/data_berita_detik.csv")
df.head()

Unnamed: 0,judul,tanggal,isi,kategori
0,Jangan Sembarangan! Waspadai Pemicu Cedera saa...,"Sabtu, 07 Sep 2024 19:00 WIB",Jakarta - Olahraga lari dalam beberapa waktu t...,Kesehatan
1,Keseringan Pakai TWS? Waspada Gangguan Pendeng...,"Sabtu, 07 Sep 2024 18:00 WIB",Jakarta - Adanya teknologi earphone nirkabel a...,Kesehatan
2,Jogging Pakai TWS Aman Nggak Sih? Ini Plus-Min...,"Sabtu, 07 Sep 2024 17:00 WIB",Jakarta - Tren penggunaan earphone nirkabel at...,Kesehatan
3,5 Kebiasaan Simpel yang Bisa Bikin Panjang Umu...,"Sabtu, 07 Sep 2024 14:00 WIB",Jakarta - Hidup sehat dengan umur yang panjang...,Kesehatan
4,Rajin Sit-up Saja Nggak Bikin Lemak di Perut H...,"Sabtu, 07 Sep 2024 13:00 WIB",Jakarta - Spesialis kedokteran olahraga dr And...,Kesehatan


**1.  CLEANSING**

Tahapan proses cleansing data merupakan tahap pembersihan kata dari atribut yang tidak berpengaruh terhadap hasil klasifikasi sentimen. Komponen dokumen review memiliki beberapa atribut tidak berpengaruh terhadap sentimen diataranya url, html, emoji, simbol, angka dan tanda baca (~!@#$%^&*{}<>:|). Atribut yang tidak berpengaruh tersebut kemudian akan dihapus dan akan digantikan dengan karakter spasi







In [61]:
import re
import string
import nltk

def remove_url(ulasan):
  url = re.compile(r'https?://\S+|www\.S+')
  return url.sub(r'', ulasan)

def remove_html(ulasan):
  html = re.compile(r'<.#?>')
  return html.sub(r'', ulasan)

def remove_emoji(ulasan):
  emoji_pattern = re.compile("["
      u"\U0001F600-\U0001F64F"
      u"\U0001F300-\U0001F5FF"
      u"\U0001F680-\U0001F6FF"
      u"\U0001F1E0-\U0001F1FF""]+", flags=re.UNICODE)
  return emoji_pattern.sub(r'', ulasan)

def remove_numbers(ulasan):
  ulasan = re.sub(r'\d+', '', ulasan)
  return ulasan

def remove_symbols(ulasan):
  ulasan = re.sub(r'[^a-zA-Z0-9\s]', '', ulasan)
  return ulasan

df['cleansing'] = df['isi'].apply(lambda x: remove_url(x))
df['cleansing'] = df['cleansing'].apply(lambda x: remove_html(x))
df['cleansing'] = df['cleansing'].apply(lambda x: remove_emoji(x))
df['cleansing'] = df['cleansing'].apply(lambda x: remove_symbols(x))
df['cleansing'] = df['cleansing'].apply(lambda x: remove_numbers(x))

df.head(5)

Unnamed: 0,judul,tanggal,isi,kategori,cleansing
0,Jangan Sembarangan! Waspadai Pemicu Cedera saa...,"Sabtu, 07 Sep 2024 19:00 WIB",Jakarta - Olahraga lari dalam beberapa waktu t...,Kesehatan,Jakarta Olahraga lari dalam beberapa waktu te...
1,Keseringan Pakai TWS? Waspada Gangguan Pendeng...,"Sabtu, 07 Sep 2024 18:00 WIB",Jakarta - Adanya teknologi earphone nirkabel a...,Kesehatan,Jakarta Adanya teknologi earphone nirkabel at...
2,Jogging Pakai TWS Aman Nggak Sih? Ini Plus-Min...,"Sabtu, 07 Sep 2024 17:00 WIB",Jakarta - Tren penggunaan earphone nirkabel at...,Kesehatan,Jakarta Tren penggunaan earphone nirkabel ata...
3,5 Kebiasaan Simpel yang Bisa Bikin Panjang Umu...,"Sabtu, 07 Sep 2024 14:00 WIB",Jakarta - Hidup sehat dengan umur yang panjang...,Kesehatan,Jakarta Hidup sehat dengan umur yang panjang ...
4,Rajin Sit-up Saja Nggak Bikin Lemak di Perut H...,"Sabtu, 07 Sep 2024 13:00 WIB",Jakarta - Spesialis kedokteran olahraga dr And...,Kesehatan,Jakarta Spesialis kedokteran olahraga dr Andh...


**2. CASE FOLDING**

Pada tahap case folding huruf kapital pada semua dokumen ulasan diubah menjadi huruf kecil atau disebut lowercase. Hal ini bertujuan agar menghiangkan redudansi data yang hanya berbeda pada hurufnya saja.



In [62]:
def case_folding(text):
    if isinstance(text, str):
      lowercase_text = text.lower()
      return lowercase_text
    else :
      return text

df ['case_folding'] = df['cleansing'].apply(case_folding)

df.head(5)

Unnamed: 0,judul,tanggal,isi,kategori,cleansing,case_folding
0,Jangan Sembarangan! Waspadai Pemicu Cedera saa...,"Sabtu, 07 Sep 2024 19:00 WIB",Jakarta - Olahraga lari dalam beberapa waktu t...,Kesehatan,Jakarta Olahraga lari dalam beberapa waktu te...,jakarta olahraga lari dalam beberapa waktu te...
1,Keseringan Pakai TWS? Waspada Gangguan Pendeng...,"Sabtu, 07 Sep 2024 18:00 WIB",Jakarta - Adanya teknologi earphone nirkabel a...,Kesehatan,Jakarta Adanya teknologi earphone nirkabel at...,jakarta adanya teknologi earphone nirkabel at...
2,Jogging Pakai TWS Aman Nggak Sih? Ini Plus-Min...,"Sabtu, 07 Sep 2024 17:00 WIB",Jakarta - Tren penggunaan earphone nirkabel at...,Kesehatan,Jakarta Tren penggunaan earphone nirkabel ata...,jakarta tren penggunaan earphone nirkabel ata...
3,5 Kebiasaan Simpel yang Bisa Bikin Panjang Umu...,"Sabtu, 07 Sep 2024 14:00 WIB",Jakarta - Hidup sehat dengan umur yang panjang...,Kesehatan,Jakarta Hidup sehat dengan umur yang panjang ...,jakarta hidup sehat dengan umur yang panjang ...
4,Rajin Sit-up Saja Nggak Bikin Lemak di Perut H...,"Sabtu, 07 Sep 2024 13:00 WIB",Jakarta - Spesialis kedokteran olahraga dr And...,Kesehatan,Jakarta Spesialis kedokteran olahraga dr Andh...,jakarta spesialis kedokteran olahraga dr andh...


**3. TOKENIZATION**


Tahap Tokenization merupakan pemotongan kata berdasarkan tiap kata yang menyusunnya menjadi potongan tunggal. Kata dalam dokumen yang dimaksud adalah kata yang dipisah oleh spasi, sehingga proses tokenisasi mengandalkan karakter spasi pada dokumen untuk melakukan pemisahan kata.


In [63]:
def tokenize(text):
    tokens = text.split()
    return tokens

df['tokenize'] = df['case_folding'].apply(tokenize)

df.head(5)

Unnamed: 0,judul,tanggal,isi,kategori,cleansing,case_folding,tokenize
0,Jangan Sembarangan! Waspadai Pemicu Cedera saa...,"Sabtu, 07 Sep 2024 19:00 WIB",Jakarta - Olahraga lari dalam beberapa waktu t...,Kesehatan,Jakarta Olahraga lari dalam beberapa waktu te...,jakarta olahraga lari dalam beberapa waktu te...,"[jakarta, olahraga, lari, dalam, beberapa, wak..."
1,Keseringan Pakai TWS? Waspada Gangguan Pendeng...,"Sabtu, 07 Sep 2024 18:00 WIB",Jakarta - Adanya teknologi earphone nirkabel a...,Kesehatan,Jakarta Adanya teknologi earphone nirkabel at...,jakarta adanya teknologi earphone nirkabel at...,"[jakarta, adanya, teknologi, earphone, nirkabe..."
2,Jogging Pakai TWS Aman Nggak Sih? Ini Plus-Min...,"Sabtu, 07 Sep 2024 17:00 WIB",Jakarta - Tren penggunaan earphone nirkabel at...,Kesehatan,Jakarta Tren penggunaan earphone nirkabel ata...,jakarta tren penggunaan earphone nirkabel ata...,"[jakarta, tren, penggunaan, earphone, nirkabel..."
3,5 Kebiasaan Simpel yang Bisa Bikin Panjang Umu...,"Sabtu, 07 Sep 2024 14:00 WIB",Jakarta - Hidup sehat dengan umur yang panjang...,Kesehatan,Jakarta Hidup sehat dengan umur yang panjang ...,jakarta hidup sehat dengan umur yang panjang ...,"[jakarta, hidup, sehat, dengan, umur, yang, pa..."
4,Rajin Sit-up Saja Nggak Bikin Lemak di Perut H...,"Sabtu, 07 Sep 2024 13:00 WIB",Jakarta - Spesialis kedokteran olahraga dr And...,Kesehatan,Jakarta Spesialis kedokteran olahraga dr Andh...,jakarta spesialis kedokteran olahraga dr andh...,"[jakarta, spesialis, kedokteran, olahraga, dr,..."


**4. STOPWORD REMOVAL**


Dalam tahapan proses Stopword Removal kata yang tidak memiliki pengaruh signifikan dalam kalimat akan dihilangkan. Dalam pre processing ini penulis menghapus stopword pada data ulasan berdasar daftar kalimat stopword diantaranya yaitu “yang”, “dan”, “di”, “dari”, dll.



In [64]:
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = stopwords.words('indonesian')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [65]:
def remove_stopwords(text):
  return [word for word in text if word not in stop_words]

df['stopword_removal'] = df['tokenize'].apply(lambda x: ' '.join(remove_stopwords(x)))

df.head(5)

Unnamed: 0,judul,tanggal,isi,kategori,cleansing,case_folding,tokenize,stopword_removal
0,Jangan Sembarangan! Waspadai Pemicu Cedera saa...,"Sabtu, 07 Sep 2024 19:00 WIB",Jakarta - Olahraga lari dalam beberapa waktu t...,Kesehatan,Jakarta Olahraga lari dalam beberapa waktu te...,jakarta olahraga lari dalam beberapa waktu te...,"[jakarta, olahraga, lari, dalam, beberapa, wak...",jakarta olahraga lari tren mudah berlari salah...
1,Keseringan Pakai TWS? Waspada Gangguan Pendeng...,"Sabtu, 07 Sep 2024 18:00 WIB",Jakarta - Adanya teknologi earphone nirkabel a...,Kesehatan,Jakarta Adanya teknologi earphone nirkabel at...,jakarta adanya teknologi earphone nirkabel at...,"[jakarta, adanya, teknologi, earphone, nirkabe...",jakarta teknologi earphone nirkabel true wirel...
2,Jogging Pakai TWS Aman Nggak Sih? Ini Plus-Min...,"Sabtu, 07 Sep 2024 17:00 WIB",Jakarta - Tren penggunaan earphone nirkabel at...,Kesehatan,Jakarta Tren penggunaan earphone nirkabel ata...,jakarta tren penggunaan earphone nirkabel ata...,"[jakarta, tren, penggunaan, earphone, nirkabel...",jakarta tren penggunaan earphone nirkabel true...
3,5 Kebiasaan Simpel yang Bisa Bikin Panjang Umu...,"Sabtu, 07 Sep 2024 14:00 WIB",Jakarta - Hidup sehat dengan umur yang panjang...,Kesehatan,Jakarta Hidup sehat dengan umur yang panjang ...,jakarta hidup sehat dengan umur yang panjang ...,"[jakarta, hidup, sehat, dengan, umur, yang, pa...",jakarta hidup sehat umur impian orang dibutuhk...
4,Rajin Sit-up Saja Nggak Bikin Lemak di Perut H...,"Sabtu, 07 Sep 2024 13:00 WIB",Jakarta - Spesialis kedokteran olahraga dr And...,Kesehatan,Jakarta Spesialis kedokteran olahraga dr Andh...,jakarta spesialis kedokteran olahraga dr andh...,"[jakarta, spesialis, kedokteran, olahraga, dr,...",jakarta spesialis kedokteran olahraga dr andhi...


In [66]:
df.to_csv("/content/drive/My Drive/ppw/report/tugas-ppw/hasil_prepros.csv",encoding='utf8', index=False)

**TF-IDF (Term Frequency-Inverse Document Frequency)**


---

TF-IDF adalah metode statistik yang digunakan untuk mengevaluasi pentingnya suatu kata dalam sebuah dokumen relatif terhadap koleksi dokumen lainnya. TF-IDF sering digunakan dalam tugas seperti penggalian teks, penambangan informasi, dan pemodelan pembelajaran mesin berbasis teks.

In [67]:
import pandas as pd

data = pd.read_csv("/content/drive/My Drive/ppw/report/tugas-ppw/hasil_prepros.csv", sep=",")

In [80]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Menginisialisasi TfidfVectorizer tanpa max_features dan stop_words
vectorizer = TfidfVectorizer()

# Menghitung TF-IDF untuk kolom 'isi' dari dataframe
tfidf_matrix = vectorizer.fit_transform(df['stopword_removal'])

In [81]:
# Mengubah hasilnya menjadi DataFrame
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())
tfidf_df.head(10)

Unnamed: 0,aa,abadi,abdillah,abdul,abidin,absen,ac,academic,academy,acara,...,zcs,zebaidah,zenno,zennoveren,zhe,zi,zika,zokor,zumba,zurich
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.053966,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.064541,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.130855,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
