# Preprocessing Data Text

In [23]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Preprocessing data teks adalah tahapan penting dalam analisis teks dan pemrosesan bahasa alami (Natural Language Processing/NLP). Tujuannya adalah untuk membersihkan, mentransformasi, dan mengorganisasi data teks sehingga menjadi lebih sesuai untuk analisis atau penggunaan di berbagai model pembelajaran mesin. Berikut ini adalah beberapa tahapan umum dalam preprocessing data teks:

1. **Pembersihan Teks (Text Cleaning):**
   - Menghapus Karakter Khusus: Menghilangkan karakter khusus seperti tanda baca, simbol, atau karakter yang tidak relevan.

2. **Tokenisasi:**
  - Memecah teks menjadi kata-kata atau token-token yang lebih kecil. Tokenisasi biasanya melibatkan pemisahan berdasarkan spasi, tetapi dapat juga melibatkan pemisahan berdasarkan tanda baca.

3. **Menghilangkan Kata Stop (Stopword Removal):**
   - Menghapus kata-kata umum yang tidak memiliki nilai informasi tinggi, seperti "dan", "atau", "di", "sebuah", dll.

4. **Stemming atau Lemmatization:**
  - Mengubah kata-kata menjadi bentuk dasarnya. Stemming adalah pendekatan lebih kasar yang memotong akhiran kata, sedangkan lemmatization lebih canggih dan mengembalikan kata ke bentuk kata kerja atau kata benda dasarnya.

5. **Vektorisasi:**
   - Mengonversi setiap kata atau token ke dalam representasi numerik yang dapat digunakan oleh model pembelajaran mesin. Salah satu pendekatan umum adalah TF-IDF (Term Frequency-Inverse Document Frequency) dan Word Embeddings (seperti Word2Vec atau GloVe).

6. **Pengelompokan (Text Categorization):**
   - Mengkategorikan atau mengelompokkan teks ke dalam kelas atau kategori tertentu. Ini sering digunakan dalam tugas klasifikasi teks.

7. **Pemisahan Data:**
   - Memisahkan dataset menjadi set pelatihan, validasi, dan pengujian, jika diperlukan.

8. **Pengujian dan Evaluasi:**
   - Menguji dan mengevaluasi hasil preprocessing untuk memastikan data teks telah siap digunakan dalam analisis atau pemodelan.

Preprocessing data teks adalah langkah kunci dalam pemrosesan bahasa alami dan dapat memiliki dampak besar pada performa model dan analisis yang dibuat. Itu sebabnya penting untuk memahami berbagai tahapan preprocessing dan menerapkan mereka dengan cermat sesuai dengan tujuan analisis Anda.

## Install & Import Library

In [3]:
!pip install Sastrawi

Collecting Sastrawi
  Downloading Sastrawi-1.0.1-py2.py3-none-any.whl (209 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/209.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m204.8/209.7 kB[0m [31m5.9 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m209.7/209.7 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: Sastrawi
Successfully installed Sastrawi-1.0.1


Ini adalah bagian awal dari kode yang mengimpor semua pustaka, modul, dan dependencies yang akan digunakan dalam analisis teks, seperti NLTK, Scikit-Learn, dan Pandas.


In [4]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory

import warnings
import pandas as pd
import numpy as np
import nltk
import re
import csv

nltk.download('stopwords')
nltk.download('punkt')
warnings.filterwarnings('ignore')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


## Load Dataset

Ini adalah langkah untuk membaca data dari file CSV menggunakan Pandas. Data tersebut dimuat ke dalam DataFrame dengan nama `df`. DataFrame ini digunakan sebagai basis untuk analisis teks yang akan dilakukan.

In [5]:
df = pd.read_csv('/content/drive/MyDrive/PencarianPenambanganWeb/tugas/dataPTATrunojoyo.csv')
df

Unnamed: 0.1,Unnamed: 0,Judul,Nama Penulis,Pembimbing I,Pembimbing II,Abstrak
0,0,Pengembangan Game Edukasi 2 Dimensi Untuk Mate...,Nurrohmat Hidayatullah Akbar,"Arik Kurniawati, S.Kom. M.T.","Puji Rahayu Ningsih, S.Pd., M.Pd.",Materi struktur dasar algoritma pemrograman me...
1,1,Pengembangan Media Pembelajaran Sistem Bilanga...,Cholilah,"Arik Kurniawati, S.kom., MT.","Wanda Ramansyah, S.Pd., M.Pd","Pada Mata Pelajaran Sistem Komputer, siswa har..."
2,2,PENGARUH MEDIA PEMBELAJARAN E-LEARNING BERBASI...,TAUFIKUR RAHMAN,"MEDIKA RISNASARI, S.ST.,M.T.","MUCHAMAD ARIF, S.PD.,M.PD.",Penelitian ini bertujuan untuk mengetahui peng...
3,3,PROFIL BERPIKIR KRITIS SISWA KELAS X TKJ DITIN...,Yuliana Wardani,"Puji Rahayu Ningsih,S.Pd.,M.Pd","Sigit Dwi Saputo, S.Pd.,M.Pd",Abstrak\nPenelitian ini bertujuan untuk menget...
4,4,PENGEMBANGAN GAME EDUKASI 3D STRUKTUR ALGORITM...,Deny Prasetyo,"Arik Kurniawati, S. Kom., M. T.","Sigit Dwi Saputro, S.Pd., M. Pd.",Mata pelajaran pemrograman dasar merupakan sal...
...,...,...,...,...,...,...
340,340,PENGEMBANGAN APLIKASI STUDENT ASSISTANT BERBAS...,Arbi Wahyu Eko Jati,"Medika Risnasari, S.ST., M.T.","Muhamad Afif Effindi, S.Kom., M.T.",Kegiatan belajar mengajar adalah suatu rutinin...
341,341,ANALISIS PEMECAHAN MASALAH KONVERSI BILANGAN B...,Magfiroh Kharim Paref,"Muchamad Arif, S.Pd., M.Pd.","Nuru Aini, S.Kom., M.Kom.",Penelitian ini bertujuan untuk menganalisis ke...
342,342,PENGEMBANGAN PERANGKAT MODEL PEMBELAJARAN KOOP...,SITI SAMIYAH,"Puji Rahayu Ningsih, S.Pd., M.Pd","Laili Cahyani, S.Kom., M.Kom",ABSTRAK\nIlmu pengetahuan dan teknologi berkem...
343,343,Pengembangan Perangkat Model Pembelajaran Visu...,Mukhamad Dani Setyawan,"Ariesta Kartika Sari, S.Si., M.Pd","Nuru Aini, S.Kom., M.Kom",Abstrak \nMasalah dalam penelitian ini antara ...


## 1. Cleaning Data

### Menghapus Data Null

Kode ini memeriksa dan mengatasi data yang hilang (NaN) dalam DataFrame `df`. Data yang hilang dihapus dari DataFrame menggunakan `df.dropna()`.


In [6]:
df.isnull().sum()

Unnamed: 0       0
Judul            0
Nama Penulis     0
Pembimbing I     0
Pembimbing II    0
Abstrak          0
dtype: int64

In [7]:
df = df.dropna()
df.isnull().sum()

Unnamed: 0       0
Judul            0
Nama Penulis     0
Pembimbing I     0
Pembimbing II    0
Abstrak          0
dtype: int64

### Menghapus Karakter Tertentu

Fungsi `cleaning` digunakan untuk membersihkan teks dalam kolom 'Abstrak'. Ini menghapus karakter-karakter yang tidak relevan seperti tanda baca dan mengubah teks menjadi daftar kata-kata.

In [8]:
def cleaning(text):
  text = re.sub(r'[^a-zA-Z\s]', '', text).strip()
  return text

df['Cleaning'] = df['Abstrak'].apply(cleaning)
df['Cleaning']

0      Materi struktur dasar algoritma pemrograman me...
1      Pada Mata Pelajaran Sistem Komputer siswa haru...
2      Penelitian ini bertujuan untuk mengetahui peng...
3      Abstrak\nPenelitian ini bertujuan untuk menget...
4      Mata pelajaran pemrograman dasar merupakan sal...
                             ...                        
340    Kegiatan belajar mengajar adalah suatu rutinin...
341    Penelitian ini bertujuan untuk menganalisis ke...
342    ABSTRAK\nIlmu pengetahuan dan teknologi berkem...
343    Abstrak \nMasalah dalam penelitian ini antara ...
344    Penggunaan model pembelajaran yang tepat untuk...
Name: Cleaning, Length: 345, dtype: object

Fungsi `cek_specialCharacter` digunakan untuk mendeteksi karakter khusus dalam teks yang telah dibersihkan. Jika karakter khusus ditemukan, teks tersebut dicetak.

In [9]:
def cek_specialCharacter(dokumen):
  karakter = ['!', '@', '#', '$', '%', '^', '&', '*', '(', ')', '-', '_', '+', '=', '{', '}', '[', ']', '|', '\\', ':', ';', '"', "'", '<', '>', ',', '.', '?', '/', '`', '~']
  for i in dokumen:
    if i in karakter :
      print(dokumen)
df['Cleaning'].apply(cek_specialCharacter)

0      None
1      None
2      None
3      None
4      None
       ... 
340    None
341    None
342    None
343    None
344    None
Name: Cleaning, Length: 345, dtype: object

## 2. Tokenizing

Fungsi `tokenizer` digunakan untuk melakukan tokenisasi dan juga proses case folding pada teks yang telah dibersihkan. Ini mengubah teks menjadi token-token kata.

In [10]:
def tokenizer(text):
  text = text.lower()
  return word_tokenize(text)

df['Tokenizing'] = df['Cleaning'].apply(tokenizer)
df['Tokenizing']

0      [materi, struktur, dasar, algoritma, pemrogram...
1      [pada, mata, pelajaran, sistem, komputer, sisw...
2      [penelitian, ini, bertujuan, untuk, mengetahui...
3      [abstrak, penelitian, ini, bertujuan, untuk, m...
4      [mata, pelajaran, pemrograman, dasar, merupaka...
                             ...                        
340    [kegiatan, belajar, mengajar, adalah, suatu, r...
341    [penelitian, ini, bertujuan, untuk, menganalis...
342    [abstrak, ilmu, pengetahuan, dan, teknologi, b...
343    [abstrak, masalah, dalam, penelitian, ini, ant...
344    [penggunaan, model, pembelajaran, yang, tepat,...
Name: Tokenizing, Length: 345, dtype: object

Menghitung jumlah kata dalam tiap abstrak

In [11]:
def count_word(dokumens):
  return len(dokumens)

df['Count Word'] = df['Tokenizing'].apply(count_word)
df

Unnamed: 0.1,Unnamed: 0,Judul,Nama Penulis,Pembimbing I,Pembimbing II,Abstrak,Cleaning,Tokenizing,Count Word
0,0,Pengembangan Game Edukasi 2 Dimensi Untuk Mate...,Nurrohmat Hidayatullah Akbar,"Arik Kurniawati, S.Kom. M.T.","Puji Rahayu Ningsih, S.Pd., M.Pd.",Materi struktur dasar algoritma pemrograman me...,Materi struktur dasar algoritma pemrograman me...,"[materi, struktur, dasar, algoritma, pemrogram...",199
1,1,Pengembangan Media Pembelajaran Sistem Bilanga...,Cholilah,"Arik Kurniawati, S.kom., MT.","Wanda Ramansyah, S.Pd., M.Pd","Pada Mata Pelajaran Sistem Komputer, siswa har...",Pada Mata Pelajaran Sistem Komputer siswa haru...,"[pada, mata, pelajaran, sistem, komputer, sisw...",321
2,2,PENGARUH MEDIA PEMBELAJARAN E-LEARNING BERBASI...,TAUFIKUR RAHMAN,"MEDIKA RISNASARI, S.ST.,M.T.","MUCHAMAD ARIF, S.PD.,M.PD.",Penelitian ini bertujuan untuk mengetahui peng...,Penelitian ini bertujuan untuk mengetahui peng...,"[penelitian, ini, bertujuan, untuk, mengetahui...",158
3,3,PROFIL BERPIKIR KRITIS SISWA KELAS X TKJ DITIN...,Yuliana Wardani,"Puji Rahayu Ningsih,S.Pd.,M.Pd","Sigit Dwi Saputo, S.Pd.,M.Pd",Abstrak\nPenelitian ini bertujuan untuk menget...,Abstrak\nPenelitian ini bertujuan untuk menget...,"[abstrak, penelitian, ini, bertujuan, untuk, m...",153
4,4,PENGEMBANGAN GAME EDUKASI 3D STRUKTUR ALGORITM...,Deny Prasetyo,"Arik Kurniawati, S. Kom., M. T.","Sigit Dwi Saputro, S.Pd., M. Pd.",Mata pelajaran pemrograman dasar merupakan sal...,Mata pelajaran pemrograman dasar merupakan sal...,"[mata, pelajaran, pemrograman, dasar, merupaka...",202
...,...,...,...,...,...,...,...,...,...
340,340,PENGEMBANGAN APLIKASI STUDENT ASSISTANT BERBAS...,Arbi Wahyu Eko Jati,"Medika Risnasari, S.ST., M.T.","Muhamad Afif Effindi, S.Kom., M.T.",Kegiatan belajar mengajar adalah suatu rutinin...,Kegiatan belajar mengajar adalah suatu rutinin...,"[kegiatan, belajar, mengajar, adalah, suatu, r...",169
341,341,ANALISIS PEMECAHAN MASALAH KONVERSI BILANGAN B...,Magfiroh Kharim Paref,"Muchamad Arif, S.Pd., M.Pd.","Nuru Aini, S.Kom., M.Kom.",Penelitian ini bertujuan untuk menganalisis ke...,Penelitian ini bertujuan untuk menganalisis ke...,"[penelitian, ini, bertujuan, untuk, menganalis...",160
342,342,PENGEMBANGAN PERANGKAT MODEL PEMBELAJARAN KOOP...,SITI SAMIYAH,"Puji Rahayu Ningsih, S.Pd., M.Pd","Laili Cahyani, S.Kom., M.Kom",ABSTRAK\nIlmu pengetahuan dan teknologi berkem...,ABSTRAK\nIlmu pengetahuan dan teknologi berkem...,"[abstrak, ilmu, pengetahuan, dan, teknologi, b...",226
343,343,Pengembangan Perangkat Model Pembelajaran Visu...,Mukhamad Dani Setyawan,"Ariesta Kartika Sari, S.Si., M.Pd","Nuru Aini, S.Kom., M.Kom",Abstrak \nMasalah dalam penelitian ini antara ...,Abstrak \nMasalah dalam penelitian ini antara ...,"[abstrak, masalah, dalam, penelitian, ini, ant...",260


## 3. Stopword

Stopword adalah kata-kata umum yang sering tidak memiliki nilai dalam analisis teks. Fungsi `stopwordText` digunakan untuk menghapus stopword dari token-token kata yang telah dihasilkan.

Token-token kata yang telah dibersihkan dan memiliki stopword yang dihapus digabungkan kembali menjadi teks utuh dan disimpan dalam kolom 'Full Text'.

In [12]:
corpus = stopwords.words('indonesian')

def stopwordText(words):
 return [word for word in words if word not in corpus]

df['Stopword Removal'] = df['Tokenizing'].apply(stopwordText)

# Gabungkan kembali token menjadi kalimat utuh
df['Full Text'] = df['Stopword Removal'].apply(lambda x: ' '.join(x))
df['Full Text']

0      materi struktur dasar algoritma pemrograman ma...
1      mata pelajaran sistem komputer siswa memahami ...
2      penelitian bertujuan pengaruh media pembelajar...
3      abstrak penelitian bertujuan profil berpikir k...
4      mata pelajaran pemrograman dasar salah mata pe...
                             ...                        
340    kegiatan belajar mengajar rutinintas mahasiswa...
341    penelitian bertujuan menganalisis kemampuan si...
342    abstrak ilmu pengetahuan teknologi berkembang ...
343    abstrak penelitian rendahnya hasil belajar sis...
344    penggunaan model pembelajaran mata pelajaran a...
Name: Full Text, Length: 345, dtype: object

## 4. Stemming



```
def stemmingText(dokumens):
  factory = StemmerFactory()
  stemmer = factory.create_stemmer()

  return [stemmer.stem(i) for i in dokumens]

df['Stemming'] = df['Stopword Removal'].apply(stemmingText)
df['Stemming']
```



# VSM (Vector Space Model)

## 1. One Hot Encoding

### Fungsi One Hot Encoder Using Pandas

Fungsi `pandasOneHotEncoder` digunakan untuk melakukan one-hot encoding pada token-token kata yang telah dibersihkan dan memiliki stopword yang dihapus. Hasilnya adalah DataFrame yang mewakili keberadaan atau ketiadaan setiap kata dalam setiap dokumen.

In [13]:
def pandasOneHotEncoder(dokumens):
  encoder  = pd.get_dummies(dokumens.apply(pd.Series).stack()).sum(level=0)
  df = pd.concat([dokumens, encoder], axis=1)

  return df

oneHotEncoder = pandasOneHotEncoder(df['Stopword Removal'])
oneHotEncoder

Unnamed: 0,Stopword Removal,a,absensi,abstra,abstrak,acak,acccess,accelerated,acceptance,acception,...,yoga,yslow,yudhistira,yx,z,zaman,zhitung,zona,ztabel,zulfatun
0,"[materi, struktur, dasar, algoritma, pemrogram...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"[mata, pelajaran, sistem, komputer, siswa, mem...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"[penelitian, bertujuan, pengaruh, media, pembe...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"[abstrak, penelitian, bertujuan, profil, berpi...",0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"[mata, pelajaran, pemrograman, dasar, salah, m...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
340,"[kegiatan, belajar, mengajar, rutinintas, maha...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
341,"[penelitian, bertujuan, menganalisis, kemampua...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
342,"[abstrak, ilmu, pengetahuan, teknologi, berkem...",0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
343,"[abstrak, penelitian, rendahnya, hasil, belaja...",0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


### Save into CSV

In [14]:
oneHotEncoder.to_csv('OneHotEncoder.csv', index=False)

## 2. TF IDF

### Fungsi TF IDF

Fungsi `tfidf` digunakan untuk melakukan TF-IDF vectorization pada teks yang telah dibersihkan dan memiliki stopword yang dihapus. Hasilnya adalah representasi numerik dari dokumen menggunakan skema TF-IDF.

In [15]:
def tfidf(dokumen):
  vectorizer = TfidfVectorizer()
  x = vectorizer.fit_transform(dokumen).toarray()
  terms = vectorizer.get_feature_names_out()

  final_tfidf = pd.DataFrame(x, columns=terms)
  final_tfidf.insert(0, 'Dokumen', dokumen)

  return (vectorizer, final_tfidf)

tfidf_vectorizer, final_tfidf = tfidf(df['Full Text'])
final_tfidf

Unnamed: 0,Dokumen,absensi,abstra,abstrak,acak,acccess,accelerated,acceptance,acception,access,...,yangrendah,yoga,yslow,yudhistira,yx,zaman,zhitung,zona,ztabel,zulfatun
0,materi struktur dasar algoritma pemrograman ma...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0
1,mata pelajaran sistem komputer siswa memahami ...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0
2,penelitian bertujuan pengaruh media pembelajar...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0
3,abstrak penelitian bertujuan profil berpikir k...,0.0,0.0,0.048129,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0
4,mata pelajaran pemrograman dasar salah mata pe...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
340,kegiatan belajar mengajar rutinintas mahasiswa...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0
341,penelitian bertujuan menganalisis kemampuan si...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0
342,abstrak ilmu pengetahuan teknologi berkembang ...,0.0,0.0,0.038226,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0
343,abstrak penelitian rendahnya hasil belajar sis...,0.0,0.0,0.038962,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.103132,0.0,0.0


### Save into CSV

In [16]:
final_tfidf.to_csv('TF IDF.csv', index=False)

## 3. Term Frequensi

### Fungsi Term Frequensi

Fungsi `term_freq` digunakan untuk melakukan Term Frequency (TF) vectorization pada teks yang telah dibersihkan dan memiliki stopword yang dihapus. Hasilnya adalah representasi numerik dari dokumen menggunakan skema Term Frequency.

In [17]:
def term_freq(dokumens):
  # Buat objek CountVectorizer
  vectorizer = CountVectorizer()
  tf_matrix = vectorizer.fit_transform(dokumens).toarray()
  terms = vectorizer.get_feature_names_out()

  final_tf = pd.DataFrame(tf_matrix, columns=terms)
  final_tf.insert(0, 'Dokumen', dokumens)

  return (vectorizer, final_tf, tf_matrix, terms)

tf_vectorizer, final_tf, tf_matrix, tf_terms = term_freq(df['Full Text'])
final_tf

Unnamed: 0,Dokumen,absensi,abstra,abstrak,acak,acccess,accelerated,acceptance,acception,access,...,yangrendah,yoga,yslow,yudhistira,yx,zaman,zhitung,zona,ztabel,zulfatun
0,materi struktur dasar algoritma pemrograman ma...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,mata pelajaran sistem komputer siswa memahami ...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,penelitian bertujuan pengaruh media pembelajar...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,abstrak penelitian bertujuan profil berpikir k...,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,mata pelajaran pemrograman dasar salah mata pe...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
340,kegiatan belajar mengajar rutinintas mahasiswa...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
341,penelitian bertujuan menganalisis kemampuan si...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
342,abstrak ilmu pengetahuan teknologi berkembang ...,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
343,abstrak penelitian rendahnya hasil belajar sis...,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


### Save into CSV

In [22]:
final_tf.to_csv('Term Frequensi.csv', index=False)


## 4. Logarithm Freqency

Fungsi `logarithm_freq` digunakan untuk melakukan transformasi frekuensi logaritmik pada data Term Frequency. Ini membantu dalam mengurangi dampak dominasi kata-kata yang sangat umum dalam analisis teks.

### Fungsi Logarithm Frequensi

In [19]:
def logarithm_freq(dokumens):
  return np.log10(dokumens + 1)

df_logarithm_freq = pd.DataFrame(tf_matrix, columns=tf_terms).apply(logarithm_freq)
df_logarithm_freq.insert(0, 'Dokumen', df['Full Text'])
df_logarithm_freq

Unnamed: 0,Dokumen,absensi,abstra,abstrak,acak,acccess,accelerated,acceptance,acception,access,...,yangrendah,yoga,yslow,yudhistira,yx,zaman,zhitung,zona,ztabel,zulfatun
0,materi struktur dasar algoritma pemrograman ma...,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0
1,mata pelajaran sistem komputer siswa memahami ...,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0
2,penelitian bertujuan pengaruh media pembelajar...,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0
3,abstrak penelitian bertujuan profil berpikir k...,0.0,0.0,0.30103,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0
4,mata pelajaran pemrograman dasar salah mata pe...,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
340,kegiatan belajar mengajar rutinintas mahasiswa...,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0
341,penelitian bertujuan menganalisis kemampuan si...,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0
342,abstrak ilmu pengetahuan teknologi berkembang ...,0.0,0.0,0.30103,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0
343,abstrak penelitian rendahnya hasil belajar sis...,0.0,0.0,0.30103,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.30103,0.0,0.0


### Save into CSV

In [20]:
df_logarithm_freq.to_csv('Logarithm Frequensi.csv', index=False)