
# Topic Modelling LDA




Notebook ini bertujuan untuk melakukan eksplorasi topik pada tugas akhir yang terdapat dalam portal tugas akhir UTM (Trunojoyo University) menggunakan model Latent Dirichlet Allocation (LDA). LDA adalah algoritma dalam pemodelan tema yang dapat membantu mengidentifikasi topik-topik utama yang mungkin ada dalam kumpulan tugas akhir mahasiswa. Dengan menggunakan teknik ini, kita dapat mengeksplorasi struktur topik yang mendasari tanpa harus membaca setiap tugas akhir secara manual

# LDA

In [1]:
import numpy as np
import pandas as pd

## Import Data

In [2]:
df=pd.read_csv('https://raw.githubusercontent.com/tiarh/ppw/main/PTA.csv')
display(df)
df.isnull().sum()

Unnamed: 0.1,Unnamed: 0,Judul,Penulis,pembimbing I,pembimbing II,Abstrak
0,0,PERANCANGAN DAN IMPLEMENTASI SISTEM DATABASE \...,A.Ubaidillah S.Kom,Budi Setyono M.T,Hermawan S.T,Sistem informasi akademik (SIAKAD) merupaka...
1,1,APLIKASI KONTROL DAN MONITORING JARINGAN KOMPU...,"M. Basith Ardianto,","Drs. Budi Soesilo, MT","Koko Joni, ST",Berjalannya koneksi jaringan komputer dengan l...
2,2,RANCANG BANGUN APLIKASI PROXY SERVER UNTUK\nEN...,"Akhmad Suyandi, S.Kom","Drs. Budi Soesilo, M.T","Hermawan, ST, MT",Web server adalah sebuah perangkat lunak serve...
3,3,SISTEM PENDUKUNG KEPUTUSAN OPTIMASI PENJADWALA...,Heri Supriyanto,"Mulaab, S.Si., M.Kom","Firli Irhamni, ST., M.Kom",Penjadwalan kuliah di Perguruan Tinggi me...
4,4,SISTEM AUGMENTED REALITY ANIMASI BENDA BERGERA...,Septian Rahman Hakim,"Arik Kurniawati, S.Kom., M.T.","Haryanto, S.T., M.T.",Seiring perkembangan teknologi yang ada diduni...
...,...,...,...,...,...,...
853,853,PENERAPAN ALGORITMA LONG-SHORT TERM MEMORY UNT...,Rachmad Agung Pambudi,"Eka Mala Sari Rochman, S.Kom., M.Kom","Sri Herawati, S.Kom., M.Kom",Investasi saham selama ini memiliki resiko ker...
854,854,SISTEM PENCARIAN TEKS AL-QURAN TERJEMAHAN BERB...,Nadila Hidayanti,"Achmad Jauhari, S.T., M.Kom","Ika Oktavia Suzanti, S.Kom., M.Cs",Information Retrieval (IR) merupakan pengambil...
855,855,KLASIFIKASI KOMPLEKSITAS VISUAL CITRA SAMPAH M...,Afni Sakinah,"Dr. Indah Agustien Siradjuddin, S.Kom., M.Kom.","Moch. Kautsar Sophan, S.Kom., M.MT.",Klasifikasi citra merupakan proses pengelompok...
856,856,IDENTIFIKASI BINER ATRIBUT PEJALAN KAKI MENGGU...,Friska Fatmawatiningrum,"Dr. Indah Agustien Siradjuddin, S.Kom., M.Kom.","Prof. Dr. Arief Muntasa, S.Si., M.MT.",Identifikasi atribut pejalan kaki merupakan sa...


Unnamed: 0        0
Judul             6
Penulis          10
pembimbing I     10
pembimbing II    11
Abstrak          30
dtype: int64

## Preprocessing Data

In [3]:
!pip install indoNLP

Collecting indoNLP
  Downloading indoNLP-0.3.4-py3-none-any.whl (121 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/121.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━[0m [32m92.2/121.9 kB[0m [31m2.8 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m121.9/121.9 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: indoNLP
Successfully installed indoNLP-0.3.4


### Cleaning Data

In [4]:
import re, string

# Data Clean
def clean(data):
    # HTML Tag Removal
    data = re.compile('<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});').sub('', str(data))

    # Case folding
    data = data.lower()

    # Trim data
    data = data.strip()

    # Remove punctuations, karakter spesial, and spasi ganda
    data = re.compile('<.*?>').sub('', data)
    data = re.compile('[%s]' % re.escape(string.punctuation)).sub(' ', data)
    data = re.sub('\s+', ' ', data)

    # Number removal
    data = re.sub(r'\[[0-9]*\]', ' ', data)
    data = re.sub(r'[^\w\s]', '', str(data).lower().strip())
    data = re.sub(r'\d', ' ', data)
    data = re.sub(r'\s+', ' ', data)

    # Mengubah data 'nan' dengan whitespace agar nantinya dapat dihapus
    data = re.sub('nan', '', data)

    return data

In [5]:
df['Abstrak'] = df['Abstrak'].apply(lambda x: clean(x))

df.head()

Unnamed: 0.1,Unnamed: 0,Judul,Penulis,pembimbing I,pembimbing II,Abstrak
0,0,PERANCANGAN DAN IMPLEMENTASI SISTEM DATABASE \...,A.Ubaidillah S.Kom,Budi Setyono M.T,Hermawan S.T,sistem informasi akademik siakad merupakan sis...
1,1,APLIKASI KONTROL DAN MONITORING JARINGAN KOMPU...,"M. Basith Ardianto,","Drs. Budi Soesilo, MT","Koko Joni, ST",berjalannya koneksi jaringan komputer dengan l...
2,2,RANCANG BANGUN APLIKASI PROXY SERVER UNTUK\nEN...,"Akhmad Suyandi, S.Kom","Drs. Budi Soesilo, M.T","Hermawan, ST, MT",web server adalah sebuah perangkat lunak serve...
3,3,SISTEM PENDUKUNG KEPUTUSAN OPTIMASI PENJADWALA...,Heri Supriyanto,"Mulaab, S.Si., M.Kom","Firli Irhamni, ST., M.Kom",penjadwalan kuliah di perguruan tinggi merupak...
4,4,SISTEM AUGMENTED REALITY ANIMASI BENDA BERGERA...,Septian Rahman Hakim,"Arik Kurniawati, S.Kom., M.T.","Haryanto, S.T., M.T.",seiring perkembangan teknologi yang ada diduni...


In [6]:
# Ubah empty string menjadi NaN value
df = df.replace('', np.nan)


In [7]:
# Remove missing values
df.dropna(inplace=True)
len(df)

821

### Cek Data Duplikasi Abstrak

In [8]:
# Remove duplicates abstrak
df.drop_duplicates(subset=['Abstrak'], inplace=True)

# Cek apakah masih terdapat duplikasi data pada kolom abstrak
df[df['Abstrak'].duplicated()]

Unnamed: 0.1,Unnamed: 0,Judul,Penulis,pembimbing I,pembimbing II,Abstrak


### Tokenisasi

In [9]:
import nltk
from nltk.tokenize import word_tokenize
# nltk.download('punkt')
nltk.download('popular')

[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cmudict.zip.
[nltk_data]    | Downloading package gazetteers to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/gazetteers.zip.
[nltk_data]    | Downloading package genesis to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/genesis.zip.
[nltk_data]    | Downloading package gutenberg to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/gutenberg.zip.
[nltk_data]    | Downloading package inaugural to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/inaugural.zip.
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/movie_reviews.zip.
[nltk_data]    | Downloading package names to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/names.zip.
[nltk_data]    | Downloading package shakespeare to /root/nltk_data...
[nlt

True

In [10]:
# Tokenizing Abstrak
df['abstrak_tokens'] = df['Abstrak'].apply(lambda x: word_tokenize(x))
df[["Abstrak", "abstrak_tokens"]].head()

Unnamed: 0,Abstrak,abstrak_tokens
0,sistem informasi akademik siakad merupakan sis...,"[sistem, informasi, akademik, siakad, merupaka..."
1,berjalannya koneksi jaringan komputer dengan l...,"[berjalannya, koneksi, jaringan, komputer, den..."
2,web server adalah sebuah perangkat lunak serve...,"[web, server, adalah, sebuah, perangkat, lunak..."
3,penjadwalan kuliah di perguruan tinggi merupak...,"[penjadwalan, kuliah, di, perguruan, tinggi, m..."
4,seiring perkembangan teknologi yang ada diduni...,"[seiring, perkembangan, teknologi, yang, ada, ..."


### Menghapus Kata Tidak Penting

In [11]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [12]:
from nltk.corpus import stopwords
from itertools import chain

stop_words = set(chain(stopwords.words('indonesian')))

df['abstrak_tokens'] = df['abstrak_tokens'].apply(lambda x: [w for w in x if not w in stop_words])

In [13]:
df[["Abstrak", "abstrak_tokens"]].head()

Unnamed: 0,Abstrak,abstrak_tokens
0,sistem informasi akademik siakad merupakan sis...,"[sistem, informasi, akademik, siakad, sistem, ..."
1,berjalannya koneksi jaringan komputer dengan l...,"[berjalannya, koneksi, jaringan, komputer, lan..."
2,web server adalah sebuah perangkat lunak serve...,"[web, server, perangkat, lunak, server, berfun..."
3,penjadwalan kuliah di perguruan tinggi merupak...,"[penjadwalan, kuliah, perguruan, kompleks, per..."
4,seiring perkembangan teknologi yang ada diduni...,"[seiring, perkembangan, teknologi, didunia, mu..."


In [14]:
print(df.isna().sum())

Unnamed: 0        0
Judul             0
Penulis           0
pembimbing I      0
pembimbing II     0
Abstrak           0
abstrak_tokens    0
dtype: int64


In [15]:
df.dropna(inplace=True)

In [16]:
df= df.drop(columns=['Unnamed: 0'])
display(df)
df.isnull().sum()

Unnamed: 0,Judul,Penulis,pembimbing I,pembimbing II,Abstrak,abstrak_tokens
0,PERANCANGAN DAN IMPLEMENTASI SISTEM DATABASE \...,A.Ubaidillah S.Kom,Budi Setyono M.T,Hermawan S.T,sistem informasi akademik siakad merupakan sis...,"[sistem, informasi, akademik, siakad, sistem, ..."
1,APLIKASI KONTROL DAN MONITORING JARINGAN KOMPU...,"M. Basith Ardianto,","Drs. Budi Soesilo, MT","Koko Joni, ST",berjalannya koneksi jaringan komputer dengan l...,"[berjalannya, koneksi, jaringan, komputer, lan..."
2,RANCANG BANGUN APLIKASI PROXY SERVER UNTUK\nEN...,"Akhmad Suyandi, S.Kom","Drs. Budi Soesilo, M.T","Hermawan, ST, MT",web server adalah sebuah perangkat lunak serve...,"[web, server, perangkat, lunak, server, berfun..."
3,SISTEM PENDUKUNG KEPUTUSAN OPTIMASI PENJADWALA...,Heri Supriyanto,"Mulaab, S.Si., M.Kom","Firli Irhamni, ST., M.Kom",penjadwalan kuliah di perguruan tinggi merupak...,"[penjadwalan, kuliah, perguruan, kompleks, per..."
4,SISTEM AUGMENTED REALITY ANIMASI BENDA BERGERA...,Septian Rahman Hakim,"Arik Kurniawati, S.Kom., M.T.","Haryanto, S.T., M.T.",seiring perkembangan teknologi yang ada diduni...,"[seiring, perkembangan, teknologi, didunia, mu..."
...,...,...,...,...,...,...
853,PENERAPAN ALGORITMA LONG-SHORT TERM MEMORY UNT...,Rachmad Agung Pambudi,"Eka Mala Sari Rochman, S.Kom., M.Kom","Sri Herawati, S.Kom., M.Kom",investasi saham selama ini memiliki resiko ker...,"[investasi, saham, memiliki, resiko, kerugian,..."
854,SISTEM PENCARIAN TEKS AL-QURAN TERJEMAHAN BERB...,Nadila Hidayanti,"Achmad Jauhari, S.T., M.Kom","Ika Oktavia Suzanti, S.Kom., M.Cs",information retrieval ir merupakan pengambilan...,"[information, retrieval, ir, pengambilan, info..."
855,KLASIFIKASI KOMPLEKSITAS VISUAL CITRA SAMPAH M...,Afni Sakinah,"Dr. Indah Agustien Siradjuddin, S.Kom., M.Kom.","Moch. Kautsar Sophan, S.Kom., M.MT.",klasifikasi citra merupakan proses pengelompok...,"[klasifikasi, citra, proses, pengelompokan, pi..."
856,IDENTIFIKASI BINER ATRIBUT PEJALAN KAKI MENGGU...,Friska Fatmawatiningrum,"Dr. Indah Agustien Siradjuddin, S.Kom., M.Kom.","Prof. Dr. Arief Muntasa, S.Si., M.MT.",identifikasi atribut pejalan kaki merupakan sa...,"[identifikasi, atribut, pejalan, kaki, salah, ..."


Judul             0
Penulis           0
pembimbing I      0
pembimbing II     0
Abstrak           0
abstrak_tokens    0
dtype: int64

In [17]:
df['abstrak_tokens'] = df['abstrak_tokens'].apply(lambda x: ' '.join(x))

df.to_csv('DataOlah_Pta.csv')

In [18]:
dataOlah = pd.read_csv('https://raw.githubusercontent.com/tiarh/ppw/main/DataOlah_Pta.csv', index_col=0)
dataOlah.head()

Unnamed: 0,Judul,Penulis,pembimbing I,pembimbing II,Abstrak,abstrak_tokens
0,PERANCANGAN DAN IMPLEMENTASI SISTEM DATABASE \...,A.Ubaidillah S.Kom,Budi Setyono M.T,Hermawan S.T,sistem informasi akademik siakad merupakan sis...,"['sistem', 'informasi', 'akademik', 'siakad', ..."
1,APLIKASI KONTROL DAN MONITORING JARINGAN KOMPU...,"M. Basith Ardianto,","Drs. Budi Soesilo, MT","Koko Joni, ST",berjalannya koneksi jaringan komputer dengan l...,"['berjalannya', 'koneksi', 'jaringan', 'komput..."
2,RANCANG BANGUN APLIKASI PROXY SERVER UNTUK\nEN...,"Akhmad Suyandi, S.Kom","Drs. Budi Soesilo, M.T","Hermawan, ST, MT",web server adalah sebuah perangkat lunak serve...,"['web', 'server', 'perangkat', 'lunak', 'serve..."
3,SISTEM PENDUKUNG KEPUTUSAN OPTIMASI PENJADWALA...,Heri Supriyanto,"Mulaab, S.Si., M.Kom","Firli Irhamni, ST., M.Kom",penjadwalan kuliah di perguruan tinggi merupak...,"['penjadwalan', 'kuliah', 'perguruan', 'komple..."
4,SISTEM AUGMENTED REALITY ANIMASI BENDA BERGERA...,Septian Rahman Hakim,"Arik Kurniawati, S.Kom., M.T.","Haryanto, S.T., M.T.",seiring perkembangan teknologi yang ada diduni...,"['seiring', 'perkembangan', 'teknologi', 'didu..."


## Membentuk VSM dalam term frequency

In [19]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
# Inisialisasi CountVectorizer untuk term frequency
vectorizer = CountVectorizer()

# Mengubah teks menjadi matriks term frequency
tf_matrix = vectorizer.fit_transform(dataOlah['abstrak_tokens'])

# Konversi matriks term frequency menjadi DataFrame (opsional)
tf_df = pd.DataFrame(tf_matrix.toarray(), columns=vectorizer.get_feature_names_out())

# Menampilkan matriks term frequency atau DataFrame jika diinginkan
print(tf_df)

     aalysis  aam  abad  abadi  ability  abjad  absensi  absolut  absolute  \
0          0    0     0      0        0      0        0        0         0   
1          0    0     0      0        0      0        0        0         0   
2          0    0     0      0        0      0        0        0         0   
3          0    0     0      0        0      0        0        0         0   
4          0    0     0      0        0      0        0        0         0   
..       ...  ...   ...    ...      ...    ...      ...      ...       ...   
816        0    0     0      0        0      0        0        0         0   
817        0    0     0      0        0      0        0        0         0   
818        0    0     0      0        0      0        0        0         0   
819        0    0     0      0        0      0        0        0         0   
820        0    0     0      0        0      0        0        0         0   

     abstract  ...  zara  zat  zcz  zf  zona  zone  zoning  zoo

In [20]:
!pip install nltk



In [21]:
import nltk
import string

# Tokenization function
def tokenize(text):
    stem = nltk.stem.SnowballStemmer('indonesian')
    text = text.lower()

    for token in nltk.word_tokenize(text):
        if token in string.punctuation: continue
        yield stem.stem(token)


# The corpus object
corpus = dataOlah['abstrak_tokens']

In [22]:
import pandas as pd
def nltk_frequency_vectorize(corpus):

    # The NLTK frequency vectorize method
    from collections import defaultdict

    def vectorize(doc):
        features = defaultdict(int)

        for token in tokenize(doc):
            features[token] += 1

        return features

    return map(vectorize, corpus)
vectnltk=nltk_frequency_vectorize(dataOlah['abstrak_tokens'])
type(vectnltk)

map

In [23]:
def sklearn_frequency_vectorize(corpus):
    # The Scikit-Learn frequency vectorize method
    from sklearn.feature_extraction.text import CountVectorizer

    vectorizer = CountVectorizer()
    return vectorizer.fit_transform(corpus)
vectsklen=sklearn_frequency_vectorize(dataOlah['abstrak_tokens'])
print(vectsklen)

  (0, 7088)	7
  (0, 2767)	3
  (0, 102)	3
  (0, 7019)	4
  (0, 605)	1
  (0, 4077)	1
  (0, 5668)	1
  (0, 5815)	1
  (0, 1248)	5
  (0, 2199)	2
  (0, 1392)	1
  (0, 5377)	1
  (0, 3921)	1
  (0, 4199)	2
  (0, 7987)	1
  (0, 7879)	1
  (0, 7682)	1
  (0, 1249)	3
  (0, 7674)	2
  (0, 3202)	1
  (0, 5891)	1
  (0, 4908)	1
  (0, 790)	1
  (0, 710)	1
  (0, 4392)	1
  :	:
  (820, 6862)	1
  (820, 7459)	1
  (820, 3517)	1
  (820, 4590)	1
  (820, 5119)	1
  (820, 718)	1
  (820, 3305)	1
  (820, 295)	2
  (820, 5115)	1
  (820, 8067)	1
  (820, 7229)	1
  (820, 4507)	1
  (820, 3171)	1
  (820, 6801)	1
  (820, 6169)	1
  (820, 7678)	1
  (820, 8227)	4
  (820, 8228)	1
  (820, 3866)	1
  (820, 5114)	1
  (820, 876)	1
  (820, 2490)	1
  (820, 1338)	1
  (820, 229)	1
  (820, 6263)	2


In [24]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

#coun_vect = CountVectorizer()
coun_vect = CountVectorizer(stop_words=['ke', 'yang', 'dan'])
count_matrix = coun_vect.fit_transform(corpus)
count_array = count_matrix.toarray()
df = pd.DataFrame(data=count_array,columns = coun_vect.vocabulary_.keys())
df

Unnamed: 0,sistem,informasi,akademik,siakad,berfungsi,megani,pengelolaan,penyajian,data,fakultas,...,accelerated,segment,augmentasi,weak,stump,ransel,diseimbangkan,detector,anchor,pretrained
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
816,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
817,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
818,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
819,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [25]:
# Menyimpan DataFrame ke dalam file CSV
df.to_csv('tf.csv', index=False)

In [26]:
print(len(count_array[0]))

8243


## Model LDA

In [27]:
import pandas as pd
import numpy as np

In [28]:
import pandas as pd
from sklearn.decomposition import LatentDirichletAllocation

In [29]:
TF = pd.read_csv("https://raw.githubusercontent.com/tiarh/ppw/main/tf.csv")
TF

Unnamed: 0,sistem,informasi,akademik,siakad,berfungsi,megani,pengelolaan,penyajian,data,fakultas,...,accelerated,segment,augmentasi,weak,stump,ransel,diseimbangkan,detector,anchor,pretrained
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
816,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
817,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
818,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
819,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [30]:
lda = LatentDirichletAllocation(n_components=5, doc_topic_prior=0.1, topic_word_prior=0.2,random_state=42,max_iter=1)
lda_top=lda.fit_transform(TF)

In [31]:
print(lda_top.shape)  # (no_of_doc,no_of_topics)
print(lda_top)

(821, 5)
[[9.95208555e-01 1.19784537e-03 1.19792155e-03 1.19784444e-03
  1.19783404e-03]
 [9.39180532e-04 9.39147936e-04 9.96243428e-01 9.39127945e-04
  9.39115639e-04]
 [9.96443641e-01 8.89078457e-04 8.89076492e-04 8.89070794e-04
  8.89132979e-04]
 ...
 [3.88539180e-01 6.39123744e-04 6.39117226e-04 6.09543462e-01
  6.39117002e-04]
 [7.78452260e-04 7.78433038e-04 7.78384675e-04 2.42263948e-01
  7.55400782e-01]
 [9.95529606e-01 1.11762115e-03 1.11757150e-03 1.11761605e-03
  1.11758566e-03]]


In [32]:
U = pd.DataFrame(lda_top, columns=[f'Topik {i + 1}' for i in range(5)])
U

Unnamed: 0,Topik 1,Topik 2,Topik 3,Topik 4,Topik 5
0,0.995209,0.001198,0.001198,0.001198,0.001198
1,0.000939,0.000939,0.996243,0.000939,0.000939
2,0.996444,0.000889,0.000889,0.000889,0.000889
3,0.994159,0.001460,0.001460,0.001460,0.001460
4,0.995321,0.001170,0.001170,0.001170,0.001170
...,...,...,...,...,...
816,0.000823,0.000823,0.000823,0.996707,0.000823
817,0.645044,0.001198,0.351363,0.001198,0.001198
818,0.388539,0.000639,0.000639,0.609543,0.000639
819,0.000778,0.000778,0.000778,0.242264,0.755401


In [33]:
print(lda.components_)
print(lda.components_.shape)  # (no_of_topics*no_of_words)

[[0.20000048 0.76021414 0.20000041 ... 1.34184322 0.62049734 0.55857546]
 [0.37755222 0.33136594 1.11700214 ... 0.34810444 0.7795018  0.8414236 ]
 [0.2000004  1.5084187  0.20000031 ... 0.20000094 0.2000003  0.20000036]
 [1.02244657 0.20000062 1.28299669 ... 1.90630743 0.20000023 0.20000028]
 [0.20000034 0.20000061 0.20000044 ... 0.20374397 0.20000033 0.2000003 ]]
(5, 8243)


In [34]:
#bobot setiap kata terhadap topik
label=[]
for i in range (1,(lda.components_.shape[1]+1)):
  masukan = TF.columns[i-1]
  label.append(masukan)
VT_tabel = pd.DataFrame(lda.components_,columns=label)
VT_tabel.rename(index={0:"Topik 1",1:"Topik 2",2:"Topik 3",3:"Topik 4",4:"Topik 5"}).transpose()

Unnamed: 0,Topik 1,Topik 2,Topik 3,Topik 4,Topik 5
sistem,0.200000,0.377552,0.200000,1.022447,0.200000
informasi,0.760214,0.331366,1.508419,0.200001,0.200001
akademik,0.200000,1.117002,0.200000,1.282997,0.200000
siakad,0.367539,0.200000,2.032460,0.200000,0.200000
berfungsi,0.200000,0.200000,0.450872,0.949127,0.200000
...,...,...,...,...,...
ransel,0.200001,2.200635,1.199363,0.200001,0.200001
diseimbangkan,0.200001,3.059486,1.340511,0.200001,0.200001
detector,1.341843,0.348104,0.200001,1.906307,0.203744
anchor,0.620497,0.779502,0.200000,0.200000,0.200000


## Cluster

In [35]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

# Create an instance of TfidfVectorizer
vectorizer = TfidfVectorizer()

# Menghapus kolom pertama (kolom dengan label 0)
data_for_clustering = U.iloc[:, 1:]  # Mengambil semua kolom kecuali kolom pertama

# Create an instance of KMeans
kmeans = KMeans(n_clusters=3)

# Fit the KMeans model to the data
kmeans.fit(data_for_clustering)

# Add the cluster labels to the DataFrame
U['cluster'] = kmeans.labels_

# Display the DataFrame with cluster labels
display(U)




Unnamed: 0,Topik 1,Topik 2,Topik 3,Topik 4,Topik 5,cluster
0,0.995209,0.001198,0.001198,0.001198,0.001198,1
1,0.000939,0.000939,0.996243,0.000939,0.000939,1
2,0.996444,0.000889,0.000889,0.000889,0.000889,1
3,0.994159,0.001460,0.001460,0.001460,0.001460,1
4,0.995321,0.001170,0.001170,0.001170,0.001170,1
...,...,...,...,...,...,...
816,0.000823,0.000823,0.000823,0.996707,0.000823,2
817,0.645044,0.001198,0.351363,0.001198,0.001198,1
818,0.388539,0.000639,0.000639,0.609543,0.000639,2
819,0.000778,0.000778,0.000778,0.242264,0.755401,1


In [36]:
from sklearn.metrics import silhouette_score

# Hitung Silhouette Score
silhouette_avg = silhouette_score(data_for_clustering, kmeans.labels_)
print(f"Silhouette Score: {silhouette_avg}")


Silhouette Score: 0.561036847780155
