# Topic Modelling

## Crawling data From Youtube

**Tujuan** dari program ini adalah melakukan crawling (pengambilan) data komentar pada sebuah video Youtube menggunakan **Youtube Data API v3**. Sebelum mencoba program ini, pastikan Anda sudah memiliki (mengaktifkan) layanan Youtube Data API dan telah membangkitkan **API Key**. 

Jika belum memiliki **API KEY**, Anda dapat mengikuti petunjuk singkat sebagai berikut: 
1. Login ke Google Developer Console (https://console.developers.google.com/)dengan akun Google Anda
2. Buat project baru dan lengkapi isian yang diminta. 
3. Aktifkan Layanan API pada halaman project, dan cari **Youtube Data API v3**.
4. Dari halaman dashboard, buat kredential agar API tersebut dapat digunakan. Klik tombol **Buat Kredensial** (**Create Credential**). Lengkapi isian formnya.
5. Anda dapat mengakses / melihat API KEY pada tab **Credentials**.



In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
%cd /content/drive/My Drive/prosaindata/

In [None]:
#install library
!pip install sastrawi
!pip install swifter
!pip install gensim

In [None]:
#import library
import pandas as pd
from googleapiclient.discovery import build
import numpy as np
from string import punctuation
import re
import nltk

In [None]:
#Membuat function untuk crawling data
def video_comments(video_id):
	# empty list for storing reply
	replies = []

	# creating youtube resource object
	youtube = build('youtube', 'v3', developerKey=api_key)

	# retrieve youtube video results
	video_response = youtube.commentThreads().list(part='snippet,replies', videoId=video_id).execute()

	# iterate video response
	while video_response:
		
		# extracting required info
		# from each result object
		for item in video_response['items']:
			
			# Extracting comments ()
			published = item['snippet']['topLevelComment']['snippet']['publishedAt']
			user = item['snippet']['topLevelComment']['snippet']['authorDisplayName']

			# Extracting comments
			comment = item['snippet']['topLevelComment']['snippet']['textDisplay']
			likeCount = item['snippet']['topLevelComment']['snippet']['likeCount']

			replies.append([published, user, comment, likeCount])
			
			# counting number of reply of comment
			replycount = item['snippet']['totalReplyCount']

			# if reply is there
			if replycount>0:
				# iterate through all reply
				for reply in item['replies']['comments']:
					
					# Extract reply
					published = reply['snippet']['publishedAt']
					user = reply['snippet']['authorDisplayName']
					repl = reply['snippet']['textDisplay']
					likeCount = reply['snippet']['likeCount']
					
					# Store reply is list
					#replies.append(reply)
					replies.append([published, user, repl, likeCount])

			# print comment with list of reply
			#print(comment, replies, end = '\n\n')

			# empty reply list
			#replies = []

		# Again repeat
		if 'nextPageToken' in video_response:
			video_response = youtube.commentThreads().list(
					part = 'snippet,replies',
					pageToken = video_response['nextPageToken'], 
					videoId = video_id
				).execute()
		else:
			break
	#endwhile
	return replies


In [None]:
# isikan dengan api key Anda
api_key = 'AIzaSyBcQknzxNArq2ASQeN3IXu-PkvyugNKhPs'

# Enter video id
# contoh url video = https://www.youtube.com/watch?v=5tucmKjOGi8
video_id = "KtntKGlmuZw" #isikan dengan kode / ID video

# Call function
comments = video_comments(video_id)

comments

In [None]:
#menjadikan dataframe
df = pd.DataFrame(comments, columns=['publishedAt', 'authorDisplayName', 'text', 'likeCount'])
df

In [None]:
%cd /content/drive/My Drive/prosaindata/

In [None]:
#simpan hasil crawling ke csv
df.to_csv('youtube_comments.csv', index=False)

## Preprocessing

### 1. Symbol & Punctuation Removal, case folding

Pada Tahap ini preprocessing yang dilakukan yaitu menghilangkan simbol dan tanda baca, serta melakukan case folding yaitu merubah seluruh huruf yang ada pada data menjadi huruf kecil

In [None]:
#proses menghilangkan simbol dan emoji
def remove_text_special (text):
  text = text.replace('\\t',"").replace('\\n',"").replace('\\u',"").replace('\\',"")
  text = text.encode('ascii', 'replace').decode('ascii')
  return text.replace("http://"," ").replace("https://", " ")
df['text'] = df['text'].apply(remove_text_special)
print(df['text'])

In [None]:
#menghilangkan tanda baca
def remove_tanda_baca(text):
  text = re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",text)
  return text

df['text'] = df['text'].apply(remove_tanda_baca)
df['text'].head(20)

In [None]:
#proses menghilangkan angka
def remove_numbers (text):
  return re.sub(r"\d+", "", text)
df['text'] = df['text'].apply(remove_numbers)
df['text']

In [None]:
#proses casefolding
def casefolding(Comment):
  Comment = Comment.lower()
  return Comment
df['text'] = df['text'].apply(casefolding)
df['text']

### 2. Tokenizing
Pada tahap ini preprocessing yang dilakukan adalah tokenizing. Tokenizing adalah metode untuk melakukan pemisahan kata dalam suatu kalimat dengan tujuan untuk proses analisis teks lebih lanjut

In [None]:
#proses tokenisasi
# from nltk.tokenize import TweetTokenizer
nltk.download('punkt')
# def word_tokenize(text):
#   tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)
#   return tokenizer.tokenize(text)

df['review_token'] = df['text'].apply(lambda sentence: nltk.word_tokenize(sentence))
df['review_token']

### 3. Word Normalization
Pada tahap ini yang dilakukan yaitu normalisasi pada data. Hal tersebut dilakukan untuk merubah kata yang tidak baku menjadi kata baku

In [None]:
# #Normalisasi kata tidak baku
# normalize = pd.read_excel("tugas/Dataset/Normalization Data.xlsx")

# normalize_word_dict = {}

# for row in normalize.iterrows():
#   if row[0] not in normalize_word_dict:
#     normalize_word_dict[row[0]] = row[1]

# def normalized_term(comment):
#   return [normalize_word_dict[term] if term in normalize_word_dict else term for term in comment]

# df['comment_normalize'] = df['review_token'].apply(normalized_term)
# df['comment_normalize'].head(20)

### 4. Stopwords Removal
Pada tahap ini preprocessing yang dilakukan adalah menghilangkan kata yang tidak penting. Stopwords removal dilakukan 2 kali, yang pertama berdasarkan korpus yang ada di library python yaitu nltk, yang kedua berdasarkan file 'list_stopwords'

In [None]:
#Stopword Removal
nltk.download('stopwords')
from nltk.corpus import stopwords
txt_stopwords = stopwords.words('indonesian')

def stopwords_removal(filtering) :
  filtering = [word for word in filtering if word not in txt_stopwords]
  return filtering

df['stopwords_removal'] = df['comment_normalize'].apply(stopwords_removal)
df['stopwords_removal'].head(20)

In [None]:
# #stopword removal 2
# data_stopwords = pd.read_excel("tugas/Dataset/list_stopwords.xlsx")
# print(data_stopwords)

# def stopwords_removal2(filter) :
#   filter = [word for word in filter if word not in data_stopwords]
#   return filter

# df['stopwords_removal_final'] = df['stopwords_removal'].apply(stopwords_removal2)
# df['stopwords_removal_final'].head(20)

### 5. Stemming
Pada tahap ini preprocessing yang dilakukan adalah stemming. Stemming adalah proses pemetaan dan penguraian bentuk dari suatu kata menjadi bentuk kata dasarnya.

In [None]:
#proses stem
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
import string
import swifter
factory = StemmerFactory()
stemmer = factory.create_stemmer()

def stemming (term):
  return stemmer.stem(term)

term_dict = {}
for document in df['stopwords_removal_final']:
  for term in document:
    if term not in term_dict:
      term_dict[term] = ''


In [None]:
print(len(term_dict))
print("-----------------------------")

In [None]:
# for term in term_dict:
#   term_dict[term] = stemming(term)
#   print(term,":",term_dict[term])

# print(term_dict)
# print("-----------------------------")

In [None]:
def get_stemming(document):
  return [term_dict[term] for term in document]

In [None]:
df['stemming'] = df['stopwords_removal_final'].swifter.apply(get_stemming)

In [None]:
print(df['stemming'])

In [None]:
df.head(20)

## Feature Extraction (TF-IDF)

Algoritma TF-IDF (Term Frequency – Inverse Document Frequency) adalah salah satu algoritma yang dapat digunakan untuk menganalisa hubungan antara sebuah frase/kalimat dengan sekumpulan dokumen. Contoh yang dibahas kali ini adalah mengenai penentuan urutan peringkat data berdasarkan query yang digunakan.
Inti utama dari algoritma ini adalah melakukan perhitungan nilai TF dan nilai IDF dari sebuah setiap kata kunci terhadap masing-masing dokumen

In [None]:
def joinkata(data):
  kalimat = ""
  for i in data:
    kalimat += i
    kalimat += " "
  return kalimat

text = df['stemming'].swifter.apply(joinkata)
text

In [None]:
# Vectorize document using TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(lowercase=True,
                        stop_words='english',
                        ngram_range = (1,1)
                        )

# Fit and Transform the documents
tfidf_matrix = vectorizer.fit_transform(text)

In [None]:
print(tfidf_matrix)

## Latent Semantic Analysis

In [None]:
# Melakukan dekomposisi matriks dengan SVD
from sklearn.decomposition import TruncatedSVD
svd_model = TruncatedSVD(n_components=4)
lsa_matrix = svd_model.fit_transform(tfidf_matrix)

## Modelling

Bobot kata terhadap masing masing topik

In [None]:
# bobot kata terhadap masing masing topik
terms = vectorizer.get_feature_names_out()

for index, component in enumerate(svd_model.components_):
    zipped = zip(terms, component)
    top_terms_key=sorted(zipped, key = lambda t: t[1], reverse=True)[:2]
    print("Topic "+str(index)+": ",top_terms_key)

Bobot setiap topik terhadap dokumen

In [None]:
# bobot setiap topik terhadap  dokumen
df_lsa = pd.DataFrame(lsa_matrix, columns=["Topik 0", "Topik 1", "Topik 2", "Topik 3"])
df_lsa = pd.concat([text, df_lsa], axis=1)
df_lsa['Topik']= df_lsa[['Topik 0', 'Topik 1', 'Topik 2', 'Topik 3']].apply(lambda x: x.argmax(), axis=1)

df_lsa