<a href="https://colab.research.google.com/github/yogasgm/companysite/blob/main/practice_material/004_topic_modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Topic Modeling**

##**Importing required libraries**

In [None]:
!pip install pyLDAvis --no-deps --quiet
!pip install funcy --quiet

In [None]:
import pandas as pd
import nltk
import string
import gensim
import re

from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from gensim import corpora
import pyLDAvis.gensim_models
import warnings

nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

warnings.filterwarnings("ignore", category=DeprecationWarning)

##**Importing Dataset**

In [None]:
# Fetching the dataset from GitHub
data_url = "https://raw.githubusercontent.com/andrybrew/IHT-SEM1302-30Okt/main/data/001_suku-bunga.csv"

# Using pandas read_csv function to load the data from the URL directly into a DataFrame
df_tweet = pd.read_csv(data_url)

##**Data Preprocessing for Topic Modeling**

In [None]:
# Function for removing URL
def remove_url(text):
    return re.sub(r'https?://\S+|www\.\S+|\S+\.\S+/\S+', '', text, flags=re.MULTILINE)

# Remove URL from each tweet
df_tweet['text'] = df_tweet['text'].apply(remove_url)

# Remove mentions entirely
df_tweet['text'] = df_tweet['text'].str.replace(r'@\S+', '', regex=True)

# Remove non-word characters except for spaces and %
df_tweet['text'] = df_tweet['text'].str.replace(r'[^\w\s%]', '', regex=True)

# Convert to lowercase
df_tweet['text'] = df_tweet['text'].str.lower()

# Trim leading and trailing spaces and replace multiple spaces with a single space
df_tweet['text'] = df_tweet['text'].str.strip().str.replace(r'\s+', ' ', regex=True)

In [None]:
# Select Tweets
tweet = df_tweet['text']

# Tokenize
tokenized_text = [d.lower().split() for d in tweet]

# Remove punctuation
punctuation = string.punctuation
tokenized_text = [[word for word in doc if word not in punctuation] for doc in tokenized_text]

# Lemmatization
lemmatizer = WordNetLemmatizer()
tokenized_text = [[lemmatizer.lemmatize(word) for word in doc] for doc in tokenized_text]

# Remove stopwords
stop_words = stopwords.words('indonesian')
tokenized_text = [[word for word in doc if word not in stop_words] for doc in tokenized_text]

# Adding additional words to the stop words list
custom_stop_words = ['dgn', 'sdh', 'yg', 'the', 'gak', 'ga', 'a', 'krn', 'thd', 'nya', 'ya', 'n', 'kalo', 'aja', 'deh', 'tuh', 'udah', 'dll.', '2', '25', '20', '1.', '2.', '7.', 'u', '5', 'gua', '•']
stop_words.extend(custom_stop_words)

# Remove stopwords again
tokenized_text = [[word for word in doc if word not in stop_words] for doc in tokenized_text]

# Create dictionary
dictionary = corpora.Dictionary(tokenized_text)

# Create corpus
corpus = [dictionary.doc2bow(doc) for doc in tokenized_text]

##**Setting Up LDA Model**

In [None]:
# Train LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=dictionary,
                                           num_topics= 3,
                                           passes = 22,
                                           per_word_topics=True,
                                           random_state=42)

##**Visualizing Topics**

In [None]:
# Enable Notebook
pyLDAvis.enable_notebook()

# Visualize
pyLDAvis.gensim_models.prepare(lda_model, corpus, dictionary)

In [None]:
# Generate the best topics
top_topics = lda_model.print_topics(num_words=10)  # Display the top 10 keywords for each topics

# Create DataFrame
df_topics = pd.DataFrame(top_topics, columns=['Topic', 'Keywords'])

# Set topic as index
df_topics.set_index('Topic', inplace=True)

# Show df_topics
df_topics

**Topik 0:**

**Kata Kunci Utama:** bunga, suku, turun, kenaikan, fed, inflasi, saham, bank, masyarakat, ekonomi.

**Interpretasi:** Membahas dampak naik turunnya suku bunga terutama dari bank sentral (seperti Fed) terhadap ekonomi, pasar saham, dan inflasi.

**Topik 1:**

**Kata Kunci Utama:** bunga, suku, bi, acuan, 6%, rupiah, bank, indonesia, persen, naikkan.

**Interpretasi:** Fokus pada kebijakan suku bunga Bank Indonesia, efeknya pada nilai rupiah, dan ekonomi Indonesia.

**Topik 2:**

**Kata Kunci Utama:** bunga, suku, bank, fed, bi, turun, kenaikan, negara, inflasi, menaikkan.

**Interpretasi:** Berbicara tentang kebijakan suku bunga oleh bank-bank central seperti Federal Reserve dan Bank Indonesia, serta efeknya terhadap inflasi dan ekonomi global dan nasional.