GitHub - quibes/PTM

Introduction

This project investigates two probabilistic topic modeling algorithms—LDA and BERTopic—to assess their effectiveness in handling thematic coherence, interpretability, and computational efficiency when applied to large-scale unstructured Russian-language text data from Telegram channels.

Methodology

Data Collection

Data was collected from three public sports-themed Russian-language Telegram chats over a one-month period (October 1 to November 1, 2024). The selected chats are:

Матч! Чат: Over 22,000 members, generating 3,000 to 10,000 messages daily.
FEGURKA / Фигурное катание: Over 2,200 members, producing about 3,000 messages daily.
БАРСЕЛОНА | BARCELONA: Over 22,500 members, with around 10,000 messages daily.

Data was fetched using the Telethon Python library, adhering to ethical research guidelines.

Data Preprocessing

LDA Preprocessing

Data Loading and Deduplication: Loaded JSON-formatted messages and removed duplicates, capping at 300,000 messages.
Stop Phrase and Noise Removal: Excluded messages with domain-specific stop phrases and removed non-textual noise (e.g., URLs, mentions).
Tokenization and Lemmatization: Tokenized text, removed Russian stopwords, and lemmatized using pymorphy2.
Data Normalization: Converted text to lowercase and removed non-Cyrillic characters.
Identification of Common Phrases: Identified and ranked bigrams and trigrams.
Dictionary Creation and Corpus Representation: Created a Gensim dictionary, filtered extreme frequencies, and transformed the corpus into a bag-of-words representation.
Coherence-Based Optimization of Topics: Tested topic counts (2 to 40) and selected the model with the highest coherence score.

BERTopic Preprocessing

Data Loading: Loaded JSON data into a pandas DataFrame, extracting and filtering the 'message' field.
Text Normalization: Lowercased text, removed stop phrases, tokenized using razdel, and excluded non-alphabetic tokens.
Lemmatization: Applied pymorphy2 for morphological normalization.
Stopword Removal: Excluded tokens in the Russian stopword list from NLTK and ensured a minimum token length of three characters.
Quality Enhancement: Removed English words and messages lacking meaningful content post-processing.

Implementation

LDA Implementation

Dictionary and Corpus Preparation: Created a dictionary of unique tokens and constructed a bag-of-words representation.
Model Tuning: Generated models with topic counts ranging from 2 to 50, evaluated using coherence scores, and selected the optimal model.
Parameter Adjustment: Set alpha and beta parameters to 'auto' for dynamic optimization.
Evaluation: Plotted coherence scores against topic counts and extracted top words for each topic.

BERTopic Implementation

Embeddings: Generated dense vector representations using paraphrase-multilingual-sbert_large_nlu_ru and MiniLM-L12-v2.
Dimensionality Reduction: Applied UMAP with optimized parameters for Russian text data.
Clustering: Utilized HDBSCAN with parameters balancing topic granularity and cluster stability.
Vectorization and Topic Representation: Employed CountVectorizer and ClassTFIDFTransformer for interpretable topic representations.

Results

The comparative analysis of LDA and BERTopic on the Russian-language Telegram dataset revealed insights into thematic coherence, interpretability, and computational efficiency. Detailed results are documented in the thesis.

Repository Structure

Barcelona/: Data and analysis related to the БАРСЕЛОНА | BARCELONA chat.
MatchTV/: Data and analysis related to the Матч! Чат.
fegurka/: Data and analysis related to the FEGURKA / Фигурное катание chat.
environment_info/: Environment configuration and dependencies.
README.md: This README file.

Installation

Clone the Repository:

git clone https://github.com/quibes/PTM.git
cd PTM

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Methodology

Data Collection

Data Preprocessing

LDA Preprocessing

BERTopic Preprocessing

Implementation

LDA Implementation

BERTopic Implementation

Results

Repository Structure

Installation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
Barcelona		Barcelona
MatchTV		MatchTV
environment_info		environment_info
fegurka		fegurka
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Introduction

Methodology

Data Collection

Data Preprocessing

LDA Preprocessing

BERTopic Preprocessing

Implementation

LDA Implementation

BERTopic Implementation

Results

Repository Structure

Installation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages