# Roberto Di Via 4486648 - NLP Project

# Social Media Monitor

In this project i'll implement a social media monitor that tracks topics or trends from social media or blogs. This project can help businesses or individuals stay up-to-date with the latest developments and discussions related to their areas of interest.

To implement this project, I'll follow these steps:

- **1. Data Collection:** Gather data from various sources like news websites, blogs, and social media using APIs or web scraping techniques or RSS feed. In this case I'll use the 20newsgroups dataset from Sklearn that comprises around 18000 newsgroups posts on 20 topics.
- **2. Text Preprocessing:** Clean and normalize the text data using stopword removal, stemming and lemmatization.
- **3. Topic Modeling:** Employ topic modeling techniques like Latent Dirichlet Allocation (LDA) to identify the main topics or themes present in the collected data. This will help filter relevant content based on the topics of interest.
- **4. Sentiment Analysis:** Determine the sentiment of the content (positive, negative, or neutral) using a rule-based approach like VADER sentiment analyzer.
- **5. Summarization:** Generate summaries of the relevant content using extractive summarization based on word frequencies, so that users can quickly grasp the main points without reading the entire text.
- **6. Visualization and Reporting:** Visualize the results in an intuitive dashboard or report format, showing the distribution of topics, sentiment scores, and summaries of the relevant content.

# **0. Import libraries:**


In [None]:
"""
%pip install numpy
%pip install pandas
%pip install scikit-learn
%pip install nltk
%pip install gensim
%pip install pyLDAvis
%pip install vaderSentiment
%pip install wordcloud
%pip install matplotlib
"""

In [None]:
import numpy as np
import random
np.random.seed(42)

from pprint import pprint

# --------------- Dataset ------------- #
import pandas as pd
from sklearn.datasets import fetch_20newsgroups

# --------------- Pre-Processing -------- #
import re
import nltk
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import PorterStemmer, SnowballStemmer, WordNetLemmatizer

# --------------- LDA Model ---------- #
import gensim
import gensim.corpora as corpora
from gensim.models import CoherenceModel
from gensim.models.ldamodel import LdaModel
import pyLDAvis.gensim_models as gensimvis
import pyLDAvis
# --------------- Sentiment Analysis --------- #
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer


# --------------- 

# --------------- Visualize and Report ----------- #
import matplotlib.pyplot as plt
from wordcloud import WordCloud

# **1. Data Collection:**

Data can be collected from various sources like news websites, blogs, and social media using APIs or web scraping techniques or RSS feed. In this case I use the 20newsgroups dataset from Sklearn, which is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.
More informations can be found at this [link](http://qwone.com/~jason/20Newsgroups/)

**Exploring the 20newsgroups dataset**

First, I remove the headers, footers and quotes from the texts. The difference between the original text and the clean text that I'll use for my experiments are showed below:

In [None]:
example_text = fetch_20newsgroups(subset='test', categories=['sci.space'])
print(f"Original text:\n{example_text.data[2]}")

In [None]:
clean_example_text = fetch_20newsgroups(subset='test', categories=['sci.space'], remove=('headers', 'footers', 'quotes'))
print(f"Clean text:\n{clean_example_text.data[2]}")

Then, I check that the training topics are balanced, so there isn't bias during the training of the LDA model.

In [None]:
# Verify train dataset is balanced
motorcycles_train = fetch_20newsgroups(subset='train', categories=['rec.motorcycles'], remove=('headers', 'footers', 'quotes'))
print("Motorcycles dataset size: ", len(motorcycles_train.data))

hardware_train = fetch_20newsgroups(subset='train', categories=['comp.sys.ibm.pc.hardware'], remove=('headers', 'footers', 'quotes'))
print("Hardware dataset size: ", len(hardware_train.data))

graphics_train = fetch_20newsgroups(subset='train', categories=['comp.graphics'], remove=('headers', 'footers', 'quotes'))
print("Graphics dataset size: ", len(graphics_train.data))

med_train = fetch_20newsgroups(subset='train', categories=['sci.med'], remove=('headers', 'footers', 'quotes'))
print("Med dataset size: ", len(med_train.data))

space_train = fetch_20newsgroups(subset='train', categories=['sci.space'], remove=('headers', 'footers', 'quotes'))
print("Space dataset size: ", len(space_train.data))

guns_train = fetch_20newsgroups(subset='train', categories=['talk.politics.guns'], remove=('headers', 'footers', 'quotes'))
print("Guns dataset size: ", len(guns_train.data))

crypt_train = fetch_20newsgroups(subset='train', categories=['sci.crypt'], remove=('headers', 'footers', 'quotes'))
print("Crypt dataset size: ", len(crypt_train.data))

forsale = fetch_20newsgroups(subset='train', categories=['misc.forsale'], remove=('headers', 'footers', 'quotes'))
print("Forsale dataset size: ", len(forsale.data))

christian = fetch_20newsgroups(subset='train', categories=['soc.religion.christian'], remove=('headers', 'footers', 'quotes'))
print("Christian dataset size: ", len(christian.data))

**Train dataset creation**

I create the train dataset. In this case I consider only 4 training topics.

In [None]:
# Create train dataset with selected categories
train_categories = ['talk.religion.misc', 'rec.autos', 'comp.graphics', 'sci.space']

newsgroups_train = fetch_20newsgroups(subset='train', categories=train_categories, remove=('headers', 'footers', 'quotes'))

print(newsgroups_train.target_names)

n=30
print(f"Topics of first {n} texts: {newsgroups_train.target[:n]}")

**Test dataset creation**

For the test set, I create a custom dataset composed by 100 documents and 2 topics: 70% rec.autos and 30% sci.space

In [None]:
# Create test dataset with 100 samples: 70% guns and 20% space and 10% med
n_test_docs = 100
n_docs_70 = int(n_test_docs * 0.7)
n_docs_30 = n_test_docs - n_docs_70

# Fetch data for each category
test_70 = fetch_20newsgroups(subset='test', categories=['sci.space'], remove=('headers', 'footers', 'quotes'))
test_30 = fetch_20newsgroups(subset='test', categories=['comp.graphics'], remove=('headers', 'footers', 'quotes'))
print("Topic 1 size: ", len(test_70.data))
print("Topic 2 size: ", len(test_30.data))

# Randomly select the desired number of documents from each category
docs70_indices = np.random.choice(len(test_70.data), n_docs_70, replace=False)
docs30_indices = np.random.choice(len(test_30.data), n_docs_30, replace=False)

# Create the test dataset
test_data = [test_70.data[i] for i in docs70_indices] + [test_30.data[i] for i in docs30_indices]

"""
test_target = np.concatenate((guns_test.target[docs70_indices], space_test.target[docs20_indices], med_test.target[docs10_indices]))

# Create the newsgroups_test dataset
newsgroups_test = {
    'data': test_data,
    'target': test_target,
    'target_names': ['talk.politics.guns', 'sci.space', 'sci.med']
}
"""
#print("Newsgroups test dataset size: ", len(newsgroups_test['data']))

In [None]:
print(test_data[96])

In [None]:
print("Train dataset size: ", len(newsgroups_train.data))
print("Train topics are:\n",newsgroups_train.target_names)

print("\nTest dataset size: ", len(test_data))

# **2. Text Preprocessing:** 

I clean and normalize the text data using tokenization, stopword removal, and stemming/lemmatization. I use the `nltk` library for these tasks.

In [None]:
def preprocess_text(data):
    # Remove punctuation and stopwords
    stop_words = set(stopwords.words('english'))

    def remove_special_chars(text):
        # Replace anything that is not an alphanumeric character or a space with an empty string
        return re.sub(r"[^a-zA-Z0-9 ]", "", text)

    def tokenize(text):
        text = remove_special_chars(text) # Remove special characters before tokenization
        return [word for word in word_tokenize(text.lower()) if word.isalnum() and word not in stop_words]

    def lemmatize(text):
        lemmatizer = WordNetLemmatizer()
        return [lemmatizer.lemmatize(word) for word in text]

    # Tokenization
    tokenized_data = [tokenize(text) for text in data]

    # Stemming
    stemmer = PorterStemmer()
    stemmed_data = [[stemmer.stem(token) for token in text] for text in tokenized_data]

    # Lemmatize
    lemmatized_data = [lemmatize(text) for text in tokenized_data]

    return tokenized_data, stemmed_data, lemmatized_data

In [None]:
# Preprocess train dataset
tokenized_train_data, stemmed_train_data, lemmatized_train_data = preprocess_text(newsgroups_train.data)

# Preprocess test dataset
tokenized_test_data, stemmed_test_data, lemmatized_test_data = preprocess_text(test_data)

Let's visualize the difference between a tokenized text, stemmed text and a lemmatized text

In [None]:
print(tokenized_train_data[:1])

In [None]:
print(stemmed_train_data[:1])

In [None]:
print(lemmatized_train_data[:1])

# **3. Topic Modeling:** 
Apply Latent Dirichlet Allocation (LDA) to identify the main topics in the collected data. In this part I'll use a scratch implementation and compare it with `gensim` library version. To evaluate both models on the testing set I compute the coherence scores.

Remember, for topic modeling, you can train your model on any similar corpus of text documents. It doesn't necessarily have to contain the same topics as your unseen documents but having some overlap would likely improve performance. For example, if you're looking to categorize social media posts from a specific platform or about a specific subject, you would ideally use a training set gathered from the same or similar platform/subject.

However, if you want to train an LDA model on the specific topics you mentioned, you would need a dataset that contains a substantial number of documents related to these topics.


## LDA explanation

Latent Dirichlet Allocation (LDA) is a generative probabilistic model used in topic modeling. It is a statistical model that allows us to discover latent topics within a collection of documents. LDA assumes that each document in the collection is a mixture of various topics, and each topic is a distribution over words.

It's an unsupervised learning method, meaning that it generates a probabilistic model to identify groups of topics without the need for known class labels. It uses only the distribution of words to mathematically model topic.

Here's a step-by-step explanation of how LDA works:

1. **Initialization**: Choose the number of topics K to extract from the document collection and randomly assign each word in each document to one of the K topics.

2. **Iteration**: Iterate through each word in each document and reassign the word to a topic based on: the proportion of words in the document that belong to the topic, and the proportion of occurrences of the word across all documents that belong to the topic.
   - For each document d:
     - For each word w in document d:
       - Calculate two probabilities:
         - P(topic t | document d): Proportion of words in document d that are currently assigned to topic t.
         - P(word w | topic t): Proportion of assignments to topic t over all documents that come from word w.
       - Reassign word w to a new topic based on the probabilities calculated above.
   
   - Repeat the above step for a fixed number of iterations or until convergence.

3. **Output**: Repeat step 2 for a certain number of iterations or until convergence. LDA provides two main outputs:
   - The distribution of topics in each document.
   - The distribution of words in each topic.

These distributions can be used to interpret the topics and analyze the relationships between documents and topics.



LDA assumes that documents are generated in the following way:
- Choose the number of words in the document from a Poisson distribution.
- Choose a topic mixture for the document from a Dirichlet distribution.
- For each word in the document:
  - Choose a topic from the topic mixture.
  - Choose a word from the topic's word distribution.

LDA is widely used in natural language processing and text mining tasks, such as document clustering, document classification, and information retrieval. It helps uncover the underlying themes or topics in a collection of documents, making it easier to analyze and organize large amounts of textual data[1].

Please note that the search results provided additional papers and applications related to LDA, which you can explore for more specific information and use cases.

Citations:
[1] https://www.semanticscholar.org/paper/b98a4076b48552691bb99290106a378e483cdfca
[2] https://www.semanticscholar.org/paper/03ba268430128916e195e8d1a88c761f3c9d7578
[3] https://arxiv.org/abs/1309.3421
[4] https://www.semanticscholar.org/paper/c80db2cd1b127ec86060ad018c04cd0c48075ae3
[5] https://www.semanticscholar.org/paper/1713b2a9291d76c02feb49376422d800d5e44888
[6] https://www.semanticscholar.org/paper/59c902e7797889bad1f731205a409ade2913199a


The documents can come from any domain as long as they contain text. For example, they could be customer reviews, news articles, research papers, social media posts, etc. The words in the documents are collected into n-grams (a contiguous sequence of n items from a given sample of text or speech) and used to create a dictionary. This dictionary is then used to train the LDA model. 

It's important to note that the text in the documents should be preprocessed before being used for training the LDA model. This preprocessing can include removing stop words (commonly used words such as 'the', 'a', 'an', 'in'), lowercasing all the words, and lemmatizing the words (reducing inflectional forms and sometimes derivationally related forms of a word to a common base form)

When configuring the LDA model, some parameters that can be set include the rho parameter (a prior probability for the sparsity of topic distributions), the alpha parameter (a prior probability for the sparsity of per-document topic weights), the estimated number of documents, the size of the batch, the initial value of iteration used in learning update schedule, the power applied to the iteration during updates, and the number of passes over the data


As a result of the training, each document will be represented as a combination of topics, and each topic will be represented as a distribution over words. This can be used to classify new documents, identify related terms, and create recommendations.

![LDA](https://www.researchgate.net/profile/Diego-Buenano-Fernandez/publication/339368709/figure/fig1/AS:860489982689280@1582168207260/Schematic-of-LDA-algorithm.png)

## Building the LDA model

In [None]:
# Create a dictionary representation of the documents
dictionary = corpora.Dictionary(lemmatized_train_data)

# Create a bag-of-words representation of the documents
train_corpus = [dictionary.doc2bow(text) for text in lemmatized_train_data]

In [None]:
# human-readable format of corpus (term-frequency)
[[(dictionary[id], freq) for id, freq in cp] for cp in train_corpus[:1]]

In [None]:
# Set the number of topics to extract from documents
num_topics = 4

### Gensim version

Let's build the topic model. I'll define 5 topics to start with. The hyperparameter alpha affects sparsity of the document-topic (theta) distributions, whose default value is 1. Similarly, the hyperparameter eta can also be specified, which affects the topic-word distribution's sparsity.

https://www.kaggle.com/code/datajameson/topic-modelling-nlp-amazon-reviews-bbc-news

**Training phase**

In [None]:
# Train the LDA model using the processed training data
lda_gensim = LdaModel(corpus=train_corpus, id2word=dictionary, num_topics=num_topics, random_state=42, passes=10, alpha='auto', per_word_topics=True)

We explores the trained topics with their main words and correlation

In [None]:
# Mapping between topic number and category name
def topic_category_mapping(topic_id):
    if topic_id == 0:
        return 'Car'
    elif topic_id == 1:
        return "Graphics"
    elif topic_id == 2:
        return "Space"
    elif topic_id == 3:
        return 'Religion'
    else:
        return f'Unknown Category {topic_id}'

In [None]:
# Print topics and associated category names
for topic_num, topic in lda_gensim.show_topics(num_topics=num_topics, formatted=False):
    topic_words = [word for word, _ in topic]
    category_name = topic_category_mapping(topic_num)
    print(f"Topic {topic_num} ({category_name}) | Words: {topic_words}\n")

In [None]:
# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim_models.prepare(lda_gensim, train_corpus, dictionary)
pyLDAvis.save_html(vis, 'lda_vis.html')

**Test phase**

Get the topics distributions for each document

In [None]:
test_topic_distributions = [lda_gensim.get_document_topics(dictionary.doc2bow(text)) for text in lemmatized_test_data]

In [None]:
# Display the topic distribution for all test documents
for i, topic_dist in enumerate(test_topic_distributions):
    # Sort the topic distribution by probability in descending order
    sorted_topic_dist = sorted(topic_dist, key=lambda x: -x[1])
    # Create a list to store the formatted topics
    formatted_topics = []

    # Format and store each topic
    for topic_id, probability in sorted_topic_dist:
        # Associate label to the topic
        category_name = topic_category_mapping(topic_id)

        formatted_topic = f"[{topic_id}] {category_name} {probability:.2f}"
        formatted_topics.append(formatted_topic)

    # Join the formatted topics into a string
    formatted_topics_str = " - ".join(formatted_topics)

    print(f"Document {i + 1} topics : {formatted_topics_str}")

### Scratch version

I implement the LDA model from scratch. The input corpus is in the Gensim bag-of-words format, which is a list of tuples (word index, word count).

In [None]:
class LDA:
    def __init__(self, num_topics, num_iterations=30, id2word=None, alpha=0.1, beta=0.1):
        self.num_topics = num_topics
        self.num_iterations = num_iterations
        self.alpha = alpha
        self.beta = beta
        self.topic_word_counts = None
        self.doc_topic_counts = None
        self.id2word = id2word

    def fit(self, corpus):
        self.corpus = corpus
        self.num_words = max([word_idx for doc in corpus for word_idx, _ in doc]) + 1
        self.initialize()
        self.sample_topics()

    def initialize(self):
        # Initialize topic assignments randomly
        self.topic_assignments = [[random.randint(0, self.num_topics - 1) for _ in doc] for doc in self.corpus]

        # Initialize topic-word and document-topic count matrices
        self.topic_word_counts = np.zeros((self.num_topics, self.num_words))
        self.doc_topic_counts = np.zeros((len(self.corpus), self.num_topics))

        # Count initial topic assignments
        for doc_idx, doc in enumerate(self.corpus):
            for word_idx, (word, count) in enumerate(doc):
                topic = self.topic_assignments[doc_idx][word_idx]
                self.topic_word_counts[topic][word] += count
                self.doc_topic_counts[doc_idx][topic] += count

    def sample_topics(self):
        # Perform Gibbs sampling
        for it in range(self.num_iterations):
            doc_topic_sums = self.doc_topic_counts.sum(axis=1)
            topic_word_sums = self.topic_word_counts.sum(axis=1)
            
            for doc_idx, doc in enumerate(self.corpus):
                for word_idx, (word, count) in enumerate(doc):
                    # Remove current topic assignment
                    old_topic = self.topic_assignments[doc_idx][word_idx]
                    self.topic_word_counts[old_topic][word] -= count
                    self.doc_topic_counts[doc_idx][old_topic] -= count
                    doc_topic_sums[doc_idx] -= count
                    topic_word_sums[old_topic] -= count

                    # Compute probabilities for each topic (conditional distribution for the word)
                    p_topic_given_doc = (self.doc_topic_counts[doc_idx, :] + self.alpha) / (doc_topic_sums[doc_idx] + self.num_topics * self.alpha)
                    p_word_given_topic = (self.topic_word_counts[:, word] + self.beta) / (topic_word_sums + self.num_words * self.beta)
                    probabilities = p_topic_given_doc * p_word_given_topic

                    # Normalize probabilities
                    probabilities /= probabilities.sum()

                    # Sample a new topic assignment
                    new_topic = np.random.choice(self.num_topics, p=probabilities)
                    self.topic_assignments[doc_idx][word_idx] = new_topic

                    # Update counts for new topic assignment
                    self.topic_word_counts[new_topic][word] += count
                    self.doc_topic_counts[doc_idx][new_topic] += count
                    doc_topic_sums[doc_idx] += count
                    topic_word_sums[new_topic] += count

            #print(f"Iteration {it}")
    
    # Method used in the CoherenceModel 
    def get_topics(self):
        topics = np.zeros((self.num_topics, self.num_words))
        for topic_idx in range(self.num_topics):
            topic_word_counts = self.topic_word_counts[topic_idx, :]
            topic_word_probs = topic_word_counts / topic_word_counts.sum()
            topics[topic_idx, :] = topic_word_probs
        return topics  
        
    def show_topics(self, num_topics=None, num_words=10):
        if num_topics is None:
            num_topics = self.num_topics

        topics = []
        for topic_idx in range(num_topics):
            topic_word_probs = self.topic_word_counts[topic_idx, :]
            topic_word_probs /= topic_word_probs.sum()
            top_word_indices = np.argsort(topic_word_probs)[-num_words:]
            topic_words = [(self.id2word[word_idx], topic_word_probs[word_idx]) for word_idx in top_word_indices]
            topics.append((topic_idx, topic_words))
        return topics
    
    # Takes as input a specific topic id
    def show_topic(self, topic_id, num_words=10):
        topic_word_probs = self.topic_word_counts[topic_id, :]
        topic_word_probs /= topic_word_probs.sum()
        top_word_indices = np.argsort(topic_word_probs)[-num_words:]
        topic_words = [(self.id2word[word_idx], np.round(topic_word_probs[word_idx], 9)) for word_idx in top_word_indices]
        return topic_words
    
    def get_document_topics(self, unseen_document):
        unseen_corpus = [unseen_document]
        inferred_topics = self.inference(unseen_corpus)
        return inferred_topics[0]
    
    def inference(self, unseen_corpus):
        inferred_topics = []
        for doc in unseen_corpus:
            doc_topic_counts = np.zeros(self.num_topics)
            for word_idx, (word, count) in enumerate(doc):
                p_topic_given_doc = (doc_topic_counts + self.alpha) / (doc_topic_counts.sum() + self.num_topics * self.alpha)
                p_word_given_topic = (self.topic_word_counts[:, word] + self.beta) / (self.topic_word_counts.sum(axis=1) + self.num_words * self.beta)
                probabilities = p_topic_given_doc * p_word_given_topic

                # Normalize probabilities
                probabilities /= probabilities.sum()

                inferred_topic = np.random.choice(self.num_topics, p=probabilities)
                doc_topic_counts[inferred_topic] += count

            # Normalize topic probabilities, adding a small constant to avoid division by zero
            doc_topic_probs = doc_topic_counts / (doc_topic_counts.sum() + 1e-10)

            # Filter out probabilities below the threshold
            #doc_topic_probs = doc_topic_probs[doc_topic_probs > 0.2]
    
            # Convert the inferred topic counts to a list of tuples (topic_id, probability)
            topic_distribution = [(topic_id, prob) for topic_id, prob in enumerate(doc_topic_probs)]
            inferred_topics.append(topic_distribution)

        return inferred_topics

**Training phase**

In [None]:
# Train the LDA model using the processed training data
lda_scratch = LDA(num_topics=num_topics, id2word=dictionary, num_iterations=30)
lda_scratch.fit(train_corpus)

In [None]:
# Mapping between topic number and category name
def topic_category_mapping_scratch(topic_id):
    if topic_id == 0:
        return 'Car'
    elif topic_id == 1:
        return 'Graphics'
    elif topic_id == 2:
        return 'Space'
    elif topic_id == 3:
        return 'Religion'
    else:
        return f'Unknown Category {topic_id}'

In [None]:
# Print topics and associated category names
for topic_num, topic in lda_scratch.show_topics(num_topics=num_topics, num_words=15):
    topic_words = [word for word, _ in topic]
    category_name = topic_category_mapping(topic_num)
    print(f"Topic {topic_num} ({category_name}) | Words: {topic_words}\n")

**Testing phase**

In [None]:
test_topic_distributions = [lda_scratch.get_document_topics(dictionary.doc2bow(text)) for text in lemmatized_test_data]

In [None]:
# Display the topic distribution for all test documents
for i, topic_dist in enumerate(test_topic_distributions):
    # Sort the topic distribution by probability in descending order
    sorted_topic_dist = sorted(topic_dist, key=lambda x: -x[1])
    # Create a list to store the formatted topics
    formatted_topics = []

    # Format and store each topic
    for topic_id, probability in sorted_topic_dist:
        # Associate label to the topic
        category_name = topic_category_mapping_scratch(topic_id)

        formatted_topic = f"[{topic_id}] {category_name} {probability:.2f}"
        formatted_topics.append(formatted_topic)

    # Join the formatted topics into a string
    formatted_topics_str = " - ".join(formatted_topics)

    print(f"Document {i + 1} topics : {formatted_topics_str}")

### Coherence Score

https://towardsdatascience.com/understanding-topic-coherence-measures-4aa41339634c

The [CoherenceModel](https://radimrehurek.com/gensim/models/coherencemodel.html) class in the gensim library is used for evaluating the coherence of topic models. It provides different measures for computing the coherence score. Here are the different measures used in the computation of coherence score for the LDA model:

https://towardsdatascience.com/evaluate-topic-model-in-python-latent-dirichlet-allocation-lda-7d57484bb5d0
- c_v: This measure calculates the coherence based on the pairwise word-similarity scores. It considers the co-occurrence of words within a sliding window in the corpus.
- u_mass: This measure calculates the coherence based on the document co-occurrence statistics. It uses the logarithm of the ratio of the probability of observing the words in a topic to the probability of observing the words in the entire corpus.
- c_uci: This measure calculates the coherence based on the pointwise mutual information (PMI) of words in a topic. It considers the co-occurrence of words within a sliding window in the corpus and compares it to the expected co-occurrence under a random distribution.
- c_npmi: This measure calculates the coherence based on the normalized pointwise mutual information (NPMI) of words in a topic. It normalizes the PMI score by taking into account the rarity of the words.

These measures provide different perspectives on the coherence of topics in a topic model. The choice of measure depends on the specific requirements and characteristics of the corpus and the topic model being evaluated.

In [None]:
def compute_coherence_score(model, texts, dictionary, coherence='c_v'):
    coherence_model = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence=coherence)
    return coherence_model.get_coherence()

In [None]:
print("Coherence score for Gensim LDA:", compute_coherence_score(lda_gensim, tokenized_train_data, dictionary))

In [None]:
print("Coherence score for LDA from scratch:", compute_coherence_score(lda_scratch, tokenized_train_data, dictionary))

### Number of topics Tuning

Performing hyperparameter tuning with the LDA (Latent Dirichlet Allocation) model is important because it allows us to determine the optimal number of topics for a given text corpus. The initial number of topics is typically unknown, and it is necessary to extract it using an unsupervised approach. Hyperparameter tuning helps us find the best number of topics by evaluating different models based on their coherence scores.

Hyperparameter tuning involves systematically varying the number of topics and evaluating the resulting models. This process helps us avoid overfitting or underfitting the data. Overfitting occurs when the model is too complex and captures noise or irrelevant patterns, while underfitting occurs when the model is too simple and fails to capture the underlying structure of the data. By finding the optimal number of topics, we strike a balance between model complexity and interpretability.

Determining the optimal number of topics is crucial for effective topic modeling. It ensures that the resulting topics are meaningful and representative of the underlying content in the text corpus. Without hyperparameter tuning, we may end up with suboptimal or uninformative topics that do not capture the true essence of the data.

In [None]:
def hyperparameter_tuning(corpus, texts, dictionary, num_topics_range):
    coherence_scores_scratch = []
    coherence_scores_gensim = []

    n_iterations = 30
    
    for num_topics in num_topics_range:
        # Train Scratch LDA model
        lda_scratch = LDA(num_topics=num_topics, id2word=dictionary, num_iterations=n_iterations)
        lda_scratch.fit(corpus)

        # Train Gensim LDA model
        lda_gensim = LdaModel(corpus=corpus, id2word=dictionary, num_topics=num_topics, iterations=n_iterations)

        # Compute coherence score for Scratch LDA model
        coherence_score_scratch = compute_coherence_score(lda_scratch, texts, dictionary)
        coherence_scores_scratch.append(coherence_score_scratch)

        # Compute coherence score for Gensim LDA model
        coherence_score_gensim = compute_coherence_score(lda_gensim, texts, dictionary)
        coherence_scores_gensim.append(coherence_score_gensim)

    # Find optimal number of topics based on coherence score
    optimal_num_topics_scratch = num_topics_range[np.argmax(coherence_scores_scratch)]
    optimal_num_topics_gensim = num_topics_range[np.argmax(coherence_scores_gensim)]
    print(f"Optimal number of topics (scratch model): {optimal_num_topics_scratch} | Coherence score: {coherence_scores_scratch[np.argmax(coherence_scores_scratch)]}")
    print(f"Optimal number of topics (gensim model): {optimal_num_topics_gensim} | Coherence score: {coherence_scores_gensim[np.argmax(coherence_scores_gensim)]}")

    # Plot coherence scores
    plt.plot(num_topics_range, coherence_scores_scratch, label='LDA Scratch')
    plt.plot(num_topics_range, coherence_scores_gensim, label='LDA Gensim')
    plt.xlabel('Number of Topics')
    plt.ylabel('Coherence Score')
    plt.legend()
    plt.show()

    return optimal_num_topics_scratch, optimal_num_topics_gensim


In [None]:
num_topics_range = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
optimal_num_topics_scratch, optimal_num_topics_gensim = hyperparameter_tuning(train_corpus, tokenized_train_data, dictionary, num_topics_range)

# **4. Sentiment Analysis:**

Determine the sentiment of the content using the VADER sentiment analyzer from the `vaderSentiment`library.
VADER (Valence Aware Dictionary and sEntiment Reasoner) is a pre-trained sentiment analysis tool specifically designed for social media texts and doesn't require preprocessing like tokenization, stemming, or lemmatization

In [None]:
# Function to analyze sentiment using VADER
def get_sentiment(text):
     # Initialize VADER sentiment analyzer
    analyzer = SentimentIntensityAnalyzer()
    sentiment_scores = analyzer.polarity_scores(text)
    
    # Perform sentiment analysis on individual words
    words = text.split()
    words_sentiment_scores = [analyzer.polarity_scores(word) for word in words]

    # Divide words into positive, negative, and neutral lists
    positive_words = [word for word, score in zip(words, words_sentiment_scores) if score['compound'] > 0]
    negative_words = [word for word, score in zip(words, words_sentiment_scores) if score['compound'] < 0]

    return sentiment_scores, positive_words, negative_words

In [None]:
test_sentiments = [get_sentiment(text) for text in test_data]

In [None]:
# Display the sentiment scores for the first 10 test documents
scores_list = []
global_normalized_score = 0
for i, (sentiment, pos_words, neg_words) in enumerate(test_sentiments):
    scores_list.append(sentiment['compound'])
    global_normalized_score += sentiment['compound']
    print(f"Document {i + 1}: {sentiment}")
    
print(f"\nAverage sentiment: {np.mean(scores_list)}")
print(f"\nGlobal normalized sentiment: {global_normalized_score/len(test_sentiments)}")

# **5. Summarization:**

Generate summaries of the relevant content using extractive summarization based on word frequency. For this, I'll follow these steps:
- 1. Split the text into sentences.
- 2. Tokenize the sentences.
- 3. Calculate the frequency of each word in the text.
- 4. Assign a score to each sentence based on the frequency of the words in the sentence.
- 5. Select the top N sentences with the highest scores as the summary.

This is a simple implementation of extractive summarization without using any libraries. Note that this approach does not consider the semantic meaning of words or the coherence of the summary. More advanced techniques, such as using word embeddings or graph-based methods, can improve the quality of the summary.

In [None]:
def extractive_summarization(text, n_sentences=3):
    # Split the text into sentences
    sentences = sent_tokenize(text)
    # Tokenize the sentences
    tokenized_sentences = [word_tokenize(sentence.lower()) for sentence in sentences]

    # Calculate word frequencies
    word_freq = {}
    for sentence in tokenized_sentences:
        for token in sentence:
            if token not in word_freq:
                word_freq[token] = 1
            else:
                word_freq[token] += 1

    # Assign scores to sentences based on word frequencies
    sentence_scores = {}
    for i, sentence in enumerate(tokenized_sentences):
        score = sum([word_freq[token] for token in sentence])
        sentence_scores[sentences[i]] = score

    # Select top sentences for the summary
    summary_sentences = sorted(sentence_scores, key=sentence_scores.get, reverse=True)[:n_sentences]
    # Combine the selected sentences to form the summary
    summary = '. '.join(summary_sentences)

    return summary

In [None]:
summaries = [extractive_summarization(text, n_sentences=1) for text in test_data]

# Print summary for every document in the test set
for i, summary in enumerate(summaries):
    print(f"Summary {i + 1}: {summary}")

# **6. Visualization and Reporting:**

Visualize the results in an intuitive dashboard or report format, showing the distribution of topics, sentiment scores, and summaries of the relevant content. 

In [None]:
# Merge all the texts in the test set
merged_test_text = ' '.join(test_data)
_, _, lemmatized_merged_text = preprocess_text([merged_test_text])

## Main topic
In this section I print the main topic identified from all the documents in the test set, so from 100 different texts, using the LDA model from gensim and my own implementation. The test set is composed by 70% space topic, so what I expect is that the model will gives me "Space" as main topic.

In [None]:
def plot_topic_distribution(topic_distribution, topic_mapping_func):
    # Extract topic IDs and probabilities
    topic_ids, probabilities = zip(*topic_distribution)
    topic_names = [topic_mapping_func(topic_id) for topic_id in topic_ids]

    # Create a bar chart
    plt.bar(topic_names, probabilities)
    plt.xlabel('Topic')
    plt.ylabel('Probability')
    plt.title('Topic Distribution in Document')
    plt.xticks(topic_ids)

    # Show the bar chart
    plt.show()


### Gensim version

In [None]:
# Get the topic distribution for the merged text
merged_text_topic_distribution = lda_gensim.get_document_topics(dictionary.doc2bow(lemmatized_merged_text[0]))

In [None]:
print(f"\nMain topics distribution:")

# Display the topic distribution for all test documents
for i, topic_dist in enumerate(merged_text_topic_distribution):
    sorted_topic_dist = sorted([topic_dist], key=lambda x: -x[1])
    formatted_topics = []
    for topic_id, probability in sorted_topic_dist:
        topic_name =  topic_category_mapping(topic_id)
        formatted_topic = f"[{topic_id}] {topic_name} -> {probability}"
        formatted_topics.append(formatted_topic)
        formatted_topics_str = " - ".join(formatted_topics)
    print(f"{formatted_topics_str}")

# Identify the main topic based on the highest average topic distribution
main_topic = max(merged_text_topic_distribution, key=lambda x: x[1])[0]
topic_name = topic_category_mapping(main_topic)

# Display the main topic keywords
main_topic_keywords = lda_gensim.show_topic(main_topic)
print(f"\nMain {topic_name} keywords:\n{main_topic_keywords}")

In [None]:
plot_topic_distribution(merged_text_topic_distribution, topic_category_mapping)

### Scratch version

In [None]:
test_topic_distributions = lda_scratch.get_document_topics(dictionary.doc2bow(lemmatized_merged_text[0]))

In [None]:
print(f"\nMain topics distribution:")

# Display the topic distribution
sorted_topic_dist = sorted(test_topic_distributions, key=lambda x: -x[1])
for topic_id, probability in sorted_topic_dist:
    # Exclude probabilities under 0.2
    #if probability < 0.2:
    #    continue

    topic_name = topic_category_mapping_scratch(topic_id)
    formatted_topic = f"[{topic_id}] {topic_name} -> {probability}"
    print(formatted_topic)

# Identify the main topic based on the highest probability
main_topic = sorted_topic_dist[0][0]
topic_name = topic_category_mapping_scratch(main_topic)

# Display the main topic keywords
main_topic_keywords = lda_scratch.show_topic(main_topic)
print(f"\nMain {topic_name} keywords:\n{main_topic_keywords}")

In [None]:
plot_topic_distribution(test_topic_distributions, topic_category_mapping_scratch)

## Average sentiment
In this section I print the average of all the documents in the test set, and I display the sentiments scores using a bar-chart.

In [None]:
merged_text_sentiment = get_sentiment(merged_test_text)

global_sentiment = merged_text_sentiment[0]
positive_words = merged_text_sentiment[1]
negative_words = merged_text_sentiment[2]

print(f"Global Text Sentiment: {global_sentiment}")
print(f"\nPositive words: {positive_words[:10]}")
print(f"\nNegative words: {negative_words[:10]}")

In [None]:
# Function to visualize sentiment scores
def visualize_sentiment(sentiment_scores):
    labels = ['Positive', 'Neutral', 'Negative']
    values = [sentiment_scores['pos'], sentiment_scores['neu'], sentiment_scores['neg']]

    plt.bar(labels, values)
    plt.xlabel('Sentiment')
    plt.ylabel('Score')
    plt.title('Sentiment Analysis')
    plt.show()
    
visualize_sentiment(merged_text_sentiment)

In [None]:
def plot_sentiment_pie_chart(sentiment_scores):
    # Extract sentiment labels and scores
    labels, scores = zip(*sentiment_scores.items())

    # Create a pie chart
    plt.pie(scores[:-1], labels=labels[:-1], autopct='%1.1f%%', startangle=90)
    plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
    plt.title('Sentiment Distribution')

    # Show the pie chart
    plt.show()
    
plot_sentiment_pie_chart(merged_text_sentiment)

## Summary
In this section I print a summary of all documents the test set, showing the relevant phrases from all the dataset.

In [None]:
summary = extractive_summarization(merged_test_text, n_sentences=1)
print(summary)

In [None]:
generate_wordcloud([summary], "Summary words")

## Word distribution
In this section I show the most frequent words in the test set. Bigger are the words showed, higher is their frequency.

In [None]:
# Function to generate a word cloud
def generate_wordcloud(texts, title):
    all_text = ' '.join(texts)
    wordcloud = WordCloud(width=800, height=400, background_color='white', min_font_size=5, max_words=100).generate(all_text)
    plt.figure(figsize=(10, 5))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis('off')
    plt.title(title)
    plt.show()

In [None]:
# Generate word cloud
generate_wordcloud([merged_test_text], "All texts Word Cloud")

In [None]:
generate_wordcloud(positive_words, "Positive Words Word Cloud")
generate_wordcloud(negative_words, "Negative Words Word Cloud")