# Social Media Monitor

In this project i'll implement a social media monitor that tracks topics or trends from social media or blogs. This project can help businesses or individuals stay up-to-date with the latest developments and discussions related to their areas of interest.

To implement this project, I'll follow these steps:

- **1. Data Collection:** Gather data from various sources like news websites, blogs, and social media using APIs or web scraping techniques or RSS feed. In this case I'll use the 20newsgroups dataset from Sklearn that comprises around 18000 newsgroups posts on 20 topics.
- **2. Text Preprocessing:** Clean and normalize the text data using stopword removal, stemming and lemmatization.
- **3. Topic Modeling:** Employ topic modeling techniques like Latent Dirichlet Allocation (LDA) to identify the main topics or themes present in the collected data. This will help filter relevant content based on the topics of interest.
- **4. Sentiment Analysis:** Determine the sentiment of the content (positive, negative, or neutral) using a rule-based approach like VADER sentiment analyzer.
- **5. Summarization:** Generate summaries of the relevant content using extractive summarization based on word frequencies, so that users can quickly grasp the main points without reading the entire text.
- **6. Visualization and Reporting:** Visualize the results in an intuitive dashboard or report format, showing the distribution of topics, sentiment scores, and summaries of the relevant content.

# **0. Import libraries:**


In [None]:
"""
%pip install numpy
%pip install pandas
%pip install scikit-learn
%pip install nltk
%pip install gensim
%pip install pyLDAvis
%pip install vaderSentiment
%pip install wordcloud
%pip install matplotlib
"""

In [1]:
import numpy as np
import random
np.random.seed(42)

from pprint import pprint

# --------------- Dataset ------------- #
import pandas as pd
from sklearn.datasets import fetch_20newsgroups

# --------------- Pre-Processing -------- #
import re
import nltk
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, SnowballStemmer, WordNetLemmatizer

# --------------- LDA Model ---------- #
import gensim
import gensim.corpora as corpora
from gensim.models import CoherenceModel
from gensim.models.ldamodel import LdaModel
import pyLDAvis.gensim_models as gensimvis
import pyLDAvis
# --------------- Sentiment Analysis --------- #
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer


# --------------- 

# --------------- Visualize and Report ----------- #
import matplotlib.pyplot as plt
from wordcloud import WordCloud

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\robyd\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\robyd\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\robyd\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# **1. Data Collection:**
Now, let's load the 20newsgroup dataset and focus on the 5 categories for training:
http://qwone.com/~jason/20Newsgroups/

In [2]:
# Verify train dataset is balanced
baseball_train = fetch_20newsgroups(subset='train', categories=['rec.sport.baseball'], remove=('headers', 'footers', 'quotes'))
print("Baseball dataset size: ", len(baseball_train.data))

hardware_train = fetch_20newsgroups(subset='train', categories=['comp.sys.ibm.pc.hardware'], remove=('headers', 'footers', 'quotes'))
print("Hardware dataset size: ", len(hardware_train.data))

med_train = fetch_20newsgroups(subset='train', categories=['sci.med'], remove=('headers', 'footers', 'quotes'))
print("Med dataset size: ", len(med_train.data))

space_train = fetch_20newsgroups(subset='train', categories=['sci.space'], remove=('headers', 'footers', 'quotes'))
print("Space dataset size: ", len(space_train.data))

guns_train = fetch_20newsgroups(subset='train', categories=['talk.politics.guns'], remove=('headers', 'footers', 'quotes'))
print("Guns dataset size: ", len(guns_train.data))

crypt_train = fetch_20newsgroups(subset='train', categories=['sci.crypt'], remove=('headers', 'footers', 'quotes'))
print("Crypt dataset size: ", len(crypt_train.data))

Baseball dataset size:  597
Hardware dataset size:  590
Med dataset size:  594
Space dataset size:  593
Guns dataset size:  546
Crypt dataset size:  595


In [3]:
# Create train dataset with selected categories
train_categories = ['comp.sys.ibm.pc.hardware', 'rec.sport.baseball', 'sci.med', 'sci.space', 'talk.politics.guns']
newsgroups_train = fetch_20newsgroups(subset='train', categories=train_categories, remove=('headers', 'footers', 'quotes'))
#newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))

print(newsgroups_train.target_names)
print(newsgroups_train.target[:30])

['comp.sys.ibm.pc.hardware', 'rec.sport.baseball', 'sci.med', 'sci.space', 'talk.politics.guns']
[1 4 2 3 3 1 4 4 2 1 0 3 4 1 2 3 0 0 3 1 3 2 0 0 4 0 2 4 3 3]


For the test set, I will create a custom dataset with 70% rec.autos and 30% sci.space documents:

In [4]:
# Create test dataset with 100 samples: 70% guns and 30% space
n_test_docs = 100
n_guns_docs = int(n_test_docs * 0.7)
n_space_docs = n_test_docs - n_guns_docs

# Fetch data for each category
guns_test = fetch_20newsgroups(subset='test', categories=['talk.politics.guns'], remove=('headers', 'footers', 'quotes'))
space_test = fetch_20newsgroups(subset='test', categories=['sci.space'], remove=('headers', 'footers', 'quotes'))

# Randomly select the desired number of documents from each category
guns_indices = np.random.choice(len(guns_test.data), n_guns_docs, replace=False)
space_indices = np.random.choice(len(space_test.data), n_space_docs, replace=False)

# Create the test dataset
test_data = [guns_test.data[i] for i in guns_indices] + [space_test.data[i] for i in space_indices]
test_target = np.concatenate((guns_test.target[guns_indices], space_test.target[space_indices]))

# Create the newsgroups_test dataset
newsgroups_test = {
    'data': test_data,
    'target': test_target,
    'target_names': ['talk.politics.guns', 'sci.space']
}

print("Newsgroups test dataset size: ", len(newsgroups_test['data']))

Newsgroups test dataset size:  100


In [5]:
print(test_data[0])

In the wake of the Waco denouement, I had email discussions with
people from this group.  In particular, we discussed how cults
operate, why the FBI might be motivated to black out news or behave
the way it did, and what kinds of problems are involved in dealing
with cults and similar organizations.

I include an edited account of what I wrote.  The identity of my
correspondents have (I hope) been erased.  The editing process makes
the text choppy - sorry about that.  I've tried to retain the
information content.

Ellipses (...) indicate where text was removed.  A few of the comments
in parentheses are new, intended to make it easier for outsiders to
understand.

These notes are preliminary - feel free to criticize.

Cheers(?),
Oded

------------------------ (begin included text) -----------------------

I took a course called the MADNESS OF CROWDS, ...  The course included
cults and briefly mentioned/analyzed Jonestown.  (Did some external
reading too).

William Adorno ... edited a se

In [6]:
print("Train dataset size: ", len(newsgroups_train.data))
print("Train topics are:\n",newsgroups_train.target_names)

print("\nTest dataset size: ", len(test_data))

Train dataset size:  2920
Train topics are:
 ['comp.sys.ibm.pc.hardware', 'rec.sport.baseball', 'sci.med', 'sci.space', 'talk.politics.guns']

Test dataset size:  100


# **2. Text Preprocessing:** 
Clean and normalize the text data using tokenization, stopword removal, and stemming/lemmatization. I'll use the `nltk` library for these tasks.

In [7]:
def preprocess_text(data):
    # Remove punctuation and stopwords
    stop_words = set(stopwords.words('english'))

    def tokenize(text):
        return [word for word in word_tokenize(text.lower()) if word.isalnum() and word not in stop_words]

    def lemmatize(text):
        lemmatizer = WordNetLemmatizer()
        return [lemmatizer.lemmatize(word) for word in text]

    # Tokenization
    tokenized_data = [tokenize(text) for text in data]

    # Stemming
    stemmer = PorterStemmer()
    stemmed_data = [[stemmer.stem(token) for token in text] for text in tokenized_data]

    # Lemmatize
    lemmatized_data = [lemmatize(text) for text in tokenized_data]

    return stemmed_data, lemmatized_data

In [8]:
stemmed_train_data, lemmatized_train_data = preprocess_text(newsgroups_train.data)

In [9]:
print(stemmed_train_data[:1])

[['brook', 'robinson', 'defens', 'liabil', 'ted', 'william', 'weak', 'hitter', 'even', 'great', 'player', 'declin', 'age']]


In [10]:
print(lemmatized_train_data[:1])

[['brook', 'robinson', 'defensive', 'liability', 'ted', 'williams', 'weak', 'hitter', 'even', 'great', 'player', 'decline', 'age']]


In [11]:
stemmed_test_data, lemmatized_test_data = preprocess_text(test_data)
print(lemmatized_test_data[:1])

[['wake', 'waco', 'denouement', 'email', 'discussion', 'people', 'group', 'particular', 'discussed', 'cult', 'operate', 'fbi', 'might', 'motivated', 'black', 'news', 'behave', 'way', 'kind', 'problem', 'involved', 'dealing', 'cult', 'similar', 'organization', 'include', 'edited', 'account', 'wrote', 'identity', 'correspondent', 'hope', 'erased', 'editing', 'process', 'make', 'text', 'choppy', 'sorry', 'tried', 'retain', 'information', 'content', 'ellipsis', 'indicate', 'text', 'removed', 'comment', 'parenthesis', 'new', 'intended', 'make', 'easier', 'outsider', 'understand', 'note', 'preliminary', 'feel', 'free', 'criticize', 'cheer', 'oded', 'begin', 'included', 'text', 'took', 'course', 'called', 'madness', 'crowd', 'course', 'included', 'cult', 'briefly', 'jonestown', 'external', 'reading', 'william', 'adorno', 'edited', 'series', 'book', 'psychology', 'evil', 'mass', 'movement', 'starting', 'authoritarian', 'personality', 'university', 'chicago', 'press', '1948', 'attempt', 'figure

# **3. Topic Modeling:** 
Apply Latent Dirichlet Allocation (LDA) to identify the main topics in the collected data. In this part I'll use a scratch implementation and compare it with `gensim` library version. To evaluate both models on the testing set I compute the coherence scores.

Remember, for topic modeling, you can train your model on any similar corpus of text documents. It doesn't necessarily have to contain the same topics as your unseen documents but having some overlap would likely improve performance. For example, if you're looking to categorize social media posts from a specific platform or about a specific subject, you would ideally use a training set gathered from the same or similar platform/subject.

However, if you want to train an LDA model on the specific topics you mentioned, you would need a dataset that contains a substantial number of documents related to these topics.


## Latent Dirichlet Model 

Latent Dirichlet Allocation (LDA) is a generative probabilistic model used in topic modeling. It is a statistical model that allows us to discover latent topics within a collection of documents. LDA assumes that each document in the collection is a mixture of various topics, and each topic is a distribution over words.

It's an unsupervised learning method, meaning that it generates a probabilistic model to identify groups of topics without the need for known class labels. It uses only the distribution of words to mathematically model topic.

Here's a step-by-step explanation of how LDA works:

1. **Initialization**: Choose the number of topics K to extract from the document collection and randomly assign each word in each document to one of the K topics.

2. **Iteration**: Iterate through each word in each document and reassign the word to a topic based on: the proportion of words in the document that belong to the topic, and the proportion of occurrences of the word across all documents that belong to the topic.
   - For each document d:
     - For each word w in document d:
       - Calculate two probabilities:
         - P(topic t | document d): Proportion of words in document d that are currently assigned to topic t.
         - P(word w | topic t): Proportion of assignments to topic t over all documents that come from word w.
       - Reassign word w to a new topic based on the probabilities calculated above.
   
   - Repeat the above step for a fixed number of iterations or until convergence.

3. **Output**: Repeat step 2 for a certain number of iterations or until convergence. LDA provides two main outputs:
   - The distribution of topics in each document.
   - The distribution of words in each topic.

These distributions can be used to interpret the topics and analyze the relationships between documents and topics.



LDA assumes that documents are generated in the following way:
- Choose the number of words in the document from a Poisson distribution.
- Choose a topic mixture for the document from a Dirichlet distribution.
- For each word in the document:
  - Choose a topic from the topic mixture.
  - Choose a word from the topic's word distribution.

LDA is widely used in natural language processing and text mining tasks, such as document clustering, document classification, and information retrieval. It helps uncover the underlying themes or topics in a collection of documents, making it easier to analyze and organize large amounts of textual data[1].

Please note that the search results provided additional papers and applications related to LDA, which you can explore for more specific information and use cases.

Citations:
[1] https://www.semanticscholar.org/paper/b98a4076b48552691bb99290106a378e483cdfca
[2] https://www.semanticscholar.org/paper/03ba268430128916e195e8d1a88c761f3c9d7578
[3] https://arxiv.org/abs/1309.3421
[4] https://www.semanticscholar.org/paper/c80db2cd1b127ec86060ad018c04cd0c48075ae3
[5] https://www.semanticscholar.org/paper/1713b2a9291d76c02feb49376422d800d5e44888
[6] https://www.semanticscholar.org/paper/59c902e7797889bad1f731205a409ade2913199a


The documents can come from any domain as long as they contain text. For example, they could be customer reviews, news articles, research papers, social media posts, etc. The words in the documents are collected into n-grams (a contiguous sequence of n items from a given sample of text or speech) and used to create a dictionary. This dictionary is then used to train the LDA model. 

It's important to note that the text in the documents should be preprocessed before being used for training the LDA model. This preprocessing can include removing stop words (commonly used words such as 'the', 'a', 'an', 'in'), lowercasing all the words, and lemmatizing the words (reducing inflectional forms and sometimes derivationally related forms of a word to a common base form)

When configuring the LDA model, some parameters that can be set include the rho parameter (a prior probability for the sparsity of topic distributions), the alpha parameter (a prior probability for the sparsity of per-document topic weights), the estimated number of documents, the size of the batch, the initial value of iteration used in learning update schedule, the power applied to the iteration during updates, and the number of passes over the data


As a result of the training, each document will be represented as a combination of topics, and each topic will be represented as a distribution over words. This can be used to classify new documents, identify related terms, and create recommendations.

![LDA](https://www.researchgate.net/profile/Diego-Buenano-Fernandez/publication/339368709/figure/fig1/AS:860489982689280@1582168207260/Schematic-of-LDA-algorithm.png)

In [12]:
# Set the number of topics to extract from documents
num_topics = 5

# Create a dictionary representation of the documents
dictionary = corpora.Dictionary(lemmatized_train_data)

# Create a bag-of-words representation of the documents
train_corpus = [dictionary.doc2bow(text) for text in lemmatized_train_data]

In [13]:
# human-readable format of corpus (term-frequency)
[[(dictionary[id], freq) for id, freq in cp] for cp in train_corpus[:1]]

[[('age', 1),
  ('brook', 1),
  ('decline', 1),
  ('defensive', 1),
  ('even', 1),
  ('great', 1),
  ('hitter', 1),
  ('liability', 1),
  ('player', 1),
  ('robinson', 1),
  ('ted', 1),
  ('weak', 1),
  ('williams', 1)]]

## Gensim version

**Building the Topic Model**

Let's now build the topic model. We'll define 5 topics to start with. The hyperparameter alpha affects sparsity of the document-topic (theta) distributions, whose default value is 1. Similarly, the hyperparameter eta can also be specified, which affects the topic-word distribution's sparsity.

https://www.kaggle.com/code/datajameson/topic-modelling-nlp-amazon-reviews-bbc-news

In [14]:
# Train the LDA model using the processed training data
lda_model = LdaModel(corpus=train_corpus, id2word=dictionary, num_topics=num_topics, random_state=42, passes=10, alpha='auto', per_word_topics=True)

In [15]:
# Mapping between topic number and category name
topic_category_mapping = {
    0: 'Guns',
    1: 'Baseball',
    2: 'Med',
    3: 'Hardware',
    4: 'Space'
}

In [16]:
# Print topics and associated category names
for topic_num, topic in lda_model.show_topics(num_topics=num_topics, formatted=False):
    topic_words = [word for word, _ in topic]
    category_name = topic_category_mapping.get(topic_num, f'Unknown Category {topic_num}')
    print(f"Topic {topic_num} ({category_name}) | Words: {topic_words}\n")

Topic 0 (Guns) | Words: ['would', 'one', 'gun', 'like', 'get', 'know', 'time', 'people', 'problem', 'thing']

Topic 1 (Baseball) | Words: ['would', 'year', 'one', 'game', 'think', 'get', 'good', 'last', 'team', 'well']

Topic 2 (Med) | Words: ['center', 'research', 'medical', 'space', 'health', '1993', 'disease', 'cancer', 'information', 'use']

Topic 3 (Hardware) | Words: ['drive', '1', '0', '2', 'card', 'controller', '3', 'scsi', 'disk', '4']

Topic 4 (Space) | Words: ['space', 'launch', 'satellite', 'system', 'data', 'nasa', 'also', 'orbit', 'program', 'year']



Let's now evaluate the model using coherence score

In [None]:
# Calculate the coherence score to evaluate the model
coherence_model_lda = CoherenceModel(model=lda_model, texts=lemmatized_train_data, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('Coherence Score:', coherence_lda)

In [17]:
def get_topic_distribution(lda_model, dictionary, document):
    # Preprocess the document
    _, preprocessed_document = preprocess_text([document])
    # Convert the document into BoW format
    bow_document = dictionary.doc2bow(preprocessed_document[0]) # Here we are assuming that preprocessed_document is a list of lists
    
    # Get the topic distribution
    topic_distribution = lda_model.get_document_topics(bow_document, minimum_probability=0.2)
    return topic_distribution

In [18]:
test_topic_distributions = [get_topic_distribution(lda_model, dictionary, text) for text in test_data]

# Display the topic distribution for all test documents
for i, topic_dist in enumerate(test_topic_distributions):
    # Sort the topic distribution by probability in descending order
    sorted_topic_dist = sorted(topic_dist, key=lambda x: -x[1])
    # Create a list to store the formatted topics
    formatted_topics = []

    # Format and store each topic
    for topic_id, probability in sorted_topic_dist:
        # Associate label to the topic
        category_name = topic_category_mapping.get(topic_id, f'Unknown Category {topic_id}')

        formatted_topic = f"[{topic_id}] {category_name} {probability:.2f}"
        formatted_topics.append(formatted_topic)

    # Join the formatted topics into a string
    formatted_topics_str = " - ".join(formatted_topics)

    print(f"Document {i + 1} topics : {formatted_topics_str}")

Document 1 topics : [0] Guns 0.50 - [1] Baseball 0.43
Document 2 topics : [0] Guns 0.90
Document 3 topics : [1] Baseball 0.55 - [0] Guns 0.45
Document 4 topics : [0] Guns 0.94
Document 5 topics : [1] Baseball 0.48 - [0] Guns 0.34
Document 6 topics : [0] Guns 0.67 - [1] Baseball 0.30
Document 7 topics : [0] Guns 0.60 - [1] Baseball 0.40
Document 8 topics : [0] Guns 0.76
Document 9 topics : [0] Guns 0.85
Document 10 topics : [0] Guns 0.76
Document 11 topics : [0] Guns 0.72
Document 12 topics : [0] Guns 0.90
Document 13 topics : [0] Guns 0.73
Document 14 topics : [1] Baseball 0.65 - [0] Guns 0.35
Document 15 topics : [1] Baseball 0.49 - [0] Guns 0.48
Document 16 topics : [0] Guns 0.91
Document 17 topics : [4] Space 0.67 - [1] Baseball 0.31
Document 18 topics : [0] Guns 0.50 - [1] Baseball 0.45
Document 19 topics : [0] Guns 0.74 - [1] Baseball 0.26
Document 20 topics : [0] Guns 0.97
Document 21 topics : [0] Guns 0.61 - [1] Baseball 0.26
Document 22 topics : [0] Guns 0.94
Document 23 topics

## Scratch version

In [19]:
#the input corpus is in the Gensim bag-of-words format, which is a list of tuples (word index, word count)
def lda_from_scratch(corpus, num_topics, num_iterations=100, alpha=0.1, beta=0.1):
    # Initialize topic assignments randomly
    topic_assignments = [[random.randint(0, num_topics - 1) for _ in doc] for doc in corpus]

    # Initialize topic-word and document-topic count matrices
    num_words = max([word_idx for doc in corpus for word_idx, _ in doc]) + 1
    topic_word_counts = np.zeros((num_topics, num_words))
    doc_topic_counts = np.zeros((len(corpus), num_topics))

    # Count initial topic assignments
    for doc_idx, doc in enumerate(corpus):
        for word_idx, (word, count) in enumerate(doc):
            topic = topic_assignments[doc_idx][word_idx]
            topic_word_counts[topic][word] += count
            doc_topic_counts[doc_idx][topic] += count

    # Perform Gibbs sampling
    for it in range(num_iterations):
        doc_topic_sums = doc_topic_counts.sum(axis=1)
        topic_word_sums = topic_word_counts.sum(axis=1)
        
        for doc_idx, doc in enumerate(corpus):
            for word_idx, (word, count) in enumerate(doc):
                # Remove current topic assignment
                old_topic = topic_assignments[doc_idx][word_idx]
                # Decrement counts for old topic assignment
                topic_word_counts[old_topic][word] -= count
                doc_topic_counts[doc_idx][old_topic] -= count
                doc_topic_sums[doc_idx] -= count
                topic_word_sums[old_topic] -= count

                # Compute probabilities for each topic (conditional distribution for the word)
                p_topic_given_doc = (doc_topic_counts[doc_idx, :] + alpha) / (doc_topic_sums[doc_idx] + num_topics * alpha)
                p_word_given_topic = (topic_word_counts[:, word] + beta) / (topic_word_sums + num_words * beta)
                probabilities = p_topic_given_doc * p_word_given_topic

                # Normalize probabilities
                probabilities /= probabilities.sum()

                # Sample a new topic assignment
                new_topic = np.random.choice(num_topics, p=probabilities)
                topic_assignments[doc_idx][word_idx] = new_topic

                # Update counts for new topic assignment
                topic_word_counts[new_topic][word] += count
                doc_topic_counts[doc_idx][new_topic] += count
                doc_topic_sums[doc_idx] += count
                topic_word_sums[new_topic] += count

        print(f"Iteration {it}")

    return topic_word_counts, doc_topic_counts


In [21]:
num_iterations = 30
topic_word_counts, doc_topic_counts = lda_from_scratch(train_corpus, num_topics, num_iterations)

Iteration 0
Iteration 1
Iteration 2
Iteration 3
Iteration 4
Iteration 5
Iteration 6
Iteration 7
Iteration 8
Iteration 9
Iteration 10
Iteration 11
Iteration 12
Iteration 13
Iteration 14
Iteration 15
Iteration 16
Iteration 17
Iteration 18
Iteration 19
Iteration 20
Iteration 21
Iteration 22
Iteration 23
Iteration 24
Iteration 25
Iteration 26
Iteration 27
Iteration 28
Iteration 29


In [43]:
def infer_topic_distribution_scratch(document, dictionary, topic_word_counts, doc_topic_counts, alpha=0.1, beta=0.1):
    # Preprocess the document
    _, preprocessed_document = preprocess_text([document])
    # Convert the document into a bag-of-words representation
    bow_document = dictionary.doc2bow(preprocessed_document[0])  # Assuming preprocessed_document is a list of lists
    
    num_topics, num_words = topic_word_counts.shape
    doc_topic_sums = doc_topic_counts.sum(axis=1)
    topic_word_sums = topic_word_counts.sum(axis=1)

    doc_topic_counts_inferred = np.zeros(num_topics)

    for word, count in bow_document:
        p_topic_given_doc = (doc_topic_counts_inferred + alpha) / (doc_topic_sums.sum() + num_topics * alpha)
        p_word_given_topic = (topic_word_counts[:, word] + beta) / (topic_word_sums + num_words * beta)
        probabilities = p_topic_given_doc * p_word_given_topic

        probabilities /= probabilities.sum()

        for topic in range(num_topics):
            doc_topic_counts_inferred[topic] += count * probabilities[topic]

    return doc_topic_counts_inferred


def infer_topic_distribution_scratch(document, dictionary, topic_word_counts, doc_topic_counts, alpha=0.1, beta=0.1):
    # Preprocess the document
    _, preprocessed_document = preprocess_text([document])
    # Convert the document into a bag-of-words representation
    bow_document = dictionary.doc2bow(preprocessed_document[0])  # Assuming preprocessed_document is a list of lists
    
    num_topics, num_words = topic_word_counts.shape
    topic_word_sums = topic_word_counts.sum(axis=1)

    doc_topic_counts_inferred = np.zeros(num_topics)

    for word, count in bow_document:
        p_topic_given_doc = (doc_topic_counts_inferred + alpha) / (doc_topic_counts_inferred.sum() + num_topics * alpha)
        p_word_given_topic = (topic_word_counts[:, word] + beta) / (topic_word_sums + num_words * beta)
        probabilities = p_topic_given_doc * p_word_given_topic

        probabilities /= probabilities.sum()

        for topic in range(num_topics):
            doc_topic_counts_inferred[topic] += count * probabilities[topic]

    # Normalize the inferred topic counts
    #doc_topic_counts_inferred /= doc_topic_counts_inferred.sum()

    return doc_topic_counts_inferred

In [44]:
test_topic_distributions = [infer_topic_distribution_scratch(text, dictionary, topic_word_counts, doc_topic_counts) for text in test_data]
#print(test_topic_distributions)

In [45]:
# Display the topic distribution for all test documents
for i, topic_dist in enumerate(test_topic_distributions):
    
    # Sort the topic distribution by probability in descending order
    sorted_topic_dist = sorted(enumerate(topic_dist), key=lambda x: -x[1])
    
    # Create a list to store the formatted topics
    formatted_topics = []

    # Format and store each topic
    for topic_id, probability in sorted_topic_dist:
        # Exclude probabilities under 0.2
        if probability < 0.1:
            continue
        
        # Associate label to the topic
        category_name = topic_category_mapping.get(topic_id, f'Unknown Category {topic_id}')

        formatted_topic = f"[{topic_id}] {category_name} {probability:.2f}"
        formatted_topics.append(formatted_topic)

    # Join the formatted topics into a string
    formatted_topics_str = " - ".join(formatted_topics)

    print(f"Document {i + 1} topics : {formatted_topics_str}")

Document 1 topics : [0] Guns 294.47 - [3] Hardware 105.62 - [1] Baseball 21.52 - [4] Space 15.14 - [2] Med 2.25
Document 2 topics : [3] Hardware 6.74 - [0] Guns 5.78 - [1] Baseball 3.38 - [4] Space 1.88 - [2] Med 0.22
Document 3 topics : [0] Guns 64.80 - [3] Hardware 14.44 - [1] Baseball 4.34 - [4] Space 1.27 - [2] Med 0.15
Document 4 topics : [0] Guns 14.29 - [3] Hardware 12.34 - [1] Baseball 5.06 - [2] Med 0.72 - [4] Space 0.59
Document 5 topics : [0] Guns 16.91 - [3] Hardware 3.11 - [2] Med 2.14 - [4] Space 1.99 - [1] Baseball 0.85
Document 6 topics : [0] Guns 78.92 - [3] Hardware 61.61 - [4] Space 18.66 - [1] Baseball 5.81 - [2] Med 3.00
Document 7 topics : [0] Guns 16.52 - [3] Hardware 14.75 - [1] Baseball 0.44 - [4] Space 0.24
Document 8 topics : [0] Guns 22.27 - [3] Hardware 7.66 - [1] Baseball 2.43 - [4] Space 0.53 - [2] Med 0.11
Document 9 topics : [0] Guns 30.70 - [3] Hardware 17.43 - [4] Space 4.65 - [1] Baseball 2.83 - [2] Med 0.40
Document 10 topics : [0] Guns 39.23 - [3] 

In [None]:
def compute_coherence_score(lda_model, texts, dictionary):
    topics = lda_model.get_topics()
    coherence_score = 0.0
    num_topics = len(topics)
    num_texts = len(texts)

    for i in range(num_topics):
        topic = topics[i]
        topic_words = [dictionary[word_id] for word_id in topic]
        topic_word_set = set(topic_words)
        topic_word_freq = {word: topic_words.count(word) for word in topic_word_set}

        for j in range(num_texts):
            text = texts[j]
            text_word_set = set(text)
            common_words = topic_word_set.intersection(text_word_set)
            common_word_freq = {word: text.count(word) for word in common_words}

            coherence_score += np.log(sum(common_word_freq.values()) + 1) - np.log(len(text))

    coherence_score /= num_topics * num_texts

    return coherence_score


In [None]:
import itertools
from collections import defaultdict
from math import log

def compute_coherence_score(topic_word_counts, lemmatized_corpus, dictionary, top_n=10):
    # Get top N words for each topic
    top_words = [[dictionary[i] for i in np.argsort(topic_word_counts[t])[-top_n:]] for t in range(len(topic_word_counts))]

    # Compute word co-occurrence matrix
    co_occurrences = defaultdict(int)
    for doc in lemmatized_corpus:
        for w1, w2 in itertools.combinations(set(doc), 2):
            if w1 != w2:
                co_occurrences[(w1, w2)] += 1
                co_occurrences[(w2, w1)] += 1

    # Compute coherence score
    coherence_score = 0
    for topic in top_words:
        topic_score = 0
        for i, word_i in enumerate(topic[:-1]):
            for word_j in topic[i+1:]:
                if (word_i, word_j) in co_occurrences:
                    topic_score += log((co_occurrences[(word_i, word_j)] + 1) / co_occurrences[word_i])
        coherence_score += topic_score / (top_n * (top_n - 1) / 2)

    return coherence_score / len(top_words)

In [None]:
coherence_score = compute_coherence_score(topic_word_counts, lemmatized_train_data, dictionary, top_n=10)
print("Coherence Score:", coherence_score)

# **4. Sentiment Analysis:**
Determine the sentiment of the content using the VADER sentiment analyzer from the `vaderSentiment`library.
VADER (Valence Aware Dictionary and sEntiment Reasoner) is a pre-trained sentiment analysis tool specifically designed for social media texts and doesn't require preprocessing like tokenization, stemming, or lemmatization

In [None]:
# Function to analyze sentiment using VADER
def get_sentiment(text):
     # Initialize VADER sentiment analyzer
    analyzer = SentimentIntensityAnalyzer()
    sentiment_scores = analyzer.polarity_scores(text)
    return sentiment_scores

In [None]:
test_sentiments = [get_sentiment(text) for text in test_data]

In [None]:
# Display the sentiment scores for the first 10 test documents
scores_list = []
for i, sentiment in enumerate(test_sentiments):
    scores_list.append(sentiment['compound'])
    print(f"Document {i + 1}: {sentiment}")
    
print(f"\nAverage sentiment: {np.mean(scores_list)}")

# **5. Summarization:**
Generate summaries of the relevant content using extractive summarization based on word frequency. For this, I'll follow these steps:
- 1. Split the text into sentences.
- 2. Tokenize the sentences.
- 3. Calculate the frequency of each word in the text.
- 4. Assign a score to each sentence based on the frequency of the words in the sentence.
- 5. Select the top N sentences with the highest scores as the summary.

This is a simple implementation of extractive summarization without using any libraries. Note that this approach does not consider the semantic meaning of words or the coherence of the summary. More advanced techniques, such as using word embeddings or graph-based methods, can improve the quality of the summary.

In [None]:
def extractive_summarization(text, n_sentences=3):
    # Split the text into sentences
    sentences = text.strip().split('.')

    # Tokenize and preprocess the text
    word_freq = {}
    for sentence in sentences:
        stemmed_tokens, _ = preprocess_text([sentence])
        # Flatten the stemmed_tokens list
        stemmed_tokens = [token for sublist in stemmed_tokens for token in sublist]
        for token in stemmed_tokens:
            if token not in word_freq:
                word_freq[token] = 1
            else:
                word_freq[token] += 1

    # Calculate the score for each sentence
    sentence_scores = {}
    for sentence in sentences:
        stemmed_tokens, _ = preprocess_text([sentence])
        # Flatten the stemmed_tokens list
        stemmed_tokens = [token for sublist in stemmed_tokens for token in sublist]
        for token in stemmed_tokens:
            if token in word_freq:
                if sentence not in sentence_scores:
                    sentence_scores[sentence] = word_freq[token]
                else:
                    sentence_scores[sentence] += word_freq[token]

    # Select the top N sentences with the highest scores
    summary_sentences = sorted(sentence_scores, key=sentence_scores.get, reverse=True)[:n_sentences]
    summary = '. '.join(summary_sentences)

    return summary

In [None]:
summaries = [extractive_summarization(text, n_sentences=1) for text in test_data]

# Print summary for every document in the test set
for i, summary in enumerate(summaries):
    print(f"Summary {i + 1}: {summary}")

# **6. Visualization and Reporting:**
Visualize the results in an intuitive dashboard or report format, showing the distribution of topics, sentiment scores, and summaries of the relevant content. 

In [46]:
# Merge all the texts in the test set
merged_test_text = ' '.join(test_data)

## Main topic

### Gensim version

In [47]:
# Get the topic distribution for the merged text
merged_text_topic_distribution = get_topic_distribution(lda_model, dictionary, merged_test_text)

In [48]:
print(f"\nMain topics distribution:")

# Display the topic distribution for all test documents
for i, topic_dist in enumerate(merged_text_topic_distribution):
    sorted_topic_dist = sorted([topic_dist], key=lambda x: -x[1])
    formatted_topics = []
    for topic_id, probability in sorted_topic_dist:
        topic_name = topic_category_mapping.get(topic_id, f'Unknown Category {topic_id}')
        formatted_topic = f"[{topic_id}] {topic_name} -> {probability}"
        formatted_topics.append(formatted_topic)
    formatted_topics_str = " - ".join(formatted_topics)
    print(f"{formatted_topics_str}")

# Identify the main topic based on the highest average topic distribution
main_topic = max(merged_text_topic_distribution, key=lambda x: x[1])[0]
topic_name = topic_category_mapping.get(main_topic, f'Unknown Category {topic_num}')

# Display the main topic keywords
main_topic_keywords = lda_model.show_topic(main_topic)
print(f"\nMain {topic_name} keywords:\n{main_topic_keywords}")


Main topics distribution:
[0] Guns -> 0.6212282776832581
[1] Baseball -> 0.25317147374153137

Main Guns keywords:
[('would', 0.009276755), ('one', 0.008098404), ('gun', 0.0074029597), ('like', 0.005558507), ('get', 0.005383595), ('know', 0.005368865), ('time', 0.005036473), ('people', 0.0050030183), ('problem', 0.004014722), ('thing', 0.0036522488)]


### Scratch version

In [49]:
# Get the topic distribution for the document
topic_distribution = infer_topic_distribution_scratch(merged_test_text, dictionary, topic_word_counts, doc_topic_counts, alpha=0.1, beta=0.1)

In [52]:
print(f"\nMain topics distribution:")

# Display the topic distribution
sorted_topic_dist = sorted(enumerate(topic_distribution), key=lambda x: -x[1])
formatted_topics = []
for topic_id, probability in sorted_topic_dist:
    topic_name = topic_category_mapping.get(topic_id, f'Unknown Category {topic_id}')
    formatted_topic = f"[{topic_id}] {topic_name} -> {probability}"
    formatted_topics.append(formatted_topic)
formatted_topics_str = " - ".join(formatted_topics)
print(f"{formatted_topics_str}")

# Identify the main topic based on the highest probability
main_topic = sorted_topic_dist[0][0]
topic_name = topic_category_mapping.get(main_topic, f'Unknown Category {main_topic}')

# Display the main topic keywords
main_topic_keywords = []
for word, _ in sorted(enumerate(topic_word_counts[main_topic]), key=lambda x: -x[1])[:10]:
    main_topic_keywords.append(dictionary[word])
print(f"\nMain {topic_name} keywords:\n{main_topic_keywords}")


Main topics distribution:
[0] Guns -> 5045.698170268956 - [3] Hardware -> 2469.2684302414887 - [4] Space -> 785.5779189439983 - [1] Baseball -> 93.53890919586722 - [2] Med -> 68.9165713496919

Main Guns keywords:
['would', 'one', 'year', 'get', 'like', 'think', 'people', 'good', 'know', 'time']


## Average sentiment

In [None]:
merged_text_sentiment = get_sentiment(merged_test_text)

print(f"Global Text Sentiment: {merged_text_sentiment}")

In [None]:
# Function to visualize sentiment scores
def visualize_sentiment(sentiment_scores):
    labels = ['Positive', 'Neutral', 'Negative']
    values = [sentiment_scores['pos'], sentiment_scores['neu'], sentiment_scores['neg']]

    plt.bar(labels, values)
    plt.xlabel('Sentiment')
    plt.ylabel('Score')
    plt.title('Sentiment Analysis')
    plt.show()
    
visualize_sentiment(merged_text_sentiment)

## Summary

In [None]:
summary = extractive_summarization(merged_test_text, n_sentences=3)
print(summary)

## Word distribution

In [None]:
# Function to generate a word cloud
def generate_wordcloud(texts):
    all_text = ' '.join(texts)
    wordcloud = WordCloud(width=800, height=800, background_color='white', min_font_size=5, max_words=100).generate(all_text)
    plt.figure(figsize=(8, 8))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis('off')
    plt.title('Word Cloud of Texts')
    plt.show()

In [None]:
# Generate word cloud
generate_wordcloud([merged_test_text])