# Social Media Monitor

In my project i'll implement, using Python, a social media monitor that
tracks topics or trends from social media or blogs.
I'll use the SKlearn 20 newsgroups dataset that comprises around 18000
newsgroups posts on 20 topics.
Then i'll pre-process the text using the nltk library, i'll apply
stopword removal, stemming and lemmatization.
After the text is pre-processed i'll apply a model to identify the main
topic, the idea is to implement from scratch the Latent Dirichlet
Allocation model and made a comparison between my version and the
version taken from gensim or sklearn libraries.
I'll perform a sentiment analysis using the Vader library.
I'll generate a summary of the relevant content using extractive
summarization based on word frequencies, to do that i'll not use any
library.
Finally i'll visualize the results, showing the distribution of topics,
sentiment scores and summaries of relevant content. In this phase i'll
use matplotlib and wordcloud libraries.

About the LDA i found the following interesting papers:

https://arxiv.org/pdf/1711.04305.pdf

https://ai.stanford.edu/~ang/papers/jair03-lda.pdf


## IDEA
- Creo 2 datasets bilanciati (tipo 100 tweets e 100 tweets)-> covid e champions
- Traino LDA su entrambi
- Creo nuovo dataset misto (70-30%)
- Identifico topic del dataset misto
- Sentument
- Riassunto
- Plotto

I creating a custom media monitor that tracks specific topics or trends across various platforms, such as news articles, blog posts, and social media. This project can help businesses or individuals stay up-to-date with the latest developments and discussions related to their areas of interest.

To implement this project, i'll follow these steps:

1. **Data Collection**: Gather data from various sources like news websites, blogs, and social media using APIs or web scraping techniques or RSS feed.
2. **Text Preprocessing**: Clean and normalize the text data, as mentioned in the previous answer (tokenization, stopword removal, stemming/lemmatization).
3. **Topic Modeling**: Employ topic modeling techniques like Latent Dirichlet Allocation (LDA) or Non-negative Matrix Factorization (NMF) to identify the main topics or themes present in the collected data. This will help you filter relevant content based on the topics of interest.
4. **Sentiment Analysis**: Determine the sentiment of the content (positive, negative, or neutral) using techniques like rule-based approaches (e.g., VADER sentiment analyzer) or pre-trained models (e.g., TextBlob).
5. **Summarization**: Generate summaries of the relevant content using extractive or abstractive summarization techniques, so that users can quickly grasp the main points without reading the entire text.
6. **Visualization and Reporting**: Visualize the results in an intuitive dashboard or report format, showing the distribution of topics, sentiment scores, and summaries of the relevant content.

# Libraries

In [None]:
"""
%pip install numpy
%pip install pandas
%pip install scikit-learn
%pip install nltk
%pip install gensim
%pip install pyLDAvis
%pip install vaderSentiment
%pip install wordcloud
%pip install matplotlib
"""

from pprint import pprint

# **1. Data Collection:** 
Gather data from various sources using APIs or web scraping techniques. For example, you can use the requests library for accessing APIs and BeautifulSoup for web scraping. In this case I'll use the `20newsgroups` module of `sklearn.datasets`

https://imerit.net/blog/top-25-twitter-datasets-for-natural-language-processing-and-machine-learning-all-pbm/



Boxiing https://www.kaggle.com/datasets/bwandowando/boxing-twitter-dataset
space race https://www.kaggle.com/datasets/amartyanambiar/the-space-race-tweets
crypto https://www.kaggle.com/datasets/gauravduttakiit/bitcoin-tweets-16m-tweets-with-sentiment-tagged
airline (pos) https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment
avengers https://www.kaggle.com/datasets/kavita5/twitter-dataset-avengersendgame
covid (neg) https://data.gesis.org/tweetscov19/



topic modeling lda https://www.kaggle.com/code/datajameson/topic-modelling-nlp-amazon-reviews-bbc-news

## Twitter dataset
filter the negative tweets from the TweetsCOV19 dataset 
https://data.gesis.org/tweetscov19/

In [None]:
import pandas as pd

# Load the dataset
tweetscov19_df = pd.read_csv("TweetsCOV19.tsv", sep="\t", header=None)

# Assign column names
tweetscov19_df.columns = ["tweet_id", "timestamp", "entities", "sentiment", "mentions", "hashtags", "urls"]

# Filter the negative tweets
negative_tweets_df = tweetscov19_df[tweetscov19_df["sentiment"].apply(lambda x: int(x.split()[1]) < 0)]

# Save the negative tweets to a text file
negative_tweets_df.to_csv("negative_tweets.txt", columns=["tweet_id", "sentiment"], index=False)


## Apple dataset
 filter the positive tweets from the Apple Twitter Sentiment dataset: 
 https://www.kaggle.com/seriousran/appletwittersentimenttexts

In [None]:
import pandas as pd

# Load the dataset
apple_sentiment_df = pd.read_csv("apple-twitter-sentiment-texts.csv")

# Filter the positive tweets
positive_apple_tweets_df = apple_sentiment_df[apple_sentiment_df["sentiment"] == "positive"]

# Save the positive tweets to a text file
positive_apple_tweets_df.to_csv("positive_apple_tweets.txt", columns=["text", "sentiment"], index=False)


## Stock dataset
https://www.kaggle.com/datasets/yash612/stockmarket-sentiment-dataset

## 20newsgroups

In [93]:
from sklearn.datasets import fetch_20newsgroups

The total size is:  18846

The topics are: 
 ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


In [94]:
newsgroups_all = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
print("The total size is: ", len(newsgroups_all.data))
print("\nThe topics are: \n",newsgroups_all.target_names)

Train dataset size:  11314
Test dataset size:  3844


In [None]:
# Load the 20newsgroups dataset
newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))

# Select 10 categories for the test set
selected_categories = ['alt.atheism', 'comp.graphics', 'comp.sys.ibm.pc.hardware', 'rec.autos', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns']
newsgroups_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'), categories=selected_categories)

print("Train dataset size: ", len(newsgroups_train.data))
print("Test dataset size: ", len(newsgroups_test.data))


# **2. Text Preprocessing:** 
Clean and normalize the text data using tokenization, stopword removal, and stemming/lemmatization. I'll use the `nltk` library for these tasks.

In [96]:
import nltk
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, SnowballStemmer, WordNetLemmatizer

In [None]:
# Function to preprocess text
def preprocess_text(text):
    # Tokenization
    tokens = word_tokenize(text.lower())

    # Remove punctuation and stopwords
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [token for token in tokens if token.isalnum() and token not in stop_words]

    # Stemming
    stemmer = PorterStemmer()
    stemmed_tokens = [stemmer.stem(token) for token in filtered_tokens]

    # Lemmatize
    #lemmatizer = WordNetLemmatizer()
    #lemmatized_texts = [[lemmatizer.lemmatize(word) for word in text] for text in stemmed_tokens]

    return stemmed_tokens

# **3. Topic Modeling:** 
Apply Latent Dirichlet Allocation (LDA) to identify the main topics in the collected data. You can use the `gensim library` to perform LDA.

Evaluate both models on the testing set using perplexity and coherence scores.

## SCRATCH

## LDA

The main idea of LDA is that it is a generative probabilistic model, where each document is viewed as a mixture of various topics, and each topic is characterized as a distribution over words.

To understand how the LDA model works in the Gensim library, let's break it down step-by-step:

Preparing inputs: The two main inputs to the LDA topic model are the dictionary(id2word) and the corpus. The dictionary is a mapping of (word_id, word_frequency) and the corpus is a list of such mappings for each document
https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/

Building the LDA Model: Once we have the inputs ready, we can build the LDA model. In addition to the corpus and dictionary, you need to provide the number of topics as well. Other parameters include alpha and eta (which affect sparsity of the topics), chunksize (the number of documents to be used in each training chunk), update_every (how often the model parameters should be updated), and passes (the total number of training passes)

Understanding the Output: The LDA model will output the word distribution for each topic and the topic contribution for each document. The topics are just a collection of dominant keywords that are typical representatives. By looking at the keywords, you can identify what the topic is all about


-------------------------------


Latent Dirichlet Allocation in a nutshell:

The order of words is not important in a document - Bag of Words model.
A document is a distribution over topics
Each topic, in turn, is a distribution over words belonging to the vocabulary
LDA is a probabilistic generative model. It is used to infer hidden variables using a posterior distribution.
Imagine the process of creating a document to be something like this -

Choose a distribution over topics
Draw a topic - and choose word from the topic. Repeat this for each of the topics
LDA is sort of backtracking along this line -given that you have a bag of words representing a document, what could be the topics it is representing ?

http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/

https://www.youtube.com/watch?v=DDq3OVp9dNA


--------------------------------------------


Latent Dirichlet Allocation (LDA) is a generative statistical model that is used for topic modeling. It's an unsupervised learning method, meaning that it generates a probabilistic model to identify groups of topics without the need for known class labels. It uses only the distribution of words to mathematically model topic.

The type of datasets that should be used for training in an LDA model are text documents. These can be a corpus of documents where each document is a collection of words. The LDA algorithm finds the weight of connections between documents and topics and between topics and words.


https://www.baeldung.com/cs/latent-dirichlet-allocation

The documents can come from any domain as long as they contain text. For example, they could be customer reviews, news articles, research papers, social media posts, etc. The words in the documents are collected into n-grams (a contiguous sequence of n items from a given sample of text or speech) and used to create a dictionary. This dictionary is then used to train the LDA model. 

It's important to note that the text in the documents should be preprocessed before being used for training the LDA model. This preprocessing can include removing stop words (commonly used words such as 'the', 'a', 'an', 'in'), lowercasing all the words, and lemmatizing the words (reducing inflectional forms and sometimes derivationally related forms of a word to a common base form)

When configuring the LDA model, some parameters that can be set include the rho parameter (a prior probability for the sparsity of topic distributions), the alpha parameter (a prior probability for the sparsity of per-document topic weights), the estimated number of documents, the size of the batch, the initial value of iteration used in learning update schedule, the power applied to the iteration during updates, and the number of passes over the data


As a result of the training, each document will be represented as a combination of topics, and each topic will be represented as a distribution over words. This can be used to classify new documents, identify related terms, and create recommendations.

In [None]:
"""
1.1 Tokenization: Split the text into sentences and the sentences into words. Lowercase the words and remove punctuation.

1.2 Words that have fewer than 3 characters are removed.

1.3 Remove all stop words.

1.4 Lemmatized the words.
"""

import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer

def lemmatize_stemming(text):
    return SnowballStemmer('english').stem(WordNetLemmatizer().lemmatize(text, pos='v'))

def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
    return result


The next part of your task is printing the most frequent or relevant topics from your texts. You can achieve this by applying your LDA model to a new, unseen document, and then using the output to find the most relevant topic. Here is an example of how you can achieve this:

In [None]:
unseen_document = 'The unseen document text goes here'
bow_vector = dictionary.doc2bow(preprocess(unseen_document))

# print the most probable topics
for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 5)))


Remember, for topic modeling, you can train your model on any similar corpus of text documents. It doesn't necessarily have to contain the same topics as your unseen documents but having some overlap would likely improve performance. For example, if you're looking to categorize social media posts from a specific platform or about a specific subject, you would ideally use a training set gathered from the same or similar platform/subject.

However, if you want to train an LDA model on the specific topics you mentioned, you would need a dataset that contains a substantial number of documents related to these topics.


Find suitable datasets: For each of your topics, you can find relevant datasets as follows:
Covid: LitCovid is a literature hub in PubMed (an online database of biomedical references) specifically catering to Covid-19 related research papers.
Champions League: TheSportsDB is a free and community-driven sports database that includes data about teams, players, matches, and more.
Airline Use: Bureau of Transportation Statistics provides data about the performance and services of US airlines.
Avengers Endgame: IMDb dataset could be used for movie reviews. It's a popular movie database which provides user reviews, but please do keep in mind it includes reviews for all the movies so you would need to filter Avengers Endgame related reviews.
Space: You can use NASA Datasets which features a variety of datasets from satellite sightseeing to solar dynamics.

## GENSIM
https://www.kaggle.com/code/datajameson/topic-modelling-nlp-amazon-reviews-bbc-news

In [None]:
import gensim

### Training

In [162]:
# Preprocess the 20 newsgroups documents
processed_docs = [preprocess_text(doc) for doc in newsgroups_all.data]

[[('actual', 1),
  ('also', 1),
  ('anyway', 1),
  ('basher', 1),
  ('beat', 1),
  ('better', 1),
  ('bit', 3),
  ('bowman', 1),
  ('confus', 1),
  ('coupl', 1),
  ('devil', 2),
  ('disappoint', 1),
  ('end', 1),
  ('fan', 1),
  ('final', 1),
  ('fo', 1),
  ('fun', 2),
  ('game', 2),
  ('go', 2),
  ('howev', 1),
  ('island', 1),
  ('jagr', 2),
  ('jersey', 1),
  ('kill', 1),
  ('kind', 1),
  ('lack', 1),
  ('let', 1),
  ('lose', 1),
  ('lot', 2),
  ('man', 1),
  ('massacr', 1),
  ('much', 1),
  ('next', 1),
  ('pen', 5),
  ('playoff', 1),
  ('post', 1),
  ('prais', 1),
  ('pretti', 1),
  ('pulp', 1),
  ('put', 1),
  ('puzzl', 1),
  ('recent', 1),
  ('regular', 2),
  ('relief', 1),
  ('reliev', 1),
  ('rule', 1),
  ('season', 2),
  ('see', 1),
  ('show', 1),
  ('sinc', 1),
  ('stat', 1),
  ('sure', 1),
  ('thought', 1),
  ('watch', 1),
  ('wors', 1)]]

In [None]:
# Create a dictionary representation of the documents
dictionary = gensim.corpora.Dictionary(processed_docs)

# Create a bag-of-words representation of the documents
corpus = [dictionary.doc2bow(docs) for docs in processed_docs]


# human-readable format of corpus (term-frequency)
[[(dictionary[id], freq) for id, freq in cp] for cp in corpus[:1]]

**Building the Topic Model**
Let's now build the topic model. We'll define 10 topics to start with. The hyperparameter alpha affects sparsity of the document-topic (theta) distributions, whose default value is 1. Similarly, the hyperparameter eta can also be specified, which affects the topic-word distribution's sparsity.

https://www.kaggle.com/code/datajameson/topic-modelling-nlp-amazon-reviews-bbc-news

In [None]:
# Train the LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus, num_topics=2, id2word=dictionary, passes=15, random_state=42)

### Test

In [179]:
example1 = "Travel: A passport to joy, discovery, and unforgettable memories. From ancient wonders to vibrant cultures, every step is a breathtaking adventure that enriches our souls. Let's embrace the world's beauty, forge connections, and let wanderlust guide us. #Travel #Wanderlust"
example2 = "Traffic: A daily nightmare of frustration and stress. Endless queues, impatient drivers, and never-ending roadworks make our commute an exasperating odyssey. Time slips away, stress soars, and productivity suffers. We long for a smoother journey. #TrafficJam #CommuterWoes"
example3 = "Navigating through the maze of concrete and steel, the daily commute weaves a tale of frustration and stress, as we find ourselves trapped in an unending cycle of traffic. The promise of a productive day slips away as we inch forward at a snail's pace, surrounded by an endless sea of brake lights. The once simple act of reaching our destination has become a Herculean task, where every minute spent behind the wheel feels like an eternity. Gridlock traffic, impatient drivers, and never-ending road construction all conspire to transform our once serene journey into a maddening odyssey of chaos. The symphony of car horns blares in our ears, serving as a constant reminder of the exasperation that envelops us. The mere thought of being confined to our vehicles for hours on end fills us with a sense of despair, as we watch precious moments of our lives slip away, wasted in a sea of congestion. As we inch forward, our stress levels skyrocket, and our patience wears thin. The glaring red numbers on the clock mock us, reminding us of the valuable time slipping away with each passing moment. The promise of a relaxing evening or a well-deserved rest becomes an unattainable dream, overshadowed by the overwhelming burden of the daily commute. The consequences of this traffic nightmare extend beyond our personal lives. It hinders productivity, damages the environment, and takes a toll on our mental and physical well-being. The once vibrant city streets have transformed into battlegrounds of frustration and anger, as we fight for every inch of progress, caught in a relentless cycle of congestion. In this congested world, we yearn for a solution—a respite from the unrelenting grip of traffic. A transformative change that eases our burdens and frees us from this daily ordeal. Until then, we brace ourselves for another arduous journey, steeling our nerves and hoping that one day"
example4 = "Work Struggles: Battling Stress in the 9 to 5 Trenches. Endless deadlines, overwhelming tasks, and the constant race against the clock. The daily grind takes a toll on our well-being. Let's find balance, prioritize self-care, and strive for a healthier work-life equilibrium. #WorkStress #SelfCareNeeded"

4


In [173]:
# Preprocess two new unseen documents
unseen_docs  = [example1, example2, example4, example3]
preprocessed_unseen_docs = [preprocess_text(doc) for doc in unseen_docs]
print(len(preprocessed_unseen_docs))

[(0,
  '0.008*"would" + 0.007*"one" + 0.005*"use" + 0.005*"like" + 0.005*"peopl" + '
  '0.005*"get" + 0.004*"know" + 0.004*"think" + 0.004*"time" + 0.003*"say"'),
 (1,
  '0.013*"1" + 0.011*"2" + 0.010*"0" + 0.010*"x" + 0.007*"use" + 0.007*"3" + '
  '0.007*"max" + 0.007*"file" + 0.006*"4" + 0.006*"7"')]


In [None]:
# Print the topics
topics = lda_model.print_topics()
pprint(topics)

Let's now evaluate the model using coherence score

In [180]:
# coherence score
coherence_model_lda=CoherenceModel(model=lda_model,texts=processed_docs,dictionary=dictionary,coherence='c_v')
coherence_lda=coherence_model_lda.get_coherence()
print('\nCoherence Score:',coherence_lda)

In [184]:
# Get the topic distribution for each unseen document
unseen_corpus = [dictionary.doc2bow(doc) for doc in preprocessed_unseen_docs]

unseen_doc_topics = [lda_model.get_document_topics(bow) for bow in unseen_corpus]

Main topic: 0
Probability: 0.9319106638431549
Ratio: 0.9327861873172403
Keywords: [('would', 0.0075205686), ('one', 0.0073870406), ('use', 0.005440685), ('like', 0.005001823), ('peopl', 0.0047082114), ('get', 0.004608978), ('know', 0.0043750065), ('think', 0.0038991002), ('time', 0.003843012), ('say', 0.0034865953)]


In [176]:
num_topics = 4

# Calculate the average topic distribution for each topic across the 10 documents
avg_topic_distributions = [0] * num_topics
for doc_topics in unseen_doc_topics:
    for topic_id, topic_prob in doc_topics:
        avg_topic_distributions[topic_id] += topic_prob
avg_topic_distributions = [prob / len(unseen_doc_topics) for prob in avg_topic_distributions]


# Identify the main topic based on the highest average topic distribution
main_topic = max(enumerate(avg_topic_distributions), key=lambda x: x[1])[0]
main_topic_prob = max(avg_topic_distributions)

# Calculate the ratio of the main topic
total_prob = sum(avg_topic_distributions)
main_topic_ratio = main_topic_prob / total_prob

print(f"Main topic: {main_topic}")
print(f"Probability: {main_topic_prob}")
print(f"Ratio: {main_topic_ratio}")

# Display the main topic keywords
main_topic_keywords = lda_model.show_topic(main_topic)
print(f"Keywords: {main_topic_keywords}")



[]


In [187]:
# Returns the topics that a particular word belongs to along with its probability in that topic.
word_id = dictionary.token2id['space']
topics = lda_model.get_term_topics(word_id)
print(topics)


Main topic for unseen Document 1: (0, 0.9417257)
Main topic for unseen Document 2: (0, 0.91111666)
Main topic for unseen Document 3: (0, 0.87855476)


In [178]:
# Compare the topic distributions and identify the main topic
main_topic_doc1 = max(unseen_doc_topics[0], key=lambda x: x[1])
main_topic_doc2 = max(unseen_doc_topics[1], key=lambda x: x[1])
main_topic_doc3 = max(unseen_doc_topics[2], key=lambda x: x[1])

print(f"Main topic for unseen Document 1: {main_topic_doc1}")
print(f"Main topic for unseen Document 2: {main_topic_doc2}")
print(f"Main topic for unseen Document 3: {main_topic_doc3}")

In [None]:
import pyLDAvis.gensim_models as gensimvis
import pyLDAvis

tokens_list=processed_docs
dictionary = gensim.corpora.Dictionary(tokens_list)

    # Create a bag-of-words representation of the documents
corpus = [dictionary.doc2bow(tokens) for tokens in tokens_list]
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim_models.prepare(lda_model, corpus, dictionary)
vis

## SKLEARN

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# Function to perform LDA
def perform_lda_sklearn(texts_list, num_topics=5, max_features=1000):
    vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=max_features, stop_words='english')
    data_vectorized = vectorizer.fit_transform(texts_list)

    lda_model = LatentDirichletAllocation(n_components=num_topics, max_iter=5,
                                          learning_method='online', learning_offset=50., random_state=0)
    lda_model.fit(data_vectorized)

    return lda_model, vectorizer

## LDA

In [None]:
import numpy as np

def lda_gibbs_sampling(document_word_matrix, n_topics, n_iter, alpha, beta):
    n_documents, n_words = document_word_matrix.shape

    # Initialize topic assignments randomly
    topic_assignments = np.random.randint(0, n_topics, size=document_word_matrix.nonzero()[0].shape)

    # Initialize count matrices
    doc_topic_counts = np.zeros((n_documents, n_topics))
    topic_word_counts = np.zeros((n_topics, n_words))
    topic_counts = np.zeros(n_topics)

    # Update count matrices based on initial topic assignments
    for d, w, z in zip(*document_word_matrix.nonzero(), topic_assignments):
        doc_topic_counts[d, z] += 1
        topic_word_counts[z, w] += 1
        topic_counts[z] += 1

    # Perform Gibbs sampling
    for _ in range(n_iter):
        for d, w, z in zip(*document_word_matrix.nonzero(), topic_assignments):
            # Decrement count matrices
            doc_topic_counts[d, z] -= 1
            topic_word_counts[z, w] -= 1
            topic_counts[z] -= 1

            # Calculate conditional probability
            p_z = (doc_topic_counts[d, :] + alpha) * (topic_word_counts[:, w] + beta) / (topic_counts + beta * n_words)
            p_z /= np.sum(p_z)

            # Sample a new topic assignment
            new_z = np.random.choice(n_topics, p=p_z)

            # Increment count matrices
            doc_topic_counts[d, new_z] += 1
            topic_word_counts[new_z, w] += 1
            topic_counts[new_z] += 1

            # Update topic assignment
            z = new_z

    return doc_topic_counts, topic_word_counts


n_topics = 5
n_iter = 1000
alpha = 0.1
beta = 0.1

doc_topic_counts, topic_word_counts = lda_gibbs_sampling(document_word_matrix, n_topics, n_iter, alpha, beta)

# Extract topics
topics = np.argsort(-topic_word_counts, axis=1)[:, :5]
for i, topic in enumerate(topics):
    print(f"Topic {i}: {[vectorizer.get_feature_names()[word] for word in topic]}")


def predict_topic(text, doc_topic_counts, topic_word_counts, alpha, beta):
    words = preprocess(text)
    word_indices = [vectorizer.vocabulary_.get(word) for word in words]
    word_indices = [idx for idx in word_indices if idx is not None]

    n_documents, n_topics = doc_topic_counts.shape
    n_words = topic_word_counts.shape[1]

    p_z = (doc_topic_counts.sum(axis=0) + alpha) * np.prod(topic_word_counts[:, word_indices] + beta, axis=1) / (topic_word_counts.sum(axis=1) + beta * n_words)**len(word_indices)
    p_z /= np.sum(p_z)

    return np.argmax(p_z)

new_text = "your new text here"
topic_id = predict_topic(new_text, doc_topic_counts, topic_word_counts, alpha, beta)
print(f"Topic ID: {topic_id}")

In [None]:
# Randomly assign topics to words in documents
for doc in docs:
    cur_topics = []
    for word in doc:
        topic = random.randint(0, K - 1)
        cur_topics.append(topic)
        doc_topic_counts[len(topic_assignments)][topic] += 1
        topic_word_counts[topic][word] += 1
        topic_counts[topic] += 1
    topic_assignments.append(cur_topics)

# Iterate until convergence or max_iters
for _ in range(max_iters):
    for d, doc in enumerate(docs):
        for i, word in enumerate(doc):
            topic = topic_assignments[d][i]

            # Decrement counts
            doc_topic_counts[d][topic] -= 1
            topic_word_counts[topic][word] -= 1
            topic_counts[topic] -= 1

            # Sample new topic
            topic_probs = []
            for k in range(K):
                doc_topic_prob = (doc_topic_counts[d][k] + 1) / (sum(doc_topic_counts[d].values()) + K)
                topic_word_prob = (topic_word_counts[k][word] + 1) / (topic_counts[k] + len(docs))
                topic_probs.append(doc_topic_prob * topic_word_prob)

            # Normalize and sample new topic
            total_prob = sum(topic_probs)
            topic_probs = [p / total_prob for p in topic_probs]
            new_topic = random.choices(range(K), topic_probs)[0]

            # Increment counts
            topic_assignments[d][i] = new_topic
            doc_topic_counts[d][new_topic] += 1
            topic_word_counts[new_topic][word] += 1
            topic_counts[new_topic] += 1

return doc_topic_counts, topic_word_counts



In [None]:
def lda(docs, K, alpha, beta, num_iters):
# Initialize topic assignments
topic_assignments = []
for d in docs:
topic_assignments.append([random.randint(0, K-1) for _ in d])

# Initialize topic-word and document-topic count matrices
topic_word_counts = [[0 for _ in range(len(docs[0]))] for _ in range(K)]
doc_topic_counts = [[0 for _ in range(K)] for _ in range(len(docs))]

# Count initial topic assignments
for d, doc in enumerate(docs):
    for w, word in enumerate(doc):
        topic = topic_assignments[d][w]
        topic_word_counts[topic][word] += 1
        doc_topic_counts[d][topic] += 1

# Gibbs sampling
for _ in range(num_iters):
    for d, doc in enumerate(docs):
        for w, word in enumerate(doc):
            old_topic = topic_assignments[d][w]

            # Decrement counts for old topic assignment
            topic_word_counts[old_topic][word] -= 1
            doc_topic_counts[d][old_topic] -= 1

            # Compute probabilities for each topic
            probabilities = []
            for t in range(K):
                p_topic_given_doc = (doc_topic_counts[d][t] + alpha) / (sum(doc_topic_counts[d]) + K * alpha)
                p_word_given_topic = (topic_word_counts[t][word] + beta) / (sum(topic_word_counts[t]) + len(docs[0]) * beta)
                probabilities.append(p_topic_given_doc * p_word_given_topic)

            # Normalize probabilities
            total_prob = sum(probabilities)
            probabilities = [p / total_prob for p in probabilities]

            # Sample new topic assignment
            new_topic = random.choices(range(K), probabilities)[0]

            # Update counts for new topic assignment
            topic_assignments[d][w] = new_topic
            topic_word_counts[new_topic][word] += 1
            doc_topic_counts[d][new_topic] += 1

return topic_word_counts, doc_topic_counts

# **4. Sentiment Analysis:**
Determine the sentiment of the content using the VADER sentiment analyzer from the `vaderSentiment`library.
VADER (Valence Aware Dictionary and sEntiment Reasoner) is a pre-trained sentiment analysis tool specifically designed for social media texts and doesn't require preprocessing like tokenization, stemming, or lemmatization

In [20]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

In [29]:
# Function to analyze sentiment using VADER
def analyze_sentiment_vader(text):
     # Initialize VADER sentiment analyzer
    analyzer = SentimentIntensityAnalyzer()
    sentiment = analyzer.polarity_scores(text)
    return sentiment

{'neg': 0.26, 'neu': 0.52, 'pos': 0.22, 'compound': -0.1757}


In [40]:
# Example usage
text = "This is an example text for sentiment analysis. Wow. So bad."
sentiment = analyze_sentiment_vader(text)
print(sentiment)


Think!

It's the SCSI card doing the DMA transfers NOT the disks...

The SCSI card can do DMA transfers containing data from any of the SCSI devices
it is attached when it wants to.

An important feature of SCSI is the ability to detach a device. This frees the
SCSI bus for other devices. This is typically used in a multi-tasking OS to
start transfers on several devices. While each device is seeking the data the
bus is free for other commands and data transfers. When the devices are
ready to transfer the data they can aquire the bus and send the data.

On an IDE bus when you start a transfer the bus is busy until the disk has seeked
the data and transfered it. This is typically a 10-20ms second lock out for other
processes wanting the bus irrespective of transfer time.

The sentiment value of text 1 is: -0.5952
The sentiment value of text 2 is: 0.8268
The sentiment value of text 3 is: -0.9976
The sentiment value of text 4 is: 0.8932
The sentiment value of text 5 is: 0.2732
The sentime

In [None]:
# Perform sentiment analysis on the texts
import numpy as np
print(texts[3])
scores_list = []
for i, text in enumerate(texts):
    scores_list.append(analyzer.polarity_scores(text)["compound"])
    print(f'The sentiment value of text {i + 1} is: {scores_list[i]}')
print(np.mean(scores_list))

# **5. Summarization:**
Generate summaries of the relevant content using extractive summarization techniques. For this, you can use the gensim library.

To implement extractive summarization without using libraries, you can follow these steps:

1. Split the text into sentences.
2. Tokenize the sentences.
3. Calculate the frequency of each word in the text.
4. Assign a score to each sentence based on the frequency of the words in the sentence.
5. Select the top N sentences with the highest scores as the summary.

This is a simple implementation of extractive summarization without using any libraries. Note that this approach does not consider the semantic meaning of words or the coherence of the summary. More advanced techniques, such as using word embeddings or graph-based methods, can improve the quality of the summary.

In [None]:
"""
This function takes a list of texts and the number of sentences to include in the summary (default is 3). It calculates the frequency of words in the text, scores each sentence based on the frequency of the words it contains, and selects the top N sentences with the highest scores as the summary.
"""
def extractive_summarization(texts, n_sentences=3):
    summaries = []

    for text in texts:
        # Split the text into sentences
        sentences = text.strip().split('.')

        # Tokenize and preprocess the text
        word_freq = {}
        for sentence in sentences:
            stemmed_tokens, _ = preprocess_text(sentence)
            for token in stemmed_tokens:
                if token not in word_freq:
                    word_freq[token] = 1
                else:
                    word_freq[token] += 1

        # Calculate the score for each sentence
        sentence_scores = {}
        for sentence in sentences:
            stemmed_tokens, _ = preprocess_text(sentence)
            for token in stemmed_tokens:
                if token in word_freq:
                    if sentence not in sentence_scores:
                        sentence_scores[sentence] = word_freq[token]
                    else:
                        sentence_scores[sentence] += word_freq[token]

        # Select the top N sentences with the highest scores
        summary_sentences = sorted(sentence_scores, key=sentence_scores.get, reverse=True)[:n_sentences]
        summary = '. '.join(summary_sentences)
        summaries.append(summary)

    return summaries

In [45]:
from gensim.summarization import summarize

# Function to generate extractive summary
def generate_summary(text, word_count=50):
    summary = summarize(text, word_count=word_count)
    return summary

# Example usage
long_text = "This is an example long text that needs summarization. ..."
summary = generate_summary(long_text)


Summary 1:  Some sentences are more important than others.  It has several sentences
Summary 2:  Extractive summarization should work on it as well. Another example text is here


In [None]:
# Example usage
texts = [
    "This is an example text. It has several sentences. Some sentences are more important than others.",
    "Another example text is here. Extractive summarization should work on it as well."
]

summaries = extractive_summarization(texts, n_sentences=2)
for i, summary in enumerate(summaries):
    print(f"Summary {i + 1}: {summary}")

# **6. Visualization and Reporting:**
 Visualize the results in an intuitive dashboard or report format, showing the distribution of topics, sentiment scores, and summaries of the relevant content. You can use the matplotlib library for basic visualizations.

In [54]:
import matplotlib.pyplot as plt
from wordcloud import WordCloud

In [52]:
def visualize_sentiment(scores):
    plt.hist(scores, bins=[-1, -0.5, 0, 0.5, 1], edgecolor='black')
    plt.xlabel('Sentiment Scores')
    plt.ylabel('Frequency')
    plt.title('Sentiment Analysis of Texts')
    plt.show()

# Function to visualize sentiment scores
def visualize_sentiment(sentiment_scores):
    labels = ['Positive', 'Neutral', 'Negative']
    values = [sentiment_scores['pos'], sentiment_scores['neu'], sentiment_scores['neg']]

    plt.bar(labels, values)
    plt.xlabel('Sentiment')
    plt.ylabel('Score')
    plt.title('Sentiment Analysis')
    plt.show()

In [56]:
# Function to generate a word cloud
def generate_wordcloud(texts):
    all_text = ' '.join(texts)
    wordcloud = WordCloud(width=800, height=800, background_color='white', min_font_size=10, max_words=100).generate(all_text)
    plt.figure(figsize=(8, 8))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis('off')
    plt.title('Word Cloud of Texts')
    plt.show()

TypeError: list indices must be integers or slices, not str

In [None]:

# Example usage
texts = [
    "This is an example text. It has several sentences. Some sentences are more important than others.",
    "Another example HPC text is here for computer topic. Extractive computing summarization should work on if computer is it as well."
]

# Analyze sentiment using the code from the previous answer
sia = SentimentIntensityAnalyzer()
sentiment_scores = [sia.polarity_scores(text)["compound"] for text in texts]
visualize_sentiment(sentiment_scores)

# Generate word cloud
generate_wordcloud(texts)