# Social Media Monitor

In my project i'll implement, using Python, a social media monitor that
tracks topics or trends from social media or blogs.
I'll use the SKlearn 20 newsgroups dataset that comprises around 18000
newsgroups posts on 20 topics.
Then i'll pre-process the text using the nltk library, i'll apply
stopword removal, stemming and lemmatization.
After the text is pre-processed i'll apply a model to identify the main
topic, the idea is to implement from scratch the Latent Dirichlet
Allocation model and made a comparison between my version and the
version taken from gensim or sklearn libraries.
I'll perform a sentiment analysis using the Vader library.
I'll generate a summary of the relevant content using extractive
summarization based on word frequencies, to do that i'll not use any
library.
Finally i'll visualize the results, showing the distribution of topics,
sentiment scores and summaries of relevant content. In this phase i'll
use matplotlib and wordcloud libraries.

About the LDA i found the following interesting papers:

https://arxiv.org/pdf/1711.04305.pdf

https://ai.stanford.edu/~ang/papers/jair03-lda.pdf


## IDEA
- Creo 2 datasets bilanciati (tipo 100 tweets e 100 tweets)-> covid e champions
- Traino LDA su entrambi
- Creo nuovo dataset misto (70-30%)
- Identifico topic del dataset misto
- Sentument
- Riassunto
- Plotto

I creating a custom media monitor that tracks specific topics or trends across various platforms, such as news articles, blog posts, and social media. This project can help businesses or individuals stay up-to-date with the latest developments and discussions related to their areas of interest.

To implement this project, i'll follow these steps:

1. **Data Collection**: Gather data from various sources like news websites, blogs, and social media using APIs or web scraping techniques or RSS feed.
2. **Text Preprocessing**: Clean and normalize the text data, as mentioned in the previous answer (tokenization, stopword removal, stemming/lemmatization).
3. **Topic Modeling**: Employ topic modeling techniques like Latent Dirichlet Allocation (LDA) or Non-negative Matrix Factorization (NMF) to identify the main topics or themes present in the collected data. This will help you filter relevant content based on the topics of interest.
4. **Sentiment Analysis**: Determine the sentiment of the content (positive, negative, or neutral) using techniques like rule-based approaches (e.g., VADER sentiment analyzer) or pre-trained models (e.g., TextBlob).
5. **Summarization**: Generate summaries of the relevant content using extractive or abstractive summarization techniques, so that users can quickly grasp the main points without reading the entire text.
6. **Visualization and Reporting**: Visualize the results in an intuitive dashboard or report format, showing the distribution of topics, sentiment scores, and summaries of the relevant content.

# **1. Data Collection:** 
Gather data from various sources using APIs or web scraping techniques. For example, you can use the requests library for accessing APIs and BeautifulSoup for web scraping. In this case I'll use the `20newsgroups` module of `sklearn.datasets`

https://imerit.net/blog/top-25-twitter-datasets-for-natural-language-processing-and-machine-learning-all-pbm/



In [77]:
#%pip install scikit-learn
from sklearn.datasets import fetch_20newsgroups

In [99]:
newsgroups_all = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
print("The total size is: ", len(newsgroups_all.data))
print("\nThe topics are: \n",newsgroups_all.target_names)


The total size is:  18846

The topics are: 
 ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


In [112]:
# Load the 20newsgroups dataset
newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))

# Select 10 categories for the test set
selected_categories = ['alt.atheism', 'comp.graphics', 'comp.sys.ibm.pc.hardware', 'rec.autos', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns']
newsgroups_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'), categories=selected_categories)

examples = newsgroups_test.data[:10]
print(examples)

print("Train dataset size: ", len(newsgroups_train.data))
print("Test dataset size: ", len(newsgroups_test.data))


['\n\n\tDid I not hear that there maybe some ports of Real3D Version2\n   \tin the pipeline somewhere, Possibly Unix. Not too sure though\n        please put me straight.', ". . . David gives good explaination of the deductions from the isotropic,\n'edged' distribution, to whit, they are either part of the Universe or\npart of the Oort cloud.\n\nWhy couldn't they be Earth centred, with the edge occuring at the edge\nof the gravisphere? I know there isn't any mechanism for them, but there\nisn't a mechanism for the others either.", 'I wrote that I thought that 2 Peter 1:20 meant, "no prophecy of\nScripture (or, as one reader suggests, no written prophecy) is\nmerely the private opinion of the writer."\n\nTony Zamora replies (Sat 8 May 1993) that this in turn implies that\nit is not subject to the private interpretation of the reader\neither. I am not sure that I understand this.\n     In one sense, no statement by another is subject to my private\ninterpretation. If reliable historians 

# **2. Text Preprocessing:** 
Clean and normalize the text data using tokenization, stopword removal, and stemming/lemmatization. I'll use the `nltk` library for these tasks.

In [107]:
#%pip install nltk
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, SnowballStemmer, WordNetLemmatizer

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\robyd\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\robyd\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\robyd\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [121]:
# Function to preprocess text
def preprocess_text(text):
    # Tokenization
    tokens = word_tokenize(text.lower())

    # Remove punctuation and stopwords
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [token for token in tokens if token.isalnum() and token not in stop_words]

    # Stemming
    stemmer = PorterStemmer()
    stemmed_tokens = [stemmer.stem(token) for token in filtered_tokens]

    # Lemmatize
    lemmatizer = WordNetLemmatizer()
    lemmatized_texts = [[lemmatizer.lemmatize(word) for word in text] for text in stemmed_tokens]

    return stemmed_tokens

In [115]:
# Example usage
tokens_list, _ = preprocess_text(examples[0])
print(tokens_list)

15158


# **3. Topic Modeling:** 
Apply Latent Dirichlet Allocation (LDA) to identify the main topics in the collected data. You can use the `gensim library` to perform LDA.

Evaluate both models on the testing set using perplexity and coherence scores.

## SCRATCH

## GENSIM

In [64]:
#%pip install gensim
import gensim

In [122]:
# Function to perform LDA
def perform_lda(tokens_list, num_topics=20):
    # Create a dictionary representation of the documents
    dictionary = gensim.corpora.Dictionary(tokens_list)

    # Create a bag-of-words representation of the documents
    corpus = [dictionary.doc2bow(tokens) for tokens in tokens_list]
    lda_model = gensim.models.ldamodel.LdaModel(corpus, num_topics=num_topics, id2word=dictionary, passes=15)

    return lda_model

In [123]:
processed_docs = [preprocess_text(doc) for doc in newsgroups_all.data]

In [124]:
# Example usage
lda_model = perform_lda(processed_docs)

# Print the topics
topics = lda_model.print_topics()
for topic in topics:
    print(topic)

(0, '0.104*"1" + 0.081*"2" + 0.068*"0" + 0.048*"3" + 0.040*"4" + 0.034*"25" + 0.025*"5" + 0.020*"6" + 0.015*"7" + 0.014*"8"')
(1, '0.011*"goal" + 0.009*"period" + 0.009*"san" + 0.009*"new" + 0.008*"blue" + 0.008*"chicago" + 0.007*"play" + 0.007*"shot" + 0.007*"toronto" + 0.006*"john"')
(2, '0.038*"post" + 0.033*"mail" + 0.032*"list" + 0.026*"send" + 0.018*"newsgroup" + 0.017*"address" + 0.017*"db" + 0.015*"pleas" + 0.015*"internet" + 0.015*"email"')
(3, '0.009*"state" + 0.008*"use" + 0.007*"univers" + 0.006*"gun" + 0.006*"research" + 0.005*"law" + 0.005*"includ" + 0.005*"public" + 0.005*"report" + 0.005*"number"')
(4, '0.016*"univers" + 0.009*"istanbul" + 0.009*"heat" + 0.009*"histori" + 0.008*"professor" + 0.006*"sink" + 0.006*"insul" + 0.005*"london" + 0.005*"fpu" + 0.005*"prize"')
(5, '0.014*"drug" + 0.013*"medic" + 0.011*"diseas" + 0.011*"patient" + 0.010*"health" + 0.010*"caus" + 0.009*"use" + 0.008*"doctor" + 0.008*"food" + 0.008*"effect"')
(6, '0.019*"thank" + 0.017*"use" + 0.01

In [133]:
#%pip install pyLDAvis
import pyLDAvis.gensim_models as gensimvis
import pyLDAvis

tokens_list=processed_docs
dictionary = gensim.corpora.Dictionary(tokens_list)

    # Create a bag-of-words representation of the documents
corpus = [dictionary.doc2bow(tokens) for tokens in tokens_list]
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim_models.prepare(lda_model, corpus, dictionary)
vis


## SKLEARN

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# Function to perform LDA
def perform_lda_sklearn(texts_list, num_topics=5, max_features=1000):
    vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=max_features, stop_words='english')
    data_vectorized = vectorizer.fit_transform(texts_list)

    lda_model = LatentDirichletAllocation(n_components=num_topics, max_iter=5,
                                          learning_method='online', learning_offset=50., random_state=0)
    lda_model.fit(data_vectorized)

    return lda_model, vectorizer

## LDA

In [None]:
import numpy as np

def lda_gibbs_sampling(document_word_matrix, n_topics, n_iter, alpha, beta):
    n_documents, n_words = document_word_matrix.shape

    # Initialize topic assignments randomly
    topic_assignments = np.random.randint(0, n_topics, size=document_word_matrix.nonzero()[0].shape)

    # Initialize count matrices
    doc_topic_counts = np.zeros((n_documents, n_topics))
    topic_word_counts = np.zeros((n_topics, n_words))
    topic_counts = np.zeros(n_topics)

    # Update count matrices based on initial topic assignments
    for d, w, z in zip(*document_word_matrix.nonzero(), topic_assignments):
        doc_topic_counts[d, z] += 1
        topic_word_counts[z, w] += 1
        topic_counts[z] += 1

    # Perform Gibbs sampling
    for _ in range(n_iter):
        for d, w, z in zip(*document_word_matrix.nonzero(), topic_assignments):
            # Decrement count matrices
            doc_topic_counts[d, z] -= 1
            topic_word_counts[z, w] -= 1
            topic_counts[z] -= 1

            # Calculate conditional probability
            p_z = (doc_topic_counts[d, :] + alpha) * (topic_word_counts[:, w] + beta) / (topic_counts + beta * n_words)
            p_z /= np.sum(p_z)

            # Sample a new topic assignment
            new_z = np.random.choice(n_topics, p=p_z)

            # Increment count matrices
            doc_topic_counts[d, new_z] += 1
            topic_word_counts[new_z, w] += 1
            topic_counts[new_z] += 1

            # Update topic assignment
            z = new_z

    return doc_topic_counts, topic_word_counts


n_topics = 5
n_iter = 1000
alpha = 0.1
beta = 0.1

doc_topic_counts, topic_word_counts = lda_gibbs_sampling(document_word_matrix, n_topics, n_iter, alpha, beta)

# Extract topics
topics = np.argsort(-topic_word_counts, axis=1)[:, :5]
for i, topic in enumerate(topics):
    print(f"Topic {i}: {[vectorizer.get_feature_names()[word] for word in topic]}")


def predict_topic(text, doc_topic_counts, topic_word_counts, alpha, beta):
    words = preprocess(text)
    word_indices = [vectorizer.vocabulary_.get(word) for word in words]
    word_indices = [idx for idx in word_indices if idx is not None]

    n_documents, n_topics = doc_topic_counts.shape
    n_words = topic_word_counts.shape[1]

    p_z = (doc_topic_counts.sum(axis=0) + alpha) * np.prod(topic_word_counts[:, word_indices] + beta, axis=1) / (topic_word_counts.sum(axis=1) + beta * n_words)**len(word_indices)
    p_z /= np.sum(p_z)

    return np.argmax(p_z)

new_text = "your new text here"
topic_id = predict_topic(new_text, doc_topic_counts, topic_word_counts, alpha, beta)
print(f"Topic ID: {topic_id}")

In [None]:
# Randomly assign topics to words in documents
for doc in docs:
    cur_topics = []
    for word in doc:
        topic = random.randint(0, K - 1)
        cur_topics.append(topic)
        doc_topic_counts[len(topic_assignments)][topic] += 1
        topic_word_counts[topic][word] += 1
        topic_counts[topic] += 1
    topic_assignments.append(cur_topics)

# Iterate until convergence or max_iters
for _ in range(max_iters):
    for d, doc in enumerate(docs):
        for i, word in enumerate(doc):
            topic = topic_assignments[d][i]

            # Decrement counts
            doc_topic_counts[d][topic] -= 1
            topic_word_counts[topic][word] -= 1
            topic_counts[topic] -= 1

            # Sample new topic
            topic_probs = []
            for k in range(K):
                doc_topic_prob = (doc_topic_counts[d][k] + 1) / (sum(doc_topic_counts[d].values()) + K)
                topic_word_prob = (topic_word_counts[k][word] + 1) / (topic_counts[k] + len(docs))
                topic_probs.append(doc_topic_prob * topic_word_prob)

            # Normalize and sample new topic
            total_prob = sum(topic_probs)
            topic_probs = [p / total_prob for p in topic_probs]
            new_topic = random.choices(range(K), topic_probs)[0]

            # Increment counts
            topic_assignments[d][i] = new_topic
            doc_topic_counts[d][new_topic] += 1
            topic_word_counts[new_topic][word] += 1
            topic_counts[new_topic] += 1

return doc_topic_counts, topic_word_counts



In [None]:
def lda(docs, K, alpha, beta, num_iters):
# Initialize topic assignments
topic_assignments = []
for d in docs:
topic_assignments.append([random.randint(0, K-1) for _ in d])

# Initialize topic-word and document-topic count matrices
topic_word_counts = [[0 for _ in range(len(docs[0]))] for _ in range(K)]
doc_topic_counts = [[0 for _ in range(K)] for _ in range(len(docs))]

# Count initial topic assignments
for d, doc in enumerate(docs):
    for w, word in enumerate(doc):
        topic = topic_assignments[d][w]
        topic_word_counts[topic][word] += 1
        doc_topic_counts[d][topic] += 1

# Gibbs sampling
for _ in range(num_iters):
    for d, doc in enumerate(docs):
        for w, word in enumerate(doc):
            old_topic = topic_assignments[d][w]

            # Decrement counts for old topic assignment
            topic_word_counts[old_topic][word] -= 1
            doc_topic_counts[d][old_topic] -= 1

            # Compute probabilities for each topic
            probabilities = []
            for t in range(K):
                p_topic_given_doc = (doc_topic_counts[d][t] + alpha) / (sum(doc_topic_counts[d]) + K * alpha)
                p_word_given_topic = (topic_word_counts[t][word] + beta) / (sum(topic_word_counts[t]) + len(docs[0]) * beta)
                probabilities.append(p_topic_given_doc * p_word_given_topic)

            # Normalize probabilities
            total_prob = sum(probabilities)
            probabilities = [p / total_prob for p in probabilities]

            # Sample new topic assignment
            new_topic = random.choices(range(K), probabilities)[0]

            # Update counts for new topic assignment
            topic_assignments[d][w] = new_topic
            topic_word_counts[new_topic][word] += 1
            doc_topic_counts[d][new_topic] += 1

return topic_word_counts, doc_topic_counts

# **4. Sentiment Analysis:**
Determine the sentiment of the content using the VADER sentiment analyzer from the `vaderSentiment`library.
VADER (Valence Aware Dictionary and sEntiment Reasoner) is a pre-trained sentiment analysis tool specifically designed for social media texts and doesn't require preprocessing like tokenization, stemming, or lemmatization

In [16]:
#%pip install vaderSentiment
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

Collecting vaderSentiment
  Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl (125 kB)
                                              0.0/126.0 kB ? eta -:--:--
     ---------                                30.7/126.0 kB ? eta -:--:--
     ---------------------                 71.7/126.0 kB 975.2 kB/s eta 0:00:01
     -------------------------------------- 126.0/126.0 kB 1.2 MB/s eta 0:00:00
Installing collected packages: vaderSentiment
Successfully installed vaderSentiment-3.3.2
Note: you may need to restart the kernel to use updated packages.


In [20]:
# Function to analyze sentiment using VADER
def analyze_sentiment_vader(text):
     # Initialize VADER sentiment analyzer
    analyzer = SentimentIntensityAnalyzer()
    sentiment = analyzer.polarity_scores(text)
    return sentiment

In [29]:
# Example usage
text = "This is an example text for sentiment analysis. Wow. So bad."
sentiment = analyze_sentiment_vader(text)
print(sentiment)

{'neg': 0.26, 'neu': 0.52, 'pos': 0.22, 'compound': -0.1757}


In [40]:
# Perform sentiment analysis on the texts
import numpy as np
print(texts[3])
scores_list = []
for i, text in enumerate(texts):
    scores_list.append(analyzer.polarity_scores(text)["compound"])
    print(f'The sentiment value of text {i + 1} is: {scores_list[i]}')
print(np.mean(scores_list))


Think!

It's the SCSI card doing the DMA transfers NOT the disks...

The SCSI card can do DMA transfers containing data from any of the SCSI devices
it is attached when it wants to.

An important feature of SCSI is the ability to detach a device. This frees the
SCSI bus for other devices. This is typically used in a multi-tasking OS to
start transfers on several devices. While each device is seeking the data the
bus is free for other commands and data transfers. When the devices are
ready to transfer the data they can aquire the bus and send the data.

On an IDE bus when you start a transfer the bus is busy until the disk has seeked
the data and transfered it. This is typically a 10-20ms second lock out for other
processes wanting the bus irrespective of transfer time.

The sentiment value of text 1 is: -0.5952
The sentiment value of text 2 is: 0.8268
The sentiment value of text 3 is: -0.9976
The sentiment value of text 4 is: 0.8932
The sentiment value of text 5 is: 0.2732
The sentime

# **5. Summarization:**
Generate summaries of the relevant content using extractive summarization techniques. For this, you can use the gensim library.

To implement extractive summarization without using libraries, you can follow these steps:

1. Split the text into sentences.
2. Tokenize the sentences.
3. Calculate the frequency of each word in the text.
4. Assign a score to each sentence based on the frequency of the words in the sentence.
5. Select the top N sentences with the highest scores as the summary.

This is a simple implementation of extractive summarization without using any libraries. Note that this approach does not consider the semantic meaning of words or the coherence of the summary. More advanced techniques, such as using word embeddings or graph-based methods, can improve the quality of the summary.

In [44]:
"""
This function takes a list of texts and the number of sentences to include in the summary (default is 3). It calculates the frequency of words in the text, scores each sentence based on the frequency of the words it contains, and selects the top N sentences with the highest scores as the summary.
"""
def extractive_summarization(texts, n_sentences=3):
    summaries = []

    for text in texts:
        # Split the text into sentences
        sentences = text.strip().split('.')

        # Tokenize and preprocess the text
        word_freq = {}
        for sentence in sentences:
            stemmed_tokens, _ = preprocess_text(sentence)
            for token in stemmed_tokens:
                if token not in word_freq:
                    word_freq[token] = 1
                else:
                    word_freq[token] += 1

        # Calculate the score for each sentence
        sentence_scores = {}
        for sentence in sentences:
            stemmed_tokens, _ = preprocess_text(sentence)
            for token in stemmed_tokens:
                if token in word_freq:
                    if sentence not in sentence_scores:
                        sentence_scores[sentence] = word_freq[token]
                    else:
                        sentence_scores[sentence] += word_freq[token]

        # Select the top N sentences with the highest scores
        summary_sentences = sorted(sentence_scores, key=sentence_scores.get, reverse=True)[:n_sentences]
        summary = '. '.join(summary_sentences)
        summaries.append(summary)

    return summaries

In [None]:
from gensim.summarization import summarize

# Function to generate extractive summary
def generate_summary(text, word_count=50):
    summary = summarize(text, word_count=word_count)
    return summary

# Example usage
long_text = "This is an example long text that needs summarization. ..."
summary = generate_summary(long_text)


In [45]:
# Example usage
texts = [
    "This is an example text. It has several sentences. Some sentences are more important than others.",
    "Another example text is here. Extractive summarization should work on it as well."
]

summaries = extractive_summarization(texts, n_sentences=2)
for i, summary in enumerate(summaries):
    print(f"Summary {i + 1}: {summary}")

Summary 1:  Some sentences are more important than others.  It has several sentences
Summary 2:  Extractive summarization should work on it as well. Another example text is here


# **6. Visualization and Reporting:**
 Visualize the results in an intuitive dashboard or report format, showing the distribution of topics, sentiment scores, and summaries of the relevant content. You can use the matplotlib library for basic visualizations.

In [48]:
#%pip install matplotlib
import matplotlib.pyplot as plt

#%pip install wordcloud
from wordcloud import WordCloud

In [54]:
def visualize_sentiment(scores):
    plt.hist(scores, bins=[-1, -0.5, 0, 0.5, 1], edgecolor='black')
    plt.xlabel('Sentiment Scores')
    plt.ylabel('Frequency')
    plt.title('Sentiment Analysis of Texts')
    plt.show()

# Function to visualize sentiment scores
def visualize_sentiment(sentiment_scores):
    labels = ['Positive', 'Neutral', 'Negative']
    values = [sentiment_scores['pos'], sentiment_scores['neu'], sentiment_scores['neg']]

    plt.bar(labels, values)
    plt.xlabel('Sentiment')
    plt.ylabel('Score')
    plt.title('Sentiment Analysis')
    plt.show()

In [52]:
# Function to generate a word cloud
def generate_wordcloud(texts):
    all_text = ' '.join(texts)
    wordcloud = WordCloud(width=800, height=800, background_color='white', min_font_size=10, max_words=100).generate(all_text)
    plt.figure(figsize=(8, 8))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis('off')
    plt.title('Word Cloud of Texts')
    plt.show()

In [56]:

# Example usage
texts = [
    "This is an example text. It has several sentences. Some sentences are more important than others.",
    "Another example HPC text is here for computer topic. Extractive computing summarization should work on if computer is it as well."
]

# Analyze sentiment using the code from the previous answer
sia = SentimentIntensityAnalyzer()
sentiment_scores = [sia.polarity_scores(text)["compound"] for text in texts]
visualize_sentiment(sentiment_scores)

# Generate word cloud
generate_wordcloud(texts)

TypeError: list indices must be integers or slices, not str