#Introdution

Until now, we've looked at the fundamentals of preprocessing and the use of the bag-of-words approach to model topics in LDA (Latent Dirichlet Allocation).

Now, let's expand on some concepts and revisit, for example, the use of the bag-of-words approach to compare documents. Imagine that we have two documents, and we want to determine whether they discuss the same topic, based on the assumption that if both documents share the same terms, they likely cover the same subject.

With this in mind, we will use the concept of document vectors in the bag-of-words model, which can be explained as representing each document as a vector of word counts or frequencies within a fixed vocabulary. In this vector representation, each dimension corresponds to a specific word in the vocabulary, and the value in each dimension represents how often that word appears in the document.



#Vectors

By transforming documents into these numerical vectors, we can easily compare them using similarity metrics, such as cosine similarity. If two document vectors have high similarity, this suggests that the documents share a significant amount of overlapping content, indicating a related subject. The bag-of-words approach thus provides a straightforward way to quantify the content of documents for comparisons and further analysis in natural language processing tasks.

To compare documents using the Bag of Words approach with CountVectorizer from scikit-learn, we can follow these steps:

1. Transform each document into a word count vector using CountVectorizer.
2. Calculate the similarity between the vectors. One of the most common methods is cosine similarity.

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Exemplo de documentos
doc1 = "Eu amo aprender sobre processamento de linguagem natural"
doc2 = "O processamento de linguagem natural é fascinante e cheio de possibilidades"

# Lista de documentos
documents = [doc1, doc2]

# Inicializa o CountVectorizer
vectorizer = CountVectorizer()


# Converte os documentos para a representação bag-of-words
X = vectorizer.fit_transform(documents)

# Calcula a similaridade do cosseno entre os vetores dos documentos
cosine_sim = cosine_similarity(X[0], X[1])

# Exibe a similaridade entre doc1 e doc2
print("Similaridade entre doc1 e doc2:", cosine_sim[0][0])


Similaridade entre doc1 e doc2: 0.5590169943749475


Here is an example with three documents: two on the same topic (natural language processing) and a third on a different subject (weather). Using the above code calculate the cosine similarity.

In [2]:
# Exemplo de documentos
doc1 = "O processamento de linguagem natural é fascinante e cheio de possibilidades."
doc2 = "A área de processamento de linguagem natural envolve a análise de textos e fala."
doc3 = "Hoje o clima está ensolarado com algumas nuvens."



Print the obtained vectors and explain the vector in terms of which words are in each position of the vector.

In what range does cosine similarity work?

In the example above, we used cosine similarity as a metric, but there are other alternatives that can be employed to compare vectors generated by the bag-of-words model. Some of these include Euclidean distance, which measures the straight-line distance between two points in a multi-dimensional space, and Manhattan distance, which calculates the sum of the absolute differences across each dimension. Additionally, the Jaccard similarity coefficient can be used to measure the overlap between two sets, making it particularly useful when focusing on the presence or absence of words rather than their frequency. Each metric has its own advantages depending on the context and desired outcome of the comparison, as some are more sensitive to vector magnitude, while others emphasize the direction or relative frequency of terms.

# Sentiment Analysis and Text Categorization



**Sentiment analysis** is a type of text classification that involves determining the sentiment or emotional tone behind a piece of text. This can be categorized as either positive, negative, or neutral, though more granular classifications (such as very positive, somewhat negative, etc.) are also possible. The main goal of sentiment analysis is to analyze and extract subjective information from text, such as opinions, reviews, or social media posts.

## What is a Sentiment?

A **sentiment** refers to the underlying emotional state or opinion expressed in a piece of text. The sentiment can be for example:

0. **Negative Sentiment**: Expresses an unfavorable, unhappy, or disapproving tone.  
   Example: "The service at the restaurant was terrible. I'll never go again."

1. **Positive Sentiment**: Expresses a favorable, happy, or approving tone.  
   Example: "I absolutely loved the movie! It was fantastic!"



## Naive Bayes for Sentiment Analysis

**Naive Bayes classifiers** are a popular machine learning approach used for sentiment analysis because of their simplicity and effectiveness, especially with text data.

A **Naive Bayes classifier** is based on Bayes' Theorem, which calculates the probability of a given sentiment (category) based on the words present in the text. The key assumption of Naive Bayes is that the features (words) in the text are **independent** of each other, which is why it's called "naive." Despite this simplifying assumption, it often performs surprisingly well for text classification tasks.

### Steps in Naive Bayes for Sentiment Analysis:

1. **Training**:
   - The model is trained using a labeled dataset that includes text and its corresponding sentiment label (positive, negative, or neutral). For instance, a set of apps reviews labeled as positive or negative.

2. **Feature Extraction**:
   - In text data, features are usually individual words or sequences of words. For each word, the model calculates the probability of the word appearing in texts with a certain sentiment label.

3. **Calculating Probabilities**:
   - The Naive Bayes classifier computes the probability of a sentiment label given the words in a new text.

4. **Classification**:
   - When a new text is provided, the model calculates the probabilities for each sentiment class (positive, negative, neutral). It then assigns the sentiment with the highest probability to the text.

### Example:

Let's say we have a training set with the following data:

- "I love this product" (positive)
- "This is the worst purchase ever" (negative)
- "The product is okay, but not great" (neutral)

For the phrase "love this product":
- The word "love" might have a higher probability of occurring in positive reviews, while words like "worst" might be more likely in negative reviews.

The Naive Bayes classifier uses this statistical approach to make predictions on new, unseen text, based on the probabilities calculated from the training data.

In summary, sentiment analysis is a form of text categorization that focuses on identifying the emotional tone of text. Naive Bayes classifiers are commonly used in this task due to their efficiency in handling text data and their ability to classify based on word probabilities.

More information here:
https://www.youtube.com/watch?v=O2L2Uv9pdDA&t=639s


Now that we understand how Naive Bayes works, let's implement a model in Python to classify app reviews. The data was provided in a CSV file. Download the data and load it. The libraries we will use are:

In [3]:
#import pandas para trabalhar com dataframes
import pandas as pd

#import para fazer divisao do corpus em train e test
from sklearn.model_selection import train_test_split

#library for convert a collection of text documents to a matrix of token counts.
from sklearn.feature_extraction.text import CountVectorizer

#call da library naive bayes
from sklearn.naive_bayes import MultinomialNB

import numpy as np

from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

# import counter class from collections module
from collections import Counter

Load the dataset in pandas directly from the csv file.

What characteristics (size, columns...) does the dataset have?

What is the distribution of the labels in the dataset?

Preprocess the data in a way that only keeps the columns in the dataframe that contain the text 'raw' and the annotated labels.

Now that you have the appropriate corpus, split it into test documents and training documents. Use `train_test_split` to do this.

The function train_test_split splits the data into 4 parts. Explain each of these parts.

We have been working with vectors from scratch until now. But to simplify today's analysis, I created vectors from the review data using `CountVectorizer`. I removed the stopwords.

Characterize the generated vectors. What is the shape of these vectors?

Now that you know how to create the vectors, create the vectors only for the test data (reviews) generated by the train_test_split.

Use the `MultinomialNB` function to fit the vectors created from the test data with the labeled annotations of each review.

Use `model.score` to measure the performance of your classification model.

How did your model perform? Now, create a confusion matrix to analyze what your model predicts best. The negative or positive labels?

Consider the new review cases in Dataflame below

Make the prediction for these new cases proposed by the new dataframe.

How did your model perform for these new cases?

Consider the image below and discuss with the teacher about the metrics of precision and recall.

![Texto alternativo](https://upload.wikimedia.org/wikipedia/commons/2/26/Precisionrecall.svg)

Calculate the precision and recall of the model you just created.

Consider the formulas below:

The **F1-Score** is an evaluation metric that combines **Precision** and **Recall** into a single measure, using the harmonic mean of these values. The mathematical formula for the F1-Score is as follows:

$$
F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
$$


Where:

- **Precision** is the proportion of true positives among all examples predicted as positive.

$$
\text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}
$$

- **Recall** is the proportion of true positives among all examples that are actually positive.

$$
\text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}
$$

Thus, the F1-Score evaluates model performance by balancing precision and recall, making it especially useful in scenarios with imbalanced classes.


Calculate the F1 score for the model you just created.