# NLP3 - Prepocessing

---

### Preprocessing and Bag of words

Preprocessing is a vital step in natural language processing (NLP) that involves transforming raw text into a format suitable for analysis and modeling. This process enhances the quality of data by cleaning and standardizing it, enabling more accurate and efficient NLP tasks. In Python, several libraries, such as NLTK, and SpaCy, provide powerful tools for preprocessing text. Common techniques include lowercasing, tokenization, stop word removal, and stemming or lemmatization. By implementing these methods, practitioners can significantly improve the performance of their NLP models, ensuring that the data used for training and evaluation is relevant and well-structured. This study guide will explore key preprocessing techniques and demonstrate how to apply them using Python, laying a solid foundation for effective NLP projects.

Furthermore, we will provide an introduction to the Bag of Words. The concept can be defined as a text representation method that describes the occurrence of words within a document, where the order of words is ignored, and only their frequency is considered.

---

1 - Discuss what a Corpus. What role does it play in Natural Language Processing?

2 - Therefore, we need a [corpus](https://huggingface.co/datasets/tclopess/sinopsys_movies_portuguese) for our tests in today's class. We'll be working with a corpus of movie synopses in Portuguese. Below, you can see the code to load it.

In [None]:
#first lets install datasets library
!pip install datasets

In [None]:
#import installed library
from datasets import load_dataset

#load dataset
dataset = load_dataset("tclopess/sinopsys_movies_portuguese")

#convert it to pandas and slice the first 3000 data points
df_sinop = dataset['train'].to_pandas()[:3000]
df_sinop.head()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Unnamed: 0,titulo,sinopse,generos,is_valid
0,We Were Soldiers,A história da primeira grande batalha da fase ...,"['Ação', 'História', 'Guerra']",False
1,4Got10,"Um negócio de drogas dá errado, deixando corpo...","['Ação', 'Crime', 'Thriller']",False
2,Pontypool,Quando o disc jockey Grant Mazzy se reporta à ...,"['Horror', 'Mistério', 'Ficção Científica']",False
3,Ticker,Depois que o parceiro de um detetive de São Fr...,"['Ação', 'Crime', 'Thriller']",False
4,Real Genius,Um adolescente prodígio tenso entra em uma fac...,"['Comédia', 'Romance', 'Ficção Científica']",True


In [None]:
len(df_sinop)

3000

In [None]:
df_sinop.head()

Unnamed: 0,titulo,sinopse,generos,is_valid
0,We Were Soldiers,A história da primeira grande batalha da fase ...,"['Ação', 'História', 'Guerra']",False
1,4Got10,"Um negócio de drogas dá errado, deixando corpo...","['Ação', 'Crime', 'Thriller']",False
2,Pontypool,Quando o disc jockey Grant Mazzy se reporta à ...,"['Horror', 'Mistério', 'Ficção Científica']",False
3,Ticker,Depois que o parceiro de um detetive de São Fr...,"['Ação', 'Crime', 'Thriller']",False
4,Real Genius,Um adolescente prodígio tenso entra em uma fac...,"['Comédia', 'Romance', 'Ficção Científica']",True


3 - Based on the corpus documentation and the representations above, what is the data unit or what does a document represent?

4 - Now that we have discussed what a corpus is and gained an understanding of the concept of documents within it, let's move on to preprocessing. The first step is to define what a token is. **A token is a single unit of text—such as a word, phrase, or symbol—that is treated as a distinct element during text processing.** In our case, we will define a token as a word and explore various methods for tokenizing documents. Please test tokenization using the following methods: regular expressions (regex), the NLTK library, the split function, and SpaCy.

In [None]:
#using regex

In [None]:
#using nltk

In [None]:
#using split

In [None]:
#using spacy
!python -m spacy download pt_core_news_lg

5 - Based on the results above, do the functions perform the same task? Which one is better? Which is more efficient? Which one would you choose? What is the outcome of the approaches above when applied to the string `Estudamos NLP na quarta-feira. É importante você praticar.`?





6 - You have already explored different tokenization approaches. We now want to expand the corpus preprocessing. Choose the approach that yielded the best results so far, and in addition to tokenizing, remove the stopwords. Use the stopwords provided by the NLTK library, and ensure that all tokens are converted to lowercase before removing them.

7 - In addition to tokenizing, applying lowercase, and removing stopwords, we can also use techniques like lemmatization and stemming. Test both methods on the following string: `Amo amar quem ama. O amor é uma grande virtude. Virtuosas são as pessoas que amam.` For lemmatization, I recommend using the Spacy library, and for stemming, the Snowball library.



In [None]:
#Lemmatizer


In [None]:
#stemmer


8 - Is there a better approach? Reflect on when to use lemmas versus stems in NLP.

9 - Based on the conclusions drawn above, continue processing the corpus. In addition to the steps you have already applied, also implement lemmatization.

10 - Now, tokenize and clean the synopses. As a preliminary analysis, count the most frequently occurring tokens across all documents in the corpus.

11 - Analyzing the most frequently used words, do you think there are any that should have been considered stopwords? Which ones? How would you suggest expanding the list of stopwords?

12 - We now have our tokenized and cleaned data. It is important to determine the total number of unique tokens across the entire corpus and identify which tokens are present in each document. Therefore, create a matrix where each row represents one of the 7,000 tokenized documents and each column represents a unique token. Count how many times each token appears in each document and fill in the matrix accordingly.

13 - What uses do you suggest we explore with the above dataframe?

14 - How can we interpret the columns of the dataframe? Is it possible to compare the context of two documents using rows vectors?

15 - Consider the formula:

$$
TF(t) = \frac{\text{Number of times the term } t \text{ appears in the document}}{\text{Total number of terms in the document}}
$$

Use the dataframe created in exercise 10 to now create a dataframe that, instead of the count of a specific token in a given document, presents its TF.




16 - Analyze the dataframe above and answer: What is the importance of a given TF for the analyzed document? And for the corpus?