# NLP3 - Prepocessing

---

### Introduction to Preprocessing in NLP with Python

Preprocessing is a vital step in natural language processing (NLP) that involves transforming raw text into a format suitable for analysis and modeling. This process enhances the quality of data by cleaning and standardizing it, enabling more accurate and efficient NLP tasks. In Python, several libraries, such as NLTK, and SpaCy, provide powerful tools for preprocessing text. Common techniques include lowercasing, tokenization, stop word removal, and stemming or lemmatization. By implementing these methods, practitioners can significantly improve the performance of their NLP models, ensuring that the data used for training and evaluation is relevant and well-structured. This study guide will explore key preprocessing techniques and demonstrate how to apply them using Python, laying a solid foundation for effective NLP projects.

---

1 - Discuss what a Corpus. What role does it play in Natural Language Processing?

In [None]:
#In NLP, a corpus is a large and structured collection of texts or linguistic data that serves as the foundation for various language processing tasks.
#It can consist of written texts, transcripts of spoken language, or any other source of language data.
#A corpus is used for training, testing, and evaluating NLP models by providing real-world examples of language usage.
#It can be general (covering a wide range of topics) or domain-specific (focused on a particular field, such as legal or medical texts),
#and is often annotated with linguistic information like part-of-speech tags or named entities to facilitate analysis and model building.

2 - Therefore, we need a [corpus](https://huggingface.co/datasets/tclopess/sinopsys_movies_portuguese) for our tests in today's class. We'll be working with a corpus of movie synopses in Portuguese. Below, you can see the code to load it.

In [None]:
#first lets install datasets library
!pip install datasets

In [None]:
#import installed library
from datasets import load_dataset

#load dataset
dataset = load_dataset("tclopess/sinopsys_movies_portuguese")

#convert it to pandas and slice the first 3000 data points
df_sinop = dataset['train'].to_pandas()[:3000]
df_sinop.head()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Unnamed: 0,titulo,sinopse,generos,is_valid
0,We Were Soldiers,A história da primeira grande batalha da fase ...,"['Ação', 'História', 'Guerra']",False
1,4Got10,"Um negócio de drogas dá errado, deixando corpo...","['Ação', 'Crime', 'Thriller']",False
2,Pontypool,Quando o disc jockey Grant Mazzy se reporta à ...,"['Horror', 'Mistério', 'Ficção Científica']",False
3,Ticker,Depois que o parceiro de um detetive de São Fr...,"['Ação', 'Crime', 'Thriller']",False
4,Real Genius,Um adolescente prodígio tenso entra em uma fac...,"['Comédia', 'Romance', 'Ficção Científica']",True


In [None]:
len(df_sinop)

3000

In [None]:
df_sinop.head()

Unnamed: 0,titulo,sinopse,generos,is_valid
0,We Were Soldiers,A história da primeira grande batalha da fase ...,"['Ação', 'História', 'Guerra']",False
1,4Got10,"Um negócio de drogas dá errado, deixando corpo...","['Ação', 'Crime', 'Thriller']",False
2,Pontypool,Quando o disc jockey Grant Mazzy se reporta à ...,"['Horror', 'Mistério', 'Ficção Científica']",False
3,Ticker,Depois que o parceiro de um detetive de São Fr...,"['Ação', 'Crime', 'Thriller']",False
4,Real Genius,Um adolescente prodígio tenso entra em uma fac...,"['Comédia', 'Romance', 'Ficção Científica']",True


3 - Based on the corpus documentation and the representations above, what is the data unit or what does a document represent?

In [None]:
#Each document in the corpus above is a movie synopsis.

4 - Now that we have discussed what a corpus is and gained an understanding of the concept of documents within it, let's move on to preprocessing. The first step is to define what a token is. **A token is a single unit of text—such as a word, phrase, or symbol—that is treated as a distinct element during text processing.** In our case, we will define a token as a word and explore various methods for tokenizing documents. Please test tokenization using the following methods: regular expressions (regex), the NLTK library, the split function, and SpaCy.

In [None]:
df_sinop

Unnamed: 0,titulo,sinopse,generos,is_valid
0,We Were Soldiers,A história da primeira grande batalha da fase ...,"['Ação', 'História', 'Guerra']",False
1,4Got10,"Um negócio de drogas dá errado, deixando corpo...","['Ação', 'Crime', 'Thriller']",False
2,Pontypool,Quando o disc jockey Grant Mazzy se reporta à ...,"['Horror', 'Mistério', 'Ficção Científica']",False
3,Ticker,Depois que o parceiro de um detetive de São Fr...,"['Ação', 'Crime', 'Thriller']",False
4,Real Genius,Um adolescente prodígio tenso entra em uma fac...,"['Comédia', 'Romance', 'Ficção Científica']",True
...,...,...,...,...
2995,JGM (JanaGanaMana),"É o próximo filme em 3 de agosto, anunciado po...","['Ação', 'Drama']",False
2996,The Shootist,Um pistoleiro moribundo passa seus últimos dia...,"['Drama', 'Romance', 'Ocidental']",True
2997,Sleep Dealer,"Situado em um futuro próximo, mundo militariza...","['Drama', 'Ficção Científica', 'Thriller']",False
2998,Finding Neverland,Finding Neverland é um drama divertido sobre c...,['Drama'],False


In [None]:
df_sinop['sinopse'].to_list()[0]

'A história da primeira grande batalha da fase americana da Guerra do Vietnã e os soldados de ambos os lados que a travaram.'

In [None]:
#using regex
import re

df_sinop['tokens_regex'] = df_sinop['sinopse'].str.findall(r'\w+')
df_sinop['tokens_regex'].to_list()[0]


['A',
 'história',
 'da',
 'primeira',
 'grande',
 'batalha',
 'da',
 'fase',
 'americana',
 'da',
 'Guerra',
 'do',
 'Vietnã',
 'e',
 'os',
 'soldados',
 'de',
 'ambos',
 'os',
 'lados',
 'que',
 'a',
 'travaram']

In [None]:
#using nltk
import nltk
nltk.download('punkt')

from nltk.tokenize import word_tokenize

df_sinop['tokens_nltk'] = df_sinop['sinopse'].apply(word_tokenize)
df_sinop.loc[0,'tokens_nltk']

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


['A',
 'história',
 'da',
 'primeira',
 'grande',
 'batalha',
 'da',
 'fase',
 'americana',
 'da',
 'Guerra',
 'do',
 'Vietnã',
 'e',
 'os',
 'soldados',
 'de',
 'ambos',
 'os',
 'lados',
 'que',
 'a',
 'travaram',
 '.']

In [None]:
#teste usando split
df_sinop['tokens_split'] = df_sinop['sinopse'].str.split()
df_sinop.loc[0,'tokens_split']

['A',
 'história',
 'da',
 'primeira',
 'grande',
 'batalha',
 'da',
 'fase',
 'americana',
 'da',
 'Guerra',
 'do',
 'Vietnã',
 'e',
 'os',
 'soldados',
 'de',
 'ambos',
 'os',
 'lados',
 'que',
 'a',
 'travaram.']

In [None]:
!python -m spacy download pt_core_news_lg

In [None]:
#using spacy

import spacy

nlp = spacy.load("pt_core_news_lg")
# nlp = spacy.load("pt_core_news_sm")

df_sinop['tokens_spacy'] = df_sinop['sinopse'].apply(lambda x: [x.text for x in nlp(x)])
df_sinop.loc[0,'tokens_spacy']

['A',
 'história',
 'da',
 'primeira',
 'grande',
 'batalha',
 'da',
 'fase',
 'americana',
 'da',
 'Guerra',
 'do',
 'Vietnã',
 'e',
 'os',
 'soldados',
 'de',
 'ambos',
 'os',
 'lados',
 'que',
 'a',
 'travaram',
 '.']

5 - Based on the results above, do the functions perform the same task? Which one is better? Which is more efficient? Which one would you choose? What is the outcome of the approaches above when applied to the string `Estudamos NLP na quarta-feira. É importante você praticar.`?





In [None]:
#teste as opçoes acima com a frase
#Estudamos NLP na quarta-feira.

text = "Estudamos NLP na quarta-feira. É importante você praticar."

print("word_tokenize = ", word_tokenize(text))
print("Regex = ", re.findall(r'\w+',text))
print("Split = ", text.split())
print("Spacy = ", [x.text for x in nlp(text)])



word_tokenize =  ['Estudamos', 'NLP', 'na', 'quarta-feira', '.', 'É', 'importante', 'você', 'praticar', '.']
Regex =  ['Estudamos', 'NLP', 'na', 'quarta', 'feira', 'É', 'importante', 'você', 'praticar']
Split =  ['Estudamos', 'NLP', 'na', 'quarta-feira.', 'É', 'importante', 'você', 'praticar.']
Spacy =  ['Estudamos', 'NLP', 'na', 'quarta-feira', '.', 'É', 'importante', 'você', 'praticar', '.']


6 - You have already explored different tokenization approaches. We now want to expand the corpus preprocessing. Choose the approach that yielded the best results so far, and in addition to tokenizing, remove the stopwords. Use the stopwords provided by the NLTK library, and ensure that all tokens are converted to lowercase before removing them.

In [None]:
#usando nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
stop = list(set(stopwords.words('portuguese')))
print(stop)

import string
# Storing the sets of punctuation in variable result
pontuacoes = string.punctuation
print(pontuacoes)

['houveriam', 'há', 'mas', 'e', 'eram', 'pelas', 'temos', 'como', 'teríamos', 'estive', 'lhes', 'não', 'das', 'aos', 'tiveram', 'com', 'estivermos', 'seu', 'seus', 'ele', 'aqueles', 'seríamos', 'aquele', 'minha', 'dela', 'ser', 'da', 'elas', 'nos', 'que', 'ela', 'só', 'dos', 'estiveram', 'haja', 'ao', 'fomos', 'o', 'for', 'nós', 'tenham', 'tivessem', 'até', 'estas', 'será', 'já', 'seja', 'sua', 'teus', 'do', 'nossa', 'seriam', 'está', 'à', 'em', 'havemos', 'muito', 'estivessem', 'delas', 'pela', 'tenho', 'houvermos', 'tu', 'houvessem', 'tivéramos', 'pelo', 'era', 'tem', 'houverei', 'houverá', 'teu', 'aquilo', 'estivéramos', 'houver', 'a', 'serei', 'foram', 'tivera', 'esta', 'na', 'esteve', 'nosso', 'seria', 'de', 'tém', 'essa', 'qual', 'quem', 'fossem', 'aquela', 'estivesse', 'teve', 'seremos', 'tivermos', 'sejam', 'houvemos', 'estou', 'isso', 'houvera', 'por', 'minhas', 'mesmo', 'nossos', 'tivesse', 'meus', 'hajamos', 'isto', 'somos', 'houvéssemos', 'nossas', 'estejam', 'ou', 'se', 't

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
#using nltk
import nltk
nltk.download('punkt')

from nltk.tokenize import word_tokenize

df_sinop['tokens_nltk'] = df_sinop['sinopse'].apply(lambda y: [x.lower() for x in word_tokenize(y) if x.lower() not in stop and x.lower() not in pontuacoes])
df_sinop.loc[0,'tokens_nltk']

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


['história',
 'primeira',
 'grande',
 'batalha',
 'fase',
 'americana',
 'guerra',
 'vietnã',
 'soldados',
 'ambos',
 'lados',
 'travaram']

In [None]:
df_sinop

Unnamed: 0,titulo,sinopse,generos,is_valid,tokens_regex,tokens_nltk,tokens_split,tokens_spacy
0,We Were Soldiers,A história da primeira grande batalha da fase ...,"['Ação', 'História', 'Guerra']",False,"[A, história, da, primeira, grande, batalha, d...","[história, primeira, grande, batalha, fase, am...","[A, história, da, primeira, grande, batalha, d...","[A, história, da, primeira, grande, batalha, d..."
1,4Got10,"Um negócio de drogas dá errado, deixando corpo...","['Ação', 'Crime', 'Thriller']",False,"[Um, negócio, de, drogas, dá, errado, deixando...","[negócio, drogas, dá, errado, deixando, corpos...","[Um, negócio, de, drogas, dá, errado,, deixand...","[Um, negócio, de, drogas, dá, errado, ,, deixa..."
2,Pontypool,Quando o disc jockey Grant Mazzy se reporta à ...,"['Horror', 'Mistério', 'Ficção Científica']",False,"[Quando, o, disc, jockey, Grant, Mazzy, se, re...","[disc, jockey, grant, mazzy, reporta, estação,...","[Quando, o, disc, jockey, Grant, Mazzy, se, re...","[Quando, o, disc, jockey, Grant, Mazzy, se, re..."
3,Ticker,Depois que o parceiro de um detetive de São Fr...,"['Ação', 'Crime', 'Thriller']",False,"[Depois, que, o, parceiro, de, um, detetive, d...","[parceiro, detetive, francisco, assassinado, t...","[Depois, que, o, parceiro, de, um, detetive, d...","[Depois, que, o, parceiro, de, um, detetive, d..."
4,Real Genius,Um adolescente prodígio tenso entra em uma fac...,"['Comédia', 'Romance', 'Ficção Científica']",True,"[Um, adolescente, prodígio, tenso, entra, em, ...","[adolescente, prodígio, tenso, entra, faculdad...","[Um, adolescente, prodígio, tenso, entra, em, ...","[Um, adolescente, prodígio, tenso, entra, em, ..."
...,...,...,...,...,...,...,...,...
2995,JGM (JanaGanaMana),"É o próximo filme em 3 de agosto, anunciado po...","['Ação', 'Drama']",False,"[É, o, próximo, filme, em, 3, de, agosto, anun...","[próximo, filme, 3, agosto, anunciado, vijay, ...","[É, o, próximo, filme, em, 3, de, agosto,, anu...","[É, o, próximo, filme, em, 3, de, agosto, ,, a..."
2996,The Shootist,Um pistoleiro moribundo passa seus últimos dia...,"['Drama', 'Romance', 'Ocidental']",True,"[Um, pistoleiro, moribundo, passa, seus, últim...","[pistoleiro, moribundo, passa, últimos, dias, ...","[Um, pistoleiro, moribundo, passa, seus, últim...","[Um, pistoleiro, moribundo, passa, seus, últim..."
2997,Sleep Dealer,"Situado em um futuro próximo, mundo militariza...","['Drama', 'Ficção Científica', 'Thriller']",False,"[Situado, em, um, futuro, próximo, mundo, mili...","[situado, futuro, próximo, mundo, militarizado...","[Situado, em, um, futuro, próximo,, mundo, mil...","[Situado, em, um, futuro, próximo, ,, mundo, m..."
2998,Finding Neverland,Finding Neverland é um drama divertido sobre c...,['Drama'],False,"[Finding, Neverland, é, um, drama, divertido, ...","[finding, neverland, drama, divertido, sobre, ...","[Finding, Neverland, é, um, drama, divertido, ...","[Finding, Neverland, é, um, drama, divertido, ..."


7 - In addition to tokenizing, applying lowercase, and removing stopwords, we can also use techniques like lemmatization and stemming. Test both methods on the following string: `Amo amar quem ama. O amor é uma grande virtude. Virtuosas são as pessoas que amam.` For lemmatization, I recommend using the Spacy library, and for stemming, the Snowball library.



In [None]:
#há também outra maneiras de preprocessar textos mas aqui vou destacar
#Lemmatizer
text = "Amo amar quem ama. O amor é uma grande virtude. Virtuosas são as pessoas que amam."
print([x.lemma_.lower() for x in nlp(text)])
print(set([x.lemma_.lower() for x in nlp(text)]))

['amar', 'amar', 'quem', 'amar', '.', 'o', 'amor', 'ser', 'um', 'grande', 'virtude', '.', 'virtuosas', 'ser', 'o', 'pessoa', 'que', 'amar', '.']
{'ser', 'quem', 'que', 'pessoa', 'virtude', 'amor', 'virtuosas', 'o', '.', 'amar', 'grande', 'um'}


In [None]:
#stemmer
from nltk.stem.snowball import SnowballStemmer
snow_stemmer = SnowballStemmer(language='portuguese')

text = "Amo amar quem ama. O amor é uma grande virtude. Virtuosas são as pessoas que amam."
print([snow_stemmer.stem(x.text) for x in nlp(text)])

['amo', 'amar', 'quem', 'ama', '.', 'o', 'amor', 'é', 'uma', 'grand', 'virtud', '.', 'virtuos', 'sã', 'as', 'pesso', 'que', 'amam', '.']


8 - Is there a better approach? Reflect on when to use lemmas versus stems in NLP.

In [None]:
# Use Stemming when:
# Speed is a priority.
# The application can tolerate some inaccuracies in word forms.
# The focus is on keyword extraction and search functionalities.


# Use Lemmatization when:
# The application demands a deeper understanding of language context.
# Accurate interpretations of words are essential for tasks like semantic analysis.
# You are working with complex sentences where the meaning significantly depends on the exact word form.

9 - Based on the conclusions drawn above, continue processing the corpus. In addition to the steps you have already applied, also implement lemmatization.

In [None]:
nlp = spacy.load("pt_core_news_lg")

df_sinop['tokens_spacy'] = df_sinop['sinopse'].apply(lambda x: [x.lemma_.lower() for x in nlp(x) if x.text.lower() not in stop and x.text.lower() not in pontuacoes])
df_sinop.loc[0,'tokens_spacy']

['história',
 'primeiro',
 'grande',
 'batalha',
 'fase',
 'americano',
 'guerra',
 'vietnã',
 'soldado',
 'ambos',
 'lado',
 'travar']

In [None]:
df_sinop = df_sinop[['titulo','sinopse','generos','tokens_spacy']]
df_sinop

Unnamed: 0,titulo,sinopse,generos,tokens_spacy
0,We Were Soldiers,A história da primeira grande batalha da fase ...,"['Ação', 'História', 'Guerra']","[história, primeiro, grande, batalha, fase, am..."
1,4Got10,"Um negócio de drogas dá errado, deixando corpo...","['Ação', 'Crime', 'Thriller']","[negócio, droga, dar, errar, deixar, corpo, xe..."
2,Pontypool,Quando o disc jockey Grant Mazzy se reporta à ...,"['Horror', 'Mistério', 'Ficção Científica']","[disc, jockey, grant, mazzy, reportar, estação..."
3,Ticker,Depois que o parceiro de um detetive de São Fr...,"['Ação', 'Crime', 'Thriller']","[parceiro, detetive, francisco, assassinar, te..."
4,Real Genius,Um adolescente prodígio tenso entra em uma fac...,"['Comédia', 'Romance', 'Ficção Científica']","[adolescente, prodígio, tenso, entrar, faculda..."
...,...,...,...,...
2995,JGM (JanaGanaMana),"É o próximo filme em 3 de agosto, anunciado po...","['Ação', 'Drama']","[próximo, filme, 3, agosto, anunciar, vijay, d..."
2996,The Shootist,Um pistoleiro moribundo passa seus últimos dia...,"['Drama', 'Romance', 'Ocidental']","[pistoleiro, moribundo, passar, último, dia, p..."
2997,Sleep Dealer,"Situado em um futuro próximo, mundo militariza...","['Drama', 'Ficção Científica', 'Thriller']","[situado, futuro, próximo, mundo, militarizar,..."
2998,Finding Neverland,Finding Neverland é um drama divertido sobre c...,['Drama'],"[finding, neverland, drama, divertido, sobre, ..."


10 - Now, tokenize and clean the synopses. As a preliminary analysis, count the most frequently occurring tokens across all documents in the corpus.

In [1]:
import itertools
from collections import Counter

In [None]:
#da pra fazer com defaultdict
#sabem como funciona?
from collections import defaultdict
dic_total = defaultdict(int)
dic_total

#Qual a frequencia que os tokens aparecem em todos os documentos
#aqui considerando tokens que aparecem mais de uma vez em um mesmo documento
for document in df_sinop['tokens_spacy']:
    for token in document:
        dic_total[token]+=1

# dic_total
print(sorted(dic_total.items(), key=lambda item: item[1],reverse=True))

#or

full_list = Counter(list(itertools.chain(*df_sinop['tokens_spacy'])))
print(full_list.most_common())

[('vida', 522), ('ano', 369), ('jovem', 342), ('descobrir', 337), ('dois', 319), ('encontrar', 310), ('enquanto', 294), ('amigo', 277), ('dever', 267), ('poder', 265), ('homem', 264), ('fazer', 260), ('história', 256), ('família', 251), ('pai', 243), ('tornar', 241), ('todo', 240), ('novo', 236), ('cidade', 235), ('começar', 231), ('casa', 213), ('mundo', 211), ('tentar', 207), ('sobre', 198), ('filme', 194), ('grande', 187), ('filho', 185), ('levar', 183), ('guerra', 177), ('onde', 177), ('outro', 174), ('contra', 171), ('matar', 162), ('mulher', 162), ('lutar', 160), ('conhecer', 157), ('durante', 154), ('após', 152), ('grupo', 150), ('ficar', 149), ('assassino', 143), ('amor', 136), ('vez', 132), ('...', 131), ('próprio', 131), ('assassinato', 127), ('dia', 126), ('ter', 126), ('morte', 124), ('chamar', 123), ('terra', 122), ('policial', 121), ('filha', 121), ('tempo', 121), ('tudo', 120), ('mãe', 120), ('adolescente', 119), ('ver', 119), ('irmão', 119), ('três', 118), ('misterioso'

11 - Analyzing the most frequently used words, do you think there are any that should have been considered stopwords? Which ones? How would you suggest expanding the list of stopwords?

12 - We now have our tokenized and cleaned data. It is important to determine the total number of unique tokens across the entire corpus and identify which tokens are present in each document. Therefore, create a matrix where each row represents one of the 7,000 tokenized documents and each column represents a unique token. Count how many times each token appears in each document and fill in the matrix accordingly.

In [None]:
#lista para criar um dataframe com os vetores BoW
list_dict_documents = []

#para cada documento
for document in df_sinop['tokens_spacy']:
    #cria-se um dicionado default
    cont_doc = Counter(document)
    dict_doc = defaultdict(int)
    for key in dic_total.keys():
        if key in cont_doc.keys():
            #faz uma soma
            dict_doc[key] = cont_doc[key]
        else:
            dict_doc[key] = 0

    #adiciona ao final do looping a linha referente ao documento
    list_dict_documents.append(dict_doc)

In [None]:
import pandas as pd

#criando o dataframe
df_cont = pd.DataFrame(list_dict_documents)
df_cont

Unnamed: 0,história,primeiro,grande,batalha,fase,americano,guerra,vietnã,soldado,ambos,...,finding,neverland,pan,j.m.,barrie,sylvia,wilder,lemmon,matthau,esgotado
0,1,1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2995,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2996,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2997,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2998,1,0,0,0,0,0,0,0,0,0,...,1,2,2,1,2,1,0,0,0,0


In [None]:
df_cont['vida'].sum()

522

In [None]:
dic_total['vida']

522

13 - What uses do you suggest we explore with the above dataframe?

In [None]:
#each line is a document
#each column is a token

14 - How can we interpret the columns of the dataframe? Is it possible to compare the context of two documents using rows vectors?

In [None]:
#we can compare documents based on if they share same tokens

15 - Consider the formula:

$$
TF(t) = \frac{\text{Number of times the term } t \text{ appears in the document}}{\text{Total number of terms in the document}}
$$

Use the dataframe created in exercise 10 to now create a dataframe that, instead of the count of a specific token in a given document, presents its TF.




In [None]:
#fazer o mesmo para um dataframe de tf
list_dict_tfs = []
#corre-se cada documento
for document in list_dict_documents:
    #soma a quantidade de tokens do documento
    sum_document = sum([x for x in document.values()])
    #cria-se um dict de
    dict_doct = defaultdict(int)
    for key in document.keys():
        #fazer a conta da quantidade daquela palavra especifica em relacao a ao total de palavras
        dict_doct[key]+= document[key]/sum_document

    #adiciona no final para criar o dataframe
    list_dict_tfs.append(dict_doct)

In [None]:
#criar o dataframe
df_tf = pd.DataFrame(list_dict_tfs)
df_tf

Unnamed: 0,história,primeiro,grande,batalha,fase,americano,guerra,vietnã,soldado,ambos,...,finding,neverland,pan,j.m.,barrie,sylvia,wilder,lemmon,matthau,esgotado
0,0.083333,0.083333,0.083333,0.083333,0.083333,0.083333,0.083333,0.083333,0.083333,0.083333,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
1,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
2,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
3,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
4,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2995,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
2996,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
2997,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
2998,0.028571,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.028571,0.057143,0.057143,0.028571,0.057143,0.028571,0.000000,0.000000,0.000000,0.000000


16 - Analyze the dataframe above and answer: What is the importance of a given TF for the analyzed document? And for the corpus?

In [None]:
#TF measures how often a word appears in a document, indicating its relevance within that text. It highlights important terms in a document.