# NLP4 - Text Processing Techniques: TF-IDF and LDA

In this session we will explore powerful techniques for understanding and analyzing text data. Two key concepts you'll learn are TF-IDF and LDA.

TF-IDF (Term Frequency-Inverse Document Frequency) is a method used to evaluate how important a word is in a document relative to a collection of documents. It helps filter out common words while highlighting those that are more meaningful in specific contexts.

LDA (Latent Dirichlet Allocation) is a topic modeling technique. It helps identify underlying topics in a set of documents by grouping words that frequently appear together.

These tools will give you insights into patterns in text data, opening doors to advanced text analysis!

---

## Install Libraries

In [1]:
#first lets install datasets library
# !pip install datasets
# !python -m spacy download pt_core_news_lg

## Imports

In [2]:
#dataset library
from datasets import load_dataset

from NLP_Lab4_student import vec_636

#load dataset
dataset = load_dataset("tclopess/sinopsys_movies_portuguese")
#convert it to pandas and slice the first 3000 data points
df_sinop = dataset['train'].to_pandas()[:3000]


#NLP tool box nltk
import nltk
from nltk.corpus import stopwords
#getting stop words
# nltk.download('stopwords')
stop = list(set(stopwords.words('portuguese')))
print(stop)

#string library
import string
#get list of punctuations
pontuacoes = string.punctuation
print(pontuacoes)

#NLP toolbox spacy
import spacy
#load portuguese module large
nlp = spacy.load("pt_core_news_lg")

#other python support libraries and methods
import itertools
from collections import Counter
from collections import defaultdict
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

#dataframes library
import pandas as pd

#LDA library
import gensim
import gensim.corpora as corpora
from gensim.models.ldamodel import LdaModel

  from .autonotebook import tqdm as notebook_tqdm


ModuleNotFoundError: No module named 'NLP_Lab4_student'

## Term Frequency - Inverse Document Frequency (TF-IDF)




The **TF-IDF** (Term Frequency - Inverse Document Frequency) model is an improvement over the Bag of Words. It not only takes into account the frequency of words in a document but also considers how important a word is in the entire corpus. The idea is that words that appear frequently in a document but rarely in the rest of the corpus are more meaningful for that document. TF-IDF assigns higher weights to such terms, thus reducing the impact of common words (e.g., "the", "and").

- **Term Frequency (TF)**: Measures how often a word appears in a document.
- **Inverse Document Frequency (IDF)**: Reduces the weight of commonly occurring words across multiple documents.

TF-IDF helps prioritize terms that are more informative for distinguishing between documents.

### Practicing


1 - Using the concepts from the previous class, create a function that takes a string as a parameter and returns a list of pre-processed tokens. The tokens should be lowercase, lemmas, and must not be punctuation or stopwords.

In [4]:
#funcao de preprocessing com lemmas
def preprocessing(text):
  return [x.lemma_.lower() for x in nlp(text) if x.text.lower() not in stop and x.text not in pontuacoes]


2 - Using the function you created in the previous exercise, preprocess all the synopsis texts contained in the dataframe.

In [8]:
df_sinop.head()

Unnamed: 0,titulo,sinopse,generos,is_valid
0,We Were Soldiers,A história da primeira grande batalha da fase ...,"['Ação', 'História', 'Guerra']",False
1,4Got10,"Um negócio de drogas dá errado, deixando corpo...","['Ação', 'Crime', 'Thriller']",False
2,Pontypool,Quando o disc jockey Grant Mazzy se reporta à ...,"['Horror', 'Mistério', 'Ficção Científica']",False
3,Ticker,Depois que o parceiro de um detetive de São Fr...,"['Ação', 'Crime', 'Thriller']",False
4,Real Genius,Um adolescente prodígio tenso entra em uma fac...,"['Comédia', 'Romance', 'Ficção Científica']",True


In [5]:
preprocessed_docs = [preprocessing(x) for x in df_sinop['sinopse'].to_list()]

In [6]:
preprocessed_docs[0]

['história',
 'primeiro',
 'grande',
 'batalha',
 'fase',
 'americano',
 'guerra',
 'vietnã',
 'soldado',
 'ambos',
 'lado',
 'travar']

In [9]:
#create a new column with preprocessed texts
df_sinop['tokens'] = df_sinop['sinopse'].apply(preprocessing)


3 - Create a dataframe containing the tf values for all tokens in the documents. Consider the function below:

$$
TF(t,d) = \frac{\text{Number of times the term } t \text{ appears in the document } d } {\text{Total number of terms in the document } d}
$$



In [12]:
df_sinop.head()

Unnamed: 0,titulo,sinopse,generos,is_valid,tokens
0,We Were Soldiers,A história da primeira grande batalha da fase ...,"['Ação', 'História', 'Guerra']",False,"[história, primeiro, grande, batalha, fase, am..."
1,4Got10,"Um negócio de drogas dá errado, deixando corpo...","['Ação', 'Crime', 'Thriller']",False,"[negócio, droga, dar, errar, deixar, corpo, xe..."
2,Pontypool,Quando o disc jockey Grant Mazzy se reporta à ...,"['Horror', 'Mistério', 'Ficção Científica']",False,"[disc, jockey, grant, mazzy, reportar, estação..."
3,Ticker,Depois que o parceiro de um detetive de São Fr...,"['Ação', 'Crime', 'Thriller']",False,"[parceiro, detetive, francisco, assassinar, te..."
4,Real Genius,Um adolescente prodígio tenso entra em uma fac...,"['Comédia', 'Romance', 'Ficção Científica']",True,"[adolescente, prodígio, tenso, entrar, faculda..."


In [14]:
list_example = [
    {
        'historia':1
    },
    {
        'historia':2
    },
    {
        'historia':1,
        'amor':1
    }
]

In [16]:
#create a dataframe from the example list but filling NaN with 0
pd.DataFrame(list_example).fillna(0)

Unnamed: 0,historia,amor
0,1,0.0
1,2,0.0
2,1,1.0


In [22]:
#create tf dataframe
list_dict_tfs = []
for tokens_doc in df_sinop['tokens']:
    list_dict_tfs.append(Counter(tokens_doc))

In [28]:
df_tf = pd.DataFrame(list_dict_tfs).fillna(0)

In [29]:
df_tf

Unnamed: 0,história,primeiro,grande,batalha,fase,americano,guerra,vietnã,soldado,ambos,...,neverland,pan,escrita,j.m.,barrie,sylvia,wilder,lemmon,matthau,esgotado
0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2995,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2996,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2997,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2998,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,2.0,2.0,1.0,1.0,2.0,1.0,0.0,0.0,0.0,0.0


In [32]:
#filter dataframe to show only 'história'>1
df_tf[df_tf['história']>0]


Unnamed: 0,história,primeiro,grande,batalha,fase,americano,guerra,vietnã,soldado,ambos,...,neverland,pan,escrita,j.m.,barrie,sylvia,wilder,lemmon,matthau,esgotado
0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
14,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
45,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
52,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2947,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2950,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2954,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2991,1.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


4 - Now consider the IDF formula below. Calculate an IDF vector for all tokens in the corpus.

$$
IDF(t) = \log \left( \frac{\text{Number of documents in the corpus}}{\text{Number of documents where the term } t \text{ appears}} \right)
$$

In [36]:
import math

#get all tokens in a single list
full_list = Counter(itertools.chain.from_iterable(preprocessed_docs))
len_corpus = len(df_sinop)

dic_idf = {}

for token in full_list:
    #count in how many documents the token appears
    numb_corpus = 0
    #for each document check if the token is there
    for doc_tokens in df_sinop['tokens']:
        if token in doc_tokens:
            numb_corpus += 1
    #calculate idf
    dic_idf[token] = math.log(len_corpus/(numb_corpus))






In [38]:
#create a dataframe from the idf dictionary
df_idf = pd.DataFrame.from_dict(dic_idf, orient='index', columns=['idf'])
df_idf

Unnamed: 0,idf
história,2.559630
primeiro,3.563716
grande,2.853076
batalha,4.135167
fase,7.313220
...,...
sylvia,8.006368
wilder,8.006368
lemmon,8.006368
matthau,8.006368


In [39]:
#get range of idf values
df_idf['idf'].min(), df_idf['idf'].max()

(np.float64(1.8536348729461425), np.float64(8.006367567650246))

5 - Analyze the TF and IDF separately. What would be their relationship with the corpus, with a specific document, or with a specific term?

In [40]:
#TF measures how often a term appears in a document relative to the total number of terms in that document. It indicates the importance of a term within that specific document.
#IDF assesses the importance of a term across the entire corpus (a collection of documents). It measures how unique or rare a term is. A term that appears in many documents will have a lower IDF score, while a term that appears in few documents will have a higher IDF score.
#high idf means the term is rare across documents
#low idf means the term is common across documents
#considering always the range

6 - Using the data structures you used to separately calculate the TF and IDF above, return the TF-IDF value for the token 'história' in document 45.

In [41]:
#tfs do termo para os documentos que ele aparece
token = 'história'
print(df_tf.loc[45,token]*dic_idf['história'])

2.559630195983937


## Consine Similarity

In the BOW model, texts are represented as vectors that count the occurrence of words in each document, ignoring word order and focusing on frequency. The similarity between documents can be assessed using these vectors through metrics like **cosine similarity**. Cosine similarity measures the angle between two vectors, determining how similar the documents are based on the words they share, even if in different quantities. This allows for efficient comparison of text content using the vector representations created by BOW.


### Practicing

1 - Consider the vectors below. Which ones are most similar to each other?

In [46]:
X = [0, 0, 0, 1, 1, 1]
Y = [1, 0, 0, 1, 1, 0]
Z = [0, 1, 0, 0, 0, 0]

2 - Answer the question above using the `cosine_similarity` function.

In [49]:
#print all similarities X, Y and Z
X_reshaped = np.array(X).reshape(1, -1)
Y_reshaped = np.array(Y).reshape(1, -1)
Z_reshaped = np.array(Z).reshape(1, -1)
print("Similarity between X and Y:", cosine_similarity(X_reshaped, Y_reshaped)[0][0])
print("Similarity between X and Z:", cosine_similarity(X_reshaped, Z_reshaped)[0][0])
print("Similarity between Y and Z:", cosine_similarity(Y_reshaped, Z_reshaped)[0][0])

Similarity between X and Y: 0.6666666666666669
Similarity between X and Z: 0.0
Similarity between Y and Z: 0.0


2 -  Create a dataframe for the analyzed corpus, where each row represents a document and each column represents the unique tokens. Each row will therefore indicate how many times a particular token appears in a given document.

In [50]:
#create a corpus as a vector of terms. If appear 1 else 0
list_dict_documents = []
for tokens_doc in df_sinop['tokens']:
    counter_doc = Counter(tokens_doc)
    dict_doc = {}
    for token in full_list:
        if token in counter_doc:
            dict_doc[token] = 1
        else:
            dict_doc[token] = 0
    list_dict_documents.append(dict_doc)


In [51]:
#criando o dataframe
df_cont = pd.DataFrame(list_dict_documents)
df_cont

Unnamed: 0,história,primeiro,grande,batalha,fase,americano,guerra,vietnã,soldado,ambos,...,neverland,pan,escrita,j.m.,barrie,sylvia,wilder,lemmon,matthau,esgotado
0,1,1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2995,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2996,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2997,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2998,1,0,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,0,0,0,0


3 - Consider the synopses below. Which of the 3 are most similar or discuss the same topic?

In [52]:
df_sinop.loc[636,'sinopse']

'Quando a família de Frank Castle é assassinada por criminosos, ele trava uma guerra contra o crime como um assassino vigilante conhecido apenas como O Justiceiro.'

In [53]:
df_sinop.loc[999,'sinopse']

'O mafioso e assassino de aluguel Jimmy Conlon tem uma noite para descobrir onde está sua lealdade: com seu filho distante, Mike, cuja vida está em perigo, ou seu melhor amigo de longa data, o chefe da máfia Shawn Maguire, que quer que Mike pague pela morte de seu próprio filho.'

In [54]:
df_sinop.loc[14,'sinopse']

'Em julho de 1969, a corrida espacial terminou quando a Apollo 11 cumpriu o desafio do presidente Kennedy de “pousar um homem na Lua e trazê-lo de volta são e salvo à Terra”. Ninguém que testemunhou o pouso lunar jamais o esquecerá. O documentário de Al Reinert, For All Mankind, é a história dos vinte e quatro homens que viajaram para a lua, contada em suas palavras, em suas vozes, usando as imagens de suas experiências. Quarenta anos após o primeiro pouso na lua, continua sendo a obra de cinema mais radical e visualmente deslumbrante já feita sobre esse evento de abalar a terra.'

4 - Use the cosine similarity function to justify your answer.

In [60]:
similarity_matrix = cosine_similarity(df_cont.loc[[636, 999, 14]])
similarity_df = pd.DataFrame(similarity_matrix, index=[636, 999, 14], columns=[636, 999, 14])
similarity_df


Unnamed: 0,636,999,14
636,1.0,0.051434,0.0
999,0.051434,1.0,0.0
14,0.0,0.0,1.0


## Topic Modelling and LDA

While TF-IDF is effective for identifying key terms, it doesn’t provide insight into the underlying topics within the text.

This is where **Topic Modeling** comes in. It’s a technique used to automatically uncover hidden topics in large collections of text. A widely-used topic modeling method is **Latent Dirichlet Allocation (LDA)**, which goes beyond word frequencies to model the distribution of topics across documents and the distribution of words within topics. LDA assumes that each document consists of multiple topics, and each topic consists of related words.


### Practicing

1 - Discuss the paper that originated LDA. Take notes below in order to understand what the model is and how a single document can be composed of multiple topics.

2 - For this study, we will use the `Gensim` library. The first step is to create a dictionary. Use `corpora.Dictionary()` to create a dictionary that we will use in the model. Understand what this dictionary is. Did we use the same preprocessing that we did for TF-IDF?

In [None]:
# Create Dictionary
id2word = corpora.Dictionary(preprocessed_docs)

In [None]:
id2word[0]

'ambos'

3 - As a second input, it is necessary to create the corpus for the `LdaModel()` function. Read the documentation and create a compatible corpus based on the preprocessing you have already done.

In [None]:
# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in preprocessed_docs]

In [None]:
corpus[0]

[(0, 1),
 (1, 1),
 (2, 1),
 (3, 1),
 (4, 1),
 (5, 1),
 (6, 1),
 (7, 1),
 (8, 1),
 (9, 1),
 (10, 1),
 (11, 1)]

4 - One of the most important steps for topic modeling algorithms is determining how many topics to use as input. Discuss how this decision should be made. For testing, use `num_topics=10`.

In [None]:
# Set number of topics
num_topics = 10

5 - Finally, create a model from the objects created so far using the function `LdaModel()`

In [None]:
# Build LDA model
lda_model = LdaModel(
    corpus=corpus,
    id2word=id2word,
    num_topics=num_topics,
    passes=20,
    random_state=100
    )

6 - Explain and discuss what the parameters `random_state` and `passes` refer to.

In [None]:
# random_state: This is comparable to 'seed' in many packages and libraries. LDA is probabilistic, not deterministic. This means that, even were we to train models with the same parameters on the same corpus, our models might vary minutely each time. random_state helps mitigate this variation and thereby aid reproducibility.
# passes: The number of times the algorithm passes through the whole corpus. This is comparable to 'epochs' in other packages and libraries.

7 - LDA provides two main outputs, the loadings and the scores. What do they refer to?

In [None]:
# Loadings: These refer to the weights or coefficients that indicate how much each word contributes to the topic.
# Scores: Represent the transformed values of the data in terms of the new axes and are used to classify the observations into different groups or classes.

8 - Use `lda_model.print_topics()` to access the tokens that contribute to each of the created topics (Loadings).

In [None]:
# Print the keywords for each topic
#Researchers use both qualitative and quantitative methods to evaluate models
lda_model.print_topics(num_words=20)

#Topics are words with highest probability in topic and the numbers are the probabilities of words appearing in topic distribution.

[(0,
  '0.005*"ladrão" + 0.004*"jones" + 0.004*"bruxa" + 0.004*"força" + 0.004*"próprio" + 0.004*"banco" + 0.004*"enquanto" + 0.004*"coração" + 0.004*"tornar" + 0.003*"agora" + 0.003*"serviço" + 0.003*"lutar" + 0.003*"desconhecer" + 0.003*"tentar" + 0.003*"equipe" + 0.003*"destruir" + 0.003*"último" + 0.003*"inimigo" + 0.003*"sobre" + 0.003*"chamar"'),
 (1,
  '0.008*"guerra" + 0.005*"exército" + 0.005*"gangue" + 0.004*"filme" + 0.004*"durante" + 0.004*"mundial" + 0.004*"robert" + 0.004*"bandido" + 0.004*"oficial" + 0.004*"americano" + 0.003*"ano" + 0.003*"the" + 0.003*"dois" + 0.003*"romance" + 0.003*"campo" + 0.003*"mudança" + 0.003*"linha" + 0.003*"lucy" + 0.003*"quebrar" + 0.003*"chamar"'),
 (2,
  '0.008*"tornar" + 0.007*"enquanto" + 0.007*"vida" + 0.006*"novo" + 0.005*"ano" + 0.004*"encontrar" + 0.004*"história" + 0.004*"relacionamento" + 0.004*"dar" + 0.004*"york" + 0.003*"nova" + 0.003*"todo" + 0.003*"homem" + 0.003*"sobre" + 0.003*"poder" + 0.003*"dois" + 0.003*"velho" + 0.003*"

9 - Create a `for` loop to print each document and its distribution among topics. Use `lda_model.get_document_topics()`(Scores).

In [None]:
# generate document-topic distributions
for i, doc in enumerate(corpus):
    doc_topics = lda_model.get_document_topics(doc)
    print(f"Document {i}: {doc_topics}")

Document 0: [(1, 0.2542863), (4, 0.6841163)]
Document 1: [(1, 0.3727837), (9, 0.59383684)]
Document 2: [(4, 0.8323124), (8, 0.14542103)]
Document 3: [(0, 0.11194908), (3, 0.41749018), (4, 0.42046648)]
Document 4: [(0, 0.13084902), (2, 0.33413678), (6, 0.14182718), (7, 0.36916292)]
Document 5: [(4, 0.067950524), (9, 0.89389926)]
Document 6: [(0, 0.09560691), (1, 0.028979938), (3, 0.5876315), (5, 0.11230724), (7, 0.16619892)]
Document 7: [(0, 0.14820151), (3, 0.70505464), (8, 0.10003187)]
Document 8: [(4, 0.15756814), (7, 0.8138252)]
Document 9: [(1, 0.3079705), (4, 0.28584442), (7, 0.35608253)]
Document 10: [(3, 0.30184272), (4, 0.6765081)]
Document 11: [(2, 0.56374234), (4, 0.39409056)]
Document 12: [(3, 0.20310168), (4, 0.32139966), (7, 0.2758296), (8, 0.17233744)]
Document 13: [(0, 0.3011899), (3, 0.37201503), (6, 0.30176213)]
Document 14: [(3, 0.031370737), (4, 0.28720987), (6, 0.42200932), (7, 0.24971616)]
Document 15: [(7, 0.96083146)]
Document 16: [(0, 0.16077344), (4, 0.12636837

10 - Discuss the score in terms of what score would be sufficient to determine whether a document belongs to a topic or not.

11 - EXTRA

Study the pyLDAvis library to create direct graphs from the gensim library related to the model you just created.

In [None]:
!pip install pyLDAvis

In [None]:
# for LDA evaluation
import pyLDAvis
import pyLDAvis.gensim_models as gensimvisualize

pyLDAvis.enable_notebook(local=True)

In [None]:
dickens_visual = gensimvisualize.prepare(lda_model, corpus, id2word, mds='mmds',sort_topics=False)
pyLDAvis.display(dickens_visual)

  and should_run_async(code)
