# NLP4 - Text Processing Techniques: TF-IDF and LDA

In this session we will explore powerful techniques for understanding and analyzing text data. Two key concepts you'll learn are TF-IDF and LDA.

TF-IDF (Term Frequency-Inverse Document Frequency) is a method used to evaluate how important a word is in a document relative to a collection of documents. It helps filter out common words while highlighting those that are more meaningful in specific contexts.

LDA (Latent Dirichlet Allocation) is a topic modeling technique. It helps identify underlying topics in a set of documents by grouping words that frequently appear together.

These tools will give you insights into patterns in text data, opening doors to advanced text analysis!

---

## Install Libraries

In [None]:
#first lets install datasets library
!pip install datasets
!python -m spacy download pt_core_news_lg

Collecting datasets
  Downloading datasets-3.0.1-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.17-py310-none-any.whl.metadata (7.2 kB)
INFO: pip is looking at multiple versions of multiprocess to determine which version is compatible with other requirements. This could take a while.
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.1-py3-none-any.whl (471 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m471.6/471.6 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m5.0 MB/s[0m eta [36m0:00:0

## Imports

In [None]:
#dataset library
from datasets import load_dataset
#load dataset
dataset = load_dataset("tclopess/sinopsys_movies_portuguese")
#convert it to pandas and slice the first 3000 data points
df_sinop = dataset['train'].to_pandas()[:3000]


#NLP tool box nltk
import nltk
from nltk.corpus import stopwords
#getting stop words
nltk.download('stopwords')
stop = list(set(stopwords.words('portuguese')))
print(stop)

#string library
import string
#get list of punctuations
pontuacoes = string.punctuation
print(pontuacoes)

#NLP toolbox spacy
import spacy
#load portuguese module large
nlp = spacy.load("pt_core_news_lg")

#other python support libraries and methods
import itertools
from collections import Counter
from collections import defaultdict
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

#dataframes library
import pandas as pd

#LDA library
import gensim
import gensim.corpora as corpora
from gensim.models.ldamodel import LdaModel

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/645 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/3.50M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/625k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/17947 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3168 [00:00<?, ? examples/s]

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


['ela', 'esteja', 'tenha', 'aquelas', 'mas', 'sejam', 'houveria', 'teria', 'até', 'estivera', 'terão', 'tinha', 'tivemos', 'houvessem', 'nossa', 'aquele', 'estivemos', 'se', 'seu', 'teremos', 'dos', 'mesmo', 'hajam', 'seriam', 'sem', 'eram', 'nos', 'entre', 'eu', 'vocês', 'sou', 'serei', 'houveriam', 'fora', 'fui', 'ou', 'para', 'teu', 'estivéssemos', 'minhas', 'houverei', 'estão', 'houver', 'à', 'não', 'ele', 'essa', 'eles', 'delas', 'sua', 'teus', 'minha', 'de', 'vos', 'houvéramos', 'tivéramos', 'fossem', 'aquela', 'esse', 'estamos', 'está', 'estou', 'com', 'nem', 'seríamos', 'aqueles', 'os', 'houvéssemos', 'era', 'quando', 'nosso', 'você', 'estas', 'ser', 'estivesse', 'houvemos', 'fôramos', 'estivermos', 'tiveram', 'houvesse', 'como', 'elas', 'houverem', 'estiverem', 'teriam', 'seus', 'tínhamos', 'estar', 'temos', 'esteve', 'estes', 'é', 'estavam', 'a', 'seremos', 'ao', 'haja', 'seria', 'um', 'estava', 'pelos', 'já', 'numa', 'às', 'houve', 'tu', 'dele', 'o', 'pelas', 'éramos', 'nas'

## Term Frequency - Inverse Document Frequency (TF-IDF)




The **TF-IDF** (Term Frequency - Inverse Document Frequency) model is an improvement over the Bag of Words. It not only takes into account the frequency of words in a document but also considers how important a word is in the entire corpus. The idea is that words that appear frequently in a document but rarely in the rest of the corpus are more meaningful for that document. TF-IDF assigns higher weights to such terms, thus reducing the impact of common words (e.g., "the", "and").

- **Term Frequency (TF)**: Measures how often a word appears in a document.
- **Inverse Document Frequency (IDF)**: Reduces the weight of commonly occurring words across multiple documents.

TF-IDF helps prioritize terms that are more informative for distinguishing between documents.

### Practicing


1 - Using the concepts from the previous class, create a function that takes a string as a parameter and returns a list of pre-processed tokens. The tokens should be lowercase, lemmas, and must not be punctuation or stopwords.

2 - Using the function you created in the previous exercise, preprocess all the synopsis texts contained in the dataframe.

3 - Create a dataframe containing the tf values for all tokens in the documents. Consider the function below:

$$
TF(t,d) = \frac{\text{Number of times the term } t \text{ appears in the document } d } {\text{Total number of terms in the document } d}
$$



4 - Now consider the IDF formula below. Calculate an IDF vector for all tokens in the corpus.

$$
IDF(t) = \log \left( \frac{\text{Number of documents in the corpus}}{1 + \text{Number of documents where the term } t \text{ appears}} \right)
$$

5 - Analyze the TF and IDF separately. What would be their relationship with the corpus, with a specific document, or with a specific term?

6 - Using the data structures you used to separately calculate the TF and IDF above, return the TF-IDF value for the token 'história' in document 45.

## Consine Similarity

In the BOW model, texts are represented as vectors that count the occurrence of words in each document, ignoring word order and focusing on frequency. The similarity between documents can be assessed using these vectors through metrics like **cosine similarity**. Cosine similarity measures the angle between two vectors, determining how similar the documents are based on the words they share, even if in different quantities. This allows for efficient comparison of text content using the vector representations created by BOW.


### Practicing

1 - Consider the vectors below. Which ones are most similar to each other?

In [None]:
X = [0, 0, 0, 1, 1, 1]
Y = [1, 0, 0, 1, 1, 0]
Z = [0, 1, 0, 0, 0, 0]

2 - Answer the question above using the `cosine_similarity` function.

2 -  Create a dataframe for the analyzed corpus, where each row represents a document and each column represents the unique tokens. Each row will therefore indicate how many times a particular token appears in a given document.

3 - Consider the synopses below. Which of the 3 are most similar or discuss the same topic?

In [None]:
df_sinop.loc[636,'sinopse']

'Quando a família de Frank Castle é assassinada por criminosos, ele trava uma guerra contra o crime como um assassino vigilante conhecido apenas como O Justiceiro.'

In [None]:
df_sinop.loc[999,'sinopse']

'O mafioso e assassino de aluguel Jimmy Conlon tem uma noite para descobrir onde está sua lealdade: com seu filho distante, Mike, cuja vida está em perigo, ou seu melhor amigo de longa data, o chefe da máfia Shawn Maguire, que quer que Mike pague pela morte de seu próprio filho.'

In [None]:
df_sinop.loc[14,'sinopse']

'Em julho de 1969, a corrida espacial terminou quando a Apollo 11 cumpriu o desafio do presidente Kennedy de “pousar um homem na Lua e trazê-lo de volta são e salvo à Terra”. Ninguém que testemunhou o pouso lunar jamais o esquecerá. O documentário de Al Reinert, For All Mankind, é a história dos vinte e quatro homens que viajaram para a lua, contada em suas palavras, em suas vozes, usando as imagens de suas experiências. Quarenta anos após o primeiro pouso na lua, continua sendo a obra de cinema mais radical e visualmente deslumbrante já feita sobre esse evento de abalar a terra.'

4 - Use the cosine similarity function to justify your answer.

## Topic Modelling and LDA

While TF-IDF is effective for identifying key terms, it doesn’t provide insight into the underlying topics within the text.

This is where **Topic Modeling** comes in. It’s a technique used to automatically uncover hidden topics in large collections of text. A widely-used topic modeling method is **Latent Dirichlet Allocation (LDA)**, which goes beyond word frequencies to model the distribution of topics across documents and the distribution of words within topics. LDA assumes that each document consists of multiple topics, and each topic consists of related words.


### Practicing

1 - Discuss the paper that originated LDA. Take notes below in order to understand what the model is and how a single document can be composed of multiple topics.

2 - For this study, we will use the `Gensim` library. The first step is to create a dictionary. Use `corpora.Dictionary()` to create a dictionary that we will use in the model. Understand what this dictionary is. Did we use the same preprocessing that we did for TF-IDF?

3 - As a second input, it is necessary to create the corpus for the `LdaModel()` function. Read the documentation and create a compatible corpus based on the preprocessing you have already done.

4 - One of the most important steps for topic modeling algorithms is determining how many topics to use as input. Discuss how this decision should be made. For testing, use `num_topics=10`.

5 - Finally, create a model from the objects created so far using the function `LdaModel()`

6 - Explain and discuss what the parameters `random_state` and `passes` refer to.

7 - LDA provides two main outputs, the loadings and the scores. What do they refer to?

8 - Use `lda_model.print_topics()` to access the tokens that contribute to each of the created topics (Loadings).

9 - Create a `for` loop to print each document and its distribution among topics. Use `lda_model.get_document_topics()`(Scores).

10 - Discuss the score in terms of what score would be sufficient to determine whether a document belongs to a topic or not.

11 - EXTRA

Study the pyLDAvis library to create direct graphs from the gensim library related to the model you just created.

In [None]:
!pip install pyLDAvis