# HO01: Similaridade Textual
**Vencimento** Terça-feira por 7:00 <br>
**Pontos** 5 <br>
Esta é a tarefa **HO01: Similaridade Textual**, uma atividade prática que estimula o aluno a desenvolver habilidades de programação em linguagem Python para processamento textual.

## Problema
Calcular a similaridade entre cada par de documentos no coleção headlines.txt (um documento por linha):

1. Pré-processar cada documento para tokenizar, remover acentos e caracteres especiais, fazer lematização e stemming;
2. Criar a representação vetorial de 5 formas diferentes: Onehot Encoding, Counting Vectors, TF-IDF, Co-occurrence Vectors, Word2Vec;
3. Calcular a similaridade par-a-par usando 2 formas diferentes: Euclidean, Cosine

## Pré-requisitos
- pip install nltk
- pip install sklearn
- pip install gensim

In [224]:
# Bibliotecas necessárias, funções que só precisam ser executadas uma vez e carregar o arquivo de texto
import re
import nltk
from nltk.tokenize import word_tokenize
import unicodedata
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer

import pandas as pd
from IPython.display import display
from colorama import Fore, Back, Style       #prints coloridos e estilizados no terminal
# Bibliotecas da parte 2
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import Binarizer
from gensim.models import Word2Vec

#Bibliotecas da parte 3
import numpy as np
from scipy.spatial.distance import euclidean, cosine


# Baixar os recursos do NLTK necessários (apenas uma vez)
nltk.download('punkt')
nltk.download('wordnet')
#nltk.download('stopwords')


def print_destaque(texto):
    print(Back.LIGHTYELLOW_EX + Fore.BLUE + Style.BRIGHT + f' {texto} '+ Style.RESET_ALL)

texto_com_caracteres_especiais = "Olá, Coração? Café & Música são R$5,00. Avôs são fáceis de tê-los. Orações no sótão grátis! 😃"

# Carregar o arquivo de texto
with open('headlines.txt', 'r') as f:  
    headlines = f.readlines()

print_destaque("Coleção de Documentos (headlines.txt)")
headlines

[103m[34m[1m Coleção de Documentos (headlines.txt) [0m


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\rodri\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\rodri\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


['Investors unfazed by correction as crypto funds see $154 million inflows\n',
 'Bitcoin, Ethereum prices continue descent, but crypto funds see inflows\n',
 'The surge in euro area inflation during the pandemic: transitory but with upside risks\n',
 "Inflation: why it's temporary and raising interest rates will do more harm than good\n",
 'Will Cryptocurrency Protect Against Inflation?\n',
 'Tweed is a crypto wallet API to add a web3 flavor to any web service\n',
 'Who Created Bitcoin? Learn About The Biggest Cryptos, Including Dogecoin, Big Eyes Coin\n',
 'Cryptocurrency Prices And News: Bitcoin, Cryptos Fall After Silvergate Bank Liquidation News\n',
 'Silvergate Capital To Shut Down, Liquidate Crypto-Friendly Silvergate Bank\n',
 'Dow Jones Rises On Surprise Jump In Jobless Claims; Bitcoin Drops As Crypto Bank Silvergate Crashes 40%\n']

## Pré-processar cada documento para tokenizar, remover acentos e caracteres especiais, fazer lematização e stemming

### Tokenização
É o processo de dividir um texto em unidades menores, chamadas de tokens. Os tokens podem ser palavras, frases, sentenças ou até mesmo caracteres individuais, dependendo do nível de granularidade desejado. A tokenização é uma etapa fundamental em muitas tarefas de processamento de linguagem natural (NLP) porque permite que o texto seja processado em unidades significativas.

In [104]:
# Tokenização de palavras
def tokenizar(texto):
    return word_tokenize(texto) 

print_destaque("Tokenização de palavras")
print(*[tokenizar(texto) for texto in headlines], sep='\n')

[103m[34m[1m Tokenização de palavras [0m
['Investors', 'unfazed', 'by', 'correction', 'as', 'crypto', 'funds', 'see', '$', '154', 'million', 'inflows']
['Bitcoin', ',', 'Ethereum', 'prices', 'continue', 'descent', ',', 'but', 'crypto', 'funds', 'see', 'inflows']
['The', 'surge', 'in', 'euro', 'area', 'inflation', 'during', 'the', 'pandemic', ':', 'transitory', 'but', 'with', 'upside', 'risks']
['Inflation', ':', 'why', 'it', "'s", 'temporary', 'and', 'raising', 'interest', 'rates', 'will', 'do', 'more', 'harm', 'than', 'good']
['Will', 'Cryptocurrency', 'Protect', 'Against', 'Inflation', '?']
['Tweed', 'is', 'a', 'crypto', 'wallet', 'API', 'to', 'add', 'a', 'web3', 'flavor', 'to', 'any', 'web', 'service']
['Who', 'Created', 'Bitcoin', '?', 'Learn', 'About', 'The', 'Biggest', 'Cryptos', ',', 'Including', 'Dogecoin', ',', 'Big', 'Eyes', 'Coin']
['Cryptocurrency', 'Prices', 'And', 'News', ':', 'Bitcoin', ',', 'Cryptos', 'Fall', 'After', 'Silvergate', 'Bank', 'Liquidation', 'News']
['S

### Normalização
É o processo de transformar o texto em uma forma canônica, eliminando variações desnecessárias, como remoção de acentos, caracteres especiais, letras maiúsculas/minúsculas, e outras formas de padronização. Isso ajuda a reduzir a dimensionalidade dos dados e a simplificar o texto para análise posterior.
1. Remover acentos
2. Remover caracteres especiais
3. Colocar todas em minúscula

In [95]:
def remover_acentos(texto):
    return unicodedata.normalize('NFKD', texto).encode('ASCII', 'ignore').decode('ASCII') # Remove acentos
    # variação ao código acima: ''.join(c for c in unicodedata.normalize('NFD', texto) if unicodedata.category(c) != 'Mn')

def remover_caracteres_especiais(texto):
    texto = remover_acentos(texto)
    return re.sub(r'[^a-zA-Z0-9\s]', '', texto) # Remove caracteres especiais

def normalizar(texto):
    return remover_caracteres_especiais(texto).lower() 

print('ORIGINAL:       ', texto_com_caracteres_especiais)
print('SEM ACENTOS:    ', remover_acentos(texto_com_caracteres_especiais))
print('+ S/ CARAC.ESP: ', remover_caracteres_especiais(texto_com_caracteres_especiais))
print('+ CAIXA BAIXA:  ', normalizar(texto_com_caracteres_especiais), '\n')

print_destaque("Normalização dos Textos")
print(*[normalizar(texto) for texto in headlines], sep='')   

ORIGINAL:        Olá, Coração? Café & Música são R$5,00. Avôs são fáceis de tê-los. Orações no sótão grátis! 😃
SEM ACENTOS:     Ola, Coracao? Cafe & Musica sao R$5,00. Avos sao faceis de te-los. Oracoes no sotao gratis! 
+ S/ CARAC.ESP:  Ola Coracao Cafe  Musica sao R500 Avos sao faceis de telos Oracoes no sotao gratis 
+ CAIXA BAIXA:   ola coracao cafe  musica sao r500 avos sao faceis de telos oracoes no sotao gratis  

[103m[34m[1m Normalização dos Textos [0m
investors unfazed by correction as crypto funds see 154 million inflows
bitcoin ethereum prices continue descent but crypto funds see inflows
the surge in euro area inflation during the pandemic transitory but with upside risks
inflation why its temporary and raising interest rates will do more harm than good
will cryptocurrency protect against inflation
tweed is a crypto wallet api to add a web3 flavor to any web service
who created bitcoin learn about the biggest cryptos including dogecoin big eyes coin
cryptocurrency pric

### Lematização
É o processo de reduzir palavras a sua forma base ou raiz, conhecida como lemma. A lematização considera a estrutura morfológica das palavras e pode ser útil para agrupar palavras derivadas da mesma raiz em um único token, reduzindo a redundância de informações.

In [112]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

#palavras = ["correndo", "corre", "correu", "corridas"]
#lemmas = [lemmatizer.lemmatize(palavra) for palavra in palavras]
#print(lemmas)

def lematizar_doc(texto):
    return [lemmatizer.lemmatize(palavra) for palavra in tokenizar(normalizar(texto))]

print_destaque("Lematização dos Textos")
print(*[lematizar_doc(texto) for texto in headlines], sep='\n')

[103m[34m[1m Lematização dos Textos [0m
['investor', 'unfazed', 'by', 'correction', 'a', 'crypto', 'fund', 'see', '154', 'million', 'inflow']
['bitcoin', 'ethereum', 'price', 'continue', 'descent', 'but', 'crypto', 'fund', 'see', 'inflow']
['the', 'surge', 'in', 'euro', 'area', 'inflation', 'during', 'the', 'pandemic', 'transitory', 'but', 'with', 'upside', 'risk']
['inflation', 'why', 'it', 'temporary', 'and', 'raising', 'interest', 'rate', 'will', 'do', 'more', 'harm', 'than', 'good']
['will', 'cryptocurrency', 'protect', 'against', 'inflation']
['tweed', 'is', 'a', 'crypto', 'wallet', 'api', 'to', 'add', 'a', 'web3', 'flavor', 'to', 'any', 'web', 'service']
['who', 'created', 'bitcoin', 'learn', 'about', 'the', 'biggest', 'cryptos', 'including', 'dogecoin', 'big', 'eye', 'coin']
['cryptocurrency', 'price', 'and', 'news', 'bitcoin', 'cryptos', 'fall', 'after', 'silvergate', 'bank', 'liquidation', 'news']
['silvergate', 'capital', 'to', 'shut', 'down', 'liquidate', 'cryptofriendly

### Stemming 
É o processo de reduzir palavras à sua forma radical ou base, conhecida como stem. Ao contrário da lematização, o stemming não leva em consideração a estrutura morfológica das palavras, apenas remove os sufixos para obter a forma básica da palavra. Isso pode resultar em palavras que não são reconhecíveis em seu sentido original, mas pode ser útil em certos cenários onde a redução de palavras a sua forma mais básica é desejada.

In [125]:
from nltk.stem import PorterStemmer
#from nltk.stem import SnowballStemmer
#stemmer = SnowballStemmer("portuguese") # Escolha o idioma desejado
#stemmer.stem("correndo")

stemmer = PorterStemmer()
def stemming_doc(texto):
    return [stemmer.stem(palavra) for palavra in tokenizar(normalizar(texto))]

print_destaque("Stemming dos Textos")
print(*[stemming_doc(texto) for texto in headlines], sep='\n')

[103m[34m[1m Stemming dos Textos [0m
['investor', 'unfaz', 'by', 'correct', 'as', 'crypto', 'fund', 'see', '154', 'million', 'inflow']
['bitcoin', 'ethereum', 'price', 'continu', 'descent', 'but', 'crypto', 'fund', 'see', 'inflow']
['the', 'surg', 'in', 'euro', 'area', 'inflat', 'dure', 'the', 'pandem', 'transitori', 'but', 'with', 'upsid', 'risk']
['inflat', 'whi', 'it', 'temporari', 'and', 'rais', 'interest', 'rate', 'will', 'do', 'more', 'harm', 'than', 'good']
['will', 'cryptocurr', 'protect', 'against', 'inflat']
['tweed', 'is', 'a', 'crypto', 'wallet', 'api', 'to', 'add', 'a', 'web3', 'flavor', 'to', 'ani', 'web', 'servic']
['who', 'creat', 'bitcoin', 'learn', 'about', 'the', 'biggest', 'crypto', 'includ', 'dogecoin', 'big', 'eye', 'coin']
['cryptocurr', 'price', 'and', 'news', 'bitcoin', 'crypto', 'fall', 'after', 'silverg', 'bank', 'liquid', 'news']
['silverg', 'capit', 'to', 'shut', 'down', 'liquid', 'cryptofriendli', 'silverg', 'bank']
['dow', 'jone', 'rise', 'on', 'surpr

## Criar a representação vetorial de 5 formas diferentes: Onehot Encoding, Counting Vectors, TF-IDF, Co-occurrence Vectors, Word2Vec

In [187]:
headlines_stemmed = [' '.join(stemming_doc(texto)) for texto in headlines]
headlines_lematizadas = [' '.join(lematizar_doc(texto)) for texto in headlines]

### One-Hot Encoding
É uma técnica de representação vetorial em que cada palavra é representada como um vetor binário com um valor "1" na posição correspondente à palavra e "0" em todas as outras posições. É comumente usada para representar categorias discretas ou palavras em um vocabulário limitado.

In [203]:
# Criar um objeto CountVectorizer
vectorizerOH = CountVectorizer(binary=True)

# Aplicar o one-hot encoding aos documentos
matriz_onehot = vectorizerOH.fit_transform(headlines_stemmed)

df_onehot = pd.DataFrame(matriz_onehot.toarray(), columns=vectorizerOH.get_feature_names_out())

print_destaque("One-Hot Encoding dos Textos")
pd.set_option('display.precision', 0) # configurando para não exibir casas decimais
[print(f"Linha {i}: {texto}", end='') for i, texto in enumerate(headlines)]
display(df_onehot)

[103m[34m[1m One-Hot Encoding dos Textos [0m
Linha 0: Investors unfazed by correction as crypto funds see $154 million inflows
Linha 1: Bitcoin, Ethereum prices continue descent, but crypto funds see inflows
Linha 2: The surge in euro area inflation during the pandemic: transitory but with upside risks
Linha 3: Inflation: why it's temporary and raising interest rates will do more harm than good
Linha 4: Will Cryptocurrency Protect Against Inflation?
Linha 5: Tweed is a crypto wallet API to add a web3 flavor to any web service
Linha 6: Who Created Bitcoin? Learn About The Biggest Cryptos, Including Dogecoin, Big Eyes Coin
Linha 7: Cryptocurrency Prices And News: Bitcoin, Cryptos Fall After Silvergate Bank Liquidation News
Linha 8: Silvergate Capital To Shut Down, Liquidate Crypto-Friendly Silvergate Bank
Linha 9: Dow Jones Rises On Surprise Jump In Jobless Claims; Bitcoin Drops As Crypto Bank Silvergate Crashes 40%


Unnamed: 0,154,40,about,add,after,against,and,ani,api,area,...,tweed,unfaz,upsid,wallet,web,web3,whi,who,will,with
0,1,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,1,...,0,0,1,0,0,0,0,0,0,1
3,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,1,0,1,0
4,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
5,0,0,0,1,0,0,0,1,1,0,...,1,0,0,1,1,1,0,0,0,0
6,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
7,0,0,0,0,1,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [181]:
# Não servirá para este trabalho, mas é interessante ver o resultado
def onehot_encode_doc(texto):    
    encoder = OneHotEncoder() # Criando uma instância do OneHotEncoder    
    text_encoded = encoder.fit_transform([[palavra] for palavra in tokenizar(normalizar(texto))]) # Codificando o texto usando o OneHotEncoder
    
    pd.set_option('display.precision', 0) # configurando para não exibir casas decimais
    df_encoded = pd.DataFrame(text_encoded.toarray(), columns=encoder.get_feature_names_out([''])) # Criando um DataFrame com os dados codificados

    print("Matriz One-Hot Encoding de: ", texto[:-1])
    display(df_encoded)

print_destaque("One-Hot Encoding individual")
#[onehot_encode_doc(texto) for texto in headlines]
print(onehot_encode_doc(headlines[0]))

[103m[34m[1m One-Hot Encoding individual [0m
Matriz One-Hot Encoding de:  Investors unfazed by correction as crypto funds see $154 million inflows


Unnamed: 0,_154,_as,_by,_correction,_crypto,_funds,_inflows,_investors,_million,_see,_unfazed
0,0,0,0,0,0,0,0,1,0,0,0
1,0,0,0,0,0,0,0,0,0,0,1
2,0,0,1,0,0,0,0,0,0,0,0
3,0,0,0,1,0,0,0,0,0,0,0
4,0,1,0,0,0,0,0,0,0,0,0
5,0,0,0,0,1,0,0,0,0,0,0
6,0,0,0,0,0,1,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,1,0
8,1,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,1,0,0


None


### Counting Vectors
Também conhecido como Bag of Words (BoW), é uma técnica de representação vetorial em que cada documento é representado como um vetor contendo a contagem de ocorrências das palavras no documento. É uma abordagem simples e amplamente utilizada para representar documentos em NLP.
Se assemelha ao One-Hot Encoding mas diferencia em este último é uma representação binária e o Counting Vectors de ocorrências.

In [204]:
# Criar um objeto CountVectorizer
vectorizerCV = CountVectorizer(binary=False) #no One-Hot Encoding, binary=True

matriz_count = vectorizerCV.fit_transform(headlines_stemmed)

df_count_vec = pd.DataFrame(matriz_count.toarray(), columns=vectorizerCV.get_feature_names_out())

print_destaque("Counting Vector dos Textos")
pd.set_option('display.precision', 0) # configurando para não exibir casas decimais
[print(f"Linha {i}: {texto}", end='') for i, texto in enumerate(headlines)]
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    display(df_count_vec)

[103m[34m[1m Counting Vector dos Textos [0m
Linha 0: Investors unfazed by correction as crypto funds see $154 million inflows
Linha 1: Bitcoin, Ethereum prices continue descent, but crypto funds see inflows
Linha 2: The surge in euro area inflation during the pandemic: transitory but with upside risks
Linha 3: Inflation: why it's temporary and raising interest rates will do more harm than good
Linha 4: Will Cryptocurrency Protect Against Inflation?
Linha 5: Tweed is a crypto wallet API to add a web3 flavor to any web service
Linha 6: Who Created Bitcoin? Learn About The Biggest Cryptos, Including Dogecoin, Big Eyes Coin
Linha 7: Cryptocurrency Prices And News: Bitcoin, Cryptos Fall After Silvergate Bank Liquidation News
Linha 8: Silvergate Capital To Shut Down, Liquidate Crypto-Friendly Silvergate Bank
Linha 9: Dow Jones Rises On Surprise Jump In Jobless Claims; Bitcoin Drops As Crypto Bank Silvergate Crashes 40%


Unnamed: 0,154,40,about,add,after,against,and,ani,api,area,as,bank,big,biggest,bitcoin,but,by,capit,claim,coin,continu,correct,crash,creat,crypto,cryptocurr,cryptofriendli,descent,do,dogecoin,dow,down,drop,dure,ethereum,euro,eye,fall,flavor,fund,good,harm,in,includ,inflat,inflow,interest,investor,is,it,jobless,jone,jump,learn,liquid,million,more,news,on,pandem,price,protect,rais,rate,rise,risk,see,servic,shut,silverg,surg,surpris,temporari,than,the,to,transitori,tweed,unfaz,upsid,wallet,web,web3,whi,who,will,with
0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,2,0,1,0,0,1,0,0,0,0,0,0,1
3,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,1,0,1,0
4,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
5,0,0,0,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,2,0,1,0,0,1,1,1,0,0,0,0
6,0,0,1,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,1,0,0,0,1,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0
7,0,0,0,0,1,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,2,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,2,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
9,0,1,0,0,0,0,0,0,0,0,1,1,0,0,1,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


### TF-IDF
Term Frequency-Inverse Document Frequency (TF-IDF) é uma técnica de representação vetorial que combina a frequência de termos (TF) em um documento com a frequência inversa de documentos (IDF) em um corpus. É uma técnica popular que ajuda a lidar com o desequilíbrio de frequência de palavras em diferentes documentos.

In [207]:
vectorizerTfidf = TfidfVectorizer()

matriz_tfidf = vectorizerTfidf.fit_transform(headlines_stemmed)

df_tfidf = pd.DataFrame(matriz_tfidf.toarray(), columns=vectorizerTfidf.get_feature_names_out())

print_destaque("TF-IDF dos Textos")
pd.set_option('display.precision', 4) # configurando para não exibir casas decimais
[print(f"Linha {i}: {texto}", end='') for i, texto in enumerate(headlines)]
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    display(df_tfidf)

[103m[34m[1m TF-IDF dos Textos [0m
Linha 0: Investors unfazed by correction as crypto funds see $154 million inflows
Linha 1: Bitcoin, Ethereum prices continue descent, but crypto funds see inflows
Linha 2: The surge in euro area inflation during the pandemic: transitory but with upside risks
Linha 3: Inflation: why it's temporary and raising interest rates will do more harm than good
Linha 4: Will Cryptocurrency Protect Against Inflation?
Linha 5: Tweed is a crypto wallet API to add a web3 flavor to any web service
Linha 6: Who Created Bitcoin? Learn About The Biggest Cryptos, Including Dogecoin, Big Eyes Coin
Linha 7: Cryptocurrency Prices And News: Bitcoin, Cryptos Fall After Silvergate Bank Liquidation News
Linha 8: Silvergate Capital To Shut Down, Liquidate Crypto-Friendly Silvergate Bank
Linha 9: Dow Jones Rises On Surprise Jump In Jobless Claims; Bitcoin Drops As Crypto Bank Silvergate Crashes 40%


Unnamed: 0,154,40,about,add,after,against,and,ani,api,area,as,bank,big,biggest,bitcoin,but,by,capit,claim,coin,continu,correct,crash,creat,crypto,cryptocurr,cryptofriendli,descent,do,dogecoin,dow,down,drop,dure,ethereum,euro,eye,fall,flavor,fund,good,harm,in,includ,inflat,inflow,interest,investor,is,it,jobless,jone,jump,learn,liquid,million,more,news,on,pandem,price,protect,rais,rate,rise,risk,see,servic,shut,silverg,surg,surpris,temporari,than,the,to,transitori,tweed,unfaz,upsid,wallet,web,web3,whi,who,will,with
0,0.3301,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2806,0.0,0.0,0.0,0.0,0.0,0.3301,0.0,0.0,0.0,0.0,0.3301,0.0,0.0,0.1772,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2806,0.0,0.0,0.0,0.0,0.0,0.2806,0.0,0.3301,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.3301,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2806,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.3301,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2441,0.3138,0.0,0.0,0.0,0.0,0.3691,0.0,0.0,0.0,0.1982,0.0,0.0,0.3691,0.0,0.0,0.0,0.0,0.0,0.0,0.3691,0.0,0.0,0.0,0.0,0.3138,0.0,0.0,0.0,0.0,0.0,0.3138,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.3138,0.0,0.0,0.0,0.0,0.0,0.3138,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2683,0.0,0.0,0.0,0.0,0.0,0.2281,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2683,0.0,0.2683,0.0,0.0,0.0,0.0,0.0,0.0,0.2281,0.0,0.1996,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2683,0.0,0.0,0.0,0.0,0.0,0.2683,0.0,0.0,0.0,0.0,0.2683,0.0,0.0,0.0,0.4562,0.0,0.2683,0.0,0.0,0.2683,0.0,0.0,0.0,0.0,0.0,0.0,0.2683
3,0.0,0.0,0.0,0.0,0.0,0.0,0.2358,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2774,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2774,0.2774,0.0,0.0,0.2063,0.0,0.2774,0.0,0.0,0.2774,0.0,0.0,0.0,0.0,0.0,0.0,0.2774,0.0,0.0,0.0,0.0,0.0,0.2774,0.2774,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2774,0.2774,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2774,0.0,0.2358,0.0
4,0.0,0.0,0.0,0.0,0.0,0.5001,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.4251,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.3719,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5001,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.4251,0.0
5,0.0,0.0,0.0,0.2755,0.0,0.0,0.0,0.2755,0.2755,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1479,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2755,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2755,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2755,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.4683,0.0,0.2755,0.0,0.0,0.2755,0.2755,0.2755,0.0,0.0,0.0,0.0
6,0.0,0.0,0.2956,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2956,0.2956,0.1954,0.0,0.0,0.0,0.0,0.2956,0.0,0.0,0.0,0.2956,0.1587,0.0,0.0,0.0,0.0,0.2956,0.0,0.0,0.0,0.0,0.0,0.0,0.2956,0.0,0.0,0.0,0.0,0.0,0.0,0.2956,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2956,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2512,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2956,0.0,0.0
7,0.0,0.0,0.0,0.0,0.3054,0.0,0.2596,0.0,0.0,0.0,0.0,0.2271,0.0,0.0,0.2019,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1639,0.2596,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.3054,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2596,0.0,0.0,0.6108,0.0,0.0,0.2596,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2271,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2595,0.0,0.0,0.0,0.0,0.0,0.349,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.349,0.0,0.0,0.0,0.0,0.349,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2967,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.349,0.5191,0.0,0.0,0.0,0.0,0.0,0.2967,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.2647,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.225,0.1968,0.0,0.0,0.175,0.0,0.0,0.0,0.2647,0.0,0.0,0.0,0.2647,0.0,0.1421,0.0,0.0,0.0,0.0,0.0,0.2647,0.0,0.2647,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.225,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2647,0.2647,0.2647,0.0,0.0,0.0,0.0,0.0,0.2647,0.0,0.0,0.0,0.0,0.0,0.2647,0.0,0.0,0.0,0.0,0.1968,0.0,0.2647,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Co-ocorrência Vectors
É uma técnica de representação vetorial que captura a frequência com que as palavras ocorrem juntas em um contexto específico. É construída com base nas co-ocorrências de palavras em documentos ou em uma janela de contexto em torno de cada palavra.

In [217]:
vectorizer = CountVectorizer()

# Aplicar o CountVectorizer aos documentos, igual ao Counting Vector
matriz_coocorrencia = vectorizer.fit_transform(headlines)

df_coocorrencia = pd.DataFrame(matriz_coocorrencia.toarray(), columns=vectorizer.get_feature_names_out())
print_destaque("Co-ocorrência")
pd.set_option('display.precision', 1) # configurando para não exibir casas decimais
[print(f"Linha {i}: {texto}", end='') for i, texto in enumerate(headlines)]
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    display(df_coocorrencia)

[103m[34m[1m Co-ocorrência [0m
Linha 0: Investors unfazed by correction as crypto funds see $154 million inflows
Linha 1: Bitcoin, Ethereum prices continue descent, but crypto funds see inflows
Linha 2: The surge in euro area inflation during the pandemic: transitory but with upside risks
Linha 3: Inflation: why it's temporary and raising interest rates will do more harm than good
Linha 4: Will Cryptocurrency Protect Against Inflation?
Linha 5: Tweed is a crypto wallet API to add a web3 flavor to any web service
Linha 6: Who Created Bitcoin? Learn About The Biggest Cryptos, Including Dogecoin, Big Eyes Coin
Linha 7: Cryptocurrency Prices And News: Bitcoin, Cryptos Fall After Silvergate Bank Liquidation News
Linha 8: Silvergate Capital To Shut Down, Liquidate Crypto-Friendly Silvergate Bank
Linha 9: Dow Jones Rises On Surprise Jump In Jobless Claims; Bitcoin Drops As Crypto Bank Silvergate Crashes 40%


Unnamed: 0,154,40,about,add,after,against,and,any,api,area,as,bank,big,biggest,bitcoin,but,by,capital,claims,coin,continue,correction,crashes,created,crypto,cryptocurrency,cryptos,descent,do,dogecoin,dow,down,drops,during,ethereum,euro,eyes,fall,flavor,friendly,funds,good,harm,in,including,inflation,inflows,interest,investors,is,it,jobless,jones,jump,learn,liquidate,liquidation,million,more,news,on,pandemic,prices,protect,raising,rates,rises,risks,see,service,shut,silvergate,surge,surprise,temporary,than,the,to,transitory,tweed,unfazed,upside,wallet,web,web3,who,why,will,with
0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,2,0,1,0,0,1,0,0,0,0,0,0,1
3,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,1,1,0
4,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
5,0,0,0,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,2,0,1,0,0,1,1,1,0,0,0,0
6,0,0,1,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,1,0,0,0,1,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0
7,0,0,0,0,1,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,2,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,2,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
9,0,1,0,0,0,0,0,0,0,0,1,1,0,0,1,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [218]:
print(matriz_coocorrencia.toarray() == matriz_count.toarray())

False


  print(matriz_coocorrencia.toarray() == matriz_count.toarray())


### Word2Vec
É uma técnica de representação vetorial que aprende representações densas de palavras com base em seu contexto em um grande corpus de texto. É uma abordagem de representação vetorial distribuída que tem se mostrado eficaz em capturar o significado semântico e as relações entre palavras.
<br><br> API: https://radimrehurek.com/gensim/apiref.html

# Terminar

In [220]:
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess

# Lista de documentos de exemplo
documentos = ["O carro é azul.",
              "A casa é grande.",
              "O céu está azul hoje."]

# Preprocessamento dos documentos
documentos_preprocessados = [simple_preprocess(doc) for doc in documentos]

# Treinamento do modelo Word2Vec
modelo_w2v = Word2Vec(sentences=documentos_preprocessados, vector_size=100, window=5, min_count=1, sg=1)

# Obtenção do vetor de palavras
vetor_palavra_carro = modelo_w2v.wv.get_vector('carro')
vetor_palavra_casa = modelo_w2v.wv.get_vector('casa')
vetor_palavra_azul = modelo_w2v.wv.get_vector('azul')

print("Vetor da palavra 'carro':", vetor_palavra_carro)
print("Vetor da palavra 'casa':", vetor_palavra_casa)
print("Vetor da palavra 'azul':", vetor_palavra_azul)


Vetor da palavra 'carro': [ 8.13227147e-03 -4.45733406e-03 -1.06835726e-03  1.00636482e-03
 -1.91113955e-04  1.14817743e-03  6.11386076e-03 -2.02715401e-05
 -3.24596534e-03 -1.51072862e-03  5.89729892e-03  1.51410222e-03
 -7.24261976e-04  9.33324732e-03 -4.92128357e-03 -8.38409644e-04
  9.17541143e-03  6.74942741e-03  1.50285603e-03 -8.88256077e-03
  1.14874600e-03 -2.28825561e-03  9.36823711e-03  1.20992784e-03
  1.49006362e-03  2.40640994e-03 -1.83600665e-03 -4.99963388e-03
  2.32429506e-04 -2.01418041e-03  6.60093315e-03  8.94012302e-03
 -6.74754381e-04  2.97701475e-03 -6.10765442e-03  1.69932481e-03
 -6.92623248e-03 -8.69402662e-03 -5.90020278e-03 -8.95647518e-03
  7.27759488e-03 -5.77203138e-03  8.27635173e-03 -7.24354526e-03
  3.42167495e-03  9.67499893e-03 -7.78544787e-03 -9.94505733e-03
 -4.32914635e-03 -2.68313056e-03 -2.71289347e-04 -8.83155130e-03
 -8.61755759e-03  2.80021061e-03 -8.20640661e-03 -9.06933658e-03
 -2.34046578e-03 -8.63180775e-03 -7.05664977e-03 -8.40115082e-03

## Calcular a similaridade par-a-par usando 2 formas diferentes: Euclidean, Cosine

$$Similaridade_Euclidiana(vetor1, vetor2) = \sqrt{\sum{(vetor1[i] - vetor2[i])}^2}$$


$$\text{cosine similarity} =S_C (x,y):= \cos(\theta) = {\mathbf{x} \cdot \mathbf{y} \over \|\mathbf{x}\| \|\mathbf{y}\|} = \frac{ \sum\limits_{i=1}^{n}{x_i  y_i} }{ \sqrt{\sum\limits_{i=1}^{n}{x_i^2}}  \sqrt{\sum\limits_{i=1}^{n}{y_i^2}} }$$

In [239]:
def similaridade_euclidiana(matriz):
    vetores = matriz.toarray()
    num_vetores = len(vetores)
    matriz_similaridade = np.zeros((num_vetores, num_vetores)) # Inicializar matriz de similaridade
    for i in range(num_vetores):
        for j in range(num_vetores):
            sim = euclidean(vetores[i], vetores[j])
            matriz_similaridade[i, j] = sim
    return pd.DataFrame(matriz_similaridade)

# Imprimir matriz de similaridade Euclidiana
print("Matriz One Hot Encoding de Similaridade Euclidiana:")
display(similaridade_euclidiana(matriz_onehot))

print("Matriz Count Vector de Similaridade Euclidiana:")
display(similaridade_euclidiana(matriz_count))

print("Matriz Coocorrencia de Similaridade Euclidiana:")
display(similaridade_euclidiana(matriz_coocorrencia))

print("Matriz TF-IDF de Similaridade Euclidiana:")
display(similaridade_euclidiana(matriz_tfidf))


Matriz One Hot Encoding de Similaridade Euclidiana:


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0.0,3.6,4.9,5.0,4.0,4.6,4.7,4.5,4.4,4.9
1,3.6,0.0,4.6,4.9,3.9,4.5,4.4,3.9,4.2,4.8
2,4.9,4.6,0.0,5.0,4.0,5.0,4.9,4.9,4.6,5.3
3,5.0,4.9,5.0,0.0,3.9,5.1,5.2,4.8,4.7,5.6
4,4.0,3.9,4.0,3.9,0.0,4.1,4.2,3.7,3.6,4.7
5,4.6,4.5,5.0,5.1,4.1,0.0,4.8,4.6,4.2,5.2
6,4.7,4.4,4.9,5.2,4.2,4.8,0.0,4.5,4.6,5.1
7,4.5,3.9,4.9,4.8,3.7,4.6,4.5,0.0,3.6,4.5
8,4.4,4.2,4.6,4.7,3.6,4.2,4.6,3.6,0.0,4.6
9,4.9,4.8,5.3,5.6,4.7,5.2,5.1,4.5,4.6,0.0


Matriz Count Vector de Similaridade Euclidiana:


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0.0,3.6,5.2,5.0,4.0,4.9,4.7,4.8,4.7,4.9
1,3.6,0.0,4.9,4.9,3.9,4.8,4.4,4.2,4.6,4.8
2,5.2,4.9,0.0,5.3,4.4,5.6,5.0,5.5,5.2,5.6
3,5.0,4.9,5.3,0.0,3.9,5.4,5.2,5.1,5.0,5.6
4,4.0,3.9,4.4,3.9,0.0,4.5,4.2,4.1,4.0,4.7
5,4.9,4.8,5.6,5.4,4.5,0.0,5.1,5.2,4.7,5.5
6,4.7,4.4,5.0,5.2,4.2,5.1,0.0,4.8,4.9,5.1
7,4.8,4.2,5.5,5.1,4.1,5.2,4.8,0.0,4.1,4.8
8,4.7,4.6,5.2,5.0,4.0,4.7,4.9,4.1,0.0,4.7
9,4.9,4.8,5.6,5.6,4.7,5.5,5.1,4.8,4.7,0.0


Matriz Coocorrencia de Similaridade Euclidiana:


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0.0,3.6,5.2,5.0,4.0,4.9,4.9,5.0,4.6,4.9
1,3.6,0.0,4.9,4.9,3.9,4.8,4.6,4.5,4.5,4.8
2,5.2,4.9,0.0,5.3,4.4,5.6,5.0,5.5,5.3,5.6
3,5.0,4.9,5.3,0.0,3.9,5.4,5.2,5.1,5.1,5.6
4,4.0,3.9,4.4,3.9,0.0,4.5,4.2,4.1,4.1,4.7
5,4.9,4.8,5.6,5.4,4.5,0.0,5.3,5.4,4.6,5.5
6,4.9,4.6,5.0,5.2,4.2,5.3,0.0,4.8,5.0,5.3
7,5.0,4.5,5.5,5.1,4.1,5.4,4.8,0.0,4.5,5.0
8,4.6,4.5,5.3,5.1,4.1,4.6,5.0,4.5,0.0,4.6
9,4.9,4.8,5.6,5.6,4.7,5.5,5.3,5.0,4.6,0.0


Matriz TF-IDF de Similaridade Euclidiana:


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0.0,1.2,1.4,1.4,1.4,1.4,1.4,1.4,1.4,1.4
1,1.2,0.0,1.4,1.4,1.4,1.4,1.4,1.3,1.4,1.4
2,1.4,1.4,0.0,1.4,1.4,1.4,1.3,1.4,1.4,1.4
3,1.4,1.4,1.4,0.0,1.3,1.4,1.4,1.4,1.4,1.4
4,1.4,1.4,1.4,1.3,0.0,1.4,1.4,1.3,1.4,1.4
5,1.4,1.4,1.4,1.4,1.4,0.0,1.4,1.4,1.3,1.4
6,1.4,1.4,1.3,1.4,1.4,1.4,0.0,1.4,1.4,1.4
7,1.4,1.3,1.4,1.4,1.3,1.4,1.4,0.0,1.2,1.3
8,1.4,1.4,1.4,1.4,1.4,1.3,1.4,1.2,0.0,1.3
9,1.4,1.4,1.4,1.4,1.4,1.4,1.4,1.3,1.3,0.0


In [241]:
def similaridade_cosseno(matriz):
    vetores = matriz.toarray()
    num_vetores = len(vetores)
    matriz_similaridade = np.zeros((num_vetores, num_vetores)) # Inicializar matriz de similaridade
    for i in range(num_vetores):
        for j in range(num_vetores):
            sim = cosine(vetores[i], vetores[j])
            matriz_similaridade[i, j] = sim
    return pd.DataFrame(matriz_similaridade)

# Imprimir matriz de similaridade Euclidiana
print("Matriz One Hot Encoding de Similaridade Euclidiana:")
display(similaridade_cosseno(matriz_onehot))

print("Matriz Count Vector de Similaridade Euclidiana:")
display(similaridade_cosseno(matriz_count))

print("Matriz Coocorrencia de Similaridade Euclidiana:")
display(similaridade_cosseno(matriz_coocorrencia))

print("Matriz TF-IDF de Similaridade Euclidiana:")
display(similaridade_cosseno(matriz_tfidf))


Matriz One Hot Encoding de Similaridade Euclidiana:


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0.0,0.6,1.0,1.0,1.0,0.9,0.9,0.9,1.0,0.9
1,0.6,0.0,0.9,1.0,1.0,0.9,0.8,0.7,1.0,0.8
2,1.0,0.9,0.0,0.9,0.9,1.0,0.9,1.0,1.0,0.9
3,1.0,1.0,0.9,0.0,0.8,1.0,1.0,0.9,1.0,1.0
4,1.0,1.0,0.9,0.8,0.0,1.0,1.0,0.9,1.0,1.0
5,0.9,0.9,1.0,1.0,1.0,0.0,0.9,0.9,0.9,0.9
6,0.9,0.8,0.9,1.0,1.0,0.9,0.0,0.8,1.0,0.9
7,0.9,0.7,1.0,0.9,0.9,0.9,0.8,0.0,0.7,0.7
8,1.0,1.0,1.0,1.0,1.0,0.9,1.0,0.7,0.0,0.8
9,0.9,0.8,0.9,1.0,1.0,0.9,0.9,0.7,0.8,0.0


Matriz Count Vector de Similaridade Euclidiana:


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0.0,0.6,1.0,1.0,1.0,0.9,0.9,0.9,1.0,0.9
1,0.6,0.0,0.9,1.0,1.0,0.9,0.8,0.7,1.0,0.8
2,1.0,0.9,0.0,0.9,0.9,1.0,0.9,1.0,1.0,0.9
3,1.0,1.0,0.9,0.0,0.8,1.0,1.0,0.9,1.0,1.0
4,1.0,1.0,0.9,0.8,0.0,1.0,1.0,0.9,1.0,1.0
5,0.9,0.9,1.0,1.0,1.0,0.0,0.9,0.9,0.8,0.9
6,0.9,0.8,0.9,1.0,1.0,0.9,0.0,0.9,1.0,0.9
7,0.9,0.7,1.0,0.9,0.9,0.9,0.9,0.0,0.7,0.7
8,1.0,1.0,1.0,1.0,1.0,0.8,1.0,0.7,0.0,0.8
9,0.9,0.8,0.9,1.0,1.0,0.9,0.9,0.7,0.8,0.0


Matriz Coocorrencia de Similaridade Euclidiana:


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0.0,0.6,1.0,1.0,1.0,0.9,1.0,1.0,0.9,0.9
1,0.6,0.0,0.9,1.0,1.0,0.9,0.9,0.8,0.9,0.8
2,1.0,0.9,0.0,0.9,0.9,1.0,0.9,1.0,1.0,0.9
3,1.0,1.0,0.9,0.0,0.8,1.0,1.0,0.9,1.0,1.0
4,1.0,1.0,0.9,0.8,0.0,1.0,1.0,0.9,1.0,1.0
5,0.9,0.9,1.0,1.0,1.0,0.0,1.0,1.0,0.8,0.9
6,1.0,0.9,0.9,1.0,1.0,1.0,0.0,0.9,1.0,0.9
7,1.0,0.8,1.0,0.9,0.9,1.0,0.9,0.0,0.8,0.8
8,0.9,0.9,1.0,1.0,1.0,0.8,1.0,0.8,0.0,0.7
9,0.9,0.8,0.9,1.0,1.0,0.9,0.9,0.8,0.7,0.0


Matriz TF-IDF de Similaridade Euclidiana:


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0.0,0.7,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.9
1,0.7,0.0,0.9,1.0,1.0,1.0,0.9,0.8,1.0,0.9
2,1.0,0.9,0.0,1.0,0.9,1.0,0.9,1.0,1.0,0.9
3,1.0,1.0,1.0,0.0,0.8,1.0,1.0,0.9,1.0,1.0
4,1.0,1.0,0.9,0.8,0.0,1.0,1.0,0.9,1.0,1.0
5,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.9,1.0
6,1.0,0.9,0.9,1.0,1.0,1.0,0.0,0.9,1.0,0.9
7,1.0,0.8,1.0,0.9,0.9,1.0,0.9,0.0,0.7,0.9
8,1.0,1.0,1.0,1.0,1.0,0.9,1.0,0.7,0.0,0.8
9,0.9,0.9,0.9,1.0,1.0,1.0,0.9,0.9,0.8,0.0


# Mudar
- Remover stop words
- Na Matriz de co-ocorrencia: refazer como matriz termo x termo (frequencia de ocorrencia dos termos )
    - Quais são as propostas da literatura para criar um vetor para cada documento dado os termos do documento e a matriz termo x termo
- Mostrar a similaridade em forma de heat map