![bse_logo_textminingcourse](https://bse.eu/sites/default/files/bse_logo_small.png)

## Introduction to Text Mining and Natural Language Processing


## Session 5: Similarity Measures and Tf-idf

We will now start from the DTM that was generated as the 200 articles you saw previously. This is a labelled dataset where the data comes from a very small subset of a larger dataset that is unlabelled. The context is that we labelled a part of the data and then use this training to spot similar articles in the larger set. In order to prepare for that task, I have split a larger dataset into training and test set. We will first do everything on the training set and then get to the test set when we do supervised learning.

The training_set.csv below is pre-processed but not vectorized. This notebook tries to show how different vectorizations and similarity methods shown in class affect what a query on an article returns. 

You will see also that comparing by the dot product is a risky idea - run the notebook entirely, go through and then discuss with your neighbor. In the end the tf-idf vectorization together with cosine similarity will yield the best fit to the article. Experiment a little with a different article.

### Context of the Data

The text data in this corpus is dyadic text data. It is data that comes from an origin country and it talks about a destination country. There is also a time dimension which we ill ignore most of the time here.

In [49]:
# Imports
import os
import re
import csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer, SnowballStemmer, PorterStemmer
from nltk.corpus import stopwords
import spacy
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer, TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from tqdm import tqdm

# Download required NLTK resources
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt')

# Download and load spaCy model
from spacy.cli import download as spacy_download
spacy_download('en_core_web_sm')
sp = spacy.load('en_core_web_sm')

# Enable tqdm for pandas
tqdm.pandas()

# Initialize stemmers and lemmatizer
porter = SnowballStemmer('english')
lmtzr = WordNetLemmatizer()
STOP_WORDS = set(stopwords.words('english'))

# Set paths
path = '.'
spitout = '.'

# Define your file path and filename.
filename = 'training_set.csv'

# Load the data
corpus_data = pd.read_csv(os.path.join(path, filename), sep=',', encoding='utf-8')
corpus_data = corpus_data[pd.notna(corpus_data.text_preproc)].reset_index(drop=True)

print('The texts are about five different countries:')
corpus_data.dest_iso3c.value_counts()


[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/hannesfelixmuller/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/hannesfelixmuller/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/hannesfelixmuller/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m5.5 MB/s[0m  [33m0:00:02[0mm0:00:01[0m00:01[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
The texts are about five different countries:


dest_iso3c
EGY    5492
TUR    4369
ISR    1632
TUN     235
MAR      24
Name: count, dtype: int64

# Query Article

We will try a very interesting query article. A short notice which does not have an exact match. You can try your own matching below.

In [50]:
#raw text
print(corpus_data.TXT_EN[5])

According to the Israeli ambassador, Rafael Eldad, 5,000 Brazilians face the risk of Palestinian attacks. Palestinian ambassador Ibrahim Alzeben says there are also Brazilians in the territories attacked by Israel.


In [51]:
#preprocessed text
print(corpus_data.text_preproc[5])

accord israeli ambassador Rafael Eldad 5,000 Brazilians face risk palestinian attack palestinian ambassador Ibrahim Alzeben say Brazilians territory attack Israel


## Countvectorizer vs. Tf-idf

In [52]:
from sklearn.feature_extraction.text import CountVectorizer

# Initialize the simple CountVectorizer
cv_simple = CountVectorizer(ngram_range=(1,2), lowercase=False, min_df=0.01, max_df=0.6)

# Fit the vectorizer to the preprocessed text data
X_simple = cv_simple.fit_transform(corpus_data.text_preproc)

# Get the vocabulary (terms)
terms_simple = cv_simple.get_feature_names_out()

# Calculate term frequencies (total counts across all documents)
term_frequencies_simple = X_simple.toarray().sum(axis=0)

# Create a DataFrame for easier handling
import pandas as pd

df_terms_simple = pd.DataFrame({
    'term': terms_simple,
    'frequency': term_frequencies_simple
})

# Sort the DataFrame by frequency in descending order
df_terms_simple = df_terms_simple.sort_values(by='frequency', ascending=False).reset_index(drop=True)

# Display the top 10 terms as a sanity check
print(df_terms_simple.head(10))



       term  frequency
0     Egypt      22526
1    Turkey      19234
2       say      17297
3    Israel      11037
4  egyptian      10172
5   turkish       9503
6   country       7819
7   israeli       7233
8      kill       6893
9      Gaza       6568


In [53]:
cv_tfidf = TfidfVectorizer(ngram_range = (1,2), norm=None, sublinear_tf=True, lowercase=False, min_df=0.01, max_df=0.6)
cv_tfidf.fit(corpus_data.text_preproc)
# Fit the vectorizer to the preprocessed text data
X_tfidf = cv_tfidf.fit_transform(corpus_data.text_preproc)

# Get the vocabulary (terms)
terms_tfidf = cv_tfidf.get_feature_names_out()

# Calculate term frequencies (total counts across all documents)
term_frequencies_tfidf = X_tfidf.toarray().sum(axis=0)

# Create a DataFrame for easier handling
import pandas as pd

df_terms_tfidf = pd.DataFrame({
    'term': terms_tfidf,
    'frequency': term_frequencies_tfidf
})

# Sort the DataFrame by frequency in descending order
df_terms_tfidf = df_terms_tfidf.sort_values(by='frequency', ascending=False).reset_index(drop=True)

# Display the top 10 terms as a sanity check
print(df_terms_tfidf.head(10))

       term     frequency
0     Egypt  20146.234079
1    Turkey  18523.180512
2       say  17841.644007
3  egyptian  13786.580955
4    Israel  13489.171161
5   turkish  13330.544776
6   country  12111.161462
7      kill  11800.099328
8    attack  11030.829944
9   israeli  10879.267536


In [54]:
print(corpus_data.text_preproc[40])

U.S. diplomat urge Egyptians hold dialogue visit U.S. Deputy Secretary State William Burns Monday urge political faction Egypt engage dialogue end violence official news agency MENA report ask U.S. President Barack Obama Secretary State John Kerry visit Egypt clarify U.S. stance Burns say brief news conference U.S. embassy Cairo message clear United States continue commit democratic success prosperity Egypt say Burns high level diplomat visit Egypt July 3 islamist orient President Mohamed Morsi oust armed force massive nationwide protest call removal Burns meet member egyptian transitional government armed force representative different political party NGOs activist religious figure businessman Al Nour second important islamic party Egypt say decide express rejection U.S. intervention Egypt domestic affair Tamarud founder Mahmoud Badr say U.S. administration need acknowledge new system apologize support Muslim Brotherhood party terrorism Burns say Egyptians determine future come U.S. s

In [55]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

DOC_INDICES = [5]

###############################################################################
# 1) DOT PRODUCT
###############################################################################
def most_similar_by_dot_product(X, doc_index):
    # Cast the result to float to avoid the "cannot convert float infinity to integer" error
    dot_values = X[doc_index].dot(X.T).toarray().ravel().astype(float)
    dot_values[doc_index] = -np.inf
    return np.argmax(dot_values).item()

def find_closest_docs_dot_product(X, doc_indices):
    return {
        i: most_similar_by_dot_product(X, i)
        for i in doc_indices
    }

###############################################################################
# 2) COSINE SIMILARITY
###############################################################################
def most_similar_by_cosine(X, doc_index):
    cos_values = cosine_similarity(X[doc_index], X).ravel()
    cos_values[doc_index] = -np.inf
    return np.argmax(cos_values).item()

def find_closest_docs_cosine(X, doc_indices):
    return {
        i: most_similar_by_cosine(X, i)
        for i in doc_indices
    }



###############################################################################
# -- Main routine: Evaluate all 6 queries
###############################################################################
dot_tfidf_results = find_closest_docs_dot_product(X_tfidf, DOC_INDICES)
dot_simple_results = find_closest_docs_dot_product(X_simple, DOC_INDICES)

cos_tfidf_results = find_closest_docs_cosine(X_tfidf, DOC_INDICES)
cos_simple_results = find_closest_docs_cosine(X_simple, DOC_INDICES)

print("Nearest by DOT PRODUCT (TF-IDF):", dot_tfidf_results)
print("Nearest by DOT PRODUCT (Count) :", dot_simple_results)
print("Nearest by COSINE (TF-IDF)      :", cos_tfidf_results)
print("Nearest by COSINE (Count)       :", cos_simple_results)


Nearest by DOT PRODUCT (TF-IDF): {5: 6834}
Nearest by DOT PRODUCT (Count) : {5: 3412}
Nearest by COSINE (TF-IDF)      : {5: 6754}
Nearest by COSINE (Count)       : {5: 6754}


In [57]:
import textwrap

def print_article(title, text, width=100):
    print(title)
    print("-" * len(title))
    print(textwrap.fill(str(text), width=width))
    print()

print_article("Query article", corpus_data.TXT_EN[5])

print_article("DOT PRODUCT (TF-IDF)", corpus_data.TXT_EN[6834])
print_article("DOT PRODUCT (Count)", corpus_data.TXT_EN[3412])
print_article("COSINE (TF-IDF)", corpus_data.TXT_EN[6754])
print_article("COSINE (Count)", corpus_data.TXT_EN[6754])


Query article
-------------
According to the Israeli ambassador, Rafael Eldad, 5,000 Brazilians face the risk of Palestinian
attacks. Palestinian ambassador Ibrahim Alzeben says there are also Brazilians in the territories
attacked by Israel.

DOT PRODUCT (TF-IDF)
--------------------
Thousands live in areas that can be hit, according to Israeli diplomats and ANP in BrasÃ­lia
BRASÃLIA and ASHKELON, Israel Israel's ambassador to Brazil, Rafael Eldad, said yesterday that
about half of the 10,000 Brazilians living in his country may fall victim to attacks by residents in
the area and areas near Tel Aviv, against which Palestinians fired rockets. The Brazilian National
Authority (ANP) ambassador to Brazil, Ibrahim Alzeben, says that there are also Brazilians in the
territories attacked by Israel. - There are also Brazilians in the Gaza Strip and the West Bank. The
solution is the end of the conflict, of the bombardment. The missiles are hitting our people. The
end of the attacks of Israel

Very clearly cosine similarity "wins". Why is this? Discuss with your neighbor.