# Computations
In this notebook, we will compute the IDF dictionaries and TF-IDF sparse matrices / BM25 sparse matrices for each language in our multilingual corpus. These computations will allow us to represent the importance of terms in documents across languages. The final result will be useful in preparing data for efficient document retrieval.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

Let's import the necessary libraries.

In [None]:
import os                      # For file system operations (saving/loading files)
import numpy as np             # For numerical operations
import pandas as pd            # For working with DataFrames and CSV files
import scipy.sparse as sp      # For creating and saving sparse TF-IDF matrices (csr_matrix, save_npz)
from math import log           # For calculating logarithms (used in IDF calculation)
from tqdm import tqdm          # For displaying progress bars
from collections import defaultdict, Counter  # For counting document frequencies and other counting needs
from scipy.sparse import csr_matrix  # For creating sparse matrices in Compressed Sparse Row (CSR) format

First, we need to load the preprocessed corpus for each language to create the corresponding IDF dictionary and then the sparse TF-IDF matrices required for the rest of the implementation.

In [None]:
# Loading preprocessed CSVs for each language
corpus_files = {
    'en': 'Data/corpus_en_processed.csv',
    'fr': 'Data/corpus_fr_processed.csv',
    'de': 'Data/corpus_de_processed.csv',
    'es': 'Data/corpus_es_processed.csv',
    'it': 'Data/corpus_it_processed.csv',
    'ar': 'Data/corpus_ar_processed.csv',
    'ko': 'Data/corpus_ko_processed.csv'
}

# Dictionary to store the loaded data for each language
corpus_data = {}

# Loop through the dictionary to load the CSV files for each language
for lang, filepath in corpus_files.items():
    # Load the CSV file into a pandas DataFrame and store it in the corpus_data dictionary
    corpus_data[lang] = pd.read_csv(filepath)

Let us implement the function that will determine the relative importance of each term in the corpus, which is essential for many natural language processing tasks, especially when combined with term frequency (TF) to form the TF-IDF matrix used for document retrieval.

In [None]:
def create_idf_dictionary(df, lang):
    # Total number of documents
    total_docs = len(df)
    
    # Dictionary to store document frequency of each term
    df_dict = defaultdict(int)
    
    # Iterate over each row with a progress bar and calculate document frequencies
    for _, row in tqdm(df.iterrows(), total=total_docs, desc=f"Processing Documents for {lang}"):
        text = row['text']  # Assume text is already pre-processed
        
        # Split the pre-processed text into tokens (words)
        tokens = text.split()  # Assuming space-separated tokens
        
        # Get unique terms in the document to avoid counting duplicates
        unique_terms = set(tokens)
        
        # Update document frequency for each term
        for term in unique_terms:
            df_dict[term] += 1
    
    # Compute IDF for each term
    idf_dict = {term: log(total_docs / (df_value + 1)) for term, df_value in df_dict.items()}  # Add 1 to avoid division by zero
    
    print(f"IDF dictionary for {lang} created.")
    
    return idf_dict

Each of the following cell will create the IDF dictionary for one specific language.

In [None]:
# Create IDF dictionary for English
lang = 'en'
corpus_en = corpus_data[lang] 
idf_dict_en = create_idf_dictionary(corpus_en, lang)

# Save the dictionary
pd.to_pickle(idf_dict_en, 'Data/idf_dict_en.pkl')

In [None]:
# Create IDF dictionary for French
lang = 'fr'
corpus_fr = corpus_data[lang]
idf_dict_fr = create_idf_dictionary(corpus_fr, lang)

# Save the dictionary
pd.to_pickle(idf_dict_fr, 'Data/idf_dict_fr.pkl')

In [None]:
# Create IDF dictionary for German
lang = 'de'
corpus_de = corpus_data[lang] 
idf_dict_de = create_idf_dictionary(corpus_de, lang)

# Save the dictionary
pd.to_pickle(idf_dict_de, 'Data/idf_dict_de.pkl')

In [None]:
# Create IDF dictionary for Spanish
lang = 'es'
corpus_es = corpus_data[lang] 
idf_dict_es = create_idf_dictionary(corpus_es, lang)

# Save the dictionary
pd.to_pickle(idf_dict_es, 'Data/idf_dict_es.pkl')

In [None]:
# Create IDF dictionary for Italian
lang = 'it'
corpus_it = corpus_data[lang] 
idf_dict_it = create_idf_dictionary(corpus_it, lang)

# Save the dictionary
pd.to_pickle(idf_dict_it, 'Data/idf_dict_it.pkl')

In [None]:
# Create IDF dictionary for Arabic
lang = 'ar'
corpus_ar = corpus_data[lang]
idf_dict_ar = create_idf_dictionary(corpus_ar, lang)

# Save the dictionary
pd.to_pickle(idf_dict_ar, 'Data/idf_dict_ar.pkl')

In [None]:
# Create IDF dictionary for Korean
lang = 'ko'
corpus_ko = corpus_data[lang] 
idf_dict_ko = create_idf_dictionary(corpus_ko, lang)

# Save the dictionary
pd.to_pickle(idf_dict_ko, 'Data/idf_dict_ko.pkl')

We create dictionnaries that map docids to doc lengths for each language.

In [None]:
def get_doc_lengths(df, lang):
    # Filter the DataFrame for the specified language
    filtered_df = df[df['lang'] == lang]
    
    # Create a dictionary mapping docid to the length of the text
    doc_length_dict = {row['docid']: len(row['text'].split()) for _, row in filtered_df.iterrows()}
    
    return doc_length_dict

Each of the following cell will create the length dictionary for one specific language.

In [None]:
# Create doclen dictionary for English
lang = 'en'
corpus_en = corpus_data['en']
doc_len_dict_en = get_doc_lengths(corpus_en, lang)

In [None]:
# Create doclen dictionary for French
lang = 'fr'
corpus_fr = corpus_data['fr']
doc_len_dict_fr = get_doc_lengths(corpus_fr, lang)

In [None]:
# Create doclen dictionary for German
lang = 'de'
corpus_de = corpus_data['de']
doc_len_dict_de = get_doc_lengths(corpus_de, lang)

In [None]:
# Create doclen dictionary for Spanish
lang = 'es'
corpus_es = corpus_data['es']
doc_len_dict_es = get_doc_lengths(corpus_es, lang)

In [None]:
# Create doclen dictionary for Italian
lang = 'it'
corpus_it = corpus_data['it']
doc_len_dict_it = get_doc_lengths(corpus_it, lang)

In [None]:
# Create doclen dictionary for Arabic
lang = 'ar'
corpus_ar = corpus_data['ar']
doc_len_dict_ar = get_doc_lengths(corpus_ar, lang)

In [None]:
# Create doclen dictionary for Korean
lang = 'ko'
corpus_ko = corpus_data['ko']
doc_len_dict_ko = get_doc_lengths(corpus_ko, lang)

We computed IDF dictionaries that provide weights for each term in the corpus in order to adjust the term frequencies in each document. Thus, we can now compute TF-IDF matrices by combining the term frequencies (TFs) with these IDF weights to create a matrix that reflects the importance of terms in each document, relative to the entire corpus.

In [None]:
def create_tf_idf_sparse_matrix(df, idf_dict):
    # Build the vocabulary (unique terms in the IDF dictionary)
    vocabulary = sorted(idf_dict.keys())
    term_to_index = {term: idx for idx, term in enumerate(vocabulary)}
    
    # Initialize lists to store sparse matrix data
    rows = []
    cols = []
    data = []
    
    # Compute term frequency (TF) for each document and combine with IDF to get TF-IDF
    for doc_idx, text in tqdm(enumerate(df['text']), total=len(df), desc="Processing Documents"):
        # Split the preprocessed text into tokens
        tokens = text.split()
        
        # Count term frequencies in the document
        term_counts = Counter(tokens)
        total_terms = len(tokens)  # Total number of terms in the document
        
        # Create the TF-IDF values for each term in the document
        for term, count in term_counts.items():
            if term in term_to_index:
                # Calculate TF-IDF: (term frequency / max term frequency) * IDF
                tf_idf_value = (count / total_terms) * idf_dict[term]
                
                # Append to sparse matrix lists (row, column, data)
                rows.append(doc_idx)
                cols.append(term_to_index[term])
                data.append(tf_idf_value)
    
    # Create the sparse matrix (Compressed Sparse Row format)
    tf_idf_matrix = csr_matrix((data, (rows, cols)), shape=(len(df), len(vocabulary)))
    
    return tf_idf_matrix, vocabulary

Now we can calculate the sparse TF-IDF matrix of each language as well as the corresponding vocabulary.

In [None]:
# Create the TF-IDF sparse matrix for English
tf_idf_matrix_en, tf_idf_vocabulary_en = create_tf_idf_sparse_matrix(corpus_data['en'], idf_dict_en)

# Save the TF-IDF matrix and vocabulary
sp.save_npz('Data/tf_idf_matrix_en.npz', tf_idf_matrix_en)
pd.to_pickle(tf_idf_vocabulary_en, 'Data/tf_idf_vocabulary_en.pkl')

In [None]:
# Create the TF-IDF sparse matrix for French
tf_idf_matrix_fr, tf_idf_vocabulary_fr = create_tf_idf_sparse_matrix(corpus_data['fr'], idf_dict_fr)

# Save the TF-IDF matrix and vocabulary
sp.save_npz('Data/tf_idf_matrix_fr.npz', tf_idf_matrix_fr)
pd.to_pickle(tf_idf_vocabulary_fr, 'Data/tf_idf_vocabulary_fr.pkl')

In [None]:
# Create the TF-IDF sparse matrix for German
tf_idf_matrix_de, tf_idf_vocabulary_de = create_tf_idf_sparse_matrix(corpus_data['de'], idf_dict_de)

# Save the TF-IDF matrix and vocabulary
sp.save_npz('Data/tf_idf_matrix_de.npz', tf_idf_matrix_de)
pd.to_pickle(tf_idf_vocabulary_de, 'Data/tf_idf_vocabulary_de.pkl')

In [None]:
# Create the TF-IDF sparse matrix for Spanish
tf_idf_matrix_es, tf_idf_vocabulary_es = create_tf_idf_sparse_matrix(corpus_data['es'], idf_dict_es)

# Save the TF-IDF matrix and vocabulary
sp.save_npz('Data/tf_idf_matrix_es.npz', tf_idf_matrix_es)
pd.to_pickle(tf_idf_vocabulary_es, 'Data/tf_idf_vocabulary_es.pkl')

In [None]:
# Create the TF-IDF sparse matrix for Italian
tf_idf_matrix_it, tf_idf_vocabulary_it = create_tf_idf_sparse_matrix(corpus_data['it'], idf_dict_it)

# Save the TF-IDF matrix and vocabulary
sp.save_npz('Data/tf_idf_matrix_it.npz', tf_idf_matrix_it)
pd.to_pickle(tf_idf_vocabulary_it, 'Data/tf_idf_vocabulary_it.pkl')

In [None]:
# Create the TF-IDF sparse matrix for Arabic
tf_idf_matrix_ar, tf_idf_vocabulary_ar = create_tf_idf_sparse_matrix(corpus_data['ar'], idf_dict_ar)

# Save the TF-IDF matrix and vocabulary
sp.save_npz('Data/tf_idf_matrix_ar.npz', tf_idf_matrix_ar)
pd.to_pickle(tf_idf_vocabulary_ar, 'Data/tf_idf_vocabulary_ar.pkl')

In [None]:
# Create the TF-IDF sparse matrix for Korean
tf_idf_matrix_ko, tf_idf_vocabulary_ko = create_tf_idf_sparse_matrix(corpus_data['ko'], idf_dict_ko)

# Save the TF-IDF matrix and vocabulary
sp.save_npz('Data/tf_idf_matrix_ko.npz', tf_idf_matrix_ko)
pd.to_pickle(tf_idf_vocabulary_ko, 'Data/tf_idf_vocabulary_ko.pkl')


We created a sparse BM25 matrix to capture term importance across documents, adjusting term frequencies (TFs) by combining them with inverse document frequencies (IDFs) and length normalization factors. This matrix allows us to weight terms in each document relative to the corpus, enhancing retrieval accuracy by emphasizing terms that are informative within a document and rare in the corpus.

In [None]:
def create_bm25_sparse_matrix(df, doc_length_dict, idf_dict, b=0.75, k1=1.2, avgdl=200):
    # Build the vocabulary (unique terms in the IDF dictionary)
    vocabulary = sorted(idf_dict.keys())
    term_to_index = {term: idx for idx, term in enumerate(vocabulary)}
    
    # Initialize lists to store sparse matrix data
    rows = []
    cols = []
    data = []
    
    # Loop through each document
    for doc_idx, (docid, text) in tqdm(enumerate(zip(df['docid'], df['text'])), total=len(df), desc="Processing Documents"):
        # Split the preprocessed text into tokens
        tokens = text.split()
        
        # Count term frequencies in the document
        term_counts = Counter(tokens)
        
        # Retrieve document length
        doc_length = doc_length_dict.get(docid, 0)
        
        # Compute BM25 values for each term in the document
        for term, count in term_counts.items():
            if term in term_to_index:
                # Calculate BM25 term frequency component
                tf_component = (count * (k1 + 1)) / (count + k1 * (1 - b + b * (doc_length / avgdl)))
                
                # Append to sparse matrix lists (row, column, data)
                rows.append(doc_idx)
                cols.append(term_to_index[term])
                data.append(tf_component)
    
    # Create the sparse matrix (Compressed Sparse Row format)
    bm25_matrix = csr_matrix((data, (rows, cols)), shape=(len(df), len(vocabulary)))
    
    return bm25_matrix, vocabulary

Now we can calculate the sparse BM25 matrix of each language as well as the corresponding vocabulary, for k1 equal to 1,8 as frequency saturation term.

In [None]:
# Create the bm25 sparse matrix for English
avgdl_en = sum(doc_len_dict_en.values())/len(doc_len_dict_en)
bm25_matrix_en, bm25_vocabulary_en = create_bm25_sparse_matrix(corpus_en, doc_len_dict_en, idf_dict_en, b=0.75, k1=1.8, avgdl=avgdl_en)

# Save the bm25 matrix and vocabulary
sp.save_npz('Data/bm25_matrix_en.npz', bm25_matrix_en)
pd.to_pickle(bm25_vocabulary_en, 'Data/bm25_vocabulary_en.pkl')

In [None]:
# Create the bm25 sparse matrix for French
avgdl_fr = sum(doc_len_dict_fr.values())/len(doc_len_dict_fr)
bm25_matrix_fr, bm25_vocabulary_fr = create_bm25_sparse_matrix(corpus_fr, doc_len_dict_fr, idf_dict_fr, b=0.75, k1=1.8, avgdl=avgdl_fr)

# Save the bm25 matrix and vocabulary
sp.save_npz('Data/bm25_matrix_fr.npz', bm25_matrix_fr)
pd.to_pickle(bm25_vocabulary_fr, 'Data/bm25_vocabulary_fr.pkl')

In [None]:
# Create the bm25 sparse matrix for German
avgdl_de = sum(doc_len_dict_de.values())/len(doc_len_dict_de)
bm25_matrix_de, bm25_vocabulary_de = create_bm25_sparse_matrix(corpus_de, doc_len_dict_de, idf_dict_de, b=0.75, k1=1.8, avgdl=avgdl_de)

# Save the bm25 matrix and vocabulary
sp.save_npz('Data/bm25_matrix_de.npz', bm25_matrix_de)
pd.to_pickle(bm25_vocabulary_de, 'Data/bm25_vocabulary_de.pkl')

In [None]:
# Create the bm25 sparse matrix for Spanish
avgdl_es = sum(doc_len_dict_es.values())/len(doc_len_dict_es)
bm25_matrix_es, bm25_vocabulary_es = create_bm25_sparse_matrix(corpus_es, doc_len_dict_es, idf_dict_es, b=0.75, k1=1.8, avgdl=avgdl_es)

# Save the bm25 matrix and vocabulary
sp.save_npz('Data/bm25_matrix_es.npz', bm25_matrix_es)
pd.to_pickle(bm25_vocabulary_es, 'Data/bm25_vocabulary_es.pkl')

In [None]:
# Create the bm25 sparse matrix for Italian
avgdl_it = sum(doc_len_dict_it.values())/len(doc_len_dict_it)
bm25_matrix_it, bm25_vocabulary_it = create_bm25_sparse_matrix(corpus_it, doc_len_dict_it, idf_dict_it, b=0.75, k1=1.8, avgdl=avgdl_it)

# Save the bm25 matrix and vocabulary
sp.save_npz('Data/bm25_matrix_it.npz', bm25_matrix_it)
pd.to_pickle(bm25_vocabulary_it, 'Data/bm25_vocabulary_it.pkl')

In [None]:
# Create the bm25 sparse matrix for Arabic
avgdl_ar = sum(doc_len_dict_ar.values())/len(doc_len_dict_ar)
bm25_matrix_ar, bm25_vocabulary_ar = create_bm25_sparse_matrix(corpus_ar, doc_len_dict_ar, idf_dict_ar, b=0.75, k1=1.8, avgdl=avgdl_ar)

# Save the bm25 matrix and vocabulary
sp.save_npz('Data/bm25_matrix_ar.npz', bm25_matrix_ar)
pd.to_pickle(bm25_vocabulary_ar, 'Data/bm25_vocabulary_ar.pkl')

In [None]:
# Create the bm25 sparse matrix for Korean
avgdl_ko = sum(doc_len_dict_ko.values())/len(doc_len_dict_ko)
bm25_matrix_ko, bm25_vocabulary_ko = create_bm25_sparse_matrix(corpus_ko, doc_len_dict_ko, idf_dict_ko, b=0.75, k1=1.8, avgdl=avgdl_ko)

# Save the bm25 matrix and vocabulary
sp.save_npz('Data/bm25_matrix_ko.npz', bm25_matrix_ko)
pd.to_pickle(bm25_vocabulary_ko, 'Data/bm25_vocabulary_ko.pkl')

The calculations we performed will serve as the basis for several critical tasks during document retrieval and similarity analysis in the next steps of our project.