# Preprocessing 

In text retrieval, preprocessing is a vital step for enhancing the quality and relevance of search results. The goal of text retrieval is to find documents that are most relevant to a user's query. Raw text data, however, is often noisy and inconsistent, which can hinder the retrieval of the most relevant documents. 

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

Let's import the necessary libraries.

In [None]:
# General libraries
import os
import string
import numpy as np
import pandas as pd

# NLTK libraries for tokenization, stopwords, and stemming
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import SnowballStemmer

# Download necessary NLTK resources (if not already downloaded)
nltk.download('stopwords')
nltk.download('punkt')

# Ensure that NLTK's punkt tokenizer is available
nltk.download('punkt')

The goal of this notebook is to load the document corpus and preprocess them to improve the efficiency of our document retrieval system. The corpus contains documents in seven different languages, and we aim to process each document separately based on its language. The result of this preprocessing step will therefore be seven separate CSV files, one for each language in the corpus. Each CSV file will contain the documents of the corresponding language, but with the processed text. These preprocessed text files will serve as the basis for our document retrieval system. Before we begin, let's load the document corpus.

In [None]:
# Load the JSON corpus from the specified path
corpus_path = 'Data/corpus.json'
with open(corpus_path, 'r') as f:
    data = json.load(f)

# Convert the JSON corpus into a Pandas DataFrame
corpus_df = pd.DataFrame(data)

The first step is to split our corpus so that we end up with seven different dataframes, each corresponding to the documents in the corpus in a certain language.

In [None]:
# Get the unique set of languages in the corpus
languages = set(corpus_df['lang'].tolist())

# Create separate DataFrames for each language
corpus_en = corpus_df[corpus_df['lang'] == 'en']
corpus_fr = corpus_df[corpus_df['lang'] == 'fr']
corpus_de = corpus_df[corpus_df['lang'] == 'de']
corpus_es = corpus_df[corpus_df['lang'] == 'es']
corpus_it = corpus_df[corpus_df['lang'] == 'it']
corpus_ar = corpus_df[corpus_df['lang'] == 'ar']
corpus_ko = corpus_df[corpus_df['lang'] == 'ko']

Since the corpus is multilingual, preprocessing must be language-specific. So, we have to define stopwords and stemmers for each supported language in the corpus.

In [None]:
# Define stopwords for each language
stopwords_dict = {
    'en': set(stopwords.words('english')),
    'fr': set(stopwords.words('french')),
    'de': set(stopwords.words('german')),
    'es': set(stopwords.words('spanish')),
    'it': set(stopwords.words('italian')),
    'ar': set(stopwords.words('arabic'))
}

# Load Korean stopwords from an external file
with open('Data/stopwords-ko.txt', 'r', encoding='utf-8') as f: # /kaggle/input/korean-stop-words/stopwords-ko.txt
    stopwords_dict['ko'] = set(f.read().splitlines())

# Define stemmers for each language
stemmer_dict = {
    'en': SnowballStemmer('english'),
    'fr': SnowballStemmer('french'),
    'de': SnowballStemmer('german'),
    'es': SnowballStemmer('spanish'),
    'it': SnowballStemmer('italian'),
    'ar': None,  # No stemmer for Arabic
    'ko': None   # No stemmer for Korean
}

# Function to apply stemming based on language
def apply_stemming(tokens, lang):
    stemmer = stemmer_dict.get(lang, None)
    if stemmer:  # Apply stemming only if a stemmer is available for the language
        return [stemmer.stem(token) for token in tokens]
    return tokens  # If no stemmer, return tokens as-is

Now, we have to implement the functions that will allow us to preprocess each document in our corpus according to its language.

In [None]:
# Preprocessing function for each document based on its language
def preprocess_single_text(text, lang):
    # Lowercasing
    text = text.lower()
    
    # Tokenization
    tokens = word_tokenize(text)

    # Retain only alphabetic tokens
    tokens = [word for word in tokens if word.isalpha()] 
    
    # Remove stopwords based on the language
    stop_words = stopwords_dict.get(lang, set())  
    tokens = [word for word in tokens if word not in stop_words]
    
    # Apply stemming
    tokens = apply_stemming(tokens, lang)

    # Join tokens back into a single string
    processed_text = ' '.join(tokens)
    
    return processed_text

# Function to preprocess a Pandas DataFrame column
def preprocess_pandas(df, lang):
    # Add a counter to track progress
    counter = 0
    total_rows = len(df)
    
    # Function to process each row and print progress every 100 rows
    def process_row(text):
        nonlocal counter
        counter += 1
        
        # Print progress every 100 rows
        if counter % 100 == 0 or counter == total_rows:
            print(f"Processed {counter}/{total_rows} rows")
        
        return preprocess_single_text(text, lang)
    
    # Apply the processing function to the DataFrame
    df['text'] = df['text'].apply(process_row)
    
    return df

Finally, all we have to do is preprocess all the documents for each language and save the results for future use.

In [None]:
# Preprocess the English corpus
lang = 'en'
corpus_en_processed = preprocess_pandas(corpus_en, lang)

# Show the result
print(corpus_en_processed.head())

# Save the processed corpus
corpus_en_processed.to_csv('Data/corpus_en_processed.csv', index=False) 

In [None]:
# Preprocess the French corpus
lang = 'fr'
corpus_fr_processed = preprocess_pandas(corpus_fr, lang)

# Show the result
print(corpus_fr_processed.head())

# Save the processed corpus
corpus_fr_processed.to_csv('Data/corpus_fr_processed.csv', index=False)

In [None]:
# Preprocess the German corpus
lang = 'de'
corpus_de_processed = preprocess_pandas(corpus_de, lang)

# Show the result
print(corpus_de_processed.head())

# Save the processed corpus
corpus_de_processed.to_csv('Data/corpus_de_processed.csv', index=False) 

In [None]:
# Preprocess the Spanish corpus
lang = 'es'
corpus_es_processed = preprocess_pandas(corpus_es, lang)

# Show the result
print(corpus_es_processed.head())

# Save the processed corpus
corpus_es_processed.to_csv('Data/corpus_es_processed.csv', index=False) 

In [None]:
# Preprocess the Italian corpus
lang = 'it'
corpus_it_processed = preprocess_pandas(corpus_it, lang)

# Show the result
print(corpus_it_processed.head())

# Save the processed corpus
corpus_it_processed.to_csv('Data/corpus_it_processed.csv', index=False) 

In [None]:
# Preprocess the Arabic corpus
lang = 'ar'
corpus_ar_processed = preprocess_pandas(corpus_ar, lang)

# Show the result
print(corpus_ar_processed.head())

# Save the processed corpus
corpus_ar_processed.to_csv('Data/corpus_ar_processed.csv', index=False) 

In [None]:
# Preprocess the Korean corpus
lang = 'ko'
corpus_ko_processed = preprocess_pandas(corpus_ko, lang)

# Show the result
print(corpus_ko_processed.head())

# Save the processed corpus
corpus_ko_processed.to_csv('Data/corpus_ko_processed.csv', index=False) 

So, we got our 7 CSV files containing the documents processed according to their language. It is on these files that we will rely to carry out our document retrieval task.