<a href="https://colab.research.google.com/github/younesabdolmalaky/LTR-on-torob-data/blob/main/notebooks/FetchAndPreprocess.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Baseline Solution for Torob Data Challenge 2023

**Congratulations for participating in Torob Data Challenge 2023! 🎉**

In this notebook, a baseline solution is provied for solving the challenge. As its name suggests, this solution only serves as a *baseline* which helps you to get a better idea of how the challenge data can be loaded and processed; and also, it can be potentially used as a starting point which you can add more complexity upon it and do more tuning of the parameters to get a better result. Of course, the approach/solution provided here is only one of the many different approaches for solving learning-to-rank problems and you are not required to use this solution at all; although, we strongly suggest that you should at least read and run it once to better understand the challenge as well as its data.

In this baseline solution, a LambdaMART model is trained to rank products based on their relevance. We use a combination of TF-IDF and random projection to represent queries as well as product names. As mentioned above, this solution is not tuned to get the best performance possible and there is a lot of room for improvements which you can try and experiment with. Also, feel free to try other solutions which are based on entirely different models, representation, etc. than this baseline. We are really excited and looking forward to see you and your novel solutions in the challenge!

Ok, let's get started!

NOTE: you can find a brief description of the baseline solution in the following document.
https://docs.google.com/document/d/1aLvD5RoakD-eS7IXcTMGSfo4PHzxa75dZwsC1uhFQXw/edit?usp=sharing

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!pip install --no-cache-dir --upgrade gdown

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting gdown
  Downloading gdown-4.6.6-py3-none-any.whl (14 kB)
Installing collected packages: gdown
  Attempting uninstall: gdown
    Found existing installation: gdown 4.6.4
    Uninstalling gdown-4.6.4:
      Successfully uninstalled gdown-4.6.4
Successfully installed gdown-4.6.6


In [None]:
from collections import defaultdict, Counter
import json
import os
import re
from tqdm import tqdm
import pandas as pd

## Fetch Data

Fetch the public challenge data from Google Drive and extract its content to a directory called `data`.

In [None]:
!gdown 1spSFY1yieMcjGJc989bbbEheNhegmEmR

Downloading...
From (uriginal): https://drive.google.com/uc?id=1spSFY1yieMcjGJc989bbbEheNhegmEmR
From (redirected): https://drive.google.com/uc?id=1spSFY1yieMcjGJc989bbbEheNhegmEmR&confirm=t&uuid=4f91778b-bab9-4cc1-a655-90a9c8549e6a
To: /content/torob-data-challenge-2023_datafiles_v1.7z
100% 454M/454M [00:03<00:00, 118MB/s]


In [None]:
!7z e torob-data-challenge-2023_datafiles_v1.7z -odata/


7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,2 CPUs Intel(R) Xeon(R) CPU @ 2.20GHz (406F0),ASM,AES-NI)

Scanning the drive for archives:
  0M Scan         1 file, 454165175 bytes (434 MiB)

Extracting archive: torob-data-challenge-2023_datafiles_v1.7z
--
Path = torob-data-challenge-2023_datafiles_v1.7z
Type = 7z
Physical Size = 454165175
Headers Size = 276
Method = LZMA2:24
Solid = +
Blocks = 2

  0%      0% - contest_data_v1/products-info_v1.jsonl                                               1% - contest_data_v1/products-info_v1.jsonl                                               2% - contest_data_v1/products-info_v1.jsonl                       

In [None]:
# Create a directory for all the generated files during execution of notebook.
!mkdir output_data

## Data Preprocessing

In [None]:
char_mappings = {
    "٥": "5",
    "А": "a",
    "В": "b",
    "Е": "e",
    "Н": "h",
    "Р": "P",
    "С": "C",
    "Т": "T",
    "а": "a",
    "г": "r",
    "е": "e",
    "к": "k",
    "м": "m",
    "о": "o",
    "р": "p",
    "ڈ": "د",
    "ڇ": "چ",
    # Persian numbers (will be raplaced by english one)
    "۰": "0",
    "۱": "1",
    "۲": "2",
    "۳": "3",
    "۴": "4",
    "۵": "5",
    "۶": "6",
    "۷": "7",
    "۸": "8",
    "۹": "9",
    ".": ".",
    # Arabic numbers (will be raplaced by english one)
    "٠": "0",
    "١": "1",
    "٢": "2",
    "٣": "3",
    "٤": "4",
    "٥": "5",
    "٦": "6",
    "٧": "7",
    "٨": "8",
    "٩": "9",
    # Special Arabic Characters (will be replaced by persian one)
    "ك": "ک",
    "ى": "ی",
    "ي": "ی",
    "ؤ": "و",
    "ئ": "ی",
    "إ": "ا",
    "أ": "ا",
    "آ": "ا",
    "ة": "ه",
    "ء": "ی",
    # French alphabet (will be raplaced by english one)
    "à": "a",
    "ä": "a",
    "ç": "c",
    "é": "e",
    "è": "e",
    "ê": "e",
    "ë": "e",
    "î": "i",
    "ï": "i",
    "ô": "o",
    "ù": "u",
    "û": "u",
    "ü": "u",
    # Camma (will be replaced by dots for floating point numbers)
    ",": ".",
    # And (will be replaced by dots for floating point numbers)
    "&": " and ",
    # Vowels (will be removed)
    "ّ": "",  # tashdid
    "َ": "",  # a
    "ِ": "",  # e
    "ُ": "",  # o
    "ـ": "",  # tatvil
    # Spaces
    "‍": "",  # 0x9E -> ZERO WIDTH JOINER
    "‌": " ",  # 0x9D -> ZERO WIDTH NON-JOINER
    # Arabic Presentation Forms-A (will be replaced by persian one)
    "ﭐ": "ا",
    "ﭑ": "ا",
    "ﭖ": "پ",
    "ﭗ": "پ",
    "ﭘ": "پ",
    "ﭙ": "پ",
    "ﭞ": "ت",
    "ﭟ": "ت",
    "ﭠ": "ت",
    "ﭡ": "ت",
    "ﭺ": "چ",
    "ﭻ": "چ",
    "ﭼ": "چ",
    "ﭽ": "چ",
    "ﮊ": "ژ",
    "ﮋ": "ژ",
    "ﮎ": "ک",
    "ﮏ": "ک",
    "ﮐ": "ک",
    "ﮑ": "ک",
    "ﮒ": "گ",
    "ﮓ": "گ",
    "ﮔ": "گ",
    "ﮕ": "گ",
    "ﮤ": "ه",
    "ﮥ": "ه",
    "ﮦ": "ه",
    "ﮪ": "ه",
    "ﮫ": "ه",
    "ﮬ": "ه",
    "ﮭ": "ه",
    "ﮮ": "ی",
    "ﮯ": "ی",
    "ﮰ": "ی",
    "ﮱ": "ی",
    "ﯼ": "ی",
    "ﯽ": "ی",
    "ﯾ": "ی",
    "ﯿ": "ی",
    # Arabic Presentation Forms-B (will be removed)
    "ﹰ": "",
    "ﹱ": "",
    "ﹲ": "",
    "ﹳ": "",
    "ﹴ": "",
    "﹵": "",
    "ﹶ": "",
    "ﹷ": "",
    "ﹸ": "",
    "ﹹ": "",
    "ﹺ": "",
    "ﹻ": "",
    "ﹼ": "",
    "ﹽ": "",
    "ﹾ": "",
    "ﹿ": "",
    # Arabic Presentation Forms-B (will be replaced by persian one)
    "ﺀ": "ی",
    "ﺁ": "ا",
    "ﺂ": "ا",
    "ﺃ": "ا",
    "ﺄ": "ا",
    "ﺅ": "و",
    "ﺆ": "و",
    "ﺇ": "ا",
    "ﺈ": "ا",
    "ﺉ": "ی",
    "ﺊ": "ی",
    "ﺋ": "ی",
    "ﺌ": "ی",
    "ﺍ": "ا",
    "ﺎ": "ا",
    "ﺏ": "ب",
    "ﺐ": "ب",
    "ﺑ": "ب",
    "ﺒ": "ب",
    "ﺓ": "ه",
    "ﺔ": "ه",
    "ﺕ": "ت",
    "ﺖ": "ت",
    "ﺗ": "ت",
    "ﺘ": "ت",
    "ﺙ": "ث",
    "ﺚ": "ث",
    "ﺛ": "ث",
    "ﺜ": "ث",
    "ﺝ": "ج",
    "ﺞ": "ج",
    "ﺟ": "ج",
    "ﺠ": "ج",
    "ﺡ": "ح",
    "ﺢ": "ح",
    "ﺣ": "ح",
    "ﺤ": "ح",
    "ﺥ": "خ",
    "ﺦ": "خ",
    "ﺧ": "خ",
    "ﺨ": "خ",
    "ﺩ": "د",
    "ﺪ": "د",
    "ﺫ": "ذ",
    "ﺬ": "ذ",
    "ﺭ": "ر",
    "ﺮ": "ر",
    "ﺯ": "ز",
    "ﺰ": "ز",
    "ﺱ": "س",
    "ﺲ": "س",
    "ﺳ": "س",
    "ﺴ": "س",
    "ﺵ": "ش",
    "ﺶ": "ش",
    "ﺷ": "ش",
    "ﺸ": "ش",
    "ﺹ": "ص",
    "ﺺ": "ص",
    "ﺻ": "ص",
    "ﺼ": "ص",
    "ﺽ": "ض",
    "ﺾ": "ض",
    "ﺿ": "ض",
    "ﻀ": "ض",
    "ﻁ": "ط",
    "ﻂ": "ط",
    "ﻃ": "ط",
    "ﻄ": "ط",
    "ﻅ": "ظ",
    "ﻆ": "ظ",
    "ﻇ": "ظ",
    "ﻈ": "ظ",
    "ﻉ": "ع",
    "ﻊ": "ع",
    "ﻋ": "ع",
    "ﻌ": "ع",
    "ﻍ": "غ",
    "ﻎ": "غ",
    "ﻏ": "غ",
    "ﻐ": "غ",
    "ﻑ": "ف",
    "ﻒ": "ف",
    "ﻓ": "ف",
    "ﻔ": "ف",
    "ﻕ": "ق",
    "ﻖ": "ق",
    "ﻗ": "ق",
    "ﻘ": "ق",
    "ﻙ": "ک",
    "ﻚ": "ک",
    "ﻛ": "ک",
    "ﻜ": "ک",
    "ﻝ": "ل",
    "ﻞ": "ل",
    "ﻟ": "ل",
    "ﻠ": "ل",
    "ﻡ": "م",
    "ﻢ": "م",
    "ﻣ": "م",
    "ﻤ": "م",
    "ﻥ": "ن",
    "ﻦ": "ن",
    "ﻧ": "ن",
    "ﻨ": "ن",
    "ﻩ": "ه",
    "ﻪ": "ه",
    "ﻫ": "ه",
    "ﻬ": "ه",
    "ﻭ": "و",
    "ﻮ": "و",
    "ﻯ": "ی",
    "ﻰ": "ی",
    "ﻱ": "ی",
    "ﻲ": "ی",
    "ﻳ": "ی",
    "ﻴ": "ی",
    "ﻵ": "لا",
    "ﻶ": "لا",
    "ﻷ": "لا",
    "ﻸ": "لا",
    "ﻹ": "لا",
    "ﻺ": "لا",
    "ﻻ": "لا",
    "ﻼ": "لا",
}

valid_chars = [
    " ",
    "0",
    "1",
    "2",
    "3",
    "4",
    "5",
    "6",
    "7",
    "8",
    "9",
    "A",
    "B",
    "C",
    "D",
    "E",
    "F",
    "G",
    "H",
    "I",
    "J",
    "K",
    "L",
    "M",
    "N",
    "O",
    "P",
    "Q",
    "R",
    "S",
    "T",
    "U",
    "V",
    "W",
    "X",
    "Y",
    "Z",
    "a",
    "b",
    "c",
    "d",
    "e",
    "f",
    "g",
    "h",
    "i",
    "j",
    "k",
    "l",
    "m",
    "n",
    "o",
    "p",
    "q",
    "r",
    "s",
    "t",
    "u",
    "v",
    "w",
    "x",
    "y",
    "z",
    "ا",
    "ب",
    "ت",
    "ث",
    "ج",
    "ح",
    "خ",
    "د",
    "ذ",
    "ر",
    "ز",
    "س",
    "ش",
    "ص",
    "ض",
    "ط",
    "ظ",
    "ع",
    "غ",
    "ف",
    "ق",
    "ل",
    "م",
    "ن",
    "ه",
    "و",
    "پ",
    "چ",
    "ژ",
    "ک",
    "گ",
    "ی",
]



In [None]:
translation_table = dict((ord(a), b) for a, b in char_mappings.items())

# Create a regex for recognizing invalid characters.
nonvalid_reg_text = '[^{}]'.format("".join(valid_chars))
nonvalid_reg = re.compile(nonvalid_reg_text)


def normalize_text(text, to_lower=True, remove_invalid=True):
    # Map invalid characters with replacement to valid characters.
    text = text.translate(translation_table)
    if to_lower:
        text = text.lower()
    if remove_invalid:
        text = nonvalid_reg.sub(' ', text)
    # Replace consecutive whitespaces with a single space character.
    text = re.sub(r"\s+", " ", text)
    return text

In [None]:
def read_json_lines(path, n_lines=None):
    """Creates a generator which reads and returns lines of
    a json lines file, one line at a time, each as a dictionary.
    
    This could be used as a memory-efficient alternative of `pandas.read_json`
    for reading a json lines file.
    """
    with open(path, 'r') as f:
        for i, line in enumerate(f):
            if n_lines == i:
                break
            yield json.loads(line)

In [None]:
class JSONLinesWriter:
    """
    Helper class to write list of dictionaries into a file in json lines
    format, i.e. one json record per line.
    """

    def __init__(self, file_path):
        self.fd = None
        self.file_path = file_path
        self.delimiter = "\n"

    def open(self):
        self.fd = open(self.file_path, "w")
        self.first_record_written = False
        return self

    def close(self):
        self.fd.close()
        self.fd = None

    def write_record(self, obj):
        if self.first_record_written:
            self.fd.write(self.delimiter)
        self.fd.write(json.dumps(obj))
        self.first_record_written = True

    def __enter__(self):
        return self.open()

    def __exit__(self, type, value, traceback):
        self.close()

In [None]:
data_dir = '/content/drive/MyDrive/torob/data'
output_dir = os.path.join('output_data')

search_data_path = os.path.join(data_dir, 'torob-search-data_v1.jsonl')
aggregated_search_data_path = '/content/drive/MyDrive/torob/output_data/aggregated_search_data.jsonl'

products_path = os.path.join(data_dir, 'products-info_v1.jsonl')
preprocessed_products_path = '/content/drive/MyDrive/torob/output_data/preprocessed_products.jsonl
'
test_data_path = os.path.join(data_dir, 'test-offline-data_v1.jsonl')
preprocessed_test_queries_path = '/content/drive/MyDrive/torob/output_data/preprocessed_test_queries.jsonl'

In [None]:
search_data_path

'data/torob-search-data_v1.jsonl'

In [None]:
def aggregate_searches(search_data_path, output_path):
    """Aggregate searches based on raw query.
    
    For each unique raw query in the search data, the frequency of products and
    clicked products would be aggregated.
    """
    agg_searches = defaultdict(
        lambda : dict(
            results=Counter(),
            clicks=Counter(),
        )
    )
    print("Aggregating searches based on raw query...")
    for search in tqdm(read_json_lines(search_data_path)):
        agg_searches[search['raw_query']]['results'].update(search['result'])
        agg_searches[search['raw_query']]['clicks'].update(search['clicked_result'])
    
    print('Writing aggregated searches into file...')
    with JSONLinesWriter(output_path) as out_file:
        for raw_query, stats in tqdm(agg_searches.items()):
            results, results_count = list(zip(*stats['results'].most_common()))
            clicks, clicks_count = list(zip(*stats['clicks'].most_common()))
            record = {
                'raw_query': raw_query,
                'raw_query_normalized': normalize_text(raw_query),
                'results': results,
                'results_count': results_count,
                'clicks': clicks,
                'clicks_count': clicks_count,
            }
            out_file.write_record(record)

    print("Finished aggregating searches.")
    print(f'Number of aggregate search records: {len(agg_searches)}')
    print(f"The aggregated searches data were stored in '{output_path}'.")

In [None]:
aggregate_searches(search_data_path, aggregated_search_data_path)

Aggregating searches based on raw query...


2499901it [02:12, 18898.14it/s]


Writing aggregated searches into file...


100%|██████████| 270099/270099 [00:23<00:00, 11349.53it/s]


Finished aggregating searches.
Number of aggregate search records: 270099
The aggregated searches data were stored in 'output_data/aggregated_search_data.jsonl'.


In [None]:
def preprocess_products(products_path, output_path):
    """Preprocess product names.
    
    The different titles of a product are concatenated together and 
    the resulting string would be normalized. Then, the normalized title
    is split into tokens and only the set of unique tokens would be selected
    as the final title of the product.
    """
    print('Preprocessing products...')
    count = 0
    with JSONLinesWriter(output_path) as out_file:
        for product in tqdm(read_json_lines(products_path)):
            titles = product['titles']
            titles_concat_normalized = normalize_text(" ".join(titles))
            titles_words_set = set(titles_concat_normalized.split())
            titles_words_concat = " ".join(titles_words_set)
            
            record = {
                'id': product['id'],
                'title_normalized': titles_words_concat,
            }
            out_file.write_record(record)
            count += 1
    print('Finished preprocessing products.')
    print(f'Number of processed products: {count}')
    print(f"The processed products data were stored in '{output_path}'")

In [None]:
preprocess_products(products_path, preprocessed_products_path)

Preprocessing products...


282738it [00:21, 20120.97it/s]

In [None]:
def preprocess_test_queries(test_data_path, output_path):
    """Normalize test queries."""
    print('Preprocessing test queries...')
    count = 0
    with JSONLinesWriter(output_path) as out_file:
        for test_sample in tqdm(read_json_lines(test_data_path)):
            normalized_query = normalize_text(test_sample['raw_query'])
            record = {
                'raw_query_normalized': normalized_query,
            }
            count += 1
            out_file.write_record(record)
    print('Finished preprocessing test queries.')
    print(f'Number of processed test queries: {count}')
    print(f"The processed test queries were stored in '{output_path}'")

In [None]:
preprocess_test_queries(test_data_path, preprocessed_test_queries_path)

Preprocessing test queries...


23140it [00:00, 44188.40it/s]

Finished preprocessing test queries.
Number of processed test queries: 23140
The processed test queries were stored in 'output_data/preprocessed_test_queries.jsonl'



