<a href="https://colab.research.google.com/github/rprimi/IR_bm25/blob/main/proj0_is_bm25.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#  Building a Simple Information Retrieval System using BM25

## Description

Exercise for special student selection for the Course "Deep Learning applied to search systems", FEEC-Unicamp.

**Objectives:** 
- Create a basic system of Information Retrieval (IR) using the BM25 ranking algorithm in Python.

**Deliverables:**
- A github repository
- Google Colab notebooks and associated functions and libraries if any, with the code for the IR system
- A brief report describing the implementation details, results, how to test the IR system and how chatGPT helped you with the project.
- Evaluation of the model in the CISI collection
http://ir.dcs.gla.ac.uk/resources/test_collections/cisi/

**Grading:**  
The project will be evaluated based on the following criteria:
- Completion of the project deliverables (partial is OK to submit)
- Implementation accuracy
- Quality of the report
- Code readability and commenting
- How chatGPT was used




## Learning about BM25

BM25 is a ranking function used by search engines to estimate the relevance of documents for a given search query¹². It is based on the probabilistic information retrieval model developed in the 1970s and 1980s by Stephen E. Robertson, Karen Spärck Jones and others².

BM25 is a function that scores a set of documents based on the query terms that appear in each document, regardless of how close the terms are within the document². It is a family of scoring functions with slightly different components and parameters. One of the most prominent instances of the function is the following²:

$$
\mathrm{score}(D,Q) = \sum_{i=1}^{n} \mathrm{IDF}(q_i) \cdot \frac{f(q_i,D) \cdot (k_1 + 1)}{f(q_i,D) + k_1 \cdot (1 - b + b \cdot D / \mathrm{avgdl})}
$$


Where:
- $score(D,Q)$ is the BM25 score of document $D$ for query $Q$
- $\sum_{i=1}^{n}$ is the sum over all query terms $q_i$ that appear in document $D$
- The $IDF(q_i)$ is given by:

$$\mathrm{IDF}(q_i) = \log{_e} \frac{N - n(q_i) + 0.5}{n(q_i) + 0.5}$$
- where:
  - $N$ is the total number of documents in the collection
  - $n(q_i)$ is the number of documents that contain the term $q_i$
  - The smaller the value of $n(q_i)$, the greater the value of $IDF(q_i)$ and the greater the contribution of the term to the final score. The IDF increases as the term becomes rarer in the collection, and decreases as it becomes more common. "The IDF component of our formula measures how often a term occurs in all of the documents and “penalizes” terms that are common"
- $f(q_i,D)$ is the frequency of the term $q_i$ in the document $D$, that is, how many times it appears in the text. The higher this value, the more relevant the document is to the term.
- $D$ is the length of the document D in words. $\mathrm{avgdl}$ is the average length of the documents in the collection. The length of the documento is divided by the average doc length in the denominator: $D / \mathrm{avgdl}$ ."The way to think about this is that the more terms in the document — at least ones not matching the query — the lower the score for the document"²
- $b$ controls how much the score is affected by document length relative to the average length of documents in the collection. Larger values of $b$ favor shorter documents, while smaller values favor longer documents. The commonly used default value is $b = 0.75$. "if b is bigger, the effects of the length of the document compared to the average length are more amplified."

- $(k_1 + 1)$ and $(k_1 * (1 - b + b * |D| / avgdl))$ are two components that adjust for the influence of term frequency on the final score. They depend on the free parameters:
- $k_1$ controls how much the score increases based on term frequency. Larger values of $k_1$ mean that differences in term frequency have more impact on the score, while smaller values mean that differences have less impact. The commonly used default value is $k_1 = 1.2 to 2.0$.
These elements together form a ranking function that attempts to estimate how relevant a document is to a query based on the terms they share.

### Referências
(1) Okapi BM25 - Wikipedia. https://en.wikipedia.org/wiki/Okapi_BM25 Acessado 16/02/2023.  
(2) Practical BM25 - Part 2: The BM25 Algorithm and its Variables. https://www.elastic.co/blog/practical-bm25-part-2-the-bm25-algorithm-and-its-variables Acessado 16/02/2023.  
(3) Document similarities with BM25 algorithm - MATLAB bm25Similarity. https://www.mathworks.com/help/textanalytics/ref/bm25similarity.html Acessado 16/02/2023.  
(3) python-bm25 | Python implementation of BM25 function | Search Engine .... https://kandi.openweaver.com/python/fanta-mnix/python-bm25#:~:text=python-bm25%20is%20a%20Python%20library%20typically%20used%20in,Python%20implementation%20of%20BM25%20function%20for%20document%20retrieval Acessado 16/02/2023.  
(4) BM25 using Python Gensim Package | Search Engine. https://iamgeekydude.com/2022/12/25/bm25-using-python-gensim-package-search-engine-nlp/ Acessado 16/02/2023.  
(5) GitHub - nhirakawa/BM25: A Python implementation of the BM25 ranking .... https://github.com/nhirakawa/BM25 Acessado 16/02/2023.  
(6) GitHub - fanta-mnix/python-bm25: Python implementation of BM25 function .... https://github.com/fanta-mnix/python-bm25 Acessado 16/02/2023.  
(7) GitHub - xianchen2/Text_Retrieval_BM25: Python implementation of the .... https://github.com/xianchen2/Text_Retrieval_BM25 Acessado 16/02/2023.  
(8) Machine Learning Approach for Improved BM25 Retrieval. https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/LearningBM25MSRTechReport.pdf Acessado 16/02/2023.  
(9) Information retrieval evaluation of precision, recall, f-score, AP, MAP .... https://stackoverflow.com/questions/40457331/information-retrieval-evaluation-of-precision-recall-f-score-ap-map-in-pytho Acessado 16/02/2023.  
(10) Information Retrieval with document Re-ranking with BERT and BM25. https://medium.com/@papai143/information-retrieval-with-document-re-ranking-with-bert-and-bm25-7c29d738df73 Acessado 16/02/2023.  
(11) python-bm25 | Python implementation of BM25 function | Search Engine .... https://kandi.openweaver.com/python/fanta-mnix/python-bm25 Acessado 16/02/2023.  

### Exemplo "baunilha" :)

In [None]:
!pip install rank_bm25
import rank_bm25
from rank_bm25 import BM25Okapi

# Cria documentos como lista de palavras
docs = [
    ["black", "cat", "white", "cat"],
    ["cat", "outer", "space", "cat"],
    ["wag", "dog"]
]

# Criar um objeto BM25Okapi com os documentos
bm25 = BM25Okapi(docs)


# Cria uma consulta como lista de palavras
query = [["cat"], ["cat", "house"]]

# Calcular as pontuações dos documentos para a consulta usando get_scores()
scores = [bm25.get_scores(query = q) for q in query] 

# Imprimir as pontuações

print(scores)



### Implementração do algotitmo BM25 com avaliação na coleção CISI usando as bibliotecas gensin e nltk

- Database description (from ChatGPT)

  - The CISI (Computer and Information Science Index) database is a classic test collection widely used in the field of information retrieval for research and evaluation purposes. The database consists of a collection of documents and a set of queries, as well as relevance judgments for each query-document pair.

  - The CISI database is available in several formats, including plain text, SGML, and XML. The most commonly used format is the plain text format, which consists of the following files:

  - cisi.all - This file contains the full text of all the documents in the collection. Each document is separated by a line starting with the tag .I followed by a unique document ID. The text of the document follows on subsequent lines until the next .I tag.

  - cisi.docs - This file contains bibliographic information for each document in the collection. Each record begins with the tag .I followed by the document ID. The bibliographic information is contained in fields such as .T (title), .A (author), .B (source), and .W (abstract). The fields are separated by the tag and enclosed in angle brackets, like <.T>, <.A>, etc.

  - cisi.qry - This file contains the queries in the collection. Each query is preceded by the tag .I followed by a unique query ID. The text of the query follows on subsequent lines until the next .I tag.

  - cisi.rel - This file contains the relevance judgments for each query-document pair. Each record begins with the tag .I followed by the query ID, the tag .R and the document ID of a relevant document. Multiple relevant documents are listed on subsequent lines until the next .I tag.

- In summary, the CISI database has four main files: cisi.all for the document text, cisi.docs for document metadata, cisi.qry for queries, and cisi.rel for relevance judgments. The structure of each file is defined by a set of tags, which are used to identify different parts of the data.


In [162]:
import os
from google.colab import drive
drive.mount('/content/drive')

# Cria diretório no google drive
!mkdir -p /content/drive/MyDrive/cisi

# Baixa base
!wget -P /content/drive/MyDrive/cisi/ http://ir.dcs.gla.ac.uk/resources/test_collections/cisi/cisi.tar.gz

# Descomprime arquivos
!tar -xvzf /content/drive/MyDrive/cisi/cisi.tar.gz -C /content/drive/MyDrive/cisi/



Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
--2023-02-21 17:21:18--  http://ir.dcs.gla.ac.uk/resources/test_collections/cisi/cisi.tar.gz
Resolving ir.dcs.gla.ac.uk (ir.dcs.gla.ac.uk)... 130.209.240.253
Connecting to ir.dcs.gla.ac.uk (ir.dcs.gla.ac.uk)|130.209.240.253|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 775144 (757K) [application/gzip]
Saving to: ‘/content/drive/MyDrive/cisi/cisi.tar.gz’


2023-02-21 17:21:19 (1.26 MB/s) - ‘/content/drive/MyDrive/cisi/cisi.tar.gz’ saved [775144/775144]

CISI.ALL
CISI.BLN
CISI.QRY
CISI.REL


In [None]:
# Instala biblioteca gensim, que fornece uma implementação do BM25 em Python
!pip install gensim
!pip install nltk
!pip install rank_bm25

import nltk
from nltk.corpus import stopwords
import re
from gensim.summarization import bm25


def read_cisi_file(filename):
  # Abre o arquivo no modo leitura
  with open(filename, "r") as f:
    # Lê todo o conteúdo do arquivo
    content = f.read()
    # Divide o conteúdo em registros separados pelo caractere ".I"
    records = content.split(".I")
    # Remove o primeiro registro vazio
    records.pop(0)
    # Cria uma lista vazia para armazenar os textos dos registros
    texts = []
    # Para cada registro na lista de registros
    for record in records:
      # Divide o registro em linhas
      lines = record.split("\n")
      # Remove a primeira linha que contém o número do registro
      lines.pop(0)
      # Junta as linhas restantes em um único texto
      text = " ".join(lines)
      # Adiciona o texto à lista de textos
      texts.append(text)
  # Retorna a lista de textos
  return texts


# Usa a função criada para ler os documentos, 
# as consultas e as relevâncias da coleção CISI e 
# armazená-los em variáveis. Por exemplo:

# Lê os documentos da coleção CISI e armazena na variável docs
docs = read_cisi_file("/content/drive/MyDrive/cisi/CISI.ALL")
# Lê as consultas da coleção CISI e armazena na variável queries
queries = read_cisi_file("/content/drive/MyDrive/cisi/CISI.QRY")
# Lê as relevâncias da coleção CISI e armazena na variável relevances
relevances = read_cisi_file("/content/drive/MyDrive/cisi/CISI.REL")

# docs[:2]

# Cria uma função para pré-processar os textos dos documentos e das consultas, 
# removendo pontuação, caracteres especiais, números, stopwords e aplicando stemização usando a 
# biblioteca nltk


# Define uma lista de stopwords em inglês
nltk.download('stopwords')
stopwords = stopwords.words('english') 

# Define um stemizador baseado no algoritmo de Porter
stemmer = nltk.stem.PorterStemmer()

def preprocess(text):
  # Converte o texto para minúsculas
  text = text.lower()
  # Remove pontuação, caracteres especiais e números usando expressões regulares
  text = re.sub(r"[^\w\s]", "", text)
  text = re.sub(r"\d+", "", text)
  # Divide o texto em tokens (palavras) usando espaços em branco como separador
  tokens = text.split()
  # Remove os tokens que são stopwords usando list comprehension 
  tokens = [token for token in tokens if token not in stopwords]
  # Aplica stemização aos tokens usando list comprehension 
  tokens = [stemmer.stem(token) for token in tokens]
  # Retorna a lista de tokens pré-processados 
  return tokens

# Usa a função para pré-processar os documentos e as consultas da 
# coleção CISI e armazená-los em novas variáveis. 

# Pré-processa os documentos da coleção CISI e armazena na variável docs_tokens 
docs_tokens = [preprocess(doc) for doc in docs]
# Pré-processa as consultas da coleção CISI e armazena na variável queries_tokens 
queries_tokens = [preprocess(query) for query in queries]


# Cria um objeto BM25 usando a classe bm25.BM25 da gensim e passa
# os documentos tokenizados como argumento
bm25_obj = bm25.BM25(docs_tokens)

# Calcula média idf um argumento necessario para a funcao get_scores 
media_idf = sum(bm25_obj.idf.values()) / len(bm25_obj.idf.values())


queries_tokens[:3]

# Calcula o escore BM25 para cada documento usando o método get_scores do objeto BM25
scores = [bm25_obj.get_scores(document = q, average_idf = media_idf) for q in queries_tokens]

len(scores)
len(scores[0]) == bm25_obj.corpus_size
bm25_obj.corpus_size
len(queries_tokens) * bm25_obj.corpus_size


# Ordenar os documentos de acordo com o escore BM25 e selecionar os mais relevantes para a consulta
ranked_docs = sorted(zip(docs, scores), key=lambda x: x[1], reverse=True)
top_docs = ranked_docs[:3]


### Implementração do algoritmo BM25 com avaliação na coleção CISI usando python "raiz" :)

In [None]:
import os
from collections import Counter
from math import log

# Load the document text into a dictionary: The code opens the file "cisi.all" 
# and reads the contents into a Python dictionary called doc_text,
# where each document ID is associated with its corresponding text.
# this cauded a problem: "A.I. Mikhailov, and his colleagues"
# I had to change  content.split(".I") to content.split(".I ")
# learned using bing chatGPT

doc_text = {}
with open("/content/drive/MyDrive/cisi/CISI.ALL", "r") as f:
    content = f.read()
    records = content.split(".I ")
    records.pop(0)
    lines = f.readlines()
    for record in records:
      # Divide o registro em linhas
      lines = record.split("\n")
      # guarda id
      id = int(lines[0])
      # print(lines[0])
      # print(int(id))
      # Remove a primeira linha que contém o número do registro
      lines.pop(0)
      # Junta as linhas restantes em um único texto
      text = " ".join(lines)
      print(text)
      # Adiciona o texto à lista de textos
      doc_text[id] = text


# Load the queries into a list: The code opens the file "cisi.qry" and reads the contents
# into a list called queries, where each query is represented as a string.

queries = []
with open("/content/drive/MyDrive/cisi/CISI.QRY", "r") as f:
    content = f.read()
    records = content.split(".I")
    records.pop(0)
    lines = f.readlines()
    for record in records:
      # Divide o registro em linhas
      lines = record.split("\n")
      # guarda id
      id = int(lines[0])
      # print(lines[0])
      # print(int(id))
      # Remove a primeira linha que contém o número do registro
      lines.pop(0)
      # Junta as linhas restantes em um único texto
      text = " ".join(lines)
      print(text)
      # Adiciona o texto à lista de textos
      queries.append(text)


# Load the relevance judgments into a dictionary: The code opens the file "cisi.rel" 
# reads the contents into a Python dictionary called relevance_judgments, 
# each query ID is associated with a list of relevant document IDs.

relevance_judgments = {}
with open("/content/drive/MyDrive/cisi/CISI.REL", "r") as f:
    lines = f.readlines()
    for i in range(0, len(lines)):
        query_id, doc_id, _, _ = lines[i].split()
        if query_id not in relevance_judgments:
            relevance_judgments[query_id] = []
        relevance_judgments[query_id].append(doc_id)


# Define the BM25 ranking function
# the BM25 ranking function: The code defines a function called bm25 that takes 
# as input and calculates the BM25 score for each document in the database. 
# and b parameters are set to default values of 1.2 and 0.75, respectively.

def bm25(query, doc_text, k1=1.2, b=0.75):
    query_terms = query.lower().split()
    scores = Counter()
    for doc_id, doc_text in doc_text.items():
        doc_terms = doc_text.lower().split()
        doc_length = len(doc_terms)
        doc_term_counts = Counter(doc_terms)
        for term in query_terms:
            if term in doc_term_counts:
                idf = log(len(doc_text) / doc_term_counts[term])
                tf = doc_term_counts[term]
                score = idf * ((k1 + 1) * tf) / (k1 * ((1 - b) + b * (doc_length / len(doc_terms))) + tf)
                scores[doc_id] += score
    return scores


# Calculates the bm25 scores for each document for each query.
# stores in a list results with [N queries [N docs]]
results = [bm25(query=q, doc_text=doc_text) for q in queries]

len(results )
len(results[0])

# bm25(query=queries[0], doc_text=doc_text)

# Define the evaluation function
# The code defines a function called evaluate the results of the ranking 
# (as a dictionary of document scores for each query) 
# relevance judgments as input, and calculates the precision, recall, 
# and F1 score of the ranking.

def evaluate(results, relevance_judgments):
    num_queries = len(relevance_judgments)
    precision = 0.0
    recall = 0.0
    f1 = 0.0
    for query_id, docs in relevance_judgments.items():
        retrieved_docs = [doc_id for doc_id, _ in results[query_id]]
        relevant_docs = set(docs)
        retrieved_and_relevant = relevant_docs.intersection(retrieved_docs)
        precision += len(retrieved_and_relevant) / len(retrieved_docs)
        recall += len(retrieved_and_relevant) / len(relevant_docs)
    precision /= num_queries
    recall /= num_queries
    try:
        f1 = 2 * precision * recall / (precision + recall)
    except ZeroDivisionError:
        f1 = 0
    return precision, recall, f1

# Perform the BM25 ranking for each query and evaluate the results
# the BM25 ranking and evaluation: The code uses a loop to iterate over each 
#n the queries list, and applies the bm25 function to calculate the document 
#for the query. The results are stored in a dictionary called results, 
#ach query ID is associated with a list of the top 100 documents ranked by BM25 score. 
#
#, the evaluate function is called to calculate the precision, recall, 
#score of the ranking, and the results are printed to the console.
results = {}
for i, query in enumerate(queries):
    query_id = str(i+1)
    scores = bm25(query, doc_text=doc_text)
    results[query_id] = scores.most_common(100)

precision, recall, f1 = evaluate(results, relevance_judgments)



In [None]:
import os
from collections import Counter
from math import log

# Load the document text into a dictionary: The code opens the file "cisi.all" 
# and reads the contents into a Python dictionary called doc_text,
# where each document ID is associated with its corresponding text.
# this cauded a problem: "A.I. Mikhailov, and his colleagues"
# I had to change  content.split(".I") to content.split(".I ")
# learned using bing chatGPT

doc_text = {}
with open("/content/drive/MyDrive/cisi/CISI.ALL", "r") as f:
    content = f.read()
    records = content.split(".I ")
    records.pop(0)
    lines = f.readlines()
    for record in records:
      # Divide o registro em linhas
      lines = record.split("\n")
      # guarda id
      id = int(lines[0])
      # print(lines[0])
      # print(int(id))
      # Remove a primeira linha que contém o número do registro
      lines.pop(0)
      # Junta as linhas restantes em um único texto
      text = " ".join(lines)
      print(text)
      # Adiciona o texto à lista de textos
      doc_text[id] = text


# Load the queries into a list: The code opens the file "cisi.qry" and reads the contents
# into a list called queries, where each query is represented as a string.

queries = []
with open("/content/drive/MyDrive/cisi/CISI.QRY", "r") as f:
    content = f.read()
    records = content.split(".I")
    records.pop(0)
    lines = f.readlines()
    for record in records:
      # Divide o registro em linhas
      lines = record.split("\n")
      # guarda id
      id = int(lines[0])
      # print(lines[0])
      # print(int(id))
      # Remove a primeira linha que contém o número do registro
      lines.pop(0)
      # Junta as linhas restantes em um único texto
      text = " ".join(lines)
      print(text)
      # Adiciona o texto à lista de textos
      queries.append(text)


# Load the relevance judgments into a dictionary: The code opens the file "cisi.rel" 
# reads the contents into a Python dictionary called relevance_judgments, 
# each query ID is associated with a list of relevant document IDs.

relevance_judgments = {}
with open("/content/drive/MyDrive/cisi/CISI.REL", "r") as f:
    lines = f.readlines()
    for i in range(0, len(lines)):
        query_id, doc_id, _, _ = lines[i].split()
        if query_id not in relevance_judgments:
            relevance_judgments[query_id] = []
        relevance_judgments[query_id].append(str(doc_id))


# Define the BM25 ranking function
# the BM25 ranking function: The code defines a function called bm25 that takes 
# as input and calculates the BM25 score for each document in the database. 
# and b parameters are set to default values of 1.2 and 0.75, respectively.

def bm25(query, doc_text, k1=1.2, b=0.75):
    query_terms = query.lower().split()
    scores = Counter()
    for doc_id, doc_text in doc_text.items():
        doc_terms = doc_text.lower().split()
        doc_length = len(doc_terms)
        doc_term_counts = Counter(doc_terms)
        for term in query_terms:
            if term in doc_term_counts:
                idf = log(len(doc_text) / doc_term_counts[term])
                tf = doc_term_counts[term]
                score = idf * ((k1 + 1) * tf) / (k1 * ((1 - b) + b * (doc_length / len(doc_terms))) + tf)
                scores[doc_id] += score
    return scores


# Calculates the bm25 scores for each document for each query.
# stores in a list results with [N queries [N docs]]
results = [bm25(query=q, doc_text=doc_text) for q in queries]

len(results )
len(results[0])

# bm25(query=queries[0], doc_text=doc_text)

# Define the evaluation function
# The code defines a function called evaluate the results of the ranking 
# (as a dictionary of document scores for each query) 
# relevance judgments as input, and calculates the precision, recall, 
# and F1 score of the ranking.

def evaluate(results, relevance_judgments):
    num_queries = len(relevance_judgments)
    precision = 0.0
    recall = 0.0
    f1 = 0.0
    for query_id, docs in relevance_judgments.items():
        retrieved_docs = [doc_id for doc_id, _ in results[query_id]]
        relevant_docs = set(docs)
        retrieved_and_relevant = relevant_docs.intersection(retrieved_docs)
        precision += len(retrieved_and_relevant) / len(retrieved_docs)
        recall += len(retrieved_and_relevant) / len(relevant_docs)
    precision /= num_queries
    recall /= num_queries
    try:
        f1 = 2 * precision * recall / (precision + recall)
    except ZeroDivisionError:
        f1 = 0
    return precision, recall, f1

# Perform the BM25 ranking for each query and evaluate the results
# the BM25 ranking and evaluation: The code uses a loop to iterate over each 
#n the queries list, and applies the bm25 function to calculate the document 
#for the query. The results are stored in a dictionary called results, 
#ach query ID is associated with a list of the top 100 documents ranked by BM25 score. 
#
#, the evaluate function is called to calculate the precision, recall, 
#score of the ranking, and the results are printed to the console.
results = {}
for i, query in enumerate(queries):
    query_id = str(i+1)
    scores = bm25(query, doc_text=doc_text)
    results[query_id] = scores.most_common(100)


precision, recall, f1 = evaluate(results, relevance_judgments)



In [293]:
import os
from collections import Counter
from math import log

# Load the document text into a dictionary: The code opens the file "cisi.all" 
# and reads the contents into a Python dictionary called doc_text,
# where each document ID is associated with its corresponding text.
# this cauded a problem: "A.I. Mikhailov, and his colleagues"
# I had to change  content.split(".I") to content.split(".I ")
# learned using bing chatGPT

doc_text = {}
with open("/content/drive/MyDrive/cisi/CISI.ALL", "r") as f:
    content = f.read()
    records = content.split(".I ")
    records.pop(0)
    lines = f.readlines()
    for record in records:
      # Divide o registro em linhas
      lines = record.split("\n")
      # guarda id
      id = lines[0].strip()
      # print(lines[0])
      # print(int(id))
      # Remove a primeira linha que contém o número do registro
      lines.pop(0)
      # Junta as linhas restantes em um único texto
      text = " ".join(lines)
      print(text)
      # Adiciona o texto à lista de textos
      doc_text[id] = text


# Load the queries into a list: The code opens the file "cisi.qry" and reads the contents
# into a list called queries, where each query is represented as a string.

queries = []
with open("/content/drive/MyDrive/cisi/CISI.QRY", "r") as f:
    content = f.read()
    records = content.split(".I")
    records.pop(0)
    lines = f.readlines()
    for record in records:
      # Divide o registro em linhas
      lines = record.split("\n")
      # guarda id
      id = lines[0].strip()
      # print(lines[0])
      # print(int(id))
      # Remove a primeira linha que contém o número do registro
      lines.pop(0)
      # Junta as linhas restantes em um único texto
      text = " ".join(lines)
      print(text)
      # Adiciona o texto à lista de textos
      queries.append(text)


# Load the relevance judgments into a dictionary: The code opens the file "cisi.rel" 
# reads the contents into a Python dictionary called relevance_judgments, 
# each query ID is associated with a list of relevant document IDs.

relevance_judgments = {}
with open("/content/drive/MyDrive/cisi/CISI.REL", "r") as f:
    lines = f.readlines()
    for i in range(0, len(lines)):
        query_id, doc_id, _, _ = lines[i].split()
        if query_id not in relevance_judgments:
            relevance_judgments[query_id] = []
        relevance_judgments[query_id].append(doc_id)


# Define the BM25 ranking function
# the BM25 ranking function: The code defines a function called bm25 that takes 
# as input and calculates the BM25 score for each document in the database. 
# and b parameters are set to default values of 1.2 and 0.75, respectively.

def bm25(query, doc_text, k1=1.2, b=0.75):
    query_terms = query.lower().split()
    scores = Counter()
    for doc_id, doc_text in doc_text.items():
        doc_terms = doc_text.lower().split()
        doc_length = len(doc_terms)
        doc_term_counts = Counter(doc_terms)
        for term in query_terms:
            if term in doc_term_counts:
                idf = log(len(doc_text) / doc_term_counts[term])
                tf = doc_term_counts[term]
                score = idf * ((k1 + 1) * tf) / (k1 * ((1 - b) + b * (doc_length / len(doc_terms))) + tf)
                scores[doc_id] += score
    return scores


# Calculates the bm25 scores for each document for each query.
# stores in a list results with [N queries [N docs]]
# results = [bm25(query=q, doc_text=doc_text) for q in queries]
#len(results )
#len(results[0])



# bm25(query=queries[0], doc_text=doc_text)

# Define the evaluation function
# The code defines a function called evaluate the results of the ranking 
# (as a dictionary of document scores for each query) 
# relevance judgments as input, and calculates the precision, recall, 
# and F1 score of the ranking.

def evaluate(results, relevance_judgments):
    num_queries = len(relevance_judgments)
    precision = 0.0
    recall = 0.0
    f1 = 0.0
    for query_id, docs in relevance_judgments.items():
        retrieved_docs = [str(doc_id) for doc_id, _ in results[query_id]]
        relevant_docs = set(docs)
        retrieved_and_relevant = relevant_docs.intersection(retrieved_docs)
        precision += len(retrieved_and_relevant) / len(retrieved_docs)
        recall += len(retrieved_and_relevant) / len(relevant_docs)
    precision /= num_queries
    recall /= num_queries
    try:
        f1 = 2 * precision * recall / (precision + recall)
    except ZeroDivisionError:
        f1 = 0
    return precision, recall, f1

# Perform the BM25 ranking for each query and evaluate the results
# Loop over each queries list, and applies the bm25 function to calculate the document scores 
# for the query. 
# The results are stored in a dictionary called results. Each query ID is associated
# with a list of the top 100 documents ranked by BM25 score. 
# Then the evaluate function is called to calculate the precision, recall, 
# score of the ranking, and the results are printed to the console.

results = {}
for i, query in enumerate(queries):
    query_id = str(i+1)
    scores = bm25(query, doc_text=doc_text)
    results[query_id] = scores.most_common(100)

precision, recall, f1 = evaluate(results, relevance_judgments)

print("Precision: {:.2f}%".format(precision * 100))
print("Recall: {:.2f}%".format(recall * 100))
print("F1: {:.2f}%".format(f1 * 100))   

Precision: 7.66%
Recall: 20.41%
F1: 11.14%


### Solução usando gensin e nltk

In [None]:
Sim, eu posso te mostrar um código em Python que usa o algoritmo BM25 para classificar documentos. Este código é baseado no exemplo do link¹ que eu mencionei antes. Você pode adaptá-lo para os seus dados CISI.

# Importar as bibliotecas necessárias
import nltk
from gensim.summarization import bm25

# Carregar os dados CISI e dividi-los em documentos e consultas
# Aqui eu estou usando um exemplo simples com 3 documentos e 1 consulta
docs = ["O Brasil é um país da América do Sul.",
        "A França é um país da Europa Ocidental.",
        "A China é um país da Ásia Oriental."]
query = "Qual país fica na Europa?"

# Pré-processar os documentos e consultas usando nltk para tokenizar, remover stopwords e aplicar stemming
stopwords = nltk.corpus.stopwords.words('portuguese')
stemmer = nltk.stem.RSLPStemmer()

def preprocess(text):
    tokens = nltk.word_tokenize(text.lower())
    tokens = [token for token in tokens if token not in stopwords and token.isalpha()]
    tokens = [stemmer.stem(token) for token in tokens]
    return tokens

docs_tokens = [preprocess(doc) for doc in docs]
query_tokens = preprocess(query)

# Criar um objeto BM25 usando a classe bm25.BM25 da gensim e passar os documentos tokenizados como argumento
bm25_obj = bm25.BM25(docs_tokens)

# Para cada consulta, calcular o escore BM25 para cada documento usando o método get_scores do objeto BM25
scores = bm25_obj.get_scores(query_tokens)

# Ordenar os documentos de acordo com o escore BM25 e selecionar os mais relevantes para a consulta
ranked_docs = sorted(zip(docs, scores), key=lambda x: x[1], reverse=True)
top_docs = ranked_docs[:3]

# Mostrar os resultados
print(f"Consulta: {query}")
print(f"Documentos mais relevantes:")
for doc, score in top_docs:
    print(f"- {doc} (escore: {score:.2f})")


Source: Conversation with Bing, 20/02/2023(1) BM25 | Build your Own NLP Based Search Engine Using BM25 - Analytics Vidhya. https://www.analyticsvidhya.com/blog/2021/05/build-your-own-nlp-based-search-engine-using-bm25/ Accessed 20/02/2023.
(2) How to use gensim BM25 ranking in python - Stack Overflow. https://stackoverflow.com/questions/40966014/how-to-use-gensim-bm25-ranking-in-python Accessed 20/02/2023.
(3) GitHub - nhirakawa/BM25: A Python implementation of the BM25 ranking .... https://github.com/nhirakawa/BM25 Accessed 20/02/2023.

In [None]:
#Here is a Python code that implements a information search retrieval using BM25 
#and evaluates the system on CISI collection. I have added comments to explain the details.

# Importing libraries
import os
import re
import math
from gensim import corpora, models, similarities

# Defining a function to read and tokenize documents from a file
def read_docs(file):
    docs = []
    with open(file, "r") as f:
        for line in f:
            # Each document starts with .I followed by an id number
            if line.startswith(".I"):
                doc = ""
                doc_id = line.strip()
            # Each document ends with .X followed by references
            elif line.startswith(".X"):
                docs.append((doc_id, doc.split()))
            else:
                # Removing punctuation and numbers from the text
                line = re.sub("[^a-zA-Z\s]", "", line)
                doc += line.lower()
    return docs

# Defining a function to read and tokenize queries from a file
def read_queries(file):
    queries = []
    with open(file, "r") as f:
        for line in f:
            # Each query starts with .I followed by an id number
            if line.startswith(".I"):
                query = ""
                query_id = line.strip()
            # Each query ends with .W followed by text
            elif line.startswith(".W"):
                query += f.readline().lower()
                queries.append((query_id, query.split()))
    return queries

# Defining a function to read relevance judgments from a file
def read_relevance(file):
    relevance = {}
    with open(file, "r") as f:
        for line in f:
            # Each line contains a query id, a document id and a relevance score (1 or 0)
            qid, did, rel = line.split()
            # Storing the relevance judgments in a dictionary of sets
            if qid not in relevance:
                relevance[qid] = set()
            if rel == "1":
                relevance[qid].add(did)
    return relevance



# Lê os documentos da coleção CISI e armazena na variável docs
docs = read_cisi_file("/content/drive/MyDrive/cisi/CISI.ALL")
# Lê as consultas da coleção CISI e armazena na variável queries
queries = read_cisi_file("/content/drive/MyDrive/cisi/CISI.QRY")
# Lê as relevâncias da coleção CISI e armazena na variável relevances
relevances = read_cisi_file("/content/drive/MyDrive/cisi/CISI.REL")


# Reading and tokenizing the documents from cisi.all file
docs = read_docs("/content/drive/MyDrive/cisi/CISI.ALL")

# Creating a dictionary of terms from the documents
dictionary = corpora.Dictionary(doc[1] for doc in docs)

# Creating a corpus of bag-of-words vectors from the documents
corpus = [dictionary.doc2bow(doc[1]) for doc in docs]

# Initializing a BM25 model from gensim library using the corpus and dictionary 
bm25_model = bm25.BM25Mode(corpus=corpus, id2word=dictionary)



# Reading and tokenizing the queries from cisi.que file 
queries = read_queries("cisi.que")

# Reading the relevance judgments from cisi.rel file 
relevance = read_relevance("cisi.rel")

# Defining some evaluation metrics 
precision_at_10 = 0 # Precision at rank 10 
average_precision = 0 # Mean average precision 
recall_at_50 = 0 # Recall at rank 50 
num_queries = len(queries) # Number of queries 

# Looping through each query 
for query in queries:
    # Getting the query id and terms 
    qid, qterms = query 
    # Converting the query terms to bag-of-words vector 
    qvec = dictionary.doc2bow(qterms) 
    # Getting the BM25 scores for each document in the corpus using the query vector 
    scores = bm25_model.get_scores(qvec) 
    # Sorting the scores in descending order and getting the top 50 documents ids  
    ranked_docs_ids= sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:50]  
    ranked_docs_ids= [docs[i][0] for i in ranked_docs_ids]  
   
   ### Evaluating precision at rank 10 ###
   relevant_at_10= set(ranked_docs_ids[:10]) & relevance[qid]  
   precision_at_10+= len(relevant_at_10) / 10  

   ### Evaluating mean average precision ###
   num_relevant= 0  
   sum_precision= 0  
   for i, didin enumerate(ranked_docs_ids):  
       if didinrelevance[qid]:  
           num_relevant+= 1  
           sum_precision+= num_relevant / (i + 1)  
   average_precision+= sum_precision / len(relevance[qid])  

   ### Evaluating recall at rank 50 ###
   relevant_at_50= set(ranked_docs_ids) & relevance[qid]

Origem: conversa com o Bing, 16/02/2023(1) How to use gensim BM25 ranking in python - Stack Overflow. https://stackoverflow.com/questions/40966014/how-to-use-gensim-bm25-ranking-in-python Acessado 16/02/2023.
(2) BM25 | Build your Own NLP Based Search Engine Using BM25 - Analytics Vidhya. https://www.analyticsvidhya.com/blog/2021/05/build-your-own-nlp-based-search-engine-using-bm25/ Acessado 16/02/2023.
(3) svm - Implementation of Okapi BM25 in python - Stack Overflow. https://stackoverflow.com/questions/61877065/implementation-of-okapi-bm25-in-python Acessado 16/02/2023.
(4) Glasgow IDOM - CISI collection. http://ir.dcs.gla.ac.uk/resources/test_collections/cisi/ Acessado 16/02/2023.
(5) Average precisions for CISI collection: 1000 documents are added in 100 .... https://www.researchgate.net/figure/Average-precisions-for-CISI-collection-1000-documents-are-added-in-100-groups-of-10_fig2_228961662 Acessado 16/02/2023.
(6) Comparison of normalized error for the CISI collection: 1000 documents .... https://www.researchgate.net/figure/Comparison-of-normalized-error-for-the-CISI-collection-1000-documents-are-added-in-100_fig7_228961662 Acessado 16/02/2023.
(7) How to use gensim BM25 ranking in python - Stack Overflow. https://stackoverflow.com/questions/40966014/how-to-use-gensim-bm25-ranking-in-python Acessado 16/02/2023.
(8) rank-bm25 · PyPI. https://pypi.org/project/rank-bm25/ Acessado 16/02/2023.
(9) BM25 | Build your Own NLP Based Search Engine Using BM25 - Analytics Vidhya. https://www.analyticsvidhya.com/blog/2021/05/build-your-own-nlp-based-search-engine-using-bm25/ Acessado 16/02/2023.

AttributeError: ignored

BM25 is a ranking function that estimates the relevance of documents to a given search query based on the frequency and importance of the query terms in the documents². It is a bag-of-words retrieval function, which means it does not consider the order or proximity of the query terms within the document². It is based on the probabilistic retrieval framework developed by Stephen E. Robertson, Karen Spärck Jones, and others²⁵.

Origem: conversa com o Bing, 16/02/2023(1) Okapi BM25 - Wikipedia. https://en.wikipedia.org/wiki/Okapi_BM25 Acessado 16/02/2023.
(2) How does BM25 work? - Quora. https://www.quora.com/How-does-BM25-work Acessado 16/02/2023.
(3) BM25 | Build your Own NLP Based Search Engine Using BM25 - Analytics Vidhya. https://www.analyticsvidhya.com/blog/2021/05/build-your-own-nlp-based-search-engine-using-bm25/#:~:text=BM25%20is%20a%20simple%20Python%20package%20and%20can,occurrences%20of%20the%20search%20term%20in%20our%20tweet Acessado 16/02/2023.
(4) BM25 | Build your Own NLP Based Search Engine Using BM25 - Analytics Vidhya. https://www.analyticsvidhya.com/blog/2021/05/build-your-own-nlp-based-search-engine-using-bm25/ Acessado 16/02/2023.
(5) Practical BM25 - Part 2: The BM25 Algorithm and its Variables. https://www.elastic.co/blog/practical-bm25-part-2-the-bm25-algorithm-and-its-variables Acessado 16/02/2023.