# JinaReranker

- Author: [hyeyeoon](https://github.com/hyeyeoon)
- Peer Review: 
- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/01-Basic/08-Runnable.ipynb) [![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/01-Basic/08-Runnable.ipynb)

## Overview

`Jina Reranker` is a document re-ranking and compression tool that reorders retrieved documents or results to prioritize the most relevant items. It is primarily used in information retrieval and natural language processing (NLP) tasks, designed to extract critical information more quickly and accurately from large datasets.

---

**Key Features**

- Relevance-based Re-ranking

    Jina Reranker analyzes search results and reorders documents based on relevance scores. This ensures that users can access more relevant information first.

- Multilingual Support

    Jina Reranker supports multilingual models, such as `jina-reranker-v2-base-multilingual`, enabling the processing of data in various languages.

- Document Compression

    It selects only the top N most relevant documents (`top_n`), compressing the search results to reduce noise and optimize performance.

- Integration with LangChain

    Jina Reranker integrates seamlessly with workflow tools like LangChain, making it easy to connect to natural language processing pipelines.

---

**How It Works**

- Document Retrieval

    The base retriever is used to fetch initial search results.

- Relevance Score Calculation

    Jina Reranker utilizes pre-trained models (e.g., `jina-reranker-v2-base-multilingual`) to calculate relevance scores for each document.

- Document Re-ranking and Compression

    Based on the relevance scores, it selects the top N documents and provides reordered results.


### Table of Contents

- [Overview](#overview)
- [Environment Setup](#environment-setup)
- [Jina Reranker](#Jina-Reranker)
- [Performing Re-ranking with JinaRerank](#Performing-re-ranking-with-JinaRerank)

### References

- [LangChain Documentation](https://python.langchain.com/docs/how_to/lcel_cheatsheet/)
- [Jina Reranker](https://jina.ai/reranker/)

---

## Environment Setup

Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.

**[Note]**
- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. 
- You can checkout the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.
**Issuing an API Key for JinaReranker**
- Add the following to your .env file
    >JINA_API_KEY="YOUR_JINA_API_KEY"

In [22]:
%%capture --no-stderr
!pip install langchain-opentutorial

In [23]:
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langsmith",
        "langchain",
        "langchain_core",
        "langchain_openai",
    ],
    verbose=False,
    upgrade=False,
)

You can also load the `OPEN_API_KEY` from the `.env` file.

In [24]:
from dotenv import load_dotenv

load_dotenv(override=True)

True

In [25]:
# Set local environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "03-JinaReranker",
    }
)

Environment variables have been set successfully.


## Jina Reranker

- Load data for a simple example and create a retriever.

In [26]:
def pretty_print_docs(docs):
    print(
        f"\n{'-' * 100}\n".join(
            [f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]
        )
    )

- A text document is loaded into the system.

- The document is split into smaller chunks for better processing.

- `FAISS` is used with `OpenAI embeddings` to create a retriever.

- The retriever processes a query to find and display the most relevant documents.


In [27]:
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import FAISS
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings

# Load the document
documents = TextLoader("./data/appendix-keywords.txt").load()

# Initialize the text splitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)

# Split the document into chunks
texts = text_splitter.split_documents(documents)

# Initialize the retriever
retriever = FAISS.from_documents(texts, OpenAIEmbeddings()).as_retriever(
    search_kwargs={"k": 10}
)

# Define the query
query = "Tell me about Word2Vec."

# Retrieve relevant documents
docs = retriever.invoke(query)

# Print the retrieved documents
pretty_print_docs(docs)

Document 1:

Word2Vec
Definition: Word2Vec is a technique in NLP that maps words to a vector space, representing their semantic relationships based on context.
Example: In a Word2Vec model, "king" and "queen" are represented by vectors located close to each other.
Related Keywords: Natural Language Processing (NLP), Embedding, Semantic Similarity
----------------------------------------------------------------------------------------------------
Document 2:

Embedding
Definition: Embedding is the process of converting textual data, such as words or sentences, into low-dimensional continuous vectors that computers can process and understand.
Example: The word "apple" can be represented as a vector like [0.65, -0.23, 0.17].
Related Keywords: Natural Language Processing (NLP), Vectorization, Deep Learning
----------------------------------------------------------------------------------------------------
Document 3:

VectorStore
Definition: A VectorStore is a system designed to store data

## Performing Re-ranking with JinaRerank

- A document compression system is initialized using JinaRerank to prioritize the most relevant documents.

- Retrieved documents are compressed by selecting the top 3 (top_n=3) based on relevance.

- A `ContextualCompressionRetriever` is created with the JinaRerank compressor and an existing retriever.

- The system processes a query to retrieve and compress relevant documents.

In [28]:
from ast import mod
from langchain.retrievers import ContextualCompressionRetriever
from langchain_community.document_compressors import JinaRerank

# Initialize the JinaRerank compressor
compressor = JinaRerank(model="jina-reranker-v2-base-multilingual", top_n=3)

# Initialize the document compression retriever
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=retriever
)

# Retrieve and compress relevant documents
compressed_docs = compression_retriever.invoke("Explain Word2Vec.")


In [29]:
# Display the compressed documents in a readable format
pretty_print_docs(compressed_docs)

Document 1:

Word2Vec
Definition: Word2Vec is a technique in NLP that maps words to a vector space, representing their semantic relationships based on context.
Example: In a Word2Vec model, "king" and "queen" are represented by vectors located close to each other.
Related Keywords: Natural Language Processing (NLP), Embedding, Semantic Similarity
----------------------------------------------------------------------------------------------------
Document 2:

Embedding
Definition: Embedding is the process of converting textual data, such as words or sentences, into low-dimensional continuous vectors that computers can process and understand.
Example: The word "apple" can be represented as a vector like [0.65, -0.23, 0.17].
Related Keywords: Natural Language Processing (NLP), Vectorization, Deep Learning
----------------------------------------------------------------------------------------------------
Document 3:

VectorStore
Definition: A VectorStore is a system designed to store data