# 1. Introduction

This notebook demonstrates how you can build an advanced RAG (Retrieval Augmented Generation) for explaining concepts from Kaggle competition solution write-ups.

We are going to use the following public dataset : https://www.kaggle.com/datasets/thedrcat/kaggle-winning-solutions-methods

Here is the pipeline we are going to build :

- [Gemma](https://www.kaggle.com/models/google/gemma) as LLM
- [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) for the embeddings
- [FAISS](https://github.com/facebookresearch/faiss) as vector database for the embeddings
- [LangChain](https://www.langchain.com/) for orchestration


# 2. Installation and imports

## 2.1 Install packages

In [1]:
!pip install -q -U accelerate bitsandbytes langchain langchain-community sentence-transformers ragatouille faiss-gpu rank_bm25
# ! pip install -q -U beautifulsoup4 # Install beautifulsoup4 if you are running the notebook not in Kaggle
!pip install -q -U keras-nlp
!pip install -q -U keras>3

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf 23.8.0 requires cubinlinker, which is not installed.
cudf 23.8.0 requires cupy-cuda11x>=12.0.0, which is not installed.
cudf 23.8.0 requires ptxcompiler, which is not installed.
cuml 23.8.0 requires cupy-cuda11x>=12.0.0, which is not installed.
dask-cudf 23.8.0 requires cupy-cuda11x>=12.0.0, which is not installed.
keras-cv 0.8.2 requires keras-core, which is not installed.
keras-nlp 0.8.1 requires keras-core, which is not installed.
tensorflow-decision-forests 1.8.1 requires wurlitzer, which is not installed.
apache-beam 2.46.0 requires dill<0.3.2,>=0.3.1.1, but you have dill 0.3.8 which is incompatible.
apache-beam 2.46.0 requires numpy<1.25.0,>=1.14.3, but you have numpy 1.26.4 which is incompatible.
apache-beam 2.46.0 requires pyarrow<10.0.0,>=3.0.0, but you have pyarrow 11.0.0 which is incompatible.

## 2.2 Imports

In [2]:
import os
import keras
import keras_nlp
import pandas as pd

from bs4 import BeautifulSoup
from typing import Optional, List, Tuple
from IPython.display import display, Markdown

from transformers import AutoTokenizer
from ragatouille import RAGPretrainedModel
from langchain.docstore.document import Document
from langchain.prompts.prompt import PromptTemplate
from langchain_core.runnables import ConfigurableField
from langchain_community.vectorstores import FAISS, Chroma
from langchain.retrievers import BM25Retriever, EnsembleRetriever
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import DataFrameLoader
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores.utils import DistanceStrategy

os.environ["KERAS_BACKEND"] = "jax"  # Or "torch" or "tensorflow".
os.environ["XLA_PYTHON_CLIENT_MEM_FRACTION"]="1.00" # Avoid memory fragmentation on JAX backend.

2024-12-25 06:56:05.367779: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-12-25 06:56:05.367890: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-12-25 06:56:05.536994: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


# 3. Prepare the data
## 3.1 Preprocessing

In [3]:
data = pd.read_csv('/kaggle/input/kaggle-winning-solutions-methods/kaggle_winning_solutions_methods.csv')
data.head()

Unnamed: 0,link,place,competition_name,prize,team,kind,metric,year,nm,writeup,num_tokens,methods,cleaned_methods
0,https://www.kaggle.com/c/asl-signs/discussion/...,2,Google - Isolated Sign Language Recognition,"$100,000",1165,Research,PostProcessorKernelDesc,2023,406306,<h2>TLDR</h2>\n<p>We used an approach similar ...,2914,"['EfficientNet-B0', 'Data Augmentation', 'Norm...",Replace augmentation
1,https://www.kaggle.com/c/asl-signs/discussion/...,2,Google - Isolated Sign Language Recognition,"$100,000",1165,Research,PostProcessorKernelDesc,2023,406306,<h2>TLDR</h2>\n<p>We used an approach similar ...,2914,"['EfficientNet-B0', 'Data Augmentation', 'Norm...",Finger tree rotate
2,https://www.kaggle.com/c/asl-signs/discussion/...,2,Google - Isolated Sign Language Recognition,"$100,000",1165,Research,PostProcessorKernelDesc,2023,406306,<h2>TLDR</h2>\n<p>We used an approach similar ...,2914,"['EfficientNet-B0', 'Data Augmentation', 'Norm...",Data Augmentation
3,https://www.kaggle.com/c/asl-signs/discussion/...,2,Google - Isolated Sign Language Recognition,"$100,000",1165,Research,PostProcessorKernelDesc,2023,406306,<h2>TLDR</h2>\n<p>We used an approach similar ...,2914,"['EfficientNet-B0', 'Data Augmentation', 'Norm...",Onecycle scheduler
4,https://www.kaggle.com/c/asl-signs/discussion/...,2,Google - Isolated Sign Language Recognition,"$100,000",1165,Research,PostProcessorKernelDesc,2023,406306,<h2>TLDR</h2>\n<p>We used an approach similar ...,2914,"['EfficientNet-B0', 'Data Augmentation', 'Norm...",Flip pose


Let's look at an example of a write-up

In [4]:
data['writeup'][42]

'<p>Here is a quick overview of the 5th-place solution.</p>\n<ol>\n<li><p><strong>we applied various augmentations like flip, concatenation, etc</strong><br>\n1.1. By applying different augmentations, we can increase the cv by ~ 0.02 (0.76 -&gt; 0.78)</p></li>\n<li><p><strong>the model is only a transformer model based on the public kernels</strong><br>\n2.1. By increasing the number of parameters, the performance of a single model can be increased to around 0.8 (0.78-&gt;0.8) in public LB.<br>\n2.1.1. 3 layers of transformer with the embedding size 480.</p></li>\n<li><p><strong>Preprocessing by mean and std of single sign sequence</strong><br>\n3.1. the preprocessing does affect the final performance. <br>\n3.1.1. we tried different ways of calculating the mean and std and found out that using the mean and std of the single sign sequence results in better cv.</p></li>\n<li><p><strong>Feature engineering like distances between points</strong><br>\n4.1. we selected and used around 106 p

The write-ups contain HTML tags and links that are not relevant to our knowledge base. So we'll use BeautifulSoup to extract all the texts and concatenate them into a single one.

In [5]:
%%time

def clean_html(html_content):
    """Function to clean up HTML tags in each writeup"""
    soup = BeautifulSoup(html_content, 'html.parser')
    # Use '\n' as a separator to preserve the structure of the various parts
    text = soup.get_text(separator='\n', strip=True)
    return text

data['writeup'] = data['writeup'].apply(clean_html) # This might take a while

CPU times: user 25.5 s, sys: 78.8 ms, total: 25.6 s
Wall time: 25.6 s


**Here is the result :**

In [6]:
print(data['writeup'][42])

Here is a quick overview of the 5th-place solution.
we applied various augmentations like flip, concatenation, etc
1.1. By applying different augmentations, we can increase the cv by ~ 0.02 (0.76 -> 0.78)
the model is only a transformer model based on the public kernels
2.1. By increasing the number of parameters, the performance of a single model can be increased to around 0.8 (0.78->0.8) in public LB.
2.1.1. 3 layers of transformer with the embedding size 480.
Preprocessing by mean and std of single sign sequence
3.1. the preprocessing does affect the final performance.
3.1.1. we tried different ways of calculating the mean and std and found out that using the mean and std of the single sign sequence results in better cv.
Feature engineering like distances between points
4.1. we selected and used around 106 points (as the public notebook by Heck).
4.2. distances withinpoints of hands/nose/eyes/‚Ä¶ are calculated.
some methods to prevent overfitting like awp, random mask of frames, em

**This looks good now !**

To build our knowledge base, which will serve as the context for the LLM, we will concatenate relevant information such as the name of the competition, the rank of the competitors who proposed the solution and the solution itself.

Note that we can also add other columns that might also be relevant to answering the user's query.
But let's keep it simple for now.


In [7]:
data['LLM_context'] = (
    "Competition Name: " + data['competition_name'] +
    ",\nPlace: " + data['place'].astype(str) +
    ",\nMethods Used: " + data['methods'] +
    ",\nSolution: " + data['writeup']
)

print(data['LLM_context'][42])

Competition Name: Google - Isolated Sign Language Recognition,
Place: 5,
Methods Used: ['Augmentation', 'Transformer model', 'Preprocessing', 'Feature engineering', 'Overfitting prevention'],
Solution: Here is a quick overview of the 5th-place solution.
we applied various augmentations like flip, concatenation, etc
1.1. By applying different augmentations, we can increase the cv by ~ 0.02 (0.76 -> 0.78)
the model is only a transformer model based on the public kernels
2.1. By increasing the number of parameters, the performance of a single model can be increased to around 0.8 (0.78->0.8) in public LB.
2.1.1. 3 layers of transformer with the embedding size 480.
Preprocessing by mean and std of single sign sequence
3.1. the preprocessing does affect the final performance.
3.1.1. we tried different ways of calculating the mean and std and found out that using the mean and std of the single sign sequence results in better cv.
Feature engineering like distances between points
4.1. we select

In [8]:
data = data.drop("writeup", axis=1) # We remove 'writeup' column as it is already in LLM_context

## 3.2 Loading data

We'll now use LangChain's [DataFrameLoader](https://python.langchain.com/docs/integrations/document_loaders/pandas_dataframe) to store the information as a LangChain [Documents](https://api.python.langchain.com/en/latest/documents/langchain_core.documents.base.Document.html) list. 

The **`Document`** class in **LangChain** serves as a fundamental building block for storing text and associated metadata. Let's explore its key features:

1. **Purpose**: The `Document` class is designed to hold a piece of text along with relevant metadata. You can think of it as a container for textual content.

2. **Attributes**:
    - **`page_content`**: This attribute stores the actual text content of the document.
    - **`metadata` (Optional)**: You can attach arbitrary metadata to the document. For example, this could include information about the source of the content or relationships to other documents.

For more detailed information, you can refer to the [official LangChain documentation](https://python.langchain.com/docs/modules/data_connection/document_loaders/) .



In [9]:
loader = DataFrameLoader(data, page_content_column="LLM_context")
docs = loader.load()
docs_subset = docs[:1500] # Part of the data is used to reduce execution time.

In [10]:
print("-----------PAGE CONTENT-----------")
print(docs_subset[42].page_content)
print("\n\n-----------METADATA-----------\n")
print(docs_subset[42].metadata)

-----------PAGE CONTENT-----------
Competition Name: Google - Isolated Sign Language Recognition,
Place: 5,
Methods Used: ['Augmentation', 'Transformer model', 'Preprocessing', 'Feature engineering', 'Overfitting prevention'],
Solution: Here is a quick overview of the 5th-place solution.
we applied various augmentations like flip, concatenation, etc
1.1. By applying different augmentations, we can increase the cv by ~ 0.02 (0.76 -> 0.78)
the model is only a transformer model based on the public kernels
2.1. By increasing the number of parameters, the performance of a single model can be increased to around 0.8 (0.78->0.8) in public LB.
2.1.1. 3 layers of transformer with the embedding size 480.
Preprocessing by mean and std of single sign sequence
3.1. the preprocessing does affect the final performance.
3.1.1. we tried different ways of calculating the mean and std and found out that using the mean and std of the single sign sequence results in better cv.
Feature engineering like dist


# 4. Chunking

To create relevant answer snippets for the LLM, we break down the knowledge base documents into smaller pieces. These chunks should capture specific ideas, not be too short (cutting off the thought) or too long (making it hard to find the main point).

We use "recursive chunking" to achieve this. It works by repeatedly splitting the text into smaller parts using a list of separators (e.g. ["\n\n", "\n", ".", ""]), starting with the most important (like double line breaks) and moving down to less important ones (like sentence ends). This ensures that chunks are neither too large nor too small for the LLM to process effectively.

In [11]:
EMBEDDING_MODEL_NAME = "BAAI/bge-base-en-v1.5"
CHUNK_SIZE = 512 # We choose a chunk size adapted to our model

In [12]:
%%time

def split_documents(
    chunk_size: int,
    knowledge_base: List[Document],
    tokenizer_name: Optional[str] = EMBEDDING_MODEL_NAME,
) -> List[Document]:
    """
    Split documents into chunks of maximum size `chunk_size` tokens and return a list of documents.
    """
    
    text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(
        AutoTokenizer.from_pretrained(tokenizer_name),
        chunk_size=chunk_size,
        chunk_overlap=int(chunk_size / 10),
        add_start_index=True,
        strip_whitespace=True,
    )

    docs_processed = []
    for doc in knowledge_base:
        docs_processed += text_splitter.split_documents([doc])

    # Remove duplicates
    unique_texts = {}
    docs_processed_unique = []
    for doc in docs_processed:
        if doc.page_content not in unique_texts:
            unique_texts[doc.page_content] = True
            docs_processed_unique.append(doc)

    return docs_processed_unique

chunked_docs = split_documents(
    CHUNK_SIZE,  
    docs_subset,
    tokenizer_name=EMBEDDING_MODEL_NAME,
)



tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (956 > 512). Running this sequence through the model will result in indexing errors


CPU times: user 25.9 s, sys: 20.2 ms, total: 25.9 s
Wall time: 26.5 s


**If the dataset is too large, chunking all the documents can take a long time. To speed things up, consider working with a representative subset of the data.**


# 5. Embeddings and retriever
## 5.1 Embeddings

Now that the documents are correctly sized, we're ready to start building a database that includes their embeddings.

To create embeddings for document segments, we'll be using LangChain's [HuggingFaceEmbeddings](https://python.langchain.com/docs/integrations/text_embedding/huggingfacehub) in conjunction with the [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) embeddings model. A broader selection of text embedding models can be found on the Hugging Face Hub, where the most effective models are highlighted in the [Massive Text Embedding Benchmark (MTEB) Leaderboard](https://huggingface.co/spaces/mteb/leaderboard).

In [13]:
%%time

embedding_model = HuggingFaceEmbeddings(
    model_name=EMBEDDING_MODEL_NAME,
    multi_process=True,
    model_kwargs={"device": "cuda"},
    encode_kwargs={"normalize_embeddings": True},  # set True for cosine similarity
)



modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/777 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

CPU times: user 1.1 s, sys: 1.09 s, total: 2.19 s
Wall time: 4.03 s


## 5.2 Fusion retrieval or hybrid search

This concept, though not entirely new, involves integrating the strengths of two distinct search methods: traditional keyword-based search, which employs sparse retrieval algorithms such as [tf-idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) or the search industry standard [BM25](https://en.wikipedia.org/wiki/Okapi_BM25), and contemporary semantic or vector search.

The challenge lies in effectively merging the results obtained from these different similarity scoring methods. This issue is typically addressed using the [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) (RRF) algorithm, which re-ranks the retrieved results to produce the final output.

In LangChain this is implemented in the [Ensemble Retriever class](https://python.langchain.com/docs/modules/data_connection/retrievers/ensemble), combining a list of retrievers you define, for example a Faiss vector index and a BM25 based retriever and using RRF for reranking.


As vector database, we'll use [FAISS](https://github.com/facebookresearch/faiss), a library developed by Facebook AI. FAISS specializes in the efficient similarity search and clustering of dense vectors, which suits our needs perfectly. Currently, FAISS is among the top libraries for conducting Nearest Neighbor (NN) search in large datasets.



In [14]:
num_docs = 5 # Default number of documents to retrieve

bm25_retriever = BM25Retriever.from_documents(
    chunked_docs
    ).configurable_fields(
    k=ConfigurableField(
        id="search_kwargs_bm25",
        name="k",
        description="The search kwargs to use",
    )
)

faiss_vectorstore = FAISS.from_documents(
    chunked_docs, embedding_model, distance_strategy=DistanceStrategy.COSINE
)

faiss_retriever = faiss_vectorstore.as_retriever(
    search_kwargs={"k": num_docs}
    ).configurable_fields(
    search_kwargs=ConfigurableField(
        id="search_kwargs_faiss",
        name="Search Kwargs",
        description="The search kwargs to use",
    )
)

# initialize the ensemble retriever
vector_database = EnsembleRetriever(
    retrievers=[bm25_retriever, faiss_retriever], weights=[0.5, 0.5] # You can adjust the weight of each retriever in the EnsembleRetriever
)

I pick the row 42 as base to generate questions for the model.

In [15]:
print(data.iloc[42, :])

link                https://www.kaggle.com/c/asl-signs/discussion/...
place                                                               5
competition_name          Google - Isolated Sign Language Recognition
prize                                                        $100,000
team                                                            1,165
kind                                                         Research
metric                                        PostProcessorKernelDesc
year                                                             2023
nm                                                             406491
num_tokens                                                        473
methods             ['Augmentation', 'Transformer model', 'Preproc...
cleaned_methods                                       Post-processing
LLM_context         Competition Name: Google - Isolated Sign Langu...
Name: 42, dtype: object


In [16]:
print(data['LLM_context'][42])

Competition Name: Google - Isolated Sign Language Recognition,
Place: 5,
Methods Used: ['Augmentation', 'Transformer model', 'Preprocessing', 'Feature engineering', 'Overfitting prevention'],
Solution: Here is a quick overview of the 5th-place solution.
we applied various augmentations like flip, concatenation, etc
1.1. By applying different augmentations, we can increase the cv by ~ 0.02 (0.76 -> 0.78)
the model is only a transformer model based on the public kernels
2.1. By increasing the number of parameters, the performance of a single model can be increased to around 0.8 (0.78->0.8) in public LB.
2.1.1. 3 layers of transformer with the embedding size 480.
Preprocessing by mean and std of single sign sequence
3.1. the preprocessing does affect the final performance.
3.1.1. we tried different ways of calculating the mean and std and found out that using the mean and std of the single sign sequence results in better cv.
Feature engineering like distances between points
4.1. we select

Below are several questions that can be derived from the above solution description (generated by a LLM):

- "What specific augmentations were applied to improve the cross-validation score, and how did each contribute to the increase?"
- "Why was a transformer model chosen for this solution, and how did public kernels influence its development?"
- "Can you detail the impact of increasing the model's parameters on its performance on the public leaderboard?"
- "Describe the architecture of the 3-layer transformer model, specifically focusing on the choice of embedding size."
- "How does preprocessing with mean and standard deviation of single sign sequences enhance model performance?"
- "What process did you use to determine that using mean and std of single sign sequences yields better cross-validation scores?"
- "In terms of feature engineering, why were distances between points chosen as a feature, and how were they calculated?"
- "How did the selection of 106 points influence the model's ability to understand and process the data?"
- "What methods were implemented to prevent overfitting, and can you explain how each method contributed to model robustness?"
- "Reflecting on your teamwork, how did your teammates contribute to the development and success of the solution?"

You can use them as inspiration or rephrase them before asking Gemma the question. I'll choose one to test the model.

Let's make a simple query on our database !

In [17]:
user_query = """
I want to understand the 5th-place solution in the 'Google - Isolated Sign Language Recognition' competition. 
What overfitting prevention techniques were used, and how did they ensure model robustness?
"""
config = {"configurable": {"search_kwargs_faiss": {"k": 5}, "search_kwargs_bm25": 5}}
retrieved_docs = vector_database.invoke(user_query, config=config)
print("----------------------Top document content----------------------")
print(retrieved_docs[0].page_content)
print("----------------------Top document metadata----------------------")
print(retrieved_docs[0].metadata)

----------------------Top document content----------------------
Competition Name: Google - Isolated Sign Language Recognition,
Place: 5,
Methods Used: ['Augmentation', 'Transformer model', 'Preprocessing', 'Feature engineering', 'Overfitting prevention'],
Solution: Here is a quick overview of the 5th-place solution.
we applied various augmentations like flip, concatenation, etc
1.1. By applying different augmentations, we can increase the cv by ~ 0.02 (0.76 -> 0.78)
the model is only a transformer model based on the public kernels
2.1. By increasing the number of parameters, the performance of a single model can be increased to around 0.8 (0.78->0.8) in public LB.
2.1.1. 3 layers of transformer with the embedding size 480.
Preprocessing by mean and std of single sign sequence
3.1. the preprocessing does affect the final performance.
3.1.1. we tried different ways of calculating the mean and std and found out that using the mean and std of the single sign sequence results in better cv.

# 6. Reranking 

A practical strategy for RAG involves fetching a larger number of documents initially than the final count you aim for, followed by employing a stronger retrieval model to rerank these results. This process narrows down the selection to only the best top_k documents.

To implement this, we will use [Colbertv2](https://arxiv.org/abs/2112.01488), which is conveniently accessible through the [RAGatouille library](https://github.com/bclavie/RAGatouille).


In [18]:
reranker = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")

artifact.metadata:   0%|          | 0.00/1.63k [00:00<?, ?B/s]



config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/405 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [19]:
page_contents = [doc.page_content for doc in retrieved_docs]  # keep only the text
relevant_docs = reranker.rerank(user_query, page_contents, k=5)
relevant_docs = [doc["content"] for doc in relevant_docs]

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00,  6.34it/s]


In [20]:
print(relevant_docs[0])

Competition Name: Google - Isolated Sign Language Recognition,
Place: 5,
Methods Used: ['Augmentation', 'Transformer model', 'Preprocessing', 'Feature engineering', 'Overfitting prevention'],
Solution: Here is a quick overview of the 5th-place solution.
we applied various augmentations like flip, concatenation, etc
1.1. By applying different augmentations, we can increase the cv by ~ 0.02 (0.76 -> 0.78)
the model is only a transformer model based on the public kernels
2.1. By increasing the number of parameters, the performance of a single model can be increased to around 0.8 (0.78->0.8) in public LB.
2.1.1. 3 layers of transformer with the embedding size 480.
Preprocessing by mean and std of single sign sequence
3.1. the preprocessing does affect the final performance.
3.1.1. we tried different ways of calculating the mean and std and found out that using the mean and std of the single sign sequence results in better cv.
Feature engineering like distances between points
4.1. we select

# 7. Model building

In [21]:
%%time
gemma_lm = keras_nlp.models.GemmaCausalLM.from_preset("gemma_instruct_2b_en")

Attaching 'config.json' from model 'keras/gemma/keras/gemma_instruct_2b_en/2' to your Kaggle notebook...
Attaching 'config.json' from model 'keras/gemma/keras/gemma_instruct_2b_en/2' to your Kaggle notebook...
Attaching 'task.json' from model 'keras/gemma/keras/gemma_instruct_2b_en/2' to your Kaggle notebook...
Attaching 'model.weights.h5' from model 'keras/gemma/keras/gemma_instruct_2b_en/2' to your Kaggle notebook...
Attaching 'preprocessor.json' from model 'keras/gemma/keras/gemma_instruct_2b_en/2' to your Kaggle notebook...
Attaching 'tokenizer.json' from model 'keras/gemma/keras/gemma_instruct_2b_en/2' to your Kaggle notebook...
Attaching 'assets/tokenizer/vocabulary.spm' from model 'keras/gemma/keras/gemma_instruct_2b_en/2' to your Kaggle notebook...


CPU times: user 10.6 s, sys: 15.3 s, total: 25.9 s
Wall time: 57.6 s


normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization.


## 7.1 Testing Gemma model directly

In [22]:
%%time
display(Markdown(gemma_lm.generate("Hi, what can you tell me about Kaggle competitions?", max_length=256)))

I0000 00:00:1735109953.312099      34 device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
W0000 00:00:1735109953.370958      34 graph_launch.cc:671] Fallback to op-by-op mode because memset node breaks graph update
W0000 00:00:1735109953.676943      34 graph_launch.cc:671] Fallback to op-by-op mode because memset node breaks graph update


Hi, what can you tell me about Kaggle competitions?

**What are Kaggle competitions?**

Kaggle competitions are a platform where data scientists and machine learning engineers can participate in a wide range of data science and machine learning challenges. These competitions offer a unique opportunity to learn from experts, solve real-world problems, and potentially win prizes.

**Key features of Kaggle competitions:**

* **Real-world datasets:** Competitions typically use real-world datasets that are relevant to various industries and domains.
* **Multiple data modalities:** Competitions allow participants to submit solutions for various data modalities, including images, text, and time series.
* **Various challenge levels:** Competitions offer different challenge levels to cater to different skill sets and experience levels.
* **Community engagement:** Kaggle provides a vibrant community where participants can interact, share knowledge, and collaborate on solutions.
* **Prizes and recognition:** Winners of Kaggle competitions receive significant prizes and recognition, including cash, prizes, and public acclaim.

**Benefits of participating in Kaggle competitions:**

* **Learn from industry experts:** Solve real-world problems and gain insights from data science and machine learning experts.
* **Boost your resume:** Winning a Kaggle competition can significantly enhance your

CPU times: user 35.2 s, sys: 400 ms, total: 35.6 s
Wall time: 33.6 s


## 7.2 Prompt

The template for the RAG prompt we will use involves inputting it in the format preferred by the LLM's chat interface. This format includes providing our context along with the user's question.

In [23]:
prompt_template = """
Based on your extensive knowledge and the following detailed context, 
please provide a comprehensive answer to explain concepts from Kaggle competition solution write-ups:

CONTEXT:
{context}

QUESTION:
{question}

ANSWER:
"""

RAG_PROMPT_TEMPLATE = PromptTemplate(
    input_variables=["context", "question"],
    template=prompt_template,
)


# 8. Creating the RAG pipeline

In [24]:
def answer_with_rag(
    question: str,
    llm,
    knowledge_index: FAISS,
    reranker: Optional[RAGPretrainedModel] = None,
    num_retrieved_docs: int = 10,
    num_docs_final: int = 5,
) -> Tuple[str, List[Document]]:
    # Gather documents with retriever
    print("=> Retrieving documents...")
    config = {"configurable": {"search_kwargs_faiss": {"k": num_retrieved_docs}, "search_kwargs_bm25": num_retrieved_docs}}
    relevant_docs = knowledge_index.invoke(question, config=config)
    relevant_docs = [doc.page_content for doc in relevant_docs]  # keep only the text
    
    # Optionally rerank results
    if reranker:
        print("=> Reranking documents...")
        relevant_docs = reranker.rerank(question, relevant_docs, k=num_docs_final)
        relevant_docs = [doc["content"] for doc in relevant_docs]
        
    relevant_docs = relevant_docs[:num_docs_final] # Keeping only num_docs_final documents

    # Build the final prompt
    context = relevant_docs[0] # We select only the top relevant document
    
    final_prompt = RAG_PROMPT_TEMPLATE.format(
        context = context,  
        question=question
    )

    # Redact an answer
    print("=> Generating answer...")
    answer = llm.generate(final_prompt, max_length=1024)

    return answer, relevant_docs

In [25]:
%%time
question = """I want to understand the 5th-place solution in the 'Google - Isolated Sign Language Recognition' competition. 
What overfitting prevention techniques were used, and how did they ensure model robustness?
"""
answer, relevant_docs = answer_with_rag(question, gemma_lm, vector_database, reranker)

=> Retrieving documents...
=> Reranking documents...


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00,  4.11it/s]


=> Generating answer...


W0000 00:00:1735109997.093674      34 graph_launch.cc:671] Fallback to op-by-op mode because memset node breaks graph update
W0000 00:00:1735109998.085068      34 graph_launch.cc:671] Fallback to op-by-op mode because memset node breaks graph update


CPU times: user 45.2 s, sys: 1.3 s, total: 46.5 s
Wall time: 48.8 s


In [26]:
def get_gemma_answer(generated_answer: str) -> str:
    """Function to get Gemma answer"""
    split = generated_answer.split("ANSWER:")
    return split[1] if len(split) > 1 else "No answer has been generatedCliquez pour utiliser cette solution"

display(Markdown("### Gemma Answer"))
display(Markdown(get_gemma_answer(answer)))
display(Markdown("### Source docs"))
for i, doc in enumerate(relevant_docs):
    display(Markdown(f"**Document {i}------------------------------------------------------------**"))
    display(Markdown(doc))

### Gemma Answer


**Overfitting prevention techniques used in the 5th-place solution:**

* **Random masking of frames:** This technique randomly selects a subset of frames from the training data and trains the model on this subset. This helps to prevent the model from overfitting to the specific training data and improves itsgeneralizability.
* **Early stopping:** This technique stops training the model when it reaches a certain number of epochs or when the validation loss starts to increase. This helps to prevent the model from overfitting to the training data and improves itsgeneralizability.
* **Data augmentation:** This technique is used to increase the size of the training dataset and to introduce diversity into the training data. This helps to prevent the model from overfitting to the training data and improves itsgeneralizability.
* **Mean and standard deviation of the single sign sequence:** This technique is used to pre-process the training data and to improve the performance of the model.

**How these techniques ensured model robustness:**

* **Random masking of frames:** This technique helped to prevent the model from overfitting to the specific training data by exposing it to a wide range of images.
* **Early stopping:** This technique helped to prevent the model from overfitting to the training data by stopping training when it reached a certain number of epochs.
* **Data augmentation:** This technique helped to increase the size of the training dataset and to introduce diversity into the training data. This helped to prevent the model from overfitting to the training data and improved itsgeneralizability.
* **Mean and standard deviation of the single sign sequence:** This technique helped to improve the performance of the model by reducing overfitting and by introducing diversity into the training data.

### Source docs

**Document 0------------------------------------------------------------**

Competition Name: Google - Isolated Sign Language Recognition,
Place: 5,
Methods Used: ['Augmentation', 'Transformer model', 'Preprocessing', 'Feature engineering', 'Overfitting prevention'],
Solution: Here is a quick overview of the 5th-place solution.
we applied various augmentations like flip, concatenation, etc
1.1. By applying different augmentations, we can increase the cv by ~ 0.02 (0.76 -> 0.78)
the model is only a transformer model based on the public kernels
2.1. By increasing the number of parameters, the performance of a single model can be increased to around 0.8 (0.78->0.8) in public LB.
2.1.1. 3 layers of transformer with the embedding size 480.
Preprocessing by mean and std of single sign sequence
3.1. the preprocessing does affect the final performance.
3.1.1. we tried different ways of calculating the mean and std and found out that using the mean and std of the single sign sequence results in better cv.
Feature engineering like distances between points
4.1. we selected and used around 106 points (as the public notebook by Heck).
4.2. distances withinpoints of hands/nose/eyes/‚Ä¶ are calculated.
some methods to prevent overfitting like awp, random mask of frames, ema, etc ‚Ä¶
many thanks to my teammates
@qiaoshiji
@zengzhaoyang
The source code for training models can be found here :
https://github.com/zhouyuanzhe/kaggleasl5thplacesolution

**Document 1------------------------------------------------------------**

Competition Name: Google - Isolated Sign Language Recognition,
Place: 8,
Methods Used: ['Transformer models', 'FFN encoder', 'Cosine schedule', 'Dropout', 'Label smoothing', 'Sequence cutout augmentation', 'Mirror left augmentation', 'Random rotate augmentation', 'Linear interpolation', 'Min-max normalization', 'Mean/std normalization', 'Time shift delta features', 'Angle features', 'Point to point distances', 'Tflite conversion', 'Speed up with model.half().float()', 'Normalizing points across the whole sequence', 'Mixup (tried but did not work)', 'CNNs with mixup (tried but did not work)'],
Solution: Here is a quick overview of the 8th place solution.
3 transformers models, 2 layers each (384 hidden, 512 hidden ffn), with an ffn encoder (512->384), trained from scratch. LR 8e-4 with cosine schedule trained for ~300 epochs, dropout 0.1, batch size 1024, label smoothing 0.1. Using hands, lips and pose (above waist only). On one transformer all pose and a subset of lips were used for diversity.
Augmentations
most important was sequence cutout. On each sample, and each body part (left hand, right hand, lips, pose) with a 0.4 proba convert to nan 5 random slices of 0.15 x SequenceLength. It was hard to overfit with this in.
mirror left
random rotate.
Preprocessing
Linear interpolation of longer sequences to max length of 96.
Normalise each body part, using min max - I found this better than mean/std. In one model I used mean/std for diversity.
Create time shift delta features on a subset of points, using time shifts of
[1, 2, 3, 4, 6, 8, 12, 16]

**Document 2------------------------------------------------------------**

Competition Name: Google - Isolated Sign Language Recognition,
Place: 6,
Methods Used: ['MLP', 'Encoder', 'Transformer', 'Convolutional Neural Network (CNN)', 'Data Augmentation', 'Cross Entropy Loss', 'Weight Decay', 'Mean Teacher', 'Knowledge Distillation', 'Ensemble Learning', 'Stratified K-fold', 'Baseline Model', 'Deberta', 'Max Pooling', 'Normalization', 'Interpolation', 'Manifold Mixup', 'Face CutMix', 'Outlier Sample Mining (OUSM)', 'Model Soup', 'Data Relabeling', 'Data Truncation', 'Mish Activation Function'],
Solution: Thanks to both, the organizers of this competition who offered a fun yet challenging problem as well as all of the other competitors - well done to everyone who worked hard for small incremental increases.
Although I am the one posting the topic, this is the result of a great team effort, so big shoutout to
@christofhenkel
.
Brief Summary
Our solution is a 2 model ensemble of a MLP-encoder-frame-transformer model. We pushed our transformer models close to the limit and implemented a lot of tricks to climb up to 6th place.
I have 1403 hours of experiment monitoring time in April (that‚Äôs 48h per day :)).
Update :
Code is available here :
https://github.com/TheoViel/kaggle_islr
Detailed Summary
Preprocessing & Model
Preprocessing
Remove frames without fingers
Stride the sequence (use 1 every n frames) such that the sequence size is
<= max_len
. We used
max_len=25
and
80
in the final ensemble

**Document 3------------------------------------------------------------**

Competition Name: Google - Isolated Sign Language Recognition,
Place: 26,
Methods Used: ['Mixup', 'Mirroring', 'LLaMa-inspired architecture', 'RMSNorm normalization', 'Lion optimizer', 'Cosine decay learning rate', 'Batch size 128', 'Dropout 0.1', 'Exponential moving average of weights'],
Solution: Github with all the code used
Summary
The most important part of the solution is the data utilization. Major improvements were from keypoints choice and mixup. External data does not help because it is from a very different distribution. Given data amount does not benefit larger models so ensembles of small models is the way to utilize given constraints to the fullest.
Most augmentations are not helpful, because they prevent model from learning the true data distribution. So only used mirroring and mixup (0.5).
Inputs to the model
All models are trained to support sequences of up to 512 frames.
Preprocessing
Only 2d coordinates are used as 3rd dimension leads to unstable training.
To normalize inputs all keypoints are shifted so that head is located at the origin.
Scaling did not provide any benefit so not used.
All nans are replaced with 0 after normalization.
Chosen keypoints
All (21) hand keypoints
26 face keypoints
17 pose keypoints
Architecture
LLaMa-inspired architecture. Most notable improvement comes from much better normalization RMSNorm.
For all models head dimensions are set to 64
Single model (Private/Public LB: 0.8543689/0.7702471)
6 heads 5 layers 9.2M parameters
Ensemble of 3 models (Private/Public LB: 0.8584568/0.7725324)
2 heads 6 layers 1.7M parameters per model
Larger models could be fit into file size limit, but it would time out during submission.
Augmentations

**Document 4------------------------------------------------------------**

Competition Name: Google - Isolated Sign Language Recognition,
Place: 11,
Methods Used: ['Ensemble', 'Strong augmentation', 'Manual model conversion from pytorch to tensorflow', 'CLIP transformer architecture', 'Decrease parameter size', 'Motion features', 'Longer epoch'],
Solution: Thank you to the organizer and Kaggle for hosting this interesting challenge.
Especially I enjoyed this strict inference time restriction. It keeps model size reasonable and requires us for some practical technique.
TL;DR
Ensemble 5 transformer models
Strong augmentation
Manual model conversion from pytroch to tensorflow
Code is available here ->
https://github.com/bamps53/kaggle-asl-11th-place-solution
Overview
I started from
@hengck23
‚Äòs
great discussion
and
notebook
. Thanks for sharing a lot of useful tricks as always!
The changes I made are following;
Change model architecture to CLIP transformer in HuggingFace
Decrease parameter size to maximize latency within the range of same accuracy
Some strong augmentations
Horizontal flip(p=0.5)
Random 3d rotation(p=1, -45~45)
Random scale(p=1, 0.5~1.5)
Random shift(p=1, 0.7~1.3)
Random mask frames(p=1, mask_ratio=0.5)
Random resize (p=1, 0.5~1.5)
Add motion features
current - prev
next - current
Velocity
Longer epoch, 250 for 5 fold and 300 for all data
For the details, please refer to the code.(planning to upload)
Model conversion

**Let's ask a another question**

In [27]:
%%time
question = """What can you tell me about the 'RSNA Screening Mammography Breast Cancer Detection' competition ?
"""
answer, relevant_docs = answer_with_rag(question, gemma_lm, vector_database, reranker)

display(Markdown("### Gemma Answer"))
display(Markdown(get_gemma_answer(answer)))
display(Markdown("### Source docs"))
for i, doc in enumerate(relevant_docs):
    display(Markdown(f"**Document {i}------------------------------------------------------------**"))
    display(Markdown(doc))

=> Retrieving documents...
=> Reranking documents...


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00,  5.46it/s]


=> Generating answer...


### Gemma Answer


Sure, here's a summary of the RSNA Screening Mammography Breast Cancer Detection competition:

- The competition is a Kaggle competition focused on breast cancer detection using medical images.
- The competition received 10,000 chest X-ray images of patients with breast cancer.
- The goal of the competition is to develop a machine learning model that can accurately detect breast cancer in the X-ray images.
- The competition used a variety of machine learning methods, including YOLOX, classification models, and ensemble methods.
- The winning model achieved a validation accuracy of 92.2%, which placed the team in third place in the competition.

### Source docs

**Document 0------------------------------------------------------------**

Competition Name: RSNA Screening Mammography Breast Cancer Detection,
Place: 3,
Methods Used: ['YOLOX', 'Classification models', 'Average weighting fusion', 'External data', 'Data augmentation', 'CNN', 'EfficientNet', 'Convnext', 'Multi-view model', 'LSTM', 'Ensemble'],
Solution: Introduction
Thank you to all the participants for your hard work in the competition.We are honored to have achieved a good result, coming in third place in this competition. We also want to express our deepest gratitude to the organizers for putting together such a fantastic event.Thank you very much.
Finally, I want to thank my excellent teammates
@haqishen
,
@boliu0
,
@kevin1742064161
. On behalf of my teammates, I would like to introduce part of our solution, and another part is presented by
@boliu0
in another thread
.
1. Overview of the pipeline
Extract ROI with a fixed aspect ratio(1.6:1) using YOLOX
Feed the ROI into different classification models
Average weighting fusion of the results from the classification models
2. External Data
We use 4 external data in total. Not all models used all the external data. Some models only used CBIS-DDSM + CMMD, while the remaining models used all four external data. Although these external data appear to be different from the competition data, they can improve the CV and significantly enhance the stability of the training.
1)
CBIS-DDSM
The classification labels of CBIS-DDSM are: MALIGNANT, BENIGN WITHOUT CALLBACK, BENIGN. We consider MALIGNANT as positive and the others as negative, resulting in 1,350 positive and 1,753 negative samples.
2)
CMMD

**Document 1------------------------------------------------------------**

Competition Name: RSNA Screening Mammography Breast Cancer Detection,
Place: 19,
Methods Used: ['Label Smoothing', 'Auxiliary Classes', 'Weighted BCELoss', 'Mosaic', 'Mixup', 'ROI Cropping'],
Solution: 19th Place Solution
Thanks to the team at RSNA and Kaggle for putting this competition together, and thanks to my teammates
@ragnar123
and
@harshitsheoran
.
The final submission consisted of 2 CNN models : eca_nfnet_l0 and tf_efficientnet_b3_ns.
Model
CV
Best Public
Best Private
Used in Ensemble
eca_nfnet_l0 @ 1536 (Harshit)
.488
.6
.47
0
eca_nfnet_l0 @ 1536 (Ivan)
.4688
.61
.46
1
tf_efficientnet_b3_ns @ 1536 (Ivan)
.491
.63
.48
1
tf_efficientnet_b3_ns @ 1920 by 1536 (Martin)
.464
.6
.51
0
When combined, these two models had a best Public LB of .65 and a best Private LB of .5. Unfortunately there was no correlation between public and private or CV and private. This led to us picking the wrong submission. In the end we had 12 submissions that would have gotten us into gold but had no way of telling if they were the correct ones.
What Worked?
Label Smoothing
Auxiliary Classes (Bi-raids 2, Benign, Invasive, Biopsy)
Weighted BCELoss
Mosaic (with class max as target)
Mixup (with class max as target)

**Document 2------------------------------------------------------------**

Competition Name: RSNA Screening Mammography Breast Cancer Detection,
Place: 18,
Methods Used: ['Windowing', 'Min-max scaling', 'YOLOv5', 'Affine transformation', 'GeM pooling', 'imgaug augmentations', 'BCE Loss', 'Focal Loss', 'Exponential moving average', 'AdamW optimizer', 'OneCycleLR scheduler', 'Flip test TTA', 'LP pooling', 'Ensemble averaging', 'Percentile based thresholding'],
Solution: Thanks to Kaggle, the hosts and competitors for this meaningful competition.
In the following, I want to provide a brief summary of my solution.
Overview
Similar to many public codes, my pipeline is as follows.
Detect a breast area for each image and crop that area
Predict cancer image-wise using various backbones
Aggregate image-wise predictions and apply thresholding to get final prediction for each target
Preprocessing
My preprocessing depends on many public codes. I am grateful to the authors of those codes.
Sigmoid/linear windowing is applied based on
VOILUTFunction
,
WindowCenter
and
WindowWidth
in dicom data. After windowing, images are processed with min-max scaling and treated as 8-bit images.
Breast detector
I annotated breast bounding boxes for about 1000 images. In addition to those labels, I also used labels provided by
@remekkinas
(about 500 images) in
this code
to train a single YOLOv5n6 with the input size of 1024. mAP_0.5:0.95 of a validation split is 0.952.
Given the detections, affine transformation is applied to obtain fixed size cropped images.
At that time, expanding the bboxes so that the aspect ratio and the size of the bbox relative to the original images did not change too much improved somewhat of local cv.

**Document 3------------------------------------------------------------**

Competition Name: RSNA Screening Mammography Breast Cancer Detection,
Place: 9,
Methods Used: ['Sampling strategy', 'Positive class balancing', 'Augmentation', 'Model selection', 'Postprocessing', 'DICOM to PNG conversion', 'Inference with ConvNext models', 'Ensemble averaging', 'Voting strategy', 'Thresholding', 'Probf1 metric', 'ROC_AUC metric', 'Precision recall metric', 'MCC metric', 'Local validation', 'TTA (Horizontal Flip)', 'Model probabilities ensembling', 'Better than median function', 'Voting and averaging', 'RAdam optimizer', 'Lookahead optimizer', 'OneCycle scheduler', 'Weight decay', 'Dataset seed', 'Batch size', 'Mixed precision training', 'Gradient clipping', 'BCEWithLogitsLoss', 'Sequential sampler', 'ConvNext_v1 small model', 'Average pooling', 'GeM pooling', 'Training on 1536x768 images', 'Balancer class', 'Weights & Biases experiment tracking'],
Solution: First of all congratulations to all participants. Congratulations to dream teams from gold zone. I‚Äôm impressed by your consistency in winning Kaggle competition. Waiting to learn from your solution.
Thank you my team mate Andrij
@aikhmelnytskyy
We had great collaboration üëçüëçüëç - I feel that from first minute we played in one team having one goal - find better solution.
Gold in competition was dream for me. Last year we (with
@christofhenkel
) were #1 in sliver (#12 solution in Image Matching Challange 2021). This year I decided to work hard to experience gold zone and finally become competition master. Even there is no official LB finalized ‚Ä¶. we are #9 and in gold! :) and I am ‚Ä¶‚Ä¶ extremely happy! üòÅüòÅüòçüòú

**Document 4------------------------------------------------------------**

Competition Name: RSNA Screening Mammography Breast Cancer Detection,
Place: 2,
Methods Used: ['Pretraining with external dataset', 'Fine-tuning', 'Data augmentation', 'ConvnextV1 small model', 'Manual annotation of bounding box', 'Faster R-CNN for cropping', 'ShiftScaleRotate augmentation', 'RandomFlip augmentation', 'RandAugment augmentation', 'RandomErasing augmentation', 'EQL loss', 'High resolution', 'Large batch size', 'Auxiliary loss', 'More training epochs', 'Dual view model', 'Multi laterality dual view model', 'Optimizer: AdamW', 'lr: 0.00015', 'Scheduler: CosineAnnealingLR', 'Epochs: 24', 'Batch size: 192', 'EMA', 'Diagonal flip TTA'],
Solution: 2nd place solution
I would like to express my gratitude to Kaggle for hosting this meaningful competition, and to my teammates, particularly
@kapenon
, who persevered alongside me throughout the entire competition.
I would like to extend my gratitude to
@theoviel
for providing the fast DALI inference notebook, which greatly aided in the completion of this competition. Additionally, I would like to thank
@pourchot
for generously sharing the external data, which contained valuable positive case data that contributed to the success of our final solution.
Fortunately, our team was able to get 2nd place, and I am excited to share our approach.
Summary of our approach
Stages
Pretrain a single view model in 1280x1280 resolution with external dataset (Thanks to
@pourchot
,
Dataset
)
Fine-tune the single view model in 1536x1536 resolution without external dataset
Use the fine-tuned single view model to further fine-tune a dual view model and a four view model
Model

CPU times: user 7 s, sys: 15 ms, total: 7.02 s
Wall time: 14.3 s


# 9. To go further

Here are a few ideas to improve this Notebook:

- Adjust chunking: change chunk sizes, split on different separators‚Ä¶
- Switch embedding models
- Try semantic chunking for different insights.
- Adjust the EnsembleRetriever (e.g. use a different index than FAISS) or just use one retriever. You can see the list of indexes supported by LangChain [here](https://python.langchain.com/docs/integrations/vectorstores)
- Evaluate the RAG pipeline with [Ragas](https://github.com/explodinggradients/ragas) or [TruLens](https://github.com/truera/trulens) tools.
- Fine-tune the Gemma model on your dataset for better performance.

# 10. Resources

- [Advanced RAG Techniques: an Illustrated Overview](https://pub.towardsai.net/advanced-rag-techniques-an-illustrated-overview-04d193d8fec6) by [IVAN ILIN](https://medium.com/@ivanilin_iki)
- [Advanced RAG on HuggingFace documentation using langchain](https://huggingface.co/learn/cookbook/advanced_rag) by [Aymeric Roucher](https://huggingface.co/m-ric)
- [Simple RAG for GitHub issues using Hugging Face Zephyr and LangChain](https://huggingface.co/learn/cookbook/rag_zephyr_langchain) by [Maria Khalusova](https://github.com/MKhalusova)