# In-Context Learning


In-context learning is a generalisation of few-shot learning where the LLM is provided a context as part of the prompt and asked to respond by utilising the information in the context.

* Example: *"Summarize this research article into one paragraph highlighting its strengths and weaknesses: [insert article text]”*
* Example: *"Extract all the quotes from this text and organize them in alphabetical order: [insert text]”*

A very popular technique that you will learn in week 5 called Retrieval-Augmented Generation (RAG) is a form of in-context learning, where:
* a search engine is used to retrieve some relevant information
* that information is then provided to the LLM as context


In this example we download some recent research papers from arXiv papers, extract the text from the PDF files and ask Gemini to summarize the articles as well as provide the main strengths and weaknesses of the papers. Finally we print the summaries to a local html file and as markdown.

In [1]:
import os
import requests
from bs4 import BeautifulSoup
import google.generativeai as genai
from urllib.request import urlopen, urlretrieve
from IPython.display import Markdown, display
from pypdf import PdfReader
from datetime import date
from tqdm import tqdm

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
API_KEY = os.environ.get("GOOGLE_API_KEY")
genai.configure(api_key=API_KEY)

We select those papers that have been featured in Hugging Face papers.

In [3]:
BASE_URL = "https://huggingface.co/papers"
page = requests.get(BASE_URL)
soup = BeautifulSoup(page.content, "html.parser")
h3s = soup.find_all("h3")

papers = []

for h3 in h3s:
    a = h3.find("a")
    title = a.text
    link = a["href"].replace('/papers', '')

    papers.append({"title": title, "url": f"https://arxiv.org/pdf{link}"})

Code to extract text from PDFs.

In [4]:
def extract_paper(url):
    html = urlopen(url).read()
    soup = BeautifulSoup(html, features="html.parser")

    # kill all script and style elements
    for script in soup(["script", "style"]):
        script.extract()    # rip it out

    # get text
    text = soup.get_text()

    # break into lines and remove leading and trailing space on each
    lines = (line.strip() for line in text.splitlines())
    # break multi-headlines into a line each
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    # drop blank lines
    text = '\n'.join(chunk for chunk in chunks if chunk)

    return text


def extract_pdf(url):
    pdf = urlretrieve(url, "pdf_file.pdf")
    reader = PdfReader("pdf_file.pdf")
    text = ""
    for page in reader.pages:
        text += page.extract_text() + "\n"
    return text


def printmd(string):
    display(Markdown(string))

In [5]:
LLM = "gemini-1.5-flash"
model = genai.GenerativeModel(LLM)

We use Gemini to summarize the papers.

In [6]:
prompt = "Summarize this research article into a table highlighting its strengths and weaknesses in two different columns. "
for paper in tqdm(papers):
    try:
        paper["summary"] = model.generate_content(prompt + extract_pdf(paper["url"])).text
    except:
        print("Generation failed")
        paper["summary"] = "Paper not available"

100%|██████████| 4/4 [00:32<00:00,  8.10s/it]


We print the results to a html file.

In [15]:
def convert_markdown_to_html_table(markdown_table):
    lines = markdown_table.strip().split("\n")
    headers = ["Strengths", "Weaknesses"]  # Fix the header line
    rows = lines[3:]

    html_table = "<table border='1'>\n<thead>\n<tr>"
    html_table += "".join(f"<th>{header.strip()}</th>" for header in headers)
    html_table += "</tr>\n</thead>\n<tbody>\n"

    for row in rows:
        if set(row.strip()) == {'|', '-'}:
            continue
        cells = row.split("|")[1:-1]
        html_table += "<tr>" + "".join(f"<td>{cell.strip().replace('**', '<b>').replace('**', '</b>')}</td>" for cell in cells) + "</tr>\n"

    html_table += "</tbody>\n</table>"
    return html_table

In [16]:
page = f"<html> <head> <h1>Daily Dose of AI Research</h1> <h4>{date.today()}</h4> <p><i>Summaries generated with: {LLM}</i>"
with open("papers.html", "w") as f:
    f.write(page)
for paper in papers:
    html_table = convert_markdown_to_html_table(paper["summary"])
    page = f'<h2><a href="{paper["url"]}">{paper["title"]}</a></h2> <p>{html_table}</p>'
    with open("papers.html", "a") as f:
        f.write(page)
end = "</head>  </html>"
with open("papers.html", "a") as f:
    f.write(end)

We can also print the results to this notebook as markdown.

In [9]:
for paper in papers:
    printmd("**[{}]({})**<br>{}<br><br>".format(paper["title"],
                                                paper["url"],
                                                paper["summary"]))

**[YuLan-Mini: An Open Data-efficient Language Model](https://arxiv.org/pdf/2412.17743)**<br>## YuLan-Mini: Strengths and Weaknesses

| Strengths                                                                                                                                                                    | Weaknesses                                                                                                                                                                  |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Achieves top-tier performance among models of similar parameter scale (2.42B).                                                                                                | Limited context window (up to 28K tokens) due to resource constraints. Performance on long-context benchmarks is not on par with state-of-the-art models.                  |
| Data-efficient: Achieves comparable performance to industry-leading models requiring significantly more data (trained on 1.08T tokens).                                                |  Reproducing baseline model results is challenging due to lack of detailed information in their original papers. This makes the performance comparison less precise.           |
| Open-source: Full details of data composition and training procedures are publicly available, facilitating reproducibility.                                                                 | While data-efficient, still requires significant computational resources (56 A800 GPUs initially, reduced to 48 later). Not entirely accessible to all researchers.              |
| Elaborate data pipeline combines data cleaning, scheduling strategies, and synthetic data generation for enhanced model capabilities, particularly in math and coding.                 |  Only evaluated on a selected set of benchmarks.  A more exhaustive evaluation across diverse tasks would provide a more complete picture of the model's strengths and weaknesses. |
| Robust optimization method effectively mitigates training instability using techniques like µP initialization, WeSaR re-parameterization, and various numerical stability improvements.| The training stability mitigation methods are extensive, but their individual contributions are not completely isolated in the ablation study, limiting precise understanding of their impact. |
| Effective annealing approach incorporates targeted data selection and long context training to further improve performance.                                                          |   No direct comparison to truly comparable open-source models with fully disclosed training details and evaluation methodology.                                               |
| Uses readily available open-source and synthetic data.                                                                                                                            |  Requires expertise in setting up and managing a large-scale GPU cluster.                                                                                                    |


<br><br>

**[A Silver Bullet or a Compromise for Full Attention? A Comprehensive Study of Gist Token-based Context Compression](https://arxiv.org/pdf/2412.17483)**<br>## Gist Token-based Context Compression: Strengths and Weaknesses

| Strengths | Weaknesses |
|---|---|
| * **Near-lossless performance on several tasks:** Achieves comparable results to full attention models on tasks like retrieval-augmented generation (RAG), long-document QA, and summarization, especially at lower compression ratios. | * **Significant performance gaps on certain tasks:**  Struggles with tasks requiring precise recall (e.g., reranking, synthetic recall) and complex multi-hop reasoning. Performance degrades significantly with increasing compression ratios. |
| * **Efficient context compression:** Significantly reduces KV cache size and computational cost, making long-context processing more feasible for resource-constrained environments. | * **Compression bottleneck:** Gist tokens fail to fully capture original token information, leading to information loss and inaccurate reconstruction. This bottleneck is identified through probing experiments. |
| * **Unified framework for categorization:**  Provides a structured framework for understanding existing gist-based architectures based on memory location and gist granularity. | * **Three identified failure patterns:**  "Lost by the boundary" (degradation near segment starts), "Lost if surprise" (unexpected information ignored), and "Lost along the way" (errors during multi-step generation). |
| * **Proposed effective mitigation strategies:** Introduces fine-grained autoencoding and segment-wise token importance estimation to improve gist token representations and optimize learning. These strategies demonstrably improve performance, particularly on challenging tasks. | * **Limited model scale and context length in experiments:**  Experiments were constrained by computational resources, limiting the exploration of larger models and longer contexts.  The findings may not generalize perfectly to much larger models. |
| * **Comprehensive experimental evaluation:**  Extensive experiments across various language modeling and downstream tasks provide a thorough assessment of the method's effectiveness and limitations. | * **Limited scope of compression methods:** The study focuses solely on gist token-based methods, excluding other context compression techniques.  A more comprehensive comparison across diverse methods would strengthen the conclusions. |
| * **Addresses ethical considerations:** Uses established, well-curated datasets to minimize bias and harmful content in model training. |  |

<br><br>

**[Molar: Multimodal LLMs with Collaborative Filtering Alignment for Enhanced Sequential Recommendation](https://arxiv.org/pdf/2412.18176)**<br>## Molar: Multimodal LLMs with Collaborative Filtering Alignment - Strengths and Weaknesses

| Strengths                                                                                                                                                                                             | Weaknesses                                                                                                                                                                                                      |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Superior Performance:** Consistently outperforms traditional SR models and state-of-the-art LLM-based methods across multiple datasets in terms of NDCG@K and Recall@K.                                   | **Computational Cost:** Multi-task fine-tuning is time-intensive, hindering real-time applications.  The reliance on large MLLMs increases computational demands.                                                              |
| **Effective Multimodal Integration:**  Leverages a Multimodal Large Language Model (MLLM) to effectively integrate textual and non-textual (image) data, leading to improved item representations and recommendation accuracy. | **Dependence on MLLM Quality:** Performance heavily relies on the underlying capabilities of the chosen MLLM. Suboptimal base models can lead to degraded recommendation performance.                                                     |
| **Enhanced Collaborative Filtering:** Incorporates collaborative filtering signals through a post-alignment mechanism, effectively combining content-based and ID-based user embeddings to improve personalization. | **Limited Scalability (in the paper):**  Due to computational constraints, the authors couldn't train larger LLMs, limiting the potential performance gains.                                                                  |
| **Robustness:** Demonstrates consistent performance improvements across diverse datasets and scenarios.                                                                                                           | **Data Dependency:** The effectiveness of the multimodal fine-tuning depends on the quality and quantity of the available multimodal data.                                                                               |
| **Efficient Framework:** Decoupled framework (MIRM and DUEG) improves computational efficiency compared to approaches that process long user history sequences directly within the LLM.                                  | **Interpretability:** While the model performs well, understanding precisely *why* it makes specific recommendations remains challenging, a common limitation of complex deep learning models.                               |
| **Comprehensive Evaluation:** The paper includes thorough experimentation, ablation studies, and analysis to understand the impact of different components and data modalities.                                         |  **Fine-tuning complexity:** The three-objective fine-tuning process for MIRM adds to the complexity of the training process.                                                                                          |


<br><br>

**[MMFactory: A Universal Solution Search Engine for Vision-Language Tasks](https://arxiv.org/pdf/2412.18072)**<br>## MMFactory: Strengths and Weaknesses

| Strengths                                                                                                       | Weaknesses                                                                                                                           |
|-----------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------|
| **Universality:** Searches across diverse vision, language, and vision-language models for solutions.             | **Computational Cost:**  Solution generation can be time-consuming, especially with a large pool of existing solutions.  High API costs if using commercial LLMs. |
| **Multiple Solutions:** Provides a pool of programmatic solutions, allowing users to choose based on constraints. | **Complexity:** The framework is complex, requiring expertise to set up and manage.  Multi-agent system can be challenging to debug.        |
| **User-Centric:** Considers user-defined constraints (performance, computational resources) during solution generation. | **Dependence on LLMs:** Heavily relies on the capabilities of LLMs, especially for solution and metric routing.  Performance is limited by the underlying LLMs. |
| **Generalizable Solutions:** Solutions apply to all instances of a user-defined task, not just individual examples.      | **Limited Evaluation:**  Evaluation relies on existing benchmarks and metrics, which might not capture all aspects of task performance.   |
| **Robust Solutions:** Uses a multi-agent LLM conversation to refine solutions, improving correctness and robustness.| **Transparency:** The decision-making process within the multi-agent system lacks complete transparency.                                         |
| **State-of-the-art Performance:** Outperforms existing methods on benchmark datasets in several tasks.              | **Open-Source Model Reliance:** Some experiments depend on open-source models, which might not always match the performance of commercial models. |


<br><br>