# In-Context Learning


In-context learning is a generalisation of few-shot learning where the LLM is provided a context as part of the prompt and asked to respond by utilising the information in the context.

* Example: *"Summarize this research article into one paragraph highlighting its strengths and weaknesses: [insert article text]”*
* Example: *"Extract all the quotes from this text and organize them in alphabetical order: [insert text]”*

A very popular technique that you will learn in week 5 called Retrieval-Augmented Generation (RAG) is a form of in-context learning, where:
* a search engine is used to retrieve some relevant information
* that information is then provided to the LLM as context


In this example we download some recent research papers from arXiv papers, extract the text from the PDF files and ask Gemini to summarize the articles as well as provide the main strengths and weaknesses of the papers. Finally we print the summaries to a local html file and as markdown.

In [3]:
import os
import requests
from bs4 import BeautifulSoup
import google.generativeai as genai
from urllib.request import urlopen, urlretrieve
from IPython.display import Markdown, display
from pypdf import PdfReader
from datetime import date
from tqdm import tqdm

In [4]:
API_KEY = os.environ.get("GEMINI_API_KEY")
genai.configure(api_key=API_KEY)

We select those papers that have been featured in Hugging Face papers.

In [5]:
BASE_URL = "https://huggingface.co/papers"
page = requests.get(BASE_URL)
soup = BeautifulSoup(page.content, "html.parser")
h3s = soup.find_all("h3")

papers = []

for h3 in h3s:
    a = h3.find("a")
    title = a.text
    link = a["href"].replace('/papers', '')

    papers.append({"title": title, "url": f"https://arxiv.org/pdf{link}"})

Code to extract text from PDFs.

In [6]:
def extract_paper(url):
    html = urlopen(url).read()
    soup = BeautifulSoup(html, features="html.parser")

    # kill all script and style elements
    for script in soup(["script", "style"]):
        script.extract()    # rip it out

    # get text
    text = soup.get_text()

    # break into lines and remove leading and trailing space on each
    lines = (line.strip() for line in text.splitlines())
    # break multi-headlines into a line each
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    # drop blank lines
    text = '\n'.join(chunk for chunk in chunks if chunk)

    return text


def extract_pdf(url):
    pdf = urlretrieve(url, "pdf_file.pdf")
    reader = PdfReader("pdf_file.pdf")
    text = ""
    for page in reader.pages:
        text += page.extract_text() + "\n"
    return text


def printmd(string):
    display(Markdown(string))

In [7]:
LLM = "gemini-1.5-flash"
model = genai.GenerativeModel(LLM)

We use Gemini to summarize the papers.

In [8]:
for paper in tqdm(papers):
    try:
        paper["summary"] = model.generate_content("Summarize this research article into one paragraph without formatting highlighting its strengths and weaknesses. " + extract_pdf(paper["url"])).text
    except:
        print("Generation failed")
        paper["summary"] = "Paper not available"

100%|██████████| 12/12 [01:20<00:00,  6.73s/it]


We print the results to a html file.

In [9]:
page = f"<html> <head> <h1>Daily Dose of AI Research</h1> <h4>{date.today()}</h4> <p><i>Summaries generated with: {LLM}</i>"
with open("papers.html", "w") as f:
    f.write(page)
for paper in papers:
    page = f'<h2><a href="{paper["url"]}">{paper["title"]}</a></h2> <p>{paper["summary"]}</p>'
    with open("papers.html", "a") as f:
        f.write(page)
end = "</head>  </html>"
with open("papers.html", "a") as f:
    f.write(end)

We can also print the results to this notebook as markdown.

In [10]:
for paper in papers:
    printmd("**[{}]({})**<br>{}<br><br>".format(paper["title"],
                                                paper["url"],
                                                paper["summary"]))

**[AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents](https://arxiv.org/pdf/2410.24024)**<br>This research article introduces ANDROID LAB, a systematic framework for training and evaluating Android autonomous agents. The framework includes a standard operational environment with different modalities (XML and SoM) and action spaces, along with a reproducible benchmark of 138 tasks across nine apps. The researchers also developed the Android Instruct dataset, containing 10.5k traces and 94.3k steps, to fine-tune open-source models. The results show that fine-tuning improves the performance of open-source models, achieving success rates and efficiency levels comparable to closed-source models. 

Strengths of the research include the development of a comprehensive and reproducible benchmark, the use of both open-source and closed-source models, and the creation of a large-scale dataset for fine-tuning. However, the research is limited by the focus on a single mobile operating system (Android) and the reliance on pre-defined tasks, which may not fully capture the complexity of real-world mobile agent scenarios. Additionally, the use of the ReAct and SeeAct frameworks, while showing some promise, did not consistently enhance performance.  
<br><br>

**[WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning](https://arxiv.org/pdf/2411.02337)**<br>This research paper introduces WEBRL, a self-evolving online curriculum reinforcement learning framework for training large language models (LLMs) to act as web agents. WEBRL tackles three major challenges in this domain: limited training data, sparse feedback signals, and policy distribution drift in online learning. WEBRL utilizes a self-evolving curriculum that generates new tasks based on unsuccessful attempts, a robust outcome-supervised reward model (ORM) for evaluating task success, and adaptive reinforcement learning strategies to prevent catastrophic forgetting. WEBRL achieves state-of-the-art performance on the WebArena-Lite benchmark, surpassing even proprietary LLMs like GPT-4-Turbo. However, it relies heavily on the availability of an initial set of tasks and may struggle with complex or long-horizon tasks. The reliance on manually curated initial tasks is a significant weakness, making its scalability to new domains challenging.  
<br><br>

**[DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models](https://arxiv.org/pdf/2411.00836)**<br>This research article introduces DYNA MATH, a novel dynamic visual math benchmark designed to assess the robustness of mathematical reasoning in vision-language models (VLMs). DYNA MATH  features 501 seed questions, each represented as a Python program capable of generating diverse question variations, such as numerical value changes, geometric transformations, and function type modifications. The benchmark's strength lies in its ability to dynamically generate a large number of concrete questions, allowing for a more comprehensive evaluation of VLM performance under varying conditions. However, DYNA MATH's reliance on program-based generation might limit the inclusion of extremely complex questions, a weakness that could be addressed in future iterations. The authors evaluate various state-of-the-art VLMs on DYNA MATH, finding a significant gap between their average-case and worst-case accuracies, highlighting the models' lack of robustness to simple question variations. This research underscores the need for further research to develop VLMs with more reliable reasoning capabilities, particularly in the face of dynamic visual and textual contexts. 
<br><br>

**[Training-free Regional Prompting for Diffusion Transformers](https://arxiv.org/pdf/2411.02395)**<br>This research article proposes a training-free regional prompting method for diffusion transformers, specifically targeting the FLUX.1 architecture. The method utilizes attention manipulation to enable fine-grained compositional text-to-image generation without requiring model retraining. It leverages regional prompt-mask pairs, which can be user-defined or generated by large language models, to guide the model in generating images with specific spatial layouts and attributes. The research highlights the method's strengths in handling complex, multi-regional prompts, achieving swift and responsive image generation, and being compatible with other plug-and-play modules like LoRAs and ControlNet. However, the article acknowledges the limitation of tuning factors for optimal visual cohesion when the number of regional masks increases. The article also presents an ablation analysis of key factors and a comparison with standard FLUX.1-dev and RPG-based regional control, demonstrating the proposed method's efficiency in terms of inference speed and GPU memory consumption. Overall, the article introduces a promising approach to enhance the compositional generation capabilities of diffusion transformers, albeit with the need for further research to address the tuning complexities associated with a high number of regional masks. 
<br><br>

**[Sparsing Law: Towards Large Language Models with Greater Activation Sparsity](https://arxiv.org/pdf/2411.02335)**<br>This research paper investigates activation sparsity in large language models (LLMs), a property where a significant portion of neurons contribute weakly to the output. The authors propose a new metric, PPL-p% sparsity, which is more precise, versatile, and performance-aware than existing methods. Through extensive experiments, they discover several scaling properties of activation sparsity related to the amount of training data, activation function, width-depth ratio, and parameter scale. Strengths include the comprehensive analysis and novel metric. However, the study is limited by its focus on relatively small LLMs and the absence of computational costs in some analyses. Additionally, the sparsity metric's sensitivity to data distributions is a potential limitation. Despite these limitations, the paper provides valuable insights for designing and training sparser LLMs, enabling more efficient and interpretable models. 
<br><br>

**[How Far is Video Generation from World Model: A Physical Law Perspective](https://arxiv.org/pdf/2411.02385)**<br>This research paper investigates the ability of video generation models to learn fundamental physical laws from visual data, drawing inspiration from OpenAI's Sora. The authors evaluate generalization performance in three scenarios: in-distribution, out-of-distribution (OOD), and combinatorial generalization, using a 2D simulation testbed for object movement and collisions. While scaling models and datasets improves in-distribution and combinatorial generalization performance, the models fail to generalize well in OOD scenarios. Further analysis reveals that the models primarily exhibit "case-based" generalization behavior, mimicking the closest training example, rather than abstracting general physical rules. Moreover, the models prioritize certain attributes over others during case matching, with color being the most dominant factor, followed by size, velocity, and lastly, shape. This highlights that scaling alone is not sufficient for video generation models to discover fundamental physical laws, emphasizing the need for more nuanced approaches beyond simply increasing data and model size. 
<br><br>

**[LIBMoE: A Library for comprehensive benchmarking Mixture of Experts in Large Language Models](https://arxiv.org/pdf/2411.00918)**<br>This research paper introduces LibMoE, a comprehensive and modular toolkit designed to facilitate research on Mixture-of-Experts (MoE) algorithms for large language models (LLMs). LibMoE addresses the limitations of existing MoE toolkits by offering a standard benchmark for evaluating five state-of-the-art algorithms and supporting efficient training with a modular design that allows for easy customization and scalability. The paper benchmarks these algorithms on three different LLMs and 11 datasets in a zero-shot setting, demonstrating the strengths of LibMoE in streamlining research and promoting accessible large-scale studies. However, the paper also highlights weaknesses in the current understanding of MoE algorithms, particularly regarding early stopping mechanisms and the impact of architectural choices on expert selection behavior. Despite the lack of a clear winner among the evaluated algorithms, the study reveals promising research directions for future studies through its comprehensive analysis of expert selection dynamics and performance convergence.
<br><br>

**[GenXD: Generating Any 3D and 4D Scenes](https://arxiv.org/pdf/2411.02319)**<br>This research article presents GenXD, a unified framework for generating high-quality 3D and 4D scenes from any number of condition images.  The authors tackle the challenge of 4D generation by creating a new dataset, CamVid-30K,  which uses a data curation pipeline to extract camera poses and object motion strength from videos. GenXD incorporates multiview-temporal modules to disentangle camera and object movements, allowing for seamless learning from both 3D and 4D data. The model also employs masked latent conditions to support flexible conditioning views.  While GenXD achieves promising results, the limited diversity of real-world datasets and the difficulty in capturing large camera movements with significant object motion in videos remain limitations. 
<br><br>

**[Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent](https://arxiv.org/pdf/2411.02265)**<br>This research paper introduces Hunyuan-Large, a large open-source Mixture of Experts (MoE) model with 389 billion parameters and 52 billion activated parameters. It surpasses similar-sized models in various benchmarks including language understanding, generation, reasoning, coding, long-context, and aggregated tasks. Its strengths lie in the use of large-scale synthetic data, a mixed expert routing strategy, KV cache compression, and expert-specific learning rate strategies. However, it lacks detailed information regarding the training process, including specific hyperparameters and evaluation protocols used in the benchmark comparisons.  Furthermore, the paper does not comprehensively discuss the ethical implications and potential risks of such a large language model. 
<br><br>

**[Adaptive Caching for Faster Video Generation with Diffusion Transformers](https://arxiv.org/pdf/2411.02397)**<br>This research paper introduces Adaptive Caching (AdaCache), a training-free method to accelerate video generation using Diffusion Transformers (DiTs). Recognizing that not all videos require the same level of computational effort, AdaCache caches computations within transformer blocks, selectively reusing them based on a distance metric that measures the rate-of-change in representations. This adaptive caching schedule is further enhanced by Motion Regularization (MoReg), which allocates more computations for videos with higher motion content. AdaCache shows significant speedups (up to 4.7x) without sacrificing generation quality, outperforming other training-free acceleration methods. However, AdaCache relies on hyperparameter tuning and its performance may be limited by the quality of the motion estimation, potentially leading to inconsistent generations.  While promising, the method still requires further research to fully address these limitations. 
<br><br>

**[Decoding Dark Matter: Specialized Sparse Autoencoders for Interpreting Rare Concepts in Foundation Models](https://arxiv.org/pdf/2411.00743)**<br>This research paper introduces Specialized Sparse Autoencoders (SSAEs) as a new method for interpreting rare concepts within foundation models (FMs). SSAEs, unlike general-purpose Sparse Autoencoders (SAEs), focus on specific subdomains, enabling them to learn features representing infrequent but crucial concepts. The paper presents a practical recipe for training SSAEs, including using dense retrieval for data selection and Tilted Empirical Risk Minimization (TERM) for training objectives to improve concept recall. Evaluation on standard metrics like downstream perplexity and L0 sparsity show that SSAEs effectively capture subdomain tail concepts, surpassing the capabilities of general-purpose SAEs. A case study on the Bias in Bios dataset demonstrates the practical utility of SSAEs, leading to a 12.5% increase in worst-group classification accuracy when removing spurious gender information. While SSAEs offer a powerful new tool for interpreting FMs in subdomains, they still rely on the quality of seed data and can be computationally expensive when trained with TERM. Further research is needed to address these limitations and explore the ethical implications of manipulating rare concepts within models. 
<br><br>

**[DynaSaur: Large Language Agents Beyond Predefined Actions](https://arxiv.org/pdf/2411.01747)**<br>This research article introduces DynaSaur, an LLM agent framework that dynamically creates and composes actions in real-time using Python code. This approach addresses the limitations of existing LLM agent systems that rely on predefined action sets, which restricts their flexibility and requires substantial human effort. DynaSaur enables agents to learn and adapt to new scenarios by generating new actions when necessary, improving their performance on complex tasks. The framework is evaluated on the GAIA benchmark and shows significant performance gains over existing methods, even reaching the top of the public leaderboard. However, limitations include the tendency to generate overly specific actions and the high cost of evaluation, suggesting the need for a task curriculum and further research into cost-effective evaluation methods. 
<br><br>