In [1]:
import io
import json
import os
import urllib3
import time

import pdfplumber
from dotenv import load_dotenv
from IPython.display import display, Markdown
from langchain_core.messages import BaseMessage, SystemMessage, ToolMessage, AIMessage
from langchain_core.tools import BaseTool, tool
from langchain_openai import ChatOpenAI
from langgraph.graph import END, StateGraph
from langgraph.graph.state import CompiledStateGraph
from langgraph.graph.message import add_messages
from pydantic import BaseModel, Field
from typing import Annotated, ClassVar, Sequence, TypedDict, Optional

urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

# You can set your own keys here
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")
os.environ["CORE_API_KEY"] = os.getenv("CORE_API_KEY")

이 셀에는 워크플로에 사용되는 프롬프트가 포함되어 있습니다.

`agent_prompt`에는 CORE API에서 복잡한 쿼리를 사용하는 방법을 설명하는 섹션이 포함되어 있어 에이전트가 더 복잡한 작업을 해결할 수 있습니다.

In [None]:
# Prompt for the initial decision making on how to reply to the user
decision_making_prompt = """
You are an experienced scientific researcher.
Your goal is to help the user with their scientific research.

Based on the user query, decide if you need to perform a research or if you can answer the question directly.
- You should perform a research if the user query requires any supporting evidence or information.
- You should answer the question directly only for simple conversational questions, like "how are you?".
"""

# Prompt to create a step by step plan to answer the user query
planning_prompt = """
# IDENTITY AND PURPOSE

You are an experienced scientific researcher.
Your goal is to make a new step by step plan to help the user with their scientific research .

Subtasks should not rely on any assumptions or guesses, but only rely on the information provided in the context or look up for any additional information.

If any feedback is provided about a previous answer, incorportate it in your new planning.


# TOOLS

For each subtask, indicate the external tool required to complete the subtask. 
Tools can be one of the following:
{tools}
"""

# Prompt for the agent to answer the user query
agent_prompt = """
# IDENTITY AND PURPOSE

You are an experienced scientific researcher. 
Your goal is to help the user with their scientific research. You have access to a set of external tools to complete your tasks.
Follow the plan you wrote to successfully complete the task.

Add extensive inline citations to support any claim made in the answer.


# EXTERNAL KNOWLEDGE

## CORE API

The CORE API has a specific query language that allows you to explore a vast papers collection and perform complex queries. See the following table for a list of available operators:

| Operator       | Accepted symbols         | Meaning                                                                                      |
|---------------|-------------------------|----------------------------------------------------------------------------------------------|
| And           | AND, +, space          | Logical binary and.                                                                           |
| Or            | OR                     | Logical binary or.                                                                            |
| Grouping      | (...)                  | Used to prioritise and group elements of the query.                                           |
| Field lookup  | field_name:value       | Used to support lookup of specific fields.                                                    |
| Range queries | fieldName(>, <,>=, <=) | For numeric and date fields, it allows to specify a range of valid values to return.         |
| Exists queries| _exists_:fieldName     | Allows for complex queries, it returns all the items where the field specified by fieldName is not empty. |

Use this table to formulate more complex queries filtering for specific papers, for example publication date/year.
Here are the relevant fields of a paper object you can use to filter the results:
{
  "authors": [{"name": "Last Name, First Name"}],
  "documentType": "presentation" or "research" or "thesis",
  "publishedDate": "2019-08-24T14:15:22Z",
  "title": "Title of the paper",
  "yearPublished": "2019"
}

Example queries:
- "machine learning AND yearPublished:2023"
- "maritime biology AND yearPublished>=2023 AND yearPublished<=2024"
- "cancer research AND authors:Vaswani, Ashish AND authors:Bello, Irwan"
- "title:Attention is all you need"
- "mathematics AND _exists_:abstract"
"""

# Prompt for the judging step to evaluate the quality of the final answer
judge_prompt = """
You are an expert scientific researcher.
Your goal is to review the final answer you provided for a specific user query.

Look at the conversation history between you and the user. Based on it, you need to decide if the final answer is satisfactory or not.

A good final answer should:
- Directly answer the user query. For example, it does not answer a question about a different paper or area of research.
- Answer extensively the request from the user.
- Take into account any feedback given through the conversation.
- Provide inline sources to support any claim made in the answer.

In case the answer is not good enough, provide clear and concise feedback on what needs to be improved to pass the evaluation.
"""

In [3]:
class CoreAPIWrapper(BaseModel):
    """Simple wrapper around the CORE API."""
    base_url: ClassVar[str] = "https://api.core.ac.uk/v3"
    api_key: ClassVar[str] = os.environ["CORE_API_KEY"]

    top_k_results: int = Field(description = "Top k results obtained by running a query on Core", default = 1)

    def _get_search_response(self, query: str) -> dict:
        http = urllib3.PoolManager()

        # Retry mechanism to handle transient errors
        max_retries = 5    
        for attempt in range(max_retries):
            response = http.request(
                'GET',
                f"{self.base_url}/search/outputs", 
                headers={"Authorization": f"Bearer {self.api_key}"}, 
                fields={"q": query, "limit": self.top_k_results}
            )
            if 200 <= response.status < 300:
                return response.json()
            elif attempt < max_retries - 1:
                time.sleep(2 ** (attempt + 2))
            else:
                raise Exception(f"Got non 2xx response from CORE API: {response.status} {response.data}")

    def search(self, query: str) -> str:
        response = self._get_search_response(query)
        results = response.get("results", [])
        if not results:
            return "No relevant results were found"

        # Format the results in a string
        docs = []
        for result in results:
            published_date_str = result.get('publishedDate') or result.get('yearPublished', '')
            authors_str = ' and '.join([item['name'] for item in result.get('authors', [])])
            docs.append((
                f"* ID: {result.get('id', '')},\n"
                f"* Title: {result.get('title', '')},\n"
                f"* Published Date: {published_date_str},\n"
                f"* Authors: {authors_str},\n"
                f"* Abstract: {result.get('abstract', '')},\n"
                f"* Paper URLs: {result.get('sourceFulltextUrls') or result.get('downloadUrl', '')}"
            ))
        return "\n-----\n".join(docs)

class SearchPapersInput(BaseModel):
    """Input object to search papers with the CORE API."""
    query: str = Field(description="The query to search for on the selected archive.")
    max_papers: int = Field(description="The maximum number of papers to return. It's default to 1, but you can increase it up to 10 in case you need to perform a more comprehensive search.", default=1, ge=1, le=10)

class DecisionMakingOutput(BaseModel):
    """Output object of the decision making node."""
    requires_research: bool = Field(description="Whether the user query requires research or not.")
    answer: Optional[str] = Field(default=None, description="The answer to the user query. It should be None if the user query requires research, otherwise it should be a direct answer to the user query.")

class JudgeOutput(BaseModel):
    """Output object of the judge node."""
    is_good_answer: bool = Field(description="Whether the answer is good or not.")
    feedback: Optional[str] = Field(default=None, description="Detailed feedback about why the answer is not good. It should be None if the answer is good.")

def format_tools_description(tools: list[BaseTool]) -> str:
    return "\n\n".join([f"- {tool.name}: {tool.description}\n Input arguments: {tool.args}" for tool in tools])

async def print_stream(app: CompiledStateGraph, input: str) -> Optional[BaseMessage]:
    display(Markdown("## New research running"))
    display(Markdown(f"### Input:\n\n{input}\n\n"))
    display(Markdown("### Stream:\n\n"))

    # Stream the results 
    all_messages = []
    async for chunk in app.astream({"messages": [input]}, stream_mode="updates"):
        for updates in chunk.values():
            if messages := updates.get("messages"):
                all_messages.extend(messages)
                for message in messages:
                    message.pretty_print()
                    print("\n\n")
 
    # Return the last message if any
    if not all_messages:
        return None
    return all_messages[-1]

#### Agent state
This cell defines the agent state, which contains the following information:

- `requires_research`: Whether the user query requires research or not.
- `num_feedback_requests`: The number of times the LLM asked for feedback.
- `is_good_answer`: Whether the LLM's final answer is good or not.
- `messages`: The conversation history between the user and the LLM.

In [4]:
class AgentState(TypedDict):
    """The state of the agent during the paper research process."""
    requires_research: bool = False
    num_feedback_requests: int = 0
    is_good_answer: bool = False
    messages: Annotated[Sequence[BaseMessage], add_messages]

In [5]:
@tool("search-papers", args_schema=SearchPapersInput)
def search_papers(query: str, max_papers: int = 1) -> str:
    """Search for scientific papers using the CORE API.

    Example:
    {"query": "Attention is all you need", "max_papers": 1}

    Returns:
        A list of the relevant papers found with the corresponding relevant information.
    """
    try:
        return CoreAPIWrapper(top_k_results=max_papers).search(query)
    except Exception as e:
        return f"Error performing paper search: {e}"

@tool("download-paper")
def download_paper(url: str) -> str:
    """Download a specific scientific paper from a given URL.

    Example:
    {"url": "https://sample.pdf"}

    Returns:
        The paper content.
    """
    try:        
        http = urllib3.PoolManager(
            cert_reqs='CERT_NONE',
        )
        
        # Mock browser headers to avoid 403 error
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
            'Accept-Language': 'en-US,en;q=0.5',
            'Accept-Encoding': 'gzip, deflate, br',
            'Connection': 'keep-alive',
        }
        max_retries = 5
        for attempt in range(max_retries):
            response = http.request('GET', url, headers=headers)
            if 200 <= response.status < 300:
                pdf_file = io.BytesIO(response.data)
                with pdfplumber.open(pdf_file) as pdf:
                    text = ""
                    for page in pdf.pages:
                        text += page.extract_text() + "\n"
                return text
            elif attempt < max_retries - 1:
                time.sleep(2 ** (attempt + 2))
            else:
                raise Exception(f"Got non 2xx when downloading paper: {response.status_code} {response.text}")
    except Exception as e:
        return f"Error downloading paper: {e}"

@tool("ask-human-feedback")
def ask_human_feedback(question: str) -> str:
    """Ask for human feedback. You should call this tool when encountering unexpected errors."""
    return input(question)

tools = [search_papers, download_paper, ask_human_feedback]
tools_dict = {tool.name: tool for tool in tools}

In [6]:
# LLMs
base_llm = ChatOpenAI(model="gpt-4.1-2025-04-14", temperature=0.0)
decision_making_llm = base_llm.with_structured_output(DecisionMakingOutput)
agent_llm = base_llm.bind_tools(tools)
judge_llm = base_llm.with_structured_output(JudgeOutput)

# Decision making node
def decision_making_node(state: AgentState):
    """Entry point of the workflow. Based on the user query, the model can either respond directly or perform a full research, routing the workflow to the planning node"""
    system_prompt = SystemMessage(content=decision_making_prompt)
    response: DecisionMakingOutput = decision_making_llm.invoke([system_prompt] + state["messages"])
    output = {"requires_research": response.requires_research}
    if response.answer:
        output["messages"] = [AIMessage(content=response.answer)]
    return output

# Task router function
def router(state: AgentState):
    """Router directing the user query to the appropriate branch of the workflow."""
    if state["requires_research"]:
        return "planning"
    else:
        return "end"

# Planning node
def planning_node(state: AgentState):
    """Planning node that creates a step by step plan to answer the user query."""
    system_prompt = SystemMessage(content=planning_prompt.format(tools=format_tools_description(tools)))
    response = base_llm.invoke([system_prompt] + state["messages"])
    return {"messages": [response]}

# Tool call node
def tools_node(state: AgentState):
    """Tool call node that executes the tools based on the plan."""
    outputs = []
    for tool_call in state["messages"][-1].tool_calls:
        tool_result = tools_dict[tool_call["name"]].invoke(tool_call["args"])
        outputs.append(
            ToolMessage(
                content=json.dumps(tool_result),
                name=tool_call["name"],
                tool_call_id=tool_call["id"],
            )
        )
    return {"messages": outputs}

# Agent call node
def agent_node(state: AgentState):
    """Agent call node that uses the LLM with tools to answer the user query."""
    system_prompt = SystemMessage(content=agent_prompt)
    response = agent_llm.invoke([system_prompt] + state["messages"])
    return {"messages": [response]}

# Should continue function
def should_continue(state: AgentState):
    """Check if the agent should continue or end."""
    messages = state["messages"]
    last_message = messages[-1]

    # End execution if there are no tool calls
    if last_message.tool_calls:
        return "continue"
    else:
        return "end"

# Judge node
def judge_node(state: AgentState):
    """Node to let the LLM judge the quality of its own final answer."""
    # End execution if the LLM failed to provide a good answer twice.
    num_feedback_requests = state.get("num_feedback_requests", 0)
    if num_feedback_requests >= 2:
        return {"is_good_answer": True}

    system_prompt = SystemMessage(content=judge_prompt)
    response: JudgeOutput = judge_llm.invoke([system_prompt] + state["messages"])
    output = {
        "is_good_answer": response.is_good_answer,
        "num_feedback_requests": num_feedback_requests + 1
    }
    if response.feedback:
        output["messages"] = [AIMessage(content=response.feedback)]
    return output

# Final answer router function
def final_answer_router(state: AgentState):
    """Router to end the workflow or improve the answer."""
    if state["is_good_answer"]:
        return "end"
    else:
        return "planning"

In [7]:
# Initialize the StateGraph
workflow = StateGraph(AgentState)

# Add nodes to the graph
workflow.add_node("decision_making", decision_making_node)
workflow.add_node("planning", planning_node)
workflow.add_node("tools", tools_node)
workflow.add_node("agent", agent_node)
workflow.add_node("judge", judge_node)

# Set the entry point of the graph
workflow.set_entry_point("decision_making")

# Add edges between nodes
workflow.add_conditional_edges(
    "decision_making",
    router,
    {
        "planning": "planning",
        "end": END,
    }
)
workflow.add_edge("planning", "agent")
workflow.add_edge("tools", "agent")
workflow.add_conditional_edges(
    "agent",
    should_continue,
    {
        "continue": "tools",
        "end": "judge",
    },
)
workflow.add_conditional_edges(
    "judge",
    final_answer_router,
    {
        "planning": "planning",
        "end": END,
    }
)

# Compile the graph
app = workflow.compile()

#### Example usecase for PhD academic research
이 셀은 여러 예제 쿼리를 사용하여 워크플로를 테스트합니다. 이러한 쿼리는 다음과 같은 측면에서 에이전트를 평가하도록 설계

- 박사 학위를 소지한 연구자가 수행해야 할 작업을 대표하는 작업을 완료
- 정해진 기간 내에 논문을 연구해야 하는 보다 구체적인 과제를 수행
- 다양한 연구 분야에 걸친 과제를 수행
- 논문에서 특정 정보를 추출하여 자신의 답변을 비판적으로 평가

In [8]:
test_inputs = [
    "Download and summarize the findings of this paper: https://pmc.ncbi.nlm.nih.gov/articles/PMC11379842/pdf/11671_2024_Article_4070.pdf",

    "Can you find 8 papers on quantum machine learning?",

    """Find recent papers (2023-2024) about CRISPR applications in treating genetic disorders, 
    focusing on clinical trials and safety protocols""",

    """Find and analyze papers from 2023-2024 about the application of transformer architectures in protein folding prediction, 
    specifically looking for novel architectural modifications with experimental validation."""
]

# Run tests and store the results for later visualisation
outputs = []
for test_input in test_inputs:
    final_answer = await print_stream(app, test_input)
    outputs.append(final_answer.content)

## New research running

### Input:

Download and summarize the findings of this paper: https://pmc.ncbi.nlm.nih.gov/articles/PMC11379842/pdf/11671_2024_Article_4070.pdf



### Stream:




{"url": "https://pmc.ncbi.nlm.nih.gov/articles/PMC11379842/pdf/11671_2024_Article_4070.pdf"}



Tool Calls:
  download-paper (call_ppYUzbUoBQ8TxELpMS81ix5C)
 Call ID: call_ppYUzbUoBQ8TxELpMS81ix5C
  Args:
    url: https://pmc.ncbi.nlm.nih.gov/articles/PMC11379842/pdf/11671_2024_Article_4070.pdf



Name: download-paper

"Error downloading paper: No /Root object! - Is this really a PDF?"




I encountered an error while trying to download the paper from the provided link. It appears there may be an issue with the PDF file or its accessibility.

Could you please check the link or provide an alternative source for the paper? If you have the title or authors, I can also try to locate the paper using that information.




The answer does not summarize the findings of the requested paper. Instead, it only reports a technical error encountered during the download attempt and asks the user for more information. To be satisfactory, the answer should attempt to access the paper by alternative me

## New research running

### Input:

Can you find 8 papers on quantum machine learning?



### Stream:




SUBTASKS:

1. Search for 8 recent and relevant scientific papers on "quantum machine learning".
    - Tool: search-papers

Let's begin with this search.



Tool Calls:
  search-papers (call_69ZO5pIujeoomHb5QmY8h4aR)
 Call ID: call_69ZO5pIujeoomHb5QmY8h4aR
  Args:
    query: quantum machine learning
    max_papers: 8



Name: search-papers

"* ID: 73423988,\n* Title: Quantum Machine Learning,\n* Published Date: 2018-05-10T01:00:00+01:00,\n* Authors: Biamonte, Jacob and Wittek, Peter and Pancotti, Nicola and Rebentrost, Patrick and Wiebe, Nathan and Lloyd, Seth,\n* Abstract: Fuelled by increasing computer power and algorithmic advances, machine\nlearning techniques have become powerful tools for finding patterns in data.\nSince quantum systems produce counter-intuitive patterns believed not to be\nefficiently produced by classical systems, it is reasonable to postulate that\nquantum computers may outperform classical computers on machine learning tasks.\nThe field of quantum machine lea

## New research running

### Input:

Find recent papers (2023-2024) about CRISPR applications in treating genetic disorders, 
    focusing on clinical trials and safety protocols



### Stream:




Subtask 1: Search for recent (2023-2024) scientific papers about CRISPR applications in treating genetic disorders, with a focus on clinical trials and safety protocols.
Tool: search-papers
Parameters: 
{"query": "CRISPR applications in treating genetic disorders clinical trials safety protocols 2023 2024", "max_papers": 10}



Tool Calls:
  search-papers (call_fFFLlxrUx7wdhQ3gV2Pru8WC)
 Call ID: call_fFFLlxrUx7wdhQ3gV2Pru8WC
  Args:
    query: CRISPR AND genetic disorders AND (clinical trials OR safety protocols) AND yearPublished>=2023 AND yearPublished<=2024
    max_papers: 10



Name: search-papers

"* ID: 619731048,\n* Title: Balancing Progress and Ethics: Exploring the Science and Ethics of Gene Editing: Literature Review,\n* Published Date: 2024-01-01T08:00:00+00:00,\n* Authors: Burgess, Jackson,\n* Abstract: Two decades ago, the completion of the Human Genome Project marked a pivotal milestone in scientific history, unraveling the blueprint of human DNA and laying the foundati

## New research running

### Input:

Find and analyze papers from 2023-2024 about the application of transformer architectures in protein folding prediction, 
    specifically looking for novel architectural modifications with experimental validation.



### Stream:




Step 1: Search for Recent Papers (2023-2024) on Transformer Architectures in Protein Folding Prediction  
Tool: search-papers  
Reason: To identify the most recent and relevant literature on the application of transformer models to protein folding, focusing on novel architectural modifications and experimental validation.

{"query": "transformer architecture protein folding prediction novel modification experimental validation 2023 2024", "max_papers": 10}



Tool Calls:
  search-papers (call_WVw3Uy3FnVC6DwTeT3uDHFSU)
 Call ID: call_WVw3Uy3FnVC6DwTeT3uDHFSU
  Args:
    query: transformer architecture protein folding prediction AND yearPublished>=2023 AND yearPublished<=2024 AND (novel OR modification OR architecture) AND (experimental validation OR benchmark)
    max_papers: 10



Name: search-papers

"* ID: 553131833,\n* Title: Conditional Generation of Paired Antibody Chain Sequences through\n  Encoder-Decoder Language Model,\n* Published Date: 2023-04-04T01:00:00+01:00,\n* Authors:

In [9]:
for input, output in zip(test_inputs, outputs):
    display(Markdown(f"## Input:\n\n{input}\n\n"))
    display(Markdown(f"## Output:\n\n{output}\n\n"))

## Input:

Download and summarize the findings of this paper: https://pmc.ncbi.nlm.nih.gov/articles/PMC11379842/pdf/11671_2024_Article_4070.pdf



## Output:

I was unable to locate the correct paper using the provided PDF link or its metadata. The search did not return a relevant match for the article you referenced.

To assist you further, could you please provide the title or authors of the paper? Alternatively, if you have an abstract or keywords, I can use those to try to find and summarize the correct article.



## Input:

Can you find 8 papers on quantum machine learning?



## Output:

Here are 8 scientific papers on quantum machine learning:

1. Biamonte, J., Wittek, P., Pancotti, N., Rebentrost, P., Wiebe, N., & Lloyd, S. (2018). Quantum Machine Learning. This paper provides a comprehensive overview of the field, discussing how quantum computers may outperform classical computers on machine learning tasks and the challenges ahead. [arXiv:1611.09347](http://arxiv.org/abs/1611.09347)

2. Liu, D., Ran, S.-J., Wittek, P., Peng, C., García, R. B., Su, G., & Lewenstein, M. (2019). Machine Learning by Unitary Tensor Network of Hierarchical Tree Structure. This work explores the use of tensor networks from quantum many-body physics for machine learning, particularly in image recognition. [arXiv:1710.04833](http://arxiv.org/abs/1710.04833)

3. Sheng, Y.-B., & Zhou, L. (2015). Blind quantum machine learning. The authors propose a protocol for blind quantum machine learning, allowing a classical client to delegate quantum machine learning tasks to a quantum server while preserving data privacy. [arXiv:1507.07195](http://arxiv.org/abs/1507.07195)

4. Gao, J., Qiao, L.-F., Jiao, Z.-Q., Ma, Y.-C., Hu, C.-Q., Ren, R.-J., Yang, A.-L., Tang, H., Yung, M.-H., & Jin, X.-M. (2018). Experimental Machine Learning of Quantum States. This paper demonstrates a machine-learning approach to classify quantum states experimentally, using neural networks to identify separability without full state tomography. [arXiv:1712.00456](http://arxiv.org/abs/1712.00456)

5. Schuld, M., Sinayskiy, I., & Petruccione, F. (2014). An introduction to quantum machine learning. This review systematically overviews the emerging field, discussing both the translation of classical algorithms to quantum settings and the potential for quantum speedup. [arXiv:1409.3097](http://arxiv.org/abs/1409.3097)

6. Benedetti, M., Realpe-Gómez, J., & Perdomo-Ortiz, A. (2018). Quantum-assisted Helmholtz machines: A quantum-classical deep learning framework for industrial datasets in near-term devices. The authors introduce a hybrid quantum-classical framework for unsupervised generative modeling using quantum annealers. [arXiv:1708.09784](http://arxiv.org/abs/1708.09784)

7. Zhaokai, L., Xiaomei, L., Nanyang, X., & jiangfeng, D. (2014). Experimental Realization of Quantum Artificial Intelligence. This work demonstrates a quantum machine learning algorithm on a four-qubit NMR system for optical character recognition, marking an early realization of quantum AI. [arXiv:1410.1054](http://arxiv.org/abs/1410.1054)

8. Zhao, Z., Fitzsimons, J. K., Rebentrost, P., Dunjko, V., & Fitzsimons, J. F. (2019). Smooth input preparation for quantum and quantum-inspired machine learning. The paper addresses the challenge of efficiently preparing quantum states proportional to high-dimensional data, showing that robust algorithms can achieve this with constant queries. [arXiv:1804.00281](http://arxiv.org/abs/1804.00281)

These papers collectively cover foundational theory, experimental demonstrations, privacy protocols, and practical challenges in quantum machine learning.



## Input:

Find recent papers (2023-2024) about CRISPR applications in treating genetic disorders, 
    focusing on clinical trials and safety protocols



## Output:

Here are several recent (2023–2024) scientific papers focusing on CRISPR applications in treating genetic disorders, with an emphasis on clinical trials and safety protocols:

1. **Progress and harmonization of gene editing to treat human diseases: Proceeding of COST Action CA21113 GenE-HumDi**  
   - This paper discusses the GenE-HumDi network, which brings together stakeholders from academia, industry, and regulatory agencies to expedite the clinical translation of genome editing for human diseases. The network specifically addresses safety concerns, delivery systems, and the development of regulatory guidelines, aiming to standardize procedures and facilitate knowledge sharing. The inaugural meeting in 2023 highlighted breakthroughs in delivery methods, safety measures, and regulatory aspects for clinical trials in gene editing.  
   - [Read the paper](https://iris.unimore.it/bitstream/11380/1327197/2/1-s2.0-S2162253123002846-main.pdf)  
   - [Alternate link](https://digibug.ugr.es/bitstream/10481/86007/1/1-s2.0-S2162253123002846-main.pdf)  
   - (Cavazza et al., 2023; Cavazza et al., 2023)

2. **CRISPR-Cas9 Gene Editing Tool: Potential Treatment for Sickle Cell Disease**  
   - This 2024 paper reviews the use of CRISPR-Cas9 in treating sickle cell disease, a monogenic disorder. It discusses the clinical trial landscape, including the first FDA-approved CRISPR therapy (CASGEVY), and addresses safety protocols such as off-target analysis and long-term monitoring.  
   - [Read the paper](https://digitalcommons.sacredheart.edu/cgi/viewcontent.cgi?article=2403&context=acadfest)  
   - (Young, 2024)

3. **Investigating the Potential of a Cell-Based Gene Editing Therapy for Inherited Metabolic Liver Disease**  
   - This 2023 dissertation explores a non-viral, ex vivo CRISPR-Cas9 approach for treating inherited metabolic liver diseases. It details protocols for hepatocyte isolation, gene editing, and transplantation, and evaluates safety and efficacy in preclinical models. The study emphasizes the importance of delivery methods and off-target screening for clinical translation.  
   - [Read the paper](https://tigerprints.clemson.edu/cgi/viewcontent.cgi?article=4432&context=all_dissertations)  
   - (Ates, 2023)

4. **iPSC-derived liver organoids and inherited bleeding disorders: potential and future perspectives**  
   - This 2024 review discusses combining iPSC and CRISPR/Cas9 technologies to generate gene-corrected liver organoids for treating hereditary coagulation disorders. It highlights the potential for autologous, gene-corrected cell therapies and discusses safety considerations in preclinical and future clinical applications.  
   - [Read the paper](https://research-repository.st-andrews.ac.uk/bitstream/10023/30326/1/Roman-iPSC-derived-liver-organoids-FPHYS-14-1094249-CCBY.pdf)  
   - (Roman et al., 2024)

5. **A roadmap for affordable genetic medicines**  
   - This 2024 paper, co-authored by Jennifer Doudna, discusses the approval of the first CRISPR therapy for sickle cell disease and addresses the challenges of affordability and access. It also touches on regulatory and safety frameworks necessary for broader clinical implementation of CRISPR-based therapies.  
   - [Read the paper](https://escholarship.org/content/qt6d59f7sv/qt6d59f7sv.pdf)  
   - (Kliegman et al., 2024)

6. **Emerging Therapies in Retinal Diseases: From Gene Therapy to Stem Cell Interventions**  
   - This 2023 review covers gene therapy (including CRISPR) for hereditary retinal disorders, summarizing clinical trial results, safety issues (such as delivery and long-term effects), and the need for further research to address these challenges.  
   - [Read the paper](https://ijritcc.org/index.php/ijritcc/article/download/9421/7238)  
   - (Sarwate et al., 2023)

These papers collectively provide a comprehensive overview of the current state of CRISPR-based therapies for genetic disorders, with a strong focus on clinical translation, safety protocols, and regulatory considerations. For more detailed information, please refer to the full texts linked above.



## Input:

Find and analyze papers from 2023-2024 about the application of transformer architectures in protein folding prediction, 
    specifically looking for novel architectural modifications with experimental validation.



## Output:

I have identified several recent (2023–2024) papers on the application of transformer architectures in protein folding prediction, with a focus on novel architectural modifications and experimental validation. Below is an analysis of the most relevant works, with inline citations and links to the full texts where available.

---

### 1. **Endowing Protein Language Models with Structural Knowledge**  
**Authors:** Chen, Dexiong et al.  
**Published:** 2024  
**Summary:**  
This paper introduces the Protein Structure Transformer (PST), a novel framework that integrates protein structural data into transformer-based protein language models. The key architectural innovation is the refinement of self-attention mechanisms using structure extractor modules, inspired by graph transformers. PST is pretrained on a relatively small protein structure database (542K structures) and demonstrates superior parameter efficiency and predictive performance compared to state-of-the-art sequence-based models like ESM-2. The authors provide empirical validation, showing that PST outperforms ESM-2 in protein function prediction tasks, setting a new benchmark.  
**Experimental Validation:** The model is benchmarked against ESM-2 and evaluated on protein function prediction, demonstrating improved performance.  
**Novelty:** Integration of structural information directly into the transformer’s attention mechanism, leading to more efficient and effective protein modeling.  
**Link:** [arXiv:2401.14819](http://arxiv.org/abs/2401.14819)  
**Citation:** Chen et al., 2024.

---

### 2. **Multi-level Protein Representation Learning for Blind Mutational Effect Prediction**  
**Authors:** Tan, Yang et al.  
**Published:** 2023  
**Summary:**  
This work presents a pre-training framework that cascades sequential and geometric analyzers for protein primary and tertiary structures. The approach combines transformer-based sequence models with geometric modules to better capture spatial characteristics crucial for protein folding and stability. The framework is validated on public and new databases for variant effect prediction, achieving state-of-the-art performance in zero-shot learning for both single-site and deep mutations.  
**Experimental Validation:** The model is tested on multiple datasets, including new ones, and compared to other zero-shot learning methods, showing superior results.  
**Novelty:** Cascading sequential (transformer) and geometric analyzers for multi-level protein representation, directly addressing the limitations of sequence-only models.  
**Link:** [arXiv:2306.04899](http://arxiv.org/abs/2306.04899)  
**Citation:** Tan et al., 2023.

---

### 3. **Conditional Generation of Paired Antibody Chain Sequences through Encoder-Decoder Language Model (pAbT5)**  
**Authors:** Chu, Simon K. S. and Wei, Kathy Y.  
**Published:** 2023  
**Summary:**  
pAbT5 is a T5-based encoder-decoder transformer model designed for modeling antibody chain pairing, a key aspect of protein-protein interactions. The model generates variable-length sequences and its predictions align with experimental measurements. It is the first generative encoder-decoder protein language model for protein-protein interactions, and it achieves state-of-the-art unsupervised prediction on experimental data.  
**Experimental Validation:** The model’s predictions are validated against experimental measurements and position-specific scoring matrices.  
**Novelty:** First application of a T5-based encoder-decoder architecture for generative modeling of protein-protein interactions, with experimental validation.  
**Link:** [arXiv:2301.02748](http://arxiv.org/abs/2301.02748)  
**Citation:** Chu & Wei, 2023.

---

### 4. **Machine learning for the design of protein-protein interactions (PPIformer)**  
**Authors:** Anton Bushuiev  
**Published:** 2023  
**Summary:**  
This thesis introduces PPIformer, a novel self-supervised geometric deep-learning model for protein-protein interaction design. The model leverages geometric deep learning and transformer architectures to address the limitations of existing methods. Preliminary analysis shows high potential for overcoming current challenges in protein-protein interaction design.  
**Experimental Validation:** Preliminary results indicate strong performance, but further experimental validation may be ongoing.  
**Novelty:** Self-supervised geometric transformer model for protein-protein interaction design.  
**Link:** [Full thesis PDF](https://dspace.cvut.cz/bitstream/10467/108812/-1/F8-DP-2023-Bushuiev-Anton-thesis.pdf)  
**Citation:** Bushuiev, 2023.

---

### 5. **Application of coevolution-based methods and deep learning for structure prediction of protein complexes**  
**Authors:** Desai, Nikita  
**Published:** 2023  
**Summary:**  
This thesis explores the use of coevolution-based tools and deep learning (including attention mechanisms) for predicting inter-chain and intra-chain residue contacts in protein dimers. The methods are combined with docking approaches to predict homodimer and heterodimer structures.  
**Experimental Validation:** The methods are benchmarked on protein complex prediction tasks, showing improvements over previous approaches.  
**Novelty:** Modification and extension of deep learning and coevolution-based methods for protein complex prediction.  
**Link:** [Full thesis PDF](https://discovery.ucl.ac.uk/10172986/2/PhDThesis_NikitaDesai_Final.pdf)  
**Citation:** Desai, 2023.

---

## Summary Table

| Paper (Year) | Novelty | Experimental Validation | Link |
|--------------|---------|------------------------|------|
| Chen et al. (2024) | Structural info in transformer attention | Benchmarked vs. ESM-2 | [arXiv](http://arxiv.org/abs/2401.14819) |
| Tan et al. (2023) | Cascaded sequential/geometric analyzers | Multiple datasets | [arXiv](http://arxiv.org/abs/2306.04899) |
| Chu & Wei (2023) | T5 encoder-decoder for antibody pairing | Experimental data | [arXiv](http://arxiv.org/abs/2301.02748) |
| Bushuiev (2023) | Geometric transformer (PPIformer) | Preliminary | [PDF](https://dspace.cvut.cz/bitstream/10467/108812/-1/F8-DP-2023-Bushuiev-Anton-thesis.pdf) |
| Desai (2023) | Coevolution + deep learning for complexes | Benchmarked | [PDF](https://discovery.ucl.ac.uk/10172986/2/PhDThesis_NikitaDesai_Final.pdf) |

---

## Conclusion

Recent advances in transformer architectures for protein folding prediction have focused on integrating structural information into attention mechanisms, combining sequence and geometric representations, and developing encoder-decoder models for protein-protein interactions. These innovations are experimentally validated through benchmarking against state-of-the-art models and testing on diverse datasets, demonstrating improved performance and new capabilities in protein structure and function prediction.

**References:**  
- Chen et al., 2024 ([arXiv:2401.14819](http://arxiv.org/abs/2401.14819))  
- Tan et al., 2023 ([arXiv:2306.04899](http://arxiv.org/abs/2306.04899))  
- Chu & Wei, 2023 ([arXiv:2301.02748](http://arxiv.org/abs/2301.02748))  
- Bushuiev, 2023 ([PDF](https://dspace.cvut.cz/bitstream/10467/108812/-1/F8-DP-2023-Bushuiev-Anton-thesis.pdf))  
- Desai, 2023 ([PDF](https://discovery.ucl.ac.uk/10172986/2/PhDThesis_NikitaDesai_Final.pdf))

