# Scientific Literature Mining Agents for Drug Discovery Research using smolagents

This notebook demonstrates how to use the functions from the `drug_discovery_literature_mining.ipynb` notebook as tools for a smolagents agent. The agent will be used to research a specific topic across multiple literature sources (PubMed, Google Scholar, bioRxiv, Google Patents) and compile and summarize the findings.

## 1. Setting up the Environment

First, we need to install the necessary libraries.

In [None]:
%pip install pymed apify-client requests beautifulsoup4 transformers docling smolagents

## 2. Importing Libraries

In [None]:
# Import required libraries for literature mining
from pymed import PubMed
from apify_client import ApifyClient
import requests
from bs4 import BeautifulSoup
from transformers import pipeline

# Import smolagents libraries
from smolagents import CodeAgent, tool, LiteLLMModel, HfApiModel

## 3. Defining Literature Mining Tools

We'll define the functions from the original notebook as tools for our agent.

In [None]:
@tool
def pubmed_agent(query: str, max_results: int = 10) -> list:
    """
    Search PubMed for scientific articles related to the query.
    
    Args:
        query: The search query for PubMed.
        max_results: Maximum number of results to return (default: 10).
        
    Returns:
        A list of dictionaries containing article information.
    """
    pubmed = PubMed(tool="MyTool", email="my@email.address")
    results = pubmed.query(query, max_results=max_results)
    articles = [article.toDict() for article in results]
    
    # Extract relevant information for easier processing
    simplified_articles = []
    for article in articles:
        simplified_article = {
            'title': article.get('title', 'No title'),
            'abstract': article.get('abstract', 'No abstract'),
            'authors': ', '.join([author.get('lastname', '') + ' ' + author.get('firstname', '') 
                                for author in article.get('authors', [])]),
            'journal': article.get('journal', 'No journal'),
            'publication_date': article.get('publication_date', 'No date'),
            'doi': article.get('doi', 'No DOI')
        }
        simplified_articles.append(simplified_article)
    
    return simplified_articles

In [None]:
@tool
def google_scholar_agent(keyword: str, max_results: int = 10) -> list:
    """
    Search Google Scholar for scientific articles related to the keyword.
    
    Args:
        keyword: The search keyword for Google Scholar.
        max_results: Maximum number of results to return (default: 10).
        
    Returns:
        A list of dictionaries containing article information.
    """
    # Note: You need to replace <YOUR_API_TOKEN> with a valid Apify API token
    # For demonstration purposes, we'll return a mock response
    
    # Uncomment the following code when you have a valid API token
    # client = ApifyClient("<YOUR_API_TOKEN>")
    # run_input = {
    #     "keyword": keyword,
    #     "proxyOptions": {"useApifyProxy": True},
    #     "maxResults": max_results
    # }
    # run = client.actor("marco.gullo/google-scholar-scraper").call(run_input=run_input)
    # return list(client.dataset(run["defaultDatasetId"]).iterate_items())
    
    # Mock response for demonstration
    return [
        {
            "title": f"Mock Google Scholar Result 1 for {keyword}",
            "authors": "Author A, Author B",
            "publication": "Journal of Mock Science",
            "year": "2023",
            "cited_by": 42,
            "link": "https://scholar.google.com/example1"
        },
        {
            "title": f"Mock Google Scholar Result 2 for {keyword}",
            "authors": "Author C, Author D",
            "publication": "International Journal of Mock Research",
            "year": "2022",
            "cited_by": 28,
            "link": "https://scholar.google.com/example2"
        }
    ]

In [None]:
@tool
def biorxiv_agent(query: str, max_results: int = 10) -> list:
    """
    Search bioRxiv for preprints related to the query.
    
    Args:
        query: The search query for bioRxiv.
        max_results: Maximum number of results to return (default: 10).
        
    Returns:
        A list of dictionaries containing preprint information.
    """
    url = f"https://api.biorxiv.org/details/biorxiv/{query}/0/{max_results}"
    response = requests.get(url)
    
    if response.status_code == 200:
        return response.json()['collection']
    else:
        return [{"error": f"Failed to fetch data from bioRxiv: {response.status_code}"}]

In [None]:
@tool
def google_patents_agent(query: str, max_results: int = 10) -> list:
    """
    Search Google Patents for patents related to the query.
    
    Args:
        query: The search query for Google Patents.
        max_results: Maximum number of results to return (default: 10).
        
    Returns:
        A list of dictionaries containing patent information.
    """
    # Note: Web scraping Google Patents might be against their terms of service
    # For demonstration purposes, we'll return a mock response
    
    # Uncomment the following code at your own risk
    # url = f"https://patents.google.com/?q={query}&num={max_results}"
    # response = requests.get(url)
    # soup = BeautifulSoup(response.text, 'html.parser')
    # patents = soup.find_all('article', class_='result-item')
    # return [{'title': p.find('h3').text, 'link': p.find('a')['href']} for p in patents]
    
    # Mock response for demonstration
    return [
        {
            "title": f"Mock Patent 1 for {query}",
            "link": "https://patents.google.com/example1",
            "inventors": "Inventor A, Inventor B",
            "filing_date": "2022-01-15",
            "publication_date": "2023-07-22"
        },
        {
            "title": f"Mock Patent 2 for {query}",
            "link": "https://patents.google.com/example2",
            "inventors": "Inventor C, Inventor D",
            "filing_date": "2021-11-30",
            "publication_date": "2023-05-18"
        }
    ]

In [None]:
@tool
def analyze_document(document_text: str, questions: list) -> dict:
    """
    Analyze a document using a question-answering model.
    
    Args:
        document_text: The text of the document to analyze.
        questions: A list of questions to ask about the document.
        
    Returns:
        A dictionary mapping questions to answers.
    """
    # Split the document into sections
    sections = document_text.split("## ")
    sections = [s.strip() for s in sections if s.strip()]
    
    # Create a question-answering agent
    question_answerer = pipeline("question-answering")
    
    # Define the QA agent function
    def qa_agent(document_sections, question_answerer):
        def agent(question):
            context = "\n".join(document_sections)
            result = question_answerer(question=question, context=context)
            return result['answer']
        return agent
    
    # Create the agent
    agent = qa_agent(sections, question_answerer)
    
    # Get answers to the questions
    answers = {}
    for question in questions:
        answer = agent(question)
        answers[question] = answer
    
    return answers

## 4. Creating a Research Agent

Now we'll create a CodeAgent that uses the literature mining tools to research a specific topic.

In [None]:
# Choose which LLM engine to use
# Uncomment one of the following options

# Option 1: Use HfApiModel (default)
model = HfApiModel()

# Option 2: Use LiteLLMModel with GPT-4o
# model = LiteLLMModel(model_id="gpt-4o")

# Option 3: Use LiteLLMModel with Claude
# model = LiteLLMModel(model_id="anthropic/claude-3-5-sonnet-20240620")

In [None]:
# Create the research agent
research_agent = CodeAgent(
    tools=[
        pubmed_agent,
        google_scholar_agent,
        biorxiv_agent,
        google_patents_agent,
        analyze_document
    ],
    model=model,
    max_steps=15,  # Limit the number of steps to avoid excessive API calls
    verbosity_level=2  # Show detailed output
)

## 5. Using the Research Agent

Now we can use the research agent to research a specific topic across the literature sources and compile the findings.

In [None]:
# Example 1: Research on novel drug delivery systems
research_topic = """
Research the latest developments in nanoparticle-based drug delivery systems for cancer treatment. 
Focus on:
1. Recent breakthroughs in the last 2 years
2. Different types of nanoparticles being used
3. Clinical trial status of these technologies
4. Major challenges and limitations

Compile a comprehensive summary of your findings with references.
"""

# Run the agent
result = research_agent.run(research_topic)
print(result)

## 6. More Research Examples

Here are some additional examples of research topics that can be explored using the agent.

In [None]:
# Example 2: Research on CRISPR gene editing for rare diseases
research_topic_2 = """
Research the application of CRISPR gene editing technologies for treating rare genetic diseases. 
Focus on:
1. Recent clinical trials and their outcomes
2. Specific rare diseases being targeted
3. Delivery methods for CRISPR components
4. Ethical considerations and regulatory status

Compile a comprehensive summary of your findings with references.
"""

# Uncomment to run this example
# result_2 = research_agent.run(research_topic_2)
# print(result_2)

In [None]:
# Example 3: Research on AI-driven drug discovery
research_topic_3 = """
Research the role of artificial intelligence in accelerating drug discovery and development. 
Focus on:
1. AI algorithms and models being used
2. Success stories and case studies
3. Companies and research institutions leading the field
4. Future prospects and challenges

Compile a comprehensive summary of your findings with references.
"""

# Uncomment to run this example
# result_3 = research_agent.run(research_topic_3)
# print(result_3)

## 7. Custom Research Function

Let's create a function that makes it easy to conduct research on any topic.

In [None]:
def conduct_research(topic, focus_areas=None, max_results_per_source=5):
    """
    Conduct research on a specific topic across multiple literature sources.
    
    Args:
        topic: The main research topic.
        focus_areas: A list of specific areas to focus on (optional).
        max_results_per_source: Maximum number of results to retrieve from each source.
        
    Returns:
        A comprehensive research summary.
    """
    # Construct the research prompt
    prompt = f"Research the topic of {topic}.\n"
    
    if focus_areas:
        prompt += "Focus on:\n"
        for i, area in enumerate(focus_areas, 1):
            prompt += f"{i}. {area}\n"
    
    prompt += f"\nUse a maximum of {max_results_per_source} results from each source.\n"
    prompt += "\nCompile a comprehensive summary of your findings with references."
    
    # Run the agent
    result = research_agent.run(prompt)
    return result

In [None]:
# Example usage of the conduct_research function
research_result = conduct_research(
    topic="microbiome-based therapeutics for inflammatory bowel disease",
    focus_areas=[
        "Current clinical trials",
        "Mechanisms of action",
        "Comparison with traditional treatments",
        "Safety and efficacy data"
    ],
    max_results_per_source=3
)

# Uncomment to run this example
# print(research_result)

## 8. Conclusion

In this notebook, we've demonstrated how to use smolagents to create a research agent that can search for scientific literature across multiple sources, analyze the findings, and compile a comprehensive summary. This approach can be extended to other research domains and customized for specific research needs.

Key benefits of using smolagents for literature mining:
1. Automated search across multiple sources
2. Consistent analysis and summarization
3. Ability to focus on specific aspects of a research topic
4. Integration with various LLM engines
5. Extensibility with additional tools and sources

Next steps could include:
- Adding more literature sources (e.g., Scopus, Web of Science)
- Implementing more sophisticated document analysis tools
- Creating specialized agents for different research domains
- Integrating with citation management systems
- Adding visualization tools for research findings