# Turorial 1: Getting Started with CycleResearcher (huggingface)

CycleResearcher can be easily implemented using the Hugging Face Transformers library, which provides a straightforward and widely-used interface for working with large language models. While this implementation might be slightly slower than VLLM, it offers excellent compatibility and easier setup for most users.

In [5]:
import json

import torch
from cycleresearch.utils import get_paper_from_generated_text
from tqdm import tqdm, trange
from transformers import AutoModelForCausalLM, AutoTokenizer




## Loading Data and Model

Let's start by loading our pre-prepared data and model. The input data should be in BibTeX format with abstracts. BibTeX is a reference management format widely used in academic publishing, making it an ideal choice for research paper generation.

Example BibTeX format:
```bibtex
@article{example2024,
    title = {Sample Paper Title},
    author = {Author, A. and Author, B.},
    journal = {Journal Name},
    year = {2024},
    abstract = {This is a sample abstract that would be used as input...}
}
```



In [8]:
with open('./demo_data.json', 'r', encoding='utf-8') as f:
    data = json.load(f)

# Print sample data
print(json.dumps(data[0]['messages'][:2]))

# Initialize tokenizer and model
model_name = 'WestlakeNLP/CycleResearcher-12B'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,  # Use float16 for efficiency
    device_map="auto",  # Automatically handle model placement
    max_memory={i: "24GiB" for i in range(torch.cuda.device_count())},  # Adjust memory per GPU
)

# Generation parameters (equivalent to vllm's SamplingParams)
generation_config = {
    "max_length": 19000,
    "temperature": 0.1,
    "top_p": 0.95,
    "pad_token_id": None,
    "do_sample": True,
}


[{"role": "system", "content": "You are a research assistant AI tasked with generating a scientific paper based on provided literature. Follow these steps:\n\n1. Analyze the given References. \n2. Identify gaps in existing research to establish the motivation for a new study.\n3. Propose a main idea for a new research work.\n4. Write the paper's main content in LaTeX format, including:\n   - Title\n   - Abstract\n   - Introduction\n   - Related Work\n   - Methods/\n5. Generate experimental setup details in JSON format to guide researchers.\n6. After receiving experimental results in JSON format, analyze them.\n7. Complete the paper by writing:\n   - Results\n   - Discussion\n   - Conclusion\n   - Contributions\n\nEnsure all content is original, academically rigorous, and follows standard scientific writing conventions."}, {"role": "user", "content": "@article{liu2023visual,\n  title={Visual instruction tuning},\n  author={Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae

## Generating and Processing Output

Finally, we can generate output from CycleResearcher by calling `llm.generate`. An important note about the output processing: the raw generated text needs to be formatted using the `get_paper_from_generated_text` function, which returns a dictionary containing structured paper components.



The function handles all the necessary parsing and formatting, ensuring that the output maintains proper academic paper structure while being easily accessible for further processing or analysis:

* item['latex'] contains the main body of the model-generated paper.
* item['generated_text3'] contains the original complete text generated by the model
* item['motivation'] contains the motivation section explaining why this research was conducted
* item['idea'] contains the main idea section describing the core methodology and approach
* item['interestingness'] contains the section explaining why this research is significant
* item['feasibility'] contains the analysis of how practical and implementable the approach is
* item['novelty'] contains the discussion of what makes this research unique and innovative
* item['title'] contains the paper title extracted from LaTeX formatting
* item['abstract'] contains the paper abstract summarizing the research
* item['Experimental_Setup'] contains the experimental configuration details, either as JSON or text
* item['Experimental_results'] contains the experimental outcomes and findings, either as JSON or text


In [None]:
results = []
batch_size = 1  # Process one item at a time

# Process data in batches
for idx in trange(0, len(data), batch_size):
    batch_data = data[idx:idx+batch_size]
    
    # Prepare prompts
    prompts = []
    for item in batch_data:
        prompt = tokenizer.apply_chat_template(
            item['messages'][:2],
            tokenize=False,
            add_generation_prompt=True
        )
        prompts.append(prompt)
    
    # Tokenize inputs
    inputs = tokenizer(
        prompts,
        return_tensors="pt",
        padding=True,
        truncation=True
    ).to(model.device)
    
    # Generate text
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            **generation_config,
            return_dict_in_generate=True,
        )
    
    # Process outputs
    for output_idx, generated_ids in enumerate(outputs.sequences):
        # Decode the generated text
        generated_text = tokenizer.decode(
            generated_ids[inputs['input_ids'].shape[1]:],  # Skip input tokens
            skip_special_tokens=True,
            clean_up_tokenization_spaces=True
        )
        
        # Store results
        batch_data[output_idx]['generated_text'] = generated_text
        results.append(batch_data[output_idx])

# Process and print results
for idx, result in enumerate(results):
    item = get_paper_from_generated_text(result['generated_text'])
    if item is not None:
        print('Idx %d:\n %s\n' % (idx, item))