# 05.2 - Classification of Scientific Papers Using Open Hugging Face Models

This notebook explores how open LLMs, such a Mistral, Llama, Gemma, Specter, etc., can be used for classifying scientific papers based on the content or their abstracts. Specifically, these models will be used to detect papers that discuss infectious disease modeling, and further identify which modeling techniques are used.

In order to increase the accuracy of the classification, multiple models will be evaluated and employed.

## Data Preparation

Load the training dataset, which includes a column specifyting whether the paper has been manually classificed as a disease modeling paper, or not.

In [None]:
from genscai.data import load_classification_training_data

df = load_classification_training_data()
df.head(5)

## Model Set-up

Create a model client for classifying papers. The model clients come in three varieties: AisuiteClient, OllamaClient, and HuggingFaceClient. Aisuite works with cloud model providers (e.g. OpenAI) as well as models hosted locally with Ollama. The Ollama client work with models hosted locally with Ollama. And the HuggingFaceClient uses the Hugging Face Transformers library for running models locally.

For local models, Ollama is preferred if device memory is limited, since Ollama hosted models are typically 4-bit quantized. For greater control of quantization and model parameters, Hugging Face Transformer models are preferred.

In [ ]:
from aimu.models import HuggingFaceClient as ModelClient

model_kwargs = {
    "low_cpu_mem_usage": True,
    "torch_dtype": "auto",
    "device_map": "auto",
}

client = ModelClient(ModelClient.MODEL_GEMMA_3_12B, model_kwargs)

# the following only works for HuggingFaceClient since the model is hosted locally.
client.print_model_info()
client.print_device_map()

## Paper Classification

Classify each of the papers in the dataframe. The classify_papers method add a 'predict_modeling' column to the dataframe.

For local, non-reasoning models (e.g. Llama, Gemma, Phi), we want low temperature, since we're looking for a deterministic classification. Also, we only need a single token in the response. For reasoning models, however (e.g. DeepSeek R1), we want higher temperature, to promote reasoning, and more output tokens, which includes the reasoning output.

In [None]:
import genscai.classification as gc

generate_kwargs = gc.CLASSIFICATION_GENERATE_KWARGS.copy()

# increase temperature (from 0.01) and max_new_tokens (from 1) to allow for longer text generation for reasoning models
if client.model_id == ModelClient.MODEL_DEEPSEEK_R1_8B:
    generate_kwargs.update(
        {
            "max_new_tokens": 1024,
            "temperature": 0.70,
        }
    )

# start with the default prompt for classification tasks
task_prompt = gc.CLASSIFICATION_TASK_PROMPT_TEMPLATE

df = gc.classify_papers(client, task_prompt + gc.CLASSIFICATION_OUTPUT_PROMPT_TEMPLATE, generate_kwargs, df)
df.head(5)

## Classification Evaluation

Determine the overall precision, recall, and accuracy of the test.

In [None]:
df, metrics = gc.test_paper_classifications(df)
metrics

## Prompt Tuning

If the accuracy wasn't 100%, see if we can have the model tune the prompt to increase it's classification accuracy.

In [None]:
from genscai.training import MUTATION_NEG_PROMPT_TEMPLATE, MUTATION_POS_PROMPT_TEMPLATE, MUTATION_GENERATE_KWARGS

# randomly select an incorrect result for mutating the task prompt
df_bad = df.query("is_modeling != predict_modeling")
item = df_bad.sample().iloc[0]

# generate a mutation prompt based on the incorrect result
if item.is_modeling:
    mutation_prompt = MUTATION_POS_PROMPT_TEMPLATE.format(prompt=task_prompt, abstract=item.abstract)
else:
    mutation_prompt = MUTATION_NEG_PROMPT_TEMPLATE.format(prompt=task_prompt, abstract=item.abstract)

# generate a new prompt using the mutation prompt
result = client.generate_text(mutation_prompt, MUTATION_GENERATE_KWARGS)
task_prompt = result.split("<prompt>")[-1].split("</prompt>")[0].strip()
task_prompt

## Re-classify Papers

In [None]:
df = gc.classify_papers(client, task_prompt + gc.CLASSIFICATION_OUTPUT_PROMPT_TEMPLATE, generate_kwargs, df)
df, metrics = gc.test_paper_classifications(df)
metrics

In [None]:
del client