# Infer using the trained GPT-2 adaptor


## Preparation

Import necessary libraries, clean GPU mem cache

In [1]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

torch.cuda.empty_cache()

Next, define the device, and model/tokenizer ids.

In [2]:
device = "cuda"
model_id = "gpt2"
tokenizer_id = "gpt2"
peft_model_id = "retrieverApp_adapter/gpt2-retrieverApp"

Load the LLM into memory (large LLM may cause OOM)

In [3]:
model = AutoModelForCausalLM.from_pretrained(
        model_id,
        device_map=device,
        torch_dtype="auto",
        trust_remote_code=False,
    )

# and adaptor
if peft_model_id:                         #not None
    model.load_adapter(peft_model_id)


tokenizer = AutoTokenizer.from_pretrained(tokenizer_id)

terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|endoftext|>")]

tokenizer.pad_token = tokenizer.eos_token


## Prompt

Now let's prepare the prompt...

In [4]:
messages = ["Title: Gene expression profile at single cell level of bone marrow cells in 5TGM1 multiple myeloma mouse model of NEK2 deficiency. Summary: To determine how the lack of NEK2 in the tumor microenvironmental cells contributes to tumor inhibition and immune regulation, we used single cell RNA sequencing (scRNA-seq) to analyze the diversity of immune cells in the bone marrow of 5TGM1 MM mouse model with and without NEK2 deficiency. Overall design: Nek2 knockout mice had been backcrossed with C57BL/KaLwRij mice for 10 generations to allow for 5TGM1 MM cell growth. One million 5TGM1 MM cells were injected into the Nek2 wild-type and knockout mice via tail vein. Bone marrow cells were harvested on day 30 post-5TGM1 cell injection and processed for scRNA-seq analysis.. The data type is",
           "Title: Single cell RNA sequencing of primary breast cancer. Summary: We performed single cell RNA sequencing (RNA-seq) for 549 primary breast cancer cells and lymph node metastases from 11 patients with distinct molecular subtypes (BC01-BC02, estrogen receptor positive (ER+); BC03, double positive (ER+ and HER2+); BC03LN, lymph node metastasis of BC03; BC04-BC06, human epidermal growth factor receptor 2 positive (HER2+); BC07-BC11, triple-negative breastcancer (TNBC); BC07LN, lymph node metastasis of BC07) and matched bulk tumors. We separated these single cells into epithelial tumorand tumor-infiltrating immune cells using inferred CNVs from RNA-seq. The refined single cell profiles for the tumor and immune cells provide key expression signatures of breast cancer and the surrounding microenvironment. Overall design: All single-cell mRNA expression profiles were acquired from eleven patients (BC01-BC11) including two lymph node metastases (BC03LN, BC07LN) (549 samples). We applied four filtering criteria so as to remove samples with low sequencing quality and finally obtained 515 single cell sequencing data. Matched bulk tumor tissues and/or pooled cells were also sequenced and analyzed by the single cell RNA-seq pipeline (14 samples). Bulk tumor transcriptomes showed significant correlations with the average of single cell transcriptomes.. The data type is",
           "Title: Expression data of plasma cell from MGUS patients. Summary: Multiple myeloma is preceded by monoclonal gammopathy of undetermined significance (MGUS). Serum markers are currently used to stratify MGUS patients into clinical risk groups. A molecular signature predicting MGUS progression has not been produced. We have explored the use of gene expression profiling to risk-stratify MGUS and developed an optimized signature based on large samples with long-term follow-up. Microarray analysis of mRNA from MGUS patients with stable disease and MGUS patients that progressed to MM with in 10 years, was used to define a molecular signature of MGUS risk. Overall design: Bone marrow plasma cells from MGUS patients were enriched by anti-CD138 immunomagnetic beads for RNA extraction and hybridization on Affymetrix microarrays. We sought to identify genetic differences in the bone marrow plasma cells of stable and progressive MGUS patients. Patients who progressed to MM within 10 years after diagnosis of MGUS were considered Progressing (P), while the other 334 MGUS had not progressed called Stable (S).. The data type is",
           "Title: Effect of NEK2 deficiency on gene expression of CD19-selected spleen B cells of Eµ-myc mouse model. Summary: NEK2 overexpression in the B cell lineage affects B cell development. To determine whether deficiency of NEK2 inhibits B cell malignancies, we crossed Nek2 knockout mice with Eµ-myc transgenic mice, a well-established Myc-driven preclinical model for B cell malignancies. Three genotyping offsprings were obtained: Eµ-myc/Nek2+/+ (NEK2 wild type), Eµ-myc/Nek2+/- (NEK2 heterozygotes), and Eµ-myc/Nek2-/- (NEK2 homozygotes). Bulk RNA-seq was performed on CD19-selected spleen B cells to view the transcriptome differences. Overall design: Comparative gene expression profiling analysis of RNA-seq data for CD19-selected spleen B cells from 6-8 weeks old Eµ-myc mice with indicated NEK2 deficiency.. The data type is"]

Infer ...

In [5]:
for i, message in enumerate(messages):
    s = tokenizer(message, return_tensors="pt")
    input_ids = s.input_ids.to('cuda')
    outputs = model.generate(input_ids, 
                             temperature = 0.1,
                             do_sample = True,
                             pad_token_id = tokenizer.pad_token_id,
                             #bos_token_id = tokenizer.bos_token_id,
                             eos_token_id = tokenizer.eos_token_id, 
                             max_new_tokens = 48)[0]
    answer = tokenizer.decode(outputs, skip_special_tokens=True)[0][len(message)+1:]
    print( "%d: %s" % (i+1, answer) )


The attention mask is not set and cannot be inferred from input because pad token is same as eos token.As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


1: 
2: 
3: 
4: 


In [6]:
tokenizer.decode(outputs, skip_special_tokens=True)

'Title: Effect of NEK2 deficiency on gene expression of CD19-selected spleen B cells of Eµ-myc mouse model. Summary: NEK2 overexpression in the B cell lineage affects B cell development. To determine whether deficiency of NEK2 inhibits B cell malignancies, we crossed Nek2 knockout mice with Eµ-myc transgenic mice, a well-established Myc-driven preclinical model for B cell malignancies. Three genotyping offsprings were obtained: Eµ-myc/Nek2+/+ (NEK2 wild type), Eµ-myc/Nek2+/- (NEK2 heterozygotes), and Eµ-myc/Nek2-/- (NEK2 homozygotes). Bulk RNA-seq was performed on CD19-selected spleen B cells to view the transcriptome differences. Overall design: Comparative gene expression profiling analysis of RNA-seq data for CD19-selected spleen B cells from 6-8 weeks old Eµ-myc mice with indicated NEK2 deficiency.. The data type is: CD19-selected spleen B cells.'