# Prompting basics

This notebook goes over basics of prompting using Named Entity Recognition as an example. Three LLMs are used:

* [GPT-35-Turbo via Azure OpenAI](https://oai.azure.com/portal/)
* [TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g](https://huggingface.co/TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g) hosted on [Hawking](http://10.60.244.11:7030/)
* [Claude via Anthropic](https://console.anthropic.com/) (only manually since no API key is available at the moment)

## Setup

Imports

In [1]:
!pip install openai
!pip install langchain

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting openai
  Downloading openai-0.27.8-py3-none-any.whl (73 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.6/73.6 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
Collecting aiohttp (from openai)
  Downloading aiohttp-3.8.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m19.6 MB/s[0m eta [36m0:00:00[0m
Collecting multidict<7.0,>=4.5 (from aiohttp->openai)
  Downloading multidict-6.0.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (114 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m114.5/114.5 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting async-timeout<5.0,>=4.0.0a3 (from aiohttp->openai)
  Downloading async_timeout-4.0.2-py3-none-any.whl (5.8 kB)
Collecting yarl<2.0,>=1.0 (from aiohttp->openai)
  Downloadin

In [3]:
import sys
sys.path.append('/content/drive/MyDrive/Elsevier /LG LLM/llm-prompting')

In [4]:
import re
import os
import csv
import json
from typing import List

import yaml
import openai
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
from tqdm.notebook import tqdm, trange
from langchain import PromptTemplate
from langchain.prompts.few_shot import FewShotPromptTemplate
from langchain.llms import AzureOpenAI
from langchain.output_parsers import CommaSeparatedListOutputParser

from src.hawking_llm import HawkingLLM
from src.text_utils import remove_tags, list_tags, parse_output

#from hawking_llm import HawkingLLM
#from text_utils import remove_tags, list_tags, parse_output

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Ignore warnings

In [5]:
import warnings
warnings.filterwarnings("ignore")

Load Azure OpenAI API key

In [6]:
with open('config.yaml') as f_in:
    azure_api_key = yaml.safe_load(f_in)['azure']['api_key']

FileNotFoundError: ignored

Set environment variables for Azure OpenAI

In [None]:
os.environ['OPENAI_API_TYPE'] = 'azure'
os.environ['OPENAI_API_VERSION'] = '2022-12-01'
os.environ['OPENAI_API_BASE'] = 'https://nhe2.openai.azure.com/'
os.environ['OPENAI_API_KEY'] = azure_api_key

## Create LLM connectors

Create a connector for Vicuna on Hawking

In [7]:
# Create connector and set max tokens per response
vicuna = HawkingLLM(max_new_tokens=512)

# Test the model
vicuna('Explain nuclear physics in no more than five sentences.')

ConnectTimeout: ignored

In [None]:
print(vicuna)

[1mHawkingLLM[0m
Params: {'max_new_tokens': 512, 'preset': None, 'do_sample': True, 'temperature': 0.7, 'top_p': 0.1, 'typical_p': 1, 'epsilon_cutoff': 0, 'eta_cutoff': 0, 'tfs': 1, 'top_a': 0, 'repetition_penalty': 1.18, 'top_k': 40, 'min_length': 0, 'no_repeat_ngram_size': 0, 'num_beams': 1, 'penalty_alpha': 0, 'length_penalty': 1, 'early_stopping': False, 'mirostat_mode': 0, 'mirostat_tau': 5, 'mirostat_eta': 0.1, 'seed': -1, 'add_bos_token': True, 'truncation_length': 2048, 'ban_eos_token': False, 'skip_special_tokens': True, 'stopping_strings': []}


Create a connector for GPT-35 in Azure OpenAI

In [None]:
# Create a connector
gpt = AzureOpenAI(deployment_name='Davinci', model_name='text-davinci-003', max_tokens=512)

# Test the model
gpt('Explain nuclear physics in no more than five sentences.')

'\n\nNuclear physics is the study of the structure and behavior of atomic nuclei. It deals with the properties of individual nuclear particles, the interactions between them, and the ways in which these particles interact with the outside world. Nuclear physics is also concerned with the structure of atomic nuclei and the processes involved in the creation and decay of nuclei. In addition, it studies the production of energy from nuclear reactions, such as nuclear fission and nuclear fusion. Finally, nuclear physics deals with applications of nuclear physics, such as nuclear medicine, nuclear power plants, and nuclear weapons.'

In [None]:
print(gpt)

[1mAzureOpenAI[0m
Params: {'deployment_name': 'Davinci', 'model_name': 'text-davinci-003', 'temperature': 0.7, 'max_tokens': 512, 'top_p': 1, 'frequency_penalty': 0, 'presence_penalty': 0, 'n': 1, 'request_timeout': None, 'logit_bias': {}}


## Load data

Dataset: http://www.ncbi.nlm.nih.gov/CBBresearch/Dogan/DISEASE/

In [None]:
ner_data = []
with open('data/NCBI-Disease/NCBI_corpus_training.txt') as f_in:
    reader = csv.DictReader(f_in, delimiter='\t', fieldnames=['id', 'title', 'text'])
    for row in reader:
        ner_data.append({
            'id': row['id'],
            'title': row['title'],
            'title_text': remove_tags(row['title']),
            'title_labels': list_tags(row['title']),
            'abstract': row['text'],
            'abstract_text': remove_tags(row['text']),
            'abstract_labels': list_tags(row['text'])
        })

print(f'Loaded {len(ner_data)} examples\n')
print('Sample entry:\n')
print(json.dumps(ner_data[0], indent=4))

Loaded 593 examples

Sample entry:

{
    "id": "10021369",
    "title": "Identification of APC2, a homologue of the <category=\"Modifier\">adenomatous polyposis coli tumour</category> suppressor .",
    "title_text": "Identification of APC2, a homologue of the adenomatous polyposis coli tumour suppressor .",
    "title_labels": [
        "adenomatous polyposis coli tumour"
    ],
    "abstract": "The <category=\"Modifier\">adenomatous polyposis coli ( APC ) tumour</category>-suppressor protein controls the Wnt signalling pathway by forming a complex with glycogen synthase kinase 3beta ( GSK-3beta ) , axin / conductin and betacatenin . Complex formation induces the rapid degradation of betacatenin . In <category=\"Modifier\">colon carcinoma</category> cells , loss of APC leads to the accumulation of betacatenin in the nucleus , where it binds to and activates the Tcf-4 transcription factor ( reviewed in [ 1 ] [ 2 ] ) . Here , we report the identification and genomic structure of APC ho

## Create a prompt template

In [None]:
template = (
    """
    A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.

    USER:
    {context}

    List all diseases mentioned in the above text. {format_instructions}

    ASSISTANT:
    """
)
prompt_template = PromptTemplate.from_template(template)
print(prompt_template.input_variables)

['context', 'format_instructions']


Test prompt template using one example

In [None]:
output_parser = CommaSeparatedListOutputParser()

format_instructions = output_parser.get_format_instructions()

prompt = prompt_template.format(
    context=ner_data[0]['abstract_text'],
    format_instructions=format_instructions
)

output = vicuna(prompt)

print(output)


    * Colon carcinoma


Formatted output

In [None]:
parse_output(output)

{'colon carcinoma'}

What the output should have been

In [None]:
set([x.lower() for x in ner_data[0]['abstract_labels']])

{'adenomatous polyposis coli ( apc ) tumour', 'cancer', 'colon carcinoma'}

## Run model on a few examples

In [None]:
def test_model(model, prompt: str, diseases: List[str]) -> float:

    # Run model
    output = model(prompt)

    # Format output and true labels
    y_pred = parse_output(output)
    y_true = set([d.lower() for d in diseases])

    return len(y_pred.intersection(y_true)) / len(y_pred.union(y_true))

In [None]:
output_parser = CommaSeparatedListOutputParser()
format_instructions = output_parser.get_format_instructions()

sample_size = 10

results_vicuna = []
results_gpt = []

for i in trange(sample_size):
    context = ner_data[i]['abstract_text']
    diseases = ner_data[i]['abstract_labels']

    prompt = prompt_template.format(context=context, format_instructions=format_instructions)

    results_vicuna.append(test_model(vicuna, prompt, diseases))
    results_gpt.append(test_model(gpt, prompt, diseases))

  0%|          | 0/10 [00:00<?, ?it/s]

In [None]:
np.mean(results_vicuna)

0.0944298245614035

In [None]:
np.mean(results_gpt)

0.29502567693744164

## Test models in one-shot setting

In [None]:
# Use 2 dataset entries as examples
examples = [
    {
        'context': nd['abstract_text'],
        'labels': ', '.join(nd['abstract_labels'])
    } for nd in ner_data[-2:]
]
print(len(examples))

2


In [None]:
template = (
    """
    Context: {context}

    Disease names: {labels}
    """
)

example_prompt = PromptTemplate.from_template(template)

print(example_prompt.format(**examples[0]))


    Context: Mutations in the STA gene at the Xq28 locus have been found in patients with X-linked Emery-Dreifuss muscular dystrophy ( EDMD ) . This gene encodes a hitherto unknown protein named emerin . To elucidate the subcellular localization of emerin , we raised two antisera against synthetic peptide fragments predicted from emerin cDNA . Using both antisera , we found positive nuclear membrane staining in skeletal , cardiac and smooth muscles in the normal controls and in patients with neuromuscular diseases other than EDMD . In contrast , a deficiency in immunofluorescent staining of skeletal and cardiac muscle from EDMD patients was observed . A 34 kD protein is immunoreactive with the antisera--the protein is equivalent to that predicted for emerin . Together , our findings suggest the specific deficiency of emerin in the nuclear membrane of muscle cells in patients with EDMD . . 

    Disease names: deficiency of emerin, EDMD, neuromuscular diseases, X-linked Emery-Dreifuss 

In [None]:
prompt_template = FewShotPromptTemplate(
    examples=examples,
    example_prompt=example_prompt,
    suffix="Context: {context}\n\nDisease names:",
    input_variables=['context']
)

print(prompt_template.format(context=ner_data[0]['abstract_text']))


    Context: Mutations in the STA gene at the Xq28 locus have been found in patients with X-linked Emery-Dreifuss muscular dystrophy ( EDMD ) . This gene encodes a hitherto unknown protein named emerin . To elucidate the subcellular localization of emerin , we raised two antisera against synthetic peptide fragments predicted from emerin cDNA . Using both antisera , we found positive nuclear membrane staining in skeletal , cardiac and smooth muscles in the normal controls and in patients with neuromuscular diseases other than EDMD . In contrast , a deficiency in immunofluorescent staining of skeletal and cardiac muscle from EDMD patients was observed . A 34 kD protein is immunoreactive with the antisera--the protein is equivalent to that predicted for emerin . Together , our findings suggest the specific deficiency of emerin in the nuclear membrane of muscle cells in patients with EDMD . . 

    Disease names: deficiency of emerin, EDMD, neuromuscular diseases, X-linked Emery-Dreifuss 

In [None]:
vicuna(prompt.format(context=ner_data[1]['abstract_text']))

'\n    * Hereditary hemochromatosis\n    \n    * Iron overload\n    \n    * Excessive dietary iron absorption\n    \n    * Autosomal recessive disorder\n    \n    * Tissue iron deposition\n    \n    * Transferrin receptor (TfR)\n    \n    * Mutations in HFE\n    \n    * Uptake of transferrin-bound iron\n    \n    * Duodenal crypt cells\n    \n    * Iron homeostasis\n    \n    * Divalent metal transporter (DMT1)\n    \n    * Increased duodenal expression\n    \n    * Murine model of dietary iron deficiency\n    \n    * Dietary iron absorption\n    \n    * Hepatic iron concentration'

## Rerun with examples

In [None]:
sample_size = 10

results_vicuna_few_shot = []
results_gpt_few_shot = []

for i in trange(sample_size):
    context = ner_data[i]['abstract_text']
    diseases = ner_data[i]['abstract_labels']

    prompt = prompt_template.format(context=context)

    results_vicuna_few_shot.append(test_model(vicuna, prompt, diseases))
    results_gpt_few_shot.append(test_model(gpt, prompt, diseases))

  0%|          | 0/10 [00:00<?, ?it/s]

In [None]:
np.mean(results_vicuna_few_shot)

0.29279411764705887

In [None]:
np.mean(results_gpt_few_shot)

0.35872549019607847

In [None]:
all_results = {
    'vicuna': {'zero shot': np.mean(results_vicuna), 'few shot': np.mean(results_vicuna_few_shot)},
    'gpt': {'zero shot': np.mean(results_gpt), 'few shot': np.mean(results_gpt_few_shot)}
}
pd.DataFrame.from_dict(all_results, orient='index')

Unnamed: 0,zero shot,few shot
vicuna,0.09443,0.292794
gpt,0.295026,0.358725
