# Introduction
*how to build an auto-complete using a pre-trained language model*

Example auto-complete: "I confirm the appointment for tomorrow morning. Best" -> "regards,"

Goal

How to 

*   use pre-trained large language models (GPT from OpenAI)
*   generate multiple completions
*   score likelihood of candidate completions


Sections:

**Open-ended completions**: how to generate completions

**Sub-word tokenizers**: how input and output is split in sub-words

**Structured completions (classification)**: given a finite set of candidate completions, how to pick best one

# Open-ended completions

Go to [platform.openai.com](https://platform.openai.com), create a (free) account:

In the top right corner, click in your profile and select "View API keys" to generate a new key:

In [None]:
# insert your api key
!pip install openai
import openai
import getpass

openai.api_key = getpass.getpass()

A language model trained on a diverse dataset can complete arbitrary sentences:

In [4]:
prompt = f"Q: What is the capital of Ontario? A:"

In [None]:
openai.Completion.create(model="text-davinci-003", 
                         prompt=prompt, 
                         temperature=0, 
                         max_tokens=20).choices[0].text

' The capital of Ontario is Toronto.'

The completions may need to be parsed. 
Example: extract the city from the answer.

For creating an auto complete, we can simply extract the first word from the completion. But most of the work is in parsing the output and deciding when to suggest an auto-complete.

In [None]:
word_autocomplete_prompt = "I confirm the appointment for tomorrow morning.\nBest"
openai.Completion.create(model="text-davinci-003", 
                         prompt=word_autocomplete_prompt, 
                         temperature=0, 
                         max_tokens=20).choices[0].text

',\n[Your Name]'

We also can create multiple deiverse completions by setting the temperature higher than 0 and setting the number of outputs to be greater than 1. 

In [None]:
completions = openai.Completion.create(model="text-davinci-003", 
                         prompt=word_autocomplete_prompt, 
                         temperature=0.5, 
                         max_tokens=20,
                         n=10)
for i, c in enumerate(completions.choices):
  print(f"Completion {i}:\n\"{c.text}\"\n")

Completion 0:
",
[Name]"

Completion 1:
",
[Your Name]"

Completion 2:
",
[Your Name]"

Completion 3:
" regards
[Your Name]"

Completion 4:
",
[Your Name]"

Completion 5:
",
[Your Name]"

Completion 6:
" regards
[Your name]"

Completion 7:
",
[Your Name]"

Completion 8:
",
[Your Name]"

Completion 9:
",
[Your Name]"



# Sub-words tokenizers

To decide when to give a suggestion, we first need to understand how completions are generated. 

In the previous lesson, we learned that we can generate outputs character by character. Let's compare three tokenizer approaches:

**Character**

Draw-backs: 
- absence of punctuation, upper case... This can be improved by adding all the missing characters to the vocabulary. 
- lack of higher-level meaning of each token compared to a word level vocabulary.

Advantage:
- if all characters are included in the vocabulary, it can cover all possible inputs.

**Word**

Draw-backs: 
- can be very large 
- cannot possibily cover all words: new words, names, scientific. So it is very hard to cover every potential input (also inputs may be be simply mispelled).
- words sith similar meanings/spellings have different tokens (ex: play/played)

Advantage:
- each token represents a word/concept, so model can focus on learning relations between words

**Sub-word**

Trade-off between previous 2. It has the most common words (and word pieces) as well as the character vocabulary, so if an unknown word appears it can generate it from a combination of characters/subwords. Ex: 'play' and 'ed' may be different tokens and 'played' simply the combination of both. Thus, words with the same prefixes may share tokens, so it is easier to generalize concepts to derivative words.

# Number of tokens

Important because models have limit size measured in:

number of input tokens + number of output tokens < max number of total tokens

Ex: some OpenAI models have 4000 token limit

Also, API calls may be priced by number of tokens.

So knowing how many tokens an input has is very useful.

In [None]:
!pip install transformers
from transformers import GPT2TokenizerFast

tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")


Tokenization example:

In [5]:
def pretiffy_tokenizer(tokenized_list):
  return [t.replace("Ġ", " ") for t in tokenized_list]

tokenized = tokenizer(prompt)['input_ids']
print(f"The tokenized version of the sentence \"{prompt}\" i s: \n{tokenized}")
print(f"The individual tokens are:\n{pretiffy_tokenizer(tokenizer.convert_ids_to_tokens(tokenized))}")

The tokenized version of the sentence "Q: What is the capital of Ontario? A:" i s: 
[48, 25, 1867, 318, 262, 3139, 286, 10553, 30, 317, 25]
The individual tokens are:
['Q', ':', ' What', ' is', ' the', ' capital', ' of', ' Ontario', '?', ' A', ':']


In [None]:
print(f"The vocabulary size is: {len(tokenizer.get_vocab())}")

The vocabulary size is: 50257


Byte pair encoding (BPE)

https://huggingface.co/course/chapter6/5


Another example with a long work being split into multiple tokens. And also how numbers are also split.

In [6]:
complex_prompt = "The desalination rate is 300000 m3/day."
tokenized = tokenizer(complex_prompt)['input_ids']
print(f"The tokenized version of the sentence \"{complex_prompt}\" i s: \n{tokenized}")
print(f"The individual tokens are:\n{pretiffy_tokenizer(tokenizer.convert_ids_to_tokens(tokenized))}")

The tokenized version of the sentence "The desalination rate is 300000 m3/day." i s: 
[464, 748, 282, 1883, 2494, 318, 5867, 830, 285, 18, 14, 820, 13]
The individual tokens are:
['The', ' des', 'al', 'ination', ' rate', ' is', ' 300', '000', ' m', '3', '/', 'day', '.']


How words with similar meaning share tokens.

In [None]:
derivative_words = "words: hallucinate, hallucinating, hallucination"
tokenized = tokenizer(derivative_words)['input_ids']
print(f"The tokenized version of the sentence \"{derivative_words}\" i s: \n{tokenized}")
print(f"The individual tokens are:\n{pretiffy_tokenizer(tokenizer.convert_ids_to_tokens(tokenized))}")

The tokenized version of the sentence "words: hallucinate, hallucinating, hallucination" i s: 
[10879, 25, 23251, 4559, 11, 23251, 6010, 11, 23251, 1883]
The individual tokens are:
['words', ':', ' halluc', 'inate', ',', ' halluc', 'inating', ',', ' halluc', 'ination']


# Structured completions

To generate structures completions we can:

- parse the model output

- give the model multiple completions and select the one with highest score

In [None]:
completion = openai.Completion.create(
    model="text-davinci-003", 
    prompt=prompt, 
    temperature=0, 
    max_tokens=0,
    echo=True, # return the input
    logprobs=1 # return the log-probability
    )

See what was the score for each token in the question.

In [None]:
for token, logprob in zip(completion.choices[0].logprobs.tokens, completion.choices[0].logprobs.token_logprobs):
  print((token, logprob))

('Q', None)
(':', -2.7656417)
(' What', -3.0039053)
(' is', -0.84649307)
(' the', -0.40047067)
(' capital', -7.1475334)
(' of', -0.09650371)
(' Ontario', -7.120254)
('?', -0.53915656)
(' A', -9.85958)
(':', -0.0041429596)


And also, the next token with highest probability at each generation step.

Example: After "Q:" the most probable token is "How", not "What".

In [None]:
for token in completion.choices[0].logprobs.top_logprobs:
  if token:
    print(token.to_dict())
  else:
    print(None)

None
{':': -2.7656417}
{' How': -2.059068}
{' is': -0.84649307}
{' the': -0.40047067}
{' difference': -1.3592246}
{' of': -0.09650371}
{' the': -2.6134484}
{'?': -0.53915656}
{'\n': -0.009496838}
{':': -0.0041429596}


Finding the most probable output by giving a list of candidates to be scores.

In [None]:
candidate_answers = ['Ottawa', 'Toronto', 'Vancouver', 'Montreal']
logprobs = []

for c_a in candidate_answers:
  raw_logprobs = openai.Completion.create(
      model="text-davinci-003", 
      prompt=' '.join([prompt, c_a]), 
      temperature=0, 
      max_tokens=0,
      echo=True, 
      logprobs=1).choices[0].logprobs.token_logprobs[1:]
  sumlogprob = sum(raw_logprobs)
  avglogprob = sumlogprob/len(raw_logprobs) 
  logprobs.append((c_a, avglogprob))
  

In [None]:
print(f"For the question: \"{prompt}\"\nThe candidate answers and log-probs are:\n")
for l in logprobs:
  print(f"Answer: {l[0]}, Avg. logprob: {l[1]}")
logprobs.sort(key= lambda x: x[1])
print(f"\nThe most probable answer is: {logprobs[-1][0]}")

For the question: "Q: What is the capital of Ontario? A:"
The candidate answers and log-probs are:

Answer: Vancouver, Avg. logprob: -4.668726059518182
Answer: Montreal, Avg. logprob: -4.434301600709091
Answer: Ottawa, Avg. logprob: -3.475510923154545
Answer: Toronto, Avg. logprob: -3.2564251196363627

The most probable answer is: Toronto


Scores can be also useful for open-ended completions. For example we can recommend a completion only if it is above a given score (log-probability).