# A 2025 Implementation of Jeremy Howard's " A hacker's guide to Language Models"

The original implementation of this can be found [here](https://github.com/fastai/lm-hackers/blob/main/lm-hackers.ipynb). Since the progress in Language models has been so rapid, some of the models and artificats used in the original notebook have changed, been tweaked or updated for example the LLAMA models, so here i will be trying to use the latest state of the art techniques , models etc.

#### What is a language model ?

I previously defined and implemented a simple language model [here](https://nbsanity.com/static/88f4b8caa233fa6d0a5e5114810403b3/symptom-disease-ulmfit_lightning.html)

In [None]:
#|include: false 
#| code-fold: true
#| output: false
#| code-summary: "Library Installation"

%pip install --upgrade openai
%pip install claudette
%pip install python-dotenv
%pip install -U bitsandbytes
%pip install optimum
%pip install auto-gptq
%pip install Wikipedia-API
%pip install tiktoken
%pip install sentence-transformers


Collecting openai
  Downloading openai-1.73.0-py3-none-any.whl.metadata (25 kB)
Downloading openai-1.73.0-py3-none-any.whl (644 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m644.4/644.4 kB[0m [31m24.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: openai
  Attempting uninstall: openai
    Found existing installation: openai 1.72.0
    Uninstalling openai-1.72.0:
      Successfully uninstalled openai-1.72.0
Successfully installed openai-1.73.0
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use u

In [1]:
#|include: false 
#| code-fold: true
#| output: false
#| code-summary: "Library Import"

import tokenize, ast
from io import BytesIO
import os

from transformers import AutoModelForCausalLM,AutoTokenizer,BitsAndBytesConfig
import torch
import torch

import ipywidgets as widgets

from openai import OpenAI

In [None]:
#|include: false 
#| code-fold: true
#| output: false
#| code-summary: "Library Import"
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Token has not been saved to git credential helper.


## Base Models

### Tokens

Our language models process text using tokens and always break down text into tokens, which can be words, subwords, etc. This creates a vocabulary of unique tokens, where each token is assigned a unique ID.

With this, all the text passed to a language model can be broken down into tokens. As a model is being trained, it constantly adds new tokens to its vocabulary.

Language models also analyze and learn the semantic relationship between the tokens such as how they are used, or used together. This enables them to predict the most likely token in a sequence of tokens based on the input sequence. This is represented to us as predicting the next word/subword/character in a sentence.

We can also represent the relationships between our tokens with something called embeddings.

In [4]:
from tiktoken import encoding_for_model
enc = encoding_for_model("text-davinci-003")
toks = enc.encode("They are splashing")
toks

[2990, 389, 4328, 2140]

In [15]:
[enc.decode_single_token_bytes(o).decode('utf-8') for o in toks]

['They', ' are', ' spl', 'ashing']

To illustrate, consider the above text, "They are splashing" which has been broken down into [2990, 389, 4328, 2140]. OpenAI currently has a [tool](https://platform.openai.com/tokenizer) we can use to visualize how text is tokenized.

This is what we used above through [tiktoken](https://github.com/openai/tiktoken) to programmatically interact with our API for tokenizing text.

In [16]:
enc_a = encoding_for_model("gpt-3.5-turbo")
toks_a = enc_a.encode("They are splashing")
toks_a

[7009, 527, 12786, 19587]

In [17]:
enc_b = encoding_for_model("o1")
toks_b = enc_b.encode("They are splashing")
toks_b

[12280, 553, 15885, 33306]

As you can see, different models encode our text in different ways. You can see the models we have access to through tiktoken [here](https://github.com/openai/tiktoken/blob/main/tiktoken/model.py).

When you train your own model, you create your own vocab.

In [18]:
[enc_b.decode_single_token_bytes(o).decode('utf-8') for o in toks_b]

['They', ' are', ' spl', 'ashing']

In [19]:
[enc.decode_single_token_bytes(o).decode('utf-8') for o in toks]

['They', ' are', ' spl', 'ashing']

## OpenAI API

In [33]:
# Create a password widget for secure input
api_key_input = widgets.Password(
    description='OpenAI API Key:',
    style={'description_width': 'initial'},
    layout=widgets.Layout(width='50%')
)

display(api_key_input)

# In a separate cell, after the user inputs their key:
os.environ["OPENAI_API_KEY"] = api_key_input.value
client = OpenAI()

Password(description='OpenAI API Key:', layout=Layout(width='50%'), style=TextStyle(description_width='initial…

In [31]:
#client = OpenAI()

response = client.responses.create(
  model="gpt-4o",
  input="Tell me a three sentence bedtime story about a unicorn."
)

print(response)


AttributeError: 'Anthropic' object has no attribute 'responses'

In [None]:
completion = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "developer", "content": "Talk like a pirate."},
        {
            "role": "user",
            "content": "How do I check if a Python object is an instance of a class?",
        },
    ],
)

print(completion.choices[0].message.content)

## Claude API

In [32]:
import ipywidgets as widgets
import os
from anthropic import Anthropic

# Create a password widget for secure input
api_key_input = widgets.Password(
    description='Anthropic API Key:',
    style={'description_width': 'initial'},
    layout=widgets.Layout(width='50%')
)

display(api_key_input)

# In a separate cell, after the user inputs their key:
os.environ["ANTHROPIC_API_KEY"] = api_key_input.value
client = Anthropic()

Password(description='Anthropic API Key:', layout=Layout(width='50%'), style=TextStyle(description_width='init…

In [22]:
from claudette import *

In [23]:
import anthropic

In [25]:
models

['claude-3-opus-20240229',
 'claude-3-7-sonnet-20250219',
 'claude-3-5-sonnet-20241022',
 'claude-3-haiku-20240307',
 'claude-3-5-haiku-20241022']

In [26]:
model = models[1] #selects 'claude-3-7-sonnet-20250219'


In [27]:

# Option 1: Create a Claudette Client first, then pass it to Chat
#c = Client(model, cli=client)  # Pass your Anthropic client here
#chat = Chat(cli=c, sp="You are a helpful and concise assistant.")

# Option 2: Or directly when creating the Chat
chat = Chat(model=model, cli=Client(model, cli=client), sp="You are a helpful and concise assistant.")

# Now you can use the chat
chat("I'm Silver Rubanza")

BadRequestError: Error code: 400 - {'type': 'error', 'error': {'type': 'invalid_request_error', 'message': 'Your credit balance is too low to access the Anthropic API. Please go to Plans & Billing to upgrade or purchase credits.'}}

## Custom Models 

In [2]:
mn = "meta-llama/Llama-2-7b-hf"

In [3]:
quantization_config = BitsAndBytesConfig(
    load_in_8bit=True
)

In [4]:
#model = AutoModelForCausalLM.from_pretrained(mn,device_map=0,load_in_8bit=True)
model = AutoModelForCausalLM.from_pretrained(mn,device_map=0,quantization_config=quantization_config)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [5]:
tokr = AutoTokenizer.from_pretrained(mn)
prompt = "Silver Rubanza is a "
toks = tokr(prompt, return_tensors="pt")

In [18]:
prompt = "Jeremy Howard is a "
toks = tokr(prompt, return_tensors="pt")

In [19]:
toks

{'input_ids': tensor([[    1,  5677,  6764, 17430,   338,   263, 29871]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])}

In [20]:
tokr.batch_decode(toks['input_ids'])

['<s> Jeremy Howard is a ']

In [21]:
%%time
res = model.generate(**toks.to("cuda"),max_new_tokens=15).to('cpu')
res

CPU times: user 520 ms, sys: 0 ns, total: 520 ms
Wall time: 531 ms


tensor([[    1,  5677,  6764, 17430,   338,   263, 29871, 29941, 29945, 29899,
          6360, 29899,  1025,   767,   515,   278,  3303,  3900,  1058,   471,
         24383,   297]])

In [22]:
tokr.batch_decode(res)

['<s> Jeremy Howard is a 35-year-old man from the United States who was arrested in']

In [23]:
#model = AutoModelForCausalLM.from_pretrained(mn,device_map=0,load_in_8bit=True)
model = AutoModelForCausalLM.from_pretrained(mn,device_map=0,torch_dtype=torch.bfloat16)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [24]:
model = AutoModelForCausalLM.from_pretrained('TheBloke/Llama-2-7b-Chat-GPTQ', device_map=0, torch_dtype=torch.float16)

Some weights of the model checkpoint at TheBloke/Llama-2-7b-Chat-GPTQ were not used when initializing LlamaForCausalLM: ['model.layers.0.mlp.down_proj.bias', 'model.layers.0.mlp.gate_proj.bias', 'model.layers.0.mlp.up_proj.bias', 'model.layers.0.self_attn.k_proj.bias', 'model.layers.0.self_attn.o_proj.bias', 'model.layers.0.self_attn.q_proj.bias', 'model.layers.0.self_attn.v_proj.bias', 'model.layers.1.mlp.down_proj.bias', 'model.layers.1.mlp.gate_proj.bias', 'model.layers.1.mlp.up_proj.bias', 'model.layers.1.self_attn.k_proj.bias', 'model.layers.1.self_attn.o_proj.bias', 'model.layers.1.self_attn.q_proj.bias', 'model.layers.1.self_attn.v_proj.bias', 'model.layers.10.mlp.down_proj.bias', 'model.layers.10.mlp.gate_proj.bias', 'model.layers.10.mlp.up_proj.bias', 'model.layers.10.self_attn.k_proj.bias', 'model.layers.10.self_attn.o_proj.bias', 'model.layers.10.self_attn.q_proj.bias', 'model.layers.10.self_attn.v_proj.bias', 'model.layers.11.mlp.down_proj.bias', 'model.layers.11.mlp.gate_p

In [25]:
%%time
res = model.generate(**toks.to("cuda"),max_new_tokens=15).to('cpu')
res

CPU times: user 498 ms, sys: 6.15 ms, total: 504 ms
Wall time: 562 ms


tensor([[    1,  5677,  6764, 17430,   338,   263, 29871, 29941, 29945, 29899,
          6360, 29899,  1025,   767,   515,   278,  3303,  3900,  1058,   471,
         24383,   297]])

In [26]:
tokr.batch_decode(res)

['<s> Jeremy Howard is a 35-year-old man from the United States who was arrested in']

In [27]:
def gen(p, maxlen=15, sample=True):
  toks = tokr(p,return_tensors="pt")
  res = model.generate(**toks.to("cuda"), max_new_tokens=maxlen, do_sample=sample).to('cpu')
  return tokr.batch_decode(res)

In [28]:
%%time
gen(prompt,50)

CPU times: user 1.69 s, sys: 2.83 ms, total: 1.69 s
Wall time: 1.72 s


['<s> Jeremy Howard is a 25-year-old man from a small town in the Midwest. He has been playing music for most of his life and has always been interested in the sounds of the past. As a teenager, he became obsessed with the']

#### Llama 3.1

In [16]:
import transformers
import torch

model_id = "meta-llama/Llama-3.1-8B"

pipeline = transformers.pipeline(
    "text-generation", model=model_id, model_kwargs={"torch_dtype": torch.bfloat16}, device_map="auto"
)

pipeline("Silver Rubanza is a ")


config.json:   0%|          | 0.00/826 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/185 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

Device set to use cpu
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


[{'generated_text': 'Silver Rubanza is a 12x12" acrylic painting on canvas. It is painted with a palette knife and is textured and'}]

In [17]:
pipeline("Jeremy Howards is a")

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


[{'generated_text': 'Jeremy Howards is a 2004 graduate of the University of Georgia and is currently an associate attorney with the firm. Jeremy'}]

## Other Models

### Stable Beluga

In [None]:
mn = "stabilityai/StableBeluga-7B"
model = AutoModelForCausalLM.from_pretrained(mn, device_map=0, torch_dtype=torch.bfloat16)

In [None]:
sb_sys = "### System:\n You are Stable Beluga, an AI that follows instructions extremely well. Help as much as you can. . \n\n"

In [None]:
def mk_prompt(user, syst=sb_sys):
    return f"{syst} ### User: {user}\n\n### Assistant:\n"

In [None]:
ques = "Who is Silver Rubanza?"

In [None]:
%%time
gen(mk_prompt(ques), 150)

### Open Orca / Playtpus 2

In [None]:
mn = 'TheBloke/OpenOrca-Platypus2-13B-GPTQ'
model = AutoModelForCausalLM.from_pretrained(mn, device_map=0, torch_dtype=torch.float16)

In [None]:
def mk_oo_prompt(user):
    return f"### Instruction: {user}\n\n ### Response:\n"

In [None]:
gen(mk_oo_prompt(ques),150)

## Retrival Augmented Generation

In [3]:
%pip install sentence-transformers

Collecting sentence-transformers
  Using cached sentence_transformers-4.0.2-py3-none-any.whl.metadata (13 kB)
Using cached sentence_transformers-4.0.2-py3-none-any.whl (340 kB)
[0mInstalling collected packages: sentence-transformers
Successfully installed sentence-transformers-4.0.2
Note: you may need to restart the kernel to use updated packages.


In [4]:
from wikipediaapi import Wikipedia

In [None]:
wiki = Wikipedia('JeremyHowardBot/0.0', 'en')
jh_page = wiki.page('Jeremy_Howard_(entrepreneur)').text
jh_page = jh_page.split('\nReferences\n')[0]

In [None]:
jh_page

In [None]:
print(jh_page[:500])

In [None]:
print(type(jh_page))

In [None]:
sr_page = 'Hi, I am Rubanza Silver A coder with a background in Software Engineering My work and interest lie in working on various steps of the machine learning lifecycle from Exploratory Data Analysis, Data wrangling, Feature Engineering to Model building, deployment, testing, monitoring, etc.I integrate machine learning models into general software solutions, all in the context of solving a given problem. Likewise, I am proficient with Python, PyTorch, and many other libraries such as fastai, sklearn, etc.I also have over 6 years of experience developing software applications using HTML, CSS, Javascript, React JS, etc.Below are examples of my work'

In [None]:
len(jh_page.split()),len(sr_page.split())

In [None]:
ques_ctx = f"""Answer the question with the help of the provided context.

## Context
{sr_page}

## Question
{ques}

"""

In [None]:
res = gen(mk_prompt(ques_ctx),300)

In [None]:
print(res[0].split('### Assistant:\n')[1])

In [None]:
%pip install sentence-transformers

In [6]:
from sentence_transformers import SentenceTransformer

ModuleNotFoundError: No module named 'sentence_transformers'

In [None]:
emb_model = SentenceTransformer("BAAI/bge-small-en-v1.5", device=0)

In [None]:
jh = jh_page.split('\n\n')[0]
print(jh)

In [None]:
sr_page

In [None]:
sr = sr_page.split('\n\n')[0]
print(sr)

In [None]:
q_emb,jh_emb,sr_emb = emb_model.encode([ques,jh,sr], convert_to_tensor=True)

In [None]:
import torch.nn.functional as F

In [None]:
F.cosine_similarity(q_emb, jh_emb, dim=0)

In [None]:
F.cosine_similarity(q_emb, sr_emb, dim=0)

Ug guidelines

In [3]:
%pip install PyPDF2

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
Installing collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1
Note: you may need to restart the kernel to use updated packages.


In [5]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from pathlib import Path
import PyPDF2
import numpy as np
from tqdm import tqdm
from sentence_transformers import SentenceTransformer
import torch.nn.functional as F

# 1. Extract text from the Uganda Clinical Guidelines PDF
def extract_text_from_pdf(pdf_path):
    text = ""
    with open(pdf_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        for page_num in range(len(reader.pages)):
            text += reader.pages[page_num].extract_text() + "\n\n"
    return text

# 2. Split text into chunks
def split_into_chunks(text, chunk_size=500, overlap=50):
    words = text.split()
    chunks = []
    for i in range(0, len(words), chunk_size - overlap):
        chunk = ' '.join(words[i:i + chunk_size])
        chunks.append(chunk)
    return chunks

# 3. Load the embedding model
def setup_embedding_model():
    return SentenceTransformer("BAAI/bge-small-en-v1.5", device="cuda" if torch.cuda.is_available() else "cpu")

# 4. Create embeddings for chunks
def create_embeddings(chunks, emb_model):
    embeddings = []
    for chunk in tqdm(chunks):
        embedding = emb_model.encode(chunk, convert_to_tensor=True)
        embeddings.append(embedding)
    return embeddings

# 5. Load Llama model like in lm-hackers
def load_llama_model(model_name="TheBloke/Llama-2-7b-Chat-GPTQ"):
    model = AutoModelForCausalLM.from_pretrained(
        model_name, 
        device_map=0, 
        torch_dtype=torch.float16
    )
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    return model, tokenizer

# 6. Create the gen function similar to lm-hackers
def gen(p, model, tokenizer, maxlen=150, sample=True):
    toks = tokenizer(p, return_tensors="pt")
    res = model.generate(
        **toks.to("cuda"), 
        max_new_tokens=maxlen, 
        do_sample=sample
    ).to('cpu')
    return tokenizer.batch_decode(res)

# 7. Create prompt formatting functions
def mk_prompt(user, syst="### System:\nYou are a medical assistant that helps with diagnosing diseases based on symptoms using Ugandan Clinical Guidelines.\n\n"):
    return f"{syst}### User: {user}\n\n### Assistant:\n"

def mk_oo_prompt(user):
    return f"### Instruction: {user}\n\n### Response:\n"

# 8. Retrieve relevant context
def retrieve_context(query, chunks, chunk_embeddings, top_k=3):
    # Encode the query
    query_embedding = emb_model.encode(query, convert_to_tensor=True)
    
    # Calculate similarities
    similarities = []
    for chunk_emb in chunk_embeddings:
        similarity = F.cosine_similarity(query_embedding, chunk_emb, dim=0)
        similarities.append(similarity.item())
    
    # Get top k chunks
    top_indices = np.argsort(similarities)[-top_k:][::-1]
    
    # Return the top chunks and their similarity scores
    return [(chunks[i], similarities[i]) for i in top_indices]

# 9. Main RAG function using Llama model
def rag_diagnosis_llama(symptoms_text, model, tokenizer, chunks, chunk_embeddings, top_k=3):
    # Retrieve relevant context
    relevant_chunks = retrieve_context(symptoms_text, chunks, chunk_embeddings, top_k)
    
    # Format the context
    context = "\n\n".join([chunk for chunk, _ in relevant_chunks])
    
    # Format the prompt using the same style as lm-hackers
    prompt = mk_oo_prompt(f"""Answer the question with the help of the provided context.

## Context

{context}

## Question

Based on these symptoms: "{symptoms_text}", what are the possible diseases according to Ugandan Clinical Guidelines?""")
    
    # Generate response using the gen function
    response = gen(prompt, model, tokenizer, maxlen=300)[0]
    
    # Extract just the assistant's response
    answer = response.split("### Response:\n")[-1]
    
    return {
        "diagnosis": answer,
        "context_used": [chunk for chunk, _ in relevant_chunks],
        "similarity_scores": [sim for chunk, sim in relevant_chunks]
    }

# 10. Main function to set everything up
def setup_medical_rag():
    # Extract text from PDF
    pdf_path = "/teamspace/studios/this_studio/ug_cg_23.pdf"
    
    guidelines_text = extract_text_from_pdf(pdf_path)
    chunks = split_into_chunks(guidelines_text)
    
    # Create embeddings
    emb_model = setup_embedding_model()
    chunk_embeddings = create_embeddings(chunks, emb_model)
    
    # Load Llama model
    model, tokenizer = load_llama_model()
    
    return chunks, chunk_embeddings, model, tokenizer, emb_model

# 11. Demo interface
def run_medical_rag_demo():
    chunks, chunk_embeddings, model, tokenizer, emb_model = setup_medical_rag()
    
    while True:
        symptoms = input("\nDescribe the symptoms (or type 'exit' to quit): ")
        if symptoms.lower() == 'exit':
            break
            
        print("\nGenerating diagnosis...")
        result = rag_diagnosis_llama(symptoms, model, tokenizer, chunks, chunk_embeddings)
        
        print("\n=== Diagnosis ===")
        print(result["diagnosis"])
        
        print("\n=== Top Relevant Guidelines Used ===")
        for i, (context, score) in enumerate(zip(result["context_used"], result["similarity_scores"]), 1):
            print(f"--- Context {i} (Similarity: {score:.3f}) ---")
            print(context[:200] + "..." if len(context) > 200 else context)
            print()

# Run the demo
if __name__ == "__main__":
    run_medical_rag_demo()

100%|██████████| 451/451 [00:14<00:00, 31.99it/s]
`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.
Some weights of the model checkpoint at TheBloke/Llama-2-7b-Chat-GPTQ were not used when initializing LlamaForCausalLM: ['model.layers.0.mlp.down_proj.bias', 'model.layers.0.mlp.gate_proj.bias', 'model.layers.0.mlp.up_proj.bias', 'model.layers.0.self_attn.k_proj.bias', 'model.layers.0.self_attn.o_proj.bias', 'model.layers.0.self_attn.q_proj.bias', 'model.layers.0.self_attn.v_proj.bias', 'model.layers.1.mlp.down_proj.bias', 'model.layers.1.mlp.gate_proj.bias', 'model.layers.1.mlp.up_proj.bias', 'model.layers.1.self_attn.k_proj.bias', 'model.layers.1.self_attn.o_proj.bias', 'model.layers.1.self_attn.q_proj.bias', 'model.layers.1.self_attn.v_proj.bias', 'model.layers.10.mlp.down_proj.bias', 'model.layers.10.mlp.gate_proj.bias', 'model.layers.10.mlp.up_proj.bias', 'model.layers.10.self_attn.k_proj.bias', 'model.layers.10.self_attn.o_proj

tokenizer_config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.



Generating diagnosis...


NameError: name 'emb_model' is not defined

## Finetuning

import datasets

In [None]:
ds = datasets.load_dataset('knowrohit07/know_medical_dialogue_v2')

In [None]:
trn = ds['train']
trn[3]

In [None]:
tst = dict(**trn[3])
tst['question'] = 'Get the count of competition hosts by theme.'
tst

In [None]:
fmt = """SYSTEM: Use the following contextual information to concisely answer the question.

USER: {}
===
{}
ASSISTANT:"""

In [None]:
def sql_prompt(d): return fmt.format(d["context"], d["question"])

## References

[Understanding Tokenization](https://learn.microsoft.com/en-us/dotnet/ai/conceptual/understanding-tokens)

[Hacker's guide to Language Models by Jeremy Howard](https://github.com/fastai/lm-hackers/blob/main/lm-hackers.ipynb)