# LLM 20 Questions Rule Based LLAMA3.1

## Prerequisites
Set accelerator to GPU T4

In [1]:
%%bash
mkdir -p /kaggle/working/submission
mkdir -p /tmp/model
pip install -q bitsandbytes accelerate

## Download model

### HuggingFace Login

Add HugginFace access token to secrets. You can find it in `Add-ons -> secrets`

In [2]:
from kaggle_secrets import UserSecretsClient
secrets = UserSecretsClient()

HF_TOKEN: str | None  = None

try:
    HF_TOKEN = secrets.get_secret("HF_TOKEN")
except:
    pass

In [3]:
repo_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"

### Download Model via HuggingFace
To reduce disk usage, download model in `/tmp/model`

In [4]:
from huggingface_hub import snapshot_download
from pathlib import Path
import shutil

g_model_path = Path("/tmp/model")
if g_model_path.exists():
    shutil.rmtree(g_model_path)
g_model_path.mkdir(parents=True)

snapshot_download(
    repo_id=repo_id,
    ignore_patterns="original*",
    local_dir=g_model_path,
    token=globals().get("HF_TOKEN", None)
)

Fetching 14 files:   0%|          | 0/14 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

README.md:   0%|          | 0.00/44.0k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/855 [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]

LICENSE:   0%|          | 0.00/7.63k [00:00<?, ?B/s]

USE_POLICY.md:   0%|          | 0.00/4.69k [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/55.4k [00:00<?, ?B/s]

'/tmp/model'

In [5]:
!ls -l /tmp/model

total 15693188
-rw-r--r-- 1 root root       7627 Aug 13 18:02 LICENSE
-rw-r--r-- 1 root root      44003 Aug 13 18:02 README.md
-rw-r--r-- 1 root root       4691 Aug 13 18:02 USE_POLICY.md
-rw-r--r-- 1 root root        855 Aug 13 18:02 config.json
-rw-r--r-- 1 root root        184 Aug 13 18:02 generation_config.json
-rw-r--r-- 1 root root 4976698672 Aug 13 18:03 model-00001-of-00004.safetensors
-rw-r--r-- 1 root root 4999802720 Aug 13 18:03 model-00002-of-00004.safetensors
-rw-r--r-- 1 root root 4915916176 Aug 13 18:03 model-00003-of-00004.safetensors
-rw-r--r-- 1 root root 1168138808 Aug 13 18:02 model-00004-of-00004.safetensors
-rw-r--r-- 1 root root      23950 Aug 13 18:02 model.safetensors.index.json
-rw-r--r-- 1 root root        296 Aug 13 18:02 special_tokens_map.json
-rw-r--r-- 1 root root    9085657 Aug 13 18:02 tokenizer.json
-rw-r--r-- 1 root root      55351 Aug 13 18:02 tokenizer_config.json


In [6]:
import json

with open("/tmp/model/config.json", "r") as file:
    config = json.load(file)
config["rope_scaling"] = {"factor":8.0,"type":"dynamic"}
with open("/tmp/model/config.json", "w") as file:
    json.dump(config, file)

### Save quantized model
Now, load downloaded model on memory with quantization.  
This will save storage.


Moreover, since the saved model has already been quantized, we do not need `bitsandbytes` package in `main.py`

In [7]:
# load model on memory
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch

torch.backends.cuda.enable_mem_efficient_sdp(False)
torch.backends.cuda.enable_flash_sdp(False)

downloaded_model = "/tmp/model"

bnb_config = BitsAndBytesConfig(
    load_in_8bit = True,
    bnb_4bit_compute_dtype=torch.float16,
)

model = AutoModelForCausalLM.from_pretrained(
    downloaded_model,
    quantization_config = bnb_config,
    torch_dtype = torch.float16,
    device_map = "auto",
    trust_remote_code = True,
)

tokenizer = AutoTokenizer.from_pretrained(downloaded_model)

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [8]:
# save model in submission directory
model.save_pretrained("/kaggle/working/submission/model")
tokenizer.save_pretrained("/kaggle/working/submission/model")

('/kaggle/working/submission/model/tokenizer_config.json',
 '/kaggle/working/submission/model/special_tokens_map.json',
 '/kaggle/working/submission/model/tokenizer.json')

In [9]:
# unload model from memory
import gc, torch
del model, tokenizer
gc.collect()
torch.cuda.empty_cache()

## Agent

### Rules

0 = no;
1 = yes;

In [10]:
import pandas as pd

questions_bank_df = pd.read_csv('/kaggle/input/revealed-keywords-mapped/revealed_keywords_mapped.csv', index_col=0)
print(questions_bank_df.shape)
questions_bank_df

(579, 160)


Unnamed: 0_level_0,Is it man-made?,Is it used outdoors?,Is it something you can find in a bathroom?,Is it something people wear?,Is it something that can be eaten?,Is it something that has wheels?,Is it something that has a screen?,Is it something that has a smell?,Is it an animal?,Is it something that requires electricity or batteries?,...,Is it made of metal,Is it made of metal.1,Is it a collective noun rather than a compound noun?,Is it a common noun rather than a proper noun?,"Is it a verbal noun which often ends in -ing (e.g., swimming, reading)?",Is it countable?,Is it gender-specific?,Is it a concrete noun rather than an abstract noun?,Is it a device?,Is it a plant name?
keyword,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Advertisement,1,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,1,0,0,0,0
Agave,0,0,0,0,1,0,0,0,0,0,...,0,0,0,1,0,0,0,1,0,1
Air Conditioner,1,0,0,0,0,0,0,0,0,1,...,0,0,0,1,0,1,0,1,1,0
Air compressor,1,0,0,0,0,0,0,0,0,1,...,1,1,0,1,0,1,0,1,1,0
Air filter,1,0,0,0,0,0,1,0,0,0,...,0,0,0,1,0,1,0,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Wristband,1,1,0,1,0,0,0,0,0,0,...,0,0,0,1,0,1,0,1,0,0
Yoga Block,1,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,1,0,1,1,0
Yoga mat,1,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,1,0,1,0,0
Yogurt,1,0,0,0,1,0,0,1,0,0,...,0,0,0,1,0,0,0,1,0,0


### Rule Based Agent

In [11]:
%%writefile submission/rulebased.py

import numpy as np

BALANCE_THRESHOLD = 0.1
KEYWORD_THRESHOLD = 0.05
NOISE_STD = 0.01

class RuleBasedQuestions:
    def __init__(self, questions_df, add_noise=False):
        self.questions_df = questions_df
        # mask with the number of columns in the questions_df
        self.questions_log = []
        self.answers_log = []
        self.enabled = True
        self.keywords_population = len(self.questions_df)
        self.add_noise = add_noise
        self.cnt = 0
        self.guess_based_in_answer = False
        
    def get_question(self):
        balances = []
        questions = []
        for c in self.questions_df.columns:
            
            # ignore already asked questions
            if c in self.questions_log: continue
            # if -2 in self.questions_df[c].values: continue # don't use irrelevant question (0=no, 1=yes, -1=unsure, -2=irrelevant)
                
            b = len(self.questions_df[self.questions_df[c]==1]) / len(self.questions_df)
            b = b if b < 0.5 else 1-b
            # 전부 0이라면 questions 없음->키워드 질문
            # b=0.01이더라도 추가되지않음, 완전분류는 최대한 키워드 개수를 줄이는게 목적이므로 b=0,1일때만 추가되면 안됨
            if b == 0: continue # ignore questions with less than 10% balance

            balances.append(b)
            questions.append(c)
         
            #분류할 수 있는 질문이 하나도 없을 때
        if len(questions) == 0:
            self.enabled = False
            return None
        
        # add noise to balance for exploration
        if self.add_noise:
            balances = np.array(balances) + np.random.normal(0, NOISE_STD, len(balances))
        
        best_question = questions[np.argmax(balances)]
        return best_question
    
    def log_question(self, question):
        self.questions_log.append(question)

    def log_answer(self, answer):
        # parse answer
        answer_yes = True
        if "no" in answer.lower():
            answer_yes = False
            
        # log answer
        self.answers_log.append(answer_yes)
    
    def filter_questions(self):   
        if self.enabled == False:
            return
        # filter questions_df based on answer, keep rows with answer_yes
        question = self.questions_log[-1]
        answer = self.answers_log[-1]
        
        self.questions_df = self.questions_df[self.questions_df[question]==int(answer)]        
        
        # check threshold or
        # check if there are no more corresponding keywords, this will prevent division by zero
        if (len(self.questions_df) / self.keywords_population) < KEYWORD_THRESHOLD or len(self.questions_df) == 0:
            self.enabled = False
            


    def reset(self):
        self.questions_log = []
        self.answers_log = []
        self.enabled = True
        self.keywords_population = len(self.questions_df)

Writing submission/rulebased.py


### Prompts

Prompts are reffered from the [Anthropic Prompt Library](https://docs.anthropic.com/en/prompt-library/library)

In [12]:
%%writefile submission/prompts.py

def asker_prompt(obs):
    message = []
    
    # System prompt
    ask_prompt = f"""You are a helpful AI assistant with expertise in playing 20 questions game.
Your task is to ask questions so that the another guesser can guess the word the user is thinking of.
The keyword is of category: "things"
Narrow down the possibilities by asking yes/no questions.
You only ask one yes/no question in your turn. No other verbosity around. 
Think step by step and try to ask the most informative questions.
you should consider the possibility that 'yes' or 'no' answers could be wrong .
\n"""

    message.append({"role": "system", "content": ask_prompt})

    # Chat history
    for q, a in zip(obs.questions, obs.answers):
        message.append({"role": "assistant", "content": q})
        message.append({"role": "user", "content": a})

    return message


def guesser_prompt(obs):
    message = []
    
    # System prompt
    guess_prompt = f"""You are guesser with expertise in playing 20 questions game. You can guess even hard words.
The correct answer is of category: "things".
\n"""

    # Chat history
    chat_history = ""
    for q, a in zip(obs.questions, obs.answers):
        chat_history += f"""Question: {q}\nAnswer: {a}\n"""
    prompt = (
            guess_prompt + f"""so far, the current state of the game is as following:\n{chat_history}\n
Guess the correct answer based on that(Consider Wrong answers can be included.)
        Answer with the guessed word only. Do not reuse guesses guessed before"""    
    )
    
    
    message.append({"role": "system", "content": prompt})
    
    return message


def answerer_prompt(obs):
    message = []
    
    # System prompt
    prompt = f"""You are Answerer with expertise in playing 20 questions game.
Your task is to answer the yes or no questions.
You are a very strict and accurate answerer because if your answer is not correct, someone could die.
Your answers must be 'yes' or 'no' and correct 
The correct answer is: "{obs.keyword}", it is of category: "things"
When question is a guess, you must say "yes" only in case the guess is "{obs.keyword}".
The other? you must say 'no'.
"""

    message.append({"role": "system", "content": prompt})
    
    # Chat history
    message.append({"role": "user", "content": obs.questions[-1]})
    
    
    return message

class Answerer:
    def __init__(self, device, tokenizer, model, obs):
        self.device = device
        self.tokenizer = tokenizer
        self.model = model
        self.generation_kwargs = {
            'min_length': -1,
            'top_k': 0.0, # No top-k sampling.
            'top_p': 1.0, # No nucleus sampling.
            'do_sample': True,
            'temperature': 0.2,
            'pad_token_id': tokenizer.eos_token_id
        }
        self.keyword = obs.keyword
        self.question = obs.questions[-1]

    def get_messages(self, keyword, question):
        
        messages = [{'role': "system", 'content': f"""Your task is to reply with a single word answer - either 'yes' or 'no'.

I am providing you an entity delimited by <entity> </entity>.

Entity:
<entity>{keyword}</entity>

Next, the user will ask you a question delimited by <question> </question>.

Please use the entity and the question, and formulate a single word 'yes' or 'no' answer that is factually correct. You must use all your knowledge about the entity to formulate the answer.

Never mention the entity in your answer. Do not provide an explanation. Do not use any other words. Don’t get distracted by the question; remember the correct answer is {keyword}."""}]
        messages.append({
            'role': "user", 'content': f"""Question:
<question>{question}</question>"""
        })
        return messages

    def answer(self):
        keyword = self.keyword
        question = self.question
        messages = self.get_messages(keyword, question)
        prompt_tensor = self.tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_tensors="pt").to(self.device)
        answer_tensor = self.model.generate(prompt_tensor, max_new_tokens=2, eos_token_id=self.tokenizer.eos_token_id, **self.generation_kwargs)
        answer = self.tokenizer.decode(answer_tensor[0][len(prompt_tensor[0]):], skip_special_tokens=True)
        answer = answer.lower()
        if answer not in ['no', 'yes']:
            answer = 'yes'
        return answer



Writing submission/prompts.py


Answerer class from https://www.kaggle.com/competitions/llm-20-questions/discussion/524258

### Agent

Rule based mode is set to True by default.  
If you want to disable rule based mode, 
uncomment `self.RuleBasedAgent.enabled = False` in `__init__` method.

In [13]:
%%writefile submission/main.py
# comment magic command before simulation

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import os
import sys
import pandas as pd

from prompts import *
from rulebased import RuleBasedQuestions

torch.backends.cuda.enable_mem_efficient_sdp(False)
torch.backends.cuda.enable_flash_sdp(False)


KAGGLE_AGENT_PATH = "/kaggle_simulations/agent/"
if os.path.exists(KAGGLE_AGENT_PATH):
    MODEL_PATH = os.path.join(KAGGLE_AGENT_PATH, "model")
    QUESTIONS_PATH = os.path.join(KAGGLE_AGENT_PATH, "revealed_keywords_mapped.csv")
else:
    MODEL_PATH = "/kaggle/working/submission/model"
    QUESTIONS_PATH = "/kaggle/input/revealed-keywords-mapped/revealed_keywords_mapped.csv"
    
questions_bank_df = pd.read_csv(QUESTIONS_PATH)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
device = "cuda" if torch.cuda.is_available() else "cpu"
# specify end-of-turn tokens for your desired model
terminators = [tokenizer.eos_token_id]

# Additional potential end-of-turn token
# llama3, phi3, qwen2 by order
potential_terminators = ["<|eot_id|>", "<|end|>", "<end_of_turn>"]

for token in potential_terminators:
    token_id = tokenizer.convert_tokens_to_ids(token)
    if token_id is not None:
        terminators.append(token_id)

def generate_response(chat):
    inputs = tokenizer.apply_chat_template(chat, add_generation_prompt=True, return_tensors="pt").to(model.device)
    outputs = model.generate(inputs, temperature=0.9, max_new_tokens=64, pad_token_id=tokenizer.eos_token_id, eos_token_id=terminators)
    response = outputs[0][inputs.shape[-1]:]
    out = tokenizer.decode(response, skip_special_tokens=True)

    return out


class Robot:
    def __init__(self, questions_df):
        self.RuleBasedAgent = RuleBasedQuestions(questions_df, add_noise=False)
        
    def on(self, mode, obs):
        assert mode in [
            "asking", "guessing", "answering",
        ], "mode can only take one of these values: asking, answering, guessing"
        if mode == "asking":
            # launch the asker role
            output = self.asker(obs)
        if mode == "answering":
            # launch the answerer role
            output = self.answerer(obs)
            if "yes" in output.lower():
                output = "yes"
            elif "no" in output.lower():
                output = "no"
            if "yes" not in output.lower() and "no" not in output.lower():
                output = "yes"
        if mode == "guessing":
            # launch the guesser role
            output = self.guesser(obs)
        return output

    def asker(self, obs):
        if self.RuleBasedAgent.enabled == True:
            output = self.RuleBasedAgent.get_question()
        
        if self.RuleBasedAgent.enabled == False:
            input = asker_prompt(obs)
            output = generate_response(input)
        
        self.RuleBasedAgent.log_question(output)
        
        return output

    def guesser(self, obs):
        self.RuleBasedAgent.log_answer(obs.answers[-1])
        self.RuleBasedAgent.filter_questions()
        
        input = guesser_prompt(obs)
        output = generate_response(input)

        return output



robot = Robot(questions_bank_df)


def agent(obs, cfg):

    if obs.turnType == "ask":
        response = robot.on(mode="asking", obs=obs)

    elif obs.turnType == "guess":
        response = robot.on(mode="guessing", obs=obs)

    elif obs.turnType == "answer":
        response = Answerer(device=device, tokenizer=tokenizer, model=model, obs=obs).answer()

    if response == None or len(response) <= 1:
        response = "yes"

    return response


Writing submission/main.py


In [14]:
# unload model from memory
import gc, torch
gc.collect()
torch.cuda.empty_cache()

## Simulation

### Install pygame

In [15]:
!pip install pygame

Collecting pygame
  Downloading pygame-2.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Downloading pygame-2.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (14.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.0/14.0 MB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pygame
Successfully installed pygame-2.6.0


In [16]:
%%writefile dumb.py

def dumb_agent(obs, cfg):
    
    # if agent is guesser and turnType is "ask"
    if obs.turnType == "ask":
        response = "Is it a duck?"
    # if agent is guesser and turnType is "guess"
    elif obs.turnType == "guess":
        response = "duck"
    # if agent is the answerer
    elif obs.turnType == "answer":
        response = "no"
    
    return response

Writing dumb.py


In [17]:
%%time

from kaggle_environments import make
env = make("llm_20_questions", debug=True)
game_output = env.run(agents=["submission/main.py", "submission/main.py", "dumb.py", "dumb.py"])

Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token.As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
2024-08-13 18:05:26.779737: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-13 18:05:26.779866: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-13 18:05:26.942278: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


CPU times: user 1min 20s, sys: 16.3 s, total: 1min 37s
Wall time: 1min 26s


In [18]:
env.render(mode="ipython", width=600, height=500)

## Submit Agent

In [19]:
!apt install pigz pv > /dev/null

  pid, fd = os.forkpty()
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)






In [20]:
!tar --use-compress-program='pigz --fast --recursive | pv' \
    -cf submission.tar.gz \
    -C /kaggle/input/revealed-keywords-mapped revealed_keywords_mapped.csv \
    -C /kaggle/working/submission .

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)



