# Drive Test Tag Generation With BERTopic
Generate tags for the written portion of the chinese driving exam using BERTopic.

## 1. Load Data
Loading data from a local database into a question bank class.

In [1]:
from src.qb.question import Question
from src.qb.question_bank import QuestionBank
from data_storage.database.json_database import LocalJsonDB

db = LocalJsonDB("data_storage/database/json_db/data.json",
                 "data_storage/database/json_db/images")
qb : QuestionBank = db.load()
print(qb.question_count())

2836


## 2. Format Data
Although the Siglip2 model can handle images of different sizes, I will still resize all images to common sizes.

In [2]:
from data_cleaning.img_reshaper import ImgSquarer

IMG_DIR_256 = "data_cleaning/resized_imgs/img256"
IMG_DIR_512 = "data_cleaning/resized_imgs/img512"

squarer_256 = ImgSquarer(256)
# squarer_512 = ImgSquarer(512)

In [3]:
def resize_images(qb: QuestionBank, squarer: ImgSquarer, new_dir: str) -> None:
    for chapter_id in qb.get_all_chapter_num():
        for qid in qb.get_qids_by_chapter(chapter_id):
            question = qb.get_question(qid)
            if question.get_img_path() is not None:
                question.set_img_path(squarer.reshape(qid, qb.get_img_dir(), new_dir))

In [4]:
import os
# If the directory is empty, resize images.
if not os.listdir(IMG_DIR_256):
    print("Resizing images to 256x256...")
    resize_images(qb, squarer_256, IMG_DIR_256)
else:
    print("Images already resized to 256x256, skipping...")

Images already resized to 256x256, skipping...


## 3. Create Multimodal Embeddings
Create multimodal embeddings for the questions using a Siglip2 model.

In [5]:
# Library Imports
from transformers import AutoModel, AutoProcessor

# Local Imports
from embedder.siglip2_qb_embedder import Siglip2QBEmbedder

### a) Load/Download the Siglip2 Model
We will be using "google/siglip2-base-patch16-256" for this task.

In [6]:
MODEL_NAME = "google/siglip2-base-patch16-256"

model = AutoModel.from_pretrained(MODEL_NAME)
processor = AutoProcessor.from_pretrained(MODEL_NAME)

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


### b) Create embeddings

#### i) Define a logger

In [7]:
import logging
from logging import Logger
from datetime import datetime
import os

LOGGING_PATH = "logs"

def get_logger(name: str) -> Logger:
    # Create logger
    logger = logging.getLogger(name)
    logger.setLevel(logging.INFO)

    # Create a file handler with timestamp in filename
    timestamp = datetime.now().strftime("%Y%m%d_%H%M")
    file_handler = logging.FileHandler(
        os.path.join(LOGGING_PATH, f"{name}_{timestamp}.log")
    )

    # Create formatter
    formatter = logging.Formatter(
        '%(asctime)s - %(name)s - %(levelname)s - %(message)s'
    )
    file_handler.setFormatter(formatter)

    # Add handler to logger
    logger.addHandler(file_handler)

    return logger
embedder_logger = get_logger("embedder")

#### ii) Create the embedder

In [8]:
custom_embedder = Siglip2QBEmbedder(model, processor, embedder_logger)

#### iii) Generate embeddings

In [9]:
EMBEDDINGS_DIR = "data_storage/embedding_dir"
EMBEDDING_FILE_NAME = "siglip2_embeddings.npz"

os.makedirs(EMBEDDINGS_DIR, exist_ok=True)
embedding_file = os.path.join(EMBEDDINGS_DIR, EMBEDDING_FILE_NAME)
print(embedding_file)

data_storage/embedding_dir/siglip2_embeddings.npz


In [10]:
if EMBEDDING_FILE_NAME in os.listdir(EMBEDDINGS_DIR):
    print(f"Embeddings already exist at {embedding_file}, skipping generation.")
else:
    print("Generating embeddings...")
    # Generate embeddings for the question bank
    embeddings = custom_embedder.encode_qb(qb)

Generating embeddings...


#### iv) Save embeddings

In [11]:
import numpy as np

def save_embeddings(embeddings, file_path):
    np.savez(file_path, **{str(qid): embeddings[qid] for qid in embeddings})

if not os.path.exists(embedding_file):
    print(f"Saving embeddings to {embedding_file}...")
    save_embeddings(embeddings, embedding_file)
else:
    print(f"Embeddings file {embedding_file} already exists, skipping save.")

Saving embeddings to data_storage/embedding_dir/siglip2_embeddings.npz...


## 4. Generate Tags with BERTopic

### a) Load Embeddings

In [12]:
# To load the embeddings later:
def load_embeddings(file_path):
    loaded = np.load(file_path)
    return {key: loaded[key] for key in loaded.files}
id_to_embedding = load_embeddings(embedding_file)

### b) Create BERTopic Model

In [13]:
from bertopic import BERTopic
topic_model = BERTopic()

### c) Format embeddings for BERTopic

In [14]:
from typing import List

from embedder.siglip2_qb_embedder import format_question

def format_for_bertopic(id_to_embedding: dict, qb: QuestionBank) -> (List[str], List[np.ndarray]):
    """
    Format the embeddings for BERTopic.
    """
    documents: List[str] = []
    embedding_lst: List[np.ndarray] = []
    for chapter_id in qb.get_all_chapter_num():
        for qid in qb.get_qids_by_chapter(chapter_id):
            doc = format_question(qb.get_question(qid), qb.describe_chapter(chapter_id))
            documents.append(doc)
            embedding_lst.append(id_to_embedding[qid])
    return documents, embedding_lst

In [15]:
documents, embedding_lst = format_for_bertopic(id_to_embedding, qb)
embeddings = np.array(embedding_lst)

### d) Fit the BERTopic model

In [16]:
topic_model.fit(documents, embeddings)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


<bertopic._bertopic.BERTopic at 0x131af0f10>

### e) Get Topics with the built-in representation model

In [17]:
topic_model.visualize_topics()

In [18]:
topics, probs = topic_model.transform(documents, embeddings=embeddings)

In [19]:
topic_model.get_representative_docs(2)

['章节:交通信号题目:如图所示，驾驶机动车在同方向只有1条机动车道的城市道路上行驶，最高行驶速度不得超过每小时多少公里？答案:50',
 '章节:交通信号题目:如图所示，在没有道路中心线的公路上驾驶机动车，最高行驶速度不得超过每小时多少公里？答案:40',
 '章节:交通信号题目:如图所示，驾驶小型载客汽车在高速公路上行驶，同方向有3条车道的，中间车道的最低车速为每小时多少公里？答案:90']

In [20]:
topic_model.get_topic(2)

[('390', 0.1389935820753795),
 ('170', 0.1389935820753795),
 ('110120', 0.1389935820753795),
 ('100120', 0.09902102579427789),
 ('150', 0.06824008523968599),
 ('90', 0.039972556281101614),
 ('40', 0.03648754455471361),
 ('60', 0.03357168780326683),
 ('50', 0.03073019708813507),
 ('30', 0.02455003670380846)]

In [22]:
topic_model.get_document_info(documents)

Unnamed: 0,Document,Topic,Name,Representation,Representative_Docs,Top_n_words,Probability,Representative_document
0,章节:道路交通安全法律、法规和规章题目:驾驶机动车不按照规定避让校车的，一次记12分。答案:错,15,15_120_50_12_,"[120, 50, 12, , , , , , , ]",[章节:道路交通安全法律、法规和规章题目:驾驶机动车跨越双实线行驶属于什么行为？答案:违法行...,120 - 50 - 12 - - - - - - -,1.000000,False
1,章节:道路交通安全法律、法规和规章题目:从事校车业务或者旅客运输，严重超过额定乘员载客的，可...,12,12_503_5012_b2_102,"[503, 5012, b2, 102, 40, 70, 30, 12, , ]",[章节:道路交通安全法律、法规和规章题目:公安机关交通管理部门对累积记分达到规定分值的驾驶人...,503 - 5012 - b2 - 102 - 40 - 70 - 30 - 12 - -,0.377636,False
2,章节:道路交通安全法律、法规和规章题目:以下不属于机动车驾驶证审验内容的是什么？答案:驾驶车...,15,15_120_50_12_,"[120, 50, 12, , , , , , , ]",[章节:道路交通安全法律、法规和规章题目:驾驶机动车跨越双实线行驶属于什么行为？答案:违法行...,120 - 50 - 12 - - - - - - -,1.000000,False
3,章节:道路交通安全法律、法规和规章题目:机动车驾驶人逾期不参加审验仍驾驶机动车的，会受到什么...,24,24_20050_c1c2c6_200500_18,"[20050, c1c2c6, 200500, 18, 10, 37, 102, 15, 7...",[章节:道路交通安全法律、法规和规章题目:隐瞒有关情况或者提供虚假材料申请机动车驾驶证，申请...,20050 - c1c2c6 - 200500 - 18 - 10 - 37 - 102 -...,0.561734,False
4,章节:道路交通安全法律、法规和规章题目:准驾车型为C1驾照的，可以驾驶以下哪种车辆？答案:低...,-1,-1_2002000_on_c2_tsr,"[2002000, on, c2, tsr, 4110, start, 34, afs, 2...",[章节:货车专用试题题目:以下缩写中，表示车辆电子控制制动辅助系统缩写的是什么？答案:EBA...,2002000 - on - c2 - tsr - 4110 - start - 34 - ...,0.000000,False
...,...,...,...,...,...,...,...,...
2831,章节:摩托车专用试题题目:大雾天气能见度低，开启远光灯会提高能见度。答案:错,-1,-1_2002000_on_c2_tsr,"[2002000, on, c2, tsr, 4110, start, 34, afs, 2...",[章节:货车专用试题题目:以下缩写中，表示车辆电子控制制动辅助系统缩写的是什么？答案:EBA...,2002000 - on - c2 - tsr - 4110 - start - 34 - ...,0.000000,False
2832,章节:摩托车专用试题题目:在泥泞路上制动时，摩托车车轮易发生侧滑或甩尾，导致交通事故。答案:对,54,54_503_5012_40_30,"[503, 5012, 40, 30, , , , , , ]",[章节:摩托车专用试题题目:驾驶人违反交通运输管理法规发生重大事故致人死亡的处3年以上有期徒...,503 - 5012 - 40 - 30 - - - - - -,1.000000,False
2833,章节:摩托车专用试题题目:驾驶机动车通过急弯路时，最高速度不能超过多少？答案:30公里/小时,-1,-1_2002000_on_c2_tsr,"[2002000, on, c2, tsr, 4110, start, 34, afs, 2...",[章节:货车专用试题题目:以下缩写中，表示车辆电子控制制动辅助系统缩写的是什么？答案:EBA...,2002000 - on - c2 - tsr - 4110 - start - 34 - ...,0.000000,False
2834,章节:摩托车专用试题题目:道路交通安全违法行为累积记分一个周期满分为12分。答案:对,-1,-1_2002000_on_c2_tsr,"[2002000, on, c2, tsr, 4110, start, 34, afs, 2...",[章节:货车专用试题题目:以下缩写中，表示车辆电子控制制动辅助系统缩写的是什么？答案:EBA...,2002000 - on - c2 - tsr - 4110 - start - 34 - ...,0.000000,False
