# Drive Test Tag Generation With BERTopic
Generate tags for the written portion of the chinese driving exam using BERTopic.

## 1. Load Data
Loading data fom local database into a pandas dataframe

### a) Load data into question bank class

In [1]:
from qb.question import Question
from qb.question_bank import QuestionBank
from data_storage.database.json_database import LocalJsonDB

db = LocalJsonDB("data_storage/database/json_db/data.json",
                 "data_storage/database/json_db/images")
qb : QuestionBank = db.load()
print(qb.question_count())

2836


### b) Fill in questions without images with a blank image

In [2]:
from PIL import Image

def make_blank_img(path: str) -> None:
    """ Create a blank image and save it to the specified path. """
    img = Image.new('RGB', (10, 10), color='white')
    img.save(path)

In [3]:
def get_blank_img_path() -> str:
    """ Create a path for the blank image. """
    return f"data_storage/database/json_db/images/00blank.webp"
make_blank_img(get_blank_img_path())

### c) Convert question bank to pandas dataframe

In [4]:
import pandas as pd
from pandas import DataFrame
def qb_to_df(qb: QuestionBank) -> DataFrame:
    data = {
        "ID": [],
        "Question": [],
        "Answer Choices": [],
        "Answer": [],
        "Chapter": [],
        "Image Path": []
    }
    for chapter_id in qb.get_all_chapter_num():
        chapter_description = qb.describe_chapter(chapter_id)
        qid_lst = qb.get_qids_by_chapter(chapter_id)
        for qid in qid_lst:
            question: Question = qb.get_question(qid)
            data["ID"].append(qid)
            data["Question"].append(question.get_question())
            data["Answer Choices"].append(", ".join(question.get_answers()))
            data["Answer"].append(question.get_correct_answer())
            data["Chapter"].append(chapter_description)
            data["Image Path"].append(question.get_img_path() if question.get_img_path() else get_blank_img_path())
    return pd.DataFrame(data)
question_bank = qb_to_df(qb)

In [5]:
print(question_bank.shape)
question_bank.head()

(2836, 6)


Unnamed: 0,ID,Question,Answer Choices,Answer,Chapter,Image Path
0,8ea3f,同车道行驶的车辆前方遇到下列哪种车辆不得超车？,"出租汽车, 执行任务的消防车, 大型客货车, 公共汽车",执行任务的消防车,道路交通安全法律、法规和规章,data_storage/database/json_db/images/00blank.webp
1,f8309,机动车驾驶人参加满分教育现场学习、网络学习的天数累计不得少于5天，其中，现场学习的天数不得少...,"对, 错",错,道路交通安全法律、法规和规章,data_storage/database/json_db/images/00blank.webp
2,e02f1,在路口遇这种情形怎样通行？,"直接加速转弯, 减速缓慢转弯, 鸣喇叭告知让行, 让左方来车先行",让左方来车先行,道路交通安全法律、法规和规章,data_storage/database/json_db/images/e02f1.webp
3,d5a39,驾驶人在驾驶证核发地车辆管理所管辖区以外地方居住的，可以向政务大厅申请换证。,"对, 错",错,道路交通安全法律、法规和规章,data_storage/database/json_db/images/00blank.webp
4,ac75a,以下机动车中，可以牵引挂车的是哪种车型？,"低速载货汽车, 大型载客汽车, 半挂牵引车, 三轮汽车",半挂牵引车,道路交通安全法律、法规和规章,data_storage/database/json_db/images/00blank.webp


## 2. Format Data
Convert Question Bank to a form suitable for BERTopic

In [6]:
from typing import List
def make_docs_images(question_bank: DataFrame) -> (List[str], List[str]):
    docs = []
    images = []
    for key in question_bank.index:
        question = question_bank.loc[key, "Question"]
        answer_choices = question_bank.loc[key, "Answer Choices"]
        answer = question_bank.loc[key, "Answer"]
        chapter = question_bank.loc[key, "Chapter"]
        # Combine all parts into a single document
        doc = f"章节: {chapter}\n 题目: {question}\n 选项: {answer_choices}\n 答案: {answer}"
        img_path = question_bank.loc[key, "Image Path"]

        docs.append(doc)
        images.append(img_path if img_path else None)
    return docs, images
docs, images = make_docs_images(question_bank)

In [7]:
# Display the first 5 documents and images
for i in range(5):
    print(docs[i], "\n")
    print(images[i], "\n")

章节: 道路交通安全法律、法规和规章
 题目: 同车道行驶的车辆前方遇到下列哪种车辆不得超车？
 选项: 出租汽车, 执行任务的消防车, 大型客货车, 公共汽车
 答案: 执行任务的消防车 

data_storage/database/json_db/images/00blank.webp 

章节: 道路交通安全法律、法规和规章
 题目: 机动车驾驶人参加满分教育现场学习、网络学习的天数累计不得少于5天，其中，现场学习的天数不得少于1天。
 选项: 对, 错
 答案: 错 

data_storage/database/json_db/images/00blank.webp 

章节: 道路交通安全法律、法规和规章
 题目: 在路口遇这种情形怎样通行？
 选项: 直接加速转弯, 减速缓慢转弯, 鸣喇叭告知让行, 让左方来车先行
 答案: 让左方来车先行 

data_storage/database/json_db/images/e02f1.webp 

章节: 道路交通安全法律、法规和规章
 题目: 驾驶人在驾驶证核发地车辆管理所管辖区以外地方居住的，可以向政务大厅申请换证。
 选项: 对, 错
 答案: 错 

data_storage/database/json_db/images/00blank.webp 

章节: 道路交通安全法律、法规和规章
 题目: 以下机动车中，可以牵引挂车的是哪种车型？
 选项: 低速载货汽车, 大型载客汽车, 半挂牵引车, 三轮汽车
 答案: 半挂牵引车 

data_storage/database/json_db/images/00blank.webp 



## 3. Naive Processing

### a) Set up model
#### Set up the visual component

In [8]:
# Imports
from bertopic import BERTopic
from bertopic.representation import VisualRepresentation

In [9]:
# Set up the visual component
visual_model = VisualRepresentation()

In [10]:
representation_model = {
    "Visual_Aspect": visual_model,
}

#### Set up the embedding model

In [11]:
embedding_model = "distiluse-base-multilingual-cased-v1"

In [12]:
# Put the model together
topic_model = BERTopic(embedding_model=embedding_model,
                       representation_model=representation_model,
                       verbose=True)

### b) Fit the model

In [13]:
topic_model.fit(docs, images=images)

2025-06-23 12:56:44,280 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/89 [00:00<?, ?it/s]

2025-06-23 12:57:12,933 - BERTopic - Embedding - Completed ✓
2025-06-23 12:57:12,933 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-06-23 12:57:27,547 - BERTopic - Dimensionality - Completed ✓
2025-06-23 12:57:27,548 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-06-23 12:57:27,618 - BERTopic - Cluster - Completed ✓
2025-06-23 12:57:27,621 - BERTopic - Representation - Fine-tuning topics using representation models.
100%|██████████| 77/77 [00:02<00:00, 25.99it/s]
2025-06-23 12:57:30,832 - BERTopic - Representation - Completed ✓


<bertopic._bertopic.BERTopic at 0x1367b6710>

### c) Save the model

In [14]:
import os
from datetime import datetime

time = str(datetime.today().strftime("%Y-%m-%d %H:%M:%S"))
model_save_path = f"data_storage/model_dir/{time}"
os.makedirs(model_save_path, exist_ok=True)
print(model_save_path)

data_storage/model_dir/2025-06-23 12:57:31


In [15]:
topic_model.save(model_save_path, serialization="pytorch", save_ctfidf=True, save_embedding_model=embedding_model)

### d) Inspect the model

In [16]:
topic_model.visualize_topics()

In [17]:
# View a sample of the topics and their representative documents
def view_topic_samples(topic_model, n_topics=5, n_docs_per_topic=5):
    """
    Display a sample of topics and their representative documents.
    """
    for topic_id in range(n_topics):
        print(f"Topic {topic_id}:")
        print(f"Topics: {topic_model.get_topic(topic_id)}")
        # Get the representative documents for the topic
        representative_docs = topic_model.get_representative_docs(topic_id)
        # Print a sample of the documents
        for doc in representative_docs[:min(len(representative_docs), n_docs_per_topic)]:
            print(f"- {doc}")
        print("\n")

# Call the function with the trained model and documents
view_topic_samples(topic_model, n_topics=5, n_docs_per_topic=5)

Topic 0:
Topics: [('法规和规章', 0.042935264244858315), ('道路交通安全法律', 0.042838057031025834), ('交叉路口', 0.019040807306170863), ('一次记3分', 0.01457968045916838), ('弯道', 0.014560924623916852), ('环城高速', 0.014560924623916852), ('高架路', 0.014560924623916852), ('题目', 0.013918238712784265), ('章节', 0.013918238712784265), ('答案', 0.013918238712784265)]
- 章节: 道路交通安全法律、法规和规章
 题目: 驾驶机动车在下列哪种路段不得超车？
 选项: 城市快速路, 城市高架路, 窄桥、弯道, 山区道路
 答案: 窄桥、弯道
- 章节: 交通信号
 题目: 驾驶机动车行经下列哪种路段时不得超车？
 选项: 环城高速, 交叉路口, 高架路, 中心街道
 答案: 交叉路口
- 章节: 摩托车专用试题
 题目: 驾驶机动车行经下列哪种路段时不得超车？
 选项: 环城高速, 交叉路口, 高架路, 中心街道
 答案: 交叉路口


Topic 1:
Topics: [('摩托车专用试题', 0.0921590539215024), ('题目', 0.01617528950439644), ('答案', 0.01617528950439644), ('章节', 0.01617528950439644), ('选项', 0.01617528950439644), ('继续驾驶', 0.015093532722318047), ('可以停放在非机动车道上', 0.015093532722318047), ('专用安全头盔', 0.015093532722318047), ('严禁双手同时离开转向把', 0.015093532722318047), ('长时间高速行驶', 0.015093532722318047)]
- 章节: 摩托车专用试题
 题目: 遇紧急情况避险时，要沉着冷静，坚持什么样的处理原则？
 选项: 先避人后避物, 先避物后避人, 先避车后避人, 先避物后避车
 

In [18]:
topic_model.get_topic(36)

[('夜间行车', 0.08801307751878462),
 ('切换为近光灯', 0.044388997077872056),
 ('不利于观察道路交通情况', 0.044388997077872056),
 ('机动车灯光一个重要的作用是提示其他机动车驾驶人和行人', 0.044388997077872056),
 ('变短', 0.044388997077872056),
 ('与对向机动车会车时', 0.044388997077872056),
 ('谨慎会车', 0.044388997077872056),
 ('夜间在道路上会车时', 0.044388997077872056),
 ('由路中移到路侧', 0.044388997077872056),
 ('会影响前车驾驶人的视线', 0.044388997077872056)]

In [19]:
topic_model.get_representative_docs(36)

['章节: 交通信号\n 题目: 夜间道路环境对安全行车的主要影响是什么？\n 选项: 能见度低、不利于观察道路交通情况, 驾驶人易产生冲动、幻觉, 驾驶人体力下降, 路面复杂多变\n 答案: 能见度低、不利于观察道路交通情况',
 '章节: 交通信号\n 题目: 夜间会车时距对向来车150米以内应使用近光灯的原因是什么？\n 选项: 两车之间相互提示, 使用远光灯会造成双方驾驶人出现眩目，而发生危险, 驾驶人的操作习惯行为, 提示后方车辆\n 答案: 使用远光灯会造成双方驾驶人出现眩目，而发生危险',
 '章节: 交通信号\n 题目: 如图所示，夜间驾驶机动车遇对方使用远光灯，无法看清前方路况时，以下做法正确的是什么？\n 选项: 自己也打开远光灯行驶, 保持行驶方向和车速不变, 降低车速，谨慎会车, 加速通过，尽快摆脱眩目光线\n 答案: 降低车速，谨慎会车']

## 4. Multimodal Topic Modeling

In [22]:
print("hello world")

hello world
