# Drive Test Tag Generation With BERTopic
Generate tags for the written portion of the chinese driving exam using BERTopic.

## 1. Load Data
Loading data fom local database into a pandas dataframe

### a) Load data into question bank class

In [1]:
from qb.question import Question
from qb.question_bank import QuestionBank
from data_storage.database.json_database import LocalJsonDB

db = LocalJsonDB("data_storage/database/json_db/data.json",
                 "data_storage/database/json_db/images")
qb : QuestionBank = db.load()
print(qb.question_count())

2836


### b) Convert question bank to pandas dataframe

In [2]:
import pandas as pd
from pandas import DataFrame
def qb_to_df(qb: QuestionBank) -> DataFrame:
    data = {
        "ID": [],
        "Question": [],
        "Answer Choices": [],
        "Answer": [],
        "Chapter": [],
        "Image Path": []
    }
    for chapter_id in qb.get_all_chapter_num():
        chapter_description = qb.describe_chapter(chapter_id)
        qid_lst = qb.get_qids_by_chapter(chapter_id)
        for qid in qid_lst:
            question: Question = qb.get_question(qid)
            data["ID"].append(qid)
            data["Question"].append(question.get_question())
            data["Answer Choices"].append(", ".join(question.get_answers()))
            data["Answer"].append(question.get_correct_answer())
            data["Chapter"].append(chapter_description)
            data["Image Path"].append(question.get_img_path() if question.get_img_path() else "")
    return pd.DataFrame(data)
question_bank = qb_to_df(qb)

In [3]:
print(question_bank.shape)
question_bank.head()

(2836, 6)


Unnamed: 0,ID,Question,Answer Choices,Answer,Chapter,Image Path
0,ce254,关于交通违法行为，以下说法错误的是什么？,"机动车驾驶证被暂扣或者扣留期间驾驶机动车的，一次记6分, 驾驶机动车在高速公路或者城市快速路...",造成致人轻微伤或者财产损失的交通事故后逃逸，尚不构成犯罪的，一次记9分,道路交通安全法律、法规和规章,
1,62fb9,遇到这种情况的路口怎样通过？,"加速直行通过, 确认安全后通过, 右转弯加速通过, 左转弯加速通过",确认安全后通过,道路交通安全法律、法规和规章,data_storage/database/json_db/images/62fb9.webp
2,ed1cc,驾驶机动车应在变更车道的同时开启转向灯。,"对, 错",错,道路交通安全法律、法规和规章,
3,3cdaa,下列哪种标识是自学直考人员在道路上学习驾驶时，应当在车上放置的标志？,"学车专用标识, 产品合格标识, 保持车距标识, 提醒危险标识",学车专用标识,道路交通安全法律、法规和规章,
4,f980d,以下不属于轻型牵引挂车（C6）科目二考试内容的是什么？,"桩考, 直角转弯, 侧方停车, 曲线行驶",侧方停车,道路交通安全法律、法规和规章,


## 2. Clean Data
Convert Question Bank to a form suitable for BERTopic

In [4]:
from typing import List
def make_docs_images(question_bank: DataFrame) -> (List[str], List[str]):
    docs = []
    images = []
    for key in question_bank.index:
        question = question_bank.loc[key, "Question"]
        answer_choices = question_bank.loc[key, "Answer Choices"]
        answer = question_bank.loc[key, "Answer"]
        chapter = question_bank.loc[key, "Chapter"]
        image_path = question_bank.loc[key, "Image Path"]

        # Combine all parts into a single document
        doc = f"章节: {chapter}\n 题目: {question}\n 选项: {answer_choices}\n 答案: {answer}"

        docs.append(doc)
        images.append(f"图: {image_path}")
    return docs, images
docs, images = make_docs_images(question_bank)

In [5]:
# Display the first 5 documents and images
for i in range(5):
    print(docs[i], "\n")
    print(images[i], "\n")

章节: 道路交通安全法律、法规和规章
 题目: 关于交通违法行为，以下说法错误的是什么？
 选项: 机动车驾驶证被暂扣或者扣留期间驾驶机动车的，一次记6分, 驾驶机动车在高速公路或者城市快速路上违法停车的，一次记9分, 造成致人轻微伤或者财产损失的交通事故后逃逸，尚不构成犯罪的，一次记9分, 驾驶机动车在高速公路或者城市快速路上违法占用应急车道行驶的，一次记6分
 答案: 造成致人轻微伤或者财产损失的交通事故后逃逸，尚不构成犯罪的，一次记9分 

图:  

章节: 道路交通安全法律、法规和规章
 题目: 遇到这种情况的路口怎样通过？
 选项: 加速直行通过, 确认安全后通过, 右转弯加速通过, 左转弯加速通过
 答案: 确认安全后通过 

图: data_storage/database/json_db/images/62fb9.webp 

章节: 道路交通安全法律、法规和规章
 题目: 驾驶机动车应在变更车道的同时开启转向灯。
 选项: 对, 错
 答案: 错 

图:  

章节: 道路交通安全法律、法规和规章
 题目: 下列哪种标识是自学直考人员在道路上学习驾驶时，应当在车上放置的标志？
 选项: 学车专用标识, 产品合格标识, 保持车距标识, 提醒危险标识
 答案: 学车专用标识 

图:  

章节: 道路交通安全法律、法规和规章
 题目: 以下不属于轻型牵引挂车（C6）科目二考试内容的是什么？
 选项: 桩考, 直角转弯, 侧方停车, 曲线行驶
 答案: 侧方停车 

图:  



## 3. Naive Processing

In [6]:
from bertopic import BERTopic


A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.2.6 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "/Users/simonxu/opt/anaconda3/envs/drivetest_tag_gen/lib/python3.11/runpy.py", line 198, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Users/simonxu/opt/anaconda3/envs/drivetest_tag_gen/lib/python3.11/runpy.py", line 88, in _run_code
    exec(code, run_globals)
  File "/Users/simonxu/opt/anaconda3/envs/drivetest_tag_gen/lib/python3.11/site-packages/ipykernel_launcher.py", line 18, in <module>
    app.launch_new_instance()
  File "/Users/simonxu/opt/anaconda3/envs/drivetest_tag_gen/li

In [None]:
from bertopic.representation import VisualRepresentation

In [None]:
visual_model = VisualRepresentation()

In [None]:
representation_model = {
    "Visual_Aspect": visual_model,
}

In [None]:
topic_model = BERTopic(representation_model=representation_model,
                        verbose=True)

In [None]:
topic_model.fit(docs, images=images)

In [None]:
import base64

def image_base64(im_path) -> str:
    if im_path:
        with open(im_path, "rb") as image_file:
            webp_data = image_file.read()
            encoded_bytes = base64.b64encode(webp_data)
            base64_string = encoded_bytes.decode('utf-8')
            return base64_string
    else:
        return ""

def image_formatter(im_path):
    return f'<img src="data:image/webp;base64,{image_base64(im_path)}">'

# Extract dataframe
# df = topic_model.get_topic_info()

In [None]:
# df.head()