<a href="https://colab.research.google.com/github/rm-2278/MachineLearning/blob/main/DL_Basic_2025_Competition_VQA_baseline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Deep Learning 基礎講座　最終課題: VQA

## 概要
画像と質問から，回答を予測するタスクです．
- サンプル数: 訓練 19,873 サンプル，テスト 4,969 サンプル
- 入力: 画像データ（RGB，サイズは画像によって異なります），質問文（系列長はサンプルごとに異なります）
- 出力: 回答文（系列長はサンプルごとに異なります）
- 評価指標: VQA での評価指標（[こちら
](https://visualqa.org/evaluation.html)を参照）を利用しています．

### データセット ([VizWiz 2023 edition](https://www.kaggle.com/datasets/nqa112/vizwiz-2023-edition)) の詳細
- 24,842 枚の画像データセットと，各画像に対する 1 つの質問文と 10 人の回答者による回答文から構成されます．
  - 10 人の回答は全て同じとは限りません．
- 24.842 サンプルのうち，80 % (19.873) が訓練データ (train)，20 % (4969) がテストデータ (val) として与えられます．
  - テストデータに対する回答文を正解ラベルとし，配布していません．
  - データ提供元とは異なるデータ分割になっています．

### タスクの詳細
- 本コンペティションでは，与えられた画像と質問文に対して，適切な回答文を出力するモデルを作成していただきます．
- 評価は [VQA](https://visualqa.org/index.html) (Visual Question Answering) に基づいて，以下の式で計算されます．

$$\text{Acc}(ans) = \text{min}(\frac{humans \; that \; said \; ans}{3}, 1)$$

- 1 つのデータに対し， 10 人の回答のうち 9 人の回答を選択し上記の式で性能評価した， 10 パターンの Acc の平均をそのデータに対する Acc とします．
- 予測結果と正解ラベルを比較する前に，回答を lowercase にする，冠詞は削除するなどの前処理を行っています（[詳細](https://visualqa.org/evaluation.html)）．

## 考えられる工夫の例
- 事前学習モデルの fine-tuning
    - 画像特徴量，言語特徴量を取得するときに，事前学習モデルを fine-tuning することで性能向上が見込めます（今回のタスクと大きく異なるデータセットでの事前学習では効果が小さい可能性がありますので注意しましょう）．
- 質問文の表現
    - ベースラインでは，質問文をモデルに入力する際に，one-hot ベクトルにしています．これを tokenizer 等を利用して分散表現にすることで，モデル学習しやすくなります．
- ソフトラベルの利用
    - ベースラインでは 10 人の回答の中で最も多かった回答を正解ラベルとして訓練しています．この点を各回答の頻度に合わせてソフトラベルを利用することで，より多くの情報を利用して学習が可能になります．
- 画像の前処理
    - 画像の前処理には形状を同じにする Resize のみを利用しています．「畳み込みニューラルネットワーク」，「深層学習と画像認識」等で紹介されていたデータ拡張を追加することで，汎化性能の向上が見込めます．

## 修了要件を満たす条件
- ベースラインでは，omnicampus 上での性能評価において， 49.4% となります．したがって，ベースラインを超える 49.4% を超えた提出のみ，修了要件として認めます．
- ベースラインから改善を加えることで， 60% に性能向上することを運営で確認しています．こちらを 1 つの指標として取り組んでみてください．

## 注意点
- 最終的な予測モデルは，**配布している訓練データを用いて学習**（ファインチューニング含む）したものとしてください．
- 学習を行わず，**事前学習済みモデルの知識のみを利用した推論は禁止**します．  
（例: ChatGPT 等の LLM に入力して推論を得るのみ）

### 事前学習モデルの利用
許可される事項
- **構成要素としての事前学習モデルの利用**: 自身で実装したアーキテクチャの一部（特徴抽出，埋め込みなど）として事前学習モデル（BERT，ViT など）を利用することは可能です．
- **ファインチューニング**: 上記の用途で利用している事前学習モデルのファインチューニングは可能です．

禁止される事項  
- **タスク解決用の事前学習モデルの利用**: transformers などで提供されている，対象タスクを直接解くための事前学習モデルでそのまま推論のみ，またはファインチューニングのみで利用することは禁止とします．
  - 禁止事項の例: VQA タスクを直接解くための事前学習モデルを VQA タスクで利用する．

### データの準備
データをダウンロードした際に，google drive したため，利用するために google drive をマウントする必要があります．また， drive 上で展開することができないため，/content ディレクトリ下にコピーし "data.zip" を展開します．  
google drive 上に "data.zip" が配置されていない場合は実行できません．google drive 上に "data.zip" (**12GB**) を配置することが可能であれば，"data_download.ipynb" を先に実行してください．難しい場合は，omnicampus 演習環境を利用してください．．



In [None]:
# omnicampus 上では 4 セル目まで実行不要
# ドライブのマウント
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# データダウンロード用の notebook にてgoogle drive への保存後，
# 反映に時間がかかる可能性がありますので，google drive のマウント後，
# data.zip がディレクトリ内にあることを確認してから実行してください．
# data.zip を /content 下にコピーする
!cp "/content/drive/MyDrive/data.zip" "/content"

In [None]:
# カレントディレクトリ下のファイル群を確認
# data.zip が表示されれば問題ないです
%ls

data.zip  [0m[01;34mdrive[0m/  [01;34msample_data[0m/


In [None]:
# データを解凍する
!unzip data.zip

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  inflating: data/train/train_17925.jpg  
  inflating: data/train/train_11508.jpg  
  inflating: data/train/train_18890.jpg  
  inflating: data/train/train_16941.jpg  
  inflating: data/train/train_19803.jpg  
  inflating: data/train/train_01120.jpg  
  inflating: data/train/train_14602.jpg  
  inflating: data/train/train_02783.jpg  
  inflating: data/train/train_05206.jpg  
  inflating: data/train/train_11328.jpg  
  inflating: data/train/train_05513.jpg  
  inflating: data/train/train_10742.jpg  
  inflating: data/train/train_10977.jpg  
  inflating: data/train/train_01154.jpg  
  inflating: data/train/train_07148.jpg  
  inflating: data/train/train_05799.jpg  
  inflating: data/train/train_19215.jpg  
  inflating: data/train/train_11471.jpg  
  inflating: data/train/train_08915.jpg  
  inflating: data/train/train_09099.jpg  
  inflating: data/train/train_15882.jpg  
  inflating: data/train/train_16020.jpg  
  inflating

omnicampus 演習環境では，data_download.ipynb のマウント，zip 化，drive へのコピーを実行しないことで，"data.zip" を解凍した形で配置されます．したがって，data ディレクトリが存在するディレクトリをカレントディレクトリとするだけで良いです．



In [None]:
# omnicampus 実行用
# 以下の例では/workspace に data ディレクトリがあると想定
%cd /workspace/VQA

[Errno 2] No such file or directory: '/workspace/VQA'
/content


### 1. import library

In [None]:
import re
import random
import time
from statistics import mode

from PIL import Image
import numpy as np
import pandas
import torch
import torch.nn as nn
import torchvision
from torchvision import transforms

### 2. utils

In [None]:
def set_seed(seed):
    """
    シードを固定する．

    Parameters
    ----------
    seed : int
        乱数生成に用いるシード値．
    """
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

In [None]:
def process_text(text):
    """
    入力文と回答のフォーマットを統一するための関数．

    Parameters
    ----------
    text : str
        入力文，もしくは回答．
    """
    # lowercase
    text = text.lower()

    # 数詞を数字に変換
    num_word_to_digit = {
        'zero': '0', 'one': '1', 'two': '2', 'three': '3', 'four': '4',
        'five': '5', 'six': '6', 'seven': '7', 'eight': '8', 'nine': '9',
        'ten': '10'
    }
    for word, digit in num_word_to_digit.items():
        text = text.replace(word, digit)

    # 小数点のピリオドを削除
    text = re.sub(r'(?<!\d)\.(?!\d)', '', text)

    # 冠詞の削除
    text = re.sub(r'\b(a|an|the)\b', '', text)

    # 短縮形のカンマの追加
    contractions = {
        "dont": "don't", "isnt": "isn't", "arent": "aren't", "wont": "won't",
        "cant": "can't", "wouldnt": "wouldn't", "couldnt": "couldn't"
    }
    for contraction, correct in contractions.items():
        text = text.replace(contraction, correct)

    # 句読点をスペースに変換
    text = re.sub(r"[^\w\s':]", ' ', text)

    # 句読点をスペースに変換
    text = re.sub(r'\s+,', ',', text)

    # 連続するスペースを1つに変換
    text = re.sub(r'\s+', ' ', text).strip()

    return text

In [None]:
class VQADataset(torch.utils.data.Dataset):
    """
    VQA データセットを扱うためのクラス．
    """
    def __init__(self, df_path, image_dir, transform=None, answer=True):
        self.transform = transform  # 画像の前処理
        self.image_dir = image_dir  # 画像ファイルのディレクトリ
        self.df = pandas.read_json(df_path)  # 画像ファイルのパス，question, answerを持つDataFrame
        self.answer = answer

        # question / answerの辞書を作成
        self.question2idx = {}
        self.answer2idx = {}
        self.idx2question = {}
        self.idx2answer = {}

        # 質問文に含まれる単語を辞書に追加
        for question in self.df["question"]:
            question = process_text(question)
            words = question.split(" ")
            for word in words:
                if word not in self.question2idx:
                    self.question2idx[word] = len(self.question2idx)
        self.idx2question = {v: k for k, v in self.question2idx.items()}

        if self.answer:
            # 回答に含まれる文章を辞書に追加
            for answers in self.df["answers"]:
                for answer in answers:
                    word = answer["answer"]
                    word = process_text(word)
                    if word not in self.answer2idx:
                        self.answer2idx[word] = len(self.answer2idx)
            self.idx2answer = {v: k for k, v in self.answer2idx.items()}  # 逆変換用の辞書(answer)

    def update_dict(self, dataset):
        """
        検証用データ，テストデータの辞書を訓練データの辞書に更新する．

        Parameters
        ----------
        dataset : Dataset
            訓練データのDataset
        """
        self.question2idx = dataset.question2idx
        self.answer2idx = dataset.answer2idx
        self.idx2question = dataset.idx2question
        self.idx2answer = dataset.idx2answer

    def __getitem__(self, idx):
        """
        対応するidxのデータ（画像，質問，回答）を取得．

        Parameters
        ----------
        idx : int
            取得するデータのインデックス

        Returns
        -------
        image : torch.Tensor  (C, H, W)
            画像データ
        question : torch.Tensor  (vocab_size)
            質問文をone-hot表現に変換したもの
        answers_for_vqa_criterion : torch.Tensor  (n_answer)
            10人の回答者の回答のid (VQA評価用)
        target_soft_labels : torch.Tensor (len(self.answer2idx))
            ソフトラベル (VQAスコア)
        """
        image = Image.open(f"{self.image_dir}/{self.df['image'][idx]}")
        image = self.transform(image)
        question = np.zeros(len(self.idx2question) + 1)  # 未知語用の要素を追加
        question_words = self.df["question"][idx].split(" ")
        for word in question_words:
            try:
                question[self.question2idx[word]] = 1  # one-hot表現に変換
            except KeyError:
                question[-1] = 1  # 未知語

        if self.answer:
            all_human_answers_text = [process_text(answer["answer"]) for answer in self.df["answers"][idx]]

            valid_human_answer_indices = [
                self.answer2idx[ans_text] for ans_text in all_human_answers_text if ans_text in self.answer2idx
            ]

            answers_for_vqa_criterion = torch.tensor(valid_human_answer_indices, dtype=torch.float)

            answer_counts_for_soft_labels = {}
            for ans_idx in valid_human_answer_indices:
                answer_counts_for_soft_labels[ans_idx] = answer_counts_for_soft_labels.get(ans_idx, 0) + 1

            target_soft_labels = torch.zeros(len(self.answer2idx))

            for ans_idx, count in answer_counts_for_soft_labels.items():
                score = min(count / 3, 1.0)
                target_soft_labels[ans_idx] = score

            return image, torch.Tensor(question), answers_for_vqa_criterion, target_soft_labels

        else:
            return image, torch.Tensor(question)

    def __len__(self):
        return len(self.df)

In [None]:
def VQA_criterion(batch_pred, batch_answers):
    """
    VQA タスクに用いられる評価関数．
    """
    total_acc = 0.

    for pred, answers in zip(batch_pred, batch_answers):
        acc = 0.
        for i in range(len(answers)):
            num_match = 0
            for j in range(len(answers)):
                if i == j:
                    continue
                if pred == answers[j]:
                    num_match += 1
            acc += min(num_match / 3, 1)
        total_acc += acc / 10

    return total_acc / len(batch_pred)

### 3. model

In [None]:
class BasicBlock(nn.Module):
    """
    ResNet の basic block
    """
    expansion = 1

    def __init__(self, in_channels, out_channels, stride=1):
        """
        コンストラクタ．

        Parameters
        ----------
        in_channles: int
            入力のチャネル数
        out_channels:
            出力のチャネル数
        stride: int
            ストライド
        """
        super().__init__()

        self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=3, stride=stride, padding=1)
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3, stride=1, padding=1)
        self.bn2 = nn.BatchNorm2d(out_channels)
        self.relu = nn.ReLU(inplace=True)

        self.shortcut = nn.Sequential()
        if stride != 1 or in_channels != out_channels:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=stride),
                nn.BatchNorm2d(out_channels)
            )

    def forward(self, x):
        """
        順伝播処理

        Parameters
        ----------
        x: torch.Tensor
            ブロックへの入力

        Returns
        -------
        out: torch.Tensor
            ブロックへの出力
        """
        residual = x
        out = self.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))

        out += self.shortcut(residual)
        out = self.relu(out)

        return out


class BottleneckBlock(nn.Module):
    """
    ResNet の bottleneck block
    """
    expansion = 4

    def __init__(self, in_channels, out_channels, stride=1):
        """
        コンストラクタ．

        Parameters
        ----------
        in_channles: int
            入力のチャネル数
        out_channels:
            出力のチャネル数
        stride: int
            ストライド
        """
        super().__init__()

        self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=1)
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3, stride=stride, padding=1)
        self.bn2 = nn.BatchNorm2d(out_channels)
        self.conv3 = nn.Conv2d(out_channels, out_channels * self.expansion, kernel_size=1, stride=1)
        self.bn3 = nn.BatchNorm2d(out_channels * self.expansion)
        self.relu = nn.ReLU(inplace=True)

        self.shortcut = nn.Sequential()
        if stride != 1 or in_channels != out_channels * self.expansion:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_channels, out_channels * self.expansion, kernel_size=1, stride=stride),
                nn.BatchNorm2d(out_channels * self.expansion)
            )

    def forward(self, x):
        """
        順伝播処理

        Parameters
        ----------
        x: torch.Tensor
            ブロックへの入力

        Returns
        -------
        out: torch.Tensor
            ブロックへの出力
        """
        residual = x
        out = self.relu(self.bn1(self.conv1(x)))
        out = self.relu(self.bn2(self.conv2(out)))
        out = self.bn3(self.conv3(out))

        out += self.shortcut(residual)
        out = self.relu(out)

        return out


class ResNet(nn.Module):
    """
    ResNet の実装
    """
    def __init__(self, block, layers):
        """
        コンストラクタ．

        Parameters
        ----------
        block: torch.nn.Module
            利用するブロックのクラス (BasicBlock / BottleneckBlock)
        layers: list
            各ブロックの層数
        """
        super().__init__()
        self.in_channels = 64

        self.conv1 = nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3)
        self.bn1 = nn.BatchNorm2d(64)
        self.relu = nn.ReLU(inplace=True)
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)

        self.layer1 = self._make_layer(block, layers[0], 64)
        self.layer2 = self._make_layer(block, layers[1], 128, stride=2)
        self.layer3 = self._make_layer(block, layers[2], 256, stride=2)
        self.layer4 = self._make_layer(block, layers[3], 512, stride=2)

        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(512 * block.expansion, 512)

    def _make_layer(self, block, blocks, out_channels, stride=1):
        """
        同じ構成を繰り返す部分を生成する．

        Parameters
        ----------
        block: torch.nn.Module
            利用するブロックのクラス (BasicBlock / BottleneckBlock)
        blocks: int
            層数
        out_channels: int
            出力のチャネル数
        stride: int
            ストライド

        Returns
        -------
        layers: torch.nn.ModuleList
            生成した層
        """
        layers = []
        layers.append(block(self.in_channels, out_channels, stride))
        self.in_channels = out_channels * block.expansion
        for _ in range(1, blocks):
            layers.append(block(self.in_channels, out_channels))

        return nn.Sequential(*layers)

    def forward(self, x):
        """
        順伝播処理

        Parameters
        ----------
        x: torch.Tensor
            入力データ

        Returns
        -------
        x: torch.Tensor
            ResNet によって生成される特徴量
        """
        x = self.relu(self.bn1(self.conv1(x)))
        x = self.maxpool(x)

        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)

        x = self.avgpool(x)
        x = x.view(x.size(0), -1)
        x = self.fc(x)

        return x

In [None]:
def ResNet18():
    """
    ResNet18 を生成する関数．
    """
    return ResNet(BasicBlock, [2, 2, 2, 2])


def ResNet50():
    """
    ResNet50 を生成する関数．
    """
    return ResNet(BottleneckBlock, [3, 4, 6, 3])

In [None]:
class VQAModel(nn.Module):
    """
    VQA タスクを解くためのモデル例．
    """
    def __init__(self, vocab_size: int, n_answer: int):
        """
        コンストラクタ．

        Parameters
        ----------
        vocab_size: int
            入力文の語彙数
        n_answer: int
            出力のクラス数
        """
        super().__init__()
        # self.resnet = ResNet18()
        self.resnet = ResNet50()
        self.text_encoder = nn.Linear(vocab_size, 512)

        self.fc = nn.Sequential(
            nn.Linear(1024, 512),
            nn.ReLU(inplace=True),
            nn.Linear(512, n_answer)
        )

    def forward(self, image, question):

        image_feature = self.resnet(image)  # 画像の特徴量
        question_feature = self.text_encoder(question)  # テキストの特徴量

        x = torch.cat([image_feature, question_feature], dim=1)
        x = self.fc(x)

        return x

### 4. train

In [None]:
def train(model, dataloader, optimizer, criterion, device):
    """
    学習用の関数．

    Parameters
    ----------
    model: torch.nn.Module
        学習するモデル
    dataloader: torch.utils.data.DataLoader
        学習に利用するデータローダ
    optimizer: torch.optim.Optim
        最適化手法
    criterion: torch.nn.Module
        損失関数
    device: torch.device
        学習に利用するデバイス

    Returns
    -------
    total_loss: float
        平均損失
    total_acc: float
        平均正解率
    simple_acc: float
        最頻値に対する正解率（VQA の評価指標とは異なることに注意）
    time: float
        1 エポックの学習にかかった時間 (sec)
    """
    model.train()

    total_loss = 0
    total_acc = 0

    start = time.time()
    for image, question, answers_for_vqa_criterion, target_soft_labels in dataloader:
        image, question, answers_for_vqa_criterion, target_soft_labels = \
            image.to(device), question.to(device), answers_for_vqa_criterion.to(device), target_soft_labels.to(device)

        pred = model(image, question)
        loss = criterion(pred, target_soft_labels)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
        total_acc += VQA_criterion(pred.argmax(1), answers_for_vqa_criterion)  # VQA accuracy

    return total_loss / len(dataloader), total_acc / len(dataloader), 0.0, time.time() - start


def eval(model, dataloader, criterion, device):
    """
    学習用の関数．

    Parameters
    ----------
    model: torch.nn.Module
        モデル
    dataloader: torch.utils.data.DataLoader
        評価に利用するデータローダ
    criterion: torch.nn.Module
        損失関数
    device: torch.device
        利用するデバイス

    Returns
    -------
    total_loss: float
        平均損失
    total_acc: float
        平均正解率
    simple_acc: float
        最頻値に対する正解率（VQA の評価指標とは異なることに注意）
    time: float
        1 エポックの評価にかかった時間 (sec)
    """
    model.eval()

    total_loss = 0
    total_acc = 0

    start = time.time()
    with torch.no_grad():
        for image, question, answers_for_vqa_criterion, target_soft_labels in dataloader:
            image, question, answers_for_vqa_criterion, target_soft_labels = \
                image.to(device), question.to(device), answers_for_vqa_criterion.to(device), target_soft_labels.to(device)

            pred = model(image, question)
            loss = criterion(pred, target_soft_labels)

            total_loss += loss.item()
            total_acc += VQA_criterion(pred.argmax(1), answers_for_vqa_criterion)  # VQA accuracy

    return total_loss / len(dataloader), total_acc / len(dataloader), 0.0, time.time() - start

### 5. make submission file

In [None]:
# deviceの設定
set_seed(42)
device = "cuda" if torch.cuda.is_available() else "cpu"

# dataloader / model
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor()
])
train_dataset = VQADataset(df_path="./data/train.json", image_dir="./data/train", transform=transform)
test_dataset = VQADataset(df_path="./data/valid.json", image_dir="./data/valid", transform=transform, answer=False)
test_dataset.update_dict(train_dataset)

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=128, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=1, shuffle=False)

model = VQAModel(vocab_size=len(train_dataset.question2idx)+1, n_answer=len(train_dataset.answer2idx)).to(device)

# optimizer / criterion
num_epoch = 4
criterion = nn.BCEWithLogitsLoss() # Changed to BCEWithLogitsLoss for soft labels
optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-5)

In [None]:
# train model
for epoch in range(num_epoch):
    train_loss, train_acc, train_simple_acc, train_time = train(model, train_loader, optimizer, criterion, device)
    print(f"【{epoch + 1}/{num_epoch}】\n"
            f"train time: {train_time:.2f} [s]\n"
            f"train loss: {train_loss:.4f}\n"
            f"train acc: {train_acc:.4f}\n"
            f"train simple acc: {train_simple_acc:.4f}")

【1/4】
train time: 735.82 [s]
train loss: 0.0150
train acc: 0.4426
train simple acc: 0.0000


In [None]:
# make submission file
model.eval()
submission = []
for image, question in test_loader:
    image, question = image.to(device), question.to(device)
    pred = model(image, question)
    pred = pred.argmax(1).cpu().item()
    submission.append(pred)

submission = [train_dataset.idx2answer[id] for id in submission]
submission = np.array(submission)
torch.save(model.state_dict(), "model.pt")
np.save("submission.npy", submission)

## 提出方法

以下の3点をzip化し，Omnicampusの「最終課題 (VQA)」から提出してください．

- `submission.npy`
- `model.pt`や`model_best.pt`など，テストに使用した重み（拡張子は`.pt`のみ）
- 本Colab Notebook

In [None]:
from zipfile import ZipFile

model_path = "model.pt"
notebook_path = "/content/drive/MyDrive/Colab Notebooks/DL_Basic_2025_Competition_VQA_baseline.ipynb"

with ZipFile("submission.zip", "w") as zf:
    zf.write("submission.npy")
    zf.write(model_path)
    zf.write(notebook_path, arcname="DL_Basic_2025_Competition_VQA_baseline.ipynb")

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# Task
The task is to describe the VQAModel architecture, including its components (image encoder, text encoder, and fusion/prediction head), detail the ResNet18 and text encoder architectures with input/output sizes and processing, explain the feature fusion and prediction mechanism, and discuss the model's suitability for Visual Question Answering (VQA) as a baseline.

## Describe VQAModel Architecture

### Subtask:
Explain the overall structure of the VQAModel, including its main components: an image encoder (ResNet18), a text encoder (linear layer), and a fusion/prediction head (MLP).


```markdown
## VQAModel Architecture Explained

The `VQAModel` is designed to process both image and text inputs to predict an answer to a visual question. Its architecture can be broken down into three primary components:

1.  **Image Encoder (`self.resnet`)**: This component is responsible for extracting features from the input image. In the provided implementation, `ResNet18()` is used as the image encoder. `ResNet18` is a convolutional neural network (CNN) that takes an RGB image (typically resized to a fixed dimension like 224x224) and outputs a fixed-size feature vector representing the image's content. The output from `ResNet18` is further processed by a final fully connected layer within the `ResNet` class to produce a 512-dimensional image feature vector.

2.  **Text Encoder (`self.text_encoder`)**: This component processes the question text. It is implemented as a simple `nn.Linear` layer. It takes a one-hot encoded representation of the question (where `vocab_size` is the number of unique words plus one for unknown words) and transforms it into a 512-dimensional feature vector. This converts the sparse one-hot representation into a dense embedding that can be combined with image features.

3.  **Fusion and Prediction Head (`self.fc`)**: This component combines the features extracted from the image and the question, then uses them to predict the answer. It consists of a `nn.Sequential` block containing two `nn.Linear` layers with a `ReLU` activation function in between.
    *   The image feature vector (512 dimensions) and the question feature vector (512 dimensions) are first concatenated along the feature dimension, resulting in a 1024-dimensional combined feature vector.
    *   This combined vector is then passed through the first linear layer, which reduces its dimensionality to 512.
    *   A `ReLU` activation is applied.
    *   Finally, the second linear layer projects the 512-dimensional feature vector to `n_answer` dimensions, where `n_answer` is the total number of unique possible answers. The output of this layer represents the logits for each possible answer, from which the most probable answer can be selected (e.g., using `argmax`).

### Data Flow in the `forward` Method:

1.  **Input**: The `forward` method receives an `image` tensor and a `question` tensor.
2.  **Image Feature Extraction**: `image_feature = self.resnet(image)` calculates the image's feature representation using the `ResNet18` encoder.
3.  **Question Feature Extraction**: `question_feature = self.text_encoder(question)` calculates the question's feature representation using the linear layer.
4.  **Feature Fusion**: `x = torch.cat([image_feature, question_feature], dim=1)` concatenates the extracted `image_feature` and `question_feature` vectors horizontally (along `dim=1`) to form a single, richer feature vector.
5.  **Prediction**: `x = self.fc(x)` passes the fused feature vector through the `self.fc` sequential layers to produce the final logits for each possible answer.

This modular design allows the model to learn representations from both modalities independently before combining them for the final answer prediction.
```
_cell_id: daa8277f
_cell_type: text_cell
_content: ## Describe VQAModel Architecture

### Subtask:
Explain the overall structure of the VQAModel, including its main components: an image encoder (ResNet18), a text encoder (linear layer), and a fusion/prediction head (MLP).

```
_cell_id: 890f576a-68e1-45d6-8472-3510e1a141a0
_cell_type: markdown
_content: ## VQAModel Architecture Explained

The `VQAModel` is designed to process both image and text inputs to predict an answer to a visual question. Its architecture can be broken down into three primary components:

1.  **Image Encoder (`self.resnet`)**: This component is responsible for extracting features from the input image. In the provided implementation, `ResNet18()` is used as the image encoder. `ResNet18` is a convolutional neural network (CNN) that takes an RGB image (typically resized to a fixed dimension like 224x224) and outputs a fixed-size feature vector representing the image's content. The output from `ResNet18` is further processed by a final fully connected layer within the `ResNet` class to produce a 512-dimensional image feature vector.

2.  **Text Encoder (`self.text_encoder`)**: This component processes the question text. It is implemented as a simple `nn.Linear` layer. It takes a one-hot encoded representation of the question (where `vocab_size` is the number of unique words plus one for unknown words) and transforms it into a 512-dimensional feature vector. This converts the sparse one-hot representation into a dense embedding that can be combined with image features.

3.  **Fusion and Prediction Head (`self.fc`)**: This component combines the features extracted from the image and the question, then uses them to predict the answer. It consists of a `nn.Sequential` block containing two `nn.Linear` layers with a `ReLU` activation function in between.
    *   The image feature vector (512 dimensions) and the question feature vector (512 dimensions) are first concatenated along the feature dimension, resulting in a 1024-dimensional combined feature vector.
    *   This combined vector is then passed through the first linear layer, which reduces its dimensionality to 512.
    *   A `ReLU` activation is applied.
    *   Finally, the second linear layer projects the 512-dimensional feature vector to `n_answer` dimensions, where `n_answer` is the total number of unique possible answers. The output of this layer represents the logits for each possible answer, from which the most probable answer can be selected (e.g., using `argmax`).

### Data Flow in the `forward` Method:

1.  **Input**: The `forward` method receives an `image` tensor and a `question` tensor.
2.  **Image Feature Extraction**: `image_feature = self.resnet(image)` calculates the image's feature representation using the `ResNet18` encoder.
3.  **Question Feature Extraction**: `question_feature = self.text_encoder(question)` calculates the question's feature representation using the linear layer.
4.  **Feature Fusion**: `x = torch.cat([image_feature, question_feature], dim=1)` concatenates the extracted `image_feature` and `question_feature` vectors horizontally (along `dim=1`) to form a single, richer feature vector.
5.  **Prediction**: `x = self.fc(x)` passes the fused feature vector through the `self.fc` sequential layers to produce the final logits for each possible answer.

This modular design allows the model to learn representations from both modalities independently before combining them for the final answer prediction.
```

```


## Detail ResNet18 Architecture

### Subtask:
Provide a hierarchical description of the ResNet18 component, specifying the input and output sizes for its convolutional layers, basic blocks, and final fully connected layer. Include how the image input (3 channels) is processed to a 512-dimensional feature vector.


### ResNet18 Architecture Detail

ResNet18 is a convolutional neural network (CNN) that processes an input image (3 channels) and extracts a 512-dimensional feature vector. Its architecture is built upon several key components, including initial convolutional and pooling layers, a series of Basic Blocks grouped into four main layers, and final pooling and fully connected layers.

#### 1. Initial Layers
The image input, typically RGB with 3 channels (e.g., `(3, H, W)`), first passes through a set of initial layers:

-   **`conv1`**: A convolutional layer with `in_channels=3`, `out_channels=64`, `kernel_size=7`, `stride=2`, `padding=3`. This layer reduces the spatial dimensions by half and increases the feature channels to 64.
    -   Input: `(BatchSize, 3, H, W)`
    -   Output: `(BatchSize, 64, H/2, W/2)`
-   **`bn1`**: A Batch Normalization layer applied to the output of `conv1`.
-   **`relu`**: An ReLU activation function is applied.
-   **`maxpool`**: A Max Pooling layer with `kernel_size=3`, `stride=2`, `padding=1`. This further reduces the spatial dimensions by half.
    -   Input: `(BatchSize, 64, H/2, W/2)`
    -   Output: `(BatchSize, 64, H/4, W/4)`

#### 2. Main Layer Blocks (`layer1` to `layer4`)
Following the initial layers, ResNet18 consists of four main sequential layers (`layer1`, `layer2`, `layer3`, `layer4`), each comprising `BasicBlock` modules. These layers progressively increase the channel depth and, in most cases, reduce spatial dimensions through strided convolutions.

-   **`layer1`**: Consists of `2 BasicBlock` modules. It maintains the channel size and spatial dimensions.
    -   Initial `in_channels=64`, `out_channels=64`, `stride=1`.
    -   Output: `(BatchSize, 64, H/4, W/4)`
-   **`layer2`**: Consists of `2 BasicBlock` modules. The first `BasicBlock` in this layer uses `stride=2` for downsampling.
    -   Initial `in_channels=64`, `out_channels=128`, `stride=2`.
    -   Output: `(BatchSize, 128, H/8, W/8)`
-   **`layer3`**: Consists of `2 BasicBlock` modules. The first `BasicBlock` in this layer uses `stride=2` for downsampling.
    -   Initial `in_channels=128`, `out_channels=256`, `stride=2`.
    -   Output: `(BatchSize, 256, H/16, W/16)`
-   **`layer4`**: Consists of `2 BasicBlock` modules. The first `BasicBlock` in this layer uses `stride=2` for downsampling.
    -   Initial `in_channels=256`, `out_channels=512`, `stride=2`.
    -   Output: `(BatchSize, 512, H/32, W/32)`

#### 3. BasicBlock Architecture
A `BasicBlock` is the fundamental building block of ResNet18 and contains two convolutional layers with a shortcut connection. For `BasicBlock`, `expansion = 1`.

-   **Block Structure**: `conv1` -> `bn1` -> `relu` -> `conv2` -> `bn2` -> `(shortcut + output)` -> `relu`
    -   **`conv1`**: `kernel_size=3`, `stride` (variable, typically 1 or 2), `padding=1`. Transforms `in_channels` to `out_channels`.
    -   **`bn1`**: Batch Normalization.
    -   **`relu`**: ReLU activation.
    -   **`conv2`**: `kernel_size=3`, `stride=1`, `padding=1`. Transforms `out_channels` to `out_channels`.
    -   **`bn2`**: Batch Normalization.
    -   **`shortcut`**: This pathway directly adds the input `x` to the output of `bn2`. If `stride != 1` or `in_channels != out_channels`, a `1x1` convolution with `stride` and Batch Normalization is applied to `x` to match dimensions before addition.
    -   Final `relu` activation.

#### 4. Final Stages
After passing through `layer4`, the feature maps are processed to produce the final 512-dimensional feature vector:

-   **`avgpool`**: An Adaptive Average Pooling layer (`nn.AdaptiveAvgPool2d((1, 1))`). This layer reduces each `(C, H, W)` feature map to `(C, 1, 1)`, effectively taking the average across the spatial dimensions for each channel.
    -   Input: `(BatchSize, 512, H/32, W/32)`
    -   Output: `(BatchSize, 512, 1, 1)`
-   **`view`**: The `(BatchSize, 512, 1, 1)` output from `avgpool` is flattened to `(BatchSize, 512)`.
-   **`fc`**: A Fully Connected (Linear) layer with `in_features=512 * block.expansion` (which is `512 * 1 = 512` for ResNet18) and `out_features=512`.
    -   Input: `(BatchSize, 512)`
    -   Output: `(BatchSize, 512)`

This final `512`-dimensional vector is then used as the image feature (`image_feature`) by the `VQAModel`.

## Detail Text Encoder Architecture

### Subtask:
Describe the text encoder component, which uses a single linear layer. Specify its input size (vocabulary size + 1 for unknown words) and its output size (512-dimensional feature vector). Explain how the question text is converted into a one-hot encoded representation before being fed into this layer.


The text encoder in the `VQAModel` is implemented as a single `nn.Linear` layer named `text_encoder`. This layer is responsible for converting the question's one-hot encoded representation into a dense feature vector.

### Input and Output of the `text_encoder`:
- **Input Size**: The input to the `text_encoder` is a one-hot encoded vector representing the question. Its size is `vocab_size + 1`, where `vocab_size` is the number of unique words found in the training dataset's questions (`len(train_dataset.question2idx)`). The additional `+1` dimension is specifically allocated to represent *unknown words* (words not present in the training vocabulary).
- **Output Size**: The `text_encoder` outputs a 512-dimensional feature vector, which is referred to as `question_feature`. This feature vector is then concatenated with the image feature for further processing in the model's fully connected layers.

### One-Hot Encoding Process in `VQADataset`:
The conversion of question text into this one-hot encoding happens within the `VQADataset`'s `__getitem__` method. For each word in a question, the method checks if the word exists in the `question2idx` dictionary. If it does, the corresponding index in the `question` NumPy array is set to 1. If a word is not found in `question2idx` (i.e., it's an unknown word), the last element of the `question` array (`question[-1]`) is set to 1, effectively marking the presence of an unknown word in the question.

## Describe Feature Fusion and Prediction Head

### Subtask:
Explain how the 512-dimensional image features and 512-dimensional text features are combined (concatenated to form a 1024-dimensional vector) and then processed by the final MLP (sequential linear layers with ReLU activation) to produce the prediction for 'n_answer' possible answers.


### Feature Fusion and Prediction Head

In the `VQAModel`, the image features and text features are combined and processed to produce the final answer prediction. This process involves two main stages: feature fusion and the prediction head (MLP).

1.  **Feature Combination (Fusion)**:
    *   The `image_feature` vector, which is 512-dimensional and derived from the `ResNet18` image encoder, represents the visual information from the input image.
    *   The `question_feature` vector, also 512-dimensional and obtained from the `text_encoder` (a linear layer processing the one-hot encoded question), represents the semantic information from the input question.
    *   These two feature vectors are combined using `torch.cat([image_feature, question_feature], dim=1)`. The `dim=1` argument ensures that the vectors are concatenated along the feature dimension, resulting in a single, fused feature vector of 1024 dimensions.

2.  **Prediction Head (Multi-Layer Perceptron - MLP)**:
    *   The fused 1024-dimensional feature vector is then passed through the prediction head, which is implemented as an `nn.Sequential` module named `self.fc`.
    *   The `self.fc` module consists of the following layers:
        *   An `nn.Linear(1024, 512)` layer: This first linear layer takes the 1024-dimensional fused feature vector as input and transforms it into a 512-dimensional representation.
        *   An `nn.ReLU(inplace=True)` activation function: This non-linear activation is applied to the output of the first linear layer, introducing non-linearity to the model and allowing it to learn more complex relationships.
        *   A final `nn.Linear(512, n_answer)` layer: This second linear layer takes the 512-dimensional activated feature vector and maps it to `n_answer` output dimensions. `n_answer` corresponds to the total number of unique possible answers in the training dataset.

3.  **Output**: The output of this final linear layer represents the raw scores, or **logits**, for each of the `n_answer` possible answers. These logits indicate the model's confidence for each answer candidate. During inference, the answer with the highest logit score is typically selected as the model's prediction.

## Explain Suitability for VQA

### Subtask:
Discuss the ingenuity of this architecture for the VQA task, highlighting the multimodal fusion of image and text features as a core approach. Also, mention the baseline's simplified approach to question embedding (one-hot encoding) and answer prediction (classification of the most frequent answer), while acknowledging potential areas for improvement as suggested in the problem description.


```markdown
## Explanation of VQAModel's Suitability for VQA

### 1. Multimodal Fusion for VQA Task
The `VQAModel` is ingeniously designed for the Visual Question Answering (VQA) task by employing a multimodal fusion strategy. It effectively combines visual information extracted from images with semantic information from questions. The model utilizes a `ResNet18` (a deep convolutional neural network) to process the input image, generating a `512`-dimensional image feature vector. Simultaneously, the question text is processed by a linear layer (`text_encoder`) to produce a `512`-dimensional text feature vector. These two distinct modalities are then concatenated (`torch.cat`) into a single `1024`-dimensional feature vector. This fused representation, containing both visual and textual context, is then fed into a final multilayer perceptron (`self.fc`) to predict the answer. This explicit fusion of features is crucial for VQA, as understanding the visual content often depends on the context provided by the question, and vice-versa.

### 2. Baseline's Approach to Question Embedding and Answer Prediction

*   **Question Embedding (One-hot Encoding)**: The baseline model simplifies question representation by using a one-hot encoding scheme. Each unique word in the question vocabulary is assigned a unique index, and a question is represented as a binary vector where elements corresponding to words present in the question are set to 1, and others to 0. An additional element is reserved for unknown words. This creates a sparse representation where the dimensionality is `vocab_size + 1`.

*   **Answer Prediction (Classification of Most Frequent Answer)**: For answer prediction, the baseline treats the VQA task as a multi-class classification problem. During training, it uses the `mode_answer` (the most frequently provided answer among the 10 human annotations for a given question-image pair) as the single ground truth label. The final fully connected layer of the `VQAModel` outputs logits for each possible answer in the vocabulary (`n_answer`), and the model predicts the answer with the highest logit.

### 3. Strengths and Limitations of Baseline Approaches

**Strengths:**
*   **Simplicity and Interpretability:** One-hot encoding is straightforward to implement and understand. The classification of the most frequent answer simplifies the training objective, making it a clear target for a neural network.
*   **Computational Efficiency (for small vocabularies):** For relatively small vocabularies, one-hot encoding can be computationally manageable. The single target answer reduces the complexity of the loss function.
*   **Solid Starting Point:** These baseline choices provide a robust initial setup from which more sophisticated techniques can be incrementally added and evaluated.

**Limitations:**
*   **Lack of Semantic Meaning (One-hot Encoding):** One-hot vectors treat each word as an independent entity, failing to capture semantic relationships or contextual information between words. This can hinder the model's ability to generalize to unseen phrases or understand nuanced questions.
*   **High Dimensionality and Sparsity (One-hot Encoding):** As the vocabulary grows, one-hot vectors become very high-dimensional and sparse, leading to increased memory consumption and potentially slower training.
*   **Loss of Information (Most Frequent Answer):** Relying solely on the most frequent answer discards valuable information from the other nine human answers. VQA answers often have subjectivity and multiple plausible responses, and reducing them to a single 'correct' answer oversimplifies the problem and limits the model's learning capacity.
*   **Fixed Answer Set:** The model is constrained to predict answers present in the training set's answer vocabulary, making it unable to generate novel answers.

### 4. Potential Areas for Improvement

Drawing from the '考えられる工夫の例' (Examples of possible improvements) in the problem description, several enhancements can be made:

*   **Advanced Question Embedding (Tokenizer/Distributed Representations):** Instead of one-hot encoding, utilizing tokenizers (e.g., Word2Vec, GloVe, or transformer-based embeddings like BERT) to generate dense, semantically rich distributed representations of question words can significantly improve the model's understanding of natural language. This addresses the limitations of one-hot encoding regarding semantic meaning and dimensionality.
*   **Soft Labels for Answer Prediction:** To leverage the full spectrum of human responses, implementing soft labels during training could be beneficial. This involves training the model to predict a distribution over possible answers based on their frequency among the 10 human annotations, rather than just classifying the single most frequent one. This captures the inherent ambiguity and multiple correct answers in VQA.
*   **Fine-tuning Pre-trained Models:** Fine-tuning the image feature extractor (e.g., the ResNet backbone) and/or the text feature extractor with task-specific data can lead to substantial performance gains. Using pre-trained models like ViT for vision or BERT/RoBERTa for language as foundational components, and then fine-tuning them on the VQA dataset, can provide a more powerful feature extraction capability.
*   **Advanced Image Pre-processing/Data Augmentation:** Expanding beyond simple resizing and `ToTensor` transformations, incorporating data augmentation techniques (e.g., random cropping, horizontal flipping, color jittering) can enhance the model's generalization performance and robustness to variations in input images.

```markdown
## Explanation of VQAModel's Suitability for VQA

### 1. Multimodal Fusion for VQA Task
The `VQAModel` is ingeniously designed for the Visual Question Answering (VQA) task by employing a multimodal fusion strategy. It effectively combines visual information extracted from images with semantic information from questions. The model utilizes a `ResNet18` (a deep convolutional neural network) to process the input image, generating a `512`-dimensional image feature vector. Simultaneously, the question text is processed by a linear layer (`text_encoder`) to produce a `512`-dimensional text feature vector. These two distinct modalities are then concatenated (`torch.cat`) into a single `1024`-dimensional feature vector. This fused representation, containing both visual and textual context, is then fed into a final multilayer perceptron (`self.fc`) to predict the answer. This explicit fusion of features is crucial for VQA, as understanding the visual content often depends on the context provided by the question, and vice-versa.

### 2. Baseline's Approach to Question Embedding and Answer Prediction

*   **Question Embedding (One-hot Encoding)**: The baseline model simplifies question representation by using a one-hot encoding scheme. Each unique word in the question vocabulary is assigned a unique index, and a question is represented as a binary vector where elements corresponding to words present in the question are set to 1, and others to 0. An additional element is reserved for unknown words. This creates a sparse representation where the dimensionality is `vocab_size + 1`.

*   **Answer Prediction (Classification of Most Frequent Answer)**: For answer prediction, the baseline treats the VQA task as a multi-class classification problem. During training, it uses the `mode_answer` (the most frequently provided answer among the 10 human annotations for a given question-image pair) as the single ground truth label. The final fully connected layer of the `VQAModel` outputs logits for each possible answer in the vocabulary (`n_answer`), and the model predicts the answer with the highest logit.

### 3. Strengths and Limitations of Baseline Approaches

**Strengths:**
*   **Simplicity and Interpretability:** One-hot encoding is straightforward to implement and understand. The classification of the most frequent answer simplifies the training objective, making it a clear target for a neural network.
*   **Computational Efficiency (for small vocabularies):** For relatively small vocabularies, one-hot encoding can be computationally manageable. The single target answer reduces the complexity of the loss function.
*   **Solid Starting Point:** These baseline choices provide a robust initial setup from which more sophisticated techniques can be incrementally added and evaluated.

**Limitations:**
*   **Lack of Semantic Meaning (One-hot Encoding):** One-hot vectors treat each word as an independent entity, failing to capture semantic relationships or contextual information between words. This can hinder the model's ability to generalize to unseen phrases or understand nuanced questions.
*   **High Dimensionality and Sparsity (One-hot Encoding):** As the vocabulary grows, one-hot vectors become very high-dimensional and sparse, leading to increased memory consumption and potentially slower training.
*   **Loss of Information (Most Frequent Answer):** Relying solely on the most frequent answer discards valuable information from the other nine human answers. VQA answers often have subjectivity and multiple plausible responses, and reducing them to a single 'correct' answer oversimplifies the problem and limits the model's learning capacity.
*   **Fixed Answer Set:** The model is constrained to predict answers present in the training set's answer vocabulary, making it unable to generate novel answers.

### 4. Potential Areas for Improvement

Drawing from the '\u8003\u3048\u3089\u308c\u308b\u5de5\u592b\u306e\u4f8b' (Examples of possible improvements) in the problem description, several enhancements can be made:

*   **Advanced Question Embedding (Tokenizer/Distributed Representations):** Instead of one-hot encoding, utilizing tokenizers (e.g., Word2Vec, GloVe, or transformer-based embeddings like BERT) to generate dense, semantically rich distributed representations of question words can significantly improve the model's understanding of natural language. This addresses the limitations of one-hot encoding regarding semantic meaning and dimensionality.
*   **Soft Labels for Answer Prediction:** To leverage the full spectrum of human responses, implementing soft labels during training could be beneficial. This involves training the model to predict a distribution over possible answers based on their frequency among the 10 human annotations, rather than just classifying the single most frequent one. This captures the inherent ambiguity and multiple correct answers in VQA.
*   **Fine-tuning Pre-trained Models:** Fine-tuning the image feature extractor (e.g., the ResNet backbone) and/or the text feature extractor with task-specific data can lead to substantial performance gains. Using pre-trained models like ViT for vision or BERT/RoBERTa for language as foundational components, and then fine-tuning them on the VQA dataset, can provide a more powerful feature extraction capability.
*   **Advanced Image Pre-processing/Data Augmentation:** Expanding beyond simple resizing and `ToTensor` transformations, incorporating data augmentation techniques (e.g., random cropping, horizontal flipping, color jittering) can enhance the model's generalization performance and robustness to variations in input images.
```


## Final Task

### Subtask:
Conclude the explanation by summarizing the key aspects of the model's design and its role as a baseline for the VQA task.


## Summary:

### Data Analysis Key Findings

*   The `VQAModel` architecture is composed of three main parts: an image encoder (`ResNet18`), a text encoder (a single linear layer), and a fusion/prediction head (a Multi-Layer Perceptron).
*   **Image Feature Extraction**: The `ResNet18` processes a 3-channel input image, progressively reducing spatial dimensions and increasing channel depth through its convolutional blocks. It culminates in an adaptive average pooling layer and a final fully connected layer to produce a 512-dimensional image feature vector.
*   **Text Feature Extraction**: The text encoder is a `nn.Linear` layer that takes a one-hot encoded question vector of size `vocab_size + 1` (where `+1` accounts for unknown words) and transforms it into a 512-dimensional question feature vector.
*   **Feature Fusion and Prediction**: The 512-dimensional image and question feature vectors are concatenated to form a 1024-dimensional combined feature vector. This vector is then fed into the prediction head, an MLP consisting of two linear layers with a ReLU activation, which outputs `n_answer` logits for possible answers.
*   **VQA Suitability as a Baseline**: The model's strength lies in its multimodal fusion approach, combining visual and textual information. However, its baseline nature is evident in the simplified one-hot encoding for questions (which lacks semantic understanding and can be high-dimensional and sparse) and the prediction of only the most frequent answer, which overlooks the richness of multiple human annotations.

### Insights or Next Steps

*   The current `VQAModel` serves as a foundational baseline, demonstrating the core principle of multimodal fusion in VQA. Its identified limitations in text representation and answer supervision offer clear pathways for performance enhancement.
*   To significantly improve model performance, future work should prioritize adopting more sophisticated question embedding techniques (e.g., distributed representations from transformers like BERT) and employing soft-labeling or multi-label classification for answer prediction to better leverage the full spectrum of human responses and inherent ambiguity in VQA.


# Task
The `VQAModel` currently uses a simple linear layer for text encoding and a one-hot representation of questions. To improve the model's understanding of natural language, you'll first define a new `TextEncoderLSTM` module using an `nn.Embedding` layer followed by an `nn.LSTM` layer. This module will take `vocab_size`, `embed_dim`, and `hidden_dim` as parameters, with `batch_first=True` for the LSTM. The output `hidden_dim` of the LSTM should be 512 to maintain compatibility with the existing fusion layer.

## Define New Text Encoder Module

### Subtask:
Create a new TextEncoderLSTM module that takes vocab_size, embed_dim, and hidden_dim as parameters. This module will consist of an nn.Embedding layer followed by an nn.LSTM layer. The embed_dim will be the input size for the LSTM, and hidden_dim will be its output size. The batch_first=True parameter will be set for the LSTM.


**Reasoning**:
The subtask requires defining a new TextEncoderLSTM module. I will create a code block with the class definition, including the `__init__` and `forward` methods as specified in the instructions.



In [None]:
class TextEncoderLSTM(nn.Module):
    """
    テキストエンコーダモジュール (LSTMベース)
    """
    def __init__(self, vocab_size: int, embed_dim: int, hidden_dim: int):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(input_size=embed_dim, hidden_size=hidden_dim, batch_first=True)

    def forward(self, text):
        embedded = self.embedding(text)  # (batch_size, sequence_length, embed_dim)
        _, (hidden, cell) = self.lstm(embedded)  # hidden: (1, batch_size, hidden_dim)
        return hidden.squeeze(0)  # (batch_size, hidden_dim)

## Update VQAModel with LSTM Text Encoder

### Subtask:
Modify the `VQAModel` class. Replace the existing `self.text_encoder = nn.Linear(...)` with an instance of the new `TextEncoderLSTM` module. The `hidden_dim` of the LSTM should be configured to match the expected 512 dimensions for fusion with the image features. Update the `__init__` method accordingly.


**Reasoning**:
The subtask requires modifying the `VQAModel` class to use the new `TextEncoderLSTM` module. I will update the `__init__` method of `VQAModel` to replace the `nn.Linear` text encoder with `TextEncoderLSTM`, ensuring the `hidden_dim` is 512 and choosing an appropriate `embed_dim`.



# Task
```python
class VQAModel(nn.Module):
    """
    VQA タスクを解くためのモデル例．
    """
    def __init__(self, vocab_size: int, n_answer: int, embed_dim: int = 256, text_hidden_dim: int = 512):
        """
        コンストラクタ．

        Parameters
        ----------
        vocab_size: int
            入力文の語彙数 (nn.Embeddingのnum_embeddingsに相当)
        n_answer: int
            出力のクラス数
        embed_dim: int
            単語埋め込みの次元数 (TextEncoderLSTM用)
        text_hidden_dim: int
            TextEncoderLSTMの隠れ層の次元数．画像特徴量と結合するため、512に設定．
        """
        super().__init__()
        # self.resnet = ResNet18()
        self.resnet = ResNet50()
        
        # nn.Linear を TextEncoderLSTM に置き換え
        # TextEncoderLSTMの出力次元 (hidden_dim) が512になるように設定
        self.text_encoder = TextEncoderLSTM(vocab_size, embed_dim, text_hidden_dim)

        # 画像特徴量 (512) とテキスト特徴量 (text_hidden_dim) を結合するため、
        # 融合層の入力次元を 512 + text_hidden_dim に更新
        self.fc = nn.Sequential(
            nn.Linear(512 + text_hidden_dim, 512),
            nn.ReLU(inplace=True),
            nn.Linear(512, n_answer)
        )

    def forward(self, image, question):
        # 注意: TextEncoderLSTMは質問をトークンIDのシーケンスとして期待します。
        # 現在のデータセットは質問をone-hotベクトルとして提供しているため、
        # 今後のステップでVQADatasetとDataLoaderの変更が必要になります。
        image_feature = self.resnet(image)  # 画像の特徴量
        question_feature = self.text_encoder(question)  # テキストの特徴量

        x = torch.cat([image_feature, question_feature], dim=1)
        x = self.fc(x)

        return x
```

## Install Transformers Library

### Subtask:
Install the `transformers` library which provides access to pre-trained models like BERT and their tokenizers. This is a prerequisite for using BERT.


**Reasoning**:
The subtask requires installing the `transformers` library. This is achieved by executing `!pip install transformers` in a code cell.



In [None]:
!pip install transformers



## Modify VQADataset for BertTokenizer

### Subtask:
Update the VQADataset class to use BertTokenizer for processing question texts. This involves importing BertTokenizer, initializing it in the __init__ method, and replacing the custom question2idx logic with BERT tokenization (producing token IDs and attention masks) in the __getitem__ method. The answer2idx and idx2answer should still be maintained for processing answers.


**Reasoning**:
The subtask requires modifying the `VQADataset` class to integrate `BertTokenizer` for question processing. This involves adding the import, updating the `__init__` method to initialize the tokenizer and remove outdated question dictionary logic, and revising the `__getitem__` method to use BERT tokenization and return the appropriate tensors. Additionally, the `update_dict` method needs to be adjusted to reflect these changes.



In [None]:
from transformers import BertTokenizer

class VQADataset(torch.utils.data.Dataset):
    """
    VQA データセットを扱うためのクラス．
    """
    def __init__(self, df_path, image_dir, transform=None, answer=True):
        self.transform = transform  # 画像の前処理
        self.image_dir = image_dir  # 画像ファイルのディレクトリ
        self.df = pandas.read_json(df_path)  # 画像ファイルのパス，question, answerを持つDataFrame
        self.answer = answer

        # BertTokenizerを初期化
        self.tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

        # question2idxとidx2questionはBertTokenizerが管理するため削除
        self.question2idx = None
        self.idx2question = None

        self.answer2idx = {}
        self.idx2answer = {}

        if self.answer:
            # 回答に含まれる文章を辞書に追加
            for answers in self.df["answers"]:
                for answer in answers:
                    word = answer["answer"]
                    word = process_text(word)
                    if word not in self.answer2idx:
                        self.answer2idx[word] = len(self.answer2idx)
            self.idx2answer = {v: k for k, v in self.answer2idx.items()}  # 逆変換用の辞書(answer)

    def update_dict(self, dataset):
        """
        検証用データ，テストデータの辞書を訓練データの辞書に更新する．
        BertTokenizerは共通なので、answer関連の辞書のみ更新。

        Parameters
        ----------
        dataset : Dataset
            訓練データのDataset
        """
        # self.question2idxとself.idx2questionはBertTokenizerが管理するため更新しない
        self.answer2idx = dataset.answer2idx
        self.idx2answer = dataset.idx2answer

    def __getitem__(self, idx):
        """
        対応するidxのデータ（画像，質問，回答）を取得．

        Parameters
        ----------
        idx : int
            取得するデータのインデックス

        Returns
        -------
        image : torch.Tensor  (C, H, W)
            画像データ
        input_ids : torch.Tensor (sequence_length)
            BERTのトークンID
        attention_mask : torch.Tensor (sequence_length)
            BERTのアテンションマスク
        answers_for_vqa_criterion : torch.Tensor  (n_answer)
            10人の回答者の回答のid (VQA評価用)
        target_soft_labels : torch.Tensor (len(self.answer2idx))
            ソフトラベル (VQAスコア)
        """
        image = Image.open(f"{self.image_dir}/{self.df['image'][idx]}")
        image = self.transform(image)

        # BertTokenizerで質問文を処理
        encoded_question = self.tokenizer(
            process_text(self.df["question"][idx]),
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )
        input_ids = encoded_question['input_ids'].squeeze(0)
        attention_mask = encoded_question['attention_mask'].squeeze(0)

        if self.answer:
            all_human_answers_text = [process_text(answer["answer"]) for answer in self.df["answers"][idx]]

            valid_human_answer_indices = [
                self.answer2idx[ans_text] for ans_text in all_human_answers_text if ans_text in self.answer2idx
            ]

            answers_for_vqa_criterion = torch.tensor(valid_human_answer_indices, dtype=torch.float)

            answer_counts_for_soft_labels = {}
            for ans_idx in valid_human_answer_indices:
                answer_counts_for_soft_labels[ans_idx] = answer_counts_for_soft_labels.get(ans_idx, 0) + 1

            target_soft_labels = torch.zeros(len(self.answer2idx))

            for ans_idx, count in answer_counts_for_soft_labels.items():
                score = min(count / 3, 1.0)
                target_soft_labels[ans_idx] = score

            return image, input_ids, attention_mask, answers_for_vqa_criterion, target_soft_labels

        else:
            return image, input_ids, attention_mask

    def __len__(self):
        return len(self.df)


NameError: name 'torch' is not defined

**Reasoning**:
The previous code failed because `torch` was not defined within the scope of the `VQADataset` class. I need to explicitly import `torch` at the beginning of the code block where `VQADataset` is defined to resolve this `NameError`.



In [None]:
import torch
from transformers import BertTokenizer

class VQADataset(torch.utils.data.Dataset):
    """
    VQA データセットを扱うためのクラス．
    """
    def __init__(self, df_path, image_dir, transform=None, answer=True):
        self.transform = transform  # 画像の前処理
        self.image_dir = image_dir  # 画像ファイルのディレクトリ
        self.df = pandas.read_json(df_path)  # 画像ファイルのパス，question, answerを持つDataFrame
        self.answer = answer

        # BertTokenizerを初期化
        self.tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

        # question2idxとidx2questionはBertTokenizerが管理するため削除
        self.question2idx = None
        self.idx2question = None

        self.answer2idx = {}
        self.idx2answer = {}

        if self.answer:
            # 回答に含まれる文章を辞書に追加
            for answers in self.df["answers"]:
                for answer in answers:
                    word = answer["answer"]
                    word = process_text(word)
                    if word not in self.answer2idx:
                        self.answer2idx[word] = len(self.answer2idx)
            self.idx2answer = {v: k for k, v in self.answer2idx.items()}  # 逆変換用の辞書(answer)

    def update_dict(self, dataset):
        """
        検証用データ，テストデータの辞書を訓練データの辞書に更新する．
        BertTokenizerは共通なので、answer関連の辞書のみ更新。

        Parameters
        ----------
        dataset : Dataset
            訓練データのDataset
        """
        # self.question2idxとself.idx2questionはBertTokenizerが管理するため更新しない
        self.answer2idx = dataset.answer2idx
        self.idx2answer = dataset.idx2answer

    def __getitem__(self, idx):
        """
        対応するidxのデータ（画像，質問，回答）を取得．

        Parameters
        ----------
        idx : int
            取得するデータのインデックス

        Returns
        -------
        image : torch.Tensor  (C, H, W)
            画像データ
        input_ids : torch.Tensor (sequence_length)
            BERTのトークンID
        attention_mask : torch.Tensor (sequence_length)
            BERTのアテンションマスク
        answers_for_vqa_criterion : torch.Tensor  (n_answer)
            10人の回答者の回答のid (VQA評価用)
        target_soft_labels : torch.Tensor (len(self.answer2idx))
            ソフトラベル (VQAスコア)
        """
        image = Image.open(f"{self.image_dir}/{self.df['image'][idx]}")
        image = self.transform(image)

        # BertTokenizerで質問文を処理
        encoded_question = self.tokenizer(
            process_text(self.df["question"][idx]),
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )
        input_ids = encoded_question['input_ids'].squeeze(0)
        attention_mask = encoded_question['attention_mask'].squeeze(0)

        if self.answer:
            all_human_answers_text = [process_text(answer["answer"]) for answer in self.df["answers"][idx]]

            valid_human_answer_indices = [
                self.answer2idx[ans_text] for ans_text in all_human_answers_text if ans_text in self.answer2idx
            ]

            answers_for_vqa_criterion = torch.tensor(valid_human_answer_indices, dtype=torch.float)

            answer_counts_for_soft_labels = {}
            for ans_idx in valid_human_answer_indices:
                answer_counts_for_soft_labels[ans_idx] = answer_counts_for_soft_labels.get(ans_idx, 0) + 1

            target_soft_labels = torch.zeros(len(self.answer2idx))

            for ans_idx, count in answer_counts_for_soft_labels.items():
                score = min(count / 3, 1.0)
                target_soft_labels[ans_idx] = score

            return image, input_ids, attention_mask, answers_for_vqa_criterion, target_soft_labels

        else:
            return image, input_ids, attention_mask

    def __len__(self):
        return len(self.df)

## Modify VQAModel to use BertModel and Freeze Weights

### Subtask:
Revise the `VQAModel` class. Import `BertModel` and replace the existing `self.text_encoder` with `BertModel.from_pretrained('bert-base-uncased')`. Additionally, iterate through the parameters of the loaded `BertModel` (`self.bert`) and set `param.requires_grad = False` to freeze its weights, preventing updates during initial training stages.


**Reasoning**:
The subtask requires importing `BertModel` from the `transformers` library to be used in the `VQAModel`.



In [None]:
from transformers import BertModel



**Reasoning**:
The subtask requires modifying the VQAModel to replace the TextEncoderLSTM with BertModel and freeze its parameters. I will update the VQAModel class definition accordingly, adjusting the constructor parameters and the fusion layer's input dimension.



In [None]:
class VQAModel(nn.Module):
    """
    VQA タスクを解くためのモデル例．
    """
    def __init__(self, n_answer: int):
        """
        コンストラクタ．

        Parameters
        ----------
        n_answer: int
            出力のクラス数
        """
        super().__init__()
        # self.resnet = ResNet18()
        self.resnet = ResNet50()

        # BertModelを初期化
        self.bert = BertModel.from_pretrained('bert-base-uncased')

        # BERTのパラメータをフリーズ
        for param in self.bert.parameters():
            param.requires_grad = False

        # BERTの最終層の出力次元は768
        # 画像特徴量 (512) とテキスト特徴量 (768) を結合するため、
        # 融合層の入力次元を 512 + 768 に更新
        self.fc = nn.Sequential(
            nn.Linear(512 + 768, 512), # 512 (image) + 768 (BERT)
            nn.ReLU(inplace=True),
            nn.Linear(512, n_answer)
        )

    def forward(self, image, input_ids, attention_mask):
        image_feature = self.resnet(image)  # 画像の特徴量

        # BERTからのテキスト特徴量
        # BERTの出力は (batch_size, sequence_length, hidden_size)
        # ここではCLSトークンの特徴量、つまりシーケンスの最初のトークンの特徴量を使用
        bert_output = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        question_feature = bert_output.pooler_output # (batch_size, hidden_size=768)

        x = torch.cat([image_feature, question_feature], dim=1)
        x = self.fc(x)

        return x

NameError: name 'nn' is not defined

**Reasoning**:
The previous code block failed because `nn` (from `torch.nn`) was not defined. I need to add `import torch.nn as nn` to make the `VQAModel` class definition valid.



In [None]:
import torch.nn as nn

class VQAModel(nn.Module):
    """
    VQA タスクを解くためのモデル例．
    """
    def __init__(self, n_answer: int):
        """
        コンストラクタ．

        Parameters
        ----------
        n_answer: int
            出力のクラス数
        """
        super().__init__()
        # self.resnet = ResNet18()
        self.resnet = ResNet50()

        # BertModelを初期化
        self.bert = BertModel.from_pretrained('bert-base-uncased')

        # BERTのパラメータをフリーズ
        for param in self.bert.parameters():
            param.requires_grad = False

        # BERTの最終層の出力次元は768
        # 画像特徴量 (512) とテキスト特徴量 (768) を結合するため、
        # 融合層の入力次元を 512 + 768 に更新
        self.fc = nn.Sequential(
            nn.Linear(512 + 768, 512), # 512 (image) + 768 (BERT)
            nn.ReLU(inplace=True),
            nn.Linear(512, n_answer)
        )

    def forward(self, image, input_ids, attention_mask):
        image_feature = self.resnet(image)  # 画像の特徴量

        # BERTからのテキスト特徴量
        # BERTの出力は (batch_size, sequence_length, hidden_size)
        # ここではCLSトークンの特徴量、つまりシーケンスの最初のトークンの特徴量を使用
        bert_output = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        question_feature = bert_output.pooler_output # (batch_size, hidden_size=768)

        x = torch.cat([image_feature, question_feature], dim=1)
        x = self.fc(x)

        return x

**Reasoning**:
The previous `NameError` for `nn` has been addressed by importing `torch.nn as nn`. Now, the `VQAModel` should be successfully redefined to integrate the `BertModel` with frozen weights and adjusted fusion layer.



In [None]:
import torch.nn as nn

class VQAModel(nn.Module):
    """
    VQA タスクを解くためのモデル例．
    """
    def __init__(self, n_answer: int):
        """
        コンストラクタ．

        Parameters
        ----------
        n_answer: int
            出力のクラス数
        """
        super().__init__()
        # self.resnet = ResNet18()
        self.resnet = ResNet50()

        # BertModelを初期化
        self.bert = BertModel.from_pretrained('bert-base-uncased')

        # BERTのパラメータをフリーズ
        for param in self.bert.parameters():
            param.requires_grad = False

        # BERTの最終層の出力次元は768
        # 画像特徴量 (512) とテキスト特徴量 (768) を結合するため、
        # 融合層の入力次元を 512 + 768 に更新
        self.fc = nn.Sequential(
            nn.Linear(512 + 768, 512), # 512 (image) + 768 (BERT)
            nn.ReLU(inplace=True),
            nn.Linear(512, n_answer)
        )

    def forward(self, image, input_ids, attention_mask):
        image_feature = self.resnet(image)  # 画像の特徴量

        # BERTからのテキスト特徴量
        # BERTの出力は (batch_size, sequence_length, hidden_size)
        # ここではCLSトークンの特徴量、つまりシーケンスの最初のトークンの特徴量を使用
        bert_output = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        question_feature = bert_output.pooler_output # (batch_size, hidden_size=768)

        x = torch.cat([image_feature, question_feature], dim=1)
        x = self.fc(x)

        return x

## Adjust Dataloader for New Dataset Output

### Subtask:
Ensure that the `train` and `eval` functions, and their respective `DataLoader` configurations, are updated to correctly handle the new outputs from the `VQADataset`'s `__getitem__` method (e.g., token IDs and attention mask for questions, instead of one-hot vectors).


**Reasoning**:
The subtask requires updating the `train` and `eval` functions to handle the new outputs from `VQADataset` (token IDs and attention mask instead of one-hot vectors) and pass these to the `VQAModel`. I will modify the `train` function first, specifically the unpacking of `dataloader` outputs and the arguments passed to the `model`.



In [None]:
def train(model, dataloader, optimizer, criterion, device):
    """
    学習用の関数．

    Parameters
    ----------
    model: torch.nn.Module
        学習するモデル
    dataloader: torch.utils.data.DataLoader
        学習に利用するデータローダ
    optimizer: torch.optim.Optim
        最適化手法
    criterion: torch.nn.Module
        損失関数
    device: torch.device
        学習に利用するデバイス

    Returns
    -------
    total_loss: float
        平均損失
    total_acc: float
        平均正解率
    simple_acc: float
        最頻値に対する正解率（VQA の評価指標とは異なることに注意）
    time: float
        1 エポックの学習にかかった時間 (sec)
    """
    model.train()

    total_loss = 0
    total_acc = 0

    start = time.time()
    for image, input_ids, attention_mask, answers_for_vqa_criterion, target_soft_labels in dataloader:
        image, input_ids, attention_mask, answers_for_vqa_criterion, target_soft_labels = \
            image.to(device), input_ids.to(device), attention_mask.to(device), answers_for_vqa_criterion.to(device), target_soft_labels.to(device)

        pred = model(image, input_ids, attention_mask)
        loss = criterion(pred, target_soft_labels)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
        total_acc += VQA_criterion(pred.argmax(1), answers_for_vqa_criterion)  # VQA accuracy

    return total_loss / len(dataloader), total_acc / len(dataloader), 0.0, time.time() - start


def eval(model, dataloader, criterion, device):
    """
    学習用の関数．

    Parameters
    ----------
    model: torch.nn.Module
        モデル
    dataloader: torch.utils.data.DataLoader
        評価に利用するデータローダ
    criterion: torch.nn.Module
        損失関数
    device: torch.device
        利用するデバイス

    Returns
    -------
    total_loss: float
        平均損失
    total_acc: float
        平均正解率
    simple_acc: float
        最頻値に対する正解率（VQA の評価指標とは異なることに注意）
    time: float
        1 エポックの評価にかかった時間 (sec)
    """
    model.eval()

    total_loss = 0
    total_acc = 0

    start = time.time()
    with torch.no_grad():
        for image, input_ids, attention_mask, answers_for_vqa_criterion, target_soft_labels in dataloader:
            image, input_ids, attention_mask, answers_for_vqa_criterion, target_soft_labels = \
                image.to(device), input_ids.to(device), attention_mask.to(device), answers_for_vqa_criterion.to(device), target_soft_labels.to(device)

            pred = model(image, input_ids, attention_mask)
            loss = criterion(pred, target_soft_labels)

            total_loss += loss.item()
            total_acc += VQA_criterion(pred.argmax(1), answers_for_vqa_criterion)  # VQA accuracy

    return total_loss / len(dataloader), total_acc / len(dataloader), 0.0, time.time() - start

**Reasoning**:
The `train` and `eval` functions are now updated to correctly handle the new `VQADataset` outputs and model inputs. The next step is to update the `DataLoader` configurations and the `VQAModel` instantiation to reflect the changes in `VQADataset` and `VQAModel` class definitions.



In [None]:
set_seed(42)
device = "cuda" if torch.cuda.is_available() else "cpu"

# dataloader / model
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor()
])

train_dataset = VQADataset(df_path="./data/train.json", image_dir="./data/train", transform=transform)
test_dataset = VQADataset(df_path="./data/valid.json", image_dir="./data/valid", transform=transform, answer=False)
test_dataset.update_dict(train_dataset)

# Adjust DataLoader to handle BERT tokenization's padding and different data types
# Using a custom collate_fn might be necessary for batching varying sequence lengths later,
# but BertTokenizer's padding='max_length' makes it straightforward for now.
# For BERT, input_ids and attention_mask are already padded to max_length by the tokenizer,
# so standard DataLoader collate works.

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=128, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=1, shuffle=False)

# VQAModel no longer needs vocab_size as an argument
model = VQAModel(n_answer=len(train_dataset.answer2idx)).to(device)

# optimizer / criterion
num_epoch = 4
criterion = nn.BCEWithLogitsLoss() # Changed to BCEWithLogitsLoss for soft labels
optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-5)

NameError: name 'set_seed' is not defined

**Reasoning**:
The `NameError` for `set_seed` indicates that the function was not defined before being called. I will include the `set_seed` function definition, along with its necessary imports (`random`, `numpy`), within the code block to ensure it is available during execution.



In [None]:
import random
import numpy as np
import torch
import torch.nn as nn
import torchvision.transforms as transforms

def set_seed(seed):
    """
    シードを固定する．

    Parameters
    ----------
    seed : int
        乱数生成に用いるシード値．
    """
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

set_seed(42)
device = "cuda" if torch.cuda.is_available() else "cpu"

# dataloader / model
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor()
])

train_dataset = VQADataset(df_path="./data/train.json", image_dir="./data/train", transform=transform)
test_dataset = VQADataset(df_path="./data/valid.json", image_dir="./data/valid", transform=transform, answer=False)
test_dataset.update_dict(train_dataset)

# Adjust DataLoader to handle BERT tokenization's padding and different data types
# Using a custom collate_fn might be necessary for batching varying sequence lengths later,
# but BertTokenizer's padding='max_length' makes it straightforward for now.
# For BERT, input_ids and attention_mask are already padded to max_length by the tokenizer,
# so standard DataLoader collate works.

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=128, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=1, shuffle=False)

# VQAModel no longer needs vocab_size as an argument
model = VQAModel(n_answer=len(train_dataset.answer2idx)).to(device)

# optimizer / criterion
num_epoch = 4
criterion = nn.BCEWithLogitsLoss() # Changed to BCEWithLogitsLoss for soft labels
optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-5)

NameError: name 'pandas' is not defined

**Reasoning**:
The `NameError` for `pandas` indicates that the library was not imported. I will add `import pandas` to the imports section at the beginning of the code block to resolve this.



In [None]:
import random
import numpy as np
import torch
import torch.nn as nn
import torchvision.transforms as transforms
import pandas # Added pandas import

def set_seed(seed):
    """
    シードを固定する．

    Parameters
    ----------
    seed : int
        乱数生成に用いるシード値．
    """
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

set_seed(42)
device = "cuda" if torch.cuda.is_available() else "cpu"

# dataloader / model
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor()
])

train_dataset = VQADataset(df_path="./data/train.json", image_dir="./data/train", transform=transform)
test_dataset = VQADataset(df_path="./data/valid.json", image_dir="./data/valid", transform=transform, answer=False)
test_dataset.update_dict(train_dataset)

# Adjust DataLoader to handle BERT tokenization's padding and different data types
# Using a custom collate_fn might be necessary for batching varying sequence lengths later,
# but BertTokenizer's padding='max_length' makes it straightforward for now.
# For BERT, input_ids and attention_mask are already padded to max_length by the tokenizer,
# so standard DataLoader collate works.

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=128, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=1, shuffle=False)

# VQAModel no longer needs vocab_size as an argument
model = VQAModel(n_answer=len(train_dataset.answer2idx)).to(device)

# optimizer / criterion
num_epoch = 4
criterion = nn.BCEWithLogitsLoss() # Changed to BCEWithLogitsLoss for soft labels
optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-5)

FileNotFoundError: File ./data/train.json does not exist

**Reasoning**:
The `FileNotFoundError` indicates that `data/train.json` is not present, which suggests the `data.zip` file has not been unzipped in the current execution environment. I need to ensure the data is unzipped before attempting to load it.



In [None]:
!unzip -o data.zip

unzip:  cannot find or open data.zip, data.zip.zip or data.zip.ZIP.


**Reasoning**:
The previous `unzip` command failed because `data.zip` was not found in the current directory. Based on the original notebook's data preparation steps, `data.zip` needs to be copied from Google Drive to `/content` before it can be unzipped. I will execute the copy command first.



In [None]:
!cp "/content/drive/MyDrive/data.zip" "/content"

cp: cannot stat '/content/drive/MyDrive/data.zip': No such file or directory


**Reasoning**:
The previous attempts to copy and unzip `data.zip` failed, resulting in `FileNotFoundError` when trying to load `train.json`. This indicates the data files are not present in the expected location. Before proceeding, I need to verify the contents of the `/content` directory to understand the current file structure and explicitly check for the `data` directory and `data.zip`.



In [None]:
import os
print(f"Current working directory: {os.getcwd()}")
print("Contents of /content:")
!ls -F /content/
print("\nContents of /content/drive/MyDrive/:")
!ls -F /content/drive/MyDrive/ || echo "/content/drive/MyDrive/ not accessible or empty."


Current working directory: /content
Contents of /content:
sample_data/

Contents of /content/drive/MyDrive/:
ls: cannot access '/content/drive/MyDrive/': No such file or directory
/content/drive/MyDrive/ not accessible or empty.


**Reasoning**:
The previous output showed that Google Drive was not mounted, preventing access to `data.zip`. I need to execute the Google Drive mount command to make `data.zip` accessible.



In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


**Reasoning**:
Google Drive is now mounted. I need to retry copying the `data.zip` file from the mounted Google Drive to the `/content` directory.



In [None]:
!cp "/content/drive/MyDrive/data.zip" "/content"

^C


# Task
`!cp "/content/drive/MyDrive/data.zip" "/content"`

## Unzip Data

### Subtask:
Unzip the `data.zip` file located in `/content` to extract the necessary image and JSON data files for the VQA task.


**Reasoning**:
The previous copy command failed due to an interruption (`^C`). I need to re-execute the copy command to ensure `data.zip` is present in `/content` before attempting to unzip it.



In [None]:
!cp "/content/drive/MyDrive/data.zip" "/content"

**Reasoning**:
Now that `data.zip` should be copied to `/content`, I will proceed to unzip the file as instructed.



In [None]:
import os
if os.path.exists('/content/data.zip'):
    print("data.zip found, proceeding to unzip.")
    !unzip -o /content/data.zip
else:
    print("Error: data.zip not found in /content. Please ensure it's copied.")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  inflating: data/train/train_17925.jpg  
  inflating: data/train/train_11508.jpg  
  inflating: data/train/train_18890.jpg  
  inflating: data/train/train_16941.jpg  
  inflating: data/train/train_19803.jpg  
  inflating: data/train/train_01120.jpg  
  inflating: data/train/train_14602.jpg  
  inflating: data/train/train_02783.jpg  
  inflating: data/train/train_05206.jpg  
  inflating: data/train/train_11328.jpg  
  inflating: data/train/train_05513.jpg  
  inflating: data/train/train_10742.jpg  
  inflating: data/train/train_10977.jpg  
  inflating: data/train/train_01154.jpg  
  inflating: data/train/train_07148.jpg  
  inflating: data/train/train_05799.jpg  
  inflating: data/train/train_19215.jpg  
  inflating: data/train/train_11471.jpg  
  inflating: data/train/train_08915.jpg  
  inflating: data/train/train_09099.jpg  
  inflating: data/train/train_15882.jpg  
  inflating: data/train/train_16020.jpg  
  inflating

# Task
It looks like the data files (`train.json` and `valid.json`) might not be in the expected location (`./data/`). Although `data.zip` was unzipped, the output mainly showed image files and subdirectories.

Let's inspect the contents of the `data` directory to confirm where `train.json` and `valid.json` are located.

```python
!ls -F ./data/
```

## Inspect data directory

### Subtask:
Inspect the contents of the `data` directory to confirm the location of `train.json` and `valid.json`.


**Reasoning**:
To confirm the presence of `train.json` and `valid.json`, I will list the contents of the `data` directory using the `ls -F` command.



In [None]:
!ls -F ./data/

train/	train.json  valid/  valid.json


**Reasoning**:
The previous steps confirm that `train.json` and `valid.json` are correctly located in the `./data/` directory. Now, I need to re-execute the initialization code for `set_seed`, `VQADataset`, `VQAModel`, and `optimizer`/`criterion` to ensure everything is set up with the updated `VQADataset` and `VQAModel` definitions.



In [None]:
import random
import numpy as np
import torch
import torch.nn as nn
import torchvision.transforms as transforms
import pandas # Added pandas import

def set_seed(seed):
    """
    シードを固定する．

    Parameters
    ----------
    seed : int
        乱数生成に用いるシード値．
    """
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

set_seed(42)
device = "cuda" if torch.cuda.is_available() else "cpu"

# dataloader / model
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor()
])

train_dataset = VQADataset(df_path="./data/train.json", image_dir="./data/train", transform=transform)
test_dataset = VQADataset(df_path="./data/valid.json", image_dir="./data/valid", transform=transform, answer=False)
test_dataset.update_dict(train_dataset)

# Adjust DataLoader to handle BERT tokenization's padding and different data types
# Using a custom collate_fn might be necessary for batching varying sequence lengths later,
# but BertTokenizer's padding='max_length' makes it straightforward for now.
# For BERT, input_ids and attention_mask are already padded to max_length by the tokenizer,
# so standard DataLoader collate works.

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=128, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=1, shuffle=False)

# VQAModel no longer needs vocab_size as an argument
model = VQAModel(n_answer=len(train_dataset.answer2idx)).to(device)

# optimizer / criterion
num_epoch = 4
criterion = nn.BCEWithLogitsLoss() # Changed to BCEWithLogitsLoss for soft labels
optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-5)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

NameError: name 'process_text' is not defined

**Reasoning**:
The `NameError` for `process_text` indicates that the function was not defined in the current scope. I will include the `process_text` function definition from the original notebook within the current code block to ensure it's available when `VQADataset` is instantiated.



In [None]:
import random
import numpy as np
import torch
import torch.nn as nn
import torchvision.transforms as transforms
import pandas # Added pandas import
import re # Added re import for process_text

def set_seed(seed):
    """
    シードを固定する．

    Parameters
    ----------
    seed : int
        乱数生成に用いるシード値．
    """
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

def process_text(text):
    """
    入力文と回答のフォーマットを統一するための関数．

    Parameters
    ----------
    text : str
        入力文，もしくは回答．
    """
    # lowercase
    text = text.lower()

    # 数詞を数字に変換
    num_word_to_digit = {
        'zero': '0', 'one': '1', 'two': '2', 'three': '3', 'four': '4',
        'five': '5', 'six': '6', 'seven': '7', 'eight': '8', 'nine': '9',
        'ten': '10'
    }
    for word, digit in num_word_to_digit.items():
        text = text.replace(word, digit)

    # 小数点のピリオドを削除
    text = re.sub(r'(?<!\d)\.(?!\d)', '', text)

    # 冠詞の削除
    text = re.sub(r'\b(a|an|the)\b', '', text)

    # 短縮形のカンマの追加
    contractions = {
        "dont": "don't", "isnt": "isn't", "arent": "aren't", "wont": "won't",
        "cant": "can't", "wouldnt": "wouldn't", "couldnt": "couldn't"
    }
    for contraction, correct in contractions.items():
        text = text.replace(contraction, correct)

    # 句読点をスペースに変換
    text = re.sub(r"[^\w\s':]", ' ', text)

    # 句読点をスペースに変換
    text = re.sub(r'\s+,', ',', text)

    # 連続するスペースを1つに変換
    text = re.sub(r'\s+', ' ', text).strip()

    return text

set_seed(42)
device = "cuda" if torch.cuda.is_available() else "cpu"

# dataloader / model
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor()
])

train_dataset = VQADataset(df_path="./data/train.json", image_dir="./data/train", transform=transform)
test_dataset = VQADataset(df_path="./data/valid.json", image_dir="./data/valid", transform=transform, answer=False)
test_dataset.update_dict(train_dataset)

# Adjust DataLoader to handle BERT tokenization's padding and different data types
# Using a custom collate_fn might be necessary for batching varying sequence lengths later,
# but BertTokenizer's padding='max_length' makes it straightforward for now.
# For BERT, input_ids and attention_mask are already padded to max_length by the tokenizer,
# so standard DataLoader collate works.

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=128, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=1, shuffle=False)

# VQAModel no longer needs vocab_size as an argument
model = VQAModel(n_answer=len(train_dataset.answer2idx)).to(device)

# optimizer / criterion
num_epoch = 4
criterion = nn.BCEWithLogitsLoss() # Changed to BCEWithLogitsLoss for soft labels
optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-5)


NameError: name 'ResNet50' is not defined

**Reasoning**:
The `NameError` for `ResNet50` indicates that the ResNet model architecture was not defined before `VQAModel` attempted to use it. I will include the full definitions of `BasicBlock`, `BottleneckBlock`, `ResNet`, `ResNet18`, and `ResNet50` in the code block to ensure all necessary classes and functions are available.



In [None]:
import random
import numpy as np
import torch
import torch.nn as nn
import torchvision.transforms as transforms
import pandas # Added pandas import
import re # Added re import for process_text

# --- ResNet Model Definitions (Copied from original notebook cells q8bOkolIbBBg and ptR4SOabbG1-) ---
class BasicBlock(nn.Module):
    """
    ResNet の basic block
    """
    expansion = 1

    def __init__(self, in_channels, out_channels, stride=1):
        """
        コンストラクタ．

        Parameters
        ----------
        in_channles: int
            入力のチャネル数
        out_channels:
            出力のチャネル数
        stride: int
            ストライド
        """
        super().__init__()

        self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=3, stride=stride, padding=1)
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3, stride=1, padding=1)
        self.bn2 = nn.BatchNorm2d(out_channels)
        self.relu = nn.ReLU(inplace=True)

        self.shortcut = nn.Sequential()
        if stride != 1 or in_channels != out_channels:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=stride),
                nn.BatchNorm2d(out_channels)
            )

    def forward(self, x):
        """
        順伝播処理

        Parameters
        ----------
        x: torch.Tensor
            ブロックへの入力

        Returns
        -------
        out: torch.Tensor
            ブロックへの出力
        """
        residual = x
        out = self.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))

        out += self.shortcut(residual)
        out = self.relu(out)

        return out


class BottleneckBlock(nn.Module):
    """
    ResNet の bottleneck block
    """
    expansion = 4

    def __init__(self, in_channels, out_channels, stride=1):
        """
        コンストラクタ．

        Parameters
        ----------
        in_channles: int
            入力のチャネル数
        out_channels:
            出力のチャネル数
        stride: int
            ストライド
        """
        super().__init__()

        self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=1)
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3, stride=stride, padding=1)
        self.bn2 = nn.BatchNorm2d(out_channels)
        self.conv3 = nn.Conv2d(out_channels, out_channels * self.expansion, kernel_size=1, stride=1)
        self.bn3 = nn.BatchNorm2d(out_channels * self.expansion)
        self.relu = nn.ReLU(inplace=True)

        self.shortcut = nn.Sequential()
        if stride != 1 or in_channels != out_channels * self.expansion:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_channels, out_channels * self.expansion, kernel_size=1, stride=stride),
                nn.BatchNorm2d(out_channels * self.expansion)
            )

    def forward(self, x):
        """
        順伝播処理

        Parameters
        ----------
        x: torch.Tensor
            ブロックへの入力

        Returns
        -------
        out: torch.Tensor
            ブロックへの出力
        """
        residual = x
        out = self.relu(self.bn1(self.conv1(x)))
        out = self.relu(self.bn2(self.conv2(out)))
        out = self.bn3(self.conv3(out))

        out += self.shortcut(residual)
        out = self.relu(out)

        return out


class ResNet(nn.Module):
    """
    ResNet の実装
    """
    def __init__(self, block, layers):
        """
        コンストラクタ．

        Parameters
        ----------
        block: torch.nn.Module
            利用するブロックのクラス (BasicBlock / BottleneckBlock)
        layers: list
            各ブロックの層数
        """
        super().__init__()
        self.in_channels = 64

        self.conv1 = nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3)
        self.bn1 = nn.BatchNorm2d(64)
        self.relu = nn.ReLU(inplace=True)
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)

        self.layer1 = self._make_layer(block, layers[0], 64)
        self.layer2 = self._make_layer(block, layers[1], 128, stride=2)
        self.layer3 = self._make_layer(block, layers[2], 256, stride=2)
        self.layer4 = self._make_layer(block, layers[3], 512, stride=2)

        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(512 * block.expansion, 512)

    def _make_layer(self, block, blocks, out_channels, stride=1):
        """
        同じ構成を繰り返す部分を生成する．

        Parameters
        ----------
        block: torch.nn.Module
            利用するブロックのクラス (BasicBlock / BottleneckBlock)
        blocks: int
            層数
        out_channels: int
            出力のチャネル数
        stride: int
            ストライド

        Returns
        -------
        layers: torch.nn.ModuleList
            生成した層
        """
        layers = []
        layers.append(block(self.in_channels, out_channels, stride))
        self.in_channels = out_channels * block.expansion
        for _ in range(1, blocks):
            layers.append(block(self.in_channels, out_channels))

        return nn.Sequential(*layers)

    def forward(self, x):
        """
        順伝播処理

        Parameters
        ----------
        x: torch.Tensor
            入力データ

        Returns
        -------
        x: torch.Tensor
            ResNet によって生成される特徴量
        """
        x = self.relu(self.bn1(self.conv1(x)))
        x = self.maxpool(x)

        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)

        x = self.avgpool(x)
        x = x.view(x.size(0), -1)
        x = self.fc(x)

        return x


def ResNet18():
    """
    ResNet18 を生成する関数．
    """
    return ResNet(BasicBlock, [2, 2, 2, 2])


def ResNet50():
    """
    ResNet50 を生成する関数．
    """
    return ResNet(BottleneckBlock, [3, 4, 6, 3])

# --- End ResNet Model Definitions ---


# --- TextEncoderLSTM Definition (From earlier step, but not used now) ---
# The TextEncoderLSTM was defined in an earlier step but is not used in the VQAModel with BertModel.
# Keeping the definition here for completeness based on the task flow.
class TextEncoderLSTM(nn.Module):
    """
    テキストエンコーダモジュール (LSTMベース)
    """
    def __init__(self, vocab_size: int, embed_dim: int, hidden_dim: int):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(input_size=embed_dim, hidden_size=hidden_dim, batch_first=True)

    def forward(self, text):
        embedded = self.embedding(text)  # (batch_size, sequence_length, embed_dim)
        _, (hidden, cell) = self.lstm(embedded)  # hidden: (1, batch_size, hidden_dim)
        return hidden.squeeze(0)  # (batch_size, hidden_dim)
# --- End TextEncoderLSTM Definition ---


# --- VQADataset Definition (Modified from earlier step) ---
from transformers import BertTokenizer

class VQADataset(torch.utils.data.Dataset):
    """
    VQA データセットを扱うためのクラス．
    """
    def __init__(self, df_path, image_dir, transform=None, answer=True):
        self.transform = transform  # 画像の前処理
        self.image_dir = image_dir  # 画像ファイルのディレクトリ
        self.df = pandas.read_json(df_path)  # 画像ファイルのパス，question, answerを持つDataFrame
        self.answer = answer

        # BertTokenizerを初期化
        self.tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

        # question2idxとidx2questionはBertTokenizerが管理するため削除
        self.question2idx = None
        self.idx2question = None

        self.answer2idx = {}
        self.idx2answer = {}

        if self.answer:
            # 回答に含まれる文章を辞書に追加
            for answers in self.df["answers"]:
                for answer in answers:
                    word = answer["answer"]
                    word = process_text(word)
                    if word not in self.answer2idx:
                        self.answer2idx[word] = len(self.answer2idx)
            self.idx2answer = {v: k for k, v in self.answer2idx.items()}  # 逆変換用の辞書(answer)

    def update_dict(self, dataset):
        """
        検証用データ，テストデータの辞書を訓練データの辞書に更新する．
        BertTokenizerは共通なので、answer関連の辞書のみ更新。

        Parameters
        ----------
        dataset : Dataset
            訓練データのDataset
        """
        # self.question2idxとself.idx2questionはBertTokenizerが管理するため更新しない
        self.answer2idx = dataset.answer2idx
        self.idx2answer = dataset.idx2answer

    def __getitem__(self, idx):
        """
        対応するidxのデータ（画像，質問，回答）を取得．

        Parameters
        ----------
        idx : int
            取得するデータのインデックス

        Returns
        -------
        image : torch.Tensor  (C, H, W)
            画像データ
        input_ids : torch.Tensor (sequence_length)
            BERTのトークンID
        attention_mask : torch.Tensor (sequence_length)
            BERTのアテンションマスク
        answers_for_vqa_criterion : torch.Tensor  (n_answer)
            10人の回答者の回答のid (VQA評価用)
        target_soft_labels : torch.Tensor (len(self.answer2idx))
            ソフトラベル (VQAスコア)
        """
        image = Image.open(f"{self.image_dir}/{self.df['image'][idx]}")
        image = self.transform(image)

        # BertTokenizerで質問文を処理
        encoded_question = self.tokenizer(
            process_text(self.df["question"][idx]),
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )
        input_ids = encoded_question['input_ids'].squeeze(0)
        attention_mask = encoded_question['attention_mask'].squeeze(0)

        if self.answer:
            all_human_answers_text = [process_text(answer["answer"]) for answer in self.df["answers"][idx]]

            valid_human_answer_indices = [
                self.answer2idx[ans_text] for ans_text in all_human_answers_text if ans_text in self.answer2idx
            ]

            answers_for_vqa_criterion = torch.tensor(valid_human_answer_indices, dtype=torch.float)

            answer_counts_for_soft_labels = {}
            for ans_idx in valid_human_answer_indices:
                answer_counts_for_soft_labels[ans_idx] = answer_counts_for_soft_labels.get(ans_idx, 0) + 1

            target_soft_labels = torch.zeros(len(self.answer2idx))

            for ans_idx, count in answer_counts_for_soft_labels.items():
                score = min(count / 3, 1.0)
                target_soft_labels[ans_idx] = score

            return image, input_ids, attention_mask, answers_for_vqa_criterion, target_soft_labels

        else:
            return image, input_ids, attention_mask

    def __len__(self):
        return len(self.df)
# --- End VQADataset Definition ---


# --- VQAModel Definition (Modified from earlier step) ---
from transformers import BertModel

class VQAModel(nn.Module):
    """
    VQA タスクを解くためのモデル例．
    """
    def __init__(self, n_answer: int):
        """
        コンストラクタ．

        Parameters
        ----------
        n_answer: int
            出力のクラス数
        """
        super().__init__()
        # self.resnet = ResNet18()
        self.resnet = ResNet50()

        # BertModelを初期化
        self.bert = BertModel.from_pretrained('bert-base-uncased')

        # BERTのパラメータをフリーズ
        for param in self.bert.parameters():
            param.requires_grad = False

        # BERTの最終層の出力次元は768
        # 画像特徴量 (512) とテキスト特徴量 (768) を結合するため、
        # 融合層の入力次元を 512 + 768 に更新
        self.fc = nn.Sequential(
            nn.Linear(512 + 768, 512), # 512 (image) + 768 (BERT)
            nn.ReLU(inplace=True),
            nn.Linear(512, n_answer)
        )

    def forward(self, image, input_ids, attention_mask):
        image_feature = self.resnet(image)  # 画像の特徴量

        # BERTからのテキスト特徴量
        # BERTの出力は (batch_size, sequence_length, hidden_size)
        # ここではCLSトークンの特徴量、つまりシーケンスの最初のトークンの特徴量を使用
        bert_output = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        question_feature = bert_output.pooler_output # (batch_size, hidden_size=768)

        x = torch.cat([image_feature, question_feature], dim=1)
        x = self.fc(x)

        return x
# --- End VQAModel Definition ---


# --- Initialization Code ---
set_seed(42)
device = "cuda" if torch.cuda.is_available() else "cpu"

# dataloader / model
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor()
])

train_dataset = VQADataset(df_path="./data/train.json", image_dir="./data/train", transform=transform)
test_dataset = VQADataset(df_path="./data/valid.json", image_dir="./data/valid", transform=transform, answer=False)
test_dataset.update_dict(train_dataset)

# Adjust DataLoader to handle BERT tokenization's padding and different data types
# Using a custom collate_fn might be necessary for batching varying sequence lengths later,
# but BertTokenizer's padding='max_length' makes it straightforward for now.
# For BERT, input_ids and attention_mask are already padded to max_length by the tokenizer,
# so standard DataLoader collate works.

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=128, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=1, shuffle=False)

# VQAModel no longer needs vocab_size as an argument
model = VQAModel(n_answer=len(train_dataset.answer2idx)).to(device)

# optimizer / criterion
num_epoch = 4
criterion = nn.BCEWithLogitsLoss() # Changed to BCEWithLogitsLoss for soft labels
optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-5)


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

## Train Model

### Subtask:
Execute the training loop for the specified number of epochs, using the updated `VQADataset` and `VQAModel`.


**Reasoning**:
Execute the training loop as described, iterating through epochs and calling the `train` function with the prepared data and model, then print the training metrics for each epoch.



In [None]:
# train model
for epoch in range(num_epoch):
    train_loss, train_acc, train_simple_acc, train_time = train(model, train_loader, optimizer, criterion, device)
    print(f"【{epoch + 1}/{num_epoch}】\n"
            f"train time: {train_time:.2f} [s]\n"
            f"train loss: {train_loss:.4f}\n"
            f"train acc: {train_acc:.4f}\n"
            f"train simple acc: {train_simple_acc:.4f}")

NameError: name 'time' is not defined

**Reasoning**:
The previous code failed with a `NameError: name 'time' is not defined` because the `time` module was not imported. I need to add `import time` at the beginning of the code block where the `train` and `eval` functions are defined to resolve this.



In [None]:
import time # Added import for the time module

def train(model, dataloader, optimizer, criterion, device):
    """
    学習用の関数．

    Parameters
    ----------
    model: torch.nn.Module
        学習するモデル
    dataloader: torch.utils.data.DataLoader
        学習に利用するデータローダ
    optimizer: torch.optim.Optim
        最適化手法
    criterion: torch.nn.Module
        損失関数
    device: torch.device
        学習に利用するデバイス

    Returns
    -------
    total_loss: float
        平均損失
    total_acc: float
        平均正解率
    simple_acc: float
        最頻値に対する正解率（VQA の評価指標とは異なることに注意）
    time: float
        1 エポックの学習にかかった時間 (sec)
    """
    model.train()

    total_loss = 0
    total_acc = 0

    start = time.time()
    for image, input_ids, attention_mask, answers_for_vqa_criterion, target_soft_labels in dataloader:
        image, input_ids, attention_mask, answers_for_vqa_criterion, target_soft_labels = \
            image.to(device), input_ids.to(device), attention_mask.to(device), answers_for_vqa_criterion.to(device), target_soft_labels.to(device)

        pred = model(image, input_ids, attention_mask)
        loss = criterion(pred, target_soft_labels)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
        total_acc += VQA_criterion(pred.argmax(1), answers_for_vqa_criterion)  # VQA accuracy

    return total_loss / len(dataloader), total_acc / len(dataloader), 0.0, time.time() - start


def eval(model, dataloader, criterion, device):
    """
    学習用の関数．

    Parameters
    ----------
    model: torch.nn.Module
        モデル
    dataloader: torch.utils.data.DataLoader
        評価に利用するデータローダ
    criterion: torch.nn.Module
        損失関数
    device: torch.device
        利用するデバイス

    Returns
    -------
    total_loss: float
        平均損失
    total_acc: float
        平均正解率
    simple_acc: float
        最頻値に対する正解率（VQA の評価指標とは異なることに注意）
    time: float
        1 エポックの評価にかかった時間 (sec)
    """
    model.eval()

    total_loss = 0
    total_acc = 0

    start = time.time()
    with torch.no_grad():
        for image, input_ids, attention_mask, answers_for_vqa_criterion, target_soft_labels in dataloader:
            image, input_ids, attention_mask, answers_for_vqa_criterion, target_soft_labels = \
                image.to(device), input_ids.to(device), attention_mask.to(device), answers_for_vqa_criterion.to(device), target_soft_labels.to(device)

            pred = model(image, input_ids, attention_mask)
            loss = criterion(pred, target_soft_labels)

            total_loss += loss.item()
            total_acc += VQA_criterion(pred.argmax(1), answers_for_vqa_criterion)  # VQA accuracy

    return total_loss / len(dataloader), total_acc / len(dataloader), 0.0, time.time() - start

**Reasoning**:
The `train` and `eval` functions have been correctly redefined with the `time` module imported. Now I need to execute the training loop for the specified number of epochs, calling the `train` function and printing the metrics as instructed.



In [None]:
# train model
for epoch in range(num_epoch):
    train_loss, train_acc, train_simple_acc, train_time = train(model, train_loader, optimizer, criterion, device)
    print(f"【{epoch + 1}/{num_epoch}】\n"
            f"train time: {train_time:.2f} [s]\n"
            f"train loss: {train_loss:.4f}\n"
            f"train acc: {train_acc:.4f}\n"
            f"train simple acc: {train_simple_acc:.4f}")

NameError: name 'Image' is not defined

**Reasoning**:
The `NameError: name 'Image' is not defined` occurred because the `Image` class from the `PIL` library was not imported before it was used in the `VQADataset`'s `__getitem__` method. I will add `from PIL import Image` to the comprehensive code block that defines the helper functions and classes before the training loop initialization to resolve this.



In [None]:
import random
import numpy as np
import torch
import torch.nn as nn
import torchvision.transforms as transforms
import pandas # Added pandas import
import re # Added re import for process_text
from PIL import Image # Added Image import for VQADataset

# --- ResNet Model Definitions (Copied from original notebook cells q8bOkolIbBBg and ptR4SOabbG1-) ---
class BasicBlock(nn.Module):
    """
    ResNet の basic block
    """
    expansion = 1

    def __init__(self, in_channels, out_channels, stride=1):
        """
        コンストラクタ．

        Parameters
        ----------
        in_channles: int
            入力のチャネル数
        out_channels:
            出力のチャネル数
        stride: int
            ストライド
        """
        super().__init__()

        self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=3, stride=stride, padding=1)
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3, stride=1, padding=1)
        self.bn2 = nn.BatchNorm2d(out_channels)
        self.relu = nn.ReLU(inplace=True)

        self.shortcut = nn.Sequential()
        if stride != 1 or in_channels != out_channels:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=stride),
                nn.BatchNorm2d(out_channels)
            )

    def forward(self, x):
        """
        順伝播処理

        Parameters
        ----------
        x: torch.Tensor
            ブロックへの入力

        Returns
        -------
        out: torch.Tensor
            ブロックへの出力
        """
        residual = x
        out = self.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))

        out += self.shortcut(residual)
        out = self.relu(out)

        return out


class BottleneckBlock(nn.Module):
    """
    ResNet の bottleneck block
    """
    expansion = 4

    def __init__(self, in_channels, out_channels, stride=1):
        """
        コンストラクタ．

        Parameters
        ----------
        in_channles: int
            入力のチャネル数
        out_channels:
            出力のチャネル数
        stride: int
            ストライド
        """
        super().__init__()

        self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=1)
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3, stride=stride, padding=1)
        self.bn2 = nn.BatchNorm2d(out_channels)
        self.conv3 = nn.Conv2d(out_channels, out_channels * self.expansion, kernel_size=1, stride=1)
        self.bn3 = nn.BatchNorm2d(out_channels * self.expansion)
        self.relu = nn.ReLU(inplace=True)

        self.shortcut = nn.Sequential()
        if stride != 1 or in_channels != out_channels * self.expansion:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_channels, out_channels * self.expansion, kernel_size=1, stride=stride),
                nn.BatchNorm2d(out_channels * self.expansion)
            )

    def forward(self, x):
        """
        順伝播処理

        Parameters
        ----------
        x: torch.Tensor
            ブロックへの入力

        Returns
        -------
        out: torch.Tensor
            ブロックへの出力
        """
        residual = x
        out = self.relu(self.bn1(self.conv1(x)))
        out = self.relu(self.bn2(self.conv2(out)))
        out = self.bn3(self.conv3(out))

        out += self.shortcut(residual)
        out = self.relu(out)

        return out


class ResNet(nn.Module):
    """
    ResNet の実装
    """
    def __init__(self, block, layers):
        """
        コンストラクタ．

        Parameters
        ----------
        block: torch.nn.Module
            利用するブロックのクラス (BasicBlock / BottleneckBlock)
        layers: list
            各ブロックの層数
        """
        super().__init__()
        self.in_channels = 64

        self.conv1 = nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3)
        self.bn1 = nn.BatchNorm2d(64)
        self.relu = nn.ReLU(inplace=True)
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)

        self.layer1 = self._make_layer(block, layers[0], 64)
        self.layer2 = self._make_layer(block, layers[1], 128, stride=2)
        self.layer3 = self._make_layer(block, layers[2], 256, stride=2)
        self.layer4 = self._make_layer(block, layers[3], 512, stride=2)

        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(512 * block.expansion, 512)

    def _make_layer(self, block, blocks, out_channels, stride=1):
        """
        同じ構成を繰り返す部分を生成する．

        Parameters
        ----------
        block: torch.nn.Module
            利用するブロックのクラス (BasicBlock / BottleneckBlock)
        blocks: int
            層数
        out_channels: int
            出力のチャネル数
        stride: int
            ストライド

        Returns
        -------
        layers: torch.nn.ModuleList
            生成した層
        """
        layers = []
        layers.append(block(self.in_channels, out_channels, stride))
        self.in_channels = out_channels * block.expansion
        for _ in range(1, blocks):
            layers.append(block(self.in_channels, out_channels))

        return nn.Sequential(*layers)

    def forward(self, x):
        """
        順伝播処理

        Parameters
        ----------
        x: torch.Tensor
            入力データ

        Returns
        -------
        x: torch.Tensor
            ResNet によって生成される特徴量
        """
        x = self.relu(self.bn1(self.conv1(x)))
        x = self.maxpool(x)

        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)

        x = self.avgpool(x)
        x = x.view(x.size(0), -1)
        x = self.fc(x)

        return x


def ResNet18():
    """
    ResNet18 を生成する関数．
    """
    return ResNet(BasicBlock, [2, 2, 2, 2])


def ResNet50():
    """
    ResNet50 を生成する関数．
    """
    return ResNet(BottleneckBlock, [3, 4, 6, 3])

# --- End ResNet Model Definitions ---


# --- TextEncoderLSTM Definition (From earlier step, but not used now) ---
# The TextEncoderLSTM was defined in an earlier step but is not used in the VQAModel with BertModel.
# Keeping the definition here for completeness based on the task flow.
class TextEncoderLSTM(nn.Module):
    """
    テキストエンコーダモジュール (LSTMベース)
    """
    def __init__(self, vocab_size: int, embed_dim: int, hidden_dim: int):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(input_size=embed_dim, hidden_size=hidden_dim, batch_first=True)

    def forward(self, text):
        embedded = self.embedding(text)  # (batch_size, sequence_length, embed_dim)
        _, (hidden, cell) = self.lstm(embedded)  # hidden: (1, batch_size, hidden_dim)
        return hidden.squeeze(0)  # (batch_size, hidden_dim)
# --- End TextEncoderLSTM Definition ---


# --- VQADataset Definition (Modified from earlier step) ---
from transformers import BertTokenizer

class VQADataset(torch.utils.data.Dataset):
    """
    VQA データセットを扱うためのクラス．
    """
    def __init__(self, df_path, image_dir, transform=None, answer=True):
        self.transform = transform  # 画像の前処理
        self.image_dir = image_dir  # 画像ファイルのディレクトリ
        self.df = pandas.read_json(df_path)  # 画像ファイルのパス，question, answerを持つDataFrame
        self.answer = answer

        # BertTokenizerを初期化
        self.tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

        # question2idxとidx2questionはBertTokenizerが管理するため削除
        self.question2idx = None
        self.idx2question = None

        self.answer2idx = {}
        self.idx2answer = {}

        if self.answer:
            # 回答に含まれる文章を辞書に追加
            for answers in self.df["answers"]:
                for answer in answers:
                    word = answer["answer"]
                    word = process_text(word)
                    if word not in self.answer2idx:
                        self.answer2idx[word] = len(self.answer2idx)
            self.idx2answer = {v: k for k, v in self.answer2idx.items()}  # 逆変換用の辞書(answer)

    def update_dict(self, dataset):
        """
        検証用データ，テストデータの辞書を訓練データの辞書に更新する．
        BertTokenizerは共通なので、answer関連の辞書のみ更新。

        Parameters
        ----------
        dataset : Dataset
            訓練データのDataset
        """
        # self.question2idxとself.idx2questionはBertTokenizerが管理するため更新しない
        self.answer2idx = dataset.answer2idx
        self.idx2answer = dataset.idx2answer

    def __getitem__(self, idx):
        """
        対応するidxのデータ（画像，質問，回答）を取得．

        Parameters
        ----------
        idx : int
            取得するデータのインデックス

        Returns
        -------
        image : torch.Tensor  (C, H, W)
            画像データ
        input_ids : torch.Tensor (sequence_length)
            BERTのトークンID
        attention_mask : torch.Tensor (sequence_length)
            BERTのアテンションマスク
        answers_for_vqa_criterion : torch.Tensor  (n_answer)
            10人の回答者の回答のid (VQA評価用)
        target_soft_labels : torch.Tensor (len(self.answer2idx))
            ソフトラベル (VQAスコア)
        """
        image = Image.open(f"{self.image_dir}/{self.df['image'][idx]}")
        image = self.transform(image)

        # BertTokenizerで質問文を処理
        encoded_question = self.tokenizer(
            process_text(self.df["question"][idx]),
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )
        input_ids = encoded_question['input_ids'].squeeze(0)
        attention_mask = encoded_question['attention_mask'].squeeze(0)

        if self.answer:
            all_human_answers_text = [process_text(answer["answer"]) for answer in self.df["answers"][idx]]

            valid_human_answer_indices = [
                self.answer2idx[ans_text] for ans_text in all_human_answers_text if ans_text in self.answer2idx
            ]

            answers_for_vqa_criterion = torch.tensor(valid_human_answer_indices, dtype=torch.float)

            answer_counts_for_soft_labels = {}
            for ans_idx in valid_human_answer_indices:
                answer_counts_for_soft_labels[ans_idx] = answer_counts_for_soft_labels.get(ans_idx, 0) + 1

            target_soft_labels = torch.zeros(len(self.answer2idx))

            for ans_idx, count in answer_counts_for_soft_labels.items():
                score = min(count / 3, 1.0)
                target_soft_labels[ans_idx] = score

            return image, input_ids, attention_mask, answers_for_vqa_criterion, target_soft_labels

        else:
            return image, input_ids, attention_mask

    def __len__(self):
        return len(self.df)
# --- End VQADataset Definition ---


# --- VQAModel Definition (Modified from earlier step) ---
from transformers import BertModel

class VQAModel(nn.Module):
    """
    VQA タスクを解くためのモデル例．
    """
    def __init__(self, n_answer: int):
        """
        コンストラクタ．

        Parameters
        ----------
        n_answer: int
            出力のクラス数
        """
        super().__init__()
        # self.resnet = ResNet18()
        self.resnet = ResNet50()

        # BertModelを初期化
        self.bert = BertModel.from_pretrained('bert-base-uncased')

        # BERTのパラメータをフリーズ
        for param in self.bert.parameters():
            param.requires_grad = False

        # BERTの最終層の出力次元は768
        # 画像特徴量 (512) とテキスト特徴量 (768) を結合するため、
        # 融合層の入力次元を 512 + 768 に更新
        self.fc = nn.Sequential(
            nn.Linear(512 + 768, 512), # 512 (image) + 768 (BERT)
            nn.ReLU(inplace=True),
            nn.Linear(512, n_answer)
        )

    def forward(self, image, input_ids, attention_mask):
        image_feature = self.resnet(image)  # 画像の特徴量

        # BERTからのテキスト特徴量
        # BERTの出力は (batch_size, sequence_length, hidden_size)
        # ここではCLSトークンの特徴量、つまりシーケンスの最初のトークンの特徴量を使用
        bert_output = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        question_feature = bert_output.pooler_output # (batch_size, hidden_size=768)

        x = torch.cat([image_feature, question_feature], dim=1)
        x = self.fc(x)

        return x
# --- End VQAModel Definition ---


def VQA_criterion(batch_pred, batch_answers):
    """
    VQA タスクに用いられる評価関数．
    """
    total_acc = 0.

    for pred, answers in zip(batch_pred, batch_answers):
        acc = 0.
        for i in range(len(answers)):
            num_match = 0
            for j in range(len(answers)):
                if i == j:
                    continue
                if pred == answers[j]:
                    num_match += 1
            acc += min(num_match / 3, 1)
        total_acc += acc / 10

    return total_acc / len(batch_pred)

# --- Training and Evaluation Functions (Modified from earlier step) ---
import time # Added import for the time module

def train(model, dataloader, optimizer, criterion, device):
    """
    学習用の関数．

    Parameters
    ----------
    model: torch.nn.Module
        学習するモデル
    dataloader: torch.utils.data.DataLoader
        学習に利用するデータローダ
    optimizer: torch.optim.Optim
        最適化手法
    criterion: torch.nn.Module
        損失関数
    device: torch.device
        学習に利用するデバイス

    Returns
    -------
    total_loss: float
        平均損失
    total_acc: float
        平均正解率
    simple_acc: float
        最頻値に対する正解率（VQA の評価指標とは異なることに注意）
    time: float
        1 エポックの学習にかかった時間 (sec)
    """
    model.train()

    total_loss = 0
    total_acc = 0

    start = time.time()
    for image, input_ids, attention_mask, answers_for_vqa_criterion, target_soft_labels in dataloader:
        image, input_ids, attention_mask, answers_for_vqa_criterion, target_soft_labels = \
            image.to(device), input_ids.to(device), attention_mask.to(device), answers_for_vqa_criterion.to(device), target_soft_labels.to(device)

        pred = model(image, input_ids, attention_mask)
        loss = criterion(pred, target_soft_labels)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
        total_acc += VQA_criterion(pred.argmax(1), answers_for_vqa_criterion)  # VQA accuracy

    return total_loss / len(dataloader), total_acc / len(dataloader), 0.0, time.time() - start


def eval(model, dataloader, criterion, device):
    """
    学習用の関数．

    Parameters
    ----------
    model: torch.nn.Module
        モデル
    dataloader: torch.utils.data.DataLoader
        評価に利用するデータローダ
    criterion: torch.nn.Module
        損失関数
    device: torch.device
        利用するデバイス

    Returns
    -------
    total_loss: float
        平均損失
    total_acc: float
        平均正解率
    simple_acc: float
        最頻値に対する正解率（VQA の評価指標とは異なることに注意）
    time: float
        1 エポックの評価にかかった時間 (sec)
    """
    model.eval()

    total_loss = 0
    total_acc = 0

    start = time.time()
    with torch.no_grad():
        for image, input_ids, attention_mask, answers_for_vqa_criterion, target_soft_labels in dataloader:
            image, input_ids, attention_mask, answers_for_vqa_criterion, target_soft_labels = \
                image.to(device), input_ids.to(device), attention_mask.to(device), answers_for_vqa_criterion.to(device), target_soft_labels.to(device)

            pred = model(image, input_ids, attention_mask)
            loss = criterion(pred, target_soft_labels)

            total_loss += loss.item()
            total_acc += VQA_criterion(pred.argmax(1), answers_for_vqa_criterion)  # VQA accuracy

    return total_loss / len(dataloader), total_acc / len(dataloader), 0.0, time.time() - start
# --- End Training and Evaluation Functions ---


# --- Initialization Code ---
set_seed(42)
device = "cuda" if torch.cuda.is_available() else "cpu"

# dataloader / model
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor()
])

train_dataset = VQADataset(df_path="./data/train.json", image_dir="./data/train", transform=transform)
test_dataset = VQADataset(df_path="./data/valid.json", image_dir="./data/valid", transform=transform, answer=False)
test_dataset.update_dict(train_dataset)

# Adjust DataLoader to handle BERT tokenization's padding and different data types
# Using a custom collate_fn might be necessary for batching varying sequence lengths later,
# but BertTokenizer's padding='max_length' makes it straightforward for now.
# For BERT, input_ids and attention_mask are already padded to max_length by the tokenizer,
# so standard DataLoader collate works.

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=128, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=1, shuffle=False)

# VQAModel no longer needs vocab_size as an argument
model = VQAModel(n_answer=len(train_dataset.answer2idx)).to(device)

# optimizer / criterion
num_epoch = 4
criterion = nn.BCEWithLogitsLoss() # Changed to BCEWithLogitsLoss for soft labels
optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-5)


**Reasoning**:
The previous code failed with `NameError: name 'Image' is not defined` because the `Image` class from the `PIL` library was not imported. The provided code block now includes `from PIL import Image` at the top, along with all other necessary imports and definitions (`random`, `numpy`, `torch`, `nn`, `transforms`, `pandas`, `re`, `time`, and all class/function definitions). Executing this block will ensure all dependencies are met and the `VQAModel` and `VQADataset` are correctly set up before the training loop.



In [None]:
import random
import numpy as np
import torch
import torch.nn as nn
import torchvision.transforms as transforms
import pandas # Added pandas import
import re # Added re import for process_text
from PIL import Image # Added Image import for VQADataset

# --- ResNet Model Definitions (Copied from original notebook cells q8bOkolIbBBg and ptR4SOabbG1-) ---
class BasicBlock(nn.Module):
    """
    ResNet の basic block
    """
    expansion = 1

    def __init__(self, in_channels, out_channels, stride=1):
        """
        コンストラクタ．

        Parameters
        ----------
        in_channles: int
            入力のチャネル数
        out_channels:
            出力のチャネル数
        stride: int
            ストライド
        """
        super().__init__()

        self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=3, stride=stride, padding=1)
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3, stride=1, padding=1)
        self.bn2 = nn.BatchNorm2d(out_channels)
        self.relu = nn.ReLU(inplace=True)

        self.shortcut = nn.Sequential()
        if stride != 1 or in_channels != out_channels:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=stride),
                nn.BatchNorm2d(out_channels)
            )

    def forward(self, x):
        """
        順伝播処理

        Parameters
        ----------
        x: torch.Tensor
            ブロックへの入力

        Returns
        -------
        out: torch.Tensor
            ブロックへの出力
        """
        residual = x
        out = self.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))

        out += self.shortcut(residual)
        out = self.relu(out)

        return out


class BottleneckBlock(nn.Module):
    """
    ResNet の bottleneck block
    """
    expansion = 4

    def __init__(self, in_channels, out_channels, stride=1):
        """
        コンストラクタ．

        Parameters
        ----------
        in_channles: int
            入力のチャネル数
        out_channels:
            出力のチャネル数
        stride: int
            ストライド
        """
        super().__init__()

        self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=1)
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3, stride=stride, padding=1)
        self.bn2 = nn.BatchNorm2d(out_channels)
        self.conv3 = nn.Conv2d(out_channels, out_channels * self.expansion, kernel_size=1, stride=1)
        self.bn3 = nn.BatchNorm2d(out_channels * self.expansion)
        self.relu = nn.ReLU(inplace=True)

        self.shortcut = nn.Sequential()
        if stride != 1 or in_channels != out_channels * self.expansion:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_channels, out_channels * self.expansion, kernel_size=1, stride=stride),
                nn.BatchNorm2d(out_channels * self.expansion)
            )

    def forward(self, x):
        """
        順伝播処理

        Parameters
        ----------
        x: torch.Tensor
            ブロックへの入力

        Returns
        -------
        out: torch.Tensor
            ブロックへの出力
        """
        residual = x
        out = self.relu(self.bn1(self.conv1(x)))
        out = self.relu(self.bn2(self.conv2(out)))
        out = self.bn3(self.conv3(out))

        out += self.shortcut(residual)
        out = self.relu(out)

        return out


class ResNet(nn.Module):
    """
    ResNet の実装
    """
    def __init__(self, block, layers):
        """
        コンストラクタ．

        Parameters
        ----------
        block: torch.nn.Module
            利用するブロックのクラス (BasicBlock / BottleneckBlock)
        layers: list
            各ブロックの層数
        """
        super().__init__()
        self.in_channels = 64

        self.conv1 = nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3)
        self.bn1 = nn.BatchNorm2d(64)
        self.relu = nn.ReLU(inplace=True)
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)

        self.layer1 = self._make_layer(block, layers[0], 64)
        self.layer2 = self._make_layer(block, layers[1], 128, stride=2)
        self.layer3 = self._make_layer(block, layers[2], 256, stride=2)
        self.layer4 = self._make_layer(block, layers[3], 512, stride=2)

        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(512 * block.expansion, 512)

    def _make_layer(self, block, blocks, out_channels, stride=1):
        """
        同じ構成を繰り返す部分を生成する．

        Parameters
        ----------
        block: torch.nn.Module
            利用するブロックのクラス (BasicBlock / BottleneckBlock)
        blocks: int
            層数
        out_channels: int
            出力のチャネル数
        stride: int
            ストライド

        Returns
        -------
        layers: torch.nn.ModuleList
            生成した層
        """
        layers = []
        layers.append(block(self.in_channels, out_channels, stride))
        self.in_channels = out_channels * block.expansion
        for _ in range(1, blocks):
            layers.append(block(self.in_channels, out_channels))

        return nn.Sequential(*layers)

    def forward(self, x):
        """
        順伝播処理

        Parameters
        ----------
        x: torch.Tensor
            入力データ

        Returns
        -------
        x: torch.Tensor
            ResNet によって生成される特徴量
        """
        x = self.relu(self.bn1(self.conv1(x)))
        x = self.maxpool(x)

        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)

        x = self.avgpool(x)
        x = x.view(x.size(0), -1)
        x = self.fc(x)

        return x


def ResNet18():
    """
    ResNet18 を生成する関数．
    """
    return ResNet(BasicBlock, [2, 2, 2, 2])


def ResNet50():
    """
    ResNet50 を生成する関数．
    """
    return ResNet(BottleneckBlock, [3, 4, 6, 3])

# --- End ResNet Model Definitions ---


# --- TextEncoderLSTM Definition (From earlier step, but not used now) ---
# The TextEncoderLSTM was defined in an earlier step but is not used in the VQAModel with BertModel.
# Keeping the definition here for completeness based on the task flow.
class TextEncoderLSTM(nn.Module):
    """
    テキストエンコーダモジュール (LSTMベース)
    """
    def __init__(self, vocab_size: int, embed_dim: int, hidden_dim: int):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(input_size=embed_dim, hidden_size=hidden_dim, batch_first=True)

    def forward(self, text):
        embedded = self.embedding(text)  # (batch_size, sequence_length, embed_dim)
        _, (hidden, cell) = self.lstm(embedded)  # hidden: (1, batch_size, hidden_dim)
        return hidden.squeeze(0)  # (batch_size, hidden_dim)
# --- End TextEncoderLSTM Definition ---


# --- VQADataset Definition (Modified from earlier step) ---
from transformers import BertTokenizer

class VQADataset(torch.utils.data.Dataset):
    """
    VQA データセットを扱うためのクラス．
    """
    def __init__(self, df_path, image_dir, transform=None, answer=True):
        self.transform = transform  # 画像の前処理
        self.image_dir = image_dir  # 画像ファイルのディレクトリ
        self.df = pandas.read_json(df_path)  # 画像ファイルのパス，question, answerを持つDataFrame
        self.answer = answer

        # BertTokenizerを初期化
        self.tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

        # question2idxとidx2questionはBertTokenizerが管理するため削除
        self.question2idx = None
        self.idx2question = None

        self.answer2idx = {}
        self.idx2answer = {}

        if self.answer:
            # 回答に含まれる文章を辞書に追加
            for answers in self.df["answers"]:
                for answer in answers:
                    word = answer["answer"]
                    word = process_text(word)
                    if word not in self.answer2idx:
                        self.answer2idx[word] = len(self.answer2idx)
            self.idx2answer = {v: k for k, v in self.answer2idx.items()}  # 逆変換用の辞書(answer)

    def update_dict(self, dataset):
        """
        検証用データ，テストデータの辞書を訓練データの辞書に更新する．
        BertTokenizerは共通なので、answer関連の辞書のみ更新。

        Parameters
        ----------
        dataset : Dataset
            訓練データのDataset
        """
        # self.question2idxとself.idx2questionはBertTokenizerが管理するため更新しない
        self.answer2idx = dataset.answer2idx
        self.idx2answer = dataset.idx2answer

    def __getitem__(self, idx):
        """
        対応するidxのデータ（画像，質問，回答）を取得．

        Parameters
        ----------
        idx : int
            取得するデータのインデックス

        Returns
        -------
        image : torch.Tensor  (C, H, W)
            画像データ
        input_ids : torch.Tensor (sequence_length)
            BERTのトークンID
        attention_mask : torch.Tensor (sequence_length)
            BERTのアテンションマスク
        answers_for_vqa_criterion : torch.Tensor  (n_answer)
            10人の回答者の回答のid (VQA評価用)
        target_soft_labels : torch.Tensor (len(self.answer2idx))
            ソフトラベル (VQAスコア)
        """
        image = Image.open(f"{self.image_dir}/{self.df['image'][idx]}")
        image = self.transform(image)

        # BertTokenizerで質問文を処理
        encoded_question = self.tokenizer(
            process_text(self.df["question"][idx]),
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )
        input_ids = encoded_question['input_ids'].squeeze(0)
        attention_mask = encoded_question['attention_mask'].squeeze(0)

        if self.answer:
            all_human_answers_text = [process_text(answer["answer"]) for answer in self.df["answers"][idx]]

            valid_human_answer_indices = [
                self.answer2idx[ans_text] for ans_text in all_human_answers_text if ans_text in self.answer2idx
            ]

            answers_for_vqa_criterion = torch.tensor(valid_human_answer_indices, dtype=torch.float)

            answer_counts_for_soft_labels = {}
            for ans_idx in valid_human_answer_indices:
                answer_counts_for_soft_labels[ans_idx] = answer_counts_for_soft_labels.get(ans_idx, 0) + 1

            target_soft_labels = torch.zeros(len(self.answer2idx))

            for ans_idx, count in answer_counts_for_soft_labels.items():
                score = min(count / 3, 1.0)
                target_soft_labels[ans_idx] = score

            return image, input_ids, attention_mask, answers_for_vqa_criterion, target_soft_labels

        else:
            return image, input_ids, attention_mask

    def __len__(self):
        return len(self.df)
# --- End VQADataset Definition ---


# --- VQAModel Definition (Modified from earlier step) ---
from transformers import BertModel

class VQAModel(nn.Module):
    """
    VQA タスクを解くためのモデル例．
    """
    def __init__(self, n_answer: int):
        """
        コンストラクタ．

        Parameters
        ----------
        n_answer: int
            出力のクラス数
        """
        super().__init__()
        # self.resnet = ResNet18()
        self.resnet = ResNet50()

        # BertModelを初期化
        self.bert = BertModel.from_pretrained('bert-base-uncased')

        # BERTのパラメータをフリーズ
        for param in self.bert.parameters():
            param.requires_grad = False

        # BERTの最終層の出力次元は768
        # 画像特徴量 (512) とテキスト特徴量 (768) を結合するため、
        # 融合層の入力次元を 512 + 768 に更新
        self.fc = nn.Sequential(
            nn.Linear(512 + 768, 512), # 512 (image) + 768 (BERT)
            nn.ReLU(inplace=True),
            nn.Linear(512, n_answer)
        )

    def forward(self, image, input_ids, attention_mask):
        image_feature = self.resnet(image)  # 画像の特徴量

        # BERTからのテキスト特徴量
        # BERTの出力は (batch_size, sequence_length, hidden_size)
        # ここではCLSトークンの特徴量、つまりシーケンスの最初のトークンの特徴量を使用
        bert_output = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        question_feature = bert_output.pooler_output # (batch_size, hidden_size=768)

        x = torch.cat([image_feature, question_feature], dim=1)
        x = self.fc(x)

        return x
# --- End VQAModel Definition ---


def VQA_criterion(batch_pred, batch_answers):
    """
    VQA タスクに用いられる評価関数．
    """
    total_acc = 0.

    for pred, answers in zip(batch_pred, batch_answers):
        acc = 0.
        for i in range(len(answers)):
            num_match = 0
            for j in range(len(answers)):
                if i == j:
                    continue
                if pred == answers[j]:
                    num_match += 1
            acc += min(num_match / 3, 1)
        total_acc += acc / 10

    return total_acc / len(batch_pred)

# --- Training and Evaluation Functions (Modified from earlier step) ---
import time # Added import for the time module

def train(model, dataloader, optimizer, criterion, device):
    """
    学習用の関数．

    Parameters
    ----------
    model: torch.nn.Module
        学習するモデル
    dataloader: torch.utils.data.DataLoader
        学習に利用するデータローダ
    optimizer: torch.optim.Optim
        最適化手法
    criterion: torch.nn.Module
        損失関数
    device: torch.device
        学習に利用するデバイス

    Returns
    -------
    total_loss: float
        平均損失
    total_acc: float
        平均正解率
    simple_acc: float
        最頻値に対する正解率（VQA の評価指標とは異なることに注意）
    time: float
        1 エポックの学習にかかった時間 (sec)
    """
    model.train()

    total_loss = 0
    total_acc = 0

    start = time.time()
    for image, input_ids, attention_mask, answers_for_vqa_criterion, target_soft_labels in dataloader:
        image, input_ids, attention_mask, answers_for_vqa_criterion, target_soft_labels = \
            image.to(device), input_ids.to(device), attention_mask.to(device), answers_for_vqa_criterion.to(device), target_soft_labels.to(device)

        pred = model(image, input_ids, attention_mask)
        loss = criterion(pred, target_soft_labels)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
        total_acc += VQA_criterion(pred.argmax(1), answers_for_vqa_criterion)  # VQA accuracy

    return total_loss / len(dataloader), total_acc / len(dataloader), 0.0, time.time() - start


def eval(model, dataloader, criterion, device):
    """
    学習用の関数．

    Parameters
    ----------
    model: torch.nn.Module
        モデル
    dataloader: torch.utils.data.DataLoader
        評価に利用するデータローダ
    criterion: torch.nn.Module
        損失関数
    device: torch.device
        利用するデバイス

    Returns
    -------
    total_loss: float
        平均損失
    total_acc: float
        平均正解率
    simple_acc: float
        最頻値に対する正解率（VQA の評価指標とは異なることに注意）
    time: float
        1 エポックの評価にかかった時間 (sec)
    """
    model.eval()

    total_loss = 0
    total_acc = 0

    start = time.time()
    with torch.no_grad():
        for image, input_ids, attention_mask, answers_for_vqa_criterion, target_soft_labels in dataloader:
            image, input_ids, attention_mask, answers_for_vqa_criterion, target_soft_labels = \
                image.to(device), input_ids.to(device), attention_mask.to(device), answers_for_vqa_criterion.to(device), target_soft_labels.to(device)

            pred = model(image, input_ids, attention_mask)
            loss = criterion(pred, target_soft_labels)

            total_loss += loss.item()
            total_acc += VQA_criterion(pred.argmax(1), answers_for_vqa_criterion)  # VQA accuracy

    return total_loss / len(dataloader), total_acc / len(dataloader), 0.0, time.time() - start
# --- End Training and Evaluation Functions ---


# --- Initialization Code ---
set_seed(42)
device = "cuda" if torch.cuda.is_available() else "cpu"

# dataloader / model
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor()
])

train_dataset = VQADataset(df_path="./data/train.json", image_dir="./data/train", transform=transform)
test_dataset = VQADataset(df_path="./data/valid.json", image_dir="./data/valid", transform=transform, answer=False)
test_dataset.update_dict(train_dataset)

# Adjust DataLoader to handle BERT tokenization's padding and different data types
# Using a custom collate_fn might be necessary for batching varying sequence lengths later,
# but BertTokenizer's padding='max_length' makes it straightforward for now.
# For BERT, input_ids and attention_mask are already padded to max_length by the tokenizer,
# so standard DataLoader collate works.

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=128, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=1, shuffle=False)

# VQAModel no longer needs vocab_size as an argument
model = VQAModel(n_answer=len(train_dataset.answer2idx)).to(device)

# optimizer / criterion
num_epoch = 4
criterion = nn.BCEWithLogitsLoss() # Changed to BCEWithLogitsLoss for soft labels
optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-5)


**Reasoning**:
All necessary modules and functions have been defined and initialized in the previous code block. Now I can proceed with executing the training loop as instructed.



In [None]:
# train model
for epoch in range(num_epoch):
    train_loss, train_acc, train_simple_acc, train_time = train(model, train_loader, optimizer, criterion, device)
    print(f"【{epoch + 1}/{num_epoch}】\n"
            f"train time: {train_time:.2f} [s]\n"
            f"train loss: {train_loss:.4f}\n"
            f"train acc: {train_acc:.4f}\n"
            f"train simple acc: {train_simple_acc:.4f}")

## Make Submission File

### Subtask:
Generate the submission file (`submission.npy`) and save the trained model's state dictionary (`model.pt`) after the training is complete.


**Reasoning**:
The training loop has completed, and now I need to generate the submission file and save the trained model. This involves putting the model in evaluation mode, making predictions on the test dataset, converting the predicted answer IDs to strings, and then saving the submission and the model's state dictionary.



In [None]:
model.eval()
submission = []
for image, input_ids, attention_mask in test_loader:
    image, input_ids, attention_mask = image.to(device), input_ids.to(device), attention_mask.to(device)
    pred = model(image, input_ids, attention_mask)
    pred = pred.argmax(1).cpu().item()
    submission.append(pred)

submission = [train_dataset.idx2answer[id] for id in submission]
submission = np.array(submission)
torch.save(model.state_dict(), "model.pt")
np.save("submission.npy", submission)

NameError: name 'model' is not defined

**Reasoning**:
The previous code failed because `model` (along with other necessary variables and functions like `train_dataset`, `test_loader`, `device`, `np`, `torch`, `VQADataset`, `VQAModel`, `VQA_criterion`, `train` function, etc.) was not defined in the current execution context. This usually happens if the preceding cells that define these components were not executed or the kernel state was reset. To make the submission generation step runnable, all necessary imports, function definitions, class definitions, and variable initializations (including the trained model) must be present and executed before attempting to use them. I will consolidate all the necessary setup, training loop, and submission generation into one block to ensure all dependencies are met and the `model` variable is properly defined and trained before use.



In [None]:
import random
import numpy as np
import torch
import torch.nn as nn
import torchvision.transforms as transforms
import pandas # Added pandas import
import re # Added re import for process_text
from PIL import Image # Added Image import for VQADataset
from transformers import BertTokenizer, BertModel # Added for BERT
import time # Added import for the time module

# --- Helper Functions ---
def set_seed(seed):
    """
    シードを固定する．

    Parameters
    ----------
    seed : int
        乱数生成に用いるシード値．
    """
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

def process_text(text):
    """
    入力文と回答のフォーマットを統一するための関数．

    Parameters
    ----------
    text : str
        入力文，もしくは回答．
    """
    # lowercase
    text = text.lower()

    # 数詞を数字に変換
    num_word_to_digit = {
        'zero': '0', 'one': '1', 'two': '2', 'three': '3', 'four': '4',
        'five': '5', 'six': '6', 'seven': '7', 'eight': '8', 'nine': '9',
        'ten': '10'
    }
    for word, digit in num_word_to_digit.items():
        text = text.replace(word, digit)

    # 小数点のピリオドを削除
    text = re.sub(r'(?<!\d)\.(?!\d)', '', text)

    # 冠詞の削除
    text = re.sub(r'\b(a|an|the)\b', '', text)

    # 短縮形のカンマの追加
    contractions = {
        "dont": "don't", "isnt": "isn't", "arent": "aren't", "wont": "won't",
        "cant": "can't", "wouldnt": "wouldn't", "couldnt": "couldn't"
    }
    for contraction, correct in contractions.items():
        text = text.replace(contraction, correct)

    # 句読点をスペースに変換
    text = re.sub(r"[^\w\s':]", ' ', text)

    # 句読点をスペースに変換
    text = re.sub(r'\s+,', ',', text)

    # 連続するスペースを1つに変換
    text = re.sub(r'\s+', ' ', text).strip()

    return text

def VQA_criterion(batch_pred, batch_answers):
    """
    VQA タスクに用いられる評価関数．
    """
    total_acc = 0.

    for pred, answers in zip(batch_pred, batch_answers):
        acc = 0.
        for i in range(len(answers)):
            num_match = 0
            for j in range(len(answers)):
                if i == j:
                    continue
                if pred == answers[j]:
                    num_match += 1
            acc += min(num_match / 3, 1)
        total_acc += acc / 10

    return total_acc / len(batch_pred)

# --- ResNet Model Definitions ---
class BasicBlock(nn.Module):
    """
    ResNet の basic block
    """
    expansion = 1

    def __init__(self, in_channels, out_channels, stride=1):
        super().__init__()

        self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=3, stride=stride, padding=1)
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3, stride=1, padding=1)
        self.bn2 = nn.BatchNorm2d(out_channels)
        self.relu = nn.ReLU(inplace=True)

        self.shortcut = nn.Sequential()
        if stride != 1 or in_channels != out_channels:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=stride),
                nn.BatchNorm2d(out_channels)
            )

    def forward(self, x):
        residual = x
        out = self.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))

        out += self.shortcut(residual)
        out = self.relu(out)

        return out


class BottleneckBlock(nn.Module):
    """
    ResNet の bottleneck block
    """
    expansion = 4

    def __init__(self, in_channels, out_channels, stride=1):
        super().__init__()

        self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=1)
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3, stride=stride, padding=1)
        self.bn2 = nn.BatchNorm2d(out_channels)
        self.conv3 = nn.Conv2d(out_channels, out_channels * self.expansion, kernel_size=1, stride=1)
        self.bn3 = nn.BatchNorm2d(out_channels * self.expansion)
        self.relu = nn.ReLU(inplace=True)

        self.shortcut = nn.Sequential()
        if stride != 1 or in_channels != out_channels * self.expansion:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_channels, out_channels * self.expansion, kernel_size=1, stride=stride),
                nn.BatchNorm2d(out_channels * self.expansion)
            )

    def forward(self, x):
        residual = x
        out = self.relu(self.bn1(self.conv1(x)))
        out = self.relu(self.bn2(self.conv2(out)))
        out = self.bn3(self.conv3(out))

        out += self.shortcut(residual)
        out = self.relu(out)

        return out


class ResNet(nn.Module):
    """
    ResNet の実装
    """
    def __init__(self, block, layers):
        super().__init__()
        self.in_channels = 64

        self.conv1 = nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3)
        self.bn1 = nn.BatchNorm2d(64)
        self.relu = nn.ReLU(inplace=True)
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)

        self.layer1 = self._make_layer(block, layers[0], 64)
        self.layer2 = self._make_layer(block, layers[1], 128, stride=2)
        self.layer3 = self._make_layer(block, layers[2], 256, stride=2)
        self.layer4 = self._make_layer(block, layers[3], 512, stride=2)

        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(512 * block.expansion, 512)

    def _make_layer(self, block, blocks, out_channels, stride=1):
        layers = []
        layers.append(block(self.in_channels, out_channels, stride))
        self.in_channels = out_channels * block.expansion
        for _ in range(1, blocks):
            layers.append(block(self.in_channels, out_channels))

        return nn.Sequential(*layers)

    def forward(self, x):
        x = self.relu(self.bn1(self.conv1(x)))
        x = self.maxpool(x)

        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)

        x = self.avgpool(x)
        x = x.view(x.size(0), -1)
        x = self.fc(x)

        return x

def ResNet18():
    return ResNet(BasicBlock, [2, 2, 2, 2])

def ResNet50():
    return ResNet(BottleneckBlock, [3, 4, 6, 3])

# --- VQADataset Class Definition ---
class VQADataset(torch.utils.data.Dataset):
    """
    VQA データセットを扱うためのクラス．
    """
    def __init__(self, df_path, image_dir, transform=None, answer=True):
        self.transform = transform
        self.image_dir = image_dir
        self.df = pandas.read_json(df_path)
        self.answer = answer

        self.tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
        self.question2idx = None # Managed by BertTokenizer
        self.idx2question = None

        self.answer2idx = {}
        self.idx2answer = {}

        if self.answer:
            for answers in self.df["answers"]:
                for answer in answers:
                    word = answer["answer"]
                    word = process_text(word)
                    if word not in self.answer2idx:
                        self.answer2idx[word] = len(self.answer2idx)
            self.idx2answer = {v: k for k, v in self.answer2idx.items()}

    def update_dict(self, dataset):
        self.answer2idx = dataset.answer2idx
        self.idx2answer = dataset.idx2answer

    def __getitem__(self, idx):
        image = Image.open(f"{self.image_dir}/{self.df['image'][idx]}")
        image = self.transform(image)

        encoded_question = self.tokenizer(
            process_text(self.df["question"][idx]),
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )
        input_ids = encoded_question['input_ids'].squeeze(0)
        attention_mask = encoded_question['attention_mask'].squeeze(0)

        if self.answer:
            all_human_answers_text = [process_text(answer["answer"]) for answer in self.df["answers"][idx]]

            valid_human_answer_indices = [
                self.answer2idx[ans_text] for ans_text in all_human_answers_text if ans_text in self.answer2idx
            ]

            answers_for_vqa_criterion = torch.tensor(valid_human_answer_indices, dtype=torch.float)

            answer_counts_for_soft_labels = {}
            for ans_idx in valid_human_answer_indices:
                answer_counts_for_soft_labels[ans_idx] = answer_counts_for_soft_labels.get(ans_idx, 0) + 1

            target_soft_labels = torch.zeros(len(self.answer2idx))

            for ans_idx, count in answer_counts_for_soft_labels.items():
                score = min(count / 3, 1.0)
                target_soft_labels[ans_idx] = score

            return image, input_ids, attention_mask, answers_for_vqa_criterion, target_soft_labels

        else:
            return image, input_ids, attention_mask

    def __len__(self):
        return len(self.df)

# --- VQAModel Class Definition ---
class VQAModel(nn.Module):
    """
    VQA タスクを解くためのモデル例．
    """
    def __init__(self, n_answer: int):
        super().__init__()
        self.resnet = ResNet50()
        self.bert = BertModel.from_pretrained('bert-base-uncased')

        # Freeze BERT parameters
        for param in self.bert.parameters():
            param.requires_grad = False

        # BERT output dimension is 768. Image feature is 512.
        self.fc = nn.Sequential(
            nn.Linear(512 + 768, 512),
            nn.ReLU(inplace=True),
            nn.Linear(512, n_answer)
        )

    def forward(self, image, input_ids, attention_mask):
        image_feature = self.resnet(image)
        bert_output = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        question_feature = bert_output.pooler_output # (batch_size, hidden_size=768)

        x = torch.cat([image_feature, question_feature], dim=1)
        x = self.fc(x)

        return x

# --- Training and Evaluation Functions ---
def train(model, dataloader, optimizer, criterion, device):
    model.train()
    total_loss = 0
    total_acc = 0
    start = time.time()
    for image, input_ids, attention_mask, answers_for_vqa_criterion, target_soft_labels in dataloader:
        image, input_ids, attention_mask, answers_for_vqa_criterion, target_soft_labels = \
            image.to(device), input_ids.to(device), attention_mask.to(device), answers_for_vqa_criterion.to(device), target_soft_labels.to(device)

        pred = model(image, input_ids, attention_mask)
        loss = criterion(pred, target_soft_labels)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
        total_acc += VQA_criterion(pred.argmax(1), answers_for_vqa_criterion)

    return total_loss / len(dataloader), total_acc / len(dataloader), 0.0, time.time() - start

def eval(model, dataloader, criterion, device):
    model.eval()
    total_loss = 0
    total_acc = 0
    start = time.time()
    with torch.no_grad():
        for image, input_ids, attention_mask, answers_for_vqa_criterion, target_soft_labels in dataloader:
            image, input_ids, attention_mask, answers_for_vqa_criterion, target_soft_labels = \
                image.to(device), input_ids.to(device), attention_mask.to(device), answers_for_vqa_criterion.to(device), target_soft_labels.to(device)

            pred = model(image, input_ids, attention_mask)
            loss = criterion(pred, target_soft_labels)

            total_loss += loss.item()
            total_acc += VQA_criterion(pred.argmax(1), answers_for_vqa_criterion)

    return total_loss / len(dataloader), total_acc / len(dataloader), 0.0, time.time() - start

# --- Initialization Code ---
set_seed(42)
device = "cuda" if torch.cuda.is_available() else "cpu"

transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor()
])

train_dataset = VQADataset(df_path="./data/train.json", image_dir="./data/train", transform=transform)
test_dataset = VQADataset(df_path="./data/valid.json", image_dir="./data/valid", transform=transform, answer=False)
test_dataset.update_dict(train_dataset)

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=128, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=1, shuffle=False)

n_answer = len(train_dataset.answer2idx)
model = VQAModel(n_answer=n_answer).to(device)

num_epoch = 4
criterion = nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-5)

# --- Train Model ---
print("Starting model training...")
for epoch in range(num_epoch):
    train_loss, train_acc, train_simple_acc, train_time = train(model, train_loader, optimizer, criterion, device)
    print(f"【{epoch + 1}/{num_epoch}】\n" # Use double backslash for newline in f-string inside python code block
            f"train time: {train_time:.2f} [s]\n"
            f"train loss: {train_loss:.4f}\n"
            f"train acc: {train_acc:.4f}\n"
            f"train simple acc: {train_simple_acc:.4f}")
print("Model training complete.")

# --- Make Submission File ---
print("Generating submission file...")
model.eval()
submission = []
for image, input_ids, attention_mask in test_loader:
    image, input_ids, attention_mask = image.to(device), input_ids.to(device), attention_mask.to(device)
    pred = model(image, input_ids, attention_mask)
    pred = pred.argmax(1).cpu().item()
    submission.append(pred)

submission = [train_dataset.idx2answer[id] for id in submission]
submission = np.array(submission)
torch.save(model.state_dict(), "model.pt")
np.save("submission.npy", submission)
print("Submission file generated and model saved.")



Starting model training...
