<div dir=ltr align=center>
    <font color=0F5298 size=7>Neurosymbolic VQA Program Generator</font><br>
    <br>
    <font color=2565AE size=5>Part 0: Data Exploration & Setup</font><br>
</div>

<br/>

---

## **Why It Matters: Neurosymbolic VQA**

This project builds a **Neurosymbolic** framework for Visual Question Answering (VQA). In this paradigm:

- <span style="color:blue">**Programs**</span> act as **symbols**. They represent a concrete, logical sequence of steps to find an answer (e.g., `scene` -> `filter_color[blue]` -> `count`).
- <span style="color:green">**Seq2Seq Models**</span> serve as **neural structures**. Their job is to learn the mapping from a flexible, ambiguous natural language question (like "*how many blue things are there?*") to a rigid, logical symbolic program.

This combination gives us the best of both worlds: the **generalizability** of neural networks and the **interpretability** and **compositionality** of symbolic logic.

## **Learning Objectives**

By the end of this project, we will have implemented and compared three distinct strategies for training the neural program generator:

1.  ðŸŸ  **Supervised Learning**: Training a model (LSTM & Transformer) using the ground-truth programs. This is a form of "behavioral cloning."
2.  ðŸ”µ **Reinforcement Learning (RL)**: Fine-tuning the supervised model using rewards from a symbolic executor. The model gets a reward if its generated program produces the *correct answer*, even if the program itself isn't identical to the ground-truth one.
3.  ðŸŸ¢ **In-Context Learning (ICL)**: Using a pre-trained Large Language Model (LLM) to generate programs by showing it a few examples in its prompt, with no explicit training or fine-tuning.

---

## Step 1: **Download the CLEVR Dataset**

First, you must download the CLEVR dataset from this link:

**Link:** [https://drive.google.com/file/d/1_AtOysdMraIdLbbmAzC2x862Jd7xQDQ7/view?usp=sharing](https://drive.google.com/file/d/1_AtOysdMraIdLbbmAzC2x862Jd7xQDQ7/view?usp=sharing)

1.  Download the `CELVR_Dataset.zip` file.
2.  Unzip it.
3.  Place the resulting `CELVR_Dataset` folder inside the `data/` directory of this project.

The final structure should look like this:

```bash
    neurosymbolic-vqa-program-generator/
    â”œâ”€â”€ data/
    â”‚   â”œâ”€â”€ CELVR_Dataset/
    â”‚   â”‚   â”œâ”€â”€ Questions/
    â”‚   â”‚   â””â”€â”€ Scenses/  <-- (Note: original folder has a typo, change it to Scenes)
    â”‚   â””â”€â”€ .gitkeep
    â”œâ”€â”€ notebooks/
    â”‚   â””â”€â”€ 0_Data_Exploration_and_Setup.ipynb
    â””â”€â”€ src/
        ...
```

## Step 2: **Explore the Raw Data**

Let's load one question and one scene to see what the raw data looks like.

In [None]:
import json
import sys
import os

# Add the project root to the Python path
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..')))

# Now we can import from our src package
import src.config as config

# --- Load an example question --- 
print(f"Loading questions from: {config.TRAIN_QUESTIONS_JSON}\n")
with open(config.TRAIN_QUESTIONS_JSON, 'r') as f:
    question_data = json.load(f)['questions']

print("--- Example Question (index 0) ---")
example_question = question_data[0]
print(json.dumps(example_question, indent=2))

In [None]:
# --- Load an example scene --- 
print(f"Loading scenes from: {config.TRAIN_SCENES_JSON}\n")
with open(config.TRAIN_SCENES_JSON, 'r') as f:
    scene_data = json.load(f)['scenes']

print("--- Example Scene (index 0) ---")
example_scene = scene_data[0]
print(json.dumps(example_scene, indent=2))

## Step 3: **Preprocess the Data**

The raw data is not in a format our models can use. We need to run the `preprocess_data.py` script to:

1.  **Build a Vocabulary**: Create a mapping from tokens (like "what", "blue", "filter_color") to integer indices.
2.  **Tokenize & Encode**: Convert all questions and programs into sequences of these integers.
3.  **Pad**: Pad all sequences to a uniform length so they can be batched.
4.  **Save to H5**: Store these large numerical arrays in an efficient H5 file.

We will run this script three times: once for `train` (which *creates* the vocab) and once each for `val` and `test` (which *use* the saved vocab).

In [None]:
print("--- 1. Preprocessing TRAIN data ---")
# This command builds the vocabulary and saves it
!python ../scripts/preprocess_data.py \
    --input_json ../data/CLEVR_Dataset/Questions/CLEVR_train_questions.json \
    --output_h5 ../data/dataH5Files/clevr_train_questions.h5 \
    --output_vocab_json ../data/dataH5Files/clevr_vocab.json

print("\n--- 2. Preprocessing VALIDATION data ---")
# This command loads the existing vocab
!python ../scripts/preprocess_data.py \
    --input_json ../data/CLEVR_Dataset/Questions/CLEVR_val_questions.json \
    --input_vocab_json ../data/dataH5Files/clevr_vocab.json \
    --output_h5 ../data/dataH5Files/clevr_val_questions.h5 \
    --allow_unk 1

print("\n--- 3. Preprocessing TEST data ---")
# This command also loads the existing vocab
!python ../scripts/preprocess_data.py \
    --input_json ../data/CLEVR_Dataset/Questions/CLEVR_test_questions.json \
    --input_vocab_json ../data/dataH5Files/clevr_vocab.json \
    --output_h5 ../data/dataH5Files/clevr_test_questions.h5 \
    --allow_unk 1

print("\nPreprocessing complete!")

## Step 4: **Verify the Processed H5 File**

Let's load the H5 file and our new vocabulary to make sure everything worked. We'll load the first processed question and program and decode them back to text.

In [None]:
import h5py
import torch
from src.vocabulary import load_vocab, decode

# 1. Load the vocabulary
vocab = load_vocab(config.VOCAB_JSON_FILE)

# 2. Load the H5 file
print(f"Loading H5 file: {config.TRAIN_H5_FILE}")
with h5py.File(config.TRAIN_H5_FILE, 'r') as f:
    question_vectors = f['questions']
    program_vectors = f['programs']
    
    # Get the first question and program
    q_vec = question_vectors[0]
    p_vec = program_vectors[0]
    
    print("\n--- Original Vectorized Question (index 0) ---")
    print(q_vec)
    print("\n--- Decoded Question ---")
    print(decode(q_vec, vocab['question_idx_to_token']))
    
    print("\n--- Original Vectorized Program (index 0) ---")
    print(p_vec)
    print("\n--- Decoded Program ---")
    print(decode(p_vec, vocab['program_idx_to_token']))

### All set!

Our data is now preprocessed. We can move on to the first training strategy.