<a href="https://colab.research.google.com/github/tahereh-fahi/AI-ML-projects/blob/main/TabularQLearning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RL — Home World (Tabular Q-Learning) — V2

This notebook sets up the environment and **runs Tabular Q-Learning** on the *Home World* text game.

**What you'll find here:**
- Clear, guided text cells.
- The **first two code cells** exactly as requested (environment check + data/framework downloader).
- We **download** `framework.py` and `utils.py` (we **do not** paste their code here).
- We **embed and write** the full contents of:
  - `agent_tabular_ql.py`
  - `agent_linear.py`
- A quick run block to execute the tabular agent end-to-end.


In [None]:
import sys, subprocess
def _pip_install(pkg):
    try: __import__(pkg)
    except: subprocess.check_call([sys.executable, "-m", "pip", "install", pkg])
for pkg in ["numpy", "matplotlib", "tqdm"]:
    _pip_install(pkg)
print("Environment OK")

In [None]:
# --- RUN ME FIRST: download & prepare dataset(s) from GitHub Releases -------
import os, subprocess, hashlib, pathlib
os.makedirs("data", exist_ok=True)

def download(url: str, out_path: str):
    subprocess.run(["curl","-L","--fail","--retry","3","--retry-delay","3","-o",out_path,url], check=True)

def sha256sum(path: str) -> str:
    import hashlib; h = hashlib.sha256()
    with open(path,"rb") as f:
        for chunk in iter(lambda: f.read(1<<20), b""): h.update(chunk)
    return h.hexdigest()

def verify(path, expected_sha256):
    if not expected_sha256: return
    if sha256sum(path).lower()!=expected_sha256.split(":")[-1].lower():
        raise RuntimeError("Checksum mismatch")

URL="https://github.com/tahereh-fahi/Data/releases/download/v1.1.0/game.tsv"
OUT="data/game.tsv"
SHA256="sha256:f59fc2d328ecbdcf8619508a8cf226974592e67293aa4a7b6ae4a1d0128b07b3"
download(URL,OUT); verify(OUT,SHA256)
print("game.tsv ready")

FILES=[
 ("https://github.com/tahereh-fahi/Data/releases/download/v1.1.0/framework.py","framework.py",None),
 ("https://github.com/tahereh-fahi/Data/releases/download/v1.1.0/utils.py","utils.py",None)
]
for url,out,sha in FILES: download(url,out)
print("framework.py and utils.py ready")

## Files policy

- We **download** `framework.py` and `utils.py` and **import** them (we do **not** paste their code here).
- We **embed & write** the full contents of:
  - `agent_tabular_ql.py` (tabular Q-learning)
  - `agent_linear.py` (linear function approximation variant)
- `agent_dqn.py` is **not embedded** by default. Ask and I will add it similarly.


## `agent_tabular_ql.py` (embedded here)

We embed the full code so you can view/edit it in the notebook and also **write** it to disk to keep the Python module importable.


In [None]:
# Write agent_tabular_ql.py to the working directory
from pathlib import Path
code = '"""Tabular QL agent"""\nimport numpy as np\nimport matplotlib.pyplot as plt\nfrom tqdm import tqdm\nimport framework\nimport utils\n\nDEBUG = False\n\nGAMMA = 0.5  # discounted factor\nTRAINING_EP = 0.5  # epsilon-greedy parameter for training\nTESTING_EP = 0.05  # epsilon-greedy parameter for testing\nNUM_RUNS = 10\nNUM_EPOCHS = 200\nNUM_EPIS_TRAIN = 25  # number of episodes for training at each epoch\nNUM_EPIS_TEST = 50  # number of episodes for testing\nALPHA = 1e-3  # learning rate for training\n\nACTIONS = framework.get_actions()\nOBJECTS = framework.get_objects()\nNUM_ACTIONS = len(ACTIONS)\nNUM_OBJECTS = len(OBJECTS)\n\n\n# pragma: coderesponse template\ndef epsilon_greedy(state_1, state_2, q_func, epsilon):\n    """Returns an action selected by an epsilon-Greedy exploration policy\n\n    Args:\n        state_1, state_2 (int, int): two indices describing the current state\n        q_func (np.ndarray): current Q-function\n        epsilon (float): the probability of choosing a random command\n\n    Returns:\n        (int, int): the indices describing the action/object to take\n    """\n    # TODO Your code here\n    # TODO Your code here\n    p = np.random.random()\n    \n    if p<epsilon:\n        action_index = np.random.randint(q_func.shape[2])\n        object_index = np.random.randint(q_func.shape[3])\n    else:\n        Q_s_c = q_func[state_1, state_2, :, :]\n        action_index, object_index = np.unravel_index(np.argmax(Q_s_c), Q_s_c.shape)\n         \n    \n    \n    return (action_index, object_index)\n\n\n# pragma: coderesponse end\n\n\n# pragma: coderesponse template\ndef tabular_q_learning(q_func, current_state_1, current_state_2, action_index,\n                       object_index, reward, next_state_1, next_state_2,\n                       terminal):\n    """Update q_func for a given transition\n\n    Args:\n        q_func (np.ndarray): current Q-function\n        current_state_1, current_state_2 (int, int): two indices describing the current state\n        action_index (int): index of the current action\n        object_index (int): index of the current object\n        reward (float): the immediate reward the agent recieves from playing current command\n        next_state_1, next_state_2 (int, int): two indices describing the next state\n        terminal (bool): True if this episode is over\n\n    Returns:\n        None\n    """\n    # TODO Your code here\n    \n        \n    \n    first_part = (1-ALPHA) * q_func[current_state_1, current_state_2, action_index,\n               object_index]\n    \n    if terminal :\n        second_part = ALPHA * (reward)\n        \n    else :\n        Q_sprim_cprim = q_func[next_state_1, next_state_2, :, :]\n        maxQ = np.max(Q_sprim_cprim)\n        second_part = ALPHA * (reward + GAMMA * maxQ)\n    \n    \n    q_func[current_state_1, current_state_2, action_index,\n           object_index] = first_part + second_part  # TODO Your update here\n\n    return None  # This function shouldn\'t return anything\n\n\n# pragma: coderesponse end\n\n\n# pragma: coderesponse template\ndef run_episode(for_training):\n    """Run a single episode. If training, update Q online; if testing, return discounted return."""\n    epsilon = TRAINING_EP if for_training else TESTING_EP\n    epi_reward = 0.0\n    step_count = 0\n\n    current_room_desc, current_quest_desc, terminal = framework.newGame()\n    while not terminal:\n        s1 = dict_room_desc[current_room_desc]\n        s2 = dict_quest_desc[current_quest_desc]\n\n        a_idx, o_idx = epsilon_greedy(s1, s2, q_func, epsilon)\n\n        next_room_desc, next_quest_desc, reward, terminal = framework.step_game(\n            current_room_desc, current_quest_desc, a_idx, o_idx\n        )\n\n        ns1 = dict_room_desc[next_room_desc]\n        ns2 = dict_quest_desc[next_quest_desc]\n\n        if for_training:\n            tabular_q_learning(\n                q_func, s1, s2, a_idx, o_idx, reward, ns1, ns2, terminal\n            )\n        else:\n            epi_reward += (GAMMA ** step_count) * reward\n\n        step_count += 1\n        current_room_desc, current_quest_desc = next_room_desc, next_quest_desc\n\n    return None if for_training else epi_reward\n\ndef run_epoch():\n    """Runs one epoch and returns reward averaged over test episodes"""\n    rewards = []\n\n    for _ in range(NUM_EPIS_TRAIN):\n        run_episode(for_training=True)\n\n    for _ in range(NUM_EPIS_TEST):\n        rewards.append(run_episode(for_training=False))\n\n    return np.mean(np.array(rewards))\n\n\ndef run():\n    """Returns array of test reward per epoch for one run"""\n    global q_func\n    q_func = np.zeros((NUM_ROOM_DESC, NUM_QUESTS, NUM_ACTIONS, NUM_OBJECTS))\n\n    single_run_epoch_rewards_test = []\n    pbar = tqdm(range(NUM_EPOCHS), ncols=80)\n    for _ in pbar:\n        single_run_epoch_rewards_test.append(run_epoch())\n        pbar.set_description(\n            "Avg reward: {:0.6f} | Ewma reward: {:0.6f}".format(\n                np.mean(single_run_epoch_rewards_test),\n                utils.ewma(single_run_epoch_rewards_test)))\n    return single_run_epoch_rewards_test\n\n\nif __name__ == \'__main__\':\n    # Data loading and build the dictionaries that use unique index for each state\n    (dict_room_desc, dict_quest_desc) = framework.make_all_states_index()\n    NUM_ROOM_DESC = len(dict_room_desc)\n    NUM_QUESTS = len(dict_quest_desc)\n\n    # set up the game\n    framework.load_game_data()\n\n    epoch_rewards_test = []  # shape NUM_RUNS * NUM_EPOCHS\n\n    for _ in range(NUM_RUNS):\n        epoch_rewards_test.append(run())\n\n    epoch_rewards_test = np.array(epoch_rewards_test)\n\n    x = np.arange(NUM_EPOCHS)\n    fig, axis = plt.subplots()\n    axis.plot(x, np.mean(epoch_rewards_test,\n                         axis=0))  # plot reward per epoch averaged per run\n    axis.grid(which="both")\n    axis.minorticks_on()\n    axis.set_xlabel(\'Epochs\')\n    axis.set_ylabel(\'reward\')\n    axis.set_title((\'Tablular: nRuns=%d, Epilon=%.2f, Epi=%d, alpha=%.4f\' %\n                    (NUM_RUNS, TRAINING_EP, NUM_EPIS_TRAIN, ALPHA)))\n    plt.show()\n    x=0.7\n'
Path("agent_tabular_ql.py").write_text(code, encoding="utf-8")
print("agent_tabular_ql.py written ({len(code.splitlines())} lines).")

## `agent_linear.py` (embedded here)

Included for completeness (function approximation variant).

In [None]:
# Write agent_linear.py to the working directory
from pathlib import Path
code = '"""Linear QL agent"""\nimport numpy as np\nimport matplotlib.pyplot as plt\nfrom tqdm import tqdm\nimport framework\nimport utils\n\nDEBUG = False\n\n\nGAMMA = 0.5  # discounted factor\nTRAINING_EP = 0.5  # epsilon-greedy parameter for training\nTESTING_EP = 0.05  # epsilon-greedy parameter for testing\nNUM_RUNS = 10\nNUM_EPOCHS = 600\nNUM_EPIS_TRAIN = 25  # number of episodes for training at each epoch\nNUM_EPIS_TEST = 50  # number of episodes for testing\nALPHA = 0.001  # learning rate for training\n\nACTIONS = framework.get_actions()\nOBJECTS = framework.get_objects()\nNUM_ACTIONS = len(ACTIONS)\nNUM_OBJECTS = len(OBJECTS)\n\n\ndef tuple2index(action_index, object_index):\n    """Converts a tuple (a,b) to an index c"""\n    return action_index * NUM_OBJECTS + object_index\n\n\ndef index2tuple(index):\n    """Converts an index c to a tuple (a,b)"""\n    return index // NUM_OBJECTS, index % NUM_OBJECTS\n\n\n# pragma: coderesponse template name="linear_epsilon_greedy"\ndef epsilon_greedy(state_vector, theta, epsilon):\n    """Returns an action selected by an epsilon-greedy exploration policy\n\n    Args:\n        state_vector (np.ndarray): extracted vector representation\n        theta (np.ndarray): current weight matrix\n        epsilon (float): the probability of choosing a random command\n\n    Returns:\n        (int, int): the indices describing the action/object to take\n    """\n    # TODO Your code here\n    p = np.random.random()\n    \n    # [tuple2index(action_index, object_index)]\n    if p<epsilon:\n        action_index = np.random.randint(NUM_ACTIONS)\n        object_index = np.random.randint(NUM_OBJECTS)\n    else:\n        Q = (theta @ state_vector)\n        action_index, object_index = index2tuple(np.argmax(Q))\n    \n    return (action_index, object_index)\n# pragma: coderesponse end\n\n\n# pragma: coderesponse template\ndef linear_q_learning(theta, current_state_vector, action_index, object_index,\n                      reward, next_state_vector, terminal):\n    """Update theta for a given transition\n\n    Args:\n        theta (np.ndarray): current weight matrix\n        current_state_vector (np.ndarray): vector representation of current state\n        action_index (int): index of the current action\n        object_index (int): index of the current object\n        reward (float): the immediate reward the agent recieves from playing current command\n        next_state_vector (np.ndarray): vector representation of next state\n        terminal (bool): True if this epsiode is over\n\n    Returns:\n        None\n    """\n    # TODO Your code here\n    if not terminal:\n        q = (theta @ current_state_vector)[tuple2index(action_index, object_index)]\n        qPrim = (theta @ next_state_vector)[tuple2index(action_index, object_index)]\n        Qmax = np.max(qPrim)\n        phi = q/theta[tuple2index(action_index, object_index)] \n        \n        g_theta = reward + GAMMA * Qmax - q\n        theta[tuple2index(action_index, object_index)] += ALPHA * g_theta * phi\n        \n    \n        \n    \n    # TODO Your update here\n# pragma: coderesponse end\n\n\ndef run_episode(for_training):\n    """ Runs one episode\n    If for training, update Q function\n    If for testing, computes and return cumulative discounted reward\n\n    Args:\n        for_training (bool): True if for training\n\n    Returns:\n        None\n    """\n    epsilon = TRAINING_EP if for_training else TESTING_EP\n    epi_reward = None\n\n    # initialize for each episode\n    # TODO Your code here\n\n    (current_room_desc, current_quest_desc, terminal) = framework.newGame()\n    while not terminal:\n        # Choose next action and execute\n        current_state = current_room_desc + current_quest_desc\n        current_state_vector = utils.extract_bow_feature_vector(\n            current_state, dictionary)\n        # TODO Your code here\n\n        if for_training:\n            # update Q-function.\n            # TODO Your code here\n            pass\n\n        if not for_training:\n            # update reward\n            # TODO Your code here\n            pass\n\n        # prepare next step\n        # TODO Your code here\n\n    if not for_training:\n        return epi_reward\n\n\ndef run_epoch():\n    """Runs one epoch and returns reward averaged over test episodes"""\n    rewards = []\n\n    for _ in range(NUM_EPIS_TRAIN):\n        run_episode(for_training=True)\n\n    for _ in range(NUM_EPIS_TEST):\n        rewards.append(run_episode(for_training=False))\n\n    return np.mean(np.array(rewards))\n\n\ndef run():\n    """Returns array of test reward per epoch for one run"""\n    global theta\n    theta = np.zeros([action_dim, state_dim])\n\n    single_run_epoch_rewards_test = []\n    pbar = tqdm(range(NUM_EPOCHS), ncols=80)\n    for _ in pbar:\n        single_run_epoch_rewards_test.append(run_epoch())\n        pbar.set_description(\n            "Avg reward: {:0.6f} | Ewma reward: {:0.6f}".format(\n                np.mean(single_run_epoch_rewards_test),\n                utils.ewma(single_run_epoch_rewards_test)))\n    return single_run_epoch_rewards_test\n\n\nif __name__ == \'__main__\':\n    state_texts = utils.load_data(\'game.tsv\')\n    dictionary = utils.bag_of_words(state_texts)\n    state_dim = len(dictionary)\n    action_dim = NUM_ACTIONS * NUM_OBJECTS\n\n    # set up the game\n    framework.load_game_data()\n\n    epoch_rewards_test = []  # shape NUM_RUNS * NUM_EPOCHS\n\n    for _ in range(NUM_RUNS):\n        epoch_rewards_test.append(run())\n\n    epoch_rewards_test = np.array(epoch_rewards_test)\n\n    x = np.arange(NUM_EPOCHS)\n    fig, axis = plt.subplots()\n    axis.plot(x, np.mean(epoch_rewards_test,\n                         axis=0))  # plot reward per epoch averaged per run\n    axis.set_xlabel(\'Epochs\')\n    axis.set_ylabel(\'reward\')\n    axis.set_title((\'Linear: nRuns=%d, Epilon=%.2f, Epi=%d, alpha=%.4f\' %\n                    (NUM_RUNS, TRAINING_EP, NUM_EPIS_TRAIN, ALPHA)))\n\n'
Path("agent_linear.py").write_text(code, encoding="utf-8")
print("agent_linear.py written ({len(code.splitlines())} lines).")

### About `agent_dqn.py`

We did **not** embed `agent_dqn.py` here by default. If you'd like it included as well, I can add a section that writes it into the notebook just like the others.


## Quick test run (Tabular Q-Learning)

This cell runs the tabular agent entry point (typically `main()` or `run()` inside `agent_tabular_ql.py`).  
Make sure you executed the **first two cells** and the **write file** cells above first.


In [None]:
# Run the tabular Q-learning agent
import importlib
import agent_tabular_ql as tab

if hasattr(tab, "main"):
    tab.main()
elif hasattr(tab, "run"):
    tab.run()
else:
    print("agent_tabular_ql.py does not expose main() or run(); please execute its API manually.")