<a href="https://colab.research.google.com/github/tripidhoble/26-Week-of-Data-Science-Challenge-/blob/master/DRL_Assignment_1_Group14_DP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### The Smart Supplier: Optimizing Orders in a Fluctuating Market - 6 Marks

Develop a reinforcement learning agent using dynamic programming to help a Smart Supplier decide which products to manufacture and sell each day to maximize profit. The agent must learn the optimal policy for choosing daily production quantities, considering its limited raw materials and the unpredictable daily demand and selling prices for different products.

#### **Scenario**
 A small Smart Supplier manufactures two simple products: Product A and Product B. Each day, the supplier has a limited amount of raw material. The challenge is that the market demand and selling price for Product A and Product B change randomly each day, making some products more profitable than others at different times. The supplier needs to decide how much of each product to produce to maximize profit while managing their limited raw material.

#### **Objective**
The Smart Supplier's agent must learn the optimal policy π∗ using dynamic programming (Value Iteration or Policy Iteration) to decide how many units of Product A and Product B to produce each day to maximize the total profit over the fixed number of days, given the daily changing market conditions and limited raw material.

In [53]:
# import required libraries
import pandas as pd
import numpy as np
import itertools
import random
from collections import defaultdict
import matplotlib.pyplot as plt

####  **1. Custom Environment Creation: Design and implement the "Smart Supplier" environment, defining the product costs, daily market shifts, raw material limits, and rewards. (1 Mark)**

### --- 1. Custom Environment Creation (SmartSupplierEnv) --- ( 1 Mark )

In [54]:
# Define market states and their product prices
# Structure: {Market_State_ID: {'A_price': X, 'B_price': Y}}
# Define product raw material costs
# Define actions: (num_A, num_B, raw_material_cost_precalculated)
        # Action ID mapping:
        # 0: Produce_2A_0B
        # 1: Produce_1A_2B
        # 2: Produce_0A_5B
        # 3: Produce_3A_0B
        # 4: Do_Nothing

 # Define state space dimensions
        # Current Day: 1 to num_days
        # Current Raw Material: 0 to initial_raw_material
        # Current Market State: 1 or 2

# get reward function


class SmartSupplierEnv:
    def __init__(self, num_days=5, initial_rm=10, fixed_market=None):
        self.num_days = num_days
        self.initial_rm = initial_rm
        self.product_cost = {'A': 2, 'B': 1}
        self.market_states = {
            1: {'A_price': 8, 'B_price': 2},
            2: {'A_price': 3, 'B_price': 5},
        }
        self.actions = {
            0: (2, 0),
            1: (1, 2),
            2: (0, 5),
            3: (3, 0),
            4: (0, 0),
        }
        self.fixed_market_state(fixed_market)
        self.state_space = list(itertools.product(range(num_days), range(initial_rm + 1), list(self.market_states.keys())))
        self.action_space = list(self.actions.keys())


    def fixed_market_state(self, fixed_market):
        if fixed_market:
            if fixed_market not in self.market_states.keys():
                raise Exception("Invalid market fixed_market!")
            self.market_states = {fixed_market: self.market_states[fixed_market]}
            print("market_states: ",self.market_states)

        return None


    def transition(self, state, action):
      day, rm, market_state = state
      a_units, b_units = self.actions[action]
      cost = a_units * self.product_cost['A'] + b_units * self.product_cost['B']
      reward = 0

      if cost <= rm:
          prices = self.market_states[market_state]
          reward = a_units * prices['A_price'] + b_units * prices['B_price']

      next_day = day + 1
      done = next_day >= self.num_days

      if not done:
          # Return both possible market states for planning (deterministically)
          return [
              ((next_day, self.initial_rm, 1), reward, done),
              ((next_day, self.initial_rm, 2), reward, done)
          ]
      else:
          return [(None, reward, True)]

    def transition_deterministic(self, state, action, next_market):
      day, rm, market_state = state
      a_units, b_units = self.actions[action]
      cost = a_units * self.product_cost['A'] + b_units * self.product_cost['B']
      reward = 0

      if cost <= rm:
          prices = self.market_states[market_state]
          reward = a_units * prices['A_price'] + b_units * prices['B_price']

      next_day = day + 1
      done = next_day >= self.num_days
      if not done:
          next_state = (next_day, self.initial_rm, next_market)
      else:
          next_state = None

      return next_state, reward, done


    def reset(self):
        self.current_day = 0
        self.current_rm = self.initial_rm
        self.current_market = random.choice(list(self.market_states.keys()))
        self.done = False
        return (self.current_day, self.current_rm, self.current_market)

    def step(self, action):
        if self.done:
            raise Exception("Episode has ended. Please reset the environment.")

        a_units, b_units = self.actions[action]
        cost = a_units * self.product_cost['A'] + b_units * self.product_cost['B']
        reward = 0

        if cost <= self.current_rm:
            prices = self.market_states[self.current_market]
            reward = a_units * prices['A_price'] + b_units * prices['B_price']
            self.current_rm -= cost

        self.current_day += 1
        self.done = self.current_day >= self.num_days

        if not self.done:
            self.current_market = random.choice(list(self.market_states.keys()))
            self.current_rm = self.initial_rm
            next_state = (self.current_day, self.initial_rm, self.current_market)
        else:
            next_state = None


        return next_state, reward, self.done


#### **2. Dynamic Programming Implementation: Implement dynamic  programming (Value Iteration or Policy Iteration) to find the optimal policy.  Crucially, the policy will be a function of the current day, raw material, and market state.  (2 Marks)**

### --- 2. Dynamic Programming Implementation (Value Iteration or Policy Iteration) --- (2 Mark)

In [55]:
# Value Iteration function
class ValueIterationAgent:
    def __init__(self, env, gamma=1.0, theta=1e-4):
        self.env = env
        self.gamma = gamma
        self.theta = theta
        self.V = defaultdict(float)
        self.policy = {}

    def run(self):
        while True:
            delta = 0
            for state in self.env.state_space:
                if state[0] >= self.env.num_days:
                    continue
                max_value = float('-inf')
                best_action = None
                for action in self.env.action_space:
                    total = 0
                    for next_state, reward, done in self.env.transition(state, action):
                      if next_state is None:
                          total += 0.5 * reward
                      else:
                          total += 0.5 * (reward + self.gamma * self.V[next_state])
                    if total > max_value:
                        max_value = total
                        best_action = action
                delta = max(delta, abs(max_value - self.V[state]))
                self.V[state] = max_value
                self.policy[state] = best_action
            print(f"delta: {delta}, theta: {self.theta}")
            if delta < self.theta:
                break
        return self.policy, self.V


In [56]:
# Policy Iteration function
class PolicyIterationAgent:
    def __init__(self, env, gamma=1.0):
        self.env = env
        self.gamma = gamma
        self.V = defaultdict(float)
        self.policy = {s: random.choice(env.action_space) for s in env.state_space if s[0] < env.num_days}

    def run(self):
        is_policy_stable = False
        while not is_policy_stable:
            # Policy Evaluation
            while True:
                delta = 0
                for state in self.env.state_space:
                    if state[0] >= self.env.num_days:
                        continue
                    v = self.V[state]
                    action = self.policy[state]
                    total = 0
                    for next_state, reward, done in self.env.transition(state, action):
                      if next_state is None:
                          total += 0.5 * reward
                      else:
                          total += 0.5 * (reward + self.gamma * self.V[next_state])
                    self.V[state] = total
                    delta = max(delta, abs(v - self.V[state]))
                print(f"delta: {delta}")
                if delta < 1e-4:
                    break

            # Policy Improvement
            is_policy_stable = True
            for state in self.env.state_space:
                if state[0] >= self.env.num_days:
                    continue
                old_action = self.policy[state]
                best_action = old_action
                best_value = float('-inf')
                for action in self.env.action_space:
                    total = 0
                    for next_market in list(self.env.market_states.keys()):
                        next_state, reward, done = self.env.transition_deterministic(state, action, next_market)
                        if next_state is None:
                            total += reward
                        else:
                            total += 0.5 * (reward + self.gamma * self.V[next_state])
                    if total > best_value:
                        best_value = total
                        best_action = action
                self.policy[state] = best_action
                if old_action != best_action:
                    is_policy_stable = False
        return self.policy, self.V


In [57]:
# --- Main Execution ---
# Instantiate environment
env = SmartSupplierEnv(num_days=5, initial_rm=10)

# Run Value Iteration
print("\nInstantiating Value Iteration ...")
vi_agent = ValueIterationAgent(env, gamma=1.0, theta=1e-4)
vi_policy, vi_values = vi_agent.run()
print("Value Iteration run Successfully.")

# Run Policy Iteration
print("\nInstantiating Policy Iteration ...")
pi_agent = PolicyIterationAgent(env, gamma=1.0)
pi_policy, pi_values = pi_agent.run()
print("Policy Iteration run Successfully.")



Instantiating Value Iteration ...
delta: 25.0, theta: 0.0001
delta: 24.5, theta: 0.0001
delta: 24.5, theta: 0.0001
delta: 24.5, theta: 0.0001
delta: 12.25, theta: 0.0001
delta: 0, theta: 0.0001
Value Iteration run Successfully.

Instantiating Policy Iteration ...
delta: 25.0
delta: 12.5
delta: 12.5
delta: 12.5
delta: 4.0
delta: 0
delta: 25.0
delta: 15.5
delta: 15.5
delta: 12.0
delta: 8.25
delta: 0
Policy Iteration run Successfully.


 #### **3. Optimal Policy Analysis: Analyze the learned optimal policy Discuss how the policy changes based upon: (1 Mark)**  
 i. The current market state (like does it always favor Product A in Market State 1).  
 ii. The remaining raw material (does it produce more of the cheaper product if raw material is low).   
 iii. The remaining days (does it become more aggressive on the last day).

#### --- 3. Simulation and Policy Analysis ---  ( 1 Mark)

In [58]:
def action_to_str(action_idx):
    action_map = {
        0: "2A_0B",
        1: "1A_2B",
        2: "0A_5B",
        3: "3A_0B",
        4: "0A_0B"
    }
    return action_map.get(action_idx, "Invalid")

def print_policy_grid_table(policy, market_state, num_days=5, rm_array=None, day_array=None):
    data = []
    rm_array = rm_array if rm_array else range(11)
    day_array = day_array if day_array else range(num_days)
    for rm in rm_array:  # raw material from 0 to 10
        row = {}
        row["Raw Material"] = rm
        for day in day_array:
            state = (day, rm, market_state)
            if state in policy:
                row[f"Day {day}"] = action_to_str(policy[state])
            else:
                row[f"Day {day}"] = "---"
        data.append(row)
    df = pd.DataFrame(data)

    print(f"\nOptimal Policy Grid for Market State: '{market_state}'\n")
    print(df.to_markdown(index=False))

#### i. Policy Behavior Based on Market State
Check if Product A is consistently favored in Market State 1 (A price high), and B in Market State 2 (B price high).

In [59]:
print_policy_grid_table(pi_policy, market_state=1)
print_policy_grid_table(pi_policy, market_state=2)


Optimal Policy Grid for Market State: '1'

|   Raw Material | Day 0   | Day 1   | Day 2   | Day 3   | Day 4   |
|---------------:|:--------|:--------|:--------|:--------|:--------|
|              0 | 2A_0B   | 2A_0B   | 2A_0B   | 2A_0B   | 2A_0B   |
|              1 | 2A_0B   | 2A_0B   | 2A_0B   | 2A_0B   | 2A_0B   |
|              2 | 2A_0B   | 2A_0B   | 2A_0B   | 2A_0B   | 2A_0B   |
|              3 | 2A_0B   | 2A_0B   | 2A_0B   | 2A_0B   | 2A_0B   |
|              4 | 2A_0B   | 2A_0B   | 2A_0B   | 2A_0B   | 2A_0B   |
|              5 | 2A_0B   | 2A_0B   | 2A_0B   | 2A_0B   | 2A_0B   |
|              6 | 3A_0B   | 3A_0B   | 3A_0B   | 3A_0B   | 3A_0B   |
|              7 | 3A_0B   | 3A_0B   | 3A_0B   | 3A_0B   | 3A_0B   |
|              8 | 3A_0B   | 3A_0B   | 3A_0B   | 3A_0B   | 3A_0B   |
|              9 | 3A_0B   | 3A_0B   | 3A_0B   | 3A_0B   | 3A_0B   |
|             10 | 3A_0B   | 3A_0B   | 3A_0B   | 3A_0B   | 3A_0B   |

Optimal Policy Grid for Market State: '2'

|   Raw Materia

**Observation:**   
In Market State 1, the policy consistently favors producing Product A, reflecting its higher selling price compared to Product B. Similarly, in Market State 2, the policy shifts toward producing Product B, as it offers a higher selling price in that state.


#### ii. Policy Behavior Based on Raw Material
Check if agent shifts to cheaper Product B when RM is low.

In [60]:
print_policy_grid_table(pi_policy, market_state=1, rm_array=range(5))
print_policy_grid_table(pi_policy, market_state=2, rm_array=range(5))


Optimal Policy Grid for Market State: '1'

|   Raw Material | Day 0   | Day 1   | Day 2   | Day 3   | Day 4   |
|---------------:|:--------|:--------|:--------|:--------|:--------|
|              0 | 2A_0B   | 2A_0B   | 2A_0B   | 2A_0B   | 2A_0B   |
|              1 | 2A_0B   | 2A_0B   | 2A_0B   | 2A_0B   | 2A_0B   |
|              2 | 2A_0B   | 2A_0B   | 2A_0B   | 2A_0B   | 2A_0B   |
|              3 | 2A_0B   | 2A_0B   | 2A_0B   | 2A_0B   | 2A_0B   |
|              4 | 2A_0B   | 2A_0B   | 2A_0B   | 2A_0B   | 2A_0B   |

Optimal Policy Grid for Market State: '2'

|   Raw Material | Day 0   | Day 1   | Day 2   | Day 3   | Day 4   |
|---------------:|:--------|:--------|:--------|:--------|:--------|
|              0 | 2A_0B   | 2A_0B   | 2A_0B   | 2A_0B   | 2A_0B   |
|              1 | 2A_0B   | 2A_0B   | 2A_0B   | 2A_0B   | 2A_0B   |
|              2 | 2A_0B   | 2A_0B   | 2A_0B   | 2A_0B   | 2A_0B   |
|              3 | 2A_0B   | 2A_0B   | 2A_0B   | 2A_0B   | 2A_0B   |
|              

**Obseravation:**   
In Market State 1, the agent continues to favor Product A even with limited raw material, likely because failed production attempts (due to insufficient resources) result in no penalty other than a wasted action. In Market State 2, although Product B offers higher profit potential, the agent still chooses to produce Product A when raw material is less than 4, as it is the only action that is both feasible and profitable within the given resource constraints.


iii. Policy Behavior Based on Remaining Days

In [61]:
print_policy_grid_table(pi_policy, market_state=1)
print_policy_grid_table(pi_policy, market_state=2)


Optimal Policy Grid for Market State: '1'

|   Raw Material | Day 0   | Day 1   | Day 2   | Day 3   | Day 4   |
|---------------:|:--------|:--------|:--------|:--------|:--------|
|              0 | 2A_0B   | 2A_0B   | 2A_0B   | 2A_0B   | 2A_0B   |
|              1 | 2A_0B   | 2A_0B   | 2A_0B   | 2A_0B   | 2A_0B   |
|              2 | 2A_0B   | 2A_0B   | 2A_0B   | 2A_0B   | 2A_0B   |
|              3 | 2A_0B   | 2A_0B   | 2A_0B   | 2A_0B   | 2A_0B   |
|              4 | 2A_0B   | 2A_0B   | 2A_0B   | 2A_0B   | 2A_0B   |
|              5 | 2A_0B   | 2A_0B   | 2A_0B   | 2A_0B   | 2A_0B   |
|              6 | 3A_0B   | 3A_0B   | 3A_0B   | 3A_0B   | 3A_0B   |
|              7 | 3A_0B   | 3A_0B   | 3A_0B   | 3A_0B   | 3A_0B   |
|              8 | 3A_0B   | 3A_0B   | 3A_0B   | 3A_0B   | 3A_0B   |
|              9 | 3A_0B   | 3A_0B   | 3A_0B   | 3A_0B   | 3A_0B   |
|             10 | 3A_0B   | 3A_0B   | 3A_0B   | 3A_0B   | 3A_0B   |

Optimal Policy Grid for Market State: '2'

|   Raw Materia

**Observation:**    
No distinct behavior is observed on the last day, as the raw material quantity resets to 10 at the start of each new day, regardless of whether it's Market State 1 or 2.


#### **4. Performance Evaluation: (1 Mark)**
i. Calculate the state-value function (V∗) for key states (e.g. start of Day 1, Market State 1, 10 RM).

In [62]:
# print_value_for_key_states
def print_value_for_key_states(V, key_states):

    print("\nState-Value Function (V*) for Key States:")
    for state in key_states:
        value = V.get(state, 0.0)
        print(f"State (day: {state[0] +1}, rm: {state[1]}, market_state: {state[2]}): V* = {value:.2f}")


key_states = [
        (0, 10, 1),
        (0, 10, 2),
        (2, 5, 1),
        (4, 10, 1),
        (4, 10, 2)
    ]
print_value_for_key_states(vi_values, key_states)



State-Value Function (V*) for Key States:
State (day: 1, rm: 10, market_state: 1): V* = 109.75
State (day: 1, rm: 10, market_state: 2): V* = 110.75
State (day: 3, rm: 5, market_state: 1): V* = 52.75
State (day: 5, rm: 10, market_state: 1): V* = 12.00
State (day: 5, rm: 10, market_state: 2): V* = 12.50


ii. Simulate the learned policy over multiple runs (e.g., 1000 runs of 5 days each) and calculate the average total profit achieved.

In [63]:
# simulate policy function - Simulates the learned policy over multiple runs to evaluate performance
def simulate_policy(env, policy, episodes=1000):
    total_rewards = []
    for episode_count in range(episodes):
        # print(f"episode_count: {episode_count}")
        state = env.reset()
        total = 0
        done = False
        while not done:
            action = policy.get(state, 4)  # default to 'do nothing'
            next_state, reward, done = env.step(action)
            total += reward
            state = next_state if next_state is not None else state
        total_rewards.append(total)
    print(total_rewards)
    return np.mean(total_rewards)


In [64]:
from IPython.display import Markdown, display

def bold(text):
    display(Markdown(f"**{text}**"))

In [65]:

# Simulate policy to get average rewards
print("\nSimulating Value Iteration Policy...")
average_profit_vi = simulate_policy(env, vi_policy, episodes=1000)
bold(f"Average Profit using Value Iteration Policy: {average_profit_vi:.2f}")

print("\nSimulating Policy Iteration Policy...")
average_profit_pi = simulate_policy(env, pi_policy, episodes=1000)
bold(f"Average Profit using Policy Iteration Policy: {average_profit_pi:.2f}")


Simulating Value Iteration Policy...
[120, 121, 123, 124, 121, 122, 123, 124, 124, 124, 123, 122, 124, 121, 122, 123, 121, 124, 123, 124, 123, 122, 123, 122, 122, 124, 123, 123, 120, 124, 123, 123, 123, 123, 122, 121, 122, 122, 123, 122, 122, 124, 122, 121, 125, 122, 122, 120, 121, 124, 123, 121, 122, 122, 125, 123, 123, 123, 124, 120, 124, 122, 122, 123, 123, 121, 125, 124, 123, 122, 123, 123, 122, 121, 125, 124, 122, 123, 121, 121, 123, 122, 121, 123, 122, 120, 121, 122, 122, 124, 122, 122, 123, 121, 123, 123, 123, 122, 124, 122, 122, 124, 121, 122, 121, 122, 123, 123, 123, 123, 124, 122, 122, 124, 122, 123, 123, 124, 124, 122, 121, 122, 123, 122, 124, 121, 123, 122, 123, 121, 123, 125, 122, 121, 122, 122, 123, 123, 121, 123, 124, 123, 123, 122, 123, 122, 124, 121, 121, 123, 123, 121, 123, 124, 123, 123, 122, 123, 123, 124, 124, 121, 125, 124, 121, 124, 122, 125, 121, 123, 122, 123, 124, 122, 123, 123, 121, 121, 123, 122, 123, 123, 123, 121, 122, 121, 121, 121, 122, 123, 123, 123, 1

**Average Profit using Value Iteration Policy: 122.55**


Simulating Policy Iteration Policy...
[121, 122, 122, 123, 122, 122, 122, 125, 121, 124, 122, 125, 120, 123, 124, 122, 124, 122, 123, 123, 122, 120, 125, 122, 123, 122, 123, 123, 125, 122, 122, 123, 122, 122, 121, 121, 122, 122, 123, 122, 125, 124, 124, 123, 123, 123, 122, 123, 122, 124, 121, 121, 122, 121, 122, 125, 123, 122, 123, 121, 122, 121, 124, 121, 123, 124, 124, 124, 123, 123, 123, 122, 123, 121, 123, 121, 123, 122, 122, 123, 123, 123, 123, 123, 122, 123, 123, 122, 121, 123, 125, 121, 124, 120, 121, 122, 121, 121, 124, 121, 122, 124, 122, 123, 120, 124, 122, 122, 125, 123, 122, 123, 123, 122, 122, 122, 123, 121, 124, 121, 123, 122, 124, 121, 122, 121, 124, 122, 123, 121, 121, 124, 121, 123, 123, 123, 122, 121, 121, 125, 125, 122, 124, 124, 123, 124, 124, 121, 121, 120, 123, 124, 122, 122, 124, 124, 122, 122, 122, 122, 122, 123, 124, 123, 122, 124, 124, 123, 125, 123, 123, 123, 123, 121, 122, 122, 123, 122, 122, 122, 122, 123, 122, 120, 121, 123, 122, 123, 125, 122, 123, 123, 

**Average Profit using Policy Iteration Policy: 122.49**

**Observation:**  
In each episode, the profit consistently falls at the maximum achievable profit in their respective states, given the fixed daily raw material limit of 10 units each day. As a result, the average profit across episodes naturally falls within the range of 120 to 125.


#### **5. Impact of Dynamics: Compare the optimal policy learned in this dynamic**
environment to what you might expect if the market prices for Product A and
Product B were always fixed (e.g., if it was always Market State 1 every day). How
does the agent's strategy adapt or change when the market can shift unexpectedly,
versus if it were always the same? (1 M
ar
k)

#### --- 5. Impact of Dynamics Analysis --- (1 Mark)

In [66]:
# Discusses the impact of dynamic market prices on the optimal policy.
# This section should primarily be a written explanation in the report.

**Fixed Market State 1**

In [67]:
# --- Main Execution ---
# Instantiate environment
fixed_prices_env= SmartSupplierEnv(num_days=5, initial_rm=10, fixed_market=1)

# Run Value Iteration
print("\nInstantiating Value Iteration ...")
fixed_prices_vi_agent = ValueIterationAgent(fixed_prices_env, gamma=1.0, theta=1e-4)
fixed_prices_vi_policy, fixed_prices_vi_values = fixed_prices_vi_agent.run()
print("Value Iteration run Successfully.")

# Run Policy Iteration
print("\nInstantiating Policy Iteration ...")
fixed_prices_pi_agent = PolicyIterationAgent(fixed_prices_env, gamma=1.0)
fixed_prices_pi_policy, fixed_prices_pi_values = fixed_prices_pi_agent.run()
print("Policy Iteration run Successfully.")

market_states:  {1: {'A_price': 8, 'B_price': 2}}

Instantiating Value Iteration ...
delta: 24.0, theta: 0.0001
delta: 12.0, theta: 0.0001
delta: 6.0, theta: 0.0001
delta: 3.0, theta: 0.0001
delta: 0.75, theta: 0.0001
delta: 0, theta: 0.0001
Value Iteration run Successfully.

Instantiating Policy Iteration ...
delta: 24.0
delta: 8.0
delta: 4.0
delta: 2.0
delta: 0.375
delta: 0
delta: 24.0
delta: 7.0
delta: 3.5
delta: 1.0
delta: 0.375
delta: 0
Policy Iteration run Successfully.


In [68]:

# Simulate policy to get average rewards
print("\nSimulating Value Iteration Policy...")
average_profit_vi = simulate_policy(fixed_prices_env, fixed_prices_vi_policy, episodes=1000)
bold(f"Average Profit using Value Iteration Policy: {average_profit_vi:.2f}")

print("\nSimulating Policy Iteration Policy...")
average_profit_pi = simulate_policy(fixed_prices_env, fixed_prices_pi_policy, episodes=1000)
bold(f"Average Profit using Policy Iteration Policy: {average_profit_pi:.2f}")



Simulating Value Iteration Policy...
[120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 1

**Average Profit using Value Iteration Policy: 120.00**


Simulating Policy Iteration Policy...
[120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 

**Average Profit using Policy Iteration Policy: 120.00**

**Fixed Market State 2**

In [69]:
# --- Main Execution ---
# Instantiate environment
fixed_prices_env= SmartSupplierEnv(num_days=5, initial_rm=10, fixed_market=2)

# Run Value Iteration
print("\nInstantiating Value Iteration ...")
fixed_prices_vi_agent = ValueIterationAgent(fixed_prices_env, gamma=1.0, theta=1e-4)
fixed_prices_vi_policy, fixed_prices_vi_values = fixed_prices_vi_agent.run()
print("Value Iteration run Successfully.")

# Run Policy Iteration
print("\nInstantiating Policy Iteration ...")
fixed_prices_pi_agent = PolicyIterationAgent(fixed_prices_env, gamma=1.0)
fixed_prices_pi_policy, fixed_prices_pi_values = fixed_prices_pi_agent.run()
print("Policy Iteration run Successfully.")

market_states:  {2: {'A_price': 3, 'B_price': 5}}

Instantiating Value Iteration ...
delta: 25.0, theta: 0.0001
delta: 12.5, theta: 0.0001
delta: 6.25, theta: 0.0001
delta: 3.125, theta: 0.0001
delta: 0.78125, theta: 0.0001
delta: 0, theta: 0.0001
Value Iteration run Successfully.

Instantiating Policy Iteration ...
delta: 25.0
delta: 12.5
delta: 6.25
delta: 3.125
delta: 0
delta: 25.0
delta: 9.5
delta: 4.75
delta: 1.5625
delta: 0.78125
delta: 0
Policy Iteration run Successfully.


In [70]:

# Simulate policy to get average rewards
print("\nSimulating Value Iteration Policy...")
average_profit_vi = simulate_policy(fixed_prices_env, fixed_prices_vi_policy, episodes=1000)
bold(f"Average Profit using Value Iteration Policy: {average_profit_vi:.2f}")

print("\nSimulating Policy Iteration Policy...")
average_profit_pi = simulate_policy(fixed_prices_env, fixed_prices_pi_policy, episodes=1000)
bold(f"Average Profit using Policy Iteration Policy: {average_profit_pi:.2f}")



Simulating Value Iteration Policy...
[125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 1

**Average Profit using Value Iteration Policy: 125.00**


Simulating Policy Iteration Policy...
[125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 

**Average Profit using Policy Iteration Policy: 125.00**

**Observation:**  
In a fixed Market State 1 scenario, the profit per day is consistently 24, achieved by the optimal action (3A\_0B), calculated as 3 units × ₹8 = ₹24. Similarly, in a fixed Market State 2, the daily profit is always 25, resulting from the action (0A\_5B), calculated as 5 units × ₹5 = ₹25. Since the raw material resets to 10 units each day, the total profit over 5 days is fixed at ₹120 in Market State 1 and ₹125 in Market State 2. Consequently, the average profit over 1000 episodes remains constant at ₹120 and ₹125 respectively, which represents the maximum achievable profit in each fixed market condition.

