<h1 style="text-align: center">
Sina Hatami 5447389
</h1>

<h2 style="text-align: center">
Title: Recycling robot
<h2>

<p>
<b>description: </b>
Recycling robots are autonomous machines designed to automate the process of recycling waste materials. These robots are equipped with sensors, cameras, and robotic arms to identify and sort different types of recyclable items such as plastics, glass, metal, and paper. They help improve the efficiency and accuracy of recycling operations by reducing human labor and errors.
The goal of recycling robots is to enhance recycling processes, increase recycling rates, reduce contamination in recycling streams, and ultimately contribute to environmental sustainability by promoting proper waste management and resource conservation.
</p>

In [None]:
import numpy as np
import matplotlib.pyplot as plt

<p>
Firstly, we should generate an environment that has item generation, bin generation, the agent moves, collect items, execution, and some extent other actions.
<p>

In [None]:
class RecyclingRobotEnvironment:
    def __init__(self, rows, columns, num_items, num_bins, initial_battery_level):
        self.rows = rows
        self.columns = columns
        self.num_items = num_items
        self.num_bins = num_bins
        self.battery_capacity = initial_battery_level
        self.grid = np.zeros((rows, columns), dtype=int)
        self.agent_position = (0, 0)  # Initial position of the recycling robot
        self.battery_level = initial_battery_level

        # Initialize the recycling items and bins
        self.items = self.generate_items()
        self.bins = self.generate_bins()

        # Define the available actions
        self.actions = ['up', 'down', 'left', 'right', 'collect', 'recharge']

    items = []
    def generate_items(self):
        for _ in range(self.num_items):
            # Generate random item type and location
            item_type = np.random.randint(1, self.num_bins + 1)
            row = np.random.randint(0, self.rows)
            col = np.random.randint(0, self.columns)
            self.items.append({'type': item_type, 'position': (row, col)})
            self.grid[row][col] = item_type
        return self.items

    def generate_bins(self):
        bins = []
        for i in range(1, self.num_bins + 1):
            # Generate random bin location
            row = np.random.randint(0, self.rows)
            col = np.random.randint(0, self.columns)
            bins.append({'type': i, 'position': (row, col)})
            self.grid[row][col] = -i  # Negative value represents a recycling bin
        return bins

    def move_agent(self, action):
        x, y = self.agent_position

        if action == 'up' and x > 0:
            self.agent_position = (x - 1, y)
        elif action == 'down' and x < self.rows - 1:
            self.agent_position = (x + 1, y)
        elif action == 'left' and y > 0:
            self.agent_position = (x, y - 1)        
        elif action == 'right' and y < self.columns - 1:
            self.agent_position = (x, y + 1)

    def collect_item(self):
        x, y = self.agent_position
        item_type = self.grid[x][y]
        if item_type > 0:  # Positive value represents an item
            self.grid[x][y] = 0  # Remove the item from the grid
            return item_type
        return None

    def recharge_battery(self):
        self.battery_level = self.battery_capacity

    def is_not_zero(self):
        return self.battery_level > 0

    def is_zero(self):
        return self.battery_level == 0

    def get_state(self):
        x, y = self.agent_position
        return (x, y, self.battery_level)

    def execute_action(self, action):
        if action == 'up':
            self.move_agent('up')
        elif action == 'down':
            self.move_agent('down')
        elif action == 'left':
            self.move_agent('left')
        elif action == 'right':
            self.move_agent('right')
        elif action == 'collect':
            item_type = self.collect_item()
            if item_type:
                self.battery_level -= 1
        elif action == 'recharge':
            self.recharge_battery()
        elif self.is_not_zero():
            self.battery_level -= 1
        # 

    # when episode goes down
    def is_terminal(self):
        return len(self.items) == 0

<p>
For the next step, adding some kind of reward to measure and compare each of them.
</p>

In [None]:
def get_reward(self):
    if self.battery_level == 0:
        return -20
    x, y = self.agent_position
    cell_value = self.grid[x][y]
    if cell_value < 0:  # Negative value represents a recycling bin
        bin_type = abs(cell_value)
        if bin_type == self.items[-1]['type']:  # Correct item type for the bin
            return 10  # Assign a positive reward for correct item type
        else:
            return -10  # Assign a negative reward for incorrect item type
    else:
        return -0.1 # Assign a small negative reward for each action

def get_reward_random(self):
    if self.battery_level == 0:
        return -20
    x, y = self.agent_position
    cell_value = self.grid[x][y]
    if cell_value < 0:  # Negative value represents a recycling bin
        bin_type = abs(cell_value)
        if bin_type == self.items[-1]['type']:  # Correct item type for the bin
            return np.random.normal(10, 2)  # Random reward from a normal distribution
        else:
            return np.random.normal(-10, 2)  # Random reward from a normal distribution
    else:
        return -0.1  # Small negative reward for each action        
    
def get_reward_random_choise(self):
    if self.battery_level == 0:
        return -20
    x, y = self.agent_position
    cell_value = self.grid[x][y]

    if cell_value < 0:  # Negative value represents a recycling bin
        bin_type = abs(cell_value)
        if bin_type == self.items[-1]['type']:  # Correct item type for the bin
            # Higher positive reward for correct item type
            return np.random.choice([10, 20, 30])
        else:
            # Lower negative reward for incorrect item type
            return np.random.choice([-10, -20, -30])
    else:
        # Small negative reward for each action
        return np.random.choice([-1, -2, -3])

<h3 style="color: blue">
1. Basic
</h3>
<p>
Experiment with some chosen parameter...
</p>

In [None]:
# Define parameters
num_episodes = 100000
learning_rate = 0.1
discount_factor = 0.9
epsilon = 0.1
max_steps = 100
initial_battery_level = 100 # Initial battery level for the robot
env = RecyclingRobotEnvironment(rows=500, columns=500, num_items=40, num_bins=20, initial_battery_level=initial_battery_level)

In [None]:
# Track the rewards for each episode
rewards = []
def run_episode(env, reward_type):
    # Create the environment
    num_states = (env.rows * env.columns) * (initial_battery_level + 1)
    num_actions = len(env.actions)

    # Initialize the Q-table
    q_table = np.zeros((num_states, num_actions))

    for episode in range(num_episodes):
        state = env.get_state()
        episode_reward = 0

        for _ in range(max_steps):
            if np.random.uniform(0, 1) < epsilon:
                action = np.random.choice(env.actions)
            else:
                state_idx = np.ravel_multi_index(state, (env.rows, env.columns, env.battery_capacity + 1))
                action_idx = np.argmax(q_table[state_idx])
                action = env.actions[action_idx]

            env.execute_action(action)
            next_state = env.get_state()

            if reward_type == 'static': reward = get_reward(env)
            elif reward_type == 'random': reward = get_reward_random(env)
            elif reward_type == 'random_choise': reward = get_reward_random_choise(env)

            state_idx = np.ravel_multi_index(state, (env.rows, env.columns, env.battery_capacity + 1))
            next_state_idx = np.ravel_multi_index(next_state, (env.rows, env.columns, env.battery_capacity + 1))
            action_idx = env.actions.index(action)

            q_value = q_table[state_idx, action_idx]
            max_q_value = np.max(q_table[next_state_idx])
            new_q_value = (1 - learning_rate) * q_value + learning_rate * (reward + discount_factor * max_q_value)
            q_table[state_idx, action_idx] = new_q_value

            state = next_state
            episode_reward += reward

            if env.is_terminal():
                break

        rewards.append(episode_reward)

        if (episode + 1) % 100 == 0:
            average_reward = sum(rewards[-100:]) / 100
            print(f"Episode {episode+1}: Average Reward = {average_reward}")

In [None]:
def plot():
    average_reward = sum(rewards) / num_episodes
    print(f"Average Reward: {average_reward}")
    # Plot rewards
    plt.plot(rewards)
    plt.xlabel('Episode')
    plt.ylabel('Reward')
    plt.title('Reward per Episode')
    plt.show()

In [None]:
rewards = []
run_episode(env, 'static')
plot()

In [None]:
rewards = []
run_episode(env, 'random')
plot()

In [None]:
rewards = []
run_episode(env, 'random_choise')
plot()

<h3 style="color: green">
Comment:
</h3>
<p>
When we use random reward instead of static reward. The randomization in the rewards can help in the early stages of learning when the agent is exploring the environment. It provides a chance for the agent to stumble upon actions that may lead to better long-term rewards. However, as the agent's learning progresses, a more deterministic reward structure, such as a static reward, can guide the agent towards more consistent and optimal behaviors.
</p>
<hr></hr>

<h3 style="color: blue">
2. Complex Environment
</h3>
<p>
Experiment with more complex and bigger environment:
</p>

In [None]:
# Define parameters
num_episodes = 100000
learning_rate = 0.1
discount_factor = 0.9
epsilon = 0.1
max_steps = 100
initial_battery_level = 100 # Initial battery level for the robot
env = RecyclingRobotEnvironment(rows=600, columns=600, num_items=50, num_bins=50, initial_battery_level=initial_battery_level)

In [None]:
rewards = []
run_episode(env, 'static')
plot()

In [None]:
rewards = []
run_episode(env, 'random')
plot()

In [None]:
rewards = []
run_episode(env, 'random_choise')
plot()

<h3 style="color: green">
Comment:
</h3>
<p>
Because of the randomness the random reward is not good for this situation. <b>The agent requires more structured and informative feedback to learn effective strategies.<b>
</p>
<hr></hr>

<h3 style="color: blue">
3. Change initial recharge level:
</h3>
<p>
change the initial battery by reducing to see what happens.
</p>

In [None]:
initial_battery_level = 20 # Initial battery level for the robot
env = RecyclingRobotEnvironment(rows=500, columns=500, num_items=40, num_bins=20, initial_battery_level=initial_battery_level)

In [None]:
rewards = []
run_episode(env, 'static')
plot()

In [None]:
rewards = []
run_episode(env, 'random')
plot()

In [None]:
rewards = []
run_episode(env, 'random_choise')
plot()

<h3 style="color: green">
Comment:
</h3>
<p>
In my environment, the battery level plays a crucial role in determining the agent's actions and behavior. By reducing the initial battery level in random reward, I am making the agent more sensitive to its battery consumption. <b> The agent needs to be more strategic and careful in planning its actions to ensure it doesn't run out of battery too quickly. </b> This increased sensitivity can lead to better decision-making and more optimal paths being chosen, ultimately resulting in higher rewards.
</p>

In [None]:
rewards = []
initial_battery_level = 200 # Initial battery level for the robot
env = RecyclingRobotEnvironment(rows=500, columns=500, num_items=40, num_bins=20, initial_battery_level=initial_battery_level)

In [None]:
run_episode(env, 'static')
plot()

In [None]:
run_episode(env, 'random')
plot()

In [None]:
rewards = []
run_episode(env, 'random_choise')
plot()

<h3 style="color: green">
Comment:
</h3>
<p>
<b> A big capacity battery is not ideal for random reward because it allows the agent to explore and take random actions for a longer duration without experiencing the consequences of low battery. </b> Random rewards rely on exploration to discover optimal actions, but with a large battery capacity, the agent can continue exploring randomly without facing the negative consequences of low battery, which hinders its ability to learn and make informed decisions.
</p>
<hr></hr>

<h3 style="color: blue">
4. Adding penalty for distance center:
</h3>
<p>
I added a new statement in the reward. This feature forced agent spend much time on the center. Then see what happened.
</p>

In [None]:
def get_reward(self):
    if self.battery_level == 0:
        return -20
    x, y = self.agent_position
    cell_value = self.grid[x][y]
    if cell_value < 0:  # Negative value represents a recycling bin
        bin_type = abs(cell_value)
        if bin_type == self.items[-1]['type']:  # Correct item type for the bin
            return 10  # Assign a positive reward for correct item type
        else:
            return -10  # Assign a negative reward for incorrect item type
    else:
        if self.battery_level < 0.1 * self.battery_capacity:
            distance_to_center = abs(x - self.rows // 2) + abs(y - self.columns // 2) # This calculates the Manhattan distance between the agent's current position 
            distance_to_center_normalized = distance_to_center / (self.rows + self.columns) # Normalization ensures that the distance falls within the range of [0, 1]
            return -1 - distance_to_center_normalized  # Negative reward with distance penalty
        return -0.1 # Assign a small negative reward for each action
    
def get_reward_random(self):
    if self.battery_level == 0:
        return -20
    x, y = self.agent_position
    cell_value = self.grid[x][y]
    if cell_value < 0:  # Negative value represents a recycling bin
        bin_type = abs(cell_value)
        if bin_type == self.items[-1]['type']:  # Correct item type for the bin
            return np.random.normal(10, 2)  # Random reward from a normal distribution
        else:
            return np.random.normal(-10, 2)  # Random reward from a normal distribution
    else:
        if self.battery_level < 0.1 * self.battery_capacity:
            distance_to_center = abs(x - self.rows // 2) + abs(y - self.columns // 2)
            distance_to_center_normalized = distance_to_center / (self.rows + self.columns)
            return -1 - distance_to_center_normalized  # Negative reward with distance penalty
        return -0.1  # Small negative reward for each action
    
    
def get_reward_random_choise(self):
    if self.battery_level == 0:
        return -20
    x, y = self.agent_position
    cell_value = self.grid[x][y]

    if cell_value < 0:  # Negative value represents a recycling bin
        bin_type = abs(cell_value)
        if bin_type == self.items[-1]['type']:  # Correct item type for the bin
            # Higher positive reward for correct item type
            return np.random.choice([10, 20, 30])
        else:
            # Lower negative reward for incorrect item type
            return np.random.choice([-10, -20, -30])
    else:
        if self.battery_level < 0.1 * self.battery_capacity:
            distance_to_center = abs(x - self.rows // 2) + abs(y - self.columns // 2)
            distance_to_center_normalized = distance_to_center / (self.rows + self.columns)
            return -1 - distance_to_center_normalized  # Negative reward with distance penalty
        # Small negative reward for each action
        return np.random.choice([-1, -2, -3])

In [None]:
initial_battery_level = 100 # Initial battery level for the robot
env = RecyclingRobotEnvironment(rows=500, columns=500, num_items=40, num_bins=20, initial_battery_level=initial_battery_level)

In [None]:
run_episode(env, 'static')
plot()

In [None]:
run_episode(env, 'random')
plot()

In [None]:
run_episode(env, 'random_choise')
plot()

<h3 style="color: green">
Comment:
</h3>
<p>
Efficient exploration: Adding center distance to the reward can promote more efficient exploration of the environment. By prioritizing actions that bring the agent closer to the center, it is likely to explore regions of the grid that are closer to potential targets or valuable areas. <b> This can help the agent discover important locations or resources more quickly, leading to improved performance.</b>

Enhanced path planning: The center distance can act as an additional heuristic or guide for path planning. By considering the distance to the center, the agent can make more informed decisions about the direction it should take. This can help it navigate more efficiently towards the center or other strategic locations within the environment, resulting in better overall performance.
</p>
<hr></hr>

<h3 style="color: blue">
5. Learning Rate:
</h3>
<p>
This parameter determines the extent to which newly acquired information overrides existing information in the model during the training process.
</p>

In [None]:
# Define parameters
num_episodes = 100000
learning_rate = 0.01
discount_factor = 0.9
epsilon = 0.1
max_steps = 100
initial_battery_level = 100  # Initial battery level for the robot
env = RecyclingRobotEnvironment(rows=500, columns=500, num_items=40, num_bins=20, initial_battery_level=initial_battery_level)

In [None]:
rewards = []
run_episode(env, 'static')
plot()

In [None]:
rewards = []
run_episode(env, 'random')
plot()

In [None]:
rewards = []
run_episode(env, 'random_choise')
plot()

<h3 style="color: green">
Comment:
</h3>
<p>
<b>A small learning rate allows for slower and more cautious updates to the Q-values. Since the rewards are fixed and do not change over time, a small learning rate helps the agent to gradually and steadily update its Q-values without overreacting to noisy or random fluctuations in the environment. It allows the agent to carefully explore and exploit the state-action space, making incremental adjustments to its Q-values based on the limited and consistent rewards.</b>
</p>
<hr></hr>

In [None]:
# Define parameters
num_episodes = 100000
learning_rate = 0.3
discount_factor = 0.9
epsilon = 0.1
max_steps = 100
initial_battery_level = 100  # Initial battery level for the robot
env = RecyclingRobotEnvironment(rows=500, columns=500, num_items=40, num_bins=20, initial_battery_level=initial_battery_level)

In [None]:
rewards = []
run_episode(env, 'static')
plot()

In [None]:
rewards = []
run_episode(env, 'random')
plot()

In [None]:
rewards = []
run_episode(env, 'random_choise')
plot()

<h3 style="color: green">
Comment:
</h3>
<p>
<b>With a larger learning rate, the model is more sensitive to the noisy rewards and can quickly update its Q-values based on random fluctuations. This can lead to erratic and unstable learning,</b> causing the model to converge to suboptimal policies or exhibit high variability in its performance.
On the other hand, big learning rate is very good for static reward.
</p>
<hr></hr>

<h3 style="color: blue">
6. Discount Factor:
</h3>
<p>
It represents how much weight or importance should be given to future rewards compared to immediate rewards.
</p>

In [None]:
# Define parameters
num_episodes = 100000
learning_rate = 0.1
discount_factor = 0.6
epsilon = 0.1
max_steps = 100
initial_battery_level = 100 # Initial battery level for the robot
env = RecyclingRobotEnvironment(rows=500, columns=500, num_items=40, num_bins=20, initial_battery_level=initial_battery_level)

In [None]:
rewards = []
run_episode(env, 'static')
plot()

In [None]:
rewards = []
run_episode(env, 'random')
plot()

In [None]:
rewards = []
run_episode(env, 'random_choise')
plot()

<h3 style="color: green">
Comment:
</h3>
<p>
In random reward scenarios, the rewards obtained by taking different actions in the same state can vary significantly, introducing uncertainty.
When the discount factor is small, the agent places less importance on future rewards, which means it focuses more on immediate rewards.<b> In a random reward scenario, where rewards are unpredictable and can vary widely, relying solely on immediate rewards may lead the agent to choose suboptimal actions.</b> It might be more beneficial for the agent to explore and gather more information about the environment by taking actions that have the potential for higher future rewards, even if the immediate rewards are low or random.
</p>
<hr></hr>

<h3 style="color: blue">
7. Epsilon:
</h3>
<p>
It's determines the balance between exploration and exploitation
</p>

In [None]:
# Define parameters
num_episodes = 100000
learning_rate = 0.1
discount_factor = 0.9
epsilon = 0.3
max_steps = 100
initial_battery_level = 100  # Initial battery level for the robot
env = RecyclingRobotEnvironment(rows=300, columns=300, num_items=10, num_bins=10, initial_battery_level=initial_battery_level)

In [None]:
rewards = []
run_episode(env, 'static')
plot()

In [None]:
rewards = []
run_episode(env, 'random')
plot()

In [None]:
rewards = []
run_episode(env, 'random_choise')
plot()

In [None]:
# Define parameters
num_episodes = 100000
learning_rate = 0.1
discount_factor = 0.9
epsilon = 0.05
max_steps = 100
initial_battery_level = 100  # Initial battery level for the robot
env = RecyclingRobotEnvironment(rows=500, columns=500, num_items=40, num_bins=20, initial_battery_level=initial_battery_level)

In [None]:
rewards = []
run_episode(env, 'static')
plot()

In [None]:
rewards = []
run_episode(env, 'random')
plot()

In [None]:
rewards = []
run_episode(env, 'random_choise')
plot()

<h3>Comment:</h3>
<p>
A large epsilon value in an epsilon-greedy exploration strategy can be detrimental when dealing with a random reward.
When the reward is random, it means that the agent's actions do not consistently lead to higher or lower rewards. In this case, exploration becomes less valuable because the agent cannot rely on the learned knowledge to make informed decisions. <b> A large epsilon value encourages more exploration, leading the agent to take random actions even when it has already learned that certain actions are less rewarding. </b>

As a result, with a large epsilon, the agent may spend a significant amount of time exploring random actions that do not contribute to maximizing long-term rewards. This can hinder the agent's learning progress and make it difficult to converge to an optimal policy.
</p>
<hr></hr>