# Problem 1 - Pick-and-Place Robot

## Background
Design a Markov decision process (MDP) to control a robotic arm to perform repetitive pick-and-place tasks and learn smooth and fast movements.
The agent directly controls the motors and receives feedback on the current joint positions and velocities from the state.

## Define States, Actions, and Rewards

### 1. State (S)
States representation current overall posture and motion state of the robotic arm, including each joint’s position and velocity.  

The state vector:
- Each joint position (angle): $\theta_1, \theta_2, ..., \theta_n$
- Each joint velocity (angular speed): $\dot{\theta}_1, \dot{\theta}_2, ..., \dot{\theta}_n$

The state is represented as:

$s = [\theta_1, \theta_2, ..., \theta_n, \dot{\theta}_1, \dot{\theta}_2, ..., \dot{\theta}_n]$

More think about the State, can expand 
- The end-effector (gripper) position
- Object coordinates
- A binary flag representing successful grasp/release

In [1]:
# State definition
class RobotState:
    def __init__(self, joint_angles, joint_velocities):
        self.joint_angles = joint_angles  # list or np.array of length n
        self.joint_velocities = joint_velocities  # list or np.array of length n
        
    def as_vector(self):
        # Concatenate angles and velocities into a single vector
        import numpy as np
        return np.concatenate([self.joint_angles, self.joint_velocities])

### 2. Actions (A)
Actions representation motor control instruction for each joint (like torques). Each control value is constrained within the mechanical limits of the hardware to ensure safe operation.

The action vector represents the torque applied to each joint:

$a = [\tau_1, \tau_2, ..., \tau_n]$

Here is a continuous value vector of length n.

In [2]:
# Actions definition
class RobotAction:
    def __init__(self, torques):
        self.torques = torques  # list or np.array of length n
        
    def clip(self, min_torque, max_torque):
        import numpy as np
        self.torques = np.clip(self.torques, min_torque, max_torque)

### 3. Rewards (R)
The reward function considers these factors:
- Task completion: Large positive reward for completing pick-and-place
- Position error: Positive reward for reducing the distance to the target
- Smooth movement: Penalty for large or abrupt torques to encourage smooth actions
- Time efficiency: Extra reward for finishing faster

Mathematically:

$R(s, a) = R_{\text{completion}} - \alpha \|a\|^2 - \beta \|a - a_{\text{prev}}\|^2 - \gamma \text{(position error)}$

In [3]:
# Rewards definition
def compute_reward(task_completed, current_action, prev_action, position_error, alpha=0.1, beta=0.1, gamma=1.0):
    reward = 0
    if task_completed:
        reward += 100  # Completion bonus
    # Penalty on control effort
    reward -= alpha * (current_action**2).sum()
    # Penalty for action changes (for smoothness)
    reward -= beta * ((current_action - prev_action)**2).sum()
    # Penalty for distance to target
    reward -= gamma * position_error
    return reward

## MDP Structure Summary
- **States (S):** Joint angles and velocities of the arm

    $S = \left\{ s \;\middle|\; s = [\theta_1, \theta_2, ..., \theta_n, \dot{\theta}_1, \dot{\theta}_2, ..., \dot{\theta}_n] \right\}$
    
- **Actions (A):** Torques applied to each joint (continuous space)

    $A = \left\{ a \;\middle|\; a = [\tau_1, \tau_2, ..., \tau_n],\; \tau_i \in [\tau_{\text{min}}, \tau_{\text{max}}] \right\}$
    
- **Transition Probability (P):** Determined by the physics simulation, such as PyBullet or MuJoCo, with the next state depending on the current state and action. Reflecting the deterministic outcome of motor commands and environmental dynamics

    $P(s'|s, a)$

- **Reward (R):** Encourages task completion, smooth and fast movements

    $R(s, a) = R_{\text{completion}}(s) - \alpha \|a\|^2 - \beta \| a - a_{\text{prev}} \|^2 - \gamma \cdot \text{(position error)}$

- **Discount factor (γ):** Set as needed, such as 0.9 or 0.99, to balance immediate and future rewards

    $\gamma,\;\; 0 < \gamma < 1$

- **Objective** is to learn a policy $ \pi(a|s) $ that maximizes the expected cumulative discounted reward:
  
  $\max_\pi \;\; \mathbb{E}_\pi \left[ \sum_{t=0}^{\infty} \gamma^t R(s_t, a_t) \right]$

- Termination conditions may be defined as: 
    
    successful placement of the object, collision, or time-out