# MO436A Project1: Reinforced learning algorithms evaluation

# 1. Problem description 

**Motivation** 
Model an agent to operate in a trading environment. 

**Objectve** 
The agent objective is to maximize the gain. 


# 2. Environments 

## 2.1 Stocrastic

The trading problem is pretty stocrastic, so, to simplify the model, the prices are configurated with a normal distribution with a 
configurable mean and volatility. 

This is an example of pricing data generated.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

mean_return = 0.001  # Average hourly return (negative = downward trend)
volatility = 0.003    # Standard deviation of returns (3% volatility)

def generate_intraday_prices(num_days=100, hours_per_day=10, start_price=10):
    """
    Generate intraday price data using geometric random walk.
    
    Args:
        num_days: Number of trading days to generate
        hours_per_day: Number of hourly steps per day
        start_price: Initial price for all days
        
    Returns:
        numpy array of shape (num_days, hours_per_day) with price values
    """
    prices = []
    for _ in range(num_days):
        returns = np.random.normal(loc=mean_return, scale=volatility, size=hours_per_day)
        day_prices = start_price * np.exp(np.cumsum(returns))
        prices.append(day_prices)
    plot_prices(np.array(prices), num_days)
    return np.array(prices)


def plot_prices(prices, num_days):
    """
    Plot daily price variations (percentage and absolute).
    
    Args:
        prices: 2D array of shape (num_days, hours_per_day)
        num_days: Number of days for labeling
    """
    # Calculate daily variations
    # Absolute change (close - open) and percentage change
    open_prices = prices[:, 0]
    close_prices = prices[:, -1]
    daily_abs_change = close_prices - open_prices
    daily_pct_change = (close_prices / open_prices - 1) * 100

    # Figure 1: Daily percentage price variation
    fig, ax = plt.subplots(figsize=(6, 2))
    ax.bar(np.arange(1, num_days+1), daily_pct_change, color=['#2ca02c' if x>=0 else '#d62728' for x in daily_pct_change], width=0.8)

    ax.set_title('Daily Percentage Price Variation (Close vs. Open)', fontsize=14, fontweight='bold')
    ax.set_xlabel('Day', fontsize=4)
    ax.set_ylabel('Variation (%)', fontsize=4)

    # Horizontal line at 0%
    ax.axhline(0, color='black', linewidth=1)

    # Better x-axis labels (mark every 5 days)
    ax.set_xticks(np.arange(1, num_days+1, 5))

    # Text with basic statistics
    mean_change = np.mean(daily_pct_change)
    std_change = np.std(daily_pct_change)
    ax.text(0.99, 0.02, f'Mean: {mean_change:.2f}%\nStd Dev: {std_change:.2f}%', transform=ax.transAxes,
            ha='right', va='bottom', bbox=dict(boxstyle='round', fc='white', ec='gray', alpha=0.8))

    plt.tight_layout()
    plt.show()

    # Figure 2: Daily absolute price variation
    fig2, ax2 = plt.subplots(figsize=(6, 2))
    ax2.plot(np.arange(1, num_days+1), daily_abs_change, marker='o', linewidth=1.5, color='#1f77b4')
    ax2.set_title('Daily Absolute Price Variation (Close - Open)', fontsize=14, fontweight='bold')
    ax2.set_xlabel('Day', fontsize=4)
    ax2.set_ylabel('Î” Price', fontsize=4)
    ax2.axhline(0, color='black', linewidth=1)
    ax2.set_xticks(np.arange(1, num_days+1, 5))
    plt.tight_layout()
    plt.show()


In [None]:
prices = generate_intraday_prices(num_days=15, hours_per_day=10, start_price=10)

## 2.2 Deterministic environment

In this case, the deterministic events means the system might know, given the current price of the stock, what will be the price of the stock in the next hour (next step). The The Rulkov Map is used to generate price movements deterministically, since it can create a "caotic" price evolution, mantaining. 


In [2]:
import numpy as np
import gymnasium as gym
from gymnasium import spaces

class DeterministicTradingEnv(gym.Env):
    """
    Deterministic trading environment using the Rulkov Map for price generation.
    
    The Rulkov Map is a chaotic dynamical system that generates deterministic,
    reproducible price movements without randomness.
    """
    
    metadata = {"render_modes": ["human"]}

    def __init__(self, n_steps=10, start_price=10.0,
                 alpha=4.0, beta=10.0, sigma=0.01, mu=0.001,
                 window_size=3):
        """
        Initialize the deterministic trading environment.
        
        Args:
            n_steps: Number of steps per episode (trading hours)
            start_price: Initial asset price
            alpha, beta, sigma, mu: Rulkov Map parameters
            window_size: Number of historical prices to include in observation
        """
        super(DeterministicTradingEnv, self).__init__()
        self.n_steps = n_steps
        self.start_price = start_price
        self.alpha = alpha
        self.beta = beta
        self.sigma = sigma
        self.mu = mu
        self.window_size = window_size
        self.initial_cash = 100
        self.cash = self.initial_cash

        # Action and observation spaces
        self.action_space = spaces.Discrete(3)  # Hold (0), Buy (1), Sell (2)
        # State: last N prices + cash + asset holdings
        self.observation_space = spaces.Box(
            low=0, high=np.inf, shape=(self.window_size + 2,), dtype=np.float32
        )

        self.reset()

    def f(self, x, y, alpha):
        """Rulkov Map function."""
        if x <= 0:
            return alpha / (1 - x) + y
        elif 0 < x < (alpha + y):
            return alpha + y
        else:
            return -1

    def rulkov_map(self, x, y):
        """Update state using Rulkov Map dynamics."""
        x_next = self.f(x, y + self.beta, self.alpha)
        y_next = y - self.mu * (x_next + 1) + self.mu * self.sigma
        return x_next, y_next

    def reset(self, seed=None, options=None):
        """Reset the environment to initial state."""
        super().reset(seed=seed)
        self.x = -1.0
        self.y = -3.5
        self.t = 0
        self.cash = self.initial_cash
        self.asset = 0.0
        self.price = self.start_price
        # Initial price history
        self.price_history = [self.price] * self.window_size
        state = np.array(self.price_history + [self.cash, self.asset], dtype=np.float32)
        return state, {}

    def step(self, action):
        """
        Execute one step in the environment.
        
        Args:
            action: 0 = Hold, 1 = Buy, 2 = Sell
            
        Returns:
            observation, reward, done, truncated, info
        """
        # Update price via Rulkov Map
        self.x, self.y = self.rulkov_map(self.x, self.y)
        self.price *= np.exp(self.x * 0.001)

        # Update price history
        self.price_history.append(self.price)
        if len(self.price_history) > self.window_size:
            self.price_history.pop(0)

        # Execute action
        if action == 1 and self.cash > 0:  # Buy
            self.asset += self.cash / self.price
            self.cash = 0
        elif action == 2 and self.asset > 0:  # Sell
            self.cash += self.asset * self.price
            self.asset = 0

        # Calculate reward: portfolio value change
        portfolio_value = self.cash + self.asset * self.price
        reward = portfolio_value - self.initial_cash

        # Advance time
        self.t += 1
        done = self.t >= self.n_steps

        state = np.array(self.price_history + [self.cash, self.asset], dtype=np.float32)
        return state, reward, done, False, {}

    def render(self, mode="human"):
        """Render the current environment state."""
        print(f"Step {self.t}: Price={self.price:.2f}, Cash={self.cash:.2f}, Asset={self.asset:.2f}")

In [3]:
env = DeterministicTradingEnv()
state, _ = env.reset()

for _ in range(10):
    action = env.action_space.sample()
    state, reward, done, _, _ = env.step(action)
    env.render()
    if done:
        break

Step 1: Price=10.09, Cash=100.00, Asset=0.00
Step 2: Price=10.19, Cash=0.00, Asset=9.81
Step 3: Price=10.18, Cash=99.90, Asset=0.00
Step 4: Price=10.27, Cash=0.00, Asset=9.73
Step 5: Price=10.38, Cash=100.95, Asset=0.00
Step 6: Price=10.37, Cash=0.00, Asset=9.74
Step 7: Price=10.45, Cash=0.00, Asset=9.74
Step 8: Price=10.56, Cash=0.00, Asset=9.74
Step 9: Price=10.55, Cash=102.78, Asset=0.00
Step 10: Price=10.64, Cash=102.78, Asset=0.00


# 3. MDP Formulation

**States**
 In this problem, the state is defined as: 
 - Current cash
 - Current asset holdings
 - Last window_size prices
The original problem is partially observable, as if only the current price is considered, it is not possible to know if is in an upward or downward trend, so a window_size is needed to make it oberservable.
**Actions**
 The actions in this case are discrete:
- 0 : Hold
- 1 : Buy
- 2 : Sell
**Transactions**

**Rewards**

**Terminal Condition**
- Rewards: The reward is defined by profit/loss of each action
- Discount(gamma): long-term gains. 

**Simplfications in the model** 
- The prices are updated hourly and are updated 10 times a day. 
- The prices start with the same value each day.
- The prices std_dev is constant. 
- Here we sell/buy all stocks in portifolium, which means it is the same as considering only one stock.

**Environment characterists**
- Episodic - As each day is treated as a complete episode. 
- Terminal States - There are not terminal states, but we could set one, for example, choose to stop at a maximum profit or maximum lose. 
- The states are continuous, as they are defined as the price.
- The environment is stocrastic as the price vary ramdonly with a normal distribution.
- This envrionment is partially observable, as we have information of only a window size of prices.


**Transactions**
In the determimnistc environment, the price is 

# 4. Monte Carlo

# 5. SARSA

# 6.Q-Learning

# 7. Linear Function Approximator**

# 8. DQN