# *Explanation of the code*

## Importing Necessary Libraries

In [None]:
import numpy as np
import tensorflow as tf
import gym

Explanation:

- numpy is used for numerical operations.
- tensorflow is the deep learning framework used to build and train the neural network.
- gym is an open-source library for developing and comparing reinforcement learning algorithms.

## Hyperparameters

In [None]:
H = 200  # Number of neurons in the hidden layer.
learning_rate = 1e-4  # Learning rate for the optimizer.
gamma = 0.99  # Discount factor for reward, used in the reward discounting process.
D = 80 * 80  # Input dimensionality (flattened 80x80 grid).

Explanation:

- Hyperparameters are settings that can be adjusted to control the behavior of the neural network and the learning process.
- 'H' represents the size of the hidden layer in the neural network.
- 'learning_rate' affects how quickly or slowly the neural network learns. A smaller value means slower learning.
- 'gamma' is used in calculating the discounted rewards, influencing how much the model cares about immediate vs. future rewards.
- 'D' is the size of the input to the neural network, here representing an 80x80 pixel image flattened into a single vector.

## Keras Model Construction

In [None]:
model = tf.keras.Sequential([
    tf.keras.layers.Dense(H, activation='relu', input_shape=(D,)),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

Explanation:

- This code defines the neural network model using Keras, a high-level API in TensorFlow.
- 'tf.keras.Sequential' creates a linear stack of layers in the neural network.
- The first layer is a Dense (fully connected) layer with 'H' neurons and ReLU activation function. It takes input of shape 'D'.
- The second layer is a Dense layer with a single neuron and a sigmoid activation function, outputting the probability of taking a certain action.

## Optimizer

In [None]:
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)

Explanation:

- 'optimizer' is used to update the network weights based on the computed gradients.
- 'tf.keras.optimizers.Adam' is an optimizer that uses the Adam algorithm, a popular choice for deep learning applications.

## Preprocessing Function

In [None]:
def prepro(I):
    # Code for the preprocessing function 'prepro'

Explanation:

- The 'prepro' function preprocesses the raw image frames from the game environment.
- The preprocessing steps typically include cropping irrelevant parts of the image, downsampling, and normalizing pixel values.

## Discounted Rewards Function

In [None]:
def discount_rewards(r):
    # Code for the function 'discount_rewards' 

Explanation:

- The 'discount_rewards' function computes the discounted rewards over a sequence of rewards.
- This function is crucial in reinforcement learning to account for future rewards.

## Custom Loss Function

In [None]:
def custom_loss(y_true, y_pred):
    # Code for the custom loss function

Explanation:

- A custom loss function is defined to suit the specific needs of the reinforcement learning task.
- This loss function will be used to update the network weights during training.

## Training Loop

Explanation:

- The training loop involves interacting with the game environment, making decisions based on the model's predictions, and updating the model based on the rewards received from the environment.

# *Now to compare between pure python and Keras based*

## Neural Network Model

### Original Code (Python)

In [None]:
# Model Initialization
model = {}
model['W1'] = np.random.randn(H, D) / np.sqrt(D)  # "Xavier" initialization
model['W2'] = np.random.randn(H) / np.sqrt(H)

Explanation: 
- This code manually initializes the weights of a two-layer neural network using Xavier initialization.

### Keras Implementation

In [None]:
# Keras Model
model = tf.keras.Sequential([
    tf.keras.layers.Dense(H, activation='relu', input_shape=(D,)),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

Explanation: 
- The Keras implementation replaces manual weight initialization with a high-level abstraction using tf.keras.Sequential, where layers and their initializations are automatically managed.

## Preprocessing Function

### Original Code (Python)

In [None]:
def prepro(I):
    """ prepro 210x160x3 uint8 frame into 6400 (80x80) 1D float vector """
    I = I[35:195]  # crop
    I = I[::2, ::2, 0]  # downsample by factor of 2
    I[I == 144] = 0  # erase background (background type 1)
    I[I == 109] = 0  # erase background (background type 2)
    I[I != 0] = 1  # everything else (paddles, ball) just set to 1
    return I.astype(np.float).ravel()

Explanation: 
- This function preprocesses the input images by cropping, downsampling, and normalizing.

### Keras Implementation

In [None]:
def prepro(I):
    # Code for the preprocessing function 'prepro' 

Explanation: 
- The same preprocessing logic is expected to be used, adapted to work within the TensorFlow ecosystem.

## Forward Propagation and Decision Making

### Original Code (Python)

In [None]:
def policy_forward(x):
    h = np.dot(model['W1'], x)
    h[h < 0] = 0  # ReLU nonlinearity
    logp = np.dot(model['W2'], h)
    p = sigmoid(logp)
    return p, h  # return probability of taking action 2, and hidden state

Explanation: 
- Manually conducts forward propagation through the network and uses the sigmoid function to calculate the probability of taking an action.

### Keras Implementation

In [None]:
aprob = model.predict(x.reshape(1, -1), batch_size=1).flatten()

Explanation: 
- Uses the predict method of the Keras model to compute the action probability, leveraging TensorFlow's optimized operations.

## Training Loop and Backpropagation

### Original Code (Python)

In [None]:
# Policy backward pass
def policy_backward(eph, epdlogp):
    # Code for the backward pass

# Training loop with manual backpropagation and weight updates

Explanation: 
- The original code manually calculates gradients and updates weights using backpropagation algorithms.

### Keras Implementation

In [None]:
with tf.GradientTape() as tape:
    p = model(epx, training=True)
    loss = custom_loss(discounted_epr, p)
grads = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))

Explanation: 
- TensorFlow's GradientTape is used for automatic differentiation, and the optimizer handles the weight updates, significantly simplifying the backpropagation process.

  ## Custom Loss Function

In [None]:
  def custom_loss(advantages, predicted_action_probs):
    inverted_action_probs = 1 - predicted_action_probs
    loss = -tf.reduce_mean(tf.math.log(predicted_action_probs) * advantages + 
                           tf.math.log(inverted_action_probs) * (1 - advantages))
    return loss

### Explanation:

#### Purpose: 
- This function calculates the loss for training the neural network in a reinforcement learning context, specifically for playing Pong.

#### Parameters:

- Advantages: Discounted rewards, indicating how much better (or worse) an action was compared to a baseline.
- predicted_action_probs: The probabilities of the actions as predicted by the model.

#### Loss Calculation:

- inverted_action_probs: Represents the probability of not choosing the action that was actually taken.
 
- The loss is computed as the negative mean of two terms:
    - The log probability of the chosen action (predicted_action_probs), multiplied by the advantage. This encourages actions that lead to positive outcomes.
    - The log probability of the alternative action (inverted_action_probs), multiplied by the negative of the advantage. This discourages actions that lead to negative outcomes.

#### Outcome: 
- By minimizing this loss, the model is trained to increase the likelihood of actions that yield positive rewards and decrease the likelihood of actions that result in negative rewards.

# *And the complete Keras implementation is given as:*

In [None]:
import numpy as np
import tensorflow as tf
import gym

# Hyperparameters
H = 200  # number of hidden layer neurons
learning_rate = 1e-4
gamma = 0.99  # discount factor for reward
D = 80 * 80  # input dimensionality: 80x80 grid
render = True

# Keras Model
model = tf.keras.Sequential([
    tf.keras.layers.Dense(H, activation='relu', input_shape=(D,)),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)

# Preprocessing function
def prepro(I):
    """ prepro 210x160x3 uint8 frame into 6400 (80x80) 1D float vector """
    if isinstance(I, tuple):
        I = I[0]
    I = I[35:195]
    I = I[::2, ::2, 0]
    I[I == 144] = 0
    I[I == 109] = 0
    I[I != 0] = 1
    return I.astype(float).ravel()

# Discounted rewards function
def discount_rewards(r):
    """ take 1D float array of rewards and compute discounted reward """
    discounted_r = np.zeros_like(r)
    running_add = 0
    for t in reversed(range(r.size)):
        if r[t] != 0: running_add = 0
        running_add = running_add * gamma + r[t]
        discounted_r[t] = running_add
    return discounted_r

# Custom loss function
def custom_loss(advantages, predicted_action_probs):
    # advantages: The discounted rewards
    # predicted_action_probs: The probabilities of the chosen actions from the model

    # Inverting the probabilities for actions not taken
    inverted_action_probs = 1 - predicted_action_probs

    # Combining the probabilities with the advantages
    loss = -tf.reduce_mean(tf.math.log(predicted_action_probs) * advantages + tf.math.log(inverted_action_probs) * (1 - advantages))
    return loss


# Training loop
env = gym.make("Pong-v0", render_mode='human')
observation = env.reset()

prev_x = None
xs, dlogps, drs = [], [], []
reward_sum = 0
episode_number = 0

running_reward = None 

while True:
    env.render() 
    cur_x = prepro(observation)
    x = cur_x - prev_x if prev_x is not None else np.zeros(D)
    prev_x = cur_x

    aprob = model.predict(x.reshape(1, -1), batch_size=1).flatten()
    action = 2 if np.random.uniform() < aprob else 3

    xs.append(x)
    y = 1 if action == 2 else 0
    dlogps.append(y - aprob)

    observation, reward, done, info = env.step(action)[:4]
    reward_sum += reward
    drs.append(reward)

    if done:
        episode_number += 1

        epx = np.vstack(xs)
        epdlogp = np.vstack(dlogps)
        epr = np.vstack(drs)
        xs, dlogps, drs = [], [], []

        discounted_epr = discount_rewards(epr)
        discounted_epr -= np.mean(discounted_epr)
        discounted_epr /= (np.std(discounted_epr) + 1e-10)

        with tf.GradientTape() as tape:
            p = model(epx, training=True)
            loss = custom_loss(discounted_epr, p)
        grads = tape.gradient(loss, model.trainable_variables)
        optimizer.apply_gradients(zip(grads, model.trainable_variables))

        running_reward = reward_sum if running_reward is None else running_reward * 0.99 + reward_sum * 0.01
        print('Resetting env. Episode reward total was %.f. Running mean: %.f' % (reward_sum, running_reward))
        reward_sum = 0
        observation = env.reset()
        prev_x = None

#         if episode_number % 100 == 0:
#             model.save('pong_model.h5')
        model.save('pong_model.h5')

    if reward != 0:
        print('Ep %d: Game finished, reward: %f' % (episode_number, reward) + ('' if reward == -1 else ' !!!!!!!!'))


# *Loading and Running Saved Model*

In [None]:
import tensorflow as tf
import gym
import numpy as np

# Load the previously trained model
model = tf.keras.models.load_model('pong_model_customLoss.h5')

# Preprocessing function
def prepro(I):
    """ prepro 210x160x3 uint8 frame into 6400 (80x80) 1D float vector """
    if isinstance(I, tuple):
        I = I[0]
    I = I[35:195]
    I = I[::2, ::2, 0]
    I[I == 144] = 0
    I[I == 109] = 0
    I[I != 0] = 1
    return I.astype(float).ravel()

# Initialize the Pong environment
env = gym.make("Pong-v0", render_mode='human')
observation = env.reset()

prev_x = None

# Run the model on the environment
while True:
    env.render()

    cur_x = prepro(observation)
    x = cur_x - prev_x if prev_x is not None else np.zeros(80 * 80)
    prev_x = cur_x

    aprob = model.predict(x.reshape(1, -1), batch_size=1).flatten()
    action = 2 if np.random.uniform() < aprob else 3

    observation, reward, done, info = env.step(action)[:4]

    if done:
        observation = env.reset()
