<a href="https://colab.research.google.com/github/vsoni03/AI-projects/blob/main/Personal_Deep_Q_Learning_for_Lunar_Landing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Deep Q-Learning for Lunar Landing

## Part 0 - Installing the required packages and importing the libraries

### Installing Gymnasium

In [None]:
!pip install gymnasium
!pip install "gymnasium[atari, accept-rom-license]"
!apt-get install -y swig
!pip install gymnasium[box2d]

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
swig is already the newest version (4.0.2-1ubuntu1).
0 upgraded, 0 newly installed, 0 to remove and 49 not upgraded.


Importing and Installing Gymnasium for the lunar landing simulation

### Importing the libraries

In [None]:
import os
import random
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import torch.autograd as autograd
from torch.autograd import Variable
from collections import deque, namedtuple

importing needed libararies in order to code the deep Q learning algorithm for the lumar landing module

## Part 1 - Building the AI

### Creating the architecture of the Neural Network

In [None]:
class NeuralNetwork(nn.Module):
    def __init__(self, state_size, action_size, seed = 42):
        super(NeuralNetwork, self).__init__()
        self.seed = torch.manual_seed(seed)
        # generate randomness for seed
        self.fc1 = nn.Linear(state_size, 64)
        # full input layer and the full connected layer
        # most optimal is 64 for being able to land on the moon
        self.fc2 = nn.Linear(64, 64)
        # number of neturons in the first full connected layer and second full connected layer
        # Through experimentation, can say the second full connected layer would be 64
        self.fc3 = nn.Linear(64, action_size)

    def forward(self, state):
      # prograte input layer to first full connected layer
      # fc1 returns the first full connected layer and takes in state and gives the time
      x = self.fc1(state)
      # update the value of x and uses the recitifer function (activation function)
      # It replaces all the negtive values with positve values (max, 0) and introduces
      # non-linearlity as it zeros out weaker applications
      x = F.relu(x)

      # Do it to next layer as well
      x = self.fc2(x)
      x = F.relu(x)

      # return the actions
      return self.fc3(x)



**Explaination**:

*Constructor:* This is brain of the machine learning model.

Creating the constructor for the lunar landing deep q learning, which incorporates two fully connected layers to get to the actions that will be needed for AI model. The state size (observation layer) size is 8 and then two connected layers which is 64. It is most optimal for the lunar landing simulation. Finally final output layer is 4 which describes the actions which are 4 actions. 0 to do nothing, 1 fire left, 2 fire main engine, and 3 to fire right. This will result in a netural network of 8 (input)  -> 64 -> 64 -> 4 (output)





*Foward Propagation:* This is the forward pass and moving input through netural network to the output

This essential uses the contrustored fc1, fc2, and fc3 to get the output/actions after forward progation. It moves through the first layer of the netural network by taking in the state and returns the value of x which is passed through the rectiifer function. This function weakens applications by replacing it with zero making it non-linear. This is done to second layer as well and then the output is returned



## Part 2 - Training the AI

### Setting up the environment

In [None]:
import gymnasium as gym
env = gym.make('LunarLander-v3') # The Lunar Lander environment was upgraded to v3
# get the lunar lader simulation

# get the environment shape, size, and numver of actions
state_shape = env.observation_space.shape
# vector of 8 elements
state_size = env.observation_space.shape[0]
number_actions = env.action_space.n
print('State shape: ', state_shape)
print('State size: ', state_size)
print('Number of actions: ', number_actions)

State shape:  (8,)
State size:  8
Number of actions:  4


### Initializing the hyperparameters

In [None]:
# Chose this after much experimenatation to get this learning rate
learning_rate  = 5e-4
# Number of the observation in one step to update the model's parameter
# Common pratice is 100 usual size - no optimal size
minibatch_size = 100
# Present value of future rewards
# Small makes it shortsighted and only look at current rewards
# Closer to 1 will make it look at future rewards in accumulation to total reward
# Want to do this instead of short sighted
discount_factor = 0.99
# Memory of the AI and how many experiences (state, action, reward, next state)
replay_buffer_size = int(1e5)
# Interplotation optimal value for landing on the moon
interpolation_parameter = 1e-3


**Hyperparameters**

The idea behind this setting up the hyperparameters for the setup of the AI model.

The ***learning rate*** is how quickly or slowly a model learns by updating its weights. Representing the balance between learning efficently and avoiding overshotting the minimum error during training. *How quickly weights adjusted based on calculated gradients during training.*

The  ***minibatch size*** is *number of observations process before the model  updates its paramaters*. A smaller batch size could make the learning noiser but quicker and the larger is more stable but slower to coverge.

The ***discount factor*** determines *how much the model values the future rewards*. The closet to 1 makes more far sighted and more value on future rewards over time. The small value makes the model more focused on immediae rewards.

The ***replay buffer size*** is the memory capacity where past experiences - states, actions, next states - are stored. The model uses these stored experiences to learn by replaying them which helps reduce correlation between samples and improves learning stability. This used to train the model holds a fixed numbver of past experiences and pick different points in time which reduce correlation between samples and improves the effectiveness of learning.

The ***interpolation parameter*** help control the smoothing factor or rate of blending often used in updating target networks and interpolating values. The gentle adjustment for refining the model's actions or states which can be useful for precise tasks where small ajustiments matter significantly. *Control how much weight is given to the new information compared to exisiting knoweldge.*

### Implementing Experience Replay

In [None]:
class ReplayMemory(object):
  def __init__(self, capacity):
    # using gpu if available else we are using a cpu`
    self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    # maximum size of memory buffer - temporary hold data for experienced replay
    self.capacity = capacity
    # list that will store the experiences each storing state, action, and reward
    self.memory = []

  def push(self, event):
    # adding an element into a list that is created in the constructor
    self.memory.append(event)

    # if it exceeeds the memory capacity after adding the event to memory delete first event
    if len(self.memory) > self.capacity:
      del self.memory[0]
      # delete objects in python


  # randomly select a batch of experiences from the memory buffer
  def sample(self, batch_size):
    experiences = random.sample(self.memory, k = batch_size)

    # Stack the states and samples experiences together() - stack the states from the experiences
    # Get first element in this case and check if it is existent and becomes a torch tensor for neutral network
    # Needs to be a float and move it to the device whether it is cpu or gpu
    states = torch.from_numpy(np.vstack([e[0] for e in experiences if e is not None])).float().to(self.device)
    # get the actions
    actions = torch.from_numpy(np.vstack([e[1] for e in experiences if e is not None])).long().to(self.device)
    # get rewards
    rewards = torch.from_numpy(np.vstack([e[2] for e in experiences if e is not None])).float().to(self.device)
    next_states = torch.from_numpy(np.vstack([e[3] for e in experiences if e is not None])).float().to(self.device)
    dones = torch.from_numpy(np.vstack([e[4] for e in experiences if e is not None]).astype(np.uint8)).float().to(self.device)
    return states, next_states, actions, rewards, dones


#### **Constructor:**

**Device:** The constructor uses a gpu if there is a gpu available else it uses a cpu. The CPU is small number of powerful forces in consumer modesl and highly optimized for single thread performanc. The GPU contains smaller simpler cores that can perform the same operation in multiple data points (parallel) efficent for reptitive and parallel calculations.

**Capacity:** It creates a capacity which is size of the memory buffer.

**Memory**: It also creates and stores the experiences each state, action, and reward in the memory array which is set to an empty list.



#### **Push:**
The add the event into memory list and if exceeds the capacity when adding it to memory then remove first memory item.




#### **Sample:**
This randomly selects a batch of experiences from a memory buffer. It creates a stack where is gathers all states, rewards, next_states, or dones from the experiences. It uses np.vstack which goes through the
experiences and returns the stack of the index where the state, action, dones, next_states are located. Index 0 -> states, Index 1 -> actions, Index 2 -> rewards, and Index 3 -> next_states, Index 4 -> dones. After creating these stacks, it will be able to return these stacks.







### Implementing the DQN class

In [None]:
class Agent():
  def __init__(self, state_size, action_size):
    # using gpu if available else we are using a cpu`
    # intialize the state and action size which will be inputed, this is inputted from our simulation
    self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    self.state_size = state_size
    self.action_size = action_size
    # instance of netural network class to create the local and target q network and send to our device
    self.local_qnetwork = NeuralNetwork(state_size, action_size).to(self.device)
    self.target_qnetwork = NeuralNetwork(state_size, action_size).to(self.device)

    # instance of the Adam to create an optimizer and update the weights to predict the weights to land on move
    # learning rate is inputted too
    self.optimizer = optim.Adam(self.local_qnetwork.parameters(), lr = learning_rate)

    self.memory = ReplayMemory(replay_buffer_size)
    # time step and which moment are we learning and updating the variables
    self.t_step = 0



  def step(self, state, action, reward, next_state, done):
    self.memory.push((state, action, reward, next_state, done))
    # increment the time step to 4 and if it is divisble by 4 then it will start learning
    self.t_step = (self.t_step + 1) % 4

    # learning
    if self.t_step == 0:
      # self.memory is an instance of the replay memory class so need to call memory call for self.memory

      if len(self.memory.memory) > minibatch_size:
        experiences = self.memory.sample(100)
        self.learn(experiences, discount_factor)



    # select action depending on a given state and action selection policy
  def act(self, state, epsilon = 0.):
    # need to make sure it is a torch tensor to be able to work with it
    # we have 8 coordinates and now we will have to add another dimension which will correspond to the batch
    # it will be added to the beginning and given to the device
    state = torch.from_numpy(state).float().unsqueeze(0).to(self.device)
    # instance of netural network class which is from the nn.module class
    # set it into eval mode and foward pass the state [] -> [[]]
    self.local_qnetwork.eval()
    # check it is inference mode for making predictions not the training mode
    with torch.no_grad():
      action_values = self.local_qnetwork(state)
    self.local_qnetwork.train()
    if random.random() > epsilon:
      return np.argmax(action_values.cpu().data.numpy())
    else:
      return random.choice(np.arange(self.action_size))
    # foward pass those values into the local network and give the actions from that network - gives final action
    # set to local training mode
    # Using epilson greedy this is done by if the random number is larger than epilson pick largest q value
    # otherwise pick a random action

  def learn(self, experiences, discount_factor):
    states, next_states, actions, rewards, dones = experiences
    next_q_targets = self.target_qnetwork(next_states).detach().max(1)[0].unsqueeze(1)
    # action values of next states and detach function actual detached tensor from computation graph in order to get the maximum
    # get max of each row and it returns the max and the indices so get the actual max
    # adds a dimension after each feature, [] , [[],[],[]]

    # compute the q target for our current state
    q_targets = rewards + discount_factor * next_q_targets * (1 - dones)
    q_expected = self.local_qnetwork(states).gather(1, actions)
    # calculate the loss
    loss = F.mse_loss(q_expected, q_targets)
    # reset it
    self.optimizer.zero_grad()
    # backprograte the loss
    loss.backward()
    # does one single optimization step
    self.optimizer.step()
    # allow it update the target network slowly and follow improvements withput changing too rapidly
    self.soft_update(self.local_qnetwork, self.target_qnetwork, interpolation_parameter)

  def soft_update(self, local_model, target_model, interpolation_parameter):
    for target_param, local_param in zip(target_model.parameters(), local_model.parameters()):
      target_param.data.copy_(interpolation_parameter * local_param.data + (1.0 - interpolation_parameter) * target_param.data)

Creates an agent class with deep q learning and interacting with the environment with local and target q network. Local q network select actions and target will calculate the target q values used in the training the local q network. It stablizes the learning process and soft update will update the target network parameters by blending them local q network to stop abpurt changes which destablizes the training. The act method will help the agent chose an agent and actions will be returned from the local q network that will forward progratate the state value and follow an epsilon greedy policy and return final action (used as exploration mechansim - random actions). Learn method uses experiences that are sampled from the replay memory in order to update the local q network towards the target q values.

#### Constructor
This creates reinforcement leaning agent intailizes keu components required to interact with an environment and learn through Q learning approach. It uses the gpu if possible. It also intailizes the state size and the actions size which are import input and output layers of the netural netorks. The local network is main network for selecting actions, the agent chooses actions based on the Q values predicted by this betwork. The target netowk is secondary network used to calculate the target q values. These are the expected rewards which the agents aims to learn overtime. Reduces the risk of local network chasing its own tail leading to better convergerence. The optimizer is for updating the weights of local network during training with the learning rate which is a hyperparameter that determines the size of the update steps durign training, balancing speed with stability. It aims to minimize the loss between the predicted and target q values by adjusting the weights in local network. The memory is an instance of replay memory which stores past experiences that agent has encountered. It controls the maximum capacity of this memory sampling. There is also a t_Step which keeps track of environements current step in the environment and tracking time steps where the agent can perform learning updates at specfic time intervals.


#### Step
The step function takes in the state, action, reward, next state, and the done. The experiences tuple is then pushed to the replay buffer by calling self.memory.push. The time is incremented by 1 each time till it reaches 4, then it resets to 0 and triggers the learning process. If the time step is 0 and the memory has enough experiences to be sampled when the memory is bigger than the minibatch, then it selected to be sampeld sand then that is used for learning. The discounted factor is used for reinforcement leaning to calculate the discounted future rewards how much it values future rewards.


#### Action
The action function is decding which acction the agent should take based on the given state and exploration parameter epilson. The function implments epilson-greeedy policy. It choses the the action with the highest predicted q-value but if lower than epilson choses a random action to encourage exploration. The torch from num py allows it to used with neutral netowk and add an extra dimension to the beginning of the tensor turing the state with batch is front and inputs it into the cpu or gpu. It puts the network into evaulation mode and disables graident computation. It gives a foward pass through the netowk with the current state and gives q-valyes for each action in the given state. It switches back to training mode where it does the epilsion greedy policy. It returns the argmax which gives the index of the maximum value to give the action it should take (0,1,2,3). Allows to access the action_values data and use numpy to get the tensor to numpy array or choses from an array of action values and picks randomly the action.

#### Learning
This is helped the agent learn from past experiences, In each exepriences, ai has to take an action, recieve a reward, and end up in a new situation. The goal for agent is use these experiences to predict the best actions should take in each state to get the highest possible reward. There is two networks one which is local and the other which is target. The local network is constantly learning and being updated - represents the agents current understanding of which actions lead to best rewards. The target network tries to match - helps to keep the training stable and prevents wild changes. The agents uses the target q-network to preduict the best possible reward it could get from each next state by taking the best actio - this is done by finding the max of each state, detaching it. For the each experience, the AI computes a target value to aim for each state and how good each action is. It does this by computing the immediate reward. + discount factor (estimates what it would get int he future from next state) * next q_targets * (1-done ) which is saying if it is done than it zeros the future reward. Done is 0 if not the target and Done is 1 then it is at target. It also calculates the expected value from the states using the local q network and it is used to extract the q value for the specifc action that was taken in each state and it uses the gather function to take the index of the specfic action (q-value) for each state. It calculates the loss between the expected and the target q values. It uses the self.optimize.sero_grad to clear previous graiddents and computes the gradients of the loss with repsect to the q network then updates the q netowrok using the parameters of the gradient. The loss backward tells how much loss each gradient parameers needs to adjust and the step allows it update the model based on the computed gradients - reducing the loss. It calls a soft update to gradually align the target q network with local q network with the interpolation parameter which gradually puts new information with the exisiting information.

Need two networks to provide stability and uses the interpolation parameter to slowly change the network. It provides stability so it does not rapidly chnage values and instability and chasing its own estimates.


**Backprogation**: The loss function calculates the gradients which show the affect when it is changed hence if you change one more weight than it will affect the others. Goal is to get the most meaningful to have higher weight to be more impactful. It updates the parameters by doing backprogration from the loss function in order to ger the most ideal "action" for that state and it learns from every batch.


#### Soft Update
It is able to update the target parameter to align with the local parameters by using a interpolation parameter to slowly update the parameter. This is done by multiplying the interpolation parameter with local parameter and the rest of percentage will be the target parameter added together.

### Initializing the DQN agent

In [None]:
agent = Agent(state_size, number_actions)

### Training the DQN agent

In [None]:
# max number of episodes
number_episodes = 2000
# max time per episodes
maximum_number_timesteps_per_episode = 1000
epsilon_starting_value = 1.0
epsilon_ending_value = 0.01
epsilon_decay_value = 0.995
epsilon = epsilon_starting_value
scores_on_100_episodes = deque(maxlen = 100)
# decrement epsilon

for episode in range(1, number_episodes + 1):
  # reset to initial state
  state, _ = env.reset()
  score = 0
  for t in range(maximum_number_timesteps_per_episode):
    action = agent.act(state, epsilon)
    # act selcts an action in a given state following an epsilon greedy policy
    # the gynasium calculates the reward, next state, done, all from the environment step and takes this and uses it for the agent step
    next_state, reward, done, _, _ = env.step(action)
    # shows that action does to the environmet
    # agent acts on this and learns on the current state to next state
    agent.step(state, action, reward, next_state, done)
    state = next_state
    score += reward
    if done:
      break
  # append of the score of the action taken
  scores_on_100_episodes.append(score)
  # decrements the epsilon
  epsilon = max(epsilon_ending_value, epsilon_decay_value * epsilon)
  # dynamic print - removed to make room and \r to loop back
  print('\rEpisode {}\tAverage Score: {:.2f}'.format(episode, np.mean(scores_on_100_episodes)), end = "")
  if episode % 100 == 0:
    print('\rEpisode {}\tAverage Score: {:.2f}'.format(episode, np.mean(scores_on_100_episodes)))
  if np.mean(scores_on_100_episodes) >= 200.0:
    print('\nEnivornment solved in {:d} episodes!\tAverage Score: {:.2f}'.format(episode - 100, np.mean(scores_on_100_episodes)))
    torch.save(agent.local_qnetwork.state_dict(), 'checkpoint.pth')
    break






Episode 100	Average Score: -173.75
Episode 200	Average Score: -136.14
Episode 300	Average Score: -54.68
Episode 400	Average Score: 6.26
Episode 500	Average Score: 99.97
Episode 600	Average Score: 179.58
Episode 631	Average Score: 200.57
Enivornment solved in 531 episodes!	Average Score: 200.57


Training:
The maximum number of episdoes to run in training and the max number of timesteps per episodes. The starting value for epsilon was set to to 1 which controls the exploration rate in an epsilon-greedy policy. The minimum epsilon value would be 0.01 to ensure some level of epxloration even at the end of training. The decay to reduce epsilon after each episode this shifts exploration to expoliation. Through each episode, the environment resets to an intial configuration and the score keeps a track of cummulative reward to current episdoe. An episode is a complete sequence of interactions between the agent and the environment from inital state and ened swhen terminal condition is met. The agent selects an action based on current state following an epsilon greedy policy. The probablity of epsilon greedy policy, takes a randon action otherwise the best known action. It excutes a action in the environmenet returning the resulting state, reward taking the action, and the boolean indicating whether episode has ended. It updates the agantt experience based on the transition from current state to next state. If done is true then the episode ends early. The cummulative score for the epside keeping the last 100 scores using deque. The epsilon decreases each episode allowing the agent to explore less over time but ensuring it doesn't fall below the epsilon ending value.


Overtime, with dynamic printing you can see that the score gets increasing the scores and keeps learning. It updates thre model parameters and improves. The positive average score and keeps learning overtime and will stop if its average score is 200.

## Part 3 - Visualizing the results

In [None]:
import glob
import io
import base64
import imageio
from IPython.display import HTML, display

def show_video_of_model(agent, env_name):
    env = gym.make(env_name, render_mode='rgb_array')
    state, _ = env.reset()
    done = False
    frames = []
    while not done:
        frame = env.render()
        frames.append(frame)
        action = agent.act(state)
        state, reward, done, _, _ = env.step(action.item())
    env.close()
    imageio.mimsave('video.mp4', frames, fps=30)

show_video_of_model(agent, 'LunarLander-v3')

def show_video():
    mp4list = glob.glob('*.mp4')
    if len(mp4list) > 0:
        mp4 = mp4list[0]
        video = io.open(mp4, 'r+b').read()
        encoded = base64.b64encode(video)
        display(HTML(data='''<video alt="test" autoplay
                loop controls style="height: 400px;">
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii'))))
    else:
        print("Could not find video")

show_video()



It is already trained so we just need to use the step method and and act do not need to train it

Overall, this agent solve the Lunar Lander environment, as classic reinforcement learning task where an agent learns to land a spacecreaft safely.

Part 0: Installing and Importing Packages
- packages are installed
- libraries imported

Part 1: Building Neural Network
- Architecture: It has three connected layeers
  - Input later with state size nodes
  - Two hidden with 64 neurons layers - relu activations
  - Output layer with action size node for the each actions predicted Q value

Part 2: Training AI model
- Setting up the environment:
  - The lunar lander environment from gynamisum is loaded. The environment observations space shape, state size, and number of possible actions are printed

- Hyperparameters:
  - Learning rate, discount factor for future reward, replay buffer size for experience replay, interpolation parameter for soft updates target network are initalized.

- Experience Replay:
  - Replay Memory is used to store and sample experiences
    - push: adds new experience to memory, removing the oldest if buffer exceeds capacity
    - sample: samples a random baych of experiences to break temporal correlations in training and help stablize the learning process

- DQN agent:
  - Controls the DQN agent training and action selection.
  - Network Initalization: Two networks intialized
    - Local network: used for predicting q values during action selection and learning.
    - Target network: Used for calculating stable target values during training
  - Optimizer: Adam optimizer with specficed learning rate
  - Experience Replay: a buffer for storing and sampling experiences
  - Step Function:
    - Stores experiences and updates the q network every 4 steps if enough experiences are available
  - Action Selection
    - The probability epsilon, the agent choses a random action (exploration)
    - Otherwise, selects actions with highest predicted q-value
    - The epsilon decarys over time, encouraging the agent to explore initally and exploit more as it learns

  - Learning Function:
    - The function updates the q-network
    - target q-values: calculated using the target network on the next state with the highest q value selected
    - q loss calculation: Mean squared error loss between expected and target q values
    - loss is backpropagated through the network and weights are updated through optimizer

  - Soft Update:
    - Smoothly updates the target network
    - Updates each parameter in target network as weighted average of itself and the corresponding parameter from local network.


Remember the q network for local is for training goes towards the target network which gets updated using the calculate of the q forumla.

- Training Look
  - Agent is trained over multiple episodes
  - State Reset: resets the environeent and tracks the score per episode
  - Action Selection and Reward Collection: agent selects actions interacts with the environment and storees rewards
  - Episode End: The environemnt is solved if the average score is 200 over 100 episodes traing stops and model weights are saved.
  - Epislon Decay: epsilon decays with each episode to reduce exploration gradually

Part 3: Visualizing Results
- two functions display the trained agent's performance visually
- show video of model shows video of trained agent interacting with the environment.


