<a target="_blank" href="https://colab.research.google.com/github/rcpaffenroth/dac_raghu/blob/main/LunarLander.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Setup and libraries

## Colab specific stuff

This is some code to make sure that the notebook works in Colab. It's not used when you are running things locally.

In [1]:
import sys
IN_COLAB = 'google.colab' in sys.modules

In [2]:
if IN_COLAB:
  ! apt-get install swig
  ! pip install stable-baselines3[extra] gymnasium[box2d] huggingface_sb3
else:
  # Otherwise, install locally and you need the following
  # NOTE: Need "gym" and "gymnasium" installed, since we use "gymnasium" for the LunarLander environment
  #       and "gym" is for huggingface_sb3.
  # sudo apt install swig ffmpeg
  # pip install stable-baselines3[extra] gymnasium[box2d] huggingface_sb3 imageio[ffmpeg] gym ipywidegets ipykernel pandas pyarrow
  pass

## Load the needed libraries

These are the libraries I will be using for this notebook

In [3]:
import gymnasium as gym
import matplotlib.pylab as plt
import numpy as np

import imageio
from stable_baselines3 import PPO
from huggingface_sb3 import load_from_hub

import pandas as pd

from IPython.display import display
from IPython.display import HTML
from ipywidgets import interact, widgets
from base64 import b64encode
%matplotlib inline

# The Lander environment.  

It is trying to land softly between the two flags

In [4]:
# Make the environment
env = gym.make("LunarLander-v2", render_mode='rgb_array')
observation = env.reset()


There is the top level link to the library

https://gymnasium.farama.org/

Here is the Lunar Lander specific link

https://gymnasium.farama.org/environments/box2d/lunar_lander/



### Action Space
There are four discrete actions available:

0: do nothing

1: fire left orientation engine

2: fire main engine

3: fire right orientation engine



### Observation Space

The state is an 8-dimensional vector: the coordinates of the lander in x & y, its linear velocities in x & y, its angle, its angular velocity, and two booleans that represent whether each leg is in contact with the ground or not.

In [5]:
obs_names = ['x', 'y', 'vx', 'vy', 'theta', 'vtheta', 'leg1', 'leg2']

### Rewards

After every step a reward is granted. The total reward of an episode is the sum of the rewards for all the steps within that episode.

For each step, the reward:

is increased/decreased the closer/further the lander is to the landing pad.

is increased/decreased the slower/faster the lander is moving.

is decreased the more the lander is tilted (angle not horizontal).

is increased by 10 points for each leg that is in contact with the ground.

is decreased by 0.03 points each frame a side engine is firing.

is decreased by 0.3 points each frame the main engine is firing.

The episode receive an additional reward of -100 or +100 points for crashing or landing safely respectively.

An episode is considered a solution if it scores at least 200 points.

### Starting State

The lander starts at the top center of the viewport with a random initial force applied to its center of mass.

### Episode Termination

The episode finishes if:

the lander crashes (the lander body gets in contact with the moon);

the lander gets outside of the viewport (x coordinate is greater than 1);

the lander is not awake. From the Box2D docs, a body which is not awake is a body which doesn’t move and doesn’t collide with any other body:

When Box2D determines that a body (or group of bodies) has come to rest, the body enters a sleep state which has very little CPU overhead. If a body is awake and collides with a sleeping body, then the sleeping body wakes up. Bodies will also wake up if a joint or contact attached to them is destroyed.

# Training and downloading models

In [6]:
models = {}

In [7]:
class RandomModel(object):
  def __init__(self, env):
    self.env = env

  def predict(self, obs):
    return env.action_space.sample(), None # The second return value is the state value, which the random model does not use

random_model =  RandomModel(env)
models['random'] = {}
models['random']['model'] = random_model
models['random']['runs'] = []

In [8]:
# This is an trained model that has a good architecture and loss function, but is not trained very much.  This takes about 30 sec on 
# a RTX 4090
trained_model = PPO("MlpPolicy", env)
trained_model.learn(total_timesteps=20000)

models['trained'] = {}
models['trained']['model'] = trained_model
models['trained']['runs'] = []

In [9]:
# This is a model from huggingface.co at https://huggingface.co/sb3/a2c-LunarLander-v2
# Mean reward: 181.08 +/- 95.35
checkpoint = load_from_hub(
    repo_id="sb3/a2c-LunarLander-v2",
    filename="a2c-LunarLander-v2.zip",
)

good_model = PPO.load(checkpoint)

models['good'] = {}
models['good']['model'] = good_model
models['good']['runs'] = []

Exception: an integer is required (got type bytes)
Exception: an integer is required (got type bytes)


In [10]:
# This is a model from huggingface.co at https://huggingface.co/araffin/ppo-LunarLander-v2
# Mean reward:  283.49 +/- 13.74
checkpoint = load_from_hub(
    repo_id="araffin/ppo-LunarLander-v2",
    filename="ppo-LunarLander-v2.zip",
)

better_model = PPO.load(checkpoint)
models['better'] = {}
models['better']['model'] = better_model
models['better']['runs'] = []

Exception: an integer is required (got type bytes)


# Evaluate models

In [11]:
def evaluate_model(model_name, models=models, env=env):
   # Make a movie of a trained agent
   obs = env.reset()[0]

   # Get the model
   model = models[model_name]['model']
   images = []
   all_obs = []
   all_actions = []
   all_rewards = []
   done = False
   while not done:
      # This rendering mode puts an image into a numpy array
      images += [env.render()]
      action, _state = model.predict(obs)
      all_obs.append(obs)
      all_actions.append(int(action))
      obs, reward, done, trunc, info = env.step(action)
      all_rewards.append(reward)
   env.close()

   df = pd.DataFrame(all_obs, columns=obs_names)
   df['action'] = all_actions
   df['reward'] = all_rewards
   models[model_name]['runs'].append({'data':df,'images':images})



In [21]:
# For 3 runs each this takes just under 3 minutes on a RTX 4090
for i in range(16):
    evaluate_model('random')
    evaluate_model('trained')
    evaluate_model('good')
    evaluate_model('better')
    

In [13]:
@interact(model_name=models.keys(), run_idx=widgets.IntSlider(min=0, max=9, step=1, value=0))
def show_video(model_name, run_idx):
      images = models[model_name]['runs'][run_idx]['images']      
      # imageio is a nice library for taking a sequence of images and makeing a movie
      name = 'tmp.mp4'
      imageio.mimsave(name, images, fps=15)
      mp4 = open(name,'rb').read()
      data_url = "data:video/mp4;base64," + b64encode(mp4).decode()
      # This puts the video in the notebook
      display(HTML("""
      <video width=400 controls>
            <source src="%s" type="video/mp4">
      </video>
      """ % data_url))
      # plot various data for the run 
      _, ax = plt.subplots(1, 3)
      data = models[model_name]['runs'][run_idx]['data']
      ax[0].plot(data['reward'])
      ax[0].set_title('reward')
      ax[1].plot(data['x'], data['y'])
      ax[1].set_title('position')
      ax[2].plot(data['vx'], data['vy'])
      ax[2].set_title('velocity')
      plt.tight_layout()
      

interactive(children=(Dropdown(description='model_name', options=('random', 'trained', 'good', 'better'), valu…

# Rewards

In [22]:
for model_name in models.keys():
    for run_idx in range(len(models[model_name]['runs'])):
        data = models[model_name]['runs'][run_idx]['data']  
        # Print the total reward for each model
        print(f"{model_name}: {np.sum(data['reward']):.2f}")

random: -84.58
random: -222.23
random: -175.50
random: -167.85
random: -30.62
random: -298.41
random: -387.59
random: -441.15
random: -332.07
random: -170.02
random: -298.44
random: -178.78
random: 2.07
random: -215.06
random: -249.00
random: -161.29
trained: -28.31
trained: -86.46
trained: -314.67
trained: -40.17
trained: -33.12
trained: -137.18
trained: -243.20
trained: -32.47
trained: -93.13
trained: -77.55
trained: -233.26
trained: -112.91
trained: -33.93
trained: -55.08
trained: -11.77
trained: -64.51
good: -17.88
good: 12.27
good: 80.43
good: -140.53
good: -6.87
good: 265.26
good: 247.57
good: 6.11
good: 251.08
good: -18.18
good: 58.82
good: 0.83
good: 82.66
good: 41.12
good: 29.06
good: 42.86
better: 292.01
better: 244.05
better: 292.39
better: 316.78
better: 287.64
better: 305.41
better: 297.39
better: 283.23
better: -167.55
better: 306.69
better: 280.88
better: 262.24
better: 297.33
better: 245.49
better: 275.81
better: 276.15


# Save model trajectories

In [23]:
for model_name in models.keys():
    for run_idx in range(len(models[model_name]['runs'])):
        data = models[model_name]['runs'][run_idx]['data']  
        data.to_parquet(f'data/lander/{model_name}_{run_idx}_trajectory.parquet')
        

In [16]:
data['action']

0      3
1      0
2      3
3      3
4      3
      ..
191    0
192    0
193    0
194    0
195    0
Name: action, Length: 196, dtype: int64

# Write files

In [17]:
# This section produces the data for a generative model of the Lunar Lander
# Create a single dataframe with all the data
# each row is a single run of a single model
# each column is a single timestep of a single variable

# NOTE:  There needs to be some thinking here.  I mean, while the position,
# velocity, and angle are all continuous, the thrust is not.  So, we need to
# thinkg about how to interpolate the thrust. I think the data in this case
# needs to be "ragged" in the sense that each row has a different number of
# entries.  However, perhaps we can also just look at the "shortest" run and
# truncate all the other runs to that length. 


# TODO:  This is getting close, but is not there yet.  I want things like
# 'x,x,x' to be something like 'x1,x2,x3' so that I can use the autoencoder
# more easily.  Is that a matter of combining the columns?  I think so.  
# How about keeping a dict to map times to indices?  That would work I think.

def uniform_data_for_autoencoder(filename, entries_per_run=100):
    all_data = []
    for model_name in models.keys():
        for run_idx in range(len(models[model_name]['runs'])):
            df = models[model_name]['runs'][run_idx]['data'].copy()  

            # index plays the role of timestep
            df['timestamp'] = pd.to_datetime(df.index, unit='s')
            df.set_index('timestamp', inplace=True)

            # We now compute the delta t that gives us 100 total sample points for each run
            # We do this by taking the total time of the run and dividing by 100
            total_time = df.index[-1] - df.index[0]
            delta_t = total_time / entries_per_run
            df = df.resample(delta_t).interpolate()

            df = pd.melt(df, 
                        value_vars=['x', 'y', 'vx', 'vy', 'theta', 'vtheta'], 
                        var_name='variable', 
                        ignore_index=False, 
                        value_name='value')
            # How to add a few additional rows to the dataframe
            df.loc[df.index[0]] = ['model_name', model_name]
            df.loc[df.index[-1]] = ['total_time', total_time]
            all_data.append(df)

    # for i,df in enumerate(all_data):
    #     if i == 0:
    #         all_data = pd.DataFrame(df).T
    #     df['run_idx'] = i    
    # all_data = pd.concat(all_data)
    # all_data.to_parquet(filename)
    return all_data
all_data = uniform_data_for_autoencoder('data/uniform_data_for_autoencoder.parquet')

In [18]:
tmp = pd.DataFrame(columns=df['variable'])
tmp.loc[-1] = list(df['value'])
tmp


NameError: name 'df' is not defined

In [None]:
df['value']

timestamp
1970-01-01 00:00:00.000             better
1970-01-01 00:00:02.360          -0.016149
1970-01-01 00:00:04.720          -0.025983
1970-01-01 00:00:07.080          -0.035816
1970-01-01 00:00:09.440          -0.045649
                                ...       
1970-01-01 00:03:46.560           0.006841
1970-01-01 00:03:48.920            0.00513
1970-01-01 00:03:51.280            0.00342
1970-01-01 00:03:53.640            0.00171
1970-01-01 00:03:56.000    0 days 00:03:56
Name: value, Length: 606, dtype: object