https://stable-baselines3.readthedocs.io/en/master/guide/quickstart.html

## Getting Started

Most of the library tries to follow a sklearn-like syntax for the Reinforcement Learning algorithms.

Here is a quick example of how to train and run A2C on a CartPole environment:

In [None]:
# pip install stable-baselines3

In [2]:
import gym
from stable_baselines3 import A2C

In [3]:
env = gym.make('CartPole-v1')

model = A2C('MlpPolicy', env, verbose=1)
model.learn(total_timesteps=10000)

obs = env.reset()
for i in range(1000):
    action, _state = model.predict(obs, deterministic=True)
    obs, reward, done, info = env.step(action)
    env.render()
    if done:
      obs = env.reset()

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 40.8     |
|    ep_rew_mean        | 40.8     |
| time/                 |          |
|    fps                | 874      |
|    iterations         | 100      |
|    time_elapsed       | 0        |
|    total_timesteps    | 500      |
| train/                |          |
|    entropy_loss       | -0.652   |
|    explained_variance | 0.644    |
|    learning_rate      | 0.0007   |
|    n_updates          | 99       |
|    policy_loss        | 0.822    |
|    value_loss         | 1.97     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 35.5     |
|    ep_rew_mean        | 35.5     |
| time/                 |          |
|    fps                | 872      |
|    iterations         | 200      |
|    time_elapsed 

-------------------------------------
| rollout/              |           |
|    ep_len_mean        | 61.4      |
|    ep_rew_mean        | 61.4      |
| time/                 |           |
|    fps                | 854       |
|    iterations         | 1400      |
|    time_elapsed       | 8         |
|    total_timesteps    | 7000      |
| train/                |           |
|    entropy_loss       | -0.593    |
|    explained_variance | -2.32e-05 |
|    learning_rate      | 0.0007    |
|    n_updates          | 1399      |
|    policy_loss        | 0.347     |
|    value_loss         | 1.35      |
-------------------------------------
-------------------------------------
| rollout/              |           |
|    ep_len_mean        | 67.7      |
|    ep_rew_mean        | 67.7      |
| time/                 |           |
|    fps                | 850       |
|    iterations         | 1500      |
|    time_elapsed       | 8         |
|    total_timesteps    | 7500      |
| train/    

Or just train a model with a one liner if the environment is registered in Gym and if the policy is registered:

https://github.com/openai/gym/wiki/Environments

In [6]:
# from stable_baselines3 import A2C

model = A2C('MlpPolicy', 'CartPole-v1').learn(10000)
obs = env.reset()
for i in range(1000):
    action, _state = model.predict(obs, deterministic=True)
    obs, reward, done, info = env.step(action)
    env.render()
    if done:
      obs = env.reset()