### Documentación

Problemas interesantes para Aprendizaje por refuerzo
 * Gymnasium: https://gymnasium.farama.org/environments/box2d/

## Instalación

```bash
%pip install gymnasium  
%pip install gymnasium[box2d] 
```

## Acciones adicionales

Pueden ser necesarias *antes* de instalar gymnasium[box2d].

### En macos

```bash
pip uninstall swig  
xcode-select -—install (instala las herramientas de desarrollador si no se tienen ya)  
pip install swig  / sudo port install swig-python  
pip install 'gymnasium[box2d]' # en zsh hay que poner las comillas
```

### En Windows

```bash
pip install swig
```

Si da error al instalar `box2d`, se debe a la falta de la versión correcta de Microsoft C++ Build Tools, que es una dependencia de Box2D.  
Para solucionar este problema, puede seguir los siguientes pasos:
 * `%pip install --upgrade wheel setuptools`
 * Descargar Microsoft C++ Build Tools desde https://visualstudio.microsoft.com/visual-cpp-build-tools/.
 * Dentro del instalador, seleccione la opción "Desarrollo para el escritorio con C++"
 * Reinicie su sesión en Jupyter Notebook o en Visual Studio.
 * Ejecute nuevamente el comando `%pip install gymnasium[box2d]` en la línea de comandos de su notebook.

### En linux (colab)
```bash
pip install swig
```

In [1]:
# prueba lunar lander por humano

import gymnasium as gym

env = gym.make("LunarLander-v3", render_mode="rgb_array")

import numpy as np
import pygame
import gymnasium.utils.play

lunar_lander_keys = {
    (pygame.K_UP,): 2,
    (pygame.K_LEFT,): 1,
    (pygame.K_RIGHT,): 3,
}

try:
    gymnasium.utils.play.play(env, zoom=3, keys_to_action=lunar_lander_keys, noop=0)
except KeyboardInterrupt:
    pass

In [None]:
# prueba lunar lander por agente

import gymnasium as gym

env = gym.make("LunarLander-v3", render_mode="human")


def run(policy: callable):
    # observation, info = env.reset(seed=42)
    observation, info = env.reset()
    ite = 0
    racum = 0
    while True:
        ite += 1
        action = policy(observation)
        observation, reward, terminated, truncated, info = env.step(action)

        racum += reward

        if terminated or truncated:
            r = (racum + 200) / 500
            print(
                f"Episode finished after {ite} timesteps with reward {racum:.2f} (normalized: {r:.2f})"
            )
            return racum

### ¿Cómo contruir el fitness para aplicar genéticos?

 * El módulo MLP ya tiene implementado el perceptrón multicapa. Se construye con MLP(architecture).
 * Architecture es una tupla (entradas, capa1, capa2, ...).
 * La función fitness toma el cromosoma del individuo y lo convierte a pesos del MLP con model.from_chromosome(ch).
 * usa run para N casos (esto da estabilidad) y calcula el refuerzo medio.
 * Este refuerzo medio es el fitness del individuo.

In [None]:
# neuroevolución

# construir modelo
from MLP import MLP

architecture = [8, 6, 4]
model = MLP(architecture)
ch = [
    -87.12801974641987,
    36.48558651015576,
    -77.5196695444794,
    35.75198191189928,
    111.01006267351845,
    11.422649506231167,
    -44.90817164791241,
    945.5587338111916,
    -1112.5463051622548,
    -13.33544803949853,
    4.584554711582121,
    -87.65340394738321,
    -2.2556030695017038,
    -26.75428299935855,
    -140.26655027346777,
    -14.608894240253896,
    42.08959472873624,
    3.3516498820118006,
    471.11475708267074,
    -7.702968928661917,
    -142.46718287837473,
    137.53397872282346,
    21.98632170308605,
    -89.39817481467736,
    -253.24542383847853,
    42.359394754231175,
    41.69152089042732,
    151.9270416748598,
    0.9978628640326181,
    -18.383756723590665,
    -4.525010334037926,
    25.271250232485908,
    74.17208760670535,
    -5.353883457823648,
    -690.7626765740135,
    0.45795135391706043,
    -485.1300306972676,
    35.66227720278835,
    318.9243057185713,
    17969.50962570757,
    -118.57498112877332,
    -41.73875579363815,
    -185.99607997542427,
    -233.58016934011602,
    140.7149284095578,
    -1477.6553214052008,
    -1742.243268997781,
    24.042645010505158,
    -208.11491135355286,
    -17.699687972267053,
    -203.24516651024112,
    -201.50947814637018,
    0.0058133770172840785,
    -10.365214694786628,
    27.30839470604934,
    74.07879584641425,
    -77.14068141806496,
    -3254.0341953089373,
    -32.019010362334804,
    -184.07708277300958,
    0.01810857486509404,
    10.32138622060353,
    37.53349665565992,
    -71.03997632269001,
    -8.47393145765427,
    -8.386369590479774,
    -7.606773135168115,
    -42887.90266832312,
    -517.3726095271296,
    -99.60228615747565,
    -416.1578665491021,
    2972.744642954184,
    93.14431371250473,
    -18.139858416253432,
    7.115536358928599,
    -8.878161773299086,
    80.18719363581324,
    -12.733019480062692,
    14.609586024845768,
    3.3782982416443885,
    -6.2988490200508025,
    5.118150406314172,
]
model.from_chromosome(ch)

# pasar al modelo los pesos del mejor cromosoma obtenido con neuroevolución

import numpy as np


# definir política
def policy(observation):
    s = model.forward(observation)
    action = np.argmax(s)
    return action

In [None]:
from matplotlib.pyplot import show

fig, ax = model.plot_network()
fig.show()

In [None]:
N = 10
r = 0
try:
    while True:
        r += run(policy)
except KeyboardInterrupt:
    print(f"Total reward over {N} episodes: {r:.2f}")
    print(f"Average reward over {N} episodes: {r/N:.2f}")

  return 1.0 / (1.0 + np.exp(-neta))


Episode finished after 126 timesteps with reward 48.20 (normalized: 0.50)
Episode finished after 143 timesteps with reward 33.36 (normalized: 0.47)
Episode finished after 250 timesteps with reward 250.67 (normalized: 0.90)
Episode finished after 149 timesteps with reward 1.74 (normalized: 0.40)
Episode finished after 99 timesteps with reward -11.91 (normalized: 0.38)
Episode finished after 300 timesteps with reward 239.93 (normalized: 0.88)
Episode finished after 237 timesteps with reward 265.97 (normalized: 0.93)
Episode finished after 183 timesteps with reward 266.05 (normalized: 0.93)
Episode finished after 458 timesteps with reward 234.35 (normalized: 0.87)
Total reward over 10 episodes: 1328.36
Average reward over 10 episodes: 132.84


In [None]:
# para paralelizar el map incluso en windows
# https://github.com/joblib/loky

from loky import get_reusable_executor

executor = get_reusable_executor()

# results = executor.map(fitness, poblacion)

#### ¿No has tenido bastante?

Prueba a controlar el flappy bird https://github.com/markub3327/flappy-bird-gymnasium

pip install flappy-bird-gymnasium

import flappy_bird_gymnasium  
env = gym.make("FlappyBird-v0")

Estado (12 variables):
  * the last pipe's horizontal position
  * the last top pipe's vertical position
  * the last bottom pipe's vertical position
  * the next pipe's horizontal position
  * the next top pipe's vertical position
  * he next bottom pipe's vertical position
  * the next next pipe's horizontal position
  * the next next top pipe's vertical position
  * the next next bottom pipe's vertical position
  * player's vertical position
  * player's vertical velocity
  * player's rotation

  Acciones:
  * 0 -> no hacer nada
  * 1 -> volar