# Creating a custom Gym environment

## First steps with the gym interface

As you have noticed in the previous notebooks, an environment that follows the gym interface is quite simple to use.
It provides to this user mainly three methods, which have the following signature (for gym versions > 0.26)
- `reset()` called at the beginning of an episode, it returns an observation and a dictionary with additional info (defaults to an empty dict)
- `step(action)` called to take an action with the environment, it returns the next observation, the immediate reward, whether new state is a terminal state (episode is finished), whether the max number of timesteps is reached (episode is artificially finished), and additional information
- (Optional) `render()` which allow to visualize the agent in action. Note that graphical interface does not work on google colab, so we cannot use it directly (we have to rely on `render_mode='rbg_array'` to retrieve an image of the scene).

Under the hood, it also contains two useful properties:
- `observation_space` which one of the gym spaces (`Discrete`, `Box`, ...) and describe the type and shape of the observation
- `action_space` which is also a gym space object that describes the action space, so the type of action that can be taken

The best way to learn about [gym spaces](https://gymnasium.farama.org/api/spaces/) is to look at the [source code](https://github.com/Farama-Foundation/Gymnasium/tree/main/gymnasium/spaces), but you need to know at least the main ones:
- `gym.spaces.Box`: A (possibly unbounded) box in $R^n$. Specifically, a Box represents the Cartesian product of n closed intervals. Each interval has the form of one of [a, b], (-oo, b], [a, oo), or (-oo, oo). Example: A 1D-Vector or an image observation can be described with the Box space.
```python
# Example for using image as input:
observation_space = spaces.Box(low=0, high=255, shape=(HEIGHT, WIDTH, N_CHANNELS), dtype=np.uint8)
```                                       

- `gym.spaces.Discrete`: A discrete space in $\{ 0, 1, \dots, n-1 \}$
  Example: if you have two actions ("left" and "right") you can represent your action space using `Discrete(2)`, the first action will be 0 and the second 1.


[Documentation on custom env](https://stable-baselines3.readthedocs.io/en/master/guide/custom_env.html)

Also keep in mind that Stabe-baselines internally uses the previous gym API (<0.26), so every VecEnv returns only the observation after resetting and returns a 4-tuple instead of a 5-tuple  (terminated & truncated are already combined to done).

In [None]:
import gymnasium as gym

env = gym.make("CartPole-v1")

# Box(4,) means that it is a Vector with 4 components
print("Observation space:", env.observation_space)
print("Shape:", env.observation_space.shape)

# Discrete(2) means that there is two discrete actions
print("Action space:", env.action_space)

# The reset method is called at the beginning of an episode
obs, info = env.reset()

# Sample a random action
action = env.action_space.sample()
print("Sampled action:", action)

obs, reward, terminated, truncated, info = env.step(action)

# Note the obs is a numpy array
# info is an empty dict for now but can contain any debugging info
# reward is a scalar
print(obs.shape, reward, terminated, truncated, info)

##  Gym env skeleton

In practice this is how a gym environment looks like.
Here, we have implemented a simple grid world were the agent must learn to go always left.

In [None]:
import gymnasium as gym


class GoLeftEnv(gym.Env):
    """
    Custom Environment that follows gym interface.
    This is a simple env where the agent must learn to go always left.
    """

    # If you don't want to use GUI, use "console" render mode
    metadata = {"render_modes": ["human"], "render_fps": 30}

    # Define constants for clearer code
    LEFT = 0
    RIGHT = 1

    def __init__(self, grid_size=10, render_mode="console"):
        super(GoLeftEnv, self).__init__()
        self.render_mode = render_mode

        # Size of the 1D-grid
        self.grid_size = grid_size
        # Initialize the agent at the right of the grid
        self.agent_pos = grid_size - 1

        # Define action and observation space
        # They must be gym.spaces objects
        # Example when using discrete actions, we have two: left and right
        n_actions = 2
        self.action_space = spaces.Discrete(n_actions)
        # The observation will be the coordinate of the agent
        # this can be described both by Discrete and Box space
        self.observation_space = spaces.Box(
            low=0, high=self.grid_size, shape=(1,), dtype=np.float32
        )

    def reset(self, seed=None, options=None):
        """
        Important: the observation must be a numpy array
        :return: (np.array)
        """
        super().reset(seed=seed, options=options)
        # Initialize the agent at the right of the grid
        self.agent_pos = self.grid_size - 1
        # here we convert to float32 to make it more general (in case we want to use continuous actions)
        return np.array([self.agent_pos]).astype(np.float32), {}  # empty info dict

    def step(self, action):
        if action == self.LEFT:
            self.agent_pos -= 1
        elif action == self.RIGHT:
            self.agent_pos += 1
        else:
            raise ValueError(
                f"Received invalid action={action} which is not part of the action space"
            )

        # Account for the boundaries of the grid
        self.agent_pos = np.clip(self.agent_pos, 0, self.grid_size)

        # Are we at the left of the grid?
        terminated = bool(self.agent_pos == 0)
        truncated = False  # we do not limit the number of steps here

        # Null reward everywhere except when reaching the goal (left of the grid)
        reward = 1 if self.agent_pos == 0 else 0

        # Optionally, we can pass additional info, we are not using that for now
        info = {}

        return (
            np.array([self.agent_pos]).astype(np.float32),
            reward,
            terminated,
            truncated,
            info,
        )

    def render(self):
        # agent is represented as a cross, rest as a dot
        if self.render_mode == "console":
            print("." * self.agent_pos, end="")
            print("x", end="")
            print("." * (self.grid_size - self.agent_pos))

    def close(self):
        pass

### Validate the environment

Stable Baselines3 provides a [helper](https://stable-baselines3.readthedocs.io/en/master/common/env_checker.html) to check that your environment follows the Gym interface. It also optionally checks that the environment is compatible with Stable-Baselines (and emits warning if necessary).

In [None]:
from stable_baselines3.common.env_checker import check_env

In [None]:
env = GoLeftEnv()
# If the environment doesn't follow the interface, an error will be thrown
check_env(env, warn=True)

### Testing the environment

In [None]:
# Instantiate the environment
env = GoLeftEnv(grid_size=10)

obs, _ = env.reset()
env.render()

print(env.observation_space)
print(env.action_space)
print(env.action_space.sample())

GO_LEFT = 0
# Hardcoded the best agent: always go left!
n_steps = 20
for step in range(n_steps):
    print(f"Step {step + 1}")
    obs, reward, terminated, truncated, info = env.step(GO_LEFT)
    done = terminated or truncated
    print("obs=", obs, "reward=", reward, "done=", done)
    env.render()
    if done:
        print("Goal reached!", "reward=", reward)
        break

### Try it with Stable-Baselines

Once your environment follow the gym interface, it is quite easy to plug in any algorithm from stable-baselines

In [None]:
from stable_baselines3 import A2C
from stable_baselines3.common.env_util import make_vec_env

# Instantiate the env
vec_env = make_vec_env(GoLeftEnv, n_envs=1, env_kwargs=dict(grid_size=10))

In [None]:
# Train the agent
model = A2C("MlpPolicy", env, verbose=1).learn(5000)

In [None]:
# Test the trained agent
# using the vecenv
obs = vec_env.reset()
n_steps = 20
for step in range(n_steps):
    action, _ = model.predict(obs, deterministic=True)
    print(f"Step {step + 1}")
    print("Action: ", action)
    obs, reward, done, info = vec_env.step(action)
    print("obs=", obs, "reward=", reward, "done=", done)
    vec_env.render()
    if done:
        # Note that the VecEnv resets automatically
        # when a done signal is encountered
        print("Goal reached!", "reward=", reward)
        break

### Register the environment

Optionally, you can also register the environment with gym, that will allow you to create the RL agent in one line (and use `gym.make()` to instantiate the env):

In [None]:
from gymnasium.envs.registration import register

# Example for the CartPole environment
register(
    # unique identifier for the env `name-version`
    id="CartPole-v1",
    # path to the class for creating the env
    # Note: entry_point also accept a class as input (and not only a string)
    entry_point="gym.envs.classic_control:CartPoleEnv",
    # Max number of steps per episode, using a `TimeLimitWrapper`
    max_episode_steps=500,
)

# Creating a more complex custom environment

Games tend to make good environments, so I think a Snake game could be quite fitting. I searched around for a nice short/simple Snake game, and I found: https://github.com/TheAILearner/Snake-Game-using-OpenCV-Python/blob/master/snake_game_using_opencv.ipynb

I took the notebook and converted it to a script here:

In [None]:
# source: https://github.com/TheAILearner/Snake-Game-using-OpenCV-Python/blob/master/snake_game_using_opencv.ipynb
import numpy as np
import cv2
import random
import time

def collision_with_apple(apple_position, score):
	apple_position = [random.randrange(1,50)*10,random.randrange(1,50)*10]
	score += 1
	return apple_position, score

def collision_with_boundaries(snake_head):
	if snake_head[0]>=500 or snake_head[0]<0 or snake_head[1]>=500 or snake_head[1]<0 :
		return 1
	else:
		return 0

def collision_with_self(snake_position):
	snake_head = snake_position[0]
	if snake_head in snake_position[1:]:
		return 1
	else:
		return 0

img = np.zeros((500,500,3),dtype='uint8')
# Initial Snake and Apple position
snake_position = [[250,250],[240,250],[230,250]]
apple_position = [random.randrange(1,50)*10,random.randrange(1,50)*10]
score = 0
prev_button_direction = 1
button_direction = 1
snake_head = [250,250]
while True:
	cv2.imshow('a',img)
	cv2.waitKey(1)
	img = np.zeros((500,500,3),dtype='uint8')
	# Display Apple
	cv2.rectangle(img,(apple_position[0],apple_position[1]),(apple_position[0]+10,apple_position[1]+10),(0,0,255),3)
	# Display Snake
	for position in snake_position:
		cv2.rectangle(img,(position[0],position[1]),(position[0]+10,position[1]+10),(0,255,0),3)
	
	# Takes step after fixed time
	t_end = time.time() + 0.05
	k = -1
	while time.time() < t_end:
		if k == -1:
			k = cv2.waitKey(1)
		else:
			continue
			
	# 0-Left, 1-Right, 3-Up, 2-Down, q-Break
	# a-Left, d-Right, w-Up, s-Down

	if k == ord('a') and prev_button_direction != 1:
		button_direction = 0
	elif k == ord('d') and prev_button_direction != 0:
		button_direction = 1
	elif k == ord('w') and prev_button_direction != 2:
		button_direction = 3
	elif k == ord('s') and prev_button_direction != 3:
		button_direction = 2
	elif k == ord('q'):
		break
	else:
		button_direction = button_direction
	prev_button_direction = button_direction

	# Change the head position based on the button direction
	if button_direction == 1:
		snake_head[0] += 10
	elif button_direction == 0:
		snake_head[0] -= 10
	elif button_direction == 2:
		snake_head[1] += 10
	elif button_direction == 3:
		snake_head[1] -= 10

	# Increase Snake length on eating apple
	if snake_head == apple_position:
		apple_position, score = collision_with_apple(apple_position, score)
		snake_position.insert(0,list(snake_head))

	else:
		snake_position.insert(0,list(snake_head))
		snake_position.pop()
		
	# On collision kill the snake and print the score
	if collision_with_boundaries(snake_head) == 1 or collision_with_self(snake_position) == 1:
		font = cv2.FONT_HERSHEY_SIMPLEX
		img = np.zeros((500,500,3),dtype='uint8')
		cv2.putText(img,'Your Score is {}'.format(score),(140,250), font, 1,(255,255,255),2,cv2.LINE_AA)
		cv2.imshow('a',img)
		cv2.waitKey(0)
		break
		
cv2.destroyAllWindows()

The main changes are around the snippet:

In [None]:
t_end = time.time() + 0.2
k = -1
while time.time() < t_end:
	if k == -1:
		k = cv2.waitKey(125)

Changing 0.2 to more like 0.05 and the waitKey to 1. We want to step as quickly as possible here.

![Alt](https://pythonprogramming.net/static/images/reinforcement-learning/snake-base-game.gif)

Playing this, it's a simple snake game where you attempt to get the apple without running into yourself or going out of bounds.

Next, we convert this to a gym environment:

In [None]:
import gymnasium as gym

SNAKE_LEN_GOAL = 30

def collision_with_apple(apple_position, score):
	apple_position = [random.randrange(1,50)*10,random.randrange(1,50)*10]
	score += 1
	return apple_position, score

def collision_with_boundaries(snake_head):
	if snake_head[0]>=500 or snake_head[0]<0 or snake_head[1]>=500 or snake_head[1]<0 :
		return 1
	else:
		return 0

def collision_with_self(snake_position):
	snake_head = snake_position[0]
	if snake_head in snake_position[1:]:
		return 1
	else:
		return 0


class SnekEnv(gym.Env):

	def __init__(self):
		super(SnekEnv, self).__init__()
        
		# Define action and observation space
		# They must be gym.spaces objects
		# We use discrete actions, as button_direction in the previous code \in [0, 1, 2, 3]
		self.action_space = spaces.Discrete(4)
        
		# Example for using image as input (channel-first; channel-last also works):
		self.observation_space = spaces.Box(low=-500, high=500,
											shape=(5+SNAKE_LEN_GOAL,), dtype=np.float32)

	def step(self, action):
        # This is basically the original snake game code, just turned into OOP
        
        # self.prev_actions.append(action) tracks historical actions.
		self.prev_actions.append(action)
        
		cv2.imshow('a',self.img)
		cv2.waitKey(1)
		self.img = np.zeros((500,500,3),dtype='uint8')
		# Display Apple
		cv2.rectangle(self.img,(self.apple_position[0],self.apple_position[1]),(self.apple_position[0]+10,self.apple_position[1]+10),(0,0,255),3)
		# Display Snake
		for position in self.snake_position:
			cv2.rectangle(self.img,(position[0],position[1]),(position[0]+10,position[1]+10),(0,255,0),3)
		
		# Takes a step after fixed time
		t_end = time.time() + 0.05
		k = -1
		while time.time() < t_end:
			if k == -1:
				k = cv2.waitKey(1)
			else:
				continue

		button_direction = action
		# Change the head position based on the button direction
		if button_direction == 1:
			self.snake_head[0] += 10
		elif button_direction == 0:
			self.snake_head[0] -= 10
		elif button_direction == 2:
			self.snake_head[1] += 10
		elif button_direction == 3:
			self.snake_head[1] -= 10

		# Increase Snake length on eating apple
		if self.snake_head == self.apple_position:
			self.apple_position, self.score = collision_with_apple(self.apple_position, self.score)
			self.snake_position.insert(0,list(self.snake_head))

		else:
			self.snake_position.insert(0,list(self.snake_head))
			self.snake_position.pop()
		
		# On collision kill the snake and print the score
		if collision_with_boundaries(self.snake_head) == 1 or collision_with_self(self.snake_position) == 1:
			font = cv2.FONT_HERSHEY_SIMPLEX
			self.img = np.zeros((500,500,3),dtype='uint8')
			cv2.putText(self.img,'Your Score is {}'.format(self.score),(140,250), font, 1,(255,255,255),2,cv2.LINE_AA)
			cv2.imshow('a',self.img)
			self.done = True

		# the reward is the snake's size
		self.total_reward = len(self.snake_position) - 3  # default length is 3
		self.reward = self.total_reward - self.prev_reward
		self.prev_reward = self.total_reward

		if self.done:
			self.reward = -10
		info = {}

		head_x = self.snake_head[0]
		head_y = self.snake_head[1]

		snake_length = len(self.snake_position)
		apple_delta_x = self.apple_position[0] - head_x
		apple_delta_y = self.apple_position[1] - head_y

		# We now create an observation. We need to include the snake's head is, where the apple is, in relation to the head, and where the rest of the snake's body is.
        # Feel free to make your custom actions. The only slightly challenging part is, every time you eat an apple, the length of the snake is increased by 1. We need our observation to be a fixed size, whether the snake is 3 units long, or 300.
		observation = [head_x, head_y, apple_delta_x, apple_delta_y, snake_length] + list(self.prev_actions)
		observation = np.array(observation)

		return observation, self.reward, self.done, info

	def reset(self):
		self.img = np.zeros((500,500,3),dtype='uint8')
		# Initial Snake and Apple position
		self.snake_position = [[250,250],[240,250],[230,250]]
		self.apple_position = [random.randrange(1,50)*10,random.randrange(1,50)*10]
		self.score = 0
		self.prev_button_direction = 1
		self.button_direction = 1
		self.snake_head = [250,250]

		self.prev_reward = 0

		self.done = False

		head_x = self.snake_head[0]
		head_y = self.snake_head[1]

		snake_length = len(self.snake_position)
		apple_delta_x = self.apple_position[0] - head_x
		apple_delta_y = self.apple_position[1] - head_y

		# prev_actions: fixed-size list of previous actions that I expect the agent to be capable of figuring out how to extrapolate to where the rest of the body is based on "snake length."
		self.prev_actions = deque(maxlen = SNAKE_LEN_GOAL)  # however long we aspire the snake to be
		for i in range(SNAKE_LEN_GOAL):
			self.prev_actions.append(-1) # to create history

		# create observation:
		observation = [head_x, head_y, apple_delta_x, apple_delta_y, snake_length] + list(self.prev_actions)
		observation = np.array(observation)

		return observation

We now test our method:

In [None]:
env = SnekEnv()
# It will check your custom environment and output additional warnings if needed
check_env(env)

We also make sure the rewards seem correct, the snake moves around, episodes end, and restart all as expected:

In [None]:
env = SnekEnv()
episodes = 50

for episode in range(episodes):
	done = False
	obs = env.reset()
	while True:#not done:
		random_action = env.action_space.sample()
		print("action",random_action)
		obs, reward, done, info = env.step(random_action)
		print('reward',reward)

Time to try to train a model!

In [None]:
from stable_baselines3 import PPO
import os
import time

models_dir = f"models/{int(time.time())}/"
logdir = f"logs/{int(time.time())}/"

if not os.path.exists(models_dir):
	os.makedirs(models_dir)

if not os.path.exists(logdir):
	os.makedirs(logdir)

env = SnekEnv()
env.reset()

model = PPO('MlpPolicy', env, verbose=1, tensorboard_log=logdir)

TIMESTEPS = 10000
iters = 0
while True:
	iters += 1
	model.learn(total_timesteps=TIMESTEPS, reset_num_timesteps=False, tb_log_name=f"PPO")
	model.save(f"{models_dir}/{TIMESTEPS*iters}")

Go ahead and run it, and let's see what we can come up with!

After training for some time, what we have is better than random, but is nowhere near being a great model. We can see that at least episode length increased, but our actual rewards are almost unchanged. In the next tutorial, we'll see if we can't figure out a solution!

![alt](https://pythonprogramming.net/static/images/reinforcement-learning/custom-env-1.png)

The resulting behaviour looks like:

![alt](https://pythonprogramming.net/static/images/reinforcement-learning/base-custom-env.gif)

## Engineering Rewards in Custom Environments

While the agent did definitely learn to stay alive for much longer than random, we were certainly not getting any apples. Why might this be?

Unless an agent just happens to get an apple, it would never learn that it is rewarding, plus getting an apple is about as rewarding as simply not dying. How might we encourage getting the apple? One quick idea that comes to mind is to punish the agent by the euclidean distance it is from the apple. The agent would hopefully learn to get closer and closer to the apple in this case. We can achieve this by adding a euclidean distance variable and then subtracting that distance from the reward:

In [None]:
euclidean_dist_to_apple = np.linalg.norm(np.array(self.snake_head) - np.array(self.apple_position))

self.total_reward = len(self.snake_position) - 3 - euclidean_dist_to_apple

Training this starts off okay, but after some time, all we get is a black screen and output like:

```
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 1        |
|    ep_rew_mean     | -10      |
| time/              |          |
|    fps             | 693      |
|    iterations      | 1        |
|    time_elapsed    | 121      |
|    total_timesteps | 83968    |
---------------------------------
```

What happened?

![alt](https://pythonprogramming.net/static/images/reinforcement-learning/custom-env-oops.png)

Quite simply, this agent learned that living is very painful and the quickest way to the highest reward is to go ahead and stop living. You can see that reward was typically a very large negative and then it rises as episode length decreases up to the point of -10 and it just holds there, so the agent was just simply spawning and running into itself immediately to end the game. This is a good example of how things can go awry with what we think might be a good reward, but it turns out to be no good.

To fix this, we can instead just make an offset for the euclidean distance. I propose something like maybe 250, since our game size is 500x500. When we do this, I can envision the snake learning to just circle the apple, instead of eating it. The new reward function I propose to start is:

In [None]:
self.total_reward = (250 - euclidean_dist_to_apple)

But then, we do want a short term reward for eating an apple too. It needs to be greater than 250 for sure, but also enough incentive for the apple to move to a new spot, so maybe 1,000 or 5,000. I really don't know. Something significant for sure. We have to consider how many steps will it take to get to the new/next apple, and would it wind up being more advantageous for the agent to just do circles around the apple for a constant ~200-250 reward. In the code where we catch if we ate the apple and lengthen the snake, I'll add a new variable, apple_reward:

In [None]:
apple_reward = 0
# Increase Snake length on eating apple
if self.snake_head == self.apple_position:
	self.apple_position, self.score = collision_with_apple(self.apple_position, self.score)
	self.snake_position.insert(0,list(self.snake_head))
	apple_reward = 10000

Then, we'll add this apple_reward to the temp reward:

In [None]:
self.total_reward = (250 - euclidean_dist_to_apple) + apple_reward

Then, do we really want the rewards to be deltas to what they were before? I don't think so. I think we want to make them per step, so the reward we return from the step method should be the `self.total_reward`.

Finally, our current total reward scale is going to be too massive; we should really scale it down; I propose for now that we divide it by 100.

In [None]:
self.total_reward = ((250 - euclidean_dist_to_apple) + apple_reward)/100

We now obtain the desired behaviour:

![alt](https://pythonprogramming.net/static/images/reinforcement-learning/snake-modified-trained.gif)

Full code is:

In [None]:
# our environment here is adapted from: https://github.com/TheAILearner/Snake-Game-using-OpenCV-Python/blob/master/snake_game_using_opencv.ipynb
import gymnasium as gym
from gymnasium import spaces
import numpy as np
import cv2
import random
import time
from collections import deque

SNAKE_LEN_GOAL = 30

def collision_with_apple(apple_position, score):
    apple_position = [random.randrange(1,50)*10,random.randrange(1,50)*10]
    score += 1
    return apple_position, score

def collision_with_boundaries(snake_head):
    if snake_head[0]>=500 or snake_head[0]<0 or snake_head[1]>=500 or snake_head[1]<0 :
        return 1
    else:
        return 0

def collision_with_self(snake_position):
    snake_head = snake_position[0]
    if snake_head in snake_position[1:]:
        return 1
    else:
        return 0


class SnekEnv(gym.Env):

    def __init__(self):
        super(SnekEnv, self).__init__()
        # Define action and observation space
        # They must be gym.spaces objects
        # Example when using discrete actions:
        self.action_space = spaces.Discrete(4)
        # Example for using image as input (channel-first; channel-last also works):
        self.observation_space = spaces.Box(low=-500, high=500,
                                            shape=(5+SNAKE_LEN_GOAL,), dtype=np.float32)

    def step(self, action):
        self.prev_actions.append(action)
        cv2.imshow('a',self.img)
        cv2.waitKey(1)
        self.img = np.zeros((500,500,3),dtype='uint8')
        
        # Display Apple
        cv2.rectangle(self.img,(self.apple_position[0],self.apple_position[1]),(self.apple_position[0]+10,self.apple_position[1]+10),(0,0,255),3)
        
        # Display Snake
        for position in self.snake_position:
            cv2.rectangle(self.img,(position[0],position[1]),(position[0]+10,position[1]+10),(0,255,0),3)
        
        # Takes a step after fixed time
        t_end = time.time() + 0.05
        k = -1
        while time.time() < t_end:
            if k == -1:
                k = cv2.waitKey(1)
            else:
                continue

        button_direction = action
        # Change the head position based on the button direction
        if button_direction == 1:
            self.snake_head[0] += 10
        elif button_direction == 0:
            self.snake_head[0] -= 10
        elif button_direction == 2:
            self.snake_head[1] += 10
        elif button_direction == 3:
            self.snake_head[1] -= 10
		
        apple_reward = 0
        # Increase Snake length on eating apple
        if self.snake_head == self.apple_position:
            self.apple_position, self.score = collision_with_apple(self.apple_position, self.score)
            self.snake_position.insert(0,list(self.snake_head))
            apple_reward = 10000
        else:
            self.snake_position.insert(0,list(self.snake_head))
            self.snake_position.pop()
        
        # On collision kill the snake and print the score
        if collision_with_boundaries(self.snake_head) == 1 or collision_with_self(self.snake_position) == 1:
            font = cv2.FONT_HERSHEY_SIMPLEX
            self.img = np.zeros((500,500,3),dtype='uint8')
            cv2.putText(self.img,'Your Score is {}'.format(self.score),(140,250), font, 1,(255,255,255),2,cv2.LINE_AA)
            cv2.imshow('a',self.img)
            self.done = True
        
        euclidean_dist_to_apple = np.linalg.norm(np.array(self.snake_head) - np.array(self.apple_position))

        self.total_reward = ((250 - euclidean_dist_to_apple) + apple_reward)/100

        print(self.total_reward)

        self.reward = self.total_reward - self.prev_reward
        self.prev_reward = self.total_reward

        if self.done:
            self.reward = -10
        info = {}

        head_x = self.snake_head[0]
        head_y = self.snake_head[1]

        snake_length = len(self.snake_position)
        apple_delta_x = self.apple_position[0] - head_x
        apple_delta_y = self.apple_position[1] - head_y

        # create observation:
        observation = [head_x, head_y, apple_delta_x, apple_delta_y, snake_length] + list(self.prev_actions)
        observation = np.array(observation)

        return observation, self.total_reward, self.done, info

    def reset(self):
        self.img = np.zeros((500,500,3),dtype='uint8')
        # Initial Snake and Apple position
        self.snake_position = [[250,250],[240,250],[230,250]]
        self.apple_position = [random.randrange(1,50)*10,random.randrange(1,50)*10]
        self.score = 0
        self.prev_button_direction = 1
        self.button_direction = 1
        self.snake_head = [250,250]

        self.prev_reward = 0

        self.done = False

        head_x = self.snake_head[0]
        head_y = self.snake_head[1]

        snake_length = len(self.snake_position)
        apple_delta_x = self.apple_position[0] - head_x
        apple_delta_y = self.apple_position[1] - head_y

        self.prev_actions = deque(maxlen = SNAKE_LEN_GOAL)  # however long we aspire the snake to be
        for i in range(SNAKE_LEN_GOAL):
            self.prev_actions.append(-1) # to create history

        # create observation:
        observation = [head_x, head_y, apple_delta_x, apple_delta_y, snake_length] + list(self.prev_actions)
        observation = np.array(observation)

        return observation