# Project 4 -- A Selfish Mouse

## Instructions

Please read carefully:

* Solve the project yourself. No teamwork.
* If you have questions, please post these in the public channel on Slack. The answers may be relevant to others as well. 
* Feel free to import and use any additional Python package you need.
* You are allowed to solve the project using a different programming language (although this is not adviced, since the provided basis implementation is in Python).
* Your code may be tested on other world layouts (you are provided with two: `world_empty.txt` and `world_walls.txt` for experimentation).
* The refresh of the game grid is slow in Colab (and results in a noticible fluttering). To fix the performance issue, run the GitHub (not the Colab) implementation locally (provided here: [Basic Reinforcement Learning](https://github.com/vmayoral/basic_reinforcement_learning/tree/master/tutorial1)). This resolves the problem. There is no difference if you solve this project locally or in Colab. The GitHub code was adjusted to run in Colab to simplify things (and with a hope that Colab performance issues will be fixed at some point in the future, since this is a known issue and there are folks working on this right now).
* Make sure to fill in your `student_name` in the following block below.

In [7]:
student_name = 'David Mihola' # fill with your student name
assert student_name != 'your_student_name', 'Please fill in your student_name before you start.'

## Setup

In this project you will gain practical experience with a Q-learning algorithm in a multi-agent environment. You are provided a grid world game where a Q-learning algorithm teaches a mouse to find its way to a piece of cheese while avoiding a cat on its way.

<img src="https://st3.depositphotos.com/1000152/12958/i/450/depositphotos_129589806-stock-photo-little-rat-eating-cheese.jpg" width='300' align='center'>

Read the code below and understand it. The code is based on the [Basic Reinforcement Learning](https://github.com/vmayoral/basic_reinforcement_learning/tree/master/tutorial1) tutorial. The code below is an adaptation of the GibHub project to run in Google Colab. You can run the GitHub code locally on your machine (and it will be 10x faster!) The project is solvable in Colab though.

Observe that a mouse is eaten by a cat over 10 times more often than it gets the cheese. Your task is to modify **only the Mouse class** implementation to help the mouse get the cheese. You are encouraged to try different ideas and see what works and what doesn’t (=design a selfish mouse!). Don’t expect to ”win” in this game: Feeding the mouse 100% of the times will not be possible, but you can improve mouse performance.

Necessary installs and imports:

In [8]:
#!pip install pygame
#!pip install gdown

In [9]:
#!gdown https://drive.google.com/file/d/1osZDerlQk98wp-OVKsT8tkFfce310fMf/view?usp=sharing --fuzzy
#!unzip world_setup.zip

Q-learning implementation (taken from [Basic Reinforcement Learning](https://github.com/vmayoral/basic_reinforcement_learning/tree/master/tutorial1)).

In [10]:
import random

class QLearn:
    def __init__(self, actions, epsilon=0.1, alpha=0.2, gamma=0.9):
        self.q = {}

        self.epsilon = epsilon  # exploration constant
        self.alpha = alpha      # discount constant
        self.gamma = gamma
        self.actions = actions

    def getQ(self, state, action):
        return self.q.get((state, action), 0.0)
        # return self.q.get((state, action), 1.0)

    def learnQ(self, state, action, reward, value):
        '''
        Q-learning: Q(s, a) += alpha * (reward(s,a) + max(Q(s') - Q(s,a))
        '''
        oldv = self.q.get((state, action), None)
        if oldv is None:
            self.q[(state, action)] = reward
        else:
            self.q[(state, action)] = oldv + self.alpha * (value - oldv)

    def chooseAction(self, state):
        if random.random() < self.epsilon:
            action = random.choice(self.actions)
        else:
            q = [self.getQ(state, a) for a in self.actions]
            maxQ = max(q)
            count = q.count(maxQ)
            # In case there're several state-action max values 
            # we select a random one among them
            if count > 1:
                best = [i for i in range(len(self.actions)) if q[i] == maxQ]
                i = random.choice(best)
            else:
                i = q.index(maxQ)

            action = self.actions[i]
        return action

    def learn(self, state1, action1, reward, state2):
        maxqnew = max([self.getQ(state2, a) for a in self.actions])
        self.learnQ(state1, action1, reward, reward + self.gamma*maxqnew)

Implementation of the Cheese, Cat and Mouse classes. **You are only allowed to modify the Mouse class and the related parameters**. You may want to copy the abive code when solving individual tasks or modify the code above. Please comment on how you achieved an improved performance of the mouse.

In [11]:
import time
import random
import shelve
import pdb

import cellular

directions = 8

lookdist = 2
lookcells = []
for i in range(-lookdist,lookdist+1):
    for j in range(-lookdist,lookdist+1):
        if (abs(i) + abs(j) <= lookdist) and (i != 0 or j != 0):
            lookcells.append((i,j))

def pickRandomLocation():
    while 1:
        x = random.randrange(world.width)
        y = random.randrange(world.height)
        cell = world.getCell(x, y)
        if not (cell.wall or len(cell.agents) > 0):
            return cell


class Cell(cellular.Cell):
    wall = False

    def colour(self):
        if self.wall:
            return 'black'
        else:
            return 'white'

    def load(self, data):
        if data == 'X':
            self.wall = True
        else:
            self.wall = False


class Cat(cellular.Agent):
    cell = None
    score = 0
    colour = 'red'

    def update(self):
        cell = self.cell
        if cell != mouse.cell:
            self.goTowards(mouse.cell)
            while cell == self.cell:
                self.goInDirection(random.randrange(directions))


class Cheese(cellular.Agent):
    colour = 'green'

    def update(self):
        pass


class Mouse(cellular.Agent):
    colour = 'gray'

    def __init__(self):
        self.ai = None
        self.ai = QLearn(actions=range(directions),
                                alpha=0.1, gamma=0.9, epsilon=0.1)
        self.eaten = 0
        self.fed = 0
        self.lastState = None
        self.lastAction = None

    def update(self):
        # calculate the state of the surrounding cells
        state = self.calcState()
        # asign a reward of -1 by default
        reward = -1

        # observe the reward and update the Q-value
        if self.cell == cat.cell:
            self.eaten += 1
            reward = -100
            if self.lastState is not None:
                self.ai.learn(self.lastState, self.lastAction, reward, state)
            self.lastState = None

            self.cell = pickRandomLocation()
            return

        if self.cell == cheese.cell:
            self.fed += 1
            reward = 50
            cheese.cell = pickRandomLocation()

        if self.lastState is not None:
            self.ai.learn(self.lastState, self.lastAction, reward, state)

        # Choose a new action and execute it
        state = self.calcState()
        # print(state)
        action = self.ai.chooseAction(state)
        self.lastState = state
        self.lastAction = action

        self.goInDirection(action)

    def calcState(self):
        def cellvalue(cell):
            if cat.cell is not None and (cell.x == cat.cell.x and
                                         cell.y == cat.cell.y):
                return 3
            elif cheese.cell is not None and (cell.x == cheese.cell.x and
                                              cell.y == cheese.cell.y):
                return 2
            else:
                return 1 if cell.wall else 0

        return tuple([cellvalue(self.world.getWrappedCell(self.cell.x + j, self.cell.y + i))
                      for i,j in lookcells])

cheese = Cheese()
mouse = Mouse()
cat = Cat()

world = cellular.World(Cell, directions=directions, filename='world_walls.txt')
world.age = 0

world.addAgent(cheese, cell=pickRandomLocation())
#world.addAgent(cat)
world.addAgent(mouse)

endAge = world.age + 100000
while world.age < endAge:
    world.update()

    if world.age % 10000 == 0:
        print ("{:d}, e: {:0.2f}, W: {:d}, L: {:d}".format(world.age, mouse.ai.epsilon, mouse.fed, mouse.eaten))
        mouse.eaten = 0
        mouse.fed = 0

world.display.activate(size=20)
world.display.delay = 1
while 1:
    world.update(mouse.fed, mouse.eaten)

10000, e: 0.10, W: 48, L: 0
20000, e: 0.10, W: 38, L: 0
30000, e: 0.10, W: 40, L: 0
40000, e: 0.10, W: 35, L: 0
50000, e: 0.10, W: 47, L: 0
60000, e: 0.10, W: 41, L: 0
70000, e: 0.10, W: 32, L: 0
80000, e: 0.10, W: 38, L: 0
90000, e: 0.10, W: 27, L: 0
100000, e: 0.10, W: 33, L: 0


KeyboardInterrupt: 

: 

## 1 - Feed the Mouse without a Cat [7 points]

In this task, comment the `world.addAgent(cat)` line above and run the game. If you also switch to using `world_empty.txt` instead of `world_walls.txt`, the suboptimality of the mouse implementation will be very evident.

**Your task:** Improve the mouse algoritm. It should work with any world configuration (i.e., with and without walls). You are only allowed to modify the mouse class. Your goal is to minimize the time the mouse needs to get the cheese.

## 2 - Feed the Mouse with a Cat [7 points]

Your mouse algorithm should also work well with a cat. So, re-include the cat into the game for this task. Again, you are only allowed to modify the mouse class.

**Your task:** Your goal is to improve the number of times the mouse gets the cheese (in the presence of a cat). 

## 3 - Show That Your Improvements are Statistically Significant [6 points] 

Make sure your algorithmic improvements are statistically significant and not the results of a lucky random seed. We assume that an improvement is statistically significant, if the difference in performance is larger than 2x the standard deviation obtained in your measurements. See [Empirical Rule: Definition, Formula, Example, How It's Used](https://www.investopedia.com/terms/e/empirical-rule.asp).

**Your task:** Run experiments for at least 10000 simulated steps (without showing the game grid) and plot the statistics (separately for `world_empty.txt` and `world_walls.txt`). Evaluate (1) how often does the mouse get the cheese, and (2) how long does one episode last. Compare the results before and after your improvements. We are interested in the mean and the standard deviation.

Feel free to collect statistics separately and submit together with your Colab file.

## 4 - How to Submit Your Solution?

Download your notebook (File --> Download --> Download .ipynb) and send per email to [saukh@tugraz.at](mailto:saukh@tugraz.at). If you run the code locally based on the GitHub sources, please send me all python files (and a short description how to run your code and how you improved the mouse strategy).