# Switching to MediaPipe (20.12.2023)
After running into issues with the inaccuracy of Dlib when detecting which eye is closed we decided to try another model: MediaPipe. Mediapipe was developed by google and is an open-source framework that provides tools for landmark estimation. While both landmark detection and estimation identify key points on the face, landmark estimation is more precise. It provides more information about the spacial information of these points. Landmark detection on the other hand only detects if the points are on the face. Apart from a different process to analyze the face, MediaPipe also provides more than 468 landmark points. 

Another factor why MediaPipe is more precise is the attention mechanism. This is a technique modeled after the cognitive learning process of humans. The gist of it is that by selectively focusing on relevant features, the attention model can filter out irrelevant information, enabling in this case MediaPipe to prioritize the analysis of critical details. 

An important factor (though more of an added benefit than a deciding factor) is the real-time compatibility. MediaPipe is already quite lightweight making it ideal to take in our webcam inputs. This compatibility ensures that the attention model's precise analysis can be seamlessly integrated into our interactive application, where fast responses are crucial. 

| Dlib                                     | Mediapipe                         |
|------------------------------------------|-----------------------------------|
| 68 landmark points                       | approx. 468 landmark points       |
| landmark detection                       | landmark estimation               |
| not optimized for real-time applications | Real-time multimedia applications |
|                                          | attention mechanism               |

# New Interactions
Since changing our goal to only using facial expressions as interactions we also needed to adjust our commands: 

### Navigate through the page:

* Scroll to the next area: tilt your head to the right 
* Scroll to the previous area: tilt your head to the left 

### Navigate the carousel:

* Go to the next photo: Close your right eye
* Go to the previous photo: Close your left eye 

### Navigate the filters:
* Select filter or undo: tilt your head down to go to the next filter, tilt your head up to go to the previous filter 
* Apply filter or undo: close both of your eyes once

We also started thinking about the navigation a little more. For the user to use the application more effortlessly and with less inaccuracies, we decided to add a "focus mode". This was inspired by the already existing accessibility guideline Tab order. It says that "The tab order should follow the visual flow of the page: left to right, top to bottom – header first, then main navigation, then page navigation (if present), and finally the footer." [source](https://www.csun.edu/universal-design-center/web-accessibility-criteria-tab-order). This guideline exists for users who are not able to navigate a page with a mouse/trackpad but rely only on keyboard functions. Since we do not want users to rely on any periphery, we wanted to use this guideline with our facial expression commands. 
We already split the application into logically ordered sections, so now we wanted to have two navigation modes, navigate within the application and within the section. To achieve this our plan was to lock in on a specific section via a command. The first idea we had was to blink twice to lock and unlock the focus mode. Once unlocked the user should be able to navigate the application and jump from section to section by tilting the head. If locked, the user should be able to use a specific set of commands only within the section. This way we can also reuse the Head Pose Estimation as a command. The guideline says that there must be a visible tab indication to show the current position of the tab focus. We implemented this through borders with a changing color. 

# Head Pose Estimation

# Gaze Detection 
Because of our research results we wanted to try gaze detection. Although we did not include it in our list of new commands, we thought of it as an alternative for blinking or a different way of navigating that can be used in addition to all the other commands. 

This code is based on [Asadullah-Dal17's code](https://github.com/Asadullah-Dal17/Eyes-Position-Estimator-Mediapipe) as well as [Monib Sediqi's](https://kh-monib.medium.com/title-gaze-tracking-with-opencv-and-mediapipe-318ac0c9c2c3). The way this works is that the indices of the eye landmark points are saved which are then used to isolate the eyeshapes and mask the eyes. The images below are what is being used in the analysis. 

![isolated area of the eyes](images/eyeshape.png)
 
This isolated area of each eye is split into three parts: left, center and right. The gaze of each eye is estimated by analyzing the intensity distribution of the pixels within these isolated areas. This is achieved by calculation the number of black pixels within each of the three parts of the eye region. The gaze is then determined depending on where the most black pixels can be found. If the highest number of black pixels is found within the left part of the eyeshape, for example, it indicates that the subject is gazing to the left. The reason for using the black pixels is because the gaze is detected using the pupil, which is black and makes up a significant part of the eye, as seen in the second image above. 

In [ ]:
import cv2
import mediapipe as mp
import numpy as np

LEFT_EYE = [362, 382, 381, 380, 374, 373, 390, 249, 263, 466, 388, 387, 386, 385, 384, 398]
RIGHT_EYE = [33, 7, 163, 144, 145, 153, 154, 155, 133, 173, 157, 158, 159, 160, 161, 246]
mp_drawing = mp.solutions.drawing_utils
mp_face_mesh = mp.solutions.face_mesh

cam = cv2.VideoCapture(1)

def landmarks_detection(img, results, draw=False):
    img_height, img_width = img.shape[:2]
    mesh_coord = [(int(point.x * img_width), int(point.y * img_height)) for point in
                  results.multi_face_landmarks[0].landmark]
    if draw:
        [cv2.circle(img, p, 2, (0, 255, 0), -1) for p in mesh_coord]
    return mesh_coord


def eyesExtractor(img, right_eye_coords, left_eye_coords):
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    dimension = gray.shape
    mask = np.zeros(dimension, dtype=np.uint8)
    cv2.fillPoly(mask, [np.array(right_eye_coords, dtype=np.int32)], 255)
    cv2.fillPoly(mask, [np.array(left_eye_coords, dtype=np.int32)], 255)

    eyes = cv2.bitwise_and(gray, gray, mask=mask)
    eyes[mask == 0] = 155

    r_max_x = (max(right_eye_coords, key=lambda item: item[0]))[0]
    r_min_x = (min(right_eye_coords, key=lambda item: item[0]))[0]
    r_max_y = (max(right_eye_coords, key=lambda item: item[1]))[1]
    r_min_y = (min(right_eye_coords, key=lambda item: item[1]))[1]

    l_max_x = (max(left_eye_coords, key=lambda item: item[0]))[0]
    l_min_x = (min(left_eye_coords, key=lambda item: item[0]))[0]
    l_max_y = (max(left_eye_coords, key=lambda item: item[1]))[1]
    l_min_y = (min(left_eye_coords, key=lambda item: item[1]))[1]

    cropped_left = eyes[r_min_y: r_max_y, r_min_x: r_max_x]
    cropped_right = eyes[l_min_y: l_max_y, l_min_x: l_max_x]
    return cropped_right, cropped_left


def positionEstimator(cropped_eye):
    h, w = cropped_eye.shape

    gaussain_blur = cv2.GaussianBlur(cropped_eye, (9, 9), 0)
    median_blur = cv2.medianBlur(gaussain_blur, 3)

    ret, threshed_eye = cv2.threshold(median_blur, 130, 255, cv2.THRESH_BINARY)
    piece = int(w / 3)
    left_piece = threshed_eye[0:h, 0:piece]
    center_piece = threshed_eye[0:h, piece: piece + piece]
    right_piece = threshed_eye[0:h, piece + piece:w]

    eye_position, color = pixelCounter(left_piece, center_piece, right_piece)

    return eye_position, color


def pixelCounter(first_piece, second_piece, third_piece):
    left_part = np.sum(first_piece == 0)
    center_part = np.sum(second_piece == 0)
    right_part = np.sum(third_piece == 0)
    eye_parts = [left_part, center_part, right_part]

    max_index = eye_parts.index(max(eye_parts))
    pos_eye = ''
    if max_index == 0:
        pos_eye = 'LEFT'
        color = [(0, 0, 0), (0, 0, 255)]
    elif max_index == 1:
        pos_eye = 'CENTER'
        color = [(150, 0, 50), (255, 0, 128)]
    elif max_index == 2:
        pos_eye = 'RIGHT'
        color = [(75, 75, 75), (79, 100, 9)]
    else:
        pos_eye = "Closed"
        color = [(75, 75, 75), (79, 100, 9)]
    return pos_eye, color


def colorBackgroundText(img, text, font, fontScale, textPos, textThickness=1, textColor=(0, 255, 0), bgColor=(0, 0, 0),
                        pad_x=3, pad_y=3):
    (t_w, t_h), _ = cv2.getTextSize(text, font, fontScale, textThickness)
    x, y = textPos
    cv2.rectangle(img, (x - pad_x, y + pad_y), (x + t_w + pad_x, y - t_h - pad_y), bgColor, -1)
    cv2.putText(img, text, textPos, font, fontScale, textColor, textThickness)
    return img


with mp_face_mesh.FaceMesh(min_detection_confidence=0.5, min_tracking_confidence=0.5) as face_mesh:
    while True:
        ret, frame = cam.read()
        frame = cv2.flip(frame, 1)
        rgb_frame = cv2.cvtColor(frame, cv2.COLOR_RGB2BGR)
        results = face_mesh.process(rgb_frame)

        if results.multi_face_landmarks:
            mesh_coords = landmarks_detection(frame, results, False)

            right_coords = [mesh_coords[p] for p in RIGHT_EYE]
            left_coords = [mesh_coords[p] for p in LEFT_EYE]
            crop_right, crop_left = eyesExtractor(frame, left_coords, right_coords)
            eye_position, color = positionEstimator(crop_right)
            colorBackgroundText(frame, f'L: {eye_position}', cv2.FONT_HERSHEY_SIMPLEX, 1.0, (40, 220), 2, color[0],
                                color[1], 8, 8)
            eye_position_left, color = positionEstimator(crop_left)
            colorBackgroundText(frame, f'R: {eye_position_left}', cv2.FONT_HERSHEY_SIMPLEX, 1.0, (40, 320), 2, color[0],
                                color[1], 8, 8)

        cv2.imshow('frame', frame)

        if cv2.waitKey(1) & 0xFF == ord('q'):
            break

cam.release()
cv2.destroyAllWindows()

After testing this code we found that it works really well but is not suitable for navigation within our application, since we found it really straining on the eyes. Another issue with this was that the fact that you can not see the screen and gaze to the left and right simultaneously. This may lead to the problem that the user keeps gazing in one direction for too long and accidentally triggers the command they wanted to trigger multiple times. So although this is a really stable algorithm we decided to discard this. 