<a href="https://colab.research.google.com/github/willshpt/EE475stuff/blob/main/Shortened_EE475FinalProject.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Eye Tracking with Webcams

## Machine Learning Based Eye Tracking

### My Method

While not fully finished, here is my basic work on designing my own ML-based eye tracking.

#### Research

Before doing any research into the project, I came up with a basic idea of how my implementation would work. I had a feeling that I wouldn't be able to do face/eye detection from scratch using only the tools we had from class with the time I had. Instead, my plan was to use an existing face/eye/pupil detection model to generate an input for my eye tracking model.

For data, I would create my own dataset by creating a program which would display a pixel on a screen and then take a webcam image of myself (or anybody I could get to participate) once a button is pressed confirming that I am looking at the pixel. That image and the pixel's x and y coordinates would be stored.

To train my model, I would run each captured image through the existing face/eye/pupil detection model, which would give features which my own model could use as inputs, with the x and y coordinates of the pixel as outputs. I would then train my model using this data, testing out different models and different hyperparameters to find which one would be most consistent.

With the time and resources I had, I was unable to create a dataset of my own in time to get a working model. However, I have supplemental research on similar projects to discuss their pros and cons, and how I might modify my project in order to make it better.

##### GazeCapture

The first thing I found when looking for datasets similar to what I was planning on doing was [GazeCapture](https://github.com/CSAILVision/GazeCapture.git) \(their paper can be found [here](https://gazecapture.csail.mit.edu/index.php)). This is one of the few projects I was able to find where they were doing almost exactly what I was looking for -- not just detecting the direction of a person's gaze, but rather looking at where on a screen a person was looking. They created the GazeCapture dataset by uploading an application to the iOS app store which would display pulsating red dots on screen at various positions and record the user's face cam as they look at the dots. There are other parts of the application, such as making the user press left or right on the screen depending on if there is an L or an R displayed in the dot, in order to make sure the user is engaged, but otherwise the idea is almost exactly what I was thinking of in order to collect data.

Using this data, they trained a Convolutional Neural Network, which they named iTracker, for gaze prediction. iTracker did not use any pre-existing face or eye detection, and was able to outperform similar approaches by a significant margin, according to the paper. However, the iTracker model was too complex to run in real-time due to the size of the inputs.

##### Convolutional Neural Network-Based Methods for Eye Gaze Estimation: A Survey

One of the other sources I found which discussed eye tracking in a more broad sense is [Convolutional Neural Network-Based Methods for Eye Gaze Estimation: A Survey](https://ieeexplore.ieee.org/abstract/document/9153754) which is a survey of different deep learning-based gaze estimation techniques, focusing on convolutional nerual networks, as well as other methods. The paper discusses 2D and 3D approaches to eye tracking, and was focused more on explaining the general types of eye tracking.

#### Actual Code

##### A model to detect looking left and right on the screen

The idea here is to create a dataset by grabbing images of the user looking left and right, then to use neural network boosting to train a 2 class dataset with training and validation. At the end, I have a way of testing the model in real time. This is very finnicky, but I was able to get decent results even with only 20 datapoints, especially considering how little I had tuned the pupil threshold and the hyperparameters of the model.

Of course, this is still pretty much the same as the first example model I provided, and it is a multiclass model when ideally it should be either one multi-output regression model or two single-output regression models, however I plan on improving this in the future.

Here is a modified version of the code to get positions of face/eyes, which creates a dataset

In [None]:
num_its_per_direction = 20 # How many times you look left/right
brightness_threshold = 100 # Threshold for pupil detection, change this if you get bad results

In [None]:
import cv2
import numpy as np
from google.colab.patches import cv2_imshow
from IPython.display import clear_output
import sys


# init part
# May need to change the following 2 lines to the correct path, if you don't have the files check the repository
face_cascade = cv2.CascadeClassifier('/content/drive/MyDrive/FinalFinalProjects/haarcascade_frontalface_default.xml')
eye_cascade = cv2.CascadeClassifier('/content/drive/MyDrive/FinalFinalProjects/haarcascade_eye.xml')
detector_params = cv2.SimpleBlobDetector_Params()
detector_params.filterByArea = True
detector_params.maxArea = 1500
detector = cv2.SimpleBlobDetector_create(detector_params)


def detect_faces(img, cascade):
    gray_frame = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    face_rect = []
    coords = cascade.detectMultiScale(gray_frame, 1.3, 5)
    if len(coords) > 1:
        biggest = (0, 0, 0, 0)
        for i in coords:
            if i[3] > biggest[3]:
                biggest = i
        biggest = np.array([i], np.int32)
    elif len(coords) == 1:
        biggest = coords
    else:
        return None, face_rect
    for (x, y, w, h) in biggest:
        frame = img[y:y + h, x:x + w]
        face_rect = [x, y, w, h]
    return frame, face_rect


def detect_eyes(img, cascade):
    left_rect = [0,0,0,0]
    right_rect = [0,0,0,0]
    gray_frame = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    eyes = cascade.detectMultiScale(gray_frame, 1.3, 5)  # detect eyes
    width = np.size(img, 1)  # get face frame width
    height = np.size(img, 0)  # get face frame height
    left_eye = None
    right_eye = None
    for (x, y, w, h) in eyes:
        if y > height / 2:
            pass
        eyecenter = x + w / 2  # get the eye center
        if eyecenter < width * 0.5:
            left_eye = img[y:y + h, x:x + w]
            left_rect = [x,y,w,h]
        else:
            right_eye = img[y:y + h, x:x + w]
            right_rect = [x,y,w,h]
    return (left_eye, right_eye), left_rect, right_rect


def cut_eyebrows(img):
    height, width = img.shape[:2]
    eyebrow_h = int(height / 4)
    img = img[eyebrow_h:height, 0:width]  # cut eyebrows out (15 px)

    return img

def blob_process(img, threshold, detector):
    gray_frame = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    #cv2_imshow(gray_frame)
    _, img = cv2.threshold(gray_frame, threshold, 255, cv2.THRESH_BINARY)
    #cv2_imshow(img)
    img = cv2.erode(img, None, iterations=2)
    img = cv2.dilate(img, None, iterations=4)
    img = cv2.medianBlur(img, 5)
    #cv2_imshow(img)
    keypoints = detector.detect(img)
    #print(keypoints)
    return keypoints

def collect_data(num_its):
    #cap = cv2.VideoCapture(1) # Uncomment if not google colab
    #cv2.namedWindow('image') # Uncomment if not google colab
    #cv2.createTrackbar('threshold', 'image', 0, 255, nothing) # Uncomment if not google colab
    y = []
    x = []
    for i in range(0,num_its*2):
        print(i)
        l_pos = []
        r_pos = []
        pupil_l = []
        pupil_r = []
        if(i == 0):
            print("Look left")
            y.append(1)
            time.sleep(5)
        elif(i < num_its):
            print("Look left")
            y.append(1)
            time.sleep(0.1)
        elif(i == num_its):
            print("look right")
            y.append(-1)
            time.sleep(5)
        else:
            print("Look right")
            y.append(-1)
            time.sleep(0.1)
        frame = take_photo() # Uncomment if using google colab
        #_, frame = cap.read() # Uncomment if not google colab
        #clear_output()
        face_frame, face_pos = detect_faces(frame, face_cascade)
        if face_frame is not None:
            #cv2_imshow(face_frame)
            eyes, l_pos, r_pos = detect_eyes(face_frame, eye_cascade)
            j = 0
            for eye in eyes:
                j = j + 1;
                print("Found an eye...")
                if eye is not None:
                    #threshold = r = cv2.getTrackbarPos('threshold', 'image')
                    # Modify the threshold manually for now, could also iterate through thresholds if no keypoints are found
                    retry = True
                    threshold = brightness_threshold - 1
                    eye = cut_eyebrows(eye)
                    while retry:
                        threshold = threshold + 1
                        keypoints = blob_process(eye, threshold, detector)
                        if(j == 1):
                            if(len(keypoints) > 0):
                                pupil_l.append(keypoints[0].pt)
                                retry = False
                            else:
                                if(threshold >= 255):
                                    print("No pupil")
                                    pupil_l.append((0.,0))
                                    retry = False
                        elif(j == 2):
                            if(len(keypoints) > 0):
                                pupil_r.append(keypoints[0].pt)
                            else:
                                if(threshold >= 255):
                                    print("No pupil")
                                    pupil_r.append((0.,0))
                                    retry = False
                    eye = cv2.drawKeypoints(eye, keypoints, eye, (0, 0, 255), cv2.DRAW_MATCHES_FLAGS_DRAW_RICH_KEYPOINTS)
                    #cv2_imshow(eye)
                else:
                    print("No eye")
                    if(j == 1):
                        pupil_l.append((0.,0))
                    elif(j == 2):
                        pupil_r.append((0.,0))
            x.append([face_pos[0],face_pos[1],face_pos[2],face_pos[3],l_pos[0],l_pos[1],l_pos[2],l_pos[3],r_pos[0],r_pos[1],r_pos[2],r_pos[3],pupil_l[0][0],pupil_l[0][1],pupil_r[0][0],pupil_r[0][1]])
        else:
            print("No face")
        cv2_imshow(frame)
        time.sleep(2)
        clear_output()
        print("ready!")
        time.sleep(1)
        if cv2.waitKey(1) & 0xFF == ord('q'):
            break
    return x,y
    #cap.release() # Uncomment if not google colab
    #cv2.destroyAllWindows() # Uncomment if not google colab

In [None]:
x1, y1 = collect_data(num_its_per_direction)
print(x1)
print(y1)

In [None]:
# import basic librariees and autograd wrapped numpy
from sklearn.datasets import fetch_openml
import sys
sys.path.append('../')
import autograd.numpy as np2
import matplotlib.pyplot as plt
from autograd import grad
from autograd import hessian
from autograd import value_and_grad
from autograd.misc.flatten import flatten_func
import pandas as pd


# this is needed to compensate for matplotlib notebook's tendancy to blow up images when plotted inline
%matplotlib inline
#from matplotlib import rcParams
#rcParams['figure.autolayout'] = True

# datapath to data
datapath = '/content/drive/MyDrive/Colab Notebooks/mlrefined_datasets/nonlinear_superlearn_datasets/'


from numba import njit, prange

In [None]:
x4 = np2.array(x1).T
y4new = np2.array(y1)[np2.newaxis,:]
print(np2.shape(x4))
print(np2.shape(y4))

(16, 40)
(1, 40)


In [None]:
# newtons method function - inputs: g (input function), max_its (maximum number of iterations), w (initialization)
def newtons_method(g,max_its,w,**kwargs):
    # flatten input funciton, in case it takes in matrices of weights
    flat_g, unflatten, w = flatten_func(g, w)

    # compute the gradient / hessian functions of our input function -
    # note these are themselves functions.  In particular the gradient -
    # - when evaluated - returns both the gradient and function evaluations (remember
    # as discussed in Chapter 3 we always ge the function evaluation 'for free' when we use
    # an Automatic Differntiator to evaluate the gradient)
    gradient = value_and_grad(flat_g)
    hess = hessian(flat_g)
     # set numericxal stability parameter / regularization parameter
    epsilon = 10**(-7)
    if 'epsilon' in kwargs:
        epsilon = kwargs['epsilon']

    # run the newtons method loop
    weight_history = []      # container for weight history
    cost_history = []        # container for corresponding cost function history
    for k in range(max_its):
        # evaluate the gradient, store current weights and cost function value
        cost_eval,grad_eval = gradient(w)
        weight_history.append(unflatten(w))
        cost_history.append(cost_eval)

        # evaluate the hessian
        hess_eval = hess(w)

        # reshape for numpy linalg functionality
        hess_eval.shape = (int((np2.size(hess_eval))**(0.5)),int((np2.size(hess_eval))**(0.5)))

        # solve second order system system for weight update
        w = w - np2.dot(np2.linalg.pinv(hess_eval + epsilon*np2.eye(np.size(w))),grad_eval)

    # collect final weights
    weight_history.append(unflatten(w))
    # compute final cost function value via g itself (since we aren't computing
    # the gradient at the final step we don't get the final cost function value
    # via the Automatic Differentiatoor)
    cost_history.append(flat_g(w))
    return cost_history,weight_history

In [None]:
training_indices4 = np2.sort(np2.random.choice(np2.shape(x4)[1], int(2*np2.shape(x4)[1]/3), replace=False))
validation_indices4 = np2.delete(np2.arange(np.shape(x4)[1]), training_indices4)
x4train = x4.T[training_indices4].T
x4valid = x4.T[validation_indices4].T
y4train = y4[0][training_indices4]
y4valid = y4[0][validation_indices4]
print(x4.shape)
print(x4train.shape)
print(x4valid.shape)

In [None]:
def initial_model4(x,w):
  return (w*np2.ones((1,x.shape[1])))

In [None]:
def softmax4_train(w,model):
  cost = np2.sum(np2.log(1 + np2.exp(-y4train*model(x4train,w))))
  a = cost/float(np2.size(y4train))
  return a

In [None]:
def softmax4_validate(w,model):
  cost = np2.sum(np2.log(1 + np2.exp(-y4valid*model(x4valid,w))))
  a = cost/float(np2.size(y4valid))
  return a

In [None]:
def softmax4_overall(w,model):
  cost = np2.sum(np2.log(1 + np2.exp(-y4*model(x4,w))))
  a = cost/float(np2.size(y4))
  return a

In [None]:
def feature_transforms4(x,w):
    # No transform
    f = x
    return f

In [None]:
def perceptron4(x, w):
        # compute inner product with current layer weights
        f = feature_transforms4(x,w)
        a = w[0][0] + np2.dot(f.T, w[0][1:])
        # output of layer activation
        a = np2.maximum(0,a).T
        # final linear combo
        a = w[1][0] + np2.dot(a.T,w[1][1:])
        return a.T

In [None]:
def misclass_calc(model, x, true_labels, threshold=0):
    # Get model predictions
    predicted_probs = model(x)
    # Convert predicted probabilities to binary predictions based on the threshold
    binary_predictions = (predicted_probs >= threshold).astype(int)
    binary_predictions = np2.array([-1 if x == 0 else x for x in binary_predictions[0]])
    # Count misclassifications
    misclassifications = np2.sum(binary_predictions != true_labels)
    return misclassifications


In [None]:
def neural_net_boosting_learner(num_steps, max_its, scale):
  step_array = np.arange(1,num_steps)
  best_units = []
  model_history = []
  training_cost_history = []
  validation_cost_history = []
  combined_cost_history = []
  training_misclassification_history = []
  validation_misclassification_history = []
  combined_misclassification_history = []
  w_best_history = []
  model_0 = initial_model4;
  unfinished = True;
  s_factor = 1.0
  while unfinished:
    scale_temp = round(scale*s_factor,4);
    print("Step: ", 0)
    print("Scale: ", scale_temp)
    w = scale_temp*np2.random.randn(1)
    try:
      cost_hist, weight_hist = newtons_method(lambda w : softmax4_train(w, model_0), max_its, w)
      unfinished = False
    except np2.linalg.LinAlgError:
      print("Did not converge! Decreasing scale")
      s_factor = s_factor - 0.1
  ind = np2.argmin(cost_hist)
  w_best = weight_hist[ind]
  w_best_history.append(w_best)
  model = lambda x,w=w_best : model_0(x,w)
  best_units.append(model)
  model_history.append(model)
  training_cost_history.append(softmax4_train(w_best,model))
  validation_cost_history.append(softmax4_validate(w_best,model))
  combined_cost_history.append(softmax4_overall(w_best,model))
  training_misclassification_history.append(misclass_calc(model, x4train, y4train))
  validation_misclassification_history.append(misclass_calc(model, x4valid, y4valid))
  combined_misclassification_history.append(misclass_calc(model, x4, y4))
  for j in step_array:
    next_unit = lambda x,w: perceptron4(x,w)
    current_model = lambda x,w: model(x) + next_unit(x,w)
    unfinished = True;
    s_factor = 1.0
    while unfinished:
      scale_temp = round(scale*s_factor,4);
      print("Step: ", j)
      print("Scale: ", scale_temp)
      w = [scale_temp*np2.random.randn(x4train.shape[0] + 1,1), scale_temp*np2.random.randn(2,1), scale_temp*np.random.randn(6,1)]
      try:
        cost_hist, weight_hist = newtons_method(lambda w : softmax4_train(w, current_model), max_its, w)
        unfinished = False
      except np2.linalg.LinAlgError:
        print("Did not converge! Decreasing scale")
        s_factor = s_factor * 0.9
    ind = np2.argmin(cost_hist)
    w_best = weight_hist[ind]
    w_best_history.append(w_best)
    best_perceptron = lambda x,w=w_best: next_unit(x,w)
    best_units.append(best_perceptron)
    next_model = lambda x,w=w_best : current_model(x,w)
    model_history.append(next_model)
    training_cost_history.append(softmax4_train(w_best,next_model))
    validation_cost_history.append(softmax4_validate(w_best,next_model))
    combined_cost_history.append(softmax4_overall(w_best,next_model))
    training_misclassification_history.append(misclass_calc(next_model, x4train, y4train))
    validation_misclassification_history.append(misclass_calc(next_model, x4valid, y4valid))
    combined_misclassification_history.append(misclass_calc(next_model, x4, y4))
    model = lambda x,units=best_units: np2.sum([v(x) for v in units],axis=0)
    print("Misclassifications: ", combined_misclassification_history[-1])
  return w_best_history, best_units, model_history, training_cost_history, validation_cost_history, combined_cost_history, training_misclassification_history, validation_misclassification_history, combined_misclassification_history

In [None]:
num_steps4 = 20
max_its4 = 60
scale4 = .5
w_best_hist4, best_units4, model_history4, tcost_hist4, vcost_hist4, ccost_hist4, tmisclass_hist4, vmisclass_hist4, cmisclass_hist4 = neural_net_boosting_learner(num_steps4,max_its4,scale4)
true_steps = np.arange(0, np.array(tcost_hist4).shape[0])
plt.plot(true_steps, tcost_hist4, label="Training Cost")
plt.plot(true_steps, vcost_hist4, label="Validation Cost")
plt.plot(true_steps, ccost_hist4, label="Overall Cost")
plt.xlabel('Num Steps')
plt.ylabel('Cost')
plt.legend()
plt.show()
plt.plot(true_steps, tmisclass_hist4, label="Training Misclassifications")
plt.plot(true_steps, vmisclass_hist4, label="Validation Misclassifications")
plt.plot(true_steps, cmisclass_hist4, label="Overall Misclassifications")
plt.xlabel('Num Steps')
plt.ylabel('Misclassifications')
plt.legend()
plt.show()

In [None]:
best_ind4 = np.argmin(cmisclass_hist4)
best_model4 = model_history4[best_ind4]
w4f = w_best_hist4[best_ind4]
print("Misclassifications: ", misclass_calc(best_model4,x4,y4))
#print("Predictions: ", best_model4(x4,))

In [None]:
def test_model(num_its=1):
    #cap = cv2.VideoCapture(1) # Uncomment if not google colab
    #cv2.namedWindow('image') # Uncomment if not google colab
    #cv2.createTrackbar('threshold', 'image', 0, 255, nothing) # Uncomment if not google colab
    x = []
    for i in range(0,num_its):
        print(i)
        l_pos = []
        r_pos = []
        pupil_l = []
        pupil_r = []
        clear_output()
        #_, frame = cap.read() # Uncomment if not google colab
        frame = take_photo() # Uncomment if using google colab
        face_frame, face_pos = detect_faces(frame, face_cascade)
        if face_frame is not None:
            #cv2_imshow(face_frame)
            eyes, l_pos, r_pos = detect_eyes(face_frame, eye_cascade)
            j = 0
            for eye in eyes:
                j = j + 1;
                print("Found an eye...")
                if eye is not None:
                    #threshold = r = cv2.getTrackbarPos('threshold', 'image')
                    # Modify the threshold manually for now, could also iterate through thresholds if no keypoints are found
                    threshold = brightness_threshold
                    eye = cut_eyebrows(eye)
                    keypoints = blob_process(eye, threshold, detector)
                    if(j == 1):
                        if(len(keypoints) > 0):
                            pupil_l.append(keypoints[0].pt)
                        else:
                            pupil_l.append((0.,0))
                    elif(j == 2):
                        if(len(keypoints) > 0):
                            pupil_r.append(keypoints[0].pt)
                        else:
                            pupil_r.append((0.,0))
                    eye = cv2.drawKeypoints(eye, keypoints, eye, (0, 0, 0), cv2.DRAW_MATCHES_FLAGS_DRAW_RICH_KEYPOINTS)
                    #cv2_imshow(eye)
                else:
                    print("No eye")
                    if(j == 1):
                        pupil_l.append((0.,0))
                    elif(j == 2):
                        pupil_r.append((0.,0))
            x.append([face_pos[0],face_pos[1],face_pos[2],face_pos[3],l_pos[0],l_pos[1],l_pos[2],l_pos[3],r_pos[0],r_pos[1],r_pos[2],r_pos[3],pupil_l[0][0],pupil_l[0][1],pupil_r[0][0],pupil_r[0][1]])
        else:
            print("No face")
        cv2_imshow(frame)
        #print(((np2.array(x)).T).shape)
        prediction = best_model4(np2.array(x[i])[np2.newaxis,:].T,w4f)
        #print(prediction)
        if(prediction[0][0] > 0):
            print("Looking Left")
        else:
            print("Looking Right")
        #print(prediction[0][0])
        time.sleep(5)
        if cv2.waitKey(1) & 0xFF == ord('q'):
            break
    #cap.release() # Uncomment if not google colab
    #cv2.destroyAllWindows() # Uncomment if not google colab

In [None]:
test_model()

#### Analysis and Future Work

The results of my simple left-right gaze detector using neural network boosting with softmax ended up being much more successful than I thought it would be going in.

On my first attempt, with a sample size of 20 admittedly rough input datapoints, I was able to get zero misclassifications on the training and validation sets. Given a bit more time and a local Python environment running on a decent computer and a larger, more robust dataset, I believe it would be more than possible to take this project and iterate upon it to get a decently reliable gaze tracker capable of estimating where on a screen the user is looking.

# Sources

Antoinelame Gaze Tracking: https://github.com/antoinelame/GazeTracking

Convolutional Neural Network-Based Methods for Eye Gaze Estimation: A Survey: https://ieeexplore.ieee.org/abstract/document/9153754

DougDoug: https://www.youtube.com/watch?v=6cI_D2dfw24

Eye Tracking for Everyone: https://gazecapture.csail.mit.edu/index.php

Tracking your eyes with Python: https://medium.com/@stepanfilonov/tracking-your-eyes-with-python-3952e66194a6

Wikipedia: https://en.wikipedia.org/wiki/Eye_tracking