# Background



# The problem Gorani Reader wants to solve

For each user, we want to calculate the probability that the user knows some word i,

$ \hat y =  P(Know) $

## Knowledge tracing

Knowledge can't be observed directly by sensor. So, in typical education data mining researches, this is inferred by observing students' attempts to solve relevant problems. We call this approach "Knowledge Tracing." One of the most popular knowledge tracing frameworks is Bayesian Knowledge Tracing, which models $ \hat y $ using forward probability from hidden markov model.

$ \hat y_t(k) = P(o_1,o_2, ..., o_t, Know_t=k) = \sum^1_i=0 \hat y_{t-1}(i) p(Know_t=k|Know_{t-1}=i) p(o_t|Know_{t-1}=i) $ 

Where $o_i = 1$ if the answer to problem was right 0 otherwise.

The modell will learn $ p(Know_t=k|Know_{t-1}=i) $, $ p(o_t|Know_{t-1}=i) $, and initial state probability for each "skill" (e.g. quadratic formula, square root in algebra subject) 


In Gorani Reader, I applied this model using $o_i = 0$ if the user looked up some word j otherwise 1. 

There are two distinct charateristics of vocabulary tracing that might resulted in poor performance of Bayesian Knowledge Tracing for Gorani Reader.

1. There are too many "skills" The number of skills is the same as the number of words or stems.

BKT usually suffer when the number of skills is high. 

2. Few attempts at each skill. User might be not able to encounter some unusal word many times. 

Lack of observations to .


# My hypothesis

In Amazon Kindle, the student can look up the word which he or she want to know the definition of by holding onto that word. When I was studying english using Amazon Kindle (9th grades), one thought came to my mind: I am labelling unknown words by using ditionary look up function. We can express this hypothesis with following equation.

$ 1 - \hat y = 1 - P(Know) = P(Unknown) = P(Cliked) $

This is quite a strong assumption. But, with this strong assumption, we can utilize a standard machine learning approach to model user's knowledge. We simply predict whether user clicks word i or not (ground truth)

$ \hat y = f(x;\theta) $ where x is a vector constructed from various features extracted from user's pagination log. 

$ y = 0 $ if user looked up word i $1$ otherwise.  

## Experiment

I conducted an experiment to verify this experiment. For the sake of detailed analysis, I splitted down the P(Known) and P(Unknown).

$ P(Known) = P(Known|Clicked)\cdot P(Clicked) + P(Known|NotClicked)\cdot P(NotClicked)$ 

$ P(Unknown) = P(Unknown|Clicked)\cdot P(Clicked) + P(Unknown|NotClicked)\cdot P(NotClicked)$ 

Notice that when conditional probabilities are 

$ P(Known|Clicked) = 0 $

$ P(Known|NotClicked) = 1 $

$ P(Unknown|Clicked) = 1 $

$ P(Unknown|NotClicked) = 0 $

The objective probabilities become

$ P(Known) = P(NotClicked) $

$ P(Unknown) = P(Clicked) $

Which is my hypothesis (user clicks every unknown word)

I estimated these conditional probabilities and looked at how close are estimated conditional probabilites to our hypothetical values. (0,1,1,0) 

### Steps

1. Teach students how to use Gorani Reader. (especially how to use dictionary look-up)
2. Let them read about a text of about 300 words.
3. Shuffle read words to eliminate the bisa introduced by reading context.
4. Let them label words one by one. (whether they knew the word or not)
5. Notice them when they learned some word by Gorani Reader and they didn't know that word before the experimentation, it should be considered "unknown"


I collected data from 5 ESL students.

In [9]:
import pandas as pd

rows = []

import pandas as pd
import glob
all_files = glob.glob("exp1/*.csv")

for filename in all_files:
    df = pd.read_csv(filename, header=None, names=["word", "nolook", "look", "label"])
    rating_probs = df.groupby('look').size().div(len(df))
    df2 = df.groupby(['label', 'look']).size().div(len(df)).div(rating_probs, axis=0, level='look')
    p1 = df2['known'][False]
    p2 = df2['known'][True]
    p3 = df2['unknown'][False]
    p4 = df2['unknown'][True]
    p5 = df2['noword'][False] # e.g. character name, noise from less shopisticated word splitting
    rows.append([p1,p2,p3,p4,p5])

df = pd.DataFrame(rows, columns=["P(known|noclick)", "P(known|click)", "P(unknown|noclick)", "P(unknown|click)", "P(noword|noclick)"])
print(df.mean())
print(df.std())

P(known|noclick)      0.967974
P(known|click)        0.206439
P(unknown|noclick)    0.019214
P(unknown|click)      0.793561
P(noword|noclick)     0.012812
dtype: float64
P(known|noclick)      0.022919
P(known|click)        0.037831
P(unknown|noclick)    0.015504
P(unknown|click)      0.037831
P(noword|noclick)     0.008160
dtype: float64


From the result we get,

$ P(Known) = 0.2 \cdot P(Clicked) + 0.96 \cdot P(NotClicked)$ 

$ P(Unknown) = 0.79 \cdot P(Clicked) + 0.02 \cdot P(NotClicked)$ 

It's not perfectly accurate, but it's not that awful. This sample mean will approximate the real mean with a high probability. (can prove with CLT) Thus, these estimated values can describe the average error rate of my model.

There are some limitations:

1. I assumed i.i.d assumption for conditional probabilities. In other words, I assumed that the probability of knowing given cliking of different users are identicall distributed. Which is not likely to be true. e.g. some evil user might just click randomly
2. The definition for "known" is not concrete. User might know one meaning of some word but not the another.

For this reason, we should try investigating this part further. 

Nevertheless, this is better than nothing. We at least confirmed that these 5 users sampled from a relevent population are very likely to click unknown words.

# Building ML model

Event logs from Gorani Reader is rather simple. When user paginates, the words, look-ups, time, elpased time are sent through one json file. (one json file for each pagination) Various transformations are done on these pagination logs (you can find the source code of these transformations in `backend/dataserver/dataserver/job`


# First version

For the first time, I used thre features mainly: reading speed and quiz score. Quiz scores were dropped later, but it didn't contributed much for predicitve power.

The following code is roughly the model I used.

## Pagination cheating filtering using eye tracking 

Reading speed could be easily manipulated by pagination cheating. (such as fast forwards or skimming) I thought that I can filter these paginations through eye tracking feature.

It's discovered that people's iris move in a specific pattern when reading books (called saccade) I recorded gaze ratio (the ratio between left white part and right white part of each eye), put them into time series, and classified them into pagination cheating classes and normal reading class.

The following figures describes this ML model.

![a](page_cheat_model.png)

![n](page_cheat_viz.png)

It was really interesting idea for me, but eventually abandoned it. Because the definition of pagination cheating is highly subjective and there's no ground truth. Also, it can potentially harm innocent children who just happened to get a noise in their gaze data.

# Second version


## Bayesian tree



# Third version


## SunhoDiffVec



# Final verison



# Conclusion


