#Mini LA Assignment 2 - Knowledge Inference
#### **Authors**: Yutong Shen, Jingfei Chen, Yiran Wang

**Problem Statement:** 
- Can you predict whether a particular student will be able to answer a question correctly at a given stage of their learning?


**Tasks:**

- Overview of the purpose of the project, the problem statement, and some related work if appropriate to the question.

- Model implementation (i.e., parameters, fitting procedure, and model performance)

- A brief interpretation of the results regarding the question.

- A brief discussion on your insights, challenges, and/or lessons learned.

## Problem & Purpose
During this assignment, we aim to predict student performance on problems from student interactions with Intelligent Tutoring Systems through an Assistments dataset. It is worthwile to assess knowledge acquisition by evaluating students' adaptability to different learning materials and activities. The purpose is to establish a better ITS and further help students improve their learning outcomes.


## Literature Review
We found an empirical article that was closely related to knowledge inference and Intelligent Tutoring Systems. Ramírez-Noriega, Juárez-Ramírez, & Martínez-Ramírez (2017) talked about the examples of how Bayesian network helped reinforce the weak topics within the ITS and provided better accuracy in the diagnostic of students' knowledge procession. The process of building a Bayesian network and the explanation of how to increase the accuracy of inferring knowledge (including different variables) had some referential value for our assignment. 

### 1. Import Data

In [None]:
import numpy as np
import pandas as pd

In [None]:
# read the data
data = pd.read_csv('KI_dataset.csv')
data

Unnamed: 0,ID,Lesson,Student,KC,item,right,firstattempt,time
0,0,Splot,AGUFADE,VALUING-CAT-FEATURES,META-VALUING-CAT-FEATURES-1,1,1,3.297
1,1,Splot,AGUFADE,VALUING-NUM-FEATURES,META-VALUING-NUM-FEATURES-1,0,1,4.047
2,2,Splot,AGUFADE,CHOOSE-VAR-TYPE,CHOOSE-VAR-TYPE-NUM-1,1,1,1.593
3,3,Splot,AGUFADE,VALUING-NUM-FEATURES,META-VALUING-NUM-FEATURES-1,0,0,2.922
4,4,Splot,AGUFADE,CHOOSE-VAR-TYPE,CHOOSE-VAR-TYPE-NUM-2,1,1,1.594
...,...,...,...,...,...,...,...,...
124365,128869,MidZSchoolZProbabilit,SKRDBGE,GREATEST-CF,META-GREATEST-CF-33,1,1,1.920
124366,128870,MidZSchoolZProbabilit,SKRDBGE,GREATEST-CF,META-GREATEST-CF-34,1,1,3.790
124367,128871,MidZSchoolZProbabilit,SKRDBGE,GREATEST-CF,META-GREATEST-CF-35,0,1,2.800
124368,128872,MidZSchoolZProbabilit,SKRDBGE,GREATEST-CF,META-GREATEST-CF-35,0,0,1.980


### 2. pyBKT Model

In [None]:
# Install pyBKT from pip!
!pip install pyBKT

# Import all required packages including pyBKT.models.Model
from pyBKT.models import Model
import matplotlib.pyplot as plt

Collecting pyBKT
  Downloading pyBKT-1.4.tar.gz (32.7 MB)
[K     |████████████████████████████████| 32.7 MB 405 kB/s 
Building wheels for collected packages: pyBKT
  Building wheel for pyBKT (setup.py) ... [?25l[?25hdone
  Created wheel for pyBKT: filename=pyBKT-1.4-cp37-cp37m-linux_x86_64.whl size=1024681 sha256=e4a60ff41768724845c232ff6b7d0b2c45d8f3ed3cc893a7a7c1b9cde0e662c4
  Stored in directory: /root/.cache/pip/wheels/d4/bb/83/0fe92b544252ddb34ad6bf4fd2659abd64140612b2d418cd07
Successfully built pyBKT
Installing collected packages: pyBKT
Successfully installed pyBKT-1.4


###3. Training Models & Model Evaluation

**3.1. Model 1**

In [None]:
# Initialize model with seed so we can consistently replicate the results and avoid as much randomness as possible
model = Model(seed = 17, num_fits = 1)

In [None]:
# Specify the columns corresponding to each required column
# In this case, the user ID that pyBKT expects is specified by the column ID in the dataset, 
# the skill_name is specified by a column KC 
# and the correctness is specified by the right column in the dataset.
defaults = {'user_id': 'ID', 'skill_name': 'KC', 'correct': 'right'}

In [None]:
uniqueKC = data['KC'].unique()

In [None]:
# Train a simple BKT model on all skills in 'KC'
model.fit(data=data, skills = uniqueKC, defaults=defaults)

In [None]:
# Evaluate the RMSE of the model on the training data.
# The default evaluate metric is RMSE.
training_rmse = model.evaluate(data = data)

# Evaluate the AUC of the model on the training data. 
training_auc = model.evaluate(data = data, metric = 'auc')

In [None]:
# Print the RMSE and AUC of the model
print('RMSE of the model is: ', training_rmse)
print('AUD of the model is: ', training_auc)

RMSE of the model is:  0.42690204307284446
AUD of the model is:  0.7595121642568298


In [None]:
# View the trained parameters
print(model.params())

                                       value
skill                param   class          
VALUING-CAT-FEATURES prior   default 0.63987
                     learns  default 1.00000
                     guesses default 0.22735
                     slips   default 0.05363
                     forgets default 0.00000
...                                      ...
ENTER-GCD            prior   default 0.86753
                     learns  default 1.00000
                     guesses default 0.47316
                     slips   default 0.05395
                     forgets default 0.00000

[330 rows x 1 columns]


The model parameters show P(Ln): the probability the skill is already known before the first opportunity to use the skill in problem solving, P(T): the probability the skill will be learned at each opportunity to use the skill, P(G): the probability the student will guess correctly if the skill is not known, and P(S): the probability the student will slip (make a mistake) if the skill is known. Then we can use the formula P(Correct) = P(Ln)\*(1-P(S)) + (1-P(Ln))\*P(G) to predict whether a particular student will be able to answer a question correctly at a given stage of their learning.

**3.2. Model 2**

In [None]:
# Specify the columns corresponding to each required column
# In this case, the user ID that pyBKT expects is specified by the column ID in the dataset, 
# the skill_name is specified by a column item 
# and the correctness is specified by the right column in the dataset.
defaults = {'user_id': 'ID', 'skill_name': 'item', 'correct': 'right'}

# Initialize model with seed
model2 = Model(seed = 17, num_fits = 1)

# Train a simple BKT model on all skills in 'item'
model2.fit(data=data, skills = data['item'].unique(), defaults=defaults)

# Evaluate the RMSE of the model on the training data
training_rmse2 = model2.evaluate(data = data)

# Evaluate the AUC of the model on the training data
training_auc2 = model2.evaluate(data = data, metric = 'auc')

# Print the RMSE and AUC of the model
print('RMSE of the model is: ', training_rmse2)
print('AUD of the model is: ', training_auc2)

RMSE of the model is:  0.3859965524166319
AUD of the model is:  0.8432145757737171


In [None]:
# View the trained parameters
print(model2.params())

                                                      value
skill                               param   class          
META-VALUING-CAT-FEATURES-1         prior   default 0.56594
                                    learns  default 1.00000
                                    guesses default 0.65748
                                    slips   default 0.00768
                                    forgets default 0.00000
...                                                     ...
ENTER-REDUCED-PROBABILITY-SHAPES-23 prior   default 0.89533
                                    learns  default 1.00000
                                    guesses default 1.00000
                                    slips   default 0.00000
                                    forgets default 0.00000

[10440 rows x 1 columns]


The model parameters show P(Ln): the probability the skill is already known before the first opportunity to use the skill in problem solving, P(T): the probability the skill will be learned at each opportunity to use the skill, P(G): the probability the student will guess correctly if the skill is not known, and P(S): the probability the student will slip (make a mistake) if the skill is known. Then we can use the formula P(Correct) = P(Ln)\*(1-P(S)) + (1-P(Ln))\*P(G) to predict whether a particular student will be able to answer a question correctly at a given stage of their learning.

### 4. Model Cross-Validation

In [None]:
defaults = {'user_id': 'ID', 'skill_name': 'KC', 'correct': 'right'}

In [None]:
# Crossvalidate with 5 folds on all skills in 'KC'
crossvalidated_1 = model.crossvalidate(defaults = defaults, data = data, skills = data['KC'].unique(),
                                              folds = 5)

In [None]:
print(crossvalidated_1)

                                        rmse
skill                                       
VALUING-CAT-FEATURES                 0.46376
VALUING-NUM-FEATURES                 0.43130
CHOOSE-VAR-TYPE                      0.22230
CHOOSE-OK-SPLOT                      0.30007
CHOOSE-OK-BG                         0.28505
...                                      ...
MODEL-CUBE-PERPENDICULAR-EDGE-LENGTH 0.49735
MODEL-IDENTIFY-CUBE-PRISM-FACE       0.31649
COMPLETED-TOOL-CELL                      NaN
GREATEST-CF                          0.47473
ENTER-GCD                            0.31841

[66 rows x 1 columns]


In [None]:
# Explore the model accuracy for one particular skill (VALUING-CAT-FEATURES) using cross-validation
skill = 'VALUING-CAT-FEATURES'
metric = 'rmse'

simple_cv = model.crossvalidate(defaults = defaults, data = data, skills = skill, 
                                metric = metric)
simple_cv

Unnamed: 0_level_0,rmse
skill,Unnamed: 1_level_1
VALUING-CAT-FEATURES,0.46376


In [None]:
defaults = {'user_id': 'ID', 'skill_name': 'item', 'correct': 'right'}

In [None]:
# Crossvalidate with 5 folds on all skills in 'item'
crossvalidated_2 = model2.crossvalidate(defaults = defaults, data = data, skills = data['item'].unique(),
                                              folds = 5)

In [None]:
print(crossvalidated_2)

                                       rmse
skill                                      
META-VALUING-CAT-FEATURES-1         0.35913
META-VALUING-NUM-FEATURES-1         0.45323
CHOOSE-VAR-TYPE-NUM-1               0.07226
CHOOSE-VAR-TYPE-NUM-2               0.15043
META-VALUING-NUM-FEATURES-2         0.44824
...                                     ...
ENTER-REDUCED-PROBABILITY-SHAPES-19     NaN
ENTER-REDUCED-PROBABILITY-SHAPES-20     NaN
ENTER-REDUCED-PROBABILITY-SHAPES-21     NaN
ENTER-REDUCED-PROBABILITY-SHAPES-22     NaN
ENTER-REDUCED-PROBABILITY-SHAPES-23     NaN

[2088 rows x 1 columns]


In [None]:
# Explore the model accuracy for one particular skill (META-VALUING-CAT-FEATURES-1) using cross-validation
skill = 'META-VALUING-CAT-FEATURES-1'
metric = 'rmse'

simple_cv2 = model2.crossvalidate(defaults = defaults, data = data, skills = skill, 
                                metric = metric)
simple_cv2

Unnamed: 0_level_0,rmse
skill,Unnamed: 1_level_1
META-VALUING-CAT-FEATURES-1,0.36012


## Discussion

In this assignment, our team created and trained two simple BKT models. The first model was trained on all skills in column KC in the dataset. The second model was trained on all skills in the column item in the dataset. Since we trained on a dataset that had unfamiliar columns to pyBKT, we specified mapping of column names in that dataset to expected pyBKT columns, which was referred to as the model defaults (i.e. it specifies the default column names to lookup in the dataset). The first model had an RMSE of 0.427 and AUC of 0.760. The second model had an RMSE of 0.386 and AUC of 0.843. So the accuracy of the second model was higher than the first one. Cross-validation presented a similar pattern, where model 1 showed lower RMSE than model 2.

From the model parameters, we can know the values of P(Ln), P(T), P(G), and P(S). Then we can apply the formula P(Correct) = P(Ln)\*(1-P(S)) + (1-P(Ln))\*P(G) to predict whether a particular student will be able to answer a question correctly at a given stage of their learning.

## Reference
Ramírez-Noriega, A., Juárez-Ramírez, R., & Martínez-Ramírez, Y. (2017). Evaluation module based on Bayesian networks to Intelligent Tutoring Systems. International Journal of Information Management, 37(1), 1488-1498.