<a href="https://colab.research.google.com/github/shaon11579/VAE-2021-/blob/main/10-07%3A%20simulation_2021_Hasan.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

- We include a data set in the ML2Pvae package for demonstrative use.
- The data is from a simulated 30 item exam which assesses 3 latent traits. The latent abilities for 5000 students, found in the data frame theta_true, were sampled from N(0,Σ). Here, Σ specifies the correlations between the 3 abilities, and is found in the data frame correlation_matrix. 
- Discrimination and difficulty paramters were sampled uniformly from [0.25, 1.75] and [−3, 3] respectively, and entries in the Q-matrix were sampled from Bern(0.35). These values can be found in the data frames disc_true, diff_true, and q_matrix. Probabilities for each student answering each question correctly were calculated with the ML2P model [5]. These probabilities were sampled from to generate a response to each item on the assessment for each student. This is the main piece of data used for training, and is found in the data frame responses.

3.1 Data

We ran experiments on four data sets: (i) a simulated data set with 6 latent traits, 50 items, and 20,000 students; (ii) results from the Examination for the Certificate of Pro- ficiency in English (ECPE) (Templin and Hoffman, 2013), a real data set with 3 latent traits, 28 items, and 2922 students; (iii) a simulated data set with a 20 latent traits 200 items, and 50,000 students; and (iv) a simulated data set with 4 latent traits, 27 items, and 3,000 students. Note that comparisons with traditional techniques are only possible for (i), (ii), and (iv) because of the large number of latent traits in (iii). It is also worth pointing out that true parameter values, for both students and items, are only available for simulated data.
When simulating data for (i) and (iii), we used Python’s SciPy package to generate a symmetric positive definite matrix with 1s on the diagonal (correlation matrix) and all matrix entries non-negative. All latent traits had correlation values between 0 and 1. We assumed that each latent trait was mean-centered at 0. Then, we sampled abil- ity vectors to create simulated students. We generated a random Q-matrix where each entry qij ∼ Bern(0.2). If a column qi∶ = 0 for each element after sampling from this Ber- noulli distribution, then one random element was changed to a 1. This ensures that each item measured at least one trait.
Discrimination parameters were sampled from a range so that 0.25 ≤ MDISCi ≤ 1.75 for each item i, and difficulty parameters were sampled uniformly from [−3, 3]. Finally, response sets for each student were sampled from the ML2P model using these parameters. For data set (iv) we were more particular in selecting the Q-matrix and correlation matrix. Rather than generating these randomly, each entry in these matrices was chosen manually. Of the 4 skills in the correlation matrix, one of them is entirely independent of the other three. The other three latent abilities had correlations of 0.25, 0.1, and 0.15 between them. The correlation matrix was chosen in this way so that it is closer to the identity matrix, allowing the ML2P-VAEind variation to perform better. The Q-matrix was chosen so that it contained 16 “simple” items (items requiring only one skill), 6 13 items requiring 2 latent abilities, 4 items requiring 3 latent abilities, and one item requir- ing all 4 skills. In this way, each of the possible ( 4k ) combinations is present in the Q-matrix, for k ∈ {1, 2, 3, 4}.
[ ]
# a simulated data set with 6 latent traits, 50 items, and 20,000 students
Q_mat,A,B,Theta,data= Create_data(num_students=20000, num_questions=50, num_tests=1, num_skills=6)
[ ]
#20 latent traits 200 items, and 50,000 students
Q_mat,A,B,Theta,data= Create_data(num_students=80000, num_questions=250, num_tests=1, num_skills=20)
np.shape(data)
(80000, 252)
[ ]


In [5]:

import numpy as np
import pandas as pd
from scipy import *
from scipy.sparse import *

###############################################################################
#Create_data: simulates data for student assessment.
#INPUTS:
    #num_students: (int) # of students taking the assessment
    #num_questions: (int) # of quesions in the assessment 
    #num_tests: (int) # of times the student has taken a test
#OUTPUTS:
    #Q_mat: the expert estimation of which skills pertain to which question
    #A: how much a skill effects a question= diss params 
    #B: difficulty of each question = diff params 
    #Theta: the student's hidden knowledge of a subject
    #data: the student responses for each question for each test
###############################################################################
def Create_data(num_students, num_questions, num_tests, num_skills):
    J = num_skills #number of hidden skills
    K = num_students #number of students
    I = num_questions #number of questions in the assessment
    
    #Q matrix is expert prepared matrix of whether a item i requires skill j
    Q_mat = np.random.binomial(n=1,p=0.2, size = [J,I]) #bern 0.2
    
    #Discrimination parameters: how important is skill j for item i 
    A = np.random.uniform(low=0.25, high = 1.75, size = [J,I])
    
    #Theta: hidden skills for each student
    Theta = np.random.normal(loc = 0.0, scale=1.0, size = [K,J])
    np.savetxt('Theta.csv',Theta, delimiter=',')
    
    #B: the difficulty of each question
    B= np.random.uniform(low=-3.0, high = 3.0, size = [1, I])
    
    hidden = -1 * np.dot(Theta, (Q_mat * A)) + B# Equation 1 from the paper
    
    def sigmoid(x):
        return pow((1 + np.exp(x)), -1)
    
    prob_answers = sigmoid(hidden)#the probability a question is answered correctly
    
    data_rows = [] #[student, test #, q1, q2,...,qnum_questions]
    col_names = ['student','test_num']
    for question in range(I):
        col_names.append('Q{}'.format(question+1))
    for student in range(prob_answers.shape[0]):
        for test_num in range(num_tests):
            row = [None]*(num_questions + 2)#[student, test #, q1,q2,...,qnum_questions]
            row[0] = student
            row[1] = test_num
            for question in range(prob_answers.shape[1]):
                row[question+2] = np.random.binomial(n=1,p=prob_answers[student, question], size = None)
            data_rows.append(row)    
            
    data = pd.DataFrame(data = data_rows, columns = col_names)
    
    data = data.values.astype('float32')
    #data.to_csv("/content/q/data1.csv", index=False, header=False)

    return (Q_mat, A, B, Theta, data)





In [10]:
Q_mat,A,B,Theta,data= Create_data(num_students=60000, num_questions=250, num_tests=1, num_skills=25)

  

  
  


In [3]:
Q_mat,A,B,Theta,data= Create_data(num_students=5000, num_questions=30, num_tests=1, num_skills=7)

  


In [11]:
data
np.shape(data)

(5000, 32)

In [5]:

np.shape(A)

(7, 30)

In [12]:
#create response data for 5000 students 
data=pd.DataFrame(data)
data.head()
data.to_csv('/content/q/response.csv')

In [9]:
#qmatrix 30 items 7 latent traits 
data=pd.DataFrame(Q_mat)
data.head()
data.to_csv('/content/q/data.csv')

In [13]:
#Discrimination parameters: how important is skill j for item i 
#dis_true: A
data=pd.DataFrame(A)
data.head()
data.to_csv('/content/q/dis_true.csv')

In [14]:
#diff_true

data=pd.DataFrame(B)
data.head()
data.to_csv('/content/q/diff_true.csv')

In [15]:
#theta true 
data=pd.DataFrame(Theta)
data.head()
data.to_csv('/content/q/theta.csv')

# 14 latent traits 

In [24]:
Q_mat,A,B,Theta,data= Create_data(num_students=5000, num_questions=30, num_tests=1, num_skills=14)

In [31]:
#data shape check 

np.shape(data)




(5000, 32)

In [33]:
#create response data for 5000 students 
data=pd.DataFrame(data)
data.head()
data.to_csv('/content/q/response14.csv')

In [32]:
np.shape(A)


(14, 30)

In [34]:
#Discrimination parameters: how important is skill j for item i 
#dis_true: A for 14 lt 
data=pd.DataFrame(A)
data.head()
data.to_csv('/content/q/dis_true14.csv')

In [27]:
np.shape(B)


(1, 30)

In [35]:
#diff_true for 14 lt 

data=pd.DataFrame(B)
data.head()
data.to_csv('/content/q/diff_true14.csv')

In [28]:
np.shape(Q_mat)


(14, 30)

In [36]:
#qmatrix 30 items 14 latent traits 
data=pd.DataFrame(Q_mat)
data.head()
data.to_csv('/content/q/q14.csv')

In [29]:
np.shape(Theta)

(5000, 14)

In [37]:
#theta true for 14 lt 
data=pd.DataFrame(Theta)
data.head()
data.to_csv('/content/q/theta14.csv')

In [None]:
# save results 14 items 

data.to_csv('/content/q/diff_true.csv')
data.to_csv('/content/q/diff_true.csv')
data.to_csv('/content/q/diff_true.csv')
data.to_csv('/content/q/diff_true.csv')
data.to_csv('/content/q/diff_true.csv')

/content/q/14


In [None]:
#create a csv file
data.to_csv("/content/q/data.csv", index=False, header=False)

In [None]:
import csv

In [None]:
#3 latent traits 

In [None]:
# 7 latent trait 

In [None]:
# 14 latent trait 

3.1 Data
- We ran experiments on four data sets: (i) a simulated data set with 6 latent traits, 50 items, and 20,000 students; (ii) results from the Examination for the Certificate of Pro- ficiency in English (ECPE) (Templin and Hoffman, 2013), a real data set with 3 latent traits, 28 items, and 2922 students; (iii) a simulated data set with a 20 latent traits 200 items, and 50,000 students; and (iv) a simulated data set with 4 latent traits, 27 items, and 3,000 students. Note that comparisons with traditional techniques are only possible for (i), (ii), and (iv) because of the large number of latent traits in (iii). It is also worth pointing out that true parameter values, for both students and items, are only available for simulated data.
- When simulating data for (i) and (iii), we used Python’s SciPy package to generate a symmetric positive definite matrix with 1s on the diagonal (correlation matrix) and all matrix entries non-negative. All latent traits had correlation values between 0 and 1. We assumed that each latent trait was mean-centered at 0. Then, we sampled abil- ity vectors to create simulated students. We generated a random Q-matrix where each entry qij ∼ Bern(0.2). If a column qi∶ = 0 for each element after sampling from this Ber- noulli distribution, then one random element was changed to a 1. This ensures that each item measured at least one trait. 
- Discrimination parameters were sampled from a range so that 0.25 ≤ MDISCi ≤ 1.75 for each item i, and difficulty parameters were sampled uniformly from [−3, 3]. Finally, response sets for each student were sampled from the ML2P model using these parameters.
For data set (iv) we were more particular in selecting the Q-matrix and correlation matrix. Rather than generating these randomly, each entry in these matrices was chosen manually. Of the 4 skills in the correlation matrix, one of them is entirely independent of the other three. The other three latent abilities had correlations of 0.25, 0.1, and 0.15 between them. The correlation matrix was chosen in this way so that it is closer to the identity matrix, allowing the ML2P-VAEind variation to perform better. The Q-matrix was chosen so that it contained 16 “simple” items (items requiring only one skill), 6
13 items requiring 2 latent abilities, 4 items requiring 3 latent abilities, and one item requir- ing all 4 skills. In this way, each of the possible ( 4k ) combinations is present in the Q-matrix, for k ∈ {1, 2, 3, 4}.


In [39]:
# a simulated data set with 6 latent traits, 50 items, and 20,000 students
Q_mat,A,B,Theta,data= Create_data(num_students=20000, num_questions=50, num_tests=1, num_skills=6)

In [7]:
#20 latent traits 200 items, and 50,000 students
Q_mat,A,B,Theta,data= Create_data(num_students=80000, num_questions=250, num_tests=1, num_skills=20)
np.shape(data)

(80000, 252)