# Sparse_NN_Methods 
(Last updated Feb 16th, 2017) - Qi Pan

## Table of Contents
**1. Sparse Data Representations**
* 1.1 Data Overview
* 1.2 X_correct and X_views
* 1.3 X_correct_latest and X_views latest

**2. Baseline methods**
* 2.1 Global Average
* 2.2 Within Col (Step) Average 
* 2.3 Within Row (Student) Average

**3. Sparse_NN_Methods**
* 3.1 Cosine Similarity NN with {0, 1} encoding
* 3.2 Cosine Similarity NN with {0, 1} encoding + no self similarity
* 3.3 Cosine Similarity NN with {-1, 0, 1} encoding

**4. Test Results**
* 4.1 Accuracy
* 4.2 Positive and Negative Predictive Value
* 4.3 True Positive and Negative Rates

## 1.) Sparse Data Representations
### 1.1) Data Overview

The dataset is from an Intelligent Tutorial System called Bridge to Algebra and captures student responses from the 2008-2009 school year. In the analysis below we focus on a subsegment of the whole dataset that contains:

In [4]:
import pandas as pd

#load the dataset
single_KC_data = pd.read_pickle('/Users/qipanda/Documents/2016-2017_KDD_Thesis/education_data'+\
    '/bridge_to_algebra_2008_2009/bridge_0809_KC.pkl')

#Number of students, problems, steps, problem-step combinations
print('number of unique students = {}'.format(single_KC_data['Anon Student Id'].nunique()))
print('number of unique problems = {}'.format(single_KC_data['Problem Name'].nunique()))
print('number of unique steps = {}'.format(single_KC_data['Step Name'].nunique()))
print('number of unique problem-step combinations = {}'.\
    format(single_KC_data.groupby(['Problem Name', 'Step Name']).size().shape[0]))

number of unique students = 4593
number of unique problems = 909
number of unique steps = 113
number of unique problem-step combinations = 30043


## 1.2) X_correct and X_views

The dataset is represented mathematicaly by 3D matrices defined as such:

\begin{equation}
    X_{correct} \in \mathbb{R}^{mxnxt}\text{, } 
    X_{views} \in \mathbb{R}^{mxnxt}
\end{equation}

\begin{equation}
    \text{where $m =$ unique students, $n =$ unique problem-step combinations, and $t =$ the discrete time interval}
\end{equation}

\begin{equation}
    X_{correct}=     
    \begin{cases}
        0, & \text{if student $i$ $[$is wrong about OR did not do$]$ problem-step $j$ at time $t$}\\
        1, & \text{if student $i$ got problem-step $j$ correct at time $t$}
    \end{cases}
\end{equation}

\begin{equation}
    X_{views}=\text{number of times student $i$ has seen problem-step $j$ at current time $t$}
\end{equation}

In python, these 3D matrices are represented by lists of 2D sparse matrices where list indices represent $t$:

In [12]:
import DataManipulators
import loading_util as lu
import numpy as np

#load the data
dm = lu.load_pickle('test_preprocessed_DM', 'saved_ftrs')

#list indices reprent t, each member is a 2D sparse coo matrix
print(type(dm.sparse_ftrs['X_correct'][0]))
print(type(dm.sparse_ftrs['X_views'][0]))

Loading pickle from saved_ftrs/test_preprocessed_DM.pickle
<class 'scipy.sparse.coo.coo_matrix'>
<class 'scipy.sparse.coo.coo_matrix'>


## 1.3 X_correct_latest and X_views_latest

Because students can try problem-step combinations more than once, their latest "correctness" on a given problem-step may change. $X_correct_latest$ stores the most recent "correctness" at time $t$ for all problem-steps they've attemptd up to time t. Similairly, whereas $X_{views}$ only contained the views for a given problem-step exectued at time $t$, we are also interested in the total views a student had on all problem-steps they've attempted up to time $t$. The mathematical definitions are as follows:

\begin{equation}
    X_{correct\_latest} \in \mathbb{R}^{mxnxt}\text{, } 
    X_{views\_latest} \in \mathbb{R}^{mxnxt}
\end{equation}

\begin{equation}
    \text{where $m =$ unique students, $n =$ unique problem-step combinations, and $t =$ the discrete time interval}
\end{equation}

\begin{equation}
    X_{correct\_latest}=     
    \begin{cases}
        0, & \text{if student $i$ $[$was wrong on their last attempt of OR did not do$]$ problem-step $j$ up to time $t$}\\
        1, & \text{if student $i$ got problem-step $j$ correct on their last attempt up to time $t$}
    \end{cases}
\end{equation}

\begin{equation}
    X_{views\_latest}=\text{number of times student $i$ has attempted problem-step $j$ up to time $t$}
\end{equation}

In python, these 3D matrices are represented by lists of 2D sparse matrices where list indices represent $t$:

In [12]:
import DataManipulators
import loading_util as lu
import numpy as np

#load the data
dm = lu.load_pickle('test_preprocessed_DM', 'saved_ftrs')

#list indices reprent t, each member is a 2D sparse coo matrix
print(type(dm.sparse_ftrs['X_correct'][0]))
print(type(dm.sparse_ftrs['X_views'][0]))

Loading pickle from saved_ftrs/test_preprocessed_DM.pickle
<class 'scipy.sparse.coo.coo_matrix'>
<class 'scipy.sparse.coo.coo_matrix'>


In [None]:
%matplotlib notebook #so that plots show inline for this notebook