## METADATA For the Spaced Repitition Dataset 
* p_recall - proportion of exercises from this lesson/practice where the word/lexeme was correctly recalled
* timestamp - UNIX timestamp of the current lesson/practice
* delta - time (in seconds) since the last lesson/practice that included this word/lexeme
* user_id - student user ID who did the lesson/practice (anonymized)
learning_language - language being learned
* ui_language - user interface language (presumably native to the student)
* lexeme_id - system ID for the lexeme tag (i.e., word)
* lexeme_string - lexeme tag (see below)
* history_seen - total times user has seen the word/lexeme prior to this lesson/practice
* history_correct - total times user has been correct for the word/lexeme prior to this lesson/practice
* session_seen - times the user saw the word/lexeme during this lesson/practice
* session_correct - times the user got the word/lexeme correct during this lesson/practice

In [1]:
#installing packages 
!pip install plotly



In [2]:
pip install path

Note: you may need to restart the kernel to use updated packages.


In [3]:
pip install fastparquet

Note: you may need to restart the kernel to use updated packages.


In [4]:
pip install parquet 

Note: you may need to restart the kernel to use updated packages.


In [5]:
#importing libraries
from pathlib import Path
import pandas as pd
import numpy as np
import zipfile
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold, cross_val_score
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.metrics import classification_report
import gzip
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from zipfile import ZipFile 
%matplotlib inline

In [6]:
# Load packages
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score,
    confusion_matrix,
    precision_score,
    recall_score,
    RocCurveDisplay,
)
from sklearn.model_selection import RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn import preprocessing
from sklearn import utils

In [7]:
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

## Data Acquisition 

In [8]:
## data acquisition 
## SPACED REPETITION DATA from Duolingo Research 
dataframe = pd.read_csv("/Users/zeinebouerghi/Downloads/duolingospacedrd.csv.gz")
dataframe.head()

Unnamed: 0,p_recall,timestamp,delta,user_id,learning_language,ui_language,lexeme_id,lexeme_string,history_seen,history_correct,session_seen,session_correct
0,1.0,1362076081,27649635,u:FO,de,en,76390c1350a8dac31186187e2fe1e178,lernt/lernen<vblex><pri><p3><sg>,6,4,2,2
1,0.5,1362076081,27649635,u:FO,de,en,7dfd7086f3671685e2cf1c1da72796d7,die/die<det><def><f><sg><nom>,4,4,2,1
2,1.0,1362076081,27649635,u:FO,de,en,35a54c25a2cda8127343f6a82e6f6b7d,mann/mann<n><m><sg><nom>,5,4,1,1
3,0.5,1362076081,27649635,u:FO,de,en,0cf63ffe3dda158bc3dbd55682b355ae,frau/frau<n><f><sg><nom>,6,5,2,1
4,1.0,1362076081,27649635,u:FO,de,en,84920990d78044db53c1b012f5bf9ab5,das/das<det><def><nt><sg><nom>,4,4,1,1


In [9]:
dataframe.describe()

Unnamed: 0,p_recall,timestamp,delta,history_seen,history_correct,session_seen,session_correct
count,12854230.0,12854230.0,12854230.0,12854230.0,12854230.0,12854230.0,12854230.0
mean,0.8961056,1362589000.0,729581.1,21.98109,19.35025,1.817686,1.644134
std,0.2714048,293208.2,2246499.0,129.5508,111.9681,1.36018,1.318794
min,0.0,1362076000.0,1.0,1.0,1.0,1.0,0.0
25%,1.0,1362343000.0,532.0,3.0,3.0,1.0,1.0
50%,1.0,1362591000.0,77134.0,6.0,6.0,1.0,1.0
75%,1.0,1362846000.0,442507.0,15.0,13.0,2.0,2.0
max,1.0,1363105000.0,40328360.0,13518.0,12888.0,20.0,20.0


In [13]:
dataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12854226 entries, 0 to 12854225
Data columns (total 12 columns):
 #   Column             Dtype  
---  ------             -----  
 0   p_recall           float64
 1   timestamp          int64  
 2   delta              int64  
 3   user_id            object 
 4   learning_language  object 
 5   ui_language        object 
 6   lexeme_id          object 
 7   lexeme_string      object 
 8   history_seen       int64  
 9   history_correct    int64  
 10  session_seen       int64  
 11  session_correct    int64  
dtypes: float64(1), int64(6), object(5)
memory usage: 1.1+ GB


In [18]:
#investigate the correlation data 
correlation_data = dataframe.corr()
correlation_data.style.background_gradient()

Unnamed: 0,p_recall,timestamp,delta,history_seen,history_correct,session_seen,session_correct
p_recall,1.0,-0.000709,-0.030221,-0.022747,-0.012844,0.041433,0.301793
timestamp,-0.000709,1.0,0.024238,-0.002188,-0.003297,0.010319,0.009683
delta,-0.030221,0.024238,1.0,-0.030382,-0.030217,0.00217,-0.006933
history_seen,-0.022747,-0.002188,-0.030382,1.0,0.985646,0.002899,-0.004417
history_correct,-0.012844,-0.003297,-0.030217,0.985646,1.0,0.003628,-0.000407
session_seen,0.041433,0.010319,0.00217,0.002899,0.003628,1.0,0.952811
session_correct,0.301793,0.009683,-0.006933,-0.004417,-0.000407,0.952811,1.0


In [17]:
#Data Cleaning and Manipulation 
# we will only look at users learning french 
data = dataframe[dataframe.learning_language == "fr"]
#data.head()
#adding variables 
#adding number of mistakes made 
data['mistakes'] = data.history_seen - data.history_correct
data['session_mistakes'] = data.session_seen - data.session_correct
data['rate_of_error_per_session'] = data.session_mistakes/data.session_seen
data['general_rate_of_error'] = data.mistakes/data.history_seen 
data.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['mistakes'] = data.history_seen - data.history_correct
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['session_mistakes'] = data.session_seen - data.session_correct
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['rate_of_error_per_session'] = data.session_mistakes/data.session_seen
A

Unnamed: 0,p_recall,timestamp,delta,user_id,learning_language,ui_language,lexeme_id,lexeme_string,history_seen,history_correct,session_seen,session_correct,mistakes,session_mistakes,rate_of_error_per_session,general_rate_of_error
231,1.0,1362082530,261691,u:hiS4,fr,en,03a546003e03b545a6d419b6620b3749,la/le<det><def><f><sg>,31,23,1,1,8,0,0.0,0.258065
232,0.5,1362082530,231391,u:hiS4,fr,en,8bd6d060bb604e17c936418f835d87c8,mon/mon<det><pos><m><sg>,28,27,2,1,1,1,0.5,0.035714
233,1.0,1362082530,232832,u:hiS4,fr,en,1b279bb64bd6eba51ab37e4a61aad0c4,mes/mon<det><pos><mf><pl>,2,2,1,1,0,0,0.0,0.0
234,1.0,1362082530,196898,u:hiS4,fr,en,8e6df998cc5c26b86482a3040a5805e2,c'/ce<prn><tn><nt><sg>,19,18,1,1,1,0,0.0,0.052632
235,0.0,1362082530,274171,u:hiS4,fr,en,3ec6ed7d5a122ac3018def0b4f621b12,nouveau/nouveau<adj><m><sg>,6,5,1,0,1,1,1.0,0.166667


In [20]:
#investigate the correlation data 
correlation_data = data.corr()
correlation_data.style.background_gradient()

Unnamed: 0,p_recall,timestamp,delta,history_seen,history_correct,session_seen,session_correct,mistakes,session_mistakes,rate_of_error_per_session,general_rate_of_error
p_recall,1.0,-0.000751,-0.030109,0.003072,0.010392,0.048894,0.322012,-0.046783,-0.820034,-1.0,-0.097067
timestamp,-0.000751,1.0,0.016846,0.002858,0.003995,0.009478,0.008578,-0.005511,0.003622,0.000751,-0.004796
delta,-0.030109,0.016846,1.0,-0.064134,-0.06318,0.000305,-0.009236,-0.053621,0.02884,0.030109,-0.010883
history_seen,0.003072,0.002858,-0.064134,1.0,0.995407,-0.024173,-0.020284,0.767176,-0.014054,-0.003072,-0.005449
history_correct,0.010392,0.003995,-0.06318,0.995407,1.0,-0.023886,-0.017659,0.702248,-0.021085,-0.010392,-0.032779
session_seen,0.048894,0.009478,0.000305,-0.024173,-0.023886,1.0,0.947328,-0.019726,0.254526,-0.048894,0.047253
session_correct,0.322012,0.008578,-0.009236,-0.020284,-0.017659,0.947328,1.0,-0.032523,-0.068598,-0.322012,0.01046
mistakes,-0.046783,-0.005511,-0.053621,0.767176,0.702248,-0.019726,-0.032523,1.0,0.036759,0.046783,0.17911
session_mistakes,-0.820034,0.003622,0.02884,-0.014054,-0.021085,0.254526,-0.068598,0.036759,1.0,0.820034,0.115611
rate_of_error_per_session,-1.0,0.000751,0.030109,-0.003072,-0.010392,-0.048894,-0.322012,0.046783,0.820034,1.0,0.097067


In [22]:
data.ui_language.unique()

array(['en'], dtype=object)

In [17]:
# Check if there are any null values
print(df.isnull().sum())

datetime                        0
ui_language                     0
eligible_templates              0
history                  52124250
selected_template               0
session_end_completed           0
dtype: int64


In [13]:
print(dataframe.shape)

(12854226, 12)


In [14]:
#target variable is the number of recall times 
y = dataframe["p_recall"] 

In [15]:
lab = preprocessing.LabelEncoder()
y_transformed = lab.fit_transform(y)

In [16]:
y.head()

0    1.0
1    0.5
2    1.0
3    0.5
4    1.0
Name: p_recall, dtype: float64

In [18]:
#y_transformed.head()

In [21]:
Y = dataframe.iloc[:, 0:2]
Y.head()

Unnamed: 0,p_recall,timestamp
0,1.0,1362076081
1,0.5,1362076081
2,1.0,1362076081
3,0.5,1362076081
4,1.0,1362076081


In [22]:
#fitting the features, we will look into the history of the mistakes based on the number of times it was seen
X = dataframe.iloc[:, 8:14]
X.head()

Unnamed: 0,history_seen,history_correct,session_seen,session_correct
0,6,4,2,2
1,4,4,2,1
2,5,4,1,1
3,6,5,2,1
4,4,4,1,1


In [23]:
# Split the data into train and test subsets
# You can adjust the test size and random state
X_train, X_test, y_train, y_test = train_test_split(
    X, y_transformed, test_size=0.30, random_state=123)

# Standardize X data based on X_train
sc = StandardScaler().fit(X_train)
X_train_scaled = sc.transform(X_train)
X_test_scaled = sc.transform(X_test)

In [None]:
model = LogisticRegression(random_state = 10)
model.fit(X_train_scaled, y_train)

In [None]:
y_pred = model.predict(X_test_scaled)

In [None]:
# Define parameters: these will need to be tuned to prevent overfitting and underfitting
params = {
    "penalty": "l2",  # Norm of the penalty: 'l1', 'l2', 'elasticnet', 'none'
    "C": 1,  # Inverse of regularization strength, a positive float
    "random_state": 123,
}

# Create a logistic regression classifier object with the parameters above
clf = LogisticRegression(**params)

# Train the classifer on the train set
clf = clf.fit(X_train_scaled, y_train)

# Predict the outcomes on the test set
y_pred = clf.predict(X_test_scaled)

In [None]:
#evaluating the classifier through the accuracy, precision, and recall scores 
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))

In [None]:
# Calculate confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred)

# Plot a labeled confusion matrix with Seaborn
sns.heatmap(cnf_matrix, annot=True, fmt="g")
plt.title("Confusion matrix")
plt.ylabel("Actual label")
plt.xlabel("Predicted label")

In [None]:
# Plot ROC curve
RocCurveDisplay.from_estimator(clf, X_test_scaled, y_test)

### References:
Data can be downloaded here :
***
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/23ZWVI
@data{DVN/23ZWVI_2020,
author = {Kevin Yancey and Burr Settles},
publisher = {Harvard Dataverse},
title = {{Replication Data for: A Sleeping, Recovering Bandit Algorithm for Optimizing Recurring Notifications}},
year = {2020},
version = {V1},
doi = {10.7910/DVN/23ZWVI},
url = {https://doi.org/10.7910/DVN/23ZWVI}
}

https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/N8XJME
@data{DVN/N8XJME_2017,
author = {Settles, Burr},
publisher = {Harvard Dataverse},
title = {{Replication Data for: A Trainable Spaced Repetition Model for Language Learning}},
year = {2017},
version = {V1},
doi = {10.7910/DVN/N8XJME},
url = {https://doi.org/10.7910/DVN/N8XJME}
}
***

#### Code Will be Updated regularly on Github:
https://github.com/zeineb-ouerghi/Capstone-Project-Senior-Year-/upload/main
