This notebook contains Python code for reproducing the results in our paper on using a large language model to provide personalized feedback for open-ended questions:

Van Campenhout, R., Dittel, J. S., & Johnson, B. G. (2025). Scaling effective characteristics of ITSs: A preliminary analysis of LLM-based personalized feedback. In *Proceedings of the 21st International Conference on Intelligent Tutoring Systems (ITS 2025)*. ***FINISH REF. https://doi.org/PLACEHOLDER_DOI

We are honored to have received the ***[Best Short Paper Award (update link when available)](PLACEHOLDER_LINK) at [ITS 2025](https://iis-international.org/its2025-generative-systems/) for this work. Thank you, ITS! 

Results are presented in the order they occur, organized by the paper's sections. For each result, an excerpt from the paper is given followed by code to compute the result from the data set provided. Example:

>Data were grouped into student-question sessions, which encompass all actions by an individual student on a single question in chronological order. This yielded 5,022 sessions from 29 distinct questions and 198 students.

`len( sessions ), sessions.question_id.nunique(), sessions.student_id.nunique()`

Please refer to the paper for additional context.

In [1]:
import difflib
import re

import pandas as pd

## Read dataset

Student-question interaction events.

In [2]:
events = pd.read_parquet( 'events.parquet' )
events.head()

Unnamed: 0,timestamp,question_id,student_id,question,attempt_number,answer,feedback,attempt_category,answer_length
0,2024-09-05 03:46:11,24e529e2a813ed4aa821351099056d480bf90587775de7...,U2EBCW47PRUV2STAUPWC,Explain the difference between the term back r...,1,back region is where social interaction is les...,Your explanation captures the essence of the t...,+,27
1,2024-09-09 03:56:32,24e529e2a813ed4aa821351099056d480bf90587775de7...,VTK7BEMZCU5U6TAER5P4,Explain the difference between the term back r...,1,back region is when youre not under social pre...,Your explanation captures the essence of the t...,+,16
2,2024-09-09 04:32:04,24e529e2a813ed4aa821351099056d480bf90587775de7...,G86NKXYBWBGHJVDBTMZU,Explain the difference between the term back r...,1,alvin,"OK, no problem. The terms ""back region"" and ""f...",x,1
3,2024-09-09 16:41:09,24e529e2a813ed4aa821351099056d480bf90587775de7...,PYF5FKB8GYE54YR62UC6,Explain the difference between the term back r...,1,Front region is a behvior where you know you w...,Your explanation captures the essence of the d...,+,41
4,2024-09-09 21:26:39,24e529e2a813ed4aa821351099056d480bf90587775de7...,7AAVNEMKJZXZ6EGA34QN,Explain the difference between the term back r...,1,The front regioin is a frame in which your pro...,Your explanation is quite accurate and well-ar...,+,59


Student-question sessions.

In [3]:
sessions = pd.read_parquet( 'sessions.parquet' )
sessions.head()

Unnamed: 0,question_id,student_id,pattern,first_attempt,second_attempt,second_attempt_elapsed,second_attempt_overlap
0,041da273050f1811a3146414b755a91921f8c2f286315a...,2KQPBPGQKYRFKDFKYHA5,+,+,,,
1,041da273050f1811a3146414b755a91921f8c2f286315a...,2MQSTPV5RBCZNUTA4TTN,-,-,,,
2,041da273050f1811a3146414b755a91921f8c2f286315a...,2T6YX2UQS77MJ8VGNB7N,+,+,,,
3,041da273050f1811a3146414b755a91921f8c2f286315a...,2TWHRZM72RTYKHNRYVZF,+,+,,,
4,041da273050f1811a3146414b755a91921f8c2f286315a...,3AKC5SVKH3V33KUC6N3B,+,+,,,


## 2 Methods

### 2.3 Data Collection and Analysis

>This yielded 5,022 sessions from 29 distinct questions and 198 students.

In [4]:
len( sessions ), sessions.question_id.nunique(), sessions.student_id.nunique()

(5022, 29, 198)

### 2.4 Classifying Correctness and Authenticity

In the dataset, student answer attempts are classified using shorthand symbols to represent their accuracy and authenticity. Although these symbols (+, -, x) are not used in the paper, they correspond directly to the categories used, defined in the following table:

| Category    | Symbol | Description                                                                                             |
| ----------- | ------ | ------------------------------------------------------------------------------------------------------- |
| Correct     | +      | The response accurately addressed the key distinction between terms.                                    |
| Incorrect   | -      | The response did not sufficiently answer the question, despite appearing to be a genuine effort.        |
| Non-Genuine | x      | The response did not constitute a legitimate attempt (e.g., random characters, “idk”, irrelevant text). |

## 3 Results and Discussion

### 3.1 Student Answer Length

#### Table 1. Descriptive statistics for student answer length (words) by attempt category.

In [5]:
events[ 'answer_length' ] = events.answer.str.split().apply( len )

In [6]:
events.groupby( 'attempt_category' ).answer_length.describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
attempt_category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
+,3641.0,35.986542,28.460694,1.0,22.0,29.0,41.0,650.0
-,1136.0,21.080106,13.659603,1.0,12.0,19.0,27.0,108.0
x,756.0,7.115079,14.944621,1.0,1.0,1.0,4.0,137.0


>Although individual outliers were observed (e.g., one correct attempt spanned 650 words), fewer than 2% of responses surpassed 100 words.

In [7]:
p = ( events.answer_length > 100 ).mean()
print( f'{p:.1%}' )

1.7%


### 3.2 Time Intervals and Answer Overlap

>To gain insight into whether LLM feedback contributed to learning, this analysis focuses on cases where the first attempt is incorrect (21.9%) or non-genuine (14.5%).

In [8]:
sessions.first_attempt.value_counts( normalize=True ).apply( lambda p: f'{p:.1%}' )

first_attempt
+    63.5%
-    21.9%
x    14.5%
Name: proportion, dtype: object

>Despite the option for resubmission, only 22.6% of non-correct first attempts had a second attempt, likely due to time, participation credit fulfillment, or perceived sufficiency of the LLM feedback.

In [9]:
p = sessions[ sessions.first_attempt != '+' ].second_attempt.notna().mean()
print( f'{p:.1%}' )

22.6%


#### Table 2. Elapsed time (s) between first and second attempts by transition type.

In [10]:
sessions[ sessions.first_attempt != '+' ].groupby( [ 'first_attempt', 'second_attempt' ] ).second_attempt_elapsed.describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean,std,min,25%,50%,75%,max
first_attempt,second_attempt,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
-,+,126.0,41.126984,115.030151,8.0,14.0,21.5,35.75,1276.0
-,-,16.0,97.5625,91.666765,16.0,43.5,61.5,103.5,338.0
-,x,5.0,24.6,19.667232,7.0,11.0,17.0,33.0,55.0
x,+,235.0,21.821277,26.398531,7.0,11.0,14.0,19.5,298.0
x,-,13.0,74.307692,68.72334,10.0,23.0,57.0,94.0,261.0
x,x,19.0,15.526316,15.532648,4.0,6.5,8.0,20.5,61.0


#### Table 3. Overlap ratio between LLM feedback and second attempt by transition type.

In [11]:
def preprocess_text( text ):
    """
    Lowercases text, removes punctuation but keeps letters/digits, and normalizes spacing.
    Returns a cleaned string suitable for token-level comparison.
    """
    text = text.lower()
    # Remove punctuation/special characters (but keep letters a-z, digits 0-9, and whitespace)
    text = re.sub( r'[^a-z0-9\s]', '', text )
    # Normalize multiple spaces/tabs/newlines into a single space
    text = ' '.join( text.split() )

    return text

def token_based_difflib_ratio( a, b ):
    """
    Returns a float in [0.0, 1.0] indicating how similar two texts are,
    based on token-level difflib (order-sensitive).
    """
    a_tokens = preprocess_text( a ).split()
    b_tokens = preprocess_text( b ).split()
    # Create a SequenceMatcher on the token lists
    matcher = difflib.SequenceMatcher( None, a_tokens, b_tokens )

    return matcher.ratio()

In [12]:
for ( question_id, student_id ), session_events in events.groupby( [ 'question_id', 'student_id' ] ):
    # Only want multi-attempt sessions
    if len( session_events ) == 1:
        continue
    e1 = session_events.iloc[ 0 ]
    e2 = session_events.iloc[ 1 ]
    # Compute overlap between first answer's feedback and second answer
    similarity = token_based_difflib_ratio( e1.feedback, e2.answer )
    sessions.loc[ ( sessions.question_id == question_id ) & ( sessions.student_id == student_id ), 'second_attempt_overlap' ] = similarity

In [13]:
sessions[ sessions.first_attempt != '+' ].groupby( [ 'first_attempt', 'second_attempt' ] ).second_attempt_overlap.describe().round( 3 )

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean,std,min,25%,50%,75%,max
first_attempt,second_attempt,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
-,+,126.0,0.658,0.263,0.082,0.492,0.745,0.867,1.0
-,-,16.0,0.191,0.093,0.031,0.144,0.204,0.225,0.39
-,x,5.0,0.062,0.091,0.0,0.0,0.0,0.105,0.203
x,+,235.0,0.796,0.168,0.179,0.714,0.819,0.946,1.0
x,-,13.0,0.319,0.177,0.111,0.224,0.277,0.378,0.717
x,x,19.0,0.125,0.264,0.0,0.0,0.026,0.076,0.964


#### Example 1: High overlap (.901)

In [14]:
question_id = '041da273050f1811a3146414b755a91921f8c2f286315a66fee61ccdd6659823'
student_id = 'A66TTXA3ER8PS4JGXDGN'
session_events = events[ ( events.question_id == question_id ) & ( events.student_id == student_id ) ]
e1 = session_events.iloc[ 0 ]
e2 = session_events.iloc[ 1 ]
overlap = sessions.loc[ ( sessions.question_id == question_id ) & ( sessions.student_id == student_id ), 'second_attempt_overlap' ].values[ 0 ]

In [15]:
print( 'Question: ', e1.question )
print()
print( 'Attempt 1:', e1.answer )
print()
print( 'Feedback: ', e1.feedback )
print()
print( 'Attempt 2:', e2.answer )
print()
print( 'Overlap:  ', f'{overlap:.3f}' )

Question:  Explain the difference between the term social roles and task roles:

Attempt 1: ?

Feedback:  To explain the difference between social roles and task roles, focus on their functions within a group. Social roles are about fostering positive relationships and communication among group members. Task roles are centered on achieving the group's goals and ensuring productivity. If your explanation captures these distinctions, then it's accurate.

Attempt 2: To explain the difference between social roles and task roles, focus on their functions within a group. Social roles are about fostering positive relationships and communication among group members. Task roles are centered on achieving the group's goals and ensuring productivity.

Overlap:   0.901


#### Example 2: Moderate overlap (.480)

In [16]:
question_id = '8dc066de81854984b1f742f6b2eb56f79774fdd128c85724585702b9d6659823'
student_id = '3NCGEU8UAVTRUP3CRV67'
session_events = events[ ( events.question_id == question_id ) & ( events.student_id == student_id ) ]
e1 = session_events.iloc[ 0 ]
e2 = session_events.iloc[ 1 ]
overlap = sessions.loc[ ( sessions.question_id == question_id ) & ( sessions.student_id == student_id ), 'second_attempt_overlap' ].values[ 0 ]

In [17]:
print( 'Question: ', e1.question )
print()
print( 'Attempt 1:', e1.answer )
print()
print( 'Feedback: ', e1.feedback )
print()
print( 'Attempt 2:', e2.answer )
print()
print( 'Overlap:  ', f'{overlap:.3f}' )

Question:  Explain the difference between the term asynchronous communication and constitutive approach to communication:

Attempt 1: Asynchronous communication is communication that involves a prolonged delay between message and response.

Feedback:  The explanation provided only addresses one aspect of asynchronous communication. It doesn't compare or contrast it with the constitutive approach to communication. Asynchronous communication focuses on timing, while the constitutive approach emphasizes how communication can create new realities or agreements. It's important to address both terms to explain their differences accurately.

Attempt 2: Asynchronous communication focuses on the delayed timing between message and response, while the constitutive approach emphasizes how communication can create new realities or agreements.

Overlap:   0.480


#### Example 3: Low overlap (.171)

In [18]:
question_id = '32cd310ec4f80a042de7f0ba5c637d770a2ce2ddb92a904a2df6494cd6659823'
student_id = '8QNCMHMZ5XW7CW5NAZCN'
session_events = events[ ( events.question_id == question_id ) & ( events.student_id == student_id ) ]
e1 = session_events.iloc[ 0 ]
e2 = session_events.iloc[ 1 ]
overlap = sessions.loc[ ( sessions.question_id == question_id ) & ( sessions.student_id == student_id ), 'second_attempt_overlap' ].values[ 0 ]

In [19]:
print( 'Question: ', e1.question )
print()
print( 'Attempt 1:', e1.answer )
print()
print( 'Feedback: ', e1.feedback )
print()
print( 'Attempt 2:', e2.answer )
print()
print( 'Overlap:  ', f'{overlap:.3f}' )

Question:  Explain the difference between the term constitutive approach to communication and cross-cultural communication:

Attempt 1: Constitutive approach is listening to what they have to say and cross-cultural communication is when they choose multiple ways to communite

Feedback:  The explanation provided doesn't accurately capture the essence of the two terms. The constitutive approach to communication focuses on how communication can create new realities or entities. Cross-cultural communication involves understanding and comparing communication styles across different cultures. It's important to focus on these core aspects when distinguishing between the two.

Attempt 2: Constitutive approach is the creation of new communitites and entites verus cross-cultural is the comparsion of different cultures communication.

Overlap:   0.171
