# Week 2: Training, Loading, and Evaluating NLP models on Education Data

In [None]:
__author__ = "Rose E. Wang, Dorottya Demszky"
__version__ = "CS293/EDUC473, Stanford, Fall 2023"

# Table of Contents

* [Overview](#overview)
    * [Your Deliverables](#your-deliverables)
    * [Important Setup Details](#important-setup-details)
* [Model Training and Evaluation](#model-training-and-evaluation)
* [Inference](#inference)
* [Linear Regression](#linear-regression)
* [Assignment](#assignment)
* [Extra Assignment](#extra-assignment)

# Overview

The purpose of this notebook is to walk you through loading and evaluating a language model on the NCTE dataset we used in the last assignment.

We will discuss:
- How to load and run inference on a pre-trained language model on a classification task (student reasoning)
- How to use a pre-trained language model to label data and run linear regression  on the model predictions

Ultimately, we are interested in using pre-trained language models to label instances of student reasoning, and then use those labels to understand whether student reasoning correlates with outcome measures like the MQI scores.
This is one way of answering the question: "Does more student reasoning lead to better outcomes?"

## Your Deliverables

To receive credit for this assignment, please upload the PDF version of your ENTIRE Colab that includes all your code and written responses to Gradescope.

## Important Setup Details

We will be training a model in this assignment. To ensure training goes reasonably fast, make sure you connect your Colab to a GPU.

To do so:
- Navigate to the right handside buttons
- Click on the down-arrow button
- Click "Change runtime type"
- Make sure the runtime type is the GPU option e.g., "T4 GPU"

## Note

We are assuming you are running everything within Colab. We are not responsible for bugs created outside of the Colab environment.


## Linking this Colab to your GDrive

Again, like last time, make sure you link the Colab to your GDrive. Follow the previous assignment for the instructions.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# Change this directory!
DIR_PATH = "/content/drive/MyDrive/CS293/empowering_educators_via_language_technology-main"

# Model (Training and) Loading

In this section, we will load a pre-trained classifier that recognizes student reasoning data and evaluate the model's performance.
Specifically, we will use a RoBERTa model [1] on the student reasoning data we saw in the previous assignment.
This model was trained and validated using the procedure outlined in the original NCTE paper.
Please take a look at that paper if you are interested in this!

**Why are we loading a model and not training it?**
This is due to compute constraints on the free version of Colab... I did a lot of testing to try to get training to work within its budget, but these were unfortunately unsuccessful trials!
If you are interested in training your own models and have the compute to do so, please let me know and I'm happy to share the training scripts!

[1] Liu, Yinhan, et al. "Roberta: A robustly optimized bert pretraining approach." arXiv preprint arXiv:1907.11692 (2019).

In [None]:
!pip install transformers

We'll be loading the model from [HuggingFace](https://huggingface.co/models) which, amongst other things, provides a repository of pre-trained models.

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from transformers.models.bert.modeling_bert import BertModel, BertPreTrainedModel
from torch import nn
from itertools import chain
from torch.nn import MSELoss, CrossEntropyLoss
import re


HF_MODEL_NAME = "ddemszky/student-reasoning" # name of the huggingface model
STUDENT_REASONING_MIN_NUM_WORDS = 8 # minimum number of words we consider for student reasoning
REASONING_MAX_INPUT_LENGTH = 128 # maximum number of tokens for model

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f'Using {device}') # This should be cuda (using GPU)

class MultiHeadModel(BertPreTrainedModel):
  """Pre-trained BERT model that uses our loss functions"""

  def __init__(self, config, head2size):
    super(MultiHeadModel, self).__init__(config, head2size)
    config.num_labels = 1
    self.bert = BertModel(config)
    self.dropout = nn.Dropout(config.hidden_dropout_prob)
    module_dict = {}
    for head_name, num_labels in head2size.items():
      module_dict[head_name] = nn.Linear(config.hidden_size, num_labels)
    self.heads = nn.ModuleDict(module_dict)

    self.init_weights()

  def forward(self, input_ids, token_type_ids=None, attention_mask=None,
              head2labels=None, return_pooler_output=False, head2mask=None,
              nsp_loss_weights=None):

    device = "cuda" if torch.cuda.is_available() else "cpu"

    # Get logits
    output = self.bert(
      input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask,
      output_attentions=False, output_hidden_states=False, return_dict=True)
    pooled_output = self.dropout(output["pooler_output"]).to(device)

    head2logits = {}
    return_dict = {}
    for head_name, head in self.heads.items():
      head2logits[head_name] = self.heads[head_name](pooled_output)
      head2logits[head_name] = head2logits[head_name].float()
      return_dict[head_name + "_logits"] = head2logits[head_name]


    if head2labels is not None:
      for head_name, labels in head2labels.items():
        num_classes = head2logits[head_name].shape[1]

        # Regression (e.g. for politeness)
        if num_classes == 1:

          # Only consider positive examples
          if head2mask is not None and head_name in head2mask:
            num_positives = head2labels[head2mask[head_name]].sum()  # use certain labels as mask
            if num_positives == 0:
              return_dict[head_name + "_loss"] = torch.tensor([0]).to(device)
            else:
              loss_fct = MSELoss(reduction='none')
              loss = loss_fct(head2logits[head_name].view(-1), labels.float().view(-1))
              return_dict[head_name + "_loss"] = loss.dot(head2labels[head2mask[head_name]].float().view(-1)) / num_positives
          else:
            loss_fct = MSELoss()
            return_dict[head_name + "_loss"] = loss_fct(head2logits[head_name].view(-1), labels.float().view(-1))
        else:
          loss_fct = CrossEntropyLoss(weight=nsp_loss_weights.float())
          return_dict[head_name + "_loss"] = loss_fct(head2logits[head_name], labels.view(-1))


    if return_pooler_output:
      return_dict["pooler_output"] = output["pooler_output"]

    return return_dict

class InputBuilder(object):
  """Base class for building inputs from segments."""

  def __init__(self, tokenizer):
      self.tokenizer = tokenizer
      self.mask = [tokenizer.mask_token_id]

  def build_inputs(self, history, reply, max_length):
      raise NotImplementedError

  def mask_seq(self, sequence, seq_id):
      sequence[seq_id] = self.mask
      return sequence

  @classmethod
  def _combine_sequence(self, history, reply, max_length, flipped=False):
      # Trim all inputs to max_length
      history = [s[:max_length] for s in history]
      reply = reply[:max_length]
      if flipped:
          return [reply] + history
      return history + [reply]


class BertInputBuilder(InputBuilder):
  """Processor for BERT inputs"""

  def __init__(self, tokenizer):
      InputBuilder.__init__(self, tokenizer)
      self.cls = [tokenizer.cls_token_id]
      self.sep = [tokenizer.sep_token_id]
      self.model_inputs = ["input_ids", "token_type_ids", "attention_mask"]
      self.padded_inputs = ["input_ids", "token_type_ids"]
      self.flipped = False


  def build_inputs(self, history, reply, max_length, input_str=True):
    """See base class."""
    if input_str:
        history = [self.tokenizer.convert_tokens_to_ids(self.tokenizer.tokenize(t)) for t in history]
        reply = self.tokenizer.convert_tokens_to_ids(self.tokenizer.tokenize(reply))
    sequence = self._combine_sequence(history, reply, max_length, self.flipped)
    sequence = [s + self.sep for s in sequence]
    sequence[0] = self.cls + sequence[0]

    instance = {}
    instance["input_ids"] = list(chain(*sequence))
    last_speaker = 0
    other_speaker = 1
    seq_length = len(sequence)
    instance["token_type_ids"] = [last_speaker if ((seq_length - i) % 2 == 1) else other_speaker
                                  for i, s in enumerate(sequence) for _ in s]
    return instance

# Initializing the tokenizer, model, input_builder
tokenizer = AutoTokenizer.from_pretrained(HF_MODEL_NAME)
student_reasoning_model = AutoModelForSequenceClassification.from_pretrained(HF_MODEL_NAME)
input_builder = BertInputBuilder(tokenizer=tokenizer)

This function will take in a dataframe and run the student reasoning model on it.

In [None]:

def get_score(text, max_length, model):
    instance = input_builder.build_inputs([], text, max_length=max_length, input_str=True)
    instance["attention_mask"] = [[1] * len(instance["input_ids"])]
    for key in ["input_ids", "token_type_ids", "attention_mask"]:
        instance[key] = torch.tensor(instance[key]).unsqueeze(0)  # Batch size = 1
        instance[key].to(device)

    output = model(
        input_ids=instance["input_ids"],
        attention_mask=instance["attention_mask"],
        token_type_ids=instance["token_type_ids"]
    )
    return output

def run_student_reasoning_model(df, text_column, output_column='student_reasoning_score'):
    """
    Runs the student reasoning model on the given dataframe.
    :param df: dataframe with the data
    :param speaker_column: name of the column with the speaker id
    :param student_speaker_id: id of the student
    :param text_column: name of the column with the text
    :param output_column: name of the column with the output
    """
    # Run inference
    print("Running inference...")
    student_reasoning_scores = []
    with torch.no_grad():
        total_rows = len(df)
        for i, (row_i, utt) in enumerate(df.iterrows()):
            print(f"Working on {i}/{total_rows}")
            num_words = len(utt[text_column].split(' '))
            if num_words < STUDENT_REASONING_MIN_NUM_WORDS:
                student_reasoning_scores.append(None)
                continue

            output = get_score(utt[text_column], max_length=REASONING_MAX_INPUT_LENGTH, model=student_reasoning_model)
            is_reasoning = output["logits"][0].argmax().item()
            student_reasoning_scores.append(is_reasoning)

    df[output_column] = student_reasoning_scores
    result = {output_column: df[output_column].sum()}
    return df, result

## Inference

Now we run inference: We predict student reasoning for the entire NCTE transcript dataset.

In [None]:
import pandas as pd
import os

# First let's prepare the dataset we want to get student reasoning over.
# Get student reasoning candidates
transcripts = pd.read_csv(os.path.join(DIR_PATH, 'data/ncte_single_utterances.csv'))
student_reasoning_cands = transcripts[transcripts["speaker"].isin(["student", "multiple students"])]
student_reasoning_cands = student_reasoning_cands[student_reasoning_cands["num_words"] >= 8]
print(student_reasoning_cands.shape) # This should be 37272 rows.

In [None]:
student_reasoning_cands.head()

## IMPORTANT: The labeling takes a while to run. Do not close your laptop or let it sleep, otherwise you will lose your progress! Alternatively you can create a script that saves and reloads the data from its last copy.

In [None]:
PREDICTION_OUTPUT_COLUMN = "student_reasoning_score"
OUTPUT_PATH = os.path.join(DIR_PATH, 'data/labeled_student_reasoning.csv')

if os.path.exists(OUTPUT_PATH):
    print("Loading labeled student reasoning...")
    student_reasoning_cands = pd.read_csv(OUTPUT_PATH)
else:
    student_reasoning_cands, prediction_scores = run_student_reasoning_model(
        df=student_reasoning_cands,
        text_column='text',
        output_column=PREDICTION_OUTPUT_COLUMN
    )


In [None]:
# Some examples
student_reasoning_cands.head()

Let's map the student reasoning scores back to the original transcript dataframe:

In [None]:
transcripts[PREDICTION_OUTPUT_COLUMN] = transcripts['comb_idx'].map(student_reasoning_cands.set_index('comb_idx')[PREDICTION_OUTPUT_COLUMN].to_dict())


In [None]:
transcripts.head()

Note how `student_reasoning_pred` conatins NaNs for the rows that were not considered reasoning candidates.

Let's set those to 0.

In [None]:
print(transcripts[PREDICTION_OUTPUT_COLUMN].isnull().sum())

print(transcripts[PREDICTION_OUTPUT_COLUMN].sum())

In [None]:
transcripts.loc[transcripts[PREDICTION_OUTPUT_COLUMN].isnull(), PREDICTION_OUTPUT_COLUMN] = 0

In [None]:
print(transcripts[PREDICTION_OUTPUT_COLUMN].isnull().sum())
transcripts.head() # No NaNs now

### Linear regression

Great, now that we have predictions about student reasoning, we're interested in whether it's correlated with outcome measures. We can determine this using linear regression. Some outcome measures we might care about include the teacher's math instruction quality and value-added scores. We'll be using the math instruction quality for this part of the Colab, and you will use the value-added scores in the assignment.

We run a linear regression model, clustering standard errors at the teacher level.
The linear regression model is captured by this equation.

$$y = x \beta_1 + T\beta_2 + S\beta_3 + \epsilon$$

- $y$ is the dependent variable vector, shape: $\mathbb{R}^{n}$ where $n$ is the number of transcripts from this teacher.
- $x$ is the feature predictions (e.g., proportion of student reasoning), shape: $\mathbb{R}^{n}$.
- $T$ is a matrix of duplicated teacher covariates, shape: $\mathbb{R}^{n \times d_T}$ where $d_T$ is the number of teacher covariate features.
  - teacher gender (discrete)
  - teacher race/ethnicity (discrete)
  - teacher years of experience (discrete)
- $S$ is a matrix of student covariates in each transcript, shape: $\mathbb{R}^{n \times d_S}$ where $d_S$ is the number of student covariate features.
  - student gender
  - student race
  - student free or reduced lunch status
  - student special education status
  - student limited English proficiency
- $\beta_1 \in \mathbb{R}^n, \beta_2 \in \mathbb{R}^{d_T}, \beta_3 \in \mathbb{R}^{d_S}$ are the unknown parameters of the linear regression model
- $\epsilon$ is the vector of residuals, shape $\mathbb{R}^n$


Before we can run linear regression, we need to prepare our data to contain the variables of interest: $y, x, T, S$. Let's do that below and put everything into a single dataframe.

Note that the variables are on the (teacher, transcript) level. We'll structure the df appropriately.

In [None]:
# Preparing the metadata from the correct observation folder
fpath = os.path.join(DIR_PATH,'data/ICPSR_36095/DS0002/36095-0002-Data.tsv')
mqi_metadata = pd.read_csv(fpath, sep='\t')

# We want to prepare the teacher metadata from HW1 too --- these are our teacher covariates
# We are interested in: male black white asian hisp experience raceother
fpath = os.path.join(DIR_PATH,'data/ICPSR_36095/DS0006/36095-0006-Data.tsv')
teacher_metadata = pd.read_csv(fpath, sep='\t')

# And finally, we also want to prepare the student metadata.
# We are interested in: s_male s_afam s_white s_hisp s_asian s_race_other s_frpl s_sped s_lep
fpath = os.path.join(DIR_PATH,'data/ICPSR_36095/DS0005/36095-0005-Data.tsv')
student_metadata = pd.read_csv(fpath, sep='\t')

# Our feature will be proportion of student reasoning.

In [None]:
print(transcripts.columns)
print(mqi_metadata.columns) # We're going to use `CHAPNUM` to determine the proportion of reasoning.
print(teacher_metadata.columns)
print(student_metadata.columns)

print(mqi_metadata['NCTETID'].unique())
print(teacher_metadata['NCTETID'].unique())
print(student_metadata['NCTETID'].unique())

In [None]:
print(mqi_metadata['CHAPNUM'].unique())
mqi_metadata[mqi_metadata['NCTETID'] == 2501][['OBSID', 'CHAPNUM', 'MQI5']]

In [None]:
mqi_metadata.loc[mqi_metadata['CHAPNUM'] == ' ', 'CHAPNUM'] = 0
# Convert CHAPNUM to int
mqi_metadata['CHAPNUM'] = mqi_metadata['CHAPNUM'].astype(int)
mqi_metadata = mqi_metadata[~mqi_metadata['MQI5'].isin([999, 998])]

In [None]:
mqi_metadata['MQI5'].unique()

Now let's construct a df that merges the outcomes, features, teacher metadata and student metadata all into one.

In [None]:
OUTCOME_KEY = 'MQI5'
TEACHER_ID_KEY = 'NCTETID'
TRANSCRIPT_ID_KEY = 'OBSID'
NUM_SEGMENT_KEY = 'CHAPNUM'
STUDENT_REASONING_KEY = 'student_reasoning_pred'
regression_df = []

for teacher_id in mqi_metadata[TEACHER_ID_KEY].unique():
  # Get all the mqi data from this teacher specifically
  teacher_mqi_df = mqi_metadata[mqi_metadata[TEACHER_ID_KEY] == teacher_id]
  for transcript_id in teacher_mqi_df[TRANSCRIPT_ID_KEY].unique():
    teacher_transript_df = teacher_mqi_df[teacher_mqi_df[TRANSCRIPT_ID_KEY] == transcript_id]

    # Determine y
    y = teacher_transript_df[OUTCOME_KEY].mean()

    if not len(transcripts[(transcripts[TRANSCRIPT_ID_KEY] == transcript_id)]):
      continue

    # Determine x
    x = transcripts[(transcripts[TRANSCRIPT_ID_KEY] == transcript_id)][PREDICTION_OUTPUT_COLUMN].mean()

    if teacher_id not in teacher_metadata[TEACHER_ID_KEY].unique():
      continue
    if teacher_id not in student_metadata[TEACHER_ID_KEY].unique():
      continue

    # Get teacher covariates
    teacher_metadata_df = teacher_metadata[teacher_metadata[TEACHER_ID_KEY] == teacher_id]
    # Make blank answers into 0
    teacher_metadata_df.loc[teacher_metadata_df['MALE'] == ' ', 'MALE'] = 0
    teacher_metadata_df.loc[teacher_metadata_df['BLACK'] == ' ', 'BLACK'] = 0
    teacher_metadata_df.loc[teacher_metadata_df['WHITE'] == ' ', 'WHITE'] = 0
    teacher_metadata_df.loc[teacher_metadata_df['HISP'] == ' ', 'HISP'] = 0
    teacher_metadata_df.loc[teacher_metadata_df['ASIAN'] == ' ', 'ASIAN'] = 0
    teacher_metadata_df.loc[teacher_metadata_df['EXPERIENCE'] == ' ', 'EXPERIENCE'] = 0
    teacher_metadata_df.loc[teacher_metadata_df['RACEOTHER'] == ' ', 'RACEOTHER'] = 0
    male = teacher_metadata_df['MALE'].mean()
    black = teacher_metadata_df['BLACK'].mean()
    white = teacher_metadata_df['WHITE'].mean()
    asian = teacher_metadata_df['ASIAN'].mean()
    hisp = teacher_metadata_df['HISP'].mean()
    experience = teacher_metadata_df['EXPERIENCE'].mean()
    raceother = teacher_metadata_df['RACEOTHER'].mean()

    # Get student covariates
    student_metadata_df = student_metadata[student_metadata[TEACHER_ID_KEY] == teacher_id]
    # Make blank answers = 0
    student_metadata_df.loc[student_metadata_df['S_MALE'] == ' ', 'S_MALE'] = 0
    student_metadata_df.loc[student_metadata_df['S_AFAM'] == ' ', 'S_AFAM'] = 0
    student_metadata_df.loc[student_metadata_df['S_WHITE'] == ' ', 'S_WHITE'] = 0
    student_metadata_df.loc[student_metadata_df['S_HISP'] == ' ', 'S_HISP'] = 0
    student_metadata_df.loc[student_metadata_df['S_ASIAN'] == ' ', 'S_ASIAN'] = 0
    student_metadata_df.loc[student_metadata_df['S_RACE_OTHER'] == ' ', 'S_RACE_OTHER'] = 0
    student_metadata_df.loc[student_metadata_df['S_FRPL'] == ' ', 'S_FRPL'] = 0
    student_metadata_df.loc[student_metadata_df['S_SPED'] == ' ', 'S_SPED'] = 0
    student_metadata_df.loc[student_metadata_df['S_LEP'] == ' ', 'S_LEP'] = 0

    s_male = student_metadata_df['S_MALE'].astype(int).mean()
    s_afam = student_metadata_df['S_AFAM'].astype(int).mean()
    s_white = student_metadata_df['S_WHITE'].astype(int).mean()
    s_hisp = student_metadata_df['S_HISP'].astype(int).mean()
    s_asian = student_metadata_df['S_ASIAN'].astype(int).mean()
    s_race_other = student_metadata_df['S_RACE_OTHER'].astype(int).mean()
    s_frpl = student_metadata_df['S_FRPL'].astype(int).mean()
    s_sped = student_metadata_df['S_SPED'].astype(int).mean()
    s_lep = student_metadata_df['S_LEP'].astype(int).mean()

    # Add to regression_df
    result = {
        'y': y,
        'x': x,
        'male': male,
        'black': black,
        'white': white,
        'asian': asian,
        'hisp': hisp,
        'experience': experience,
        'raceother': raceother,
        's_male': s_male,
        's_afam': s_afam,
        's_white': s_white,
        's_hisp': s_hisp,
        's_asian': s_asian,
        's_race_other': s_race_other,
        's_frpl': s_race_other,
        's_sped': s_sped,
        's_lep': s_lep,
        'teacher_id': teacher_id
    }

    regression_df.append(result)


In [None]:
print(len(regression_df))
print(regression_df[0])
regression_df = pd.DataFrame(regression_df)

To run linear regression, we are going to use the statsmodel package.

In [None]:
import statsmodels.formula.api as sm

result = sm.mixedlm(
    formula=
"""y ~ x + male + experience + raceother + s_male""", data=regression_df, groups=regression_df['teacher_id']).fit()

In [None]:
result.summary()

## IMPORTANT: REPORT THE COEFFS UNDER THE ASSIGNMENT SECTION

## Assignment

0. Report the metrics from above
1. Run linear regression where you specify all the covariates (teacher and student) & report feature coefficients.
2. Run linear regression where the feature is now student words (proportion %) & report feature coefficients.


and report the correlation of the features with outcomes on **Math Instruction Quality (MQI5)**


### 0. REPORT THE COEFF HERE

Coeff with student reasoning: **REPLACE ME**

### 1. Run linear regression where you specify all the covariates (except for `raceother` and `s_race_other`) & report feature coefficients.

In [None]:
# REPLACE ME WITH CODE

In [None]:
# REPLACE ME WITH FEATURE COEFFICIENT

### 2. Run linear regression where the feature is now

- Student words (proportion %)

In [None]:
# REPLACE ME WITH CODE

In [None]:
# REPLACE ME WITH FEATURE COEFFICIENT

## Extra Assignment

Here are some things you can do on the extra assignment.
Note these problems are more open-ended than the previous problems.  
You will be graded on whatever is pasted BELOW this section.
Please specify which extra problems are you doing in your solution and be clear about your approach.


1. Determine the predictors of high student reasoning (hint: variance decomposition).
2. Characterize a temporal trend in the student reasoning data.