<a href="https://colab.research.google.com/github/victoria-sharpe/portfolio/blob/main/GPT_NeoX_Probabilities_Pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Description
This script computes word-level log probabilities, based on token probabilities extracted from the LLM GPT-NeoX, for any given text.

Probabilities can be computed based on the full context, or based on a range of context lengths lengths. For example, for the text "I went to the store last Thursday", we could compute the probability of "Thursday" given a context length of 2--P("Thursday" | "store last")--or a context length of 3--P("Thursday" | "the store last")--and so on.

For theory and use cases, see:
<br>
Sharpe, V., Mackinley, M., Eddine, S. N., Wang, L., Palaniyappan, L., & Kuperberg, G. R. (2025). Selective insensitivity to global vs. local linguistic context in speech produced by patients with untreated psychosis and positive thought disorder. Biological Psychiatry.

For example output, see .




In [1]:
## Install packages if needed

%%capture

!pip install transformers
!pip install minicons
!pip install pandas
!pip install numpy


In [2]:
## Load necessary packages

%%capture

import numpy as np
import pandas as pd

from minicons import scorer




In [3]:

%%capture

## Load the LLM
model = scorer.IncrementalLMScorer("EleutherAI/gpt-neo-1.3B")


In [87]:
## Create a function to convert extracted token probabilities to word-level probabilities (known as "cloze" probabilities)
## We do this using the rule of joint probabilities: P(A & B) = P(A) + P(B | A)

def get_cloze(text, log_prob_data):

    """Combines token-level log probabilities into word-level probabilities.

    Parameters:
      text (str): text for which token probability values were obtained
      log_prob_data: output of model.token_score(text)

    Returns:
      pandas dataframe with word-level probabilities

   """
    text_list = text.split() #split text into list of words
    text_list = [text_list[0]] + [' ' + x for x in text_list[1:len(text_list)]]
    probs = [] #create empty list to begin tracking word-level probabilities
    word_count = 0 #track number of words accounted for

    token_prob = 0 #dummy probability to start us out
    placeholder = "" #dummy text

    #parse nested arrays of tokens and probabilities
    token_list =[inner_tuple[0] for inner_list in log_prob_data for inner_tuple in inner_list]
    prob_list = [inner_tuple[1] for inner_list in log_prob_data for inner_tuple in inner_list]

    #loop through tokens, matching to words and combining probabilities as we go
    for prob, token in zip(prob_list, token_list):
        placeholder += token.strip() #add the current token to placeholder text

        #if placeholder text makes a full word from text, add word prob to list
        #then, reset to start a new word
        if placeholder.replace('Ä ', ' ') == text_list[word_count]:
          if pd.isna(prob):
            probs.append(prob) #special case: if first token, prob = NA

          else:
            probs.append(prob + token_prob) #append combined prob to list

          placeholder = ""
          token_prob = 0
          word_count += 1

        #otherwise, add the current log prob to token_prob and keep going
        else:
          if pd.isna(prob):
            token_prob = 1 #special case: if first token, prob = NA

          else:
            token_prob += prob #add current log prob to previous


    probs[0] = np.nan  #set prob of first word (which has no context) to NaN
    #organize word-level probabilities into a dataframe
    return probs


In [84]:

def main(text, context_range = (1,50), compute_full_context = True):
    """Computes probabilities for each word in a given text, based on a range of context lengths and/or full context.
    For efficiency, assigns probabilities on the diagonal for a  "moving window" of text
    E.g., for a moving window of size M, we get probability of word M given context length M-1,
    probability of word M-1 given context length M-2, probability of word M-2 given
    context length M-3, and so on.

    Parameters:
      text (str): string to obtain probabilities for
      context_range (tuple): of form (min_context, max_context),
        where min_context is the minimum context length to compute probabilities for &
        max_context is the maximum context length to compute probabilities for.
        If None, do not compute probabilities for a range of context lengths.
      compute_full_context (bool): if True, compute probabilities for full context, regardless of length

    Returns:
      pandas dataframe with word-level probabilities for each word in text, for each requested context length

   """
  #set up dataframe with one word per row
    text_list = text.split()
    df = pd.DataFrame({'word': text_list, 'word_idx': range(0, len(text_list))})

  #compute probabilities based on full context
    if compute_full_context:
      token_logProbs  = model.token_score(text)
      df['cloze_fullContext'] = get_cloze(text, token_logProbs)


  #compute probabilities based on range of context lengths
    if context_range is not None:
      min_context = context_range[0]
      max_context = context_range[1]

    #set up columns - one for each context length in range


      start = 0 #our context window begins at the first word
      if len(text_list) >= max_context + 1:
        end = max_context #if the utterance is long enough, window length is max_context + 1
      else:
        end = len(text_list) - 1  #otherwise window ends at the end of the utterance

      while start <= end - min_context:
        sub_text = text_list[start:end+1]
        orig_idx = [i for i in range(start, end+1)]

        # Compute the token-level log probabilities
        token_logProbs  = model.token_score(' '.join(sub_text))

        # Transform the token-level log probabilities into word-level probabilities
        word_logProbs = get_cloze(' '.join(sub_text), token_logProbs)

        # Assign probabilities on the diagonal to their correct location within the df
        for orig_i, new_i in zip(orig_idx[1:], range(1,len(word_logProbs))):
          colname = "cloze_" + str(new_i)+"WordContext"
          df.loc[orig_i, colname] = word_logProbs[new_i]

        start += 1

        if end < len(text_list):
          end += 1


    return df

In [88]:
text = "And as I sat there brooding on the old, unknown world, I thought of Gatsby's wonder when he first picked out the green light at the end of Daisy's dock."

d = main(text, context_range = (1,10), compute_full_context = True)
print(d)
d.to_csv('GPTNeoX_Pipeline_Example_Output.csv')

        word  word_idx  cloze_fullContext  cloze_1WordContext  \
0        And         0                NaN                 NaN   
1         as         1          -5.448388           -5.448388   
2          I         2          -2.447530           -6.273293   
3        sat         3          -3.940406           -8.431185   
4      there         4          -0.891540           -8.010176   
5   brooding         5          -7.087817          -17.254023   
6         on         6          -2.833630           -3.709713   
7        the         7          -1.092057           -5.068345   
8       old,         8         -10.124781          -11.098182   
9    unknown         9          -7.632751           -9.746181   
10    world,        10          -4.988406          -10.587034   
11         I        11          -1.442773           -4.400987   
12   thought        12          -2.347779           -5.856423   
13        of        13          -0.912090           -3.284716   
14  Gatsby's        14   