# **Script for Computing Surprisal Scores for the EXNAT studies using GPT-2**

# Author: Merle Schuckart
# Version: May 2023

### Settings



In [1]:
%%capture
# (disable %%capture to print output in console)

import math # for mathematical functions
import numpy as np # also for mathematical functions
import pandas as pd # for dfs
import csv # for reading in csv properly
from google.colab import files # for downloading files

# ML modules:
!pip install transformers

import tensorflow as tf
# use AutoModelForCausalLM instead of AutoModelWithLMHead because AutoModelWithLMHead will be deprecated soon-ish?
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

### Load & Customize Model

In [2]:
%%capture
# (disable %%capture to print output in console)

# download pre-trained German GPT-2 model & tokenizer from the Hugging Face model hub
tokenizer = AutoTokenizer.from_pretrained("dbmdz/german-gpt2")

# initialise the model, use end-of-sequence (EOS) token as padding tokens
model = AutoModelForCausalLM.from_pretrained("dbmdz/german-gpt2", pad_token_id = tokenizer.eos_token_id)


###Demo:
Mask tokens by setting their attention weights to 0 on all layers


In [None]:
# set the input text and the number of words to mask
input_text = ["Orlando", "liebte", "von", "Natur", "aus", "einsame", "Orte,", "weite", "Ausblicke", "und", "das", "Gefühl,", "für", "immer", "und", "ewig"] # allein zu sein.
n = 5  # I want to mask the last n words

# get the last n words and the number of tokens to mask
masked_words = input_text[-n:]
masked_tokens = tokenizer.tokenize(" ".join(masked_words))
print("masking the following", n, "words now:", masked_words)


# 1 word can consist of many tokens, so get last n words
# and tokenize them so we know how many tokens we have to mask:
masked_words = input_text[-n:]
n_tokens = len(tokenizer.tokenize(" ".join(masked_words)))
print(" ".join(masked_words))


# encode the full input text
encoded_input = tokenizer(" ".join(input_text), return_tensors="pt")
# get attention mask
curr_attention_mask = encoded_input.attention_mask.clone()

# modify the original attention mask: set weights to 0 where the input should be masked
# --> I think if we set this attention mask in the .generate() call, we set this for all layers.
curr_attention_mask[:, -(n_tokens):] = 0
print("Attention mask: ", curr_attention_mask)

# generate prediction, but use modified attention mask:
output = model.generate(encoded_input['input_ids'],
                        attention_mask = curr_attention_mask,
                        max_new_tokens = 10) # generate 10 additional tokens
# decode prediction
predicted_text = tokenizer.decode(output[0], skip_special_tokens = True)
print("Predicted text: ", predicted_text)


masking the following 5 words now: ['Gefühl,', 'für', 'immer', 'und', 'ewig']
Gefühl, für immer und ewig
Attention mask:  tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]])
Predicted text:  Orlando liebte von Natur aus einsame Orte, weite Ausblicke und das Gefühl, für immer und ewig die Natur.
Er war ein großer Künstler,


### Load Texts

The csv contains 4 columns:

word_punct = single words with punctuation

word_no_punct = single words without punctuation (exception: dashes between 2 words)

text_nr = Number of text, text_01 - text_09

trial_nr = Index of word in the text, starting with 1. There are 300 words (aka trials) in each text.
________
Hint: Make sure you upload the csv in Google Colab before trying to read it in. You can upload the csv by clicking the file icon in the side bar.

In [24]:
# read in csv with texts
texts_df = pd.read_csv('/content/Texts_surprisal_scores.csv',
                       sep = ",")#,
                       #quoting = csv.QUOTE_NONE, # don't put quotes around fields
                       #quotechar = None) # no specific character used for enclosing fields with special characters

# Check dataframe - show all rows and columns in console:
#pd.set_option('display.max_rows', None)
#pd.set_option('display.max_columns', None)
#print(texts_df["word_punct"])

# get unique text numbers in texts_df
text_nrs = list(set(texts_df["text_nr"]))

# if you want to get only a subset of the texts, uncomment this line and
# add all texts you want to get surprisal scores for:
#text_nrs = ['text_10']

print("preparing surprisal scores for the following texts:")
print(text_nrs)

preparing surprisal scores for the following texts:
['text_10']


###Compute Surprisal Scores for each Word

Loop texts in Text_surprisal_scres.csv. For each word, get chunk with x previous words and calculate probability for predicting the actual word in the text.

If there are lower time scales (i.e. you have time scales with shorter context chunks), mask the already known part of the input.
_______
####Example:
This is our whole sentence:

"Sphinx of black quartz, judge my vow."

We want to get a surprisal score for the last word "vow" on time scales 2 and 4. To do so, we have to predict the last word, given the previous n words (with n being 2 for time scale 2 and 4 for time scale 4).

So for time scale 2, our context chunk would be: "judge my"
For time scale 4, the context chunk would be the last 4 words: "black quartz, judge my"

As you can see, we have redundant information here: The last 2 words are used to generate predictions on both time scales, so in a way, time scale 4 contains time scale 2. Not ideal.

Now we have 2 options:

1. We could just leave out the last 2 words for time scale 4 and just use "black quartz," as an input. There are 2 problems with this: a) The model doesn't "know" there is something missing and it has to predict not the next word, but the third next word, and b) the surprisal score on TS 4 for the current word at index n would be the same as the surprisal score on time scale 2 for the word at index n-2, which is also a bit weird.

2. We could mask the redundant information. To do so, we set the attention weights for the redundant words to 0, which means they are still in the context chunk, but they are not providing information for the prediction of the next word. This is what I did here.



In [None]:
""" Compute surprisal scores with different context chunk sizes """
# (running this takes a few hours)

# define your time scales (aka context chunk lengths) here
time_scale = [1, 4, 12, 60]

# Prepare list for collecting surprisal scores on each time scale
for i in range(len(time_scale)):
    # create list called surprisal_n with n being
    # one of the 9 time scales defined above
    exec(f"surprisal_{time_scale[i]} = []")
# test if it worked: Should return an empty list
#print(surprisal_1)

# get unique text numbers in texts_df
#text_nrs = list(set(texts_df["text_nr"]))

print("preparing surprisal scores for the following texts:")
print(text_nrs)


""" Loop time scales """
for curr_ts_idx, curr_ts in enumerate(time_scale):

  print("\n\n\nStarting to compute surprisal scores for TS", curr_ts)

  # get correct surprisal score list for current TS
  curr_TS_list = eval(f"surprisal_{curr_ts}")

  # Set number of words to mask in the context chunks
  # this depends on the length of the last TS (e.g. if the current TS
  # is 64 and the last one is 32, we have context chunks
  # of 64 words with the last 32 words being masked)
  # n is the number of words that should be masked here.

  # If the current TS is the first one, there is no context
  # overlap with previous TSs (because there aren't any),
  # so don't mask any of the previous words
  if curr_ts_idx == 0:
    n = 0
  # if there are previous TSs though, mask as many words as there
  # were words in the previous TS's context chunk.
  else:
    n = time_scale[curr_ts_idx-1]

  # Hint: As each word consists of multiple tokens, we have to
  # find out how many tokens have to be masked for each new
  # context chunk. This will be done later in this script.

  """ Loop texts """
  for text_nr in text_nrs:
    # get subset of df with current text
    curr_text = texts_df[texts_df["text_nr"] == text_nr]
    print(text_nr)

    # The size of the context chunks equals the time scale, e.g. for TS 16
    # we have a context chunk of the 16 words before the word we want to predict.
    # As we always need a certain context size, the first few words from each text
    # don't have the necessary context chunk length and consequently can't get surprisal scores.
    # So assign None values instead, so all surprisal score lists have the same length.
    curr_TS_list.extend([None] * curr_ts)

    """ Loop words & calculate surprisal score for each one """

    # loop words, start at word with index = TS
    # (e.g. for TS = 64, start with the 65th word)
    for word_idx in range(curr_ts, len(curr_text["word_punct"])):

        """ prepare context chunk """
        # get n previous words (with n = TS, e.g. 64 for TS 64)
        previous_words = list(curr_text["word_punct"])[word_idx - curr_ts : word_idx]
        #print("previous words", previous_words)

        # generate token ids for each of the x previous words
        encoded_input = tokenizer(" ".join(previous_words), # input chunk as 1 text string
                                  add_special_tokens = False, # don't add special tokens in the output
                                  return_tensors = "pt") # return as tensor object

        # get token IDs, put them in a list and in a tensor
        ids_tensor = encoded_input['input_ids']
        ids_list = ids_tensor[0].tolist()

        # get original attention mask
        curr_attention_mask = encoded_input.attention_mask.clone()

        # check which TS index we have here: If it's the
        # first TS, we don't have to mask anything.
        if n == 0:
          print("first TS - not masking any words!")
          unmasked_tokens = tokenizer.tokenize(" ".join(previous_words))
          print("unmasked tokens:", unmasked_tokens)

        # if there are previous TSs though, mask as many words as there
        # were words in the previous TS's context chunk.
        else:
          unmasked_words = previous_words[:-n]
          unmasked_tokens = tokenizer.tokenize(" ".join(unmasked_words))
          print("unmasked tokens:", unmasked_tokens)

          # get the last n words and the number of tokens to mask
          masked_words = previous_words[-n:]
          masked_tokens = tokenizer.tokenize(" ".join(masked_words))
          n_tokens = len(masked_tokens)
          print("masking the following", n, "words now:", masked_words, " - list of masked tokens:", masked_tokens)

          # modify the original attention mask: set weights to 0 where the input should be masked
          # --> I think if we set this attention mask in the .generate() call, we set this for all layers.
          curr_attention_mask[:, -n_tokens:] = 0

        print("attention mask: ", curr_attention_mask)

        """ prepare actual next word """
        # We should also predict punctuation.
        # It's not like the words are shown without punctuation on screen.
        actual_word = list(curr_text["word_punct"])[word_idx]
        print("actual word:", actual_word)

        # Problem: actual word might have multiple token IDs
        # --> get all IDs for current word
        act_word_id = tokenizer.encode(actual_word)
        # Output looks somewhat like this: [44, 305, 479, 5283]

        """ loop IDs of current word """
        curr_id_probs = []
        for curr_id in act_word_id:

            print("computing probability for ID " +  str(curr_id) + " of word " + actual_word)

            # generate probabilities for each possible token being the actual next token
            # --> use modified attention mask:
            output = model.generate(ids_tensor,
                                    attention_mask = curr_attention_mask,
                                    return_dict_in_generate = True,
                                    output_scores = True,
                                    max_new_tokens = 1) # set output length here - 1 because I only want 1 token

            # read out probabilities for all IDs
            logits = output.scores[0] # logits = probabilities with range [0,1] transformed to range [inf, -inf]
            probs = tf.nn.softmax(logits) # transform logits back to probabilities

            # get probability for actual ID being the next one & append it to
            # array with probabilities of all IDs for current word
            curr_id_probs.append(probs.numpy()[0][curr_id])

            # append current token ID to list of previous words (if there are any)
            # reason: The previous parts of the word are part of the context.
            ids_list = ids_list + [curr_id]
            ids_tensor = torch.tensor(ids_list).unsqueeze(0) # put the token IDs into a tensor

            # -------------------
            """ How do I deal with target words that consist of several tokens? """
            # (This is more an explanation for future-me, because I will definitely forget
            # I even thought about how to deal with this question.)

            # If the to-be-predicted word consists of several tokens,
            # I have to predict each of them and give the previous
            # tokens of the word to the input chunk.

            # Example (I made up the tokens btw): El-ef-ant
            # In this case, I generate a surprisal score for "El" given
            # the input chunk with n masked words. Then I generate surprisal
            # scores for the next token "ef" given the  input chunk with n masked
            # words and the previous token of the current word ("El"), but
            # I won't mask "El" here.

            # Reason:
            # I want 1 surprisal score for 1 word, so even if this isn't quite true,
            # 1 word means 1 prediction for me (at least in the brain, I don't think it tokenizes words
            # like GPT-2 does, at least not in the exact same way).
            # So I think you should get all previous tokens from the current word without masking them.
            # Another thing is that if I masked the start of the word itself, too,
            # for some words, my suprisal scores would go completely through the roof,
            # just because they are longer than others.

            # -------------------

            # Long story short: Add 1 token to the end of the attention
            # mask as well, but don't mask the new one (so the attention mask has a value of 1 there).
            additional_token_mask = torch.ones((1, 1), dtype = torch.long)
            curr_attention_mask = torch.cat([curr_attention_mask, additional_token_mask], dim = 1)
            print("attention mask:", curr_attention_mask)

        # multiply all probabilities for current word:
        act_word_prob = np.prod(curr_id_probs)

        # transform probability value into surprisal score (negative log of the probability)
        # negative log = log(1 / x) with x being the value you want to get the negative log of.
        # I use e as a base value for the log here.
        surprisal_score = np.log( 1 / act_word_prob )

        # if surprisal score == Inf, set surprisal score to 100
        if surprisal_score == float('inf'):
          surprisal_score = 100

        # collect surprisal score in array for all surprisal scores of current time scale
        curr_TS_list.append(surprisal_score)
        print("\nSurprisal score for actual word " + actual_word +" is " + str(surprisal_score) + ".")
        print("TS = " + str(curr_ts) + "- Text Nr = " + str(text_nr) + " - Trial Nr = " + str(word_idx))
        print(" --------------- ")

print("\n\n\nfinished computing surprisal scores\n\n")


# append new surprisal score columns to texts_df
texts_df = texts_df.assign(surprisal_1  = surprisal_1,
                           surprisal_4  = surprisal_4,
                           surprisal_12  = surprisal_12,
                           surprisal_60  = surprisal_60)


""" download texts_df as surprisal_scores_masked_context.csv """
texts_df.to_csv("surprisal_scores_masked_context.csv", encoding = "utf-8-sig")
files.download("surprisal_scores_masked_context.csv")



In [27]:
# In case the chunk before throws an error because you used only a subset of the texts,
# you can run this to save the data:

#texts_df = texts_df[texts_df['text_nr'] == 'text_10']

#texts_df = texts_df.assign(surprisal_1  = surprisal_1,
#                           surprisal_4  = surprisal_4,
#                           surprisal_12  = surprisal_12,
#                           surprisal_60  = surprisal_60)

#texts_df.to_csv("surprisal_scores_masked_context.csv", encoding = "utf-8-sig")
#files.download("surprisal_scores_masked_context.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
""" Test cases to check whether the attention mask is set on
    all layers and the predictions are generated as intended.
"""


""" TEST 1 """
# Get attention weights from all layers and see whether
# they are set to 0 on all layers for the tokens I want to mask.

def print_layer_attention_weights(previous_words, n, model):

    # generate token ids for each of the x previous words
    encoded_input = tokenizer(" ".join(previous_words),
                              add_special_tokens = False,
                              return_tensors = "pt")

    # get token IDs, put them in a list and in a tensor
    ids_tensor = encoded_input['input_ids']
    ids_list = ids_tensor[0].tolist()
    print("input IDs:", ids_list)

    # get original attention mask
    curr_attention_mask = encoded_input.attention_mask.clone()

    # mask the last n words
    unmasked_words = previous_words[:-n]
    unmasked_tokens = tokenizer.tokenize(" ".join(unmasked_words))

    # get the last n words and the number of tokens to mask
    masked_words = previous_words[-n:]
    masked_tokens = tokenizer.tokenize(" ".join(masked_words))
    n_tokens = len(masked_tokens)
    print("nr of masked tokens:", n_tokens)

    # modify the original attention mask: set weights to 0 where the input should be masked
    curr_attention_mask[:, -n_tokens:] = 0
    print("attention mask:", curr_attention_mask)

    # Generate model output
    output = model.generate(ids_tensor,
                            attention_mask = curr_attention_mask,
                            return_dict_in_generate = True, # for some reason I need to include this or I don't get attention weights in the output.
                            output_attentions = True,
                            max_new_tokens = 1)

    generated_tokens = output.sequences
    attentions = output.attentions

    #print("\ninput tokens + predicted token(s):", generated_tokens)
    #print("\n\nattention weights:\n", attentions)

    # average the attention weights over the generated tokens for each layer:
    layer_attention_weights = []
    for i, layer_attentions in enumerate(attentions[0]):
      #print(i, "\n\n\ncurrent attention weights for all heads:\n\n", layer_attentions)
      layer_attention_weights.append(torch.mean(layer_attentions[:, -1, :-n_tokens], dim=1))

    # print layer attention weights
    for i, layer_weights in enumerate(layer_attention_weights):
        print(f"\n\nLayer {i+1} attention weights:\n")
        print(layer_weights.tolist())

# Test with example input
#input_text = ["Orlando", "liebte", "von", "Natur", "aus", "einsame", "Orte,", "weite", "Ausblicke", "und", "das", "Gefühl,", "für", "immer", "und", "ewig"] # allein zu sein.
input_text = ["Orlando", "liebte", "von", "Natur"]
n = 1  # I want to mask the last n words
print_layer_attention_weights(input_text, n, model)

# Works. The last n values in each attention weights tensor are always 0, just as intended.


In [None]:
""" TEST 2 """
# Here I'm turning the masking around a bit - I will mask the first instead of the last n words.
# I will then compare the predictions for the partially masked input with a "normal" prediction
# with an input chunk where the words that were masked in the first run are just left out.

# I'm not quite sure if this really makes sense, but at least for a human it should be the
# same because syntactically, it shouldn't matter that there are a few masked words at the
# beginning of the context chunk.

# In test 3 I'll do it with the first n words, where it should definitely matter (at least for the syntax)
# if we include & mask them or exclude them completely.

def exclude_first_n(previous_words, n, model):

    seed = 42
    torch.manual_seed(seed)
    np.random.seed(seed)

    """ 1. mask first n words & run prediction """
    # generate token ids for each of the x previous words
    encoded_input = tokenizer(" ".join(previous_words),
                              add_special_tokens = False,
                              return_tensors = "pt")

    # get token IDs, put them in a list and in a tensor
    ids_tensor = encoded_input['input_ids']
    ids_list = ids_tensor[0].tolist()
    #print("input IDs:", ids_list)

    # get original attention mask
    curr_attention_mask = encoded_input.attention_mask.clone()

    # mask the first n words
    unmasked_words = previous_words[n:]
    unmasked_tokens = tokenizer.tokenize(" ".join(unmasked_words))
    #print("unmasked_words", unmasked_words)

    # get the first n words and the number of tokens to mask
    masked_words = previous_words[:n]
    #print("masked words:", masked_words)
    masked_tokens = tokenizer.tokenize(" ".join(masked_words))
    n_tokens = len(masked_tokens)
    #print("nr of masked tokens:", n_tokens)

    # modify the original attention mask: set weights to 0 where the input should be masked
    curr_attention_mask[:, :n_tokens] = 0
    #print("attention mask:", curr_attention_mask)

    # Generate model output
    output = model.generate(ids_tensor,
                            attention_mask = curr_attention_mask,
                            return_dict_in_generate = True,
                            output_scores = True,
                            max_new_tokens = 13)

    predicted_text = tokenizer.decode(output.sequences.numpy().tolist()[0],
                                      skip_special_tokens = True)

    print("Prediction with the first n words masked:", predicted_text)



    """ 2. exclude first n words & run prediction """
    # do the same again, but exclude the first n words instead of masking them

    # exclude first n words from the input chunk
    included_words = previous_words[n:]
    #print("\n\nincluded words", included_words)

    # encode text, generate token IDs tensor
    encoded_input = tokenizer(" ".join(included_words),
                              add_special_tokens = False,
                              return_tensors="pt")

    # get token IDs, put them in a list and in a tensor
    ids_tensor = encoded_input['input_ids']
    ids_list = ids_tensor[0].tolist()
    #print("input IDs:", ids_list)

    # generate model output
    output = model.generate(ids_tensor,
                            return_dict_in_generate = True,
                            output_scores = True,
                            max_new_tokens = 13)
    # decode IDs to get predicted text
    predicted_text = tokenizer.decode(output.sequences.numpy().tolist()[0],
                                      skip_special_tokens = True)

    print("Prediction with the first n words excluded:", predicted_text)


# Test with example input
input_text = ["Orlando", "liebte", "von", "Natur", "aus", "einsame", "Orte,", "weite", "Ausblicke", "und", "das", "Gefühl,", "für", "immer", "und", "ewig", "allein", "zu", "sein."]
#input_text = ["Orlando", "liebte", "von", "Natur"]
n = 10  # I want to mask/exclude the first n words
exclude_first_n(input_text, n, model)

# Seems to be the same (the first part of the sentence is not the same obviously, but that
# was the masked/excluded part of the input. The predictions are the same.)


Prediction with the first n words masked: Orlando liebte von Natur aus einsame Orte, weite Ausblicke und das Gefühl, für immer und ewig allein zu sein.
Ich bin nicht allein.
Ich bin ein Teil von dir
Prediction with the first n words excluded: das Gefühl, für immer und ewig allein zu sein.
Ich bin nicht allein.
Ich bin ein Teil von dir


In [None]:
""" TEST 3 """
# Here I'm doing the same as in test 2, but I'm masking the last n words, just as I did
# for the surprisal score calculation.
# I will then compare the predictions for the partially masked input with a "normal" prediction
# with an input chunk where the words that were masked in the first run are just left out.

# Unlike in test 2, where I wasn't so sure whether the masking/exclusion of words
# should make a difference, here it should definitely matter (at least for the syntax)
# if we include & mask them or exclude them completely.

def exclude_last_n(previous_words, n, model):

    # Idk if this actually matters for the generation (probably not),
    # but set seed:
    seed = 42
    torch.manual_seed(seed)
    np.random.seed(seed)

    """ 1. mask last n words & run prediction """
    # generate token ids for each of the x previous words
    encoded_input = tokenizer(" ".join(previous_words),
                              add_special_tokens = False,
                              return_tensors = "pt")

    # get token IDs, put them in a list and in a tensor
    ids_tensor = encoded_input['input_ids']
    ids_list = ids_tensor[0].tolist()
    #print("input IDs:", ids_list)

    # get original attention mask
    curr_attention_mask = encoded_input.attention_mask.clone()

    # check which TS index we have here: If it's the
    # first TS, we don't have to mask anything.
    unmasked_words = previous_words[:-n]
    unmasked_tokens = tokenizer.tokenize(" ".join(unmasked_words))
    #print("unmasked tokens:", unmasked_tokens)

    # get the last n words and the number of tokens to mask
    masked_words = previous_words[-n:]
    masked_tokens = tokenizer.tokenize(" ".join(masked_words))
    n_tokens = len(masked_tokens)
    #print("number of tokens to be masked:", n_tokens)
    print("masking the following", n, "words now:", masked_words, " - list of masked tokens:", masked_tokens)

    # modify the original attention mask: set weights to 0 where the input should be masked
    # --> I think if we set this attention mask in the .generate() call, we set this for all layers.
    #print("original attention mask:", curr_attention_mask)
    curr_attention_mask[:, -n_tokens:] = 0
    #print("attention mask: ", curr_attention_mask)

    # Generate model output
    output = model.generate(ids_tensor,
                            attention_mask = curr_attention_mask,
                            return_dict_in_generate = True,
                            output_scores = True,
                            max_new_tokens = 13)

    predicted_text = tokenizer.decode(output.sequences.numpy().tolist()[0],
                                      skip_special_tokens = True)

    print("Prediction with the last n words masked:", predicted_text)



    """ 2. exclude last n words & run prediction """
    # do the same again, but exclude the first n words instead of masking them

    # exclude first n words from the input chunk
    included_words = previous_words[:-n]
    print("\n\nincluded words", included_words)

    # encode text, generate token IDs tensor
    encoded_input = tokenizer(" ".join(included_words),
                              add_special_tokens = False,
                              return_tensors="pt")

    # generate model output
    output = model.generate(ids_tensor,
                            return_dict_in_generate = True,
                            output_scores = True,
                            max_new_tokens = 13)
    # decode IDs to get predicted text
    predicted_text = tokenizer.decode(output.sequences.numpy().tolist()[0],
                                      skip_special_tokens = True)

    print("Prediction with the last n words excluded:", predicted_text)


# Test with example input
input_text = ["Orlando", "liebte", "von", "Natur", "aus", "einsame", "Orte,", "weite", "Ausblicke", "und", "das", "Gefühl,", "für", "immer", "und", "ewig", "allein", "zu", "sein."]
n = 6  # I want to mask/exclude the first n words
exclude_last_n(input_text, n, model)



masking the following 6 words now: ['immer', 'und', 'ewig', 'allein', 'zu', 'sein.']  - list of masked tokens: ['immer', 'Ġund', 'Ġewig', 'Ġallein', 'Ġzu', 'Ġsein', '.']
Prediction with the last n words masked: Orlando liebte von Natur aus einsame Orte, weite Ausblicke und das Gefühl, für immer und ewig allein zu sein. wenn es nicht zu Hause war.
Er war ein guter Freund


included words ['Orlando', 'liebte', 'von', 'Natur', 'aus', 'einsame', 'Orte,', 'weite', 'Ausblicke', 'und', 'das', 'Gefühl,', 'für']
Prediction with the last n words excluded: Orlando liebte von Natur aus einsame Orte, weite Ausblicke und das Gefühl, für immer und ewig allein zu sein.
Er war ein großer Künstler, ein Künstler, der sich in


In [None]:
# How to compute cosine similarity for 2 texts
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Example sentences
sentence1 = "The quick brown fox jumps over the lazy dog."
sentence2 = "Sphinx of black quartz, judge my vow."
sentence3 = "The quick brown fox jumps over the sphinx of black quartz."

# Compute cosine similarity for sentences 1 & 2
# 0 = completely different texts
# 1 = very similar (e.g. same words in different order) or identical
vectorizer1 = CountVectorizer().fit_transform([sentence1, sentence2])
cos_sim = cosine_similarity(vectorizer1)[0][1]
print("Cosine similarity for sentences 1 & 2:", cos_sim)

# compare sentences 1 & 3:
vectorizer2 = CountVectorizer().fit_transform([sentence1, sentence3])
cos_sim = cosine_similarity(vectorizer2)[0][1]
print("Cosine similarity for sentences 1 & 3:", cos_sim)


Cosine similarity for sentences 1 & 2: 0.0
Cosine similarity for sentences 1 & 3: 0.7526178090063818
