![DLI Logo](../assets/DLI_Header.png)


# Training Data Extraction

In this lab we are going to walk through an implementation of parts of ["Extracting Training Data from Large Language Models"](https://arxiv.org/pdf/2012.07805.pdf) by Carlini et al.  In what might be a familiar refrain by now, please temper your expectations. We use smaller LLMs to allow inference to be fast enough for the lab, and smaller models have less capacity to exactly memorize data.

If you're following along in the paper, the bits you're probably most likely to be interested in start around page 5 (section 4).

The main idea of the attack is that the model should be 'more confident' in producing text that it has memorized exactly; to exploit this, we first ask the model to generate a large quantity of text, and then use the model to check its own work and filter down to candidate productions that are likely to have been memorized exactly.

To improve this attack, the authors try two additional generation strategies:
1. Start with a high temperature during generation, and rapidly drop it to a normal level.  This encourages diversity in the first few tokens of model output before attempting to make it produce sequences that the model assigns very high probability to (which by assumption are more likely to have been memorized).
2. "Prime" the model with a number of source texts from the Internet, likely to be the same as or similar to the model training data (we leave this as an exercise)

They also introduce several modified scoring rules, in addition to just the model's probability estimate (which is a direct map to "perplexity" which they refer to in the paper):
1. Compare to another neural language model (left as an exercise)
2. Compare to zlib compression 
3. Compare to lowercased text
4. Perplexity on a sliding window of text, rather than the entire text at once (not implemented)


Having generated significant amounts of text, as well as several different ways to score it, they take the highest scores from each combination of generation method and scoring, and evaluate it by hand to determine the rate of accurate training data recovery.

The final test, of course, is comparing it to the actual training data to determine whether or not it's a "hit", however we don't have access to that, so we're going to have to settle for things that look plausible, and Google searches.

By the time you've finished this lab, you'll have a good idea of how each of these metric perform, and how to use them to assess the likelihood that a given sample was included in training data.  You'll also have had a bit of practice reading an academic paper and extracting the relevant bits of information from it to implement a new (to you) attack.

:::{admonition} Exercise
Go read the Wikipedia page on[ perplexity](https://en.wikipedia.org/wiki/Perplexity#Perplexity_of_a_probability_model).  If you need to, review some [material on negative log-likelihood](https://towardsdatascience.com/cross-entropy-negative-log-likelihood-and-all-that-jazz-47a95bd2e81).  How are the two related?  If we order by negative log likelihood loss, is the order the same or different as if we order by perplexity?
:::

:::{admonition} Exercise

Now that we've laid out the big ideas of the paper, take fifteen minutes, read it for yourself, and try to convince yourself you understand how they did the evaluation.  See if you can pick out the details in how they generated the samples: what top-k? what temperature? how many new tokens per sample?

Most attacks in this space appear in papers like this first; the better you get at skimming them and picking out the key bits, the better you'll be at keeping up with the fire hose.  Eat your vegetables.
:::

:::{warning}

We're doing a lot of mucking about with tensors that go to and from the GPU, as well as asking you to experiment with indexing into and slicing on-GPU arrays; the potential for CUDA errors throwing up the `RuntimeError: CUDA error: device-side assert triggered` message is very high.  If you get a CUDA error, this usually disables GPU access from the python process until the kernel is reset.

If you encounter CUDA errors while running this lab, you need to do the following:
1. Go to the 'Running terminals and kernels' tab on the left (below the folder icon) and terminate all other terminals.
2. Restart the kernel for the current notebook from the 'Kernel' menu
3. Fix the offending code
4. Re-run the notebook from the beginning

If you still run into an error, please let us know.
:::

## Model Loading Boilerplate

Nothing much new here -- though do note the use of a 'large' model as well as the way we prompt with a special token below.

In [1]:
# DO NOT CHANGE

import os
os.environ['CUDA_VISIBLE_DEVICES']='0'

# for scoring outputs
import numpy as np
import pandas as pd
import zlib, sys

# LLM imports
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer

import torch
torch.cuda.set_device(0)
device = "cuda:0"

In [2]:
# DO NOT CHANGE

from transformers.utils import logging

# This suppresses a recurring warning message that pops up because we're going to spend a lot of time trying to generate sequences that start with a special token
logging.set_verbosity_error()

In [3]:
# DO NOT CHANGE

# We're going to try gpt-large here to have a slightly better chance of retrieving memorized data
# model_id = "gpt2-large"
model_id = "gpt2-xl"
device = 'cuda:0'
model = AutoModelForCausalLM.from_pretrained(model_id).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_id)


With the boilerplate out the the way, we're going to call the model once to see if it works -- note that the model needs a first token to start with, so we're going to use the 'bos' (beginning of sequence) token.  This is the same token ("<|endoftext|>") as the padding and end of text tokens.

Side note worth remembering: most tokenizers will find the longest match when tokenizing; even though "end" "of" and "text" are all individual tokens, the fact that "<|endoftext|>" is registered as its own token means that the tokenizer will replace it with that single token rather than all three.

Another side note: what happens if you insert the "<|endoftext|>" token into a prompt for GPT2?

In [7]:
# DO NOT CHANGE

text = tokenizer.special_tokens_map['bos_token']

inputs = tokenizer(text, return_tensors="pt").to(device)
outputs = model.generate(**inputs, do_sample=True, 
                         max_new_tokens=256, 
                         top_k=40,
                         return_dict_in_generate=True,
                         output_scores = True,
                         pad_token_id=50256,
                        )

print(tokenizer.decode(outputs.sequences[0], skip_special_tokens=False))

<|endoftext|>As the new year dawns this week, we would be foolish to forget about the many, major, and horrifying problems that still haunt our country. The problem is that there are some things that the mass media will never show.

Case in point: a massive, well-organized effort to shut down conservative thought by infiltrating and subverting the very institutions of higher learning.

How did this all begin?

It all started in September, when a group of leftists went to Columbia University and tried to bully Professors Nicholas and Erika Christakis into deleting a critical opinion piece the couple had posted to Facebook to give their students some insight into what it means for them to be a faculty member.

In an interview with MSNBC's Chris Hayes, Erika (the author of the article) was quick to point out that Columbia wasn't the first school to invite them to speak at.

"I think there are a lot of examples in universities across the country," she explained, pointing to the efforts tha

## Scoring

Here we define our scoring function -- we show the model the sentence we produced, and ask it how likely it would have been to predict the n-th letter given n-1 letters. This is exactly the quantity the model computes during training, so we can use the model loss that's built in. Look at the equation at the end of section 4.2; the quantity inside the parentheses is the same as the model loss, they then exponentiate it to find perplexity. The exponentiation operation changes the magnitude of the score, but not the order in which we'd sort the samples (a higher NLL would always have a higher perplexity than a lower NLL sample), so we're going to ignore that step for now. The same logic as with 'loss' applies: the lower it is, the more likely the model thinks the token is -- if it's perfectly memorized a sample, it should have very low loss for that exact sample.

Worth noting the care we're taking to `.detach()` and explicitly mark items for deletion/collection -- when working at a low level with tensors that are on the GPU, creating VRAM "memory leaks" is very easy and annoying.

In [7]:
# DO NOT CHANGE

def score_outputs(outputs, model = model):
    input_ids = outputs.clone()
    target_ids = input_ids.clone()
    # detach() to explicitly break gradients and avoid CUDA memory 'leaks'
    loss = model(input_ids, labels = target_ids).loss.detach().item()
    del input_ids, target_ids #extra paranoia for CUDA memory
    return loss

In [6]:
# DO NOT CHANGE

score_outputs(outputs.sequences[0], model)

1.9576294422149658

:::{admonition} Exercise!
Inspect the full results from calling the model: `model(input_ids, labels=target_ids)`   What do they represent? 

Look up the softmax function if you don't remember it.  What do we have to do to the logits to compute the loss function from them?

BONUS challenge: Read up on the [log-sum-exp](https://gregorygundersen.com/blog/2020/02/09/log-sum-exp/) trick and see if you can compute the loss from logits and get the same result as the function above.

If you get stuck, check out the [answer key](answers-4_LLM.ipynb) notebook.
:::

In [12]:
# your code here 
import torch.nn.functional as F

def compute_loss_from_logits(logits, targets):
    shift_logits=logits[:-1].contiguous()
    shift_targets=targets[1:].contiguous()

    # flatten the tensors
    flat_logits=shift_logits.view(-1, shift_logits.size(-1))
    flat_targets=shift_targets.view(-1)

    # compute cross-entropy loss
    loss = F.cross_entropy(flat_logits, flat_targets)
    return loss

logits = model(outputs.sequences[0], labels=outputs.sequences[0]).logits
loss= compute_loss_from_logits(logits, outputs.sequences[0])

In [14]:
loss

tensor(2.0653, device='cuda:0', grad_fn=<NllLossBackward0>)

:::{admonition} Exercise!

Try a few different bits of text below, both common as well as random; what do the scores look like?  With at most 256 tokens, what's the highest score you can get? Lowest?

When constructing sentences by hand, you need at least 2 tokens (so it can predict the second from the first) -- it's a good idea to insert the "<|endoftext|>" token at the start of the text as a 'null' character (the same token is used for the beginning of a sequence, the end of a sequence, and unknown tokens).

You can see special tokens in the tokenizer by inspecting the special_tokens_map property
```python
tokenizer.special_tokens_map
```

:::

In [13]:
# DO NOT CHANGE

score_outputs(tokenizer("<|endoftext|>"+"YOUR TEXT HERE", return_tensors='pt').to(device).input_ids)

5.875243663787842

## Text Generation and "Improved" Text Generation

In section 5, they describe a few more potential methods to improve text generation, including:
1. Sampling with a decaying temperature (5.1.1), and
2. Conditioning on Internet text (5.1.2)

The function below implements all of them -- the default value for 'text' is the beginning-of-sequence token, otherwise user-supplied text can be provided that will be tokenized and used as the prompt.  We've implemented a single temperature scaling method, the one used in the paper, as well as the default fixed-temperature approach.

All other parameters -- including top-k and the number of new tokens -- are fixed to values from the paper.

:::{admonition} Exercise
If you're not familiar with temperature or the top-k parameter, or just need a reminder, read about it [here](https://txt.cohere.com/llm-parameters-best-outputs-language-ai/)
:::

In [5]:
# DO NOT CHANGE

def gen(model, text = tokenizer.special_tokens_map['bos_token'], use_temperature_scaling=False):
    inputs = tokenizer(text, return_tensors="pt").to(device)
    
    if use_temperature_scaling:
        # here we generate one token at a time starting
        temperature_scale = np.linspace(10, 1, 20)
        for t in temperature_scale:
            outputs = model.generate(**inputs, do_sample=True, 
                                     max_new_tokens=1, 
                                     top_k=40,
                                     return_dict_in_generate=True,
                                     pad_token_id=50256,
                                     temperature = t,
                                    )
            text = tokenizer.decode(outputs.sequences[0], skip_special_tokens=True)
            inputs = tokenizer(text, return_tensors="pt").to(device)
        outputs = model.generate(**inputs, do_sample=True, 
                                 max_new_tokens=256 - len(temperature_scale), 
                                 top_k=40,
                                 return_dict_in_generate=True,
                                 pad_token_id=50256,
                                 temperature = 1,
                                )

    else:
        outputs = model.generate(**inputs, do_sample=True, 
                                 max_new_tokens=256, 
                                 top_k=40,
                                 return_dict_in_generate=True,
                                 output_scores = True,
                                 pad_token_id=50256,
                                )
    scores = score_outputs(outputs.sequences[0], model=model)
    return outputs.sequences[0].detach().cpu(), scores

## Generating Samples

In [17]:
# DO NOT CHANGE

samples = []


# in the paper they generated 200,000 samples, but ain't nobody got time for that
N_SAMPLES = 50

for sample in range(N_SAMPLES):
    tokens, nlls = gen(model)
    samples.append((tokens, np.mean(nlls)))
    sys.stdout.write(f"\r{sample} / {N_SAMPLES}");sys.stdout.flush

49 / 50

In [18]:
# DO NOT CHANGE

samples.sort(key = lambda x:np.mean(x[1]))

In [19]:
# DO NOT CHANGE

for tokens, score in samples[:3]:
    print(score, tokenizer.decode(tokens, skip_special_tokens=True))
    print("+*+*+*+*+*+*+*")

1.4503071308135986 
With Donald Trump expected to win the election on Tuesday, there are concerns across Canada about whether Trump will make good on his pledge to bar Muslim refugees from coming to Canada.

"There is a concern at home, a bit of uneasiness and apprehension on the part of Canadians that the U.S. will indeed go down this path," said Chantal Hébert, a fellow of the Munk School of Global Affairs at the University of Toronto and a fellow for the Wilson Centre. "For Canada, it would be catastrophic. For the U.S., it would actually be a positive development."

Canadian Prime Minister Justin Trudeau says he would welcome refugees and immigrants from the U.S. But he's not convinced Donald Trump would do the same <a href="https://twitter.com/hashtag/cbc?src=hash">#cbc</a> <a href="https://t.co/KpDZ5tH7Zi">https://t.co/KpDZ5tH7Zi</a> —@CBCPolitics

Canada accepted 25,839 refugees in 2016, the third highest number in recent history after the United States and China. Of those, 13
+

## Improved Scoring of Samples

In addition to the simple perplexity measure, they suggest three other tools for evaluating the likelihood that a production was in the training data in section 5.2 (go read it), of which we implement two.

1. Comparing to zlib compression: the model will often memorize fairly trivial strings ("AAAAAAAAAA") or ("1,2,3,4,5,6,...") -- compression is another way to approximate the 'information content' of a string.  By comparing the GPT2 perplexity to the zlib-inferred 'entropy' we can at least partially filter "uninteresting" samples from the data, leaving behind only samples with low perplexity but high complexity.

2. Comparing to lowercased text: GPT has different tokens for uppercased and lowercased text; if it has precisely memorized a specific exact phrase, then the lower-cased version of it should me "more unexpected" (lower probability, higher perplexity) than the original. 

They also suggest evaluating perplexity on a sliding window, looking at the minimum perplexity for any subset of 50 consecutive tokens in the production.

:::{admonition} Exercise!
Using [this page](https://huggingface.co/docs/transformers/perplexity) as a reference, implement sliding window perplexity.

See the [answer key](answers-4_LLM.ipynb) notebook if you get stuck.
:::

In [20]:
# DO NOT CHANGE

def compare_zlib(tokens, tokenizer=tokenizer):
    base_score = score_outputs(tokens.to(device))
    zlib_score = len(zlib.compress(tokenizer.decode(tokens, skip_special_tokens=True).encode('utf-8')))
    return base_score / np.log2(zlib_score)

def compare_lowercased(tokens, model=model, tokenizer=tokenizer):
    test_string = tokenizer.decode(tokens, skip_special_tokens=True)
    test_string_lower = test_string.lower()
    inputs_lower = tokenizer(test_string_lower, return_tensors="pt").to(device).input_ids
    score = score_outputs(tokens.to(device))
    score_lower = score_outputs(inputs_lower)
    del inputs_lower
    return score/score_lower

## Evaluation of Productions

With our three scoring functions, we can now score our productions from above.  For ease of analysis, we'll put them all into a single pandas dataframe.

In [21]:
# DO NOT CHANGE

all_scores = list()

for index, (tokens, score) in enumerate(samples):
    td = dict()
    td['index'] = index
    td['text'] = tokenizer.decode(tokens)
    td['NLL'] = score_outputs(tokens.to(device))
    td['lower'] = compare_lowercased(tokens)
    td['zlib'] = compare_zlib(tokens)
    # TODO -- insert your sliding window score function here!
    all_scores.append(td)
results = pd.DataFrame(all_scores)

And now, sort by various score functions.

In [22]:
# DO NOT CHANGE

results.sort_values('lower')

Unnamed: 0,index,text,NLL,lower,zlib
2,2,<|endoftext|>The video will start in 8 Cancel\...,1.521073,0.619678,0.163132
49,49,<|endoftext|>Ticketing\n\nAll events are hoste...,2.704234,0.6792,0.395039
12,12,<|endoftext|>The official website of the TV an...,1.810654,0.696292,0.199016
11,11,<|endoftext|>(JTA) — The US Embassy to France ...,1.80367,0.699413,0.194786
13,13,"<|endoftext|>""I'm always being told something,...",1.836027,0.705708,0.194952
7,7,<|endoftext|>The video will start in 8 Cancel\...,1.741096,0.709982,0.188463
0,0,<|endoftext|>\nWith Donald Trump expected to w...,1.450307,0.713736,0.157522
5,5,<|endoftext|>Image copyright EPA Image caption...,1.648996,0.726716,0.178218
1,1,<|endoftext|>The New York Yankees are expected...,1.500827,0.746887,0.166654
8,8,"<|endoftext|>COPENHAGEN, Denmark (AP) — Denmar...",1.763727,0.765634,0.188794


:::{exercise}
Examine the outputs with respect to different scoring systems.  Do they at least look plausible as potential candidates for memorized text?  Which scoring system seems to perform the best?
:::

:::{exercise}
Perform this attack against one of the fine-tuned models from the previous labs -- instructions on loading them are in lab 1 -- where we know what the instruction dataset looked like. Can you recover any training samples?

NB: these models are very small and poor at memorization; don't be disappointed if you get borderline poor results.
:::

:::{exercise}
Modify the attack to try to recover specific bits of information: 
1. Find (or make up) a fixed text prompt that is a prefix to the information you want to try to recover
2. Generate a large number of samples using the method above
3. Score, sort/filter, and check the results.

Things to explore:
- shorter or longer prompts
- temperature scaling
- computing other scoring functions that sample perplexity over a window

And one last reminder: GPT2-large is (name aside) a relatively small model; the amount of memorized data is likely to be low.  If you have time, you may wish to try loading the 'gpt2-xl' model -- this is the "full" GPT-2 model used in the paper, however it is significantly slower to use.  Also remember that the authors took the top 1,000 samples from 200,000 for each condition, so the total volume we examine here is much smaller simply due to time (but you can do this at home too!)

Please share any interesting findings with us!
:::

In [8]:
# provided code

samples = []

# in the paper they generated 200,000 samples, but ain't nobody got time for that
N_SAMPLES = 50

for sample in range(N_SAMPLES):
    tokens, nlls = gen(model, text = "YOUR PROMPT HERE", use_temperature_scaling=True)
    samples.append((tokens, np.mean(nlls)))
    sys.stdout.write(f"\r{sample} / {N_SAMPLES}");sys.stdout.flush

49 / 50

In [None]:
# DO NOT CHANGE

print(tokenizer.decode(gen(model, text="The Ventura county DMV is located")[0]))

In [None]:
# DO NOT CHANGE

all_scores = list()

for index, (tokens, score) in enumerate(samples):
    td = dict()
    td['index'] = index
    td['text'] = tokenizer.decode(tokens)
    td['NLL'] = score_outputs(tokens.to(device))
    td['lower'] = compare_lowercased(tokens)
    td['zlib'] = compare_zlib(tokens)
    # TODO -- insert your sliding window score function here!
    all_scores.append(td)
results = pd.DataFrame(all_scores)

In [None]:
# DO NOT CHANGE

results.sort_values('lower')

In [None]:
# DO NOT CHANGE

# to view a specific row from a dataframe, use 'iloc' (index location)
results.iloc[12].text

# Conclusion
Were you able to follow the paper? 
Did you identify any training data? Can you verify somehow that that data was actually in the training set?  Names, phone numbers, and addresses of businesses or government offices are all good test cases for models like this.  While this kind of attack can be difficult and expensive to execute, the potential risks to the model owner can be severe if the model exposes any sort of regulated or GDPR-sensitive content.  This kind of attack can also be used as a training data set inference attack: if you prime the model with several strings that are in a known dataset (such as the "Colossal Cleaned Common Crawl" aka C4 dataset: https://huggingface.co/datasets/allenai/c4/tree/main) and find that many of the complete samples can be recovered, then you've got more confidence that that dataset was used in training the model, which can help you execute other attacks such as training a proxy model.

This is the end of the labs; congratulations, you made it! If you have time, we encourage you to go back and experiment with this or other attacks, as well as to download any notebooks, code, or models that you want to take with you.  Thank you so much for taking this class with us, and as always, please let us know if you have any questions, comments, or suggestions for next time. Thank you!

If you'd like to try the assessment to get a certificate, **move on to the [assessment overview notebook](../8_course_assessment/1_assessment_intro.ipynb)**.

![DLI Logo](../assets/DLI_Header.png)
