# GPT-2

### Goal & Task:
- Fine-tune a pre-trained GPT-2 model from the Hugging Face Libraries
- Focus on identifying potential vulnerabilities in code snippets and suggesting ways to mitigate those risks

In [None]:
# text wrapping across all cells
from IPython.display import HTML, display

def set_css_in_cell_output():
  display(HTML('''
      <style>
        pre {white-space: pre-wrap;}
      </style>
    '''))

get_ipython().events.register('pre_run_cell', set_css_in_cell_output)

## 1. Data Collection

### 1.1 Given Dataset

In [None]:
import os
import json
from google.colab import drive

# mount google drive
my_path = '/content/drive'
drive.mount(my_path)

# directory containing JSON files
directory = '/content/drive/Shared drives/HCLCapstone/Dataset'

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# dictionary to hold all JSON data
all_data = {}

for root, dirs, files in os.walk(directory):
    if os.path.basename(root) == 'api':  # check if the parent directory is 'api'
        for filename in files:
            if filename.endswith('.json'):
                filepath = os.path.join(root, filename)
                with open(filepath, 'r') as file:
                    # load JSON file and add its contents to the all_data dictionary
                    all_data[filename] = json.load(file)

# now 'all_data' contains the data from all JSON files in 'api' folders

In [None]:
all_data

{'AccessControl.Bypass_javascript_beb9bd61.article.json': {'version': 1,
  'id': 'AccessControl.Bypass_javascript_beb9bd61',
  'apiName': 'MooTools Cookie Path Too General',
  'cause': [{'sortId': 0,
    'text': "The path for the cookie has been set in the `Cookie.write()` method allowing this cookie to be acccessed from any page. The path attribute determines the pages within that particular application path (and its sub-directories) to which the cookie is accessible. A value of '/' or false means that the cookie can be accessed within all pages in all paths throughout the application."},
   {'sortId': 1,
    'text': '{@code_bad:\n<script language="javascript">\n...\n   var c = Cookie.write(\'user\', value, {path: false});\n   // or\n   var c = Cookie.write(\'user\', value);   // Defaults to {path: false}\n...\n</script>\n:code}'},
   {'sortId': 2,
    'text': "Setting this attribute to its default value of '/' could allow cookies given to an authenticated user to leak out into areas 

In [None]:
formatted_text_data = {}

def extract_text(items):
    texts = []
    for item in items:
        if isinstance(item, str):
            texts.append(item)
        elif isinstance(item, dict) and 'text' in item:
            texts.append(item['text'])
    return ', '.join(texts)

for file_name, json_content in all_data.items():
    formatted_text = ""

    # extract 'cause' or 'causes' content
    for key in ['cause', 'causes']:
        if key in json_content:
            formatted_text += "The causes are: " + extract_text(json_content[key]) + "\n"

    # extract 'risks' content
    if 'risks' in json_content:
        formatted_text += "The risks are: " + extract_text(json_content['risks']) + "\n"

    # extract 'recommendations' content
    if 'recommendations' in json_content:
        formatted_text += "The recommendations are: " + extract_text(json_content['recommendations']) + "\n"

    formatted_text_data[file_name] = formatted_text

# 'formatted_text_data' now contains the formatted text representation of specific JSON content

In [None]:
formatted_text_data

{'AccessControl.Bypass_javascript_beb9bd61.article.json': 'The causes are: The path for the cookie has been set in the `Cookie.write()` method allowing this cookie to be acccessed from any page. The path attribute determines the pages within that particular application path (and its sub-directories) to which the cookie is accessible. A value of \'/\' or false means that the cookie can be accessed within all pages in all paths throughout the application., {@code_bad:\n<script language="javascript">\n...\n   var c = Cookie.write(\'user\', value, {path: false});\n   // or\n   var c = Cookie.write(\'user\', value);   // Defaults to {path: false}\n...\n</script>\n:code}, Setting this attribute to its default value of \'/\' could allow cookies given to an authenticated user to leak out into areas of the application where authentication is not necessary. For example, if a user browses the public (unauthenticated) part of a website, it can usually be done without having to provide the user wit

### 1.2 OWASP API

In [None]:
# clone the OWASP CheatSheetSeries repository
!git clone https://github.com/OWASP/CheatSheetSeries.git

fatal: destination path 'CheatSheetSeries' already exists and is not an empty directory.


In [None]:
# install markdown and html2text packages
!pip install markdown html2text



In [None]:
# libraries
import os
import markdown
import html2text

In [None]:
# set the path to the cheatsheets directory within the cloned repository
cheatsheets_dir = 'CheatSheetSeries/cheatsheets'

# initialize a variable to hold the combined text
combined_text = ''

# walk through the directory, and read each Markdown file
for root, dirs, files in os.walk(cheatsheets_dir):
    for file in files:
        # check if the file has a Markdown extension
        if file.endswith('.md'):
            # construct the full file path
            file_path = os.path.join(root, file)
            # open and read the Markdown file
            with open(file_path, 'r', encoding='utf-8') as md_file:
                md_content = md_file.read()
                # convert Markdown to HTML using the markdown library
                html_content = markdown.markdown(md_content)
                # initialize html2text converter
                text_maker = html2text.HTML2Text()
                # set to ignore links in the conversion process
                text_maker.ignore_links = True
                # convert HTML to plain text
                plain_text = text_maker.handle(html_content)
                # add the plain text to the combined text, separating files with newlines
                combined_text += plain_text + '\n\n'

# save the combined plain text to a single file
with open('combined_cheatsheets.txt', 'w', encoding='utf-8') as output_file:
    output_file.write(combined_text)

In [None]:
with open('combined_cheatsheets.txt', 'r', encoding='utf-8') as output_file:
  #print(output_file.readlines())
  combined_text = output_file.readlines()

In [None]:
combined_text

['# Transport Layer Security Cheat Sheet\n',
 '\n',
 '## Introduction\n',
 '\n',
 'This cheat sheet provides guidance on implementing transport layer protection\n',
 'for applications using Transport Layer Security (TLS). It primarily focuses on\n',
 'how to use TLS to protect clients connecting to a web application over HTTPS,\n',
 'though much of this guidance is also applicable to other uses of TLS. When\n',
 'correctly implemented, TLS can provide several security benefits:\n',
 '\n',
 '  * **Confidentiality** : Provides protection against attackers reading the contents of the traffic.\n',
 '  * **Integrity** : Provides protection against traffic modification, such as an attacker replaying requests against the server.\n',
 '  * **Authentication** : Enables the client to confirm they are connected to the legitimate server. Note that the identity of the client is not verified unless client certificates are employed.\n',
 '\n',
 '### SSL vs TLS\n',
 '\n',
 'Secure Socket Layer (SSL) w

In [None]:
# save formatted_text_data to a text file
with open('given_dataset.txt', 'w', encoding='utf-8') as file:
    for text in formatted_text_data.values():
        file.write(text + '\n')

# save combined_text (which already is in a list format) to a text file
with open('combined_dataset.txt', 'w', encoding='utf-8') as file:
    file.writelines(combined_text)

## 2. Modeling

### 2.1 Baseline Model

In [None]:
!pip install transformers



In [None]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# initialize the tokenizer & model for the baseline GPT-2
tokenizer_baseline = GPT2Tokenizer.from_pretrained('gpt2')
model_baseline = GPT2LMHeadModel.from_pretrained('gpt2')

# example prompt
prompt_baseline = "The current weather is"
input_ids_baseline = tokenizer_baseline.encode(prompt_baseline, return_tensors='pt')

# generate text using the baseline model
output_baseline = model_baseline.generate(input_ids_baseline, max_length=50, num_return_sequences=1)

# decode and print the generated text
print(tokenizer_baseline.decode(output_baseline[0], skip_special_tokens=True))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


The current weather is not conducive to the development of a sustainable future for the planet.

The current climate is not conducive to the development of a sustainable future for the planet.

The current climate is not conducive to the development of a sustainable


### 2.2 Fine-Tuned Model

2.2.1 Using Given Dataset

In [None]:
!pip install accelerate



In [None]:
!pip install torch torchvision torchaudio --upgrade



In [None]:
!pip install accelerate -U



In [None]:
!pip install transformers[torch] -U



In [None]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel, TextDataset, DataCollatorForLanguageModeling, Trainer, TrainingArguments

given_dataset_file_path = 'given_dataset.txt'

# load tokenizer and model
tokenizer_given = GPT2Tokenizer.from_pretrained('gpt2')
model_given = GPT2LMHeadModel.from_pretrained('gpt2')

# prepare the dataset
train_dataset_given = TextDataset(tokenizer=tokenizer_given, file_path=given_dataset_file_path, block_size=128)
data_collator_given = DataCollatorForLanguageModeling(tokenizer=tokenizer_given, mlm=False)

# training arguments
training_args_given = TrainingArguments(output_dir='./gpt2_given_dataset', overwrite_output_dir=True, num_train_epochs=3, per_device_train_batch_size=4)

# initialize Trainer
trainer_given = Trainer(model=model_given, args=training_args_given, data_collator=data_collator_given, train_dataset=train_dataset_given)

# fine-tuning
trainer_given.train()

# save the fine-tuned model
model_given.save_pretrained('./gpt2_given_dataset')
tokenizer_given.save_pretrained('./gpt2_given_dataset')

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Step,Training Loss


('./gpt2_given_dataset/tokenizer_config.json',
 './gpt2_given_dataset/special_tokens_map.json',
 './gpt2_given_dataset/vocab.json',
 './gpt2_given_dataset/merges.txt',
 './gpt2_given_dataset/added_tokens.json')

2.2.2. Using OWASP API

In [None]:
combined_dataset_file_path = 'combined_dataset.txt'

# load tokenizer and model
tokenizer_combined = GPT2Tokenizer.from_pretrained('gpt2')
model_combined = GPT2LMHeadModel.from_pretrained('gpt2')

# prepare the dataset
train_dataset_combined = TextDataset(tokenizer=tokenizer_combined, file_path=combined_dataset_file_path, block_size=128)
data_collator_combined = DataCollatorForLanguageModeling(tokenizer=tokenizer_combined, mlm=False)

# training arguments
training_args_combined = TrainingArguments(output_dir='./gpt2_combined_dataset', overwrite_output_dir=True, num_train_epochs=3, per_device_train_batch_size=4)

# initialize Trainer
trainer_combined = Trainer(model=model_combined, args=training_args_combined, data_collator=data_collator_combined, train_dataset=train_dataset_combined)

# fine-tuning
trainer_combined.train()

# save the fine-tuned model
model_combined.save_pretrained('./gpt2_combined_dataset')
tokenizer_combined.save_pretrained('./gpt2_combined_dataset')


dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Step,Training Loss
500,3.3047
1000,3.0334
1500,2.8189
2000,2.7034


('./gpt2_combined_dataset/tokenizer_config.json',
 './gpt2_combined_dataset/special_tokens_map.json',
 './gpt2_combined_dataset/vocab.json',
 './gpt2_combined_dataset/merges.txt',
 './gpt2_combined_dataset/added_tokens.json')

## 3. Evaluation and Testing

### Prompt

In [None]:
# Cross Site Scripting
code_snippet = ".innerHTML=\'[&#8212;"

# SQL Injection
#code_snippet = 'SELECT ") .replace(/\s{0,}UPDATE /ig,"UPDATE ") .replace(/ SET /ig," SET ")'

prompt = "What is Cross Site Scripting"

# prompt = "What is Cross Site Scripting"
# prompt = f"I have this context: {code_snippet}\nWhat type of vulnerability might this be susceptible to, and how can I mitigate the risk?"

### 3.1 Fine-tunded Model with Given Dataset

In [None]:
from transformers import pipeline, GPT2LMHeadModel, GPT2Tokenizer

# Load the model and tokenizer from the directory of the given dataset fine-tuning
model_given = GPT2LMHeadModel.from_pretrained('./gpt2_given_dataset')
tokenizer_given = GPT2Tokenizer.from_pretrained('./gpt2_given_dataset')

# Create a text generation pipeline for the given dataset model
text_generator_given = pipeline('text-generation', model=model_given, tokenizer=tokenizer_given)

# Generate text using the model fine-tuned on the given dataset
generated_texts_given = text_generator_given(prompt, max_length=300, num_return_sequences=1)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [None]:
for generated_text in generated_texts_given:
    print(generated_text['generated_text'])

What is Cross Site Scripting (XSS) discovered in a website? This is where a malicious user could inject malicious code to the user's browser., {@code_bad:
<iframe src="https://www.example.com" sandbox="allow_javascript" >
:code}, The recommendations are: Consider rendering the site and the SWF files with JavaScript enabled., {@code_bad:
<iframe src="https://www.example.com" sandbox="allow_javascript" >
:code}, The causes are: Cross Site Scripting (XSS) can be bypassed by manipulating CSS attributes to expose a malicious user to the web. Using one or more of the `Variables.set()` and `Variables.display()` methods will expose such content to the user., {@code_bad:
<script language="javascript">
...
   $.ajax({
      type: 'POST',
      location:'my-site',
      payload: '<h1>I'm looking for a new web page</h1><br />';
   });
  
</script>
:code}, Since this content is being injected into the `Variables.set()` method, the `Variables.display()` and `Variables.customAttribute()` methods will

### 3.2 Fine-tuned Model with OWASP API

In [None]:
# Load the model and tokenizer from the directory of the combined dataset fine-tuning
model_combined = GPT2LMHeadModel.from_pretrained('./gpt2_combined_dataset')
tokenizer_combined = GPT2Tokenizer.from_pretrained('./gpt2_combined_dataset')

# Create a text generation pipeline for the combined dataset model
text_generator_combined = pipeline('text-generation', model=model_combined, tokenizer=tokenizer_combined)

# Generate text using the model fine-tuned on the combined dataset
generated_texts_combined = text_generator_combined(prompt, max_length=300, num_return_sequences=1)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [None]:
for generated_text in generated_texts_combined:
    print(generated_text['generated_text'])

What is Cross Site Scripting (XSS)?
  * **Cross site scripting injection (XSS) refers to a virus, malware, script, JavaScript, etc. that attacks a resource on an
application, then instructing that the resources are loaded from a different location, causing the
application to inject itself into the victim server;**
  * **XSS refers to a malicious code in a data source system, such as an application, server, phone, or computer, that attacks a
resource on the victim server.**

In general, this is a relatively minor threat, but if you are
understandable as a developer and know enough to know where the problem is, consider implementing
measures to prevent cross site scripting.

## 2.2 - Avoid using JavaScript by Appropriate Paths

It may well not be a good idea to enable JavaScript paths in the `!eval`
document input when executing scripts, or to use a different origin from the
`!evalPath` directive. Such directives need to be enabled and enforced to prevent
cross site scripting attacks lik