# Manual pages generation with GPT-2

GPT-2, or Generative Pre-trained Transformer 2, is a language model developed by OpenAI. It is the predecessor to GPT-3 and was released in 2019. GPT-2 is a state-of-the-art model that uses deep learning techniques to generate human-like text.

The model was trained on a massive dataset containing a large portion of the Internet, allowing it to learn patterns and relationships between words and generate coherent and contextually relevant text. GPT-2 utilizes a transformer architecture, which is a type of neural network that excels at handling sequential data, such as natural language.

In [1]:
from google.colab import drive
drive.mount('/content/drive/', force_remount=True)
%cd '/content/drive/MyDrive/MachineLearningProjects/FineTuneGPT2'

Mounted at /content/drive/
/content/drive/MyDrive/MachineLearningProjects/FineTuneGPT2


In [None]:
!pip install transformers
!pip install transformers[torch]

In [3]:
from transformers import (
    GPT2Tokenizer,
    DataCollatorForLanguageModeling,
    TextDataset,
    GPT2LMHeadModel,
    TrainingArguments,
    Trainer,
    pipeline
)

In [4]:
train_path = 'data/man.txt'

## Data preprocessing


In [5]:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

When feeding the sentence to the tokenizer, it returns a dictionary with a list of input_ids (indeces corresponding to each token). There is also an argument called attention mask which indicates to the model which tokens should be attended to and which not (to skip padded tokens).

In [81]:
phrase = "man is an interface to the system reference manuals"
print(f'Tokenization of {phrase}:', tokenizer(f'{phrase}'))

Tokenization of man is an interface to the system reference manuals: {'input_ids': [805, 318, 281, 7071, 284, 262, 1080, 4941, 41371], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [7]:
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

## Load dataset

In [8]:
train_dataset = TextDataset(
    tokenizer=tokenizer,
    file_path=train_path,
    block_size=128
)



In [82]:
print(tokenizer.decode(train_dataset[123]))

lr] [-p filename] [-dt] [name]
              Each  time  hash  is invoked, the full pathname of the command name is determined by searching the
              directories in $PATH and remembered.  Any previously-remembered pathname is discarded.  If the  -p
              option  is supplied, no path search is performed, and filename is used as the full filename of the
     


## Fine-tuning

Fine-tuning GPT-2 with the Transformers library involves adapting the pre-trained GPT-2 model to a specific task or dataset. Fine-tuning allows you to leverage the knowledge and language understanding of the pre-trained model and customize it for your specific needs.


In [10]:
model = GPT2LMHeadModel.from_pretrained('gpt2')

Downloading model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [11]:
training_args = TrainingArguments(
    output_dir = 'data/out', # the output directory for the model predictions and checkpoints
    overwrite_output_dir = True, # overwrite the content of the output directory
    per_device_train_batch_size = 32, # the batch size for training
    per_device_eval_batch_size = 32, # the batch size for evaluation
    learning_rate = 5e-5, # defaults to 5e-5
    num_train_epochs = 1, # total number of training epochs to perform
)

trainer = Trainer(
    model = model,
    args = training_args,
    data_collator=data_collator,
    train_dataset = train_dataset,
    eval_dataset = train_dataset
)

In [12]:
trainer.train()



Step,Training Loss
500,1.8262
1000,1.5769
1500,1.5133
2000,1.4631
2500,1.4346
3000,1.413
3500,1.3859
4000,1.3782
4500,1.3629
5000,1.3447


TrainOutput(global_step=10974, training_loss=1.3814163611238068, metrics={'train_runtime': 13431.3685, 'train_samples_per_second': 26.144, 'train_steps_per_second': 0.817, 'total_flos': 2.29381742592e+16, 'train_loss': 1.3814163611238068, 'epoch': 1.0})

In [None]:
trainer.save_model()

## Text generation

In [None]:
generator = pipeline('text-generation', tokenizer='gpt2', model='data/out')

In [154]:
def generate(phrase, maxlen=0):
    text = generator(phrase, max_length=maxlen or 5*len(phrase))[0]['generated_text']
    text = text.removeprefix(phrase)
    print(f"\n\n\x1b[36m{phrase}\x1b[0m{text}")

In [122]:
generate("SYNOPOSIS", maxlen = 100)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[36mSYNOPOSIS[0m
       gcloud    compute  target-https-proxies   create  NAME  (--region=REGION | --zone=ZONE)  [--labels=[KEY=VALUE,...]]
              [GCLOUD_WIDE_FLAG...]

DESCRIPTION
       (BETA) Create a target


In [129]:
generate("DESCRIPTION", maxlen = 400)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[36mDESCRIPTION[0m
       (ALPHA) Get Distributed Cloud Edge Network resources.

EXAMPLES
       Describe the Distributed Cloud Edge Network resource:

         $ gcloud alpha  edge-cloud networking endpoints describe example-endpoint \
             --location=us-east1

REQUIRED FLAGS
         --location=LOCATION

           Location of the Distributed Cloud Edge Network endpoint group

         --role=ROLE

           Role ID of the role to describe.

FLAGS
          At most one of these can be specified:

           --region=REGION

            Dataproc   region  of the edge network endpoint group. Overrides the default dataproc/region property value for this command
          request.

         --update-labels=[KEY=VALUE,...]

          Update the labels associated with a Distributed Cloud Edge Network endpoint group.

         Keys must start with a lowercase character and contain only hyphens (-), underscores (_), lowercase
        characters, and numbers. Values must contain on

In [156]:
generate("OPTIONS", maxlen = 200)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.




[36mOPTIONS[0m
       These are the options that are not documented in the documentation.

SEE ALSO
       lvm(8), lvmadd(8), lvminfo(8)

AUTHOR
       LvmInfo was written by Michael Reiner <mreiner@samba.org>

       The lvmd daemon was first available in Linux 4.1.39.

       The lvm module has now been ported to FreeBSD 9.0.

SEE ALSO
       lvm(8)

BUGS
       This is currently in beta and might change without notice. These variants are also available:

         $ gcloud alpha active-directory domains

         $ gcloud alpha data


In [123]:
generate("COMMANDS", maxlen = 200)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[36mCOMMANDS[0m
       COMMAND is one of the following:

         describe

           (BETA) Describe an AlloyDB connection.

         get-database-resources

           (BETA) Get database resources for an AlloyDB connection.

         list

           (BETA) List AlloyDB connections in a Filestore instance.

         update

           (BETA) Update a Filestore instance.

          get-database-resources

           (BETA) Get database resources for an AlloyDB connection.

NOTES



In [125]:
generate("In the following descriptions", maxlen = 200)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[36mIn the following descriptions[0m, you cannot use this command when the
                                                PagingInterval is greater than the timeout:

               PagingInterval -u no-serving: none

             PagingInterval -u none: PagingExcludes=none

             PagingInterval -u timeout-1: PagingInterval =10ms, PagingInterval =3.4s,
             PagingInterval --timeout=-1: 5s, PagingInterval --total-days-ago=


In [126]:
generate("Most options may be given in one of two forms:")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[36mMost options may be given in one of two forms:[0m

       • The  format  of  the  input. This parameter specifies the format of the input for which data should be
          given.  The default is "plain", which is a literal string used to be passed as a filename.

       • The value of the input. This parameter specifies the format for whose data should be given.

          By default, data with an empty value is parsed as a single argument; the value must contain at least one value
          typed and stored as data. The value will contain empty or quoted strings, as well as the numeric values that
           may be passed as arguments to the function. If you wish to use the default format, pass "null" as the default.

       • The default argument should be empty for a null-terminated string (i.e., the character


In [127]:
generate("The environment variable is parsed before the command line, so")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[36mThe environment variable is parsed before the command line, so[0m that the command line is  executed  before
          any other command in the program. However the environment variable also takes as the argu‐
           ment name of a shell function which returns 1 if the shell's environment variable or the value of the
            value of the option arg is missing, or 0 otherwise.

       --cursor-indent 0
              Turn the cursor cursor indent on.  This can be useful for displaying text in interactive mode, or when making a paste
              command in a shell script (such as pkill or tilde-l) by putting the newline in the newline.  It can also TOPIC-LINES.  This option
              is  useful  if  the  editor  is  unable to do what the user expects: it is unable toendiary a window,
            it has no history, no history options and no history information (except the value of the
          --colors subcommand to be executed).  This allows the output of the option t

In [128]:
generate("An 8-bit color string is one or two decimal integers separated by a dot, where the  first  integer specifies")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[36mAn 8-bit color string is one or two decimal integers separated by a dot, where the  first  integer specifies[0m
           which pixel has the RGB_INDEX. For example, 0x23 means the character  color2  is  1  if  none  of
           the  pixels  in  the  line  have any reds, greens, and blues. When color is 0, the first pixel of the line
           has all of the colors, and the second does not; the remainder of the line has only reds, greens,
            blues, and blues.

            Note that these flags modify the display_name or window_name structure provided by the XkbGeometry  object spoiler.  Any  of
           any  of  the  bits  of  the  new  screen  type may be set for the display_name attribute; the other
          bits may be set in any other way you want. To determine whether or not a screen is ready to display a new window
         handle a new screen name, or which of the following options may be passed to the XkbSetWindowDescanter or
        WindowSpec  function. 

In [130]:
generate("You  may  define  your own")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[36mYou  may  define  your own[0m custom tag for a region. Each tag has a unique name and can be used
       by other tags in the same region.  If no argument is specified, then a reference to the region will be shown.

       Region resources inherit resource:value, which of the following keys may be set:

       • the resource's version.

       • the resource's version identifier, the region of the resource which is inheritable of the resource.

       • the type of


In [131]:
generate("To use these scripts, put them both where they can be executed and ")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[36mTo use these scripts, put them both where they can be executed and [0m use  these options to override
             defaults. The.gs-profile file should point to a directory where the settings for these scripts might be
              run.  To run a script that is currently running and will update the configs to be generated by this  script,
           press  the    gs_profile     variable    and    enter    the  command    it   invoked:
                gs_profile --add-autoprovisioning-schedule-config-file \
                   gs_profile --all --autoprovisioning-schedule-configs --autoprovisioning-script

               You can execute a script named 'GSS_PROJECT_ID' by sending its arguments along with $GIT_PROJECT_ID, or else run that script with this command
           $KUBECONF_VERSION with the following arguments:

          --autoprovisioning-schedule-config-file=AUTOPROVISIONING_SCHED


In [134]:
generate("ls -a")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[36mls -a[0m -f /var/cache/dpy-ld.c

        
