# Fine Tune GPT on SageMaker Examples
Generative Pre-trained Transformer. In this example, we'll fine-tune a large GPT-2 on the Amazon SageMaker Examples code repository.

First, let's process the raw notebook files and convert them into text.

In [None]:
# consider doing a fresh clone so you all the content raw
!git clone https://github.com/awslabs/amazon-sagemaker-examples.git

In [None]:
bucket = 'your-bucket'
path = 'your-prefix'

In [None]:
import os

# utils is a custom packaged I developed for this project
from utils import parse_notebook, initialize_output

def main(root_file, verbose, output_file, bucket, path):
    '''
    Takes a root file name, loops through all files.
        When it finds an ipython notebook, pulls in for parsing.
    '''
    hits = 0
    totals = 0
    
    for subdir, dirs, files in os.walk(root_file):

        for file in files:

            if '.ipynb' in file:
                    
                try:
                    parse_notebook(input_file = os.path.join(subdir, file), output_file = output_file, bucket=bucket, path=path )
                    if verbose:
                        print ('worked for ', file)
                    hits += 1
                except:
                    if verbose:
                        print ('broke on ', file)
                        
                totals += 1
                         
    print ('Got {} hits out of {} total.'.format(hits, totals))
    return

output_file = "sagemaker-examples.txt"
initialize_output(output_file)    
verbose = True
main('amazon-sagemaker-examples', verbose , output_file, bucket, path )

Great! Now, let's copy that over to S3. 

In [None]:
s3_train_path = 's3://{}/{}/'.format(bucket, path)
# os.system('aws s3 cp {} {}'.format(output_file, s3_train_file) )

Now, let's format a Python script that imports the large GPT-2 model, points to a fine-tuning framework, and applies our data on this model. 

Turns out we need a legacy version of TensorFlow to use this fine-tuning framework. In addition, when using script mode on this version of TensorFlow, we actually need to point to a bash script in the SageMaker training container to install our extra pacakges. Let's get that defined below.

In [None]:
%%writefile src/bash_start.sh

# install the extra packages
pip install -r requirements.txt

# run our script
python tune_gpt.py

In [None]:
%%writefile src/requirements.txt
# tensorflow=1.14
awscli
gpt-2-simple

After that, here's a modification of Max Woolf's nice gpt-2-simple package to fine tune GPT2 on a data file we bring ourselves. Thanks for the start code Max!
- https://github.com/minimaxir/gpt-2-simple 

In [None]:
%%writefile src/tune_gpt.py

import os
import requests
import gpt_2_simple as gpt2

def get_model_name(model_size):
    if 'large' in model_size:
        model_name = '774M'

    elif 'medium' in model_size: 
        model_name = "355M"

    elif 'small' in model_size:
        model_name = "124M"
    
    return model_name

def get_file_name(bucket, path, file_name):
    
    # for the exceedingly time-constrained    
    os.system('aws s3 cp s3://{}/{}/{} .'.format(bucket, path, file_name))
    
    return file_name

def save_to_s3(txt, bucket, path, out_file):
    
    # say hello to cloudwatch
    print (txt)

    with open(out_file, 'w') as f:
        f.write(txt)        

    os.system('aws s3 cp {} s3://{}/{}/output/'.format(out_file, bucket, path))
        
    # could also save the trained model to s3 here
    save_path = os.environ.get('SM_MODEL_DIR')
    model.save_weights(save_path)

if __name__ == "__main__":
            
    # turns out we need to hard code this in twice, once above, and another in the script here due to how the magic function %%writefile was implemented. 
    bucket = 'your-bucket'
    
    path = 'your-prefix'
        
    model_name = get_model_name('large')

    if not os.path.isdir(os.path.join("models", model_name)):
        print(f"Downloading {model_name} model...")
        gpt2.download_gpt2(model_name=model_name)   # model is saved into current directory under /models/model_name/

    file_name = get_file_name(bucket, path, 'sagemaker-examples.txt')

    sess = gpt2.start_tf_sess()

    print ('fine tuning on {}'.format(file_name))
    
    gpt2.finetune(sess,
                  file_name,
                  model_name=model_name,
                  steps=1000)   # steps is max number of training steps

    txt = gpt2.generate(sess, return_as_list = True)[0]
    
    save_to_s3(txt, bucket, path, 'output.txt')

Once you've gotten that file written, just run your job!

In [None]:
from sagemaker.tensorflow import TensorFlow
import os
import sagemaker
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()

role = get_execution_role()
region = sagemaker_session.boto_session.region_name

est = TensorFlow(entry_point='bash_start.sh',
                             role=role,
                             source_dir = 'src',
                             train_instance_count=1,
                             
                             # most accounts will need to explicitly request a limit increase for a GPU this large. 
                             # just reach out to AWS support for this
                             train_instance_type='ml.p3dn.24xlarge',
                             framework_version='1.14',
                             py_version='py3')

# feel free to set wait to True here, or logs to True, if you want to see the results here.
# Otherwise, wait a few minutes, then open up cloudwatch to view your model training. 
est.fit(s3_train_path, wait=False)