# Sample Notebook on how to run inference using `GPT-J`

The GPT-J model was released in the [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax) repository by Ben Wang and Aran Komatsuzaki. It is a GPT-2-like causal language model trained on the [Pile](https://pile.eleuther.ai/) dataset.

This model was contributed by [Stella Biderman](https://huggingface.co/stellaathena).


Documentation: [GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj#gptj)



In [None]:
!pip install transformers==4.12.3 torch==1.9.1 --upgrade

In [5]:
import transformers
import torch

assert transformers.__version__ == "4.12.3", f"wrong transformers version: {transformers.__version__}"
assert "1.9.1" in torch.__version__  , f"wrong torch version: {torch.__version__}"

We are going to use the [fp16 branch](https://huggingface.co/EleutherAI/gpt-j-6B/tree/float16) which stores the fp16 weights, which could be used to further minimize the RAM usage. Combining all this it should take roughly 12.1GB of CPU RAM to load the model.

## Loading the model and using the `generate` method

Since we are using the `fp16` branch of the model it should fit on 16GB GPU for inference (P3) or (T4).
loading the model fp16 branch (11.3GB) on `ec2` machine took 3 minutes and 32 seconds. Loading the model into memory took another 3 minutes


In [1]:
from transformers import GPTJForCausalLM
import torch

model = GPTJForCausalLM.from_pretrained("EleutherAI/gpt-j-6B", revision="float16", torch_dtype=torch.float16, low_cpu_mem_usage=True)

Downloading:   0%|          | 0.00/836 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/11.3G [00:00<?, ?B/s]

In [11]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")

device='cuda:0'
model.to(device)

prompt = "In a shocking finding, scientists discovered a herd of unicorns living in a remote, " \
         "previously unexplored valley, in the Andes Mountains. Even more surprising to the " \
         "researchers was the fact that the unicorns spoke perfect English."

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

%timeit
gen_tokens = model.generate(input_ids.to(device), do_sample=True, temperature=0.9, max_length=100,)

tokenizer.batch_decode(gen_tokens)[0]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'In a shocking finding, scientists discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English. They even have the same rights as us humans. The unicorns are only two foot tall--an easy target for a hunting rifle.\n\n\nThe unicorns have the same rights as humans because they are, technically, human: they all stem from the same origin. They'

## Loading the `gpt-j` from cache

loading `gpt-j` from local cache took 3 minutes 16 seconds.

In [12]:
model2 = GPTJForCausalLM.from_pretrained("EleutherAI/gpt-j-6B", revision="float16", torch_dtype=torch.float16, low_cpu_mem_usage=True)

## Load model from `directory`

loading the model from `disk` took 1m 23s

In [13]:
model2.save_pretrained("./tmp")

In [14]:
model2 = GPTJForCausalLM.from_pretrained("tmp")

# Load `gpt-j` using `torch.load`

loading the model with `torch.load` took 7.7s

In [2]:
from transformers import AutoTokenizer,GPTJForCausalLM
import torch

# load fp 16 model
model = GPTJForCausalLM.from_pretrained("EleutherAI/gpt-j-6B", revision="float16", torch_dtype=torch.float16)
torch.save(model, "gptj.pt")


In [3]:
model = torch.load("gptj.pt")

In [13]:
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
gen = pipeline("text-generation",model=model,tokenizer=tokenizer,device=0)

In [15]:
gen("My Name is philipp")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'My Name is philipp k. and I live just outside of Detroit. For most of my growing up years I knew that I wanted to be in the art world but had no idea where to start. I started taking art classes during high school and'}]

# Creating `model.tar.gz` for sagemaker deployment

In [16]:
import tarfile
import os

def compress(tar_dir=None,output_file="model.tar.gz"):
    with tarfile.open(output_file, "w:gz") as tar:
        tar.add(tar_dir, arcname=os.path.sep)
            

import boto3

def upload_file_to_s3(bucket_name=None,file_name="model.tar.gz",key_prefix=""):
    s3 = boto3.resource('s3')
    key_prefix_with_file_name = os.path.join(key_prefix,file_name)
    s3.Bucket(bucket_name).upload_file(file_name, key_prefix_with_file_name)
    return f"s3://{bucket_name}/{key_prefix_with_file_name}"

In [18]:
import os
import shutil 
import tarfile
import torch
from transformers import AutoTokenizer,GPTJForCausalLM

def compress(tar_dir=None,output_file="model.tar.gz"):
    with tarfile.open(output_file, "w:gz") as tar:
        tar.add(tar_dir, arcname=os.path.sep)
            

import boto3

def upload_file_to_s3(bucket_name=None,file_name="model.tar.gz",key_prefix=""):
    s3 = boto3.resource('s3')
    key_prefix_with_file_name = os.path.join(key_prefix,file_name)
    s3.Bucket(bucket_name).upload_file(file_name, key_prefix_with_file_name)
    return f"s3://{bucket_name}/{key_prefix_with_file_name}"


model_save_dir="./tmp"
bucket_name="hf-sagemaker-inference"
key_prefix="gpt-j"
src_inference_script= os.path.join("code","inference.py")
dst_inference_script= os.path.join(model_save_dir,"code")

os.makedirs(model_save_dir,exist_ok=True)
os.makedirs(dst_inference_script,exist_ok=True)

# load fp 16 model
model = GPTJForCausalLM.from_pretrained("EleutherAI/gpt-j-6B", revision="float16", torch_dtype=torch.float16)
torch.save(model, os.path.join(model_save_dir,"gptj.pt"))

tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
tokenizer.save_pretrained(model_save_dir)

# copy inference script
shutil.copy(src_inference_script, dst_inference_script)

# create archive
compress(model_save_dir)

# upload to s3
model_uri = upload_file_to_s3(bucket_name=bucket_name,key_prefix=key_prefix)
model_uri


's3://hf-sagemaker-inference/gpt-j/model.tar.gz'

**bash scripting** -> after loading and saving model + copying `inference.py`

In [None]:
%bash
tar zcvf model.tar.gz *
aws s3 cp model.tar.gz s3://hf-sagemaker-inference/gpt-j/model.tar.gz


## Deploy endpoint

In [None]:
!pip install sagemaker

In [29]:
from sagemaker.huggingface import HuggingFaceModel
import boto3
import os

os.environ["AWS_DEFAULT_REGION"]="us-east-1"


iam_role="sagemaker_execution_role"
model_uri="s3://hf-sagemaker-inference/gpt-j/model.tar.gz"

iam_client = boto3.client('iam')
role = iam_client.get_role(RoleName=iam_role)['Role']['Arn']

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
  model_data=model_uri,
	transformers_version='4.12',
	pytorch_version='1.9',
	py_version='py38',
	role=role, 
)


# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
	initial_instance_count=1, # number of instances
	instance_type='ml.g4dn.xlarge' #'ml.p3.2xlarge' # ec2 instance type
)


---------------!

In [30]:
predictor.predict({
	'inputs': "Can you please let us know more details about your "
})

[{'generated_text': 'Can you please let us know more details about your \nexperiences with the bookkeeper.\n\nI received a call from Chris Foster requesting that you review the below \nAgreement and return with any comments.  \n\nAs a'}]

In [31]:
predictor.predict({
	'inputs': "Can you please let us know more details about your "
})

[{'generated_text': 'Can you please let us know more details about your xtraday, xtrading and portfolio strategies?\nSo far, I have read that you have used a 15% drawdown when you exited the equity fund. Is this a safe'}]

parameterized request

In [23]:
predictor.predict({
	'inputs': "Can you please let us know more details about your ",
  "parameters" : {
    "min_length": 120,
    "temperature": 0.9,
  }
})

[{'generated_text': 'Can you please let us know more details about your \nissue?\n\nA:\n\nThe problem was caused by my lack of understanding on how web sockets \n  worked. Once I understood how they work; I was able to fix'}]

custom end of sequence token. 

In [None]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")

end_sequence="."
temparature=40
max_generated_token_length=50
input="Can you please let us know more details about your "

predictor.predict({
	'inputs': input,
  "parameters" : {
    "min_length": int(len(input) + max_generated_token_length),
    "temperature":temparature,
    "eos_token_id": tokenizer.convert_tokens_to_ids(end_sequence)
  }
})

In [32]:
predictor.delete_endpoint()