# Introduction to SageMaker JumpStart - text generation with Mistral models

---
In this demo notebook, we demonstrate how to use the SageMaker Python SDK to deploy Mistral models for text generation tasks.

---

## Setup
***

In [2]:
model_id = "huggingface-llm-mistral-7b-instruct"

In [3]:
from sagemaker.jumpstart.model import JumpStartModel

model = JumpStartModel(model_id=model_id)
predictor = model.deploy()

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


Using model 'huggingface-llm-mistral-7b-instruct' with wildcard version identifier '*'. You can pin to version '3.0.0' for more stable results. Note that models may have different input/output signatures after a major version upgrade.


--------!

### Supported parameters

***
This model supports many parameters while performing inference. They include:

* **max_length:** Model generates text until the output length (which includes the input context length) reaches `max_length`. If specified, it must be a positive integer.
* **max_new_tokens:** Model generates text until the output length (excluding the input context length) reaches `max_new_tokens`. If specified, it must be a positive integer.
* **num_beams:** Number of beams used in the greedy search. If specified, it must be integer greater than or equal to `num_return_sequences`.
* **no_repeat_ngram_size:** Model ensures that a sequence of words of `no_repeat_ngram_size` is not repeated in the output sequence. If specified, it must be a positive integer greater than 1.
* **temperature:** Controls the randomness in the output. Higher temperature results in output sequence with low-probability words and lower temperature results in output sequence with high-probability words. If `temperature` -> 0, it results in greedy decoding. If specified, it must be a positive float.
* **early_stopping:** If True, text generation is finished when all beam hypotheses reach the end of sentence token. If specified, it must be boolean.
* **do_sample:** If True, sample the next word as per the likelihood. If specified, it must be boolean.
* **top_k:** In each step of text generation, sample from only the `top_k` most likely words. If specified, it must be a positive integer.
* **top_p:** In each step of text generation, sample from the smallest possible set of words with cumulative probability `top_p`. If specified, it must be a float between 0 and 1.
* **return_full_text:** If True, input text will be part of the output generated text. If specified, it must be boolean. The default value for it is False.
* **stop**: If specified, it must a list of strings. Text generation stops if any one of the specified strings is generated.

We may specify any subset of the parameters mentioned above while invoking an endpoint. Next, we show an example of how to invoke endpoint with these arguments
***

## Instruction prompts
***
The examples in this section demonstrate queries to the Mistral 7B Instruct model. This involves special token formatting within the prompt input. The base, pre-trained Mistral model is not fine-tuned to perform this instruction task -- please use the prompts in the next section instead.
***

In [4]:
from typing import Dict, List


def format_instructions(instructions: List[Dict[str, str]]) -> List[str]:
    """Format instructions where conversation roles must alternate user/assistant/user/assistant/..."""
    prompt: List[str] = []
    for user, answer in zip(instructions[::2], instructions[1::2]):
        prompt.extend(["<s>", "[INST] ", (user["content"]).strip(), " [/INST] ", (answer["content"]).strip(), "</s>"])
    prompt.extend(["<s>", "[INST] ", (instructions[-1]["content"]).strip(), " [/INST] "])
    return "".join(prompt)


def print_prompt_and_response(prompt: str, response: str) -> None:
    bold, unbold = '\033[1m', '\033[0m'
    print(f"{bold}> Input{unbold}\n{prompt}\n\n{bold}> Output{unbold}\n{response[0]['generated_text']}\n")

In [5]:
instructions = [{"role": "user", "content": "what is the recipe of mayonnaise?"}]
prompt = format_instructions(instructions)
payload = {
    "inputs": prompt,
    "parameters": {"max_new_tokens": 256, "do_sample": True}
}
response = predictor.predict(payload)
print_prompt_and_response(prompt, response)

[1m> Input[0m
<s>[INST] what is the recipe of mayonnaise? [/INST] 

[1m> Output[0m
Here is a classic recipe for making mayonnaise at home using a blender or a food processor. This recipe makes about 1 cup (237 ml) of mayonnaise.

Ingredients:
- 1 cup (237 ml) vegetable oil (canola, safflower, or sunflower oil)
- 1 large egg yolk (preferably at room temperature)
- 1 tablespoon (15 ml) white wine vinegar or distilled white vinegar
- 1 teaspoon (5 ml) Dijon mustard
- 1 teaspoon (5 g) salt, or to taste
- 1-2 teaspoons (5-10 g) granulated sugar (optional)
- 2 tablespoons (30 ml) cold water

Instructions:
1. In the blender or food processor, add the egg yolk, vinegar, Dijon mustard, salt, and sugar (if using). Blend for a few seconds until the ingredients are combined and the yolk starts to thicken. If using a blender, make sure the base is securely attached and the blade is in place



## Pre-trained model prompts
***
The examples in this section demonstrate how to perform text generation on the base, pre-trained Mistral model. If you have deployed the instruction-tuned model, please use prompt formatting in the previous section instead.
***

In [15]:
!pip install fpdf

Collecting fpdf
  Using cached fpdf-1.7.2-py2.py3-none-any.whl
[33mDEPRECATION: pyodbc 4.0.0-unsupported has a non-standard version number. pip 24.0 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pyodbc or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0mInstalling collected packages: fpdf
Successfully installed fpdf-1.7.2
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [17]:
!pip install faker

Collecting faker
  Downloading Faker-24.8.0-py3-none-any.whl.metadata (15 kB)
Downloading Faker-24.8.0-py3-none-any.whl (1.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m16.2 MB/s[0m eta [36m0:00:00[0m:00:01[0m
[?25h[33mDEPRECATION: pyodbc 4.0.0-unsupported has a non-standard version number. pip 24.0 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pyodbc or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0mInstalling collected packages: faker
Successfully installed faker-24.8.0
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [16]:
from fpdf import FPDF

In [18]:
import numpy as np
import random

In [19]:
company_names = np.array(['1. Allstate Insurance Company',

'State Farm Insurance',
'Progressive Corporation',
'Liberty Mutual Insurance',
'Geico',
'Farmers Insurance Group',
'Travelers Insurance',
'Nationwide Mutual Insurance Company','American Family Insurance',
'USAA',
'Chubb Limited',
'Hartford Financial Services Group',
'AIG (American International Group)',
'MetLife',
'The Hartford',
'The Travelers Companies',
'Zurich Insurance Group',
'Cigna',
'UnitedHealthcare',
'Aetna',
'Blue Cross Blue Shield Association',
'Anthem, Inc.',
'Humana',
'UnitedHealth Group',
'Kaiser Permanente',
'Health Care Service Corporation',
'CVS Health',
'United States Fire Insurance Company',
'The Hanover Insurance Group',
'The Hartford Financial Services Group, Inc.',
'The Cincinnati Insurance Company'])

In [20]:
from random import randrange
premium_amount = randrange(10000,1000000)
print(premium_amount)

509352


In [None]:
#Loop for producing synthetic policies and outputting csv
from faker import Faker 
import pandas as pd
from fpdf import FPDF

output_list = []
for i in range(1,100):
    #gen fake info 
    fake = Faker()
    primary_name = fake.name()
    primary_address = fake.address()
    primary_city = fake.city()
    primary_state = fake.state()
    primary_zip = fake.zipcode()
    primary_email_address   = fake.email()
    secondary_name = fake.name()
    secondary_address = fake.address()
    secondary_email_address   = fake.email()
    company_name = random.choice(company_names)
    contract_id = random.choice([fake.password(length = 16, special_chars = False ),fake.uuid4()])
    a1 = fake.date_of_birth()
    b1  = fake.date_of_birth()
    
    if a1 < b1:
        begin_date = str(a1.year)+'-'+str(a1.month)+'-'+str(a1.day)
        end_date   =   str(b1.year)+'-'+str(b1.month)+'-'+str(b1.day)
    else:
        begin_date = str(b1.year)+'-'+str(b1.month)+'-'+str(b1.day)
        end_date   =   str(a1.year)+'-'+str(a1.month)+'-'+str(a1.day)


    
    #prompt enginer 
    prompt = "Compose a Legal Auto Claims Policy between the Insurer {} and {} \. The following are the details of the Insurer. \
    \nAddress: {} \nEmail:   {}. The Policy Number  for the Legal Auto Claims Policy is {}. \
    \nThe Start date of the Legal Auto Claims Policy is {} and the End date of Legal Auto Claims Policy is {}. \
    \nThe policy has a premium amount of {}. \
    \nThe Legal Auto Claims Policy must include the Address and Email Information at the beginning of the Legal Auto Claims Policy.  \
    \nThe Legal Auto Claims Policy must include the Policy Number at the beginning of the Legal Auto Claims Policy.  \
    \nThe Legal Auto Claims Policy must include the Start date and End date at the beginning of the Legal Auto Claims Policy.  \
    \nThe Legal Policy must have different subsections.\
    The subsections are \
    a) Automobile Liability Insurance. b) Automobile Medical Payments. c) Automobile Debt Indemnity Insurance. \
    d) Uninsured Motorists Insurance. e) Default Provisions. f) Personal Injury Protection. g) Collision Insurance. h) Comprehensive Insurance. i) Rental Reimbursement Insurance. j) Towing and Labor Coverage.\
    k) Waiver of Deductible. Each subsection must include Exclusions.  Each subsection must include LIMITS OF LIABILITY; What is not covered. \
    The Legal Auto Claims Policy must also include the following mailing contact information {}, {},{} \
    ".format(company_name,primary_name,primary_address,primary_email_address,contract_id,begin_date, end_date,premium_amount, secondary_name,secondary_address,secondary_email_address)
    payload = {
        "inputs": prompt,
        "parameters": {
            "max_new_tokens": 1200,
            "no_repeat_ngram_size": 3,
        },
    }
    response = predictor.predict(payload)
    #add respons to output list 
    output_list.append(response[0]['generated_text'])
    # save FPDF() class into a 
    # variable pdf
    pdf = FPDF()

    # Add a page
    pdf.add_page()

    # set style and size of font 
    # that you want in the pdf
    pdf.set_font("Arial", size = 15)

    # create a cell
    pdf.multi_cell(200, 10, txt = response[0]['generated_text'], 
             align = 'C')


    # save the pdf with name .pdf
    pdf_filename = "ner/output_"+ str(i) +"_v2a.pdf"
    pdf.output(pdf_filename) 


pd.DataFrame(output_list).to_csv("ner/synth_polcies.csv")

In [None]:
# Python program to create
# a pdf file






## Clean up the endpoint

In [None]:
predictor.delete_model()
predictor.delete_endpoint()