## Run inference on the Llama 2 endpoint you have created

### Query endpoint that you have created
***
The following cell provides a helper function that will be used to query your endpoint with boto3.
***

In [None]:
import json
import boto3


endpoint_name = "jumpstart-dft-meta-textgeneration-llama-2-13b-f"

def query_endpoint(payload):
    client = boto3.client("sagemaker-runtime")
    response = client.invoke_endpoint(
        EndpointName=endpoint_name,
        ContentType="application/json",
        Body=json.dumps(payload),
    )
    response = response["Body"].read().decode("utf8")
    response = json.loads(response)
    return response

### Supported Parameters

***
This model supports many parameters while performing inference. They include:

* **max_length:** Model generates text until the output length (which includes the input context length) reaches `max_length`. If specified, it must be a positive integer.
* **max_new_tokens:** Model generates text until the output length (excluding the input context length) reaches `max_new_tokens`. If specified, it must be a positive integer.
* **num_beams:** Number of beams used in the greedy search. If specified, it must be integer greater than or equal to `num_return_sequences`.
* **no_repeat_ngram_size:** Model ensures that a sequence of words of `no_repeat_ngram_size` is not repeated in the output sequence. If specified, it must be a positive integer greater than 1.
* **temperature:** Controls the randomness in the output. Higher temperature results in output sequence with low-probability words and lower temperature results in output sequence with high-probability words. If `temperature` -> 0, it results in greedy decoding. If specified, it must be a positive float.
* **early_stopping:** If True, text generation is finished when all beam hypotheses reach the end of sentence token. If specified, it must be boolean.
* **do_sample:** If True, sample the next word as per the likelihood. If specified, it must be boolean.
* **top_k:** In each step of text generation, sample from only the `top_k` most likely words. If specified, it must be a positive integer.
* **top_p:** In each step of text generation, sample from the smallest possible set of words with cumulative probability `top_p`. If specified, it must be a float between 0 and 1.
* **return_full_text:** If True, input text will be part of the output generated text. If specified, it must be boolean. The default value for it is False.
* **stop**: If specified, it must be a list of strings. Text generation stops if any one of the specified strings is generated.

We may specify any subset of the parameters mentioned above while invoking an endpoint. Next, we show an example of how to invoke endpoint with these arguments.

**NOTE**: If `max_new_tokens` is not defined, the model may generate up to the maximum total tokens allowed, which is 4K for these models. This may result in endpoint query timeout errors, so it is recommended to set `max_new_tokens` when possible. For 7B, 13B, and 70B models, we recommend to set `max_new_tokens` no greater than 1500, 1000, and 500 respectively, while keeping the total number of tokens less than 4K.

**NOTE**: This model only supports 'system', 'user' and 'assistant' roles, starting with 'system', then 'user' and alternating (u/a/u/a/u...).

***

### Example prompts
***
The examples in this section demonstrate how to perform text generation with conversational dialog as prompt inputs. First, define a helper function to appropriately format conversational dialog for Llama-2 chat models.
***

In [43]:
from typing import Dict, List


def format_messages(messages: List[Dict[str, str]]) -> List[str]:
    """Format messages for Llama-2 chat models.
    
    The model only supports 'system', 'user' and 'assistant' roles, starting with 'system', then 'user' and 
    alternating (u/a/u/a/u...). The last message must be from 'user'.
    """
    prompt: List[str] = []

    if messages[0]["role"] == "system":
        content = "".join(["<<SYS>>\n", messages[0]["content"], "\n<</SYS>>\n\n", messages[1]["content"]])
        messages = [{"role": messages[1]["role"], "content": content}] + messages[2:]

    for user, answer in zip(messages[::2], messages[1::2]):
        prompt.extend(["<s>", "[INST] ", (user["content"]).strip(), " [/INST] ", (answer["content"]).strip(), "</s>"])

    prompt.extend(["<s>", "[INST] ", (messages[-1]["content"]).strip(), " [/INST] "])

    return "".join(prompt)


def print_messages(prompt: str, response: str) -> None:
    bold, unbold = '\033[1m', '\033[0m'
    print(f"{bold}> Input{unbold}\n{prompt}\n\n{bold}> Output{unbold}\n{response[0]['generated_text']}\n")
    
def get_response(prompt: str, response: str) -> None:
    bold, unbold = '\033[1m', '\033[0m'
    print(f"> Output{unbold}\n{response[0]['generated_text']}\n")
    return response[0]['generated_text']

In [44]:
dialog = [{"role": "user", "content": "what is the recipe of mayonnaise?"}]
prompt = format_messages(dialog)
payload = {"inputs": prompt, "parameters": {"max_new_tokens": 256, "top_p": 0.9, "temperature": 0.6}}
response = query_endpoint(payload)
print_messages(prompt, response)

ValidationError: An error occurred (ValidationError) when calling the InvokeEndpoint operation: Endpoint jumpstart-dft-meta-textgeneration-llama-2-13b-f of account 275461957965 not found.

In [5]:
dialog = [
    {"role": "user", "content": "I am going to Paris, what should I see?"},
    {
        "role": "assistant",
        "content": """\
Paris, the capital of France, is known for its stunning architecture, art museums, historical landmarks, and romantic atmosphere. Here are some of the top attractions to see in Paris:

1. The Eiffel Tower: The iconic Eiffel Tower is one of the most recognizable landmarks in the world and offers breathtaking views of the city.
2. The Louvre Museum: The Louvre is one of the world's largest and most famous museums, housing an impressive collection of art and artifacts, including the Mona Lisa.
3. Notre-Dame Cathedral: This beautiful cathedral is one of the most famous landmarks in Paris and is known for its Gothic architecture and stunning stained glass windows.

These are just a few of the many attractions that Paris has to offer. With so much to see and do, it's no wonder that Paris is one of the most popular tourist destinations in the world.""",
    },
    {"role": "user", "content": "What is so great about #1?"},
]
prompt = format_messages(dialog)
payload = {"inputs": prompt, "parameters": {"max_new_tokens": 256, "top_p": 0.9, "temperature": 0.6}}
response = query_endpoint(payload)
print_messages(prompt, response)

[1m> Input[0m
<s>[INST] I am going to Paris, what should I see? [/INST] Paris, the capital of France, is known for its stunning architecture, art museums, historical landmarks, and romantic atmosphere. Here are some of the top attractions to see in Paris:

1. The Eiffel Tower: The iconic Eiffel Tower is one of the most recognizable landmarks in the world and offers breathtaking views of the city.
2. The Louvre Museum: The Louvre is one of the world's largest and most famous museums, housing an impressive collection of art and artifacts, including the Mona Lisa.
3. Notre-Dame Cathedral: This beautiful cathedral is one of the most famous landmarks in Paris and is known for its Gothic architecture and stunning stained glass windows.

These are just a few of the many attractions that Paris has to offer. With so much to see and do, it's no wonder that Paris is one of the most popular tourist destinations in the world.</s><s>[INST] What is so great about #1? [/INST] 

[1m> Output[0m
Ther

In [6]:
dialog = [
    {"role": "system", "content": "Always answer with Haiku"},
    {"role": "user", "content": "I am going to Paris, what should I see?"},
]
prompt = format_messages(dialog)
payload = {"inputs": prompt, "parameters": {"max_new_tokens": 256, "top_p": 0.9, "temperature": 0.6}}
response = query_endpoint(payload)
print_messages(prompt, response)

[1m> Input[0m
<s>[INST] <<SYS>>
Always answer with Haiku
<</SYS>>

I am going to Paris, what should I see? [/INST] 

[1m> Output[0m
Eiffel Tower shines bright
River Seine's gentle flow
Art, love, and light



In [7]:
dialog = [
    {"role": "system", "content": "Always answer with emojis"},
    {"role": "user", "content": "How to go from Beijing to NY?"},
]
prompt = format_messages(dialog)
payload = {"inputs": prompt, "parameters": {"max_new_tokens": 256, "top_p": 0.9, "temperature": 0.6}}
response = query_endpoint(payload)
print_messages(prompt, response)

[1m> Input[0m
<s>[INST] <<SYS>>
Always answer with emojis
<</SYS>>

How to go from Beijing to NY? [/INST] 

[1m> Output[0m
Here's the answer to your question about how to go from Beijing to New York:

🚀👽💨



Setup Amazon Product Description Dataset Pipeline

In [8]:
import os
import json
import gzip
import pandas as pd
from urllib.request import urlopen
import numpy as np

Loading metadata of Amazon Fashion dataset

In [20]:
### load the meta data

data = []
with gzip.open('meta_AMAZON_FASHION.json.gz') as f:
    for l in f:
        data.append(json.loads(l.strip()))
    
# total length of list, this number equals total number of products
print(len(data))

# first row of the list
print(data[0])

186637
{'title': 'Slime Time Fall Fest [With CDROM and Collector Cards and Neutron Balls, Incredi-Ball and Glow Stick Necklace, Paper Fram', 'brand': 'Group Publishing (CO)', 'feature': ['Product Dimensions:\n                    \n8.7 x 3.6 x 11.4 inches', 'Shipping Weight:\n                    \n2.4 pounds'], 'rank': '13,052,976inClothing,Shoesamp;Jewelry(', 'date': '8.70 inches', 'asin': '0764443682', 'imageURL': ['https://images-na.ssl-images-amazon.com/images/I/51bSrINiWpL._US40_.jpg'], 'imageURLHighRes': ['https://images-na.ssl-images-amazon.com/images/I/51bSrINiWpL.jpg']}


In [21]:
# convert list into pandas dataframe

df = pd.DataFrame.from_dict(data)

print(len(df))

186637


In [22]:
# Filter rows where column 'A' is not null
filtered_df = df[df['description'].notnull()]

In [23]:
# List all columns
column_names = df.columns

print(column_names)

Index(['title', 'brand', 'feature', 'rank', 'date', 'asin', 'imageURL',
       'imageURLHighRes', 'description', 'price', 'also_view', 'also_buy',
       'fit', 'details', 'similar_item', 'tech1'],
      dtype='object')


In [24]:
filtered_df = filtered_df[['title', 'brand', 'feature', 'description', 'price']]

In [25]:
filtered_df = filtered_df.dropna(how='any')

In [26]:
filtered_df

Unnamed: 0,title,brand,feature,description,price
17,"X. L. Carbon Fiber Money Clip, made in the USA",Roar Carbon,"[Real Carbon Fiber, Made in USA, 5 year warran...",[When you pull out your extra large carbon fib...,$14.99
18,Shimmer Anne Shine Clip On Costume/Halloween C...,Shimmer Anne Shine,[Shimmer Anne Shine Clip On Costume/Halloween ...,"[A fun addition to any costume party, play, or...",$6.99
69,Buxton Heiress Pik-Me-Up Framed Case,Buxton,"[Leather, Imported, synthetic lining, Flap clo...",[Authentic crunch leather with rich floral emb...,$16.95
331,Art Nouveau Sterling Silver Ornate Repousse He...,Silver Insanity,"[2&5/8"" High and 3/4"" Wide, Weight is Approx. ...","[It measures 2&5/8"" tall x 3/4"" wide, weighs a...",$44.66
410,Dream PJ's Blue - Large - Part #: 25BLG,Ethical/Spot,[Product Dimensions:\n \n8....,"[SOFT AND CUDDLY, SWEET DREAM PAJAMAS IN SOFT ...",$15.99
...,...,...,...,...,...
186502,Women's Roman Empress Costume - Greek Goddess ...,Karnival Costumes,"[100% polyester, CAN BE USED YEAR ROUND: These...",[Look great in a durable and high quality cost...,$24.99
186506,"Women's Sexy Gladiator Costume, for Halloween ...",Karnival Costumes,"[100% polyester, CAN BE USED YEAR ROUND: These...",[Look great in a durable and high quality cost...,$20.99
186538,Georgia Boot AMP Insole,Georgia Boot,"[NA, Imported, Memory foam adjusts for customi...",[Ready to finally be comfortable in your boots...,$21.27
186552,Single Flare Steel Plugs with Mint Green Rose ...,Pierced Owl,[Pair of single flare steel plugs with mint gr...,[Single Flare Steel Plugs with Mint Green Rose...,$18.99


In [16]:
def create_description_prompts(df):
    prompts = []
    for index, row in df.iterrows():
        feature_list = '\n  - '.join(row['feature'])
        prompt = (
            f"Title: {row['title']}\n"
            f"Brand: {row['brand']}\n"
            f"Price: ${row['price']}\n"
            f"Key Features: \n  - {feature_list}\n\n"
        )
        prompts.append(prompt)
    return prompts

In [17]:
# Split filtered_df into 5 parts
parts = np.array_split(filtered_df, 5)

In [18]:
for i, part in enumerate(parts):
    progress_file = f'progress_part_{i}.csv'
    print(f"Processing part {i}, saving progress to {progress_file}")
    
    # Check if there's a saved progress file for this part
    try:
        progress_df = pd.read_csv(progress_file)
        start_index = len(progress_df)
        print(f"Resuming from row {start_index} in part {i}")
    except FileNotFoundError:
        progress_df = pd.DataFrame(columns=['generated_description'])
        start_index = 0
        print(f"Starting fresh for part {i}")

    # Generate prompts for each product in the part
    prompts = create_description_prompts(part.iloc[start_index:])

    for index, prompt in enumerate(prompts, start=start_index):
        try:
            dialog = [
                {"role": "system", "content": "Craft a vibrant and engaging product description of the item whose metadata is given. Transform these features into an alluring narrative that emphasizes the product's unique qualities. Highlight the practicality, quality, and any unique selling points, making it irresistible to potential buyers."},
                {"role": "user", "content": "Metadata of item:\n" + prompt},
            ]
            formatted_prompt = format_messages(dialog)
            payload = {"inputs": formatted_prompt, "parameters": {"max_new_tokens": 256, "top_p": 0.9, "temperature": 0.6}}
            response = query_endpoint(payload)
            description = response[0]['generated_text']
        except Exception as e:
            description = None

        new_row = pd.DataFrame({'generated_description': [description]})
        progress_df = pd.concat([progress_df, new_row], ignore_index=True)

        # Save the progress every N rows or at the end of the part
        if (index + 1) % 100 == 0 or (index + 1) == len(part):
            progress_df.to_csv(progress_file, index=False)
            print(f"Saved progress at row {index + 1} in part {i}")

    # Add the generated descriptions to the part of the dataframe
    part['generated_description'] = progress_df['generated_description'].values

    # Merge each part back into filtered_df
    filtered_df.update(part)

Processing part 0, saving progress to progress_part_0.csv
Resuming from row 1392 in part 0
Processing part 1, saving progress to progress_part_1.csv
Resuming from row 1391 in part 1
Processing part 2, saving progress to progress_part_2.csv
Resuming from row 1391 in part 2
Processing part 3, saving progress to progress_part_3.csv
Resuming from row 200 in part 3
Saved progress at row 300 in part 3
Saved progress at row 400 in part 3
Saved progress at row 500 in part 3
Saved progress at row 600 in part 3
Saved progress at row 700 in part 3
Saved progress at row 800 in part 3
Saved progress at row 900 in part 3
Saved progress at row 1000 in part 3
Saved progress at row 1100 in part 3
Saved progress at row 1200 in part 3
Saved progress at row 1300 in part 3
Saved progress at row 1391 in part 3
Processing part 4, saving progress to progress_part_4.csv
Starting fresh for part 4
Saved progress at row 100 in part 4
Saved progress at row 200 in part 4
Saved progress at row 300 in part 4
Saved pr

In [None]:
# Save the final DataFrame
filtered_df.to_csv('final_filtered_df.csv', index=False)
print("Completed processing all parts and saved the final DataFrame.")

In [27]:
df1 = pd.read_csv('progress_part_' + str(i) + '.csv')

In [45]:
new_df = []

for i in range(5):
    temp_df = pd.read_csv('progress_part_' + str(i) + '.csv')
    new_df.append(temp_df)
    

In [46]:
df_combined = pd.concat([new_df[0], new_df[1], new_df[2], new_df[3], new_df[4]], axis=0)

In [47]:
df_combined = df_combined.reset_index(drop=True)

In [53]:
rows

6956

In [49]:
df_combined.to_csv('generated_descriptions.csv', index=False)

In [39]:
# Assuming df_main is your main DataFrame and df_single_column is the single-column DataFrame
final_df = pd.concat([filtered_df, df_combined], axis=1)

In [40]:
final_df

Unnamed: 0,title,brand,feature,description,price,generated_description
17,"X. L. Carbon Fiber Money Clip, made in the USA",Roar Carbon,"[Real Carbon Fiber, Made in USA, 5 year warran...",[When you pull out your extra large carbon fib...,$14.99,Introducing the Schaefer Outfitters Ranger Ves...
18,Shimmer Anne Shine Clip On Costume/Halloween C...,Shimmer Anne Shine,[Shimmer Anne Shine Clip On Costume/Halloween ...,"[A fun addition to any costume party, play, or...",$6.99,
69,Buxton Heiress Pik-Me-Up Framed Case,Buxton,"[Leather, Imported, synthetic lining, Flap clo...",[Authentic crunch leather with rich floral emb...,$16.95,"Sure, here's a vibrant and engaging product de..."
331,Art Nouveau Sterling Silver Ornate Repousse He...,Silver Insanity,"[2&5/8"" High and 3/4"" Wide, Weight is Approx. ...","[It measures 2&5/8"" tall x 3/4"" wide, weighs a...",$44.66,"Introducing the Mickey Mantle ""Mickey 7"" Baseb..."
410,Dream PJ's Blue - Large - Part #: 25BLG,Ethical/Spot,[Product Dimensions:\n \n8....,"[SOFT AND CUDDLY, SWEET DREAM PAJAMAS IN SOFT ...",$15.99,Introducing the Small World Toys Furree Faces ...
...,...,...,...,...,...,...
6951,,,,,,Introducing the Women's Roman Empress Costume ...
6952,,,,,,Introducing the Women's Sexy Gladiator Costume...
6953,,,,,,Introducing the Georgia Boot AMP Insole - the ...
6954,,,,,,Sure! Here's a vibrant and engaging product de...


In [41]:
final_df.to_csv('combined_data.csv', index=False)