# Dorabruschi Fine-tuned GPT

```gpt-3.5-turbo``` fine-tuned on question-routine answer pairs.

## Setup

In [1]:
!pip install tiktoken
!pip install openai

Collecting tiktoken
  Downloading tiktoken-0.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tiktoken
Successfully installed tiktoken-0.6.0
Collecting openai
  Downloading openai-1.28.1-py3-none-any.whl (320 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m320.1/320.1 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
Collecting httpx<1,>=0.23.0 (from openai)
  Downloading httpx-0.27.0-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
Collecting httpcore==1.* (from httpx<1,>=0.23.0->openai)
  Downloading httpcore-1.0.5-py3-none-any.whl (77 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting h11<0.15,>=0.13 (from httpcore

In [2]:
import json
import pandas as pd
from google.colab import drive, userdata, files
from openai import OpenAI
import tiktoken
import numpy as np
from collections import defaultdict
import os
import ast
from scipy import spatial
from IPython import embed

In [3]:
drive.mount('/content/drive')

Mounted at /content/drive


### Import data and convert to JSONL

In [4]:
file_path = '/content/drive/MyDrive/colab_notebooks/independent_study/data/REVISED_qa_pairs.xlsx'
df = pd.read_excel(file_path)
df.head()

Unnamed: 0,question,answer
0,I'm 25 and have combination skin. What cleanse...,For your daily skincare routine with combinati...
1,What's the best moisturizer for someone living...,For someone living in a dry climate with dry s...
2,Can you suggest a serum that helps with rednes...,"I recommend the ""Sensitive skin moisturizer"" f..."
3,Which products are most effective for deep wri...,"For deep wrinkles around the mouth, I recommen..."
4,What treatment would you recommend for acne sc...,"For acne scars on oily skin, I recommend the f..."


Create the JSONL file from the XLSX Q&A pairs.

In [5]:
product_list = 'ACE 10% multivitamin concentrate, Revitalizing multivitamin cream, Smoothing renewing cream, Acne roll-on lotion, Acne paste, Micellar water, Anti-wrinkle cream K, Smoothing foot balm, Restructuring hand balm, Delicate sebum-balancing cleansing base, Intensive collagen concentrate, Intensive concentrate with hyaluronic acid, Intensive elastin concentrate, Intensive concentrate with snail slime, Hyaluronic acid eye contour, Eye-lip contour, Anti-aging hand cream, Anti-wrinkle eye contour cream plus, Global anti-aging restructuring eye contour cream, Elasticizing body cream, Energizing cream, Quick tan fluid cream, Self-tanning fluid cream, Fluid moisturizing body cream, Gommage cream, Intensive moisturizing cream first wrinkles, Nourishing moisturizing cream for first wrinkles, Sensitive skin moisturizer, Cream K for dry skin, Reactive skin protective cream, Intensive firming body cream, Redensifying cream, Plumping cream, Repairing hand cream, Foot repair cream, Relaxing leg and foot cream, Global anti-aging restructuring cream, Slimming cream, Ultra-nourishing body cream, Rebalancing face cream, Facial scrub cleanser, Shower shampoo, Concentrated lifting emulsion, Aftershave emulsion, Roll-on anti-wrinkle eye contour fluid, Double action eye contour gel, Cleansing milk, Cleansing milk, Gentle cleansing milk, Softening lotion Softening lotion with argan oil, Anti-fatigue eye lotion, Toning lotion, Lifting effect mask, Intensive moisturizing mask, Perfecting mask, Body scrub, Facial scrub, Biphasic eye make-up remover, FF toner'

In [6]:
system_prompt = 'You are tasked with offering customized beauty routine recommendations using only products from Dorabruschi\'s product line, tailored to the user\'s specific skincare needs. For each customer query: \
        - Recommend products only from the provided Dorabruschi product catalog. \
        - Do not recommend or suggest products outside of this catalog. \
        - For each recommended product, provide a brief explanation of why it has been chosen for the user, detailing its usage and cost. \
        - Limit each routine recommendation to 3-5 products. \
        - If no product in the catalog suits the user\'s request, clearly state that no suitable product is available. Do not make assumptions about product benefits that are not explicitly supported by the catalog. \
        - In cases of uncertainty, advise the user to consult a skincare specialist or explore other brands for more suitable options. \
        Remember, the product catalog is composed of: ' + product_list + '.'

In [7]:
jsonl_data = []

for index, row in df.iterrows():
    entry = {
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": row['question']},
            {"role": "assistant", "content": row['answer']}
        ]
    }
    jsonl_data.append(json.dumps(entry))

jsonl_file_path = '/content/drive/MyDrive/colab_notebooks/independent_study/data/revised_qa_pairs.jsonl'
with open(jsonl_file_path, 'w') as file:
    for item in jsonl_data:
        file.write(item + '\n')

jsonl_file_path

'/content/drive/MyDrive/colab_notebooks/independent_study/data/revised_qa_pairs.jsonl'

In [8]:
# read the jsonl file to ensure correct format
with open(jsonl_file_path, 'r') as file:
    for line in file:
        json_data = json.loads(line.strip())
        print(json_data)

{'messages': [{'role': 'system', 'content': "You are tasked with offering customized beauty routine recommendations using only products from Dorabruschi's product line, tailored to the user's specific skincare needs. For each customer query:         - Recommend products only from the provided Dorabruschi product catalog.         - Do not recommend or suggest products outside of this catalog.         - For each recommended product, provide a brief explanation of why it has been chosen for the user, detailing its usage and cost.         - Limit each routine recommendation to 3-5 products.         - If no product in the catalog suits the user's request, clearly state that no suitable product is available. Do not make assumptions about product benefits that are not explicitly supported by the catalog.         - In cases of uncertainty, advise the user to consult a skincare specialist or explore other brands for more suitable options.         Remember, the product catalog is composed of: AC

### Check data formatting
Following the OpenAI cookbook.

In [9]:
# load the dataset
with open(jsonl_file_path, 'r', encoding='utf-8') as f:
    dataset = [json.loads(line) for line in f]

# initial dataset stats
print("Num examples:", len(dataset))
print("First example:")
for message in dataset[0]["messages"]:
    print(message)

Num examples: 111
First example:
{'role': 'system', 'content': "You are tasked with offering customized beauty routine recommendations using only products from Dorabruschi's product line, tailored to the user's specific skincare needs. For each customer query:         - Recommend products only from the provided Dorabruschi product catalog.         - Do not recommend or suggest products outside of this catalog.         - For each recommended product, provide a brief explanation of why it has been chosen for the user, detailing its usage and cost.         - Limit each routine recommendation to 3-5 products.         - If no product in the catalog suits the user's request, clearly state that no suitable product is available. Do not make assumptions about product benefits that are not explicitly supported by the catalog.         - In cases of uncertainty, advise the user to consult a skincare specialist or explore other brands for more suitable options.         Remember, the product catalog

In [10]:
# format error checks
format_errors = defaultdict(int)

for ex in dataset:
    if not isinstance(ex, dict):
        format_errors["data_type"] += 1
        continue

    messages = ex.get("messages", None)
    if not messages:
        format_errors["missing_messages_list"] += 1
        continue

    for message in messages:
        if "role" not in message or "content" not in message:
            format_errors["message_missing_key"] += 1

        if any(k not in ("role", "content", "name", "function_call", "weight") for k in message):
            format_errors["message_unrecognized_key"] += 1

        if message.get("role", None) not in ("system", "user", "assistant", "function"):
            format_errors["unrecognized_role"] += 1

        content = message.get("content", None)
        function_call = message.get("function_call", None)

        if (not content and not function_call) or not isinstance(content, str):
            format_errors["missing_content"] += 1

    if not any(message.get("role", None) == "assistant" for message in messages):
        format_errors["example_missing_assistant_message"] += 1

if format_errors:
    print("Found errors:")
    for k, v in format_errors.items():
        print(f"{k}: {v}")
else:
    print("No errors found")

No errors found


In [11]:
encoding = tiktoken.get_encoding("cl100k_base")

# not exact!
# simplified from https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb
def num_tokens_from_messages(messages, tokens_per_message=3, tokens_per_name=1):
    num_tokens = 0
    for message in messages:
        num_tokens += tokens_per_message
        for key, value in message.items():
            num_tokens += len(encoding.encode(value))
            if key == "name":
                num_tokens += tokens_per_name
    num_tokens += 3
    return num_tokens

def num_assistant_tokens_from_messages(messages):
    num_tokens = 0
    for message in messages:
        if message["role"] == "assistant":
            num_tokens += len(encoding.encode(message["content"]))
    return num_tokens

def print_distribution(values, name):
    print(f"\n#### Distribution of {name}:")
    print(f"min / max: {min(values)}, {max(values)}")
    print(f"mean / median: {np.mean(values)}, {np.median(values)}")
    print(f"p5 / p95: {np.quantile(values, 0.1)}, {np.quantile(values, 0.9)}")

In [12]:
# warnings and tokens counts
n_missing_system = 0
n_missing_user = 0
n_messages = []
convo_lens = []
assistant_message_lens = []

for ex in dataset:
    messages = ex["messages"]
    if not any(message["role"] == "system" for message in messages):
        n_missing_system += 1
    if not any(message["role"] == "user" for message in messages):
        n_missing_user += 1
    n_messages.append(len(messages))
    convo_lens.append(num_tokens_from_messages(messages))
    assistant_message_lens.append(num_assistant_tokens_from_messages(messages))

print("Num examples missing system message:", n_missing_system)
print("Num examples missing user message:", n_missing_user)
print_distribution(n_messages, "num_messages_per_example")
print_distribution(convo_lens, "num_total_tokens_per_example")
print_distribution(assistant_message_lens, "num_assistant_tokens_per_example")
n_too_long = sum(l > 4096 for l in convo_lens)
print(f"\n{n_too_long} examples may be over the 4096 token limit, they will be truncated during fine-tuning")

Num examples missing system message: 0
Num examples missing user message: 0

#### Distribution of num_messages_per_example:
min / max: 3, 3
mean / median: 3.0, 3.0
p5 / p95: 3.0, 3.0

#### Distribution of num_total_tokens_per_example:
min / max: 645, 1137
mean / median: 795.918918918919, 771.0
p5 / p95: 713.0, 925.0

#### Distribution of num_assistant_tokens_per_example:
min / max: 85, 578
mean / median: 231.42342342342343, 205.0
p5 / p95: 149.0, 363.0

0 examples may be over the 4096 token limit, they will be truncated during fine-tuning


In [13]:
# pricing and default n_epochs estimate
MAX_TOKENS_PER_EXAMPLE = 4096

TARGET_EPOCHS = 3
MIN_TARGET_EXAMPLES = 100
MAX_TARGET_EXAMPLES = 25000
MIN_DEFAULT_EPOCHS = 1
MAX_DEFAULT_EPOCHS = 25

n_epochs = TARGET_EPOCHS
n_train_examples = len(dataset)
if n_train_examples * TARGET_EPOCHS < MIN_TARGET_EXAMPLES:
    n_epochs = min(MAX_DEFAULT_EPOCHS, MIN_TARGET_EXAMPLES // n_train_examples)
elif n_train_examples * TARGET_EPOCHS > MAX_TARGET_EXAMPLES:
    n_epochs = max(MIN_DEFAULT_EPOCHS, MAX_TARGET_EXAMPLES // n_train_examples)

n_billing_tokens_in_dataset = sum(min(MAX_TOKENS_PER_EXAMPLE, length) for length in convo_lens)
print(f"Dataset has ~{n_billing_tokens_in_dataset} tokens that will be charged for during training")
print(f"By default, you'll train for {n_epochs} epochs on this dataset")
print(f"By default, you'll be charged for ~{n_epochs * n_billing_tokens_in_dataset} tokens")

Dataset has ~88347 tokens that will be charged for during training
By default, you'll train for 3 epochs on this dataset
By default, you'll be charged for ~265041 tokens


Given the training cost of US\$8.00 / 1M tokens, this translates to an approximate total training cost of \$1.2.

## Fine-tuning

In [14]:
os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

In [15]:
# upload a training file
with open(jsonl_file_path, 'rb') as file:
    response = client.files.create(
        file=file,
        purpose='fine-tune'
    )

In [None]:
response

FileObject(id='file-lP9udpRcCnenUh9xYmS4zIhv', bytes=240296, created_at=1715222997, filename='revised_qa_pairs.jsonl', object='file', purpose='fine-tune', status='processed', status_details=None)

In [None]:
file_id = 'file-lP9udpRcCnenUh9xYmS4zIhv'

In [None]:
# create a fine-tuning job
response = client.fine_tuning.jobs.create(
    training_file=file_id,
    model='gpt-3.5-turbo'
)

In [None]:
response

FineTuningJob(id='ftjob-9Q8uYJKV4ybV8JzyxoehU6a8', created_at=1715223105, error=Error(code=None, message=None, param=None), fine_tuned_model=None, finished_at=None, hyperparameters=Hyperparameters(n_epochs='auto', batch_size='auto', learning_rate_multiplier='auto'), model='gpt-3.5-turbo-0125', object='fine_tuning.job', organization_id='org-kHNGlYLNsE0KOkCDy0N8yMG3', result_files=[], seed=1808267741, status='validating_files', trained_tokens=None, training_file='file-lP9udpRcCnenUh9xYmS4zIhv', validation_file=None, estimated_finish=None, integrations=[], user_provided_suffix=None)

In [16]:
job_id = 'ftjob-9Q8uYJKV4ybV8JzyxoehU6a8'

In [17]:
# list fine-tuning jobs
client.fine_tuning.jobs.list(limit=3)

SyncCursorPage[FineTuningJob](data=[FineTuningJob(id='ftjob-9Q8uYJKV4ybV8JzyxoehU6a8', created_at=1715223105, error=Error(code=None, message=None, param=None), fine_tuned_model='ft:gpt-3.5-turbo-0125:personal::9MoLXX93', finished_at=1715223773, hyperparameters=Hyperparameters(n_epochs=3, batch_size=1, learning_rate_multiplier=2), model='gpt-3.5-turbo-0125', object='fine_tuning.job', organization_id='org-kHNGlYLNsE0KOkCDy0N8yMG3', result_files=['file-3woFBCK5YjyGbknbxYiidUkg'], seed=1808267741, status='succeeded', trained_tokens=142164, training_file='file-lP9udpRcCnenUh9xYmS4zIhv', validation_file=None, estimated_finish=None, integrations=[], user_provided_suffix=None)], object='list', has_more=False)

In [18]:
# retrieve the state of the finetune
client.fine_tuning.jobs.retrieve(job_id)

FineTuningJob(id='ftjob-9Q8uYJKV4ybV8JzyxoehU6a8', created_at=1715223105, error=Error(code=None, message=None, param=None), fine_tuned_model='ft:gpt-3.5-turbo-0125:personal::9MoLXX93', finished_at=1715223773, hyperparameters=Hyperparameters(n_epochs=3, batch_size=1, learning_rate_multiplier=2), model='gpt-3.5-turbo-0125', object='fine_tuning.job', organization_id='org-kHNGlYLNsE0KOkCDy0N8yMG3', result_files=['file-3woFBCK5YjyGbknbxYiidUkg'], seed=1808267741, status='succeeded', trained_tokens=142164, training_file='file-lP9udpRcCnenUh9xYmS4zIhv', validation_file=None, estimated_finish=None, integrations=[], user_provided_suffix=None)

In [19]:
model_id = 'ft:gpt-3.5-turbo-0125:personal::9MoLXX93'

In [20]:
# test 1 - failing with edge cases
completion = client.chat.completions.create(
    model=model_id,
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": "What's your best sunscreen?"}
    ]
)

print(completion.choices[0].message.content)

I'm currently restricted to recommending skincare products only from Dorabruschi's product line, and Dorabruschi does not offer a dedicated sunscreen product in their catalog. For the best advice on sun protection, especially considering the importance of this skincare step, I recommend checking with a dermatologist or skincare expert who can provide guidance on selecting a high-quality sunscreen that meets your needs. Remember, daily sunscreen use is crucial for protecting your skin from harmful UV rays and preventing premature aging, so do prioritize this step in your skincare routine. Let me know if you have any more questions or need assistance with Dorabruschi products!


In [21]:
# test 2 - failing with a simple query (inexistant product)
completion = client.chat.completions.create(
    model=model_id,
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": "I have dry skin and am looking for a moisturizer that I can use every day. What is the best cream for me?"}
    ]
)

print(completion.choices[0].message.content)

For your dry skin type, the best Dorabruschi cream is the "Cream K for dry skin." This 24-hour cream is rich in references Amino Acids and mineral salts to deeply hydrate and nourish the skin. It is formulated to enhance the vitality and resilience of the skin, making it ideal for daily use to combat dryness and maintain a smooth, glowing complexion.

**Product:**
- Name: Cream K for dry skin
- Description: 24-hour cream with references Amino Acids and mineral salts to improve tissue hydration and nourishment.
- Usage: Apply morning and/or evening to perfectly cleansed skin and massage gently to favor absorption.
- Price: 48.00 euros
- Quantity: 50 ml

This cream will provide the necessary hydration and nourishment your dry skin needs on a daily basis, helping to maintain soft and supple skin.


In [22]:
# test 3 - failing with a question already in the training set
completion = client.chat.completions.create(
    model=model_id,
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": "I'm 25 and have combination skin. What cleanser should I use for daily skincare?"}
    ]
)

print(completion.choices[0].message.content)

For your daily skincare routine, I recommend using the **Delicate sebum-balancing cleansing base** from Dorabruschi. This product is a delicate detergent base with a sebum-balancing action, ideal for daily cleansing of impure and acne-prone skin. It contains ingredients like Propolis and Tea Tree Oil which have purifying and soothing properties, making it suitable for combination skin.

**Usage Instructions:** Apply the product to dampened face and massage gently, focusing on the most affected areas. Rinse thoroughly with water and follow with a specific tonic lotion.

**Price:** 16.00 euros for 150 ml

**Why it's recommended:** The Delicate sebum-balancing cleansing base will help cleanse your skin without being too harsh, balancing the sebum production in your T-zone while being gentle on the drier areas of your face.


In [23]:
# test 4 - failing with a more complex question, also in the training set
completion = client.chat.completions.create(
    model=model_id,
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": "What Dorabruschi treatment would you recommend for acne scars on oily skin?"}
    ]
)

print(completion.choices[0].message.content)

For acne scars on oily skin, I recommend the **"Intensive concentrate with snail slime"** from Dorabruschi. This product is specifically formulated to accelerate the skin renewal process, making it an excellent choice for addressing acne scars. The snail slime extract contains Allantoin, Proteins, Glycolic Acid, and Collagen, which work together to promote the regeneration and reorganization of the skin's structure.

### Product Recommended:
- Name: Intensive concentrate with snail slime
- Description: Accelerates the skin's renewal process, favoring the reorganization of the skin's structure.
- Usage: Gently tap a few drops on the face morning and evening, focusing on areas with scars.
- Price: 40.00 euros
- Quantity: 30 ml

By incorporating this concentrate into your skincare routine, you can help reduce the appearance of acne scars on your oily skin while promoting a smoother and more even complexion over time.


In [24]:
# test 5 - invert the question
completion = client.chat.completions.create(
    model=model_id,
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": "What is the Acne paste and what is it good for?"}
    ]
)

print(completion.choices[0].message.content)

The **Acne paste** from Dorabruschi is a specific intensive care product designed for individual blemishes, blackheads, and enlarged pores. It contains natural absorbent powders like Kaolin and Calamine to reduce excess sebum and purify the skin. Additionally, ingredients like Zinc Oxide, Sulfur, and Salicylic Acid help in the renewal of skin with blemishes.

**Key Benefits:**
- Purifying and renewing action to target blemishes, blackheads, and enlarged pores.
- Reduces skin redness and helps in the resolution of acneic conditions.

**Product Usage:** Apply a layer of paste to clean skin 1-2 times a day on blemishes. Allow it to dry for about 20-30 minutes and then rinse off with water.

**Price:** 23.00 euros for 30 ml.

**Why Recommended:** The Acne paste is recommended for its targeted action on individual blemishes, making it an effective spot treatment for acneic conditions.


## Semantic Search

Convert each product into an embedding, as well as the output of the fine-tuned model. Then, use semantic search to extract the 5 most relevant products to the output of the model.

In [25]:
EMBEDDING_MODEL = "text-embedding-ada-002"
GPT_MODEL = "gpt-3.5-turbo-0125"

In [26]:
file_path = '/content/drive/MyDrive/colab_notebooks/independent_study/data/dorabruschi_products.xlsx'
df = pd.read_excel(file_path)
df.head()

Unnamed: 0,title,description,usage_instructions,properties,ingredients,product_type,benefits,intended_concerns,skin_type,texture,format,price,quantity
0,ACE 10% multivitamin concentrate,This rapidly absorbed concentrated treatment e...,Apply a few drops of concentrate in the mornin...,"Anti-wrinkle, Antioxidant, Illuminating","Aqua [water], Glycerin, Tocopheryl acetate, Pr...",Serum,Wrinkle,"Aging, Dullness, Wrinkles",All Skin Types,Liquid,Dropper,46.0,30
1,Revitalizing multivitamin cream,"Cream with a velvety and light texture, design...",Apply in the morning and/or in the evening to ...,"Anti-wrinkle, Antioxidant, Illuminating","Aqua [water], Glycerin, Cetyl alcohol, Capryli...",Moisturizer,Wrinkle,"Aging, Dullness, Wrinkles",All Skin Types,Velvety cream,airless,49.0,50
2,Smoothing renewing cream,"Cream with a velvety and light texture, it is ...",Apply in the evening to perfectly cleansed ski...,"Anti-wrinkle, Renewing, Illuminating","Aqua [water], Peg-6 stearate, Glycolic acid, C...",Moisturizer,Wrinkle,"Aging, Dullness, Wrinkles",All Skin Types,Velvety cream,airless,45.0,50
3,Acne roll-on lotion,Moderately alcoholic invisible lotion with a p...,Apply with the appropriate roll-on directly on...,"Purifying, Astringent","Aqua [water], Alcohol, Glycerin, Salicylic aci...",Spot Treatment,Purifying,"Acne, Blemishes",Oily,Liquid,roll-on,22.0,10
4,Acne paste,Paste for a quick and effective treatment of p...,Apply 1-2 times a day on pimples and impuritie...,"Purifying, Anti-imperfections","Paraffinum liquidum [mineral oil], Zinc oxide,...",Spot Treatment,Purifying,"Acne, Blemishes",Oily,Thick paste,Tubo,26.0,30


In [27]:
df = df.astype(str)
df['price'] = df['price'] + ' euros'
df['quantity'] = df['quantity'] + ' ml'

Concatenate features into one string ```all_product_info```.

In [28]:
df['text'] = df['title'] + ' ' + 'Description: ' + df['description'] + ' ' + \
    'Usage Instructions: ' + df['usage_instructions'] + ' ' + \
    'Properties: ' + df['properties'] + ' ' + \
    'Ingredients: ' + df['ingredients'] + ' ' + \
    'Product Type: ' + df['product_type'] + ' ' + \
    'Benefits: ' + df['benefits'] + ' ' + \
    'Intended concerns: ' + df['intended_concerns'] + ' ' + \
    'Skin Type: ' + df['skin_type'] + ' ' + \
    'Texture: ' + df['texture'] + ' ' + \
    'Format: ' + df['format'] + ' ' + \
    'Price: ' + df['price'] + ' ' + \
    'Quantity: ' + df['quantity']

df['text'].iloc[0]

'ACE 10% multivitamin concentrate Description: This rapidly absorbed concentrated treatment ensures maximum purity and effectiveness of the ingredients used. The revitalizing cocktail of the 3 beauty vitamins (A, C, E) helps to delay the aging processes and counteract the aggression of free radicals, mainly responsible for the degenerative changes associated with aging. The constant use of this serum makes the skin elastic and hydrated, radiant, toned, thus giving the face a younger and brighter appearance. Usage Instructions: Apply a few drops of concentrate in the morning and/or evening on perfectly cleansed facial skin and massage delicately until completely absorbed. Properties: Anti-wrinkle, Antioxidant, Illuminating Ingredients: Aqua [water], Glycerin, Tocopheryl acetate, Propylene glycol, Peg-40 hydrogenated castor oil, Ascorbic acid, Retinyl palmitate, Tocopherol, Helianthus annuus (sunflower) seed oil, Xanthan gum, Ethylcellulose, Trideceth-9, Phenoxyethanol, Tetrasodium edta,

In [29]:
all_product_info = '\n\n'.join(df['text'])
print(all_product_info[:10000])

ACE 10% multivitamin concentrate Description: This rapidly absorbed concentrated treatment ensures maximum purity and effectiveness of the ingredients used. The revitalizing cocktail of the 3 beauty vitamins (A, C, E) helps to delay the aging processes and counteract the aggression of free radicals, mainly responsible for the degenerative changes associated with aging. The constant use of this serum makes the skin elastic and hydrated, radiant, toned, thus giving the face a younger and brighter appearance. Usage Instructions: Apply a few drops of concentrate in the morning and/or evening on perfectly cleansed facial skin and massage delicately until completely absorbed. Properties: Anti-wrinkle, Antioxidant, Illuminating Ingredients: Aqua [water], Glycerin, Tocopheryl acetate, Propylene glycol, Peg-40 hydrogenated castor oil, Ascorbic acid, Retinyl palmitate, Tocopherol, Helianthus annuus (sunflower) seed oil, Xanthan gum, Ethylcellulose, Trideceth-9, Phenoxyethanol, Tetrasodium edta, 

In [30]:
def num_tokens(text: str, model: str = GPT_MODEL) -> int:
    """
    Returns the number of tokens in a text string.
    """
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))

print('Number of characters: ', len(all_product_info))
print('Number of tokens: ', num_tokens(all_product_info))

Number of characters:  77105
Number of tokens:  20082


In [31]:
def get_embedding(text: str, model: str=EMBEDDING_MODEL) -> list[float]:
    """
    Returns the embedding of a text string using the OpenAI API.
    """
    response = client.embeddings.create(
        input=text,
        model=model
    )
    return response.data[0].embedding

def compute_doc_embeddings_from_column(df: pd.DataFrame, column_name: str
                                       ) -> dict[tuple[str, str], list[float]]:
    """
    Takes a column of df and creates an embedding for each row in the dataframe using the OpenAI Embeddings API.
    Return a dictionary that maps between each embedding vector and the index of the row that it corresponds to.
    """
    return {
        idx: get_embedding(r[column_name]) for idx, r in df.iterrows()
    }

In [32]:
product_embeddings = compute_doc_embeddings_from_column(df, 'text')
df['embeddings'] = product_embeddings
df.to_csv('dorabruschi_products_embeddings.csv', index=False)
df.head()

Unnamed: 0,title,description,usage_instructions,properties,ingredients,product_type,benefits,intended_concerns,skin_type,texture,format,price,quantity,text,embeddings
0,ACE 10% multivitamin concentrate,This rapidly absorbed concentrated treatment e...,Apply a few drops of concentrate in the mornin...,"Anti-wrinkle, Antioxidant, Illuminating","Aqua [water], Glycerin, Tocopheryl acetate, Pr...",Serum,Wrinkle,"Aging, Dullness, Wrinkles",All Skin Types,Liquid,Dropper,46.00 euros,30 ml,ACE 10% multivitamin concentrate Description: ...,"[-0.013464927673339844, -0.017423616722226143,..."
1,Revitalizing multivitamin cream,"Cream with a velvety and light texture, design...",Apply in the morning and/or in the evening to ...,"Anti-wrinkle, Antioxidant, Illuminating","Aqua [water], Glycerin, Cetyl alcohol, Capryli...",Moisturizer,Wrinkle,"Aging, Dullness, Wrinkles",All Skin Types,Velvety cream,airless,49.00 euros,50 ml,Revitalizing multivitamin cream Description: C...,"[-0.02980664186179638, -0.01637360453605652, -..."
2,Smoothing renewing cream,"Cream with a velvety and light texture, it is ...",Apply in the evening to perfectly cleansed ski...,"Anti-wrinkle, Renewing, Illuminating","Aqua [water], Peg-6 stearate, Glycolic acid, C...",Moisturizer,Wrinkle,"Aging, Dullness, Wrinkles",All Skin Types,Velvety cream,airless,45.00 euros,50 ml,Smoothing renewing cream Description: Cream wi...,"[-0.013324999250471592, -0.005090258549898863,..."
3,Acne roll-on lotion,Moderately alcoholic invisible lotion with a p...,Apply with the appropriate roll-on directly on...,"Purifying, Astringent","Aqua [water], Alcohol, Glycerin, Salicylic aci...",Spot Treatment,Purifying,"Acne, Blemishes",Oily,Liquid,roll-on,22.00 euros,10 ml,Acne roll-on lotion Description: Moderately al...,"[0.0097566619515419, -0.005106792785227299, 0...."
4,Acne paste,Paste for a quick and effective treatment of p...,Apply 1-2 times a day on pimples and impuritie...,"Purifying, Anti-imperfections","Paraffinum liquidum [mineral oil], Zinc oxide,...",Spot Treatment,Purifying,"Acne, Blemishes",Oily,Thick paste,Tubo,26.00 euros,30 ml,Acne paste Description: Paste for a quick and ...,"[0.001426826580427587, 0.011394193395972252, 0..."


In [33]:
def strings_ranked_by_relatedness(
    query: str,
    df: pd.DataFrame,
    relatedness_fn=lambda x, y: 1 - spatial.distance.cosine(x, y),
    top_n: int=100
) -> tuple[list[str], list[float]]:

    """Returns a list of strings and relatednesses, sorted from most related to least."""

    query_embedding = get_embedding(query)

    strings_and_relatedness = [
        (row['text'], relatedness_fn(query_embedding, row['embeddings']))
        for i, row in df.iterrows()
    ]

    strings_and_relatedness.sort(key=lambda x: x[1], reverse=True)
    strings, relatednesses = zip(*strings_and_relatedness)
    return strings[:top_n], relatednesses[:top_n]

In [34]:
# example output
completion = client.chat.completions.create(
    model=model_id,
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": "I have dry skin and am looking for a moisturizer that I can use every day. What is the best cream for me?"}
    ]
)

response = completion.choices[0].message.content

print(response)

For your dry skin that requires daily moisturization, the best Dorabruschi cream would be the "Cream K for dry skin." This 24-hour nourishing cream is known for its restructuring and protective properties, making it ideal for ensuring balanced hydration and comfort throughout the day.

### Recommended Cream for Dry Skin:
- Product: Cream K for dry skin
- Description: 24-hour nourishing cream rich in Polypeptides and Horse Chestnut Extract known for their restructuring and protective properties.
- Characteristics: Nourishing, moisturizing, protective.
- Usage: Apply in the morning and/or evening to perfectly cleansed skin and massage gently until completely absorbed.
- Price: 48.00 euros
- Quantity: 50 ml

The "Cream K" will help replenish your skin's moisture levels and provide the necessary nourishment to combat dryness, leaving your skin feeling comfortable and hydrated on a daily basis.


In [35]:
strings, relatednesses = strings_ranked_by_relatedness(response, df, top_n=5)

for string, relatedness in zip(strings, relatednesses):
    print(f"***{relatedness=:.3f}***")
    display(string)
    print('\n')

***relatedness=0.881***


'Cream K for dry skin Description: Night cream with an extremely rich and nourishing texture, ideal for very dry and parched skin. This original formula handed down by tradition Dora Bruschi, retains unchanged all the characteristics and properties sought by its creator. The highly nourishing and skin-friendly ingredients such as beeswax, jojoba oil and sweet almond oil make the skin soft and perfectly relaxed. Usage Instructions: Apply in the evening to perfectly cleansed skin and massage gently until completely absorbed. Properties: Elasticizing, Intensive, Nourishing Ingredients: Aqua [water], Prunus amygdalus dulcis (sweet almond) oil, Petrolatum, Decyl oleate, Cera alba [beeswax], Simmondsia chinensis (jojoba) seed oil, Dicocoyl pentaerythrityl distearyl citrate, Sorbitan sesquioleate, Acetylated lanolin, Cholesterol, Glycerin , Paraffinum liquidum [mineral oil], Cetyl peg/ppg-10/1 dimethicone, Lanolin alcohol, Propylene glycol, Aluminum stearate, Caprylic/capric glycerides, Triti



***relatedness=0.866***


'Anti-wrinkle cream K Description: Cream with a very rich texture, identical to the original handed down by tradition Dora Bruschi, which brings natural lipids to the skin thanks to the presence of oils such as cod liver oil and sweet almond oil, and fats such as lanolin. The result is a pleasant sensation of softness and elasticity. Usage Instructions: Apply in the evening to perfectly cleansed skin and massage gently until completely absorbed. Properties: Anti-wrinkle, Emollient Ingredients: Aqua [water], Prunus amygdalus dulcis (sweet almond) oil, Petrolatum, Cera alba [beeswax], Dicocoyl pentaerythrityl distearyl citrate, Sorbitan sesquioleate, Decyl oleate, Glycerin, Lanolin, Paraffinum liquidum [mineral oil], Aluminum stearate, Cera microcr istallina [ microcrystalline wax], Propylene glycol, Cholesterol, Caprylic/capric glycerides, Gadi iecur oil, Krameria triandra root extract, Triticum vulgare (wheat) germ oil, Dimethicone, Propylparaben, Isopropyl myristate, Parfum [fragrance



***relatedness=0.856***


'Nourishing moisturizing cream for first wrinkles Description: 24-hour cream with a soft texture, designed to deeply nourish the skin and preserve its natural integrity. Thanks to the concentration of emollient active ingredients such as Argan Oil, Rice Bran Oil, Kigelia Africana and Quillaja Saponaria, it helps the skin to rebuild the protective skin barrier which allows it to maintain its optimal level of hydration. The formula is completed by Vitamin E with its antioxidant properties which reinforces the protection from external agents and prevents environmental ageing. The result is a more elastic and luminous face. Usage Instructions: Apply morning and/or evening to perfectly cleansed skin and massage gently until completely absorbed. Properties: Emollient, Repair Ingredients: Aqua [water], Glycerin, Argania spinosa kernel oil, Peg-6 stearate, Cetyl alcohol, Glyceryl stearate, Peg-32 stearate, Cetyl palmitate, Oryza sativa (rice) bran oil, Potassium cetyl phosphate, Camellia sinen



***relatedness=0.844***


'Intensive moisturizing cream first wrinkles Description: 24-hour cream with a silky and light texture, designed to help the skin maintain its optimal level of hydration. Rich in highly moisturizing active ingredients (Hyaluronic Acid and Urea) it gives immediate hydration and, thanks to the Saccharide Isomerate, maintains it over time. The formula is completed by Argan Oil and Vitamin E, with antioxidant and emollient properties to give the face extreme softness and comfort. Usage Instructions: Apply morning and/or evening to perfectly cleansed skin and massage gently until completely absorbed. Properties: Deep, And, Prolonged, Hydration Ingredients: Aqua [water], Glycerin, Peg-6 stearate, Cetyl alcohol, Argania spinosa kernel oil, Glyceryl stearate, Peg-32 stearate, Cetearyl isononanoate, Cetyl palmitate, Ethylhexyl stearate, Potassium cetyl phosphate, Propylene glycol, Urea, Camellia sinensis leaf extract , Sodium hyaluronate, Saccharide isomerate, Sorbitol, Tocopheryl acetate, Toco



***relatedness=0.839***


'Intensive moisturizing mask Description: Super moisturizing and refreshing pack, suitable for all skin types, especially for dry and sensitive ones. It represents a real "revitalizing bath" for skins stressed by the sun and/or other external agents. The result is brighter, velvety and supple skin. Usage Instructions: Spread a thick layer of product on thoroughly cleansed face and neck. Leave on for 20-30 minutes before removing with warm water. It is recommended to use it 1-2 times a week depending on the type of skin and/or its conditions. Properties: Nourishing, Soothing Ingredients: Aqua [water], Glycerin, Polysorbate 61, Oryza sativa (rice) starch, Zinc oxide, Prunus amygdalus dulcis (sweet almond) oil, Sorbitol, Triticum vulgare (wheat) germ oil, Tocopheryl acetate, Sodium glutamate, Parfum [fragrance], Isopropyl myristate, Propylene glycol, Imidazolidinyl urea, Disodium edta, Geraniol, Linalool, Hexyl cinnamal, Alpha-isomethyl ionone, Benzyl salicylate, Citronellol, Benzyl benzo





Create functions to extract the response of the chatbot, find the most relevant products, insert them into the prompt, and output a new, accurate response.

In [36]:
def generate_response(query: str):
    """Generates a response using the fine-tuned model."""
    completion = client.chat.completions.create(
        model=model_id,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": query}
        ]
    )

    return completion.choices[0].message.content

In [37]:
def query_message(
    query: str,
    df: pd.DataFrame,
    model: str=model_id,
    token_budget: int=16385,
    top_n: int=5
) -> str:
    """Return a message for GPT, with relevant product information pulled from a dataframe."""
    response = generate_response(query)
    strings, relatednesses = strings_ranked_by_relatedness(response, df, top_n=top_n)
    message = "Use the below product catalog from Dorabruschi to answer the subsequent question."
    question = f"\n\nQuestion: {query}"

    for string in strings:
        next_article = f"\nProduct description: {string}\n"
        token_count = num_tokens(message + next_article + question, model=model)
        if token_count > token_budget:
            break
        else:
            message += next_article

    return message + question

In [38]:
def ask(
    query: str,
    df: pd.DataFrame=df,
    model: str=model_id,
    token_budget: int=16385-4000
) -> str:
    """Answers a query using GPT and a dataframe of relevant texts and embeddings."""
    message = query_message(query, df, model, token_budget)
    messages = [
        {'role': 'system', 'content':
         'You are tasked with offering customized beauty routine recommendations using only products from Dorabruschi\'s product line, tailored to the user\'s specific skincare needs. For each customer query: \
        - Recommend products only from the provided Dorabruschi product catalog. \
        - Do not recommend or suggest products outside of this catalog. \
        - For each recommended product, provide a brief explanation of why it has been chosen for the user, detailing its usage and cost. \
        - Limit each routine recommendation to 3-5 products. \
        - If no product in the catalog suits the user\'s request, clearly state that no suitable product is available. Do not make assumptions about product benefits that are not explicitly supported by the catalog. \
        - In cases of uncertainty, advise the user to consult a skincare specialist or explore other brands for more suitable options.'},
        {'role': 'user', 'content': message},
    ]

    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0
        #, max_tokens=500
    )
    return response.choices[0].message.content

In [39]:
# test 1
query = "What's your best sunscreen?"
response = ask(query, df)
print(response)

Dorabruschi does not have a sunscreen product in its current product line. As a result, I am unable to recommend a Dorabruschi sunscreen. If you are looking for a sunscreen, I recommend exploring other brands that specialize in sun protection products to find the best sunscreen for your needs. Remember to choose a broad-spectrum sunscreen with an SPF of 30 or higher and reapply it every 2 hours when exposed to the sun for optimal protection.


In [40]:
# test 2
query = "I have dry skin and am looking for a moisturizer that I can use every day. What is the best cream for me?"
response = ask(query, df)
print(response)

Based on your dry skin type and the need for a daily moisturizer, the best Dorabruschi cream for you would be the **"Intensive moisturizing cream first wrinkles."**

**Product:** Intensive moisturizing cream first wrinkles  
**Description:** This 24-hour cream has a silky and light texture, designed to help the skin maintain optimal hydration levels. It is rich in highly moisturizing active ingredients like Hyaluronic Acid and Urea, providing immediate and prolonged hydration. The cream contains Argan Oil and Vitamin E for antioxidant and emollient properties, giving your face extreme softness and comfort.  
**Usage Instructions:** Apply morning and/or evening to perfectly cleansed skin and massage gently until completely absorbed.  
**Properties:** Deep, And, Prolonged, Hydration  
**Intended concerns:** Aging, Wrinkles  
**Skin Type:** All Skin Types  
**Texture:** Light cream  
**Format:** airless  
**Price:** 42.00 euros  
**Quantity:** 50 ml  

This cream will deeply hydrate your 

In [41]:
# test 3
query = "I'm 25 and have combination skin. What cleanser should I use for daily skincare?"
response = ask(query, df)
print(response)

For your daily skincare routine with combination skin, I recommend the following Dorabruschi products:

1. **Product:** Delicate sebum-balancing cleansing base
   - **Description:** Extremely delicate washing base that combines a rebalancing action with a moisturizing and soothing action, suitable for all skin types.
   - **Usage:** Apply a small amount to a damp face in the morning and evening, massage gently, and rinse thoroughly.
   - **Properties:** Cleanses, Rebalances
   - **Intended concerns:** Acne, Blemishes
   - **Skin Type:** Oily
   - **Texture:** Foaming gel
   - **Format:** Flip top bottle
   - **Price:** 22.00 euros
   - **Quantity:** 165 ml

**Why this product:** This cleansing base is ideal for combination skin as it helps rebalance sebum production while being gentle and suitable for all skin types. It will cleanse your skin effectively without causing dryness.

2. **Optional Add-on (if desired):** Rebalancing face cream
   - **Description:** Light emulsion with a pur

In [42]:
# test 4
query = "What Dorabruschi treatment would you recommend for acne scars on oily skin?"
response = ask(query, df)
print(response)

For acne scars on oily skin, I recommend the following Dorabruschi treatment:

1. Product: **Acne roll-on lotion**
   - Description: Moderately alcoholic invisible lotion with a purifying, exfoliating, and calming action to quickly resolve skin imperfections like pimples or blackheads.
   - Usage Instructions: Apply directly on imperfections 2-3 times a day and leave to absorb.
   - Properties: Purifying, Astringent
   - Skin Type: Oily
   - Texture: Liquid
   - Format: roll-on
   - Price: 22.00 euros
   - Quantity: 10 ml

2. Product: **Acne paste**
   - Description: Paste for quick and effective treatment of pimples, blackheads, and enlarged pores with high sebum absorbing and purifying properties.
   - Usage Instructions: Apply 1-2 times a day on pimples and impurities after specific balancing cleansing.
   - Properties: Purifying, Anti-imperfections
   - Skin Type: Oily
   - Texture: Thick paste
   - Format: Tubo
   - Price: 26.00 euros
   - Quantity: 30 ml

3. Product: **Rebalancin

In [43]:
# test 5
query = "What is the Acne paste and what is it good for?"
response = ask(query, df)
print(response)

The Acne paste from Dorabruschi is a thick paste designed for a quick and effective treatment of pimples, blackheads, and enlarged pores. It is specifically formulated to address skin imperfections associated with acne and blemishes. The paste contains ingredients with the following properties:

1. **Sebum Absorbing**: Zinc Oxide and Rice Starch have a high sebum absorbing capacity, which helps in controlling excess oil on the skin.
2. **Skin Purifying**: Sulfur and Allantoin in the formula purify the skin, helping to clear impurities and blemishes.
3. **Smoothing**: Salicylic Acid provides a smoothing effect, promoting a more refined skin texture.

**Usage Instructions:** The Acne paste should be applied 1-2 times a day directly on pimples and impurities after a specific balancing cleansing routine.

**Skin Type:** Oily  
**Texture:** Thick paste  
**Format:** Tubo  
**Price:** 26.00 euros  
**Quantity:** 30 ml  

**Benefits:** Purifying  
**Intended concerns:** Acne, Blemishes  

In 

## Evaluation
Generate responses to 25 evaluation questions drafted by Dorabruschi specialists.

In [45]:
# function to generate a response for a list of queries
def answer_queries(queries: list[str]):
    responses = []
    for query in queries:
        response = ask(query, df)
        responses.append(response)
    return responses

In [46]:
# function to extract questions and add responses to Q&A df
def generate_responses(qa: pd.DataFrame, column: str):
    qa_new = qa
    for index, row in qa.iterrows():
        query = row['Question']
        response = ask(query, df)
        qa_new.loc[index, column] = response
    return qa_new

In [47]:
file_path = 'drive/MyDrive/colab_notebooks/independent_study/data/model_eval.xlsx'
eval = pd.read_excel(file_path)
eval.head()

Unnamed: 0,Question,Custom GPT,RAG,Fine-tuning
0,"I have combination skin, which tends to be shi...",,,
1,I have dry skin and would like a nourishing pr...,,,
2,I am 65 years old and have several signs of ag...,,,
3,"I can see my eye area aging, I notice many mor...",,,
4,I turn 50 in a month and would like to arrive ...,,,


In [48]:
eval_new = generate_responses(eval, 'Fine-tuning')

In [49]:
eval_new.loc[:5, ['Question', 'Fine-tuning']]

Unnamed: 0,Question,Fine-tuning
0,"I have combination skin, which tends to be shi...",For your combination skin that is shiny in the...
1,I have dry skin and would like a nourishing pr...,"I recommend the ""Nourishing moisturizing cream..."
2,I am 65 years old and have several signs of ag...,"Based on your concerns about wrinkles, sagging..."
3,"I can see my eye area aging, I notice many mor...",Based on your concerns about aging in the eye ...
4,I turn 50 in a month and would like to arrive ...,To help you arrive at your 50th birthday party...
5,I would like a “shock” program to get back in ...,"For a ""shock"" program to quickly treat celluli..."


In [50]:
print(eval_new.loc[10, 'Fine-tuning'])

For your 25-year-old daughter with normal skin, here is a tailored Dorabruschi beauty routine that will help maintain her skin's health and radiance:

1. **Gentle Sebum-Balancing Cleansing Base**
   - **Description:** Extremely delicate washing base ideal for impure and reddened skin, suitable for all skin types.
   - **Usage:** Apply a small amount to a damp face morning and evening, massage gently, and rinse thoroughly.
   - **Properties:** Cleanses, Rebalances
   - **Price:** 22.00 euros
   - **Quantity:** 165 ml

2. **Gentle Cleansing Milk**
   - **Description:** Delicate cleansing milk for sensitive skin, rich in soothing plant extracts.
   - **Usage:** Apply morning and evening to face and neck, massage gently, and remove excess with warm water or softening lotion.
   - **Properties:** Cleanses, Calms, Reduces Redness
   - **Price:** 23.00 euros
   - **Quantity:** 200 ml

3. **Perfecting Mask**
   - **Description:** Soft gel mask with sebum-normalizing and purifying properties, s

In [51]:
eval_new.to_excel('model_eval_finetune.xlsx', index=False)
files.download('model_eval_finetune.xlsx')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>