# Useful links

fine tuning:
https://platform.openai.com/docs/guides/fine-tuning

documentation:https://platform.openai.com/docs/api-reference/parameter-details

tokenizer to choose classes:https://platform.openai.com/tokenizer

pricing:https://openai.com/api/pricing/#faq-classifications-pricing

# Train our dataset on GPT3 api

## Create dataset to use

In [None]:
from google.colab import drive

drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import pandas as pd

col_ingredients_names = [
    "ingredient_id",
    "name",
    "category_id",
    "carbon_foot_print",
    "carbon_foot_print_source",
    "carbon_foot_print_weight",
    "water_foot_print",
    "water_foot_print_source",
    "water_foot_print_weight",
    "kcal",
    "kcal_weight",
    "protein",
    "protein_weight",
    "fat",
    "fat_weight",
    "carbohydrates",
    "carbohydrates_weight",
    "fiber",
    "fiber_weight",
    "vendor_recipe_ids",
    "created_at",
    "updated_at",
    "water_foot_print_z_score",
    "carbon_foot_print_z_score"]

ingredients = pd.read_csv("/content/drive/MyDrive/Semantics In Intelligent Information Access/GPT trial/ingredients.csv", 
                          delimiter=',', 
                          quotechar='"', 
                          header=None, 
                          names = col_ingredients_names, 
                          index_col = 0)

ingredients.head(3)

Unnamed: 0_level_0,name,category_id,carbon_foot_print,carbon_foot_print_source,carbon_foot_print_weight,water_foot_print,water_foot_print_source,water_foot_print_weight,kcal,kcal_weight,...,fat_weight,carbohydrates,carbohydrates_weight,fiber,fiber_weight,vendor_recipe_ids,created_at,updated_at,water_foot_print_z_score,carbon_foot_print_z_score
ingredient_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,"cheese, brie",28,9.59,sueatable,1000,5253,sueatable,1000,334,100,...,100,0.45,100,0.0,100,\N,2022-10-24 17:13:17,2023-02-04 00:10:02,-4.65155e-07,2.34918e-09
2,"sauce, barbecue",109,1.46,sueatable,1000,572,sueatable,1000,172,100,...,100,40.77,100,0.9,100,\N,2022-10-24 17:13:17,2023-02-04 00:10:02,-1.85771e-06,-2.16886e-10
3,"candies, dark chocolate coated coffee beans",\N,3.16,sueatable,1000,20717,sueatable,1000,540,100,...,100,59.95,100,7.5,100,\N,2022-10-24 17:13:17,2023-02-04 00:10:02,4.13525e-06,3.19684e-10


In [None]:
import csv
rows = []
reader = csv.reader(open("/content/drive/MyDrive/Semantics In Intelligent Information Access/GPT trial/recipes.csv"), 
                    quotechar='"', 
                    delimiter=',', 
                    quoting=csv.QUOTE_MINIMAL)
for line in reader:
  rows.append(line)

recipes = pd.DataFrame(rows, 
                       columns= ['recipe_id', 
                                 'title', 
                                 'url',
                                 'vendor_id', 
                                 'static_score',
                                 'mcfp',
                                 'trust_cfp', 
                                 'mwfp', 
                                 'trust_wfp',
                                 'created_at',
                                 'updated_at']).set_index("recipe_id")

recipes.head(3)

Unnamed: 0_level_0,title,url,vendor_id,static_score,mcfp,trust_cfp,mwfp,trust_wfp,created_at,updated_at
recipe_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,Yogurt Parfaits,http://tastykitchen.com/recipes/breakfastbrunc...,000095fc1d,0.061915812117267,1.9333333333333331,1.0,1383.1666666666667,1.0,2022-10-21 07:39:08,2023-02-04 18:58:53
2,"Salt Free, Low Cholesterol Sugar Cookies Recipe",http://cookeatshare.com/recipes/salt-free-low-...,00051d5b9d,0.0566084103446987,1.1960000000000002,0.833333333,2629.8,0.833333333,2022-10-21 07:39:08,2023-02-04 18:58:53
3,Honey Sriracha Chicken Wings,http://tastykitchen.com/recipes/main-courses/h...,00059b093b,0.0606619702453755,2.5420000000000003,0.909090909,4883.111111111111,0.818181818,2022-10-21 07:39:08,2023-02-04 18:58:53


In [None]:
import numpy as np

rec_perc = recipes.loc[:, ["title", "static_score"]].copy()
rec_perc.rename(columns={"title":"recipe name", "static_score": "foot print"}, inplace=True)
rec_perc["foot print"].replace('\\N', np.nan, inplace=True)

# fill 424 empty in order to allow float to int transformation
mean = rec_perc["foot print"].astype(float).mean()
rec_perc["foot print"].fillna(mean, inplace=True)

# rec_perc["foot print"] = (rec_perc["foot print"].astype(float) * 100).round(2).astype(str) + "%"
rec_perc['foot print'] = np.rint((rec_perc['foot print'].astype(float)*100)).astype(str).str.replace('.0', '%')

rec_perc.head()

  rec_perc['foot print'] = np.rint((rec_perc['foot print'].astype(float)*100)).astype(str).str.replace('.0', '%')


Unnamed: 0_level_0,recipe name,foot print
recipe_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Yogurt Parfaits,6%
2,"Salt Free, Low Cholesterol Sugar Cookies Recipe",6%
3,Honey Sriracha Chicken Wings,6%
4,Shrimp and Caper Salad,4%
5,Natural Peanut Butter Chocolate Bon Bons,9%


The motivation for choosing the inputs to provide to the model is as follows:

- The model is able to assume alternative ingredients in a recipe.
- The model already knows the ingredients of each recipe, presumably.
- The model does not have the ability to perform mathematical calculations, so entering mathematical formulas would be useless (e.g. static score calculation).
- Therefore, we decided to make the _foot print_ value as intuitive as possible by using percentages for entire recipes.
- Another solution could be to discretize the _foot print_ value into different categories, such as "low, medium, high", however let's try to push the model a little further.

Summary: in my opinion the model is able to intuitively learn if a recipe is polluting with a "low, medium, high" level, therefore, given for granted, let's try anyway with the percentages to "experiment".

In [None]:
# show rows that contaings ? symbol
display(rec_perc[rec_perc['recipe name'].str.contains('\?')])

# remove these rows
rec_perc = rec_perc[~rec_perc['recipe name'].str.contains('\?')]

Unnamed: 0_level_0,recipe name,foot print
recipe_id,Unnamed: 1_level_1,Unnamed: 2_level_1
2502,Persian ?Ice Cream? Sundae,8%
2925,"Peaches and Ice Cream, Tempted yet ?",7%
6307,Crag?crg (Thai Dumplings in Coconut Cream),7%
19110,Did You Know? (3 Little-Known Homemade Remedies),8%
19728,Salted Boiling Water - What Does It Mean?,7%
20897,Foolproof Pie Dough (With Vodka!?),7%
20975,hityl? Cinnamon Candy Sauce,9%
23759,This is Gluten Free?! Our Favorite Pizza Crust,6%
24649,Mandarin Pancakes ?New Year?,7%
25254,How to make buttermilk out of regular milk? - ...,7%


## Adapt the dataset for fine tuning

The input to the GPT model must be of the form :

    {"prompt": "<prompt text>", "completion": "<ideal generated text>"}
    {"prompt": "<prompt text>", "completion": "<ideal generated text>"}
    {"prompt": "<prompt text>", "completion": "<ideal generated text>"}
    ...
In our case : 

    {"prompt": "<recipe name or composition>", "completion": "<its foot print percentage>"}
    {"prompt": "<recipe name or composition>", "completion": "<its foot print percentage>"}
    ...
    {"prompt": "<recipe name or composition>", "completion": "<its foot print percentage>"}

In [None]:
%%capture
!pip install --upgrade openai

Classification

In classification problems, each input in the prompt should be classified into one of the predefined classes. For this type of problem, we recommend:

1. Use a separator at the end of the prompt, e.g. \n\n###\n\n. Remember to also append this separator when you eventually make requests to your model.
2. Choose classes that map to a single token. At inference time, specify max_tokens=1 since you only need the first token for classification.
3. Ensure that the prompt + completion doesn't exceed 2048 tokens, including the separator
4. Aim for at least ~100 examples per class
5. To get class log probabilities you can specify logprobs=5 (for 5 classes) when using your model
6. Ensure that the dataset used for finetuning is very similar in structure and type of task as what the model will be used for

In [None]:
# fine tuning prompt completion

ft_pc = rec_perc.copy()
ft_pc.rename(columns={"recipe name":"prompt", "foot print":"completion"}, inplace=True)
ft_pc['completion'] = ' The foot print of ' + ft_pc['prompt'] + ' is about ' + ft_pc['completion'] + ' ###'
ft_pc['prompt'] = 'Which is the foot print of ' + ft_pc['prompt'] + '?  \n\n###\n\n'
ft_pc = ft_pc.drop(ft_pc.columns[2:], axis=1)

ft_pc.head(5)

Unnamed: 0_level_0,prompt,completion
recipe_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Which is the foot print of Yogurt Parfaits? \...,The foot print of Yogurt Parfaits is about 6%...
2,"Which is the foot print of Salt Free, Low Chol...","The foot print of Salt Free, Low Cholesterol ..."
3,Which is the foot print of Honey Sriracha Chic...,The foot print of Honey Sriracha Chicken Wing...
4,Which is the foot print of Shrimp and Caper Sa...,The foot print of Shrimp and Caper Salad is a...
5,Which is the foot print of Natural Peanut Butt...,The foot print of Natural Peanut Butter Choco...


A VALID ALTERNATIVE SINCE WE WORK WITH TRANSFORMER IS TO MAKE THE LABEL CATEGORY, MAYBE DIVIDE IT INTO DRAWERS AS "LOW", "MEDIUM" AND "HIGH".

CLI (command-line interface) data preparation tool

In [None]:
ft_pc.to_csv("ft_pc.csv", index=False)
!openai tools fine_tunes.prepare_data -f ft_pc.csv

Analyzing...

- Based on your file extension, your file is formatted as a CSV file
- Your file contains 51213 prompt-completion pairs
- There are 4926 duplicated prompt-completion sets. These are rows: [246, 546, 704, 846, 876, 1117, 1283, 1396, 1407, 1429, 1456, 1474, 1576, 1713, 1866, 1908, 1934, 1994, 2202, 2206, 2404, 2522, 2585, 2689, 2700, 2733, 2760, 2797, 2949, 2985, 3082, 3109, 3156, 3222, 3264, 3285, 3482, 3492, 3495, 3606, 3619, 3703, 3722, 3726, 3737, 3839, 3945, 3961, 3966, 3967, 4030, 4076, 4109, 4141, 4202, 4270, 4399, 4422, 4432, 4451, 4477, 4487, 4490, 4512, 4548, 4592, 4594, 4637, 4639, 4710, 4718, 4768, 4783, 4827, 4838, 4867, 4875, 4876, 4959, 5057, 5069, 5085, 5089, 5185, 5188, 5253, 5348, 5386, 5390, 5394, 5433, 5489, 5496, 5540, 5548, 5600, 5603, 5615, 5617, 5652, 5701, 5725, 5731, 5732, 5787, 5830, 5839, 5881, 5926, 5937, 5977, 5982, 5989, 6049, 6056, 6070, 6078, 6109, 6193, 6216, 6220, 6233, 6272, 6297, 6340, 6367, 6457, 6482, 6547, 6548, 6556, 6578, 6581, 6590

## Start training

In [None]:
# openai api fine_tunes.create -t "content/ft_pc_prepared.jsonl" -m "davinci"

import openai

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
openai.api_key = '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'

# List all created fine-tunes
!openai api fine_tunes.list

# Retrieve the state of a fine-tune. The resulting object includes
# job status (which can be one of pending, running, succeeded, or failed)
# and other information
!openai api fine_tunes.get -i <YOUR_FINE_TUNE_JOB_ID>

# Cancel a job
!openai api fine_tunes.cancel -i <YOUR_FINE_TUNE_JOB_ID>

After the job is finifshed we can start making requests by passing the model name as the model parameter of a completion request:

In [None]:
# OpenAI CLI
!openai api completions.create -m <FINE_TUNED_MODEL> -p <YOUR_PROMPT>

# cURL
curl https://api.openai.com/v1/completions \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"prompt": YOUR_PROMPT, "model": FINE_TUNED_MODEL}'

# Python
import openai
openai.Completion.create(
    model=FINE_TUNED_MODEL,
    prompt=YOUR_PROMPT)

# Node.js
const response = await openai.createCompletion({
  model: FINE_TUNED_MODEL
  prompt: YOUR_PROMPT,
});

## Delete a fine-tuned model
https://platform.openai.com/docs/guides/fine-tuning/delete-a-fine-tuned-model