# install requered libraries for this project 
**openai is the library which will use for fine tunes models<br />**

In [8]:
!pip install openai

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


**pandas is a library for data manipulation**

In [None]:
!pip install pandas

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


**numpy is computational library for pythons**

In [None]:
!pip install numpy

In [9]:
import openai 
import pandas as pd 
import numpy as np

In [10]:
np.random.seed(0)

**read my data from file atis_intents.csv**

In [12]:
df = pd.read_csv('atis_intents.csv', header = None)

In [13]:
df.head()

Unnamed: 0,0,1
0,atis_flight,i want to fly from boston at 838 am and arriv...
1,atis_flight,what flights are available from pittsburgh to...
2,atis_flight_time,what is the arrival time in san francisco for...
3,atis_airfare,cheapest airfare from tacoma to orlando
4,atis_airfare,round trip fares from pittsburgh to philadelp...


**we will name our columns as `intent` and `text`**

In [14]:
df.columns = ['intent', 'text']

In [15]:
df['intent'].unique()

array(['atis_flight', 'atis_flight_time', 'atis_airfare', 'atis_aircraft',
       'atis_ground_service', 'atis_airport', 'atis_airline',
       'atis_distance', 'atis_abbreviation', 'atis_ground_fare',
       'atis_quantity', 'atis_city', 'atis_flight_no', 'atis_capacity',
       'atis_flight#atis_airfare', 'atis_meal', 'atis_restriction',
       'atis_airline#atis_flight_no',
       'atis_ground_service#atis_ground_fare',
       'atis_airfare#atis_flight_time', 'atis_cheapest',
       'atis_aircraft#atis_flight#atis_flight_no'], dtype=object)

**clean my data for our models**

In [16]:
df['intent'] = df['intent'].str.replace('#','_')

In [17]:
df['intent'] = df['intent'].str.replace('atis_', '')

**we will take only the most repeated intent in my data**

In [18]:
labels = ['flight', 'airfare', 'ground_service', 'abbreviation', 'flight_time']

In [19]:
df = df[df['intent'].isin(labels)]

**take a sample from my data becouse the model will take much time if the data is big**

In [20]:
sample_data = df.groupby('intent').apply(lambda x: x.sample(40)).reset_index(drop = True)

In [21]:
sample_data.to_csv('sample_data.csv', index = False)

In [22]:
sample_data = sample_data[['text', 'intent']]

In [23]:
sample_data['intent'] = sample_data['intent'].str.strip()
sample_data['text'] = sample_data['text'].str.strip()

**here we will prepare our data as the documentation tell us<br />**
**take a looke [here](https://platform.openai.com/docs/guides/fine-tuning) to get a good veiw about this**

In [24]:
sample_data['text'] = sample_data['text'] + '\n\nIntent:\n\n'
sample_data['intent'] = " "+sample_data['intent'] + " END"

In [25]:
sample_data.columns = ['prompt', 'completion']

In [26]:
sample_data

Unnamed: 0,prompt,completion
0,what does mco stand for\n\nIntent:\n\n,abbreviation END
1,what does the abbreviation co mean\n\nIntent:\n\n,abbreviation END
2,what is hp\n\nIntent:\n\n,abbreviation END
3,is fare code b the same as business class\n\nI...,abbreviation END
4,what does ewr mean\n\nIntent:\n\n,abbreviation END
...,...,...
195,are there any limousines or taxi services avai...,ground_service END
196,i'm looking for ground transportation in dalla...,ground_service END
197,ground transportation from airport in boston t...,ground_service END
198,ground transportation philadelphia\n\nIntent:\n\n,ground_service END


save our data to .json file becouse chatgpt model read the data from jsonl format

In [None]:
sample_data.to_json('intent_sample.jsonl', orient= 'records', lines=True)

In [28]:
!openai tools fine_tunes.prepare_data -f intent_sample.jsonl

Analyzing...

- Your file contains 200 prompt-completion pairs
- Based on your data it seems like you're trying to fine-tune a model for classification
- For classification, we recommend you try one of the faster and cheaper models, such as `ada`
- For classification, you can estimate the expected model performance by keeping a held out dataset, which is not used for training
- There are 7 duplicated prompt-completion sets. These are rows: [7, 11, 12, 26, 29, 32, 140]
- All prompts end with suffix `\n\nIntent:\n\n`. This suffix seems very long. Consider replacing with a shorter suffix, such as `\n\n###\n\n`

Based on the analysis we will perform the following actions:
- [Recommended] Remove 7 duplicate rows [Y/n]: Y
- [Recommended] Would you like to split into training and validation set? [Y/n]: Y


Your data will be written to a new JSONL file. Proceed [Y/n]: Y

Wrote modified files to `intent_sample_prepared_train.jsonl` and `intent_sample_prepared_valid.jsonl`
Feel free to take a lo

this `API_KEY` will get it from your account on `openai api` to train your model

In [29]:
API_KEY = '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'

In [30]:
import os
os.environ['OPENAI_API_KEY'] = API_KEY

train our model on `validation and train data` with model `davinci`

In [31]:
!openai api fine_tunes.create -t "intent_sample_prepared_train.jsonl" -v "intent_sample_prepared_valid.jsonl" -m 'davinci'

Found potentially duplicated files with name 'intent_sample_prepared_train.jsonl', purpose 'fine-tune' and size 18369 bytes
file-wb1wu18MJq4lKFgEJ4g0eJBi
Enter file ID to reuse an already uploaded file, or an empty string to upload this file anyway: "intent_sample_prepared_train.jsonl"
File id '"intent_sample_prepared_train.jsonl"' is not among the IDs of the potentially duplicated files

Enter file ID to reuse an already uploaded file, or an empty string to upload this file anyway: 
Upload progress: 100% 18.4k/18.4k [00:00<00:00, 13.0Mit/s]
Uploaded file from intent_sample_prepared_train.jsonl: file-8FQ6jyM4f6CrqbZch0ZsdWZb
Found potentially duplicated files with name 'intent_sample_prepared_valid.jsonl', purpose 'fine-tune' and size 4774 bytes
file-r6mTyM8ehp3cgDHb5usgAlNV
Enter file ID to reuse an already uploaded file, or an empty string to upload this file anyway: 
Upload progress: 100% 4.77k/4.77k [00:00<00:00, 6.68Mit/s]
Uploaded file from intent_sample_prepared_valid.jsonl: fil

In [56]:
!openai api fine_tunes.follow -i ft-rLn7Qys7ba73EszDMZxLZ4bs

[2023-04-29 10:34:26] Created fine-tune: ft-rLn7Qys7ba73EszDMZxLZ4bs
[2023-04-29 10:51:44] Fine-tune costs $0.41
[2023-04-29 10:51:44] Fine-tune enqueued. Queue number: 0
[2023-04-29 10:53:31] Fine-tune started
[2023-04-29 10:55:46] Completed epoch 1/4
[2023-04-29 10:56:33] Completed epoch 2/4
[2023-04-29 10:57:20] Completed epoch 3/4
[2023-04-29 10:58:07] Completed epoch 4/4
[2023-04-29 10:58:49] Uploaded model: davinci:ft-personal-2023-04-29-10-58-48
[2023-04-29 10:58:50] Uploaded result file: file-hhjLKO8VgFQZq4YVStXkgBNA
[2023-04-29 10:58:50] Fine-tune succeeded

Job complete! Status: succeeded 🎉
Try out your fine-tuned model:

openai api completions.create -m davinci:ft-personal-2023-04-29-10-58-48 -p <YOUR_PROMPT>


`prompt` is my input for my model to get `completion` output of my model<br />
below i 3 example i

In [66]:
# prompt1 = "Did we have  flight on monday\n\nIntent\n\n "
# prompt2 = 'what is ap57 restriction\n\nIntent\n\n'
# prompt3 = 'show me grount transportation in baltmore\n\nIntent\n\n'

In [67]:
from pandas.core.generic import Frequency
openai.api_key = API_KEY
responce = openai.Completion.create(
    model = 'davinci:ft-personal-2023-04-29-10-58-48',
    prompt = prompt3,
    max_tokens = 5,
    temperature = 0,
    top_p = 1,
    frequency_penalty = 0,
    presence_penalty = 0,
    stop = [" END"]
)

In [68]:
print(responce['choices'][0]['text'])

 ground_service
