# Fine-tuning

## Loading the Data
In this step, we will load the data from the CSV file into a dictionary object. The csv.DictReader function reads the CSV file row by row and converts each row into a dictionary object.

To load the data, we will create a function load_data that takes the CSV file name as an input and returns a list of dictionaries containing the data.

In [1]:
!pip install pandas
!pip install Pyarrow

Collecting pandas
  Downloading pandas-2.2.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (19 kB)
Collecting numpy>=1.23.2 (from pandas)
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m1.0 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting tzdata>=2022.7 (from pandas)
  Downloading tzdata-2024.1-py2.py3-none-any.whl.metadata (1.4 kB)
Downloading pandas-2.2.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.0/13.0 MB[0m [31m56.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.3/18.3 MB[0m [31m50.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading tzdata

In [2]:
filename="job_skills.csv"

In [3]:
import csv

def load_data(filename):
    with open(filename, 'r') as file:
        reader = csv.DictReader(file)
        data = [row for row in reader]
    return data


In [4]:
import pandas as pd
df = pd.read_csv('job_skills.csv')
df.head(5)

Unnamed: 0,Company,Title,Category,Location,Responsibilities,Minimum Qualifications,Preferred Qualifications
0,Google,Google Cloud Program Manager,Program Management,Singapore,"Shape, shepherd, ship, and show technical prog...",BA/BS degree or equivalent practical experienc...,Experience in the business technology market a...
1,Google,"Supplier Development Engineer (SDE), Cable/Con...",Manufacturing & Supply Chain,"Shanghai, China",Drive cross-functional activities in the suppl...,BS degree in an Engineering discipline or equi...,"BSEE, BSME or BSIE degree.\nExperience of usin..."
2,Google,"Data Analyst, Product and Tools Operations, Go...",Technical Solutions,"New York, NY, United States",Collect and analyze data to draw insight and i...,"Bachelor’s degree in Business, Economics, Stat...",Experience partnering or consulting cross-func...
3,Google,"Developer Advocate, Partner Engineering",Developer Relations,"Mountain View, CA, United States","Work one-on-one with the top Android, iOS, and...",BA/BS degree in Computer Science or equivalent...,"Experience as a software developer, architect,..."
4,Google,"Program Manager, Audio Visual (AV) Deployments",Program Management,"Sunnyvale, CA, United States",Plan requirements with internal customers.\nPr...,BA/BS degree or equivalent practical experienc...,CTS Certification.\nExperience in the construc...


## Preparing the Data for OpenAI Finetuning
In this step, we will prepare the data for OpenAI finetuning. We will use the data loaded from the CSV file and create a new JSONL file that will be used for finetuning.

In [5]:
import json

# Open the input CSV file
with open('job_skills.csv', 'r',encoding='UTF-8') as csv_file:
    csv_reader = csv.DictReader(csv_file)

    # Open the output JSONL file
    with open('\output.jsonl', 'w',encoding='UTF-8') as jsonl_file:
        for row in csv_reader:
            # Extract the relevant fields from the CSV row
            location = row['Location']
            responsibilities = row['Responsibilities']
            minimum_qualifications = row['Minimum Qualifications']
            title = row['Title']
            category = row['Category']

            # Construct the JSONL object
            jsonl_obj = {
                'prompt': f'Location: {location}\nResponsibilities: {responsibilities}\n Qualifications: {minimum_qualifications}',
                'completion': f'{title}\n{category}'
            }

            # Write the JSONL object to the output file
            jsonl_file.write(json.dumps(jsonl_obj) + '\n')


In [7]:
with open('output.jsonl', 'r') as f:
    for i in range(5):
        line = f.readline()
        print(line)

{"prompt": "Location: Singapore\nResponsibilities: Shape, shepherd, ship, and show technical programs designed to support the work of Cloud Customer Engineers and Solutions Architects.\nMeasure and report on key metrics tied to those programs to identify any need to change course, cancel, or scale the programs from a regional to global platform.\nCommunicate status and identify any obstacles and paths for resolution to stakeholders, including those in senior roles, in a transparent, regular, professional and timely manner.\nEstablish expectations and rationale on deliverables for stakeholders and program contributors.\nProvide program performance feedback to teams in Product, Engineering, Sales, and Marketing (among others) to enable efficient cross-team operations.\n Qualifications: BA/BS degree or equivalent practical experience.\n3 years of experience in program and/or project management in cloud computing, enterprise software and/or marketing technologies.", "completion": "Google C

## Preparing the Data Using the OpenAI Tools Package
In this step, we will use the OpenAI tools package to prepare the data for finetuning. The prepare_data function in the tools.fine_tunes module can be used for this purpose.

The prepare_data function takes the following arguments:

file: The name of the input file in JSONL format.
-f: The name of the output file in the GPT-3 training format.
To prepare the data using the OpenAI tools package, we will execute the following code in the notebook:

In [8]:
!pip install openai

Collecting openai
  Downloading openai-1.25.0-py3-none-any.whl.metadata (21 kB)
Collecting distro<2,>=1.7.0 (from openai)
  Downloading distro-1.9.0-py3-none-any.whl.metadata (6.8 kB)
Collecting httpx<1,>=0.23.0 (from openai)
  Downloading httpx-0.27.0-py3-none-any.whl.metadata (7.2 kB)
Collecting pydantic<3,>=1.9.0 (from openai)
  Downloading pydantic-2.7.1-py3-none-any.whl.metadata (107 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m107.3/107.3 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting httpcore==1.* (from httpx<1,>=0.23.0->openai)
  Downloading httpcore-1.0.5-py3-none-any.whl.metadata (20 kB)
Collecting h11<0.15,>=0.13 (from httpcore==1.*->httpx<1,>=0.23.0->openai)
  Downloading h11-0.14.0-py3-none-any.whl.metadata (8.2 kB)
Collecting annotated-types>=0.4.0 (from pydantic<3,>=1.9.0->openai)
  Downloading annotated_types-0.6.0-py3-none-any.whl.metadata (12 kB)
Collecting pydantic-core==2.18.2 (from pydantic<3,>=1.9.0->openai)
  Down

In [9]:
!yes | openai tools fine_tunes.prepare_data -f output.jsonl

Analyzing...

- Your file contains 1250 prompt-completion pairs
- There are 124 duplicated prompt-completion sets. These are rows: [304, 305, 308, 316, 318, 323, 333, 355, 367, 374, 379, 398, 412, 417, 443, 461, 465, 466, 487, 489, 508, 510, 511, 519, 567, 591, 631, 697, 708, 711, 741, 771, 831, 859, 870, 872, 874, 876, 878, 879, 880, 881, 890, 895, 901, 902, 908, 912, 921, 931, 934, 936, 938, 940, 941, 948, 950, 951, 953, 955, 957, 959, 960, 963, 967, 970, 972, 977, 982, 983, 990, 991, 995, 1002, 1006, 1012, 1017, 1023, 1027, 1028, 1035, 1041, 1042, 1059, 1060, 1078, 1080, 1084, 1089, 1096, 1106, 1110, 1111, 1112, 1114, 1116, 1124, 1125, 1131, 1135, 1136, 1142, 1143, 1147, 1153, 1154, 1165, 1166, 1168, 1173, 1184, 1187, 1191, 1194, 1198, 1204, 1205, 1213, 1217, 1219, 1221, 1227, 1238, 1240]
- Your data does not contain a common separator at the end of your prompts. Having a separator string appended to the end of the prompt makes it clearer to the fine-tuned model where the completion

In [10]:
with open('output_prepared.jsonl', 'r') as f:
    for i in range(5):
        line = f.readline()
        print(line)

{"prompt":"Location: Singapore\nResponsibilities: Shape, shepherd, ship, and show technical programs designed to support the work of Cloud Customer Engineers and Solutions Architects.\nMeasure and report on key metrics tied to those programs to identify any need to change course, cancel, or scale the programs from a regional to global platform.\nCommunicate status and identify any obstacles and paths for resolution to stakeholders, including those in senior roles, in a transparent, regular, professional and timely manner.\nEstablish expectations and rationale on deliverables for stakeholders and program contributors.\nProvide program performance feedback to teams in Product, Engineering, Sales, and Marketing (among others) to enable efficient cross-team operations.\n Qualifications: BA\/BS degree or equivalent practical experience.\n3 years of experience in program and\/or project management in cloud computing, enterprise software and\/or marketing technologies.\n\n###\n\n","completion