# GenAI model tuning prep data

This notebook attempts to transform free form data to JSONL format so it can be used by OpenAI model tuning API.

* 20231206 Using try5 data and the context rule hardcoded
* 20231205 Using try4 data and the context rule hardcoded
* 20231125 Using try3 data and the context rule hardcoded
* 20231118 first set of train and test code.
* 20231116 Second set of actual asm code.
* 20231109 Uses first set of actual asm code.



In [1]:
import datetime
import pathlib
import json

import pandas as pd
import os
import os.path

### Setup dirs 

In [2]:
# Get the current date
current_date = datetime.datetime.now()

# Format the date as YYYYMMDD
formatted_date = current_date.strftime('%Y%m%d')
formatted_date

'20231206'

In [3]:
# This can be varied to point to different files.
OUT_TRAIN_FILE_NAME = 'train' + formatted_date + '.jsonl'
OUT_TEST_FILE_NAME = 'test' + formatted_date + '.jsonl'
os.environ['OUT_TRAIN_FILE_NAME'] = OUT_TRAIN_FILE_NAME
os.environ['OUT_TEST_FILE_NAME'] = OUT_TEST_FILE_NAME
print("OUT_TRAIN_FILE_NAME: ", OUT_TRAIN_FILE_NAME)
print("OUT_TEST_FILE_NAME: ", OUT_TEST_FILE_NAME)

OUT_TRAIN_FILE_NAME:  train20231206.jsonl
OUT_TEST_FILE_NAME:  test20231206.jsonl


In [4]:
# The current directory will be where this src file is located.
# Which is in the notebooks dir of the project
dirpath = os.getcwd()
print("current directory is : " + dirpath)

current directory is : /workspaces/BALSA/notebooks


In [5]:
# Use pathlib to find the root dir of the git repo
root_path = pathlib.PurePath(dirpath).parents[0]
data_path = root_path / 'data'
train_path = data_path / 'try5' / 'train'
test_path = data_path / 'try5' / 'test'
print("root directory is: ", root_path)
print("data directory is: ",  data_path)
print("train directory is: ",  train_path)
print("test directory is: ", test_path)

root directory is:  /workspaces/BALSA
data directory is:  /workspaces/BALSA/data
train directory is:  /workspaces/BALSA/data/try5/train
test directory is:  /workspaces/BALSA/data/try5/test


In [6]:
# Create equivalent dir names in the environment
# Data
DATA_DIR_NAME = data_path.as_posix()
print("DATA_DIR_NAME: ", DATA_DIR_NAME)
os.environ['DATA_DIR_NAME'] = DATA_DIR_NAME

DATA_DIR_NAME:  /workspaces/BALSA/data


In [7]:
%%bash
# Verify env variables are set
echo ${DATA_DIR_NAME}
echo ${LOGS_DIR_NAME}
echo ${CSV_FILE_NAME}

/workspaces/BALSA/data




# Routine to build our tuning jsonl file from txt files.

In [8]:
# Function to read the input text file and convert it to JSONL format
def convert_text_to_jsonl(input_file, output_file):
    lines = []
    messages = []
    # stub vars
    sample_code = []
    commentary = []
    prompt = []

    with open(input_file, 'r') as file:
        lines = file.readlines()


    # 
    # find delimiters
    #

    posn = []
    line_nbr = 0
    for a_line in lines:
        #print(a_line)
        if (a_line == "RESULT\n"):
            #print(a_line, " ", line_nbr )
            # save that position
            posn.append(line_nbr) 
        if (a_line == "PROMPT\n"):
            #print(a_line)
            posn.append(line_nbr) 
        
        line_nbr = line_nbr + 1

    # Record the last line in file
    posn.append(line_nbr) 

    #print("posn: ", posn)

    # separate out the parts
    result_lines = lines[1+1:posn[1]]
    prompt_lines = lines[posn[1]+1:posn[2]]

    # dump the parts
    #print("===result_lines:===\n", result_lines)
    #print("===prompt_lines:===\n", prompt_lines)

    hardcoded_context="""
When writing code, obey these rules:

* NAME corresponds to a LABEL and is always in column 1.
    - The NAME is at most 8 characters long.
    - The NAME begins with characters A-Z, a-z, $, # or @. 
* OPERATION corresponds to an instruction (mnemonic) and starts in column 10.
* OPERANDS corresponds to instruction argumennts or parameters and starts in column 15.
    - Multiple operands are separated by a comma `,`.
    - Space ` ` characters are not permitted between OPERANDS.
* COMMENT corresponds to non functional text and has two possible starting locations.  
    - If the entire line is a comment, then the comment marker `*` starts in column 1.
    - If the comment is used at the end of a line of code, it starts at column 32.
* Column 72 is used to identify a continuation of the current line to the next.
    - Only use a continuation character when an instruction line spans more than 65 columns.
    - In this case, use a `x` in column 72 on the first line.
    - On the second continued line, code starts at column 16.

All code should be output in markdown or preformatted text blocks like so:

```
code here
```

Unless explictly told to do so, do not include any commentary.

When specifiying registers be explicit.  For example when referring to register one, use R1 rather than 1.

Do not show any subroutine standard entry and exit code.

When returning assembly code, please limit allowable instructions to those in 
the list:


* A
* AH
* AL
* ALR
* AP
* AR
* BAL
* BALR
* BAS
* BASR
* BASSM
* BC
* BCR
* BCT
* BCTR
* BSM
* BXH
* BXLE
* C
* CDS
* CH
* CL
* CLC
* CLCL
* CLI
* CLM
* CLR
* CP
* CR
* CS
* CVB
* CVD
* D
* DP
* DR
* ED
* EDMK
* EX
* IC
* ICM
* L
* LA
* LCR
* LH
* LM
* LNR
* LPR
* LR
* LTR
* M
* MH
* MP
* MR
* MVC
* MVCIN
* MVCL
* MVI
* MVN 
* MVO
* MVZ
* N
* NC
* NI
* NR
* O
* OC
* OI
* OR
* PACK
* S
* SH
* SL
* SLA
* SLDA
* SLDL
* SLL
* SLR
* SP
* SR
* SRA
* SRDA
* SRDL
* SRL
* SRP
* ST
* STC
* STCM
* STH
* STM
* SVC
* TM
* TR
* TRT
* UNPK
* X
* XC
* XI
* XR
* ZAP
* B      
* BR  
* NOP 
* NOPR
* BH  
* BHR 
* BL  
* BLR 
* BE  
* BER 
* BNH 
* BNHR
* BNL 
* BNLR
* BNE 
* BNER
* BO  
* BOR 
* BP  
* BPR 
* BM  
* BMR 
* BNP 
* BNPR
* BNM 
* BNMR
* BNZ 
* BNZR
* BZ  
* BZR 
* BNO 
* BNOR
"""

    a_dict = {}
    a_dict['messages'] = []
    
    a_dict['messages'].append({'role':'system',
                              'content': hardcoded_context})
    a_dict['messages'].append({'role':'user',
                               'content': ' '.join(prompt_lines)})
    a_dict['messages'].append({'role':'assistant',
                               'content': ' '.join(result_lines)})


    #print(a_dict)
    #print(output_file)

    # append to output file
    # modify with w to write a new one
    with open(output_file, 'a') as jsonl_file:
        jsonl_file.write(json.dumps(a_dict) + '\n')

In [9]:


#train_files

# build train file

In [10]:
OUT_TRAIN_FQPN = train_path /  pathlib.Path(OUT_TRAIN_FILE_NAME).as_posix()
#print(OUT_TRAIN_FQPN)

# remove any existing output
try:
    os.remove(OUT_TRAIN_FQPN)
    print("removed")
except OSError:
    print("did not remove the existing training file.")

train_files = os.listdir(train_path)
#print(train_files)

for a_file in train_files:
    #print("a file name: ", a_file)
    IN_FQPN = train_path /  pathlib.PurePath(a_file).as_posix()
    #print(IN_FQPN)
    #print(OUT_TRAIN_FQPN)
    convert_text_to_jsonl(IN_FQPN, OUT_TRAIN_FQPN)


did not remove the existing training file.


# build test file

In [11]:
OUT_TEST_FQPN = test_path /  pathlib.Path(OUT_TEST_FILE_NAME).as_posix()
#print(OUT_FQPN)

# remove any existing output
try:
    os.remove(OUT_TEST_FQPN)
    print("removed")
except OSError:
    print("did not remove the existing training file.")

test_files = os.listdir(test_path)

for a_file in test_files:
    #print("a file name: ", a_file)
    IN_FQPN = test_path /  pathlib.PurePath(a_file).as_posix()
    #print(IN_FQPN)
    convert_text_to_jsonl(IN_FQPN, OUT_TEST_FQPN)

did not remove the existing training file.


In [12]:
os.environ['OUT_TRAIN_FQPN'] = OUT_TRAIN_FQPN.as_posix()
os.environ['OUT_TEST_FQPN'] = OUT_TEST_FQPN.as_posix()

In [13]:
%%bash
head ${OUT_TRAIN_FQPN}

{"messages": [{"role": "system", "content": "\nWhen writing code, obey these rules:\n\n* NAME corresponds to a LABEL and is always in column 1.\n    - The NAME is at most 8 characters long.\n    - The NAME begins with characters A-Z, a-z, $, # or @. \n* OPERATION corresponds to an instruction (mnemonic) and starts in column 10.\n* OPERANDS corresponds to instruction argumennts or parameters and starts in column 15.\n    - Multiple operands are separated by a comma `,`.\n    - Space ` ` characters are not permitted between OPERANDS.\n* COMMENT corresponds to non functional text and has two possible starting locations.  \n    - If the entire line is a comment, then the comment marker `*` starts in column 1.\n    - If the comment is used at the end of a line of code, it starts at column 32.\n* Column 72 is used to identify a continuation of the current line to the next.\n    - Only use a continuation character when an instruction line spans more than 65 columns.\n    - In this case, use a

In [14]:
%%bash
head ${OUT_TEST_FQPN}

{"messages": [{"role": "system", "content": "\nWhen writing code, obey these rules:\n\n* NAME corresponds to a LABEL and is always in column 1.\n    - The NAME is at most 8 characters long.\n    - The NAME begins with characters A-Z, a-z, $, # or @. \n* OPERATION corresponds to an instruction (mnemonic) and starts in column 10.\n* OPERANDS corresponds to instruction argumennts or parameters and starts in column 15.\n    - Multiple operands are separated by a comma `,`.\n    - Space ` ` characters are not permitted between OPERANDS.\n* COMMENT corresponds to non functional text and has two possible starting locations.  \n    - If the entire line is a comment, then the comment marker `*` starts in column 1.\n    - If the comment is used at the end of a line of code, it starts at column 32.\n* Column 72 is used to identify a continuation of the current line to the next.\n    - Only use a continuation character when an instruction line spans more than 65 columns.\n    - In this case, use a