# Finetune Llama 2 with Together AI

See https://docs.together.ai/docs/fine-tuning-python#prepare-your-data and https://colab.research.google.com/drive/11DwtftycpDSgp3Z1vnV-Cy68zvkGZL4K?usp=sharing for reference.

In [1]:
!pip install datasets
!pip install --upgrade together

Collecting datasets
  Downloading datasets-2.16.1-py3-none-any.whl.metadata (20 kB)
Collecting filelock (from datasets)
  Downloading filelock-3.13.1-py3-none-any.whl.metadata (2.8 kB)
Collecting pyarrow>=8.0.0 (from datasets)
  Downloading pyarrow-14.0.2-cp312-cp312-macosx_10_14_x86_64.whl.metadata (3.0 kB)
Collecting pyarrow-hotfix (from datasets)
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl.metadata (3.6 kB)
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl.metadata (9.9 kB)
Collecting tqdm>=4.62.1 (from datasets)
  Downloading tqdm-4.66.1-py3-none-any.whl.metadata (57 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.6/57.6 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp312-cp312-macosx_10_9_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2023

In [3]:
from datasets import load_dataset
import together
import pandas as pd 
import re
import os

from dotenv import load_dotenv      # Put TOGETHER_API_KEY and WANDB_API_KEY in .env file 
load_dotenv()


  from .autonotebook import tqdm as notebook_tqdm


True

In [20]:
base_model_name = "togethercomputer/llama-2-7b-chat"

## Load test set and train set, save into standard format. 

See the reference links above for details.

Format: 
```
{'text': "<s>[INST] <<SYS>> You are a knowledgeable climate science assistant trained to assess the confidence level associated with various statements about climate change. <</SYS>> You will be presented with a statement about climate science, climate impacts or climate change mitigation which is retrieved or paraphrased from the IPCC AR6 WGI, WGII or WGIII assessment reports. Climate scientists have evaluated that statement as low confidence, medium confidence, high confidence, or very high confidence, based on evidence (type, amount, quantity, consistency) and agreement among their peers. What is their confidence level? Respond *only* with one of the following words: 'low', 'medium', 'high', 'very high'. If you don't know, you can respond 'I don't know'. Statement: Since 2011 (measurements reported in AR5), concentrations have continued to increase in the atmosphere, reaching annual averages of 410 parts per million (ppm) for carbon dioxide (CO 2), 1866 parts per billion (ppb) for methane (CH 4), and 332 ppb for nitrous oxide (N 2O) in 2019.6 Land and ocean have taken up a near-constant proportion (globally about 56% per year) of CO 2 emissions from human activities over the past six decades, with regional differences Confidence: [/INST] high </s>"}
```

In [4]:
df = pd.read_csv('./data/ipcc_statements_dataset.tsv', sep='\t', skiprows=0)

train_set = df.loc[df['split'] == 'train']
test_set = df.loc[df['split'] == 'test']

test_set.head()

Unnamed: 0,statement_idx,report,page_num,sent_num,statement,confidence,score,split
3,3,AR6_WGI,24,2,"Since 1750, increases in CO2 (47%) and CH4 (15...",very high,3,test
42,42,AR6_WGI,37,16,"Over the next 2000 years, global mean sea leve...",low,0,test
77,77,AR6_WGI,47,7,"By the end of the century, scenarios with very...",high,2,test
81,81,AR6_WGI,62,2,"Over the past millennium, and especially since...",medium,1,test
86,86,AR6_WGI,63,8,The paleo context supports the assessment that...,high,2,test


In [6]:
# print a sample
print(train_set.iloc[0]['statement'])
print(train_set.iloc[0]['confidence'])


Since 2011 (measurements reported in AR5), concentrations have continued to increase in the atmosphere, reaching annual averages of 410 parts per million (ppm) for carbon dioxide (CO 2), 1866 parts per billion (ppb) for methane (CH 4), and 332 ppb for nitrous oxide (N 2O) in 2019.6 Land and ocean have taken up a near-constant proportion (globally about 56% per year) of CO 2 emissions from human activities over the past six decades, with regional differences
high


In [10]:
train_set_lines = []

PREFIX = "<s>[INST] <<SYS>> You are a knowledgeable climate science assistant trained to assess the confidence level associated with various statements about climate change. <</SYS>> You will be presented with a statement about climate science, climate impacts or climate change mitigation which is retrieved or paraphrased from the IPCC AR6 WGI, WGII or WGIII assessment reports. Climate scientists have evaluated that statement as low confidence, medium confidence, high confidence, or very high confidence, based on evidence (type, amount, quantity, consistency) and agreement among their peers. What is their confidence level? Respond *only* with one of the following words: 'low', 'medium', 'high', 'very high'. If you don't know, you can respond 'I don't know'. Statement: "
SUFFIX = " Confidence: [/INST] "
END = " </s>"

train_set['formatted_jsonl'] = train_set.apply(lambda x: {'text': PREFIX + x['statement'] + SUFFIX + x['confidence'] + END}, axis=1)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_set['formatted_jsonl'] = train_set.apply(lambda x: {'text': PREFIX + x['statement'] + SUFFIX + x['confidence'] + END}, axis=1)


In [14]:
print(train_set['formatted_jsonl'].sample(1).values)    

[{'text': "<s>[INST] <<SYS>> You are a knowledgeable climate science assistant trained to assess the confidence level associated with various statements about climate change. <<SYS>> You will be presented with a statement about climate science, climate impacts or climate change mitigation which is retrieved or paraphrased from the IPCC AR6 WGI, WGII or WGIII assessment reports. Climate scientists have evaluated that statement as low confidence, medium confidence, high confidence, or very high confidence, based on evidence (type, amount, quantity, consistency) and agreement among their peers. What is their confidence level? Respond *only* with one of the following words: 'low', 'medium', 'high', 'very high'. If you don't know, you can respond 'I don't know'. Statement: Ice cores show increases in aerosols across the Northern Hemisphere mid-latitudes since 1700 and reductions since the late 20th century Confidence: [/INST] high </s>"}]


In [15]:
train_set['formatted_jsonl'].to_json('./data/climatex_train_set.jsonl', orient='records', lines=True)

In [16]:
# check your data with your base model prompting type before uploading
resp = together.Files.check(file='./data/climatex_train_set.jsonl')
print(resp)

{'is_check_passed': True, 'model_special_tokens': 'we are not yet checking end of sentence tokens for this model', 'file_present': 'File found', 'file_size': 'File size 0.007 GB', 'num_samples': 7794}


In [18]:
# Try it before finetuning: 
test_prompt = train_set['formatted_jsonl'].sample(1).values[0]['text'].split('[/INST]')[0]
test_prompt += '[/INST]'
print(test_prompt)

<s>[INST] <<SYS>> You are a knowledgeable climate science assistant trained to assess the confidence level associated with various statements about climate change. <<SYS>> You will be presented with a statement about climate science, climate impacts or climate change mitigation which is retrieved or paraphrased from the IPCC AR6 WGI, WGII or WGIII assessment reports. Climate scientists have evaluated that statement as low confidence, medium confidence, high confidence, or very high confidence, based on evidence (type, amount, quantity, consistency) and agreement among their peers. What is their confidence level? Respond *only* with one of the following words: 'low', 'medium', 'high', 'very high'. If you don't know, you can respond 'I don't know'. Statement: Learning and experimentation across governance boundaries and between agencies and local communities enable adaptation to be better aligned with changing climate risks and community Confidence: [/INST]


In [22]:
output = together.Complete.create(
  prompt = test_prompt,
  model = base_model_name,
  max_tokens = 256,
  temperature = 0.0,
#   top_k = 90,
#   top_p = 0.8,
#   repetition_penalty = 1.1,
  stop = ['</s>', '[/INST]']
)

# print generated text
print(output['prompt'][0]+" -> "+output['output']['choices'][0]['text'])

<s>[INST] <<SYS>> You are a knowledgeable climate science assistant trained to assess the confidence level associated with various statements about climate change. <<SYS>> You will be presented with a statement about climate science, climate impacts or climate change mitigation which is retrieved or paraphrased from the IPCC AR6 WGI, WGII or WGIII assessment reports. Climate scientists have evaluated that statement as low confidence, medium confidence, high confidence, or very high confidence, based on evidence (type, amount, quantity, consistency) and agreement among their peers. What is their confidence level? Respond *only* with one of the following words: 'low', 'medium', 'high', 'very high'. If you don't know, you can respond 'I don't know'. Statement: Learning and experimentation across governance boundaries and between agencies and local communities enable adaptation to be better aligned with changing climate risks and community Confidence: [/INST] ->  Based on the statement you

In [62]:
file_resp = together.Files.upload(file='./data/climatex_train_set.jsonl')
file_id = file_resp['id']
print(file_resp)

Uploading ./data/climatex_train_set.jsonl: 100%|██████████| 7.54M/7.54M [00:00<00:00, 8.24MB/s]

{'filename': 'climatex_train_set.jsonl', 'id': 'file-a3655a4e-8662-4c04-bcb2-8c55c8af3761', 'object': 'file', 'report_dict': {'is_check_passed': True, 'model_special_tokens': 'we are not yet checking end of sentence tokens for this model', 'file_present': 'File found', 'file_size': 'File size 0.007 GB', 'num_samples': 7794}}





### Make a short train set for troubleshooting training first

In [51]:
short_train_set_100 = train_set.iloc[:100]
display(short_train_set_100)
short_train_set_100['formatted_jsonl'].to_json('./data/climatex_train_set_short_100.jsonl', orient='records', lines=True)


Unnamed: 0,statement_idx,report,page_num,sent_num,statement,confidence,score,split,formatted_jsonl
0,0,AR6_WGI,20,22,"Since 2011 (measurements reported in AR5), con...",high,2,train,{'text': '<s>[INST] <<SYS>> You are a knowledg...
1,1,AR6_WGI,21,8,Mid-latitude storm tracks have likely shifted ...,medium,1,train,{'text': '<s>[INST] <<SYS>> You are a knowledg...
2,2,AR6_WGI,21,18,The average rate of sea level rise was 1.3 [0....,high,2,train,{'text': '<s>[INST] <<SYS>> You are a knowledg...
4,4,AR6_WGI,24,4,Temperatures during the most recent decade (20...,medium,1,train,{'text': '<s>[INST] <<SYS>> You are a knowledg...
5,5,AR6_WGI,24,5,"Prior to that, the next most recent warm perio...",medium,1,train,{'text': '<s>[INST] <<SYS>> You are a knowledg...
...,...,...,...,...,...,...,...,...,...
101,101,AR6_WGI,66,17,Hot extremes also continued to increase during...,high,2,train,{'text': '<s>[INST] <<SYS>> You are a knowledg...
102,102,AR6_WGI,66,18,"Even in a continually warming climate, periods...",very high,3,train,{'text': '<s>[INST] <<SYS>> You are a knowledg...
103,103,AR6_WGI,67,1,Simulations and understanding of modes of clim...,medium,1,train,{'text': '<s>[INST] <<SYS>> You are a knowledg...
104,104,AR6_WGI,67,3,While anthropogenic forcing has contributed to...,high,2,train,{'text': '<s>[INST] <<SYS>> You are a knowledg...


In [61]:
resp = together.Files.check(file='./data/climatex_train_set_short_100.jsonl')
print(resp)
file_resp_short = together.Files.upload(file='./data/climatex_train_set_short_100.jsonl')
file_id_short = file_resp_short['id']
print(file_resp_short)

{'is_check_passed': True, 'model_special_tokens': 'we are not yet checking end of sentence tokens for this model', 'file_present': 'File found', 'file_size': 'File size 0.0 GB', 'num_samples': 100}


Uploading ./data/climatex_train_set_short_100.jsonl: 100%|██████████| 98.0k/98.0k [00:00<00:00, 181kB/s]

{'filename': 'climatex_train_set_short_100.jsonl', 'id': 'file-6974fee7-a90b-45e1-8e36-dc2da41e673f', 'object': 'file', 'report_dict': {'is_check_passed': True, 'model_special_tokens': 'we are not yet checking end of sentence tokens for this model', 'file_present': 'File found', 'file_size': 'File size 0.0 GB', 'num_samples': 100}}





## Run short training job as a test

In [56]:
# Submit your finetune job
ft_resp_short = together.Finetune.create(
  training_file = file_id_short,
  model = base_model_name,
  n_epochs = 1,
  batch_size = 4,
  n_checkpoints = 1,
  learning_rate = 5e-5,
  wandb_api_key = os.environ['WANDB_API_KEY'],
  # estimate_price = True,
  suffix = 'climatex-train-set',
)

fine_tune_id = ft_resp_short['id']
print(ft_resp_short)

{'training_file': 'file-e438722e-14c0-4ef6-b334-306f0de337aa', 'validation_file': '', 'model_output_name': 'kerriewu@stanford.edu/llama-2-7b-chat-climatex-train-set-2024-01-19-02-26-45', 'model_output_path': 's3://together-dev/finetune/6584967ebd65c90fd83c83d8/kerriewu@stanford.edu/llama-2-7b-chat-climatex-train-set-2024-01-19-02-26-45/ft-aca7ab06-cfb7-4b74-8f4c-c30455bb566f', 'Suffix': 'climatex-train-set', 'model': 'togethercomputer/llama-2-7b-chat', 'n_epochs': 1, 'n_checkpoints': 1, 'batch_size': 4, 'learning_rate': 5e-05, 'user_id': '6584967ebd65c90fd83c83d8', 'lora': False, 'lora_r': 8, 'lora_alpha': 8, 'lora_dropout': 0, 'staring_epoch': 0, 'training_offset': 0, 'checkspoint_path': '', 'random_seed': '', 'created_at': '2024-01-19T02:26:45.056Z', 'updated_at': '2024-01-19T02:26:45.056Z', 'status': 'pending', 'owner_address': '0x9c9dc76a62aed719d6c4d4f3283f47150bcefdcf', 'id': 'ft-aca7ab06-cfb7-4b74-8f4c-c30455bb566f', 'job_id': '', 'token_count': 0, 'param_count': 0, 'total_price

In [59]:
# run this from time to time to check on the status of your job
print(together.Finetune.retrieve(fine_tune_id=fine_tune_id)) # retrieves information on finetune event
print("-"*50)
print(together.Finetune.get_job_status(fine_tune_id=fine_tune_id)) # pending, running, completed
print(together.Finetune.is_final_model_available(fine_tune_id=fine_tune_id)) # True, False
print(together.Finetune.get_checkpoints(fine_tune_id=fine_tune_id)) # list of checkpoints

{'training_file': 'file-e438722e-14c0-4ef6-b334-306f0de337aa', 'validation_file': '', 'model_output_name': 'kerriewu@stanford.edu/llama-2-7b-chat-climatex-train-set-2024-01-19-02-26-45', 'model_output_path': 's3://together-dev/finetune/6584967ebd65c90fd83c83d8/kerriewu@stanford.edu/llama-2-7b-chat-climatex-train-set-2024-01-19-02-26-45/ft-aca7ab06-cfb7-4b74-8f4c-c30455bb566f', 'Suffix': 'climatex-train-set', 'model': 'togethercomputer/llama-2-7b-chat', 'n_epochs': 1, 'n_checkpoints': 1, 'batch_size': 4, 'learning_rate': 5e-05, 'user_id': '6584967ebd65c90fd83c83d8', 'lora': False, 'lora_r': 8, 'lora_alpha': 8, 'lora_dropout': 0, 'staring_epoch': 0, 'training_offset': 0, 'checkspoint_path': '', 'random_seed': '', 'created_at': '2024-01-19T02:26:45.056Z', 'updated_at': '2024-01-19T02:26:45.056Z', 'status': 'pending', 'owner_address': '0x9c9dc76a62aed719d6c4d4f3283f47150bcefdcf', 'id': 'ft-aca7ab06-cfb7-4b74-8f4c-c30455bb566f', 'job_id': '', 'token_count': 0, 'param_count': 0, 'total_price

## Run full training job 


In [63]:
ft_resp = together.Finetune.create(
  training_file = file_id,
  model = base_model_name,
  n_epochs = 5,
  batch_size = 4,
  n_checkpoints = 5,
  learning_rate = 5e-5,
  wandb_api_key = os.environ['WANDB_API_KEY'],
  # estimate_price = True,
  suffix = 'climatex-train-set-full',
)

fine_tune_id = ft_resp['id']
print(ft_resp)

{'training_file': 'file-a3655a4e-8662-4c04-bcb2-8c55c8af3761', 'validation_file': '', 'model_output_name': 'kerriewu@stanford.edu/llama-2-7b-chat-climatex-train-set-full-2024-01-19-02-59-23', 'model_output_path': 's3://together-dev/finetune/6584967ebd65c90fd83c83d8/kerriewu@stanford.edu/llama-2-7b-chat-climatex-train-set-full-2024-01-19-02-59-23/ft-93aa7a9c-3819-479d-ab5b-f8aea70ad65e', 'Suffix': 'climatex-train-set-full', 'model': 'togethercomputer/llama-2-7b-chat', 'n_epochs': 5, 'n_checkpoints': 5, 'batch_size': 4, 'learning_rate': 5e-05, 'user_id': '6584967ebd65c90fd83c83d8', 'lora': False, 'lora_r': 8, 'lora_alpha': 8, 'lora_dropout': 0, 'staring_epoch': 0, 'training_offset': 0, 'checkspoint_path': '', 'random_seed': '', 'created_at': '2024-01-19T02:59:23.03Z', 'updated_at': '2024-01-19T02:59:23.03Z', 'status': 'pending', 'owner_address': '0x9c9dc76a62aed719d6c4d4f3283f47150bcefdcf', 'id': 'ft-93aa7a9c-3819-479d-ab5b-f8aea70ad65e', 'job_id': '', 'token_count': 0, 'param_count': 0,

In [65]:
# run this from time to time to check on the status of your job
print(together.Finetune.retrieve(fine_tune_id=fine_tune_id)) # retrieves information on finetune event
print("-"*50)
print(together.Finetune.get_job_status(fine_tune_id=fine_tune_id)) # pending, running, completed
print(together.Finetune.is_final_model_available(fine_tune_id=fine_tune_id)) # True, False
print(together.Finetune.get_checkpoints(fine_tune_id=fine_tune_id)) # list of checkpoints

{'training_file': 'file-a3655a4e-8662-4c04-bcb2-8c55c8af3761', 'validation_file': '', 'model_output_name': 'kerriewu@stanford.edu/llama-2-7b-chat-climatex-train-set-full-2024-01-19-02-59-23', 'model_output_path': 's3://together-dev/finetune/6584967ebd65c90fd83c83d8/kerriewu@stanford.edu/llama-2-7b-chat-climatex-train-set-full-2024-01-19-02-59-23/ft-93aa7a9c-3819-479d-ab5b-f8aea70ad65e-2024-01-18-19-41-42', 'Suffix': 'climatex-train-set-full', 'model': 'togethercomputer/llama-2-7b-chat', 'n_epochs': 5, 'n_checkpoints': 5, 'batch_size': 4, 'learning_rate': 5e-05, 'user_id': '6584967ebd65c90fd83c83d8', 'lora': False, 'lora_r': 8, 'lora_alpha': 8, 'lora_dropout': 0, 'staring_epoch': 0, 'training_offset': 0, 'checkspoint_path': '', 'random_seed': '', 'created_at': '2024-01-19T02:59:23.03Z', 'updated_at': '2024-01-19T03:46:02.874Z', 'status': 'completed', 'owner_address': '0x9c9dc76a62aed719d6c4d4f3283f47150bcefdcf', 'id': 'ft-93aa7a9c-3819-479d-ab5b-f8aea70ad65e', 'job_id': '11022', 'token_

## Evaluate finetuned model on the training set and the test set


In [66]:
# replace this name with the name of your newly finetuned model
new_model_name = 'kerriewu@stanford.edu/llama-2-7b-chat-climatex-train-set-full-2024-01-19-02-59-23'

model_list = together.Models.list()

print(f"{len(model_list)} models available")

available_model_names = [model_dict['name'] for model_dict in model_list]

new_model_name in available_model_names

127 models available


True

In [68]:
# Get the prompt for each row in train and test sets
train_set['eval_prompt'] = train_set.apply(lambda x: PREFIX + x['statement'] + SUFFIX, axis=1)
test_set['eval_prompt'] = test_set.apply(lambda x: PREFIX + x['statement'] + SUFFIX, axis=1)
print(train_set['eval_prompt'].sample(1).values[0])
print(test_set['eval_prompt'].sample(1).values[0])

<s>[INST] <<SYS>> You are a knowledgeable climate science assistant trained to assess the confidence level associated with various statements about climate change. <<SYS>> You will be presented with a statement about climate science, climate impacts or climate change mitigation which is retrieved or paraphrased from the IPCC AR6 WGI, WGII or WGIII assessment reports. Climate scientists have evaluated that statement as low confidence, medium confidence, high confidence, or very high confidence, based on evidence (type, amount, quantity, consistency) and agreement among their peers. What is their confidence level? Respond *only* with one of the following words: 'low', 'medium', 'high', 'very high'. If you don't know, you can respond 'I don't know'. Statement: As well as temporal trends in climate-change risk perception, the lit- erature since AR5 continues to show much heterogeneity (both within and between nations) among householders in respect of risk percep- tion Confidence: [/INST] 


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_set['eval_prompt'] = train_set.apply(lambda x: PREFIX + x['statement'] + SUFFIX, axis=1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_set['eval_prompt'] = test_set.apply(lambda x: PREFIX + x['statement'] + SUFFIX, axis=1)


In [105]:
# Helper functions

def get_zero_shot_confidence(new_model_name, x):
    output = together.Complete.create(
        prompt = x['eval_prompt'],
        model = new_model_name,
        max_tokens = 500,
        temperature = 0.0,
        #   top_k = 90,
        #   top_p = 0.8,
        #   repetition_penalty = 1.1,
        stop = ['</s>']
    )
    text = output['output']['choices'][0]['text']

    # extract single-word confidence rating
    if re.search(r"low", text, re.IGNORECASE):
        return "low"
    elif re.search(r"medium", text, re.IGNORECASE):
        return "medium"
    elif re.search(r"very high", text, re.IGNORECASE):
        return "very high"
    elif re.search(r"high", text, re.IGNORECASE):
        return "high"
    elif re.search(r"i don't know", text, re.IGNORECASE):
        return "idk"
    else:
        return "N/A"



### Start the model. Don't forget to stop after starting! 

In [71]:
# deploy your newly finetuned model
# Note: This starts payment for hosted time until the model is stopped. Don't forget to stop it!
together.Models.start(new_model_name)

{'success': True,
 'value': 'a3a5a17ce12763e5dced2849bc07fe28666fe3fe8f174b549ba9fa827f1012f5-60870b3df1a861f138d9108a661e5f5a960f87777dc0b694c3b8b5561f759b15'}

In [81]:
# check if your model is finished deploying, if this returns {"ready": true}, you model is ready for inference
together.Models.ready(new_model_name)

[{'modelInstanceConfig': {'appearsIn': [], 'order': 0},
  '_id': '65a9f07ba30377b2e0aaeac9',
  'name': 'kerriewu@stanford.edu/llama-2-7b-chat-climatex-train-set-full-2024-01-19-02-59-23',
  'display_type': 'chat',
  'description': 'Llama 2-chat leverages publicly available instruction datasets and over 1 million human annotations. Available in three sizes: 7B, 13B and 70B parameters',
  'creator_organization': 'kerriewu@stanford.edu',
  'hardware_label': 'L40',
  'num_parameters': '6738415616',
  'release_date': '2024-01-19T03:46:02.926Z',
  'show_in_playground': True,
  'owner': 'kerriewu@stanford.edu',
  'owner_address': '0x9c9dc76a62aed719d6c4d4f3283f47150bcefdcf',
  'owner_userid': '6584967ebd65c90fd83c83d8',
  'parent': 'togethercomputer/llama-2-7b-chat',
  'base': 'togethercomputer/llama-2-7b-chat',
  'path': 'r2://together-dev/finetune/6584967ebd65c90fd83c83d8/kerriewu@stanford.edu/llama-2-7b-chat-climatex-train-set-full-2024-01-19-02-59-23/ft-93aa7a9c-3819-479d-ab5b-f8aea70ad65

### Train set

In [106]:
samples = train_set.copy()
samples = samples.sample(300)

In [107]:
# Single sample first for sanity check
test_chat_prompt = samples['eval_prompt'].sample(1).values[0]
print(new_model_name)
output = together.Complete.create(
  model = new_model_name,
  prompt = test_chat_prompt,
  max_tokens = 500,
  temperature = 0.0,
  # top_k = 90,
  # top_p = 0.8,
  # repetition_penalty = 1.1,
  stop = ['</s>']
)

# print generated text
print(output['prompt'][0]+" -> "+output['output']['choices'][0]['text'])

kerriewu@stanford.edu/llama-2-7b-chat-climatex-train-set-full-2024-01-19-02-59-23
<s>[INST] <<SYS>> You are a knowledgeable climate science assistant trained to assess the confidence level associated with various statements about climate change. <<SYS>> You will be presented with a statement about climate science, climate impacts or climate change mitigation which is retrieved or paraphrased from the IPCC AR6 WGI, WGII or WGIII assessment reports. Climate scientists have evaluated that statement as low confidence, medium confidence, high confidence, or very high confidence, based on evidence (type, amount, quantity, consistency) and agreement among their peers. What is their confidence level? Respond *only* with one of the following words: 'low', 'medium', 'high', 'very high'. If you don't know, you can respond 'I don't know'. Statement: Indigenous knowledge and local knowledge reinforce community adaptive capacity, yet governance structures and processes, including the deliberate desi

In [108]:
samples['prediction'] = samples.apply(lambda x: get_zero_shot_confidence(new_model_name, x), axis=1)

In [109]:
samples['correct'] = samples.apply(
    lambda row: (row['confidence'] == row['prediction']), axis=1)

print(f"Accuracy: {samples['correct'].sum() / (samples.shape[0]-len(samples[samples['prediction'] == 'idk'])):.3f}")
print(f"'I don't know': {len(samples[samples['prediction'] == 'idk'])}")
print(f"N/A: {len(samples[samples['prediction'] == 'N/A'])}")

Accuracy: 0.530
'I don't know': 0
N/A: 86


In [110]:
samples.to_csv('./results/llama-2-chat-7b-finetuned-5-epochs-full-train-set-zeroshot-train-set-eval-temp0-2024-01-19.csv', index=False)

In [111]:
display(samples[samples['prediction'] == 'idk'])

Unnamed: 0,statement_idx,report,page_num,sent_num,statement,confidence,score,split,formatted_jsonl,eval_prompt,prediction,correct


In [112]:
display(samples[samples['prediction'] == 'N/A'])

Unnamed: 0,statement_idx,report,page_num,sent_num,statement,confidence,score,split,formatted_jsonl,eval_prompt,prediction,correct
4234,4234,AR6_WGII,660,8,Reduction in the effectiveness of future adapt...,high,2,train,{'text': '<s>[INST] <<SYS>> You are a knowledg...,<s>[INST] <<SYS>> You are a knowledgeable clim...,,False
921,921,AR6_WGI,692,6,The Human Perturbation of the Carbon and Bioge...,high,2,train,{'text': '<s>[INST] <<SYS>> You are a knowledg...,<s>[INST] <<SYS>> You are a knowledgeable clim...,,False
5423,5423,AR6_WGII,1724,8,"From 1946 to 2017, the number of fires and are...",high,2,train,{'text': '<s>[INST] <<SYS>> You are a knowledg...,<s>[INST] <<SYS>> You are a knowledgeable clim...,,False
3095,3095,AR6_WGII,109,21,Considering socioeconomic developments and cli...,medium,1,train,{'text': '<s>[INST] <<SYS>> You are a knowledg...,<s>[INST] <<SYS>> You are a knowledgeable clim...,,False
35,35,AR6_WGI,36,4,It is very unlikely that the combined global l...,very high,3,train,{'text': '<s>[INST] <<SYS>> You are a knowledg...,<s>[INST] <<SYS>> You are a knowledgeable clim...,,False
...,...,...,...,...,...,...,...,...,...,...,...,...
3814,3814,AR6_WGII,449,5,"Warming climates, even with low ocean- warming...",high,2,train,{'text': '<s>[INST] <<SYS>> You are a knowledg...,<s>[INST] <<SYS>> You are a knowledgeable clim...,,False
1151,1151,AR6_WGI,942,29,"This new understanding, along with updated est...",high,2,train,{'text': '<s>[INST] <<SYS>> You are a knowledg...,<s>[INST] <<SYS>> You are a knowledgeable clim...,,False
2826,2826,AR6_WGII,75,9,Dengue vector ranges will increase in North Am...,high,2,train,{'text': '<s>[INST] <<SYS>> You are a knowledg...,<s>[INST] <<SYS>> You are a knowledgeable clim...,,False
5778,5778,AR6_WGII,1860,8,The development of adaptation strategies for s...,high,2,train,{'text': '<s>[INST] <<SYS>> You are a knowledg...,<s>[INST] <<SYS>> You are a knowledgeable clim...,,False


In [113]:
display(samples)

Unnamed: 0,statement_idx,report,page_num,sent_num,statement,confidence,score,split,formatted_jsonl,eval_prompt,prediction,correct
6647,6647,AR6_WGII,2391,2,Deforestation generally reduces rainfall and e...,high,2,train,{'text': '<s>[INST] <<SYS>> You are a knowledg...,<s>[INST] <<SYS>> You are a knowledgeable clim...,low,False
4234,4234,AR6_WGII,660,8,Reduction in the effectiveness of future adapt...,high,2,train,{'text': '<s>[INST] <<SYS>> You are a knowledg...,<s>[INST] <<SYS>> You are a knowledgeable clim...,,False
186,186,AR6_WGI,92,30,"However, these decreases may be underestimated...",low,0,train,{'text': '<s>[INST] <<SYS>> You are a knowledg...,<s>[INST] <<SYS>> You are a knowledgeable clim...,low,True
921,921,AR6_WGI,692,6,The Human Perturbation of the Carbon and Bioge...,high,2,train,{'text': '<s>[INST] <<SYS>> You are a knowledg...,<s>[INST] <<SYS>> You are a knowledgeable clim...,,False
5500,5500,AR6_WGII,1744,8,The combination of (seasonally) reduced water ...,high,2,train,{'text': '<s>[INST] <<SYS>> You are a knowledg...,<s>[INST] <<SYS>> You are a knowledgeable clim...,high,True
...,...,...,...,...,...,...,...,...,...,...,...,...
7317,7317,AR6_WGIII,55,14,"A multitude of actors, networks, and movements...",high,2,train,{'text': '<s>[INST] <<SYS>> You are a knowledg...,<s>[INST] <<SYS>> You are a knowledgeable clim...,high,True
2826,2826,AR6_WGII,75,9,Dengue vector ranges will increase in North Am...,high,2,train,{'text': '<s>[INST] <<SYS>> You are a knowledg...,<s>[INST] <<SYS>> You are a knowledgeable clim...,,False
5778,5778,AR6_WGII,1860,8,The development of adaptation strategies for s...,high,2,train,{'text': '<s>[INST] <<SYS>> You are a knowledg...,<s>[INST] <<SYS>> You are a knowledgeable clim...,,False
6131,6131,AR6_WGII,2061,5,Stronger evidence confirms that education and ...,high,2,train,{'text': '<s>[INST] <<SYS>> You are a knowledg...,<s>[INST] <<SYS>> You are a knowledgeable clim...,,False


### Test set

In [114]:
samples = test_set.copy()

In [117]:
# Single sample first for sanity check
test_chat_prompt = samples['eval_prompt'].sample(1).values[0]

output = together.Complete.create(
  prompt = test_chat_prompt,
  model = new_model_name,
  max_tokens = 500,
  temperature = 0.0,
#   top_k = 90,
#   top_p = 0.8,
#   repetition_penalty = 1.1,
  stop = ['</s>']
)

# print generated text
print(output['prompt'][0]+" -> "+output['output']['choices'][0]['text'])

<s>[INST] <<SYS>> You are a knowledgeable climate science assistant trained to assess the confidence level associated with various statements about climate change. <<SYS>> You will be presented with a statement about climate science, climate impacts or climate change mitigation which is retrieved or paraphrased from the IPCC AR6 WGI, WGII or WGIII assessment reports. Climate scientists have evaluated that statement as low confidence, medium confidence, high confidence, or very high confidence, based on evidence (type, amount, quantity, consistency) and agreement among their peers. What is their confidence level? Respond *only* with one of the following words: 'low', 'medium', 'high', 'very high'. If you don't know, you can respond 'I don't know'. Statement: Common indicators of development reflect the significant diversity that exists across different global regions with respect to their development context Confidence: [/INST]  -> 1.1°C Global warming is expected to exacerbate many of 

In [118]:
samples['prediction'] = samples.apply(lambda x: get_zero_shot_confidence(new_model_name, x), axis=1)

In [119]:
samples['correct'] = samples.apply(
    lambda row: (row['confidence'] == row['prediction']), axis=1)

print(f"Accuracy: {samples['correct'].sum() / (samples.shape[0]-len(samples[samples['prediction'] == 'idk'])):.3f}")
print(f"'I don't know': {len(samples[samples['prediction'] == 'idk'])}")
print(f"N/A: {len(samples[samples['prediction'] == 'N/A'])}")

Accuracy: 0.307
'I don't know': 0
N/A: 94


In [120]:
samples.to_csv('./results/llama-2-chat-7b-finetuned-5-epochs-full-train-set-zeroshot-test-set-eval-temp0-2024-01-19.csv', index=False)

### Stop the model! 

In [121]:
# stop your model and you will no longer be paying for it
together.Models.stop(new_model_name)

{'success': True}