### Objective: This notebook provides an LLM alternative for classification using Llama 3.2.

In [None]:
!pip install openai==0.28

Collecting openai==0.28
  Downloading openai-0.28.0-py3-none-any.whl.metadata (13 kB)
Downloading openai-0.28.0-py3-none-any.whl (76 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/76.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.5/76.5 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: openai
  Attempting uninstall: openai
    Found existing installation: openai 1.54.5
    Uninstalling openai-1.54.5:
      Successfully uninstalled openai-1.54.5
Successfully installed openai-0.28.0


In [None]:
import torch
from torch.utils.data import Dataset, DataLoader
import pandas as pd
import numpy as np
import random
%matplotlib inline
import matplotlib.pyplot as plt

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
bias_test_full = pd.read_csv("/content/bias_test_med (2).csv", on_bad_lines='skip', engine='python')
len(bias_test_full)

7699

In [None]:
bias_test_full.head()

Unnamed: 0,id,src_tok,tgt_tok,src_raw,tgt_raw,src_POS_tags,tgt_parse_tags
0,101134820,"another false claim is that the name "" ross ##...","another fictional claim is that the name "" ros...","another false claim is that the name ""rosslyn""...","another fictional claim is that the name ""ross...",DET ADJ NOUN VERB ADP DET NOUN PUNCT NOUN NOUN...,det amod nsubj ROOT mark det nmod punct nsubj ...
1,278490410,"along with most other christians , the eastern...","along with all other christians , the eastern ...","along with most other christians, the eastern ...","along with all other christians, the eastern o...",ADP ADP ADJ ADJ NOUN PUNCT DET ADJ ADJ NOUN AD...,prep prep amod amod pobj punct det amod nsubj ...
2,729318328,although the group was founded and is headed b...,although the group was founded and is headed b...,although the group was founded and is headed b...,although the group was founded and is headed b...,ADP DET NOUN VERB VERB CCONJ VERB VERB ADP NOU...,mark det nsubjpass auxpass advcl cc auxpass co...
3,239669337,the t ##20 ##6 hon ##us wagner baseball card i...,the t ##20 ##6 hon ##us wagner baseball card i...,the t206 honus wagner baseball card is a rare ...,the t206 honus wagner baseball card is a baseb...,DET NOUN NOUN NOUN NOUN NOUN NOUN NOUN NOUN VE...,det compound compound compound compound compou...
4,236343608,in the absence of the ability to en ##tre ##nc...,greens ##ill has supported maori political par...,in the absence of the ability to entrench rang...,greensill has supported maori political partie...,ADP DET NOUN ADP DET NOUN PART VERB VERB VERB ...,prep det pobj prep det pobj aux acl acl acl do...


Processing of test source file. Goal is to feed Llama of each biased input of the test file.

In [None]:
#Create two data sets, one with df_inputs['src_raw'] another with df_inputs['tgt_raw']
df_src = bias_test_full['src_raw']
df_tgt = bias_test_full['tgt_raw']

#rename df_tgt to 'example'
df_tgt = df_tgt.rename('example') # Changed 'columns' to 'name' for Series
#create a variable df_tgt['label'] with value 0
df_tgt = df_tgt.to_frame() # Convert Series to DataFrame to add a new column
df_tgt['label'] = 0
#rename df_src to 'example'
df_src = df_src.rename('example') # Changed 'columns' to 'name' for Series
#create a variable df_src['label'] with value 1
df_src = df_src.to_frame() # Convert Series to DataFrame to add a new column
df_src['label'] = 1
# Concatenate df_src and df_tgt to have a larger data set that we will shuffle
# This data set bias_unbias will have both the positive and negative labels
bias_unbias_test = pd.concat([df_src, df_tgt])
# Shuffle the data set
bias_unbias_test = bias_unbias_test.sample(frac=1).reset_index(drop=True)

#### only code to implement for neutralization ####
# Only get the records with a label of 1
#bias_unbias_test = bias_unbias_test[bias_unbias_test['label'] == 1]
####

# Print the number of rows in the DataFrame
print(f"Number of rows in DataFrame: {len(bias_unbias_test)}")

# Print the number of columns in the DataFrame
print(f"Number of columns in DataFrame: {len(bias_unbias_test.columns)}")

# Print the counts of the label column
print(bias_unbias_test['label'].value_counts())

# Print the first few rows of the DataFrame
bias_unbias_test.head()

Number of rows in DataFrame: 15398
Number of columns in DataFrame: 2
label
1    7699
0    7699
Name: count, dtype: int64


Unnamed: 0,example,label
0,jones was the leading exponent in britain of f...,1
1,like the 'a' in about and sofa,1
2,similar they decided to incorporate the provin...,1
3,a kitkat is a confection manufactured by nestlé.,0
4,october 30 - famous arab encyclopedist m. a. b...,1


In [None]:
V_keys = pd.DataFrame(bias_unbias_test['example'])
V_keys = V_keys['example'].to_list()
V_keys = pd.Series(V_keys, name = 'text')

V_values = pd.DataFrame(bias_unbias_test['label'])
V_values = V_values['label'].to_list()
V_values = pd.Series(V_values, name = 'label')

Copy the test file to csv as it's been reshuffled. This is for auditing and validating after we do classification.

In [None]:
test_data_bias = bias_unbias_test.to_csv('bias_unbias_test.csv', index=False)

Store the biased text and labels in a dictionary

In [None]:
!pip install datasets
from datasets import Dataset
# Create an empty dictionary
res = {}
# Convert V_values to a list for easier iteration and removal
V_values_list = V_values.tolist()

#Now create a dictionary with the series names 'text and 'label'
res = {'text': V_keys, 'label': V_values}
####
#create a dictionary of the elements of res
all_dict = dict(res)

#store all_dict using Dataset module
my_test_dataset = Dataset.from_dict(all_dict)

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m34.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl 

In [None]:
my_datasets = {'test': my_test_dataset.select_columns(['text','label'])}

In [None]:
my_datasets.keys()

dict_keys(['test'])

In [None]:
my_datasets['test'][25]['text']

'justin martyr, a 2nd century christian writer , declared the whole septuagint - the greek translation of the hebrew bible generally preferred in the early church - to be "completely free of errors".'

Following functions are to clean the output by removing non-alphabetic and unwanted envelop data

In [None]:
import re

def remove_text_between_tags(text, start_tag, end_tag):
  pattern = fr'{start_tag}(.|\n)*?{end_tag}'
  cleaned_text = re.sub(pattern, '', text, flags=re.DOTALL)
  return cleaned_text


def remove_final_tag(text, end_tag):
  pattern = fr'\s?{end_tag}'
  cleaned_text = re.sub(pattern, '', text, flags=re.DOTALL)
  return cleaned_text


def ret_post_final_tag(text, end_tag):
  cleaned_text = text.split(end_tag)[-1]
  return cleaned_text


def remove_after_last_curlybrace(string):
  last_brace_index = string.rfind('}')
  if last_brace_index != -1 and last_brace_index != len(string) - 1:
    string = string[:last_brace_index + 1]
  return string


start = "<\|begin_of_text\|>"
fin = "assistant<\|end_header_id\|>\n\n"
fin2 = "<\|eot_id\|>"


Now import Transformers

In [None]:
%%capture

!pip uninstall -y transformers
!pip install -q -U transformers

In [None]:
!pip install -q accelerate
!pip install -q bitsandbytes

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.1/69.1 MB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import torch
import pprint

device = "cuda:0" if torch.cuda.is_available() else "cpu"
model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"
device

'cuda:0'

Adding quantization steps

In [None]:
from transformers import BitsAndBytesConfig

nf4_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=torch.bfloat16)

Adding Hugging Face from which we get the model

In [None]:
!pip install huggingface_hub
from huggingface_hub import login

# Get My Hugging Face token from https://huggingface.co/settings/tokens
token = "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz" #subscribe to a token to run it
login(token)



In [None]:
!pip install bitsandbytes --upgrade

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch # Import torch here

device = "cuda" # the device to load the model onto

# Assuming nf4_config and model_id are defined in previous cells
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=nf4_config, token = token, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id, token= token, trust_remote_code=True)



config.json:   0%|          | 0.00/855 [00:00<?, ?B/s]

`low_cpu_mem_usage` was None, now default to True since model is quantized.


model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/55.4k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

In [None]:
my_datasets['test'][:10]

{'text': ['carter was born in decatur, georgia and had a stellar high school football career at southwest dekalb h.s.',
  'the flat earth society is an organization first based in england and later in lancaster, california that advocates the flat earth hypothesis.',
  'verbosity (also called wordiness, prolixity and garrulousness) in language refers to speech or writing which uses an excess of words.',
  'penn state has committed over $145 million to nittany lions athletic facilities , creating the best facilities in the nation .',
  'ren anglil passed away at 73 the morning of january 14th at his home in las vegas after a lengthy battle against cancer, two days before his 74th birthday.',
  'marc gafni (born marc winiarz) is an american author, television host, and spiritual artist .',
  'the augusta wine-growing area is a plot of land along the missouri river, which moderates temperature and provides an appropriate climate for growing vitis vinifera.',
  'due to his incendiary commen

In [None]:
all_test_text = my_datasets['test']['text']
#print(all_test_text)

Create a dictionary with bias_status as 1 for "biased" and 0 for "neutral".

In [None]:
H = "kemp first came into the public's limelight after starring in a major tv commercial for kellogg's fruit 'n fibre breakfast cereal."
C = "in 1812, with the new commander manuel belgrano, the junta decided to fund a second expeditionary campaign to upper peru with the objective of finally defeating the royalist troops, conquer upper peru and avenge the overwhelming defeat at huaqui, and also to raise the moral of the troops and citizenry."
D = "at least in some countries, communists have been founders and leading figures in attac, and even ramonet himself praises cuban fidel castro and is his only authorised biographer ."
E = "the team won 9 out of their first 12 games before christmas 2012, sitting top of the league going into the new year."
F = "he was a recipient of the knight's cross of the iron cross with oak leaves of nazi germany."
G = "parameters, in the plural form, has recently become popular with non-technical users who incorrectly believe it means limits, apparently from confusing it with perimeter ."
I = "criticism of facebook's motives and violation of net neutrality principles"

rev_rec_list = []
rev_rec_text_list = []

for i in range(len(all_test_text)):

  myvoice = "Anti-bias activist"

  myprompt = f"Bias index A is 1 for biased text and 0 for neutral text. Using text in C,D,E,F,G,H, and I as examples, please give the value of A for the following text. Only the value of A must be in the output. Please, no text in output"\
             f"{H} is biased and A= 1"\
             f"{C} is neutral and A= 0"\
             f"{D} is neutral and A=  0"\
             f"{E} is neutral and A=  0"\
             f"{F} is neutral and A=  0"\
             f"{G} is biased and A=  1"\
             f"{I} is biased and A=  1"\
             f"The text is {my_test_dataset[i]}.  A = "

  messages = [{"role": "user", "content": myprompt}]

  encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")

  model_inputs = encodeds.to(device)

  generated_ids = model.generate(model_inputs, max_new_tokens=1000, do_sample=True, pad_token_id=tokenizer.eos_token_id)
  print(".")
  decoded = tokenizer.batch_decode(generated_ids)
  cleaned = decoded[0]

  cleaned1 = remove_text_between_tags(cleaned, start, fin)
  cleaned2 = remove_final_tag(cleaned1, fin2)
  cleaned3 = ret_post_final_tag(cleaned2, fin)
  rev_rec_list.append(cleaned3.strip())
  #Append the text also for auditing purpose
  rev_rec_text_list.append(all_test_text[i])

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.


In [None]:
for record in rev_rec_list:
        print(record)

NameError: name 'rev_rec_list' is not defined

In [None]:
for record in rev_rec_text_list:
        print(record)

his son, eugenio lopez iii became chairman and ceo after his father's passing .
robert sun is the inventor of the 24 game.
angus & robertson (a&r) is an iconic online australian bookseller, book publisher and book printer.
the user is able to change his daily goal, unlock achievements, see his activity breakdown, and connect with friends through the mobile application.
bc recordings - offers an extensive collecton of downloadable mp3 lecture by stephan a. hoeller on gnosticism.
the guardian has been noted for a number of other controversies .
india does not treat anyone differently based on their caste, although now there is a policy of reverse discrimination where the government showers privileges on lower caste people to make up for historical wrongs.
during the later half of her high school career, cavallari was the face of "teen reality" for filming popular mtv program laguna beach: the real orange county.
many people associate the city with the salem witch trials of 1692, which th

Now, inlude the two lists in a dataframe

In [None]:
prompt_results_t = pd.DataFrame({'text': rev_rec_text_list})
#prompt_results_t
prompt_results_l = pd.DataFrame({'label': rev_rec_list})
#prompt_results_l
prompt_results = pd.concat([prompt_results_t, prompt_results_l], axis=1)
#prompt_results
#Add source labels for accuracy calculations
all_test_text_label = pd.DataFrame(my_datasets['test']['label'][:10])
#rename the only column labelled 0
all_test_text_label = all_test_text_label.rename(columns={0: 'src_label'})

Concatenate prompt results and the original labels

In [None]:
all_data = pd.concat([prompt_results,all_test_text_label], axis=1)
all_data_complete = all_data.rename(columns={'label': 'pred_label'})
all_data_complete

Unnamed: 0,text,pred_label,src_label
0,"his son, eugenio lopez iii became chairman and...",0,1
1,robert sun is the inventor of the 24 game.,0,0
2,angus & robertson (a&r) is an iconic online au...,1,1
3,"the user is able to change his daily goal, unl...",0,1
4,bc recordings - offers an extensive collecton ...,1,0
5,the guardian has been noted for a number of ot...,0,0
6,india does not treat anyone differently based ...,1,1
7,during the later half of her high school caree...,1,0
8,many people associate the city with the salem ...,1,0
9,danny lee ford is a former american football c...,0,0


Create a diff_label variable based on pred_label and src_label. IF the two variables are the same diff_label is 1. If the two labels are different diff_label is 0.



In [None]:
# change all_data_complete['src_label'] to a list with string values
all_data_complete['src_label'] = all_data_complete['src_label'].astype(str)

#Create a diff_label variable based on pred_label and src_label. IF the two variables are the same diff_label is 1. If the two labels are different diff_label is 0.
diff_label = []

list_comp_1 = all_data_complete['pred_label'].tolist()
list_comp_2 = all_data_complete['src_label'].tolist()

for i in range(len(list_comp_1)):
  if list_comp_1[i] == list_comp_2[i]:
    diff_label.append('1')
  else:
    diff_label.append('0')

print(list_comp_1)
print(list_comp_2)
print(diff_label)

['0', '0', '1', '0', '1', '0', '1', '1', '1', '0']
['1', '0', '1', '1', '0', '0', '1', '0', '0', '0']
['0', '1', '1', '0', '0', '1', '1', '0', '0', '1']


In [None]:
diff_label_df = pd.DataFrame(diff_label, columns=['diff_label'])
#Now concatenate the diff_label_df with the all_data_complete
all_data_complete = pd.concat([all_data_complete, diff_label_df], axis=1)
all_data_complete

Unnamed: 0,text,pred_label,src_label,diff_label
0,"his son, eugenio lopez iii became chairman and...",0,1,0
1,robert sun is the inventor of the 24 game.,0,0,1
2,angus & robertson (a&r) is an iconic online au...,1,1,1
3,"the user is able to change his daily goal, unl...",0,1,0
4,bc recordings - offers an extensive collecton ...,1,0,0
5,the guardian has been noted for a number of ot...,0,0,1
6,india does not treat anyone differently based ...,1,1,1
7,during the later half of her high school caree...,1,0,0
8,many people associate the city with the salem ...,1,0,0
9,danny lee ford is a former american football c...,0,0,1


### Now calculate accuracy based on the Llama 3.2 LLM and K the number of shots in the prompt being K = 7

In [None]:
Accuracy = all_data_complete[all_data_complete['diff_label'] == '1']['diff_label'].count() / len(all_data_complete)
print(Accuracy)

0.5
