### Task 2 (Prompting Techniques / RAG ):
Dataset: train_40k.csv

Design and implement a classifier using any LLM to classify the data in Column name “Text” with Column name “Cat2” and Column name “Cat3”.
2.1 - Input to your prompt will be a text from column name Text, and output should be class name from Column Name Cat 2
2.2 - 2.1 - Input to your prompt will be a text from column name Text, and output should be class name from Column Name Cat 3

Report the Accuracy on a sample test set split from 40k samples.

Submission: Task2.ipynb notebook file and a readme file explaining the approach to the problem

### Approaches:
Large Language Models (LLMs) have revolutionized the classification process by offering real-time text data categorization, perfect for applications requiring swift decision-making.

- Prompt Base Approach: By providing an appropriate prompt alongside item descriptions and a list of item classes, LLMs efficiently categorize items. This approach works best when the prompt fits within the model's context window.

- Retrieval Augmented Generation (RAG): RAG is invaluable when item classes exceed the context window. Here, item classes are converted into embeddings and stored in a Vector DB or Open Search with an ANN Plugin. LLMs, when prompted without the item class list, reference the Vector DB to retrieve top relevant item classes. These are then used to classify the item description.

- Fine Tuning: With sufficient training data, fine-tuning LLMs using a specialized dataset containing item descriptions and their classes enhances classification performance. This process merges pre-trained model knowledge with task-specific data, improving accuracy.

In [1]:
import pandas as pd
import numpy as np

In [2]:
data_file = "train_40k.csv"

In [3]:
data = pd.read_csv(data_file)

In [4]:
data.head()

Unnamed: 0,productId,Title,userId,Helpfulness,Score,Time,Text,Cat1,Cat2,Cat3
0,B000E46LYG,Golden Valley Natural Buffalo Jerky,A3MQDNGHDJU4MK,0/0,3,-1,The description and photo on this product need...,grocery gourmet food,meat poultry,jerky
1,B000GRA6N8,Westing Game,unknown,0/0,5,860630400,This was a great book!!!! It is well thought t...,toys games,games,unknown
2,B000GRA6N8,Westing Game,unknown,0/0,5,883008000,"I am a first year teacher, teaching 5th grade....",toys games,games,unknown
3,B000GRA6N8,Westing Game,unknown,0/0,5,897696000,I got the book at my bookfair at school lookin...,toys games,games,unknown
4,B00000DMDQ,I SPY A is For Jigsaw Puzzle 63pc,unknown,4-Feb,5,911865600,Hi! I'm Martine Redman and I created this puzz...,toys games,puzzles,jigsaw puzzles


#### As the task heavily depends on only three columns (Text, Cat2, Cat3) and dropping other columns

In [5]:
columns_to_drop = [
    "productId",
    "Title",
    "userId",
    "Helpfulness",
    "Score",
    "Time",
    "Cat1",
]  # List of columns to drop
data = data.drop(columns=columns_to_drop)

In [6]:
print(data.head())

                                                Text          Cat2  \
0  The description and photo on this product need...  meat poultry   
1  This was a great book!!!! It is well thought t...         games   
2  I am a first year teacher, teaching 5th grade....         games   
3  I got the book at my bookfair at school lookin...         games   
4  Hi! I'm Martine Redman and I created this puzz...       puzzles   

             Cat3  
0           jerky  
1         unknown  
2         unknown  
3         unknown  
4  jigsaw puzzles  


In [7]:
category2_classes = data["Cat2"].unique()
print(category2_classes)
print(len(category2_classes))

category3_classes = data["Cat3"].unique()
print(category3_classes)
print(len(category3_classes))

['meat poultry' 'games' 'puzzles' 'beverages' 'makeup' 'arts crafts'
 'action toy figures' 'dolls accessories' 'baby toddler toys'
 'personal care' 'nutrition wellness' 'learning education'
 'electronics for kids' 'household supplies' 'stuffed animals plush'
 'tricycles' 'health care' 'gear' 'skin care' 'grown up toys'
 'dress up pretend play' 'novelty gag toys' 'bath body'
 'tools accessories' 'hair care' 'medical supplies equipment'
 'baby child care' 'building toys' 'gifts' 'sexual wellness'
 'sports outdoor play' 'hobbies' 'feeding' 'diapering' 'safety' 'nursery'
 'bathing skin care' 'vehicles remote control' 'car seats accessories'
 'strollers' 'pregnancy maternity' 'cats' 'potty training' 'dogs'
 'gourmet gifts' 'sauces dips' 'breakfast foods' 'pantry staples'
 'fragrance' 'fresh flowers live indoor plants' 'breads bakery'
 'candy chocolate' 'cooking baking supplies' 'snack food' 'meat seafood'
 'herbs' 'baby food' 'fish aquatic pets' 'small animals' 'dairy eggs'
 'birds' 'produc

##### Split the data set to train & test using scikit-learn

In [8]:
from sklearn.model_selection import train_test_split

In [9]:
# Split the dataset into train and test sets (80% train, 20% test)
train_df, test_df = train_test_split(data, test_size=0.2, random_state=42)

# Save the test data into another CSV file
test_df.to_csv("test_dataset.csv", index=False)

#### 1. Prompt Base Approach

In [10]:
!pip install openai==0.28



In [11]:
import os
import openai

In [12]:
def chatgpt_generation(question):
    temperature = 0.0
    max_tokens = 64
    top_p = 0.9
    best_of = 1
    frequency_penalty = 0.0
    presence_penalty = 0.0
    stop = ['===']
    output = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[
            {"role": "user", "content": question}
        ],
        max_tokens=max_tokens,
        stop=stop,
        temperature=temperature,
        top_p=top_p, )
    return output.choices[0].message.content

####  Prompting
Defining the Prompt and adding the constraints in the prompt. 

In [24]:
prompt = """You are a classification engine. Your task is to classify the text from the given list of categories - {categories}.
You must not classify anything out the list of categories provided. Return unknown if the text cannot be classifed
Now classify for this text:
User Query: {sentence}
"""

# prompt = """You are a classification engine. Your task is to classify the text from the given list of categories - {categories}.
# You must not classify anything out the list of categories provided. Return unknown if the text cannot be classifed
# Text:My wife has been wonderfully pleased with this product. It leaves her cockapoo soft and shiny. She says that it the best she has ever used.
# Class:dogs
# Text:{sentence}
# Class:"""

In [25]:
print(prompt)

You are a classification engine. Your task is to classify the text from the given list of categories - {categories}.
You must not classify anything out the list of categories provided. Return unknown if the text cannot be classifed
Now classify for this text:
User Query: {sentence}



In [16]:
openai.api_key = "<add your openai key>"

##### Category 2 Classification using GPT4 model for all the texts

In [17]:
test_data = pd.read_csv('test_dataset.csv')

In [18]:
# Shuffle the DataFrame
test_shuffled_data = test_data.sample(frac=1).reset_index(drop=True)

In [19]:
print(len(test_data['Text']))

8000


#### Testing the sample

In [27]:
sample_prompt=prompt.format(categories=category2_classes,sentence=test_data['Text'][0])
gpt3_output = chatgpt_generation(sample_prompt)
gpt3_output

"'household supplies'"

In [48]:
summ_prompt = "{A} \nExplain the above in one sentence:"

In [78]:
cat2_output = []
# Iterate over the first 100 entries of the DataFrame
#for index, row in test_shuffled_data.head(5).iterrows():
for each_text in test_shuffled_data['Text']:
    #summ_each_prompt = summ_prompt.format(A=test_shuffled_data['Text'][index])
    summ_each_prompt = summ_prompt.format(A=each_text)
    summ_output = chatgpt_generation(summ_each_prompt)
    each_prompt = prompt.format(categories=category2_classes,sentence=summ_output)
    gpt3_output = chatgpt_generation(each_prompt)
    cat2_output.append(gpt3_output)
#     print("\n")
#     print(test_data['Text'][index])
#     print("***")
#     print(summ_output,gpt3_output)
#     print("\n")

    
# for each_text in test_data['Text']:
#     cat2_output.append (chatgpt_generation(prompt+each_text) )

RateLimitError: You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.

In [79]:
print(len(cat2_output))

3315


In [69]:
print(cat2_output)

["The text can be classified under 'personal care'.", "'novelty gag toys'", "'games'", "'nutrition wellness'", "'beverages'"]


In [None]:
summ_prompt = "{A} \nExplain the above in one sentence:"

##### Category3 Classification using GPT4 model for all the texts

In [None]:
cat3_output = []
#for index, row in test_shuffled_data.head(5).iterrows():
for each_text in test_shuffled_data['Text']:
    #summ_each_prompt = summ_prompt.format(A=test_shuffled_data['Text'][index])
    summ_each_prompt = summ_prompt.format(A=each_text)
    summ_output = chatgpt_generation(summ_each_prompt)
    each_prompt = prompt.format(categories=category3_classes,sentence=summ_output)
    gpt3_output = chatgpt_generation(each_prompt)
    cat3_output.append(gpt3_output)

#### Writing back the predictions to csv file

In [None]:
# Add the predictions to a new column in the DataFrame
test_data['Cat2_Predictions'] = cat2_output
test_data['Cat3_Predictions'] = cat3_output

In [None]:
# Save the DataFrame back to the CSV file
test_data.to_csv('test_dataset.csv', index=False)

#### Calculating the accuracy for the LLM predictions

In [81]:
def calculate_accuracy(expected_values,predicted_values):
    # Calculate accuracy
    total_samples = len(expected_values)
    print(len(expected_values))
    correct_predictions = sum(1 for exp, pred in zip(expected_values, predicted_values) if exp == pred)
    accuracy = (correct_predictions / total_samples) * 100

    print(f"Accuracy: {accuracy:.2f}%")
    return accuracy


In [85]:
cat2_output = [item.replace("'", "") for item in cat2_output]

In [86]:
# accuracy for Cat2
cat2_expected_values = test_shuffled_data['Cat2'].head(3315).tolist()

print("Cat2 accuracy",calculate_accuracy(cat2_expected_values,cat2_output))

# # accuracy for Cat3
# cat3_expected_values = test_data['Cat3'].tolist() 
# print(calculate_accuracy(cat3_expected_values,cat3_output))


3315
Accuracy: 51.25%
Cat2 accuracy 51.251885369532424


**Accuracy: 51.25%**

**Cat2 accuracy 51.251885369532424**

In [87]:
output_df = pd.DataFrame({'Text':test_shuffled_data['Text'].head(3315).tolist(),'Expected':cat2_expected_values,'Precited':cat2_output})

In [88]:
print(output_df)

                                                   Text  \
0     I'm really happy with this soap. I love that i...   
1     My son has wanted to learn to do magic tricks,...   
2     i just bought this for my daughters 3rd bday a...   
3     Ive heard many good things about biotin and I ...   
4     Basically, this is a powder mix for one of tho...   
...                                                 ...   
3310  Sweet lily of the valley main note with a bit ...   
3311  Ordered the Sandalwood Vanilla, I received it ...   
3312  My birds so happy to be able to hop from the m...   
3313  THE FIRST TIME I SAW THIS GATE I THOUGHT IT WA...   
3314  Along with Hokey Pokey Elmo, Chicken Dance Elm...   

                   Expected            Precited  
0                 bath body       personal care  
1          novelty gag toys    novelty gag toys  
2      electronics for kids               games  
3        nutrition wellness  nutrition wellness  
4        nutrition wellness           bev

#### Future Tasks:
- As GPT-4 model has best performance, i used that model for our experiment
- As the quota of my Openai key is over, could able to perform only 3315 entries for Category 2 and not able to perform for Category 3 classification
- We can try using other open-source models like Llama, Openchat, and Mistral, but there may be a trade-off in accuracy.

#### Future improvements:
When it comes to improving the accuracy of zero-shot prompting for multi-label and multi-class classification using GPT-4, there are several approaches and future directions we can consider,

- Retrieval Augmented Generation (RAG): RAG is invaluable when item classes exceed the context window. Here, item classes are converted into embeddings and stored in a Vector DB or Open Search with an ANN Plugin. LLMs, when prompted without the item class list, reference the Vector DB to retrieve top relevant item classes. These are then used to classify the item description.

- Fine Tuning: With sufficient training data, fine-tuning LLMs using a specialized dataset containing item descriptions and their classes enhances classification performance. This process merges pre-trained model knowledge with task-specific data, improving accuracy.