##### Dataset Creation Approach

    There are lot of datasets available on the internet for varies llm evaluation tasks. 

    Dataset benchmarks like 
```
    arc-challenge
    mbpp
    winogrande
    mtbench
    grade-school-math
    hellaswag
    mmlu
```
    However, categorizing, downloading and preprocessing those datasets was taking more time for me than anticipated initially. Because of this reason, I will be taking a sythetic dataset creation approach to generate the data.

##### I will be using the LLMs to generate the data for the modules defined in the assignment.
```
    Document Extraction
    Quantitative Analysis
    Report Generation
    Interactive QA Chatbot
    Multi Document Summarization
```



##### Def LLM

In [None]:
from langchain_mistralai import ChatMistralAI
import os
import requests
import json
import base64


if "MISTRAL_API_KEY" not in os.environ:
    os.environ["MISTRAL_API_KEY"] = ""


def get_llm():
    return ChatMistralAI(
        model="mistral-large-latest",
        temperature=0,
        max_retries=2,
    )




def encode_image(image_path):
        with open(image_path, "rb") as img_file:
                return base64.b64encode(img_file.read()).decode("utf-8")
def query_llm_rest_pix(mlist,image=None):
        
        url = 'http://<IP>:9104/v1/chat/completions'
        headers = {
            'Content-Type': 'application/json',
            'Authorization': 'Bearer token'
        }


        data = {
            "model": "mistralai/Pixtral-12B-2409"
        }
        data.update({"messages": mlist})

        response = requests.post(url, headers=headers, data=json.dumps(data))

        if response.status_code == 400:
            return "I'm unsure about the answer. Please provide more context or ask a different question."
        else:
            return response.json()["choices"][0]['message']['content']


##### PIP install

In [11]:
# !pip install spacy
# !pip install textstat
# !python -m spacy download en_core_web_sm




##### Synthetic data generation

##### Prompt Iteration 

- I refined and enhanced the prompt to optimize prompt generation. Below is the final version that was used. I'm using pixtral local model here. Using bigger model will certainly yield better results.
- I consider GPT, Claude, llama and Mistral as target values and assume these are good at particular tasks

In [None]:
modules = {
    "t1":"Document Content Extraction: Reading documents in different formats and extracting information from it, which could be in tables, graphs, etc. Make sure all quantitative data is extracted correctly with the right context.",
    "t2": "Quantitative Analysis: To streamline and automate the process of performing advanced calculations based on given parameters. If required, the LLM needs to choose the correct parameters like assumptions, margin of error, etc.",
    "t3": "Report Generation: Creating a report following a pre-set format, length, and style, including text, charts, and illustrations. This can be calibrated based on requirements and data sources.",
    "t4":"Interactive QA Chatbot: A question-and-answer chatbot to go with the final report so that if there are any queries, they can be answered based on all the sources used for creating the report.",
    "t5":"Multi-document Summarization: Summarizing data from multiple documents, giving importance to the most recent information or specified context."
}



llm = get_llm()

data = []
num_prompts = 50
for name, module in modules.items():
    print(name)
    prompt = f"""
    You are a prompt generation expert who can generate prompts for variety of tasks. You have been given the following description of task. 
    The prompt you generate should be of varying length and complexity. Give the response stricty in python dictionary format.
    {{"prompt":"","complexity":"","data_type":""}}

    data_type can include following types["Text", "Numerical Data", "Tables", "Graphs & Charts", "Equations & Formulas", "Dates & Timestamps"]
    complexity can include following types["Simple", "Medium", "Complex"]
    Strictly follow the format.
    
    Respond only in text. Dont include markdown.
    Generate {num_prompts} prompts for each of the following tasks:
        
        {module}

    """
    message = [
            ("system", "You are a prompt generation expert who can generate prompts for variety of tasks."),
            ("user", prompt
            )
    ]
    print("Prompt: ", prompt)
    mlist =[ {       
            "role": "system",
            "content": [
                {"type": "text", "text": "You are a prompt generation expert who can generate prompts for variety of tasks."}
            ]
        },
        {
                    "role": "user",
                    "content": [
                        {"type": "text", "text": prompt}
                    ]
        }
    ]
    # res = llm.invoke(message).content
    print("------------------------------------------")
    res = query_llm_rest_pix(mlist)
    # print(res)
    data.append({"module":name,"prompts":res})
    print(data)
    with open("files/D1.json", "w") as f:
        json.dump(data, f, indent=4)


t1
Prompt:  
    You are a prompt generation expert who can generate prompts for variety of tasks. You have been given the following description of task. 
    The prompt you generate should be of varying length and complexity. Give the response stricty in python dictionary format.
    {"prompt":"","complexity":"","data_type":""}

    data_type can include following types["Text", "Numerical Data", "Tables", "Graphs & Charts", "Equations & Formulas", "Dates & Timestamps"]
    complexity can include following types["Simple", "Medium", "Complex"]
    Strictly follow the format.
    
    Respond only in text. Dont include markdown.
    Generate 50 prompts for each of the following tasks:
        
        Document Content Extraction: Reading documents in different formats and extracting information from it, which could be in tables, graphs, etc. Make sure all quantitative data is extracted correctly with the right context.

    
------------------------------------------
[{'module': 't1', 'p

#####  Remove unwanted chars

In [None]:


datax = []
for item in data:
    item["prompts"] = item["prompts"].replace("\\n", "").replace("\\", "").replace("```python","").replace("```","")
    datax.append(item)
datax

[{'module': 't1',
  'prompts': '\n{\n    "prompts": [\n        {"prompt": "Extract the names and contact numbers from the PDF provided.", "complexity": "Simple", "data_type": ["Text", "Numerical Data"]},\n        {"prompt": "Identify and extract all tables from the Word document.", "complexity": "Simple", "data_type": ["Tables"]},\n        {"prompt": "From the given document, extract the names of all individuals mentioned along with their roles.", "complexity": "Simple", "data_type": ["Text"]},\n        {"prompt": "Extract the dates and times from the document and format them into a consistent date-time format.", "complexity": "Simple", "data_type": ["Dates & Timestamps"]},\n        {"prompt": "From the PDF, extract all numerical values mentioned and their context.", "complexity": "Simple", "data_type": ["Numerical Data", "Text"]},\n        {"prompt": "Extract and sort the numerical data from the table in the document by their values.", "complexity": "Simple", "data_type": ["Numerical 

### Features selected

- **`num_tokens`** – Helps determine prompt length, affecting processing time and model performance.  
- **`num_sentences`** – Indicates prompt complexity and structure.  
- **`lexical_diversity`** – Measures vocabulary richness, impacting generalization.  
- **`readability`** – Assesses how easy or difficult the prompt is to understand.  
- **`noun_ratio`** – Identifies the amount of entity-related information.  
- **`verb_ratio`** – Shows action-oriented nature of the prompt.  
- **`adjective_ratio`** – Captures descriptive complexity in the prompt.  
- **`num_named_entities`** – Detects presence of key entities, aiding context understanding.  
- **`contains_table`** – Flags structured data presence, affecting LLM parsing needs.  
- **`contains_list`** – Identifies enumerated elements that may require special processing.  
- **`keyword_density`** – Measures topic emphasis and focus within the prompt.  
- **`redundancy_score`** – Helps detect excessive repetition, which may confuse LLMs.  
- **`compression_ratio`** – Evaluates information density, useful for summarization tasks.  
- **`contains_numbers`** – Determines presence of quantitative data, influencing numerical reasoning.  
- **`contains_chain_of_thought`** – Checks for step-by-step reasoning, important for logical tasks.  
- **`contains_output_constraints`** – Identifies explicit formatting constraints in expected responses.  
- **`is_multi_turn`** – Detects conversational prompts requiring memory or context retention.  


In [None]:
import spacy
import textstat
import re
from collections import Counter

# Load spaCy NLP model
nlp = spacy.load("en_core_web_sm")

def extract_prompt_features(prompt):
    doc = nlp(prompt)
    
    #  Linguistic Features
    num_tokens = len(doc)
    num_sentences = len(list(doc.sents))
    unique_words = len(set([token.text.lower() for token in doc]))
    lexical_diversity = unique_words / num_tokens if num_tokens > 0 else 0
    readability = textstat.flesch_kincaid_grade(prompt)
    
    # Part-of-Speech (POS) Distribution
    pos_counts = Counter([token.pos_ for token in doc])
    noun_ratio = pos_counts.get("NOUN", 0) / num_tokens
    verb_ratio = pos_counts.get("VERB", 0) / num_tokens
    adjective_ratio = pos_counts.get("ADJ", 0) / num_tokens

    #  Structural Features
    num_named_entities = len(doc.ents)
    contains_table = bool(re.search(r"(\||\+\-+)|Table \d+", prompt))  # Detect tables
    contains_list = bool(re.search(r"(\d+\.)|(- )|(\* )", prompt))  # Detect bullet points or numbered lists

    # Content-Specific Features
    keyword_density = {word: prompt.lower().count(word) for word in ["summary", "analyze", "data", "report"]}
    redundancy_score = len(re.findall(r"(\b\w+\b).*\1", prompt))  # Count repeated words
    compression_ratio = len(prompt) / num_tokens if num_tokens > 0 else 0

    # Task-Specific Complexity Features
    contains_numbers = bool(re.search(r"\d+", prompt))  # Presence of numerical data
    contains_chain_of_thought = bool(re.search(r"step-by-step|explain your reasoning", prompt.lower()))
    contains_output_constraints = bool(re.search(r"limit to|output in|return a json", prompt.lower()))
    is_multi_turn = "previous response" in prompt.lower() or "as mentioned before" in prompt.lower()

    return {
        "num_tokens": num_tokens,
        "num_sentences": num_sentences,
        "lexical_diversity": lexical_diversity,
        "readability": readability,
        "noun_ratio": noun_ratio,
        "verb_ratio": verb_ratio,
        "adjective_ratio": adjective_ratio,
        "num_named_entities": num_named_entities,
        "contains_table": contains_table,
        "contains_list": contains_list,
        # "keyword_density": keyword_density,
        "redundancy_score": redundancy_score,
        "compression_ratio": compression_ratio,
        "contains_numbers": contains_numbers,
        "contains_chain_of_thought": contains_chain_of_thought,
        "contains_output_constraints": contains_output_constraints,
        "is_multi_turn": is_multi_turn
    }

# Example Prompt
prompt = """
You are an AI assistant. Summarize the research paper and provide key insights. 
Ensure the response is formatted as JSON and limited to 200 words. 
Step-by-step analysis is preferred. Example: "In this study, the authors examined..."
"""

# Extract Features
features = extract_prompt_features(prompt)
print(features)


{'num_tokens': 53, 'num_sentences': 5, 'lexical_diversity': 0.7547169811320755, 'readability': 7.3, 'noun_ratio': 0.22641509433962265, 'verb_ratio': 0.1509433962264151, 'adjective_ratio': 0.018867924528301886, 'num_named_entities': 3, 'contains_table': False, 'contains_list': False, 'redundancy_score': 2, 'compression_ratio': 4.452830188679245, 'contains_numbers': True, 'contains_chain_of_thought': True, 'contains_output_constraints': False, 'is_multi_turn': False}


##### Load json file

In [None]:
import json
with open("/home/vikasnr/codebase/crsl/tuesday/files/D1.json", "r") as f:
    data_dict = json.load(f)

print(data_dict)


[{'module': 't1', 'prompts': [{'prompt': 'Extract the names and contact numbers from the PDF provided.', 'complexity': 'Simple', 'data_type': ['Text', 'Numerical Data']}, {'prompt': 'Identify and extract all tables from the Word document.', 'complexity': 'Simple', 'data_type': ['Tables']}, {'prompt': 'From the given document, extract the names of all individuals mentioned along with their roles.', 'complexity': 'Simple', 'data_type': ['Text']}, {'prompt': 'Extract the dates and times from the document and format them into a consistent date-time format.', 'complexity': 'Simple', 'data_type': ['Dates & Timestamps']}, {'prompt': 'From the PDF, extract all numerical values mentioned and their context.', 'complexity': 'Simple', 'data_type': ['Numerical Data', 'Text']}, {'prompt': 'Extract and sort the numerical data from the table in the document by their values.', 'complexity': 'Simple', 'data_type': ['Numerical Data']}, {'prompt': 'Extract the key points and conclusions from the research 

In [None]:
i = 0
data = []
mapping_task = {
"t1":"document_extraction",
"t2":"quantitative_analysis",
"t3":"report_generation",
"t4":"interactive_qa_chatbot",
"t5":"multi_document_summarization"
}

mapping_model = {
"t1":"GPT",
"t2":"Claude",
"t3":"Mistral",
"t4":"llma",
"t5":"GPT"
}
for mod in data_dict:
    for prompt in mod["prompts"]:
        print(prompt)
        pt = prompt["prompt"]
        complx = prompt["complexity"]
        data_type = prompt["data_type"]
        inter = {"PID": i,"prompt_length":len(pt),  "complexity": complx, "data_type": data_type,"module": mapping_task[mod["module"]],
                 "LLM_PRE": mapping_model[mod["module"]]}
        
        inter.update(extract_prompt_features(pt))
        
        i += 1
        data.append(inter)


with open("files/D2.json", "w") as f:
    json.dump(data, f, indent=4)

{'prompt': 'Extract the names and contact numbers from the PDF provided.', 'complexity': 'Simple', 'data_type': ['Text', 'Numerical Data']}
{'prompt': 'Identify and extract all tables from the Word document.', 'complexity': 'Simple', 'data_type': ['Tables']}
{'prompt': 'From the given document, extract the names of all individuals mentioned along with their roles.', 'complexity': 'Simple', 'data_type': ['Text']}
{'prompt': 'Extract the dates and times from the document and format them into a consistent date-time format.', 'complexity': 'Simple', 'data_type': ['Dates & Timestamps']}
{'prompt': 'From the PDF, extract all numerical values mentioned and their context.', 'complexity': 'Simple', 'data_type': ['Numerical Data', 'Text']}
{'prompt': 'Extract and sort the numerical data from the table in the document by their values.', 'complexity': 'Simple', 'data_type': ['Numerical Data']}
{'prompt': 'Extract the key points and conclusions from the research paper.', 'complexity': 'Simple', 'da

##### convert to csv

In [None]:
import csv
csv_file = "files/input_intermediate.csv"

# Write to CSV
with open(csv_file, mode="w", newline="") as file:
    writer = csv.DictWriter(file, fieldnames=data[0].keys())
    writer.writeheader()  # Write column headers
    writer.writerows(data)  # Write rows

print(f"CSV file '{csv_file}' created successfully!")

CSV file 'input_v2.csv' created successfully!


In [None]:
import pandas as pd
import numpy as np

# Load CSV file
df = pd.read_csv("files/input_intermediate.csv")

# Define the target column
label_col = "LLM_PRE"  # Replace with your actual column name
new_col = "LLM"

# Get unique labels
labels = df[label_col].unique()

# Create a new column with the same values initially
df[new_col] = df[label_col]

# Modify 20% of each label group
for label in labels:
    label_indices = df[df[label_col] == label].index
    num_to_replace = int(0.2 * len(label_indices))  # 20% of each label

    if num_to_replace > 0:
        # Choose random indices to modify
        replace_indices = np.random.choice(label_indices, num_to_replace, replace=False)

        # Select different labels for replacement
        possible_labels = [l for l in labels if l != label]
        df.loc[replace_indices, new_col] = np.random.choice(possible_labels, num_to_replace)

# Drop the 'LLM_PRE' column
df.drop(columns=["LLM_PRE"], inplace=True)
# Save the modified CSV
df.to_csv("files/modified_file.csv", index=False)

print("CSV file updated with new column!")


CSV file updated with new column!


In [18]:
unique_count = df['LLM'].nunique()
label_counts = df['LLM'].value_counts()
print(label_counts)
# print(f"Count of unique values in 'module' column: {unique_count}")


LLM
llma       162
Mistral    160
GPT        110
Claude      75
Name: count, dtype: int64
