<a href="https://colab.research.google.com/github/rymarinelli/Number_Of_Thoughts/blob/main/Number_Of_Thought_Labeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Research Plan

1. We need to label a dataset with how prompts and an estimation of how many steps will be needed to solve that prompt.

We will need two math datasets.
1A. So a model can learn to do the classifcation
1B. A 2nd dataset to evaluate how the classification affects


*   https://huggingface.co/datasets/meta-math/MetaMathQA
*   https://huggingface.co/datasets/TIGER-Lab/MathInstruct


2. **Train a classifer to look at a prompt and predict how many steps it would require to solve.**

It might be interesting to take the classifer and use it for routing. Try different versions of the model and report if you can optimize inference speed and accuracy. Easy questions go to a smaller mode and harder questions go to the bigger model.

3. **Test Change of thought model informed with the classifer.**
   3A. Compare with Chain-Of-Thought without the classifer letting it take as many steps as needed
   3B.Compare with model without chain of thought





In [None]:
!pip install transformers torch
!pip install transformers==4.37.2
!pip install flash_attn
!pip install rouge-score
!pip install pynvml
!pip install --upgrade huggingface_hub
!pip install Datasets

import locale
locale.getpreferredencoding = lambda: "UTF-8"
!export LC_ALL=en_US.UTF-8
!export LANG=en_US.UTF-8




In [None]:
import logging
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

try:
    tokenizer = AutoTokenizer.from_pretrained(
        "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
        trust_remote_code=True
    )
    model = AutoModelForCausalLM.from_pretrained(
        "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
        trust_remote_code=True,
        torch_dtype=torch.bfloat16
    ).cuda()
    logging.info("Model and tokenizer loaded successfully.")
except Exception as e:
    logging.exception("Failed to load the model and tokenizer.")
    exit(1)



Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
import pandas as pd

df = pd.read_json("hf://datasets/TIGER-Lab/MathInstruct/MathInstruct.json")

df = df.rename(columns={'instruction': 'question'})

df = df.iloc[:100]

# Calculate Number of Steps Used

In [None]:
import re
import logging
from tqdm import tqdm
import pandas as pd
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import yaml
import numpy as np


logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(message)s",
    handlers=[
        logging.FileHandler("process.log"),
        logging.StreamHandler()
    ]
)
logger = logging.getLogger(__name__)


def create_step_listing_prompt(question: str) -> str:
    """
    Formats the prompt to instruct the model to list reasoning steps.
    """
    return (
        "Please provide a detailed, step-by-step solution to the following problem. "
        "Number each step sequentially.\n\n"
        f"Problem:\n{question}\n\n"
        "Solution Steps:"
    )


def generate_step_list_response(
    question: str,
    tokenizer,
    model,
    device,
    max_length: int = 500,
    temperature: float = 0.7,
    top_p: float = 0.95
) -> str:
    """
    Generates a response with a numbered list of steps to solve the problem.
    """
    prompt = create_step_listing_prompt(question)
    input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device)

    try:
        with torch.no_grad():
            output_ids = model.generate(
                input_ids,
                max_length=max_length,
                temperature=temperature,
                top_p=top_p,
                do_sample=True,
                eos_token_id=tokenizer.eos_token_id,
                pad_token_id=tokenizer.eos_token_id
            )
        return tokenizer.decode(output_ids[0], skip_special_tokens=True)
    except Exception as e:
        logger.error(f"Error generating response for question: {question}\nError: {e}")
        return ""


def count_explicit_steps(response: str) -> int:
    """
    Count explicit numbered steps (e.g., 'Step 1', '1.', '1)') in the response.
    """
    if not isinstance(response, str):
        return 0

    pattern = re.compile(r"^\s*(?:Step\s*\d+|\d+[\.\)])", re.MULTILINE | re.IGNORECASE)
    matches = pattern.findall(response)
    return len(set(matches))  # Deduplicate


def count_reasoning_keywords(response: str) -> int:
    """
    Count reasoning keywords as fallback for thought counting.
    """
    if not isinstance(response, str):
        return 0

    keywords = [
        "first", "second", "third", "next", "then",
        "after that", "finally", "therefore", "thus", "so", "hence", "lastly"
    ]
    pattern = re.compile(r"\b(" + "|".join(keywords) + r")\b", re.IGNORECASE)
    return len(pattern.findall(response))


def count_steps_combined(response: str) -> int:
    """
    Combines explicit step counting with keyword-based reasoning detection.
    """
    explicit_steps = count_explicit_steps(response)
    if explicit_steps > 0:
        return explicit_steps

    return count_reasoning_keywords(response)


def save_responses_to_yaml(df: pd.DataFrame, yaml_file_path: str = "responses.yaml"):
    """
    Saves the thought process (responses) to a YAML file.
    """
    try:
        responses_dict = {}
        for idx, row in df.iterrows():
            question_number = f"Question_{idx + 1}"
            responses_dict[question_number] = {
                "question": row.get("question", ""),
                "response": row.get("response", ""),  # Updated to use "response" column
                "thought_count": row.get("thought_count", 0),
            }

        with open(yaml_file_path, "w", encoding="utf-8") as file:
            yaml.dump(responses_dict, file, sort_keys=False, allow_unicode=True)

        logger.info(f"Responses successfully saved to '{yaml_file_path}'.")
    except Exception as e:
        logger.error(f"Failed to save responses to YAML. Error: {e}")
        raise


def process_questions(
    df: pd.DataFrame,
    tokenizer,
    model,
    device,
    output_csv: str = "dataset_with_step_counts.csv",
    output_yaml: str = "responses.yaml"
):
    """
    Processes questions in the DataFrame, generates stepwise responses, and counts steps.
    """
    thought_counts_list = []
    all_responses = []

    for run in range(2):  # Repeat 2 times
        logger.info(f"Run {run + 1}/5: Generating stepwise responses.")
        new_responses = []
        thought_counts = []

        for idx, question in tqdm(df['question'].items(), total=df.shape[0], desc="Processing Questions"):
            response = generate_step_list_response(question, tokenizer, model, device)
            if not response:
                logger.warning(f"No response generated for question at index {idx}.")
            count = count_steps_combined(response)
            new_responses.append(response)
            thought_counts.append(count)
            logger.info("Run %d, Responses so far: %s", run + 1, new_responses)

        thought_counts_list.append(thought_counts)
        all_responses = new_responses  # Use responses from the last run

    # Store the final responses and thought counts in the DataFrame
    df['response'] = all_responses
    df['thought_count'] = thought_counts_list[-1]
    df['thought_count_avg'] = np.mean(thought_counts_list, axis=0)

    df.to_csv(output_csv, index=False)
    logger.info(f"Updated DataFrame with averages saved to '{output_csv}'.")

    # Save the thought processes to a YAML file
    save_responses_to_yaml(df, yaml_file_path=output_yaml)


def display_statistics(df: pd.DataFrame):
    """
    Displays basic statistics on thought counts in the DataFrame.
    """
    average_thoughts = df['thought_count_avg'].mean()
    median_thoughts = df['thought_count_avg'].median()
    min_thoughts = df['thought_count_avg'].min()
    max_thoughts = df['thought_count_avg'].max()
    std_thoughts = df['thought_count_avg'].std()

    print("\nStatistics on Average Thought Counts:")
    print(f"Average number of thoughts per question: {average_thoughts:.2f}")
    print(f"Median number of thoughts per question: {median_thoughts}")
    print(f"Minimum number of thoughts: {min_thoughts}")
    print(f"Maximum number of thoughts: {max_thoughts}")
    print(f"Standard Deviation: {std_thoughts:.2f}")


def main():
    output_csv = "dataset_with_step_counts.csv"
    output_yaml = "responses.yaml"

    process_questions(df, tokenizer, model, "cuda", output_csv, output_yaml)
    display_statistics(df)


if __name__ == "__main__":
    main()


Processing Questions: 100%|██████████| 2/2 [00:32<00:00, 16.31s/it]
Processing Questions: 100%|██████████| 2/2 [00:32<00:00, 16.02s/it]


Statistics on Average Thought Counts:
Average number of thoughts per question: 11.00
Median number of thoughts per question: 11.0
Minimum number of thoughts: 4.5
Maximum number of thoughts: 17.5
Standard Deviation: 9.19





# Calculate Additional Metrics

In [None]:
import pandas as pd
import time
import torch
from rouge_score import rouge_scorer
from transformers import AutoTokenizer, AutoModelForCausalLM
from pynvml import nvmlInit, nvmlDeviceGetHandleByIndex, nvmlDeviceGetPowerUsage, nvmlShutdown

# Initialize pynvml for GPU power monitoring
nvmlInit()

# Get GPU device handle (assuming a single GPU is used)
gpu_handle = nvmlDeviceGetHandleByIndex(0)


scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

#Intial estimate for calculating power.
def compute_metrics_with_power(df: pd.DataFrame, tokenizer, model, device):
    """
    Compute ROUGE scores, latency, and estimated power consumption for each row in the DataFrame.

    Args:
        df (pd.DataFrame): DataFrame containing questions and outputs.
        tokenizer: Pre-trained tokenizer.
        model: Pre-trained language model.
        device: Device to run inference on (e.g., "cuda" or "cpu").

    Returns:
        pd.DataFrame: Updated DataFrame with ROUGE scores, latency, and power consumption.
    """
    rouge_1_scores = []
    rouge_2_scores = []
    rouge_L_scores = []
    latencies = []
    power_usages = []

    for _, row in df.iterrows():
        # Extract reference and output
        reference = row['question']
        output = row['output']

        # Start timer for latency and measure power consumption
        start_time = time.time()
        power_start = nvmlDeviceGetPowerUsage(gpu_handle) / 1000.0  # Power usage in watts

        # Perform model inference
        input_text = row['question']
        input_ids = tokenizer.encode(input_text, return_tensors="pt").to(device)
        with torch.no_grad():
            generated_ids = model.generate(input_ids, max_length=512, eos_token_id=tokenizer.eos_token_id)
        generated_output = tokenizer.decode(generated_ids[0], skip_special_tokens=True)

        # End timer and measure power consumption again
        end_time = time.time()
        power_end = nvmlDeviceGetPowerUsage(gpu_handle) / 1000.0  # Power usage in watts

        # Calculate latency and average power usage
        latency = end_time - start_time
        average_power = (power_start + power_end) / 2  # Estimate average power usage during inference

        # Compute ROUGE scores
        scores = scorer.score(reference, generated_output)
        rouge_1_scores.append(scores['rouge1'].fmeasure)
        rouge_2_scores.append(scores['rouge2'].fmeasure)
        rouge_L_scores.append(scores['rougeL'].fmeasure)
        latencies.append(latency)
        power_usages.append(average_power)

    # Add metrics to the DataFrame
    df['rouge_1'] = rouge_1_scores
    df['rouge_2'] = rouge_2_scores
    df['rouge_L'] = rouge_L_scores
    df['latency'] = latencies
    df['average_power_watts'] = power_usages

    return df



df = compute_metrics_with_power(df, tokenizer, model, "cuda")

# Shutdown pynvml after use
nvmlShutdown()


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end gene

                             source  \
0            data/CoT/aqua_rat.json   
1            data/CoT/aqua_rat.json   
2   data/PoT/aqua_rat_filtered.json   
3             data/CoT/gsm_rft.json   
4   data/PoT/aqua_rat_filtered.json   
..                              ...   
95      data/CoT/math50k_camel.json   
96           data/CoT/aqua_rat.json   
97          data/CoT/gsm_train.json   
98          data/CoT/gsm_train.json   
99      data/CoT/math50k_camel.json   

                                             question  \
0   The distance between two stars is 6.52 × 10^5 ...   
1   How many ways can the letters in the word COMM...   
2   A team of six entered for a shooting competiti...   
3   A psychiatrist has 4 patients that need 25 ses...   
4   The radius of a wheel is 22.4 cm. What is the ...   
..                                                ...   
95  What is the smallest Sophie Germain prime grea...   
96  Last year Department Store X had a sales total...   
97  A farmer hires

# Write out File to Huggingface


In [None]:

from huggingface_hub import login, create_repo, upload_file
from datasets import Dataset
login(token='<<TOKEN_HUGGING_FACE>>')
dataset = Dataset.from_pandas(df)
dataset.push_to_hub("Chain_Of_Thought_Count")
print("Pandas DataFrame has been pushed as a Hugging Face Dataset!")



Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Pandas DataFrame has been pushed as a Hugging Face Dataset!
