# Rule Selection for Few Shot Prompting
The code below will prompt starcoder2 to output the most suitable rule for each code block that must be refactored.

In [1]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) Y
Token is valid (permission: read).
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in your terminal in case you want to set the 'store

In [2]:
pip install git+https://github.com/huggingface/transformers.git accelerate==0.27.1 datasets>=2.16.1 bitsandbytes==0.41.3 peft==0.8.2 trl==0.7.10 wandb==0.16.3 huggingface_hub==0.20.3

  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers.git /tmp/pip-req-build-c3tnt5gy


In [3]:
from datasets import load_dataset  # Hugging Face datasets package
mbpp = load_dataset("mbpp")  # train, validation, and test
humaneval = load_dataset("openai_humaneval")  # test only

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/9.06k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/87.2k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/116k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/7.88k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/374 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/500 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/90 [00:00<?, ? examples/s]

Generating prompt split:   0%|          | 0/10 [00:00<?, ? examples/s]

Downloading readme:   0%|          | 0.00/6.52k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/83.9k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/164 [00:00<?, ? examples/s]

In [4]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained("TechxGenus/starcoder2-3b-instruct")

tokenizer_config.json:   0%|          | 0.00/7.88k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/777k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/442k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.06M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.33k [00:00<?, ?B/s]

In [5]:
model = AutoModelForCausalLM.from_pretrained(
    "TechxGenus/starcoder2-3b-instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

config.json:   0%|          | 0.00/779 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/39.0k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

In [40]:
from google.colab import drive
import shutil


# This will prompt for authorization.
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [42]:
def select_rules(prompts):
  output_texts = []
  for p in prompts:
    code_block = p
    rule_prompt = (
      f"### Instruction"
      f"You are an expert programmer. Given the code below, pick the most suitable refactoring rule from the list of rules copied below.\n"
      f"Code: {code_block}\n"
      f"You MUST follow these requirements: \n"
      f"1) Only pick one rule and 2) Only output the number of the rule you have chosen\n"
      "List of Rules:\n"
      "1. Use a formatted string.\n"
      "2. Use a built-in (i.e., radians) function.\n"
      "3. Use a logical operator instead of a nested if.\n"
      "4. Use a for-loop instead of a while-loop.\n"
      "5. Use list comprehension instead of a for-loop.\n"
      "6. Use the map function instead of list comprehension."
      "7. Use a throwaway variable.\n"
      "8. Use the enumerate function instead of the range function.\n"
      "9. Use the zip function instead of the range function.\n"
      "10. Use a ternary operator instead of an if-branch.\n"
      " 11. Merge repeated ifs.\n"
      " 12. Merge dictionary assignments.\n"
      " 13. Remove unnecessary calls to dict.items().\n"
      " 14. Remove str() from calls to print().\n"
      " 15. Flatten nested try.\n"
      " 16. Convert any to in."
    )

    PROMPT = """### Instruction
    {instruction}
    ### Response
    """
    prompt = PROMPT.format(instruction=rule_prompt)

    inputs = tokenizer.encode(prompt, return_tensors="pt")
    outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=2048)
    output_text = tokenizer.decode(outputs[0])
    output_dict = {'prompt': p, 'output': output_text}
    output_texts.append(output_dict)
    # print("OUTPUT :  ")
    # break

  return output_texts



In [43]:
humaneval_prompts = []
for i in humaneval["test"]:
  concat_prompt = i['prompt'] + i['canonical_solution']
  humaneval_prompts.append(concat_prompt)

humaneval_outputs = select_rules(humaneval_prompts)


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
The attentio

In [44]:
import csv
def parse_results(data):
    results = []

    for output in data:
        prompt = output['prompt']
        output = output['output']
        result_dict = {'prompt': prompt,'output': output}
        results.append(result_dict)
    return results


def parsed_csv(file_path, results):
  keys = ['prompt', 'output']
  with open(file_path, 'w', newline='') as csvfile:
      writer = csv.DictWriter(csvfile, fieldnames=keys)
      writer.writeheader()
      for result in results:
          writer.writerow(result)

In [45]:
humaneval_results=parse_results(humaneval_outputs)
parsed_csv('/content/humaneval_rules.csv', humaneval_results)

In [47]:
# Define the path to the file in Colab and the destination path in Google Drive
source_file = '/content/humaneval_rules.csv'
destination_path = '/content/drive/My Drive/humaneval_rules.csv'

# Copy the file to Google Drive
shutil.copy(source_file, destination_path)

print(f'File copied to {destination_path}')


File copied to /content/drive/My Drive/humaneval_rules.csv


In [46]:
mbpp_prompts=[]
for i in mbpp["test"]:
  concat_prompt = i['code']
  mbpp_prompts.append(concat_prompt)

mbpp_outputs = select_rules(mbpp_prompts)
mbpp_results=parse_results(mbpp_outputs)
parsed_csv('/content/mbpp_rules.csv', mbpp_results)



The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
The attentio

In [48]:
# Define the path to the file in Colab and the destination path in Google Drive
source_file = '/content/mbpp_rules.csv'
destination_path = '/content/drive/My Drive/mbpp_rules.csv'

# Copy the file to Google Drive
shutil.copy(source_file, destination_path)

print(f'File copied to {destination_path}')

File copied to /content/drive/My Drive/mbpp_rules.csv


In [51]:
import pandas as pd
import re

# Load the CSV file
df = pd.read_csv('/content/mbpp_rules.csv')

# Function to extract the rule number from the output text
def extract_rule(output):
    # Extracting text after '### Response'
    response_text = output.split('### Response')[-1]
    # Finding the first number in the specified range
    found_numbers = re.findall(r'\b(?:1[0-6]|[1-9])\b', response_text)
    return found_numbers[0] if found_numbers else None

# Apply the function to the 'output' column
df['rule'] = df['output'].apply(extract_rule)

# Create a new DataFrame with the required columns including the original output
new_df = df[['prompt', 'output', 'rule']]

# Save the new DataFrame to a CSV file
new_df.to_csv('/content/filtered_mbpp_rules_with_output.csv', index=False)

print("New CSV file created with prompts, original outputs, and their corresponding rules.")


New CSV file created with prompts, original outputs, and their corresponding rules.


In [52]:
import pandas as pd
import re

# Load the CSV file
df = pd.read_csv('/content/humaneval_rules.csv')

# Function to extract the rule number from the output text
def extract_rule(output):
    # Extracting text after '### Response'
    response_text = output.split('### Response')[-1]
    # Finding the first number in the specified range
    found_numbers = re.findall(r'\b(?:1[0-6]|[1-9])\b', response_text)
    return found_numbers[0] if found_numbers else None

# Apply the function to the 'output' column
df['rule'] = df['output'].apply(extract_rule)

# Create a new DataFrame with the required columns including the original output
new_df = df[['prompt', 'output', 'rule']]

# Save the new DataFrame to a CSV file
new_df.to_csv('/content/filtered_humaneval_rules_with_output.csv', index=False)

print("New CSV file created for humaneval_rules with prompts, original outputs, and their corresponding rules.")


New CSV file created for humaneval_rules with prompts, original outputs, and their corresponding rules.
