Test code generation for different models

- Base model (phi1-5)
- Teacher models
    - GPT-4
    - Code LLama (huggingface or openrouter)
- Benchmark on human eval test set

Test unit-tests generation for different models (maybe separate notebook)    
- Require an evaluation metric for thoroughness of unit-test generation (e.g. out of U total tests that achieve 100% thorougness / coverage, how many are produced by the model + how many are functionally correct as well ... )

- Base model
- Teacher models
- Model fine-tuned on code generation data
- Model fine-tuned on code generation data with evol-instruct
- Need to create own benchmark here or look for existing benchmark




# Evaluation code generation ability

This section evaluates some models on the HumanEval test set. The task by the model is to generate solution for a given problem description in the format of the HumanEval dataset.

Models

- Base model (phi1-5)

### Base model (phi1-5)

In [3]:
import torch

  from .autonotebook import tqdm as notebook_tqdm


In [4]:
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("microsoft/phi-1_5",
                                             trust_remote_code=True, torch_dtype="auto").to('cuda')
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-1_5", trust_remote_code=True, torch_dtype="auto")

2023-09-22 22:36:43.703209: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-09-22 22:36:44.820373: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/intel/compilers_and_libraries_2018.3.222/linux/mpi/intel64/lib:/opt/intel/compilers_and_libraries_2018.3.222/linux/mpi/mic/lib:/opt/intel/compilers_and_libraries_2018.3.222/linux/mpi/intel64/lib:/opt/intel/compilers_and_libraries_2018.3.222/linux/mpi/mic/lib::/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64/:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64/
2023-09-22 22:36:44.

In [5]:
from human_eval.data import read_problems, write_jsonl
problems = read_problems()

In [6]:
print(problems['HumanEval/0']['prompt'])

from typing import List


def has_close_elements(numbers: List[float], threshold: float) -> bool:
    """ Check if in given list of numbers, are any two numbers closer to each other than
    given threshold.
    >>> has_close_elements([1.0, 2.0, 3.0], 0.5)
    False
    >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
    True
    """



In [7]:
def parse_response_phi15(response: str):
    """Extract the solution from phi1 model's response, as it often
    generates some random function after the required solution was generated.

    This could be improved further with more time.
    """
    # discard the original prompt as it is included in the response
    #response = response[len(prompt):]
    
    # get the result until the second def
    def1_pos = response.index('def ')
    try:
        def2_pos = response.index('def ', def1_pos+4)
    except ValueError as ex:
        def2_pos = len(response)
    return response[:def2_pos]

@torch.inference_mode()
def mix_generate(model, tokenizer, prompt, max_new_tokens:int=512, num_sequences:int=10):
  """Generate output that is a mix of greedy and sampling.

  The greedy approach seems to be the most effective, while beam-search
  is not supported by the model. So, generate 1 sequence using greedy
  and the rest using sampling.
  """
  inputs = tokenizer(prompt, return_tensors="pt", return_attention_mask=False).to('cuda')
  seqs = []
  # greedy generation
  #outputs = model.generate(**inputs, max_new_tokens=max_new_tokens, use_cache=True)
  #text = tokenizer.batch_decode(outputs)[0]
  #seqs.append(parse_response_phi15(text))
  # sampling generation
  if num_sequences >= 1:
    outputs = model.generate(**inputs, max_new_tokens=max_new_tokens, use_cache=True,
                             do_sample=True, top_k=3, num_return_sequences=num_sequences)
    seqs.extend([parse_response_phi15(text)
                for text in tokenizer.batch_decode(outputs, skip_special_tokens=True)])

  return seqs

In [8]:
from tqdm import tqdm
import time

num_samples_per_task = 10
max_new_tokens = 300

start = time.time()
pbar = tqdm(total=len(problems) * num_samples_per_task)
samples = []
for task_id in problems:
  p = problems[task_id]['prompt']
  solutions = mix_generate(model, tokenizer, p,
                           max_new_tokens=max_new_tokens)
  samples.extend([dict(task_id=task_id, completion=solution) for solution in solutions])
  pbar.update(num_samples_per_task)

elapsed = time.time() - start

print('Total generation time: ', elapsed)

  0%|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         | 0/1640 [00:00<?, ?it/s]

 20%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | 320/1640 [16:33<1:06:22,  3.02s/it]

RuntimeError: CUDA out of memory. Tried to allocate 560.00 MiB (GPU 0; 7.94 GiB total capacity; 6.69 GiB already allocated; 547.50 MiB free; 6.81 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

In [1]:
len(samples)

NameError: name 'samples' is not defined

In [55]:
for i in range(0,5):
    p = problems[f'HumanEval/{i}']['prompt']
    inputs = tokenizer(p, return_tensors="pt", return_attention_mask=False).to('cuda')
    outputs = model.generate(**inputs, max_new_tokens=300, use_cache=True,
                             do_sample=True, top_k=3)
    text = tokenizer.batch_decode(outputs)[0]
    print(parse_response_phi15(text))
    print('=========================')

Prompt: 
from typing import List


def has_close_elements(numbers: List[float], threshold: float) -> bool:
    """ Check if in given list of numbers, are any two numbers closer to each other than
    given threshold.
    >>> has_close_elements([1.0, 2.0, 3.0], 0.5)
    False
    >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
    True
    """

Response: 
from typing import List


def has_close_elements(numbers: List[float], threshold: float) -> bool:
    """ Check if in given list of numbers, are any two numbers closer to each other than
    given threshold.
    >>> has_close_elements([1.0, 2.0, 3.0], 0.5)
    False
    >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
    True
    """
    for i in range(len(numbers)):
        for j in range(i + 1, len(numbers)):
            if abs(numbers[i] - numbers[j]) < threshold:
                return True
    return False



Prompt: 
from typing import List


def separate_paren_groups(paren_string: str) -> List[str]:
    "