# Module 3 Project 5: Automated Evaluation with DSPy + CoT
We want to tie everything in this module together into a single project
Pick a new domain, use DSPy to build out the modules to specify the problem space
We will be prototyping a simple prompt+evaluation loop for a very basic Software Engineering agent using Mistral

In [None]:
!pip install dspy-ai

## STEP 1: IMPORTS
- We need to import [DSPy](https://dspy-docs.vercel.app/docs/building-blocks/solving_your_task) and related pieces for optimization and evaluation
- We also need a Dataloader for our dataset and system imports like `math` and `re`

In [None]:
import dspy
from dspy.datasets import DataLoader
from dspy.teleprompt import BootstrapFewShotWithRandomSearch
from dspy.evaluate import Evaluate

import math
from collections import Counter
import re

## STEP 2: LOAD THE MODEL
- We now load our LLM, which is [TheBloke's Mistral-7B-Instruct-v0.2-GPTQ](https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GPTQ) running locally on a [TGI server](https://github.com/huggingface/text-generation-inference)
- We want to configure dspy to use the settings from this model's config

In [None]:
llama = dspy.HFClientTGI(model="TheBloke/Mistral-7B-Instruct-v0.2-GPTQ", port=8080, url="http://localhost")

dspy.settings.configure(lm=llama)

## STEP 3: LOAD THE DATASET
- We will be using [CodeAlpaca_20K](https://huggingface.co/datasets/HuggingFaceH4/CodeAlpaca_20K) for our dataset
- This data consists of 'prompt' sections in English, and a code section marked as 'completion' for each example
- We will denote that the 'prompt' is the desired input here in each dataset
- We will take the first 100 examples from the train and test splits respectively

In [None]:
WORD = re.compile(r"\w+")

dl = DataLoader()

code_alpaca = dl.from_huggingface("HuggingFaceH4/CodeAlpaca_20K")

train_dataset = [x.with_inputs('prompt') for x in code_alpaca['train']][:100]
test_dataset = [x.with_inputs('prompt') for x in code_alpaca['test']][:100]

## STEP 4: SIGNATURES
- Here we define [Signatures](https://dspy-docs.vercel.app/docs/building-blocks/signatures) for our program
- Instead of doing shorthand like "question -> answer", we can make Signatures with descriptions
- These descriptions are used as inputs with the prompts for each step to the model
- This allows us to steer the model towards each task required at each step of the program
- The steps we want our 'coder' to take are: 

```
1. Problem Analysis
2. Solution Generation
3. Generate Candidate Solutions
4. Rank Generated Solutions
5. Generate Test for Top Solution
6. Validate Code/Test Pair
7. Iterate and Improve
```

In [None]:
class Problem(dspy.Signature):
    prompt = dspy.InputField(desc='The prompt section of the given Example')
    analysis = dspy.OutputField(desc="Simple instructions on how to solve this problem in code")

class Solution(dspy.Signature):
    analysis = dspy.InputField()
    code_solution = dspy.OutputField(desc="A code solution in a single line with no newlines enclosed in backticks answering the input")

class SolutionWithAttempt(dspy.Signature):
    analysis = dspy.InputField()
    previous_attempt = dspy.InputField()
    code_solution = dspy.OutputField(desc="A code solution in a single line with no newlines enclosed in backticks answering the input that is different than the provided attempt")

class Ranker(dspy.Signature):
    code_solutions = dspy.InputField()
    top_ranked_code_solution = dspy.OutputField(desc="Sort the given code solutions by clarity and syntax and return the top one no other text")

class Test(dspy.Signature):
    top_solution = dspy.InputField()
    test = dspy.OutputField(desc="A test for the given code solution in a single line with no newlines enclosed in backticks")

class Judge(dspy.Signature):
    code = dspy.InputField()
    test = dspy.InputField()
    answer = dspy.OutputField(desc="True/False whether the code solution and accompanying test are syntactically correct and if the test will pass when ran no other text")

class Improve(dspy.Signature):
    code = dspy.InputField()
    test = dspy.InputField()
    top_ranked_code_solution = dspy.OutputField(desc="Improve the given code solution according to the provided test and return the improved code solution no other text")

## STEP 5: MODULE
- Now that we've created our signatures, we can build our [Module](https://dspy-docs.vercel.app/docs/building-blocks/modules)
- We want to walk through the steps listed above, with the analysis step set as CoT to capture rationale
- We define our initialization to build our internal modules with [Predict and ChainOfThought](https://dspy-docs.vercel.app/docs/building-blocks/modules#how-do-i-use-a-built-in-module-like-dspypredict-or-dspychainofthought)
- We define our `forward` step to walk through our modules and pass inputs in accordingly
- We generate 4 candidate solutions initially, and give the model 5 attempts to improve on the best solution
- The 'criteria' for validation here is simply just another LM call to return `True/False` - since this is a prototype and demonstration, this is okay, but in production we would need better evaluation metrics

In [None]:
class NightwingCoder(dspy.Module):
    def __init__(self):
        self.problem_reflector = dspy.ChainOfThought(Problem)
        self.solution_generator = dspy.Predict(Solution)
        self.solution_improver = dspy.Predict(SolutionWithAttempt)
        self.rank_solutions = dspy.Predict(Ranker)
        self.generate_test = dspy.Predict(Test)
        self.evaluate_solution = dspy.Predict(Judge)
        self.improve_solution = dspy.Predict(Improve)

    def forward(self, prompt):
        # Analysis
        problem_reflection = self.problem_reflector(prompt=prompt)
        print(problem_reflection.rationale)


        # Solution
        solution = self.solution_generator(analysis=problem_reflection.rationale)
        prev = solution.code_solution

        # Candidate Solutions
        solutions = ""
        for i in range(3):
            solution = self.solution_improver(analysis=problem_reflection.rationale, previous_attempt=prev)
            print(solution)
            solutions += solution.code_solution + "\n"
            prev = solution.code_solution

        # Rank Candidates
        top_ranked = self.rank_solutions(code_solutions=solutions)
        print(top_ranked)

        # Test Generation
        test = self.generate_test(top_solution=top_ranked.top_ranked_code_solution)
        print(test)

        # Evaluate
        valid = self.evaluate_solution(code=top_ranked.top_ranked_code_solution, test=test.test)
        finished = "true" in str(valid.answer).lower()
        attempts = 0

        # Improve + evaluate loop
        while not finished and attempts < 5:
            top_ranked = self.improve_solution(code=top_ranked.top_ranked_code_solution, test=test.test)
            valid = self.evaluate_solution(code=top_ranked.top_ranked_code_solution, test=test.test)
            finished = "true" in str(valid.answer).lower()

            attempts += 1

        print(top_ranked.top_ranked_code_solution)

        return top_ranked

## STEP 6: VALIDATION METRIC
- As mentioned above, this validation metric is simplistic and doesn't fully capture 'success' here
- The steps required to full run and evaluate code from every language is more than this module requires
- So, we calculate our 'metric' as `cosine similarity` between strings
- If the similarity is above 0.5, we can say that these two strings are close to similar
- This gives us a very ROUGH idea of accuracy over our dataset (again, this is proof-of-concept)

In [None]:
def text_to_vector(text):
    words = WORD.findall(text)
    return Counter(words)

def validate_code(example, pred, trace=None):
    vec1 = text_to_vector(pred.top_ranked_code_solution)
    vec2 = text_to_vector(example.completion)
    intersection = set(vec1.keys()) & set(vec2.keys())
    numerator = sum([vec1[x] * vec2[x] for x in intersection])

    sum1 = sum([vec1[x] ** 2 for x in list(vec1.keys())])
    sum2 = sum([vec2[x] ** 2 for x in list(vec2.keys())])
    denominator = math.sqrt(sum1) * math.sqrt(sum2)

    sim = 0.0

    if denominator:
        sim = float(numerator) / denominator

    print(sim)

    return sim > 0.5

## STEP 7: TEST THE MODEL
- Now that our model is built, we can test it before compiling with DSPy to see the raw output
- We pass in a sample prompt for a Python Fibonacci program
- We then print the output

In [None]:
uncompiled = NightwingCoder()

output = uncompiled(prompt="Create an Python to calculate the Fibonacci sequence")

print(output.top_ranked_code_solution)

## STEP 8: COMPILE
- Now that everything is working, we can compile with our [Optimizer](https://dspy-docs.vercel.app/docs/building-blocks/optimizers)
- [BootstrapFewShotWithRandomSearch](https://dspy-docs.vercel.app/docs/building-blocks/optimizers#how-do-i-use-an-optimizer) is the best optimizer for this use case
- We pass in our `validate_code` metric method from above to return a boolean value validating each prediction against the data's ground truth
- We set `max_bootstrapped_demos` to 4 to pull up to 4 examples back in to our prompt when optimizing
- We then prompt our model again to see the difference in accuracy after compilation

In [None]:
config = dict(max_bootstrapped_demos=4, max_labeled_demos=4)

teleprompter = BootstrapFewShotWithRandomSearch(metric=validate_code, **config)
optimized_cot = teleprompter.compile(NightwingCoder(), trainset=train_dataset)

output = optimized_cot(prompt="Create an Python to calculate the Fibonacci sequence")

print(output.top_ranked_code_solution)

## STEP 9: EVALUATE
- Finally, we want to [evaluate](https://dspy-docs.vercel.app/docs/cheatsheet#dspy-evaluation) our model
- We denote the `devset` and inputs, our validation metric from above, and set `num_threads` to 1
- We then run the evaluation loop and analyze the output

In [None]:
evaluate = Evaluate(devset=[x.with_inputs('prompt') for x in test_dataset], metric=validate_code, num_threads=1, display_progress=True, display_table=0)

evaluate(optimized_cot)