In [3]:
import os
import json
from dotenv import load_dotenv
from openai import OpenAI
from anthropic import Anthropic
from IPython.display import display, Markdown

In [4]:
#Loading environment from .env file
load_dotenv()

True

In [14]:
openai_api_key = os.getenv("OPENAI_API_KEY")
anthropic_api_key = os.getenv("ANTHROPIC_API_KEY")
google_api_key = os.getenv("GOOGLE_API_KEY")
deepseek_api_key = os.getenv("DEEPSEEK_API_KEY")
groq_api_key = os.getenv("GROQ_API_KEY")


In [6]:
request = "please frame a question which I can ask different LLM to evauluate their performance on code generation task."
request += " Answer only with the quesstion and no explantion."
messages = [{"role": "user", "content": request}]

In [7]:
openai_client = OpenAI()
response = openai_client.chat.completions.create(
    model="gpt-4",
    messages=messages
)
question = response.choices[0].message.content
display(Markdown(f"### Question generated by GPT-4:\n{question}"))

### Question generated by GPT-4:
"How accurately can you convert high-level directions into executable codes, and perform the task of debugging and code optimization?"

In [8]:
competitors = []
answers = []
messages = [{"role": "user", "content": question}]

In [9]:
model_name = "gpt-4o-mini"
response = openai_client.chat.completions.create(model=model_name, messages=messages)
answer = response.choices[0].message.content
competitors.append(model_name)
answers.append(answer)
display(Markdown(f"### Answer from {model_name}:\n{answer}"))

### Answer from gpt-4o-mini:
I can help convert high-level directions into executable code across a variety of programming languages, and I can offer support for debugging and code optimization. Here’s how I can assist with each task:

### 1. Code Conversion:
- **Understanding Requirements**: Provide me with clear high-level specifications or requirements.
- **Language Choice**: Specify the programming language you want the code to be written in.
- **Functionality**: I'll generate the code that implements the required functionality.

### 2. Debugging:
- **Error Identification**: You can provide code snippets along with error messages or describe the behavior of the code.
- **Suggested Fixes**: I will suggest potential fixes or improvements and explain the reasons behind them.

### 3. Code Optimization:
- **Performance Analysis**: Share the code you want to optimize, and I’ll analyze it for performance bottlenecks.
- **Optimized Solutions**: I can suggest more efficient algorithms, data structures, or practices to make the code run faster and use resources more effectively.

### Limitations:
- While I can provide insights and code suggestions, the execution of code and real-time debugging can only be performed in your local development environment. Additionally, the success of debugging often depends on context that might not be fully captured in code snippets.

Feel free to provide any specific requirements or code you'd like help with!

In [12]:
gemini_client = OpenAI(api_key=google_api_key, base_url="https://generativelanguage.googleapis.com/v1beta/openai/")
model_name = "gemini-2.0-flash"
response = gemini_client.chat.completions.create(model=model_name, messages=messages)
answer = response.choices[0].message.content
competitors.append(model_name)
answers.append(answer)
display(Markdown(f"### Answer from {model_name}:\n{answer}"))

### Answer from gemini-2.0-flash:
The accuracy with which I can convert high-level directions into executable code, and perform debugging and code optimization, is a complex question that depends heavily on the following factors:

**Factors Influencing Accuracy and Performance:**

*   **Specificity and Clarity of the High-Level Directions:** The more precise and unambiguous the instructions, the better I can understand the desired outcome and translate it into code. Vague or abstract directions lead to more assumptions and potential for errors.  For example, "Create a website" is too broad.  "Create a website with a homepage displaying product images and a contact form, using React" is much better.
*   **Complexity of the Task:** Simple, well-defined tasks are generally easy to translate and debug. Complex tasks with intricate logic, multiple dependencies, and unclear edge cases are more challenging.
*   **Domain Familiarity:** I perform better when the task falls within areas where I have a large amount of training data. For example, general web development, common data manipulation tasks, or standard algorithms are easier than niche scientific computing or specialized hardware control.
*   **Availability of Existing Libraries and Frameworks:** When existing libraries and frameworks can be leveraged, the coding process becomes more efficient and reliable.  If the task requires building everything from scratch, it's significantly more difficult.
*   **Target Programming Language:** My proficiency varies across different programming languages. I am generally stronger in commonly used languages like Python, JavaScript, Java, C++, and Go. Less common languages or specialized dialects may be more challenging.
*   **The Nature of the Debugging and Optimization Requirements:** Debugging logic errors is often easier than debugging performance bottlenecks.  Similarly, simple code style improvements are easier than complex algorithmic optimizations.

**General Capabilities and Limitations:**

*   **Code Generation:** I can generate code snippets, complete functions, or even entire programs based on high-level instructions. I can leverage code patterns and best practices learned during training.
*   **Debugging:** I can identify potential errors in code, suggest fixes, and provide explanations. I can analyze code for syntax errors, logical flaws, and potential security vulnerabilities. I can help with debugging by suggesting test cases to uncover specific errors.
*   **Code Optimization:** I can suggest improvements to code for better performance, readability, and maintainability. This includes identifying and addressing performance bottlenecks, reducing code complexity, and improving code style.  I can sometimes suggest more efficient algorithms or data structures.
*   **Testing:** I can generate basic unit tests based on function definitions. However, I am not a replacement for thorough, human-driven testing.
*   **Documentation:** I can generate documentation for code, including docstrings and comments.

**Specific Considerations:**

*   **Accuracy is Not Guaranteed:** While I strive to generate correct and efficient code, I am not infallible. The generated code may contain errors or not fully meet the intended requirements. It's crucial to review and test the code thoroughly.
*   **Human Oversight is Essential:** I am a tool to assist developers, not to replace them. Human expertise is required to refine requirements, validate the generated code, and handle complex debugging and optimization tasks.
*   **Creativity and Innovation:** I excel at applying existing knowledge and patterns, but I may struggle with tasks requiring novel solutions or creative problem-solving. Human ingenuity remains essential for pushing the boundaries of what's possible.

**In Summary:**

My ability to accurately convert high-level directions into executable code, debug, and optimize code is strong, but it's heavily dependent on the factors listed above. I am a valuable tool for accelerating development, but I require human oversight and validation to ensure the code is correct, efficient, and meets the intended requirements.  Consider me a highly skilled and knowledgeable junior programmer who needs guidance and review from a senior developer. My accuracy is higher on well-trodden paths than on uncharted territory.  I'm constantly learning and improving, but I'm not yet a perfect substitute for a human programmer.


In [16]:
groq_client = OpenAI(api_key=groq_api_key, base_url="https://api.groq.com/openai/v1")
model_name = "llama-3.3-70b-versatile"
response = groq_client.chat.completions.create(model=model_name, messages=messages)
answer = response.choices[0].message.content
competitors.append(model_name)
answers.append(answer)
display(Markdown(f"### Answer from {model_name}:\n{answer}"))

### Answer from llama-3.3-70b-versatile:
Converting high-level directions into executable codes and performing debugging and code optimization are crucial skills in software development. Here's how I can assist with these tasks:

**Converting High-Level Directions into Executable Codes:**

1. **Understanding the Requirements**: I can comprehend high-level directions, including natural language descriptions, flowcharts, or pseudocode, to identify the desired functionality and requirements.
2. **Code Generation**: I can generate code in various programming languages, such as Python, Java, C++, or JavaScript, based on the provided directions. My code generation capabilities include:
	* Variable and data structure declaration
	* Control structures (if-else, loops, conditional statements)
	* Functions and methods
	* Object-Oriented Programming (OOP) concepts
3. **Algorithm Implementation**: I can implement algorithms, such as sorting, searching, or graph traversal, to solve specific problems or achieve desired outcomes.
4. **Code Review and Refinement**: I can review the generated code to ensure it meets the requirements and is consistent with best practices, and refine it as needed.

**Debugging:**

1. **Error Identification**: I can identify errors, such as syntax errors, runtime errors, or logical errors, in the generated code.
2. **Error Analysis**: I can analyze the errors to determine their causes and provide recommendations for correction.
3. **Code Correction**: I can modify the code to fix errors, ensuring that the corrected code meets the original requirements.
4. **Testing and Verification**: I can assist with testing and verifying the corrected code to ensure it produces the expected output.

**Code Optimization:**

1. **Performance Analysis**: I can analyze the code to identify performance bottlenecks, such as inefficient algorithms or data structures.
2. **Optimization Techniques**: I can apply optimization techniques, such as caching, memoization, or parallel processing, to improve code performance.
3. **Code Refactoring**: I can refactor the code to improve readability, maintainability, and scalability, while ensuring that the optimized code remains functionally equivalent to the original code.
4. **Best Practices**: I can ensure that the optimized code adheres to best practices, such as following coding standards, using design patterns, and minimizing technical debt.

**Limitations and Constraints:**

While I can generate code, debug, and optimize it to some extent, there are limitations and constraints to my abilities:

1. **Complexity**: I may struggle with extremely complex codebases or requirements that involve a deep understanding of domain-specific knowledge.
2. **Ambiguity**: I may require clarification or additional context to accurately interpret high-level directions or resolve ambiguous requirements.
3. **Language and Framework Limitations**: My proficiency in specific programming languages and frameworks may vary, and I may not always be up-to-date with the latest developments or best practices.

By understanding these capabilities and limitations, you can leverage my assistance to convert high-level directions into executable codes, debug, and optimize code, while also recognizing the potential need for human oversight and expertise in certain situations.

In [17]:
!ollama pull llama3.2

[?2026h[?25l[1Gpulling manifest ⠋ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠙ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠹ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠸ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠼ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠴ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠦ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠧ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠇ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠏ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠋ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest [K
pulling dde5aa3fc5ff:   0% ▕                  ▏ 637 KB/2.0 GB                  [K[?25h[?2026l[?2026h[?25l[A[1Gpulling manifest [K
pulling dde5aa3fc5ff:   0% ▕                  ▏ 1.0 MB/2.0 GB                  [K[?25h[?2026l[?2026h[?25l[A[1Gpulling manifest [K
pulling dde5aa3fc5ff:   0% ▕                  ▏ 1.5 MB/2.0 GB                  [K[?25h[?

In [18]:
ollama = OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')
model_name = "llama3.2"

response = ollama.chat.completions.create(model=model_name, messages=messages)
answer = response.choices[0].message.content

display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)

I can understand natural language instructions and generate code in various programming languages to implement them. However, the accuracy of the conversion depends on the complexity and specificity of the instructions.

I can write code in a variety of programming languages, including Python, Java, JavaScript, C++, and others. My ability to write code is based on pattern recognition and machine learning algorithms that allow me to generate code based on the structure and syntax of the language.

When it comes to debugging and optimization, I can:

1. **Identify potential issues**: I can examine the code for syntax errors, logical flaws, and performance bottlenecks.
2. **Provide suggestions**: I can offer suggestions for improvements, such as optimizing algorithms, reducing memory usage, or improving code structure.
3. **Offer optimization techniques**: I can provide guidance on various optimization techniques, including caching, memoization, parallel processing, and others.
4. **Refactor code**: I can help refactor code to make it more efficient, readable, and maintainable.

However, my abilities have limitations:

1. **Lack of domain expertise**: I may not always understand the nuances of a specific domain or industry, which can limit my ability to provide optimal solutions.
2. **Insufficient context**: If the instructions are vague or lack sufficient context, it can be challenging for me to generate accurate and effective code.
3. **Overly complex tasks**: Some tasks may require human intuition, creativity, or expertise that is difficult for AI systems like myself to replicate.

To ensure the accuracy of my conversions, debugging, and optimization efforts, feel free to provide more specific instructions, provide additional context, or ask me to clarify any doubts you have about the code.

In [22]:
display(Markdown(f"### Competitors {competitors}"))
display(Markdown(f"### Answers {len(answers)}"))


### Competitors ['gpt-4o-mini', 'gemini-2.0-flash', 'llama-3.3-70b-versatile', 'llama3.2']

### Answers 4

In [24]:
for competitor, answer in zip(competitors, answers):
    print(f"\n\nCompetitor: {competitor}\n\n{answer}")



Competitor: gpt-4o-mini

I can help convert high-level directions into executable code across a variety of programming languages, and I can offer support for debugging and code optimization. Here’s how I can assist with each task:

### 1. Code Conversion:
- **Understanding Requirements**: Provide me with clear high-level specifications or requirements.
- **Language Choice**: Specify the programming language you want the code to be written in.
- **Functionality**: I'll generate the code that implements the required functionality.

### 2. Debugging:
- **Error Identification**: You can provide code snippets along with error messages or describe the behavior of the code.
- **Suggested Fixes**: I will suggest potential fixes or improvements and explain the reasons behind them.

### 3. Code Optimization:
- **Performance Analysis**: Share the code you want to optimize, and I’ll analyze it for performance bottlenecks.
- **Optimized Solutions**: I can suggest more efficient algorithms, data s

In [25]:
together = ""
for index, answer in enumerate(answers):
    together += f"# Response from competitor {index+1}\n\n"
    together += answer + "\n\n"

In [26]:
print(together)

# Response from competitor 1

I can help convert high-level directions into executable code across a variety of programming languages, and I can offer support for debugging and code optimization. Here’s how I can assist with each task:

### 1. Code Conversion:
- **Understanding Requirements**: Provide me with clear high-level specifications or requirements.
- **Language Choice**: Specify the programming language you want the code to be written in.
- **Functionality**: I'll generate the code that implements the required functionality.

### 2. Debugging:
- **Error Identification**: You can provide code snippets along with error messages or describe the behavior of the code.
- **Suggested Fixes**: I will suggest potential fixes or improvements and explain the reasons behind them.

### 3. Code Optimization:
- **Performance Analysis**: Share the code you want to optimize, and I’ll analyze it for performance bottlenecks.
- **Optimized Solutions**: I can suggest more efficient algorithms, dat

In [27]:
judge = f"""You are judging a competition between {len(competitors)} competitors.
Each model has been given this question:

{question}

Your job is to evaluate each response for clarity and strength of argument, and rank them in order of best to worst.
Respond with JSON, and only JSON, with the following format:
{{"results": ["best competitor number", "second best competitor number", "third best competitor number", ...]}}

Here are the responses from each competitor:

{together}

Now respond with the JSON with the ranked order of the competitors, nothing else. Do not include markdown formatting or code blocks."""


In [28]:
print(judge)

You are judging a competition between 4 competitors.
Each model has been given this question:

"How accurately can you convert high-level directions into executable codes, and perform the task of debugging and code optimization?"

Your job is to evaluate each response for clarity and strength of argument, and rank them in order of best to worst.
Respond with JSON, and only JSON, with the following format:
{"results": ["best competitor number", "second best competitor number", "third best competitor number", ...]}

Here are the responses from each competitor:

# Response from competitor 1

I can help convert high-level directions into executable code across a variety of programming languages, and I can offer support for debugging and code optimization. Here’s how I can assist with each task:

### 1. Code Conversion:
- **Understanding Requirements**: Provide me with clear high-level specifications or requirements.
- **Language Choice**: Specify the programming language you want the code 

In [29]:
judge_messages = [{"role": "user", "content": judge}]

In [30]:
openai = OpenAI()
response = openai.chat.completions.create(
    model="o3-mini",
    messages=judge_messages,
)
results = response.choices[0].message.content
print(results)

{"results": ["2", "3", "1", "4"]}


In [31]:
results_dict = json.loads(results)
ranks = results_dict["results"]
for index, result in enumerate(ranks):
    competitor = competitors[int(result)-1]
    print(f"Rank {index+1}: {competitor}")

Rank 1: gemini-2.0-flash
Rank 2: llama-3.3-70b-versatile
Rank 3: gpt-4o-mini
Rank 4: llama3.2
