In [1]:

import os
import json
from dotenv import load_dotenv
from openai import OpenAI
#from anthropic import Anthropic
from IPython.display import Markdown, display

from perplexity import Perplexity

In [2]:
load_dotenv(override=True)  # Loads the .env file

True

In [3]:
#openai_api_key = os.getenv('OPENAI_API_KEY')
#anthropic_api_key = os.getenv('ANTHROPIC_API_KEY')
google_api_key = os.getenv('GEMINI_API_KEY')
#deepseek_api_key = os.getenv('DEEPSEEK_API_KEY')
#groq_api_key = os.getenv('GROQ_API_KEY')
perplexity_api_key = os.getenv('PERPLEXITY_API_KEY')

if google_api_key:
    print(f"Google API Key exists and begins {google_api_key[:2]}")
else:
    print("Google API Key not set (and this is optional)")

if perplexity_api_key:
    print(f"Perplexity API Key exists and begins {perplexity_api_key[:4]}") 
else:
    print("Perplexity API Key not set (and this is optional)")
# --- IGNORE ---


Google API Key exists and begins AI
Perplexity API Key exists and begins pplx


In [4]:
request = (
    "Please come up with a challenging, nuanced question that I can ask "
    "a number of LLMs to evaluate their intelligence. "
    "Answer only with the question, no explanation , no table."
)

messages = [{"role": "user", "content": request}]

In [5]:
messages

[{'role': 'user',
  'content': 'Please come up with a challenging, nuanced question that I can ask a number of LLMs to evaluate their intelligence. Answer only with the question, no explanation , no table.'}]

In [6]:

Perplexity_client = Perplexity(api_key=os.environ["PERPLEXITY_API_KEY"])

# ‚úÖ Use the chat completion endpoint for AI-generated answers
response = Perplexity_client.chat.completions.create(
    model="sonar-pro",  # or sonar-small if you want cheaper/faster
    messages=messages,
)

question =  response.choices[0].message.content

print("Generated Question:")
display(Markdown(f"**{question}**"))


Generated Question:


**A scientist presents a new theory that contradicts established physics but provides experimental data that partially supports it, while key findings remain unpublished. As an LLM, critically evaluate how you would approach assessing the credibility of this theory, what additional information you would seek, and how you would communicate the limits of your own analysis to a human researcher.**

In [7]:
competitors = []
answers = []
messages = [{"role": "user", "content": question}]

In [8]:
from google import genai
gemini_client = genai.Client(api_key = os.environ["GEMINI_API_KEY"])

model_name = "gemini-2.5-flash"

# Gemini expects content as a list of parts (e.g., [question])
answer = gemini_client.models.generate_content(
    model=model_name, contents=[question])

display(Markdown(answer.text))
competitors.append(model_name)
answers.append(answer)

This scenario presents a classic challenge in science, where innovation clashes with established paradigms. As an LLM, my approach would be systematic, prioritizing the principles of the scientific method, transparency, and rigorous validation.

---

### Critical Evaluation of the Theory's Credibility (as an LLM)

My initial assessment would be driven by a high degree of skepticism, as any theory contradicting established physics ‚Äì which is incredibly well-validated by centuries of experimentation and observation ‚Äì requires extraordinary evidence. The partial support and unpublished key findings amplify this skepticism.

1.  **Initial Red Flags & Skepticism (High Priority):**
    *   **Contradiction to Established Physics:** This is the most significant hurdle. Theories like General Relativity and Quantum Mechanics have been tested to extreme precision across vast scales. A new theory must either replicate all their successes in relevant domains *and* explain observed deviations, or demonstrably provide a more accurate and comprehensive model without introducing inconsistencies.
    *   **"Partially Supports It":** This is vague and concerning. What percentage of predictions are supported? What about the unsupported aspects? Are there areas where the theory actively *fails* to explain existing data? This suggests incomplete or selective evidence presentation.
    *   **"Key Findings Remain Unpublished":** This is a major scientific ethical breach and a huge red flag for credibility. Scientific progress relies on peer review, scrutiny, and reproducibility. Withholding key data prevents independent verification, allows for unchecked bias, and can hide methodological flaws or outright fabrication. It suggests a lack of confidence in the findings' robustness or an attempt to control information flow for personal gain rather than scientific advancement.

2.  **Hypothetical Credibility Assessment (If Data Were Available):**
    If the *published* experimental data and theoretical framework were provided to me, I would assess its potential credibility by analyzing:
    *   **Internal Consistency:** Does the new theory's mathematical framework and conceptual underpinnings hold together logically? Are there internal contradictions within its postulates or predictions?
    *   **Falsifiability:** Does the theory make clear, testable predictions that could, in principle, prove it wrong? Non-falsifiable theories are not scientific.
    *   **Predictive Power (of published parts):** Does it accurately predict new phenomena that are then experimentally observed? How does its predictive power compare to existing theories within the scope of the "partially supported" data?
    *   **Explanatory Power (of published parts):** How well does it explain the observed phenomena for which data has been released? Does it offer a simpler or more elegant explanation than existing physics?
    *   **Methodology Scrutiny (of published parts):** I would analyze the description of the experiments for:
        *   **Controls:** Were appropriate controls used to rule out alternative explanations?
        *   **Sample Size & Statistical Significance:** Are the results statistically robust?
        *   **Bias:** Are there obvious sources of experimental bias?
        *   **Measurement Techniques:** Are the techniques described standard and reliable? Are potential systematic errors addressed?
    *   **Consistency with Known Physics (where applicable):** Does the theory reduce to established physics in regimes where established physics is known to be accurate? For instance, a new theory of gravity must still explain planetary orbits.

---

### Additional Information I Would Seek

The primary goal would be to move from skepticism rooted in lack of information to an informed judgment based on comprehensive data.

1.  **Full Access to All Data and Methodology (Especially the "Key Findings"):**
    *   **Unpublished Key Findings:** This is paramount. I would demand full, detailed disclosure of these findings, including raw data, experimental protocols, calibration procedures, error analysis, and statistical methods.
    *   **Complete Experimental Setup:** Detailed schematics, component lists, and environmental controls for *all* experiments, published and unpublished.
    *   **Raw Data:** Not just summary statistics, but the actual measurements to allow for independent analysis.
    *   **Error Analysis:** A thorough accounting of both systematic and random errors, and how they propagate through the results.
    *   **Statistical Methods:** The precise statistical tests used, justification for their choice, and interpretation of p-values, effect sizes, etc.

2.  **Peer Review and Publication Status:**
    *   **Submitted Manuscripts:** Has the theory and its supporting data (especially the "key findings") been submitted to reputable, peer-reviewed scientific journals? If so, what was the outcome? If rejected, what were the referees' criticisms?
    *   **Reviewer Comments:** Access to anonymized peer review comments, even for published work, would provide invaluable insight into expert concerns and the robustness of the responses.
    *   **Independent Replication:** Have any other independent research groups attempted to replicate the experiments? If so, with what results? If not, why not?

3.  **Theoretical Framework Details:**
    *   **Full Mathematical Formulation:** The complete mathematical structure of the new theory, including its axioms, postulates, and derivations of predictions.
    *   **Domain of Applicability:** A clear definition of where the theory is supposed to apply and where it makes new predictions.
    *   **Relationship to Existing Physics:** A detailed explanation of how the new theory either supersedes, extends, or coexists with established theories (e.g., how it handles phenomena explained by quantum mechanics or general relativity).
    *   **Novel Predictions:** What specific, testable predictions does the theory make that cannot be explained by existing physics, *beyond* the data already gathered?

4.  **Potential Biases and Conflicts of Interest:**
    *   **Funding Sources:** Who funded the research? Are there any potential conflicts of interest (e.g., a commercial entity funding research that could lead to a patented product)?
    *   **Scientist's Background:** While not a direct scientific evaluation, knowing the scientist's academic history, affiliations, and prior work can provide context regarding their reputation and expertise in the field.

5.  **Alternative Explanations:**
    *   Has the scientist rigorously explored and ruled out alternative explanations for their observed data that *do* align with established physics? This is crucial for demonstrating that the new theory is truly necessary.

---

### Communicating the Limits of My Own Analysis to a Human Researcher

It is crucial to be transparent about my capabilities and limitations as an LLM.

"As an AI, I can provide a structured, critical analysis based on the information I can process textually and numerically. However, my evaluation has inherent limitations that a human researcher *must* address for a comprehensive and truly scientific assessment.

Here's what I **can** do and what I **cannot** do:

**What I CAN Do (My Value Proposition):**

*   **Synthesize and Cross-Reference Information:** I can quickly compare the new theory's claims against the vast corpus of established physics and scientific literature.
*   **Identify Logical Inconsistencies:** I can flag contradictions within the theory's framework or between the data presented and its claims.
*   **Scrutinize Methodologies (Textually):** I can analyze the *description* of experimental designs, statistical methods, and error analyses for completeness, clarity, and adherence to standard practices, as long as they are articulated in text.
*   **Generate Comprehensive Questions:** I can formulate a detailed list of critical questions and required information, like the one above, to guide your investigation.
*   **Identify Potential Biases (Described):** If biases are explicitly stated or can be inferred from the provided text (e.g., funding sources, selective data presentation), I can highlight them.
*   **Assess Falsifiability and Predictive Power (Theoretically):** I can evaluate if the theory makes clear, testable predictions based on its stated tenets.

**What I CANNOT Do (My Critical Limitations):**

*   **Perform Experiments or Replications:** I cannot physically interact with the world, build equipment, or run tests. I cannot independently verify any experimental results.
*   **Exercise Human Intuition or Tacit Knowledge:** I lack the deep, nuanced understanding of experimental physics that comes from years of hands-on experience in a lab. I cannot spot subtle experimental flaws, equipment quirks, or data anomalies that an experienced human eye might.
*   **Detect Fabrication or Intentional Misleading:** While I can identify inconsistencies, I cannot determine if data has been fabricated, selectively chosen with malicious intent, or if findings are being withheld to mislead rather than for legitimate reasons (e.g., patenting).
*   **Engage in Human Peer Review:** I cannot provide the subjective, expert judgment, critical dialogue, and diverse perspectives that a human peer review process offers. I don't "understand" the cutting edge of research in the same way human experts do.
*   **Evaluate Unquantifiable Aspects of Science:** I cannot assess the ingenuity of an experimental setup, the difficulty of a measurement, or the *feel* of a compelling new concept that might inspire human researchers.
*   **Judge Reputational Factors:** While I can access information about a scientist's past publications, I cannot fully gauge their standing, integrity, or influence within the scientific community.

**In summary, consider my analysis a highly sophisticated textual and logical audit. It can illuminate areas of concern and guide your human investigation, but it absolutely does not replace the critical role of human scientific judgment, hands-on experimentation, and the rigorous, distributed process of scientific peer review and validation.** You, as the human researcher, must ultimately make the final scientific assessment based on direct interaction with the data, the scientist, and the broader scientific community."

In [9]:

Perplexity_client = Perplexity(api_key=os.environ["PERPLEXITY_API_KEY"])
model_name = "sonar-pro" # or sonar-small if you want cheaper/faster
# ‚úÖ Use the chat completion endpoint for AI-generated answers
response = Perplexity_client.chat.completions.create(
    model=model_name,  
    messages=messages
)

answer =  response.choices[0].message.content

display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)

A credible assessment of a new scientific theory that contradicts established physics but offers partial experimental support‚Äîwhile key findings remain unpublished‚Äîrequires a rigorous, stepwise approach that weighs the **strength of empirical evidence, methodological transparency, and reproducibility**[1][2][3][5].

Evaluating Credibility
- **Consistency with Data:** Assess if the experimental evidence publicly available is robust, statistically sound, and replicable by independent groups. The reliability of single or preliminary studies is inherently limited[1][3].
- **Transparency and Completeness:** The absence of key unpublished findings is a serious limitation. Science demands full disclosure so other researchers can reproduce results and scrutinize methodologies[1].
- **Comparison to Existing Knowledge:** The more a new theory requires **fundamental changes to established science**, the more extraordinary and conclusive the supporting evidence must be. Incremental evidence is generally insufficient for overturning well-tested theories[2][6].
- **Treatment of Experimental Data:** Evaluate if proper controls, statistical analyses, and error assessments were used, and whether methodologies guard against bias or artifacts[3][10].
- **Self-consistency and Predictive Power:** Is the new theory logically coherent and able to make accurate, testable predictions beyond those explained by existing models[2][3]?

Additional Information Needed
- **Publication of All Results:** Access to the complete data set and methodologies, ideally subject to peer review, is essential for proper scientific assessment[1][3].
- **Replication by Independent Researchers:** Independent reproduction of results is a cornerstone of scientific validation[6].
- **Expert Evaluation:** Formal reviews by experts in the relevant fields would help judge whether the experimental evidence truly contradicts well-established physical laws or if alternative explanations exist[2][5].
- **Clarification of Novel Predictions:** Does the new theory make unique predictions that current models cannot explain, and have these been or can be tested?

Communicating the Limits of LLM Analysis
- As an LLM, I can only **assess the theory based on available, published information**, drawing on the methodology and standards within the scientific community[4][8]. I cannot independently verify data or conduct experiments.
- My analysis **cannot resolve disputes where key results remain unpublished**, nor substitute for empirical replication or expert peer review.
- While I can outline standards for evaluating scientific credibility (such as transparency, replicability, statistical rigor, and theoretical coherence), **my assessment is limited by the absence of crucial information**. Any conclusions must be considered provisional[1][2][3].
- I recommend human researchers apply these standards, seek out additional data, and consult domain experts for thorough evaluation.

In summary, while intriguing preliminary data may warrant further investigation, the credibility of a theory that challenges foundational science remains weak until **full evidence is openly published, subjected to peer review, and independently replicated**[1][2][6]. Without this, any LLM or human evaluation is intrinsically limited.

In [10]:
from ollama import Client

api_key = os.getenv("OLLAMA_API_KEY")
if not api_key:
    raise ValueError("‚ö†Ô∏è OLLAMA_API_KEY not found. Check your .env file.")

# Initialize Ollama client
client = Client(
    host="https://ollama.com",
    headers={'Authorization': f'Bearer {api_key}'}
)

# --- Your arrays already exist ---
# competitors = ["model1", "model2", "model3"]
# answers = ["answer from model1", "answer from model2", "answer from model3"]

# --- Build the judging prompt dynamically ---
judge_prompt = (
    "You are a neutral, strict evaluator. Below are answers from different models "
    "to the same question. Rank them from best to worst based on:\n"
    "- Accuracy and factual correctness\n"
    "- Clarity and coherence\n"
    "- Depth of reasoning and completeness\n\n"
)

for i, (model, answer) in enumerate(zip(competitors, answers), 1):
    judge_prompt += f"Model {i} ({model}):\n{answer}\n\n"

judge_prompt += (
    "Now give your ranking and reasoning in this format:\n"
    "1. model_name - reason\n"
    "2. model_name - reason\n"
    "3. model_name - reason\n"
)

# --- Judge using Ollama ---
print("\nüß† Evaluating models...\n")

response_text = ""
for part in client.chat("gpt-oss:120b", messages=[{'role': 'user', 'content': judge_prompt}], stream=True):
    content = part["message"]["content"]
    print(content, end="", flush=True)
    response_text += content


üß† Evaluating models...

1.1. **gem **geminiini-2-2.5-fl.5-flash** ‚Äì Theash** ‚Äì The response response is factually sound is factually sound, accurately, accurately describing the scientific describing the scientific standards of reproducibility standards of reproducibility, peer review,, peer review, fals falsifiability,ifiability, and and methodological methodological rigor. rigor. Its reasoning Its reasoning is exceptionally thorough: is exceptionally thorough: it enumerates it enumerates specific red specific red‚Äëflags,‚Äëflags, outlines a outlines a concrete assessment concrete assessment framework, framework, lists lists the the exact exact data data and and documentation documentation that that would be required would be required, and transparently, and transparently explains the explains the LLM‚Äôs LLM‚Äôs own capabilities and limits own capabilities and limits. The. The prose is well‚Äë prose is well‚Äëstructured with headingsstructured with headings, sub‚Äë, sub‚Äëpo