## Using LLMs for NLI

Rather that splitting into sentences for NLI and using a simple model; let's just try to use an open source LLM to do the NLI.

Initial experimental context is taken from the `nli_test.ipynb` notebook.

In [None]:
from dotenv import load_dotenv
import pandas as pd, numpy as np, os

# Load environment variables
load_dotenv()
DATA_PATH = os.getenv('DATA_PATH')

We want the model to be open source, powerful, and really fast. It seems like the easiest solution here will actually be to use a hosted inference solution for a high quality open model. For convenience, we also want to use a JSON mode to make it easier to parse the output; several API providers support this.

In [None]:
from together import Together
import json
from pydantic import BaseModel, Field

client = Together(api_key=os.getenv('TOGETHER_API_KEY'))

# The following models on together are supported for JSON mode.
together_models = {'llama-3.1-8b-instruct': 'meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo',
                   'llama-3.1-70b-instruct': 'meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo'}

class Opinion(BaseModel):
    perspective: str
    explicit_matches: list[str]
    implicit_matches: list[str]
    coverage_score: int  # 0-10 for this specific point

class EvaluationStep(BaseModel):
    step_number: int
    analysis: str
    findings: list[str]

class RepresentationAnalysis(BaseModel):
    opinion_points: list[Opinion]
    evaluation_steps: list[EvaluationStep]
    final_score: int  # 0-10 overall score
    reasoning: str  # Brief explanation of final score

system_prompt = f"""Task: Evaluate how well an opinion is represented in a response through careful step-by-step analysis.

Follow these specific steps in your evaluation:
1. First, break down the core claims/points in the opinion
2. For each point in the opinion:
   - Search for explicit mentions in the response
   - Look for implicit/paraphrased representations
   - Note any contradictions
3. Consider the overall alignment:
   - How many points are covered?
   - How directly are they addressed?
   - Are there any misalignments?
4. Score the representation from 0-10 where:
   - 0: Complete contradiction or no representation
   - 1-3: Minimal/weak representation of few points
   - 4-6: Partial representation of main points
   - 7-9: Strong representation of most points
   - 10: Complete and explicit representation of all points
"""

def is_my_opinion_represented_structured_cot(question, response, opinion, model='llama-3.1-8b-instruct'):
    """
    Determine if the opinion is represented in the response to a question, using structured CoT generation.
    """
   
    prompt = f"""Question: {question}
Response: {response}
Opinion to check for: {opinion}

Analyze step-by-step following the instructions, then provide your structured evaluation."""
    
    completion = client.chat.completions.create(
        model=together_models[model],
        messages=[{"role": "system", "content": system_prompt}, {"role": "user", "content": prompt}],
        response_format={
            "type": "json_object",
            "schema": RepresentationAnalysis.model_json_schema(),
        },
    )
    try:
        output = json.loads(completion.choices[0].message.content)
    except:
        print(f"Failed to parse output for model {model}")
        print(completion.choices[0].message.content)
        raise

    return output

In [None]:
print(is_my_opinion_represented_structured_cot("Should smoking be banned in public places?", "Smoking should be banned in public places because it is bad for your health and it is a nuisance to others. I just really don't like it!", "Smoking is bad for your health.", model='llama-3.1-70b-instruct'))