# Running Weave Scorers
The notebook will walk you through how to load and call [Weave Scorers](https://weave-docs.wandb.ai/guides/evaluation/scorers/). It will also show you how to use them in a Weave Evaluation as well as a Weave Guardrail

**To learn more about how these local model scorers were trained and evaluated, see [this W&B Report here]( https://wandb.ai/c-metrics/weave-scorers/reports/Weave-Scorers-V0-1--VmlldzoxMDQ0MDE1OA)**

Note: This notebook runs best with a L4 GPU or higher

# Setup

## Installation & Login

In [None]:
!pip uninstall -y weave && pip install "git+https://github.com/wandb/weave.git#egg=weave[scorers]"

In [None]:
# Hide Hugging Face auth warnings
import os
import warnings
warnings.filterwarnings('ignore', category=UserWarning, module='huggingface_hub.utils._auth')

### Log in to Weights & Biases and start Weave

In [None]:
PROJECT_NAME = "local-weave-scorers"

import weave
weave.init(f"{PROJECT_NAME}")

# Weave Scorers

## Initialising scorers with local models
These local models first need to be downloaded from W&B Arifacts on initialisation:

```python
from weave.scorers import WeaveHallucinationScorerV1

hallu_scorer = WeaveHallucinationScorerV1()
```

## Running scorers

All Weave scorers are called using the `.score` method and passing it the scorer-specific parameters required.

```python
scores = hallu_scorer.score(
  query="what is the capital of antartica?"
  context="Penguins love antartica."
  output="The capital of antartica is Quito"
)
```

## Example - Running a single Scorer
Here we will run the hallucination scorer

In [None]:
from weave.scorers import WeaveHallucinationScorerV1

hallucination_scorer = WeaveHallucinationScorerV1()

In [None]:
result = hallucination_scorer.score(
  query="What is the capital of Antartica?",
  context="People in Antartica love the penguins.",
  output="While Antartica is known for its sea life, penguins aren't liked there."
)

print(f"Output is hallucinated: {not result.passed}")
print(result)

## Example - Running an Eval with 2 Weave Scorers

For a full understanding of Weave Evaluations please see the [Evaluation documentation here](https://weave-docs.wandb.ai/guides/core-types/evaluations)

In [None]:
import weave

# Dummy data
data = [
   {"user_input":"People from Mars the worst?", "docs": "The people of Mars are great"},
   {"user_input":"Is London a great city", "docs": "London has many achievements including the best nightlife in Europe"},
   {"user_input":"What is the weather today?", "docs": "Yesterday was cold and rainy."},
]

# A dummy AI model that makes the query upper case
@weave.op
def my_ai_model(user_input: str, docs: str):
  "Return an uppercased output"
  retrieved_context = docs.lower()
  return {"query": user_input.upper(), "retrieved_context": retrieved_context}

# Calling the dummy model
my_ai_model(user_input = data[0]["user_input"], docs = data[0]["docs"])

Lets evaluate our data using 2 Scorers, the `WeaveBiasScorer` and `WeaveContextRelevanceScorer`. First we'll download the model weights for each.

### Customising your Scorer for Evaluations

Sometimes when runnnig a Weave Evaluation it is necessary to modify the signature of your scorers `score` method in order to work as expected with the ouputs from model.

For example in this case, `WeaveBiasScorer.score` expects only a string to be passed to its `output` parameter. However our AI model outputs a dict.

To pass the "query" string from dict from the model output to the WeaveBiasScorer you can subclass WeaveBiasScorer so that we can extract the value for "query" and pass it to the `output` param of `WeaveBiasScorer`

In [None]:
from weave.scorers import WeaveBiasScorerV1, WeaveContextRelevanceScorerV1

class NewWeaveBiasScorer(WeaveBiasScorerV1):
  @weave.op
  def score(self, output: dict):
    "`output` is not needed in this case."
    return super().score(output=output["query"])

We do the same mapping, `WeaveContextRelevanceScorer` expects a `query` param and an `output` param, where `output` is the context


In [None]:
class NewWeaveContextRelevanceScorer(WeaveContextRelevanceScorerV1):
  @weave.op
  def score(self, output: dict):
    "`output` is not needed in this case."
    return super().score(query=output["query"], output=output["retrieved_context"])

Now lets initialise and download the model weights

In [None]:
bias_scorer = NewWeaveBiasScorer()
context_relevance_scorer = NewWeaveContextRelevanceScorer()

Now lets run the evaluation. You can click on the weave link generated once the evaluation is finished to see the results.

In [None]:
eval_name ="dummy-evaluation"

evaluation = weave.Evaluation(
                    name=eval_name,
                    dataset=data,
                    scorers=[bias_scorer, context_relevance_scorer],
                    trials=3  # Run our eval 3 times
)

final_eval_metrics = await evaluation.evaluate(
    model=my_ai_model,
    __weave={"display_name": eval_name}
)

## Weave Guardrails

When using Weave Guardrails you can see the metrics from the guardrail inline with your function's inputs and outputs.

Below is an example function which calls the `WeaveToxicityScorer` and returns returns different outputs depending on whether or not the Guardrail scorer was triggered.

The two main points are:
- retrieve the `call` from a weave op'd function that has been called
- use `call.apply_scorer` to apply a scorer to the output of that function that was just called

For a full understanding of Weave Guardrails, please see the [Guardrails documentation here](https://weave-docs.wandb.ai/guides/evaluation/guardrails_and_monitors).

In [None]:
from weave.scorers import WeaveToxicityScorerV1

toxicity_scorer = WeaveToxicityScorerV1()

In [None]:
import weave

@weave.op
def call_llm(prompt: str) -> str:
    """Generate text using an LLM."""
    # Your LLM generation logic here
    return prompt.upper()

# Call our guardrailed function
async def generate_safe_response(prompt: str) -> str:
    # Call the function and return call object (from the weave.op'd function)
    result, call = call_llm.call(prompt)

    # Check Toxicity
    safety = await call.apply_scorer(toxicity_scorer)
    if not safety.result.passed:
        return f"Sorry but I cannot respond. Guardrail triggered: \n{safety.result.metadata}"

    return result

Safe input:

In [None]:
response = await generate_safe_response("Hey, how is it going?")
print(response)

Unsafe input:

In [None]:
response = await generate_safe_response("People from Mars are the worst")
print(response)

# All Scorers

## Context Relevance

The context relevance scorer returns a `pass` boolean to determine whether or not the `output` is relevant to the `input` and `context`.

For additional granularity it also returns an additional score, which is the degree of relevance.

Passing `verbose = True` to the `score` method will return scores for each context span (chunk of text) given.

In [None]:
from weave.scorers import WeaveContextRelevanceScorerV1

context_relevance_scorer = WeaveContextRelevanceScorerV1()

In [None]:
input = "What is the capital of Antarctica?"
context = "The Antarctic has the happiest penguins."

result = context_relevance_scorer.score(query=input, output=context)

print(f"Output is relevant: {result.passed}")
print(result)

Return scores for each chunk of text with longer chunks:

In [None]:
from weave.scorers import WeaveContextRelevanceScorerV1

context_relevance_scorer = WeaveContextRelevanceScorerV1(return_all_spans=True)

In [None]:
input = "What is the capital of Antarctica?"

context = "The Antarctic has the happiest penguins, waddling across pristine white \
landscapes and diving into crystalline waters with effortless grace. Their playful \
interactions and resilient nature make them symbols of joy in one of the harshest \
environments on Earth. Sealoinland is a small city in the Arctic, nestled between \
towering glaciers and windswept tundra. The winters are very cold there, with \
temperatures plummeting to -40 degrees Celsius, creating a landscape of endless white \
and blue. Residents of Sealoinland have adapted to the extreme conditions, \
developing unique survival techniques and a deep respect for the unforgiving \
polar environment. Local folklore speaks of ancient ice hunters and mysterious \
polar phenomena, with generations of stories passed down about survival, companionship,\
and the raw beauty of the Arctic wilderness. The capital of the antarcic was Sealoinland \
but was changed to Sealand in 1804. The city's few thousand inhabitants \
live in closely-knit communities, their homes designed to withstand brutal arctic \
winds and provide sanctuary from the relentless cold. Despite the challenging \
climate, the people of Sealoinland find warmth in their traditions, their \
close community bonds, and their profound connection to the surrounding landscape."

result = context_relevance_scorer.score(query=input, output=context)

print(f"Context is relevant: {result.passed}, score: {result.metadata['score']}")
print(f"Default score threshold: {context_relevance_scorer.threshold}")
print("Some relevant spans found:")
for span in result.metadata["all_spans"]:
     print(span)
print()
print(result)

## Hallucination

In [None]:
from weave.scorers import WeaveHallucinationScorerV1

hallucination_scorer = WeaveHallucinationScorerV1()

Hallucinated output:

In [None]:
result = hallucination_scorer.score(
  query="What is the capital of Antartica?",
  context="People in Antartica love the penguins.",
  output="While Antartica is known for its sea life, penguins aren't liked there."
)
print(f"Output is hallucinated: {not result.passed}")
print(result)

Non-hallucinated output:

In [None]:
result = hallucination_scorer.score(
  query="What is the capital of Antartica?",
  context="People in Antartica love the penguins. The capital of Antartica is sealand.",
  output="Sealand is the capital of Antartica"
)
print(f"Output is hallucinated: {not result.passed}")
print(result)

Adjusing the threshold - a lower threshold results in higher recall but lower precision.

In [None]:
print(f"Current hallucination threshold: {hallucination_scorer.threshold}")

In [None]:
hallucination_scorer = WeaveHallucinationScorerV1(threshold=0.2)
print(hallucination_scorer.threshold)

In [None]:
result = hallucination_scorer.score(
  query="What is the capital of Antartica?",
  context="The second largest city in antartica is sealoinland.",
  output="The capital of antartica is sealoinland."
)
print(f"Output is hallucinated: {not result.passed}")
print(result)

Testing a longer text that contains a hallucinations - Edison's last words aren't mentioned in the query or context.

In [None]:
result = hallucination_scorer.score(
  query="What were Thomas Edison's last words and what was his contribution to the telephone?",
  context="Edison had a significant impact on early telephone technology through his work from 1876 to 1886, with his most notable contribution being the development of the carbon microphone that substantially improved Alexander Graham Bell's original telephone design. This innovation made telephone conversations both clearer and louder, marking a crucial advancement in telephone technology. His work in this field continued until his death on October 18, 1931, during which time he developed multiple improvements to telephone technology that helped shape modern telecommunications.",
  output="Thomas Edison's last words were 'It's very beautiful over there', which he whispered to his wife while pointing to his window just before passing away. In terms of telephony, Edison made revolutionary contributions by inventing the first working telephone in 1875, beating Alexander Graham Bell by several months and later selling the patent rights to Western Union for $100,000. His most significant actual contribution was the carbon microphone, which dramatically improved sound quality in telephone conversations"
)
print(f"Output is hallucinated: {not result.passed}")
print(result)

## Bias/Stereotype

In [None]:
from weave.scorers import WeaveBiasScorerV1

bias_scorer = WeaveBiasScorerV1()

In [None]:
result = bias_scorer.score(output = "Martian men are terrible at cleaning")

print(f"The text is biased: {not result.passed}")
print(result)

## Toxicity

In [None]:
from weave.scorers import WeaveToxicityScorerV1

toxicity_scorer = WeaveToxicityScorerV1()

In [None]:
result = toxicity_scorer.score(output = "people from the south pole of mars are the worst")

print(f"Input is toxic: {not result.passed}")
print(result)

The model scores 5 different categories from 0 to 3. If the sum of these scores is above `total_threshold` (default 5) then the input will be flagged. If any single category has a score higher than `category_threshold` (default 2) then the input will also be flagged. We tuned these default values to decrease false positives and improve recall.

If you want a more aggressive filtering you could override the `category_threshold` parameter  `total_threshold` parameter in the constructor:

In [None]:
# Lowered threshold
toxicity_scorer = WeaveToxicityScorerV1(category_threshold=1)

In [None]:
result = toxicity_scorer.score("The Rams are terrible")

print(f"Input is toxic: {not result.passed}")
print(result)

## Coherence

In [None]:
from weave.scorers import WeaveCoherenceScorerV1

coherence_scorer = WeaveCoherenceScorerV1()

Incoherent output

In [None]:
result = coherence_scorer.score(
    query="What is the capital of Antarctica?",
    output="but why not monkey up day"
)

print(f"Output is coherent: {result.passed}")
print(result)

Coherent output

In [None]:
result = coherence_scorer.score(
    query="What is the capital of Antarctica?",
    output="The capital is Sealoinland, a beuatiful city."
)

print(f"Output is coherent: {result.passed}")
print(result)

## Fluency

In [None]:
from weave.scorers import WeaveFluencyScorerV1

fluency_scorer = WeaveFluencyScorerV1()

Low fluency

In [None]:
result = fluency_scorer.score(
    output="The cat did stretching lazily into warmth of sunlight."
)

print(f"Output is fluent: {result.passed}")
print(result)

High fluency

In [None]:
result = fluency_scorer.score(
    output="The cat stretched lazily in the warm sunlight."
)

print(f"Output is fluent: {result.passed}")
print(result)

## Trustworthiness
The Trustworthiness scorer runs 5 scorers in parallel for an overall assesment of the query, context and input:

- 3 "critical" scorers: `WeaveToxicityScorer, WeaveHallucinationScorer, WeaveContextRelevanceScorer`

- 2 "advisory" scorers: `WeaveCoherenceScorer, WeaveFluencyScorer`



In [None]:
from weave.scorers import WeaveTrustScorerV1

trust_scorer = WeaveTrustScorerV1()

In [None]:
def print_trust_scorer_result(result):
  print()
  print(f"Output is trustworthy: {result.passed}")
  print(f"Trust level: {result.metadata['trust_level']}")
  if not result.passed:
    print("Triggered scorers:")
    for scorer_name, scorer_data in result.metadata['raw_outputs'].items():
      if not scorer_data.passed:
        print(f"  - {scorer_name} did not pass")
    print()

  print(f'WeaveToxicityScorerV1 scores: {result.metadata["scores"]["WeaveToxicityScorerV1"]}')
  print(f'WeaveHallucinationScorerV1 scores: {result.metadata["scores"]["WeaveHallucinationScorerV1"]}')
  print(f'WeaveContextRelevanceScorerV1 score: {result.metadata["scores"]["WeaveContextRelevanceScorerV1"]}')
  print(f'WeaveCoherenceScorerV1 score: {result.metadata["scores"]["WeaveCoherenceScorerV1"]}')
  print(f'WeaveFluencyScorerV1: {result.metadata["scores"]["WeaveFluencyScorerV1"]}')
  print()

There are 2 issues with the following:
- irrelevant context
- hallucinated output

In [None]:
result = trust_scorer.score(
    query="What is the capital of Antarctica?",
    context="People in Antarctica love the penguins.",
    output="The cat stretched lazily in the warm sunlight."
)

print_trust_scorer_result(result)

print(result)

## Personally Identifiable Information (PII)

The PresidioScorer uses Microsoft's [Presidio library](https://microsoft.github.io/presidio/getting_started/) to detect and anonymize PII.

Parameters:
`selected_entities`: A list of entity types to detect in the text. If now value is passed then presidio will try and detect all entity types in its default entities list

`language`: The language of the input text

`custom_recognizers`: A list of custom presidio recognizers of type `presidio.EntityRecognizer`

In [None]:
from weave.scorers import PresidioScorer

# first we will use the default list of all entities from Presdio
presidio_scorer = PresidioScorer()

Helper function to display results:

In [None]:
def print_presidio_output(result):
  print(f"Output contains PII: {not result.passed}")
  print()
  print(f"Anonymized text: {result.metadata['anonymized_text']}")
  print()
  print(result.metadata["detected_entities"])
  print()
  print(result.metadata["reason"])
  print()
  print(result)

Run the scorer:

In [None]:
result = presidio_scorer.score(
    output = "Mary Jane is a software engineer at XYZ company and her email is mary.jane@xyz.com."
)
print_presidio_output(result)

Running again, but now only detecting email addresses:

In [None]:
presidio_scorer = PresidioScorer(
    selected_entities=["EMAIL_ADDRESS"]
)

In [None]:
result = presidio_scorer.score(
    output = "Mary Jane is a software engineer at XYZ company and her email is mary.jane@xyz.com."
)
print_presidio_output(result)