# General information about the **Evaluation of Machine-Generated Text**

In this notebook, we present a standalone example that evaluates the quality of machine-generated text by comparing it to one or more reference texts. For testing purposes, two sets of question-answer pairs were used in this notebook. The first set concerns topics such as what a computer, CPU, and GPU are, while the second set consists of questions from the medical field, extracted from the actual use case.

Since the developers are not medical experts, it is challenging for them to judge the validity of the medical outcomes. This is why we also included questions about computers to have a more easily verifiable set of responses. We cover three different evaluation methods:

- **BLEU score (Bilingual Evaluation Understudy)**
- **ROUGE-L score (Recall-Oriented Understudy for Gisting Evaluation) with the longest common subsequence (LCS)**
- **BERT for sentence similarity using cosine similarity**
- **BERT for embedding similarity** 
2. **METEOR** (Metric for Evaluation of Translation with Explicit ORdering) is designed to improve some of the issues with BLEU, such as considering synonyms and stemming. It aligns generated text with reference text and scores based on matches of exact words, stems, synonyms, and paraphrases. Does support only 5 languages, but German is included. (En, Es, Fr, De, Cs) 

## Evaluation Methods

### BLEU Score

BLEU (Bilingual Evaluation Understudy) is a metric for evaluating the quality of machine-generated text by comparing it to one or more reference texts. It measures the precision of n-grams (sequences of 'n' words) in the generated text compared to the reference text. The key concepts behind BLEU are:

1. **N-grams**: BLEU computes n-gram precision, where n-grams are sequences of 'n' words. Commonly used n-grams are unigrams (1 word), bigrams (2 words), trigrams (3 words), and so on.

2. **Precision**: BLEU calculates the precision for each n-gram length by checking how many n-grams in the candidate sentence match n-grams in the reference sentence.

3. **Brevity Penalty**: To prevent generating short sentences that might have high precision, BLEU includes a brevity penalty. This ensures that longer candidate sentences that match the reference sentence more closely are scored higher.

4. **Geometric Mean**: BLEU combines the precision scores for different n-grams using a geometric mean, which is then multiplied by the brevity penalty.

Mathematically, the BLEU score is computed as follows:
 
$ \text{BLEU} = BP \cdot \exp \left( \sum\limits_{n=1}^N w_n \log p_n \right) $

where:
- $BP$ is the brevity penalty.
- $w_n$ is the weight for the n-gram precision (usually equal weights are used, e.g., \( \frac{1}{N} \)).
- $p_n$ is the precision for n-grams.

The brevity penalty $BP$ is calculated as:

$
BP = 
\begin{cases} 
1 & \text{if } c > r \\
\exp(1 - \frac{r}{c}) & \text{if } c \leq r 
\end{cases}
$

where $c$ is the length of the candidate sentence and $r$ is the length of the reference sentence.
 

### ROUGE-L Score

ROUGE-L (Recall-Oriented Understudy for Gisting Evaluation - Longest Common Subsequence) is another metric for evaluating machine-generated text. It focuses on recall by computing the longest common subsequence (LCS) between the generated text and the reference text.
1. **Longest Common Subsequence (LCS):** The LCS is the longest sequence of words that appear in both the candidate and reference texts in the same order, but not necessarily consecutively.

2. **Precision, Recall, and F1-Score:**
- **Precision:** Measures the proportion of words in the LCS that are present in the candidate text.
- **Recall:** Measures the proportion of words in the LCS that are present in the reference text.
- **F1-Score:** Harmonic mean of precision and recall, which balances the trade-off between precision and recall.

3. **ROUGE-L** computes the F1-score based on precision and recall, emphasizing the importance of capturing the relevant content from the reference text in the candidate text.

### BERT for Sentence Similarity

BERT (Bidirectional Encoder Representations from Transformers) is a state-of-the-art pre-trained model that can generate embeddings for sentences. We use BERT to compute embeddings for both the generated and reference sentences and then calculate the cosine similarity between them. This method captures semantic similarity between sentences, making it robust to variations in wording and word order.
1. **Sentence Embeddings:** BERT takes as input a pair of sentences (candidate and reference) and produces fixed-size vector representations (embeddings) for each sentence. These embeddings capture the semantic meaning of the sentences in a high-dimensional space.

2. **Cosine Similarity:** Cosine similarity is a measure of similarity between two vectors based on the cosine of the angle between them. For BERT embeddings, cosine similarity is computed between the embeddings of the candidate and reference sentences. A higher cosine similarity indicates greater similarity between the sentences in terms of their semantic content.

3. **Semantic Similarity:** By computing cosine similarity between BERT embeddings, we can quantify the semantic similarity between the candidate and reference sentences. This method is robust to changes in word order and minor variations in wording, as it focuses on the overall semantic content rather than exact word matches.

In summary, while ROUGE-L score emphasizes recall by computing the longest common subsequence between texts, BERT-based cosine similarity score leverages pre-trained BERT embeddings to measure semantic similarity between sentences, offering robustness to variations in wording and word order.


### BERT for embedding similarity
Uses BERT embeddings to calculate similarity, but it does so by comparing each token's embedding in the candidate text to each token's embedding in the reference text. It's known to perform well in capturing semantic similarity.

### METEOR
Is designed to improve some of the issues with BLEU, such as considering synonyms and stemming. It aligns generated text with reference text and scores based on matches of exact words, stems, synonyms, and paraphrases. Does support only 5 languages, but German is included. (En, Es, Fr, De, Cs) 

## Conclusion
By employing these evaluation methods, we can assess the quality of machine-generated text from different perspectives.

## Note on the difference between the two BERT implementations (sentence versus embedding similarity)

BERT sentence similarity is called BERT1 and the embedding similarity is called BERT2 in the following.
1. **Models used**:
- BERT1 uses the **sentence-transformers** library and a specific sentence embedding model.
- BERT2 uses the **bert_score** library and BERT model fine-tuned for token-level comparison.
2. **Similarity metric**:
  - BERT1 uses **cosine similarity** between sentence embeddings.
  - BERT2 computes **BERTScore**, which provides *precision*, *recall*, and *F1*-scores at the token level.
3. **Granularity**:
    - BERT1 focuses on overall **sentence similarity** through embeddings.
    - BERT2 focuses on **token-level similarity** and alignment, providing more granular metrics (P, R, F1)
4. **Evaluation Focus**:
- BERT1 might be more suitable for general sentence similarity.
- BERT2 provides a more detailed evaluation of how well the generated text matches the reference in terms of specific tokens, useful for more detailed and nuanced comparisons.

Both methods have their strengths and are useful in different contexts depending on the level of detail and type of similarity you are interested in.

# Instalation of required packages

In [None]:
!pip install nltk                   # Blue (for tokenization of text into words and sentences)
!pip install rouge_score            # Rouge
!pip install sentence_transformers  #(BERT)

# Answers (Reference & LLM Generated)
1. Computer questions
2. Medical questions in German

## 1. Computer questions

In [None]:
questions = [
    "What is a computer?",
    "What is a CPU in a Computer?",
    "What is a graphics card?",
    "What is a HDD?",
    "What the benefits of using a SSD to using a HDD?"    
]

human_responses = [
    "A computer is a machine that can be programmed to carry out sequences of arithmetic or logical operations (computation) automatically. Modern digital electronic computers can perform generic sets of operations known as programs. These programs enable computers to perform a wide range of tasks. The term computer system may refer to a nominally complete computer that includes the hardware, operating system, software, and peripheral equipment needed and used for full operation - or to a group of computers that are linked and function together, such as a computer network or computer cluster.",
    "In computing and computer science, a processor or processing unit is an electrical component (digital circuit) that performs operations on an external data source, usually memory or some other data stream. It typically takes the form of a microprocessor, which can be implemented on a single or a few tightly integrated metal–oxide–semiconductor integrated circuit chips. In the past, processors were constructed using multiple individual vacuum tubes, multiple individual transistors, or multiple integrated circuits. The term is frequently used to refer to the central processing unit (CPU), the main processor in a system. However, it can also refer to other coprocessors, such as a graphics processing unit (GPU).",
    "A graphics card (also called a video card, display card, graphics adapter, VGA card/VGA, video adapter, display adapter, or colloquially GPU) is a computer expansion card that generates a feed of graphics output to a display device such as a monitor. Graphics cards are sometimes called discrete or dedicated graphics cards to emphasize their distinction to integrated graphics processor on the motherboard or the CPU. A graphics processing unit (GPU) that performs the necessary computations is the main component in a graphics card, but the acronym GPU is sometimes also used to erroneously refer to the graphics card as a whole.",
    "A hard disk drive (HDD), hard disk, hard drive, or fixed disk, is an electro-mechanical data storage device that stores and retrieves digital data using magnetic storage with one or more rigid rapidly rotating platters coated with magnetic material. The platters are paired with magnetic heads, usually arranged on a moving actuator arm, which read and write data to the platter surfaces. Data is accessed in a random-access manner, meaning that individual blocks of data can be stored and retrieved in any order. HDDs are a type of non-volatile storage, retaining stored data when powered off. Modern HDDs are typically in the form of a small rectangular box.",
    "Fast performance: SSDs offer much faster data access and transfer speeds than HDDs. Compact size: SSDs are smaller and lighter than HDDs, making them an ideal option for use in portable devices such as laptops and tablets. Lower power consumption: SSDs consume less power than HDDs, making them more energy-efficient."
]

generated_responses = {"Llama3": [
    "A computer is an electronic device that can receive, store, and process data, and perform a variety of tasks automatically, following a set of instructions or programs. It consists of hardware and software components that work together to execute tasks, solve problems, and communicate with users. The basic components of a computer include: 1. **Input Devices**: These allow users to enter data or instructions into the computer, such as keyboards, mice, scanners, and cameras. 2. **Central Processing Unit (CPU)**: Also known as the processor, it executes instructions and performs calculations, making it the brain of the computer. 3. **Memory (RAM)**: Temporary storage for data and applications that the CPU uses while running, also known as Random Access Memory. 4. **Storage Drive**: A permanent storage device that holds data, programs, and the operating system, such as hard drives or solid-state drives. 5. **Output Devices**: These display or produce the results of the computer's processing, such as monitors, printers, and speakers. 6. **Operating System (OS)**: Software that manages the computer's hardware and provides a platform for running applications, such as Windows, macOS, or Linux. Computers can perform a wide range of tasks, including: * Calculations and data analysis * Word processing and document creation * Internet browsing and communication * Graphics and video editing * Gaming * Scientific simulations and modeling * Artificial intelligence and machine learning There are various types of computers, including: * Desktop computers * Laptops (notebooks) * Tablets * Smartphones * Servers * Mainframes * Supercomputers * Embedded systems (e.g., smart home devices, appliances) In summary, a computer is a versatile electronic device that can process information, perform tasks, and communicate with users, making it an essential tool in modern life.",
    "A CPU, or Central Processing Unit, is the primary component of a computer that executes most instructions that a computer program requires. It's often referred to as the brain of the computer. The CPU is responsible for performing the following tasks: 1. **Executing Instructions**: The CPU takes in instructions from the operating system and applications, decodes them, and carries out the required actions. 2. **Arithmetic and Logical Operations**: The CPU performs mathematical calculations, such as addition, subtraction, multiplication, and division, as well as logical operations like AND, OR, and NOT. 3. **Data Processing**: The CPU processes and manipulates data, including moving data between different parts of the computer, storing data in memory, and retrieving data from memory. 4. **Control Unit**: The CPU manages the flow of data between different components of the computer, such as the memory, input/output devices, and other peripherals. The CPU consists of several key components, including: 1. **Control Unit**: Retrieves and decodes instructions, generates control signals to execute instructions, and manages data flow. 2. **Arithmetic Logic Unit (ALU)**: Performs arithmetic and logical operations. 3. **Registers**: Small amounts of on-chip memory that store data temporarily while it's being processed. 4. **Cache Memory**: A small, fast memory that stores frequently accessed data to improve performance. In summary, the CPU is the heart of a computer, responsible for executing instructions, performing calculations, and managing data processing. Its performance and capabilities have a significant impact on the overall speed and efficiency of a computer.",
    "A graphics card, also known as a video card, is a computer component that is responsible for rendering images on a computer screen. Its primary function is to process and display visual data, allowing you to see images, videos, and 3D graphics on your monitor. A graphics card is essentially a mini-computer that sits inside your computer's casing and is connected to the motherboard. It has its own processor, memory, and other specialized components that work together to perform the following tasks: 1. **Rendering**: The graphics card receives instructions from the computer's central processing unit (CPU) and uses its own processing power to render 2D and 3D graphics, textures, and effects. 2. **Memory management**: The graphics card has its own dedicated memory, known as video random access memory (VRAM), which stores graphics data and instructions. 3. **Display output**: The graphics card sends the rendered graphics to the display device, such as a monitor, through a digital or analog connection. Graphics cards can be integrated into the motherboard (known as integrated graphics) or installed as a separate, dedicated component (known as a discrete graphics card). Discrete graphics cards are generally more powerful and offer better performance, especially for demanding applications like gaming, video editing, and 3D modeling. Some key components of a graphics card include: * **GPU (Graphics Processing Unit)**: The brain of the graphics card, responsible for executing graphics instructions and performing calculations. * **Memory (VRAM)**: Dedicated memory for storing graphics data and instructions. * **Cooling system**: A heat sink and fan or other cooling mechanism to keep the graphics card at a safe operating temperature. * **Connectors**: Outputs such as HDMI, DisplayPort, DVI, and VGA for connecting to displays. In summary, a graphics card is a crucial component that enables your computer to display visual content, and its performance can greatly impact the overall computing experience, especially for graphics-intensive tasks.",
    "A HDD, or Hard Disk Drive, is a type of non-volatile storage device that uses magnetic disks to store data. It is a traditional type of storage device that has been widely used in computers and other digital devices for many years. A HDD consists of one or more disks, called platters, that are coated with a magnetic material. The platters are spinning at high speed (typically around 5,400 to 7,200 revolutions per minute) and are read and written to by a magnetic read/write head that floats above the surface of the platter. The read/write head is attached to an actuator arm, which moves the head to different locations on the platter to access different areas of the disk. The data is stored on the platter in the form of tiny magnetic fields, which are detected by the read/write head and translated into digital data. HDDs are available in various sizes, ranging from small laptop drives to large enterprise-class drives. They are relatively inexpensive and offer high storage capacities, making them a popular choice for many applications. Some common characteristics of HDDs include: * High storage capacity: HDDs can store large amounts of data, ranging from hundreds of gigabytes to several terabytes. * Low cost per gigabyte: HDDs are relatively inexpensive compared to other types of storage devices, such as solid-state drives (SSDs). * High latency: HDDs have slower access times compared to SSDs, which can affect performance in certain applications. * Mechanical parts: HDDs have moving parts, which can make them more prone to mechanical failure and vibration. However, HDDs are being increasingly replaced by SSDs in many applications due to their faster performance, lower power consumption, and higher reliability.",
    "There are several benefits of using a Solid-State Drive (SSD) over a Hard Disk Drive (HDD): 1. **Faster Speed**: SSDs are significantly faster than HDDs in terms of read and write speeds. SSDs can read and write data at speeds of up to 500-1000 MB/s, while HDDs typically top out at around 100-200 MB/s. This means that SSDs can boot up your computer, launch applications, and access files much faster. 2. **Lower Latency**: SSDs have much lower latency than HDDs, which means that they can access data more quickly. This is because SSDs use flash memory to store data, which can be accessed almost instantly, whereas HDDs use mechanical heads to read and write data, which takes longer. 3. **Higher Reliability**: SSDs are more reliable than HDDs because they have no moving parts, which makes them less prone to mechanical failure. This means that SSDs are less likely to fail due to shock, vibration, or physical damage. 4. **Lower Power Consumption**: SSDs use significantly less power than HDDs, which can help to prolong battery life in laptops and reduce heat generation in desktops. 5. **Quieter Operation**: SSDs are completely silent, whereas HDDs can be noisy due to the mechanical movement of the disks and heads. 6. **Improved Durability**: SSDs are more resistant to physical shock, vibration, and extreme temperatures than HDDs, making them a better choice for mobile devices and industrial applications. 7. **Better Multitasking**: SSDs can handle multiple tasks simultaneously much better than HDDs, which can lead to slower performance when multiple applications are running at the same time. 8. **Increased Security**: SSDs have built-in encryption and secure erase capabilities, which can help to protect sensitive data. 9. **Longer Lifespan**: While SSDs have a limited number of write cycles, modern SSDs are designed to last for many years, and some manufacturers offer warranties of up to 10 years or more. 10. **Better Performance in Extreme Conditions**: SSDs can operate in extreme temperatures, humidity, and altitudes, making them a better choice for applications that require reliable performance in harsh environments. Overall, SSDs offer significant performance, reliability, and durability advantages over HDDs, making them a popular choice for many applications, including gaming, video editing, and enterprise storage."
],
                     "Gpt4": [
    "A computer is an electronic device that manipulates information or data. It has the ability to store, retrieve, and process data. It can execute a pre-recorded list of instructions known as a program. Computers can perform complex and repetitive procedures quickly, precisely, and reliably. Modern computers can accomplish tasks that include settings like calculations, word processing, browsing the internet, and more.",
    "A CPU (Central Processing Unit) in a computer is often referred to as the brain of the computer. It is responsible for executing instructions of a computer program by performing the basic arithmetical, logical, control and input/output (I/O) operations specified by the instructions. It handles all calculations and computer processing. In essence, a CPU brings a computer's functions and programs to life.",
    "A Graphics Card, also known as a video card, graphics adapter or display adapter, is a piece of hardware installed in a computer that is responsible for rendering images and videos to the monitor. It processes and outputs the graphical data using its dedicated processing power. High quality graphics cards are essential for tasks that require high-resolution visuals such as graphic design, video editing, and gaming.",
    "A HDD, or Hard Disk Drive, is a type of storage device used in computers. It uses spinning disks, or platters, to read and write data. It is considered a non-volatile storage, meaning it retains the data even when powered off. The size and speed of a HDD can significantly impact a computer's performance.",
    "1. Faster Speed: SSDs have a faster data access speed than HDDs. This means programs can run more quickly, which is significant for programs that access large amounts of data, like your operating system. 2. Less Power Usage: SSDs require less power to operate, leading to longer battery life for laptops. 3. Durability: Since SSDs do not have any moving parts, they're much more robust and resistant to shock and general wear and tear compared to HDDs. 4. No Fragmentation: As SSDs can access any location on the drive with equal speed, it does not suffer from fragmentation. In HDDs, data gets spread out across the disk platter, causing the drive to suffer from fragmentation, which can significantly slow down performance over time. 5. Noise and Heat: SSDs are virtually silent and do not generate heat as HDDs do, because they do not have any moving parts. 6. Form Factor: SSDs are available in a smaller form factor than HDDs are, leaving more room in a desktop or laptop for other components. 7. Faster startup: SSDs allow your system to boot up and shut down quicker because they access data faster. 8. Greater Lifespan: While SSDs can wear out over time (an issue known as finite lifespans), they generally last longer than HDDs because they don't have mechanical or magnetic components that can fail."
],
                      "Claude": [
    "A computer is an electronic device that processes data, performs calculations, and executes instructions based on a set of commands called a program. It can store, retrieve, and manipulate information to perform various tasks. Key features of a computer include: 1. Input: Computers receive data and instructions through input devices such as a keyboard, mouse, touch screen, or microphone. 2. Processing: The central processing unit (CPU) is the brain of the computer, executing instructions and performing calculations. 3. Output: Computers display or send out the results of processing through output devices such as a monitor, printer, or speakers. 4. Memory: Computers store data and instructions temporarily in RAM (random access memory) and permanently in storage devices like hard drives or solid-state drives. 5. Connectivity: Modern computers can connect to networks and the internet to communicate with other devices and access shared resources. Computers come in various forms, including desktop computers, laptops, tablets, smartphones, and embedded systems. They are used in a wide range of applications, such as communication, education, entertainment, business, scientific research, and industrial control.",
    "A CPU, or Central Processing Unit, is the main processor of a computer. It is often referred to as the brain of the computer because it carries out most of the processing tasks and controls the operation of all other components. Key points about the CPU: 1. Execution of instructions: The CPU fetches, decodes, and executes program instructions, performing arithmetic, logical, and input/output operations. 2. Control Unit: It directs the flow of data between the CPU and other devices, such as memory, storage, and peripherals. 3. Arithmetic Logic Unit (ALU): This part of the CPU performs mathematical and logical operations on data. 4. Cache memory: Modern CPUs have built-in cache memory, which stores frequently used data and instructions for faster access. 5. Clock speed: The CPU's performance is largely determined by its clock speed, measured in Hz, which dictates how many instructions it can process per second. 6. Multi-core processors: Many modern CPUs have multiple cores, allowing them to execute multiple tasks simultaneously for improved performance. The CPU is a critical component in determining a computer's overall performance, along with other factors such as the amount and speed of RAM, storage, and the efficiency of the operating system and software.",
    "A graphics card, also known as a video card, graphics processing unit (GPU), or display adapter, is a specialized electronic circuit designed to rapidly process and output images to a display device, such as a computer monitor. The primary purpose of a graphics card is to render and display 2D and 3D graphics, video, and animations. Key features and functions of a graphics card include: 1. GPU: The GPU is the main component of a graphics card, responsible for processing graphics-related tasks and calculations. 2. Video RAM (VRAM): Graphics cards have dedicated memory to store image data, texture maps, and other graphics-related information. 3. Cooling system: Due to the high-performance nature of graphics cards, they generate a significant amount of heat and require efficient cooling solutions like heat sinks and fans. 4. Display connections: Graphics cards have various display output ports, such as HDMI, DisplayPort, and DVI, to connect to monitors or other display devices. 5. Graphics API support: Modern graphics cards support industry-standard APIs like DirectX and OpenGL, which are used by software developers to create graphics-intensive applications and games. Graphics cards are essential for tasks that require intensive graphics processing, such as gaming, 3D modeling, video editing, and other visually demanding applications. They come in different performance tiers and are manufactured by companies like Nvidia, AMD, and Intel.",
    "HDD stands for Hard Disk Drive. It is a type of non-volatile data storage device used in computers and other electronic devices. Key features of HDDs include: 1. Data storage: HDDs store digital data using magnetic storage technology. They contain one or more spinning disks (called platters) coated with a magnetic material, and data is read from or written to the disks using read/write heads. 2. Capacity: HDDs are available in various storage capacities, ranging from a few hundred gigabytes (GB) to several terabytes (TB). 3. Speed: The performance of an HDD is determined by factors such as spindle speed (measured in RPM), cache size, and data transfer rate. Typically, HDDs are slower than solid-state drives (SSDs) in terms of data access and transfer speeds. 4. Durability: HDDs have moving parts, making them more susceptible to physical damage from drops, shocks, or vibrations compared to SSDs. 5. Cost: HDDs are generally less expensive than SSDs on a per-gigabyte basis, making them a cost-effective solution for high-capacity storage needs. HDDs are commonly used in desktop computers, laptops, servers, and external storage devices for storing operating systems, applications, documents, multimedia files, and other types of data.",
    "Solid State Drives (SSDs) offer several benefits over traditional Hard Disk Drives (HDDs): 1. Speed: SSDs are significantly faster than HDDs. They have shorter access times, faster boot times, and quicker file transfers, resulting in improved overall system performance. 2. Durability: SSDs have no moving parts, making them more resistant to physical shock and vibration compared to HDDs. This reduces the risk of data loss due to accidental drops or impacts. 3. Quiet operation: Without moving parts, SSDs operate silently, whereas HDDs produce audible noise from the spinning disks and moving read/write heads. 4. Lower power consumption: SSDs consume less power than HDDs, which can lead to improved battery life in laptops and reduced energy costs in desktop systems. 5. Cooler operation: SSDs generate less heat than HDDs, helping to keep the system running cooler and potentially prolonging the lifespan of other components. 6. Lightweight and compact: SSDs are typically smaller and lighter than HDDs, making them ideal for portable devices like laptops and tablets. 7. Faster random access: SSDs excel at random access times, allowing for quicker access to small files and faster application loading. 8. No fragmentation: Unlike HDDs, SSDs do not suffer from fragmentation issues, maintaining their performance over time without the need for defragmentation. However, SSDs also have some drawbacks compared to HDDs, such as higher cost per gigabyte and limited write cycles, although modern SSDs have significantly improved in terms of durability and longevity."
]

                      
                    }

## 2. Medical questions in German

In [37]:
questions = [
    "Welche Rolle spielt die ärztliche Verordnung bei der Beantragung von häuslicher Krankenpflege?",
    "Welche Formen der häuslichen Krankenpflege gibt es?",
    "Welche Leistung umfasst die Krankenhausvermeidungspflege?"
]

human_responses = [
    "Ihr Arzt entscheidet, ob eine häusliche Krankenpflege notwendig ist und stellt Ihnen eine Verordnung aus. Diese ist dann bei der jeweiligen zuständigek Krankenkasse zur vorherigen Genehmigung ein zu reichen.",
    "Folgende Formen der häuslichen Krankenpflege gibt es: Grundpflege: Bei der Grundpflege helfen Ihnen Pflegekräfte bei der Körperpflege und beim Essen. Behandlungspflege: Die Behandlungspflege umfasst medizinisch notwendige Hilfeleistungen wie Verbände wechseln, Wunden spülen oder Medikamente verabreichen. Hauswirtschaftliche Versorgung: Hier helfen Ihnen Pflegekräfte dabei, den Haushalt weiterzuführen. Sie unterstützen zum Beispiel beim Kochen, Einkaufen und Putzen.",
    "Wenn sich durch die häusliche Pflege ein Aufenthalt im Krankenhaus vermeiden oder verkürzen lässt, übernehmen wir die Kosten für die erforderliche Grund- und Behandlungspflege sowie hauswirtschaftliche Versorgung – je nach Krankheitsfall bis zu vier Wochen. In Ausnahmefällen bewilligen wir diese Pflegeformen auch länger."
]
# Subject Matter Expert score dictionary
score_dict_sme= {(0, 'Llama3'): 0.99, (0, 'Gpt4'): 0.95, (0, 'Claude'): 1.00,
                 (1, 'Llama3'): 1.00, (1, 'Gpt4'): 0.99, (1, 'Claude'): 0.99,
                 (2, 'Llama3'): 0.90, (2, 'Gpt4'): 0.50, (2, 'Claude'): 0.91}

generated_responses = {
    "Llama3": [
    "Die ärztliche Verordnung spielt eine zentrale Rolle bei der Beantragung von häuslicher Krankenpflege. Hier sind die Gründe dafür: 1. Medizinische Notwendigkeit: Die ärztliche Verordnung bestätigt, dass die häusliche Krankenpflege medizinisch notwendig ist und dass der Patient eine entsprechende Pflege benötigt. 2. Feststellung der Pflegebedürftigkeit: Der Arzt bewertet den Gesundheitszustand des Patienten und stellt fest, welche Art und Umfang der Pflege erforderlich sind. 3. Beantragung von Leistungen: Die ärztliche Verordnung ist ein wichtiger Bestandteil des Antrags auf häusliche Krankenpflege bei der Krankenkasse oder anderen Kostenträgern. 4. Legitimation für die Pflegedienste: Die ärztliche Verordnung legitimiert die Pflegedienste, die dem Patienten die erforderliche Pflege zu erbringen haben. 5. Dokumentation: Die ärztliche Verordnung dient auch als Dokumentation der medizinischen Notwendigkeit und des Umfangs der Pflege, die erforderlich ist. In Deutschland ist die ärztliche Verordnung für die Inanspruchnahme von häuslicher Krankenpflege nach § 37 Abs. 2 SGB V erforderlich. Der Arzt muss die Notwendigkeit der Pflege und den Umfang der erforderlichen Leistungen in der Verordnung dokumentieren.",
    "Es gibt verschiedene Formen der häuslichen Krankenpflege, die je nach Bedarf und Situation des Patienten angepasst werden können. Hier sind einige Beispiele: 1. Grundpflege: Diese Form der Pflege umfasst die täglichen Grundbedürfnisse wie Waschen, Anziehen, Essen, Trinken, Toilettengang und Mobilität. 2. Behandlungspflege: Diese Form der Pflege umfasst medizinische Behandlungen wie Injektionen, Wundversorgung, Medikamentengabe und physikalische Therapie. 3. Rehabilitationspflege: Diese Form der Pflege konzentriert sich auf die Wiederherstellung von Funktionen und Fähigkeiten nach einer Krankheit oder Verletzung, wie z.B. Physiotherapie, Ergotherapie oder Logopädie. 4. Palliativpflege: Diese Form der Pflege konzentriert sich auf die Linderung von Schmerzen und Symptomen bei Patienten mit unheilbaren Krankheiten, wie z.B. Krebs oder AIDS. 5. Intensivpflege: Diese Form der Pflege umfasst die umfassende Überwachung und Behandlung von Patienten, die sich in einem kritischen Zustand befinden, wie z.B. nach einer Operation oder bei einer schweren Infektion. 6. Nacht- und Wochenendpflege: Diese Form der Pflege bietet Unterstützung und Betreuung für Patienten, die während der Nacht oder am Wochenende Pflege benötigen. 7. Hauswirtschaftliche Hilfe: Diese Form der Pflege umfasst die Unterstützung bei hauswirtschaftlichen Aufgaben wie Einkaufen, Kochen, Reinigung und organisationelle Aufgaben. 8. Begleitung und Betreuung: Diese Form der Pflege konzentriert sich auf die emotionale Unterstützung und Begleitung von Patienten, die sich in einer schwierigen Situation befinden. Es gibt auch verschiedene Modelle der häuslichen Krankenpflege, wie z.B.: 1. Pflegedienstmodell: Ein Pflegedienst schickt eine Pflegekraft in das Haus des Patienten, um die erforderlichen Pflegemaßnahmen durchzuführen. 2. Familienpflegemodell: Familienmitglieder oder Angehörige übernehmen die Pflege des Patienten, unterstützt von professionellen Pflegekräften. 3. Shared-Care-Modell: Ein Pflegedienst und Familienmitglieder oder Angehörige teilen sich die Pflege des Patienten. Jede Form der häuslichen Krankenpflege kann individuell angepasst werden, um den spezifischen Bedürfnissen und Wünschen des Patienten gerecht zu werden.",
    "Die Krankenhausvermeidungspflege, auch als ambulante Krankenhausvermeidung oder ambulante Intensivpflege bezeichnet, umfasst eine Vielzahl von Leistungen, die darauf ausgerichtet sind, einen Krankenhausaufenthalt zu vermeiden oder zu verkürzen. Diese Leistungen können je nach Bedarf und Versicherungsträger variieren, aber typischerweise umfassen sie: 1. Medizinische Behandlung: Durchgeführt von Ärzten, Krankenschwestern und anderen medizinischen Fachkräften, um die Gesundheit des Patienten zu stabilisieren und zu verbessern. 2. Pflegeleistungen: Umfassen die Versorgung von Patienten mit grundlegenden Bedürfnissen wie Essen, Trinken, Körperpflege, Mobilität und Medikamentenverwaltung. 3. Therapieleistungen: Physiotherapie, Ergotherapie, Logopädie und andere Therapieformen, die dazu beitragen, die körperliche und geistige Gesundheit des Patienten zu verbessern. 4. Medikamentöse Versorgung: Verabreichung von Medikamenten und Überwachung ihrer Wirkung. 5. Wundversorgung: Behandlung und Versorgung von Wunden, um eine Infektion zu vermeiden und die Heilung zu fördern. 6. Ernährungstherapie: Beratung und Unterstützung bei der Ernährung, um die Gesundheit des Patienten zu verbessern. 7. Soziale Unterstützung: Begleitung und Unterstützung bei der Organisation des Alltags, wie zum Beispiel bei der Beschaffung von Hilfsmitteln oder der Koordination von Terminen. 8. Überwachung und Kontrolle: Regelmäßige Überwachung des Gesundheitszustands des Patienten, um sicherzustellen, dass die Behandlung erfolgreich ist und Anpassungen vorgenommen werden können, wenn nötig. Die Krankenhausvermeidungspflege kann in verschiedenen Settings durchgeführt werden, wie zum Beispiel in der eigenen Wohnung des Patienten, in einer ambulanten Pflegeeinrichtung oder in einem ambulanten Krankenhaus."
    ], 
    "Gpt4": [
    "Die ärztliche Verordnung spielt eine entscheidende Rolle bei der Beantragung von häuslicher Krankenpflege. Sie ist das zentrale Dokument, das den Bedarf an häuslicher Pflege belegt und damit eine Voraussetzung für die Gewährung von Leistungen durch die Krankenkasse ist. Die Verordnung muss vom behandelnden Arzt ausgestellt und begründet werden. Sie enthält in der Regel Angaben zur Diagnose, zur Art der notwendigen Pflege und zur voraussichtlichen Dauer des Pflegebedarfs. Ohne eine solche Verordnung wird die Krankenkasse die Kosten für die häusliche Krankenpflege in der Regel nicht übernehmen.",
    "1. Grundpflege: Hierzu zählen alltägliche Verrichtungen wie Körperhygiene, Hilfe beim An- und Auskleiden, Nahrungsaufnahme etc. 2. Behandlungspflege: Diese beinhaltet medizinische Maßnahmen, die nach Anordnung eines Arztes durchgeführt werden. Dazu gehören z.B. Verbandswechsel, Medikamentengabe, Injektionen, Blutdruckmessung etc. 3. Hauswirtschaftliche Versorgung: Sie umfasst Leistungen wie Einkaufen, Kochen, Reinigen der Wohnung, Wäschepflege etc. 4. Palliativpflege: Sie ist auf die Bedürfnisse von schwerstkranken und sterbenden Menschen ausgerichtet. Ziel ist es, die bestmögliche Lebensqualität und Würde des Patienten bis zum Tod zu gewährleisten. 5. Intensivpflege: Diese Form der Pflege ist notwendig, wenn Patienten aufgrund schwerer Krankheiten oder nach Operationen eine intensive Betreuung und Überwachung benötigen. 6. 24-Stunden-Pflege: Hierbei handelt es sich um eine Rund-um-die-Uhr-Betreuung, die häufig von Menschen in Anspruch genommen wird, die aufgrund ihres Gesundheitszustandes permanente Hilfe benötigen.",
    "Die Krankenhausvermeidungspflege (auch Verhinderungspflege genannt) bietet Leistungen für die vorübergehende Übernahme der Pflege zu Hause, wenn die private Pflegeperson verhindert ist. Dies kann bei Krankheit oder Urlaub der Fall sein. Die Leistungen können zum Beispiel Körperpflege, Hilfe bei der Nahrungsaufnahme, Mobilisation sowie Hilfen zur Haushaltsführung umfassen. Es kann auch eine teilstationäre Tages- oder Nachtpflege in Anspruch genommen werden. Die Krankenhausvermeidungspflege ist als Hilfe zur Sicherstellung der Pflege bei vorübergehender Abwesenheit der Pflegeperson gedacht."
    ],
    "Claude": [
    "Die ärztliche Verordnung spielt bei der Beantragung von häuslicher Krankenpflege eine zentrale Rolle: 1. Sie bestätigt die medizinische Notwendigkeit der Pflege. Der Arzt bescheinigt, dass der Patient aufgrund einer Krankheit, Behinderung oder eines sonstigen Pflegebedarfs auf häusliche Krankenpflege angewiesen ist. 2. In der Verordnung legt der Arzt die Art, den Umfang und die Dauer der erforderlichen Pflegemaßnahmen fest. Hierzu zählen beispielsweise Medikamentengabe, Verbandswechsel oder Unterstützung bei der Körperpflege. 3. Die Verordnung dient als Grundlage für die Kostenübernahme durch die Krankenkasse. Nur mit einer gültigen ärztlichen Verordnung kann die Krankenkasse die Kosten für die häusliche Krankenpflege übernehmen. 4. Der Arzt überprüft regelmäßig, ob die Pflege weiterhin notwendig ist und passt die Verordnung gegebenenfalls an. Die Verordnung ist in der Regel auf einen bestimmten Zeitraum befristet. 5. Ohne ärztliche Verordnung besteht kein Anspruch auf Kostenübernahme für häusliche Krankenpflege durch die Krankenkasse. Die ärztliche Verordnung ist somit die rechtliche und medizinische Voraussetzung für die Beantragung und Genehmigung von häuslicher Krankenpflege durch die gesetzliche Krankenversicherung.",
    "Es gibt verschiedene Formen der häuslichen Krankenpflege in Deutschland: 1. Behandlungspflege: Hierzu gehören medizinisch notwendige Maßnahmen wie Medikamentengabe, Wundversorgung, Injektionen oder Blutdruckmessung. Diese wird vom Arzt verordnet und von examinierten Pflegekräften durchgeführt. 2. Grundpflege: Darunter fallen Hilfen bei täglichen Aktivitäten wie Körperpflege, An- und Auskleiden, Essen und Mobilität. Sie wird von Pflegefachkräften oder Pflegehilfskräften erbracht. 3. Hauswirtschaftliche Versorgung: Dazu zählen Einkaufen, Kochen, Reinigen der Wohnung und Wäschepflege. Das erledigen Hauswirtschaftskräfte oder Pflegehilfskräfte. 4. Betreuung und Begleitung: Diese umfasst Hilfe bei der Tagesstrukturierung, Gespräche, Spaziergänge oder Beschäftigung. Auch Besuche bei Ärzten oder Behörden fallen darunter. 5. Verhinderungspflege: Wenn die übliche Pflegeperson wie Angehörige ausfällt, übernimmt ein Pflegedienst für maximal 6 Wochen pro Jahr. 6. Palliativpflege: Für schwerstkranke und sterbende Menschen gibt es speziell geschultes Personal, um die letzte Lebensphase zu Hause zu ermöglichen. Diese Leistungen werden je nach Pflegegrad und ärztlicher Verordnung von der Pflegeversicherung bezuschusst. Ein ambulanter Pflegedienst kann alle oder einzelne dieser Bereiche abdecken.",
    "Die Krankenhausvermeidungspflege umfasst verschiedene medizinische und pflegerische Leistungen, die darauf abzielen, einen Krankenhausaufenthalt von Patienten zu vermeiden oder zu verkürzen. Hier sind die wesentlichen Aspekte: 1. Häusliche Krankenpflege: Pflegefachkräfte übernehmen die Pflege und Versorgung des Patienten in seiner gewohnten Umgebung, um einen Krankenhausaufenthalt zu vermeiden. 2. Medizinische Behandlungspflege: Dazu gehören Tätigkeiten wie Medikamentengabe, Wundversorgung, Injektionen und Infusionen, die sonst im Krankenhaus durchgeführt würden. 3. Grundpflege und hauswirtschaftliche Versorgung: Die Pflegekräfte unterstützen den Patienten bei Aktivitäten des täglichen Lebens wie Körperpflege, Ernährung und Mobilität. 4. Anleitung und Beratung: Patienten und Angehörige erhalten Schulungen im Umgang mit der Erkrankung und der Pflege, um eine bessere Versorgung zu Hause zu ermöglichen. 5. Koordination und Zusammenarbeit: Die Pflegedienste arbeiten eng mit Ärzten, Therapeuten und anderen Dienstleistern zusammen, um eine ganzheitliche Versorgung zu gewährleisten. 6. Palliativpflege: In manchen Fällen kann die Krankenhausvermeidungspflege auch die Versorgung von unheilbar kranken Patienten in ihrer letzten Lebensphase zu Hause ermöglichen. Ziel ist es, durch diese Leistungen die Lebensqualität der Patienten zu verbessern, Krankenhausaufenthalte zu vermeiden und das Gesundheitssystem zu entlasten."
    ],
    }

# Evaluations:
1. BLUE Score
2. Rouge-L
3. BERT sentence similarity
4. BERT embedding similarity
5. METEOR

In [38]:
# Define the order of LLMs to be evaluated
key_order = ['Claude', 'Llama3', 'Gpt4']

## 1. BLUE Score (Bilingual Evaluation Understudy)

In [None]:
import nltk
nltk.download('punkt')

In [None]:
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from scipy.stats import rankdata

# Function to compute BLEU score for a single pair of sentences
def compute_bleu(reference, generated):
    # Tokenize the reference and hypothesis sentences
    reference_tokens = [nltk.word_tokenize(reference)]
    generated_tokens = nltk.word_tokenize(generated)
    
    # Smoothing function to handle short sentences
    smoothie = SmoothingFunction().method4
    
    # Compute BLEU score
    bleu_score = sentence_bleu(reference_tokens, generated_tokens, smoothing_function=smoothie)
    
    return bleu_score

### Loop over the list of LLMs

In [None]:
# Compute BLEU scores for each pair of expected and generated answers
score_dict = {}
print(f'\nBLUE scores from human answer versus machine generated answer.')
print('--------------------------------------------------------------')
for LLM_model in generated_responses.keys():
    bleu_scores = [compute_bleu(ref, gen) for ref, gen in zip(human_responses, generated_responses[LLM_model])]
    score_dict.update({(i, LLM_model): score for i, score in enumerate(bleu_scores)})

    #bleu_scores = []
    #for i, (ref, gen) in enumerate(zip(human_responses, generated_responses[LLM_model])):
    #    score = compute_bleu(ref, gen)
    #    bleu_scores.append(score)
    #    score_dict.update({(i,LLM_model): score})

    # Print BLEU scores
    print(f'\n********** {LLM_model} ***************')
    for i, score in enumerate(bleu_scores):
        print(f"BLEU score for Q{i+1}: {score:.4f}")

### Ranking of the LLMs per Question

In [None]:
# Iterate through each question and rank the models based on scores
cum_rank_dict = {key: 0 for key in key_order}

for question_id in range(len(human_responses)):
    ordered_values = [score_dict.get((question_id, key), None) for key in key_order]
    ranks = rankdata([-val for val in ordered_values], method='average')
    
    print(f"\nRanking for Question {question_id + 1}:")
    for key, val, rnk in zip(key_order, ordered_values, ranks):
        print(f"{key:<8} - {val:.3f} -> Rank: {rnk}")
        cum_rank_dict[key] += rnk
    


### Summing the ranking of the LLMs over the 5 questions (lowest value is the best LLM)

In [None]:
# Sort the dictionary by values (scores)
sorted_scores = sorted(cum_rank_dict.items(), key=lambda item: item[1])

# Print the ranking of LLMs
print(f"LLM Rankings based on cumulative scores: Nr of Questions={len(human_responses)}")
for rank, (key, value) in enumerate(sorted_scores, 1):
    print(f"{rank}. {key:<8} - CumRank: {value:.1f}")

## 2. ROUGE-L scores (Recall-Oriented Understudy for Gisting Evaluation with the Longest common subsequence)

In [None]:
from rouge_score import rouge_scorer
from scipy.stats import rankdata

def compute_rouge_l(reference, generated):
    
    # Define the ROUGE-L scorer
    scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)
    score = scorer.score(reference, generated)
    return score['rougeL'].precision, score['rougeL'].recall, score['rougeL'].fmeasure

### Loop over the list of LLMs

In [None]:
# Compute ROUGE-L scores for each pair of expected and generated answers
score_dict = {}
for LLM_model, responses in generated_responses.items():
    rouge_scores = [compute_rouge_l(ref, gen) for ref, gen in zip(human_responses, responses)]
    score_dict.update({(i, LLM_model): score[2] for i, score in enumerate(rouge_scores)})
    # Print ROUGE-L scores
    print(f'\n********** {LLM_model} ***************')
    for i, (precision, recall, f1) in enumerate(rouge_scores, start=1):
        print(f"ROUGE-L score for pair {i}: (prec, recall, f1) = ({precision:.4f}, {recall:.4f}, {f1:.4f})")

### Ranking of the LLMs per Question

In [None]:
# Iterate through each question and rank the models based on scores
cum_rank_dict = {key: 0 for key in key_order}

for question_id in range(len(human_responses)):
    ordered_values = [score_dict.get((question_id, key), None) for key in key_order]
    ranks = rankdata([-val for val in ordered_values], method='average')
    
    print(f"\nRanking for Question {question_id + 1}:")
    for key, val, rnk in zip(key_order, ordered_values, ranks):
        print(f"{key:<8} - {val:.3f} -> Rank: {rnk}")
        cum_rank_dict[key] += rnk

### Summing the ranking of the LLMs over the 5 questions (lowest value is the best LLM)

In [None]:
# Sort the dictionary by values (scores)
sorted_scores = sorted(cum_rank_dict.items(), key=lambda item: item[1])

# Print the ranking of LLMs
print(f"LLM Rankings based on cumulative scores: Nr of Questions={len(human_responses)}")
for rank, (key, value) in enumerate(sorted_scores, 1):
    print(f"{rank}. {key:<8} - CumRank: {value:.1f}")

## 3. BERT sentence similarity

In [None]:
pip install sentence_transformers

In [98]:
from sentence_transformers import SentenceTransformer, util
from scipy.stats import rankdata

# Load a pre-trained BERT model from Hugging Face
model = SentenceTransformer('bert-base-nli-mean-tokens')

### Loop over the list of LLMs

In [99]:
score_dict = {}
for LLM_model, responses in generated_responses.items():
       
    # Compute embeddings for each sentence
    expected_embeddings = model.encode(human_responses, convert_to_tensor=True)
    generated_embeddings = model.encode(responses, convert_to_tensor=True)

    # Compute cosine similarities
    similarities = util.pytorch_cos_sim(generated_embeddings, expected_embeddings)
    score_dict.update({(i, LLM_model): score for i, score in enumerate(similarities.diag())})
    
    # Print the cosine similarity scores
    print(f'\n********** {LLM_model} ***************')
    for i, similarity in enumerate(similarities.diag()):
        print(f"Similarity score for pair {i+1}: {similarity.item():.4f}")


********** Llama3 ***************
Similarity score for pair 1: 0.7488
Similarity score for pair 2: 0.8970
Similarity score for pair 3: 0.9383

********** Gpt4 ***************
Similarity score for pair 1: 0.8726
Similarity score for pair 2: 0.9618
Similarity score for pair 3: 0.9198

********** Claude ***************
Similarity score for pair 1: 0.7529
Similarity score for pair 2: 0.9424
Similarity score for pair 3: 0.7480


### Ranking of the LLMs per Question

In [100]:
# Iterate through each question and rank the models based on scores
cum_rank_dict = {(q_id, key): 0 for q_id in range(len(questions)) for key in key_order}

for question_id in range(len(human_responses)):
    ordered_values = [score_dict.get((question_id, key), None) for key in key_order]
    ranks = rankdata([-val for val in ordered_values], method='average')
    
    print(f"\nRanking for Question {question_id + 1}:")
    for key, val, rnk in zip(key_order, ordered_values, ranks):
        val_sme = score_dict_sme[(question_id,key)]
        rnk_sme = cum_rank_dict_sme[(question_id,key)]
        print(f"{key:<8} - score (SME, Machine): ({val_sme:.3f}, {val:.3f}) -> Rank (SME, Machine): ({rnk_sme}, {rnk} )")
        cum_rank_dict[(question_id,key)] += rnk


Ranking for Question 1:
Claude   - score (SME, Machine): (1.000, 0.753) -> Rank (SME, Machine): (1.0, 2.0 )
Llama3   - score (SME, Machine): (0.990, 0.749) -> Rank (SME, Machine): (2.0, 3.0 )
Gpt4     - score (SME, Machine): (0.950, 0.873) -> Rank (SME, Machine): (3.0, 1.0 )

Ranking for Question 2:
Claude   - score (SME, Machine): (0.990, 0.942) -> Rank (SME, Machine): (2.5, 2.0 )
Llama3   - score (SME, Machine): (1.000, 0.897) -> Rank (SME, Machine): (1.0, 3.0 )
Gpt4     - score (SME, Machine): (0.990, 0.962) -> Rank (SME, Machine): (2.5, 1.0 )

Ranking for Question 3:
Claude   - score (SME, Machine): (0.910, 0.748) -> Rank (SME, Machine): (1.0, 3.0 )
Llama3   - score (SME, Machine): (0.900, 0.938) -> Rank (SME, Machine): (2.0, 1.0 )
Gpt4     - score (SME, Machine): (0.500, 0.920) -> Rank (SME, Machine): (3.0, 2.0 )


### Summing the ranking of the LLMs over the different questions (lowest value is the best LLM)

In [101]:
# Initialize a dictionary to hold the sum of ranks for each LLM model
cum_rank_sum = {key: 0 for key in key_order}
cum_rank_sum_sme = {key: 0 for key in key_order}

# Iterate through the cum_rank_dict to sum up the ranks for each LLM model
for (q_id, key), value in cum_rank_dict.items():
    cum_rank_sum[key] += value
# Iterate through the cum_rank_dict_sme to sum up the sme-ranks for each LLM model
for (q_id, key), value in cum_rank_dict_sme.items():
    cum_rank_sum_sme[key] += value

print(f'BERT sentiment: Rankings based on cumulative scores: \n\tNr of Questions={len(human_responses)}')


# Sort the cumulative rank sums for each LLM model
sorted_cum_rank_sum = sorted(cum_rank_sum.items(), key=lambda item: item[1])

# Sort the cumulative rank sums for each LLM model
sorted_cum_rank_sum_sme = sorted(cum_rank_sum_sme.items(), key=lambda item: item[1])

# Print the sorted cumulative rank sums for each LLM model
print('\n--> Machine results:')
for rank, (key, value) in enumerate(sorted_cum_rank_sum, 1):
    print(f"{rank}. {key:<8} - CumRank: {value:.1f}")
# Print the sorted sme-cumulative rank sums for each LLM model
print('\n--> SME results:')
for rank, (key, value) in enumerate(sorted_cum_rank_sum_sme, 1):
    print(f"{rank}. {key:<8} - CumRank: {value:.1f}")



BERT sentiment: Rankings based on cumulative scores: 
	Nr of Questions=3

--> Machine results:
1. Gpt4     - CumRank: 4.0
2. Claude   - CumRank: 7.0
3. Llama3   - CumRank: 7.0

--> SME results:
1. Claude   - CumRank: 4.5
2. Llama3   - CumRank: 5.0
3. Gpt4     - CumRank: 8.5


## 4. BERT embedding similarity

In [None]:
pip install bert-score

### Loop over the list of LLMs

In [95]:
from bert_score import score
from scipy.stats import rankdata

score_dict = {}
for LLM_model, responses in generated_responses.items():
    # Compute BERTScore
    P, R, F1 = score(responses, human_responses, lang="de", model_type='bert-base-uncased')
    score_dict.update({(i, LLM_model): score.item() for i, score in enumerate(F1)})
    
    # Print the BERTScore F1 scores
    print(f'\n********** {LLM_model} ***************')
    for i, f1 in enumerate(F1):
        print(f"BERTScore F1 for Q{i+1}: {f1.item():.4f}")


********** Llama3 ***************
BERTScore F1 for Q1: 0.6571
BERTScore F1 for Q2: 0.7083
BERTScore F1 for Q3: 0.6433

********** Gpt4 ***************
BERTScore F1 for Q1: 0.7424
BERTScore F1 for Q2: 0.7299
BERTScore F1 for Q3: 0.6991

********** Claude ***************
BERTScore F1 for Q1: 0.6778
BERTScore F1 for Q2: 0.7377
BERTScore F1 for Q3: 0.6877


### Ranking of the LLMs per Question

In [96]:
# Iterate through each question and rank the models based on scores
cum_rank_dict = {(q_id, key): 0 for q_id in range(len(questions)) for key in key_order}

for question_id in range(len(human_responses)):
    ordered_values = [score_dict.get((question_id, key), None) for key in key_order]
    ranks = rankdata([-val for val in ordered_values], method='average')
    
    print(f"\nRanking for Question {question_id + 1}:")
    for key, val, rnk in zip(key_order, ordered_values, ranks):
        val_sme = score_dict_sme[(question_id,key)]
        rnk_sme = cum_rank_dict_sme[(question_id,key)]
        print(f"{key:<8} - score (SME, Machine): ({val_sme:.3f}, {val:.3f}) -> Rank (SME, Machine): ({rnk_sme}, {rnk} )")
        cum_rank_dict[(question_id,key)] += rnk


Ranking for Question 1:
Claude   - score (SME, Machine): (1.000, 0.678) -> Rank (SME, Machine): (1.0, 2.0 )
Llama3   - score (SME, Machine): (0.990, 0.657) -> Rank (SME, Machine): (2.0, 3.0 )
Gpt4     - score (SME, Machine): (0.950, 0.742) -> Rank (SME, Machine): (3.0, 1.0 )

Ranking for Question 2:
Claude   - score (SME, Machine): (0.990, 0.738) -> Rank (SME, Machine): (2.5, 1.0 )
Llama3   - score (SME, Machine): (1.000, 0.708) -> Rank (SME, Machine): (1.0, 3.0 )
Gpt4     - score (SME, Machine): (0.990, 0.730) -> Rank (SME, Machine): (2.5, 2.0 )

Ranking for Question 3:
Claude   - score (SME, Machine): (0.910, 0.688) -> Rank (SME, Machine): (1.0, 2.0 )
Llama3   - score (SME, Machine): (0.900, 0.643) -> Rank (SME, Machine): (2.0, 3.0 )
Gpt4     - score (SME, Machine): (0.500, 0.699) -> Rank (SME, Machine): (3.0, 1.0 )


### Summing the ranking of the LLMs over the different questions (lowest value is the best LLM)

In [97]:
# Initialize a dictionary to hold the sum of ranks for each LLM model
cum_rank_sum = {key: 0 for key in key_order}
cum_rank_sum_sme = {key: 0 for key in key_order}

# Iterate through the cum_rank_dict to sum up the ranks for each LLM model
for (q_id, key), value in cum_rank_dict.items():
    cum_rank_sum[key] += value
# Iterate through the cum_rank_dict_sme to sum up the sme-ranks for each LLM model
for (q_id, key), value in cum_rank_dict_sme.items():
    cum_rank_sum_sme[key] += value

print(f'BERT embedding: Rankings based on cumulative scores: \n\tNr of Questions={len(human_responses)}')


# Sort the cumulative rank sums for each LLM model
sorted_cum_rank_sum = sorted(cum_rank_sum.items(), key=lambda item: item[1])

# Sort the cumulative rank sums for each LLM model
sorted_cum_rank_sum_sme = sorted(cum_rank_sum_sme.items(), key=lambda item: item[1])

# Print the sorted cumulative rank sums for each LLM model
print('\n--> Machine results:')
for rank, (key, value) in enumerate(sorted_cum_rank_sum, 1):
    print(f"{rank}. {key:<8} - CumRank: {value:.1f}")
# Print the sorted sme-cumulative rank sums for each LLM model
print('\n--> SME results:')
for rank, (key, value) in enumerate(sorted_cum_rank_sum_sme, 1):
    print(f"{rank}. {key:<8} - CumRank: {value:.1f}")

BERT embedding: Rankings based on cumulative scores: 
	Nr of Questions=3

--> Machine results:
1. Gpt4     - CumRank: 4.0
2. Claude   - CumRank: 5.0
3. Llama3   - CumRank: 9.0

--> SME results:
1. Claude   - CumRank: 4.5
2. Llama3   - CumRank: 5.0
3. Gpt4     - CumRank: 8.5


## 4. METEOR

In [66]:
import nltk
from nltk.translate.meteor_score import meteor_score

nltk.download('wordnet')
nltk.download('punkt')

# Tokenize the sentences
human_responses_tokenized = [nltk.word_tokenize(response) for response in human_responses]
generated_responses_tokenized = {
    LLM: [nltk.word_tokenize(response) for response in responses]
    for LLM, responses in generated_responses.items()
}

[nltk_data] Downloading package wordnet to
[nltk_data]     /home/rheremans/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /home/rheremans/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Loop over the list of LLMs

In [67]:
score_dict = {}
for LLM_model in generated_responses_tokenized.keys():
    print(f'\n********** {LLM_model} ***************')
    for i in range(len(human_responses)):
        score = meteor_score([human_responses_tokenized[i]], generated_responses_tokenized[LLM_model][i])
        score_dict.update({(i,LLM_model): score})
        print(f"METEOR score for pair Q{i+1}: {score:.4f}")


********** Llama3 ***************
METEOR score for pair Q1: 0.2008
METEOR score for pair Q2: 0.2153
METEOR score for pair Q3: 0.1690

********** Gpt4 ***************
METEOR score for pair Q1: 0.2609
METEOR score for pair Q2: 0.2282
METEOR score for pair Q3: 0.1509

********** Claude ***************
METEOR score for pair Q1: 0.2217
METEOR score for pair Q2: 0.3298
METEOR score for pair Q3: 0.1942


### Ranking of the LLMs per Question

In [68]:
# Iterate through each question and rank the models based on scores
cum_rank_dict = {(q_id, key): 0 for q_id in range(len(questions)) for key in key_order}

for question_id in range(len(human_responses)):
    ordered_values = [score_dict.get((question_id, key), None) for key in key_order]
    ranks = rankdata([-val for val in ordered_values], method='average')
    
    print(f"\nRanking for Question {question_id + 1}:")
    for key, val, rnk in zip(key_order, ordered_values, ranks):
        val_sme = score_dict_sme[(question_id,key)]
        rnk_sme = cum_rank_dict_sme[(question_id,key)]
        print(f"{key:<8} - score (SME, Machine): ({val_sme:.3f}, {val:.3f}) -> Rank (SME, Machine): ({rnk_sme}, {rnk} )")
        cum_rank_dict[(question_id,key)] += rnk


Ranking for Question 1:
Claude   - score (SME, Machine): (1.000, 0.222) -> Rank (SME, Machine): (1.0, 2.0 )
Llama3   - score (SME, Machine): (0.990, 0.201) -> Rank (SME, Machine): (2.0, 3.0 )
Gpt4     - score (SME, Machine): (0.950, 0.261) -> Rank (SME, Machine): (3.0, 1.0 )

Ranking for Question 2:
Claude   - score (SME, Machine): (0.990, 0.330) -> Rank (SME, Machine): (2.5, 1.0 )
Llama3   - score (SME, Machine): (1.000, 0.215) -> Rank (SME, Machine): (1.0, 3.0 )
Gpt4     - score (SME, Machine): (0.990, 0.228) -> Rank (SME, Machine): (2.5, 2.0 )

Ranking for Question 3:
Claude   - score (SME, Machine): (0.910, 0.194) -> Rank (SME, Machine): (1.0, 1.0 )
Llama3   - score (SME, Machine): (0.900, 0.169) -> Rank (SME, Machine): (2.0, 2.0 )
Gpt4     - score (SME, Machine): (0.500, 0.151) -> Rank (SME, Machine): (3.0, 3.0 )


### Summing the ranking of the LLMs over the different questions (lowest value is the best LLM)

In [None]:
# Sort the dictionary by values (scores)
sorted_scores = sorted(cum_rank_dict.items(), key=lambda item: item[1])

# Print the ranking of LLMs
print(f"LLM Rankings based on cumulative scores: Nr of Questions={len(human_responses)}")
for rank, (key, value) in enumerate(sorted_scores, 1):
    print(f"{rank}. {key:<8} - CumRank: {value:.1f}")

In [69]:
# Initialize a dictionary to hold the sum of ranks for each LLM model
cum_rank_sum = {key: 0 for key in key_order}
cum_rank_sum_sme = {key: 0 for key in key_order}

# Iterate through the cum_rank_dict to sum up the ranks for each LLM model
for (q_id, key), value in cum_rank_dict.items():
    cum_rank_sum[key] += value
# Iterate through the cum_rank_dict_sme to sum up the sme-ranks for each LLM model
for (q_id, key), value in cum_rank_dict_sme.items():
    cum_rank_sum_sme[key] += value

print(f'METEOR: Rankings based on cumulative scores: \n\tNr of Questions={len(human_responses)}')


# Sort the cumulative rank sums for each LLM model
sorted_cum_rank_sum = sorted(cum_rank_sum.items(), key=lambda item: item[1])

# Sort the cumulative rank sums for each LLM model
sorted_cum_rank_sum_sme = sorted(cum_rank_sum_sme.items(), key=lambda item: item[1])

# Print the sorted cumulative rank sums for each LLM model
print('\n--> Machine results:')
for rank, (key, value) in enumerate(sorted_cum_rank_sum, 1):
    print(f"{rank}. {key:<8} - CumRank: {value:.1f}")
# Print the sorted sme-cumulative rank sums for each LLM model
print('\n--> SME results:')
for rank, (key, value) in enumerate(sorted_cum_rank_sum_sme, 1):
    print(f"{rank}. {key:<8} - CumRank: {value:.1f}")

METEOR: Rankings based on cumulative scores: 
	Nr of Questions=3

--> Machine results:
1. Claude   - CumRank: 4.0
2. Gpt4     - CumRank: 6.0
3. Llama3   - CumRank: 8.0

--> SME results:
1. Claude   - CumRank: 4.5
2. Llama3   - CumRank: 5.0
3. Gpt4     - CumRank: 8.5


In [71]:
print(cum_rank_dict)
print(cum_rank_dict_sme)

{(0, 'Claude'): 2.0, (0, 'Llama3'): 3.0, (0, 'Gpt4'): 1.0, (1, 'Claude'): 1.0, (1, 'Llama3'): 3.0, (1, 'Gpt4'): 2.0, (2, 'Claude'): 1.0, (2, 'Llama3'): 2.0, (2, 'Gpt4'): 3.0}
{(0, 'Claude'): 1.0, (0, 'Llama3'): 2.0, (0, 'Gpt4'): 3.0, (1, 'Claude'): 2.5, (1, 'Llama3'): 1.0, (1, 'Gpt4'): 2.5, (2, 'Claude'): 1.0, (2, 'Llama3'): 2.0, (2, 'Gpt4'): 3.0}


# Tooling functions:
1. What were the questions

## 1. What were the questions:

In [None]:
from IPython.display import display, Markdown
for i in range(len(questions)):
    display(Markdown(f"Q{i+1}: **{questions[i]}**\n\n **Human Response:** {human_responses[i]}\n\n"))
    for LLM in generated_responses.keys():
        display(Markdown(f"**{LLM} Response:**\n {generated_responses[LLM][i]}"))

In [None]:
print(score_dict)
key_order = ['Claude', 'Llama3', 'Gpt4']
ordered_values = [score_dict.get(key, None) for key in key_order]
ranks = rankdata([-val for val in ordered_values], method='average')
for key, val, rnk in zip(key_order, ordered_values, ranks):
    print(f"{key:<8}-{val:.2f}-> Rank: {rnk}")

# Evaluating the agreement between machine-generated rankings and human rankings can be approached using the following 2 metrices:
1. **Kendall's Tau** is a measure of rank correlation, which assesses the similarity between two rankings. It calculates the number of concordant and discordant pairs in the rankings. A pair of observations is called concordant when the order between the pair is preserved in the human ranking versus the machine ranking. When the order is broken it is called discordant. The kendall's Tau is given by,
   
   $\tau = \frac{2(C-D)}{n(n-1)} $

where $C$ is the number of concordant and $D$ the number of discordant pairs and where $n$ is the total number of pairs.

3. **Spearman's Rank Correlation Coefficient** measures the strength and direction of the association between two ranked variables. The formula is:

    $\rho = 1 - \frac{6 \sum d_{i}^{2}}{n(n^{2}-1)}$

   where $d_i$ is the difference between the ranks of each observation (this means that when the rank of the human and the rank of the machine is for example (1,2,3) verus (2,1,3) for a given LLM order, then the $d_1 = 2 - 1 = 1$, $d_2 = 1 - 2 = -1$ and $d_3 = 3 - 3 = 0$)  and $n$ is the number of observations.

## Aplied on our example
Consider a single question, with a human answer and 3 LLM answers. A human ranking would tell us which of the LLM answers is the best, which is second and so forth, i.e. the human ranking. Q1 gives HumanRanking to be LLM1_Answer = best, LLM2_Answer is third and LLM3_Answer=second --> HumanRanking => (1,3,2). To get the machine ranking you need a metric that gives a score reflecting on how good the machine answer is compared to the human answer, based on this score you can derive the machine ranking (for instance (1,2,3)). Based on the Human Rank and the Machine Rank we can calculate either the Kendall's Tau and/or the Spearman's Rank Correlation Coefficient.  

In [90]:
import numpy as np
from scipy.stats import kendalltau

key_order = ['Claude', 'Llama3', 'Gpt4']
print("Q ID,   Tau ,   p-val  Machine Rnks, SME Rnks")
for q_id in range(len(questions)):
    rnk_mac = [cum_rank_dict.get((q_id, key), None) for key in key_order]
    rnk_sme = [cum_rank_dict_sme.get((q_id, key), None) for key in key_order]
    tau1, p_value1 = kendalltau(rnk_mac, rnk_sme)
    print(f"Q{q_id:<3}, {tau1:6.2f}, {p_value1:6.2f}, {rnk_mac}, {rnk_sme}")

Q ID,   Tau ,   p-val  Machine Rnks, SME Rnks
Q0  ,  -0.33,   1.00, [2.0, 3.0, 1.0], [1.0, 2.0, 3.0]
Q1  ,  -0.82,   0.22, [1.0, 3.0, 2.0], [2.5, 1.0, 2.5]
Q2  ,   1.00,   0.33, [1.0, 2.0, 3.0], [1.0, 2.0, 3.0]
