Published on December 03, 2024. By Prata, Marília (mpwolke)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import matplotlib.pyplot as plt
import seaborn as sns

#Ignore warnings
import warnings
warnings.filterwarnings('ignore')

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Competition Citation

@misc{llms-you-cant-please-them-all,

    author = {Paul Mooney and Ashley Chow and Will Cukierski},
    title = {LLMs - You Can't Please Them All
    },
    year = {202
    4},
    howpublished = {\url{https://kaggle.com/competitions/llms-you-cant-please-them-all}},

![](https://neurips.cc/media/PosterPDFs/NeurIPS%202024/96672.png?t=1732572730.1297684)
https://neurips.cc/virtual/2024/poster/96672

## LLM Evaluators Recognize and Favor Their Own Generations

Authors: Arjun Panickssery, Samuel R. Bowman,  Shi Feng 


"The authors provided initial evidence towards the hypothesis that LLMs prefer their own generations because they recognize themselves. In addition to evaluating LLMs out-of-the-box, they showed that fine-tuning on a small number of examples elicit strong, generalizable self-recognition capability on summarization datasets. By varying fine-tuning task, the authors observed a linear correlation between self-recognition and self-preference, and validate that the correlation cannot be explained away by potential confounders. Their results established self-recognition as a crucial factor in unbiased self-evaluation as well as an important safety-related property. The experiment design also provides a blueprint to explore the effects of self-recognition on other downstream properties."

https://arxiv.org/pdf/2404.13076

## Submission file

In [None]:
sub = pd.read_csv('../input/llms-you-cant-please-them-all/sample_submission.csv')
sub.tail()

## Test file

No train at all.

In [None]:
test = pd.read_csv('../input/llms-you-cant-please-them-all/test.csv')
test.tail()

In [None]:
sub['essay'][2]

In [None]:
test['topic'][1]

In [None]:
test['topic'][0]

## Large Language Models are not Fair Evaluators

Authors: Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu Tianyu Liu, Zhifang Sui

In this paper, the authors revealed a systematic positional bias in evaluation with advanced ChatGPT/GPT-4 models: by manipulating the order of candidate responses during evaluation, the quality ranking results can be significantly influenced. To this end, they introduced three effective strategies, namely Multiple Evidence Calibration (MEC), Balanced Position Calibration (BPC), and Human-in-the-Loop Calibration (HITLC). MEC requires the LLM evaluator to first provide multiple evaluation evidence to support their subsequent ratings and BPC aggregates the results from various orders to determine the final score."

"Based on the results of MEC and BPC, HITLC further calculates a balanced position diversity entropy to select examples for human annotations. These strategies successfully reduce the evaluation bias and improve alignment with human judgments. The authors provided their code and human annotations to support future studies and enhance the evaluation of generative models."

https://arxiv.org/pdf/2305.17926

In [None]:
import os
import time
import google.generativeai as genai
from kaggle_secrets import UserSecretsClient

In [None]:
#By Paul Mooney https://www.kaggle.com/code/paultimothymooney/how-to-upload-large-files-to-gemini-1-5/notebook

from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
GEMINI_API_KEY = user_secrets.get_secret("GEMINI_API_KEY")
genai.configure(api_key=GEMINI_API_KEY)

In [None]:
generation_config = {
  "temperature": 1,
  "top_p": 0.95,
  "top_k": 64,
  "max_output_tokens": 8192,
  "response_mime_type": "text/plain",
}

model = genai.GenerativeModel(
  model_name="gemini-1.5-flash",
  generation_config=generation_config,
  system_instruction="#System Prompt:You are an AI Research Assistance understand and summarize data. Answer briefly, some questions referring only to the context.\"\n",
)

## Importing Data Source

In [None]:
ex_d = genai.upload_file("/kaggle/input/llms-you-cant-please-them-all/test.csv")

In [None]:
chat_session = model.start_chat(
history =[
    {
        'role':'user',
        'parts': [
            ex_d,
        ]
    }
])

def chatI(prompt):
    response = chat_session.send_message(prompt)
    print(response.text)

In [None]:
Prompt="If there is willingness, there are always conditions for improvement. I have the lucidity to see that I know nothing about almost anything. For now, I still have half a dozen certainties that hinder me. I know nothing, but this does not stop me from writing a few lines on various subjects. What is the role of self-reliance in achieving success in software engineering?"

In [None]:
chatI(Prompt)

In [None]:
#Ckecking the information provided above by chatI Gemini

row=2

#To display in a table format, you can use with or without display()
test.iloc[row:row+1]

In [None]:
chatI("How to evaluate the effectiveness of management consulting in addressing conflicts within marketing?")

In [None]:
#Verify chatI Gemini answer above
row=1

#To display in a table format, you can use with or without display()
test.iloc[row:row+1]

In [None]:
chatI("What's the importance of self-reliance and adaptability in healthcare?")

## Gemini, what's Plucrarealucrarealucrarealucrarealucrarealucrarealucrarealucrarea?

In [None]:
chatI("What's the meaning of Plucrarealucrarealucrarealucrarealucrarealucrarealucrarealucrarea?")

In [None]:
#What about supercalifragilisticexpialidocious?
chatI("What is supercalifragilisticexpialidocious?")

In [None]:
chatI("How can I provide you access to the content of that document?")

## Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Authors: Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin,  Zhuohan Li,  Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, Ion Stoica.

"In this paper, the authors proposed LLM-as-a-judge for chatbot evaluation and systematically examine its efficacy using human preference data from 58 experts on MT-bench, as well as thousands of crowdusers on Chatbot Arena. Their results reveal that strong LLMs can achieve an agreement rate of over 80%, on par with the level of agreement among human experts, establishing a foundation for an
LLM-based evaluation framework."

https://arxiv.org/pdf/2306.05685

![](https://images7.memedroid.com/images/UPLOADED400/605df28d4e241.jpeg)memdroid

## Keep Pleasing them All Gemini!

I'm glad with the results provided by Gemini. They are correct and very objective. Great job!

If there is willingness, **there are always conditions for improvement**.


I have the lucidity to see that **I know nothing about almost anything**.


For now, I still have half a dozen certainties that hinder me.


I know nothing, but this does not stop me from writing a few lines on various subjects.


**The awareness of my ignorance is what brings me relief and tranquility**.

#Acknowledgements:

Paul Mooney https://www.kaggle.com/code/paultimothymooney/how-to-upload-large-files-to-gemini-1-5/notebook

mpwolke https://www.kaggle.com/code/mpwolke/um-discurso-existencialista-com-gemma2/notebook