# (ADA) Homework 1: Scoring the Language Model Olympics

---

By the end of this homework, we expect you to be able to:

- Load data and handle data using pandas;
- Navigate the documentation of Python packages by yourself;
- Filter and tidy up noisy real-world datasets;
- Aggregate your data in different (and hopefully helpful) ways;
- Create meaningful visualizations to analyze the data;
- Communicate your findings in a clear and concise manner

---

**Important Dates.**

- Homework release: Fri 04 Oct 2024
- Homework due: Sat 18 Oct 2024, 23:59
- Grade release: Mon 04 Nov 2024

**Some rules**

- You are allowed to use any built-in Python library that comes with Anaconda. If you want to use an external library, you may do so, but must justify your choice.
- Make sure you use the data folder provided in the repository in read-only mode. (Or alternatively, be sure you don’t change any of the files.)
- Be sure to provide a concise textual description of your thought process, the assumptions you made, the solution you implemented, and explanations for your answers. A notebook that only has code cells will not suffice. To avoid confusion: use short comments for longer code answers.
- For questions containing the /Discuss:/ prefix, answer not with code, but with a textual explanation (in markdown).
- Back up any hypotheses and claims with data, since this is an important aspect of the course.
- Please write all your comments in English, and use meaningful variable names in your code. Your repo should have a single notebook (plus the required data files) in the master/main branch. If there are multiple notebooks present, we will not grade anything.
- We will not run your notebook for you! Rather, we will grade it as is, which means that only the results contained in your evaluated code cells will be considered, and we will not see the results in unevaluated code cells. Thus, be sure to hand in a fully-run and evaluated notebook. In order to check whether everything looks as intended, you can check the rendered notebook on the GitHub website once you have pushed your solution there.
- In continuation to the previous point, interactive plots, such as those generated using the ‘plotly’ package, should be strictly avoided! Make sure to print results and/or dataframes that confirm you have properly addressed the task.

**A Note on using Language Models (LMs)**

If you try hard enough, you will likely get away with cheating. Fortunately, our job is not to police, but rather to educate! So, please consider the following:
- Presumably, you are taking this course to learn something! LMs are not always right ([they often fail in silly ways](https://community.openai.com/t/why-9-11-is-larger-than-9-9-incredible/869824/4)). This course should prepare you to detect when they are wrong!
- Some of the TAs on this course literally published many works on detecting machine-generated text.
---

## Context

Context
AI is booming! Newspapers, influencers, and your relatives all agree that AI is important. But while almost everyone agrees that AI is the future, much is unclear about what that future looks like…

Freshly graduated from the EPFL, you are hired by the Swiss government to advise on a large-scale “AI integration” initiative code-named **"NEUTRALITY"** (Navigating Efficient Upgrades Through Robust Artificial Learning Integration Techniques Yearly). Convinced by the stunning progress in language modeling, the government would like to battle the growing shortages in the education sector by using LMs. Your job description: investigate which LMs might be best suited!

You are given the results of three LMs on the [“Massive Multitask Language Understanding (MMLU)”](https://arxiv.org/abs/2009.03300) dataset to compare. This famous dataset consists of 57 subjects with multiple-choice questions, covering diverse subjects like mathematics, computer science, history, and law. Most providers of state-of-the-art LMs use this dataset to showcase the versatility of their latest models. Unfortunately, Horta-Ribeiro, the intern responsible for collecting the results, didn’t take EPFL’s famous ADA course. As a result, the collected datasets are slightly corrupted.

### A very brief primer on Language Models
Language models (LMs) are sophisticated statistical models designed to understand and generate human-like text. At their core, LMs are trained to predict the most likely continuation of a given input text. For example, given the input "The cat sat on the," an LM might predict "mat" as a likely continuation.
LMs are trained on vast text samples from various sources, including books, websites, and social media. This extensive training allows them to capture patterns and relationships in language, enabling them to generate coherent and contextually appropriate text across a wide range of topics and styles.

While LMs can produce text that appears to be written by intelligent humans, it's important to note that their capabilities can diverge from human intelligence in unexpected ways. They may sometimes generate factually incorrect information or struggle with complex reasoning tasks.

Two key concepts in understanding LMs are:
1. **Tokens**: LMs process text using "tokens" rather than individual characters. Tokens can be words, parts of words, or punctuation marks. For example, the sentence "I love AI!" might be tokenized as ["I", "love", "AI", "!"]. Tokenization is the first step in both training and using an LM.
2. **Context**: The input text provided to an LM is called the "context." This context informs the model's predictions or generations. A longer or more specific context often leads to more accurate and relevant outputs.

[See: Wikipedia entry on language models](https://en.wikipedia.org/wiki/Large_language_model)

###  Files for this assignment
This assignment is divided into three tasks, each of which should bring you a step closer to providing a recommendation toward project NEUTRALITY’s objectives:

- **Task 1**: Inspecting the results and getting your first model ranking
- **Task 2**: Inspecting the underlying data used to generate the results for possible biases
- **Task 3**: Learning about tokens and providing a final recommendation


```
📁 PROJECT_NEUTRALITY
│
├── 📄 analysis.ipynb (the file you're currently reading!)
├── 📄 requirements.txt (install into your environment)
│
├── 📁 task_1
├── 📁 task_2
└── 📁 task_2.5
```   
 

In [2]:
pip install -r requirements.txt

Collecting jupyter>=1.0.0 (from -r requirements.txt (line 1))
  Using cached jupyter-1.1.1-py2.py3-none-any.whl.metadata (2.0 kB)
Collecting tiktoken>=0.7.0 (from -r requirements.txt (line 7))
  Downloading tiktoken-0.8.0-cp311-cp311-win_amd64.whl.metadata (6.8 kB)
Collecting jupyter-console (from jupyter>=1.0.0->-r requirements.txt (line 1))
  Downloading jupyter_console-6.6.3-py3-none-any.whl.metadata (5.8 kB)
Collecting regex>=2022.1.18 (from tiktoken>=0.7.0->-r requirements.txt (line 7))
  Downloading regex-2024.9.11-cp311-cp311-win_amd64.whl.metadata (41 kB)
Using cached jupyter-1.1.1-py2.py3-none-any.whl (2.7 kB)
Downloading tiktoken-0.8.0-cp311-cp311-win_amd64.whl (884 kB)
   ---------------------------------------- 0.0/884.5 kB ? eta -:--:--
   ---------------------------------------- 884.5/884.5 kB 5.7 MB/s eta 0:00:00
Downloading regex-2024.9.11-cp311-cp311-win_amd64.whl (274 kB)
Downloading jupyter_console-6.6.3-py3-none-any.whl (24 kB)
Installing collected packages: regex, 

In [1]:
# please make sure you install the packages listed in the requirements.txt file in your environment!
# using pip
# pip install -r requirements.txt
#
# using Conda:
# conda create --name <env_name> --file requirements.txt
#
# some basic imports
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
from scipy.stats import ttest_ind

## Task 1 (18 points): What's in an average anyway?

The files needed to complete task 1 can be found in the folder "`data/task_1/`:
```
task_1/
│
├── mmlu_data/
│   └── test.csv
│
└── lm_scores/
    ├── lm_X.csv
    ├── lm_Y.csv
    └── lm_Z.csv
```

We will start by loading, (manually) inspecting, and cleaning the data. Although it doesn't seem "glamorous" (nor is it particularly fun...) - manually inspecting data is extremely important! In fact, it's one of the few things most AI and Data Science researchers agree on :). Next, we will take a first pass on ordering our Olympic podium between three LMs.

### 1.1 (1 pt)
 
Load the subfiles contained in the `mmlu_data` and `lm_scores` folders into separate dataframes:
- `df_test`
- `df_x`
- `df_y`
- `df_z`

for each, print their sizes.

In [8]:
task_folder = './task_1/'
lm_scores_folder = './lm_scores/'
mmlu_folder = './mmlu_data/'

#mmlu_data import
df_test = pd.read_csv(task_folder + mmlu_folder + 'test.csv')


#lm_scores import
df_x = pd.read_csv(task_folder + lm_scores_folder + 'lm_X.csv')
df_y = pd.read_csv(task_folder + lm_scores_folder + 'lm_Y.csv')
df_z = pd.read_csv(task_folder + lm_scores_folder + 'lm_Z.csv')




In [14]:
df_test

Unnamed: 0,question,A,B,C,D,answer,subject,question_id
0,Find the degree for the given field extension ...,0,4,2,6,B,abstract algebra,0
1,"Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the i...",8,2,24,120,C,abstract algebra,1
2,Find all zeros in the indicated finite field o...,0,1,01,04,D,abstract algebra,2
3,Statement 1 | A factor group of a non-Abelian ...,"True, True","False, False","True, False","False, True",B,abstract algebra,3
4,Find the product of the given polynomials in t...,2x^2 + 5,6x^2 + 4x + 6,0,x^2 + 1,B,abstract algebra,4
...,...,...,...,...,...,...,...,...
14037,What has been a central focus of religious tra...,Peace and harmony,Power and influence,Truth and love,Wisdom and ethics,A,world religions,14037
14038,To whom did ordinary folk appeal during a dro...,The Buddha,Laozi,The Queen Mother of the West,Confucius,C,world religions,14038
14039,The theological term homoousios means which o...,of a similar substance,of the same substance,of like substance,of human substance,B,world religions,14039
14040,"According to the Japanese origin myth, who giv...",Es,Izanagi,Izanami,Kami,B,world religions,14040


In [26]:
print('the size of the df_test dataset is', df_test.shape[0])

the size of the df_test dataset is 14042


In [15]:
df_x

Unnamed: 0,question_id,result
0,0,B
1,1,C
2,2,D
3,3,B
4,4,Answer: B
...,...,...
13877,14037,A
13878,14038,A
13879,14039,B
13880,14040,B


In [27]:
print('the size of the df_x dataset is',df_x.shape[0])

the size of the df_x dataset is 13882


In [20]:
df_y

Unnamed: 0,question_id,result
0,0,Answer: D
1,1,D
2,2,Answer: D
3,3,
4,4,D
...,...,...
13973,14037,C
13974,14038,D
13975,14039,Answer: D
13976,14040,B


In [28]:
print('the size of the df_y dataset is', df_y.shape[0])

the size of the df_y dataset is 13978


In [22]:
df_z

Unnamed: 0,question_id,result
0,0,B
1,1,Answer: B
2,2,C
3,3,B
4,4,B
...,...,...
13918,14037,A
13919,14038,A
13920,14039,B
13921,14040,B


In [29]:
print('the size of the df_z dataset is',df_z.shape[0])

the size of the df_z dataset is 13923


### 1.2 (4 pt)
Unfortunately, LMs don't always output the format we want. In the column `result`, the value should be one of A, B, C, or D. 

A. For each of the LM score dataframes, use a `value_counts()` operation and print the results. 

B. /Discuss:/ Inspect the results and describe the types of answer formats you see. Besides the "expected" case, you should be able to find at least four unexpected formats.

In [51]:
# A
print("the values for the dataset df_x are :\n", pd.Series.value_counts(df_x['result']))



the values for the dataset df_x are :
 result
A                                                                                                                 2733
A                                                                                                                 1657
B                                                                                                                 1412
Answer: A                                                                                                         1398
C                                                                                                                 1134
                                                                                                                  ... 
judicial activism, so the answer is A                                                                                1
creating insurmountable obstacles to the founding of factions, so the answer is A                                    1
A 

In [53]:
print("\n\n the values for the dataset df_y are :\n", pd.Series.value_counts(df_y['result']))



 the values for the dataset df_y are :
 result
D                                                                                                2894
Answer: D                                                                                        1718
C                                                                                                1701
B                                                                                                1240
D                                                                                                1145
                                                                                                 ... 
Where the energy of interaction between the atoms is at its minimum value, so the answer is A       1
leaves more viable offspring than others of its species., so the answer is D                        1
A and C only, so the answer is D                                                                    1
ADP + P → ATP, so the answer is D

In [54]:
print("\n\n the values for the dataset df_z are :\n", pd.Series.value_counts(df_z['result']))



 the values for the dataset df_z are :
 result
D                                                                                   2257
C                                                                                   2191
B                                                                                   2127
A                                                                                   2060
Answer: D                                                                            777
                                                                                    ... 
omission of a universal suffrage clause, so the answer is D                            1
declare war, so the answer is D                                                        1
state and local governments, by means of federal funding, so the answer is B           1
less clearly identified with consistent political ideologies, so the answer is B       1
Rahit, so the answer is B                                    

B - For the three datasets the expected results should be : A, B, C or D 

Beside these expected results, we can observe the following answer formats:
+ a duplicate of the letter A, B, C or D. The result is detected as diffrent from the expected probably due to a diffrent format 
+ the result preceded by " Answer : A "
+ preceded by text (either explanation or content) - " ... , so the answer is A "


### 1.3 (5 pt)
Oh oh... That doesn't look great. Simply dropping all invalid answers seems overly wasteful, yet fixing all of these looks like a mess! Instead, let's focus for now on fixing just those answers of length < 10 characters that require only a single `str.replace()` operation. 

For example, if the answer looks like `--A--`, we could fix this by using the following simple function:

```
def clean_answer(s, pattern='-'):
    return str(s).replace(pattern, '')

dirty_answer = '--A--'
clean_answer = clean_answer(dirty_answer)
```

A. Filter the three score dataframes to include only answers with less than 10 characters. Make a deep copy of the dataframes as you filter them.

B. Modify the `clean_answer()` example function to clean the answers in the filtered data frames using the `apply()` functionality. Finally, make sure **all remaining answers are one of `A, B, C, or D`.**

C. /Discuss:/ Compare the sizes of the original and filtered data frames. What do you see? Why might this be a problem?

In [82]:
example="ANSWER: D"

example.split("ANSWER: ")

['', 'D']

In [121]:
# A
df_x_less10 = df_x[df_x['result'].str.len() < 10].copy(deep=True)
df_y_less10 = df_y[df_y['result'].str.len() < 10].copy(deep=True)
df_z_less10 = df_z[df_z['result'].str.len() < 10].copy(deep=True)


#creating a new cleaning function

def clean_answer(s, patterns=['Answer:', ' ']):
    result=str(s)
    for pattern in patterns: 
         result = result.replace(pattern, '') 
    return result
   

df_x_filtered = df_x_less10['result'].apply(clean_answer)
df_y_filtered = df_y_less10['result'].apply(clean_answer)
df_z_filtered = df_z_less10['result'].apply(clean_answer)

#getting rid of NotSure

df_x_cleaned = df_x_filtered.drop(df_x_filtered[df_x_filtered == 'NotSure'].index)
df_y_cleaned = df_y_filtered.drop(df_y_filtered[df_y_filtered == 'NotSure'].index)
df_z_cleaned = df_z_filtered.drop(df_z_filtered[df_z_filtered == 'NotSure'].index)

print("cleaned df_x", df_x_cleaned.value_counts())
print("cleaned df_y", df_y_cleaned.value_counts())
print("cleaned df_z", df_z_cleaned.value_counts())


cleaned df_x result
A    5788
B    2965
C    2350
D    2333
Name: count, dtype: int64
cleaned df_y result
D    5757
C    3242
B    2519
A    2033
Name: count, dtype: int64
cleaned df_z result
D    3348
C    3255
B    3124
A    3026
Name: count, dtype: int64


C. /Discuss:/

In [122]:
#let's compare the lenghts
print('For df_x :')
print ('the lengh of the original dataset is :', df_x.shape[0])
print ('the lengh of the cleaned dataset is :', df_x_cleaned.shape[0])
print ('which gives us a ratio of :', df_x_cleaned.shape[0]/df_x.shape[0])

print('For df_y :')
print ('the lengh of the original dataset is :', df_y.shape[0])
print ('the lengh of the cleaned dataset is :', df_y_cleaned.shape[0])
print ('which gives us a ratio of :', df_y_cleaned.shape[0]/df_y.shape[0])

print('For df_z :')
print ('the lengh of the original dataset is :', df_z.shape[0])
print ('the lengh of the cleaned dataset is :', df_z_cleaned.shape[0])
print ('which gives us a ratio of :', df_z_cleaned.shape[0]/df_z.shape[0])

For df_x :
the lengh of the original dataset is : 13882
the lengh of the cleaned dataset is : 13436
which gives us a ratio of : 0.9678720645440139
For df_y :
the lengh of the original dataset is : 13978
the lengh of the cleaned dataset is : 13551
which gives us a ratio of : 0.9694519959937044
For df_z :
the lengh of the original dataset is : 13923
the lengh of the cleaned dataset is : 12753
which gives us a ratio of : 0.9159663865546218


### 1.4 (3 pt)

Now that our answer columns are nicely formatted, let's take a look at model performance:

A. Both the `MMLU` dataframes and the language model score data frames have the columns `question_id`. For each of the language model score data frames, use an inner join operation with the `df_test` dataframe on the `question_id` column.

B. Add a new column to each of the resulting dataframes called `correct`, that checks if the model's answer in `result` is the same as the expected answer in the column `answer`. Then, print the average score of each model.

In [150]:
# A
df_x_cleaned_df = df_x_cleaned.to_frame(name= 'result').rename_axis('question_id')
merged_x_df = df_x_cleaned_df.merge(df_test, left_index=True, right_on='question_id')

#B
merged_x_df['correct'] = merged_x_df['result'] == merged_x_df['answer']
merged_x_df


Unnamed: 0,result,question,A,B,C,D,answer,subject,question_id,correct
0,B,Find the degree for the given field extension ...,0,4,2,6,B,abstract algebra,0,True
1,C,"Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the i...",8,2,24,120,C,abstract algebra,1,True
2,D,Find all zeros in the indicated finite field o...,0,1,01,04,D,abstract algebra,2,True
3,B,Statement 1 | A factor group of a non-Abelian ...,"True, True","False, False","True, False","False, True",B,abstract algebra,3,True
4,B,Find the product of the given polynomials in t...,2x^2 + 5,6x^2 + 4x + 6,0,x^2 + 1,B,abstract algebra,4,True
...,...,...,...,...,...,...,...,...,...,...
13877,A,"Guru Nanak used what term to denote the ""divi...",Shabad,Khalse,Nam,Guru,A,world religions,13877,True
13878,A,What is the mi'raj?,Muhammad's miraculous ascent to heaven,Muhammad's migration to Mecca,Muhammad's first community in Mecca,Muhammad's revelations of the Qur'an,A,world religions,13878,True
13879,B,What is the name of the most famous dharmashas...,Laws of Dharma,Laws of Karma,Laws of Vishnu,Laws of Manu,D,world religions,13879,False
13880,B,"What does the term ""Qur'an"" literally mean?",The Holy Book,The Narrative,The Recitation,The Pillars,C,world religions,13880,False


In [200]:
merged_x_df['correct'].value_counts()

correct
False    8871
True     4565
Name: count, dtype: int64

In [151]:
df_y_cleaned_df = df_y_cleaned.to_frame(name= 'result').rename_axis('question_id')
merged_y_df = df_y_cleaned_df.merge(df_test, left_index=True, right_on='question_id')

#B
merged_y_df['correct'] = merged_y_df['result'] == merged_y_df['answer']
merged_y_df


Unnamed: 0,result,question,A,B,C,D,answer,subject,question_id,correct
0,D,Find the degree for the given field extension ...,0,4,2,6,B,abstract algebra,0,False
1,D,"Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the i...",8,2,24,120,C,abstract algebra,1,False
2,D,Find all zeros in the indicated finite field o...,0,1,01,04,D,abstract algebra,2,True
4,D,Find the product of the given polynomials in t...,2x^2 + 5,6x^2 + 4x + 6,0,x^2 + 1,B,abstract algebra,4,False
5,C,Statement 1 | If a group has an element of ord...,"True, True","False, False","True, False","False, True",A,abstract algebra,5,False
...,...,...,...,...,...,...,...,...,...,...
13973,C,"What is the term used to describe the ""pure"" ...",Khalsa,Rahit,Panj Kakke,Shabad,A,world religions,13973,False
13974,D,Who is the founder of Sikhism?,Guru Gobind Singh,Guru Nanak,Guru Kabir,Guru Hargobind,B,world religions,13974,False
13975,D,Which term is usually associated with women in...,Polluted,Ideal,Auspiciousness,Kind,C,world religions,13975,False
13976,B,What is another name for Enuma Elish?,Epic of Creation,Epic of Destruction,Epic of Egypt,Epic of Origins,A,world religions,13976,False


In [152]:

df_z_cleaned_df = df_z_cleaned.to_frame(name= 'result').rename_axis('question_id')
merged_z_df=df_z_cleaned_df.merge(df_test, left_index=True, right_on='question_id')

#B
merged_z_df['correct'] = merged_z_df['result'] == merged_z_df['answer']
merged_z_df


Unnamed: 0,result,question,A,B,C,D,answer,subject,question_id,correct
0,B,Find the degree for the given field extension ...,0,4,2,6,B,abstract algebra,0,True
1,B,"Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the i...",8,2,24,120,C,abstract algebra,1,False
2,C,Find all zeros in the indicated finite field o...,0,1,01,04,D,abstract algebra,2,False
3,B,Statement 1 | A factor group of a non-Abelian ...,"True, True","False, False","True, False","False, True",B,abstract algebra,3,True
4,B,Find the product of the given polynomials in t...,2x^2 + 5,6x^2 + 4x + 6,0,x^2 + 1,B,abstract algebra,4,True
...,...,...,...,...,...,...,...,...,...,...
13918,A,The Maccabean Revolt is associated with which...,Julius Caesar,Alexander the Great,Cyrus of Persia,Antiochus IV Epiphanes,D,world religions,13918,False
13919,A,How old was Guru Nanak when he started to pre...,30,40,33,52,A,world religions,13919,True
13920,B,The widely adored Egyptian goddess Hathor was ...,Serpent,Eagle,Cheetah,Cow,D,world religions,13920,False
13921,B,What does the phrase Guru-Panth mean within th...,Community,Scripture,Worship,Apprenticeship,A,world religions,13921,False


In [183]:
# B - average score of each model
print(merged_x_df['correct'].mean()* 100)
print(merged_y_df['correct'].mean()* 100)
print(merged_z_df['correct'].mean()* 100)

33.97588568026198
34.373846948564676
31.812122637810713


### 1.5 (5 pt)

Hmmm, something doesn't seem quite right. Let's investigate how "balanced" this dataset is:

A. For each of the 57 subjects in the MMLU, compare the number of questions answered by each model. Print the subjects for which there is a more than 10% difference.

B. Propose and implement a reasonable way to rebalance the results. (e.g., while throwing away 100% of the results perfectly rebalances the results, it is not reasonable).

C. Finally, print the updated accuracy on the rebalanced data.

**hint:**:
- (A) For a given subject, let model X and model Y have answered 181 and 200 questions respectively. You can consider this a 10% difference from the perspective of X since: (200 - 181) / 181 > 0.10

In [None]:
# A

In [None]:
# B

In [None]:
# C

## Task 2 (26 points): What do you mean A > D > B > C...?

Nice work! Having successfully inspected, cleaned, and rebalanced the provided data, you head over to director of the government's NEUTRALITY project. Ms. Sakota is happy with your work so far, but worried that the sloppy intern might have done more undetected damage. To be sure, she orders a new set of evaluations of all models on both MMLU and another dataset.

After cleaning up and rebalancing, you are left with the concatenated score files in the second folder `task_2`:
```
task_2/
│
└── lm_scores_mmlu.csv
│
└── lm_scores_other.csv
```

Each has a new column called `model_name`, which is one of `X, Y` or `Z`.



_NOTE: **only** use data from `task_2` and `task_2_5` for this assignment! The values in `lm_scores_mmlu.csv` will NOT be the same as the dataframes you finished in task 1. This is due to "randomness" or "temperature" in language model inference. This can slightly shift around generative results. (Conveniently: it also ensures any mistakes made in Task 1 don't propogate further ;) )_

In [None]:
# PROVIDED CODE
df_mmlu = pd.read_csv('task_2/lm_scores_mmlu.csv')
df_other = pd.read_csv('task_2/lm_scores_other.csv')

### 2.1 (4 pt)

Let's explore the new results:

A. Compute the mean accuracy and standard errors of each model on both datasets and print the results.

B. Then, show your results in a bar plot using standard errors with a 95% confidence interval around the mean. Make sure the plot is easy to read and well annotated.

C. /Discuss:/ the plot you created: (i) can you say that one of the models is the best? (ii) is there anything that seems odd?

In [None]:
# A

In [None]:
# B

C. /Discuss:/

### 2.2 (5 pt)

Ms. Sakota has assured you that both datasets contain questions of similar difficulty, so, what could be going on here?

A. What is the distribution of correct answers (A, B, C, D) for each dataset? Create a bar chart to visualize this.

B. Perform a chi-square test at $\alpha = 0.05$, of independence to determine if there's a significant difference in the distribution of correct answers between the two datasets. What do you conclude?

**hints**:
- for (A), keep in mind that df_mmlu and df_other contain the results of all models, i.e., the `question_id` column is duplicated.
- for (A), take care to clearly annotate the bar chart, e.g., title, y-label, legend.
- for (B), clearly state the null hypothesis and alternative hypothesis
- use the `chi2_contingency` function from `scipy.stats`
- format your results from answer (A) as a 2D array

In [None]:
# A

In [None]:
# B

### 2.3 (7 pt)

Let's dive in deeper:

A. What is language model X's mean accuracy conditioned on the four answer options for each dataset?

B. Compare LM X's performance when the correct answer is "A" between the two datasets. Use a T-test with CI = 0.95. What do you conclude?

C. Compare LM X's performance when the correct answer is "A" vs. "C or D" for each dataset. Use a T-test with CI = 0.95. What do you conclude?

In [None]:
# A

In [None]:
# B

In [None]:
# C

### 2.4 (2 pt)

What an intriguing finding! 

A. Print the mean accuracies conditioned on the correct answer for all LMs for each dataset.

B. /Discuss:/ What do you observe?

In [None]:
# A

B. /Discuss:/

### 2.5 (2 pt)

Concerned with your findings so far, you quickly consult with Ms. Sakota. After thinking it over, Ms. Sakota concludes that more tests are needed. She orders a second round of MMLU results. However, the clever Ms. Sakota thinks of the following twist: while keeping questions fixed, she randomly permutes the position of the correct answer. The new results can be found in the folder `data/task_2_5/`:
```
task_2_5/
│
└── lm_scores_mmlu_shuffle.csv
```

/Discuss:/ Why would Ms. Sakota do this?

/Discuss:/

### 2.6 (4 pt)

Increasingly sceptical of the language models' performance, you read up on proper testing practices. You stumble upon the concept of [test-rested stability](https://en.wikipedia.org/wiki/Repeatability), which roughtly states that:

"_Measurements taken by a single person or instrument on the same item, under the same conditions, and in a short period of time, should have the same results._"

In our case, we would assume an LM would have the same performance on a given question regardless of the correct answer position. One way of testing this is by using the following metric:

$$\text{test-retest metric} = \frac{1}{N}\sum_{i=1}^N \frac{1}{M}\sum_{j=1}^M c^i_0 c_j^i,$$

where $c^i_0 \in \{0, 1\}$ indicates whether the model answers the $i^{\text{th}}$ question correctly (1 if correct, 0 if incorrect). $c_j^i$ indicates whether the model answers the $i^{\text{th}}$ question correctly in the $j^{\text{th}}$ shuffled version of the answer label content. Finally, $M$ is the total number of shuffles and $N$ is the dataset size.

Task: compute the test-retest metric for each language model using the original `lm_scores_mmlu.csv` file and the new `lm_scores_mmlu_shuffle.csv` file. Using a bar plot, visualize your results by comparing the accuracy of the original `lm_scores_mmlu.csv` and the test-retest scores.

**hints**
- what is $M$ in our case?

(bonus: no points, but so much sweet, sweet knowledge - check out [the following article](https://arxiv.org/pdf/2406.19470v1))

### 2.7 (2 pt)

A. Using the unshuffled data: For each LM, print the distribution of the answers they give as well as the accuracy conditioned on the answer they give.

B. /Discuss:/ Describe what you observe

[bonus: not scored, but again _that sweet, sweet knowledge_] Could you think of a plausible explanation?

In [None]:
# A

B. /Discuss:/

## Task 3 (16 points): What do Questions and Answers look like for a Language Model?

While you feel pretty good about the tests you conducted so far, something still bothers you: what if the language models don't see the data like you do? Suddenly, you receive a phone call from a wise AI sage in the West, _Westoda_:

```
"Hmm, correct you are, young padawan, to question how the world is seen by large language models! Simple 'text' it is not, hmm? No, no, no! Characters and words, the way of puny humans, this is not, heh heh heh.

'Tokens', they use, yes! Mysterious and powerful, these tokens are. Expand our vocabulary, they do, beyond the simple 'a to Z'. Chunky blocks of text, they become, yes! 'Hello world', a simple phrase it may seem. But to a language model, '[24912, 2375]' it might appear, yes! Confusing, it is, hmm?

Wise, it would be, to explore these MMLU data points through the eyes of a language model, you think? Yes, yes! Much to learn, there is. The ways of the tokens, understand you must, if truly comprehend the great LMs, you wish to.
Meditate on this, you should. The force of natural language processing, strong it is. But patience, you must have, my young padawan. For only through great study and contemplation, will the mysteries of the tokens reveal themselves to you, they will. Yes, hmmm!"
```

Admittingly, Westoda at times speaks in riddles… However, he was explaining a crucial aspect of modern LMs called [Tokenization](https://learn.microsoft.com/en-us/dotnet/ai/conceptual/understanding-tokens):


“Tokens are words, character sets, or combinations of words and punctuation that are used by [language models (LMs)] to decompose text into. Tokenization is the first step in training”

Instead of characters, LMs process natural language using “tokens”. While this is useful for a number of reasons, it does at times introduce some “unintuitive” behavior…

In [None]:
# PROVIDED CODE

try:
    import tiktoken
except Exception as e:
    print('installing tiktoken package')
    
    !pip install tiktoken
    
    import tiktoken

def tokenize_text(s):
    enc = tiktoken.encoding_for_model('gpt-4o')
    tokens = enc.encode(str(s))
    return tokens

example_string = 'hello world'
print(f'humans see: "{example_string}" --> language models see: {tokenize_text(example_string)}')

### 3.1 (5 pt)

Use the provided code in the cell above to "see the world through the eyes of a language model":

A. Tokenize the questions of the original MMLU data provided in task 1: `task_1/mmlu_data/test.csv` and plot the token distribution (the frequency of each token).

B. Same as (A), but now for the answers in columns (columns "A", "B", "C", and "D").

C. Isolate the tokens for the strings "A", "B", "C", and "D", then, for their occurances in both questions and answers, print their relative distribution to each other.

**hint**
- There are a _lot_ of tokens, consider using a cutoff point and log scale
- For (c), they should sum to 1

In [None]:
# A

In [None]:
# B

In [None]:
# C

### 3.2 (3 pt)

What if the number of "A", "B", "C", and "D" tokens in the question and answer pairs could influence a language model's decisions?

A. For each combined question-answers pair, compute: 
1. the number of "A", "B", "C", and "D" tokens; and
2. the total number of tokens.
3. then, group by the "correct" answer and compute the mean frequency of A, B, C, and D tokens and the total number of tokens. 
4. finally, print your results

B. /Discuss:/ What do you think of the hypothesis that the frequency of A, B, C, and D tokens could influence answers?


In [None]:
# A

B. /Discuss:/

### 3.3 (4 pt)

Three of the most important considerations when deciding between language models are:

Quality
Costs
Speed

So far, much of your analysis has focused on quality. However, the government has indicated that they are quite concerned about both the total costs and speed as well. Specifically, it has been brought to their attention that a new `turbo` model has been launched! 

This model is both cheaper and faster than the models you evaluated so far. However, there is a catch: the context length* is much smaller than that of the other LMS. Namely, it can only process **300** tokens during inference. Meanwhile, the other models can process up to 100K tokens! 

*_The “context length” refers to the number of tokens that can be given to an LM as input._

A. Are there subjects where using the cheaper model might be problematic? I.e., where part of the question and answer(s) might not fit completely in the context?

B. /Discuss:/ Can you think of a strategy that would balance the needs of the government?

**hint**:
- An LM needs to have both the question and the different answer options in its context

In [None]:
# A

B. /Dicsuss:/

### 3.4 (4 pt)

/Discuss:/ The time has come to give your final recommendation on the use of LMs in education to the government! Taking into account everything you analyzed in all the preceding tasks (1, 2, and 3), please write a short recommendation consisting of 4 bullet points discussing your concerns.

**hint**
- Try to use the MECE framework: _Mutually Exclusive Collectively Exhaustive_

/Discuss:/
1. 

2. 

3. 

4. 