In [1]:
# Robert Cocker
# Dr. Farbin
# DATA-4380
# Tabular Feasibility
# 6/24/2025

## Tabular Feasibility Exploratory Data Anylsis (EDA)

#### MMLU-Pro Benchmark Data

#### Data Loading and Initial Look

In [2]:
import pandas as pd
import numpy as np

# Load data
# Replace with the actual path to your dataset
df = pd.read_parquet("~/DATA4380.Summer.2025/src/data/MMLU-Pro/data/test-0-of-1.parquet")  # or use .parquet, .tsv, etc.

# Print questions and answers in the test data
print("Number of rows (data points):", df.shape[0])
print("Number of features (columns):", df.shape[1])
df.head()

Number of rows (data points): 12032
Number of features (columns): 8


Unnamed: 0,question_id,question,options,answer,answer_index,cot_content,category,src
0,70,"Typical advertising regulatory bodies suggest,...","[Safe practices, Fear, Jealousy, Trivial, Unsa...",I,8,,business,ori_mmlu-business_ethics
1,71,Managers are entrusted to run the company in t...,"[Shareholders, Diligence, Self-interest, Share...",F,5,,business,ori_mmlu-business_ethics
2,72,There are two main issues associated with ____...,"[Down, Autonomy, Remuneration, Benefit, Down, ...",J,9,,business,ori_mmlu-business_ethics
3,73,_______ locate morality beyond the sphere of r...,"[Ethical egoism, Ethics of duty, Postmodern et...",C,2,,business,ori_mmlu-business_ethics
4,74,Some of key differences between Islamic finan...,"[Interest, Certain, Assured, Both tangible and...",G,6,,business,ori_mmlu-business_ethics


#### Data Visualization

In [3]:
from ydata_profiling import ProfileReport

# Overview of data
profile = ProfileReport(df, title="NLP EDA Report")
profile.to_notebook_iframe()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]


  0%|                                                                                             | 0/8 [00:00<?, ?it/s][A
 12%|██████████▋                                                                          | 1/8 [00:00<00:00,  8.81it/s][A
100%|█████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 10.22it/s][A


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

In [4]:
# Generate MMLU-Pro Overview
profile.to_file("mmlupro_overview.html")

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

In [6]:
# Outliers
def detect_outliers(series):
    q1 = series.quantile(0.25)
    q3 = series.quantile(0.75)
    iqr = q3 - q1
    lower = q1 - 1.5 * iqr
    upper = q3 + 1.5 * iqr
    return ((series < lower) | (series > upper)).sum()

# Testing IQR for outliers on data
print("IQR Outliers::")
numerical = df.select_dtypes(include=[np.number])
outlier_counts = numerical.apply(detect_outliers).sort_values(ascending=False)
outlier_counts.head()

IQR Outliers::


question_id     0
answer_index    0
dtype: int64

In [None]:
# Class imbalance
df["target"].value_counts(normalize=True).plot(kind="bar", title="Class Distribution")

In [None]:
# Inspect target values: TBD
print("Target column:", "target")  # update as needed
print("Unique values in target:", df["target"].unique())

# If classification:
# e.g., 0 = survived, 1 = deceased
# If regression:
# print("Range:", df["target"].min(), "to", df["target"].max())

# Summary
- The MMLU-Pro dataset has 12,032 rows and 8 features.
- There are 3 numerical features and 5 categorical features.
- Missing values are present in maybe in 1 column - cot_content
- Have not looked into outliers yet. EDA for LLM's with reasoning language features is state of the art.
- The target variable is `TBD` for a binary classification: `0` and `1` (TBD).
- For class imbalance, I will need to determine the EDA methods further.
- Other recommended libraries for EDA in NLP - Hugging Face `datasets` + `evaluate`

#### Data Cleaning and Preperation for Machine Learning

- MMLU-Pro data is for testing and validation of a model. It is not intended for pretraining.
- MIMIC-IV is the proposed data for pretraining + the original data from the pretrained LLM model proposed. 
- MIMIC-IV data is highly sensitive data (data privacy + data governance and other further considerations are needed)
- Awaiting approval to obtain training data. MMLU-Pro is for benchmarking LLM's not training.

#### HLE Benchmark Data

In [15]:
# Load data
df = pd.read_parquet("~/DATA4380.Summer.2025/src/data/hle/data/test-0-of-1.parquet")

# Print questions and answers in the test data
print("Number of rows (data points):", df.shape[0])
print("Number of features (columns):", df.shape[1])
df.head(3)

Number of rows (data points): 2500
Number of features (columns): 12


Unnamed: 0,id,question,image,image_preview,answer,answer_type,author_name,rationale,rationale_image,raw_subject,category,canary
0,6687ffb1091058ff19128813,Black to move. Without moving the black queens...,"data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQA...",{'bytes': b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x...,"Rxf3, Rf1#",exactMatch,Elliott T,Chess engine says that this is the only mate i...,,Chess,Other,BENCHMARK DATA SHOULD NEVER APPEAR IN TRAINING...
1,668825f80a642802bdfeadfa,Which condition of Arrhenius's sixth impossibi...,,,D,multipleChoice,Elliott T,Critical-level views in population axiology su...,,Philosophy,Humanities/Social Science,BENCHMARK DATA SHOULD NEVER APPEAR IN TRAINING...
2,668828540a642802bdfeadfc,"The concept of logical ""depth"" mentioned in _T...",,,yeyo,exactMatch,,1. 'crypticity' -> y (source: https://www.its....,,Trivia,Other,BENCHMARK DATA SHOULD NEVER APPEAR IN TRAINING...


In [22]:
# Outliers
def detect_outliers(series):
    q1 = series.quantile(0.25)
    q3 = series.quantile(0.75)
    iqr = q3 - q1
    lower = q1 - 1.5 * iqr
    upper = q3 + 1.5 * iqr
    return ((series < lower) | (series > upper)).sum()

# Testing IQR for outliers on data
print("IQR Outliers::")
numerical = df.select_dtypes(include=[np.number])
outlier_counts = numerical.apply(detect_outliers).sort_values(ascending=False)
outlier_counts.head()

IQR Outliers::


Series([], dtype: float64)

In [25]:
import pandas as pd
import numpy as np

# Identify numerical features (int, float)
numerical_features = df.select_dtypes(include=[np.number]).columns.tolist()

# Identify categorical features (object, category)
categorical_features = df.select_dtypes(include=["object", "category"]).columns.tolist()

# Optional: Drop known image or binary columns
image_like_columns = [col for col in df.columns if df[col].apply(lambda x: isinstance(x, (bytes, bytearray, memoryview, np.ndarray))).any()]

# Cleaned feature lists
numerical_features = [col for col in numerical_features if col not in image_like_columns]
categorical_features = [col for col in categorical_features if col not in image_like_columns]

# Print result
print("Features in the HLE dataset:")
print("Numerical features:", numerical_features)
print("Categorical features:", categorical_features)
print("Image-like or non-tabular features:", image_like_columns)

Features in the HLE dataset:
Numerical features: []
Categorical features: ['id', 'question', 'image', 'image_preview', 'answer', 'answer_type', 'author_name', 'rationale', 'rationale_image', 'raw_subject', 'category', 'canary']
Image-like or non-tabular features: []


In [35]:
from collections import Counter

counts = Counter(categorical_features)
print(counts)

Counter({'id': 1, 'question': 1, 'image': 1, 'image_preview': 1, 'answer': 1, 'answer_type': 1, 'author_name': 1, 'rationale': 1, 'rationale_image': 1, 'raw_subject': 1, 'category': 1, 'canary': 1})


In [None]:
from ydata_profiling import ProfileReport

# Overview of data
profile = ProfileReport(df, title="NLP EDA Report")
profile.to_notebook_iframe()

#### Data Cleaning and Preperation for Machine Learning

- The HLE data also applies the same as MMLU-Pro benchmark.
- Implementation requires advanced methods.

# Summary
- The HLE dataset has 2,500 rows and 12 features.
- Data can be cleaned via removing nontabular data.
- Numerical features + noncategorical tabular data outputs arrays.
- The data does have categorical features.
- Have not looked into outliers yet. EDA for LLM's with reasoning language features is state of the art.
- The target variable is `TBD` for a binary classification: `0` and `1` (TBD).
- Awaiting approval for MIMIC-IV data. MIMIC-IV is the utilized training data.