# Tabular Feasibility

### Version & Annum
#### Project Link: https://github.com/rcghpge/dataproblems/tree/main/src/projects

**Project Summary:**
This feasibility study covers two parallel tabular AI projects: Version and Annum.

Version proposes a tabular NLP-based framework to establish a reproducible and ethical standard for data science using Python, EDA, a proposal of a novel method exploratory deep learning, and ML.

Annum looks to extend this in development using a novel architecture with FreeBSD OS, Mojo, Apache technologies, and other frameworks to explore modular, secure ML systems for tabular data and data that require NLP methods.

**Challenge Description:**
Build iterative versions of 2 projects for tabular data and the NLP domain. Refine and prep for computer vision phase in builds and llm phase of project(s) requirements. For a shippable package for the Python stdlib. R&D what is feasibile with the time allotted per project build domain(s).

**Datasets Considered:**

MIMIC-IV: Clinical EHR data (MIT); utilized for supervised health prediction tasks

MMLU-Pro: NLP-based benchmark for LLMs and reasoning

HLE (Humanity’s Last Exam): A global academic benchmark for evaluating LLM generalization and subject-matter breadth

Mathematics Dataset: A collection of mathematical question-answer pairs at school-level difficulty, designed to test and train neural models on mathematical problems for problem-solving.

**Type of Machine Learning:**

**Supervised Learning + Novel Architecture**

This project explores binary classification tasks of Q&A pairs using supervised machine learning and deep learning. The modeling focus is on both binary and maybe multiclass classification depending on the dataset and task context + sandbox R&D.

- **Version** The original goal(s) were focusing on binary classification utilizing the MIMIC-IV dataset, such as predicting patient mortality `1 = mortality`, `0 = mortality` or other predictive outcomes. Although the main goal is to integrate tabular and NLP modeling; For Version and in extension, Annum, a proposed novel method or not yet fully established method--exploratory deep learning (EDL) will be utilized for research purposes. The final build goal(s) are a protoype model for predictive outcomes though time constraints may not allow it. So the focus of these builds are the mathematical field, industry leading test data, and preparation for a shippable package in Python though this is TBD.

- **Annum** emphasizes a novel architecture as well + novel approaches and methods for ML pipelines. While not model-centric, the goal here is build with other technologies currently not widely utilized in the field of data science.

**R&D Goal and Notes:**  
- Both models will be developed using a Scikit-Learn + TensorFlow/Keras framework and evaluated on benchmark QA datasets (MMLU-Pro, HLE, and the Google DeepMind math_dataset) to assess generalization and reasoning.

- The MMLU-Pro and HLE datasets will be used for zero-shot multiclass benchmarking. Due to time constraints, MIMIC-IV will not be included in these builds, in line with PhysioNet and MIT Laboratory for Computational Physiology recommendations that derivative works be published on their platform.

- Research sources include the Google DeepMind Mathematics Dataset, which will serve as the primary dataset for both training and testing these models. These are new models, independent from the originally proposed architecture.

- The primary goal is to explore portability, reproducibility, and architectural flexibility of tabular and NLP-based ML/DL systems in scientific and potentially broader domains.

- A formal literature review using the PRISMA method was considered to guide R&D on computer vision and general-purpose LLMs. However, due to scope and time constraints, the math_dataset will serve as the main baseline for these builds, a light reading of existing research, and will be central to both training and testing workflows.

- In parallel, Annum will be developed using Hugging Face Transformers. Annum's focused on R&D test runs with pretrained models to evaluate transfer learning capabilities, with additional training and testing on the math_dataset to assess mathematical reasoning and generalization.

**Responsible Data Science** — Privacy, governance, and explainability:
While working with real-world data, important factors and considerations are taken into account such as data privacy, data governance, and explainable AI (XAI).

## Data Loading and Initial Look

**- See `docs/tabular` notebooks for preliminary overview of data, benchmark data, and further details.**

## Data Visualization

### MMLU-Pro

External link: [MMLU-Pro Data Visualization](https://rcghpge.github.io/dataproblems/mmlupro_overview.html)

![Distribution of Disciplines for MMLU-Pro](https://cdn-uploads.huggingface.co/production/uploads/636a35eff8d9af4aea181608/M7mJcKstlVHo6p7P4Cu1j.png)

## Data Cleaning and Preparation for Machine Learning

### math_dataset

<center><img src="../src/img/version-plot.png" alt="mmlu-pro" height="250" width="500"/></center>

<center><img src="../src/img/version-plot3.png" alt="mmlu-pro" height="250" width="500"/></center>

MMLU-Pro + HLE datasets require EDA advanced methods. The baseline builds are with the `math_dataset`. These are for iterative prototyped 
baseline builds.

- See notebooks in `docs/tabular` directory for more details.

## Conclusion

This is doable though not 100% complete from intials estimations. A shippable Python package is feasible but requires a lot of work. 
- See sandbox at `src/projects/version/sandbox` directory for more details.