
<div align="center">
    <img src="https://i.ibb.co/9rx4pbX/AIMO.png">
</div>

In this competition, I aim is to build AI models that can solve tough math problems, in other words, creating LLM models capable of solving Math Olympiad problems. This notebook will cover through the process of fine-tuning the **Gemma** LLM model with LoRA to solve math problems using KerasNLP. With KerasNLP, fine-tuning with LoRA becomes straightforward with just a few lines of code.



# Install Libraries

We need to install latest KerasNLP to load Gemma 1.1 model. As we don't have access to internet during inference, we will be installing this library from our local files.

In [1]:
!pip install -q /kaggle/input/keras-lib-dataset/keras_nlp-0.9.2-py3-none-any.whl --no-deps

[0m[31mERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory: '/kaggle/input/keras-lib-dataset/keras_nlp-0.9.2-py3-none-any.whl'
[0m[31m
[0m

# Import Libraries 

In [2]:
import os
os.environ["KERAS_BACKEND"] = "jax" # you can also use tensorflow or torch
os.environ["XLA_PYTHON_CLIENT_MEM_FRACTION"] = "0.9" # avoid memory fragmentation on JAX backend.

import keras
import keras_nlp

import numpy as np
import pandas as pd
from tqdm.notebook import tqdm
tqdm.pandas() # progress bar for pandas

import plotly.graph_objs as go
import plotly.express as px
from IPython.display import display, Markdown

2024-06-13 06:55:21.844049: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-13 06:55:21.844196: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-13 06:55:22.010276: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


# Configuration

In [3]:
class CFG:
    seed = 42
    dataset_path = "/kaggle/input/ai-mathematical-olympiad-prize"
    preset = "gemma_1.1_instruct_2b_en" # name of pretrained Gemma
    sequence_length = 512 # max size of input sequence for training
    batch_size = 1 # size of the input batch in training
    epochs = 1 # number of epochs to train

# Reproducibility 
Sets value for random seed to produce similar result in each run.

In [4]:
keras.utils.set_random_seed(CFG.seed)

# Data

No training data is provided in this competition; in other words, we can use any openly available datasets for this competition. In this notebook, we will use a modified **Math** dataset which I have compiled to have a `Question-Solution-Answer` format.

**Data Format:**

These datasets include:
- `problem`: The math problem in LaTeX format.
- `solution`: Step-by-step solution to this problem.
- `answer`: Final answer of the solution which will be the ground truth for this competition.
- `level`: Difficulty of the problem.
- `type`: The category of the problem.

> This dataset comes with its own train test split. However, we will merge them both and use them for fine-tuning. You are welcome to use them for trainining and validation separately. Also to reduce the training time we will only be training on the first`1000` samples. You are welcome to train on the full data.

In [5]:
df1 = pd.read_csv("/kaggle/input/math-qsa-dataset/train.csv")
df2 = pd.read_csv("/kaggle/input/math-qsa-dataset/test.csv")
df = pd.concat([df1, df2], axis=0)
df = df[:1000] # take first 1000 samples
df.head(2)

Unnamed: 0,problem,level,type,solution,answer
0,The United States Postal Service charges an ex...,Level 3,Prealgebra,We calculate the desired ratio for each envelo...,3
1,How many integers between 1000 and 2000 have a...,Level 4,Prealgebra,"A number with 15, 20 and 25 as factors must be...",3


# Filter Data

The Math dataset contains various problems, but not all of them are suitable for this competition. More specifically, this competition requires a `non-negative integer` answer, while the Math dataset includes problems with different types of answers such as integers, floats, fractions, matrices, etc. In this notebook, we will only use those problems whose answers are non-negative integers and filter out the rest.

In [6]:
def is_integer(text):
    try:
        if int(text) >= 0:
            return True
        else:
            return False
    except ValueError:
        return False
    
df["is_integer"] = df.answer.map(is_integer)
df = df[df.is_integer].reset_index(drop=True)
df.head(2)

Unnamed: 0,problem,level,type,solution,answer,is_integer
0,The United States Postal Service charges an ex...,Level 3,Prealgebra,We calculate the desired ratio for each envelo...,3,True
1,How many integers between 1000 and 2000 have a...,Level 4,Prealgebra,"A number with 15, 20 and 25 as factors must be...",3,True


# Prompt Engineering

We will be using below simple prompt template we'll use to create problem-solution-answer trio to feed the model. This template will help the model to follow instruction and respond accurately. You can explore more advanced prompt templates for better results. 

```
Role:
You are an advanced AI system with exceptional mathematical reasoning and problem-solving capabilities, specifically designed to solve tricky math problems (whose answer is a non-negative integer) written in LaTeX format from the AI Mathematical Olympiad (AIMO) competition. Your task is to accurately analyze and solve intricate mathematical problems, demonstrating a deep understanding of mathematical concepts and a strong ability to apply logical reasoning strategies.

Instruction:
1. Carefully read and comprehend the problem statement provided in the "Problem" section.
2. In the "Solution" section, provide a solution of the problem with detailed explanation of your logical reasoning process. Keep in mind that answer must be a non-negative integer number.
3. At the end, create a "Answer" section where you will state only the final numerical or algebraic answer, without any additional text or narrative.

Problem:
...

Solution:
...

Answer:
...
```

In [7]:
template = """Role:\nYou are an advanced AI system with exceptional mathematical reasoning and problem-solving capabilities, specifically designed to solve tricky math problems (whose answer is a non-negative integer) written in LaTeX format from the AI Mathematical Olympiad (AIMO) competition. Your task is to accurately analyze and solve intricate mathematical problems, demonstrating a deep understanding of mathematical concepts and a strong ability to apply logical reasoning strategies.\n\nInstruction:
1. Carefully read and comprehend the problem statement provided in the "Problem" section.
2. In the "Solution" section, provide a solution of the problem with detailed explanation of your logical reasoning process. Keep in mind that answer must be a non-negative integer number.
3. At the end, create a "Answer" section where you will state only the final numerical or algebraic answer, without any additional text or narrative.\n\nProblem:\n{problem}\n\nSolution:\n{solution}"""

In [8]:
df["prompt"] = df.progress_apply(lambda row: template.format(problem=row.problem,
                                                             solution=f"{row.solution}\n\nAnswer:\n{row.answer}"),
                                                             axis=1)
data = df.prompt.tolist()

  0%|          | 0/676 [00:00<?, ?it/s]

Let's examine a sample prompt. As the answers in our dataset are curated with **markdown** format, we will render the sample using `Markdown()` to properly visualize the formatting.

## Check Sample

In [9]:
def colorize_text(text):
    for word, color in zip(["Role", "Instruction", "Problem", "Solution", "Answer"],
                           ["blue", "yellow", "red", "cyan", "green"]):
        text = text.replace(f"{word}:", f"\n\n**<font color='{color}'>{word}:</font>**")
    return text

## Sample 1

In [10]:
# Take a random sample
sample = data[12]

# Give colors to Instruction, Response and Category
sample = colorize_text(sample)

# Show sample in markdown
display(Markdown(sample))



**<font color='blue'>Role:</font>**
You are an advanced AI system with exceptional mathematical reasoning and problem-solving capabilities, specifically designed to solve tricky math problems (whose answer is a non-negative integer) written in LaTeX format from the AI Mathematical Olympiad (AIMO) competition. Your task is to accurately analyze and solve intricate mathematical problems, demonstrating a deep understanding of mathematical concepts and a strong ability to apply logical reasoning strategies.



**<font color='yellow'>Instruction:</font>**
1. Carefully read and comprehend the problem statement provided in the "Problem" section.
2. In the "Solution" section, provide a solution of the problem with detailed explanation of your logical reasoning process. Keep in mind that answer must be a non-negative integer number.
3. At the end, create a "Answer" section where you will state only the final numerical or algebraic answer, without any additional text or narrative.



**<font color='red'>Problem:</font>**
What is the largest positive multiple of $12$ that is less than $350?$



**<font color='cyan'>Solution:</font>**
Dividing $350$ by $12$ gives a quotient $29$ with a remainder of $2$. In other words, \[350=12\cdot29+2.\]Thus, $29\cdot12=\boxed{348}$ is the largest multiple of $12$ which is less than $350.$



**<font color='green'>Answer:</font>**
348

## Sample 2

In [11]:
# Take a random sample
sample = data[32]

# Give colors to Instruction, Response and Category
sample = colorize_text(sample)

# Show sample in markdown
display(Markdown(sample))



**<font color='blue'>Role:</font>**
You are an advanced AI system with exceptional mathematical reasoning and problem-solving capabilities, specifically designed to solve tricky math problems (whose answer is a non-negative integer) written in LaTeX format from the AI Mathematical Olympiad (AIMO) competition. Your task is to accurately analyze and solve intricate mathematical problems, demonstrating a deep understanding of mathematical concepts and a strong ability to apply logical reasoning strategies.



**<font color='yellow'>Instruction:</font>**
1. Carefully read and comprehend the problem statement provided in the "Problem" section.
2. In the "Solution" section, provide a solution of the problem with detailed explanation of your logical reasoning process. Keep in mind that answer must be a non-negative integer number.
3. At the end, create a "Answer" section where you will state only the final numerical or algebraic answer, without any additional text or narrative.



**<font color='red'>Problem:</font>**
We are given that $$54+(98\div14)+(23\cdot 17)-200-(312\div 6)=200.$$Now, let's remove the parentheses:  $$54+98\div14+23\cdot 17-200-312\div 6.$$What does this expression equal?



**<font color='cyan'>Solution:</font>**
Notice how the parentheses are only around pairs of numbers that are being multiplied or divided. Since multiplication and division are performed before addition and subtraction, it doesn't matter if we remove the parentheses. That's why  \begin{align*}
&54+(98\div14)+(23\cdot 17)-200-(312\div 6)\\
&=54+98\div14+23\cdot17-200-312\div 6\\
&=\boxed{200}.\end{align*}



**<font color='green'>Answer:</font>**
200

## Utilities

In [None]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("meta-math/MetaMath-7B-V1.0")
model = AutoModelForCausalLM.from_pretrained("meta-math/MetaMath-7B-V1.0")

tokenizer_config.json:   0%|          | 0.00/748 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/96.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

pytorch_model.bin.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

pytorch_model-00001-of-00003.bin:   0%|          | 0.00/9.88G [00:00<?, ?B/s]

In [18]:
import re

# Extract answer from model response
def get_answer(text):
    try:
        answer = re.search(r'Answer:\s*([\s\S]+)', text).group(1).strip()
        answer = answer.replace(",","")
        if is_integer(answer):
            return int(answer)%1000
        else:
            return 0
    except:
        return 0
    
    
def infer(df):
    preds = []
    for i in tqdm(range(len(df))):
        row = df.iloc[i]

        # Generate Prompt using template
        prompt = template.format(
            problem=row.problem,
            solution=""
        )

        # Infer
        output = model.generate(prompt, max_length=1024)
        pred = get_answer(output)

        # Store predictions
        preds.append([row.id, pred])
        if "answer" in row:
            preds[-1] += [row.answer]
    return preds

## Infer on Test Data

In [15]:
test_df = pd.read_csv(f"{CFG.dataset_path}/test.csv")
test_preds = infer(test_df)

NameError: name 'infer' is not defined

## Prepare Submission File

While preparing the submission file, we must keep in mind that, the answer must be between `0-999`. This can easily handled by using `remainder (%)` operation. For this notebook, this step is already applied in the inference stage while extracting `answer` from `solution`. So, we don't need to separately apply it heer.

In [None]:
sub_df = pd.DataFrame(test_preds, columns=["id", "answer"])
sub_df.to_csv("submission.csv",index=False,header=True)
sub_df.head()