# MidcurveLLM Fine-tuning with Ludwig

🙌 Welcome to the hands-on tutorial dedicated to exploring the cutting-edge capabilities of Ludwig 0.8, for building an Fine-tuned model for Geometric Graphs Shape Reduction aka Midcurve.

Ludwig, an open-source package has been used here to train machine learning models in Encoder-Combination-Decoder (ECD) mode as well as in fine-tuning LLMs via Instruction Tuning mode, through declarative config files.

A bit more info about MidcurveNN: MidcurveNN is a project aimed at solving the challenging problem of finding the midcurve of a 2D closed shape using neural networks. The primary goal is to transform a closed polygon, represented by a set of points or connected lines, into another set of points or connected lines, allowing for the possibility of open or branched polygons in the output.

👉👉 Step-by-step explanation of the solution is available [here TBD]().

## Installation 🧰

Needs HuggingFace API Token, access approval to Gemma–7b-it, and a GPU with a minimum of 12 GiB of VRAM. Here in this notebook, T4 GPU is being used.

In [None]:
%matplotlib inline

import matplotlib
import numpy as np
import matplotlib.pyplot as plt

In [None]:
!pip uninstall -y tensorflow --quiet
!pip install Cython # do this before installing torch which is inside ludwig, to avoid "_C" error
!pip install ludwig
!pip install ludwig[llm]
!pip install accelerate
from accelerate.utils import write_basic_config; write_basic_config(mixed_precision='fp16')
!pip install -i https://pypi.org/simple/ bitsandbytes  # latest
# !pip install bitsandbytes==0.41.3 --prefer-binary --extra-index-url=https://jllllll.github.io/bitsandbytes-windows-webui # overriding 0.40.2 which comes with Ludwig
#You are calling `save_pretrained` to a 4-bit converted model, but your `bitsandbytes` version doesn't support it. If you want to save 4-bit models, make sure to have `bitsandbytes>=0.41.3` installed.

Collecting sentence-transformers (from ludwig[llm])
  Using cached sentence_transformers-2.5.1-py3-none-any.whl (156 kB)
Collecting faiss-cpu (from ludwig[llm])
  Using cached faiss_cpu-1.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.0 MB)
Collecting loralib (from ludwig[llm])
  Using cached loralib-0.1.2-py3-none-any.whl (10 kB)
Collecting peft (from ludwig[llm])
  Using cached peft-0.9.0-py3-none-any.whl (190 kB)
Installing collected packages: loralib, faiss-cpu, sentence-transformers, peft
Successfully installed faiss-cpu-1.8.0 loralib-0.1.2 peft-0.9.0 sentence-transformers-2.5.1
Looking in indexes: https://pypi.org/simple/


Enable text wrapping so we don't have to scroll horizontally and create a function to flush CUDA cache.

In [None]:
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))

get_ipython().events.register('pre_run_cell', set_css)

def clear_cache():
  if torch.cuda.is_available():
    model = None
    torch.cuda.empty_cache()

Sometime error comes as ```NameError: name '_C' is not defined``, follow https://github.com/pytorch/pytorch/issues/1633 for the solution

-> **Setup Your HuggingFace Token** 🤗

We'll be using  Llama-2, which a model released by Meta. However, the model is not openly-accessible and requires requesting for access (assigned to your HuggingFace token).

Obtain a [HuggingFace API Token](https://huggingface.co/settings/tokens) and request access to [gemma-7b-it](https://huggingface.co/google/gemma-7b-it) before proceeding. You may need to signup on HuggingFace if you don't aleady have an account: https://huggingface.co/join

In [None]:
import getpass
import locale; locale.getpreferredencoding = lambda: "UTF-8"
import logging
import os
import torch
import yaml

from ludwig.api import LudwigModel


os.environ["HUGGING_FACE_HUB_TOKEN"] = getpass.getpass("Token:")
assert os.environ["HUGGING_FACE_HUB_TOKEN"]

Token:··········


Sometime error comes as ```NameError: name '_C' is not defined``, follow https://github.com/pytorch/pytorch/issues/1633 for the solution

## Configurations

Defining config for Instruction Fine Tuning using Gemma 7B model. It is based on [this](https://predibase.com/blog/fine-tuning-mistral-7b-on-a-single-gpu-with-ludwig) tutorial. Prompt has been changed.

In [None]:
instruction_tuning_llm_yaml = yaml.safe_load("""
model_type: llm
base_model: google/gemma-7b-it
# meta-llama/Llama-2-7b-hf
# alexsherstinsky/Mistral-7B-v0.1-sharded
# mistralai/Mistral-7B-v0.1
# Salesforce/codet5-large

quantization:
 bits: 4

adapter:
 type: lora

prompt:
  template: |
    ### Instruction:
    You are a geometric modeling expert. You need to read 2D profile structure
    called 'Profile_brep' from json format and convert it to corresponding
    2D midcruve strucure called 'Midcurve_brep' also in json format.
    Below is an example:

    ### Input:
    {Profile_brep}

    ### Response:

input_features:
 - name: Profile_brep
   type: text

output_features:
 - name: Midcurve_brep
   type: text

trainer:
 type: finetune
 learning_rate: 0.0003
 batch_size: 1
 gradient_accumulation_steps: 8
 epochs: 3

backend:
 type: local
""")

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

# import os
# os.chdir('/content/drive/MyDrive/ImpDocs/Work/AICoach/Notebooks')

## Dataset
Data in the form of csv is made avilable at the Github location [here](https://raw.githubusercontent.com/yogeshhk/MidcurveNN/master/src/ludwig/data/shapes2brep.csv). `wget` it ones from the location given below. Keep it in `data` folder, then comment this cell for further executions.

In [None]:
!pip install wget
import wget

# Replace the URL with the raw URL of the file on GitHub
url = "https://raw.githubusercontent.com/yogeshhk/MidcurveNN/master/src/ludwig/data/shapes2brep.csv"

# Download the file
wget.download(url, 'shapes2brep.csv')

Collecting wget
  Downloading wget-3.2.zip (10 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l[?25hdone
  Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9655 sha256=af0abf882e5c5a66018c34ad3bf2fe63d4b2ed470c291a98ca32c0df35df8a60
  Stored in directory: /root/.cache/pip/wheels/8b/f1/7f/5c94f0a7a505ca1c81cd1d9208ae2064675d97582078e6c769
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2


'shapes2brep.csv'

In [None]:
from google.colab import data_table; data_table.enable_dataframe_formatter()
import numpy as np; np.random.seed(123)
import pandas as pd
from datasets import DatasetDict, load_from_disk, Dataset

In [None]:
# df = pd.read_csv('/content/drive/MyDrive/ImpDocs/Work/AICoach/Notebooks/data/midcurve_llm.csv')
df = pd.read_csv('shapes2brep.csv', encoding='cp1252')
df.head()

Unnamed: 0,ShapeName,Profile,Midcurve,Profile_brep,Midcurve_brep
0,I,"[[5.0, 5.0], [10.0, 5.0], [10.0, 20.0], [5.0, ...","[[7.5, 5.0], [7.5, 20.0]]","{""Points"": [[5.0, 5.0], [10.0, 5.0], [10.0, 20...","{""Points"": [[7.5, 5.0], [7.5, 20.0]], ""Lines"":..."
1,L,"[[5.0, 5.0], [10.0, 5.0], [10.0, 30.0], [35.0,...","[[7.5, 5.0], [7.5, 32.5], [35.0, 32.5]]","{""Points"": [[5.0, 5.0], [10.0, 5.0], [10.0, 30...","{""Points"": [[7.5, 5.0], [7.5, 32.5], [35.0, 32..."
2,Plus,"[[0.0, 25.0], [10.0, 25.0], [10.0, 45.0], [15....","[[12.5, 0.0], [12.5, 22.5], [12.5, 45.0], [0.0...","{""Points"": [[0.0, 25.0], [10.0, 25.0], [10.0, ...","{""Points"": [[12.5, 0.0], [12.5, 22.5], [12.5, ..."
3,T,"[[0.0, 25.0], [25.0, 25.0], [25.0, 20.0], [15....","[[12.5, 0.0], [12.5, 22.5], [25.0, 22.5], [0.0...","{""Points"": [[0.0, 25.0], [25.0, 25.0], [25.0, ...","{""Points"": [[12.5, 0.0], [12.5, 22.5], [25.0, ..."
4,I_scaled_2,"[[10.0, 10.0], [20.0, 10.0], [20.0, 40.0], [10...","[[15.0, 10.0], [15.0, 40.0]]","{""Points"": [[10.0, 10.0], [20.0, 10.0], [20.0,...","{""Points"": [[15.0, 10.0], [15.0, 40.0]], ""Line..."


A crucial step in our journey involves the compilation of a dataset that mirrors the real-world profiles. So, this dataset is a `Profile_brep` and corresponding `Midcurve_brep` dataset. Each row in the dataset consists of an:
- `Profile_brep` that describes a 2D Profile in brep format
- `Midcurve_brep` that describes the correspondng 1D Midcurve in brep format

The model's declarative nature allows us to clearly define the architecture, making the training process transparent and insightful.

Instantiation of `LudwigModel` with fine-tuning config `instruction_tuning_yaml`. Training it on Shapes csv based dataframe.

In [None]:
model_instruction_tuning = LudwigModel(config=instruction_tuning_llm_yaml,logging_level=logging.INFO)
results_instruction_tuning = model_instruction_tuning.train(dataset=df)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/694 [00:00<?, ?B/s]

INFO:ludwig.utils.print_utils:
INFO:ludwig.utils.print_utils:╒════════════════════════╕
INFO:ludwig.utils.print_utils:│ EXPERIMENT DESCRIPTION │
INFO:ludwig.utils.print_utils:╘════════════════════════╛
INFO:ludwig.utils.print_utils:
INFO:ludwig.api:╒══════════════════╤═════════════════════════════════════════════════════════════════════════════════════════╕
│ Experiment name  │ api_experiment                                                                          │
├──────────────────┼─────────────────────────────────────────────────────────────────────────────────────────┤
│ Model name       │ run                                                                                     │
├──────────────────┼─────────────────────────────────────────────────────────────────────────────────────────┤
│ Output directory │ /content/results/api_experiment_run                                                     │
├──────────────────┼─────────────────────────────────────────────────────────────────

tokenizer_config.json:   0%|          | 0.00/2.16k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/888 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

INFO:ludwig.utils.tokenizers:Loaded HuggingFace implementation of google/gemma-7b-it tokenizer
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
INFO:ludwig.features.text_feature:Max length of feature 'None': 363 (without start and stop symbols)
INFO:ludwig.features.text_feature:Max sequence length is 363 for feature 'None'
INFO:ludwig.utils.tokenizers:Loaded HuggingFace implementation of google/gemma-7b-it tokenizer
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
INFO:ludwig.features.text_feature:Max length of feature 'Midcurve_brep': 116 (without start and stop symbols)
INFO:ludwig.features.text_feature:Max sequence length is 116 for feature 'Midcurve_brep'
INFO:ludwig.utils.tokenizers:Loaded HuggingFace implementation of google/gemma-7b-it tokenizer
Asking to truncate to max_length but no maximum lengt

model.safetensors.index.json:   0%|          | 0.00/20.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/2.11G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

You are calling `save_pretrained` to a 4-bit converted model, but your `bitsandbytes` version doesn't support it. If you want to save 4-bit models, make sure to have `bitsandbytes>=0.41.3` installed.
INFO:ludwig.models.llm:Done.
INFO:ludwig.utils.tokenizers:Loaded HuggingFace implementation of google/gemma-7b-it tokenizer
INFO:ludwig.models.llm:Trainable Parameter Summary For Fine-Tuning
INFO:ludwig.models.llm:Fine-tuning with adapter: lora
INFO:ludwig.utils.print_utils:
INFO:ludwig.utils.print_utils:╒══════════╕
INFO:ludwig.utils.print_utils:│ TRAINING │
INFO:ludwig.utils.print_utils:╘══════════╛
INFO:ludwig.utils.print_utils:


trainable params: 3,211,264 || all params: 8,540,892,160 || trainable%: 0.037598695075901765


INFO:ludwig.trainers.trainer:Creating fresh model training run.
INFO:ludwig.trainers.trainer:Training for 2001 step(s), approximately 3 epoch(s).
INFO:ludwig.trainers.trainer:Early stopping policy: 5 round(s) of evaluation, or 3335 step(s), approximately 5 epoch(s).

INFO:ludwig.trainers.trainer:Starting with step 0, epoch: 0


Training:   0%|          | 1/2001 [00:19<10:51:05, 19.53s/it, loss=2.6]

OutOfMemoryError: CUDA out of memory. Tried to allocate 144.00 MiB. GPU 0 has a total capacty of 14.58 GiB of which 77.56 MiB is free. Process 4109 has 14.50 GiB memory in use. Of the allocated memory 14.13 GiB is allocated by PyTorch, and 240.91 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Testing or inferencing dataset has just a couple of profiles for which answers are seeked.

In [None]:
test_df = pd.DataFrame([
    {
        "Profile_brep": '{"Points": [[12.48, 0.65], [11.31, 23.12], [10.13, 45.59], [-1.18, 22.47], [23.79, 23.78]], "Lines": [[0, 1], [4, 1], [2, 1], [3, 1]], "Segments": [[0], [1], [2], [3]]}'
    },
    {
        "Profile_brep": '{"Points": [[12.48, 0.65], [11.31, 23.12], [23.79, 23.78], [-1.18, 22.47]], "Lines": [[0, 1], [1, 2], [3, 1]], "Segments": [[0], [1], [2]]}'
    }
])

## Runnuing Ludwig: Inferencing

With Ludwig's training complete, the explorers put the model to the test. They fed it a set of profiles, eager to witness the declarative AI framework in action.

**Predictions on fine-tuned model**

In [None]:
predictions_instruction_tuning_df, output_directory = model_instruction_tuning.predict(dataset=test_df)
shapes_brep_dict_list_strs = predictions_instruction_tuning_df["Midcurve_brep_response"].tolist()
print(shapes_brep_dict_list_strs)

Plotting function to visualize the output

In [None]:
def plot_lines(lines, color='black'):
    for line in lines:
        a = np.asarray(line)
        x = a[:, 0].T
        y = a[:, 1].T
        plt.plot(x, y, c=color)
    plt.axis('equal')

In [None]:
test_lines =  [((10.0, 0.0), (10.0, 45.0)),  ((10.0, 45.0), (15.0, 45.0)), ((15.0, 45.0), (15.0, 0.0))]
plot_lines(test_lines, 'red')

In [None]:
def plot_breps(shapes_brep_dict_list):
    for dct in shapes_brep_dict_list:
        profile_point_list = dct['Profile']
        profile_x_coords, profile_y_coords = zip(*profile_point_list)
        profile_brep = dct['Profile_brep']
        profile_segments = profile_brep["Segments"]
        profile_lines = profile_brep["Lines"]
        profile_segment_color = 'black'
        # Plot Profile segments
        for segment in profile_segments:
            for line_idx in segment:
                line = profile_lines[line_idx]
                x_segment = [profile_x_coords[i] for i in line]
                y_segment = [profile_y_coords[i] for i in line]
                plt.plot(x_segment + [x_segment[0]], y_segment + [y_segment[0]], color=profile_segment_color,
                         marker='o')

        midcurve_point_list = dct['Midcurve']
        midcurve_x_coords, midcurve_y_coords = zip(*midcurve_point_list)
        midcurve_brep = dct['Midcurve_brep']
        midcurve_segments = midcurve_brep["Segments"]
        midcurve_lines = midcurve_brep["Lines"]
        midcurve_segment_color = 'red'

        # Plot Midcurve segments
        for segment in midcurve_segments:
            for line_idx in segment:
                line = midcurve_lines[line_idx]
                x_segment = [midcurve_x_coords[i] for i in line]
                y_segment = [midcurve_y_coords[i] for i in line]
                plt.plot(x_segment + [x_segment[0]], y_segment + [y_segment[0]], color=midcurve_segment_color,
                         marker='x')

        plt.axis('equal')

In [None]:
plot_breps(shapes_brep_dict_list)