Following https://github.com/bigcode-project/starcoder

In [None]:
# Step 1: Clone the repo and setup environment

import os
import subprocess

repo_dir = "/content/tamarind-finetune"
repo_url = "https://github.com/smartrics/tamarind-finetune.git"

if os.path.isdir(repo_dir):
    print("Directory 'tamarind-finetune' exists. Pulling latest changes...")
    subprocess.run(["git", "-C", repo_dir, "pull"], check=True)
else:
    print("Directory 'tamarind-finetune' does not exist. Cloning repository...")
    subprocess.run(["git", "clone", repo_url, repo_dir], check=True)
print("finished!")

In [None]:
%cd /content/tamarind-finetune

We’ll finetune `bigcode/starcoderbase-1b`, which is a 1B parameter model trained on 80+ programming languages. This is a gated model, so if you plan to run this notebook with this exact model, you’ll need to gain access to it on the model’s page. Log in to your Hugging Face account to do so:

In [None]:
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
!pip install transformers
!pip install git+https://github.com/huggingface/peft.git
!pip install datasets accelerate huggingface_hub bitsandbytes wandb

In [None]:
from huggingface_hub import notebook_login

# --- 2. Login to Hugging Face Hub ---
notebook_login()

In [None]:
import wandb
wandb.login()

In [6]:
!python data_starcoderbase/finetune.py \
  --model_path="bigcode/starcoderbase-1b"\
  --dataset_path="./data_starcoderbase/tamarind_data.csv" \
  --subset="data/finetune"\
  --split="train"\
  --size_valid_set 10000\
  --seq_length 1700 \
  --max_steps 1000\
  --batch_size 4\
  --input_column_name="question"\
  --output_column_name="response"\
  --gradient_accumulation_steps 16\
  --learning_rate 1e-4\
  --lr_scheduler_type="cosine"\
  --num_warmup_steps 100\
  --weight_decay 0.05\
  --output_dir="./checkpoints"

2025-05-01 08:30:31.997278: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-05-01 08:30:32.016029: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1746088232.037892    2155 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1746088232.044622    2155 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-05-01 08:30:32.067023: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instr

The size of the SE dataset is better manageable when using streaming. We also have to precise the split of the dataset that is used. For more details, check the dataset's page on 🤗. Similarly we can modify the command to account for the availability of GPUs

In [None]:
!python data_starcoderbase/merge_peft_adapters.py \
   --base_model_name_or_path "bigcode/starcoderbase-1b" \
   --peft_model_path "./checkpoints/checkpoint-100" \
   --merged_model_name_or_path "smartrics/starcoderbase-1b-tamarind" \
   --push_to_hub