<a href="https://colab.research.google.com/github/twhool02/atubigdataanalyticsproject1/blob/main/Notebook_Template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Add Notebook Title Here

## Setup

### Map Google Drive

In [None]:
import shutil, os, subprocess

# mount google drive
from google.colab import drive
drive.mount('/content/drive')
os.chdir('/content/drive/MyDrive/Colab Notebooks/dissertation')

Mounted at /content/drive


### Log into HuggingFace Hub

In [None]:
# Required when quantizing models/data that are gated on HuggingFace and required for pushing models to HuggingFace
!pip install --upgrade huggingface_hub

import huggingface_hub

print(f"Hugging Face Version is: {huggingface_hub.__version__}")

Collecting huggingface_hub
  Downloading huggingface_hub-0.20.3-py3-none-any.whl (330 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m330.1/330.1 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: huggingface_hub
  Attempting uninstall: huggingface_hub
    Found existing installation: huggingface-hub 0.20.2
    Uninstalling huggingface-hub-0.20.2:
      Successfully uninstalled huggingface-hub-0.20.2
Successfully installed huggingface_hub-0.20.3
Hugging Face Version is: 0.20.3


In [None]:
from google.colab import userdata

# using the HF_TOKEN secret, this has write permissions to Hugging Face
hftoken = userdata.get('HF_TOKEN')

In [None]:
from huggingface_hub import login

# Log into hugging face using the HF_TOKEN secrect
login(hftoken, add_to_git_credential=True)

### Install Transformers and other libraries

In [None]:
# install the development version of transformers
# !pip install -q -U git+https://github.com/huggingface/transformers.git -q


# The 'accelerate' library is a part of the Hugging Face ecosystem[^1^][1][^2^][2].
# It enables the same PyTorch code to be run across any distributed configuration by adding just a few lines of code[^1^][1][^2^][2].
# In short, it makes training and inference at scale simple, efficient, and adaptable[^1^][1][^2^][2].
# It abstracts the boilerplate code related to multi-GPUs/TPU/fp16 and leaves the rest of your code unchanged[^2^][2].
# This library is useful when you want to easily run your training scripts in a distributed environment without having to renounce full control over your training loop[^2^][2][^3^][3].
# It is not a high-level framework above PyTorch, just a thin wrapper so you don't have to learn a new library[^2^][2][^3^][3].
# !pip install -q -U git+https://github.com/huggingface/accelerate


# Install latest available stable builds, upgrade if later version that the currently installed version is available
!pip install -q -U transformers -q

# Accelerate is a Python library created for PyTorch users. It abstracts the boilerplate code related to using multiple GPUs, TPUs,
# and mixed precision (fp8, fp16, bf16), allowing you to run your raw PyTorch training script on any kind of device.
# By adding a few lines to any standard PyTorch training script, you can now run on any kind of single or distributed
# node setting (single CPU, single GPU, multi-GPUs, and TPUs) as well as with or without mixed precision.
# It even handles the device placement for you
!pip install -q -U accelerate -q


# TensorFlow is an open-source software library for high-performance numerical computation3.
# It’s used for machine learning and deep learning applications34. Its flexible architecture allows easy deployment
# of computation across a variety of platforms (CPUs, GPUs, TPUs), and from desktops to clusters of servers
# to mobile and edge devices3.
# It was originally developed by researchers and engineers from the Google Brain team within Google’s AI organization
!pip install -q -U tensorFlow


# 'einops' is a Python library that provides flexible and powerful tensor operations for readable and reliable code.
# It supports various frameworks such as numpy, pytorch, tensorflow, jax, and others.
# It allows you to rearrange, reduce, and repeat elements in tensors according to specified patterns.
# This makes it easier to manipulate tensors in a more readable and reliable way.
!pip install -q -U einops

# 'sentencepiece' is a Python library for unsupervised text tokenization.
# It provides an API for encoding, decoding, and training of Sentencepiece models.
# It's particularly useful for Neural Network-based text generation.
!pip install sentencepiece -q

# The 'bitsandbytes' library is a lightweight wrapper around CUDA custom functions,
# particularly 8-bit optimizers, matrix multiplication (LLM.int8()), and quantization functions[^1^][4][^2^][5].
# It is used for tasks like 8-bit inference with HuggingFace Transformers, using 8-bit optimizers,
# and replacing certain layers with 8-bit versions for improved performance[^1^][4][^2^][5].
!pip install bitsandbytes -q

# PEFT stands for Parameter-Efficient Fine-Tuning, developed by Hugging Face that aims to make fine-tuning large language models (LLMs) more efficient and memory-friendly.
# Key features of PEFT:
# Parameter-efficient fine-tuning: It allows you to fine-tune only a small portion of a large language model's parameters, reducing memory usage and training time significantly.
# Adaptive embedding sharing: It dynamically determines which embeddings to share across different tasks, further optimizing memory usage.
# Gradient checkpointing: It saves memory by storing only a subset of activations during backpropagation.
# Compatibility with Transformers: It integrates seamlessly with the popular Transformers library, making it easy to use with various pre-trained language models.
!pip install peft -q

# trl is short for Transformers Reinforcement Learning, it's used for fine-tuning transformer models using Proximal Policy Optimization.
!pip install trl -q

# The 'xformers' library provides customizable and optimized building blocks for Transformers[^3^][1].
# It is domain-agnostic and used by researchers in various fields like vision, NLP, etc[^3^][1].
# The library contains bleeding-edge components that are not yet available in mainstream libraries like PyTorch[^3^][1].
# It is built with efficiency in mind, containing its own CUDA kernels, but dispatches to other libraries when relevant[^3^][1].
# !pip install xformers

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
4.37.0.dev0


### Check library versions

In [None]:
#print the version of transformers
import transformers
print(f"version of transformers: {transformers.__version__}")

# print the version of the tensorflow library
import accelerate
print(f"version of accelerate: {accelerate.__version__}")

# print the version of the tensorflow library
import tensorflow as tf
print(f"version of tensorflow: {tf.__version__}")

### Restart the runtime

In [None]:
# restart the runtime
import os
os.kill(os.getpid(), 9)

### Import libraries

In [None]:
# Import the torch library. PyTorch is an open source machine learning library based on the Torch library.
# It's used for applications such as computer vision and natural language processing.
import torch

# Import the nn module from the torch library. nn stands for neural networks.
# This module contains various building blocks for creating neural networks.
import torch.nn as nn

# Import specific classes from the transformers library. The transformers library is a state-of-the-art Natural Language Processing library for TensorFlow 2.0 and PyTorch.
# AutoTokenizer: This class can automatically guess and download the correct tokenizer based on the model’s name you give it (like 'bert-base-uncased').
# AutoConfig: This class can automatically guess and download the correct configuration based on the model’s name you give it (like 'bert-base-uncased').
# AutoModelForCausalLM: This class can automatically guess and download a model for causal language modeling based on the model’s name you give it (like 'gpt2').
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM

In [None]:
# torch is a library for processing tensors. A tensor is a generalization of vectors and matrices to potentially higher dimensions.
# bfloat16 is a 16-bit floating point representation method within the tensor library.
# It is used for mixed precision training where some operations use the torch.float32 datatype and other operations use a lower precision floating point datatype like torch.bfloat16[^1^][1][^2^][2].
from torch import bfloat16

# transformers is a library that provides general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, etc.) for Natural Language Understanding (NLU) and Natural Language Generation (NLG).
# The pipeline() function is a high-level, easy to use, API for doing inference over a variety of downstream-tasks, including Named Entity Recognition (NER), Masked Language Modeling (MLM), Sentiment Analysis, Feature Extraction and Question Answering[^3^][4][^4^][5].
from transformers import pipeline

### Libaries imported when inferring a model

In [None]:
# os is a standard Python library that provides functions for interacting with the operating system.
import os

# torch is the main package of PyTorch, an open-source machine learning library for Python.
import torch

# load_dataset is a function from the datasets library by Hugging Face. It allows you to load and preprocess datasets for machine learning models.
from datasets import load_dataset

# The transformers library is a popular library for Natural Language Processing (NLP). It provides thousands of pre-trained models to perform tasks on texts such as classification, information extraction, summarization, translation, and more.
from transformers import (
    # AutoModelForCausalLM is a class in the transformers library. It represents a model for causal language modeling.
    AutoModelForCausalLM,

    # AutoTokenizer is a class in the transformers library. It is used for converting input data into a format that can be used by the model.
    AutoTokenizer,

    # BitsAndBytesConfig is a configuration class in the transformers library. It is used to configure a BitsAndBytes model.
    BitsAndBytesConfig,

    # HfArgumentParser is a class in the transformers library. It is used for parsing command-line arguments.
    HfArgumentParser,

    # TrainingArguments is a class in the transformers library. It defines the arguments used during training.
    TrainingArguments,

    # pipeline is a high-level function in the transformers library. It creates a pipeline that applies a model to some input data.
    pipeline,

    # logging is a module in the transformers library. It is used for logging events during training and evaluation.
    logging,
)

# used for Parameter-Efficient Fine-Tuning
from peft import LoraConfig, PeftModel

# trl is short for Transformers Reinforcement Learning. It is a Python library for fine-tuning transformer models using Proximal Policy Optimization.
from trl import SFTTrainer


### Create cache directory for Hugging Face Models

In [None]:
# Set the cache directory to a specific path in your Google Drive.
# This is where Hugging Face models will be cached.
cache_dir = "/content/drive/MyDrive/huggingface_cache"

# The os.makedirs() method in Python is used to create directories recursively.
# The exist_ok=True parameter prevents an error if the directory already exists.
os.makedirs(cache_dir, exist_ok=True)

## Model Loading

### Load a Pre-Trained Model

In [None]:
# Define the name of the model to be used. This is a string that corresponds to a specific pre-trained model.
# 'TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T' is the name of the model.
# model_name = 'TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T'
model_name = 'TinyLlama/TinyLlama-1.1B-Chat-v0.3'

# Load the pre-trained model using the AutoModelForCausalLM class from the transformers library.
# The from_pretrained() method is used to load the model.
# device_map='cpu' specifies that the model should be loaded onto the CPU. If you have a GPU available, you could change this to 'cuda' to use the GPU instead.
model = AutoModelForCausalLM.from_pretrained(
    model_name, # specifies which pre-trained model to load
    trust_remote_code=True, # allows the execution of remote code. Be careful with this setting as it can be a security risk.
    torch_dtype=torch.bfloat16, #sets the data type for the model's parameters to bfloat16.
    offload_folder='offload', #specifies the folder where offloaded parameters will be stored.
    cache_dir=cache_dir # sets the directory where the pre-trained model will be cached. This can help speed up future model loading times.
)

### Load a Pre-Trained Model from a local directory

In [None]:
# Zephyr with BitsAndBytes Configuration

# The AutoTokenizer class is used to load the tokenizer associated with the "HuggingFaceH4/zephyr-7b-alpha" model.
# A tokenizer is responsible for preparing the inputs for a model. This includes converting input text into tokens, which are numerical representations that the model can understand.
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-alpha")

# The AutoModelForCausalLM class is used to load the "HuggingFaceH4/zephyr-7b-alpha" model with the specified quantization configuration.
# AutoModelForCausalLM is a class that includes automatic model architecture detection. It's used for tasks that involve causal language modeling (predicting the next word in a sentence).
# The 'quantization_config' parameter is set to 'bnb_config', which is the BitsAndBytesConfig instance we created earlier. This configures the model to use the specified quantization settings.
# The 'device_map' parameter is set to 'auto', which means the model will automatically use the GPU if it's available, and fall back to CPU otherwise.
model = AutoModelForCausalLM.from_pretrained(
    "HuggingFaceH4/zephyr-7b-beta",
    quantization_config=bnb_config,
    device_map='auto',
    cache_dir=cache_dir,
    local_files_only=True
)

### Create a pipeline

In [None]:
# Create a pipeline
# The pipeline function is a high-level, easy to use, API for doing inference over a variety of downstream-tasks, including text generation.
# It abstracts away the underlying details and allows you to use models like BERT, GPT-2 or RoBERTa with a simple API.
# The model and tokenizer are passed as arguments to the function.
# The model is a pre-trained model and the tokenizer is used to convert the input text into a format that is understandable by the model.
# The task argument specifies the task to be performed. In this case, it’s ‘text-generation’.
pipe = pipeline(model=model, tokenizer=tokenizer, task='text-generation')

### Linux Commands



#### Create a directory

In [None]:
import os

# The os.makedirs() method in Python is used to create directories recursively.
# The exist_ok=True parameter prevents an error if the directory already exists.
os.makedirs('/content/drive/MyDrive/new_dirextory', exist_ok=True)

#### Change directory

In [None]:
import os

# change directory
os.chdir(cache_dir)

#### Get current working directory and list files

In [None]:
import os
import glob

# get current working dirctory and list files
print(f"current directory is: {os.getcwd()}\n")
# print(os.listdir('.'))

# Get a list of all files and directories in the current directory
files = glob.glob('./*')

# Create a list of tuples, each containing the name of the file/directory and its last modification time
files_with_times = [(file, os.path.getmtime(file)) for file in files]

# Sort the list by the modification time (the second element of each tuple)
files_with_times.sort(key=lambda x: x[1])

# Print the sorted list
print("Files in current directory:")
for file, mtime in files_with_times:
    print(f'{file}: {mtime}')

### Download a model from HF

In [None]:
from huggingface_hub import snapshot_download

model_id="twhoool02/TinyLlama-1.1B-Chat-v0.3-GGUF"

snapshot_download(repo_id=model_id, local_dir=cache_dir, local_dir_use_symlinks=False, revision="main")

## Push the model to Hugging Face

In [None]:
# Import the HfApi class from the huggingface_hub module
from huggingface_hub import HfApi

# Initialize the HfApi class
api = HfApi()

# set model name
quant_name = model_name.split('/')[-1] + "-GGUF"
print(f"Model name is {quant_name}")

# Define the ID of the repository where the file will be uploaded
# The repository is located under the user 'twhooly02' and named after the quant_name variable
repo_id = "twhoool02/" + quant_name
print(f"Repository ID is {repo_id}")

# Create model repo
api.create_repo(repo_id=repo_id)

In [None]:
# directory from where model files will be uploaded
base_path = "./models"

#Array of local file paths you want to upload
local_file_paths = [
    base_path + "/tokenizer_config.json",
    base_path + "/tokenizer.model",
    base_path + "/tokenizer.json",
    base_path + "/special_tokens_map.json",
    base_path + "/ggml-vocab-llama.gguf",
    base_path + "/" + f'{model_name_pure}.{quant_type}.gguf',
]

In [None]:
#Loop through each file and upload it
for local_file_path in local_file_paths:
    # Extract the file name from the local file path
    file_name = local_file_path.split("/")[-1]

    # Specify the path where you want the file to be uploaded in the repository
    path_in_repo = file_name # Using the file name directly adjust if needed

    # Use the upload_file method of the HfApi class to upload the file
    # The method takes the local file path, the path in the repository, the repository ID, and the repository type as arguments
    api.upload_file(
        path_or_fileobj=local_file_path,
        path_in_repo=path_in_repo,
        repo_id=repo_id,
        repo_type="model"  # The type of the repository is 'model', could also be "dataset" or "space"
    )

    print(f"Uploaded {file_name} to {repo_id}")

## Add Model Cards

### Add Model cards from template

This is the preferred option

In [None]:
# import model card libraries
from huggingface_hub import ModelCard, ModelCardData

# Import the date class from the datetime module
from datetime import date

# The datasets the model was trained on
datasets = ["cerebras/SlimPajama-672B","bigcode/starcoderdata","OpenAssistant/oasst_top1_2023-08-25"]

# Define the metadata for the model card
card_data = ModelCardData(
    base_model='TinyLlama/TinyLlama-1.1B-Chat-v0.3',
    language='en',  # The language the model was trained on
    license='apache-2.0',  # The license for the model
    library=['Transformers','GGUF'],  # The library used to train the model
    model_name=quant_name,  # The name of the model
    model_type='tinyllama',
    tags=["GGUF", "tinyllama"],  # Tags for the model
    datasets=datasets
)

# Define a description for the model
model_description = "This model is a quantized version of the TinyLlama/TinyLlama-1.1B-Chat-v0.3 model. \
    The model was quantized using GGUF."

# Create a model card from the template
card = ModelCard.from_template(
    card_data,  # The model card data defined earlier
    model_id=quant_name,  # The ID of the model
    model_description=model_description,  # The description of the model
    date=date.today(),  # The date the model card was created
    developers="Ted Whooley"  # The developers of the model
)

# Print the model card
print(card)

### Add custom model card

In [None]:
# Add custom model card
from huggingface_hub import ModelCard, ModelCardData

datasets = ['cerebras/SlimPajama-672B','bigcode/starcoderdata','OpenAssistant/oasst_top1_2023-08-25']

card_data = ModelCardData(language='en', license='apache-2.0', datasets=datasets)

example_template_var = 'nateraw'
content = f"""
---
{ card_data.to_yaml() }
---

# TinyLlama-1.1B-Chat-v0.3-GGUF

This model is a quantized version of the TinyLlama/TinyLlama-1.1B-Chat-v0.3 model. The model was quantized using GGUF.

The TinyLlama project aims to **pretrain** a **1.1B Llama model on 3 trillion tokens**.
"""

card = ModelCard(content)
print(card)

### Upload the model card

In [None]:
try:
    card.validate()
    card.save('README.md')
    card.push_to_hub(repo_id=repo_id)
except:
    raise ValueError("model card info is invalid. please check.")

## Model Inference

### Create prompt

Note: These prompts may only work on the model - zephyr-7b-beta which I was using when following [Which Quantization Method is Right for You? (GPTQ vs. GGUF vs. AWQ)](https://https://www.maartengrootendorst.com/blog/quantization/) as in this case the chat template is saved in the underlying tokenizer

In [None]:
# We use the tokenizer's chat template to format each message
# See https://huggingface.co/docs/transformers/main/en/chat_templating

# 'messages' is a list of dictionaries. Each dictionary represents a message in the conversation.
# Each message has a 'role' (either 'system' or 'user') and 'content' (the text of the message).
# The 'system' role is typically used for instructions that guide the model's behavior.
messages = [
    {
        "role": "system",
        "content": "You are a friendly chatbot.",
    },
    {
        "role": "user",
        "content": "Tell me a story about a man from Ireland"
    },
]

# The 'apply_chat_template' function formats the messages for the model.
prompt = pipe.tokenizer.apply_chat_template(
    messages,
    tokenize=False, # 'tokenize=False' means the messages will not be tokenized right now.
    add_generation_prompt=True # 'add_generation_prompt=True' adds a special prompt token at the end to signal the model to start generating a response.
)

### Pass Prompt to LLM

In [None]:
# The 'pipe' function is called with several arguments:
outputs = pipe(
    prompt, # 'prompt' is the input text that the model will respond to.
    max_new_tokens=256, # 'max_new_tokens=256' limits the length of the generated text to 256 tokens.
    do_sample=True, # 'do_sample=True' means the model will generate text by sampling from its output distribution.
    temperature=0.1, # controls the randomness of the sampling process. Lower values (like 0.1) make the output more deterministic, while higher values make it more random.
    top_p=0.95 # 'top_p=0.95' implements nucleus sampling, where the model only considers the smallest set of tokens whose cumulative probability exceeds 0.95.
)

# The 'outputs' variable contains the generated text.
# 'outputs[0]["generated_text"]' extracts the generated text from the first (and in this case, only) output.
# The generated text is then printed to the console.
print(outputs[0]["generated_text"])

## Cleanup

### Delete directories

Have removed optiont to delete the directory in the below code it is too dangerous

In [None]:
import os
import shutil

# Specify the directory you want to delete
directory = cache_dir

# Check if the directory exists
if os.path.exists(directory):
    # Use shutil.rmtree to delete the directory
    # shutil.rmtree(directory)
    print(f"The directory {directory} has been deleted.")
else:
    print("The directory does not exist.")


### Delete Repo

Have commented this out will uncomment as needed

In [None]:
# # delete the repo that was created with this notebook
# from huggingface_hub import delete_repo

# delete_repo(repo_id=repo_id, repo_type="mo

## Troubleshooting

### Check memory footprint

In [None]:
# memory footprint support libraries/code
!ln -sf /opt/bin/nvidia-smi /usr/bin/nvidia-smi
!pip install gputil
!pip install psutil
!pip install humanize
import psutil
import humanize
import os
import GPUtil as GPU

In [None]:
GPUs = GPU.getGPUs()
# XXX: only one GPU on Colab and isn’t guaranteed
gpu = GPUs[0]
def printm():
 process = psutil.Process(os.getpid())
 print("Gen RAM Free: " + humanize.naturalsize( psutil.virtual_memory().available ), " | Proc size: " + humanize.naturalsize( process.memory_info().rss))
 print("GPU RAM Free: {0:.0f}MB | Used: {1:.0f}MB | Util {2:3.0f}% | Total {3:.0f}MB".format(gpu.memoryFree, gpu.memoryUsed, gpu.memoryUtil*100, gpu.memoryTotal))
printm()

### Check processes running

In [None]:
import subprocess

process_name_substring = 'python'

result = subprocess.run(['fuser', '/dev/nvidia0', '-v'], stdout = subprocess.PIPE)

process_ids = [int(i) for i in str(result.stdout).split(' ') if i.isdigit()]

for process_id in process_ids:
    pid_info = subprocess.run(['ps', '-p', str(process_id)], stdout = subprocess.PIPE)
    print(pid_info.stdout)

### Get memory stats

In [None]:
# The 'torch.cuda.memory_stats()' function returns a dictionary containing detailed CUDA memory stats.
# This includes information about the amount of memory allocated and reserved on the GPU
torch.cuda.memory_stats()

### Empty the CUDA cache

In [None]:
# The 'torch.cuda.empty_cache()' function releases all unoccupied cached memory
# currently held by the caching allocator so that those can be used in other
# GPU application and visible in nvidia-smi. It does not release the GPU
# memory used by tensors. So if you are using GPU tensors, this will not
# free any used GPU memory.
torch.cuda.empty_cache()


## Tricks

### Check GPU compatibility with bfloat16

In [None]:
# Check GPU compatibility with bfloat16
if compute_dtype == torch.float16 and use_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print("=" * 80)
        print("Your GPU supports bfloat16: accelerate training with bf16=True")
        print("=" * 80)