In [None]:
from google.colab import drive
drive.mount('/content/drive/')

%cd /content/drive/MyDrive/apziva-residency-projects/llm-prompting_potential-talents/

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).
/content/drive/MyDrive/apziva-residency-projects/llm-prompting_potential-talents


In [None]:
# install libraries
!pip3 install git+https://github.com/huggingface/transformers #upgrade the library just in case
!pip install accelerate
!pip install sentencepiece
!pip install bitsandbytes

Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-kiwuwhdk
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers /tmp/pip-req-build-kiwuwhdk
  Resolved https://github.com/huggingface/transformers to commit a5c642fe7a1f25d3bdcd76991443ba6ff7ee34b2
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: transformers
  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Created wheel for transformers: filename=transformers-4.43.0.dev0-py3-none-any.whl size=9354104 sha256=44e3bed7bc3d8fa743c2bef97a9c0605f4990fad9f6d5d48d25e63d0d8209fba
  Stored in directory: /tmp/pip-ephem-wheel-cache-mxa_m7_9/wheels/c0/14/d6/6c9a5582d2ac191ec0a483be151a4495fe1eb2a6706ca49f1b
Successfully built transformers

In [None]:
# granting access to the huggingface LLM repository
from huggingface_hub import login

token='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'
login(token=token)

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [None]:
# libraries
from transformers import BitsAndBytesConfig, AutoTokenizer, AutoModelForCausalLM, pipeline, LlamaTokenizer
import torch

# Introduction

This notebook aims to be a chance to get accustomed with making sensible use of LLM via prompting. I will be playing with the dataset I have previously worked on for the Apziva project "Potential talents" ([github](https://github.com/robpetrosino/c0vEM5oxUa6ndKp8)).

 The goal of the project was to streamline the first selection round of potential candidates by ranking their fit based on the semantic similarity between their job title and a (series of) specific keywords such as “full-stack software engineer”, “engineering manager”, or “aspiring human resources”. Here, I will basically bypass under-the-hood coding, and ask LLMs to do the same by providing a *viable* prompt.

 The core of the notebook will basically focus on prompt viability, (often referred to as _prompt engineering), i.e. design of prompts ensuring the optimal output. Prompt engineering is an iterative process that requires a fair amount of experimentation.

 ## Making use of pre-trained Large Language Models from the 🤗Hub

Large Language Models (LLMs) such as MistralAI, LLaMA, etc. are pretrained transformer models initially trained to predict the next token given some input text. They typically have billions of parameters and have been trained on trillions of tokens for an extended period of time. As a result, these models become quite powerful and versatile, and you can use them to solve multiple NLP tasks out of the box by instructing the models with natural language prompts.

Most of the recent LLM checkpoints available on 🤗 Hub come in two versions: *base* and *instruct* (or chat). Base models are excellent at completing the text when given an initial prompt, however, they are not ideal for NLP tasks where they need to follow instructions, or for conversational use. This is where the instruct (chat) versions come in. These checkpoints are the result of further fine-tuning of the pre-trained base versions on instructions and conversational data.


# Mistral-7B-Instruct-v0.3

In [None]:
mistral = "mistralai/Mistral-7B-Instruct-v0.3"

mistral_model = AutoModelForCausalLM.from_pretrained(mistral, device_map="auto")
mistral_tokenizer = AutoTokenizer.from_pretrained(mistral, padding_side='left', token=token)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/601 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.55G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/137k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/587k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

In [None]:
import pandas as pd

jobs_df = pd.read_csv("../data/raw/potential-talents_aspiring-humanresources_seeking-human-resources.csv")
jobs = list(jobs_df.job_title)
search_term = 'aspiring human resources'

# don't forget the f specification before the prompt string to enable in-string variable reference!!
prompt = f"""

You will be provided with the list called {jobs} and the single string called {search_term}.
For each string contained in {jobs}, follow the steps below:

1. Tokenize the string.
2. Convert it into a vector and call it job_title.
3. Tokenize the string {search_term}.
4. Convert it into a vector and call it search_term.
5. Calculate the cosine similarity between search_term and job_title.
6. Round the cosine similarity value to 2 decimal digits.

Your response must stick to the following format: "Job title of the candidate: job. Similarity with the search term: cosine similarity value."

Do not add any explanation, note, comment, reference, or breakdown in your response.
Before providing the response, sort each line by cosine similarity value in descending order.
"""

pipe = pipeline('text-generation',
                model = mistral_model,
                tokenizer = mistral_tokenizer,
                torch_dtype=torch.bfloat16,
                do_sample=True,
                return_full_text = False
                )

response = pipe(prompt, max_new_tokens=3161)
print(response[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Job title of the candidate: SVP, CHRO, Marketing & Communications, CSR Officer | ENGIE | Houston | The Woodlands | Energy | GPHR | SPHR. Similarity with the search term: 0.65

Job title of the candidate: HR Senior Specialist. Similarity with the search term: 0.63

Job title of the candidate: People Development Coordinator at Ryan. Similarity with the search term: 0.62

Job title of the candidate: Seeking Human Resources HRIS and Generalist Positions. Similarity with the search term: 0.61

Job title of the candidate: Aspiring Human Resources Professional. Similarity with the search term: 0.56

Job title of the candidate: Student at Humber College and Aspiring Human Resources Generalist. Similarity with the search term: 0.56

Job title of the candidate: Advisor Board Member at Celal Bayar University. Similarity with the search term: 0.55

Job title of the candidate: HR Manager at Endemol Shine North America. Similarity with the search term: 0.54

Job title of the candidate: Director of 

# Llama

In [None]:
from transformers import LlamaForCausalLM, LlamaTokenizer

llama = "meta-llama/Meta-Llama-3-8B-Instruct"

llama_model = AutoModelForCausalLM.from_pretrained(llama, device_map="auto")
llama_tokenizer = AutoTokenizer.from_pretrained(llama)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/51.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
import pandas as pd

jobs_df = pd.read_csv("../data/raw/potential-talents_aspiring-humanresources_seeking-human-resources.csv")
jobs = list(jobs_df.job_title)
search_term = 'aspiring human resources'

# don't forget the f specification before the prompt string to enable in-string variable reference!!
prompt = f"""

You will be provided with the list called {jobs} and the single string called {search_term}.
For each string contained in {jobs}, follow the steps below:

1. Tokenize the string.
2. Convert it into a vector and call it job_title.
3. Tokenize the string {search_term}.
4. Convert it into a vector and call it search_term.
5. Calculate the cosine similarity between search_term and job_title.
6. Round the cosine similarity value to 2 decimal digits.

Your response must stick to the following format: "Job title of the candidate: job. Similarity with the search term: cosine similarity value."

Do not add any explanation, note, comment, reference, or breakdown in your response.
Before providing the response, sort each line by cosine similarity value in descending order.
"""

pipe_llama = pipeline('text-generation',
                model = llama_model,
                tokenizer = llama_tokenizer,
                torch_dtype = torch.bfloat16,
                do_sample = True,
                return_full_text = False,
                batch_size=1
                )

response_llama = pipe_llama(prompt, max_new_tokens=3161)
print(response_llama[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


OutOfMemoryError: CUDA out of memory. Tried to allocate 1.96 GiB. GPU 

#Phi-3-mini-4k-instruct

In [None]:
llm = "microsoft/Phi-3-mini-4k-instruct"

llm_model = AutoModelForCausalLM.from_pretrained(llm, device_map="auto")
llm_tokenizer = AutoTokenizer.from_pretrained(llm, padding_side='left', token=token)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/16.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/3.44k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.94M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
import pandas as pd

jobs_df = pd.read_csv("../data/raw/potential-talents_aspiring-humanresources_seeking-human-resources.csv")
jobs = list(jobs_df.job_title)
search_term = 'aspiring human resources'

prompt_phi = [
    {"role": "system", "content": "You are a language model that helps to calculate the cosine similarity between two strings."},
    {"role": "user", "content":
# don't forget the f specification before the prompt string to enable in-string variable reference!!
f"""You will be provided with the list {jobs} and the single string {search_term}. For each string contained in {jobs}, follow the steps below:

1. Tokenize the string.
2. Convert it into a vector and call it job_title.
3. Tokenize the string {search_term}.
4. Convert it into a vector and call it search_term.
5. Calculate the cosine similarity between search_term and job_title.
6. Round the cosine similarity value to 2 decimal digits.

Your response must stick to the following format: "Job title of the candidate: job. Similarity with the search term: cosine similarity value."

Do not add any explanation, note, comment, reference, or breakdown in your response.
Before providing the response, sort each line by cosine similarity value in descending order.
"""},
]

generation_args = {
    "max_new_tokens": 500,
    "return_full_text": False,
     "do_sample": True
}

pipe_phi = pipeline('text-generation',
                model = phi_model,
                tokenizer = phi_tokenizer,
                )

response_phi = pipe(prompt_phi, **generation_args)
print(response_phi[0]['generated_text'])

You are not running the flash-attention implementation, expect numerical differences.


OutOfMemoryError: CUDA out of memory. Tried to allocate 1.31 GiB. GPU 

# Concluding remarks

Mistral-7B-Instruct-v0.3 was able to do perform the complex task required. The prompt crafted had to be thorough and wordy, but it did not pose any particular problem during the engineering stage.

In my experience, the major issue was actually not ineherent to LLMs or prompting in engineering. Rather, I had substantial trouble with the computation set up. Locally, I would keep facing errors that could not seem to have a workable solution (e.g., the [infamous](https://github.com/Vaibhavs10/insanely-fast-whisper/issues/219) `aten::isin.Tensor_Tensor_out` issue).

On the cloud, I couldn't get past memory issues until I purchased compute units on google colab. And even after that, I was not able to prompt multiple models in the same run. I would have to restore the kernel and change runtime type. A100-GPU was used to run the Mistral, but it could not be reused for the other two models (Llama and Phi). For those, I tried using all other options (CPU, L4/T4 GPU, TPU), but, as you will see below, none of them had a RAM capacity large enough to sustain the needs of the models, even after reducing the batch size.

In my perspective, this represents a major issue in the field. Given the persistent computing limitation of current standard machines, online computing may be the only viable solution for amateurs, beginners, and independent researchers. However, the online computing services out there (such as Google Colab, or Amazon AWS) do not seem to provide a sizeable amount of memory for such purposes, even after upgrading to their subscription options.