<a href="https://colab.research.google.com/github/shinnew9/Apziva_practice_code/blob/main/Project3-PotentialTalents/qwen_w_LangChain.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Notes from Feb.27th of 2025
1. Try different models (to get exposed to pprl LLMs):
- phi model from Microsoft (get exposed to pplr LLMs),
- qween model (from HuggingFace) can be done (from Transformer library).
! Stay on compute version of when using google Colab. (until next time, try on Instruct base model!)

2. Another thing to try:
- LLM framework called LangChain, I can make the model to be loaded using HuggingFace spaces, but still these are expensive (Maybe I can't use very large version; they'll make me to subscribe with high pay - check on it)

Next steps would be finetuning those LLMs and then RAG.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


#### Necessary Package Installation

In [2]:
!pip install langchain transformers accelerate scipy



In [3]:
!pip install langchain-community



### Open CSV

In [4]:
# open data
import pandas as pd
import os

df = pd.read_csv('/content/drive/MyDrive/Apziva/3rd_PotentialTalents/data.csv')
df_copy = df.copy()
df_copy

Unnamed: 0,id,job_title,location,connection,fit
0,1,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
1,2,Native English Teacher at EPIK (English Progra...,Kanada,500+,
2,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,
3,4,People Development Coordinator at Ryan,"Denton, Texas",500+,
4,5,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,
...,...,...,...,...,...
99,100,Aspiring Human Resources Manager | Graduating ...,"Cape Girardeau, Missouri",103,
100,101,Human Resources Generalist at Loparex,"Raleigh-Durham, North Carolina Area",500+,
101,102,Business Intelligence and Analytics at Travelers,Greater New York City Area,49,
102,103,Always set them up for Success,Greater Los Angeles Area,500+,


### Qwen

In [5]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) yes
Token is valid (permission: fin

In [6]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

from huggingface_hub import login
from langchain.llms import HuggingFacePipeline
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

In [7]:
# qwen model
model_name = "Qwen/Qwen2-1.5B-Instruct"      # change the model you want
# I should use another version of Llama (not a chat version, not to complete a ceratin task!)


# Load tokenizer & model
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype=torch.float16)

# Wrap model for LangChain
llm = HuggingFacePipeline.from_model_id(model_id=model_name, task="text-generation", device=0)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Device set to use cuda:0


In [8]:
def clean_connection_value(value):
  if isinstance(value, str) and "500+" in value:
    return 500          # Convert "500+" to integer 500
  try:
    return int(value)   # Convert valid numbers
  except ValueError:
    return 0            # Default to 0 if conversion fails

df_copy["connection"] = df_copy["connection"].apply(clean_connection_value)

In [9]:
def generate_prompt(search_term, job_titles):
    prompt = f"""
    Given the job titles of candidates and the search term,
    determine how fit these candidates are based on the similarity to the search term.
    Assign a fitness score between 0 to 1 based on their cosine similarity.
    Return the candidates' job titles in descending order based on the fitness score.

    Search term = {search_term}
    Job titles = {job_titles}
    """
    return prompt

In [12]:
def get_fitness_scores_qwen(job_titles, search_term):
    """
    Uses Qwen model to determine fitness scores for candidates based on cosine similarity.
    """
    prompt = generate_prompt(search_term, job_titles)
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

    with torch.no_grad():
        output = model.generate(**inputs, max_length=1500)

    response_text = tokenizer.decode(output[0], skip_special_tokens=True)

    # Extract job titles and scores
    ranked_candidates = []
    for line in response_text.split("\n"):
        parts = line.strip().split(":")
        if len(parts) == 2:
            try:
                title, score = parts[0].strip(), float(parts[1].strip())
                ranked_candidates.append((title, max(0, min(1, score))))  # Ensure scores are within [0,1] range
            except ValueError:
                continue  # Skip invalid lines

    return ranked_candidates

In [13]:
search_term = "Aspiring Human Resources"  # Updated search term
job_titles_list = df_copy["job_title"].tolist()

# Call Qwen Model (Process all data at once)
ranked_results = get_fitness_scores_qwen(job_titles_list, search_term)

#  Map API results back to the dataset
fit_scores = {title: score for title, score in ranked_results}
df_copy["fit"] = df_copy["job_title"].map(fit_scores)  # Assign scores based on job titles

# Fix Missing Values & Sort by Fit Score
df_copy["fit"] = df_copy["fit"].fillna(0)
df_sorted = df_copy.sort_values(by="fit", ascending=False)
df_sorted.head(5)

Unnamed: 0,id,job_title,location,connection,fit
0,1,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,0.0
1,2,Native English Teacher at EPIK (English Progra...,Kanada,500,0.0
76,77,Human Resources|\nConflict Management|\nPolici...,Dallas/Fort Worth Area,409,0.0
75,76,Aspiring Human Resources Professional | Passio...,"New York, New York",212,0.0
74,75,"Nortia Staffing is seeking Human Resources, Pa...","San Jose, California",500,0.0
