<a href="https://colab.research.google.com/github/skreddypalvai/WxTzkoMxQimJnzBZ/blob/main/Potential%20Talents_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

###  **Potential Talents:**
As a talent sourcing and management company, they are interested in discovering talented individuals for sourcing these candidates for technology companies. However, finding talented candidates is not easy in general. This entails numerous challenges, including identifying potential candidates, determining what makes them the best choice, and pinpointing where to locate them. The nature of their job demands significant human effort and involves a high degree of manual operations. To streamline the process, save time, and identify the ideal candidate for a given role, they are seeking to develop a machine learning automation process. With this process, they can search for ideal candidates using specific keywords.

After successfully ranking the candidates based on their job titles by matching them with the keywords, the recruiter might not select the top-ranked candidate but rather the seventh. Consequently, each time the recruiter selects a candidate, the candidates need to be re-ranked based on the chosen candidate's information.

### **Objective:**
Our primary goal is to compute the fitness score of each candidate's job title based on the provided keywords and subsequently rank them according to their fitness score. Following the ranking, we must re-rank the candidates when a candidate is starred/selected. This involves calculating the similarity score between the selected candidate's information and the information of all the candidates.

In [29]:
#importing necessary libraries
import pandas as pd
import numpy as np
import nltk
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import torch

In [25]:
# downloading the stop words package
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [11]:
# loading the data
data = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/potential-talents - Aspiring human resources - seeking human resources.csv')
data.head()

Unnamed: 0,id,job_title,location,connection,fit
0,1,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
1,2,Native English Teacher at EPIK (English Progra...,Kanada,500+,
2,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,
3,4,People Development Coordinator at Ryan,"Denton, Texas",500+,
4,5,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,


In [12]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 104 entries, 0 to 103
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   id          104 non-null    int64  
 1   job_title   104 non-null    object 
 2   location    104 non-null    object 
 3   connection  104 non-null    object 
 4   fit         0 non-null      float64
dtypes: float64(1), int64(1), object(3)
memory usage: 4.2+ KB


We can observe that the entire fit column contains null values, as we need to input it by calculating the fitness score.

In [14]:
duplicate_count = data[['job_title','location','connection']].duplicated().sum() # we are excluding the Id column since it represents unique number
print("Total Duplicate Rows:", duplicate_count)

Total Duplicate Rows: 51


In [15]:
#noticed many duplicate rows and decided to drop
data.drop_duplicates(subset=['job_title'], keep='first', inplace=True)
data.reset_index(drop=True, inplace=True)

In [16]:
#let us compute the most frequent words in the job title column
# converting all the job title observations as one string paragraph
all_job_titles = " ".join(data["job_title"].astype(str))
#lowercasing the text
all_job_titles_lower = all_job_titles.lower()
# Tokenization with re
words = re.findall(r'\b\w+\b', all_job_titles_lower)
# Remove stopwords
stop_words = set(stopwords.words("english"))
filtered_words = [word for word in words if word not in stop_words]
#Counting the frequency of words
All = Counter(filtered_words)
All.most_common(10)

[('human', 33),
 ('resources', 33),
 ('aspiring', 12),
 ('seeking', 10),
 ('professional', 8),
 ('manager', 7),
 ('university', 6),
 ('student', 6),
 ('management', 6),
 ('business', 5)]

#### Text Processing:

In [17]:
#lowercasing the job_title column
data['job_title'] = data['job_title'].str.lower()
#we don't have any use with fit column
data.drop('fit',axis=1,inplace=True)

In [18]:
# we can notice that there is a spelling mistake in the location column , where Canada is named as Kanada
data['location'] = data['location'].apply(lambda location: location.replace('Kanada','Canada') if isinstance(location,str) else location)

In [19]:
# we will define a function to remove special characters
def preprocess_text(text):
    # Remove special characters and brackets using regex
    clean_text = re.sub(r'[^a-zA-Z\s]', '', text)
    return clean_text
#applying the function to the job_title column
data['job_title'] = data['job_title'].apply(preprocess_text)

In [26]:
# defining a function to remove stopwords
def remove_stop_words(text):
    words = word_tokenize(text)
    filtered_words = [word for word in words if word not in stop_words]
    return ' '.join(filtered_words)
data['job_title'] = data['job_title'].apply(remove_stop_words)

In [28]:
data['job_title'].value_counts()

ct bauer college business graduate magna cum laude aspiring human resources professional                1
native english teacher epik english program korea                                                       1
junior mes engineer information systems                                                                 1
senior human resources business partner heil environmental                                              1
aspiring human resources professional energetic teamfocused leader                                      1
hr manager endemol shine north america                                                                  1
human resources professional world leader gis software                                                  1
rrp brand portfolio executive jti japan tobacco international                                           1
information systems specialist programmer love data organization                                        1
bachelor science biology victoria university w

We can observe that, all the stop words and special characters are removed.

#### TF-IDF Vector:

In [None]:
#choosen aspiring human resources as the keywords
keywords = ["aspiring human resources"]
# Vectorize the job titles and keywords
vectorizer = TfidfVectorizer()
job_title_vectors = vectorizer.fit_transform(data['job_title'])
keyword_vectors = vectorizer.transform(keywords)
# Calculate cosine similarity
similarity_scores = cosine_similarity(keyword_vectors, job_title_vectors)
# Map similarity scores to each row
data['Tfidf_fitness'] = similarity_scores.max(axis=0)
data = data.sort_values(by='Tfidf_fitness', ascending=False)
data.head(5)

Unnamed: 0,id,job_title,location,connection,Tfidf_fitness
0,17,aspiring human resources professional,"Raleigh-Durham, North Carolina Area",44,0.753591
1,49,aspiring human resources specialist,Greater New York City Area,1,0.695679
2,73,aspiring human resources manager seeking inter...,"Houston, Texas Area",7,0.576794
3,74,human resources professional,Greater Boston Area,16,0.460159
4,39,student humber college aspiring human resource...,Canada,61,0.436172


In [None]:
data.tail(5)

Unnamed: 0,id,job_title,location,connection,Tfidf_fitness
47,103,always set success,Greater Los Angeles Area,500+,0.0
48,91,lead official western illinois university,Greater Chicago Area,39,0.0
49,20,native english teacher epik english program korea,Canada,500+,0.0
50,80,junior mes engineer information systems,"Myrtle Beach, South Carolina Area",52,0.0
51,104,director administration excellence logging,"Katy, Texas",500+,0.0


#### BERT:

In [None]:
pip install transformers



In [None]:
from transformers import BertTokenizer, BertModel
# Loading pre-trained BERT model and tokenizer
model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)
#intializing the bert model using pretrained weights
model = BertModel.from_pretrained(model_name, output_hidden_states=True)
model.eval()

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
  

In [None]:
# Tokenzing and encoding the keywords, job title using pre-trained Bert tokenizer
keyword_tokens = tokenizer(keywords, padding=True, truncation=True, return_tensors="pt")
job_title_tokens = tokenizer(list(data["job_title"]), padding=True, truncation=True, return_tensors="pt", max_length=128)

In [None]:
# Getting BERT embeddings for keywords and job titles
with torch.no_grad():
    keyword_outputs = model(**keyword_tokens)
    Job_title_outputs = model(**job_title_tokens)

In [None]:
# Extracting the embeddings (using the second to last layer for contextual embeddings)
keyword_embeddings = keyword_outputs.hidden_states[-2].mean(dim=1)
job_title_embeddings = Job_title_outputs.hidden_states[-2].mean(dim=1)

In [None]:
#calculating the similarity scores using cosine similarity
similarity_scores = cosine_similarity(keyword_embeddings, job_title_embeddings)
data["BERT_fitness"] = similarity_scores[0]
data = data.sort_values(by='BERT_fitness', ascending=False)

In [None]:
data.head(10)

Unnamed: 0,id,job_title,location,connection,Tfidf_fitness,BERT_fitness
2,73,aspiring human resources manager seeking inter...,"Houston, Texas Area",7,0.576794,0.882239
9,82,aspiring human resources professional energeti...,"Austin, Texas Area",174,0.323092,0.874919
19,76,aspiring human resources professional passiona...,"New York, New York",212,0.218163,0.862118
7,66,experienced retail manager aspiring human reso...,"Austin, Texas Area",57,0.348,0.854407
30,70,retired army national guard recruiter office m...,"Virginia Beach, Virginia",82,0.112592,0.853863
28,75,nortia staffing seeking human resources payrol...,"San Jose, California",500+,0.135574,0.851598
12,57,ct bauer college business graduate magna cum l...,"Houston, Texas",85,0.283759,0.847732
6,72,business management major aspiring human resou...,"Monroe, Louisiana Area",5,0.374724,0.84595
8,100,aspiring human resources manager graduating ma...,"Cape Girardeau, Missouri",103,0.334276,0.845381
4,39,student humber college aspiring human resource...,Canada,61,0.436172,0.84498


In [None]:
data.tail(5)

Unnamed: 0,id,job_title,location,connection,Tfidf_fitness,BERT_fitness
41,95,student westfield state university,"Bridgewater, Massachusetts",57,0.0,0.748171
32,11,student chapman university,"Lake Forest, California",2,0.0,0.74344
36,87,bachelor science biology victoria university w...,"Baltimore, Maryland",40,0.0,0.738932
40,93,admissions representative community medical ce...,"Long Beach, California",9,0.0,0.738487
47,103,always set success,Greater Los Angeles Area,500+,0.0,0.70679


We can notice that there is a high fitness score for the unmatched job_titles.

#### Sentence Transformers:

In [None]:
pip install sentence-transformers



In [None]:
from sentence_transformers import SentenceTransformer
# Load SentenceTransformer model
model_name2 = 'paraphrase-MiniLM-L6-v2'
model2 = SentenceTransformer(model_name2)
keyword_embeddings2 = model2.encode(keywords, convert_to_tensor=True)
job_title_embeddings2 = model2.encode(list(data["job_title"]), convert_to_tensor=True)
# Calculating cosine similarity scores
similarity_scores1 = cosine_similarity(keyword_embeddings2, job_title_embeddings2)

In [None]:
data["ST_fitness"] = similarity_scores1[0]
data = data.sort_values(by='ST_fitness', ascending=False)
data.reset_index(drop=True, inplace=True)

In [None]:
data.head(8)

Unnamed: 0,id,job_title,location,connection,Tfidf_fitness,BERT_fitness,ST_fitness
0,17,aspiring human resources professional,"Raleigh-Durham, North Carolina Area",44,0.753591,0.839597,0.92063
1,49,aspiring human resources specialist,Greater New York City Area,1,0.695679,0.843632,0.870179
2,28,seeking human resources opportunities,"Chicago, Illinois",390,0.287816,0.813949,0.865065
3,99,seeking human resources position,"Las Vegas, Nevada Area",48,0.279124,0.789635,0.834088
4,73,aspiring human resources manager seeking inter...,"Houston, Texas Area",7,0.576794,0.882239,0.755794
5,74,human resources professional,Greater Boston Area,16,0.460159,0.822908,0.746101
6,39,student humber college aspiring human resource...,Canada,61,0.436172,0.84498,0.734856
7,27,aspiring human resources management student se...,"Houston, Texas Area",500+,0.395643,0.843704,0.724473


In [None]:
data.tail(5)

Unnamed: 0,id,job_title,location,connection,Tfidf_fitness,BERT_fitness,ST_fitness
47,96,student indiana university kokomo business man...,"Lafayette, Indiana",19,0.0,0.814554,0.185218
48,93,admissions representative community medical ce...,"Long Beach, California",9,0.0,0.738487,0.177539
49,12,svp chro marketing communications csr officer ...,"Houston, Texas Area",500+,0.0,0.784643,0.169567
50,103,always set success,Greater Los Angeles Area,500+,0.0,0.70679,0.162727
51,85,rrp brand portfolio executive jti japan tobacc...,Greater Philadelphia Area,500+,0.0,0.779052,0.132577


In [None]:
# defining a function to calculate the similarity score of every candidate based on the selected candidate information
def rerank_candidates(selected_rank):
    if selected_rank < 0 or selected_rank >= len(data):
        print("Invalid rank selected.")
        return
    # Calculate the embeddings for the selected candidate's information
    all_embeddings =  model2.encode(list(data[["job_title", "connection", "location"]].values), convert_to_tensor=True)
    selected_candidate_embedding = all_embeddings[selected_rank]
    # Calculate cosine similarity scores between selected candidate and all other candidates
    similarity_scores3 = cosine_similarity(selected_candidate_embedding.reshape(1, -1), all_embeddings)[0]
    # creating final fitness score column
    data["selected_candidate_fitness"] = similarity_scores3
    # Re-ranking the candidates based on the similarity to the selected candidate
    reranked_data = data.sort_values(by="selected_candidate_fitness", ascending=False).reset_index(drop=True)
    return reranked_data

In [None]:
selected_rank = int(input('Enter the Index number to be Re-Ranked:'))
reranked_candidates = rerank_candidates(selected_rank)
reranked_candidates.head(6)

Enter the Index number to be Re-Ranked:6


Unnamed: 0,id,job_title,location,connection,Tfidf_fitness,BERT_fitness,ST_fitness,selected_candidate_fitness
0,39,student humber college aspiring human resource...,Canada,61,0.436172,0.84498,0.734856,1.0
1,17,aspiring human resources professional,"Raleigh-Durham, North Carolina Area",44,0.753591,0.839597,0.92063,0.803612
2,49,aspiring human resources specialist,Greater New York City Area,1,0.695679,0.843632,0.870179,0.787076
3,27,aspiring human resources management student se...,"Houston, Texas Area",500+,0.395643,0.843704,0.724473,0.780692
4,73,aspiring human resources manager seeking inter...,"Houston, Texas Area",7,0.576794,0.882239,0.755794,0.765197
5,79,liberal arts major aspiring human resources an...,"Baton Rouge, Louisiana Area",7,0.297679,0.827784,0.687192,0.740049


In [None]:
reranked_candidates.tail(5)

Unnamed: 0,id,job_title,location,connection,Tfidf_fitness,BERT_fitness,ST_fitness,selected_candidate_fitness
47,93,admissions representative community medical ce...,"Long Beach, California",9,0.0,0.738487,0.177539,0.262625
48,96,student indiana university kokomo business man...,"Lafayette, Indiana",19,0.0,0.814554,0.185218,0.242396
49,102,business intelligence analytics travelers,Greater New York City Area,49,0.0,0.78103,0.295941,0.200659
50,85,rrp brand portfolio executive jti japan tobacc...,Greater Philadelphia Area,500+,0.0,0.779052,0.132577,0.125001
51,103,always set success,Greater Los Angeles Area,500+,0.0,0.70679,0.162727,0.006468
