In recruitment, companies grapple with sifting through resumes to find the right candidates. Our project aims to change that by using smart computer techniques to help with hiring.

We've mastered using advanced computer tricks to quickly analyze resumes. By teaching a computer to understand resumes and job descriptions, we can find the best-suited candidates for a job. This means less time wasted and a higher chance of finding the perfect match.

Our project is a game-changer for hiring, making the process faster and easier. With our computer magic, companies can focus on engaging with the best candidates instead of spending hours on resume review. It's a win-win for everyone, making hiring smoother and more efficient.

# Data Loading

Before diving into our analysis, it's crucial to review the dataset. This step involves examining the data and inspecting the columns to gain a better understanding.

In [41]:
from google_drive_downloader import GoogleDriveDownloader as gdd

gdd.download_file_from_google_drive(file_id='1NrO86f_62pAIjr0HwcuSNKFIujMEQmVO',
                                    dest_path='NLP talent matching/Resume_test')

In [40]:
import pandas as pd

csv_path ='NLP talent matching/Resume_test'
df = pd.read_csv(csv_path, encoding='latin-1')
df.head(5)

Unnamed: 0,ID,Resume_str
0,16852973,HR ADMINISTRATOR/MARKETING ASSOCIATE\...
1,22323967,"HR SPECIALIST, US HR OPERATIONS ..."
2,33176873,HR DIRECTOR Summary Over 2...
3,27018550,HR SPECIALIST Summary Dedica...
4,17812897,HR MANAGER Skill Highlights ...


# Enhancing resumes with Word2Vec word embedding

Now that our dataset is loaded, the next step is to construct and save the model. This will enable us to efficiently utilize the model in the forthcoming tasks.

In [39]:
import nltk
nltk.download('punkt')

from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize

tokenized_data = [word_tokenize(str(resume).lower()) for resume in df['Resume_str']]

# Train Word2Vec model on the resumes
model = Word2Vec(sentences=tokenized_data, vector_size=100, window=5, min_count=1, workers=4)

# Save the trained model for later use
model.save("resume_word2vec.model")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# Vectorizing Job Descriptions

The model has been successfully generated. Our next step involves splitting the job description and assigning vector representations to individual words. These representations enable comparisons with other vectors in subsequent analyses. The final output is a single vector representing the semantic meaning of the entire job description based on the context provided by the Word2Vec model.

In [42]:
import numpy as np
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize

# Load the pre-trained Word2Vec model
model = Word2Vec.load("resume_word2vec.model")

# Load the user-provided job description
user_provided_job_description = """We are seeking a highly motivated and detail-oriented Data Analyst to join our team in Delhi,
India. As a Data Analyst, you will be responsible for analyzing large datasets, identifying trends,
and generating insights to drive business decisions. You should have strong skills in data analysis, SQL (MySQL),
and data management. Proficiency in office management and basic knowledge of data science concepts is also required.
This is an entry-level position, and we offer an annual compensation of 3-6 LPA.
A bachelor's degree is the minimum qualification required for this role.
Join us and contribute to our data-driven decision-making process."""

# Tokenize the user-provided job description
tokenized_user_provided_description = word_tokenize(user_provided_job_description.lower())

user_provided_vector = np.mean([model.wv[word] for word in tokenized_user_provided_description if word in model.wv], axis=0)
user_provided_vector = np.squeeze(np.asarray(user_provided_vector))
user_provided_vector

array([-0.67927027,  0.3436342 ,  0.03724858,  0.32435536,  0.16108595,
       -0.41910636,  0.40179226,  0.9011683 , -0.6204146 , -0.1910091 ,
       -0.32675013, -0.6165524 , -0.1667926 ,  0.74518025,  0.01668402,
       -0.09489693,  0.05290135, -0.02091683, -0.68331015, -0.7830736 ,
       -0.17965369,  0.08961367,  0.43303275,  0.24897042,  0.02726281,
       -0.16919889, -0.32660857,  0.12733924, -0.5691138 ,  0.06537176,
        0.8208582 ,  0.22943707,  0.10490521, -0.370735  , -0.10359649,
        0.30579558,  0.33335325, -0.28555143, -0.2024219 , -0.6331265 ,
        0.32355297, -0.7630865 , -0.5842616 , -0.12379848,  0.14230792,
       -0.16448215, -0.24876255,  0.11788491,  0.55113834,  0.03174062,
       -0.0902904 , -0.2338393 ,  0.00947474, -0.02125346, -0.05167062,
        0.45611852,  0.64091766, -0.33490205, -0.49981603,  0.17095646,
       -0.4618533 ,  0.06388243,  0.03644531, -0.18172167, -0.5374956 ,
        0.22638118, -0.00700108,  0.43135512, -0.4937227 ,  0.42

# Talent Search with Cosine Similarity

The description has been vectorized successfully. Our next task involves comparing the vector representations of resumes and job descriptions. This comparison enables us to assess the compatibility score of each resume with the job description.

In [43]:
# Create a list to store dictionaries with similarity scores
similarity_list = []

# Iterate over each row in the DataFrame
for index, row in df.iterrows():
    resume_id = row['ID']
    resume_text = str(row['Resume_str'])

    # Tokenize the resume
    tokenized_resume = word_tokenize(resume_text.lower())

    # Create vectors for the resume
    resume_vector = np.mean([model.wv[word] for word in tokenized_resume if word in model.wv], axis=0)
    resume_vector = np.squeeze(np.asarray(resume_vector))

    # Calculate cosine similarity
    similarity = np.dot(user_provided_vector, resume_vector) / (np.linalg.norm(user_provided_vector) * np.linalg.norm(resume_vector))
    similarity = np.maximum(similarity, 0)  # Ensure similarity is not negative

    # Append the similarity score to the list
    similarity_list.append({'ID': resume_id, 'Resume': resume_text, 'Similarity': similarity})

  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)


# Selecting Top Candidates for Job Roles

We have successfully found the scores for each resume. Now, we need to identify the top 10 candidates for the job. By doing this, we can facilitate efficient candidate selection by providing the most relevant resumes that closely align with the job requirements, thereby enhancing the recruitment process and workforce management.

In [44]:
# Create DataFrame from the list of dictionaries
similarity_df = pd.DataFrame(similarity_list)

# Convert 'Similarity' column to numeric values with errors='coerce' to handle non-numeric values
similarity_df['Similarity'] = pd.to_numeric(similarity_df['Similarity'], errors='coerce')

# Drop rows with NaN values in the 'Similarity' column
similarity_df = similarity_df.dropna(subset=['Similarity'])

# Sort the DataFrame by similarity in descending order and display the top 10
top_10_indices = np.argsort(similarity_df['Similarity'].values)[::-1][:10]
top_10_matches = similarity_df.iloc[top_10_indices]
top_10_matches

Unnamed: 0,ID,Resume,Similarity
929,11813872,"VP, PRINCIPAL Summary I am ...",0.981473
262,24038620,INFORMATION TECHNOLOGY DIRECTOR ...,0.977411
321,27058381,SYSTEM ADMINISTRATOR Expe...,0.976347
311,26746496,DATABASE PROGRAMMER/ANALYST (.NET DEV...,0.966906
217,36856210,INFORMATION TECHNOLOGY Summar...,0.966469
250,32959732,"SENIOR DIRECTOR, INFORMATION TECHNOLO...",0.965322
280,28126340,INFORMATION TECHNOLOGY COORDINATOR ...,0.965231
986,57706851,NOC ENGINEER Summary To work...,0.962509
313,12334140,PRODUCTION ASSOCIATE Summary ...,0.962209
158,11155153,MECHANICAL DESIGNER Summary ...,0.960532


In [46]:
top_10_matches.iloc[0]['Resume']

'         VP, PRINCIPAL       Summary     I am highly skilled,growth mindset IT professional having more than 20 years experience mostly in financial industry related with providing advanced data solutions using innovative database technology. Very innovative,creative, great problem solver and have achieved the highest ratings consistently for more than 10 years. Continuously learning,adapting and evolving by overcoming challenges faced during professional career. I am fortunate to be a part of team who has delivered cutting edge products over the years to help our firm and clients. My career philosophy is  4LT(Listen,Learn,Love,Lead and earn Trust).        Skills          Deep expertise in designing,developing, implementing and running mission critical systems involving OLTP,OLAP and HTAP workloads  Extensive experience in building and deploying large scale applications in cloud environment(AWS)  Deep expertise in advanced data modeling, data management and data governance  Passionate