# Objective: Build a PDF extractor to pull relevant details from CVs in PDF format, and match them against the job descriptions from the Hugging Face dataset.

## Approach

#### Step 1: Data Extraction from PDFs

##### Use the PyMuPDF (fitz) library to parse PDF documents efficiently and extract text data.
##### Implement robust error handling to gracefully handle exceptions during the PDF extraction process.

#### Step 2: Text Data Preprocessing

##### Tokenize both job descriptions and candidate resumes using the Hugging Face Transformers library to prepare them for analysis.
##### Convert all text data to lowercase to ensure case sensitivity does not affect similarity calculations.
##### Remove stopwords from the text using the NLTK library to eliminate common words that do not contribute significantly.
##### Lemmatize words using NLTK to reduce inflected words to their base form, improving analysis accuracy.

#### Step 3: Embeddings and Similarity Calculation

##### Utilize pretrained models like DistilBERT from Hugging Face to convert text data into dense vector representations (embeddings).
##### Calculate cosine similarities between job descriptions and candidate resumes using the scikit-learn library, a fundamental metric for comparing text data.

#### Step 4: Data Organization and Management

##### Organize and manage data efficiently for both job descriptions and candidate resumes using Pandas dataframes, simplifying data manipulation.
##### Use dictionaries to store and access data, facilitating structured storage for easy retrieval.

#### Step 5: Top Candidate Ranking

##### Dynamically map job titles to corresponding candidate resume datasets to allow for flexibility and scalability.
##### Implement ranking logic, which involves sorting and ranking algorithms, to determine the top candidates for each job description based on similarity scores.

### Importing of Job Description dataset 

In [1]:
import pandas as pd
job_desp_df = pd.read_csv('job_description.csv')
job_desp_df

Unnamed: 0,company_name,job_description,position_title,description_length
0,Google,minimum qualifications\nbachelors degree or eq...,Sales Specialist,2727
1,Apple,description\nas an asc you will be highly infl...,Apple Solutions Consultant,828
2,Netflix,its an amazing time to be joining netflix as w...,Licensing Coordinator - Consumer Products,3205
3,Robert Half,description\n\nweb designers looking to expand...,Web Designer,2489
4,TrackFive,at trackfive weve got big goals were on a miss...,Web Developer,3167
...,...,...,...,...
848,Menards,job description\n\nparttime\n\nmake big money ...,Management Internship,1122
849,Parker,responsibilities\nparkers internship program w...,Human Resources Internship - Corporate (Year-...,3840
850,Borgen Project,the borgen project is an innovative national ...,Writer / Journalist Internship,897
851,Wyndham Destinations,put the world on vacation\n\nat wyndham destin...,Inbound Customer Service / Sales (Remote),4604


### Cleaning the dataframe. 

In [2]:
# List of column names to be dropped
columns_to_drop = ['description_length', 'company_name']
job_desp_df = job_desp_df.drop(columns_to_drop, axis=1)

In [3]:
job_desp_df

Unnamed: 0,job_description,position_title
0,minimum qualifications\nbachelors degree or eq...,Sales Specialist
1,description\nas an asc you will be highly infl...,Apple Solutions Consultant
2,its an amazing time to be joining netflix as w...,Licensing Coordinator - Consumer Products
3,description\n\nweb designers looking to expand...,Web Designer
4,at trackfive weve got big goals were on a miss...,Web Developer
...,...,...
848,job description\n\nparttime\n\nmake big money ...,Management Internship
849,responsibilities\nparkers internship program w...,Human Resources Internship - Corporate (Year-...
850,the borgen project is an innovative national ...,Writer / Journalist Internship
851,put the world on vacation\n\nat wyndham destin...,Inbound Customer Service / Sales (Remote)


### Extracting 10-15 job descriptions for this task. 

In [4]:
import pandas as pd

# Filter rows containing the keyword corresponding to the subdirectory of the Cvs in 'text_column'

accountant_df = job_desp_df[job_desp_df['position_title'].str.contains('accountant', case=False)]
advocate_df = job_desp_df[job_desp_df['position_title'].str.contains('ADVOCATE', case=False)]
agriculture_df = job_desp_df[job_desp_df['position_title'].str.contains('AGRICULTURE', case=False)]
apparel_df = job_desp_df[job_desp_df['position_title'].str.contains('APPAREL', case=False)]
arts_df = job_desp_df[job_desp_df['position_title'].str.contains('ARTS', case=False)]
automobile_df = job_desp_df[job_desp_df['position_title'].str.contains('AUTOMOBILE', case=False)]
avaition_df = job_desp_df[job_desp_df['position_title'].str.contains('AVIATION', case=False)]
banking_df = job_desp_df[job_desp_df['position_title'].str.contains('BANKING', case=False)]
bpo_df = job_desp_df[job_desp_df['position_title'].str.contains('BPO', case=False)]
buisdevp_df = job_desp_df[job_desp_df['position_title'].str.contains('BUSINESS-DEVELOPMENT', case=False)]
chef_df = job_desp_df[job_desp_df['position_title'].str.contains('CHEF', case=False)]
const_df = job_desp_df[job_desp_df['position_title'].str.contains('CONSTRUCTION', case=False)]
consultant_df = job_desp_df[job_desp_df['position_title'].str.contains('CONSULTANT', case=False)]
designer_df = job_desp_df[job_desp_df['position_title'].str.contains('DESIGNER', case=False)]
digimedia_df = job_desp_df[job_desp_df['position_title'].str.contains('DIGITAL-MEDIA', case=False)]
engg_df = job_desp_df[job_desp_df['position_title'].str.contains('ENGINEERING', case=False)]
finance_df = job_desp_df[job_desp_df['position_title'].str.contains('FINANCE', case=False)]
fitness_df = job_desp_df[job_desp_df['position_title'].str.contains('FITNESS', case=False)]
healthcare_df = job_desp_df[job_desp_df['position_title'].str.contains('HEALTHCARE', case=False)]
hr_df = job_desp_df[job_desp_df['position_title'].str.contains('HR', case=False)]
infotech_df = job_desp_df[job_desp_df['position_title'].str.contains('INFORMATION-TECHNOLOGY', case=False)]
pubrel_df = job_desp_df[job_desp_df['position_title'].str.contains('PUBLIC-RELATIONS', case=False)]
sales_df = job_desp_df[job_desp_df['position_title'].str.contains('SALES', case=False)]
teacher_df = job_desp_df[job_desp_df['position_title'].str.contains('TEACHER', case=False)]

### Manually selecting 10 - 15 position_title for the model. 

In [5]:
accountant_df

Unnamed: 0,job_description,position_title
418,brdisplaynonecss ul limarginleftcss lipadding...,"Holman Frenia Allison, PC Hiring for Audit Sta..."
420,requirements for senior accountant\n minimum o...,"Senior Accountant, Banking (Lakewood)"
421,our client a cpa firm located in central new j...,Audit Staff Accountant
423,vita healthcare group is a leading name in the...,Accountant
424,description robert half is seeking a staff ac...,Accountant
567,about the job\nsenior accountant\nour client i...,Senior Accountant
569,salary range lower than you desire but still ...,Sr Accountant
573,we are looking for a senior accountant job req...,"Senior Accountant - Florham Park, NJ"
777,in the minute it takes you to read this job de...,Senior Accountant (Remote)
778,job description\n\nabout the role\n\nas part o...,Senior Accountant (Remote) (Remote)


In [6]:
advocate_df

Unnamed: 0,job_description,position_title
340,social media advocate\n\ngrassroots education ...,Remote Education Advocate
664,just an idea\n\njob details\njob type\nparttim...,Part-time Patient Advocate
679,fulltime entry level great way to get hands o...,Remote Patient Advocate Specialist
680,job description\n\nthe patient advocate will p...,$17.50hr - Patient Advocate - Part Time Positi...
681,lets do great things together\n\nfounded in or...,Member Health Advocate ~ remote
682,this job description states that it is remote ...,Bilingual Care Advocate (Hebrew)
724,this is a fulltime position i know this is not...,"Advocate, Patient Empowerment- REMOTE"


In [7]:
agriculture_df

Unnamed: 0,job_description,position_title


In [8]:
apparel_df

Unnamed: 0,job_description,position_title


In [9]:
arts_df

Unnamed: 0,job_description,position_title
721,this position will cover womens health and cli...,"Account Executive, Urology - Ohio and parts of..."


In [10]:
automobile_df

Unnamed: 0,job_description,position_title


In [11]:
avaition_df

Unnamed: 0,job_description,position_title


In [12]:
banking_df

Unnamed: 0,job_description,position_title
420,requirements for senior accountant\n minimum o...,"Senior Accountant, Banking (Lakewood)"
449,job profile\n\nposition overview\n\nat pnc our...,Sales Leader I - Business Banking Sales Manager
529,how would you like to join a successful and gr...,Banking Center Supervisor
758,about working at commerce\n\nwouldnt it be gre...,Business Banking Relationship Manager


In [13]:
bpo_df

Unnamed: 0,job_description,position_title


In [14]:
buisdevp_df

Unnamed: 0,job_description,position_title


In [15]:
chef_df

Unnamed: 0,job_description,position_title


In [16]:
const_df

Unnamed: 0,job_description,position_title
235,construction project manager\n\ncleveland oh o...,Construction Project Manager
404,here at shake shack we take care of each other...,Construction Project Manager - Remote
405,we have an immediate need for local or remote...,(3) Construction Project Managers (Remote) (in...
406,job description\n\nwe are currently seeking a ...,Construction Project Manager (Remote)
407,circustrix\n\nconstruction project manager\n\n...,Construction Project Manager (Remote)
408,position summary\n\nthe construction project m...,Construction Project Manager - Remote
409,job title\n\nproject manager construction mana...,"Project Manager, Construction Manager (Remote)..."
410,construction project managerremote flexibility...,REMOTE FLEX - Construction Project Manager
411,akron ohio united states of america clevelan...,Construction Project Manager- (Remote in Detro...
532,full job description \n\ncommercial constructi...,Commercial Construction Manager


In [17]:
consultant_df

Unnamed: 0,job_description,position_title
1,description\nas an asc you will be highly infl...,Apple Solutions Consultant
21,company overview\n\nexechq is a consulting fir...,Chief Executive Officer - CEO Consultant
249,we go beyond the obvious using intelligence pa...,Ecommerce Consultant
649,job details\nsalary\n a year\njob type\nfull...,Retail Furniture Sales Consultant
733,job details\n\ndescription\n\nthe sales associ...,Sales Associate / Design Consultant - Full Time
745,our bassett design consultants are responsible...,Design Consultant
746,job description\nrequisition id \nour busy new...,Design Consultant- Interior Design/Sales Consu...
772,one of our clients a global market research co...,Programmer Analyst with R Consultant


In [18]:
designer_df

Unnamed: 0,job_description,position_title
3,description\n\nweb designers looking to expand...,Web Designer
6,about the position\n\nthe web designer is resp...,Remote Website Designer
7,job description\n\nzander insurance group is o...,Web Designer
8,tuff is a growth marketing team working with c...,Web Designer
9,type of requisition regular\n\nclearance level...,SR. Web Designer
12,we are seeking a senior ui designer who relis...,Senior UI Designer
14,if youre passionate about building a better fu...,UI Web Designer
15,the apply with seek option will be utilized fo...,Senior Web Designer (REMOTE)
378,sonneman a way of light is seeking an experien...,Senior graphic designer


In [19]:
digimedia_df

Unnamed: 0,job_description,position_title


In [20]:
engg_df

Unnamed: 0,job_description,position_title
32,gaming\n\nwelcome to the world of landbased ga...,Associate Software Engineering
417,passionate about precision medicine and advanc...,Engineering Lead - Analytics Platform
528,job summary\n\nrandstad federal\n\nwe have ove...,Software Engineering


In [21]:
finance_df

Unnamed: 0,job_description,position_title
453,createme is a research and development company...,Senior Director Finance
454,company description\nproject finds mission is ...,Director of Finance
456,vp of finance about flywheel flywheel software...,VP of Finance (East Meadow)
457,a tech services company in new york city is cu...,Director - FP&A / Corporate Finance


In [22]:
fitness_df

Unnamed: 0,job_description,position_title
284,orangetheory fitness seven corners\n\nhonors h...,Fitness Coach
285,strength and conditioning jobs in virginia usa...,Strength & Conditioning Coach Jobs at Onelife ...


In [23]:
healthcare_df

Unnamed: 0,job_description,position_title
23,chief executive officer healthcare\n\ncolumbu...,Chief Executive Officer - Healthcare - Columbus
267,work for one of the worlds best health care sy...,Talent Acquisition Advisor (Healthcare) REMOTE
519,job description\n\njob summary\n\nmolina healt...,"VP, Healthcare Services"
786,healthcare administrative assistant urgent hi...,Healthcare Administrative Assistant – Urgent H...


In [24]:
hr_df

Unnamed: 0,job_description,position_title
82,about\nmacys is proudly americas department st...,Retail Commission Sales Associate - Fine Jewel...
87,marriott international portfolio of brands inc...,Utility Cleaner - Kitchen ($19.77/hr)
401,company background\n\n \n\ncrane co is headqu...,"HR Director, Americas"
481,job overview\nprovide support to members of an...,Sr. HR Business Partner (Hybrid/Remote Role)
482,about the teamat doordash people are our most ...,Senior HR Business Partner
483,scale is growing and so is our people team wer...,HR Generalist
484,engaging our growth mindset the nielsen media ...,HR Initiatives Program Lead (Remote)
600,position description\n\nlocation new york ny\n...,"Development Associate, Philanthropy"
680,job description\n\nthe patient advocate will p...,$17.50hr - Patient Advocate - Part Time Positi...
737,customer service representative\nrequirements\...,Remote Customer Service Representative ($16/hr)


In [25]:
infotech_df

Unnamed: 0,job_description,position_title


In [26]:
pubrel_df

Unnamed: 0,job_description,position_title


In [27]:
sales_df

Unnamed: 0,job_description,position_title
0,minimum qualifications\nbachelors degree or eq...,Sales Specialist
54,job summary\n\nunder general supervision this ...,Sales Executive
55,position type full time\n\ntype of hire expe...,Sales Executive III Client Management
56,we are hiring a sales manager\n\nsummary\n\nbe...,Sales Manager
58,account executive columbus oh amo sales and ...,"Account Executive - Columbus, OH - AMO Sales a..."
...,...,...
826,we inspire purposefilled living that brings jo...,Furniture Sales Associate
827,as one of our passionate fun and dedicated sal...,Vans Retail Sales Associate (Beachwood Place M...
828,what this position is all aboutthe style advi...,Luxury Sales Stylist - Mens Combo - Saks Fifth...
829,job category retail\n\nrequisition number \n\n...,Retail Sales Associate


In [28]:
teacher_df

Unnamed: 0,job_description,position_title
337,job description\n\nwe are currently seeking a ...,Special Education Teacher 2022/2023 School Year
338,description\n\nour teachers bring warmth patie...,Teachers at Alpha Park KinderCare
377,jobid \n\nposition type\n\ndistrictart\n\ndat...,Art Teacher


### Selecting the 10 - 15 job roles with job description based on the above analysis and converting it into a dataframe.  
### Index list - [423, 664, 721, 758, 574,  249, 9, 528, 454, 284, 519, 483, 818, 338]

In [29]:
# Select specific rows by their indices
selected_indices = [423, 664, 721, 758, 574, 249, 9, 528, 454, 284, 519, 483, 818, 338]  # Replace with the indices you want to select
sel_job_desp_df = job_desp_df.iloc[selected_indices]

# Print the new DataFrame
print(sel_job_desp_df)

                                       job_description  \
423  vita healthcare group is a leading name in the...   
664  just an idea\n\njob details\njob type\nparttim...   
721  this position will cover womens health and cli...   
758  about working at commerce\n\nwouldnt it be gre...   
574  overview\nlocation of construction pm currentl...   
249  we go beyond the obvious using intelligence pa...   
9    type of requisition regular\n\nclearance level...   
528  job summary\n\nrandstad federal\n\nwe have ove...   
454  company description\nproject finds mission is ...   
284  orangetheory fitness seven corners\n\nhonors h...   
519  job description\n\njob summary\n\nmolina healt...   
483  scale is growing and so is our people team wer...   
818  overview\n\nthe sales associate is a key emplo...   
338  description\n\nour teachers bring warmth patie...   

                                        position_title  
423                                         Accountant  
664            

In [31]:
# Reset the index
sel_job_desp_df = sel_job_desp_df.reset_index(drop=True)

In [32]:
sel_job_desp_df

Unnamed: 0,job_description,position_title
0,vita healthcare group is a leading name in the...,Accountant
1,just an idea\n\njob details\njob type\nparttim...,Part-time Patient Advocate
2,this position will cover womens health and cli...,"Account Executive, Urology - Ohio and parts of..."
3,about working at commerce\n\nwouldnt it be gre...,Business Banking Relationship Manager
4,overview\nlocation of construction pm currentl...,Construction Project Manager
5,we go beyond the obvious using intelligence pa...,Ecommerce Consultant
6,type of requisition regular\n\nclearance level...,SR. Web Designer
7,job summary\n\nrandstad federal\n\nwe have ove...,Software Engineering
8,company description\nproject finds mission is ...,Director of Finance
9,orangetheory fitness seven corners\n\nhonors h...,Fitness Coach


## Data Cleaning

### Removing punctuation.

In [33]:
import pandas as pd
import string  # Import the 'string' module


# Function to remove punctuation
def remove_punctuation(text):
    # Define a string of punctuation characters
    punctuations = string.punctuation
    
    # Use a regular expression to replace all punctuation characters with an empty string
    text = ''.join([char for char in text if char not in punctuations])
    
    return text

# Apply the remove_punctuation function to the "job_description" column
sel_job_desp_df['job_description'] = sel_job_desp_df['job_description'].apply(remove_punctuation)

# Display the DataFrame with punctuation removed
print(sel_job_desp_df)

                                      job_description  \
0   vita healthcare group is a leading name in the...   
1   just an idea\n\njob details\njob type\nparttim...   
2   this position will cover womens health and cli...   
3   about working at commerce\n\nwouldnt it be gre...   
4   overview\nlocation of construction pm currentl...   
5   we go beyond the obvious using intelligence pa...   
6   type of requisition regular\n\nclearance level...   
7   job summary\n\nrandstad federal\n\nwe have ove...   
8   company description\nproject finds mission is ...   
9   orangetheory fitness seven corners\n\nhonors h...   
10  job description\n\njob summary\n\nmolina healt...   
11  scale is growing and so is our people team wer...   
12  overview\n\nthe sales associate is a key emplo...   
13  description\n\nour teachers bring warmth patie...   

                                       position_title  
0                                          Accountant  
1                        Part-ti

### Removing irrelevant information. 

In [34]:
# Function to remove newline characters
def remove_newlines(text):
    # Use str.replace() to replace '\n' with an empty string
    text = text.replace('\n', '')
    
    return text

# Apply the remove_newlines function to the "job_description" column
sel_job_desp_df['job_description'] = sel_job_desp_df['job_description'].apply(remove_newlines)

# Display the DataFrame with newline characters removed
print(sel_job_desp_df)

                                      job_description  \
0   vita healthcare group is a leading name in the...   
1   just an ideajob detailsjob typeparttimequalifi...   
2   this position will cover womens health and cli...   
3   about working at commercewouldnt it be great t...   
4   overviewlocation of construction pm currently ...   
5   we go beyond the obvious using intelligence pa...   
6   type of requisition regularclearance level mus...   
7   job summaryrandstad federalwe have over a deca...   
8   company descriptionproject finds mission is to...   
9   orangetheory fitness seven cornershonors holdi...   
10  job descriptionjob summarymolina healthcare se...   
11  scale is growing and so is our people team wer...   
12  overviewthe sales associate is a key employee ...   
13  descriptionour teachers bring warmth patience ...   

                                       position_title  
0                                          Accountant  
1                        Part-ti

### Lowercasing: Converting all text to lowercase for consistency.

In [35]:
# Function to remove newline characters and convert to lowercase
def preprocess_text(text):
    # Remove newline characters and convert to lowercase
    text = text.replace('\n', '').lower()
    
    return text

# Apply the preprocess_text function to the "job_description" column
sel_job_desp_df['job_description'] = sel_job_desp_df['job_description'].apply(preprocess_text)

# Display the DataFrame with newline characters removed and text converted to lowercase
print(sel_job_desp_df)

                                      job_description  \
0   vita healthcare group is a leading name in the...   
1   just an ideajob detailsjob typeparttimequalifi...   
2   this position will cover womens health and cli...   
3   about working at commercewouldnt it be great t...   
4   overviewlocation of construction pm currently ...   
5   we go beyond the obvious using intelligence pa...   
6   type of requisition regularclearance level mus...   
7   job summaryrandstad federalwe have over a deca...   
8   company descriptionproject finds mission is to...   
9   orangetheory fitness seven cornershonors holdi...   
10  job descriptionjob summarymolina healthcare se...   
11  scale is growing and so is our people team wer...   
12  overviewthe sales associate is a key employee ...   
13  descriptionour teachers bring warmth patience ...   

                                       position_title  
0                                          Accountant  
1                        Part-ti

In [36]:
import pandas as pd
import string
import nltk
from nltk.corpus import stopwords
from collections import Counter

# Function to tokenize and count words
def word_frequency(text):
    # Tokenize the text into words
    words = nltk.word_tokenize(text)
    
    # Remove punctuation and convert to lowercase
    words = [word.lower() for word in words if word.isalnum()]
    
    # Filter out stopwords
    words = [word for word in words if word not in stopwords.words('english')]
    
    # Calculate word frequencies
    word_freq = dict(Counter(words))
    
    return word_freq

# Create a dictionary to store word frequencies for each job description
word_freq_dict = {}

# Apply the word_frequency function to the "job_description" column and store in the dictionary
for index, row in sel_job_desp_df.iterrows():
    word_freq_dict[index] = word_frequency(row['job_description'])

# Print the dictionary of word frequencies
print(word_freq_dict)

{0: {'vita': 1, 'healthcare': 1, 'group': 1, 'leading': 1, 'name': 1, 'skilled': 1, 'nursing': 2, 'world': 2, 'providing': 2, 'class': 1, 'care': 1, 'community': 4, 'decadewe': 1, 'seeking': 1, 'dedicated': 1, 'accountant': 2, 'assist': 2, 'operating': 1, 'facilitiesif': 1, 'looking': 1, 'incredible': 2, 'office': 2, 'environment': 1, 'paid': 2, 'training': 2, 'room': 1, 'growth': 1, 'may': 1, 'position': 1, 'youthe': 1, 'key': 1, 'team': 2, 'member': 1, 'responsible': 1, 'accounting': 5, 'guidance': 1, 'management': 1, 'relates': 1, 'overall': 1, 'activities': 1, 'programs': 1, 'communityaccountant': 1, 'responsibilities': 1, 'provide': 1, 'support': 2, 'communities': 2, 'executive': 1, 'directors': 1, 'managers': 1, 'prepare': 1, 'journal': 1, 'entries': 1, 'maintain': 2, 'accounts': 2, 'facilitate': 1, 'monthly': 1, 'financial': 1, 'calls': 1, 'payable': 1, 'receivable': 1, 'account': 1, 'reconciliation': 1, 'filing': 1, 'process': 1, 'payrolls': 1, 'compliance': 1, 'company': 1, 'p

### Removing stop words: Common words like "and," "the," "is" are often removed using custom stop words list and also inbuilt stop word list

In [37]:
import pandas as pd
import nltk
from nltk.corpus import stopwords

# Function to remove stopwords from text
def remove_stopwords(text):
    # Tokenize the text into words
    words = nltk.word_tokenize(text)
    
    # Remove stopwords
    words = [word for word in words if word.lower() not in stopwords.words('english')]
    
    # Rejoin the words to form cleaned text
    cleaned_text = ' '.join(words)
    
    return cleaned_text

# Apply the remove_stopwords function to the "job_description" column
sel_job_desp_df['job_description'] = sel_job_desp_df['job_description'].apply(remove_stopwords)

# Display the DataFrame with stopwords removed
print(sel_job_desp_df)

                                      job_description  \
0   vita healthcare group leading name skilled nur...   
1   ideajob detailsjob typeparttimequalificationsc...   
2   position cover womens health clinical accounts...   
3   working commercewouldnt great build career ban...   
4   overviewlocation construction pm currently see...   
5   go beyond obvious using intelligence passion c...   
6   type requisition regularclearance level must a...   
7   job summaryrandstad federalwe decade experienc...   
8   company descriptionproject finds mission provi...   
9   orangetheory fitness seven cornershonors holdi...   
10  job descriptionjob summarymolina healthcare se...   
11  scale growing people team looking people opera...   
12  overviewthe sales associate key employee whose...   
13  descriptionour teachers bring warmth patience ...   

                                       position_title  
0                                          Accountant  
1                        Part-ti

In [38]:
import pandas as pd
import string
import nltk
from nltk.corpus import stopwords
from collections import Counter

# Function to tokenize and count words
def word_frequency(text):
    # Tokenize the text into words
    words = nltk.word_tokenize(text)
    
    # Remove punctuation and convert to lowercase
    words = [word.lower() for word in words if word.isalnum()]
    
    # Filter out stopwords
    words = [word for word in words if word not in stopwords.words('english')]
    
    # Calculate word frequencies
    word_freq = dict(Counter(words))
    
    return word_freq

# Create a dictionary to store word frequencies for each job description
word_freq_dict = {}

# Apply the word_frequency function to the "job_description" column and store in the dictionary
for index, row in sel_job_desp_df.iterrows():
    word_freq_dict[index] = word_frequency(row['job_description'])

# Print the dictionary of word frequencies
print(word_freq_dict)

{0: {'vita': 1, 'healthcare': 1, 'group': 1, 'leading': 1, 'name': 1, 'skilled': 1, 'nursing': 2, 'world': 2, 'providing': 2, 'class': 1, 'care': 1, 'community': 4, 'decadewe': 1, 'seeking': 1, 'dedicated': 1, 'accountant': 2, 'assist': 2, 'operating': 1, 'facilitiesif': 1, 'looking': 1, 'incredible': 2, 'office': 2, 'environment': 1, 'paid': 2, 'training': 2, 'room': 1, 'growth': 1, 'may': 1, 'position': 1, 'youthe': 1, 'key': 1, 'team': 2, 'member': 1, 'responsible': 1, 'accounting': 5, 'guidance': 1, 'management': 1, 'relates': 1, 'overall': 1, 'activities': 1, 'programs': 1, 'communityaccountant': 1, 'responsibilities': 1, 'provide': 1, 'support': 2, 'communities': 2, 'executive': 1, 'directors': 1, 'managers': 1, 'prepare': 1, 'journal': 1, 'entries': 1, 'maintain': 2, 'accounts': 2, 'facilitate': 1, 'monthly': 1, 'financial': 1, 'calls': 1, 'payable': 1, 'receivable': 1, 'account': 1, 'reconciliation': 1, 'filing': 1, 'process': 1, 'payrolls': 1, 'compliance': 1, 'company': 1, 'p

### Lemmatization: Reducing words to their base form.

In [39]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Initialize the WordNet Lemmatizer
lemmatizer = WordNetLemmatizer()

# Function to remove stopwords and perform lemmatization
def preprocess_text(text):
    # Tokenize the text into words
    words = nltk.word_tokenize(text)
    
    # Remove stopwords and apply lemmatization
    words = [lemmatizer.lemmatize(word.lower()) for word in words if word.lower() not in stopwords.words('english')]
    
    # Rejoin the words to form cleaned text
    cleaned_text = ' '.join(words)
    
    return cleaned_text

# Apply the preprocess_text function to the "job_description" column
sel_job_desp_df['job_description'] = sel_job_desp_df['job_description'].apply(preprocess_text)

# Display the DataFrame with stopwords removed and lemmatization applied
print(sel_job_desp_df)


                                      job_description  \
0   vita healthcare group leading name skilled nur...   
1   ideajob detailsjob typeparttimequalificationsc...   
2   position cover woman health clinical account t...   
3   working commercewouldnt great build career ban...   
4   overviewlocation construction pm currently see...   
5   go beyond obvious using intelligence passion c...   
6   type requisition regularclearance level must a...   
7   job summaryrandstad federalwe decade experienc...   
8   company descriptionproject find mission provid...   
9   orangetheory fitness seven cornershonors holdi...   
10  job descriptionjob summarymolina healthcare se...   
11  scale growing people team looking people opera...   
12  overviewthe sale associate key employee whose ...   
13  descriptionour teacher bring warmth patience u...   

                                       position_title  
0                                          Accountant  
1                        Part-ti

In [41]:
sel_job_desp_df

Unnamed: 0,job_description,position_title
0,vita healthcare group leading name skilled nur...,Accountant
1,ideajob detailsjob typeparttimequalificationsc...,Part-time Patient Advocate
2,position cover woman health clinical account t...,"Account Executive, Urology - Ohio and parts of..."
3,working commercewouldnt great build career ban...,Business Banking Relationship Manager
4,overviewlocation construction pm currently see...,Construction Project Manager
5,go beyond obvious using intelligence passion c...,Ecommerce Consultant
6,type requisition regularclearance level must a...,SR. Web Designer
7,job summaryrandstad federalwe decade experienc...,Software Engineering
8,company descriptionproject find mission provid...,Director of Finance
9,orangetheory fitness seven cornershonors holdi...,Fitness Coach


### Utilizing the DistilBERT model and tokenizer to extract embeddings from the "job_description" column in a DataFrame and replaces the text with these embeddings in a single line.

In [42]:
import pandas as pd
from transformers import DistilBertTokenizer, DistilBertModel
import torch


# Initialize the DistilBERT tokenizer and model
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
model = DistilBertModel.from_pretrained("distilbert-base-uncased")

# Function to extract DistilBERT embeddings
def extract_embeddings(text):
    # Tokenize the text
    input_ids = tokenizer(text, return_tensors='pt', padding=True, truncation=True)["input_ids"]
    
    # Extract embeddings using the model
    with torch.no_grad():
        outputs = model(input_ids)
    
    # Extract the embeddings for the [CLS] token
    embeddings = outputs.last_hidden_state[:, 0, :].numpy()
    
    return embeddings

# Apply the extract_embeddings function to the "job_description" column
sel_job_desp_df['job_description'] = sel_job_desp_df['job_description'].apply(extract_embeddings)

# Display the DataFrame with embeddings in the "job_description" column
print(sel_job_desp_df[['job_description']])


                                      job_description
0   [[-0.2349301, 0.23931362, -0.0066209063, 0.012...
1   [[-0.48101765, 0.12456707, -0.012981296, -0.10...
2   [[-0.41524535, 0.18179955, -0.009882465, -0.07...
3   [[-0.52538115, 0.12988769, -0.02525802, -0.068...
4   [[-0.37318966, 0.15656477, -0.053767644, -0.08...
5   [[-0.4030924, 0.10880383, 0.05623078, -0.03711...
6   [[-0.43022907, 0.1107656, 0.030928671, -0.0072...
7   [[-0.3756775, 0.21737866, -0.23779063, -0.0189...
8   [[-0.3724737, 0.18712647, -0.061148852, -0.016...
9   [[-0.37973148, 0.110710956, 0.011414934, -0.04...
10  [[-0.48247072, 0.18552431, 0.028860055, 0.0378...
11  [[-0.46628553, 0.1547784, 0.03934186, -0.01833...
12  [[-0.5785006, 0.20030381, -0.14895616, -0.1038...
13  [[-0.29803705, 0.1478748, 0.022891663, 0.12934...


In [43]:
sel_job_desp_df

Unnamed: 0,job_description,position_title
0,"[[-0.2349301, 0.23931362, -0.0066209063, 0.012...",Accountant
1,"[[-0.48101765, 0.12456707, -0.012981296, -0.10...",Part-time Patient Advocate
2,"[[-0.41524535, 0.18179955, -0.009882465, -0.07...","Account Executive, Urology - Ohio and parts of..."
3,"[[-0.52538115, 0.12988769, -0.02525802, -0.068...",Business Banking Relationship Manager
4,"[[-0.37318966, 0.15656477, -0.053767644, -0.08...",Construction Project Manager
5,"[[-0.4030924, 0.10880383, 0.05623078, -0.03711...",Ecommerce Consultant
6,"[[-0.43022907, 0.1107656, 0.030928671, -0.0072...",SR. Web Designer
7,"[[-0.3756775, 0.21737866, -0.23779063, -0.0189...",Software Engineering
8,"[[-0.3724737, 0.18712647, -0.061148852, -0.016...",Director of Finance
9,"[[-0.37973148, 0.110710956, 0.011414934, -0.04...",Fitness Coach


### This code extracts text content, job roles, skills, and education information from PDF files in a specified folder, organizes the data into DataFrames, and stores them in a dictionary with subfolder names as keys.

In [49]:
import os
import re
import fitz  # PyMuPDF
import pandas as pd

# Function to extract text content from a PDF file using PyMuPDF
def extract_text_from_pdf(pdf_file_path):
    doc = fitz.open(pdf_file_path)
    text = ""
    for page_num in range(doc.page_count):
        page = doc[page_num]
        text += page.get_text()
    return text

# Function to extract the job role from the first line of text in a PDF file
def extract_job_role_from_pdf(pdf_file_path):
    job_role = ""
    try:
        pdf_text = extract_text_from_pdf(pdf_file_path)
        
        # Extract first line (Job Role)
        job_role = pdf_text.split('\n')[0].strip()
    except Exception as e:
        print(f"Error extracting job role from {pdf_file_path}: {str(e)}")
    return job_role

# Function to extract the skills section from a PDF file's text content
def extract_skills_from_pdf(pdf_file_path):
    skills = ""
    try:
        pdf_text = extract_text_from_pdf(pdf_file_path)
        
        # Extract the "Skills" section
        in_skills_section = False
        lines = pdf_text.split('\n')
        
        for line in lines:
            if "Skills" in line:
                in_skills_section = True
            elif in_skills_section and line.strip() == "":
                break  # Exit the skills section when encountering an empty line
            elif in_skills_section:
                skills += line + "\n"
    except Exception as e:
        print(f"Error extracting skills from {pdf_file_path}: {str(e)}")
    return skills

# Function to extract the education section from a PDF file's text content
def extract_education_from_pdf(pdf_file_path):
    education = ""
    try:
        pdf_text = extract_text_from_pdf(pdf_file_path)
        
        # Extract the "Education" section
        in_education_section = False
        lines = pdf_text.split('\n')
        
        for line in lines:
            if "Education" in line:
                in_education_section = True
            elif in_education_section and line.strip() == "":
                break  # Exit the education section when encountering an empty line
            elif in_education_section:
                education += line + "\n"
    except Exception as e:
        print(f"Error extracting education from {pdf_file_path}: {str(e)}")
    return education

# Function to extract job details from a folder of PDFs and organize them into a corpus
def extract_details_from_folder(root_folder):
    corpus = {}

    for dirpath, dirnames, filenames in os.walk(root_folder):
        for file in filenames:
            if file.endswith(".pdf"):
                pdf_path = os.path.join(dirpath, file)
                job_role = extract_job_role_from_pdf(pdf_path)
                skills = extract_skills_from_pdf(pdf_path)
                education = extract_education_from_pdf(pdf_path)
                
                # Remove "Skills" and "Education" labels
                skills = skills.replace("Skills", "").strip()
                education = education.replace("Education", "").strip()
                
                # Get the PDF file name
                pdf_file_name = os.path.basename(pdf_path)
                
                # Create a subfolder-based corpus
                subfolder = os.path.basename(dirpath)
                if subfolder not in corpus:
                    corpus[subfolder] = {"PDF File Name": [], "Job Role": [], "Skills": [], "Education": []}
                
                corpus[subfolder]["PDF File Name"].append(pdf_file_name)
                corpus[subfolder]["Job Role"].append(job_role)
                corpus[subfolder]["Skills"].append(skills)
                corpus[subfolder]["Education"].append(education)

    return corpus

# Specify the root folder path containing subfolders with PDFs
root_folder = r"C:\Users\Vinith MH\OneDrive\Desktop\Capital Placement\candidate_cvs"

# Extract job details from the specified folder and organize them into DataFrames
extracted_corpus = extract_details_from_folder(root_folder)

# Create DataFrames for each subdirectory within the corpus
dataframes = {}
for subfolder, details_dict in extracted_corpus.items():
    df = pd.DataFrame(details_dict)
    dataframes[subfolder] = df

# Now, dataframes dictionary contains DataFrames for each subdirectory, and you can access them by their names.

### Exporting extracted data from CVs pdf as csv.

In [75]:
# Specify the output folder where you want to save the exported CSV files
output_folder = r"C:\Users\Vinith MH\OneDrive\Desktop\Capital Placement\exported_data"

# Ensure the output folder exists, or create it if necessary
if not os.path.exists(output_folder):
    os.makedirs(output_folder)

# Export each DataFrame as a CSV file
for subfolder, df in dataframes.items():
    csv_filename = os.path.join(output_folder, f"{subfolder}_job_details.csv")
    df.to_csv(csv_filename, index=False)
    print(f"Exported data for {subfolder} to {csv_filename}")


Exported data for ACCOUNTANT to C:\Users\Vinith MH\OneDrive\Desktop\Capital Placement\exported_data\ACCOUNTANT_job_details.csv
Exported data for ADVOCATE to C:\Users\Vinith MH\OneDrive\Desktop\Capital Placement\exported_data\ADVOCATE_job_details.csv
Exported data for AGRICULTURE to C:\Users\Vinith MH\OneDrive\Desktop\Capital Placement\exported_data\AGRICULTURE_job_details.csv
Exported data for APPAREL to C:\Users\Vinith MH\OneDrive\Desktop\Capital Placement\exported_data\APPAREL_job_details.csv
Exported data for ARTS to C:\Users\Vinith MH\OneDrive\Desktop\Capital Placement\exported_data\ARTS_job_details.csv
Exported data for AUTOMOBILE to C:\Users\Vinith MH\OneDrive\Desktop\Capital Placement\exported_data\AUTOMOBILE_job_details.csv
Exported data for AVIATION to C:\Users\Vinith MH\OneDrive\Desktop\Capital Placement\exported_data\AVIATION_job_details.csv
Exported data for BANKING to C:\Users\Vinith MH\OneDrive\Desktop\Capital Placement\exported_data\BANKING_job_details.csv
Exported data 

In [50]:
dataframes['ACCOUNTANT']

Unnamed: 0,PDF File Name,Job Role,Skills,Education
0,10554236.pdf,ACCOUNTANT,Accounting; General Accounting; Accounts Payab...,Northern Maine Community College 1994 Associat...
1,10674770.pdf,STAFF ACCOUNTANT,"accounting, accounts payable, Accounts Receiva...","Bachelor of Science : Accounting , May 2010 Un..."
2,11163645.pdf,ACCOUNTANT,"accounts payables, accounts receivables, Accou...","Skills\naccounts payables, accounts receivable..."
3,11759079.pdf,SENIOR ACCOUNTANT,"accounting, balance sheet, budgets, client, cl...","EMORY UNIVERSITY, Goizueta Business School 5 2..."
4,12065211.pdf,SENIOR ACCOUNTANT,Aderant/CMS\nExcel\nQuickBooks Pro\nSQL\nAcces...,Bachelor of Business Administration : Accounti...
...,...,...,...,...
113,78403342.pdf,ACCOUNTANT,"Accounting, balance, budget, business analyst,...","Accounting Certificate : Accounting , 2012 Cec..."
114,80053367.pdf,GENERAL ACCOUNTANT,"Healthcare: Â \nSound, ethical and independent...","Bachelor of Science : Nursing , 2016 Californi..."
115,82649935.pdf,SENIOR ACCOUNTANT,,Bachelor of Arts : Economics City College of N...
116,87635012.pdf,PRINCIPAL ACCOUNTANT,"Microsoft Excel, Peachtree, PeopleSoft, SAP, S...","Master of Business Administration , Finance 20..."


### Combine "Job Role," "Skills," and "Education" columns into a single "Details" column, separating them with newline characters, and drop the individual columns for all DataFrames within the dataframes dictionary.

In [51]:
# Combine "Job Role," "Skills," and "Education" columns into a single column for all DataFrames
for subfolder, df in dataframes.items():
    # Combine the columns and separate them with newline characters
    df["Details"] = df["Job Role"] + "\n" + df["Skills"] + "\n" + df["Education"]
    
    # Drop the individual columns
    df.drop(columns=["Job Role", "Skills", "Education"], inplace=True)

In [52]:
dataframes['ACCOUNTANT']

Unnamed: 0,PDF File Name,Details
0,10554236.pdf,ACCOUNTANT\nAccounting; General Accounting; Ac...
1,10674770.pdf,"STAFF ACCOUNTANT\naccounting, accounts payable..."
2,11163645.pdf,"ACCOUNTANT\naccounts payables, accounts receiv..."
3,11759079.pdf,"SENIOR ACCOUNTANT\naccounting, balance sheet, ..."
4,12065211.pdf,SENIOR ACCOUNTANT\nAderant/CMS\nExcel\nQuickBo...
...,...,...
113,78403342.pdf,"ACCOUNTANT\nAccounting, balance, budget, busin..."
114,80053367.pdf,"GENERAL ACCOUNTANT\nHealthcare: Â \nSound, eth..."
115,82649935.pdf,SENIOR ACCOUNTANT\n\nBachelor of Arts : Econom...
116,87635012.pdf,"PRINCIPAL ACCOUNTANT\nMicrosoft Excel, Peachtr..."


# Cleaning CVs Data. 

### Cleaning the "Details" column in each DataFrame within the dataframes dictionary by removing punctuation characters.

In [53]:
import re

# Function to remove punctuation from text
def remove_punctuation(text):
    return re.sub(r'[^\w\s]', '', text)

# Iterate through the dataframes dictionary and clean text columns
for subfolder, df in dataframes.items():
    # Clean the "Details" column (assuming you've already combined the columns)
    df["Details"] = df["Details"].apply(remove_punctuation)

# Now, the "Details" column in each DataFrame within dataframes is cleaned from punctuation.

In [54]:
dataframes['ACCOUNTANT']

Unnamed: 0,PDF File Name,Details
0,10554236.pdf,ACCOUNTANT\nAccounting General Accounting Acco...
1,10674770.pdf,STAFF ACCOUNTANT\naccounting accounts payable ...
2,11163645.pdf,ACCOUNTANT\naccounts payables accounts receiva...
3,11759079.pdf,SENIOR ACCOUNTANT\naccounting balance sheet bu...
4,12065211.pdf,SENIOR ACCOUNTANT\nAderantCMS\nExcel\nQuickBoo...
...,...,...
113,78403342.pdf,ACCOUNTANT\nAccounting balance budget business...
114,80053367.pdf,GENERAL ACCOUNTANT\nHealthcare Â \nSound ethic...
115,82649935.pdf,SENIOR ACCOUNTANT\n\nBachelor of Arts Economi...
116,87635012.pdf,PRINCIPAL ACCOUNTANT\nMicrosoft Excel Peachtre...


### Remove newline characters from the "Details" column in each DataFrame within the dataframes dictionary.

In [55]:
# Iterate through the dataframes dictionary and remove newline characters
for subfolder, df in dataframes.items():
    # Remove newline characters from the "Details" column
    df["Details"] = df["Details"].str.replace('\n', ' ')

In [56]:
dataframes['ACCOUNTANT']

Unnamed: 0,PDF File Name,Details
0,10554236.pdf,ACCOUNTANT Accounting General Accounting Accou...
1,10674770.pdf,STAFF ACCOUNTANT accounting accounts payable A...
2,11163645.pdf,ACCOUNTANT accounts payables accounts receivab...
3,11759079.pdf,SENIOR ACCOUNTANT accounting balance sheet bud...
4,12065211.pdf,SENIOR ACCOUNTANT AderantCMS Excel QuickBooks ...
...,...,...
113,78403342.pdf,ACCOUNTANT Accounting balance budget business ...
114,80053367.pdf,GENERAL ACCOUNTANT Healthcare Â Sound ethical...
115,82649935.pdf,SENIOR ACCOUNTANT Bachelor of Arts Economics...
116,87635012.pdf,PRINCIPAL ACCOUNTANT Microsoft Excel Peachtree...


### Lowercasing: Converting all text to lowercase for consistency.

In [57]:
# Iterate through the dataframes dictionary and convert text to lowercase
for subfolder, df in dataframes.items():
    # Convert the "Details" column to lowercase
    df["Details"] = df["Details"].str.lower()

In [58]:
dataframes['ACCOUNTANT']

Unnamed: 0,PDF File Name,Details
0,10554236.pdf,accountant accounting general accounting accou...
1,10674770.pdf,staff accountant accounting accounts payable a...
2,11163645.pdf,accountant accounts payables accounts receivab...
3,11759079.pdf,senior accountant accounting balance sheet bud...
4,12065211.pdf,senior accountant aderantcms excel quickbooks ...
...,...,...
113,78403342.pdf,accountant accounting balance budget business ...
114,80053367.pdf,general accountant healthcare â sound ethical...
115,82649935.pdf,senior accountant bachelor of arts economics...
116,87635012.pdf,principal accountant microsoft excel peachtree...


In [59]:
from collections import Counter

# Initialize a dictionary to store word frequencies for each DataFrame
word_frequencies = {}

# Iterate through the dataframes dictionary
for subfolder, df in dataframes.items():
    # Concatenate all text in the "Details" column
    concatenated_text = ' '.join(df["Details"])
    
    # Tokenize the text (split it into words)
    words = concatenated_text.split()
    
    # Count word frequencies using Counter
    word_freq = Counter(words)
    
    # Store the word frequencies in the dictionary
    word_frequencies[subfolder] = word_freq

# Now, word_frequencies contains word frequencies for each DataFrame within dataframes.
# You can access word frequencies for a specific DataFrame using its subfolder name.


In [60]:
word_frequencies

{'ACCOUNTANT': Counter({'accountant': 322,
          'accounting': 1006,
          'general': 290,
          'accounts': 569,
          'payable': 177,
          'program': 32,
          'management': 457,
          'northern': 5,
          'maine': 1,
          'community': 37,
          'college': 76,
          '1994': 4,
          'associate': 25,
          'city': 432,
          'state': 494,
          'usa': 28,
          'emphasis': 3,
          'in': 545,
          'business': 361,
          'associates': 13,
          'gpa': 111,
          '341': 2,
          '174': 1,
          'hours': 18,
          'quarter': 7,
          'attended': 1,
          'husson': 1,
          'major': 6,
          '78': 1,
          'semester': 3,
          'toward': 1,
          'bachelors': 10,
          'degree': 22,
          'professional': 54,
          'military': 16,
          'comptroller': 1,
          'school': 55,
          '6wk': 1,
          '498': 1,
          'managerial': 13,
     

### Removing stop words: Common words like "and," "the," "is" are often removed using custom stop words list and also inbuilt stop word list

In [61]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

# Initialize NLTK's stop words
stop_words = set(stopwords.words('english'))

# Iterate through the dataframes dictionary and remove stop words
for subfolder, df in dataframes.items():
    # Tokenize and remove stop words from the "Details" column
    df["Details"] = df["Details"].apply(lambda text: ' '.join([word for word in text.split() if word.lower() not in stop_words]))


[nltk_data] Downloading package stopwords to C:\Users\Vinith
[nltk_data]     MH\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [62]:
dataframes['ACCOUNTANT']

Unnamed: 0,PDF File Name,Details
0,10554236.pdf,accountant accounting general accounting accou...
1,10674770.pdf,staff accountant accounting accounts payable a...
2,11163645.pdf,accountant accounts payables accounts receivab...
3,11759079.pdf,senior accountant accounting balance sheet bud...
4,12065211.pdf,senior accountant aderantcms excel quickbooks ...
...,...,...
113,78403342.pdf,accountant accounting balance budget business ...
114,80053367.pdf,general accountant healthcare â sound ethical ...
115,82649935.pdf,senior accountant bachelor arts economics city...
116,87635012.pdf,principal accountant microsoft excel peachtree...


In [63]:
from collections import Counter

# Initialize a dictionary to store word frequencies for each DataFrame
word_frequencies = {}

# Iterate through the dataframes dictionary
for subfolder, df in dataframes.items():
    # Concatenate all text in the "Details" column
    concatenated_text = ' '.join(df["Details"])
    
    # Tokenize the text (split it into words)
    words = concatenated_text.split()
    
    # Count word frequencies using Counter
    word_freq = Counter(words)
    
    # Store the word frequencies in the dictionary
    word_frequencies[subfolder] = word_freq

# Now, word_frequencies contains word frequencies for each DataFrame within dataframes.
# You can access word frequencies for a specific DataFrame using its subfolder name.


In [64]:
word_frequencies

{'ACCOUNTANT': Counter({'accountant': 322,
          'accounting': 1006,
          'general': 290,
          'accounts': 569,
          'payable': 177,
          'program': 32,
          'management': 457,
          'northern': 5,
          'maine': 1,
          'community': 37,
          'college': 76,
          '1994': 4,
          'associate': 25,
          'city': 432,
          'state': 494,
          'usa': 28,
          'emphasis': 3,
          'business': 361,
          'associates': 13,
          'gpa': 111,
          '341': 2,
          '174': 1,
          'hours': 18,
          'quarter': 7,
          'attended': 1,
          'husson': 1,
          'major': 6,
          '78': 1,
          'semester': 3,
          'toward': 1,
          'bachelors': 10,
          'degree': 22,
          'professional': 54,
          'military': 16,
          'comptroller': 1,
          'school': 55,
          '6wk': 1,
          '498': 1,
          'managerial': 13,
          '0998': 2,
     

### Lemmatization: Reducing words to their base form.

In [65]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Initialize NLTK's stop words and lemmatizer
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

# Function to lemmatize text
def lemmatize_text(text):
    words = nltk.word_tokenize(text)
    lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
    return ' '.join(lemmatized_words)

# Iterate through the dataframes dictionary and apply lemmatization
for subfolder, df in dataframes.items():
    # Tokenize, remove stop words, and lemmatize the "Details" column
    df["Details"] = df["Details"].apply(lambda text: ' '.join([lemmatizer.lemmatize(word) for word in text.split() if word.lower() not in stop_words]))


[nltk_data] Downloading package punkt to C:\Users\Vinith
[nltk_data]     MH\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to C:\Users\Vinith
[nltk_data]     MH\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to C:\Users\Vinith
[nltk_data]     MH\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [66]:
dataframes['ACCOUNTANT']

Unnamed: 0,PDF File Name,Details
0,10554236.pdf,accountant accounting general accounting accou...
1,10674770.pdf,staff accountant accounting account payable ac...
2,11163645.pdf,accountant account payable account receivables...
3,11759079.pdf,senior accountant accounting balance sheet bud...
4,12065211.pdf,senior accountant aderantcms excel quickbooks ...
...,...,...
113,78403342.pdf,accountant accounting balance budget business ...
114,80053367.pdf,general accountant healthcare â sound ethical ...
115,82649935.pdf,senior accountant bachelor art economics city ...
116,87635012.pdf,principal accountant microsoft excel peachtree...


### Utilizing the DistilBERT model and tokenizer to extract embeddings from the "job_description" column in a DataFrame and replaces the text with these embeddings in a single line.

In [67]:
import pandas as pd
import torch
from transformers import DistilBertTokenizer, DistilBertModel

# Initialize DistilBERT model and tokenizer
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertModel.from_pretrained('distilbert-base-uncased')

# Function to extract DistilBERT embeddings
def extract_distilbert_embeddings(text):
    # Tokenize text
    inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True)
    
    # Generate embeddings
    with torch.no_grad():
        outputs = model(**inputs)
    
    # Extract embeddings for the [CLS] token (you can adjust this depending on your needs)
    cls_embedding = outputs.last_hidden_state.mean(dim=1).squeeze().tolist()
    
    return cls_embedding

# Iterate through the dataframes dictionary and apply DistilBERT embedding extraction
for subfolder, df in dataframes.items():
    # Apply DistilBERT embedding extraction to the 'Details' column and store in the same column
    df['Details'] = df['Details'].apply(extract_distilbert_embeddings)


'(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: 1f8e1cb8-c7c5-47b0-84a3-b1c6596ce6d4)')' thrown while requesting HEAD https://huggingface.co/distilbert-base-uncased/resolve/main/vocab.txt


In [68]:
dataframes

{'ACCOUNTANT':     PDF File Name                                            Details
 0    10554236.pdf  [-0.1271236687898636, 0.44556787610054016, 0.2...
 1    10674770.pdf  [-0.07137733697891235, 0.39897966384887695, 0....
 2    11163645.pdf  [-0.22557547688484192, 0.47983312606811523, 0....
 3    11759079.pdf  [-0.1911216825246811, 0.4082939624786377, 0.23...
 4    12065211.pdf  [-0.11686299741268158, 0.34147530794143677, 0....
 ..            ...                                                ...
 113  78403342.pdf  [-0.20568658411502838, 0.49587711691856384, 0....
 114  80053367.pdf  [-0.1782316267490387, 0.3512096703052521, 0.38...
 115  82649935.pdf  [-0.15202154219150543, 0.26726406812667847, -0...
 116  87635012.pdf  [-0.2352742701768875, 0.23847785592079163, 0.0...
 117  98559931.pdf  [-0.41110071539878845, 0.5781598687171936, 0.2...
 
 [118 rows x 2 columns],
 'ADVOCATE':     PDF File Name                                            Details
 0    10186968.pdf  [-0.1140255928039

### Calculating cosine similarities between job descriptions and CV embeddings, creating new DataFrames with PDF ranks and cosine similarities, and export these DataFrames to CSV files while also displaying their contents for each position title.

In [74]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Create a mapping between position titles and DataFrame names
position_title_mapping = dict(zip(sel_job_desp_df['position_title'], dataframes.keys()))

# Initialize a dictionary to store the dataframes with PDF ranks and cosine similarities
dataframes_with_ranks = {}

# Calculate cosine similarities and create dataframes
for position_title, job_description in zip(sel_job_desp_df['position_title'], sel_job_desp_df['job_description']):
    similarities = {}
    job_description = np.array(job_description).reshape(1, -1)  # Ensure job_description is a 2D array
    cv_df = dataframes[position_title_mapping[position_title]]  # Get the corresponding DataFrame
    cv_embeddings = np.array(list(cv_df['Details']))  # Assuming 'Details' column contains CV embeddings
    
    # Calculate cosine similarities within the corresponding DataFrame
    similarities = cosine_similarity(job_description, cv_embeddings)
    
    # Get the PDF file names and ranks based on similarity
    pdf_file_names = cv_df['PDF File Name']
    ranked_pdf_file_names = [pdf_file_names[i] for i in np.argsort(similarities[0])[::-1]]
    
    # Create a new DataFrame with PDF ranks and cosine similarities
    rank_df = pd.DataFrame({'PDF File Name': ranked_pdf_file_names,
                            'Cosine Similarity': similarities[0][np.argsort(similarities[0])[::-1]]})
    
    # Store the dataframe in the dictionary
    dataframes_with_ranks[position_title] = rank_df

# Export dataframes to CSV files and/or access them as needed
for position_title, rank_df in dataframes_with_ranks.items():
    # Export the dataframe to a CSV file
    rank_df.to_csv(f'{position_title}_ranked.csv', index=False)
    
    # You can also access the dataframe as needed
    print(f"DataFrame for {position_title}:\n")
    print(rank_df)
    print("\n")


DataFrame for Accountant:

    PDF File Name  Cosine Similarity
0    12442909.pdf           0.510607
1    15592167.pdf           0.458605
2    18365791.pdf           0.445075
3    10554236.pdf           0.439047
4    36024962.pdf           0.439034
..            ...                ...
113  25867805.pdf           0.363885
114  21763056.pdf           0.360377
115  29821051.pdf           0.358728
116  24799301.pdf           0.346606
117  24817041.pdf           0.340470

[118 rows x 2 columns]


DataFrame for Part-time   Patient Advocate:

    PDF File Name  Cosine Similarity
0    10186968.pdf           0.397712
1    29415426.pdf           0.392439
2    20324037.pdf           0.382327
3    73448369.pdf           0.377010
4    15337481.pdf           0.374494
..            ...                ...
113  26071861.pdf           0.305230
114  27182111.pdf           0.305018
115  14176254.pdf           0.305013
116  10344379.pdf           0.298493
117  13967854.pdf           0.276442

[118 rows x 2

### Calculating cosine similarities between job descriptions and CV embeddings to find and display the top 5 PDF files for each position title, along with their similarity scores.

In [70]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Create a mapping between position titles and DataFrame names
position_title_mapping = dict(zip(sel_job_desp_df['position_title'], dataframes.keys()))

# Initialize a dictionary to store the top 5 CVs for each position title
top_5_cvs = {}

# Calculate cosine similarities and find the top 5 CVs for each position title
for position_title, job_description in zip(sel_job_desp_df['position_title'], sel_job_desp_df['job_description']):
    similarities = {}
    job_description = np.array(job_description).reshape(1, -1)  # Ensure job_description is a 2D array
    cv_df = dataframes[position_title_mapping[position_title]]  # Get the corresponding DataFrame
    cv_embeddings = np.array(list(cv_df['Details']))  # Assuming 'Details' column contains CV embeddings
    
    # Calculate cosine similarities within the corresponding DataFrame
    similarities = cosine_similarity(job_description, cv_embeddings)
    
    # Get the PDF file names and ranks based on similarity
    pdf_file_names = cv_df['PDF File Name']
    ranked_pdf_file_names = [(pdf_file_names[i], similarities[0][i]) for i in np.argsort(similarities[0])[-1:-6:-1]]
    
    # Store the ranked PDF file names along with similarity
    top_5_cvs[position_title] = ranked_pdf_file_names

# Display results
for position_title, top_pdf_similarities in top_5_cvs.items():
    print(f"Top 5 PDF Files for {position_title}:")
    for rank, (pdf_file, similarity) in enumerate(top_pdf_similarities, start=1):
        print(f"Rank {rank}: PDF File: {pdf_file}, Similarity: {similarity:.4f}")
    print("\n")


Top 5 PDF Files for Accountant:
Rank 1: PDF File: 12442909.pdf, Similarity: 0.5106
Rank 2: PDF File: 15592167.pdf, Similarity: 0.4586
Rank 3: PDF File: 18365791.pdf, Similarity: 0.4451
Rank 4: PDF File: 10554236.pdf, Similarity: 0.4390
Rank 5: PDF File: 36024962.pdf, Similarity: 0.4390


Top 5 PDF Files for Part-time   Patient Advocate:
Rank 1: PDF File: 10186968.pdf, Similarity: 0.3977
Rank 2: PDF File: 29415426.pdf, Similarity: 0.3924
Rank 3: PDF File: 20324037.pdf, Similarity: 0.3823
Rank 4: PDF File: 73448369.pdf, Similarity: 0.3770
Rank 5: PDF File: 15337481.pdf, Similarity: 0.3745


Top 5 PDF Files for Account Executive, Urology - Ohio and parts of W.PA, W.VA:
Rank 1: PDF File: 27689009.pdf, Similarity: 0.4393
Rank 2: PDF File: 56068028.pdf, Similarity: 0.4264
Rank 3: PDF File: 19851252.pdf, Similarity: 0.4152
Rank 4: PDF File: 27888251.pdf, Similarity: 0.4149
Rank 5: PDF File: 84512719.pdf, Similarity: 0.4124


Top 5 PDF Files for Business Banking Relationship Manager:
Rank 1: P