# AI Pipeline for Auto Resume Matching

# Strategy I

### Pre-processing steps:

1. **JSON to DataFrame**: The job description JSON file is converted into a pandas data frame to simplify data manipulation.
2. **Text Cleaning**: The code employs various natural language processing operations such as tokenization, stemming using WordNetLemmatizer, and removal of punctuations. A peculiar pattern of recurring "AAAA" is specifically cleaned from the text. There's a specific focus on retaining certain words, namely 'SAP', 'S4Hana', and 'ICT', and these are excluded from the stopwords list.
3. **Language Translation**: To maintain consistency, any resume in a non-English language is converted to English.
4. **Noise Reduction**: To ensure the text is meaningful and relevant, only words that are recognized in the English dictionary are retained, effectively filtering out any noise or non-sensical terms.

### Overall Strategy:

1. **Feature Extraction via Regex Patterns**: Before embedding the text, extract domain-specific features using predefined regex patterns.
2. **Embedding Texts**: Utilize a pre-trained SentenceTransformer model (`paraphrase-MiniLM-L6-v2`) to embed the textual content.
3. **Chunking Long Texts**: For texts that exceed the model's token limit, divide the text into overlapping chunks, embed each chunk separately, and then average their embeddings.
4. **Combining Embeddings**: Combine the embeddings obtained from the SentenceTransformer model with the domain-specific feature embeddings.
5. **Similarity Calculation**: Compute the cosine similarity between the job requirement's combined embedding and the combined embedding of each resume in the dataset.
6. **Ranking**: Rank the resumes based on their similarity scores to the job requirement.
7. **Output**: Save the detailed similarity scores and rankings to a CSV file and prepare a final submission CSV with the ID and rank.

### Cool Optimizations and Challenges Solved:

1. **Overlapping Chunking**:
    - *Challenge*: Direct chunking could miss some contextual information at the borders.
    - *Solution*: The implementation uses overlapping chunks to ensure that no information is lost at the boundaries between chunks. The average of the embeddings of these overlapping chunks provides a more holistic representation of the content.

2. **Domain-Specific Feature Embeddings**:
    - *Challenge*: Simple embeddings might not capture domain-specific nuances.
    - *Solution*: By defining a set of domain-specific features and using regex patterns to extract these features, the strategy combines traditional feature extraction with modern embedding techniques. This ensures that both general and domain-specific information is captured.

3. **Optimized Tokenization**:
    - *Challenge*: Direct string chunking may not consider word or sentence boundaries.
    - *Solution*: By tokenizing the text first and then creating chunks based on token count, the implementation ensures that words or sentences aren't arbitrarily split, preserving their meaning.

4. **Weighted Job Requirements**:
    - *Challenge*: Some job requirements might be more important than others.
    - *Solution*: The code multiplies the "must_have" and "should_have" requirements by 2, effectively giving them more weight in the embedding process. This highlights the importance of these requirements in the similarity calculations.

5. **Efficient Similarity Calculation**:
    - *Challenge*: Computing cosine similarity in a naive way can be computationally expensive.
    - *Solution*: The use of `util.pytorch_cos_sim` provides an efficient way to compute cosine similarities using PyTorch, thereby speeding up the overall process.

6. **Scalable Ranking System**:
    - *Challenge*: With many resumes, establishing a ranking system can become challenging.
    - *Solution*: The pandas `.rank()` function is employed to efficiently rank resumes based on their similarity scores, ensuring the solution remains scalable as the dataset size grows.


In [17]:
import re
import json
import string
import pandas as pd
import numpy as np
from typing import List, Union, Dict

import nltk
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords, words
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize

from tqdm import tqdm
from translate import Translator
from sklearn.metrics.pairwise import cosine_similarity


import torch
from sentence_transformers import SentenceTransformer, util

import os
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'

[nltk_data] Downloading package stopwords to /home/vr/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/vr/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [18]:
def convert_json_to_dataframe(json_file: str) -> pd.DataFrame:
    """Convert a JSON file into a pandas DataFrame."""
    # Open and load the JSON file
    with open(json_file, 'r') as file:
        data = json.load(file)

    records = []

    for item in data:
        # Extracting data from the JSON item and creating a record dictionary
        record = {
            "_id": str(item["_id"]["$oid"]),  # Assuming _id is a string
            "location": item["location"],
            "title": item["title"],
            "organisation": item["organisation"],
            "description": item["description"]
        }

        # Extracting specific_skills from the JSON item and categorizing them
        specific_skills = item["specific_skills"]
        nice_to_have = [skill["title"] for skill in specific_skills if skill["weigth"] == "Nice to have"]
        should_have = [skill["title"] for skill in specific_skills if skill["weigth"] == "Should have"]
        record["nice_to_have"] = nice_to_have
        record["should_have"] = should_have

        # Extracting sector_skills from the JSON item and categorizing them
        sector_skills = item["sector_skills"]
        must_have = [skill["title"] for skill in sector_skills if skill["weigth"] == "Must have"]
        record["must_have"] = must_have

        records.append(record)

    # Creating a pandas DataFrame from the list of records
    df = pd.DataFrame(records)

    return df


def process_string(text: str) -> str:
    """Process a given string by translating to English, removing punctuations,
    lemmatizing each word, and removing stopwords (excluding 'SAP', 'S4Hana', and 'ICT')."""
    # Translating the text to English
    translator = Translator(to_lang='en')
    text = translator.translate(text)

    # Tokenizing the text
    tokens = word_tokenize(text)

    # Create a set of special words you want to keep
    special_words = {'SAP', 'S4Hana', 'ICT'}

    # Converting tokens to lowercase only if they are not in the set of special words
    tokens = [token.lower() if token not in special_words else token for token in tokens]

    # Replace 'sa' with 'SAP'
    tokens = ['SAP' if token.lower() == 'sa' else token for token in tokens]

    # Removing punctuations from the text
    tokens = [''.join([c for c in token if c not in string.punctuation]) for token in tokens]

    # Removing stopwords and lemmatizing each word
    stop_words = set(stopwords.words('english')) - {word.lower() for word in special_words}
    lemmatizer = WordNetLemmatizer()
    processed_tokens = [lemmatizer.lemmatize(token) for token in tokens if token not in stop_words]

    # Removing words that are not in the English dictionary
    english_words = set(words.words())
    processed_tokens = [token for token in processed_tokens if token in english_words or token in special_words]

    # Joining the processed tokens back into a single string
    processed_text = ' '.join(processed_tokens)

    return processed_text


def preprocess_text(data) -> list or str:
    """Preprocess a single string or a list of strings by applying process_string function."""
    if isinstance(data, list):
        # If input is a list, process each string in the list using process_string function
        return [process_string(s) for s in data]
    elif isinstance(data, str):
        # If input is a single string, process it using process_string function
        return process_string(data)
    else:
        # Raise an error if the input is not a string or a list
        raise ValueError("Input must be a string or list of strings")


In [19]:
## Converting Job requirements JSON into a dataframe
json_file = 'job_description_response.json'
df_req = convert_json_to_dataframe(json_file)
df_req.drop(['_id', 'location', "organisation"], axis=1, inplace=True)
df_req['description'] = df_req['description'].apply(preprocess_text)
df_req['nice_to_have'] = df_req['nice_to_have'].apply(preprocess_text)
df_req['should_have'] = df_req['should_have'].apply(preprocess_text)
df_req['must_have'] = df_req['must_have'].apply(preprocess_text)


# Load resume data
df_resume = pd.read_csv("resumes.csv", delimiter=',', quotechar='"')

# Preprocess the resume text
df_resume[' resume_text'] = df_resume[' resume_text'].apply(preprocess_text)

print("Resumes and job description processed...")

df_resume.head()

Resumes and job description processed...


Unnamed: 0,id,resume_text
0,socrates,contact mobile top skill business development ...
1,pythagoras,contact mobile company top skill service manag...
2,heraclitus,business analysis functional analysis requirem...
3,homer,contact top skill business analysis language d...
4,parmenides,business transformation program management inn...


In [22]:

# Define the list of features and corresponding regex patterns or direct strings
features = {
    "s4hana_experience": r'\bS4Hana\b',
    "sap_experience": r'\bSAP\b',
    "energy_sector_experience": r'\b(?:Belgian )?energy sector\b',
    "business_analysis_experience": r'\bbusiness analyst\b',
    "bpm_experience": r'\bBPM\b|business process(?:es)?(?: development| optimization)?\b',
    "stakeholder_management": r'\bstakeholder management\b',
    "process_analysis": r'\banalyz(?:e|ed|es|ing)\b processes\b',
    "documentation_skills": r'\buse cases?\b|user manuals?\b',
    "testing_skills": r'\bsupervis(?:e|ing)? testing\b|testing yourself\b',
    "training_skills": r'\bprovid(?:e|ing)? training\b|supporting trainers\b|training key users\b',
    "writing_skills": r'\bwriting processes\b',
    "training_end_users": r'\btraining end users\b'
}


def extract_features_from_text(text: str, features: dict) -> dict:
    """Extract the presence of features from a text based on regex patterns or direct strings."""
    extracted_features = {}
    for feature, pattern in features.items():
        if re.search(pattern, text, re.IGNORECASE):
            extracted_features[feature] = 1  # Feature present
        else:
            extracted_features[feature] = 0  # Feature not present
    return extracted_features


def feature_dict_to_embedding(feature_dict: dict) -> List[int]:
    """Convert a dictionary of features to a simple embedding (list of 1s and 0s)."""
    return list(feature_dict.values())

def text_to_embedding(model: SentenceTransformer, text: str, features: dict, max_length=512) -> torch.Tensor:
    """ Convert text to its combined embedding using SBERT and additional features with chunking."""
    # Tokenize the text and check if it exceeds the max_length
    tokens = model.tokenize(text)
    
    if len(tokens) > max_length:
        # Split the text into overlapping chunks
        stride = int(max_length / 2)
        chunks = [text[i:i+max_length] for i in range(0, len(tokens), stride)]
        
        embeddings = []
        for chunk in chunks:
            embeddings.append(model.encode(chunk, convert_to_tensor=True))
        
        # Average the embeddings for all chunks
        sbert_embedding = torch.mean(torch.stack(embeddings), dim=0)
    else:
        sbert_embedding = model.encode(text, convert_to_tensor=True)
    
    extracted_features = extract_features_from_text(text, features)
    domain_embedding = torch.Tensor(feature_dict_to_embedding(extracted_features)).to(sbert_embedding.device)
    combined_embedding = torch.cat([sbert_embedding, domain_embedding])
    return combined_embedding

def calculate_similarity(model: SentenceTransformer, req_embedding: torch.Tensor, text: str, features: dict) -> float:
    """ Calculate the cosine similarity between embeddings."""
    text_embedding = text_to_embedding(model, text, features)
    cosine_sim = util.pytorch_cos_sim(req_embedding, text_embedding)
    return cosine_sim.item()

# Initialize the SBERT model
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

# Dataframe df_req contains the job requirements data
job_requirements = (
    df_req["title"] + " " + 
    df_req["description"] + " " + 
    ' '.join(df_req["must_have"][0]*2) + " " + 
    ' '.join(df_req["should_have"][0]*2) + " " + 
    ' '.join(df_req["nice_to_have"][0])
)

# Generate embeddings for the first job requirement
req_embedding = text_to_embedding(model, job_requirements[0], features)


# Calculate the similarities between the job requirement and all resumes
similarities = [
    calculate_similarity(model, req_embedding, resume_row[' resume_text'], features) 
    for _, resume_row in df_resume.iterrows()
]

# Update the resume dataframe with similarity scores
df_resume['similarity'] = similarities

# Rank the resumes based on their similarity scores
df_resume['rank'] = df_resume['similarity'].rank(method='min', ascending=False)

# Save detailed results to a CSV file
df_resume.to_csv('submission_detailed.csv', index=False)

# Prepare a subset dataframe for final submission and save it to a CSV file
df_subset = df_resume[['id', 'rank']].copy()
df_subset['rank'] = df_subset['rank'].astype(int)
df_subset.to_csv('submission.csv', index=False)

In [23]:
df_resume

Unnamed: 0,id,resume_text,similarity,rank
0,socrates,contact mobile top skill business development ...,0.628296,20.0
1,pythagoras,contact mobile company top skill service manag...,0.74247,3.0
2,heraclitus,business analysis functional analysis requirem...,0.493334,29.0
3,homer,contact top skill business analysis language d...,0.676995,13.0
4,parmenides,business transformation program management inn...,0.689107,11.0
5,hesiod,contact company top skill functional analysis ...,0.425243,31.0
6,theodorus,contact top skill business analysis change man...,0.715598,5.0
7,zeno,profile summary manager analyst proven track r...,0.707094,6.0
8,cyrene,curriculum contact data address postal city te...,0.583185,25.0
9,elea,contact home top skill related business genera...,0.660715,16.0


### Conclusion

The provided code effectively integrates advanced sentence embeddings with domain-specific feature extraction to rank resumes against job requirements. By adeptly handling longer texts through chunking and emphasizing critical job criteria, the strategy offers a nuanced yet scalable approach to resume matching. This blend of modern machine learning with traditional text analysis promises a more streamlined and precise recruitment process.

#### Further possible improvements:

1. **Hyperparameter Tuning**: There's potential for performance improvement by fine-tuning parameters, such as the chunk_size. Adjusting this parameter might produce varying results depending on the data.
2. **Weighted Cosine Similarity**: Incorporating a weighted cosine similarity measure might lead to better accuracy, although this would require a benchmark or ground truth to verify the improvements.
3. **Use NER and speech tagging, sentiment analysis**: Identify more important parts and later on integrate that in the scoring system.
4. **Weighted custom features**: We can play around by giving different importance to custom features we extracted.

**NOTE:** We can't do these improvements without getting the validation data or the actual results.

# Strategy II

This strategy is quite complex and needs a lot of computing (can train this with 16GB VRAM, I ran this in collab); I did a lot of optimization, like using half-precision and batching of embedding, etc.

We are not running this but this might give much better performance, especially on larger datasets. 

#### In this strategy, we use both BERT and Sentence transformers, but why?

**BERT:**
BERT is used in this code to extract embeddings from text. The choice of using DistilBERT suggests a preference for a faster and lighter version of BERT. DistilBERT retains much of BERT's performance while being faster and requiring less memory.
Given that BERT is great at understanding the context and semantics of sentences, it's a good choice for segmenting resumes and job descriptions to capture the essence.

**Sentence Transformers:**
While BERT can be used to generate sentence embeddings, it does so by taking the average of the token embeddings, which might not always be the optimal way to represent sentence semantics.
Sentence Transformers (like the one used, paraphrase-MiniLM-L6-v2) are specifically trained for generating sentence embeddings and hence usually provide more accurate sentence-level semantic representations.
The detailed_comparison function leverages Sentence-BERT to compute the similarity between resume sections and the job description.

**NOTE**: Do not run this without 16GB VRAM

In [None]:
# Define the device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

bert_tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
bert_model = DistilBertModel.from_pretrained('distilbert-base-uncased').to(device).half()  # Convert to half precision
sbert_model = SentenceTransformer('paraphrase-MiniLM-L6-v2').to(device).half()  # Convert to half precision


def split_into_chunks(text: Union[pd.Series, str, object], max_length: int) -> List[str]:
    if isinstance(text, pd.Series):
        text = text.to_string()
    elif not isinstance(text, str):
        print(f"Unexpected type: {type(text)}")
        text = str(text)

    sentences = sent_tokenize(text)
    chunks = []
    chunk = []
    length = 0
    
    for sentence in sentences:
        sentence_length = len(sentence.split())
        if length + sentence_length > max_length:
            chunks.append(' '.join(chunk))
            chunk = [sentence]
            length = sentence_length
        else:
            chunk.append(sentence)
            length += sentence_length
    chunks.append(' '.join(chunk))
    return chunks

def bert_text_to_embedding(tokenizer, model, text, max_length=512):
    tokens = tokenizer.tokenize(text)
    chunks = [tokens[i:i + max_length] for i in range(0, len(tokens), max_length)]
    chunk_embeddings = []
    
    with torch.no_grad():
        model.eval()
        for batch in chunks:
            if not batch:
                continue
            batch_inputs = tokenizer(batch, return_tensors="pt", padding='max_length', max_length=max_length, truncation=True)
            
            # Just move input_ids to the appropriate device without converting to half precision
            batch_inputs['input_ids'] = batch_inputs['input_ids'].to(device)
            # Convert attention masks to half precision (optional, but for consistency)
            batch_inputs['attention_mask'] = batch_inputs['attention_mask'].to(device).half()
            
            batch_embedding = model(**batch_inputs).last_hidden_state.mean(dim=1)
            chunk_embeddings.append(batch_embedding.cpu())

    if not chunk_embeddings:
        return torch.zeros((model.config.hidden_size,), device=device).half()  # Convert to half precision
    
    chunk_embeddings = torch.cat(chunk_embeddings)
    document_embedding = torch.mean(chunk_embeddings, dim=0)
    return document_embedding.to(device).half()  # Convert to half precision



def extract_relevant_sections(tokenizer, model, text, job_description_embedding):
    sections = split_into_chunks(text, 32)
    section_scores = []

    for section in sections:
        section_embedding = bert_text_to_embedding(tokenizer, model, section)
        similarity = torch.cosine_similarity(section_embedding, job_description_embedding, dim=0).item()
        section_scores.append((section, similarity))

    sorted_sections = sorted(section_scores, key=lambda x: x[1], reverse=True)
    top_sections = [section[0] for section in sorted_sections[:5]]
    return " ".join(top_sections)

def detailed_comparison(model, text1, text2):
    embedding1 = sbert_text_to_embedding(model, text1)
    embedding2 = sbert_text_to_embedding(model, text2)
    similarity = util.pytorch_cos_sim(embedding1, embedding2).item()
    return similarity

def sbert_text_to_embedding(model, text):
    embedding = model.encode(text, convert_to_tensor=True)
    return embedding

# Assuming you have loaded your dataframes df_req and df_resume
job_description = df_req["title"].iloc[0] + " " + df_req["description"].iloc[0] + " " + ' '.join(df_req["must_have"].iloc[0]*2) + " " + ' '.join(df_req["should_have"].iloc[0]*2) + " " + ' '.join(df_req["nice_to_have"].iloc[0])
job_description_embedding_bert = bert_text_to_embedding(bert_tokenizer, bert_model, job_description, max_length=32)

similarities = []

for _, resume_row in df_resume.iterrows():
    relevant_resume_parts = extract_relevant_sections(bert_tokenizer, bert_model, resume_row[' resume_text'], job_description_embedding_bert)
    similarity = detailed_comparison(sbert_model, relevant_resume_parts, job_description)
    similarities.append(similarity)

    torch.cuda.empty_cache()

df_resume['similarity'] = similarities


# Strategy III

Combining all these techniques, like BERT, combined with Named entity recognition, part-of-speech tagging (POS), and sentiment analysis, can be used to extract relevant information. And later on, with multiple testing, we can design a scoring system that gives different importances to a different part of the job. Right now, most things are treated equally unless specified otherwise. The system could be more robust by deeply analyzing the text and identifying parts of speech that do not contribute to calculating matching scores.

Here's the pre-processing function with NER, but because I don't have any validation data, so I don't know whether I should use this or not. Because it is slow and might decrease performance. In the above code replace the below-commented function, if you wish to run NER.

**NOTE:** This needs spacy library

In [None]:
import spacy

def process_string(text: str) -> str:
    """
    Process a given string by translating to English, removing punctuations,
    removing specific words ('a'), tokenizing, lemmatizing each word,
    removing stopwords, and removing named entities.

    Args:
        text (str): The input text to be processed.

    Returns:
        str: The processed string.
    """
    # Translating the text to English
    translator = Translator(to_lang='en')
    text = translator.translate(text)

    # Converting text to lowercase
    text = text.lower()

    # Removing punctuations from the text
    text = ''.join([c for c in text if c not in string.punctuation])

    # Removing the word 'a' if it appears as a separate word (using regex '\ba+\b')
    text = re.sub(r'\ba+\b', '', text)

    # Tokenizing the text
    tokens = word_tokenize(text)

    # Removing stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]

    # Lemmatizing and removing named entities
    nlp = spacy.load("en_core_web_sm")
    lemmatizer = WordNetLemmatizer()
    processed_tokens = []
    for token in tokens:
        doc = nlp(token)
        if not doc.ents:
            processed_tokens.append(lemmatizer.lemmatize(token))

    # Joining the processed tokens back into a single string
    processed_text = ' '.join(processed_tokens)

    return processed_text

# Strategy IV

Use LLM, I tested LLMs, and they can easily get all the relevant information form all the continuous resume text. Once we have the text, we can use simple keyword and rule-based matching to generate a score for each resume or even use the entire above pipeline, this will wokr the best because now every irrelevant information would have already been removed. To pull this of we can combine all the above strategy and can create a hybrid model out of it.

I didn't use this approach as this requires access to CHATGPT API, which is a paid service.