## TF-IDF embedding + similarity

In [15]:
import pandas as pd
import numpy as np

df = pd.read_pickle("data/processed_resumes.pkl")
df_jd = pd.read_pickle("data/processed_jds.pkl")


In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity


In [17]:
tfidf = TfidfVectorizer(
    ngram_range=(1, 2),
    max_features=5000,
    stop_words="english"
)


In [18]:
corpus = (
    df["cleaned_resume"].tolist() +
    df_jd["cleaned_jd"].tolist()
)

tfidf.fit(corpus)


0,1,2
,"input  input: {'filename', 'file', 'content'}, default='content' - If `'filename'`, the sequence passed as an argument to fit is  expected to be a list of filenames that need reading to fetch  the raw content to analyze. - If `'file'`, the sequence items must have a 'read' method (file-like  object) that is called to fetch the bytes in memory. - If `'content'`, the input is expected to be a sequence of items that  can be of type string or byte.",'content'
,"encoding  encoding: str, default='utf-8' If bytes or files are given to analyze, this encoding is used to decode.",'utf-8'
,"decode_error  decode_error: {'strict', 'ignore', 'replace'}, default='strict' Instruction on what to do if a byte sequence is given to analyze that contains characters not of the given `encoding`. By default, it is 'strict', meaning that a UnicodeDecodeError will be raised. Other values are 'ignore' and 'replace'.",'strict'
,"strip_accents  strip_accents: {'ascii', 'unicode'} or callable, default=None Remove accents and perform other character normalization during the preprocessing step. 'ascii' is a fast method that only works on characters that have a direct ASCII mapping. 'unicode' is a slightly slower method that works on any characters. None (default) means no character normalization is performed. Both 'ascii' and 'unicode' use NFKD normalization from :func:`unicodedata.normalize`.",
,"lowercase  lowercase: bool, default=True Convert all characters to lowercase before tokenizing.",True
,"preprocessor  preprocessor: callable, default=None Override the preprocessing (string transformation) stage while preserving the tokenizing and n-grams generation steps. Only applies if ``analyzer`` is not callable.",
,"tokenizer  tokenizer: callable, default=None Override the string tokenization step while preserving the preprocessing and n-grams generation steps. Only applies if ``analyzer == 'word'``.",
,"analyzer  analyzer: {'word', 'char', 'char_wb'} or callable, default='word' Whether the feature should be made of word or character n-grams. Option 'char_wb' creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space. If a callable is passed it is used to extract the sequence of features out of the raw, unprocessed input. .. versionchanged:: 0.21  Since v0.21, if ``input`` is ``'filename'`` or ``'file'``, the data  is first read from the file and then passed to the given callable  analyzer.",'word'
,"stop_words  stop_words: {'english'}, list, default=None If a string, it is passed to _check_stop_list and the appropriate stop list is returned. 'english' is currently the only supported string value. There are several known issues with 'english' and you should consider an alternative (see :ref:`stop_words`). If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens. Only applies if ``analyzer == 'word'``. If None, no stop words will be used. In this case, setting `max_df` to a higher value, such as in the range (0.7, 1.0), can automatically detect and filter stop words based on intra corpus document frequency of terms.",'english'
,"token_pattern  token_pattern: str, default=r""(?u)\\b\\w\\w+\\b"" Regular expression denoting what constitutes a ""token"", only used if ``analyzer == 'word'``. The default regexp selects tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator). If there is a capturing group in token_pattern then the captured group content, not the entire match, becomes the token. At most one capturing group is permitted.",'(?u)\\b\\w\\w+\\b'


In [19]:
resume_tfidf = tfidf.transform(df["cleaned_resume"])
jd_tfidf = tfidf.transform(df_jd["cleaned_jd"])


In [20]:
jd_idx = 0  # keep consistent

tfidf_scores = cosine_similarity(
    resume_tfidf,
    jd_tfidf[jd_idx]
).flatten()

df["tfidf_similarity"] = tfidf_scores


In [21]:
# saving TF_IDF results
df.to_pickle("data/resumes_with_tfidf.pkl")
print("TF-IDF scores saved")


TF-IDF scores saved


In [22]:
del resume_tfidf
del jd_tfidf


In [23]:
from sentence_transformers import SentenceTransformer

  from .autonotebook import tqdm as notebook_tqdm


In [24]:
print("Loading SBERT model...")
sbert_model = SentenceTransformer("all-MiniLM-L6-v2")
print("Model loaded")


Loading SBERT model...
Model loaded


In [25]:
resume_texts = df["cleaned_resume"].tolist()
jd_texts = df_jd["cleaned_jd"].tolist()


In [26]:
print("Encoding test batch...")

test_embeddings = sbert_model.encode(
    resume_texts[:5],
    convert_to_numpy=True
)

print("Test embeddings shape:", test_embeddings.shape)


Encoding test batch...
Test embeddings shape: (5, 384)


In [27]:
resume_embeddings = sbert_model.encode(
    resume_texts,
    batch_size=16,
    convert_to_numpy=True,
    show_progress_bar=True
)

jd_embeddings = sbert_model.encode(
    jd_texts,
    batch_size=16,
    convert_to_numpy=True,
    show_progress_bar=True
)


Batches: 100%|██████████| 61/61 [00:47<00:00,  1.29it/s]
Batches: 100%|██████████| 3/3 [00:01<00:00,  1.67it/s]


In [28]:
np.save("embeddings/resume_embeddings.npy", resume_embeddings)
np.save("embeddings/jd_embeddings.npy", jd_embeddings)

print("Embeddings saved")


Embeddings saved


In [29]:
del resume_embeddings
del jd_embeddings
