# Semantic search with sentence-transformers over PDFs

Here's a step-by-step guide to extract information from PDF files and perform semantic search on the extracted text:

1. Load PDF File

2. Extract Text

3. Preprocess Text: removing irrelevant characters, converting text to lowercase, lemmatization, etc.

4. Organizing it into a more structured format

5. Vectorize Text: Essential for semantic similarity calculations

6. Build Search Index

7. Perform Semantic Search

8. Search Index

9. Rank Results

10. Return Results

In [71]:
!pip install sentence_transformers

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




In [None]:
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel
from sentence_transformers.util import cos_sim

In [72]:
def average_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]

def divide_text_cv_in_parts(text_cv, n_parts=5):
    sentences = sent_tokenize(text_cv.text)
    n_sentences = len(sentences)
    step = int(n_sentences/n_parts)
    parts = []
    for i in range(n_parts):
        if i == n_parts-1:
            parts.append(sentences[i*step:])
        else:
            parts.append(sentences[i*step:(i+1)*step])
    return parts

def get_sentences(PATH):
    text_cv = nlp([page.extract_text() for page in PyPDF2.PdfReader(open(PATH, 'rb')).pages][0])
    sentences = sent_tokenize(text_cv.text)  # split into sentences
    return sentences
    PATH = PATHs[0]

def get_cls_embeddings(PATH1, PATH2):
    # Load the PDF file
    sentences_1 = get_sentences(PATH1)
    sentences_2 = get_sentences(PATH2)
    
    # Tokenize the input texts
    batch_dict_1 = tokenizer(sentences_1, max_length=512, padding=True, truncation=True, return_tensors='pt')
    batch_dict_2 = tokenizer(sentences_2, max_length=512, padding=True, truncation=True, return_tensors='pt')
    
    # Compute the embeddings
    outputs_1 = model(**batch_dict_1)
    outputs_2 = model(**batch_dict_2)
    
    # Average pool the embeddings across the tokens
    embeddings_1 = average_pool(outputs_1.last_hidden_state, batch_dict_1.attention_mask)
    embeddings_2 = average_pool(outputs_2.last_hidden_state, batch_dict_2.attention_mask)

    # (Optionally) normalize embeddings
    embeddings_1 = F.normalize(embeddings_1, p=2, dim=1)
    embeddings_2 = F.normalize(embeddings_2, p=2, dim=1)
    
    cls_1 = embeddings_1[0] # cls token
    cls_2 = embeddings_2[0] # cls token
    
    return cls_1, cls_2


def compute_cos_sim(PATH1, PATH2):
    # Get the embeddings
    cls_1, cls_2 = get_cls_embeddings(PATH1, PATH2)
    # Compute cosine similarity between the two embeddings
    sim_cos = cos_sim(cls_1, cls_2)
    return sim_cos


In [None]:
tokenizer = AutoTokenizer.from_pretrained("thenlper/gte-small")
model = AutoModel.from_pretrained("thenlper/gte-small")

## 1. Load PDF File

In [1]:
!pip install PyPDF2

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m527.3 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


In [55]:
import PyPDF2
import glob

from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from gensim.utils import simple_preprocess

import spacy


In [4]:
PATHs = glob.glob('../PDF_CV_DATASET/data/data/*/*.pdf')

In [6]:
PATH = PATHs[0]
# Load the PDF file
pdf_file = open(PATH, 'rb')
reader = PyPDF2.PdfReader(pdf_file)

# Extract text from each page
for page in reader.pages:
    text = page.extract_text()
    print(text)

ADULT EDUCATION INSTRUCTOR
Summary
Seasoned Agriculture Teacher with more than 20 years of experience in this world of education. Excellent teaching and leadership skills. Track
record of achieving exceptional results in not only FFA programs but also Credit Recovery Programs at my current high school and program
improvement in numbers at not only Covina High School but also Bloomington High School. I have also been involved with bringing to life the
Adult Education Program in the Colton Joint Unified School District.Â Â 
 Compassionate teacher excited to take on new professional challenges
and assist studentsÂ in improving learning skills, and abilities. Hardworking and responsible professional adept at crisis response and activity
planning.
Experience
Company Name
 
City
 
, 
State
 
Adult Education Instructor
 
08/2016
 
to 
Current
 
Developed a diploma program that fit the needs of the community,
continues to work with the community and wants to see the students succeed move on in

### 1.1 Libraries to preprocess the text

- lemmatization
- tokenization
- removing irrelevant characters
- converting text to lowercase
- removing stopwords

In [7]:

text = "The quick brown fox jumped over the lazy dog."

# Preprocess the text
preprocessed_text = simple_preprocess(text)
print(preprocessed_text)


['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']


In [21]:
!python -m spacy download en_core_web_md

Collecting en-core-web-md==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.7.1/en_core_web_md-3.7.1-py3-none-any.whl (42.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 MB[0m [31m763.5 kB/s[0m eta [36m0:00:00[0m00:01[0m00:02[0m
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-3.7.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


In [30]:

# nlp = spacy.load("en_core_web_sm")
nlp = spacy.load("en_core_web_md")

text = "The quick brown fox jumped over the lazy dog on the sofa."
text = "Google LLC is an American multinational technology company that specializes in Internet-related services and products, which include online advertising technologies, a search engine, cloud computing, software, and hardware. It is considered one of the Big Four technology companies alongside Amazon, Apple, and Microsoft"
text = "The University of Trento is an Italian university located in Trento and nearby Rovereto. It has been able to achieve considerable results in didactics, research, and international relations according to CENSIS and the Italian Ministry of Education."
# Process the text using spaCy's pipeline
doc = nlp(text)

# Access the tokens and their lemmas
print([(tokens.text, tokens.lemma_) for tokens in doc])
print([tokens.lemma_ for tokens in doc])

# Access the tokens and their part-of-speech tags
print([(tokens.text, tokens.label_) for tokens in doc.ents])


[('The', 'the'), ('University', 'University'), ('of', 'of'), ('Trento', 'Trento'), ('is', 'be'), ('an', 'an'), ('Italian', 'italian'), ('university', 'university'), ('located', 'locate'), ('in', 'in'), ('Trento', 'Trento'), ('and', 'and'), ('nearby', 'nearby'), ('Rovereto', 'Rovereto'), ('.', '.'), ('It', 'it'), ('has', 'have'), ('been', 'be'), ('able', 'able'), ('to', 'to'), ('achieve', 'achieve'), ('considerable', 'considerable'), ('results', 'result'), ('in', 'in'), ('didactics', 'didactic'), (',', ','), ('research', 'research'), (',', ','), ('and', 'and'), ('international', 'international'), ('relations', 'relation'), ('according', 'accord'), ('to', 'to'), ('CENSIS', 'CENSIS'), ('and', 'and'), ('the', 'the'), ('Italian', 'italian'), ('Ministry', 'Ministry'), ('of', 'of'), ('Education', 'Education'), ('.', '.')]
['the', 'University', 'of', 'Trento', 'be', 'an', 'italian', 'university', 'locate', 'in', 'Trento', 'and', 'nearby', 'Rovereto', '.', 'it', 'have', 'be', 'able', 'to', 'ach

### 1.2 Extract and preprocess text

In [34]:
PATH = PATHs[0]
text_cv = nlp([page.extract_text() for page in PyPDF2.PdfReader(open(PATH, 'rb')).pages][0])

In [36]:
print([(tokens.text, tokens.label_) for tokens in text_cv.ents])

[('more than 20 years', 'DATE'), ('FFA', 'ORG'), ('Credit Recovery Programs', 'ORG'), ('Covina', 'GPE'), ('Bloomington High School', 'ORG'), ('Adult Education Program', 'ORG'), ('the Colton Joint Unified School District', 'ORG'), ('State\n \nAgriculture/Credit Recovery', 'ORG'), ('08/2000', 'CARDINAL'), ('Goal Setting Established', 'ORG'), ('Parent Communication Regularly', 'ORG'), ('Student-Centered Curriculum Planning Developed', 'ORG'), ('mid-semester', 'DATE'), ('year', 'DATE'), ('80%', 'PERCENT'), ('08/2000', 'CARDINAL'), ('80%', 'PERCENT'), ('Cal Poly', 'ORG'), ('Pomona', 'GPE'), ('Pomona', 'GPE'), ('CA', 'GPE'), ('USA Community Involvement Been', 'ORG'), ('4Hfor', 'CARDINAL'), ('the last 12 years', 'DATE'), ('the San Bernardino County Fair', 'LOC'), ('Lesson Planning', 'ORG'), ('National Education Association', 'ORG'), ('NEA', 'ORG'), ('1995', 'DATE'), ('1995', 'DATE'), ('Skills\nExcellent', 'ORG'), ('Ag', 'ORG'), ('Community Service', 'ORG')]


In [56]:
print([tokens.lemma_ for tokens in text_cv])

# split into sentences
sentences = sent_tokenize(text_cv.text)
print(sentences)

# split into words
tokens = word_tokenize(text_cv.text)
print(tokens)

len(sentences)

['adult', 'education', 'instructor', '\n', 'Summary', '\n', 'Seasoned', 'Agriculture', 'Teacher', 'with', 'more', 'than', '20', 'year', 'of', 'experience', 'in', 'this', 'world', 'of', 'education', '.', 'excellent', 'teaching', 'and', 'leadership', 'skill', '.', 'track', '\n', 'record', 'of', 'achieve', 'exceptional', 'result', 'in', 'not', 'only', 'FFA', 'program', 'but', 'also', 'Credit', 'Recovery', 'Programs', 'at', 'my', 'current', 'high', 'school', 'and', 'program', '\n', 'improvement', 'in', 'number', 'at', 'not', 'only', 'Covina', 'High', 'School', 'but', 'also', 'Bloomington', 'High', 'School', '.', 'I', 'have', 'also', 'be', 'involve', 'with', 'bring', 'to', 'life', 'the', '\n', 'Adult', 'Education', 'Program', 'in', 'the', 'Colton', 'Joint', 'Unified', 'School', 'District', '.', 'Â', 'Â', '\n ', 'compassionate', 'teacher', 'excite', 'to', 'take', 'on', 'new', 'professional', 'challenge', '\n', 'and', 'assist', 'studentsÂ', 'in', 'improve', 'learning', 'skill', ',', 'and', 'a

26

In [57]:
sentences

['ADULT EDUCATION INSTRUCTOR\nSummary\nSeasoned Agriculture Teacher with more than 20 years of experience in this world of education.',
 'Excellent teaching and leadership skills.',
 'Track\nrecord of achieving exceptional results in not only FFA programs but also Credit Recovery Programs at my current high school and program\nimprovement in numbers at not only Covina High School but also Bloomington High School.',
 'I have also been involved with bringing to life the\nAdult Education Program in the Colton Joint Unified School District.Â Â \n Compassionate teacher excited to take on new professional challenges\nand assist studentsÂ in improving learning skills, and abilities.',
 'Hardworking and responsible professional adept at crisis response and activity\nplanning.',
 'Experience\nCompany Name\n \nCity\n \n, \nState\n \nAdult Education Instructor\n \n08/2016\n \nto \nCurrent\n \nDeveloped a diploma program that fit the needs of the community,\ncontinues to work with the community an

In [50]:
parts = divide_text_cv_in_parts(text_cv, n_parts=5)

In [51]:
parts

[['ADULT EDUCATION INSTRUCTOR\nSummary\nSeasoned Agriculture Teacher with more than 20 years of experience in this world of education.',
  'Excellent teaching and leadership skills.',
  'Track\nrecord of achieving exceptional results in not only FFA programs but also Credit Recovery Programs at my current high school and program\nimprovement in numbers at not only Covina High School but also Bloomington High School.',
  'I have also been involved with bringing to life the\nAdult Education Program in the Colton Joint Unified School District.Â Â \n Compassionate teacher excited to take on new professional challenges\nand assist studentsÂ in improving learning skills, and abilities.',
  'Hardworking and responsible professional adept at crisis response and activity\nplanning.'],
 ['Experience\nCompany Name\n \nCity\n \n, \nState\n \nAdult Education Instructor\n \n08/2016\n \nto \nCurrent\n \nDeveloped a diploma program that fit the needs of the community,\ncontinues to work with the commu

In [61]:
text_cv

ADULT EDUCATION INSTRUCTOR
Summary
Seasoned Agriculture Teacher with more than 20 years of experience in this world of education. Excellent teaching and leadership skills. Track
record of achieving exceptional results in not only FFA programs but also Credit Recovery Programs at my current high school and program
improvement in numbers at not only Covina High School but also Bloomington High School. I have also been involved with bringing to life the
Adult Education Program in the Colton Joint Unified School District.Â Â 
 Compassionate teacher excited to take on new professional challenges
and assist studentsÂ in improving learning skills, and abilities. Hardworking and responsible professional adept at crisis response and activity
planning.
Experience
Company Name
 
City
 
, 
State
 
Adult Education Instructor
 
08/2016
 
to 
Current
 
Developed a diploma program that fit the needs of the community,
continues to work with the community and wants to see the students succeed move on in

In [81]:
print(PATHs[0], PATHs[1])   
print(compute_cos_sim(PATHs[0], PATHs[1]))

print(PATHs[0], PATHs[0])   
print(compute_cos_sim(PATHs[0], PATHs[0]))

print(PATHs[0], PATHs[100])   
print(compute_cos_sim(PATHs[0], PATHs[100]))


../PDF_CV_DATASET/data/data/AGRICULTURE/37201447.pdf ../PDF_CV_DATASET/data/data/AGRICULTURE/12674256.pdf
tensor([[0.8080]], grad_fn=<MmBackward0>)
../PDF_CV_DATASET/data/data/AGRICULTURE/37201447.pdf ../PDF_CV_DATASET/data/data/AGRICULTURE/37201447.pdf
tensor([[1.]], grad_fn=<MmBackward0>)
../PDF_CV_DATASET/data/data/AGRICULTURE/37201447.pdf ../PDF_CV_DATASET/data/data/ARTS/73030450.pdf
tensor([[0.8190]], grad_fn=<MmBackward0>)
