Install all the requirements for this notebook

In [None]:
!pip install PyMuPDF
!pip install textract
!pip install python-docx
!pip install pdf2image
!pip install tiktoken
!sudo apt-get install poppler-utils
!pip install paddlepaddle
!apt-get update
!apt-get install -y libssl-dev
!wget http://nz2.archive.ubuntu.com/ubuntu/pool/main/o/openssl/libssl1.1_1.1.1f-1ubuntu2.19_amd64.deb
!sudo dpkg -i libssl1.1_1.1.1f-1ubuntu2.19_amd64.deb

In [4]:
import json
import os
from collections import OrderedDict
import re
from docx import Document
import textract
import fitz  # PyMuPDF
import pandas as pd
import math
import tiktoken

def read_document(file_path):
    file_path = str(file_path)
    _, file_extension = os.path.splitext(file_path)
    text = ""
    if file_extension == '.docx':
        doc = Document(file_path)
        for para in doc.paragraphs:
            text = text + para.text + " "
    elif file_extension == '.doc':
        text = textract.process(file_path).decode()
    elif file_extension.lower() == '.pdf':
        doc = fitz.open(file_path)
        for page_number in range(len(doc)):
            page = doc[page_number]
            text = text + page.get_text() + " "
    elif file_extension.lower() in ['.xls', '.xlsx']:
        data = pd.read_excel(file_path)
        text = data.to_string(index=False)

    else:
        print(f"Unsupported file type: {file_extension}")

    return text
resume_path = "/content/photo_cv2.pdf"
read_document(resume_path)

' '

**Note:**
In the above case the text extracted from the pdf is empty which can be due to the fact that it is a photo cv and hence we are not able to read it.

The code defines a function **pdf_to_jpg** that converts each page of a given PDF file into separate JPG images. Utilizing the convert_from_path function from the **pdf2image** library, the function reads the PDF, generates images for each page, and saves them to a specified output directory (or the current directory by default). The paths of the created JPG images are then returned as a list. The example at the end calls this function using a variable **resume_path** (not defined in the provided code) to convert a PDF to JPGs and prints the paths of the resultant image files.

In [5]:
from pdf2image import convert_from_path

def pdf_to_jpg(pdf_path, output_folder="."):
    """
    Convert a PDF into JPG images.

    Args:
    - pdf_path (str): The path to the PDF file.
    - output_folder (str, optional): The path where the JPG files should be saved.
                                     Defaults to the current directory.

    Returns:
    - List[str]: List of paths to the created JPG images.
    """
    images = convert_from_path(pdf_path)
    image_paths = []

    for i, image in enumerate(images):
        image_path = f"{output_folder}/output_page_{i + 1}.jpg"
        image.save(image_path, "JPEG")
        image_paths.append(image_path)

    return image_paths

# Convert the given PDF to JPG
jpg_paths = pdf_to_jpg(resume_path, "/content/")
print(jpg_paths)

['/content//output_page_1.jpg']


Git clone the PaddleOCR repo to use it in order to extract data from images.

In [6]:
!git clone https://github.com/PaddlePaddle/PaddleOCR.git

Cloning into 'PaddleOCR'...
remote: Enumerating objects: 47218, done.[K
remote: Counting objects: 100% (462/462), done.[K
remote: Compressing objects: 100% (276/276), done.[K
remote: Total 47218 (delta 276), reused 322 (delta 184), pack-reused 46756[K
Receiving objects: 100% (47218/47218), 343.41 MiB | 38.60 MiB/s, done.
Resolving deltas: 100% (33126/33126), done.


In [8]:
cd PaddleOCR

/content/PaddleOCR


In [9]:
!pip install -r requirements.txt

Collecting pyclipper (from -r requirements.txt (line 4))
  Downloading pyclipper-1.3.0.post5-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (908 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m908.3/908.3 kB[0m [31m19.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting lmdb (from -r requirements.txt (line 5))
  Downloading lmdb-1.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (299 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m299.2/299.2 kB[0m [31m30.0 MB/s[0m eta [36m0:00:00[0m
Collecting visualdl (from -r requirements.txt (line 8))
  Downloading visualdl-2.5.3-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m74.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting rapidfuzz (from -r requirements.txt (line 9))
  Downloading rapidfuzz-3.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [10]:
# Import all the important libraries
import paddleocr
from paddleocr import PaddleOCR

The function **extract_text** processes a list of OCR (Optical Character Recognition) output data, presumably nested in structure. It iterates over the first element of this list (assumed to be another list) and retrieves the text located at the second element of each item, which is itself a tuple. This text is appended to the **extracted_texts** list. While iterating, the function also prints each item for reference. Finally, the function consolidates the extracted texts into a single string, separating them by spaces, and returns this string.






In [11]:
def extract_text(ocr_output):
    extracted_texts = []

    for item in ocr_output[0]:
        # Assuming each item's second element is a tuple containing text as its first item
        text = item[1][0]
        print(item)
        extracted_texts.append(text)
    single_string = ' '.join(extracted_texts)
    return single_string


In [12]:
# Initialize PaddleOCR
ocr = PaddleOCR(lang="en")

# Path to your image
combine_text = []
for img_path in jpg_paths:
  # Extract text
  result = ocr.ocr(img_path)
  # Get extracted text
  text = extract_text(result)

  combine_text.append(text)
total_string = ' '.join(combine_text)

download https://paddleocr.bj.bcebos.com/PP-OCRv3/english/en_PP-OCRv3_det_infer.tar to /root/.paddleocr/whl/det/en/en_PP-OCRv3_det_infer/en_PP-OCRv3_det_infer.tar


100%|██████████| 4.00M/4.00M [00:08<00:00, 494kiB/s] 


download https://paddleocr.bj.bcebos.com/PP-OCRv4/english/en_PP-OCRv4_rec_infer.tar to /root/.paddleocr/whl/rec/en/en_PP-OCRv4_rec_infer/en_PP-OCRv4_rec_infer.tar


100%|██████████| 10.2M/10.2M [00:02<00:00, 3.63MiB/s]


download https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_mobile_v2.0_cls_infer.tar to /root/.paddleocr/whl/cls/ch_ppocr_mobile_v2.0_cls_infer/ch_ppocr_mobile_v2.0_cls_infer.tar


100%|██████████| 2.19M/2.19M [00:06<00:00, 363kiB/s]

[2023/09/18 11:53:17] ppocr DEBUG: Namespace(help='==SUPPRESS==', use_gpu=False, use_xpu=False, use_npu=False, ir_optim=True, use_tensorrt=False, min_subgraph_size=15, precision='fp32', gpu_mem=500, gpu_id=0, image_dir=None, page_num=0, det_algorithm='DB', det_model_dir='/root/.paddleocr/whl/det/en/en_PP-OCRv3_det_infer', det_limit_side_len=960, det_limit_type='max', det_box_type='quad', det_db_thresh=0.3, det_db_box_thresh=0.6, det_db_unclip_ratio=1.5, max_batch_size=10, use_dilation=False, det_db_score_mode='fast', det_east_score_thresh=0.8, det_east_cover_thresh=0.1, det_east_nms_thresh=0.2, det_sast_score_thresh=0.5, det_sast_nms_thresh=0.2, det_pse_thresh=0, det_pse_box_thresh=0.85, det_pse_min_area=16, det_pse_scale=1, scales=[8, 16, 32], alpha=1.0, beta=1.0, fourier_degree=5, rec_algorithm='SVTR_LCNet', rec_model_dir='/root/.paddleocr/whl/rec/en/en_PP-OCRv4_rec_infer', rec_image_inverse=True, rec_image_shape='3, 48, 320', rec_batch_num=6, max_text_length=25, rec_char_dict_path='




[2023/09/18 11:53:19] ppocr DEBUG: dt_boxes num : 95, elapsed : 0.7822816371917725
[2023/09/18 11:53:51] ppocr DEBUG: rec_res num  : 95, elapsed : 32.30969309806824
[[[93.0, 6.0], [421.0, 12.0], [419.0, 67.0], [91.0, 60.0]], ('Laveena', 0.9984917044639587)]
[[[1212.0, 20.0], [1581.0, 20.0], [1581.0, 47.0], [1212.0, 47.0]], ('laveenasatwani52483@gmail.com', 0.982727587223053)]
[[[1093.0, 47.0], [1583.0, 47.0], [1583.0, 73.0], [1093.0, 73.0]], ('linkedin.com/in/laveena-satwani-189970153', 0.9928882718086243)]
[[[84.0, 89.0], [416.0, 89.0], [416.0, 152.0], [84.0, 152.0]], ('Satwani', 0.9998189210891724)]
[[[838.0, 95.0], [1589.0, 97.0], [1589.0, 126.0], [838.0, 124.0]], ('Worked on annotation tool for images inpython to save annotation', 0.9870398640632629)]
[[[853.0, 121.0], [1318.0, 126.0], [1318.0, 152.0], [853.0, 148.0]], ('time along with scriptingfor data handling', 0.9905973076820374)]
[[[840.0, 198.0], [991.0, 198.0], [991.0, 233.0], [840.0, 233.0]], ('Projects', 0.996961414813995

Now we use the **summarize_resume** used in Assignment3 to extract valuable information from the resume

In [13]:
def summarize_resume(text):
    """
    Summarize the given resume text using the OpenAI API with a specified prompt.

    Args:
    - text (str): The resume text that needs to be summarized.

    Returns:
    - str: Summarized text as returned by the OpenAI model.
    """
    prompt=f'''Read the given resume and extract information corresponding to the keys "name_of_candidate" \
    which stores the candidate name, "mobile_number" contains the mobile number, "email_id" records the email id of the candidate, \
    total years of experience is stored in "years_of_experience", "education" refers to the candidate's most recent or highest academic degree, \
    last university/school/college attended by the candidate is given by "university", "linkedin_profile" contains the linkedin profile, \
    record all the technical skills in "technical_skills" , "years_of_jobs" showcases the years spent in different jobs, \
    years spent in the current organization is given by "year_in_current_position", "Present_Organization" denotes name of the present \
    organization and the "summay". For "technical_skills", provide a summary of the programming languages, libraries, \
    and frameworks the candidate has experience with, "years_of_jobs" is a list of job durations, e.g., ["2012-current","2010-2012", (June 22, 2022 - Present)]. \
    "year_in_current_position" indicates the duration in their current job role. "years_of_experience" is the sum of years spent in all jobs including the current one. \
    Round off the year to the upper ceiling. So, if it is 3 months, round it off to 1 year.Summarize the resume in approximately 100 words for the "summary" field. \
    The final output must be in JSON'''

    # Create a list of messages to simulate a conversation with the OpenAI model.
    # The system starts with a prompt and the user provides the resume text.
    messages = [
            {"role": "system", "content": f"{prompt}"},
            {"role": "user", "content": text },
        ]

    # Make a request to the OpenAI API to get the summary.
    # Using the 'gpt-3.5-turbo-16k' model for completion.
    response = openai.ChatCompletion.create(
            model="gpt-3.5-turbo-16k",
            messages=messages,
            temperature=1,
            max_tokens=13000  # Setting a maximum token limit for the model's output
        )

    # Extract the generated text from the response.
    # Since there's only one message in the choices, we're taking the first message's content.
    generated_texts = [
        choice.message["content"].strip() for choice in response["choices"]
    ]

    return generated_texts[0]

Upload the .env file to the directory `/content/` which contains the "OPENAI_API_KEY".

The provided code snippet accesses sensitive values like the OpenAI API key

In [14]:
# Export your API Key to environment variable
# Upload the .env file to the directory "/content/"
!pip install python-dotenv
from dotenv import load_dotenv
load_dotenv()

Collecting python-dotenv
  Downloading python_dotenv-1.0.0-py3-none-any.whl (19 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.0.0


True

In [15]:
# Libraries Installation
!pip install openai
# Required Libraries
import openai
import json
import os
from collections import OrderedDict
# Retrieve the API key from environment variable
openai_api_key = os.getenv("OPENAI_API_KEY")

# Set the API key for OpenAI
openai.api_key = openai_api_key

Collecting openai
  Downloading openai-0.28.0-py3-none-any.whl (76 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/76.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.5/76.5 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: openai
Successfully installed openai-0.28.0


The **check_and_trim** function processes a given **resume_text** to ensure that its token count does not exceed a specified maximum (max_tokens, defaulting to 1500). It utilizes the tiktoken library to encode the text into tokens and count them. If the text's token count exceeds the limit, the function trims the tokens to the defined maximum and decodes it back to a string. The function returns the potentially trimmed text, the original token count, and the final token count.

In [17]:
def check_and_trim(resume_text, max_tokens=1500):
    # tokens = nltk.word_tokenize(resume_text)
    enc = tiktoken.get_encoding("cl100k_base")
    tokens = enc.encode(resume_text)
    old_len = len(tokens)
    if len(tokens) > max_tokens:
        tokens = tokens[:max_tokens]
        resume_text = enc.decode(tokens)
    return resume_text, old_len, len(tokens)

In [18]:
# Give the extracted text as input to the summarize_resume function
resume_text, _, _ = check_and_trim(total_string)
resume_summary = summarize_resume(resume_text)

In [19]:
print(resume_summary)

{
  "name_of_candidate": "Laveena",
  "mobile_number": "N/A",
  "email_id": "laveenasatwani52483@gmail.com",
  "years_of_experience": 5,
  "education": "Bachelor of Technology in Computer Science & Engineering",
  "university": "Indian Institute of Information Technology Jabalpur",
  "linkedin_profile": "linkedin.com/in/laveena-satwani-189970153",
  "technical_skills": "Python, Deep Learning, Image Processing, Machine Learning, TensorFlow, Matlab, scikit-learn, SpringBoot, AngularJS",
  "years_of_jobs": ["2019-2019", "2020-Present", "2018-2018", "2018-2018", "2019-2020", "2019-2019", "2019-2019", "2019-2020", "2019-2019"],
  "year_in_current_position": 2,
  "Present_Organization": "BigVision LLC",
  "summary": "Laveena is a Computer Science and Engineering graduate with a Bachelor's degree from the Indian Institute of Information Technology Jabalpur. She has a total of 5 years of experience in the field of computer vision and machine learning. Her technical skills include Python, Deep 

# Output:


```
{
  "name_of_candidate": "Laveena Satwani",
  "mobile_number": "N/A",
  "email_id": "laveenasatwani52483@gmail.com",
  "years_of_experience": 5,
  "education": "Bachelor of Technology in Computer Science & Engineering",
  "university": "Indian Institute of Information Technology Jabalpur, India",
  "linkedin_profile": "linkedin.com/in/laveena-satwani-189970153",
  "technical_skills": "Python, Deep Learning, Machine Learning, Image Processing, Image Segmentation, Convolutional Neural Networks (CNN), TensorFlow, scikit-learn",
  "years_of_jobs": ["2018-2018", "2019-2019", "2020-Present"],
  "year_in_current_position": 1,
  "Present_Organization": "BigVision LLC, Bangalore",
  "summary": "Laveena Satwani is a computer science engineer with 5 years of experience in deep learning, machine learning, and image processing. She has worked on projects involving image annotation, image captioning, intelligent image enhancement, sentiment analysis, data extraction from scientific charts, and gaze estimation. She is proficient in Python and has experience with frameworks such as TensorFlow and scikit-learn. Laveena holds a Bachelor's degree in Computer Science & Engineering from Indian Institute of Information Technology Jabalpur. In her current position at BigVision LLC, she works as a Computer Vision Engineer."
}
```

