# Description:
The notebook focuses on automating the process of extracting and summarizing information from resumes. It uses OpenAI's GPT model to help in achieving this task. Specifically, it aims to read the content of the resumes and summarize their details, presenting them in both textual and JSON formats.
# Learning Objectives:
1. Learn how to set up the necessary environment and install required packages in a Colab notebook.
2. Discover ways to preprocess and trim long texts for model consumption.
3. Familiarize oneself with the OpenAI API and understand how to construct meaningful prompts for better output.
Extract and represent resume details in multiple formats, focusing on standard text and JSON formats.

Upload the .env file to the directory `/content/` which contains the "OPENAI_API_KEY"

In [1]:
# Libraries Installation
!pip install openai
# Required Libraries
import openai
import json
import os
from collections import OrderedDict


Collecting openai
  Downloading openai-0.28.0-py3-none-any.whl (76 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/76.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━[0m [32m41.0/76.5 kB[0m [31m1.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.5/76.5 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: openai
Successfully installed openai-0.28.0


We set up our environment to use OpenAI's API for extracting information from Job Descriptions (JD). We'll use Python as our primary language and leverage the OpenAI library to interact with OpenAI's services


Read the "OPENAI_API_KEY" from the .env file

In [2]:
# Export your API Key to environment variable
# Upload the .env file to the directory "/content/"
!pip install python-dotenv
from dotenv import load_dotenv
load_dotenv()

Collecting python-dotenv
  Downloading python_dotenv-1.0.0-py3-none-any.whl (19 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.0.0


True

In [3]:
import openai
# Retrieve the API key from environment variable
openai_api_key = os.getenv("OPENAI_API_KEY")

# Set the API key for OpenAI
openai.api_key = openai_api_key

Upload the json file containing important information about the Job requirements which was generated in Assignment1 and the file containing information about the filtered resumes generated from Assignment2

In [4]:
from google.colab import files

# Upload the first file
print("Please upload the file (filtered_applications.json):")
uploaded1 = files.upload()

# Check to ensure a file was uploaded. If not, prompt again.
while len(uploaded1) == 0:
    print("No file uploaded. Please upload the first file (filtered_applications.json) again:")
    uploaded1 = files.upload()
print("Please upload the file (all_applications.json):")
uploaded1 = files.upload()

# Check to ensure a file was uploaded. If not, prompt again.
while len(uploaded1) == 0:
    print("No file uploaded. Please upload the first file (filtered_applications.json) again:")
    uploaded1 = files.upload()
# Upload the second file
print("Please upload the file (requirements_output.json):")
uploaded2 = files.upload()

# Check to ensure a file was uploaded. If not, prompt again.
while len(uploaded2) == 0:
    print("No file uploaded. Please upload the second file (requirements_output.json) again:")
    uploaded2 = files.upload()

# Merge the dictionaries to have all uploaded files in one
uploaded = {**uploaded1, **uploaded2}

# Print details of uploaded files
for fn in uploaded.keys():
    print('User uploaded file "{name}" with length {length} bytes'.format(
        name=fn, length=len(uploaded[fn])))


Please upload the file (filtered_applications.json):


Saving filtered_applications.json to filtered_applications.json
Please upload the file (all_applications.json):


Saving all_applications.json to all_applications.json
Please upload the file (requirements_output.json):


Saving requirements_output.json to requirements_output.json
User uploaded file "all_applications.json" with length 4809 bytes
User uploaded file "requirements_output.json" with length 764 bytes


Now download the `Webinar_resumes.zip` file which contains all the resumes

In [5]:
import requests

def download_file_from_google_drive(file_id, destination):
    base_url = "https://drive.google.com/uc?export=download"

    session = requests.Session()

    response = session.get(base_url, params={'id': file_id}, stream=True)
    token = get_confirm_token(response)

    if token:
        params = {'id': file_id, 'confirm': token}
        response = session.get(base_url, params=params, stream=True)

    save_response_content(response, destination)

def get_confirm_token(response):
    for key, value in response.cookies.items():
        if key.startswith('download_warning'):
            return value
    return None

def save_response_content(response, destination):
    CHUNK_SIZE = 32768

    with open(destination, "wb") as f:
        for chunk in response.iter_content(CHUNK_SIZE):
            if chunk:
                f.write(chunk)
# Example Usage
# file_id = '1HaM3IeK2-iqyZzeQmCnAzKLcF9NF-mSo'  # Replace with your file's ID
# destination = 'resume_data.zip'  # Replace with your desired file name and extension
file_id = '17V_o0Snt-Lj0FmegENPQ_rXpvWTWlZgQ'
destination = 'Webinar_resumes.zip'  # Replace with your desired file name and extension
download_file_from_google_drive(file_id, destination)

The following code imports various libraries to facilitate file handling and natural language processing. Libraries like docx and textract process Word documents, while fitz handles PDFs. The os, json, and pandas libraries aid in file operations and data management. The following script's functions read job requirements from JSON files and content from different file formats such as DOCX, DOC, PDF, and Excel, and condenses resume texts to a specified number of tokens using nltk to maintain manageable input sizes.

In [6]:
# Import important libraries
!pip install PyMuPDF
!pip install textract
!pip install python-docx
!pip install tiktoken

Collecting PyMuPDF
  Downloading PyMuPDF-1.23.3-cp310-none-manylinux2014_x86_64.whl (4.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.3/4.3 MB[0m [31m39.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting PyMuPDFb==1.23.3 (from PyMuPDF)
  Downloading PyMuPDFb-1.23.3-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (30.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.6/30.6 MB[0m [31m47.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyMuPDFb, PyMuPDF
Successfully installed PyMuPDF-1.23.3 PyMuPDFb-1.23.3
Collecting textract
  Downloading textract-1.6.5-py3-none-any.whl (23 kB)
Collecting argcomplete~=1.10.0 (from textract)
  Downloading argcomplete-1.10.3-py2.py3-none-any.whl (36 kB)
Collecting beautifulsoup4~=4.8.0 (from textract)
  Downloading beautifulsoup4-4.8.2-py3-none-any.whl (106 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m106.9/106.9 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m

Collecting python-docx
  Downloading python-docx-0.8.11.tar.gz (5.6 MB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/5.6 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.1/5.6 MB[0m [31m2.7 MB/s[0m eta [36m0:00:03[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━[0m [32m3.7/5.6 MB[0m [31m53.5 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m5.6/5.6 MB[0m [31m67.6 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m50.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: python-docx
  Building wheel for python-docx (setup.py) ... [?25l[?25hdone
  Created wheel for python-docx: filename=python_docx-0.8.11-py3-none-any.whl size=184487 sha256=270dbefed10a63204399666e94ad0d4

In [7]:
# Importing necessary libraries and modules
from docx import Document
import textract
import fitz  # PyMuPDF
import openai
import json
import os
from collections import OrderedDict
import re
import pandas as pd
import math
import tiktoken


def read_requirements(file_path):
    """
    Read the job requirements from a given JSON file.

    Args:
    - file_path (str): Path to the JSON file.

    Returns:
    - dict: Job requirements if successfully read, otherwise None.
    """
    try:
        with open(file_path, 'r') as f:
            data = json.load(f)
        return data
    except Exception as e:
        print(f"Error reading requirements JSON: {e}")
        return None

def read_json(file_path):
    """
    Read data from a given JSON file.

    Args:
    - file_path (str): Path to the JSON file.

    Returns:
    - dict: Data from the JSON file.
    """
    with open(file_path, 'r') as f:
        data = json.load(f)
    return data

def read_document(file_path):
    """
    Read and extract text from various document types (.docx, .doc, .pdf, .xls, .xlsx).

    Args:
    - file_path (str): Path to the document file.

    Returns:
    - str: Extracted text from the document.
    """
    file_path = str(file_path)
    _, file_extension = os.path.splitext(file_path)
    text = ""
    if file_extension == '.docx':
        doc = Document(file_path)
        for para in doc.paragraphs:
            text = text + para.text + " "
    elif file_extension == '.doc':
        text = textract.process(file_path).decode()
    elif file_extension.lower() == '.pdf':
        doc = fitz.open(file_path)
        for page_number in range(len(doc)):
            page = doc[page_number]
            text = text + page.get_text() + " "
    elif file_extension.lower() in ['.xls', '.xlsx']:
        data = pd.read_excel(file_path)
        text = data.to_string(index=False)
    else:
        print(f"Unsupported file type: {file_extension}")

    return text


def check_and_trim(resume_text, max_tokens=1500):
    """
    Trim the text to a specified number of tokens if it exceeds the limit.

    Args:
    - resume_text (str): Text to be trimmed.
    - max_tokens (int, optional): Maximum number of tokens allowed. Defaults to 1500.

    Returns:
    - str: Trimmed text.
    - int: Original number of tokens.
    - int: Number of tokens after trimming.
    """
    # tokens = nltk.word_tokenize(resume_text)
    enc = tiktoken.get_encoding("cl100k_base")
    tokens = enc.encode(resume_text)
    old_len = len(tokens)
    if len(tokens) > max_tokens:
        tokens = tokens[:max_tokens]
        resume_text = enc.decode(tokens)
    return resume_text, old_len, len(tokens)


The below function `summarize_resume` takes two arguments: `prompt` and `text`. Its primary purpose is to summarize the content of a resume.

`Building the Conversation Messages`: Inside the function, a messages list is initialized with two entries. The first entry has the role `system` and provides a contextual instruction or prompt (specified by the prompt argument) to the model. The second entry has the role `user` and contains the content of the resume (specified by the text argument). This list emulates a conversation where the system sets the context and the user provides the input.

`Generating the Response`: The `openai.ChatCompletion.create` method is then called with several parameters:

`model="gpt-3.5-turbo-16k"`: Specifies the model variant to be used for the task.

`messages`: Provides the constructed conversation to the model.

`temperature=1`: This dictates the randomness of the model's output. A value closer to 1 makes the model's responses more random, while a value closer to 0 makes them more deterministic.

`max_tokens=13000`: Limits the response to a maximum of 13,000 tokens to ensure the output isn't too lengthy.

`Extracting the Summary`: After the response is generated, the content of the response is extracted, stripped of any leading or trailing white spaces, and stored in the `generated_texts` list.

`Returning the Result`: Finally, the function returns the first (and only) item in the generated_texts list, which is the summarized content of the resume.

In [8]:
def summarize_resume(prompt, text):
    """
    Summarize the given resume text using the OpenAI API with a specified prompt.

    Args:
    - prompt (str): The leading instruction or question for the model.
    - text (str): The resume text that needs to be summarized.

    Returns:
    - str: Summarized text as returned by the OpenAI model.
    """

    # Create a list of messages to simulate a conversation with the OpenAI model.
    # The system starts with a prompt and the user provides the resume text.
    messages = [
            {"role": "system", "content": f"{prompt}"},
            {"role": "user", "content": text },
        ]

    # Make a request to the OpenAI API to get the summary.
    # Using the 'gpt-3.5-turbo-16k' model for completion.
    response = openai.ChatCompletion.create(
            model="gpt-3.5-turbo-16k",
            messages=messages,
            temperature=1,
            max_tokens=13000  # Setting a maximum token limit for the model's output
        )

    # Extract the generated text from the response.
    # Since there's only one message in the choices, we're taking the first message's content.
    generated_texts = [
        choice.message["content"].strip() for choice in response["choices"]
    ]

    return generated_texts[0]

**Note:**
In the subsequent sections, we'll iteratively craft a prompt designed to extract pertinent information from the curated resumes. The extracted data will be presented in JSON format, ensuring that the keys remain consistent across multiple runs.


The provided code imports the necessary zipfile module and then reads job requirements from a JSON file, extracting the `must_have_skills`. It also reads filtered job applications from another JSON, focusing on the first two. The ZIP file named `resume_data.zip` located in the `/content/` directory is then extracted to a folder named `extracted_files`. Post extraction, the code iterates through the items in this folder and, if any directory names contain spaces, renames them by replacing spaces with underscores, ensuring a clean directory structure for the resumes. The final path to this directory of resumes is stored in the `resume_path` variable.

In [9]:
import json
import random

def read_json(filepath):
    """
    Reads a JSON file and returns the data.

    Args:
    - filepath (str): Path to the JSON file.

    Returns:
    - dict: Data from the JSON file.
    """
    with open(filepath, 'r') as file:
        data = json.load(file)
    return data

def user_select_number_of_resumes(total_resumes, default=2):
    """
    Allow the user to input a number of resumes to process.
    If no input is given, the default value is returned.

    Args:
    - total_resumes (int): Total number of resumes available.
    - default (int): The default number to return if no input.

    Returns:
    - int: The number of resumes the user wants to process.
    """
    print(f"Total resumes available: {total_resumes}")
    user_input = input(f"How many resumes do you want to process? (Default is {default}): ")

    # If the user doesn't provide any input, return the default value.
    if not user_input:
        return default

    try:
        # Convert user input to an integer and ensure it's within the range.
        selected_num = int(user_input)
        if 1 <= selected_num <= total_resumes:
            return selected_num
        else:
            print(f"Please select a number between 1 and {total_resumes}.")
            return user_select_number_of_resumes(total_resumes, default)
    except ValueError:
        # If the user provides non-numeric input, prompt them again.
        print("Please enter a valid number.")
        return user_select_number_of_resumes(total_resumes, default)

# Read the filtered_applications data from the JSON file
json_data = read_json('/content/filtered_applications.json')

# Display total resumes and get the user's choice
n = user_select_number_of_resumes(len(json_data))

# Randomly select n resumes
selected_applications = random.sample(json_data, n)

Total resumes available: 12
How many resumes do you want to process? (Default is 2): 5


In [10]:
import zipfile
import os
import shutil
job_requirements = read_requirements('/content/requirements_output.json')
must_have_skills = job_requirements["must_have_skills"]
zip_file_path = "/content/Webinar_resumes.zip" # For example give the path to resume_data.zip


def extract_and_rename(zip_file_path, extract_path="extracted_files"):
    """
    Extract files from a zip archive to a specified directory.
    Rename directories containing spaces to use underscores instead.

    Args:
    - zip_file_path (str): The path to the zip file to be extracted.
    - extract_path (str, optional): The path where the zip file content should be extracted to.
                                    Defaults to "extracted_files".

    Returns:
    - str: Path to the resume or directory.
    """
    # Check if extract_path exists, if not, create it
    if not os.path.exists(extract_path):
        os.makedirs(extract_path)

    # If extract_path is not empty, skip extraction
    if not os.listdir(extract_path):
        with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
            zip_ref.extractall(extract_path)

    resume_path = extract_path
    for item in os.listdir(extract_path):
        item_path = os.path.join(extract_path, item)

        # Check if the current item is a directory and if it has spaces in its name
        if os.path.isdir(item_path) and ' ' in item:
            new_name = item.replace(' ', '_')
            new_path = os.path.join(extract_path, new_name)

            # If the new directory name doesn't already exist, create it
            if not os.path.exists(new_path):
                os.makedirs(new_path)

            # Copying contents from the old directory to the new one
            for sub_item in os.listdir(item_path):
                shutil.copy2(os.path.join(item_path, sub_item), new_path)

            # Removing the old directory
            shutil.rmtree(item_path)
            resume_path = new_path
        else:
            resume_path = item_path

    return resume_path
resume_path = extract_and_rename(zip_file_path)

# Prompt Version 1:

### *Extract key details from the resume using a basic prompt.*
### *This is a more open-ended prompt to get an initial sense of what the model understands and extracts from a resume without any constraints.*

In [11]:
import os
prompt_v1=f'''Read the given resume and extract information such as the candidate name, mobile number, email_id, total years of experience, the candidate's last education degree \
last university attended by the candidate University, extract the candidate's linkedin profile, record all the technical skills, years spent in different jobs,
years spent in the current organization, name of the present organization and the summary'''

for application in selected_applications:
    if 'resume_path' in application and 'email_id' in application:
        resume_text = read_document(os.path.join(resume_path, application['resume_path']))
        resume_text, _, _ = check_and_trim(resume_text)
        resume_summary = summarize_resume(prompt_v1, resume_text)
        print("[Resume Summary] ", resume_summary)

[Resume Summary]  Candidate Name: Paula Ramos
Mobile Number: 919-786-3615
Email: pjramg@gmail.com
Total Years of Experience: Ph.D in Engineering (Computer Science & Image Processing) from Jan 2012 - Apr 2018
Last Education Degree: Ph.D in Engineering (Computer Science & Image Processing)
Last University Attended: Universidad Nacional de Colombia
LinkedIn Profile: linkedin.com/in/paula-ramos-41097319
Technical Skills: Computer Vision, Machine Learning, Image Processing, Signal Processing, Control and Automation, Robotics, Embedded Systems, Mobile Devices
Years Spent in Different Jobs:
- AI Software Development Engineer at Intel Corporation: Nov 2021 - On going
- Research Scholar at North Carolina State University - USDA ARS: Jul 2020 - Oct 2022
- Postdoctoral Researcher at North Carolina State University - USDA ARS: Jun 2019 - Jun 2020
- Research Scholar at North Carolina State University: Feb 2019 - Jun 2019
- Research Scientist at National Research Center of Coffee: Feb 2010 - Dec 201

# Output:


```
[Resume Summary]  Candidate Name: Pankaj Kumar Goyal
Mobile Number: Not mentioned
Email ID: pankajgoyal02003@gmail.com
Total Years of Experience: Not mentioned

Last Education Degree: B.Tech in Electronics and Communication Engineering
Last University Attended: Indian Institute of Information Technology, Allahabad

LinkedIn Profile: https://www.linkedin.com/in/pankaj10032

Technical Skills:
- Python
- Machine Learning
- Computer Vision
- Deep Learning
- Data Cleaning
- Feature Engineering
- Data Analysis
- Data Science
- Natural Language Processing
- Large language models (BERT, Roberta, XLM-R, T5, Distil-BERT)
- Prompt Engineering
- Generative AI
- LangChain
- Pinecone (vector databases)
- Chatbot Development
- Numpy
- Pandas
- Scikit Learn
- TensorFlow
- Keras
- Seaborn
- Matplotlib
- SQL
- MySQL
- PosgreySQL
- AWS
- Vertex AI (AUTOML, CustomML)
- MLOPS (MLflow)
- PowerBI

Years Spent in Different Jobs: Not mentioned
Years Spent in Current Organization: Not mentioned
Name of Present Organization: Not mentioned

Summary: The resume highlights the candidate's hands-on experience in data science and machine learning, with expertise in natural language processing, prompt engineering, deep learning, and computer vision. The candidate has worked on projects related to cross-lingual and multilingual language modeling, Twitter hate speech detection, skin cancer classification, and data extraction in NLP. They possess strong technical skills in Python, machine learning, computer vision, deep learning, and data analysis. The candidate has completed a B.Tech degree in Electronics and Communication Engineering from the Indian Institute of Information Technology, Allahabad. They have also participated in a hackathon and completed a course on deep learning provided by Kaggle.
[Resume Summary]  Candidate Name: Abhilash Babu
Mobile Number: +49 17647165848
Email ID: abhilashbabuj@gmail.com
Total Years of Experience: 18 years
Last Education Degree: MS in Communication Engineering
Last University Attended: Technische Universität, München, Germany
LinkedIn Profile: https://www.linkedin.com/in/abhilashbabu
Technical Skills:
- Languages: C++, C, C#, Python
- Computer Vision: OpenCV, Halcon Machine vision library
- Machine Learning: Tensorflow, PyTorch, PyTorch-Lightning, Scikit-Learn, Pandas, Keras, ONNX, ApacheTVM, MLFlow, Optuna
- Database: MySQL, SQLite
- Libraries: Boost, ZeroMQ, Protocol Buffer, gRPC, MQTT, RabbitMQ
- GUI Frameworks: Qt, WPF, DearImGUI
- Testing frameworks: pytest, GoogleTest, Catch2
- Miscellaneous: Docker, Jenkins, Bamboo, Jupyter Notebooks

Experience:
- Apr 2022 - Present: Senior Machine Learning Engineer at IDnow GmbH, München
- Jan 2020 - Feb 2022: Senior Developer Vision Systems at Bundesdruckerei GmbH, München
- Aug 2016 - Dec 2019: Developer Vision Systems at Bundesdruckerei GmbH, München
- Jan 2013 - Jul 2016: Software Development Engineer at Stratus Vision GmbH, München
- Aug 2011 - Oct 2011: Praktikant at Rohde & Schwarz, Berlin, Germany
- Apr 2008 - Sep 2010: Lead Engineer at Samsung India Software Operations, Bangalore, India
- Nov 2005 - Apr 2008: Software Engineer at Wipro Technologies, Bangalore, India

Summary:
Senior Machine Learning Engineer with 18 years of experience in successfully delivering projects in the domain of Computer Vision and Image processing. Deep understanding of classical computer vision techniques as well as the latest advancements in deep learning frameworks. Experienced in developing and deploying machine learning solutions for various applications. Strong expertise in languages like C++, C, C#, and Python. Certified Software Architect and Certified Scrum Product Owner. Proficient in mentoring junior colleagues and interns.
```



# Prompt Version 2:

### *Extract key details from the resume and represent the output in JSON format.*
### *The output in the above is in plain text, so the below prompt narrows down the desired format of the output. JSON is a commonly used data interchange format and provides structured data which can be easily parsed and utilized.*


In [12]:
import os
prompt_v2=f'''Read the given resume and extract information such as the candidate name, mobile number, email_id, total years of experience, the candidate's last education degree \
last university attended by the candidate University, extract the candidate's linkedin profile, record all the technical skills, years spent in different jobs,
years spent in the current organization, name of the present organization and the summary. The final output must be in JSON'''

for application in selected_applications:
    if 'resume_path' in application and 'email_id' in application:
        resume_text = read_document(os.path.join(resume_path, application['resume_path']))
        resume_text, _, _ = check_and_trim(resume_text)
        resume_summary = summarize_resume(prompt_v2, resume_text)
        print("[Resume Summary] ", resume_summary)

[Resume Summary]  {
  "Candidate Name": "Paula Ramos",
  "Mobile Number": "919-786-3615",
  "Email": "pjramg@gmail.com",
  "Address": "3357 Bordwell Ridge Drive, New Hill, NC, 27562",
  "LinkedIn": "linkedin.com/in/paula-ramos-41097319",
  "Total Experience": "10 years",
  "Last Education Degree": "Ph.D. in Engineering (Computer Science & Image Processing)",
  "Last University Attended": "Universidad Nacional de Colombia",
  "Present Organization": "Intel Corporation",
  "Years in Current Organization": "1 year",
  "Summary": "Research in new AI technologies based on image (2D - 3D) and signal processing, control and automation, robotics, machine learning, embedded systems, and mobile devices.",
  "Technical Skills": [
    "Computer Vision",
    "Machine Learning",
    "AI Software Development",
    "Image Processing",
    "Signal Processing",
    "Control Systems",
    "Automation",
    "Robotics",
    "Embedded Systems",
    "Mobile Devices",
    "IoT",
    "Deep Learning",
    "Prec

# Output:


```
[Resume Summary]  {
  "Candidate Name": "Abhijeet Dhupia",
  "Mobile Number": "+919901656836",
  "Email ID": "abhijeetdhupia@gmail.com",
  "Total Years of Experience": "2",
  "Last Education Degree": "B.Tech. in Electrical and Electronics Engineering (EEE)",
  "Last University Attended": "Manipal Institute of Technology",
  "University": "Manipal, India",
  "LinkedIn Profile": "abhijeetdhupia",
  "Technical Skills": [
    "Python",
    "C++",
    "CSS",
    "HTML5",
    "ImageJ",
    "LATEX",
    "MATLAB",
    "Markdown",
    "R",
    "Shell Scripting",
    "Vim",
    "Pytorch",
    "TensorFlow",
    "AWS",
    "Docker",
    "Flask",
    "Git",
    "Jira",
    "OpenCV"
  ],
  "Years in Different Jobs": [
    {
      "Job Title": "Research Assistant",
      "Organization": "Spectrum Lab, Indian Institute of Science",
      "Years of Experience": "2"
    },
    {
      "Job Title": "Research Intern",
      "Organization": "QpiAI (in collaboration with IISc)",
      "Years of Experience": "0.5"
    }
  ],
  "Years in Current Organization": "2",
  "Present Organization": "Spectrum Lab, Indian Institute of Science",
  "Summary": "Experienced research assistant with a focus on data science and deep learning. Strong background in developing algorithms for medical imaging and healthcare applications. Skilled in Python, Pytorch, and OpenCV. Completed hands-on training in deep learning and data science specializations. Passionate about leveraging AI for impactful healthcare solutions."
}
[Resume Summary]  {
  "Name": "Laveena Satwani",
  "Mobile Number": "+91 8989035197",
  "Email": "laveenasatwani52483@gmail.com",
  "Total Years of Experience": "2 years and 2 months",
  "Last Education Degree": "Bachelor of Technology in Computer Science & Engineering",
  "Last University Attended": "Indian Institute of Information Technology Jabalpur, India",
  "LinkedIn Profile": "linkedin.com/in/laveena-satwani-189970153",
  "Technical Skills": [
    "Computer Vision",
    "Machine Learning",
    "Image Processing",
    "Deep Learning",
    "Python",
    "TensorFlow",
    "Matlab",
    "SpringBoot",
    "scikit-learn",
    "AngularJS"
  ],
  "Years Spent in Different Jobs": [
    {
      "Job Title": "Computer Vision Engineer",
      "Organization": "BigVision LLC",
      "Years Spent": "0 years and 9 months"
    },
    {
      "Job Title": "Machine Learning Engineer",
      "Organization": "Vassar Labs IT Solutions",
      "Years Spent": "1 year and 1 month"
    },
    {
      "Job Title": "Software Engineer",
      "Organization": "Vassar Labs IT Solutions",
      "Years Spent": "0 years and 3 months"
    },
    {
      "Job Title": "Machine Learning Intern",
      "Organization": "Vassar Labs IT Solutions",
      "Years Spent": "0 years and 6 months"
    }
  ],
  "Years Spent in Current Organization": "0 years and 2 months",
  "Current Organization": "BigVision LLC",
  "Summary": "Experienced Computer Vision Engineer and Machine Learning Engineer with a demonstrated history of working on various projects in the field of image enhancement, car segmentation, product detection, and satellite image analysis. Skilled in Computer Vision, Machine Learning, Deep Learning, and Image Processing. Strong engineering professional with a Bachelor of Technology (B.Tech.) focused in Computer Science & Engineering from Indian Institute of Information Technology Jabalpur."
}
```



# Prompt Version 3:
### *Extract key details from the resume and represent the output in JSON format.*

### *This is a refined and more explicit prompt to ensure consistency in the output format. By mentioning the exact keys and giving examples for each key, the aim is to guide the model towards a structured and expected output. Additionally, this prompt provides detailed instructions on how certain fields should be represented, ensuring that the output aligns closely with the desired format.*

In [13]:
import os
prompt_v3=f'''Read the given resume and extract information corresponding to the keys\
 "name_of_candidate" which stores the candidate name, \
 "mobile_number" contains the mobile number, \
 "email_id" records the email id of the candidate, \
 total years of experience is stored in "years_of_experience", \
 "education" refers to the candidate's most recent or highest academic degree, \
 last university/school/college attended by the candidate is given by "university", \
 "linkedin_profile" contains the linkedin profile, \
 record all the technical skills in "technical_skills", \
 "years_of_jobs" showcases the years spent in different jobs, \
 years spent in the current organization is given by "year_in_current_position", \
 "Present_Organization" denotes name of the present organization and "summay". \
 For "technical_skills", provide a summary of the programming languages, libraries, and frameworks the candidate has experience with as a list, \
 "years_of_jobs" is a list of job durations, e.g., ["2012-current","2010-2012", (June 22, 2022 - Present)]. \
 "year_in_current_position" indicates the duration in their current job role integer. Present year is 2023. \
 "years_of_experience" is the sum of years spent in all jobs including the current one. \
 Round off the year to the upper ceiling. So, if it is 3 months, round it off to 1 year. \
 Summarize the resume in approximately 100 words for the "summary" field. \
 The final output must be in JSON
'''

gpt_response_list = []
for application in selected_applications:
    if 'resume_path' in application and 'email_id' in application:
        resume_text = read_document(os.path.join(resume_path, application['resume_path']))
        resume_text, _, _ = check_and_trim(resume_text)
        resume_summary = summarize_resume(prompt_v3, resume_text)
        gpt_response_list.append(resume_summary)
        print("[Resume Summary] ", resume_summary)

[Resume Summary]  {
  "name_of_candidate": "PAULA RAMOS",
  "mobile_number": "919-786-3615",
  "email_id": "pjramg@gmail.com",
  "years_of_experience": 15,
  "education": "Ph.D. in Engineering (Computer Science & Image Processing)",
  "university": "Universidad Nacional de Colombia",
  "linkedin_profile": "linkedin.com/in/paula-ramos-41097319",
  "technical_skills": ["Computer Vision", "Machine Learning", "Signal Processing", "Control Systems", "Automation", "Robotics", "Embedded Systems", "Mobile Devices"],
  "years_of_jobs": ["2012-2023", "2010-2012", "June 22, 2022 - Present"],
  "year_in_current_position": 1,
  "Present_Organization": "Intel Corporation",
  "summary": "Paula Ramos is a Computer Vision and Machine Learning expert with 15 years of experience in research and development. She holds a Ph.D. in Engineering, specializing in Computer Science and Image Processing. Paula has a strong background in signal processing, control systems, automation, robotics, embedded systems, an

# Output:


```

[Resume Summary]  {
  "name_of_candidate": "Pankaj Kumar Goyal",
  "mobile_number": "",
  "email_id": "pankajgoyal02003@gmail.com",
  "years_of_experience": 1,
  "education": "B.Tech in Electronics and Communication Engineering",
  "university": "Indian Institute of Information Technology, Allahabad",
  "linkedin_profile": "",
  "technical_skills": "Python, Machine learning, Computer Vision, Deep learning, Data Cleaning, Feature engineering, Data Analysis, Data Science, Natural Language Processing, Large language models(BERT, Roberta, XLM-R, T5, Distil-BERT), Prompt Engineering, Generative AI, LangChain, pinecone(vector databases), chatbot development, Numpy, Pandas, Scikit Learn, TensorFlow, Keras, Seaborn, Matplotlib, Database: SQL, MYSQL, PosgreySQL, Cloud: AWS, Vertex AI(AUTOML, CustomML), MLOPS(MLflow), BI tools: PowerBI",
  "years_of_jobs": ["2021-Present"],
  "year_in_current_position": 1,
  "present_organization": "",
  "summary": "Pankaj Kumar Goyal is a Data Scientist with expertise in Data Science and Machine Learning. He has hands-on experience in Natural Language Processing, Prompt Engineering, Deep Learning, and Computer Vision. Pankaj has completed a B.Tech in Electronics and Communication Engineering from the Indian Institute of Information Technology, Allahabad. He is skilled in Python, Machine Learning, Computer Vision, Deep Learning, Data Cleaning, Feature Engineering, and Data Analysis. Pankaj has worked on various projects in the fields of Cross-Lingual and Multilingual Language Modeling, Twitter Hate Speech Detection, Skin Cancer MNIST, and Data Extraction in NLP. He has a strong knowledge of different programming languages, libraries, frameworks, databases, and cloud technologies. Pankaj has achieved several accomplishments, including participation in Hack-out 2022 and completing an Intro to Deep Learning course provided by Kaggle. He has also successfully deployed NLP projects on HuggingFace."
}
[Resume Summary]  {
  "name_of_candidate": "Abhilash Babu",
  "mobile_number": "+49 17647165848",
  "email_id": "abhilashbabuj@gmail.com",
  "years_of_experience": 18,
  "education": "MS in Communication Engineering",
  "university": "Technische Universität, München",
  "linkedin_profile": "",
  "technical_skills": [
    "C++",
    "C",
    "C#",
    "Python",
    "OpenCV",
    "Halcon Machine vision library",
    "Tensorflow",
    "PyTorch",
    "PyTorch-Lightning",
    "Scikit-Learn",
    "Pandas",
    "Keras",
    "ONNX",
    "ApacheTVM",
    "MLFlow",
    "Optuna",
    "MySQL",
    "SQLite",
    "Boost",
    "ZeroMQ",
    "Protocol Buffer",
    "gRPC",
    "MQTT",
    "RabbitMQ",
    "Qt",
    "WPF",
    "DearImGUI",
    "pytest",
    "GoogleTest",
    "Catch2",
    "Docker",
    "Jenkins",
    "Bamboo",
    "Jupyter Notebooks"
  ],
  "years_of_jobs": [
    "Apr 2022 - Present",
    "Jan 2020 - Feb 2022",
    "Aug 2016 - Dec 2019",
    "Jan 2013 - Jul 2016",
    "Aug 2011 - Oct 2011",
    "Apr 2008-Sep 2010",
    "Nov 2005 - Apr 2008"
  ],
  "year_in_current_position": "1 year",
  "Present_Organization": "IDnow GmbH",
  "summary": "Abhilash Babu is a Senior Machine Learning Engineer with 18 years of experience in computer vision and image processing. He has developed and deployed machine learning solutions for various applications such as object detection and image classification. Abhilash is skilled in classical computer vision techniques as well as deep learning frameworks like TensorFlow and PyTorch. He has experience in developing applications for both desktop and embedded domains, and has expertise in microservices for machine learning applications. Abhilash has a strong educational background with a Master's degree in Communication Engineering. He is a certified Software Architect and Scrum Product Owner."
}
```




The code offers a function (**save_to_json**) and a series of operations to enrich a list of job applications with resume summaries. The **save_to_json** function is intended to verify if given data can be validly saved as JSON, ensuring it's either a string (that's loadable as JSON) or a dictionary/list. If it's a string, the function attempts to parse it into a Python object to check its validity.

In the main block, data is initially loaded from a JSON file (**filtered_applications.json**). For each application in the loaded data, if it has a '**resume_path**' and '**email_id**', the code reads the associated resume document and checks if its content needs trimming. The content is then sent to a summarization function (presumably, **summarize_resume**), which is not provided in the code but assumed to exist. The summarized data is then validated and saved back into the application's '**resume_summary**' field using the **save_to_json** function.

Lastly, after iterating through all applications and updating their '**resume_summary**', the enriched data is saved back to the same JSON file, ensuring that the output is neatly formatted with a 4-space indentation.

In [14]:
def save_to_json(data):
    """
    Save a Python data structure to a JSON file.

    Args:
    - data (dict or str): The Python data structure to be saved. Can be a string (that can be loaded as JSON) or a dictionary.
    - filename (str): The name of the JSON file.

    Returns:
    - None
    """
    # Check if the data is already a string and try to load it into a Python object.
    # If it's already a Python object (like a dictionary or list), then pass.
    if isinstance(data, str):
        try:
            data = json.loads(data)
        except json.JSONDecodeError:
            raise ValueError("The provided string is not valid JSON.")
    elif not isinstance(data, (dict, list)):
        raise TypeError("The data should either be a valid JSON string, dictionary, or list.")

    return data


In [15]:
import os
prompt_v3=f'''Read the given resume and extract information corresponding to the keys\
 "name_of_candidate" which stores the candidate name, \
 "mobile_number" contains the mobile number, \
 "email_id" records the email id of the candidate, \
 total years of experience is stored in "years_of_experience", \
 "education" refers to the candidate's most recent or highest academic degree, \
 last university/school/college attended by the candidate is given by "university", \
 "linkedin_profile" contains the linkedin profile, \
 record all the technical skills in "technical_skills", \
 "years_of_jobs" showcases the years spent in different jobs, \
 years spent in the current organization is given by "year_in_current_position", \
 "Present_Organization" denotes name of the present organization and "summay". \
 For "technical_skills", provide a summary of the programming languages, libraries, and frameworks the candidate has experience with as a list, \
 "years_of_jobs" is a list of job durations, e.g., ["2012-current","2010-2012", (June 22, 2022 - Present)]. \
 "year_in_current_position" indicates the duration in their current job role integer. Present year is 2023. \
 "years_of_experience" is the sum of years spent in all jobs including the current one. \
 Round off the year to the upper ceiling. So, if it is 3 months, round it off to 1 year. \
 Summarize the resume in approximately 100 words for the "summary" field. \
 The final output must be in JSON
'''
json_data = read_json('/content/filtered_applications.json')


MAX_RETRIES = 5

for application in json_data:
    if 'resume_path' in application and 'email_id' in application:
        resume_text = read_document(os.path.join(resume_path, application['resume_path']))
        resume_text, _, _ = check_and_trim(resume_text)

        retries = 0
        success = False

        while not success and retries < MAX_RETRIES:
            try:
                application['resume_summary'] = save_to_json(summarize_resume(prompt_v3, resume_text))

                success = True
            except Exception as e:
                retries += 1


save_to_json(json_data)
# Save the updated data back to the same JSON file
with open('/content/filtered_applications_summary.json', 'w') as f:
    json.dump(json_data, f, indent=4)

An error occurred: The provided string is not valid JSON.. Retrying attempt 1/5...
An error occurred: The provided string is not valid JSON.. Retrying attempt 1/5...


Download the **filtered_applications_summary.json** file to be used in the next assignments.

In [16]:
from google.colab import files

# List of file paths that you want to download
file_paths = [
    "/content/filtered_applications_summary.json",
]

# Download each file to your local system
for path in file_paths:
    files.download(path)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Once the above is done we repeat the same for all the applications that we have before filtering.

In [17]:
import os
prompt_v3=f'''Read the given resume and extract information corresponding to the keys\
 "name_of_candidate" which stores the candidate name, \
 "mobile_number" contains the mobile number, \
 "email_id" records the email id of the candidate, \
 total years of experience is stored in "years_of_experience", \
 "education" refers to the candidate's most recent or highest academic degree, \
 last university/school/college attended by the candidate is given by "university", \
 "linkedin_profile" contains the linkedin profile, \
 record all the technical skills in "technical_skills", \
 "years_of_jobs" showcases the years spent in different jobs, \
 years spent in the current organization is given by "year_in_current_position", \
 "Present_Organization" denotes name of the present organization and "summay". \
 For "technical_skills", provide a summary of the programming languages, libraries, and frameworks the candidate has experience with as a list, \
 "years_of_jobs" is a list of job durations, e.g., ["2012-current","2010-2012", (June 22, 2022 - Present)]. \
 "year_in_current_position" indicates the duration in their current job role integer. Present year is 2023. \
 "years_of_experience" is the sum of years spent in all jobs including the current one. \
 Round off the year to the upper ceiling. So, if it is 3 months, round it off to 1 year. \
 Summarize the resume in approximately 100 words for the "summary" field. \
 The final output must be in JSON
'''
all_applications = read_json('/content/all_applications.json')

MAX_RETRIES = 5

for application in all_applications:
    if 'resume_path' in application and 'email_id' in application:
        resume_text = read_document(os.path.join(resume_path, application['resume_path']))
        resume_text, _, _ = check_and_trim(resume_text)

        retries = 0
        success = False

        while not success and retries < MAX_RETRIES:
            try:
                application['resume_summary'] = save_to_json(summarize_resume(prompt_v3, resume_text))
                success = True
            except Exception as e:
                retries += 1


save_to_json(all_applications)
# Save the updated data back to the same JSON file
with open('/content/all_applications_summary.json', 'w') as f:
    json.dump(all_applications, f, indent=4)

Now download the file '/content/all_applications_summary.json' to be used in Assignment8

In [18]:
from google.colab import files

# List of file paths that you want to download
file_paths = [
    "/content/all_applications_summary.json",
]

# Download each file to your local system
for path in file_paths:
    files.download(path)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

With the successive iterations of the prompt for resume summarization we have finally achieved a prompt that can extract valuable information from resumes and give the output in JSON format with consistent keys such as "years_of_experience", "technical_skills", "years_of_jobs", etc.