# Description:
The notebook delves into the comparative performance analysis of two iterations of the GPT model, namely GPT-3.5 and GPT-4, for automating the evaluation of candidates' resumes. By feeding these models resumes of varying complexities, the notebook aims to ascertain which version delivers more accurate and relevant evaluations.

### Key components include:

1. Utilizing GPT-3.5 and GPT-4 to extract pivotal details such as contact information, skills, and work experiences from resumes.
2. Integrating libraries that enable the reading of different file formats to ensure no candidate is left out because of the file type.
3. Employing a scoring mechanism, hinged on predefined "must-have" skills, to grade resumes. The score provides an objective measure to compare the efficacy of both models in recognizing and evaluating these skills.

Now install necessary libraries required to execute all functionalities within this notebook.

In [None]:
!pip install openai
!pip install PyMuPDF
!pip install textract
!pip install python-docx
!pip install tiktoken

Collecting tiktoken
  Downloading tiktoken-0.5.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tiktoken
Successfully installed tiktoken-0.5.1


Upload the .env file to the directory `/content/` which contains the "OPENAI_API_KEY"

The provided code snippet accesses sensitive values like the OpenAI API key


In [None]:
# Export your API Key to environment variable
# Upload the .env file to the directory "/content/"
!pip install python-dotenv
from dotenv import load_dotenv
load_dotenv()



True

In [None]:
import openai
import os
# Retrieve the API key from environment variable
openai_api_key = os.getenv("OPENAI_API_KEY")
# Set the API key for OpenAI
openai.api_key = openai_api_key

Upload the json file containing important information about the Job requirements which was generated in Assignment1 and the file containing information about the filtered resumes along with their summary generated from Assignment3

In [None]:
from google.colab import files

# Upload the first file
print("Please upload the first file (filtered_applications_summary.json):")
uploaded1 = files.upload()

# Check to ensure a file was uploaded. If not, prompt again.
while len(uploaded1) == 0:
    print("No file uploaded. Please upload the first file (filtered_applications_summary.json) again:")
    uploaded1 = files.upload()

# Upload the second file
print("Please upload the second file (requirements_output.json):")
uploaded2 = files.upload()

# Check to ensure a file was uploaded. If not, prompt again.
while len(uploaded2) == 0:
    print("No file uploaded. Please upload the second file (requirements_output.json) again:")
    uploaded2 = files.upload()

# Merge the dictionaries to have all uploaded files in one
uploaded = {**uploaded1, **uploaded2}

# Print details of uploaded files
for fn in uploaded.keys():
    print('User uploaded file "{name}" with length {length} bytes'.format(
        name=fn, length=len(uploaded[fn])))


Now download the `Webinar_resumes.zip` file which contains all the resumes

In [None]:
import requests

def download_file_from_google_drive(file_id, destination):
    base_url = "https://drive.google.com/uc?export=download"

    session = requests.Session()

    response = session.get(base_url, params={'id': file_id}, stream=True)
    token = get_confirm_token(response)

    if token:
        params = {'id': file_id, 'confirm': token}
        response = session.get(base_url, params=params, stream=True)

    save_response_content(response, destination)

def get_confirm_token(response):
    for key, value in response.cookies.items():
        if key.startswith('download_warning'):
            return value
    return None

def save_response_content(response, destination):
    CHUNK_SIZE = 32768

    with open(destination, "wb") as f:
        for chunk in response.iter_content(CHUNK_SIZE):
            if chunk:
                f.write(chunk)
# # Example Usage
# file_id = '1HaM3IeK2-iqyZzeQmCnAzKLcF9NF-mSo'  # Replace with your file's ID
# destination = 'resume_data.zip'  # Replace with your desired file name and extension
file_id = '1P7PXx5ynhTRGnfXd273Swbph9t1w8tDs'
destination = 'Webinar_resumes.zip'  # Replace with your desired file name and extension
download_file_from_google_drive(file_id, destination)

The provided code provides a set of utility functions to process and analyze job resumes. Firstly, it uses OpenAI's GPT-3.5 model to summarize resumes based on a given prompt. It can read job requirements from a JSON file, and extract text from various document formats such as DOCX, DOC, PDF, and Excel sheets. If a resume contains excessive content, the code employs the Natural Language Toolkit (NLTK) to tokenize the text and trim it down to a manageable size, ensuring that only the most relevant information, up to a maximum token limit, is processed.

In [None]:
import openai
import json
import os
from collections import OrderedDict
import re
from docx import Document
import textract
import fitz  # PyMuPDF
import pandas as pd
import math
import nltk
import tiktoken
# nltk.download('punkt')

def read_requirements(file_path):
    # Reads the job requirements from a JSON file
    try:
        with open(file_path, 'r') as f:
            data = json.load(f)
        return data
    except Exception as e:
        print(f"Error reading requirements JSON: {e}")
        return None

def read_json(file_path):
    with open(file_path, 'r') as f:
        data = json.load(f)
    return data

def read_document(file_path):
    file_path = str(file_path)
    _, file_extension = os.path.splitext(file_path)
    text = ""
    if file_extension == '.docx':
        doc = Document(file_path)
        for para in doc.paragraphs:
            text = text + para.text + " "
    elif file_extension == '.doc':
        text = textract.process(file_path).decode()
    elif file_extension.lower() == '.pdf':
        doc = fitz.open(file_path)
        for page_number in range(len(doc)):
            page = doc[page_number]
            text = text + page.get_text() + " "
    elif file_extension.lower() in ['.xls', '.xlsx']:
        data = pd.read_excel(file_path)
        text = data.to_string(index=False)

    else:
        print(f"Unsupported file type: {file_extension}")

    return text

def check_and_trim(resume_text, max_tokens=1500):
    # tokens = nltk.word_tokenize(resume_text)
    enc = tiktoken.get_encoding("cl100k_base")
    tokens = enc.encode(resume_text)
    old_len = len(tokens)
    if len(tokens) > max_tokens:
        tokens = tokens[:max_tokens]
        resume_text = enc.decode(tokens)
    return resume_text, old_len, len(tokens)



The provided code allows a user to select a desired number of resumes to process from a total set, with a default of 2 resumes if no input is given. The **user_select_number_of_resumes** function prompts the user for their choice, ensures valid input, and returns the selected number. The main execution block reads the **filtered_applications_summary** data from a JSON file, queries the user for their desired number of resumes using the aforementioned function, and then randomly selects the specified number of resumes from the total set, storing the result in the **selected_applications** variable.

In [None]:
import json
import random


def user_select_number_of_resumes(total_resumes, default=2):
    """
    Allow the user to input a number of resumes to process.
    If no input is given, the default value is returned.

    Args:
    - total_resumes (int): Total number of resumes available.
    - default (int): The default number to return if no input.

    Returns:
    - int: The number of resumes the user wants to process.
    """
    print(f"Total resumes available: {total_resumes}")
    user_input = input(f"How many resumes do you want to process? (Default is {default}): ")

    # If the user doesn't provide any input, return the default value.
    if not user_input:
        return default

    try:
        # Convert user input to an integer and ensure it's within the range.
        selected_num = int(user_input)
        if 1 <= selected_num <= total_resumes:
            return selected_num
        else:
            print(f"Please select a number between 1 and {total_resumes}.")
            return user_select_number_of_resumes(total_resumes, default)
    except ValueError:
        # If the user provides non-numeric input, prompt them again.
        print("Please enter a valid number.")
        return user_select_number_of_resumes(total_resumes, default)

# Read the filtered_applications_summary data from the JSON file
json_data = read_json('/content/filtered_applications_summary.json')

# Display total resumes and get the user's choice
n = user_select_number_of_resumes(len(json_data))

# Randomly select n resumes
selected_applications = random.sample(json_data, n)

Total resumes available: 2
How many resumes do you want to process? (Default is 2): 2


The code below initializes three primary components: job requirements, must-have skills, and filtered application data. The `read_requirements` function fetches the job requirements from a predefined JSON file, and `read_json` retrieves the filtered applications data. Then, a structured `resume_prompt` is defined, instructing the model on specific tasks, this prompt has already been prepared in Assignment3. The prompt asks the model to read a resume and extract various attributes: candidate's name, contact information, experience details, education credentials, technical skills, and a concise summary. The expected model output is a JSON structure, containing the extracted details, with an emphasis on properly rounding off the `years_of_experience` and delivering a concise 100-word summary of the resume.

In [None]:
import zipfile
import shutil
job_requirements = read_requirements('/content/requirements_output.json')
must_have_skills = job_requirements["must_have_skills"]
zip_file_path = "/content/Webinar_resumes.zip" # For example give the path to resume_data.zip

def extract_and_rename(zip_file_path, extract_path="extracted_files"):
    """
    Extract files from a zip archive to a specified directory.
    Rename directories containing spaces to use underscores instead.

    Args:
    - zip_file_path (str): The path to the zip file to be extracted.
    - extract_path (str, optional): The path where the zip file content should be extracted to.
                                    Defaults to "extracted_files".

    Returns:
    - str: Path to the resume or directory.
    """
    # Check if extract_path exists, if not, create it
    if not os.path.exists(extract_path):
        os.makedirs(extract_path)

    # If extract_path is not empty, skip extraction
    if not os.listdir(extract_path):
        with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
            zip_ref.extractall(extract_path)

    resume_path = extract_path
    for item in os.listdir(extract_path):
        item_path = os.path.join(extract_path, item)

        # Check if the current item is a directory and if it has spaces in its name
        if os.path.isdir(item_path) and ' ' in item:
            new_name = item.replace(' ', '_')
            new_path = os.path.join(extract_path, new_name)

            # If the new directory name doesn't already exist, create it
            if not os.path.exists(new_path):
                os.makedirs(new_path)

            # Copying contents from the old directory to the new one
            for sub_item in os.listdir(item_path):
                shutil.copy2(os.path.join(item_path, sub_item), new_path)

            # Removing the old directory
            shutil.rmtree(item_path)
            resume_path = new_path
        else:
            resume_path = item_path

    return resume_path
resume_path = extract_and_rename(zip_file_path)



The provided code evaluates resumes against a set of requisite skills `must_have`. It operates by leveraging the GPT-3.5 model to systematically analyze resumes. The function is designed to prompt the model about assessing a resume based on specified skills and expects the model to return a score and a summary for each skill in a JSON format. The scores range from 0 to 5 based on the candidate's experience with the skills. The main loop iterates through a list of job applications, ensuring each has a valid resume path and email. The resume content is read and trimmed if necessary, and then summarized. Finally, the script calculates a score for the candidate's resume based on the essential skills and prints out the candidate's name and their corresponding score.

In [None]:
def calculate_Score(text, must_have):
    model="gpt-3.5-turbo-16k"
    max_tokens=2000

    first_prompt = f'''I have a resume. I want to find a person whose resume has skills in {must_have}. \
    Look into a resume and give score based on each of the skill mentioned in {must_have}. \
    Each skill present in {must_have}, should have 2 elements. "Score" and "Summary", if the score is zero then "Summary' should be empty, otherwise \
    if score is non-zero then give the summary of the project in "Summary". Give json response only. The score should be between 0 and 5. Limited experience \
    the score should be one or two, 2-3 projects then score should be three or four and 4-5 projects or more than score should be four or five.'''

    messages = [
            {"role": "system", "content": f"{first_prompt}"},
            {"role": "user", "content": f"{text}"},
        ]
    response = openai.ChatCompletion.create(
            model="gpt-3.5-turbo-16k",
            messages=messages,
            temperature=1,
            max_tokens=max_tokens
        )
    generated_texts = [
        choice.message["content"].strip() for choice in response["choices"]
    ]
    # print("generated_texts", generated_texts)
    return generated_texts[0]

for application in selected_applications:
    if 'resume_path' in application and 'email_id' in application:

        # Extract resume text
        resume_text = read_document(os.path.join(resume_path, application['resume_path']))
        resume_text, _, _ = check_and_trim(resume_text)

        # Directly assign the resume_summary without json.loads()
        resume_summary = application['resume_summary']

        score = calculate_Score(resume_text, must_have_skills)
        print(f'''[Score Request] for {resume_summary["name_of_candidate"]} ''', score)


[Score Request] for Soso Sukhitashvili  {
  "Keras": {
    "Score": 0,
    "Summary": ""
  },
  "TensorFlow": {
    "Score": 0,
    "Summary": ""
  },
  "PyTorch": {
    "Score": 1,
    "Summary": "Soso has experience working with PyTorch in deep learning projects such as face recognition, image similarity search, object tracking, object detection, and object segmentation."
  },
  "Computer Vision": {
    "Score": 1,
    "Summary": "Soso has worked on computer vision projects including face recognition, image similarity search, object tracking, object detection, and object segmentation."
  }
}
[Score Request] for Joseph Adeola  {
  "Score": 3,
  "Summary": "Joseph Adeola has experience and skills in Keras, TensorFlow, and PyTorch. He has worked on the iToBos project, a research initiative focused on developing an intelligent total body scanner for early detection of melanoma using computer vision techniques and deep learning models. He applied computer vision techniques for dataset pre

**Note**:
The problem with the output from the above code is that the output isn't stable. In some cases it is giving extra information that is not required like as shown below,


```
[Score Request] for Soso Sukhitashvili  {
  "Keras": {
    "Score": 0,
    "Summary": ""
  },
  "TensorFlow": {
    "Score": 0,
    "Summary": ""
  },
  "PyTorch": {
    "Score": 2,
    "Summary": "Soso has experience working with PyTorch, specifically in the field of deep learning. He has worked on projects involving object detection, object tracking, and image similarity search using PyTorch."
  },
  "Computer Vision": {
    "Score": 3,
    "Summary": "Soso has strong skills in computer vision, as demonstrated in his work on projects such as face recognition, image similarity search, object tracking, object detection, object segmentation, and OCR."
  }
}
[Score Request] for Joseph Adeola  {
  "score": 5,
  "summary": "The person's resume has expertise in Keras, TensorFlow, PyTorch, and Computer Vision. They have worked on projects such as skin lesion detection and classification using deep learning models, feature tracking using the ICP algorithm for event-based pose estimation, stereo visual odometry, camera calibration and pose estimation with augmented reality, facial expression recognition using transfer learning, and underwater image analysis and registration."
}
```
For some cases it works perfectly giving scores between 0-5 along with the summary but the result is not stable on multiple runs, it keeps on changing



### From our observations, it's evident that the output generated by GPT-3.5 lacks the desired structure and contains a considerable amount of extraneous information. And even if it gives desired results, the outputs are not stable. Now keeping everything same we just change the GPT model to GPT-4.

In [None]:
def calculate_Score(text, must_have):
    model="gpt-4"
    max_tokens=2000

    first_prompt = f'''I have a resume. I want to find a person whose resume has skills in {must_have}. \
    Look into a resume and give score based on each of the skill mentioned in {must_have}. \
    Each skill present in {must_have}, should have 2 elements. "Score" and "Summary", if the score is zero then "Summary' should be empty, otherwise \
    if score is non-zero then give the summary of the project in "Summary". Give json response only. The score should be between 0 and 5. Limited experience \
    the score should be one or two, 2-3 projects then score should be three or four and 4-5 projects or more than score should be four or five.'''

    messages = [
            {"role": "system", "content": f"{first_prompt}"},
            {"role": "user", "content": f"{text}"},
        ]
    response = openai.ChatCompletion.create(
            model="gpt-4",
            messages=messages,
            temperature=1,
            max_tokens=max_tokens
        )
    generated_texts = [
        choice.message["content"].strip() for choice in response["choices"]
    ]
    # print("generated_texts", generated_texts)
    return generated_texts[0]

for application in selected_applications:
    if 'resume_path' in application and 'email_id' in application:

        # Extract resume text
        resume_text = read_document(os.path.join(resume_path, application['resume_path']))
        resume_text, _, _ = check_and_trim(resume_text)

        # Directly assign the resume_summary without json.loads()
        resume_summary = application['resume_summary']

        score = calculate_Score(resume_text, must_have_skills)
        print(f'''[Score Request] for {resume_summary["name_of_candidate"]} ''', score)


[Score Request] for Soso Sukhitashvili  {
    "Keras": {
        "Score": 0,
        "Summary": ""
    },
    "TensorFlow": {
        "Score": 0,
        "Summary": ""
    },
    "PyTorch": {
        "Score": 2,
        "Summary": "Used PyTorch in variety of projects including face recognition, image similarity search, object detection and tracking, OCR, and classification at MaxinAI and Cortica AI. Despite the challenging tasks like varied shapes and aspect ratios of objects, contributed to increase the AI model's accuracy by 5%."
    },
    "Computer Vision": {
        "Score": 3,
        "Summary": "Worked extensively on Computer Vision projects at MaxinAI and Cortica AI. Major projects included object detection and tracking, face recognition, image similarity search, and object segmentation. Utilized computer vision for detecting manufacturing defects on products with high accuracy and low latency. As a result, increased the AI model's accuracy by 5%."
    }
}
[Score Request] for J

**Note**:
In the case of the output generated by the above cell, we are getting perfectly formatted JSON data which contains scores between 0-5 for each of the must have skills along with the summary. The JSON output is very stable and hence proves that gpt-4 is superior in performance to gpt-3.5

# Output:

```
[Score Request] for Soso Sukhitashvili  {
  "Keras": {
    "Score": 0,
    "Summary": ""
  },
  "TensorFlow": {
    "Score": 0,
    "Summary": ""
  },
  "PyTorch": {
    "Score": 4,
    "Summary": "Worked as a deep learning engineer at MaxinAI and developed advanced ML algorithms for complex computer vision projects. As a senior engineer at Cortica AI, improved AI model accuracy by 5% and decreased processing speed by 15%."
  },
  "Computer Vision": {
    "Score": 5,
    "Summary": "Has extensive experience in working on computer vision projects. Developed algorithms for object detection and tracking, face recognition, and image tagging. Also, worked on an Israel based company called Cortica AI, where he developed new models for defect detection on manufacturing products."
  }
}
[Score Request] for Joseph Adeola  {
  "Keras": {
    "Score": 5,
    "Summary": [
      "Deep Learning Intern at Computer Vision and Robotics Research Institute where developed and trained deep learning models for skin lesion detection and classification.",
      "Built a facial expression recognition system on Nvidia Jetson Nano using transfer learning techniques with the RESNET-18 architecture."
    ]
  },
  "TensorFlow": {
    "Score": 5,
    "Summary": [
      "Deep Learning Intern at Computer Vision and Robotics Research Institute where developed and trained deep learning models for skin lesion detection and classification.",
      "Built a facial expression recognition system on Nvidia Jetson Nano using transfer learning techniques with the RESNET-18 architecture."
    ]
  },
  "PyTorch": {
    "Score": 4,
    "Summary": [
      "Developed various projects using PyTorch, one of them involves detection of facial expressions on Nvidia Jetson Nano."
    ]
  },
  "Computer Vision": {
    "Score": 5,
    "Summary": [
      "Worked on an intelligent total body scanner for early detection of melanoma using computer vision techniques and deep learning models.",
      "Developed a feature tracker using the iterative closest point (ICP) algorithm for event-based vision.",
      "Handled a project that involved camera calibration, pose estimation, and augmented reality using Aruco markers.",
      "Worked on underwater image analysis and registration project. Employed SIFT algorithm for robust feature extraction and image registration."
    ]
  }
}

```



### When employing GPT-4, the output score approaches perfection, highlighting its enhanced performance over GPT-3.5. We can see the output is perfectly structured in JSON format and the output is also stable over multiple runs. Thus, for those seeking a cost-effective solution with GPT-3.5 and aiming for accuracy on par with GPT-4, it's imperative to adopt the chain-of-thoughts methodology. This involves first establishing a scoring criteria and subsequently using that criteria to determine the final score.

# `Chain-of-thought prompting`:

`Chain-of-thought` prompting is a method that enables models to decompose multi-step problems into intermediate steps, which improves reasoning capabilities in large language models. This method is used to enhance the reasoning ability of large language models in arithmetic, commonsense, and symbolic reasoning tasks. Chain-of-thought prompting involves guiding a language model through a series of intermediate steps to solve a problem. It encourages the LLM to explain its reasoning, and the model-generated chain of thought would resemble an intuitive thought process when working through a multi-step reasoning problem. The chain-of-thought prompting technique is simply solving the problem step-by-step, and each step is based on logical reasoning. It is important to note that the benefits of chain-of-thought prompting only become evident when applied to models with approximately 100 billion parameters. The chain-of-thought prompting method enables models to generate chains of thought if demonstrations of chain-of-thought reasoning are provided in the exemplars for few-shot prompting.

