<a href="https://colab.research.google.com/github/ua-datalab/NLP-Speech/blob/main/Introduction_to_Information_Extraction/Automatic_resume_scraping_with_AIVerde.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Automatic resume scraping with LLMs: with AI Verde

How are automatic tracking systems used for collecting information from resumes?

Source: https://huggingface.co/foduucom/resume-extractor

This simple example uses an LLM to extract information from a resume PDF. For this demonstration, we use AI Verde to access the Llama 3.1 model.

Note: This will require an API key and access needs to be requested.


In [15]:
import os

# We will add the api key as an environment variable.
with open("api.txt") as api:
  api_key = api.read()

os.environ["OPENAI_API_KEY"] = api_key
# We need a custom endpoint, as we will be calling Verde's LLM
API_ENDPOINT = "https://llm-api.cyverse.ai"

In [47]:
from pprint import pprint
def pretty_print_model_output(model_output):
    """
    Pretty prints the output of model.invoke(),
    handling AIMessage objects.
    """
    if hasattr(model_output, "content"):  # Check if AIMessage has 'content' attribute
        data = model_output.content
        additional_kwargs = model_output.additional_kwargs
        response_metadata = model_output.response_metadata

    else:
        data = str(model_output)  # Fallback to string conversion

    print(data)
    pprint(additional_kwargs)
    pprint(response_metadata)

In [45]:
!pip install -qU langchain langchain-openai langchain-community pypdf
!pip install pdfminer.six



In [26]:
# We will connect to Meta-Llama-3.1 through Verde
# Notice how we need to specify the API endpoint
model = ChatOpenAI(model="Meta-Llama-3.1-70B-Instruct-quantized", base_url=API_ENDPOINT)

# Do a test call:
response = model.invoke("Hello, who are you?")


In [44]:
# Example usage
pretty_print_model_output(response)


I'm an artificial intelligence model known as Llama. Llama stands for "Large Language Model Meta AI."
{'refusal': None}
{'finish_reason': 'stop',
 'logprobs': None,
 'model_name': 'neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a8',
 'system_fingerprint': None,
 'token_usage': {'completion_tokens': 23,
                 'completion_tokens_details': None,
                 'prompt_tokens': 41,
                 'prompt_tokens_details': None,
                 'total_tokens': 64}}


In [17]:
from langchain_openai import ChatOpenAI 		# We use the OpenAI protocol, but are using another provider (Verde)

In [19]:
# Format for the extracted output:
json_content = """{{
    "name": "",
    "email" : "",
    "phone_1": "",
    "phone_2": "",
    "address": "",
    "city": "",
    "linkedin": "",
    "professional_experience_in_years": "",
    "highest_education": "",
    "is_fresher": "yes/no",
    "is_student": "yes/no",
    "skills": ["",""],
    "applied_for_profile": "",
    "education": [
        {{
            "institute_name": "",
            "year_of_passing": "",
            "score": ""
        }},
        {{
            "institute_name": "",
            "year_of_passing": "",
            "score": ""
        }}
    ],
    "professional_experience": [
        {{
            "organisation_name": "",
            "duration": "",
            "profile": ""
        }},
        {{
            "organisation_name": "",
            "duration": "",
            "profile": ""
        }}
    ]
}}"""

class InputData:
    # LLM Prompt
    def input_data(self, text):

        input = f"""Extract relevant information from the following resume text and fill the provided JSON template.
                    Ensure all keys in the template are present in the output,
                    even if the value is empty or unknown.
                    If a specific piece of information is not found in the text, use 'Not provided' as the value.

        Resume text:
        {text}

        JSON template:
        {json_content}

        Instructions:
        1. Carefully analyse the resume text.
        2. Extract relevant information for each field in the JSON template.
        3. If a piece of information is not explicitly stated, make a reasonable inference based on the context.
        4. Ensure all keys from the template are present in the output JSON.
        5. Format the output as a valid JSON string.

        Output the filled JSON template only, without any additional text or explanations."""

        return input
    # run LLM:
    def llm(self):
        llm = ChatOpenAI(model="Meta-Llama-3.1-70B-Instruct-quantized", base_url=API_ENDPOINT)
        return llm

In [5]:
# Process resume and print results:
from pdfminer.high_level import extract_text

def extract_text_from_pdf(pdf_path):
    return extract_text(pdf_path)

In [6]:
!wget --no-check-certificate -O anti-cv.pdf "https://raw.githubusercontent.com/ua-datalab/NLP-Speech/main/Introduction_to_Information_Extraction/anti-cv.pdf"


--2025-02-20 07:17:57--  https://raw.githubusercontent.com/ua-datalab/NLP-Speech/main/Introduction_to_Information_Extraction/anti-cv.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 68269 (67K) [application/octet-stream]
Saving to: ‘anti-cv.pdf’


2025-02-20 07:17:57 (35.0 MB/s) - ‘anti-cv.pdf’ saved [68269/68269]



In [20]:
# Extraction:
text = extract_text_from_pdf(r'/content/anti-cv.pdf')

print(text)

Jane Doe

Anti Curriculum Vitae

KEY

+1 (123) 456 7898

website janedoe.xyz

jane.doe@gmail.com
github github.com/jane

(cid:114) Heart-stab. Professional rejections and things I messed up.
⇝ Squigly arrow. What did I learn? What were the consequences?

EDUCATION

(cid:114) High-school: Never took German class seriously. To this day I don’t speak German.
⇝ I think I learned my lesson. I regret not having learned German, I wish I could speak to my German

colleagues in their mother tongue now.

WORK EXPERIANCE

(cid:114) Summer 2021 Rejected from XYZ.
(cid:114) Summer 2021, didn’t participate in the final round of the Alibaba math competition.
(cid:114) Spring 2021 University research scholarship, my sloppy last minute application was rejected ⇝ Don’t

make a last minuet sloppy application. Write multiple drafts days in advance.

(cid:114) DEF, rejected ⇝ they replied and were cordial, and told me they would get back to me if they needed

me in the future.

(cid:114) Lorum Ipsum reject

In [48]:
input = InputData()
llm = input.llm()
data = llm.invoke(input.input_data(text))

In [46]:
# print(data)
pretty_print_model_output(data)

{
    "name": "Jane Doe",
    "email" : "jane.doe@gmail.com",
    "phone_1": "+1 (123) 456 7898",
    "phone_2": "",
    "address": "",
    "city": "",
    "linkedin": "",
    "professional_experience_in_years": "",
    "highest_education": "High school",
    "is_fresher": "Not provided",
    "is_student": "Not provided",
    "skills": ["French", "German", "Software"],
    "applied_for_profile": "",
    "education": [
        {
            "institute_name": "Not provided",
            "year_of_passing": "",
            "score": ""
        },
        {
            "institute_name": "",
            "year_of_passing": "",
            "score": ""
        }
    ],
    "professional_experience": [
        {
            "organisation_name": "XYZ",
            "duration": "Summer 2021",
            "profile": "Rejected"
        },
        {
            "organisation_name": "Alibaba",
            "duration": "Summer 2021",
            "profile": "Didn't participate in the final round"
        }