[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vicvenet/GenAI_for_Innovative_Communications/blob/main/2025_S1/Week_3/customer_information_extraction.ipynb)

### How to run this notebook

This notebook is meant to be run in Google Colab:

- Sign in to your Google account
- Click on the "Open In Colab" badge at the top left of this notebook
- Run the notebook all at once using the "Runtime" menu at the top right of the notebook and selecting "Run all"
- IMPORTANT: make sure that the runtime that is displayed at the top right of the notebook right on the left of RAM and Disk is set to "T4". If it is not the case, in the "Runtime" menu, select "Change runtime type" and set the runtime to "T4 GPU" and click on "Save"


###  Put this notebook in context

If you want to extract information from text, images, audio or video, GenAI can help.

One important aspect is that GenAI models are not perfect and you need to be aware of the limitations of the model you are using. Therefore, typically, GenAI models are used as part of a chain of processing steps to generate a final output.

For instance, in this notebook, we will use a GenAI model to extract the experience level from text, but of course, in a real-world scenario, you would want to also:

- Have a list of source documents (here we only have one)
- Further process the output of the GenAI model with another model (typically a non GenAI Machine Learning model as almost no GenAI model can handle numerical analysis properly) or with a rule-based system
- Use the output of the model to score the lead
- Plot and analyse the results to extract insights
- Store the results in a database or a structured format (e.g. CSV, JSON, Excel, etc.)

These last steps are not covered in this notebook as they are not specific to GenAI but you should be aware of them.

### Understand the toy example we are using

We present here a toy example to illustrate a situation where the resume of a potential customer (i.e. "lead") is analysed to assess the experience level of the person from 1 to 5 in order to decide which product or service would be suitable for that lead:

1. Entry Level (0-2 years): Suitable for recent graduates or individuals new to the industry.
2. Junior Level (2-5 years): Candidates with some professional experience, often having foundational skills and looking to build their expertise.
3. Mid Level (5-10 years): Professionals with substantial experience, capable of handling more complex tasks and possibly taking on leadership roles.
4. Senior Level (10-15 years): Highly experienced individuals who are often experts in their field and may hold senior or managerial positions.
5. Executive Level (15+ years): Veteran professionals with extensive experience, likely to be in top management or executive roles.

### This notebook is based on:

Original code authored by Shaw Talebi that is simplified and modified to use the Qwen 2.5 3B model instead of OpenAI's API.
Original video: https://youtu.be/3JsgtpX_rpU  


### Learning points

In this notebook, you will learn a general GenAI text workflow using an open-source model that you can run locally or in Google Colab, i.e. how to:

- Install and import packages in a Jupyter notebook (Google Colab runs a type of Jupyter notebook)
- Download a file from the Internet and extract text from a PDF file
- Preprocess the text to prepare it for a GenAI model (in this case, we use the Qwen 2.5 3B model which is the 3-billion-parameter version of the text-to-text model made by Alibaba)
- Use the Hugging Face Transformers library to load a model
- Craft a prompt to instruct the model to perform a specific task
- Write the appropriate messages to the model to generate the response in the format you want to obtain
- Postprocess the response to extract the information you need


### Enable autosaving of this notebook

In [None]:
%autosave 20

Autosaving every 20 seconds


### Install Required Packages

In [None]:
!pip install -q transformers accelerate
!pip install -q PyMuPDF polars

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m79.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m69.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m35.5 MB/s[0m eta [36m0:00:00[0m
[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━[0m [32m478.1/664.8 MB[0m [31m125.0 MB/s[0m eta [36m0:00:02[0m

### Import the required libraries

In [None]:
import os
import fitz  # aka PyMuPDF
import polars as pl # Polars is a fast, efficient DataFrame library that is similar to Pandas
import requests
import tempfile
import numpy as np
from transformers import AutoModelForCausalLM, AutoTokenizer


### Initialize the Qwen 2.5 3B Model

In this step, we:
- define the model_name with the name of the model we want to use as defined in the model hub of Hugging Face here: https://huggingface.co/Qwen/Qwen2.5-3B-Instruct
- initialize the tokenizer and the model based on the classes from the Hugging Face Transformers library


In [None]:
model_name = "Qwen/Qwen2.5-3B-Instruct"

# Initialize tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/7.30k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/661 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/35.6k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/3.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

### Define the Text Extraction Functions

In this step, we define the function `extractText` that will extract the text from a PDF file.


In [None]:
def extractText(filepath: str) -> str:
    """
    Function to extract text from automatically generated resume from LinkedIn
    """
    doc = fitz.open(filepath)
    doc_text = ""
    for page in doc:
        page_text = page.get_text("text", clip=fitz.Rect((200, 0, 612, 792)))  # extract text from main part of resume
        doc_text = doc_text + page_text
    return doc_text

### Download and Process the Resume

In this step, we:
- define the GitHub raw URL for the PDF file
- download the PDF file with the requests library
- extract the text from the PDF file
- use text processing to clean the text with the split() method
- store the text in a dictionary
- create a Polars dataframe based on the dictionary

In [None]:
# GitHub raw URL for the PDF file
pdf_url = "https://raw.githubusercontent.com/ShawhinT/YouTube-Blog/main/ai-for-business/3-sales-use-cases/data/resumes/Profile.pdf"

# Download the PDF file
response = requests.get(pdf_url)
if response.status_code == 200:
    # Create a temporary file to store the PDF
    with tempfile.NamedTemporaryFile(suffix='.pdf', delete=False) as temp_file:
        temp_file.write(response.content)
        filepath = temp_file.name
else:
    raise Exception(f"Failed to download PDF file. Status code: {response.status_code}")

# Extract text and create initial dataframe
doc_text = extractText(filepath)
name = doc_text.split('\n')[1].split(',')[0]  # name is first line and dropping title
if name == "i":  # edge case: first line is "i"
    name = doc_text.split('\n')[1].split(',')[0]

resume_dict = {"Name": name, "Resume": doc_text}
df = pl.DataFrame([resume_dict])

# Clean up temporary file
os.unlink(filepath)

### Data Augmentation with Qwen

In this step, we:
- create a system prompt for the model
- create a prompt template that will be used to extract the experience level from the resume

In [None]:
system_prompt = """You are a resume analysis assistant. Your task is to classify resumes into one of five experience level buckets based on the number of years of professional experience listed in the resume.

The experience level buckets are:

1. Entry Level (0-2 years): Suitable for recent graduates or individuals new to the industry.
2. Junior Level (2-5 years): Candidates with some professional experience, often having foundational skills and looking to build their expertise.
3. Mid Level (5-10 years): Professionals with substantial experience, capable of handling more complex tasks and possibly taking on leadership roles.
4. Senior Level (10-15 years): Highly experienced individuals who are often experts in their field and may hold senior or managerial positions.
5. Executive Level (15+ years): Veteran professionals with extensive experience, likely to be in top management or executive roles.

When given a resume, analyze the text to determine the total years of professional experience and classify the resume into the appropriate experience level bucket."""

In [None]:
prompt_template = lambda resume: f"""I have a resume, and I need to identify the candidate's experience level. Here are the experience level buckets:

1 = Entry Level (0-2 years)
2 = Junior Level (2-5 years)
3 = Mid Level (5-10 years)
4 = Senior Level (10-15 years)
5 = Executive Level (15+ years)

Please analyze the following resume text and identify the experience level of the candidate. Ensure your response is a single digit between 1-5 indicating the experience level based on the above rubric.

### Resume

{resume}

### Output: """

In this step, we:
- initialize an empty list to store the experience level
- loop over the resumes in the dataframe
- create a prompt for the model with the prompt template and the resume
- make sure that the model will only output one token. While this does not guarantee that the model will only output the value from 1 to 5 we want, it is a good practice to do so. In a real-world scenario, you would want to use a more sophisticated method based on the Pydantic library to control the output of the model
- use the Qwen 2.5 3B model to extract the experience level as one of the 5 buckets from the resume

In [None]:
exp_level_list = []

# extract YoE for each resume in df
for i in range(len(df)):

    prompt = prompt_template(df["Resume"][i])

    messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": prompt}
    ]
    text = tokenizer.apply_chat_template(messages, tokenize=False,add_generation_prompt=True)

    model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
    generated_ids = model.generate(**model_inputs, max_new_tokens=1)
    generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)]

    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

    exp_level_list.append(response)
    print(f"Resume {i+1} experience level: {response}")

# convert list to numpy array of integers
exp_level_arr = np.array(exp_level_list).astype(int)

# add experience level to dataframe
df = df.with_columns(pl.Series(name="experience_level", values=exp_level_arr))

Resume 1 experience level: 4


In this step, we print the dataframe with the experience level suggested by the model

In [None]:
print(df)

shape: (1, 3)
┌─────────────┬───────────────────┬──────────────────┐
│ Name        ┆ Resume            ┆ experience_level │
│ ---         ┆ ---               ┆ ---              │
│ str         ┆ str               ┆ i64              │
╞═════════════╪═══════════════════╪══════════════════╡
│ Shaw Talebi ┆                   ┆ 4                │
│             ┆ Shaw Talebi       ┆                  │
│             ┆ AI Educator | Ph… ┆                  │
└─────────────┴───────────────────┴──────────────────┘
