# Introduction:
Building a resume parser using SpaCy can greatly streamline the process of extracting relevant information from resumes, enabling efficient candidate evaluation. In this guide, we will explore step-by-step instructions to develop a resume parser using the powerful natural language processing library, SpaCy.

1. Understanding the Problem:
Define the objective of the resume parser and the specific information to be extracted, such as name, contact details, skills, education, and work experience.

2. Preparing the Environment:
Install SpaCy and its required dependencies.
Download and load the necessary SpaCy language models.

3. Extracting Text from Resumes:
Utilize PDF parsing libraries like pdfminer to extract text content from resume files.
Implement a function to extract text from PDF files using the chosen library.

4. Extracting Name from Resumes:
Use SpaCy's linguistic capabilities to extract names from resume text.
Define name patterns using SpaCy's Matcher module to identify different name formats.

5. Extracting Contact Details:
Employ regular expressions to extract contact numbers from resume text.
Define patterns to capture various phone number formats.

6. Extracting Email Addresses:
Utilize regular expressions to identify and extract email addresses from resume text.
Define email patterns to ensure accurate extraction.

7. Extracting Skills:
Create a predefined list of skills relevant to the desired job requirements.
Utilize SpaCy's linguistic capabilities to match and extract skills from the resume text.

8. Extracting Education:
Define a set of education keywords or patterns to identify educational information.
Utilize regular expressions to extract education details from the resume text.

9. Putting it All Together:
Combine the individual extraction functions to create a comprehensive resume parser.
Process the resume text, extract the desired information, and store it in a structured format.

10. Enhancements and Customizations:
Explore advanced techniques to improve extraction accuracy, such as named entity recognition and entity linking.
Consider handling different resume formats and languages for broader compatibility.
Implement additional features like extracting work experience, certifications, or personal projects based on specific requirements.

In [None]:
Note to Readers:

I am thrilled to share with you a comprehensive guide on building a resume parser using SpaCy, which is now available on Analytics Vidhya. This guide aims to empower you with the knowledge and tools to create a powerful resume parser from scratch.

Throughout the guide, I have covered various essential aspects of resume parsing, including text extraction from PDFs, extracting important information like contact details, skills, education, and more. I have also demonstrated how to leverage the capabilities of SpaCy, a popular natural language processing library, to perform these tasks effectively.

By following the step-by-step instructions and code examples provided in the guide, you will gain a solid understanding of the resume parsing process and be equipped with the skills to build your own resume parser. Whether you are a data scientist, developer, or HR professional, this guide will help you streamline and automate the resume screening process, saving you valuable time and effort.

I encourage you to dive into the guide, experiment with the code, and adapt it to your specific requirements. Remember that building a resume parser is an iterative process, and you may need to fine-tune and customize it based on the unique characteristics of the resumes you encounter.

I would like to express my gratitude to the Analytics Vidhya platform for providing the opportunity to share this guide with the community. Their dedication to promoting knowledge sharing and empowering data professionals is truly commendable.

I hope this guide serves as a valuable resource for you on your journey of building a resume parser using SpaCy. Feel free to reach out with any questions, feedback, or success stories you may have. Happy parsing!

In [1]:
!pip install spacy pdfplumber
!python -m spacy download en_core_web_sm




Traceback (most recent call last):
  File "<frozen runpy>", line 189, in _run_module_as_main
  File "<frozen runpy>", line 148, in _get_module_details
  File "<frozen runpy>", line 112, in _get_module_details
  File "C:\Users\dell\anaconda3\Lib\site-packages\spacy\__init__.py", line 6, in <module>
  File "C:\Users\dell\anaconda3\Lib\site-packages\spacy\errors.py", line 3, in <module>
    from .compat import Literal
  File "C:\Users\dell\anaconda3\Lib\site-packages\spacy\compat.py", line 4, in <module>
    from thinc.util import copy_array
  File "C:\Users\dell\anaconda3\Lib\site-packages\thinc\__init__.py", line 5, in <module>
    from .config import registry
  File "C:\Users\dell\anaconda3\Lib\site-packages\thinc\config.py", line 5, in <module>
    from .types import Decorator
  File "C:\Users\dell\anaconda3\Lib\site-packages\thinc\types.py", line 27, in <module>
    from .compat import cupy, has_cupy
  File "C:\Users\dell\anaconda3\Lib\site-packages\thinc\compat.py", line 99, in <mod

In [21]:
import pdfplumber  # PDF text extraction
import spacy        # NLP library
import re           # Regex for email/phone/skills


In [3]:
import spacy
nlp = spacy.load("en_core_web_sm")
print("Spacy loaded successfully")


Spacy loaded successfully


In [5]:
def extract_text_from_pdf(pdf_path):
    text = ""
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            text += page.extract_text()
    return text


In [7]:
def extract_name(text):
    doc = nlp(text)
    for ent in doc.ents:
        if ent.label_ == "PERSON":
            return ent.text
    return None


In [9]:
def extract_email(text):
    match = re.search(r"[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+", text)
    return match.group() if match else None


In [11]:
def extract_phone(text):
    match = re.search(r"\+?\d[\d\s-]{8,}\d", text)
    return match.group() if match else None


In [13]:
def extract_skills(text):
    skills_list = ['Python', 'SQL', 'Power BI', 'Excel', 'Data Analysis', 'Machine Learning',
                   'Deep Learning', 'Pandas', 'NumPy', 'Git', 'Jupyter', 'Canva']
    found = []
    for skill in skills_list:
        if re.search(r'\b' + re.escape(skill) + r'\b', text, re.IGNORECASE):
            found.append(skill)
    return list(set(found))


In [15]:
def extract_education(text):
    pattern = r"(?i)(Bachelor|Master|B\.Sc||MCA|Intermediate|High school)[^\\n]+"
    return re.findall(pattern, text)


In [25]:
def extract_name_custom(text):
    lines = text.split('\n')
    for i, line in enumerate(lines):
        if 'sa1610166@gmail.com' in line.lower():
            # Return line just above email
            for j in range(i-1, -1, -1):
                name_candidate = lines[j].strip()
                if name_candidate and all(x.isalpha() or x.isspace() for x in name_candidate):
                    return name_candidate
    return "Name not found"


In [27]:
def extract_name_spacy(text):
    doc = nlp(text)
    for ent in doc.ents:
        if ent.label_ == "PERSON":
            if len(ent.text.split()) <= 4 and not any(char.isdigit() for char in ent.text):
                return ent.text
    return "Name not found"


In [31]:
name = extract_name_custom(text)
if name == "Name not found":
    name = extract_name_spacy(text)


In [51]:
def extract_name(text):
    """
    Extracts name from resume text by checking lines near email or using spaCy matcher.
    """
    nlp = spacy.load('en_core_web_sm')
    doc = nlp(text)

    # Try using spaCy Matcher with common name patterns
    matcher = Matcher(nlp.vocab)
    name_patterns = [
        [{'POS': 'PROPN'}, {'POS': 'PROPN'}],
        [{'POS': 'PROPN'}, {'POS': 'PROPN'}, {'POS': 'PROPN'}]
    ]
    matcher.add("NAME", patterns=name_patterns)
    matches = matcher(doc)

    # Filter matched spans to avoid address-like or long results
    for match_id, start, end in matches:
        span = doc[start:end]
        name_candidate = span.text.strip()
        if (len(name_candidate.split()) <= 3 and not any(char.isdigit() for char in name_candidate)):
            return name_candidate

    # Backup method: Try to extract name from lines near the email
    lines = [line.strip() for line in text.split('\n') if line.strip()]
    email = extract_email_from_resume(text)
    for i, line in enumerate(lines):
        if email and email in line:
            for j in range(i - 1, -1, -1):
                candidate = lines[j]
                if candidate and candidate.replace(" ", "").isalpha():
                    return candidate

    return None

resume_path = r"C:\Users\dell\Downloads\resume.pdf"  # Make sure file is in same folder

text = extract_text_from_pdf(r"C:\Users\dell\Downloads\resume.pdf")

print("👤 Name:", extract_name(text))
print("📧 Email:", extract_email(text))
print("📞 Phone:", extract_phone(text))
print("💼 Skills:", extract_skills(text))
print("🎓 Education:", extract_education(text))

👤 Name: SHAIBA ALI
📧 Email: sa1610166@gmail.com
📞 Phone: 7570919305
💼 Skills: ['Python', 'Pandas', 'Excel', 'Data Analysis', 'Power BI', 'NumPy', 'Git', 'Canva', 'SQL', 'Jupyter']
🎓 Education: ['Master', 'Bachelor', 'Intermediate']
