<a href="https://colab.research.google.com/github/schmcklr/skill_extractor/blob/main/skill_extractor_section_extract_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Part 2:** Skill Section Extraction

This specialized program systematically extracts the qualification section from english job ads by scouring the HTML source code. It identifies key phrases and specific patterns, derived from the analysis of over 1500 job ads. The extracted qualifications, along with the remaining job ad data, are compiled in a data frame and exported to an Excel file for further analysis. Ads where the qualification section couldn't be extracted are automatically excluded from the export.


# 1. Load preprocessed data
*   Import of translated job advertisements ([part 1](https://colab.research.google.com/drive/1BgayjC-opiqcTT_QLv6RGG9a5roDLK9D#scrollTo=1qwXDcfoCAvZ))


In [61]:
import pandas as pd
# Fetching raw data
workbook = 'https://github.com/schmcklr/skill_extractor/blob/main/job_data/job_data_general_preprocessed_and_translated.xlsx?raw=true'

# Import of tabs
job_data = pd.read_excel(workbook, sheet_name="Sheet1")

#2. Function Definitions



*   filter_duplicates
*   filter_key_phrases
*   check_stop_phrases_element
*   check_stop_phrases_element_text
*   truncate_text_on_stopphrase
*   filter_and_clean_qualifications
*   extract_qualification_section

2.1 Filter duplicates

In [62]:
# Function to filter duplicated sentences
def filter_duplicates(sentences):
    filtered_sentences = []
    sentences = sorted(sentences, key=len, reverse=True)
    for sentence in sentences:
        if all(sentence not in other_sentence for other_sentence in filtered_sentences):
            filtered_sentences.append(sentence)
    return filtered_sentences

2.2 Filter key phrases

In [63]:
# Filter key phrases by removing key_phrases which are in list key_phrases_exact_match
def filter_key_phrases(key_phrases, key_phrases_exact_match):
    filtered_phrases = []
    for phrase in key_phrases:
        if not any(phrase == match for match in key_phrases_exact_match):
            filtered_phrases.append(phrase)
    return filtered_phrases

2.3 Stopphrase checking

In [64]:
# Checking Element
# Stop extracting if one of the stopphrases has been dedected
def check_stop_phrases_element(next_element):
  # Clean the text by removing non-alphanumeric characters and whitespace characters
  if next_element and next_element.string != None:
    cleaned_element_text = re.sub(r'[^\w\s]', '', next_element.string.strip())

  # Detect stop_phrases within the first 45 characters
  if any(stop_phrase in ' '.join(next_element.text.split()[:45]) for stop_phrase in stop_phrases):
    # Only used for developing context
    #matched_phrases = [stop_phrase for stop_phrase in stop_phrases if stop_phrase in ' '.join(next_element.text.split()[:50])]
    #if matched_phrases:
      #print('Stopphrase detected:', matched_phrases[0])

    return True
  # Detect stop_phrases_exact_match
  if next_element and next_element.string != None and any(stop_phrase_exact_match == cleaned_element_text for stop_phrase_exact_match in stop_phrases_exact_match):
    return True

  return False

In [65]:
# Checking Element text
# Stop extracting if one of the stopphrases has been dedected
def check_stop_phrases_element_text(element_text):
    cleaned_element_text = re.sub(r'[^\w\s]', '', element_text)

    if any(stop_phrase in element_text for stop_phrase in stop_phrases):
        return True

    if cleaned_element_text is not None and any(stop_phrase_exact_match == cleaned_element_text for stop_phrase_exact_match in stop_phrases_exact_match):
        return True

    return False

2.4 Extract section until stopphrase occurs

In [66]:
def truncate_text_on_stopphrase(element):
    text = element.text

    min_index = len(text) # Index of the shortest stop phrase in the text
    selected_stop_phrase = None # Selected stop phrase for truncation

    for stop_phrase in stop_phrases:
        index = text.find(stop_phrase)
        if index != -1 and index < min_index:
            min_index = index
            selected_stop_phrase = stop_phrase

    if selected_stop_phrase:
        truncated_text = text[:min_index].strip()
        return truncated_text
    else:
        return text

2.5 Filter and clean qualification list

In [67]:
def filter_and_clean_qualifications(qualifications):
    filtered_qualifications = []
    for qualification in qualifications:
        clean_qualification = "".join(char for char in qualification if char.isalpha() or char.isspace())
        if clean_qualification and clean_qualification not in key_phrases:
            filtered_qualifications.append(qualification)
    return filtered_qualifications

2.6 Definition of lists


*   **key_phrases** (for identifying the Skills section)

*   **key_phrases_exact_match** (key phrases that should only be recognized when they are within a separate tag element)
*   **key_phrases_without_exact_match** (only key phrases which are not in the key_phrases_exact_match)
*   **stop_phrases** (phrases that, upon occurrence, should halt the extraction process)
*   **stop_phrases_exact_match** (stop phrases that should only be recognized when they are within a separate tag element)
*   **stop_words** (words that should be removed from the text)



In [68]:
# Key Phrases
key_phrases = [
    'required experience & competencies', 'who we are looking for', 'requirements', 'competencies', 'required skills', 'technical skills', 'education', 'skills required', 'common core skills', 'qualifikationen', 'profile', 'have knowledge in',
    'minimum knowledge and skills required', 'was sie mitbringen', 'language skills', 'desired/plus', 'language','desirable attributes', 'skills and experience', 'what we expect from you',
    'required/must have', 'core competencies', 'qualifications required', 'essential qualifications', 'position criteria', 'complément du descriptif', 'compétences requises',
    'profile description', "you're at the right place, if", 'skills and competencies', 'competencies required', 'professional qualification', 'soft skills',
    'we are looking for people, who are', 'qualifications', 'your background', 'must have', 'nice to have', 'your profile', 'compétences', 'votre profil', 'your skills', 'you...', 'you:',
    'languages', 'personal abilities', 'required competencies', 'compétences requises', 'profil candidat', "le plus de l'offre", '候选人能力&amp;要求', 'experience and education',
    'your qualities', 'what will make you successful', "who we're looking for", 'basisqualifikationen', "what you'll bring to us", "dein profil", 'what we are looking for',
    'skills', 'knowledge', 'your strengths', 'required technical', 'professional expertise', 'preferred technical and professional expertise', 'you bring that with', 'what you need is',
    'an ideal candidate will be', 'minimum qualifications', 'preferred qualifications', 'basic qualifications', 'your qualifications', 'about you', 'profil', 'your personality','your qualification', 'what you need to have',
    'who are you?', 'ihr profil', 'do i qualify?', 'requisitos', 'they fit us.', "what you’ll need", 'come as you are', 'your story', 'are you a fit?', 'we’re looking for people who', 'what are we looking for',
    '· you are in your last semester of a bachelor’s or master’s program or have graduated less than six months ago', 'job requirements', 'required skills:', 'preferred skills:',
    'do you have what it takes to be the field sales manager', 'you', 'who are we looking for', 'the perfect candidate', 'qualification','experience & key skills', 'studied:', 'skills to create thrills',
    'das bringen sie mit', 'candidate:', 'skills & qualifications', 'for people with many qualities', 'who you are', 'required profile', 'we are looking for people who', 'about the candidate',
    'we are seeking highly motivated individuals who meet the following criteria', 'qualifications and skills', 'you are…', "if you’re a good fit, you’ll have", "if you’re a really good fit, you’ll have",
    'am i a perfect match', 'to qualify for the role you must have', 'education and qualifications / skills and competencies', 'is it you, we are looking for', 'main competence requirement',
    'expérience requise et formation', 'expérience requise et formation', 'experience', 'personal competencies', 'digital competencies', 'managerial competencies',
    'required education', 'we are looking for someone who has', 'will be an advantage if the candidate has', 'you score with us', "what you'll bring to us", 'expérience requise et formation',
    'you are best equipped for this task if you have','# profile # competences', 'we look for', 'connaissances/expérience', 'do you have what it takes', 'profil recherché', 'personal skills', 'you are...','technical knowledge',
    'to fit in the role, you also have', 'requirement', 'what we expect', 'your profile  ready to step on the gas', "what we’re looking for", 'formula for success', 'seeked profile',
    'what will you bring to hp?', 'this is you', 'required skills', 'preferred skills', ' you are an accounting professional', 'experience and specific knowledge', "are you the process innovator we're looking for?",
    'your profile  B', 'you bring that with –f ä', 'so you score with us', "what you'll need", 'wanted profile', 'your background looks like', 'bringing ', 'what we are expecting',
    'this scores with us', 'in this you are a specialist', 'you can see you in this function when you', 'what you need', 'job skills and knowledge required', "here's what we require",
    "to be successful in this role you will need the following", 'who you are?', "you'll be the right fit if you", "you`ll be the right fit if you have", 'what will you bring',
    "you’ll need to have", "we’d love to see", "In this you are a specialist", "this is what you`ll bring to us", "we're looking for people who", "am i qualified", "you'll need to have",
    "you specialist", "expected skills", "you bring", "you ...", "the ideal candidate will meet the following requirements", "what we're looking for", 'professional skills', 'specific competencies',
    "anforderungsprofil", "your skills  talents", "therefore fit us", "what else will make you successful", "the following requirement profile bring with", "this is the case with", 'stating the job reference',
    'essential experience', "desirable experience", "we'd love to see", "requisitos", 'you are a perfect fit for us if you', 'anforderungen/ kenntnisse', 'requirements', 'key skills, capabilities',
    'what can you contribute', "a motivated enthusiastic personality with a good understanding of the relationships between the business and the it your way of working can be described as independent responsible and solutionoriented furthermore you have the following background",
    'successful applicants will likely possess most of these', 'what do you bring with', 'required technical and professional expertise', 'what will make you successful', 'qualifikationen', 'desired',
    'personality', 'knowledge & experience', 'this is what your background looks like', 'you have', 'technical and professional skills', 'behavioural competency required', 'competency', 'behaviour',
    ' personal competencies', 'you also have the following qualities', 'essential skills', 'personal attributes', 'preferred qualifications', 'additional important requirements', 'role competencies',
    'minimum education requirements', 'personal competencies', 'key skills', 'you are best equipped for this work, if you', 'target disciplines and special skills', 'to fit in the role, you also',
    'further skills', 'it skills', 'language skills', 'studies', 'studied', 'minimum education and experience', 'candidate profile', 'ideal candidate', 'technical requirements', 'further requirements',
    'desired requirements', 'technical  professional knowledge', 'working experience', 'preferred tech and prof experience', 'what were looking for', 'to be eligible', 'you need',
    'you are interested in or have knowledge in', 'knowledge/technical skills', 'what skills/experience we are looking for', 'how can you make a difference', 'desired  skills & exp erience',
    'experience & skills', 'skills &amp; exp', 'desired skills', 'skills / experience', 'preferred requirements', 'required experience','your heroic skills', 'requirements profile',
    'requisite skills & experience', 'technical competencies', 'behavioural competencies', 'you are interested in or have knowledge in :', 'knowledge  knowledge  experiences', 'you preferably have',
    'required professional and technical expertise', "to summarise, we're looking for someone with", 'what we look for', 'besides a technical strong basis we are looking for people who',
    'monitoring the administrative organization and the internal controls', 'you have a toyota dna, this means you', 'entry requirements', 'you should apply if you are', 'desired skills & experience',
    'education and work experience', "you are an m/f/x that has proven business judgement and you have a passion for what technology and data science can do. you love working with customers, you aspire to be a domain expert one day in either an entire industry or a process and you get excited when you are influencing change to executives through market-leading saas technology.",
    'required backround and experience', 'educational qualification', 'work experience & qualifications', 'key requirements/experience', 'what you need to succeed', 'qualifikationen und erfahrung', 'valuable skills',
    'preferred work experience', 'we expect you to', 'attitude is important! we are looking for people, who are', 'with potential value', 'your skills', 'required', 'desired', 'base compentencies',
    'you are looking for an opportunity to accelerate your career in financial management at international level and emerge as a financial leader within a dynamic and challenging organization',
    'plus, you fulfill the following hard fact based criteria', 'and you fully agree to these statements', '候选人能力&要求', 'requirements:', 'education', ', schedules, and manage customer expectations',
    'experience/knowledge required', 'essential', 'desirable', 'required languages', 'personal required skills', 'skills and knowledge', 'relevant work experience', '2.    knowledge/experience',
    'education, experience and skills', 'key success criteria', 'do you have what it takes to be the field sales manager?', 'you are best equipped for this task if you', ' your profile',
    '- you have a degree in computer science, business informatics or a comparable education', '- you are passionate about the daimlers concept of innovative payment solutions',
    'technical & professional knowledge', 'you are skilled', 'who can apply for the acaddemict business & functional analyst program?', 'skills and qualifications', 'desired experience',
    'student profile', 'requirements / knowledge / experience', 'expected profile', 'in order to apply for the graduate programme, you must have', 'profil :', 'you will convince us with',
    '##master in engineering, or business administration and at least 3 years of experiences in digital (insurance or web/e-commerce) as a business analyst',
    'you have successfully completed your studies in computer science, business informatics, business administration, economics or industrial engineering.handy tools',
    "technical business/data analyst for change initiatives for allianzgi's alternatives investment platform", 'your skills &amp; talents...', 'hard skills',
    'university degree (law / business preferred)', '##support development and continuous improvement of required processes', 'what is your educational background',
    'requirements / knowledge / experience< /b', 'education and experience', 'applying creative methods to consume data from is/it sources or move data between is/it solutions in the ecosystem.',
    'knowledge / knowledge / experiences', 'preferred skills (good to have)', 'languages required', 'this role requires', 'education and professional background',
    'experience in sap functional area as', 'capabilities in new technologies', 'your strengths', 'level of diploma', 'in this they are strong', 'what you deliver',
    'we offer the opportunity to develop a fascinating career in exchange for a few requirements', 'high it affinity', 'it knowledge', 'studies/training','your profilebr>',
    'in addition, a rotation of at least 3 months in one of our innovation team can be part of one in which it can be done in agile work methodology, big data, machine learningand acquire design thinking, expand your digital skills and expand your own network.',
    "- you develop operations support policies, standards and procedures to improve operational efficiency", 'in this you are strong', 'in this they are strong',
    "what you should bring", 'studium', 'you will work with our it development partners in a dynamic environment in modern forms of collaboration, such as. agile development.',
    'interface between clients, project members and end users', 'seem profile', 'presentation and discussion ofwork results at management level',
    'acceptance test of the implemented requirements as well as expansion of the regression tests', 'prerequisites', 'you bring with -skills with which you shape the future',
    'creation ofbusiness analyzes including the specification of the requirements in the context of implementation concepts and user stories as the basis for it programming.',
    'if you also pursue your goals with great commitment and characterize you an open personality, you should check in to start together.', 'why are we looking for you',
    'we look forward to', 'the best conditions for your entry', 'you are strong', 'minimum knowledge, skills and abilities required', 'knowledge, skills, and abilities',
    'effectively manage', 'profile required', 'competence & experience', '…and you also:', 'attitudes', 'it applications', 'behavioural skills'
    ]

# Key Phrases exact match
key_phrases_exact_match = ['you...', 'you:', 'your strengths', 'about you', 'you', 'profil', 'who are', 'ihr profil', 'do i qualify?', 'requisitos', 'they fit us.','your story','skills',
                           'are you a fit?', 'candidate:', 'who you are','you are…', 'you are...', 'your skills', 'requirement', 'experience', 'this is you', 'bringing ', 'you specialist',
                           "you ...", "we look for", 'you have', 'education', 'behaviour', 'profile', 'desired','studies', 'it skills', 'erience', 'required', 'essential', 'desirable',
                           'knowledge', 'and:', 'essential']

# Key Phrases without exact match
key_phrases_without_exact_match = filter_key_phrases(key_phrases, key_phrases_exact_match)

# Stop phrases
stop_phrases = ['audit planning', 'pwc', 'impact/scope', 'offers that convince me', 'what do we offer you', 'additional information', 'we...', 'ubisoft',
                'you can look forward to', 'airbus', 'what you’ll get in return', 'send us your application', 'zusätzliche informationen', 'allianz is', 'why amazon',
                'airbus', 'candidates who are considered for a position', 'co-ordinating it cover', 'apply today', 'thanks for taking the time', 'application procedure', 'cover letter',
                'in order to apply', 'please note that', 'mars', 'all applications will be reviewed', 'please send your application to', 'volvo group', 'contract:',
                'stipend and benefits', 'documents to your application', 'your cv in english', 'application and assessment process', 'full-time job', 'reference code','ibm is committed',
                'your tasks will be', 'are you interested?', 'tasks in detail', 'studies/training', 'start contract date', 'we offer', 'about us ', 'unibail', 'qualified applicants will receive consideration',
                'provide proper reporting', 'can apply', 'huntsville', 'work with external resource', 'you are not a regular helpdesk', 'perks at work', 'about zalando', "zalando is europe’s leading", 'internal tech academy',
                'further information for your application', 'bmw group', 'experience what moves us', 'contact our recruiting team', 'security/export control statement', 'our offer',
                'working conditions', 'if you share our values and vision', 'more opportunities for your development', 'student programs team', 'please no phone calls', 'hat we offer',
                'additional information', 'order to be eligible for an internship', 'key dates:', 'what will you work on', 'why netsuite', 'novutech in a nutshell', "#1 cloud-based",
                "what's in it for you", 'place of employment', 'you are going to bring', 'together we will rethink', 'are you ready for a new challenge', 'general function',
                'management system provides', 'deloitte-recruiting team', 'what are the central tasks', 'new york it', 'new yorkers offers', "you'll be working within an audit team",
                'interested?', 'best business school in the world', 'siehe job description', 'graduates working in project controls', "pour l'analyse de données", 'in order to structure our',
                'recruiting or consulting agency', 'key figures', 'production sites', 'deine benefits', 'you will prepare market', 'transfer pricing structures', 'your benefit',
                'application configuration and support','please note', 'we advise corporates', 'seller support center', 'configure and support', 'diversity and inclusion', 'the nestlé group',
                "please don't hesitate and apply", 'lufthansaglobal business services gmbh', 'messier-dowty', ' canadian controlled', 'equal opportunity employer',
                'tasks and responsibilities', 'coach and mentor', 'scor global life americas', 'we are looking forward to your application', 'you make the impossible possible',
                'takes responsibility for achieving', 'responsibility for achieving', 'accountabilitytakes', 'permanent job', 'contract :', 'the internship stipend',
                'a temporary position', 'stipend and benefits', 'upload your cv', 'you are offered', 'please apply', 'approve monthly/weekly delivery',
                'via lufthansa global business services', 'at criteo', 'nestlé is the largest food', 'your responsibilities', 'use analytics', 'appointment date', ' please send',
                'application period', 'protecting your privacy', 'singapore r&d office', 'belonging to toyota', 'start -up date', 'at allianz', 'position based in', 'category :',
                'with us you are part', 'you have the chance', 'coffee and water', 'our company pension', 'nike european headquarters', 'compelling marketing copy', 'flexible working times',
                'tech-unicorns in the world','important note','why us?', 'flat hierarchies', 'scholarship', 'terms & conditions', 'position summary','our tech department', 'benefits', 'grow together',
                'free sports', 'about the application process', 'dynamic and international working', 'deliver accurate and timely', 'your next career step', 'analyze, document',
                'remember to attach', 'actively shape', 'the department', 'do we want to achieve', 'new yorkers', 'part of our team', 'responsibility and decision-making authority',
                'a combination of personal', 'swts manufactures', 'exellys','report to', 'duration: 6 months full-time', 'what’s in it for you', 'accelerate your technical capabilities',
                'contact info', 'duration:', 'mobile al', 'note to students', 'allianz group', 'vmware is committed', 'vmware company overview', 'ntry level', 'ibm services is a team',
                'about business unit', 'about ses', 'join seb global services', 'to be a part of international company', 'amazon is an equal opportunities', 'join us and', 'timing :',
                'diverse, exciting and varied activities', 'nivel de educación', 'presso primarie', 'of responsibility and', 'your win', 'above all of this', 'your day to day', 'contrat',
                'equal employment', 'founded in', 'tasks & responsibilities', 'lufthansa consulting delivers'
                ]

# Stop phrases exact match
stop_phrases_exact_match = ['we',  'we offer', 'training and resources', 'location', 'entry level', 'ibm',  'ey', 'deloitte','contract','lufthansa', 'responsibilities',
                            'application support', 'basf', 'how to apply', 'salary', 'the application of']

# Stop words
stop_words = ['key competencies', 'knowledge, skills, qualifications & experience:', 'required:', 'desired:', 'required', 'desired', 'required skills:', 'desired characteristics/skills', 'functional skills:', 'you are:', 'is it you, we are looking for?',
              'https://www.bankaustria.at/karriere.jsp', 'knowledge/technical skills', 'what skills/experience we are looking for:', 'licenses/certifications/other',
              'entry level', "ideally, you’ll also have", 'as per job description.', "what you'll need", 'we are willing to cover 2 positions:', 'work experience & qualifications:',
              'to qualify for the position(s), you possess the following qualities:', 'who are you?', 'professional qualification routes', 'new graduate',
              'educational qualifications:', 'what kind of candidate are we looking for?', 'we are looking for someone that shares the same values as our team:', 'local job requirements',
              "https://www.be-lufthansa.com/de/faqs-be-frifthansa/lufthansa/internship degree work/", 'you are offered', 'via lufthansa global', 'department description:',
              '##support development and continuous improvement of required processes', 'category :', '!our', 'experience :', 'general requirements:', 'preferred skills:',
              'profile preferred qualifications', 'profile key requirements', 'ability to:']

2.6 Main function for skill section extraction

In [69]:
import pandas as pd
from bs4 import BeautifulSoup
import re

# Identifying key phrases and extracting the qualification section
def extract_qualification_section(df_row):
    # Tracks if p or ul element has been found
    p_ul_found = False
    # Tracks if key_phrase has been found
    key_phrase_found = False
    # Tracks if a key_phrase has been found
    found = False
    # Tracks if a section has been found
    section_found = False
    # Tracks if a 'tr' HTML object has been found
    tr_found = False

    # Initialize list to store qualifications
    qualifications = []

    # Checking if df_row is of type string
    if type(df_row) != str:
      print('Row is not of type string')
      return('Skills can not be detected!')
    else:
       df_row = df_row.lower()

    # Replace incorrect html tags
    df_row = df_row.replace('brong', 'strong')
    df_row = df_row.replace('<d>', '<div>')

    # Creates a BeautifulSoup object by passing 'df_row' and specifying the parser as 'html.parser'
    soup = BeautifulSoup(df_row, 'html.parser')

    # Iterate through the list of key_phrases
    for key_phrase in key_phrases:

      # Reset boolean variables
      found = False
      p_ul_found = False

      # Searches for all specified elements and verifies if a keyphrase is present between the specified element and the following element
      # Case that key_phrase is between a div tag and another tag
      div_tags = soup.find_all('div')
      for i in range(len(div_tags)):
        div_element = div_tags[i].find_next()
        if div_element and div_element.previous_sibling:
          if re.sub(r'[^\w\s]', '', key_phrase).strip() == re.sub(r'[^\w\s]', '', div_element.previous_sibling.text).strip():
            found = True
            key_phrase_found = True
            element = div_tags[i]
            break
          if div_element is not None:
            div_element = div_element.find_next()

      # Case that key_phrase is between a b tag and another tag
      b_tags = soup.find_all('b')
      for i in range(len(b_tags)):
        b_element = b_tags[i].find_next()
        if b_element and b_element.previous_sibling:
          if re.sub(r'[^\w\s]', '', key_phrase).strip() == re.sub(r'[^\w\s]', '', b_element.previous_sibling.text).strip():
            found = True
            key_phrase_found = True
            element = b_tags[i]
            break
          if b_element is not None:
            b_element = b_element.find_next()

      # Case that key_phrase is between a br tag and another tag
      # Searches for all defined elements and checks if keyphrase is in the text is between the defined element and the following element
      br_tags = soup.find_all('br')
      for i in range(len(br_tags)):
        br_element = br_tags[i].find_next()
        if br_element and br_element.previous_sibling:
          previous_sibling_text = br_element.previous_sibling.text
          if isinstance(previous_sibling_text, str):
              previous_sibling_text = re.sub(r'[^\w\s]', '', previous_sibling_text).strip()
              # Removing non-alphanumeric characters and whitespace
              # Case 1:
              if (re.sub(r'[^\w\s]', '', key_phrase).strip() in previous_sibling_text and key_phrase in key_phrases_without_exact_match and len(previous_sibling_text.split()) < 6) or (re.sub(r'[^\w\s]', '', key_phrase).strip() == previous_sibling_text):
                  found = True
                  key_phrase_found = True
                  element = br_tags[i]
                  break
              # Case 2:
              elif re.sub(r'[^\w\s]', '', key_phrase).strip() == previous_sibling_text:
                  found = True
                  key_phrase_found = True
                  element = br_tags[i]
                  break
          if br_element is not None:
            br_element = br_element.find_next()

      # Case that key_phrase is between a strong tag and another tag
      strong_tags = soup.find_all('strong')
      for i in range(len(strong_tags)):
        strong_element = strong_tags[i].find_next()
        if strong_element and strong_element.previous_sibling:
          if re.sub(r'[^\w\s]', '', key_phrase).strip() == re.sub(r'[^\w\s]', '', strong_element.previous_sibling.text).strip():
            found = True
            key_phrase_found = True
            element = strong_tags[i]
            break
          if strong_element is not None:
            strong_element = strong_element.find_next()

      # Case that key_phrase is between a span tag and another tag
      span_tags = soup.find_all('span')
      for i in range(len(span_tags)):
        span_element = span_tags[i].find_next()
        if span_element and span_element.previous_sibling:
          if re.sub(r'[^\w\s]', '', key_phrase).strip() == re.sub(r'[^\w\s]', '', span_element.previous_sibling.text).strip():
            found = True
            key_phrase_found = True
            element = span_tags[i]
            break
          if span_element is not None:
            span_element = span_element.find_next()


      # Case that key_phrase is between a li tag and another tag
      li_tags = soup.find_all('li')
      for i in range(len(li_tags)):
        li_element = li_tags[i].find_next()
        if li_element and li_element.previous_sibling:
          if re.sub(r'[^\w\s]', '', key_phrase).strip() == re.sub(r'[^\w\s]', '', li_element.previous_sibling.text).strip():
            found = True
            key_phrase_found = True
            element = li_tags[i]
            break
          if li_element is not None:
            li_element = li_element.find_next()



      # Case that key_phrase is between a end tag and another tag
      tags = soup.find_all()
      for i in range(len(tags) - 1):
          # Check if the current tag is an end tag and the next tag is a start tag
          if tags[i] and tags[i + 1]:
              if tags[i].name and tags[i + 1].name and tags[i].name != tags[i + 1].name:
                  next_sibling = tags[i].next_sibling
                  if next_sibling and hasattr(next_sibling, 'text'):
                      next_sibling_text = next_sibling.text.strip()
                      if next_sibling_text:
                          stripped_text = (re.sub(r'[^\w\s]', '', next_sibling_text))
                          if re.sub(r'[^\w\s]', '', key_phrase) == stripped_text:
                              found = True
                              key_phrase_found = True
                              element = next_sibling
                              break

      # Looks for occurrences where the keyphrase appears between a <p> element and a subsequent <br> element
      pattern = rf'<p>\s*{re.escape(key_phrase)}\s*(:\s*)?<br>'
      match = re.search(pattern, df_row, re.DOTALL)
      if match:
        key_phrase_found = True
        # Extract the content following the match
        text_after_match = df_row[match.start():]
        # Select paragraph with keyword
        soup4 = BeautifulSoup(text_after_match, 'html.parser')
        element = soup4.find('p')

        # Append the segment after the match that does not contain any stopphrase to the qualifying section
        if not check_stop_phrases_element(element):
          qualifications.append(truncate_text_on_stopphrase(element))
          break

      # Execute if none of the previous conditions or patterns match a keyphrase
      if found == False:
        # Iterate over elements found in the BeautifulSoup object and check if the element's tag name is in the list of specified tags
        for element in soup.find_all():
            # Ensure the element has a string value, that the key_phrase is present in the string
            # Verify that key_phrase is not in key_phrases_exact_match (unless it is an exact match)
            # Confirm that none of the stop_phrases are present in the element's string
            if element.name in ['h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'b', 'strong', 'i','u','em', 'span', 'font', 'a'] and element.string != None and key_phrase in element.string and len(element.string.split()) < 11 and ((key_phrase not in key_phrases_exact_match or key_phrase == element.string)) and all(stop_phrase not in element.string for stop_phrase in stop_phrases):
                found = True
                key_phrase_found = True
                break
            # Check if the element's tag is in the list of specified tags
            # Check if the element's stripped text matches the key_phrase or its variations with specific suffixes
            if element.name in ['p', 'div', 'ul', 'li'] and (element.text.strip() in [key_phrase, key_phrase + ':', key_phrase + ';', key_phrase + '-']):
                found = True
                key_phrase_found = True
                break

      # If keyphrase has been found, extraction of sentences which include qualification
      if found:

          # Stores qualifications temporarily
          temporary_qualifications = []

          # Find first element occurring after keyphrase
          if element:
            next_element = element.find_next()
          else:
            # Break if next element is None
            break

          # Track the number of words after key_phrase
          word_count = 0

          # Continue the loop while tag name is not in ['p', 'ul', 'li', 'ol', 'div']
          # And either the word count is less than 60 or the next_element's tag name is 'font' or 'span'
          while next_element and next_element.name not in ['p', 'ul', 'li', 'ol','div'] and (word_count < 60 or next_element.name == 'font' or next_element.name == 'span'):
              # Skip None type elements
              if next_element is None:
                  continue

              if next_element:
                # Special case to track word count for <br> elements
                if next_element.name in ['br'] and next_element.previous_sibling:

                  # Stop extracting if one of the stopwords was dedected
                  if check_stop_phrases_element_text(next_element.previous_sibling.text.strip()):
                    break

                  # Increase word_count by the number of words of the previous sibling element
                  word_count += len(next_element.previous_sibling.text.split())

                else:
                  # Increase word_count by the number of words of the element
                  word_count += len(next_element.get_text().split())
              # Break if word_count >= 60 and no ul element within the next_element
              if word_count >= 60 and not next_element.find('ul'):
                break

              # Find next element
              next_element = next_element.find_next()


          allow_br = False  # Variable to check if the <br> element should be allowed
          allow_em = False  # Variable to check if the <em> element should be allowed


          while next_element and (next_element.name in ['p', 'ul', 'li', 'div', 'ol'] or (next_element.name == 'br' and allow_br) or (next_element.name == 'em' and allow_em)):
              #Tracks if p or ul section was found
              p_ul_found = True

              # # Set variables to True to continue the loop if any of the specified elements occur
              if next_element.name in ['p', 'ul', 'li', 'div', 'ol']:
                  allow_br = True
                  allow_em = True

              # Stops extracting if one of the stopwords has been dedected
              if check_stop_phrases_element(next_element):
                break

              #########
              # Lists
              #########
              if next_element.name in ['ul','li','ol']:
                  # Track if ul list was found
                  ul_list_found = True

                  ###################################
                  # Option #1: There is a valid list element
                  ###################################
                  # List has li-elements
                  if next_element.find_all('li', recursive=False) != []:
                    for li in next_element.find_all('li', recursive=False):
                        # Checking for stopphrase and break if detected
                        if check_stop_phrases_element(li):
                          break
                        # Special case that there is a second list within a list
                        # Only valid if less than 3 sentences have been collected so far, otherwise it will be skipped
                        if li.find('ul'):
                          if len(set(temporary_qualifications)) < 3:
                            qualifications.append(li.text)
                            temporary_qualifications.append(li.text)
                            section_found = True
                          else:
                            continue
                        # Extract qualification sentences
                        else:
                          qualifications.append(truncate_text_on_stopphrase(li))
                          temporary_qualifications.append(truncate_text_on_stopphrase(li))
                          section_found = True


                        #  Check if the current list section is concluded by checking the next element
                        next_element_check = li.find_next()
                        if next_element_check and next_element_check.name not in ['font','b', 'ul', 'li', 'span', 'br', 'td', 'tr', 'strong','ol', 'i', 'u', 'p','em', 'a']:
                          if any(qualification not in key_phrases for qualification in temporary_qualifications) and next_element_check.name not in ['font','b', 'ul', 'li', 'span', 'br', 'td', 'tr', 'strong','ol', 'i', 'u', 'p','em', 'a']:
                              break
                  # List has no li-elements
                  elif next_element.name in ['ul','ol']:
                    qualifications.append(truncate_text_on_stopphrase(next_element))
                    temporary_qualifications.append(truncate_text_on_stopphrase(next_element))
                    section_found = True


                  ###################################
                  # Option #2: there is no list begining tag
                  ###################################
                  else:
                    if next_element.previous_sibling:
                      for li in next_element.previous_sibling.find_all_next('li', recursive=False):
                        # Check for stopphrase and break if detected
                        if check_stop_phrases_element(li):
                          break

                        # Detects <strong> elements within a <li> element containing a key_phrase and skips it
                        if li.find('strong'):
                            strong_element = li.find('strong')
                            for key_phrase in key_phrases:
                              if strong_element.string and key_phrase in strong_element.string:
                                continue
                            # Appen it to qualifications if no key_phrase in it
                            qualifications.append(li.text)
                            temporary_qualifications.append(li.text)
                            section_found = True
                        if li.find('ul'):
                            continue
                        else:
                            # Append sentences to qualifications
                            qualifications.append(truncate_text_on_stopphrase(li))
                            temporary_qualifications.append(truncate_text_on_stopphrase(li))
                            section_found = True

                        # Check if all li elements has been completely extracted
                        next_element_check = li.find_next()
                        if next_element_check and next_element_check.name not in ['font','b', 'ul', 'li', 'span', 'br', 'td', 'tr', 'strong','ol', 'i', 'u', 'p','em', 'a']:
                          if any(qualification not in key_phrases for qualification in temporary_qualifications) and next_element_check.name not in ['font','b', 'ul', 'li', 'span', 'br', 'td', 'tr', 'strong','ol', 'i', 'u', 'p','em', 'a']:
                              break


                  # Find next sibling element
                  next_element_check = next_element.find_next_sibling()

                  # Check whether the current list has been completely extracted
                  if next_element_check and not (next_element_check.text == '' and next_element_check.name == 'p'):
                    if next_element_check and next_element_check.name not in ['font','b', 'ul', 'li', 'span', 'br', 'td', 'tr', 'strong','ol', 'i', 'u','em', 'a']:
                      if any(qualification not in key_phrases for qualification in temporary_qualifications) and next_element_check.name not in ['font','b', 'ul', 'li', 'span', 'br', 'td', 'tr', 'strong','ol', 'i', 'u' ,'em', 'a']:
                          break

              ###########
              # Sections
              ###########
              # Extract <p> elements which have a textual content
              if next_element.name == 'p' and next_element.text.strip() != "":
                if next_element.string == None:
                  qualifications.append(truncate_text_on_stopphrase(next_element))
                  section_found = True
                else:
                  qualifications.append(next_element.string)
                  section_found = True
                next_element_check = next_element.find_next_sibling()

                # Check if p sections has been completely extracted
                if next_element_check and next_element_check.name not in ['font', 'ul', 'li', 'span', 'br', 'td', 'tr', 'strong', 'p', 'b','ol', 'u','em', 'a']:
                    break

              ###############
              # Div elements
              ###############
              if next_element.name == 'div':
                # Check if there is a stop phrase within the element text
                if not any(stop_phrase in next_element.text for stop_phrase in stop_phrases):
                  qualifications.append(next_element.text)
                  section_found = True
                else:
                  qualifications.append(truncate_text_on_stopphrase(next_element))
                  section_found = True

                # Check if div sections has been completely extracted
                next_element_check = next_element.find_next_sibling()
                if next_element_check and next_element_check.name not in ['font', 'ul', 'li', 'span', 'br', 'td', 'tr', 'strong','b','ol', 'div', 'u','em', 'a']:
                    break

              # Find next element
              if next_element.find_next_sibling() is None:
                next_element = next_element.find_next()
              else:
                next_element = next_element.find_next_sibling()


          # Case no section or list element was found
          if not p_ul_found:
              # Case 1: Qualifications within a strong tag
              # Find the index of the closing </strong> tag after the keyword
              closing_tag_index = df_row.find('</strong>', df_row.find(key_phrase))

              # Check if a closing tag index was found
              if closing_tag_index != -1:
                  # Find the index of the next opening <strong> tag
                  opening_tag_index = df_row.find('<strong>', closing_tag_index)
                  if opening_tag_index != -1:
                      # Extracts the text between the two tags
                      text_between_strong = df_row[closing_tag_index + len('<strong>'):opening_tag_index]

                      # Stops extracting if one of the stopwords has been dedected
                      if check_stop_phrases_element_text(text_between_strong):
                        break

                      else:
                        # Add the text to qualifications
                        qualifications.append(text_between_strong.strip())


              # Case 2: Qualifications/Key phrase within a <b>-tag
              # Find all <b>-tags
              b_tags = soup.find_all('b')
              text_between_b_tags = ''

              # Iterate through all <b> tag
              for i in range(len(b_tags)):
                text_between_b_tags = ''
                if key_phrase in b_tags[i].text:
                    next_element = b_tags[i].find_next_sibling()
                    word_count = 0
                    # Check if the current iteration is the last index of the b_tags list
                    if i == len(b_tags) - 1:
                      if next_element and next_element.name == 'br' and next_element.previous_sibling:

                            # Stops extracting if one of the stopwords has been dedected
                            if check_stop_phrases_element_text(next_element.previous_sibling.text.strip()):
                              break

                            # Reset word count when encountering a <br> tag
                            word_count = 0
                            words = next_element.previous_sibling.text.split()
                            word_count += len(words)

                            # Filter out large paragraphs which likely to be description of the company
                            if word_count <= 75:
                                qualifications.append(next_element.previous_sibling.text.strip())
                            if next_element is not None:
                              next_element = next_element.find_next()

                    # Case, qualifications are separated by <br> elements
                    while next_element and (next_element.name == 'br'):

                        # Stops extracting if one of the stopwords has been dedected
                        if check_stop_phrases_element_text(next_element.previous_sibling.text.strip()):
                          break

                        if next_element and next_element.name == 'br' and next_element.previous_sibling:

                            # Reset word count when encountering a <br> tag
                            word_count = 0
                            words = next_element.previous_sibling.text.split()
                            word_count += len(words)

                            # Filter out large paragraphs which likely to be description of the company
                            if word_count <= 75:
                              qualifications.append(next_element.previous_sibling.text.strip())

                              # Handle the case of the last sentence in the qualifications section (a tag other than <br> is encountered)
                              if next_element.find_next() and next_element.find_next().name != 'br':
                                  pre_next_element = next_element.find_next()

                                  # Stops extracting if one of the stopwords was dedected
                                  if not any(stop_phrase in pre_next_element.find_previous().text for stop_phrase in stop_phrases):
                                    qualifications.append(pre_next_element.find_previous().text.strip())

                                  # Stops extracting if one of the stopwords was dedected
                                  elif next_element.string != None and not any(stop_phrase_exact_match == re.sub(r'[^\w\s]', '', pre_next_element.find_previous().text) for stop_phrase_exact_match in stop_phrases_exact_match):
                                    qualifications.append(pre_next_element.find_previous().text.strip())

                              # Handle the case of the last sentence in the qualifications section (no tag is encountered)
                              elif next_element.find_next() == None:
                                filtered_html = ''.join(str(element) for element in next_element.next_siblings if element != '\n')

                                # Stops extracting if one of the stopwords was dedected
                                if not any(stop_phrase in filtered_html for stop_phrase in stop_phrases):
                                  qualifications.append(filtered_html)

                            # Find next element
                            if next_element is not None:
                              next_element = next_element.find_next()

              # Case Keyphrase has been found but no list or section was found
              if found:
                # In some cases there is a br tag within the strong tag, in this case we find the next tag
                if element and element.name == 'strong':
                  element = element.find_next()

                # Skip the first <br> tag since the first qualification sentence comes after it
                if element and element.name == 'br':
                  next_element = element.find_next()

                # <strong> element should only be allowed after br or span element
                allow_strong = False
                # Case #1, qualifications are separated by <br> elements
                while next_element and ((next_element.name in ['br', 'span']) or (allow_strong and next_element.name in ['br', 'span', 'strong', 'b'])):
                        allow_strong = True

                        if next_element and next_element.name in ['br', 'span'] and next_element.previous_sibling:

                            # Reset word count
                            word_count = 0
                            words = next_element.previous_sibling.text.split()
                            word_count += len(words)

                            # Stops extracting if one of the stopwords was dedected
                            if check_stop_phrases_element_text(next_element.previous_sibling.text.strip()):
                              break

                            # Filter out large paragraphs which likely to be description of the company
                            if word_count <= 40:
                                qualifications.append(next_element.previous_sibling.text.strip())

                                # Handle the case of the last sentence in the qualifications section
                                if next_element.next_sibling and next_element.next_sibling.name != 'br':

                                  # Stops extracting if one of the stopwords was dedected
                                  if not any(stop_phrase in next_element.next_sibling.text for stop_phrase in stop_phrases):
                                    qualifications.append(next_element.next_sibling.text.strip())

                                  # Stops extracting if one of the stopwords was dedected
                                  elif next_element.string != None and not any(stop_phrase_exact_match == re.sub(r'[^\w\s]', '', next_element.next_sibling.text.strip()) for stop_phrase_exact_match in stop_phrases_exact_match):
                                    qualifications.append(next_element.next_sibling.text.strip())

                        # Find next element
                        if next_element is not None:
                          next_element = next_element.find_next()

                # Case #2, qualifications are within table rows
                if soup.find('tr') is not None and key_phrase in key_phrases_without_exact_match:
                  start_index = df_row.find("<tr>", df_row.find(key_phrase))
                  end_index = df_row.find("</tr>", start_index) + len("</tr>")

                  next_tr = df_row[start_index:end_index]
                  soup2 = BeautifulSoup(next_tr, 'html.parser')

                  # Extract text from the HTML
                  if soup2.get_text() and not check_stop_phrases_element(soup2):
                    qualifications.append(soup2.get_text())

                  if soup2.get_text().strip():
                    tr_found = True
                    key_phrase_found = True


                # Case #3: Key_phrase is in an string element within a paragraph
                pattern = rf'<strong>{re.escape(key_phrase)}:</strong>(.*?)</p>'
                match = re.search(pattern, df_row, re.DOTALL)
                if match:
                    qualifications.append(match.group(1).strip())

          # Section has been found
          section_found = True

    # Case no section and no keyphrase has been found
    if section_found == False:
        # Case keyphrase within an <ul>-tag/<li>-tag
        ul_tags = soup.find_all('ul')
        for ul_tag in ul_tags:
          if key_phrase in ul_tag.text:
              print(ul_tag.text)
              li_tags = ul_tag.find_all('li')
              for li_tag in li_tags:
                if key_phrase in li_tag.text:
                  next_li_tags = li_tag.find_all_next('li')
                  for next_li_tag in next_li_tags:
                    if key_phrase not in next_li_tag.text:
                      qualifications.append(next_li_tag.text.strip())
                  break

    # Remove remaining html tags
    qualifications = [BeautifulSoup(text, "html.parser").get_text() for text in qualifications]

    # Remove special characters
    qualifications = [text.replace("<", "").replace(">", "").replace("•", "") for text in qualifications]

    # Remove multiple spaces and line breaks
    qualifications = [re.sub(r'\s+', ' ', text) for text in qualifications]

    # Remove duplicates #1
    qualifications = list(set(qualifications))

    # Remove leading/trailing whitespace and filter out empty qualifications
    qualifications = [qualification.strip() for qualification in qualifications if qualification.strip()]
    qualifications = list(filter(None, qualifications))

    # Filter qualifications based on alphabetical characters and key phrases
    qualifications = filter_and_clean_qualifications(qualifications)

    # Remove list items that only contain stopwords
    qualifications = [item for item in qualifications if not any(word == item for word in stop_words)]

    # Remove duplicates #2
    qualifications = list(set(qualifications))

    # Exclude sentences that are already contained in other paragraphs
    qualifications = filter_duplicates(qualifications)


    if len(qualifications) == 0 and key_phrase_found:
      print('Skills can not be detected!')
      return('Skills can not be detected!')
    elif len(qualifications) == 0 and not key_phrase_found:
      print('Keyphrase can not be detected!')
      return('Skills can not be detected!')
    else:
      print(qualifications)
      return qualifications

# 3. Extraction of qualification section

3.1 Extraction of qualification for whole dataset

In [None]:
job_data['qualifications'] = job_data.apply(lambda row: (print(row['id']), extract_qualification_section(row['rawDescriptionTranslated']))[1], axis=1)

3.2 Extraction method for one specific job ad (for development purposes and accuracy only)

In [71]:
# Define text for single extraction
j = 0
text = """"""

In [72]:
# Extract single skill section
'''
j = 0
if extract_qualification_section(text) is not None:
  for i in extract_qualification_section(text):
    j = j + 1
    print(j)
    print(i)
'''

'\nj = 0\nif extract_qualification_section(text) is not None:\n  for i in extract_qualification_section(text):\n    j = j + 1\n    print(j)\n    print(i)\n'

3.3 Preparing data for export

In [73]:
# Convert list to string
job_data['qualifications'] = job_data['qualifications'].apply(lambda x: ' '.join(x) if isinstance(x, list) else x)

# Filter for job ads where qualifications were detected
job_data = job_data[job_data['qualifications'] != 'Skills can not be detected!']

#4. Export

4.1 Excel export of dataset with extracted skill-section

In [74]:
# Export dataset detected with skill sections
job_data.to_excel('job_data_preprocessed_extracted_qualifications.xlsx', index=False)