# 2. Preprocessing

In this notebook, our primary focus is on preprocessing the raw data to ensure accurate modeling. The key emphasis is on cleaning textual data to suit the requirements of various embeddings such as Word2Vec, BERT, and DistillBERT models. Specifically, we address the conversion of text to formats compatible with these models.

For all models, we perform common preprocessing steps such as removing HTML and markup tags from the text. Additionally, on a case-by-case basis, we apply text transformations tailored to the specific model requirements. For example, we convert the text to lowercase for BERT-based models, as these models typically use uncased variants.

The specifics of the preprocessing steps can be found in the notebook. Once the preprocessing is complete, the cleaned and transformed data is stored for future use in subsequent notebooks. This ensures that the data is appropriately prepared for effective utilization in the subsequent stages of the analysis and modeling pipeline.

In [1]:
#Importing libraries:

import os
from pathlib import Path
import pandas as pd
import numpy as np
import nltk; nltk.download('punkt'); nltk.download('stopwords')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import re

[nltk_data] Error loading punkt: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:1000)>
[nltk_data] Error loading stopwords: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:1000)>


### 2.1.1 Gathering data

In [2]:
# Changing the directory to the project folder

os.chdir(Path(os.path.realpath("")).resolve().parents[1])

In [3]:
# Importing modules that fetch raw data
from src.getter.load_application_and_opportunity import get_raw_data
from src.getter.save_application_and_opportunity import save_app_data

# Getting and reading a few lines of data
pdata = get_raw_data()
pdata.head(5)

Unnamed: 0,OpportunityId,ApplicationId,ExternalBriefDescription,ExternalDescription,Title,JobCategoryName,IsRejected,IsCandidateInternal,BehaviorCriteria,MotivationCriteria,...,Educations,LicenseAndCertifications,Skills,Motivations,Behaviors,StepId,StepName,Tag,StepGroup,pass_first_step
0,MbzeABKVn06G8irkoHJeIg==,nTzdqGj020CYqTouPocGSg==,"$16.00 Per Hour\n\nAt Orkin, our purpose is to...",<p><strong>$16.00 Per Hour</strong></p>\n<p><s...,Customer Service Specialist,Customer Service,True,False,[{'Description': 'Capable of carrying out a gi...,[{'Description': 'Inspired to perform well by ...,...,"[{'Degree': 'Some college', 'Description': Non...",[],"[{'ScaleValue': '4', 'ScaleValueName': 'Advanc...",[{'Description': 'Inspired to perform well by ...,[{'Description': 'Devoted to a task or purpose...,K8yQlic+/UiXxBMpOnAoLQ==,Decline,2.0,declined,False
1,7SPt0A57/kyzM9hE9vxDRg==,QVk5MFCZ70WvlZE9FzAW9g==,"$15.00 Per Hour\n\nAt Orkin, our purpose is to...",<p><strong>$15.00 Per Hour</strong></p>\n<p><s...,Customer Service Specialist,Customer Service,True,False,[{'Description': 'Capable of carrying out a gi...,[{'Description': 'Inspired to perform well by ...,...,"[{'Degree': 'Diploma', 'Description': None, 'G...",[],"[{'ScaleValue': '5', 'ScaleValueName': 'Expert...",[],[],K8yQlic+/UiXxBMpOnAoLQ==,Decline,2.0,declined,False
2,7SPt0A57/kyzM9hE9vxDRg==,I1kcPlAw3E+rqceh0qrutQ==,"$15.00 Per Hour\n\nAt Orkin, our purpose is to...",<p><strong>$15.00 Per Hour</strong></p>\n<p><s...,Customer Service Specialist,Customer Service,True,False,[{'Description': 'Capable of carrying out a gi...,[{'Description': 'Inspired to perform well by ...,...,"[{'Degree': 'HIGH SCHOOL DIPLOMA', 'Descriptio...",[],"[{'ScaleValue': '4', 'ScaleValueName': 'Advanc...",[],[],K8yQlic+/UiXxBMpOnAoLQ==,Decline,2.0,declined,False
3,zolSWBFjWESbfkj8AXLYwA==,VTCXZK6/ZUWJDpxTcm2CRg==,"$15.00 Per Hour\n\nAt Orkin, our purpose is to...",<p><strong>$15.00 Per Hour</strong></p>\n<p><s...,Customer Service Specialist,Customer Service,True,False,[{'Description': 'Capable of carrying out a gi...,[{'Description': 'Inspired to perform well by ...,...,"[{'Degree': 'Associate in Early', 'Description...",[],"[{'ScaleValue': '5', 'ScaleValueName': 'Expert...",[],[],K8yQlic+/UiXxBMpOnAoLQ==,Decline,2.0,declined,False
4,zolSWBFjWESbfkj8AXLYwA==,I6KgcL0jdkG8wBnL1+BZ/g==,"$15.00 Per Hour\n\nAt Orkin, our purpose is to...",<p><strong>$15.00 Per Hour</strong></p>\n<p><s...,Customer Service Specialist,Customer Service,True,False,[{'Description': 'Capable of carrying out a gi...,[{'Description': 'Inspired to perform well by ...,...,"[{'Degree': 'Bachelor of Business Admin', 'Des...",[],"[{'ScaleValue': '5', 'ScaleValueName': 'Expert...",[],[],K8yQlic+/UiXxBMpOnAoLQ==,Decline,2.0,declined,False


### 2.1.2 Defining column names so that various preprocessing functions could be applied

In [17]:
# Defining liststhat contains the names of the columns for easy access

job_column = [
    'ExternalBriefDescription',
    'ExternalDescription', 
    'Title', 
    'JobCategoryName'
]
uid_column = ['OpportunityId', 'ApplicationId']

# column - StepId has been removed on purpose, will be added later

can_column = [
    'IsCandidateInternal',
    'BehaviorCriteria', 
    'MotivationCriteria',
    'EducationCriteria', 
    'LicenseAndCertificationCriteria', 
    'SkillCriteria', 
    'WorkExperiences', 
    'Educations', 
    'LicenseAndCertifications', 
    'Skills', 'Motivations', 
    'Behaviors', 
    'StepName', 
    'Tag', 
    'StepGroup',
    'pass_first_step'
]
 
sel_column = ['IsRejected']

# Defining list of columns based on the type of contents

str_column = [
    'ExternalBriefDescription',
    'ExternalDescription', 
    'Title', 
    'JobCategoryName',
    'BehaviorCriteria', 
    'MotivationCriteria',
    'EducationCriteria', 
    'LicenseAndCertificationCriteria', 
    'SkillCriteria', 
    'WorkExperiences', 
    'Educations', 
    'LicenseAndCertifications', 
    'Skills', 
    'Motivations', 
    'Behaviors',
    'StepId', 
    'StepName', 
    'StepGroup'
]

bool_column = ['IsCandidateInternal', 'pass_first_step']

In [18]:
# Saving data for app
raw_job_data = pdata[['OpportunityId'] + job_column].drop_duplicates()
save_app_data(raw_job_data, 'raw_job_data')

## 2.2 Preprocessing the data - TF-IDF weighted Word 2 Vec model, BERT and Universal Sentence Encoder

### 2.2.1 Creating functions that perform operations on text for different models

In [19]:
# Defining functions for extracting information from the JSON like objects 
# Preprocessing the extracted information

def dataextractor(data, col_names):
    '''
    Extracts data from the JSON array like objects in the columns and 
    places string in a new column with name col_names__pp.

    Args: 
    data (pandas.DataFrame): Dataset containing the JSON array like objects
    col_names (str): Target column with JSON array like objects in each row 
    on whichthe operation needs to be performed

    Returns: None
    Creates pandas.DataFrame: Column with the name format 'columname'__'pp' 
    with string of all values extracted from JSON array like objects 
    '''
    def valuesextractor(cell):
        lst = []
        if isinstance(cell, np.ndarray):        
            for dctnry in cell:
                if not dctnry:
                    return str("")
                
                else:
                    for k, v in dctnry.items():
                        lst.append(k)
                        lst.append(v)
            lst = [str(x) for x in lst]
            return " ".join(lst)
        else:
            return cell

    #Adding '__' for unique identification of data extracted columns

    data[col_names + '__' + 'pp'] = data[col_names].apply(valuesextractor) 
    
    return 


def preprocessing_w2v(ls, stemming = True):
    """
    Preprocesses the text by doing the following: 
    1. Removes HTML and markup signs
    2. Converts text to lower case 
    3. Decontracts words for example : "won't" becomes "will not" etc
    4. Tokenizes the words 
    5. Removes stop words 
    6. Applies Porters stemmer if stemming = True
    
    Args:
    ls (text): Input text for preprocessing
    stemming (bool): Flag to enable stemming (default is False)
    input_islist (bool): Flag to enable preprocessing of text inside list. 
    If False, the input is treated as string

    Returns: 
    text (str): Preprocessed text for further use
    """

    # Taking care of values other than string that may come across
    text = str(ls)
    
    #removing html_text
    text = re.sub(r'<.*?>', '', text)

    #Lowering the case of the tokens 
    text = text.lower()

# Adding a few more functions for text processing

    #Code obtained from # https://stackoverflow.com/a/47091490/4084039
    #Removing decontracted words from text

    def decontracted(phrase):
        phrase = re.sub(r"won't", "will not", phrase)
        phrase = re.sub(r"can\'t", "can not", phrase)

        # general contractions
        phrase = re.sub(r"n\'t", " not", phrase)
        phrase = re.sub(r"\'re", " are", phrase)
        phrase = re.sub(r"\'s", " is", phrase)
        phrase = re.sub(r"\'d", " would", phrase)
        phrase = re.sub(r"\'ll", " will", phrase)
        phrase = re.sub(r"\'t", " not", phrase)
        phrase = re.sub(r"\'ve", " have", phrase)
        phrase = re.sub(r"\'m", " am", phrase)
        return phrase
    
    #Replacing decontracted words

    text = decontracted(text)

    #Removing special characters 

    text = re.sub(r'[^a-zA-Z0-9]', ' ', text)
    
    #Removing stopwords

    tkns = word_tokenize(text)
    stpwrds = set(stopwords.words('english'))
    tkns = [e for e in tkns if e not in stpwrds]

    #Applying Stemmer 

    if stemming: 
        stemmer = PorterStemmer()
        sentence =  ' '.join([stemmer.stem(words) for words in tkns])
        
    return sentence

def preprocessing_transformermodels(ls):
    """
    Preprocesses the text by doing the following: 
    1. Removes HTML tags 
    2. Converts text to lower case 
  
    Args:
    ls (text): Input text for preprocessing
    
    Returns: 
    text (str): Preprocessed text for further use
    """

    # Taking care of values other than string that may come across

    text = str(ls)
    
    #removing html_text

    text = re.sub(r'<.*?>', '', text)
    text = re.sub(r'[\n | \t]', " ", text)
    text = re.sub(r"  ", " ", text)

    return text


### 2.2.2 Applying the preprocessing functions to the datasets

In [20]:
# Applying dataextractor function to data from defined columns

for colnames in str_column:
    dataextractor(pdata, colnames)

In [21]:
# Applying preprocessing function to data from defined columns

for x in str_column:
    pdata[x+"__w2vpp"] = pdata[x+"__pp"].apply(preprocessing_w2v)

for x in str_column:
    pdata[x+"__trnsfrmrpp"] = pdata[x+"__pp"].apply(preprocessing_transformermodels)

Imputing NaN values in column - 'Tag'

In [22]:
# Imputing NaN values with -1

pdata['Tag'].fillna(-1, inplace = True)

In [23]:
pdata['Tag']

0         2.0
1         2.0
2         2.0
3         2.0
4         2.0
         ... 
110262   -1.0
110263    2.0
110264    2.0
110265   -1.0
110266   -1.0
Name: Tag, Length: 110267, dtype: float64

###  Exporting data for featurization

In [24]:
# Importing modules that fetch raw data
from src.getter.save_application_and_opportunity import save_interim_data

save_interim_data(pdata, "preprocesseddata")

In [25]:
pdata

Unnamed: 0,OpportunityId,ApplicationId,ExternalBriefDescription,ExternalDescription,Title,JobCategoryName,IsRejected,IsCandidateInternal,BehaviorCriteria,MotivationCriteria,...,SkillCriteria__trnsfrmrpp,WorkExperiences__trnsfrmrpp,Educations__trnsfrmrpp,LicenseAndCertifications__trnsfrmrpp,Skills__trnsfrmrpp,Motivations__trnsfrmrpp,Behaviors__trnsfrmrpp,StepId__trnsfrmrpp,StepName__trnsfrmrpp,StepGroup__trnsfrmrpp
0,MbzeABKVn06G8irkoHJeIg==,nTzdqGj020CYqTouPocGSg==,"$16.00 Per Hour\n\nAt Orkin, our purpose is to...",<p><strong>$16.00 Per Hour</strong></p>\n<p><s...,Customer Service Specialist,Customer Service,True,False,[{'Description': 'Capable of carrying out a gi...,[{'Description': 'Inspired to perform well by ...,...,MinimumScaleValue 3 MinimumScaleValueName Inte...,EndMonth None EndYear None JobTitle Call Cente...,Degree Some college Description None Graduatio...,,ScaleValue 4 ScaleValueName Advanced Skill Clo...,Description Inspired to perform well by moneta...,Description Devoted to a task or purpose with ...,K8yQlic+/UiXxBMpOnAoLQ==,Decline,declined
1,7SPt0A57/kyzM9hE9vxDRg==,QVk5MFCZ70WvlZE9FzAW9g==,"$15.00 Per Hour\n\nAt Orkin, our purpose is to...",<p><strong>$15.00 Per Hour</strong></p>\n<p><s...,Customer Service Specialist,Customer Service,True,False,[{'Description': 'Capable of carrying out a gi...,[{'Description': 'Inspired to perform well by ...,...,MinimumScaleValue 3 MinimumScaleValueName Inte...,EndMonth None EndYear None JobTitle Coordinato...,Degree Diploma Description None GraduationMont...,,ScaleValue 5 ScaleValueName Expert Skill Sales...,,,K8yQlic+/UiXxBMpOnAoLQ==,Decline,declined
2,7SPt0A57/kyzM9hE9vxDRg==,I1kcPlAw3E+rqceh0qrutQ==,"$15.00 Per Hour\n\nAt Orkin, our purpose is to...",<p><strong>$15.00 Per Hour</strong></p>\n<p><s...,Customer Service Specialist,Customer Service,True,False,[{'Description': 'Capable of carrying out a gi...,[{'Description': 'Inspired to perform well by ...,...,MinimumScaleValue 3 MinimumScaleValueName Inte...,EndMonth None EndYear None JobTitle Direct Car...,Degree HIGH SCHOOL DIPLOMA Description None Gr...,,ScaleValue 4 ScaleValueName Advanced Skill Cas...,,,K8yQlic+/UiXxBMpOnAoLQ==,Decline,declined
3,zolSWBFjWESbfkj8AXLYwA==,VTCXZK6/ZUWJDpxTcm2CRg==,"$15.00 Per Hour\n\nAt Orkin, our purpose is to...",<p><strong>$15.00 Per Hour</strong></p>\n<p><s...,Customer Service Specialist,Customer Service,True,False,[{'Description': 'Capable of carrying out a gi...,[{'Description': 'Inspired to perform well by ...,...,MinimumScaleValue 3 MinimumScaleValueName Inte...,EndMonth None EndYear 2019.0 JobTitle Package ...,Degree Associate in Early Description None Gra...,,ScaleValue 5 ScaleValueName Expert Skill Cashi...,,,K8yQlic+/UiXxBMpOnAoLQ==,Decline,declined
4,zolSWBFjWESbfkj8AXLYwA==,I6KgcL0jdkG8wBnL1+BZ/g==,"$15.00 Per Hour\n\nAt Orkin, our purpose is to...",<p><strong>$15.00 Per Hour</strong></p>\n<p><s...,Customer Service Specialist,Customer Service,True,False,[{'Description': 'Capable of carrying out a gi...,[{'Description': 'Inspired to perform well by ...,...,MinimumScaleValue 3 MinimumScaleValueName Inte...,EndMonth None EndYear None JobTitle Warehouse ...,Degree Bachelor of Business Admin Description ...,,ScaleValue 5 ScaleValueName Expert Skill Forkl...,,,K8yQlic+/UiXxBMpOnAoLQ==,Decline,declined
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
110262,PNL1Y3XJ50Whs9uchKxtyA==,URO0MqU14Eys5V5BrxL0qQ==,Berkshire is a national management company tha...,<p><strong><em>Team &ldquo;IT FACTOR&rdquo;</e...,Property Manager,Property Management,True,False,[{'Description': 'Shows intense and eager enjo...,[{'Description': 'Inspired to perform well by ...,...,MinimumScaleValue 4 MinimumScaleValueName Adva...,EndMonth 10.0 EndYear 2018.0 JobTitle Property...,Degree Certification Description None Graduati...,,ScaleValue 5 ScaleValueName Expert Skill Marke...,,,BGEezloW4EGuGdzEobmLsw==,Being Reviewed,other
110263,rWGGQMet/0GjOg8+J2MD0w==,8XKm6D+2KkiPHO3XTUXkcA==,Berkshire Communities is a national management...,<p><strong><em>Team &ldquo;IT FACTOR&rdquo;</e...,Property Manager,Property Management,True,False,[{'Description': 'Shows intense and eager enjo...,[{'Description': 'Inspired to perform well by ...,...,MinimumScaleValue 4 MinimumScaleValueName Adva...,EndMonth None EndYear None JobTitle Leasing Ma...,Degree Core Business Courses Description None ...,,ScaleValue 5 ScaleValueName Expert Skill Credi...,,,S69pj47jM0mAb/Ne6i6goA==,Decline,declined
110264,rWGGQMet/0GjOg8+J2MD0w==,dYTPtlNw+UChHEl50M5Xag==,Berkshire Communities is a national management...,<p><strong><em>Team &ldquo;IT FACTOR&rdquo;</e...,Property Manager,Property Management,True,False,[{'Description': 'Shows intense and eager enjo...,[{'Description': 'Inspired to perform well by ...,...,MinimumScaleValue 4 MinimumScaleValueName Adva...,EndMonth None EndYear None JobTitle SELF EMPLO...,Degree High School Diploma Description None Gr...,,ScaleValue 3 ScaleValueName Intermediate Skill...,Description Inspired to perform well by the ab...,Description Devoted to a task or purpose with ...,S69pj47jM0mAb/Ne6i6goA==,Decline,declined
110265,xNgzKlWSDkCoWpgBGmXmrA==,9cBFQH7OGEO/kHUFL9d2DA==,Berkshire is a national management company tha...,<p><strong><em>Team &ldquo;IT FACTOR&rdquo;</e...,Property Manager - Future Acquisition Opportunity,Property Management,False,False,[{'Description': 'Shows intense and eager enjo...,[{'Description': 'Inspired to perform well by ...,...,MinimumScaleValue 2 MinimumScaleValueName Some...,EndMonth None EndYear None JobTitle Account Ex...,Degree HIGH SCHOOL DIPLOMA Description None Gr...,,ScaleValue 5 ScaleValueName Expert Skill Sales...,,,wyObdVyxxEOkwTIQsnAS3A==,Under Qualified,declined


In [26]:
(pdata.LicenseAndCertificationCriteria__pp[0])

''