<a href="https://colab.research.google.com/github/ysavine/job-requirements/blob/main/job_requirements.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Job Requirements


*Task: given a job description/posting, extract the skills and requirements that an ideal candidate would have*



In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv('data_science_jobs_indeed_usa.csv')
print(df.columns)

Index(['Unnamed: 0', 'Title', 'Company', 'Location', 'Rating', 'Date',
       'Salary', 'Description', 'Links', 'Descriptions'],
      dtype='object')


## **Data Pre-Processing (Reducing Number of Columns)**




In [None]:
# Define a function to merge text from multiple columns
def merge_columns(row):
    row = [str(val) for val in row if isinstance(val, str)]
    return ' '.join(row)

# Specify the columns to merge
columns_to_merge = ['Description', 'Descriptions']

# Create a new column by applying the merge function to the specified columns
df['Job_Description'] = df[columns_to_merge].apply(merge_columns, axis=1)

df = df.drop(columns=['Unnamed: 0','Links', 'Rating', 'Date', 'Description', 'Descriptions', 'Salary', 'Location'])

# Display the DataFrame with the new merged column
print("\nAfter merging:")
print(df.columns)


After merging:
Index(['Title', 'Company', 'Job_Description'], dtype='object')


## **Data Pre-Processing (Text Data)**

In [None]:
!pip install contractions


Collecting contractions
  Using cached contractions-0.1.73-py2.py3-none-any.whl (8.7 kB)
Collecting textsearch>=0.0.21 (from contractions)
  Using cached textsearch-0.0.24-py2.py3-none-any.whl (7.6 kB)
Collecting anyascii (from textsearch>=0.0.21->contractions)
  Using cached anyascii-0.3.2-py3-none-any.whl (289 kB)
Collecting pyahocorasick (from textsearch>=0.0.21->contractions)
  Using cached pyahocorasick-2.1.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (110 kB)
Installing collected packages: pyahocorasick, anyascii, textsearch, contractions
Successfully installed anyascii-0.3.2 contractions-0.1.73 pyahocorasick-2.1.0 textsearch-0.0.24


In [None]:
import nltk
import re
import string
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import contractions
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

Lowercasing

In [None]:
# Lowercase all columns
df_lower = df.apply(lambda x : x.astype(str).str.lower())

White space

In [None]:
# Clean spacing with regex
def clean_spacing(text):
    cleaned_text = re.sub(r'\s+', ' ', text) # remove extra whitespaces (including tabs and newlines) between words
    cleaned_text = cleaned_text.strip() # remove leading and trailing whitespaces
    return cleaned_text

df_clean = df_lower.applymap(clean_spacing)

In [None]:
df_clean

Unnamed: 0,Title,Company,Job_Description
0,data scientist,driven brands,you’ll be working alongside a team of eight an...
1,business analyst,sabot consulting,preferred candidates will have prior experienc...
2,it business intelligence developer (ft) remote...,ballad health,job details apply save print this job email a…...
3,data engineer,longevity holdings inc.,incorporate core data management competencies ...
4,network administrator/dba developer,wki kenworth,the network administrator provides 2nd level e...
...,...,...,...
1642,senior machine learning engineer,hyperscience,we turn ml lab experiments into enterprise-rea...
1643,data scientist,n9 it solutions,experience using a variety of data mining/data...
1644,data engineer,marathon ts,"3 years working with databases, data modeling,..."
1645,"director of data science and engineering, rv t...",red ventures,you will be responsible for overseeing the tea...


Remove Non-English Characters for Consistency

In [None]:
# Function that removes non-english characters

def remove_non_english(text):
    tokens = word_tokenize(text) # tokenize the text into words
    non_english_pattern = re.compile(r'[^\x00-\x7F]+') # define a regex pattern to match non-English characters
    cleaned_tokens = [non_english_pattern.sub('', token) for token in tokens] # remove non-English characters from each token
    cleaned_text = ' '.join(cleaned_tokens) # reconstruct the text from the list of cleaned tokens
    return cleaned_text

df_non_eng = df_lower.applymap(remove_non_english)

Punctuation and Special Characters

Things to consider:


*   Handling contractions
*   Stemming and lemmatization



In [None]:
# Function that removes special characters and punctuation
def remove_punc(text):
  tokens = word_tokenize(text) # tokenize text into words
  table = str.maketrans('', '', string.punctuation) # remove punctuation and special characters
  stripped = [word.translate(table) for word in tokens]
  text_without_punctuation = ' '.join(stripped) # rejoin text
  return text_without_punctuation

df_remove_punc = df_non_eng.applymap(remove_punc)

Removing Numbers

What types of numbers exist in the dataset?
*   Years of experience
*   Zipcodes
*   etc







In [None]:
def remove_numbers(text):
    return re.sub(r'\d+', '', str(text))

df_remove_num = df_remove_punc.applymap(remove_numbers)

In [None]:
df_remove_num

Unnamed: 0,Title,Company,Job_Description
0,data scientist,driven brands,you ll be working alongside a team of eight an...
1,business analyst,sabot consulting,preferred candidates will have prior experienc...
2,it business intelligence developer ft remote...,ballad health,job details apply save print this job email a ...
3,data engineer,longevity holdings inc,incorporate core data management competencies ...
4,network administratordba developer,wki kenworth,the network administrator provides nd level en...
...,...,...,...
1642,senior machine learning engineer,hyperscience,we turn ml lab experiments into enterpriseread...
1643,data scientist,n it solutions,experience using a variety of data miningdata ...
1644,data engineer,marathon ts,years working with databases data modeling ...
1645,director of data science and engineering rv t...,red ventures,you will be responsible for overseeing the tea...


**Removing Stopwords**


In [None]:
nltk.download('stopwords')

# Define stopwords function
def remove_stopwords(text):
    english_stopwords = set(stopwords.words('english'))
    words = re.findall(r'\b\w+\b', text.lower())
    filtered_words = [word for word in words if word not in english_stopwords]
    return ' '.join(filtered_words)

# Remove stopwords in df
df_stop = df_remove_num.applymap(remove_stopwords)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Potential Things to Remove in Description

Read through dataset and found the following words that might not be as meaningful due to their frequency in all job descriptions
*   "job", "detail", "description", "candidates", "preferred", "position", "overview", "summary",



Additional Recommended Stopwords by reference https://www.kaggle.com/code/sanabdriss/nlp-extract-skills-from-job-descriptions/notebook

In [None]:
# Compile list of additional stop words (combo of Kaggle notebook by sanabdriss and ours)
other_stopwords = ['junior', 'senior','experience','etc','job', 'work','company','technique',
                    'candidate','skill','skills','language','menu','inc','new','plus','years',
                   'technology','organization','ceo','cto','account','manager','data','scientist','mobile',
                    'developer','product','revenue','strong', 'description', 'detail', 'preferred', 'position',
                    'overview', 'summary', 'candidates', 'apply']

# Define additional stopwords function
def replace_additional_stopwords(text, additional_stopwords):
    pattern = r'\b(?:{})\b'.format('|'.join(map(re.escape, additional_stopwords)))
    return re.sub(pattern, '', text)

# Remove additional stop words
df_adstop = df_stop.applymap(lambda text: replace_additional_stopwords(text, other_stopwords))

In [None]:
df_adstop

Unnamed: 0,Title,Company,Job_Description
0,,driven brands,working alongside team eight analysts scienti...
1,business analyst,sabot consulting,prior implementing cloudhosted business pro...
2,business intelligence ft remote,ballad health,details save print email details save pri...
3,engineer,longevity holdings,incorporate core management competencies incl...
4,network administratordba,wki kenworth,network administrator provides nd level enduse...
...,...,...,...
1642,machine learning engineer,hyperscience,turn ml lab experiments enterpriseready ai sol...
1643,,n solutions,using variety miningdata analysis methods us...
1644,engineer,marathon ts,working databases modeling management cura...
1645,director science engineering rv travel,red ventures,responsible overseeing team owns best practic...


Lemmatization & Tokenization



In [None]:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

# download NLTK resources
nltk.download('punkt')
nltk.download('wordnet')

# lemmatize function
def lemmatize(text):
    words = word_tokenize(text)
    lemmatizer = WordNetLemmatizer()
    lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
    return ' '.join(lemmatized_words)

df_lemmatized = df_adstop.applymap(lemmatize)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...


Check for Null Values

In [None]:
has_null = df_clean.isnull().any()
print(has_null)

Title              False
Company            False
Job_Description    False
dtype: bool


Expand Contractions

In [None]:
def contract(text):
  expanded_text = []
  for word in text.split():
      expanded_text.append(contractions.fix(word))
  return " ".join(expanded_text)

df_contract = df_lemmatized.applymap(contract)

Remove "remote" and "hybrid" from job title

In [None]:
df_contract['Title'] = df_contract['Title'].str.replace('remote', '')
df_contract['Title'] = df_contract['Title'].str.replace('hybrid', '')

Working with Job Categories - merging categories

Check for Most Common Job Titles

In [None]:
df_contract = df_contract[df_contract['Title'] != ""] # Get rid of the job entries that are empty strings
df_contract = df_contract[df_contract['Title'] != "remote"] # Get rid of the job entries that only say "remote" - not a title
df_contract = df_contract[df_contract['Title'] != "hybrid"] # Get rid of the job entries that only say "hybrid" - not a title
df_contract = df_contract[df_contract['Title'] != "varying level"]
df_contract = df_contract[df_contract['Title'].str.len() >= 3] # Get rid of job entries of len < 3

In [None]:
top_entries = df_contract['Title'].value_counts().head(50)

In [None]:
print(top_entries)

Title
engineer                                                                                133
business analyst                                                                        118
machine learning engineer                                                               105
database administrator                                                                   74
business intelligence                                                                    73
science                                                                                  51
director science                                                                         32
ai ml                                                                                    15
business intelligence analyst                                                            15
software engineer                                                                        13
head science                                                              

In [None]:
def replace_substring(df, column, strings_to_check, replacement, exclude=[]):
    mask = df[column].str.contains('|'.join(strings_to_check))
    for ex in exclude:
        mask &= ~df[column].str.contains(ex)
    df.loc[mask, column] = replacement

# Machine Learning & AI
replace_substring(df_contract, 'Title', ['machine learn'], 'machine learning')
replace_substring(df_contract, 'Title', ['deap learn', 'deep learning engineer'], 'deep learning')
replace_substring(df_contract, 'Title', ['ai'], 'ai')
replace_substring(df_contract, 'Title', ['nlp'], 'nlp')
replace_substring(df_contract, 'Title', ['computer vision'], 'computer vision')

# Business
replace_substring(df_contract, 'Title', ['business anal'], 'business analyst')
replace_substring(df_contract, 'Title', ['business intelligence'], 'business intelligence')

# Data
replace_substring(df_contract, 'Title', ['database admin'], 'database administrator')
replace_substring(df_contract, 'Title', ['data anal', 'data eng', 'database spec'], 'data analyst')
replace_substring(df_contract, 'Title', ['database anal'], 'database analyst')

# Software / Other Tech
replace_substring(df_contract, 'Title', ['software'], 'software engineer')
replace_substring(df_contract, 'Title', ['azure'], 'azure engineer')
replace_substring(df_contract, 'Title', ['web'], 'web')
replace_substring(df_contract, 'Title', ['frontend', 'front end', 'front-end'], 'front end')

# Fin/Econ
replace_substring(df_contract, 'Title', ['economet'], 'econometrics')
replace_substring(df_contract, 'Title', ['fin'], 'finance')

# Misc
replace_substring(df_contract, 'Title', ['engineering dir', 'director sci', 'head science', 'science director', 'sc director', 'principal scientist'], 'sr science')
replace_substring(df_contract, 'Title', ['director anal'], 'director analytics')

replace_substring(df_contract, 'Title', ['logist'], 'logistics')

replace_substring(df_contract, 'Title', ['recruit'], 'recruiter')
replace_substring(df_contract, 'Title', ['marketing'], 'marketing')

replace_substring(df_contract, 'Title', ['computational biology'], 'computational biology')

In [None]:
top_entries = df_contract['Title'].value_counts()

In [None]:
top_entries = df_contract['Title'].value_counts().head(50)

In [None]:
print(top_entries)

Title
machine learning                                          204
business analyst                                          187
database administrator                                    176
business intelligence                                     171
engineer                                                  133
sr science                                                120
science                                                    51
ai                                                         45
software engineer                                          30
director analytics                                         17
finance                                                    11
azure engineer                                             10
director engineering                                        9
marketing                                                   9
front end                                                   8
data analyst                                                8
en

## **Modeling & Evaluation**

Multinomial NB Modeling

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

## Converting text to features
vectorizer = TfidfVectorizer()

# Tokenize and build vocabulary
X = vectorizer.fit_transform(df_contract.Job_Description)
y = df_contract.Title

# Split data into 80% training and 20% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state=42)
print("train data shape: ",X_train.shape)
print("test data shape: ",X_test.shape)

# Fit model
clf = MultinomialNB()
clf.fit(X_train, y_train)

## Predict
y_predicted = clf.predict(X_test)

train data shape:  (1222, 14518)
test data shape:  (306, 14518)


Multinomial NB Model Evaluation

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, f1_score

# Evaluate the predictions
print("F1-score:", f1_score(y_test, y_predicted, average='weighted'))
print("Accuracy score is: ",accuracy_score(y_test, y_predicted))
print("Classes: (to help read Confusion Matrix)\n", clf.classes_)
print("Confusion Matrix: ")

print(confusion_matrix(y_test, y_predicted))
print("Classification Report: ", classification_report(y_test, y_predicted))


NameError: name 'y_test' is not defined

KNN Modeling

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

## Converting text to features
vectorizer = TfidfVectorizer()

# Tokenize and build vocabulary
X = vectorizer.fit_transform(df_contract.Job_Description)
y = df_contract.Title

# Split data into 80% training and 20% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("train data shape: ", X_train.shape)
print("test data shape: ", X_test.shape)

# Fit model
clf = KNeighborsClassifier(n_neighbors = 7, weights = 'distance') # more optimal hyperparams by gridsearhcv
clf.fit(X_train, y_train)

## Predict
y_predicted = clf.predict(X_test)



train data shape:  (1222, 14518)
test data shape:  (306, 14518)


In [None]:
'''#Looking at results
X_test_text = vectorizer.inverse_transform(X_test)
for true_label, predicted_label, desc in zip(y_test, y_predicted, X_test_text):
    if true_label != predicted_label:
        print("True: " + true_label + ", Predicted: " + predicted_label)
        print("Description: " + ', '.join(desc) + \n)'''


True: director analytics, Predicted: sr science
Description: unutilized, white, intensively, productize, translation, forecasted, learningartificial, hazardous, introduction, preexisting, sole, deemed, dohme, sharp, chosen, revoke, notwithstanding, revoked, ineligible, dialog, undue, causing, researchbased, rico, puerto, hrbp, spd, forth, usand, battling, countless, genius, morning, threaten, msd, rahway, merck, ideology, later, profit, necessitating, multistakeholder, organizeprioritize, confidently, pharmaceuticalhealthcare, industrymarket, algorithmsmodeling, scientificcommercial, inventing, twelve, innovating, routinely, treatment, customerfacing, augmenting, else, intake, optimally, placement, proactivity, accommodate, continues, thoroughly, exhibit, burden, valid, property, ambiguity, resourcing, head, llc, paper, deal, pool, pursue, physician, estimate, fee, unsolicited, comprised, mastery, seniority, matrix, hired, accept, determines, nj, jersey, opening, everywhere, division, 

KNN Model Evaluation

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, f1_score

# Evaluate the predictions
print("F1-score:", f1_score(y_test, y_predicted, average='weighted'))
print("Accuracy score is: ",accuracy_score(y_test, y_predicted))
print("Classes: (to help read Confusion Matrix)\n", clf.classes_)
print("Confusion Matrix: ")

print(confusion_matrix(y_test, y_predicted))
print("Classification Report: ", classification_report(y_test, y_predicted))

F1-score: 0.7918738193512488
Accuracy score is:  0.8202614379084967
Classes: (to help read Confusion Matrix)
 [' enterprise intelligence' ' looking yea engineer' ' scienceoptimization'
 'actuarial' 'administrator database' 'advance analytics'
 'agile project enterprise analytics' 'ai' 'analyst' 'analytics'
 'analytics ' 'analytics ii' 'analytics pricing science'
 'analytics service director' 'architecture health analytics'
 'assistant director eam science'
 'assistant director undergraduate science study'
 'assistant integration customer feedback management' 'associate'
 'associate directordirector statistical programming' 'aws engineer'
 'azure engineer' 'big engineer' 'big engineer pyspark' 'biological'
 'business analyst' 'business intelligence' 'business system analyst'
 'campus graduate decision science full time'
 'cdna science omnichannel orchestration' 'center engineer'
 'climate science' 'computational biology' 'computational genetics'
 'consultant life science ' 'coop spring'

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## **Feature Extraction**

### **Hyperparameter Tuning**

In [None]:
from sklearn.model_selection import GridSearchCV

# Define hyperparameters grid for KNN
knn_param_grid = {'n_neighbors': [3, 5, 7], 'weights': ['uniform', 'distance']}

# Define hyperparameters grid for Naive Bayes
nb_param_grid = {'alpha': np.linspace(0.1, 1.0, 10)}

# Initialize classifiers
knn = KNeighborsClassifier()
nb = MultinomialNB()

# Initialize grid search
knn_grid_search = GridSearchCV(knn, knn_param_grid, scoring='accuracy',)
nb_grid_search = GridSearchCV(nb, nb_param_grid, scoring='accuracy',)

# Fit grid search on training data
knn_grid_search.fit(X_train, y_train)
nb_grid_search.fit(X_train, y_train)

# Get best hyperparameters and scores
knn_best_params = knn_grid_search.best_params_
knn_best_score = knn_grid_search.best_score_
knn_test_score = knn_grid_search.best_estimator_.score(X_test, y_test)

nb_best_params = nb_grid_search.best_params_
nb_best_score = nb_grid_search.best_score_
nb_test_score = nb_grid_search.best_estimator_.score(X_test, y_test)

print("KNN best parameters:", knn_best_params)
print("KNN best accuracy on training data:", knn_best_score)
print("KNN accuracy on test data:", knn_test_score)

print("Naive Bayes best parameters:", nb_best_params)
print("Naive Bayes best accuracy on training data:", nb_best_score)
print("Naive Bayes accuracy on test data:", nb_test_score)



KNN best parameters: {'n_neighbors': 7, 'weights': 'distance'}
KNN best accuracy on training data: 0.7479257276681165
KNN accuracy on test data: 0.8202614379084967
Naive Bayes best parameters: {'alpha': 0.1}
Naive Bayes best accuracy on training data: 0.6104683840749414
Naive Bayes accuracy on test data: 0.6078431372549019
