# Job Jargon Identification

In this project, I analyzed job descriptions from a number of different fields. The thought is that these job descriptions might contain both jargon word and phrases.

The challenge here will be to analyze the text of the included job descriptions, but to also compare the words and phrases there with a reference set. In this case, I used Reuters news articles as a background corpus to compare all possible jargon text with.

I first read in the text of the job descriptions and then tokenized them. I then took the tokens and compared them to the Reuters as both individual tokens and also as bigrams.

This project aims for just term differences, so I reported back the tokens that are only in the job descriptions.

The code has been built around using NLTK.

In [1]:
# import necessary libraries for using NLTK

import nltk.data
from os import listdir
from os.path import isfile, join
from nltk.util import bigrams 
from nltk.tokenize import TreebankWordTokenizer
sentence_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
treebank_tokenizer = TreebankWordTokenizer()

## Section I: Pre-process job description text 

### i) Read job description from repositories using os 

In [2]:
#dir_base = "/Users/teacher/repos/s20_ds_nlp/homeworks/homework_1/data/"
dir_base = "/Users/Winnie/Documents/2020 Spring/NLP/homework_1/data/"

def read_file(filename):
    input_file_text = open(filename , encoding='utf-8').read()
    return input_file_text

    
def read_directory_files(directory):
    file_texts = []
    files = [f for f in listdir(directory) if isfile(join(directory, f))]
    for f in files:
        file_text = read_file(join(directory, f))
        #print(file_text)
        file_texts.append({"file":f, "content": file_text })
    return file_texts
    
# generate the list that contains all the files and their contents
text_corpus = read_directory_files(dir_base)
print(text_corpus[0])

{'file': 'r1', 'content': "Dominion Engineering, Inc. (DEI; domeng.com) is a small (~40-person) company that supports the commercial energy industry in the US and abroad with technology, laboratory R&D testing, and consulting. The working environment at DEI is close-knit and professional, but not overly formal. Typical project teams are 2-3 persons working together and in collaboration with the Customer. Newer employees work under the general mentorship of more senior engineers, while still maintaining fairly autonomous roles, roles that may evolve over time to meet emergent needs.\n\nOne of DEI’s areas of expertise is degradation of nuclear power plant materials. This position would provide engineering analysis support to DEI project managers and subject matter experts for materials degradation projects and may also be called upon to provide support to other emergent DEI projects. Engineering analysis areas will include corrosion evaluation, fracture mechanics, and microstructural cha

### ii) Clean text data 

In [3]:
# define a function to clean the text data

def process_description(job_description_object):
    job_description = job_description_object["content"]
    #print(job_description)
    
    # take the job description text, and tokenize it
    from nltk.tokenize import word_tokenize
    tokens = word_tokenize(job_description)
    
    # convert to lower case
    tokens = [word.lower() for word in tokens]
    
    # remove punctualition from each word
    import string
    table = str.maketrans('', '', string.punctuation)
    stripped = [word.translate(table) for word in tokens]
    
    # remove remaining tokens that are not alphabetic
    tokens = [word for word in stripped if word.isalpha()]
    
    # filter out stop words
    from nltk.corpus import stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if not word in stop_words]
    
    # remove numbers
    tokens = [word for word in tokens if not word.isdigit()]
    print(tokens[0])  
    
    return tokens 


### iii) Generate single-word tokens 

In [4]:
# This loop will simply apply the text-preprocessing method to all the job descriptions (single-word tokens)
all_job_description_words = []
for job_description in text_corpus:
    all_job_description_words.extend(process_description(job_description))

dominion
come
ana
driven
overview
general
unique
navaide
materials
inspired
gcssigal
love
sourcing
job
new


### iv) Generate bigrams 

In [5]:
# This loop will simply apply the text preprocessing method to all the job descriptions and generate bigrams tokens
all_job_description_bigrams = []
for job_description in text_corpus:
    all_job_description_bigrams.extend(nltk.bigrams(process_description(job_description)))

dominion
come
ana
driven
overview
general
unique
navaide
materials
inspired
gcssigal
love
sourcing
job
new


## Section II: Pre-process Reuters Corpus 

In [6]:
# This will be the corpus I work from
from nltk.corpus import reuters

### i) Clean Reuters text using the same pre-processing method for the job description text and generate single-word tokens and bigrams

In [7]:
num_docs = len(reuters.fileids())

all_reuters_words = []
all_reuters_bigrams = []

# for doc_id in reuters.fileids()[0:25]:  <---- When practising, use this; this will only iterate over the first 25 documents
for doc_id in reuters.fileids(): 
    # this doc_text variable will give a text version of the news article. This could be tokenized.
    reuters_text = reuters.open(doc_id).read()
    #print(reuters_text)
    
    # run the same job description processing method
    # tokenize the text
    from nltk.tokenize import word_tokenize
    reuters_tokens = word_tokenize(reuters_text)
    # convert to lower case
    reuters_tokens = [word.lower() for word in reuters_tokens]
    # remove punctuation from each word
    import string
    table = str.maketrans('', '', string.punctuation)
    reuters_stripped = [word.translate(table) for word in reuters_tokens]
    # remove remaining tokens that are not alphabetic
    reuters_tokens = [word for word in reuters_stripped if word.isalpha()]
    # filter out stop words
    from nltk.corpus import stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in reuters_tokens if not word in stop_words]
    # remove numbers
    reuters_tokens = [word for word in reuters_tokens if not word.isdigit()]
    
    #print(reuters_tokens[0])
    
    reuters_bigrams = nltk.bigrams(reuters_tokens)
    
    # add the output to the all_reuters_words list
    all_reuters_words.extend(reuters_tokens)
    all_reuters_bigrams.extend(reuters_bigrams)

## Section III: Compare two corpus 

### i) Identify single-word jargon 

In [8]:
# find ways to compare the words in the job descriptions and the reuters text
# using Python's set capabilities to intersect things

# Token jargon
a = set(all_job_description_words)
b = set(all_reuters_words)
jargon = a.difference(b)
print(list(jargon)[:10])


['surfaces', 'ordersmonthly', 'http', 'answered', 'rfi', 'prds', 'suitability', 'interacting', 'familiarity', 'toolsetslanguages']


### ii) Identify bigrams jargon 

In [9]:
c = set(all_job_description_bigrams)
d = set(all_reuters_bigrams)
bigrams_jargon = c.difference(d)
print(list(bigrams_jargon)[:10])

[('degree', 'years'), ('measured', 'team'), ('systems', 'combine'), ('takes', 'win'), ('management', 'leverage'), ('resume', 'workforce'), ('testing', 'consulting'), ('manufacturing', 'metals'), ('drawingscontract', 'documents'), ('project', 'responsibilities')]


In [10]:
print("The number of token jargon is: ", len(jargon))
print("The number of bigrams jargon is:", len(bigrams_jargon))

The number of token jargon is:  291
The number of bigrams jargon is: 3196


# Analysis of Results

I applied six steps to pre-process both the job description word list and the reuters word list. These steps are:
1) tokenize the words in the text
2) convert token to lower case
3) remove punctuations from the words list (i.e. !, ?)
4) remove remaining tokens that are not alphabetic (i.e. $, >, <)
5) filter out stop words (i.e. "a", "the", "this", "that")
6) remove numbers in the list (i.e. 7, 8, 1999)

After pre-processing the text, I converted the two processed lists to sets. I used python's difference function to identify words that are in the job description words set only but not in the reuters words set. This yields the single-word jargon in the job description text. As for the bigrams jargon, I added the step "nltk.bigrams()" to convert the single-word tokens into two-word tokens. 

The results show that there are 291 single-word jargon and 3196 two-word jargon in the job description text in comparision with the reuters corpus. 

