# Homework #1

In this homework you will be analyzing job descriptions from a number of different fields. The thought is that these job descriptions might contain both jargon word ands phrases.

The challenge here will be to analyze the text of the included job descriptions, but to also compare the words and phrases there with a reference set. In this case, we will use Reuters news articles as a background corpus to compare our possible jargon text with.

This homework will require that you read in the text of the job descriptions and then tokenize them. You will then need to take the tokens and compare them to the Reuters as both individual tokens and also as bigrams.

You need not look at the frequency of the terms. We are aiming for just term differences, so simply reporting back the tokens that are only in the job descriptions will be sufficient. One key thing to consider here is what kind of tokens will you want to report on. For example, the job descriptions might contain numbers and other things. Generally, you'd not want to report back numbers. Also, you might want to consider lowercasing things. 

If you'd like you can also try to stem or lemmatize the text.

The code has been built around using NLTK, but you could just as easily do this with Spacy.

In [1]:
# here we will import necessary libraries for using NLTK

import nltk.data
from os import listdir
from os.path import isfile, join
from nltk.util import bigrams 
from nltk.tokenize import TreebankWordTokenizer
sentence_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
treebank_tokenizer = TreebankWordTokenizer()



In [2]:
dir_base = "/Users/xyyao/Documents/GitHub/s20_ds_nlp/homeworks/homework_1/data/"


####
# Notice: We are reusing code from class notes... remember these kind of building blocks
####

def read_file(filename):
    input_file_text = open(filename , encoding='utf-8').read()
    return input_file_text

    
def read_directory_files(directory):
    file_texts = []
    files = [f for f in listdir(directory) if isfile(join(directory, f))]
    for f in files:
        file_text = read_file(join(directory, f))
        print(file_text)
        file_texts.append({"file":f, "content": file_text })
    return file_texts
    
# here we will generate the list that contains all the files and their contents
text_corpus = read_directory_files(dir_base)
print(text_corpus)

Dominion Engineering, Inc. (DEI; domeng.com) is a small (~40-person) company that supports the commercial energy industry in the US and abroad with technology, laboratory R&D testing, and consulting. The working environment at DEI is close-knit and professional, but not overly formal. Typical project teams are 2-3 persons working together and in collaboration with the Customer. Newer employees work under the general mentorship of more senior engineers, while still maintaining fairly autonomous roles, roles that may evolve over time to meet emergent needs.

One of DEI’s areas of expertise is degradation of nuclear power plant materials. This position would provide engineering analysis support to DEI project managers and subject matter experts for materials degradation projects and may also be called upon to provide support to other emergent DEI projects. Engineering analysis areas will include corrosion evaluation, fracture mechanics, and microstructural characterization, as well as dev

In [3]:
###
# You will need to work on filling out the content of this method. 
###

import string

from nltk.corpus import stopwords

def process_description(job_description_object):
    job_description = job_description_object["content"]
    
    # take the job description text, and tokenize it
    # you could also remove numbers and other noise tokens here too
    # also, you might generate bigrams here as well
    
    # remove numbers
    job_description = ''.join(c for c in job_description if not c.isdigit())
    # replace punctuations to whitespaces
    replace_punctuation = str.maketrans(string.punctuation, ' '*len(string.punctuation))
    job_description = job_description.translate(replace_punctuation)
    tokens = treebank_tokenizer.tokenize(job_description)
    # convert to lower case
    tokens = [w.lower() for w in tokens]
    # remove remaining tokens that are not alphabetic
    words = [word for word in tokens if word.isalpha()]
    # filter out stop words
    stop_words = set(stopwords.words('english'))
    words = [w for w in words if not w in stop_words]
    
    return words # the later function assumes you are returning a list of terms

In [4]:
# This loop will simply apply your method to all the job descriptions
all_job_description_words = []
for job_description in text_corpus:
    all_job_description_words.extend(process_description(job_description))
    
from nltk.stem.porter import *
porter_stemmer = PorterStemmer()

stemmed_job_description_words = [porter_stemmer.stem(word) for word in all_job_description_words]
print(stemmed_job_description_words)

job_description_bigrams = nltk.bigrams(stemmed_job_description_words)

['dominion', 'engin', 'inc', 'dei', 'domeng', 'com', 'small', 'person', 'compani', 'support', 'commerci', 'energi', 'industri', 'us', 'abroad', 'technolog', 'laboratori', 'r', 'test', 'consult', 'work', 'environ', 'dei', 'close', 'knit', 'profession', 'overli', 'formal', 'typic', 'project', 'team', 'person', 'work', 'togeth', 'collabor', 'custom', 'newer', 'employe', 'work', 'gener', 'mentorship', 'senior', 'engin', 'still', 'maintain', 'fairli', 'autonom', 'role', 'role', 'may', 'evolv', 'time', 'meet', 'emerg', 'need', 'one', 'dei', 'area', 'expertis', 'degrad', 'nuclear', 'power', 'plant', 'materi', 'posit', 'would', 'provid', 'engin', 'analysi', 'support', 'dei', 'project', 'manag', 'subject', 'matter', 'expert', 'materi', 'degrad', 'project', 'may', 'also', 'call', 'upon', 'provid', 'support', 'emerg', 'dei', 'project', 'engin', 'analysi', 'area', 'includ', 'corros', 'evalu', 'fractur', 'mechan', 'microstructur', 'character', 'well', 'develop', 'execut', 'laboratori', 'test', 'pro

In [5]:
# This will be the corpus we work from
from nltk.corpus import reuters

In [6]:
def process_text(text):
   
    # remove numbers
    text = ''.join(c for c in text if not c.isdigit())
    # replace punctuations to whitespaces
    replace_punctuation = str.maketrans(string.punctuation, ' '*len(string.punctuation))
    text = text.translate(replace_punctuation)
    tokens = treebank_tokenizer.tokenize(text)
    # convert to lower case
    tokens = [w.lower() for w in tokens]
    # remove remaining tokens that are not alphabetic
    words = [word for word in tokens if word.isalpha()]
    # filter out stop words
    stop_words = set(stopwords.words('english'))
    words = [w for w in words if not w in stop_words]
    
    return words

In [7]:
num_docs = len(reuters.fileids())
#  this has a large number of files... 
# you might wish to limit the number of documents you use while developing your technique 
# ex. reuters.fileids()[0:25]

all_reuters_words = []

# this will only iterate over the first 25 documents, 
# for the real submission you will need to run across more documents
# perhaps 250 documents, or all of them
for doc_id in reuters.fileids(): 
    # this doc_text variable will give you a text version of the news article. This could be tokenized.
    reuters_text = reuters.open(doc_id).read()
    # here you could perhaps run the same job description processing method
    # then you could simply add the output to the all_reuters_words list
    all_reuters_words.extend(process_text(reuters_text))

stemmed_reuters_words = [porter_stemmer.stem(word) for word in all_reuters_words]
print(stemmed_reuters_words)

reuters_bigrams = nltk.bigrams(stemmed_reuters_words)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [8]:
# here you will want to find ways to compare the words in the job descriptions and the reuters text
# you might consider using Python's set capabilities to intersect things
# also, you might just iterate over the job description words to see if they are in the reuters word list

stemmed_job_description_words = set(stemmed_job_description_words)
stemmed_reuters_words = set(stemmed_reuters_words)
job_description_bigrams = set(job_description_bigrams)
reuters_bigrams = set(reuters_bigrams)
jargon = []
jargon.extend(w for w in stemmed_job_description_words if w not in stemmed_reuters_words)
jargon.extend(w for w in job_description_bigrams if w not in reuters_bigrams)
print(jargon)

['hubzon', 'templat', 'multidisciplinari', 'hopkin', 'email', 'googl', 'religion', 'traine', 'visio', 'python', 'stipend', 'marit', 'perk', 'uncap', 'technologist', 'loyal', 'lpn', 'rn', 'metadata', 'upload', 'tableau', 'gender', 'sybas', 'pto', 'noncontributori', 'toolset', 'rdbm', 'footprint', 'thermomet', 'rfi', 'merci', 'lifecycl', 'legaci', 'mentorship', 'funni', 'loyalti', 'faculti', 'innat', 'navaid', 'tenac', 'rooki', 'descript', 'invigor', 'ged', 'fractur', 'methodolog', 'enlighten', 'collegi', 'radiologist', 'insight', 'internship', 'click', 'feedback', 'medstar', 'salesforc', 'workplac', 'clarksburg', 'yelp', 'spous', 'priorit', 'licensur', 'hs', 'empathet', 'lineag', 'grammar', 'fundrais', 'cgi', 'rapport', 'domeng', 'powerpoint', 'graduat', 'warfight', 'armypay', 'compassion', 'credenti', 'dynamap', 'possess', 'submitt', 'soul', 'stellar', 'http', 'personifi', 'radiolog', 'workshop', 'lifelong', 'aprnvalid', 'informatica', 'subcontractor', 'breadth', 'glucos', 'gaithersbur

# Analysis of your results

Below this cell, please put a short writeup of your approach and comments on your results. The goal here is to explain how well you think your method worked based on looking at some of your output data. Additionally, please describe things you might do differently or ways in which you might improve the process if you were given more time.

Cleaned the text off numbers and punctuations, etc. There are misspellings in the jargon reported, such as "template", "google", etc. This assignment could be improved by removing possible misspellings.