https://medium.com/swlh/web-scraping-python-requests-and-beautifulsoup-45d5f48f5a1

https://sparkbyexamples.com/spark/spark-read-text-file-rdd-dataframe/

https://medium.com/swlh/text2sql-in-spark-nlp-converting-natural-language-questions-to-sql-queries-on-scale-6ae9a9061d74

https://www.oreilly.com/library/view/learning-spark-2nd/9781492050032/ch04.html

In [None]:
#!python -m pip install html2text

In [None]:
#!python -m pip install mistletoe

https://github.com/webrecorder/warcio

In [None]:
#!pip install warcio

In [None]:
#!python -m pip install cdx_toolkit

In [None]:
#%load_ext autoreload
#%autoreload 2

Run the cell below to import everything necessary for this analysis.

In [1]:
import json
import sys
import os.path
import re
import logging
import pandas as pd
import requests
import warcio
import mistletoe
import cdx_toolkit
import zlib
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from warcio import ArchiveIterator
from contextlib import closing
from html2text import HTML2Text
from bs4 import BeautifulSoup
from pathlib import Path
from IPython.display import HTML as HTML_raw, display
from mpl_toolkits.mplot3d import Axes3D
from sklearn.manifold import TSNE
from nltk.tokenize import word_tokenize
np.random.seed(0)

# Our corpus

In this analysis, we'll be working with alomost 100k different documents, each containing one single job AD.

These job AD plain text files are contained within the `\c\Users\renjm\Job-Posting-Big-Data\ana\dat\100k_jobADs` subdirectory, contained within the same folder as this analysis. Each job AD is stored in a single file, with files ranging from `1_2021JAN_jobAD.txt` to `99422_2021JAN_jobAD.txt`.

To make it easy to read in all of the documents, use a list comprehension to create a list containing the name of every single job AD plain text file below.

In [2]:
filesList = []
n_0 = 1
n_1 = 100000
for i in range(n_0, n_1):
    filepath = './dat/100k_jobADs/'+str(i)+'_2021JAN_jobAD.txt'
    if (os.path.exists(filepath)):
        filesList.append(str(i)+'_2021JAN_jobAD.txt')

print(filesList)

['1_2021JAN_jobAD.txt', '11_2021JAN_jobAD.txt', '12_2021JAN_jobAD.txt', '21_2021JAN_jobAD.txt', '22_2021JAN_jobAD.txt', '23_2021JAN_jobAD.txt', '27_2021JAN_jobAD.txt', '28_2021JAN_jobAD.txt', '33_2021JAN_jobAD.txt', '34_2021JAN_jobAD.txt', '39_2021JAN_jobAD.txt', '41_2021JAN_jobAD.txt', '42_2021JAN_jobAD.txt', '45_2021JAN_jobAD.txt', '49_2021JAN_jobAD.txt', '53_2021JAN_jobAD.txt', '54_2021JAN_jobAD.txt', '55_2021JAN_jobAD.txt', '56_2021JAN_jobAD.txt', '62_2021JAN_jobAD.txt', '65_2021JAN_jobAD.txt', '69_2021JAN_jobAD.txt', '74_2021JAN_jobAD.txt', '84_2021JAN_jobAD.txt', '87_2021JAN_jobAD.txt', '91_2021JAN_jobAD.txt', '92_2021JAN_jobAD.txt', '96_2021JAN_jobAD.txt', '108_2021JAN_jobAD.txt', '115_2021JAN_jobAD.txt', '116_2021JAN_jobAD.txt', '117_2021JAN_jobAD.txt', '118_2021JAN_jobAD.txt', '121_2021JAN_jobAD.txt', '124_2021JAN_jobAD.txt', '125_2021JAN_jobAD.txt', '129_2021JAN_jobAD.txt', '131_2021JAN_jobAD.txt', '132_2021JAN_jobAD.txt', '134_2021JAN_jobAD.txt', '135_2021JAN_jobAD.txt', '14

next, create an empty DataFrame called `jobads_df`. As we read in job ads and store and clean them, we'll store them in this DataFrame.

In [3]:
jobads_df = pd.DataFrame()

next, let's import one job AD to see what our text looks like so that we can make sure we clean and tokenize it correctly.

read in and print out the `job description` from `11_2021JAN_jobAD.txt`. Use vanilla Python, no pandas needed.

93148 (113), 59721 (107)

In [4]:
filepick = './dat/100k_jobADs/85473_2021JAN_jobAD.txt' 
with open(filepick, encoding='utf-8', errors='ignore') as f:
    test_jobad = f.readlines()
    print(test_jobad)

['Workopolis Logo \n', ' Job Title, Keywords \n', ' City, Province \n', ' Menu \n', ' \n', ' Browse Jobs \n', ' Advanced Job Search \n', ' \n', ' \n', ' Français \n', ' \n', ' \n', ' \n', ' Post a job \n', ' Department Manager \n', ' Walmart Canada \n', ' __Richmond Hill, ON \n', ' Apply Now \n', ' LEADER, COACH, TEAM PLAYER—ALL ROLLED INTO ONE. \n', ' It takes a special kind of person to be a Department Manager at Walmart—because we’re every bit as committed to our team culture as we are to our customers. No matter the department, our Managers are friendly, confident communicators who know how to get the best out of their team—and they’ve got the leadership skills to make people want to perform their best. We’ll let you in on a secret: we even love and appreciate a bit of a competitive spirit… especially when it comes to hitting new milestones for our charitable initiatives and giving back to our communities. \n', ' The ability to engage and inspire your team to be better than they we

## Tokenizing our Data

Before we can create a bag of words or vectorize each document, we need to clean it up and split each job AD into an array of individual words. Computers are very particular about strings. If we tokenized our data in its current state, we would run into the following problems:

* Counting things that aren't actually words. 
* Punctuation and capitalization would mess up our word counts. To the Python interpreter, `love`, `Love`, `Love?`, and `Love\n` are all unique words, and would all be counted separately. We need to remove punctuation and capitalization, so that all words will be counted correctly.

Tokenization is pretty tedious if we handle it manually, and would probably make use of `regular expressions`. In order to keep this lab moving, we'll use a library function to clean and tokenize our data so that we can move onto vectorization.

Tokenization is a required task for just about any Natural Language Processing (NLP) task, so great industry-standard tools exist to tokenize things for us, so that we can spend our time on more important tasks without getting bogged down hunting every special symbol or punctuation in a massive dataset. For this analysis, we'll make use of the tokenizer in the amazing `nltk` library, which is short for *Natural Language Tool Kit*.

**NOTE**: `NLTK` requires extra installation methods to be run the first time certain methods are used. If `nltk` throws you an error about needing to install additional packages, follow the instructions in the error message to install the dependencies, and then re-run the cell.

Before we tokenize our job AD plain text files, we'll do only a small manual bit of cleaning. In the cell below, a function that allows us to make every word lowercase, remove newline characters `\n`, and all the following punctuation marks: `",.'?!"` is developed.

Furthermore, this function that not only takes in job AD plain text files having erratic symbols removed but also joins all of the lines into a single string. To sum up, the function, `tokenize_jobAD()`, is utilized to get a fully tokenized version of the job AD. We can test this function on `jobad` to ensure that the function works.

In [5]:
def tokenize_jobAD(jobad):
    subs = [('.',''),("'",""),('"',''),(',',''),('?',''),('!',''),('\n',''),('\t',''),('\r',''),('|',''),('/',' '),
            ('-',''),('(c)',''),('*',''),('(',''),(')',''),('&',''),(':',''),(';',''),('[',''),(']',''),('\\',''),
            ('â„¢',''),('%',''),('â£',''),('Â£0',''),('Â£',''),('>',''),('<',''),('=',''),('_',''),('__',''),('“',''),
            ('’',''),('–',''),('”',''),('—','')]
    cleaned_jobad = []
    for line in jobad:
        for old, new in subs:
            line = line.replace(old, new).lower()
        cleaned_jobad.append(line)

    tokenized_jobad = ' '.join(cleaned_jobad).split()
    
    return tokenized_jobad

In [6]:
tokenized_test_jobad = tokenize_jobAD(test_jobad)
print(len(tokenized_test_jobad))
print(tokenized_test_jobad[:100000])

1019
['workopolis', 'logo', 'job', 'title', 'keywords', 'city', 'province', 'menu', 'browse', 'jobs', 'advanced', 'job', 'search', 'français', 'post', 'a', 'job', 'department', 'manager', 'walmart', 'canada', 'richmond', 'hill', 'on', 'apply', 'now', 'leader', 'coach', 'team', 'playerall', 'rolled', 'into', 'one', 'it', 'takes', 'a', 'special', 'kind', 'of', 'person', 'to', 'be', 'a', 'department', 'manager', 'at', 'walmartbecause', 'were', 'every', 'bit', 'as', 'committed', 'to', 'our', 'team', 'culture', 'as', 'we', 'are', 'to', 'our', 'customers', 'no', 'matter', 'the', 'department', 'our', 'managers', 'are', 'friendly', 'confident', 'communicators', 'who', 'know', 'how', 'to', 'get', 'the', 'best', 'out', 'of', 'their', 'teamand', 'theyve', 'got', 'the', 'leadership', 'skills', 'to', 'make', 'people', 'want', 'to', 'perform', 'their', 'best', 'well', 'let', 'you', 'in', 'on', 'a', 'secret', 'we', 'even', 'love', 'and', 'appreciate', 'a', 'bit', 'of', 'a', 'competitive', 'spirit…', 

Now that we can tokenize our job AD plain text file, we can move onto vectorization.

## Count Vectorization

Machine Learning algorithms don't understand strings. However, they do understand math, which means they understand vectors and matrices. By Vectorizing the text, we just convert the entire text into a vector, where each element in the vector represents a different word. The vector is the length of the entire vocabulary -- usually, every word that occurs in the English language, or at least every word that appears in our corpus. Any given sentence can then be represented as a vector where all the vector is 1 (or some other value) for each time that word appears in the sentence.

`Count Vectorization` allows us to represent a sentence as a vector, with each element in the vector corresponding to how many times that word is used. Notice that when we vectorize a sentence this way, we lose the order that the words were in. This is the `Bag of Words` approach

In the cell below, create a function that takes in a tokenized, cleaned job AD plain text file and returns a count vectorized representation of it as a Python dictionary (aka: *Sparse Representation*). Add in an optional parameter called vocab that defaults to None. This way, if we are using a vocabulary that contains words not seen in the song, we can still use this function by passing it into the vocab parameter.

In [7]:
def count_vectorize(jobad, vocab=None):
    if vocab:
        unique_words = vocab
    else:
        unique_words = list(set(jobad))
    
    jobad_dict = {i:0 for i in unique_words}
    
    for word in jobad:
        jobad_dict[word] += 1
    
    return jobad_dict

In [8]:
test_vectorized = count_vectorize(tokenized_test_jobad)
print(test_vectorized)

{'positive': 2, 'best': 6, 'recognition': 2, 'post': 1, 'as': 10, 'willingness': 4, 'below': 4, 'manager': 19, 'winning': 4, 'us': 4, 'any': 2, 'do': 3, 'older': 2, 'career': 2, 'than': 2, 'about': 1, 'an': 1, 'appreciate': 2, 'were': 8, 'single': 2, 'action': 2, 'grow': 2, 'seriously': 2, '16': 2, 'law': 2, 'way': 2, 'into': 2, 'see': 2, 'employers': 2, 'energetic': 2, 'department': 23, 'the': 28, 'every': 10, 'with': 6, 'told': 2, 'longterm': 2, 'center': 1, 'site': 1, 'walmartbecause': 2, 'everyone': 2, 'take': 2, 'others': 2, 'value': 2, 'each': 2, 'privacy': 2, 'province': 1, 'skills': 4, 'your': 8, 'other': 2, 'handling': 2, 'really': 2, 'customer': 6, 'of': 12, 'theyve': 2, 'wp': 1, 'their': 8, 'workopolis': 2, 'store': 2, 'to': 42, 'know': 2, '‘wowing': 2, 'service': 2, 'welcoming': 2, 'great': 12, 'engage': 2, 'thats': 2, '2021': 1, 'sees': 2, 'communities': 4, 'what': 5, 'committed': 4, 'canada': 3, 'belong': 2, 'valued': 2, 'member': 2, 'job': 6, 'outgoing': 2, 'walmart': 7,

We've just successfully vectorized the first job AD plain text document! Now, let's look at a more advanced type of vectorization, TF-IDF!

In [9]:
statesList = ["United States of America","USA","AL","AK","AZ","AR","CA","CO","CT","DE","FL","GA","HI","ID","IL","IN","IA","KS",
              "KY","LA","ME","MD","MA","MI","MN","MS","MO","MT","NE","NV","NH","NJ","NM","NY","NC","ND","OH","OK","OR","PA",
              "RI","SC","SD","TN","TX","UT","VT","VA","WA","WV","WI","WY","DC","GU","MH","MP","PR","VI","Alabama","Alaska",
              "Arizona","Arkansas","California","Colorado","Connecticut","Delaware","Florida","Georgia","Hawaii","Idaho",
              "Illinois","Indiana","Iowa","Kansas","Kentucky","Louisiana","Maine","Maryland","Massachusetts","Michigan",
              "Minnesota","Mississippi","Missouri","Montana","Nebraska","Nevada","Hampshire","Jersey","Mexico","York",
              "Carolina","Dakota","Ohio","Oklahoma","Oregon","Pennsylvania","Rhode","Tennessee","Texas","Utah","Vermone",
              "Virginia","Washington","Wisconsin","Wyoming"]
itList = ["Engineer","Scientist","Developer","Analyst","Operator"]
qaList = ["Qualification","Qualifications","Certification","Certifications"]
euList = ["United Kingdom","UK","EU"]

check this text job AD text file to ensure it contains any key word(s) in `qaList`

In [10]:
def qa_usa_Count(filename, tokenized_jobad):
    found_qa = False
    counter_notusa = 0
    counter_qa = 0
    counter_usa = 0
    qaINjobList = []
    qaINusaDict = {}
    
    for eu in euList:
        if eu.lower() in tokenized_jobad.keys():
            counter_notusa += 1
            
    if counter_notusa == 0:
        for qa in qaList:
            if qa.lower() in tokenized_jobad.keys():
                counter_qa += 1
                found_qa = True
                qaINjobList.append(qa)
        if found_qa == True:
            for usa in statesList:
                if usa.lower() in tokenized_jobad.keys():
                    counter_usa += 1
                    qaINusaDict.setdefault(filename, []).append(usa)
            for qa in qaINjobList:
                qaINusaDict.setdefault(filename, []).append(qa)
                
    return counter_qa, counter_usa, qaINusaDict

In [11]:
qaNums_in_test_vectorized, usaNums_in_test_vectorized, qaDict_in_test_vectorized = qa_usa_Count('92663_2021JAN_jobAD', test_vectorized)
print(qaNums_in_test_vectorized)
print(usaNums_in_test_vectorized)
print(qaDict_in_test_vectorized)

1
2
{'92663_2021JAN_jobAD': ['IN', 'OR', 'Qualifications']}


In [14]:
def get_list_of_qaJobAds():
    n_0 = 1
    n_1 = 100000
    qaList = []
    for i in range(n_0, n_1):
        qaNums_in_test_vectorized = 0
        usaNums_in_test_vectorized = 0
        filepath = './dat/100k_jobADs/'+str(i)+'_2021JAN_jobAD.txt'
        if (os.path.exists(filepath)):
            filename = filepath.replace('./dat/100k_jobADs/','').replace('.txt','')
            with open(filepath, encoding='utf-8', errors='ignore') as f:
                test_jobad = f.readlines()
            #print(filename)
            tokenized_test_jobad = tokenize_jobAD(test_jobad)
            test_vectorized = count_vectorize(tokenized_test_jobad)
            qaNums_in_test_vectorized, usaNums_in_test_vectorized, qaDics_in_test_vectorized = qa_usa_Count(filename, test_vectorized)
            if (qaNums_in_test_vectorized > 0) and (usaNums_in_test_vectorized > 0):
                qaList.append(qaDics_in_test_vectorized) 
        else:
            continue
                
    return qaList

In [15]:
print(get_list_of_qaJobAds())

[{'12_2021JAN_jobAD': ['IN', 'Qualifications']}, {'27_2021JAN_jobAD': ['IN', 'OR', 'Qualification']}, {'92_2021JAN_jobAD': ['IN', 'OR', 'Certification']}, {'96_2021JAN_jobAD': ['IN', 'OR', 'Qualification', 'Certification']}, {'129_2021JAN_jobAD': ['IN', 'ME', 'OR', 'Certifications']}, {'210_2021JAN_jobAD': ['ID', 'IN', 'ME', 'MT', 'OR', 'Qualifications']}, {'213_2021JAN_jobAD': ['IN', 'OR', 'Qualification']}, {'221_2021JAN_jobAD': ['AZ', 'IN', 'ME', 'OR', 'Qualifications', 'Certification']}, {'285_2021JAN_jobAD': ['HI', 'IN', 'ME', 'OR', 'California', 'Connecticut', 'Massachusetts', 'Montana', 'Nebraska', 'Jersey', 'Qualifications']}, {'293_2021JAN_jobAD': ['IN', 'OR', 'Certification']}, {'298_2021JAN_jobAD': ['CO', 'IN', 'Qualification']}, {'310_2021JAN_jobAD': ['IN', 'OR', 'Certification']}, {'362_2021JAN_jobAD': ['IN', 'ME', 'OR', 'Certifications']}, {'410_2021JAN_jobAD': ['IN', 'OR', 'Washington', 'Qualifications', 'Certification']}, {'539_2021JAN_jobAD': ['DE', 'IN', 'OR', 'Qualif