# Introduction

In this notebook we will learn how to **characterize job titles** that have been scraped from the internet. In a job search, the job will have an alpha numeric id and suggested job title.  However not all jobs that are the same, will have the same title.  For example, an administrative assitant can also be labelled as an office administrator or admin.  For our tutorial, we will categorize any job titles that have these words as 'administrative assistant'.   This will be accomplished through the use of **Natural Language Processing (NLP)**.

This will allow you to discover, step by step, how you can create the code doing the job title text processing.  In the last part of the workshop, this code will be **packaged to create a service** that you can query from an application.

We will be training the model.  Once our model is trained, we can test the model by entering a web scraped job id (e.g. admin assistant office administrator administrative support at coreix 2597597309) and check if the model has correctly characterized the jot title.  Current job titles we are categorizing are:  administrative_assistant, apprentice, painter, security_guard or other.

Ready? Let's go!

## Libraries
First, we'll need to **install some libraries** that are not part of our container image. Normally, **Red Hat OpenShift Data Science** is already taking care of this for you, based on what it detects in the code. **Red Hat OpenShift Data Science** will reinstall all those libraries for you every time you launch the notebook!


## Imports
Of course, we'll need to import various packages. They are either built in the notebook image you are running, or have been installed in the previous step.

In [1]:
import numpy as np 
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical
from tensorflow import keras

2021-08-12 13:43:10.777599: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0


## Create Training and Testing data sets

Now that we have loaded the tools we need, the first step in our journey is to be able to take our raw data and divide it into testing and training sets.


In [2]:
#============================================================================
#Determine what the training and testing percentages, of the data set, will be.
#============================================================================
training_portion = .80         # Use 80% of data for training, 20% for testing
max_words        = 1000        # Max words in job title input

data             = pd.read_csv('dataset/generated_jobs_data.csv')

#============================================================================
# TRY:  uncomment the below print statement and print out first 5 rows of 
# generated claims so you can see what data looks like.
#
# print(data.head())
#============================================================================

train_size       = int(len(data) * training_portion)

#============================================================================
# FUNCTION:  train_test_split
# This function splits the data into training and test sets.  
# Inputs:   raw data and determined train_size.
#============================================================================
def train_test_split(data, train_size):
    train        = data[:train_size]
    test         = data[train_size:]
    return train, test

train_cat, test_cat   = train_test_split(data.iloc[:,1], train_size)  # label data is second column
train_text, test_text = train_test_split(data.iloc[:,0], train_size)  # text data is first column


## Tokenize the Data sets

After we have training and testing sets, we need to **tokenize the data**.  This means that we convert text documents into contextual vectors which contain numeric representations (index of where those words occur in a word dictionary) of the words in the documents.

To see how Tokenization works, you can take a look at **02-TokenDemo.ipynb**.

In [3]:
tokenize              = Tokenizer(num_words=max_words, char_level=False)
tokenize.fit_on_texts(train_text) # fit tokenizer to our training text data

#============================================================================
#x_train and x_test are the vectorization of the text data (which is a claim)
#============================================================================
x_train               = tokenize.texts_to_matrix(train_text)
x_test                = tokenize.texts_to_matrix(test_text)

#============================================================================
# TRY:  uncomment the below print statement and observe the rows in the 
# newly created matrix.
#
# print(x_train)
# ===========================================================================


## Using Sklearn
We will be using the sklearn utility to convert label strings to numbered index.

In [4]:
#============================================================================
# Convert label strings to numbered index
#============================================================================
encoder              = LabelEncoder()  
encoder.fit(train_cat)
y_train              = encoder.transform(train_cat)
y_test               = encoder.transform(test_cat)

#============================================================================
# Note: for each row in the data, each entry represents the value of the label
# Example:  [2 1 1 2 1 1 0 ...  which corresponds to starter, other,
# other, starter, other, other, brakes ...
#
#============================================================================
# TRY:  uncomment the below print statement.  What would you expect y_train
# to look like?  
#
# print(y_train)
#============================================================================


## One Hot Encoding

We need to create labels (job titles such as administrative assistant or painter) for our test data, convert the labels to numbered index and then use one-hot encoding.

**One hot encoding** allows the representation of categorical data to be more expressive. Many machine learning algorithms cannot work with categorical data directly. **The categories must be converted into numbers**. This is required for both input and output variables that are categorical.

After we have converted the labels using one-hot encoding, we will be ready to build our main NLP model and train it.

In [5]:
#============================================================================
# Convert the labels to a one-hot representation
#
# One Hot Encoding replaces the column of labels whose (values are 0 or 1 or 2)
# with 3 columns each representing 1 label value.  For example, the label 
# 'administrative_assitant' may be replaced by the vector 0 1 0, the label 'painter' may be replaced by
# the vector 0 0 1, the label 'other' maybe replaced by the vector 1 0 0
#============================================================================
num_classes          = len(set(y_train))  # set() creates a unique set of objects
y_train              = to_categorical(y_train, num_classes)  
y_test               = to_categorical(y_test, num_classes)

#============================================================================
# TRY:  uncomment the below print statements in order to inspect the 
# dimenstions of our training and test data.
# y_train may appear as y_train shape: (159, 3) which represents 159 rows, 3 cols
#
#print('x_train shape:', x_train.shape)
#print('x_test shape:', x_test.shape)
#print('y_train shape:', y_train.shape)
#print('y_test shape:', y_test.shape)
#============================================================================



## Building the model

Once the model is trained, we can test our model by entering a job title (and id) and check if the model has correctly categorized the job title. 

For example:  

**admin assistant office administrator administrative support at coreix 2597597309**  is categorized as an **administrative_assistant**

**security officer at unitrust protection services uk ltd 2589892274** is categorized as a **security_guard**



In [6]:
#============================================================================
# Build model
#============================================================================
layers               = keras.layers
models               = keras.models
model                = models.Sequential()
model.add(layers.Dense(512, input_shape=(max_words,), activation='relu'))  # Hidden layer with 512 nodes
model.add(layers.Dense(num_classes, activation='softmax'))

#============================================================================
# relu, softmax, categorical_crossentropy are telling the model how to do some 
# internal calculations.  Softmax is telling the model to calculate 
# probabilities for each category in each document.  If you only had yes, 
# or no you would use sigmoid instead of softmax.
#============================================================================
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

#============================================================================
# VARIABLES
# history    - normally used to plot learning curves.  
# fit        - calculates the weights in the model. 
# batch_size - tells the internal calculations how many rows to process at 1 time
# epochs     - num of times model calculations will pass through the entire data
#============================================================================
batch_size          = 32
epochs              = 2
history = model.fit(x_train, y_train,      
                    batch_size=batch_size,  
                    epochs=epochs,         
                    verbose=1,
                    validation_split=0.1)

#============================================================================
# evaluate func compares the model predictions with the actual known test values
#============================================================================
score = model.evaluate(x_test, y_test,       
                       batch_size=batch_size, verbose=1)

#============================================================================
# TRY:  uncomment the below print statements to see the test loss and accuracy
# of our model
#
#print('Test loss:', score[0])
#print('Test accuracy:', score[1])
#============================================================================


2021-08-12 13:43:32.640783: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-08-12 13:43:32.641068: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-08-12 13:43:32.641093: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
2021-08-12 13:43:32.641131: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (jupyterhub-nb-adrezni): /proc/driver/nvidia/version does not exist
2021-08-12 13:43:32.641332: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AV

Epoch 1/2
Epoch 2/2



## Let's test our model!

Now that we have a model, we would like to generate a prediction (e.g. categorize the web scraped job title as:  administrative_assitant, apprentice, painter, security_guard or other)


In [10]:
#============================================================================
# In this section, we will need to create a 'predict' function that will pass 
# a 'job title string' to the model.  The return string will be a category of 
# administrative_assitant, apprentice, painter, security_guard or other.
#============================================================================
text_labels = encoder.classes_   #ndarray of output values (labels or classes)  e.g. security_guard
#print(text_labels)

#==========================================================
# Save labels (categories) to be used later when we run the
# model in actual flask app
#==========================================================
import csv

with open('dataset/savedCategories.csv', 'w') as file:
    writer = csv.writer(file, delimiter=',')
    writer.writerow(text_labels)
    

#============================================================================
# Examine first 10 test samples of 445
#============================================================================
#for i in range(len(test_cat)):
    #temp = x_test[i]
    #prediction = model.predict(np.array([x_test[i]]))
    #predicted_label = text_labels[np.argmax(prediction)]  #predicted class
    #print(test_text.iloc[i][:50], "...")                # 50 char sample of text
    #print('Actual label:' + test_cat.iloc[i])
    #print("Predicted label: " + predicted_label + "\n")

#============================================================================
# FUNCTION:  predict
# This function takes a string input (e.g. web scraped job title and id), tokenizes the input
# and then passes the input to the model.  The 'predicted_label' maps the
# index of the prediction to the test labels array (e.g. administrative_assitant, apprentice, painter, security_guard or other)
# The single_predicted_label returns one of the test labels array (e.g. painter)
#============================================================================
def predict(single_test_text):
    text_as_series = pd.Series(single_test_text) #perform data conversion
    single_x_test = tokenize.texts_to_matrix(text_as_series)
    single_prediction = model.predict(np.array([single_x_test]))
    model.save('models/jobmodel.h5')  #after prediction, save the model
    single_predicted_label = text_labels[np.argmax(single_prediction)]  #maps index of the prediction to the test labels array e.g. painter
    return (single_predicted_label)

#========================================
#Run the first time in order to save the model
#=========================================
single_test_text = 'admin assistant office administrator administrative support at coreix 2597597309 ' 
print('the job desc: ' + single_test_text)    #print out the job title and id being categorized

prediction = predict(single_test_text) 
print('job category is: ' + prediction)   

the job desc: admin assistant office administrator administrative support at coreix 2597597309 
job category is: administrative_assistant


In [12]:
#==========================================================================
# TRY: uncomment the below to test the predict function.  We will test for 5
# job titles so that we can see if the model can properly categorize job titles 
# that have been scraped from the internet
# administrative assistant, apprentice, painter, security guard, other

#single_test_text = 'security officer'
#single_test_text = 'security guard' 
#single_test_text = 'security guard 6 month fixed term at g4s 2585133095'
#single_tezt_text = 'admin assistant office administrator administrative support at coreix 2597597309'
single_test_text = 'security officer at unitrust protection services uk ltd 2589892274'

print('the job title: ' + single_test_text)               #print out the repair being categorized
model = keras.models.load_model('models/jobmodel.h5')     #load the model before called predict function
prediction = predict(single_test_text)                    #make the prediction
print('job category is: ' + prediction)                   #print the prediction


the job title: security officer at unitrust protection services uk ltd 2589892274
job category is: security_guard
