# Currating data to initially train our model

**Goal** 
for an intelligent job search application we would like to use internet scraped jobs data.  However; we first need to overcome the hurdle of jobs that are the same but have different job titles. 

**Background** 
when jobs are scraped off the internet, they have a job title that may be associated with other job titles.  For example, an administrative assitant can also be labeled as an office admin, admin, or assistant.  

**Solution** 
therefore, we will 'categorize' the job titles into a smaller known job title set (administrative_assistant, apprentice, painter, security_guard, other).  We will then train our model against a new set of web scraped data and determine if our model is able to properly categorize (re-title) the jobs we are interested in.  Also, we needed to generate more jobs as the 

**Note:** this tutorial is part of a larger work effort to create an intelligent job search application.



In [None]:
# ==============================================================
# Pull AWS access keys from environment variables previously set 
# in Jupyter Hub
# =============================================================
key_id          = os.environ.get('AWS_ACCESS_KEY_ID')
secret_key      = os.environ.get('AWS_SECRET_ACCESS_KEY')

session         = boto3.session.Session(aws_access_key_id=key_id, aws_secret_access_key=secret_key)

s3_client       = boto3.client('s3',
                  aws_access_key_id=key_id,
                  aws_secret_access_key=secret_key)


In [None]:
# ==============================================================
# Download jobs file from S3 and place it into a dataframe (df)
#===============================================================
bucket_name     = 'rhods-pilot'
file_name       = 'job_offers_202106190048.csv'
new_file_name   = 'new_job_offers.csv'
local_dest_dir  = os.path.join(os.getcwd(), 'downloaded-folder')

s3_client.download_file(bucket_name, file_name, new_file_name)

#place file contents into a dataframe for processing
df              = pd.read_csv(new_file_name)

In [None]:
# =============================================================
# Examine what data columns and data is available by printing
# out the first 5 rows of scraped job listings
# =============================================================
print(df.head())


In [None]:
# =============================================================
# read in dataset, extract out the following columns
# title, company, location, description, employment_type
# =============================================================


#***************** WORK WITH MAX ON MULTIPLE DATA SOURCES and REQD COLUMNS **************************************************
#straight from web scraping
#col_list        = ["id","title", "company", "location", "description", "employent_type"]  #employent_type misspelled in curratedjobs.csv
#df              = pd.read_csv(new_file_name, usecols=col_list) 

col_list        = ["id","searched_keywords", "title", "currated_title", "company", "location", "description", "employent_type"]  #employent_type misspelled in curratedjobs.csv
df              = pd.read_csv("dataset/curratedjobs2.csv", usecols=col_list) 



#create file containing all jobs but with the above columns only
#curratejobs     = df.to_csv('dataset/curratedjobs2.csv')
curratejobs     = df.to_csv('dataset/model_training_jobs2.csv')

#create file containing all jobs but with the above columns only


#create file containing only unique titles to determine num of titles we are dealing with
#unique_title    = df['title'].unique()
#all_titles      = pd.Series(unique_title)

#save titles to csv
#all_titles.to_csv('dataset/unique_titles.csv')
#print(all_titles[0])



In [None]:
# =============================================================
# process file and add category, subcategory, title
# categories and sub categories taken from CareerBuilder.com
# =============================================================

#created training file with 5 job types:  Adminstrative Assistant, Dentist, Systems Engineer, Supply Chain Manager, Structural Engineer
#need more data therefore use markovfy to create additional data

#read in responses, only pull out client response, categorized issue and car symptom

df = pd.read_csv('dataset/model_training_jobs2.csv') 
df = df.fillna('')
df['currated_jobtitle']  = df.iloc[:,4]                            #issue
df['combined_jobtitles'] = df.iloc[:,1]+df.iloc[:,2]+df.iloc[:,3]   #respose
subset = df.iloc[:,-3:]



print(subset)


In [None]:
# may need to perform pip install markovify in launcher

import markovify
import codecs

In [None]:
# =============================================================
# Markovify is a simple, extensible Markov chain generator
# Its primary use is for building Markov models of large corpora
# of text and generating random sentences from that. 
# =============================================================

#Function builds the model according to what job title (e.g. dentist, administrative assistant, systems engineer etc...) is given
def train_markov_type(data, currated_jobtitle):
    return markovify.Text(data[data["currated_jobtitle"] == currated_jobtitle].combined_jobtitles, retain_original=False, state_size=2)

#Function takes one of the 'issue' models and creates a randomly-generated sentence of length up to 200 characters.  Note only creates '1' sentence
def make_sentence(model, length=1000):
    return model.make_short_sentence(length, max_overlap_ratio = .7, max_overlap_total=15)

#build models
admin_model          = train_markov_type(subset, "administrative_assistant")
apprentice_model     = train_markov_type(subset, "apprentice")
painter_model        = train_markov_type(subset, "painter")
security_model       = train_markov_type(subset, "security_guard")
other_model          = train_markov_type(subset, "other")


We can combine these models with relative weights

In [None]:
# =============================================================
# combine models with relative weights
# =============================================================

import numpy

def generate_cases(models, weights=None):
    if weights is None:
        weights = [1] * len(models)
    
    choices = [] # Array of tuples of weight and models
    
    total_weight = float(sum(weights))
    
    for i in range(len(weights)):
        choices.append((float(sum(weights[0:i+1])) / total_weight, models[i]))
    
    # Return a tuple of model and category that are randomly selected by given weights.
    def choose_model():
        r = numpy.random.uniform()
        for (model_weight, model) in choices:
            if r <= model_weight:
                return model
        return choices[-1][1]

    while True:
        local_model = choose_model() 
        # local_model[0]) is the markovify model, local_model[1] is the category
        yield make_sentence(local_model[0]), local_model[1]
   


Generate new job descriptions & classify them as:  administrative_assistant, apprentice, painter, security
guard & Other

Store new job descriptions and job titles in file:  generated_jobs_data.csv

In [None]:
import numpy as np

generated_cases = generate_cases([(admin_model,'administrative_assistant'), 
                                  (apprentice_model,'apprentice'), 
                                  (painter_model,'painter'),
                                  (security_model,'security_guard'),
                                  (other_model,'other')], [28,7,7,7,7])

# Tuples with sentence and category
sentence_tuples = [next(generated_cases)  for i in range(2000)]  # create 2000 sentence/category tuples

# Write to csv file old one is testdata1.csv
with open('dataset/generated_jobs_data.csv', 'w') as file:
    writer = csv.writer(file, delimiter=',', lineterminator='\n')
    writer.writerows(sentence_tuples)

At this point we have created a new data set.  There is however a problem we must overcome.  Machine Learning models cannot understand 'text'.  Therefore we must convert the textual data into some numeric form.

We can do this, using Tokenization.  Jump to 02-TokenDemo.ipynb