# Data Science Job Dashboard

## Imports

In [2]:
# Plotting
from matplotlib import pyplot as plt
from plotly import express as px
from wordcloud import WordCloud

def make_wordcloud(text):
    wordcloud = WordCloud(background_color='white', width=3200, height=2400)
    wordcloud.generate(text)
    fig = plt.figure(figsize = (16,12))
    plt.imshow(wordcloud)
    plt.axis("off")
    return fig

# Inspecting modules
import inspect
from IPython.display import Markdown, display

def display_object(obj):
    source = inspect.getsource(obj)
    wrapped_source = f'```python\n{source}\n```'
    markdown_source = Markdown(wrapped_source)
    display(markdown_source)
    
# Basics
import numpy as np
import pandas as pd

# Timing 
from tqdm import tqdm

# Processed data storage
import pickle

# Build corpus
from gensim import corpora

# Latent Dirichlet Allocation
from gensim.models import LdaModel

# K-means
from sklearn.cluster import KMeans

# Clustering Evaluation
from sklearn.metrics import calinski_harabasz_score, silhouette_score, confusion_matrix

# Word count vectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Model evaluation
from sklearn.model_selection import train_test_split, GridSearchCV

#Naive Bayes'
from sklearn.naive_bayes import MultinomialNB


## Contact Information

Seth Chart, PhD

Data Scientist & Mathematician

Seeking opportunities in Data Science
 * Email: [seth.chart@protonmail.com](mailto:seth.chart@protonmail.com)
 * Phone: [443.303.7114](tel:4433037114)
 * [Resume](https://sethchart.com/resume-sethchart.pdf)
 * [GitHub](https://github.com/sethchart)
 * [LinkedIn](https://www.linkedin.com/in/sethchart)
 * [Website](https://sethchart.com)
 * [Twitter](https://www.linkedin.com/in/sethchart)

## Business Understanding

The job market for the data industry can be difficult to navigate because there are a plethora of job titles and the relationship between titles and roles is often not well defined.
Two identical roles may have completely different titles.
Two substantially different roles may have the same title.
This limits the usability of job titles as a means to succinctly communicate about data industry jobs.
This project will address the issue by directly analyzing full job descriptions from a corpus of job postings to provide three main deliverables:

### Objectives

 * First, from the language used in job descriptions (without titles), identify clusters of similar jobs based on their roles and responsibilities.
 * Second, a tool for classifying a provided job description according to our scheme.
 * Third, a comparison between our classification scheme and existing job titles ability to distinguish between roles.

### Available Resources

Recently two of the most recognizable job posting sites, LinkedIn and Indeed, have closed their job posting APIs and taken steps to discourage web scraping. This means that our preferred data sources were not available. 

We found that [careerjet.com](http://careerjet.com) has an public API and a fairly simple page structure for job posting. The official API [page](https://www.careerjet.com/partners/api/) for careerjet provides a python API package. Unfortunately, the official python package is not functional. There is an unofficial fork of the package that is functional that can be found [here](https://github.com/davebulaval/careerjet-api). 

The careerjet API is designed as a method for serving ads and tracks visits to careerjet job postings. For this reason, there are fairly restrictive rate limits for retrieving postings through the API. These limitations are not clearly documented, but they rendered the API unusable for large scale data collection.

After discovering that the careerjet API would not be usable for data collection we investigated the possibility of scraping job postings by exploiting the page number url parameter to iterate over search result pages, scrape posting results from each result listing, then scrape each result url. However, after experimentation we discovered that the site only surfaces one hundred pages of search results with twenty postings per page.

Finally, we determined that a by initiating a scraping process on the post listed first in the careerjet search results and using Selenium to advance to the next search result, we were able to access up to ten thousand job postings in a scraping session.

### Data Mining Goals

The goal of our data mining process was to obtain a reasonably large corpus of job postings consisting of a job title and a job description within the data job sector. 

### Project Plan

#### Data Collection

 1. Initiate a search for jobs postings located in the United States containing the keyword `data`. 
 2. Traverse and scrape job postings using the Selenium webdriver.
 3. Store scraped posts in a SQLite database for further analysis. 

#### Data Cleaning

The work-flow below applies to both job descriptions and job titles. This process seeks to distill the raw text to a list of unique and independent tokens of information.
 
 1. Lowercase and remove newline characters.
 2. Tokenize documents into sentences.
 3. Tokenize sentences into words.
 4. Tag words with parts-of-speech tags.
 5. Lemmatize words based on parts-of-speech tags.
 6. Remove stopwords and special characters.
 7. Group common bigrams and trigrams.

#### Modeling 

Essentially we wish to model two activities related to searching for a job. First, reading full job descriptions to determine the relevant skills and requirements, thereby classifying the job. Second, skimming over job titles to classify jobs. 

The goal of a job listing should be to efficiently communicate the requirements of the job so that employers and job seekers can easily match. Job titles are essentially a summary of the full job description which should allow a job seeker to quickly reject postings that are not relevant for further review. When the job title is not predictive of the requirements of the job, a job seeker must review full job descriptions to correctly reject job postings or accept a high false negative rate for title skimming. Either approach leads to reduced efficiency in the job seeking process. 

Our first model will produce a data supported summary of a job description, which should be as easy to parse as a job title, but more predictive of the job description.

Our second model will try to simulate the process of classifying a job by reading the job title alone. From this model, we wish to estimate the false negative rate for this method of rejecting a job posting for further review. 

##### Unsupervised Learning on Job Descriptions

Having distilled job descriptions to token lists we wish to derive meaningful representations of the job descriptions, which lend themselves to succinct classification of jobs. To this end we propose the following work-flow.
 
 1. Convert token lists to bag-of-words representation.
 2. Train a Latent Dirichlet Allocation model on the bag-of-words representations to extract a latent topics representation of the job descriptions. 
 3. Train a K-Means clustering model on the latent topics representations to identify clusters of similar job descriptions.
 
 The labeling of job descriptions with their corresponding cluster label provides our first deliverable.
 
 Having trained both LDA and K-Means models on the collected data, we can feed an unseen job description through our model pipeline and assign a cluster label. This provides our second deliverable.

##### Supervised Learning on Job Titles.

Having classified jobs to the best of our abilities using the full job description, we now wish to determine how effectively we are able to predict the cluster label of the job using only the job title. We propose the following work-flow.

 1. Convert job title token lists to token count vectors.
 2. Train a Multinomial Naive Bayes' taking job title token count vectors as inputs to predict job cluster labels. This simulates a job seeker classifying a job based on the occurrence of key words or phrases in the job title. 
 

#### Deployment


We will provide a publicly available interface, which will allow the end user to classify raw job description text.

## Data Understanding

In order to produce our deliverables, we will needed a sizable corpus of data industry job descriptions.
We were able to obtain a corpus of approximately 9,485 job descriptions paired with their assigned job titles.
The descriptions from this corpus will serve as our data for deliverables one and two.
We will use the job titles from this corpus as data for deliverable three.

### Data Collection and Storage

Data collection is executed by the `scrape` script, which depends on the `careerjet` and `JobsDb` modules. Essentially, the script bellow executes the following steps.
 1. Instantiate a `JobsDb` object, which provides methods for writing records to the database.
 2. Instantiate a `Scraper` object which opens a chrome browser and navigates to the first careerjet.com job posting page.
 3. Scrape page contents.
 4. Write record to database.
 5. Advance to next page and return to Step 3.

#### `scrape` script

In [None]:
import scrape
display_object(scrape.main)

#### `careeerjet` module

In [None]:
import careerjet
display_object(careerjet)

#### `JobsDb` module

In [None]:
import JobsDb
display_object(JobsDb)

### Data Description


We executed two runs of the `scrape` script. The first run did not include `data` as a search term, so it retrieved a sampling of the full job market. The second run included the search term `data`, so it retrieved a sampling of the data industry job market.

#### Load Full Dataset

In the cell below we leverage the JobsDb module to load our full data set from the database.

In [None]:
db = JobsDb.JobsDb()
df = db.load_table_as_df('jobs')
db.close()
df.head()

#### Total Records

In [None]:
display(Markdown(f' Our database currenlty contains {len(df)} job postings scraped from careerjet.com.'))

#### Records Containing `data` Keyword

In [None]:
query = """
SELECT * from jobs
WHERE title LIKE '%data%'
OR description LIKE '%data%';
"""
db = JobsDb.JobsDb()
data = db.load_query_as_df(query)
db.close()
data.head()

In [None]:
display(Markdown(f' Our database currenlty contains {len(data)} job postings scraped from careerjet.com with the keyword `data`.'))

### Data Exploration

Below we explore the distribution of lengths, measured by the number of characters for job titles and job descriptions.

#### Job Titles

In [None]:
title_length = data['title'].apply(len)
median_title_length =title_length.median()
fig = px.histogram(
    title_length,
    title = f'Distribution of Job Title Lengths (Medain {median_title_length})',
    labels = {
        'value': 'Title Length (characters)',
    },
    nbins = 45,
)
fig.layout.update(showlegend=False)
fig.show()

#### Job Descriptions

In [None]:
description_length = data['description'].apply(len)
median_description_length = description_length.median()
fig = px.histogram(
    description_length,
    title = f'Distribution of Job Description Lengths (Medain {median_description_length})',
    labels = {
        'value': 'Description Length (characters)',
    },
    nbins = 45,
)
fig.layout.update(showlegend=False)
fig.show()

### Data Validation

Below we check data types and ensure that there are no missing values.

#### Data Types

All three columns from our dataset are formatted as strings. 

In [None]:
data.dtypes

#### Missing Values

There are no missing values in this dataset.

In [None]:
data.isna().sum()

## Data Preparation

In this section we take an in depth look at the data preparation process. All data processing is handled by the `DataProcessor` module.

In [None]:
import DataProcessor
display_object(DataProcessor)

### Data Selection

For the purposed of this analysis we are only interested in job postings related to the data industry. For this reason we will only include postings the contain the keyword `data ` in either the job title or the job description. We have already extracted this data from the database [here](#Records-Containing-data-Keyword). For your convenience we reproduce the required code below.
```python
query = """
SELECT * from jobs
WHERE title LIKE '%data%'
OR description LIKE '%data%';
"""
db = JobsDb.JobsDb()
data = db.load_query_as_df(query)
db.close()
```

### Data Cleaning

To quickly demonstrate the steps in our data cleaning pipeline, we will select a single example record and apply each step in sequence.

#### Selecting an Example Job Post

Below we select a job posting at random from the data.

In [None]:
example_index = np.random.choice(len(data))
job_post = data.iloc[example_index]
title = job_post['title']
description = job_post['description']
print(f'{title}\n\n{description}')

##### Raw description word cloud

In [None]:
fig = make_wordcloud(description)

#### Tokenizing Text

The `doc_tokenizer` method first splits a document into a list of sentences, then splits each sentence into a list of words. This method depends on the following methods:
 * `sent_tokenize` and `word_tokenize` from `nltk.tokenize`.

##### `doc_tokenizer` method

In [None]:
display_object(DataProcessor.doc_tokenizer)

##### Example output

In [None]:
title_tokens = DataProcessor.doc_tokenizer(title)
description_tokens = DataProcessor.doc_tokenizer(description)
print(f'{title_tokens}\n\n{description_tokens}')

#### Parts of Speech Tagging

The `doc_pos_tagger` method parses the tokenized text from the step above and applies a wordnet part of speech tag. This method depends on the following methods:
 * `sentence_pos_tagger` and `get_wordnet_pos` from `DataProcessor`
 * `pos_tag` from `nltk`
 * `wordnet` from `nltk.corpus`

##### `doc_pos_tagger` method

In [None]:
display_object(DataProcessor.doc_pos_tagger)

##### Example output

In [None]:
title_pos_tags = DataProcessor.doc_pos_tagger(title_tokens)
description_pos_tags = DataProcessor.doc_pos_tagger(description_tokens)
print(f'{title_pos_tags}\n\n{description_pos_tags}')

##### `sentence_pos_tagger` method

In [None]:
display_object(DataProcessor.sentence_pos_tagger)

##### `get_wordnet_pos` method

In [None]:
display_object(DataProcessor.get_wordnet_pos)

#### Lemmatization

The `doc_lemmatizer` method parses the POS tagged text and, where possible, replaces words with their [lemmas](https://en.wikipedia.org/wiki/Lemmatisation). This method depends on the following methods and classes:
 * `sentence_lemmatizer` and `tag_lemmatizer` from `DataProcessor`
 * `WordNetLemmatizer` from `nltk.stem.wordnet`

##### `doc_lemmatizer` method

In [None]:
display_object(DataProcessor.doc_lemmatizer)

##### Example output

In [None]:
title_lemmas = DataProcessor.doc_lemmatizer(title_pos_tags)
description_lemmas = DataProcessor.doc_lemmatizer(description_pos_tags)
print(f'{title_lemmas}\n\n{description_lemmas}')

##### `sentence_lemmatizer` method

In [None]:
display_object(DataProcessor.sentence_lemmatizer)

##### `tag_lemmatizer` method

In [None]:
display_object(DataProcessor.tag_lemmatizer)

#### Clean Up

The `doc_clean` method removes special characters and stopwords. It depends on the following methods.
 * `stopwords` from `nltk.corpus`

##### `doc_clean` method

In [None]:
display_object(DataProcessor.doc_clean)

##### Example output

In [None]:
title_clean = DataProcessor.doc_clean(title_lemmas)
description_clean = DataProcessor.doc_clean(description_lemmas)
print(f'{title_clean}\n\n{description_clean}')

##### Clean word cloud

In [None]:
description_clean_text = ' '.join(description_clean)
fig = make_wordcloud(description_clean_text)

#### Process Full Dataset

In the cells below we process the full data set using the `data_processor` method.

##### `data_processor` method

In [None]:
display_object(DataProcessor.data_processor)

##### Processing titles

In [None]:
titles = data['title']
titles_processed = DataProcessor.data_processor(tqdm(titles))

##### Processing descriptions

This cell takes several minutes to run.

In [None]:
descriptions = data['description']
descriptions_processed = DataProcessor.data_processor(tqdm(descriptions))

### Feature Engineering

#### Combining Common Phrases

Having cleaned our data we wish to combine common phrases into bigrams, trigrams, and quadgrams. This helps to ensure that each token in our final representation of our text is an independent unit of information.

The `data_combine_phrases` method uses the full corpus to detect common phrases and combine them into bigrams, trigrams, and quadgrams. when this method is run it saves a copy of the two required `Phrases` models to the `model` folder. This method depends on the following class.
 * `Phrases` from `gensim.models`
 
Once `data_combine_phrases` has built a phrase model for the provided data, we can combine phrases for an unseen document.

The `doc_combine_phrases` method will combine common phrases into $n$-grams for an unseen document. This method has the following dependencies.
 * `Phrases` from `gensim.models`
 * The model trained by `data_combine_phrases`

##### `data_combine_phrases` method

In [None]:
display_object(DataProcessor.data_combine_phrases)

##### Training on full dataset

In [None]:
title_phrases = DataProcessor.data_combine_phrases(tqdm(titles_processed), 'title')

In [None]:
description_phrases = DataProcessor.data_combine_phrases(tqdm(descriptions_processed), 'description')

##### `doc_combine_phrases` method

In [None]:
display_object(DataProcessor.doc_combine_phrases)

##### Example output

In [None]:
title_grams = DataProcessor.doc_combine_phrases(title_clean, 'title')
description_grams = DataProcessor.doc_combine_phrases(description_clean, 'description')
print(f'{title_grams}\n\n{description_grams}')

##### Description grams word cloud

In [None]:
description_grams_text = ' '.join(description_grams)
fig = make_wordcloud(description_grams_text)

### Processed Data Storage

As we saw above. It takes around ten minutes in total to execute data cleaning and feature engineering on the full data set. For this reason it is useful to store our processed data for future use. Below we package our raw data with our processed data in a list of dictionaries and save the resulting object as a pickle file.

#### Combining raw data and processed data for storage

In [None]:
data_records = data.to_dict('records')
for post, description_tokens, title_tokens in tqdm(zip(data_records, description_phrases, title_phrases)):
    post['title_tokens'] = title_tokens
    post['description_tokens'] = description_tokens

#### Inspecting the first record

In [None]:
data_records[1]

#### Saving to pickle file

The line below should only be run if the `processed_data` file needs to be updated.

```python
with open('../data/processed_data.pkl', mode='wb') as file:
    pickle.dump(data_records, file)
```

## Modeling

The purpose of a job posting is to communicate the responsibilities and requirements of a job to potential applicants. When a job seeker is viewing job postings, their goal is to identify jobs that are closely aligned with their skills efficiently. Both employers and job seekers benefit from postings that efficiently communicate the precise requirements of a job. The employer receives more relevant job applicants, which results in a more efficient hiring process. The job seeker is able to evaluate jobs effectively and invest time in applying to only those jobs that are relevant to their skills, which results in a more efficient job seeking process.

The first step of the job seeking process is to evaluate large batches of job postings and identify relevant postings. In this phase, the greatest contributer to efficiency is the ability to rapidly reject irrelevant postings. The most consistently available information in a job posting is a job title and a job description. A fairly standard approach to reviewing a collection of job postings is to read the job titles, reject all of the postings with irrelevant job titles and then review the job descriptions for the remaining job posts. Because reviewing a job description can take substantially longer than reviewing a job title, it is not feasible to review all job descriptions. 

An issue arises with this strategy if job titles do not provide an accurate classification of the job that they describe. The primary issue being low accuracy in predicting the nature of a job from the job title. In the case of a false negative, the job seeker incorrectly classifies a job as irrelevant and discards it without further review, missing an opportunity to apply for a relevant job. In the case of a false positive, the job seeker carries an irrelevant job posting forward for further review, incurring an opportunity cost by wasting time that could have been used pursuing a relevant job.

In this section, we take the view that a job description is a complete and correct representation of the job. Our first modeling task is to produce quantitative representation of the data contained in the job description. For this task, we have selected a Latent Dirichlet Allocation (LDA) model. This type of model assumes that every document (job description) in our corpus is created by selecting a mixture of topics (responsibilities and requirements) and then selecting words according to a distribution that is conditioned on the mixture of topics. This is a plausible model for job descriptions since the author of the job posting must describe a handful of requirements of the job, where each requirement will have a handful of distinguishing key words. 

A fitted LDA model produces a list of topics, each containing a list of top keywords. For any document, the model provides a probability vector indicating the mixture of topics that are present in the document. 

Because the list of topics are inferred from the corpus of documents, we must review the list of top keywords and identify the essential meaning of the topic. 

One of the hyper parameters of an LDA model is the number of topics to detect. It is important that we carefully select an appropriate number of topics. We want our topics to be coherent, which generally drives us toward more topics which capture smaller and more precise topics. On the other hand, we want our topic to be distinct, this generally drives us toward fewer topics. By balancing these two considerations, we will select an appropriate number of topics for our LDA model. 

Once we have a topic mixture representation of our job descriptions, we can cluster jobs into classes with similar mixtures of topics. For this task we will use a K-means clustering algorithm.

We will need to tune the number of clusters for our K-means clustering to ensure that jobs are grouped into distinct and self similar classes. 

Finally, we will train a multinomial Naive Bayes' classifier on the job titles to predict the job class produced by LDA and K-means. The accuracy of this model provides an estimate of the accuracy of classifying jobs as relevant or not based on job title alone. 

### Architecture

### Testing Plan

Both LDA and K-means are unsupervised learning techniques, as such we will need to carefully tune hyper-parameters to ensure that these models are performing optimally. However, we do not have access to ground truth to test these models.

For our Multinomial Naive Bayes' classifier, we will implement a train test split for model verification and use five-fold cross-validation to validate our selection of the number of clusters.

### Build

#### Load Processed Data

In [3]:
with open('../data/processed_data.pkl', mode='rb') as file:
    data_records = pickle.load(file)

title_tokens = [record['title_tokens'] for record in data_records]
description_tokens = [record['description_tokens'] for record in data_records]

#### Latent Dirichlet Allocation Topic Model

##### Build dictionary and corpus

In [4]:
dictionary = corpora.Dictionary(tqdm(description_tokens))
dictionary.filter_extremes(no_below=3)

100%|██████████| 9666/9666 [00:04<00:00, 2082.48it/s]


In [5]:
corpus = [dictionary.doc2bow(token) for token in tqdm(description_tokens)]

100%|██████████| 9666/9666 [00:02<00:00, 3826.47it/s]


##### Build LDA models

In [6]:
response = input('This cell takes a long time to run. Would you like to skip it? (y/n)')
if response.lower() == 'y':
    pass
elif response.lower() =='n':
    np.random.seed(42)
    eta = [0.01]*len(dictionary.keys())
    for num_topics in tqdm(range(2,31)):
        alpha = [0.01]*num_topics
        lda_model = LdaModel(
            corpus, 
            num_topics=num_topics,
            id2word=dictionary,
            passes=4, 
            alpha=alpha,
            eta=eta
        )
        file_path = f'../model/LDA-{num_topics}topics'
        lda_model.save(file_path)
else:
    print('Could not interpret response. Run cell again.')

This cell takes a long time to run. Would you like to skip it? (y/n)y


##### Tuning the number of topics for LDA

A Latent Dirichlet Allocation model tries to learn a fixed number of latent topics from a corpus of texts. In order to produce the best possible representation of our Job descriptions it is important to select an appropriate number of topics. 

We will use two measures of model quality to select the number of topics for our final model. First, mean Jaccard similarity of topics, which essentially measures how much overlap there is between topics. A lower value of this measure indicates a better model. Second, coherence of topics, which essentially measures how internally consistent the top words are from each topic. A higher value of this measure indicates a better model.

Both measures return values in the range from zero to one and we will select the number of topics which produces the largest difference of coherence minus Jaccard similarity, preferring fewer topics in the case of a near tie.

##### Computing measures

To compute both measures for each saved LDA model, we call the `get_measures` method from the `topic_selection` module. This method takes substantial time to run.

In [8]:
import topic_selection
display_object(topic_selection.get_measures)

```python
def get_measures(texts, dictionary):
    """get_measures. Collects Jaccard similarity and coherence for all trained
    LDA models.

    Parameters
    ----------
    model:
        A trained LDA model. Takes input from get_model.
    texts :
        texts
    dictionary :
        dictionary
    """
    measures_list = {
        'n': [],
        'mean_jaccard': [],
        'coherence': []        
    }
    for n in tqdm(range(2,31)):
        model = get_model(n)
        topics = get_topics(model)
        measures_list['n'].append(n)
        measures_list['mean_jaccard'].append(mean_jaccard_similarity(topics)),
        measures_list['coherence'].append(get_coherence(model, texts, dictionary))
    return measures_list

```

In [11]:
response = input('This cell takes a long time to run. Would you like to skip it? (y/n)')
if response.lower() == 'y':
    pass
elif response.lower() =='n':
    measures_list = topic_selection.get_measures(description_tokens, dictionary)
    with open('../model/measures.pkl', mode='wb') as file:
        pickle.dump(measures_list, file)
else:
    print('Could not interpret response. Run cell again.')

This cell takes a long time to run. Would you like to skip it? (y/n)y


In [12]:
with open('../model/measures.pkl', mode='rb') as file:
    measures = pickle.load(file)
    
measures_df = pd.DataFrame(measures)
measures_df['diff'] = measures_df['coherence'] - measures_df['mean_jaccard']
px.line(
    data_frame = measures_df,
    x = 'n',
    y = ['mean_jaccard', 'coherence', 'diff'],
    title = 'latent dirichlet allocation topic number selection'.title(),
    labels = {
        'n': 'Number of Topics',
        'variable': 'Metric'
    }
)

EOFError: Ran out of input

##### Selection of number of topics

Based on our selection criterion and the plot above we select 28 topics for our model.

In [None]:
num_topics = 28
lda_model = topic_selection.get_model(num_topics)

##### Future work

It might be advisable to investigate models with a number of topics greater than 28 in the future.

#### K-Means Clustering of Jobs

##### Computing topic distributions

In the cell below, we use our LDA model to compute topic distributions for our corpus of job descriptions. Each job description is represented as a 28 dimensional probability vector, which describes the mixture of topics present in the description.

In [None]:
def get_topic_distributions(lda_model, corpus):
    rows = []
    for description in tqdm(corpus):
        topics = lda_model.get_document_topics(description)
        vec = np.zeros(num_topics)
        for key, prob in topics:
            vec[key] = prob
        rows.append(vec)
    topic_distributions = np.array(rows)
    return topic_distributions

topic_distributions = get_topic_distributions(lda_model, corpus)

##### Tuning the number of clusters

For our K-means clustering model we need to select an appropriate number of clusters. To this end we will compute three well regarded measures of clustering quality: Calinski Harabasz Score, Within Cluster Sum of Squares, and Silhouette Score. We will select a number of clusters where one or more of these measures exhibits a marked change in trend, we will prefer fewer clusters if there are multiple candidates.

##### Computing measures

In [None]:
def get_measures(topic_distributions):
    measures = {
        'n_clusters': [],
        'CH_score': [],
        'WCSS_score': [],
        'S_score': []
    }

    for n_clusters in tqdm(range(2, 40)):
        clusterer = KMeans(n_clusters=n_clusters)
        preds = clusterer.fit_predict(topic_distributions)
        measures['CH_score'].append(calinski_harabasz_score(topic_distributions, preds))
        measures['WCSS_score'].append(clusterer.inertia_)
        measures['S_score'].append(silhouette_score(topic_distributions,preds))
        measures['n_clusters'].append(n_clusters)
    return measures

measures = get_measures(topic_distributions)

In [None]:
km_measures_df = pd.DataFrame(measures)
fig = px.line(
    data_frame = km_measures_df,
    x='n_clusters', 
    y=['CH_score', 'WCSS_score'], 
    title='K-means Quality Measures',

)
fig.show()

In [None]:
fig = px.line(
    data_frame = km_measures_df,
    x='n_clusters', 
    y='S_score', 
    title='Silhouette Score',
    labels ={
        'n_clusters': 'Number of Clusters',
        'S_score': 'Silhouette Score'
    }
)    
fig.show()

##### Selecting the number of clusters

The most marked change in trend is in the Silhouette Score at twelve clusters, so we select this value for our number of clusters.

In [None]:
num_clusters = 13
km_model = KMeans(num_clusters)

#### Multinomial Naive Bayes Classification of Jobs

##### Convert title tokens to word count vectors

In [None]:
vectorizer = CountVectorizer()
titles = [' '.join(title_list) for title_list in title_tokens]
X = vectorizer.fit_transform(titles)

##### Assign job classes using K-means model

In [None]:
y = km_model.fit_predict(topic_distributions)

##### Perform train test split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

##### Define gridsearch over $\alpha$

The parameter $\alpha$ is a smoothing parameter, by default the Multinomial Naive Bayes' classifier uses Laplace smoothing, which corresponds to $\alpha = 1$ we also test Lidstone smoothing with $\alpha = 10^{-5}, 0.01, 0.1$. 

In [None]:
clf = MultinomialNB()
param_grid = {'alpha' : [1e-5, 1e-2, 1e-1, 1]}
grid = GridSearchCV(clf, param_grid, scoring='accuracy', n_jobs=-1, cv=5, verbose=2)

In [None]:
grid.fit(X_train, y_train)

In [None]:
grid_df = pd.DataFrame(grid.cv_results_)
px.line(
    data_frame = grid_df,
    x = 'param_alpha',
    y = 'mean_test_score'
)

##### Selecting best Multinomial Naive Bayes' model

As we saw above the best value of $\alpha$ was $0.1$.

In [None]:
mnb_model = grid.best_estimator_

### Assessment

Having built our models. We wish to asses our models.

#### Inspecting LDA Topics

In [None]:
topics = DataProcessor.get_topics()

#### Inspecting Job Clusters

#### Estimated Accuracy of Classifying Jobs by Title
Below we see that our best estimate for the accuracy of classifying data industry job postings by job title is 53%. So we would expect that a job seeker will incorrectly classify a job posting about half of the time when skimming job titles.

In [None]:
pred_test = mnb_model.predict(X_test)
score_test = mnb_model.score(X_test, y_test)
px.imshow(
    confusion_matrix(y_test, pred_test),
    title = f'Test Set Confusion Matrix (Accuracy {round(score_test,2)})',
    color_continuous_scale='Blues',
    height = 800,
    width = 800,
    labels = {
        'x': 'Predicted',
        'y': 'Observed',
        'color': 'Frequency'
    }
)

## Evaluation

### Results

### Review 

### Conclusion

## Deployment

### Deployment Plan

### Monitoring and Maintenance

### Report

### Project Review