# Data Science Job Dashboard

## Contact Information
Seth Chart, PhD

Data Scientist & Mathematician

Seeking opportunities in Data Science
 * Email: [seth.chart@protonmail.com](mailto:seth.chart@protonmail.com)
 * Phone: [443.303.7114](tel:4433037114)
 * [Resume](https://sethchart.com/resume-sethchart.pdf)
 * [GitHub](https://github.com/sethchart)
 * [LinkedIn](https://www.linkedin.com/in/sethchart)
 * [Website](https://sethchart.com)
 * [Twitter](https://www.linkedin.com/in/sethchart)



## Business Understanding

The job market for the data industry can be difficult to navigate because there are a plethora of job titles and the relationship between titles and roles is often not well defined.
Two identical roles may have completely different titles.
Two substantially different roles may have the same title.
This limits the usability of job titles as a means to succinctly communicate about data industry jobs.
This project will address the issue by directly analyzing full job descriptions from a corpus of job postings to provide three main deliverables:

### Objectives

 * First, from the language used in job descriptions (without titles), identify clusters of similar jobs based on their roles and responsibilities.
 * Second, a tool for classifying a provided job description according to our scheme.
 * Third, a comparison between our classification scheme and existing job titles ability to distinguish between roles.

### Available Resources

Recently two of the most recognizable job posting sites, LinkedIn and Indeed, have closed their job posting APIs and taken steps to discourage web scraping. This means that our preferred data sources were not available. 

We found that [careerjet.com](http://careerjet.com) has an public API and a fairly simple page structure for job posting. The official API [page](https://www.careerjet.com/partners/api/) for careerjet provides a python API package. Unfortunately, the official python package is not functional. There is an unofficial fork of the package that is functional that can be found [here](https://github.com/davebulaval/careerjet-api). 

The careerjet API is designed as a method for serving ads and tracks visits to careerjet job postings. For this reason, there are fairly restrictive rate limits for retrieving postings through the API. These limitations are not clearly documented, but they rendered the API unusable for large scale data collection.

After discovering that the careerjet API would not be usable for data collection we investigated the possibility of scraping job postings by exploiting the page number url parameter to iterate over search result pages, scrape posting results from each result listing, then scrape each result url. However, after experimentation we discovered that the site only surfaces one hundred pages of search results with twenty postings per page.

Finally, we determined that a by initiating a scraping process on the post listed first in the careerjet search results and using Selenium to advance to the next search result, we were able to access up to ten thousand job postings in a scraping session.

### Data Mining Goals

The goal of our data mining process was to obtain a reasonably large corpus of job postings consisting of a job title and a job description within the data job sector. 

### Project Plan

#### Data Collection

 1. Initiate a search for jobs postings located in the United States containing the keyword `data`. 
 2. Traverse and scrape job postings using the Selenium webdriver.
 3. Store scraped posts in a SQLite database for further analysis. 
 
#### Data Cleaning

The work-flow below applies to both job descriptions and job titles. This process seeks to distill the raw text to a list of unique and independent tokens of information.
 
 1. Lowercase and remove newline characters.
 2. Tokenize documents into sentences.
 3. Tokenize sentences into words.
 4. Tag words with parts-of-speech tags.
 5. Lemmatize words based on parts-of-speech tags.
 6. Remove stopwords and special characters.
 7. Group common bigrams and trigrams.
 
#### Modeling 

Essentially we wish to model two activities related to searching for a job. First, reading full job descriptions to determine the relevant skills and requirements, thereby classifying the job. Second, skimming over job titles to classify jobs. The goal of a job listing should be to efficiently 

##### Unsupervised Learning on Job Descriptions

Having distilled job descriptions to token lists we wish to derive meaningful representations of the job descriptions, which lend themselves to succinct classification of jobs. To this end we propose the following work-flow.
 
 1. Convert token lists to bag-of-words representation.
 2. Train a Latent Dirichlet Allocation model on the bag-of-words representations to extract a latent topics representation of the job descriptions. 
 3. Train a K-Means clustering model on the latent topics representations to identify clusters of similar job descriptions.
 
 The labeling of job descriptions with their corresponding cluster label provides our first deliverable.
 
 Having trained both LDA and K-Means models on the collected data, we can feed an unseen job description through our model pipeline and assign a cluster label. This provides our second deliverable.
 
##### Supervised Learning on Job Titles.

Having classified jobs to the best of our abilities using the full job description, we now wish to determine how effectively we are able to predict the cluster label of the job using only the job title. 

## Data Understanding

In order to produce our deliverables, we will needed a sizable corpus of data industry job descriptions.
We were able to obtain a corpus of approximately 9,485 job descriptions paired with their assigned job titles.
The descriptions from this corpus will serve as our data for deliverables one and two.
We will use the job titles from this corpus as data for deliverable three.

### Data Collection

### Data Description

### Data Exploration

### Data Validation


## Data Preparation

### Data Selection

### Data Cleaning

### Feature Engineering

### Data Storage

## Modeling

### Architecture

### Testing Plan

### Build

### Assessment

## Evaluation

### Results

### Review 

### Conclusion

## Deployment

### Deployment Plan

### Monitoring and Maintenance

### Report

### Project Review