# NLP For Classification - The Language of Job Postings
## Non - Technical Report

## Problem Statement:

**Objective**

Searching for new jobs can be an arduous process. The diligent can get through 10-20 per week, and the hit rate may not be that high. If the job search goes for 2 months, thats around 150 job postings. From my initial search, there are 1000s of data science/ data analyst positions, and many more similar positions under business and product engineers. So can data science help data scientists find jobs?

Assumption: NLP can be used with the detailed job descriptions to 
1. predict job titles (if any) for job descriptions
2. in process of predicting job titles, which titles are similiar and why?
3. word patterns for various jobs
4. what can we learn about the machine learning industry from these job postings?

<a id='home'></a>
- ** A. Preface: Terminology** <a href='#sectionA'>link</a> 
- ** 0. Data sources / Scraping** <a href='#section0'>link</a> 
- ** 1. Loading Libraries** <a href='#section1'>link</a>
- ** 2. Loading Data** <a href='#section2'>link</a>
- ** 3. Data Munging (Post scrape)** <a href='#section3'>link</a>
- ** 4. EDA** <a href='#section3'>link</a>
- ** 5. Feature Engineering - "Line Items" and "Hard Skills"** <a href='#section4'>link</a>
- ** 6. NLP Tokenizing** <a href='#section6'>link</a>
- ** 7. Split X and Y** <a href='#section7'>link</a>
- ** 8. One vs. Rest Classification : Logistic Regression** <a href='#section8'>link</a>
- ** 9. Logistic Classification Sample** <a href='#section9'>link</a>
- ** 10. Post Model Sample - Top Word Features per Expanded Title**  <a href='#section10'>link</a>
- ** 11. Post Model Sample - Top Classifications by Probability**  <a href='#section11'>link</a>
- ** 12. Post Model Sample - Top Word Count per Posting**  <a href='#section12'>link</a>
- ** 13. Assembling all Post Model Results**  <a href='#section13'>link</a>
- ** 14. Plotting Multi Class Roc Curves**  <a href='#section14'>link</a>
- ** 15. Comparing Different Models**  <a href='#section15'>link</a>
- ** 16. Optimized Regularized Model**  <a href='#section16'>link</a>
- ** 17. Comparison of Betas of Regularized and Non Models**  <a href='#section17'>link</a>
- ** Appendix AWS python code** <a href='#appendix'>link</a>

<a id='sectionA'></a>  <a href='#home'>Back to Table of Contents</a>
## A. Preface : Vocabulary

**NLP** - Natural Language processing - the study of words using large scale computational resources

**EDA** - Exploratory Data Analysis - high level survey of available data before doing any statistical predictions

**Scraping** - specifically webscraping is the process of extracting information from websites using programming tools to be consolidated into a central and normalized data repository

**Munging** - Cleaning the data, filling in blanks, changing formats, getting rid of excess characters, calculating intermediate values such as average / mode/ max values.

**Base Title** - this is the lowerest level of a job title; "product quality engineer" would be reduced down to "engineer" and "senior data scientist of chemical diffusion" would be "scientist".

**Expanded Title** - this is the 2nd lowerest level of a job title; "senior software engineer of VR" would be reduced down to "software engineer" and "senior data scientist of chemical diffusion" would be "data scientist".

**LDA** - Latent Dirchilet Allocation: https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. This is a text analysis technique to extract "common topics"

**Multi Classification** - Normally classification is binary, like 1 for yes, and 0 for no. But often times there can be multiple discrete classifications, such as "car", "van", "truck", and for these classifications, these are referred to as Multi-class problems

**AWS** - stands for Amazon Web Services, provides on-demand computing. In this project a linux-based AWS instance was used for some of the heavy processing required.

**Betas** - these refer to final coefficients for the trained model after multiple statistical calculations and iterations

**Features** - features are variables that will be used as a basis to predict an output. For instance, to predict a total credit card bill, features or inputs might be (total purchases), (number of electronics), (number of times eating out). Can be considered synonymously with 'inputs'

**Regularization** - This approach/method adds an artificial mathematical penalty term to the mathematical calculations. This technique is used on complex modles and relates/addresses to the number of variables (or features), and their relationship with each other. 


<a id='section0'></a>  <a href='#home'>Back to Table of Contents</a>
## 0. Preface: Data Sources and Webscraping

### Goals : Pull job postings for possible data science-related careers, using the following search terms and parameters


**Terms Searched for:**
- "Data Science"
- "Data Scientist"
- "Data Analyst"
- "Data Engineer"
- "Business Analyst"
- "Machine Learning"
- "Statistics"
- "Product Analyst"
- "Deep Learning"

**Cities/Locations**
- San Francisco, CA
- Mountain View, CA
- Seattle, WA
- Los Angeles, CA
- Boston, MA
- New York, NY
- Philadelphia, PN
- Washington, DC
- Atlanta, GA
- Houston, TX
- Austin, TX
- Chicago, IL
- Minneapolis, MN

**Job Website 1:** the website has an API. Python was used to pull job listing urls, but just has titles, company names, cities, and states. Python, requests library was used to pull from the API. From those links, python with selenium was used to pull job description details. 50,000 job postings were pulled **Notes:** Noted in the EDA, but a majority of the job postings were duplicates, after de-duping, the total unique postings were around 14,000. 

**Process**
1. Get API key by registering
2. Write python looper for ~14 different cities, using same API search string (no max for search results)
3. For each search string, pull all possible results using python to go through the pagination
4. Save all the individual job links (~50,000)
5. Start 2nd crawler to pull jobdescription details for the remaining jobs


**Job Website 2:** the website does not have an API, so for both the job links and the job descriptions, selenium was used over a period of 2 weeks at a slow 20sec delay to pull 14,000 job postings. Due to website security and blocking, scraping was very slow, and had to be restarted multiple times. Search results were capped at 40 pages x 25 results per page = 1000 search results. As a result a list of ~ 14 cities x ~10 search terms were used to make ~140 unique searches.

**Process**
1. Test website limitis
2. Write python looper for 140 different search combinations
3. For each search string, pull all possible results using selenium to go through the pagination
4. Save all the individual job links (~14,000)
5. Start 2nd crawler to pull jobdescription details for the remaining jobs

<a id='section1'></a>  <a href='#home'>Back to Table of Contents</a>
## 1. Loading python libraries

All the standard python sklearn libraries are used. 

Plotting selection: Altair https://github.com/minrk/altair/blob/master/README.md
Used this graphing package over the matplotlib and seaborn because of the categorical nature of most of data. Most of the main features are word based and altair makes it a lot easier to accept word/categorical inputs

In [None]:
#load the data
import datetime
import pandas as pd
import cPickle as pickle
import patsy
import unidecode
import numpy as np
from altair import Chart, SortField, X as X_axis, Y as Y_axis

from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.linear_model import LogisticRegression, SGDClassifier, LogisticRegressionCV
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import roc_curve, roc_auc_score, auc, confusion_matrix
from scipy import interp

<a id='section2'></a> <a href='#home'>Back to Table of Contents</a>
## 2. Loading / Source Data

Data source: the world wide web!

We will be loading a data frame that was combined from several webscraping processes. Two major job websites were scraped and the common fields were combined together via Scrapy, Pandas, XPath, and Selenium. The 22k records scraped took roughly 2 weeks to pull from the various websites. A long de-duping process was also required afterwards. Pickle was used to save the data frame and restore to This same dataset was used as baseline for a number of other studies:

- Linear Regression on applicant views
- Topic Modeling
- Recommender Systems
- Current analysis

<a id='section3'></a> <a href='#home'>Back to Table of Contents</a>
## 3. Cleaning and Munging Data

Since the majority of our analysis is text based, HTML <--> unicode <--> UTF-8 conversion is required, so the below uncode function was created to change some of the scraped text into unicode to decode to a workable standard string format. This function was used repeated lead prepping for Word Tokenizing or for Topic analysis, or repairing the text.

Some common issues when working with HTML:
- Some python features require unicode, others do not
- < br > must be converted into new lines
- some HTML pages have UL and LI list item tags that need to be converted to new lines
- Many people used implicit formatting (many new lines) to make a text-friendly posting
- HTML has its own type of "coding" for instance  %20 == space http://www.w3schools.com/tags/ref_urlencode.asp
- Some job postings have ZERO formatting, its a solid paragraph of text.

#### Sample Posting, with normal formatting

>The Software Engineer III is responsible for the development of large-scale Web applications, related systems and tools, including analysis, design, implementation, unit testing and documentation. Interact successfully with business owners, project managers and other technical teams. Contribute to company’s engineering standards and best practices. Participate in application tuning. Provide production support including on-call support. Identify, fix and follow through deployment of critical issues.

> - Assists in providing guidance to small groups of two to three engineers, including offshore associates, for assigned Engineering projects
- Demonstrates up-to-date expertise in Software Engineering and applies this to the development, execution, and improvement of action plans
- Manages small to large-sized complex projects
- Models compliance with company policies and procedures and supports company mission, values, and standards of ethics and integrity
- Participates in the discovery phase of small to medium-sized projects to come up with high level design
- Provides and supports the implementation of business solutions
- Provides support to the business
 -Troubleshoots business and production issues Minimum Qualifications

>Bachelor's Degree in Computer Science or related field and 5 years experience building scalable ecommerce applications or mobile software-Hands-on Java programming, -Knowledge of REST Webservices-Experience building applications using frameworks like Spring Hands-on experience with Eclipse or other IDE development tools-Knowledge of distributed source control systems such as Git Additional Preferred Qualifications


#### Sample Posting, lacking formattting

>Looking for a company that inspires passion, courage and imagination, where you can be part of the team shaping the future of global commerce? Want to shape how millions of people buy, sell, connect, and share around the world? If you’re interested in joining a purpose driven community that is dedicated to creating an ambitious and inclusive workplace, join eBay – a company you can be proud to be a part of. Are you excited about Big Data, machine learning and large-scale distributed systems? Interested in processing petabytes of data and turning it into products? Are you passionate about big data processing and cloud technologies? Do you enjoy working in a dynamic environment, be part of a world class team making an impact in the industry? If so, look no further, eBay Seattle is the right place for you! Take your career to the next level, come join our team and play a big role driving the next generation data services and solutions at eBay enabling billions in global connected ecommerce. As a valued member of the team at eBay that develops world-class behavior analytics product as well as generate insights through data science, you will make important contributions. Design and build behavioral data insights platform, deliver analytics product, combine qualitative and quantitative data to convince people with your actionable insights thus improve the user experience. Job Requirements Software development for large scale data solution and service People in the team are friendly, highly motivated, and extremely bright. Our team tries to maintain a work climate of professionalism, innovation, career growth, and fun. We provide you with the best opportunity to work in a challenging, highly visible and fast paced environment.


### 3.1 Picking Common Titles : Base vs. Expanded 

Every company is essentially its own country with its own titles. A bank's entry level position might be "Assistance Vice President" and a consulting firm might be "business analyst". Many new tech companies with flat organizations are notorious for having titles that do not matter; a data analyst could be a PhD with 12 years of work experience. And for every industry, there's another layer of specific lingo. 

#### Sample Job Posting Titles:
```
['Clinical Practice Specialist - 4 Hope' 'WEB DEVELOPER'
 'Analyst, Marketing Analytics' 'Data Engineer'
 'Application Implementation Specialist' 'Consulting Analyst'
 'Data Integration Engineer' 'Health Care Junior Business Analyst'
 'Solutions Architect' 'Senior Application Developer' 'Data Analyst Junior'
 'Clinical Informatics Data Analyst' 'Junior Data Scientist, Marketing'
 'Business Intelligence Analyst' 'SEO Specialist'
 'Market and Cost Intelligence Analyst' 'Data Analyst'
 'Application Developer/Analyst' 'Senior Data Analyst, Discovery Analytics'
 'Big Data Senior Consultant - Information Delivery'
 'Business Intelligence Engineer - Amazon Restaurants'
 'Principle Software Engineer'
 'Healthcare & Finance Analyst, Performance Measurement'
 'Application Development Programmer Specialist'
 'Senior Analyst, Strategy & Operations']
```
#### "Base" Titles
All the titles were extracted to isolate the base title, such as "Engineer", "Analyst", "Architect", "Developer", "Scientist"

#### "Expanded" Titles
All the titles were extracted to have one layer of detail above the base title, such as "Data Engineer", "Software Engineer", "Learning Engineer", "Data Analyst", "Business Analyst", "Senior Analyst"

Only job postings that have over 100 records will isolated from the rest of the master dataset. This reduces the original 22k count to 12k. 

    business analyst
    data analyst
    software engineer
    data scientist
    data engineer
    analyst
    systems analyst
    development engineer
    senior consultant
    marketing specialist
    director
    product manager
    manager
    senior analyst
    intelligence analyst
    research scientist
    marketing analyst
    data architect
    operations analyst
    project manager
    solution architect
    product analyst
    financial analyst
    senior associate
    learning engineer
    research analyst

### 3.2 EDA - Data science relevance - Machine-Learning related terms

As mentioned before, titles can be deceiving. After pulling down a large number of job postings, the following terms were flagged, and will be used as a rough metric of relevance. To answer, how much data science can be found in business analyst? How much data science cross over can be found in Data Analyst? 

Terms:
- machine learning
- data science
- regression
- bayes
- sklearn
- data scientist
- neural networks
- ' R '

### 3.3 Flagging the Machine-learning terms in the data frame, and then totaling the count as a new field
Most search engines will try to match the terms to the title. Finding data science related jobs are difficult because of the two terms: "data" and "science". If our practice was 'statisti-gineering', it would be a lot more distinct. But since it is not, after collecting a large body of job postings that might have data science related terms as a proxy score for a machine learning attribute 

### 3.4 Top posting companies
Will also create two additional fields to identify the companies that appear the most oftene in the collection of job posting data. Will take the top 25 companies. Will look at overall posting volume, and will also look at volume of postings with machine-learning terms. This will be examined in more detail in the EDA section

<a id='section4'></a> <a href='#home'>Back to Table of Contents</a>

## 4. EDA

The next section will explore the nature of the collected data, and make some preliminary comparisons 

### 4.1 Which Expanded Titles showed up the most? 

Unsurprisingly **Business Analyst** is the most general term that shows up the most often and makes 4k of the 12k rows analyzed. So the baseline for this classification is ~30%. All others are around 1k views which is around 8% as a baseline classification. Software engineer, data analyst, and data engineer are also the largest terms mainly because they were the primary search terms

```
    business analyst        0.335750
    data analyst            0.127500
    software engineer       0.107083
    data scientist          0.094917
    data engineer           0.090417
    analyst                 0.025833
    systems analyst         0.024500
    development engineer    0.022167
    senior consultant       0.020750
    marketing specialist    0.015667
    director                0.015250
    product manager         0.014833
    manager                 0.014167
    senior analyst          0.012917
    intelligence analyst    0.012167
    marketing analyst       0.011583
    research scientist      0.011583
    data architect          0.011000
    operations analyst      0.010333
    project manager         0.010083
    solution architect      0.009833
    product analyst         0.009583
    financial analyst       0.009250
    senior associate        0.008750
    learning engineer       0.008750
    research analyst        0.008500
    Name: expanded_title, dtype: float64
```

#### Comments: 

The **largest baseline is "business analyst" for 33%**, then there's 4 more (data analystm software engineer, data scientist, and data engineer) each at 12% before dropping to 2% and below.

![alt text](img/vega0.png)

### 4.2 What are the longest Posts (by words)

As expected "senior" positions have longer descriptions, typically due to expanded responsibilities and work experience. The shorter descriptions are surprisingly engineering oriented vs. the analyst titles

![alt text](img/vega1.png)

### 4.3 Distribution of postings by Expanded title and State

California in general had the most job postings, followed by washington ,boston, and new york. Surprisingly virgina, georgia, and texas have a decent number of data scientist titled rows vs. 

![alt text](img/vega2.png)

### 4.4 Distribution of postings by Expanded title and State

California has the largest market for most generic positions, but there are large concentration in some non-hub places like Boston, Georgia, Illnois, and Virginia, which are not typical cities that people associate with tech


![alt text](img/vega3.png)

### 4.5 Machine Learning shows up in which Expanded Titles?

In a job posting, what is the density of job terms? The following plot shows the breakout. We see that data scientist has the most Machine learning terms, followed by Learning engineer, research scientist, data architects. Data Engineer and Software engineer have a lot of postings with low-machine learning de

![alt text](img/vega4.png)

### 4.6 Machine Learning terms shows up in which states?

California has the highest concentration of machine learning terms followed by Washington, new york, and boston

![alt text](img/vega5.png)

### 4.7 Which companies are posting the most in this space?

From the search terms cited above, which companies showed up the most frequent in the results? Or in other words, who has the widest and largest tech opportunities?


![alt text](img/vega6.png)

### 4.8 Which Companies have the most Machine-Learning Terminology

Similarly, which of the companies have the highest density of Machine Learning terminology? Amazon and Deloitte currently lead with the most postings and highest density

![alt text](img/vega7.png)

<a id='section5'></a> <a href='#home'>Back to Table of Contents</a>

## 5. Feature Engineering - Job Posting Text

### Cleaning up Lengthy Job Posting: class hardSkillParser


As shown in the sample before, we also try to extract 3 levels of text:

### Full Posting Sample:

This is the 100% of the full raw HTML description pull from webscraping.

---
>US Citizenship Required - TS/SCI with FSP Clearance Required - US Citizenship Required

>Senior CNO Developer to support programs related to computer network operations. You will be integral members of a highly-skilled team that provides on-site development of advanced tools and techniques for use in intelligence collection programs.

>Main responsibilities will include, but are not limited to:
- Analyzing and deconstructing software applications and products
- Working within operating systems to characterize and understand technical specifications

>Required Skills:
- Bachelor's degree in Engineering, Computer Science or related field (4 years of related experience may be substituted for each year of degree-level education)
- 8+ years of related development experience
- 8+ years of development experience in Windows OR Linux environment
- A minimum of one year of experience working in or supporting customers within the Intelligence Community
- 8+ years of demonstrated experience in x86 assembly, SW reverse engineering, kernel debugging, boot procedures, file systems, networking protocol stacks, embedded SW development
- 8+ years of Windows OR Linux kernel/driver development experience
- Strong understanding of operating system internals

>Desired Skills:
- Must present a positive and professional demeanor
- Must possess excellent communication skills, both written and verbal

>Required Education (including Major):
- Bachelor's degree in Engineering, Computer Science or related field (4 years of related experience may be substituted for each year of degree-level education)

### Line Items Only Sample:

These line items are identified by the following conditions:
- There's a new line before
- there's only a single sentence in the block
- If there's more than one sentence between  new line breaks, then ignore
---

>- Analyzing and deconstructing software applications and products
- Working within operating systems to characterize and understand technical specifications

>Required Skills:
- Bachelor's degree in Engineering, Computer Science or related field (4 years of related experience may be substituted for each year of degree-level education)
- 8+ years of related development experience
- 8+ years of development experience in Windows OR Linux environment
- A minimum of one year of experience working in or supporting customers within the Intelligence Community
- 8+ years of demonstrated experience in x86 assembly, SW reverse engineering, kernel debugging, boot procedures, file systems, networking protocol stacks, embedded SW development
- 8+ years of Windows OR Linux kernel/driver development experience
- Strong understanding of operating system internals

>Desired Skills:
- Must present a positive and professional demeanor
- Must possess excellent communication skills, both written and verbal

>Required Education (including Major):
- Bachelor's degree in Engineering, Computer Science or related field (4 years of related experience may be substituted for each year of degree-level education)


### Hard Skills Only Sample:

These hard skills are identified by specific english terms. The proceeding phrase is isolated and considered a "hard skill":

        - in
        - including
        - knowledge of
        - experience with
        - understanding of
        - to
        - develop
        - design
        - requirements

>- experience working in or supporting customers within the Intelligence Community
> - in x86 assembly, SW reverse engineering, kernel debugging, boot procedures, file systems, networking protocol stacks, embedded SW development
> - in Engineering, Computer Science or related field (4 years of related experience may be substituted for each year of degree-level education)
> - experience working in or supporting customers within the Intelligence Community
> - experience in Windows OR Linux environment

<a id='section6'></a> <a href='#home'>Back to Table of Contents</a>

## 6. NLP - Word Tokenizing

The following NLP Parser class takes in a list of text and automates the following tasks:

- Vectorizing of the different texts into arrays/matrices of words using Count Vectorizer
- Apply stop words filter to get rid of common words
- Corpus creation - an optimized word count data structure (combination of dictionary and tuples)
- vocab creation - remaining unique words

These word features will become the features that will be used for predicting expanded title names. Only the top **20,000** word features (by occurance count) will be used for prediction.

In [None]:
def NLP_parser(text):
    # using count vectorizer to turn the cleaned job descriptions into feadutes
    cvec = CountVectorizer(stop_words='english', lowercase=True,ngram_range=(1,1))
    start_time = datetime.datetime.now()
    cvec.fit(text)

    
    #create a data from from the elements generated from Count Vectorizer
    cdf  = pd.DataFrame(cvec.transform(text).todense(),
                 columns=cvec.get_feature_names())

    #keeping the top 20,000 features (or words)
    summary = cdf.sum().sort_values(ascending = False)
    keep_cols = summary[:20000].index
    cdf_to_merge = cdf[keep_cols]


    # since some of the words are reserved, adding NLP_ as a prefix to stop any type of compiler issues
    cdf_to_merge.columns = ['nlp_'+x for x in cdf_to_merge.columns]
    return cdf_to_merge

### Parse Text - Full Job Postings, Line Items, Hard Skills

Will use the above class to parse the full text into word features

In [None]:
full_df = NLP_parser(fulltext)
LI_df = NLP_parser(LItext)
HS_df = NLP_parser(HStext)

print full_df.shape, LI_df.shape, HS_df.shape

<a id='section7'></a> <a href='#home'>Back to Table of Contents</a>

## 7. Splitting data into X(features) and Y(targets)

#### One vs. rest classification - y as a matrix

In a normal classification, logistic regression can be  used to separate 2 classes. So for 2 classes, 1 logistic regression can be used, represented by a y vector of [n_rows]. Since this analysis will be classifying 26 different classes, 25 different logistic regressions are required. So instead of the following for a single class:

    y = [ 0 , 0 , 0 , .. 1, 0 ,0]
   
For multiclass y becomes a matrix, 

    y = [[ 0 , 0 , 0 , .. 1, 0 ,0]
            [ 0 , 0 , 0 , .. 1, 0 ,0]
            [ 0 , 0 , 0 , .. 1, 0 ,0]
            ...
            [ 0 , 0 , 0 , .. 1, 0 ,0]]
            
An extremely useful classes are 
    
    from sklearn.preprocessing import MultiLabelBinarizer

The multiLabelBinarizer is very similar to patsy, but specifically for y and will turn  mulit-class vector into an matrix with n_classes - 1 x n_rows. For X, the main features are the various vectorized word counts. These counts only range from 1-5 per posting, so scaling won't be used.

In [None]:
def generateXY(base_dataframe, word_df):
    formula = 'expanded_title ~ company + city + state + desc_len -1'
    print formula
    y, X = patsy.dmatrices(formula, base_dataframe, return_type='dataframe')
    # will ignore the y value for the matrix below

    X = X.merge(word_df, how = 'left', left_index=True,right_index=True)
    print X.shape

    # turns y into a multi class matrix
    # 2 classes require a single vector n multiclassification requires n-1 matrix
    mlb = MultiLabelBinarizer()
    y = mlb.fit_transform([[x] for x in limited_postings['expanded_title']])
    print mlb.classes_
    print y.shape
    return y, X, mlb

y, X, mlb = generateXY(limited_postings, full_df)

<a id='section8'></a> <a href='#home'>Back to Table of Contents</a>

## 8. Classification Function - setup to take models for one vs. rest library

#### Multiclass classification

The classification approach will be for 26 different classes. The modeling will be performed by using logistic regression class by class, or basically 26 different models. This is for one modeling run. So for however many runs that need to be tested it will be 26 times. 

Python sklearn has a prebuilt function called :
        - OneVsRestClassifier

This OneVsRestClassifier is a generic utility that can take a model and run it multiple times for multiclass classification. For each training, the model will run 26 different tests such as : data analyst vs. rest, then engineer vs. rest, and then data scientist vs. the rest. To reduce the code bloat, the following function was designed below.

<a id='section9'></a> <a href='#home'>Back to Table of Contents</a>

## 9. Modeling Sample : Basic Logistic Regression

Dataset | Log | L1 C=0.01 | L1 C=0.1 | L1 C=1.0 | L1 C=10.0 | L1 C=100.0 | Random Forest|
--------|-----|-----------|----------|----------|-----------|------------|-----------|
Full Text|x|x|x|x|x|x|x
Line Items Only|x|x|x|x|x|x|x
Hard Skills only |x|x|x|x|x|x|x

#### Modeling code:

Will run 21 times for the above combinations. The below is a sample code block that will be run for multiple times. The logistic regression will use One Vs Rest to do multi-class classification, and will use the liblinear solver.

#### Post Modeling:
The following sections will show a sample of post-model processing, how to extract the following:

- predictions
- top word features per predicting class names
- top alternative class options ordered by probability

### Initial Baseline Modeling results

#### Logistic Classification training score: 0.97

#### Logistic Classification test score: 0.641

#### Comments: 

The overall score is above the baseline, the large difference in the training score vs. the actual test score seems to indicate overfitting. As a result, regularization will be applied to reduce the overfitting of the training set and make it more generalized. In terms of regularization, lasso will be used due to the large number of features created from NLP 


<a id='section10'></a> <a href='#home'>Back to Table of Contents</a>

## 10. Post-Modeling Sample: Reporting Top Features for a Expanded Title

This will take the fitted OneVsRestClassifier and pull out the final model and pull the beta coefficients and match them with the words to find the top features for the predictions. This will take the fitted estimator model, pull out the beta coefficients, and will use them to identify top word features.

### Function: pulling features per expanded title

<a id='section11'></a> <a href='#home'>Back to Table of Contents</a>

## 11. Post-Modeling Sample:  Matching Top Guesses per post

Taking advantage of hte fact that our y is a matrix as shown below:

Predicted:

    y = [class1 | class2 | class3 | class4]
        [ 0.,        1.,        0.,      0.]
        [ 1.,        0.,        0.,      0.]
        
Predicted Probaility:

    y = [class1 | class2 | class3 | class4]
        [ .21,      .87,      0.01,     0.31]
        [ .93,       .34,      0.14,     0.45]

Using the matrix of probabilities, we can find the 2nd,3rd, 4th .. other classes

<a id='section12'></a> <a href='#home'>Back to Table of Contents</a>

## 12. Post-Modeling: For each Post, getting top word counts

For each posting, the top words will be counted and appended as as descending list for further analysis. This is after stop words, so many common words will be filtered out, and will be specific to the post. This level of detail will be used for error analysis

** Sample of posting **

    record 1 : [[(u'nlp_marketing', 11.0),
          (u'nlp_loyalty', 10.0),
          (u'nlp_analysis', 8.0),
          (u'nlp_ability', 7.0),
          (u'nlp_business', 6.0),
          (u'nlp_skills', 6.0),
          (u'nlp_work', 5.0),
          (u'nlp_analytics', 4.0),
          (u'nlp_environment', 4.0),
          (u'nlp_experience', 4.0),
          (u'nlp_using', 4.0),
          (u'nlp_support', 4.0),
          (u'nlp_responsible', 4.0),
          (u'nlp_sas', 3.0),
           ....
    record 2 : [(u'nlp_data', 5.0),
          (u'nlp_experience', 4.0),
          (u'nlp_location', 3.0),
          (u'nlp_skyport', 2.0),
          (u'nlp_restful', 2.0),
          (u'nlp_understanding', 2.0),
          (u'nlp_api', 2.0),
          (u'nlp_knowledge', 2.0),
          (u'nlp_backend', 2.0),
          (u'nlp_team', 2.0),
          (u'nlp_developing', 2.0),
             ...

<a id='section13'></a> <a href='#home'>Back to Table of Contents</a>

## 13. Basic Logistic Regression - Assembled results table

The baseline table of job postings with expanded titles is then merged with top class predictions and top words listings

## 13.1 Looking at title similarity - based on probabilities

Since we have similar options for different postings, we can make a correlation matrix based on the probabilities

```
Post #1 : analyst ('analyst', 0.97124201971635582)
	('data analyst', 0.09224897854402879)
	('marketing analyst', 0.025353417134555477)
	('intelligence analyst', 0.00020392029014586449)
Post #2 : data engineer ('data engineer', 0.94262491122939063)
	('software engineer', 0.032009124051313669)
	('data scientist', 0.0073729054476535522)
	('development engineer', 0.0065031250470106144)
```

#### Prep the data for charting

## 13.2 Expanded Title to Expanded Title Similarities

![alt text](img/vega8.png)

## 13.3 Expanded Title to Word Similarities

Per job post, there is a vectorized word count per posting. This will be extracted and compared per title

    jobpost:'data analyst'-- [(u'nlp_data', 5.0),
      (u'nlp_experience', 4.0),
      (u'nlp_location', 3.0),
      (u'nlp_skyport', 2.0),
      (u'nlp_restful', 2.0),
      (u'nlp_understanding', 2.0),
      (u'nlp_api', 2.0),
      (u'nlp_knowledge', 2.0),
      (u'nlp_backend', 2.0),
      (u'nlp_team', 2.0),
      (u'nlp_developing', 2.0),
      (u'nlp_make', 2.0),
      
All the words will be unpacked for each job posting and will be used to see top words occurance across expanded titles

![alt text](img/vega9.png)

<a id='section14'></a> <a href='#home'>Back to Table of Contents</a>

# 14. Scoring - Confusion Matrix, ROC for multiclass

## 14.1 Scoring - Bad Predictions


The following section is an EDA of the bad predictions, exploring which titles are difficult to predict and why.

### 14.1 Which Titles also multiple predictions?


![alt text](img/vega10.png)

### 14.2 Which Titles where wrong ? (~30% incorrect)

![alt text](img/vega11.png)

### 14.3 Scoring - Confusion Matrix for multiclass

Currently in SKLearn, there is no metric support for multi-class classification. See the posting:

http://scikit-learn.org/stable/modules/multiclass.html

Previously we turned a single column (12k rows) with 26 different options into a  12k by 26 matrix. We will now take the prediction matrix and ravel it down into a single vector so we can use the built-in confusion matrix library.

    Y predict matrix     Y predict vector
    [ 0 1 0 0 0 0]
    [ 0 0 0 0 0 1] ---- > [2, 7, ..]


![alt text](img/mplot1.png)

## 14.4 Scoring - ROC for multiclass

The difficulty with multi-class is that there's a single model that accounts for the classification multiple classes. Each of these classes will have its own ROC curve. The following picture was made from a looping function and represents all classes.

![alt text](img/UntitledROC_all.png)

<a id='section15'></a> <a href='#home'>Back to Table of Contents</a>

## 15. Applying Regularization

Importing data from the bulk processor:

Tests that were run:

Dataset | Log | L1 C=0.01 | L1 C=0.1 | L1 C=1.0 | L1 C=10.0 | L1 C=100.0 | Random Forest|
--------|-----|-----------|----------|----------|-----------|------------|-----------|
Full Text|x|x|x|x|x|x|x
Line Items Only|x|x|x|x|x|x|x
Hard Skills only |x|x|x|x|x|x|x

![alt text](img/log_res_table.png)

![alt text](img/vega12.png)

### Discussion 

1. **Full Text Scored Higher overall**: Though there is an excess of words, overall, the score increased with volume of words. Full Text > Line Item Text > Hard Skill Text

2. **Overfit without regularization**: With no regularization ( log C), there is severe overfitting in the training model. WIth training scores around 0.9 and testing scores around 0.6, the model is over fit. As we can see from the curves, increasing the regularization (decreasing C) will bring the scores closer togehter, but will lower the over all score down, so there's a tradeoff

3. **Final choice : LogC = -2 , or C=0.1 **: At this point the train score is around .615, and the test is .524, which is very close, without sacrificing too much of the over all score

<a id='section16'></a> <a href='#home'>Back to Table of Contents</a>

## 16. Revisiting model with regularized Model C=0.01

### Confusion Matrix, was run 26 times, and turned into chart format

![alt text](img/vega19.png)

Looking at an updated version of similarity chart. Large circles means that there is a high similarity between the two titles. Dark circles means tht this occurs very often. Note that the largest circle is maximum 0.6 probability.  

![alt text](img/vega13.png)

![alt text](img/vega14.png)

<a id='section17'></a> <a href='#home'>Back to Table of Contents</a>
##  17 Comparison of Betas 

After re-running the model with regularization to reduce the overfitting, the betas will be compared between the regular and the regularized versions of the model to see how the results have been adjusted. The coefficients are taken from the two different fit One vs. Rest Models, and then these words will be plotted for the following classes:

- Data scientist
- Data Analyst
- Data Engineer
- research Scientist

![alt text](img/vega15.png)

![alt text](img/vega16.png)

![alt text](img/vega17.png)

![alt text](img/vega18.png)

![alt text](img/vega19.png)

<a id='section18'></a> <a href='#home'>Back to Table of Contents</a>
##  18. Conclusions and Takeaways

### Modeling 

- **Data science is still an ambiguous field** the two main avenues that this is being worked into is from the software/data engineering end and from the analytics business oriented end. One end is seeking to build the next level of intelligence and automation, and the other end is trying to understand the sea of data that is being generated daily. As a results, these responsibilities will creep into existing roles until new fully defined rolls can be established. This can be seen in the 60% predictive rate. These dedicated and specific positions are forming, but there is still a lot of ambiguity and cross over that exists as companies figure this out.

- **There is no bad score** - What I love about this problem is that there is no right answer. If the predicition score was incredibly low, than job titles mean nothing, and all job descriptions are not correlated. If The prediction score was 100% that would mean that we only have to apply to data scientists / data analysts and take solace that no other jobs slip through our fingers. At ~60%, that would mean that as you apply by title to various jobs, there's about 40% of other opportunities out there that are miss named, or that you couldn't find because it was named incorrectly.  

- **Prediction could be better, but uncertain how much** With a rough score of 0.5-0.6 on the prediction, its a good sign that more detail could be mined out from the job postings. Ensemble methods were avoided because the goal was to understand the inner workings, not to achieve the highest prediction score. 

- **Additional Models** - Further work into decision tree classifiers may yield more insights. But I believe further feature engineering with the actual text would yield the most benefit.

- **Multiclass adds n-processing time** with grid search, cross val score, regularization, there's already a lot of iterations to complete, and having a multiclass only adds to the processing requirements of this project.


### Webscraping:

- **Anythign can be webscraped if you have patience:** if you are willing to pull data 1-2 records a minute. At that rate, how good of a competitor/ pirate could you be?

- **HTML is messy business:** Many top companies' HTML are full of all sorts of junk, either on purpose or accident, their pages are a mess. Be prepared to dig for tags 

- **Internet HTML Formatting is the wildwest:** Each company is built differently. Pulling from company websites would be ideal, but would take significant time to design an extraction template from each one. This why 

- **With big data, comes big duplication:** There's a lot of dups out there and because of the data size, and there's no where to dedup. I started out with 50k job postings that shrunk to 18k, because of repostings or duplication across multiple job sites


### NLP for Job Postings

- **Jobs are abstract items, inherently difficult ** - Jobs in general are difficult to define at times. If there isn't a specific purpose or skill that is required such as "blacksmith" or "must know piping", many jobs sound the same at the high level, talking to people, setting up meetings ,and getting a wide variety taskes complete. It's not always easy to summarize a job, much less translate your current experience into a different position. Being very organized with monthly team meetings, and designing websites sometimes gets abstracted to : people-person, go-getter, not afraid to be technical.

- **Garbage in, Garbage Out** - and unfortunately for job postings, there's a lot of unnecessary words. There's phrases that are added for legal or generic requirementslike "be a strong communicator", or "be a hard worker", but is a lot of noise compared to the rest of the data  

- **Removing disclaimers** - every company always has a "fluff" piece near the beginning of the posting and a disclaimer near the bottom, disclaiming anything about race, gender, and equal opportunity, its difficult to isolate these common words to isolate the target

- **Responsiblities / Requirements / Must Haves / What you'll do/ What you'd need** This section is often the key area to scrutinize to determine the minimum requirements for applying. Unfortunately every hip-recruiter or company has decided to rename this section to something else. Sometimes jobs will have "must haves" and "nice to have" or sometimes its built into each line.

### Next Steps:

- **Problem is now setup for Topic Modeling** - since it is shown that there is about a 40% ambiguity rating in the overall data science market from looking at pure words, it's the perfect time to move to advanced NLP or topic modeling. The topic modeling will extract more of the latent detail and provide a psuedo-metric to measure on. Instead of "how many times does data & science appear", the question will be "how close to Business-reporting topic does this job posting land?" "how close is it to statistical-data-prediction topic?" I've prototyped this process, but this in itself is a whole other capstone

