# Predicting Category from Job Postings

In this workshop, we'll try a classification task involving text.

This task is to predict job categories from Google Job postings.

## Dataset
https://www.kaggle.com/niyamatalmass/google-job-skills/version/1

(Requires free Kaggle account)

## Setup

We'll be using spaCy, a popular library for natural language processing (https://spacy.io/) to do some basic text processing.

Run this from your conda environment:

```
conda install -y spacy
python -m spacy download en_core_web_sm
```

The first command installs spaCy. 

The second command downloads spaCy's English pre-trained model. Language models require separate downloads because they are big files and are different for each language you need.

scikit-learn also has some NLP support, but only English and not as powerful as spaCy.

## Data Exploration

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
import spacy

In [2]:
df.head()

NameError: name 'df' is not defined

### Check for NaN

In [None]:
# check for NaN

In [None]:
# there are NA values, drop them and then check again


### Merge some smaller classes

In [None]:
# Combine a few categories that are smaller, e.g.
#   Data Center & Network, Network Engineering, IT & Data Management => IT & Data Management




## Data Transformation

### Label encoding

Label encode the 'Category' column into numbers to form our output (Y).

### Text to Vectors (a little taste of NLP)

In order to train a model, we'll need convert the "Responsibilities" column from text into vectors.

Let's try:
1. Lemmatizing the words using English language rules (e.g. features -> feature)
2. Removing stop words such as "a", "the", "to" that don't contribute much semantic meaning.
3. Computing the TF-IDF (Term Frequency - Inverse Document Frequency) for each word (term) in the Responsibility column. This gives us a real number representation of each word.
4. Treat each word as a separate feature in X. (This is where lemmatizing and removing stop words will keep the number of features small.)

#### Notes:
The downside to this approach (while relatively simple), is that we end up with lots of features. However, this should be okay for a small corpus (couple thousand short sentences on a specific domain).

To handle larger corpora (hundreds of MBs to GBs), a more scalable approach is to train word vectors, which are slightly more involving (need hyperparameter guessing^H^H tuning). Word vectors will be overkill for this task, but an option once we've covered them in class.

In [None]:
# Example text we want to convert
df.Responsibilities[0]

In [None]:
import spacy
from spacy.lemmatizer import Lemmatizer
from spacy.lang.en import LEMMA_INDEX, LEMMA_EXC, LEMMA_RULES

nlp = spacy.load('en_core_web_sm')
lemmatizer = Lemmatizer(LEMMA_INDEX, LEMMA_EXC, LEMMA_RULES)

# transformation function to use in pd.DataFrame.apply()
def tokenize_text(text):
    """Tokenizes the text by lemmatizing and removing stop words
    Args:
        text - the input text
    Returns:
        a list of tokens
    """
    # process the text
    doc = nlp(text)
    
    # https://spacy.io/api/token
    lemmas = [lemmatizer(token.text, token.pos_) for token in doc
              if not token.is_stop and token.is_alpha]

    # create a string
    return ' '.join([item for sublist in lemmas for item in sublist])
            
# Test
tokenize_text(df.Responsibilities[0])

In [None]:
# Tokenize the column
df_tokenized = df.Responsibilities.apply(tokenize_text)
df_tokenized.head()

### TF-IDF

See text.ipynb under 'Frequency Analysis' for formulas.

- We'll perform frequency analysis on each term (e.g. 'shape' across all documents).
- Each Job Responsibility is treated as a "document".
- Frequency analysis will compute a number indicating how important that word is in the corpus.
- Rare words that have high importance are accorded higher weight (e.g. datum or shepherd).

http://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting

In [None]:
# Apply the vectorizer on all tokenized Responsibility rows

vectorizer = TfidfVectorizer(lowercase=False, decode_error='ignore')

X = vectorizer.fit_transform(df_tokenized)

# convert sparse matrix to dense matrix
X_dense = X.todense()

print(X_dense.shape)

# print the features
print(vectorizer.get_feature_names())

## Visualize

t-SNE (t-Distributed Stochastic Neighbour Embedding) is more suited for sparse features than PCA.

Text features are "sparse" because not all the same words always appear.

In [None]:
from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, random_state=42)
X_2d = tsne.fit_transform(X_dense)

In [None]:
fig, ax = plt.subplots(figsize=(15, 10))

ax.scatter(X_2d[:, 0], X_2d[:, 1], c=df.y)

ax.set(title='t-SNE plot for Responsibilities (X) and Job Categories (y)',
       xlabel='X_2d[:, 0]', ylabel='X_2d[:, 1]')
ax.grid()
plt.show()

## Learning Curve

Example of learning curve.

## Predictions

Let's perform some predictions with real job postings.

1. https://careers.google.com/jobs/
2. Pick a posting, copy the Job Responsibilities
3. Transform the input by tokenizing and then using the TfIdfVectorizer
4. Get the prediction

In [None]:
# https://careers.google.com/jobs/#!t=jo&jid=/google/migrations-architect-google-cloud-singapore-3974200108&

test = """Work closely with strategic clients, both engineering and non-technical, to lead migration projects and customer implementations on Google Cloud.
Coordinate with a diverse team of stakeholders and supporting Googlers, including Sales, Solutions Engineers and the Professional Services organization.
Build core migration tooling across all Google Cloud Platform products and relevant third-party software.
Establish and drive planning and execution steps towards production deployments.
Write/develop deployment templates, orchestration scripting, database replication configurations, CI/CD pipeline assemblies, etc."""

