# This example aims to demostrate few shot learning classification for classifying job titltes into blue collar and white collar jobs
## What is few shot learning?
It is learning from very few examples. In many cases a pretrained network/embedding is used with many early layers frozen to prevent overfitting.
## Dataset
Since there is no Ground truth dataset we will use the few examples here.

https://www.indeed.com/career-advice/finding-a-job/difference-between-blue-and-white-collar-jobs

Note there are only 8 blue collar and 8 white collar job titles.

## Modelling
In this example we used BertBase and use it's second last hidden state as an **sentence** embedding. 

We do this by using an average pool on the second last layer.

The benefit of using bert is that it works on sentences as well as requiring little to no text cleaning.(no out of vocab problem)

We then pass this embedding to an SVM using sklearn linearSVC.

Note that it is also possible to finetune bert however it is not shown in this example.

In [None]:
import torch
import transformers
import torch.nn as nn
import torch.nn.functional as F

# Here is the training dataset manually extracted from Indeed
Yes this is the **entire** training set.

In [None]:
white_collar = ["Accountant","Market researcher","Health services administrator","Executive director","Civil engineer"
                ,"Attorney","Software engineer","Physician"]
blue_collar = ["Warehouse associate","Inspector/packer","Landscape laborer","Refuse collector"
,"Flooring installer","Mechanic","HVAC technician","Electrician"]

In [None]:
TOKENIZER = transformers.BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

In [None]:
def tokenize(text):
    tokens = TOKENIZER.encode_plus(
            text,
            None,
            truncation=True,
            add_special_tokens=True,
            max_length=512,
            padding="max_length"
        )
    return {
        "input_ids":torch.tensor(tokens["input_ids"]).unsqueeze(0),
        "token_type_ids":torch.tensor(tokens["token_type_ids"]).unsqueeze(0),
        "attention_mask":torch.tensor(tokens["attention_mask"]).unsqueeze(0)
    }

In [None]:
bert = transformers.BertModel.from_pretrained('bert-base-uncased',output_hidden_states=True)

In [None]:
def output_second_last_hidden_state(tokens):
    out = bert(tokens["input_ids"],attention_mask=tokens["attention_mask"],token_type_ids=tokens["token_type_ids"])
    return out.hidden_states[-2].numpy()

In [None]:
blue_collar_tokens = []
white_collar_tokens = []

In [None]:
for text in blue_collar:
    with torch.no_grad():
        tokens = tokenize(text)
        embed = output_second_last_hidden_state(tokens)
        blue_collar_tokens.append(embed)
for text in white_collar:
    with torch.no_grad():
        tokens = tokenize(text)
        embed = output_second_last_hidden_state(tokens)
        white_collar_tokens.append(embed)

In [None]:
import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from sklearn.svm import LinearSVC
import pandas as pd

In [None]:
combined = blue_collar_tokens + white_collar_tokens
combined = np.concatenate(combined).sum(axis=1)
label = np.array([1 if i>7 else 0 for i in range(16)])

# As you can see they are fairly well separated from the embeddings

In [None]:
pca = PCA(n_components=2)
dr = pca.fit_transform(combined)
plt.scatter(dr[:,0],dr[:,1],c=[1 if i>7 else 0 for i in range(16)])
plt.show()

In [None]:
svc = LinearSVC()
svc.fit(combined,label)

In [None]:
def predict_class(text):
    with torch.no_grad():
        tokens = tokenize(str(text))
        embed = output_second_last_hidden_state(tokens).sum(axis=1)
        answer = svc.predict(embed)
        if answer.item() == 0:
            return "Blue Collar"
        else:
            return "White Collar"

# Test set results seems to be relatively good

In [None]:
test = ["cleaner","driver","data scientist","machine learning engineer",
        "hawker","farmer","miner","ceo","cfo","construction worker",
       "research coordinator","project manager","electrical engineer",
        "software developer","service crew","cook","lab technician","software developer","Firefighter"
       ,"janitor","landscaper","manufactoring worker","Business executive","Market researcher","lawyer","Architect"]
for t in test:
    print(t,":",predict_class(t))

In [None]:
df = pd.read_csv("../input/jobposts/data job posts.csv")
df = df.sample(100,random_state=42) # Sampling as currently code does not support batch operations and is slow

In [None]:
df["class_collar"] = df["Title"].apply(predict_class)

In [None]:
df

In [None]:
df.to_csv("result.csv",index=False)