#Keyword Extraction and Clustering

#Problem Statement

Using the provided dataset:
1. Mine top skills from Job Descriptions in column B
2. Cluster JDs based on features and attributes extracted from part 1
3. Explain methodology and provide step by step thought process.

###Installing required libraries

In [4]:
import nltk
nltk.download('all')

!pip install yake
!pip install tqdm
!pip install textacy
!pip install pytextrank
!pip install spacy
!pip install spacytextblob
!python -m textblob.download_corpora
!python -m spacy download en_core_web_sm

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_ru.zip.
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping grammars/basque_grammars.zip.
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/biocreative_ppi.zip.
[nltk_data]    | Downloadin

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting spacytextblob
  Downloading spacytextblob-4.0.0-py3-none-any.whl (4.5 kB)
Installing collected packages: spacytextblob
Successfully installed spacytextblob-4.0.0
[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package conll2000 to /root/nltk_data...
[nltk_data]   Package conll2000 is alrea

###Importing Libraries

In [1]:
# import libraries
import pandas as pd
import numpy as np
from nltk.tokenize import word_tokenize
import re
import math
import textacy
import spacy
import pytextrank
import yake
import nltk
from tqdm import tqdm
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from textacy.extract import keyterms
from tqdm import tqdm
from nltk import ne_chunk
from nltk.util import bigrams, trigrams, ngrams
from nltk.stem.wordnet import WordNetLemmatizer
from bs4 import BeautifulSoup
from string import punctuation
from nltk.corpus import stopwords

import warnings
warnings.filterwarnings('ignore')
nlp = spacy.load("en_core_web_sm")
import matplotlib.pyplot as plt
%matplotlib inline

###Loading the dataset

In [2]:
df = pd.read_csv('/content/jd2.csv')

### Checking a sample of size 5 from the dataset

In [3]:
df.sample(5)

Unnamed: 0,ID,Job Description
5,20167509,<p>The Key Objectives is As follows :</p><p>• ...
4,20169174,<p>Serves as a senior compliance risk officer ...
24,20181388,<p><span>Serves as a senior compliance risk an...
1,20169305,<p>The Financial Advisor SAFE Act is a seasone...
7,20173082,<ul><li>The Model/Anlys/Valid Sr Analyst is a ...


### Checking the shape

In [4]:
print("Data Shape:", df.shape)

Data Shape: (64, 2)


### Checking data types

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64 entries, 0 to 63
Data columns (total 2 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   ID               64 non-null     int64 
 1   Job Description  64 non-null     object
dtypes: int64(1), object(1)
memory usage: 1.1+ KB


### Checking for missing values

In [42]:
df.isnull().sum()

ID                               0
Job Description                  0
Clean_Job_Description            0
Clean_Job_Description_no_stop    0
Class                            0
risk                             0
management                       0
analysis                         0
technology                       0
audit                            0
compliance                       0
dtype: int64

### Checking for duplicate values

In [44]:
df[df.duplicated()]

Unnamed: 0,ID,Job Description,Clean_Job_Description,Clean_Job_Description_no_stop,Class,risk,management,analysis,technology,audit,compliance


#Data Cleaning

In [6]:
def convert_utf8(s):
    return str(s)

df['Job Description'] = df['Job Description'].map(convert_utf8)

### Copying the dataset to another column for pre-processing

In [7]:
df['Clean_Job_Description'] = df['Job Description']

### Using Beautiful Soup for removing html tags

In [8]:
df['Clean_Job_Description'] = [BeautifulSoup(text).get_text() for text in df['Clean_Job_Description'] ]
df['Clean_Job_Description'][0]

'Within ISG, GPU is an integral part of the Centralized Pricing team and operates on a global model with end to end ownership at GPC–Gurgaon and providing 24/6 coverage. The team includes dedicated Pricing Experts, with extensive experience in Security setups & corporate action treatments. Over the years, GPU has gathered expertise in pricing a variety of capital market instruments. The team responsible for providing security pricing data to fund accounting team to ensure timely and accurate NAV is delivered to the clients. Additionally they have a specialized team that provides valuations for OTC derivative instruments.Primary responsibilities include – process transactions and facilitate accurate and timely processing of daily prices, security setup and corporate action events, meeting service levels, production workflows, and training.\xa0 Perform various tasks to ensure the quality of data held for financial instruments are kept up to date. Work closely with internal clients, vario

###Removing any URLs in the text

In [9]:
def remove_urls(s):
    s = re.sub('[^\s]*.com[^\s]*', " ", s)
    s = re.sub('[^\s]*www.[^\s]*', " ", s)
    return s

df['Clean_Job_Description'] = df['Job Description'].map(remove_urls)

### Removing any asterisks in the text

In [10]:
def remove_star_words(s):
    return re.sub('[^\s]*[\*]+[^\s]*', " ", s)

df['Clean_Job_Description'] = df['Clean_Job_Description'].map(remove_star_words)

### Removing any remaining tags

In [11]:
def remove_tags(s):
  CLEANR = re.compile('<.*?>') 
  return re.sub(CLEANR, ' ', s)

df['Clean_Job_Description'] = df['Clean_Job_Description'].map(remove_tags)

### Removing any punctuations

In [12]:
# Remove the punctuations
def remove_punctuation(s):
    global punctuation
    for p in punctuation:
        s = s.replace(p, ' ')
    return s

df['Clean_Job_Description'] = df['Clean_Job_Description'].map(remove_punctuation)

### Converting all text to lower case

In [13]:
df['Clean_Job_Description'] = df['Clean_Job_Description'].map(lambda x: x.lower())

### Removing any remaining special characters

In [14]:
def remove_special_chars(s):
    return re.sub('\W+'," ", s)

df['Clean_Job_Description'] = df['Clean_Job_Description'].map(remove_special_chars)

In [15]:
df['Clean_Job_Description'].replace('nbsp',"",regex = True, inplace = True)

### Checking Language of the dataset

Two rows fail the test. On checking one of them is in Polish. We proceed to drop it.

In [16]:
for idx, x in enumerate(df['Clean_Job_Description']):
  if x.isascii()== False:
    print(idx)

32
57


In [17]:
df['Clean_Job_Description'][32]

' the role is part of the valuation control group vcg team in budapest vcg s primary responsibility is to reasonably ensure that risk portfolios are fairly valued in accordance with applicable standards and regulatory requirements vcg works with trading and quantitative groups to fully understand the impact of valuation uncertainty on the desk s level of pricing and any fair value adjustments the team also participates in the approval of new products where appropriate job background the valuation control group vcg is an independent group of subject matter experts in product control within the finance division vcg is organized along business lines and clients of the group include desk heads line product control market risk model validation compliance legal department middle office internal audit external audit firms and regulatory bodies in countries impacted by citi s footprint vcg collaborates closely with its client base to understanding of business valuation issues and accurate exec

In [18]:
df['Clean_Job_Description'][57]

' quality amp control manager jest odpowiedzialny za kontrolę i nadzór nad zapewnieniem wysokiej jakości obsługi klienta i zgodności podejmowanych działań z obowiązującymi procedurami i standardami banku nadzoruje i asystuje w procesie obsługi klienta w oddziale podejmuje także działania administracyjne służące utrzymaniu funkcjonalności oddziału oraz odpowiada za maksymalizację wartości dodanej poprzez dbałość o kryteria jakościowe obowiązki kontrola zgodności działań pracowników oddziału z procedurami i zasadami obowiązującymi w banku przeprowadzanie kontroli wewnętrznych m in zgodności z procedurami wytycznymi audytu i organów zewnętrznych np knf sprawdzanie operacji wykonywanych przez pracowników oddziału zapewnianie zgodności z aktualnie obowiązującymi procedurami i wewnętrznymi zarządzeniami banku procesów realizacji transakcji oraz ich systematyczna kontrola realizacja celów jakościowych odpowiedzialność za pozytywny wynik audytu wewnętrznego bieżące informowanie dyrektora oddzi

In [19]:
df.drop(index = 57, axis = 0, inplace = True)


### Final cleaned text

In [20]:
df

Unnamed: 0,ID,Job Description,Clean_Job_Description
0,20153988,"<p>Within ISG, GPU is an integral part of the ...",within isg gpu is an integral part of the cen...
1,20169305,<p>The Financial Advisor SAFE Act is a seasone...,the financial advisor safe act is a seasoned ...
2,20170606,<p></p><p><i>Description:</i></p><p>The Financ...,description the finance transformation amp go...
3,20170684,<p>The Applications Development Technology Lea...,the applications development technology lead ...
4,20169174,<p>Serves as a senior compliance risk officer ...,serves as a risk officer for independent comp...
...,...,...,...
59,20189322,<p>The Senior Auditor is an intermediate level...,the senior auditor is an intermediate level r...
60,80008365,<p>The Investment Banking Associate is an inte...,the investment banking associate is an interm...
61,20187917,<p>The IT Project Senior Analyst is a seasoned...,the it project senior analyst is a seasoned p...
62,20183523,<p>The role of this position is responsible fo...,the role of this position is responsible for ...


### Final Shape

In [21]:
df.shape

(63, 3)

### Saving the final cleaned file

In [22]:
df.to_csv('data.csv')

# Data Pre-processing

In [51]:
en_stopwords = stopwords.words('english') #importing stopwords to be removed

#Defining function for removing stopwords and removing it
def remove_stopwords(s):
    global en_stopwords
    s = word_tokenize(s)
    s = " ".join([w for w in s if w not in en_stopwords])
    return s

df['Clean_Job_Description_no_stop'] = df['Clean_Job_Description'].map(remove_stopwords) 

corpus = " ".join(df['Clean_Job_Description_no_stop'].tolist()) #creating a corpus without any stopwords

tokens = word_tokenize(corpus) #Creating individual tokens from the corpus

temp = nltk.pos_tag(tokens)
postag = [a[1] for a in temp]
all_postags = pd.Series(postag)
pd.DataFrame(all_postags.value_counts()).reset_index().rename(columns = {'index': 'POS_Tag', 0:'Total_Count'})

Unnamed: 0,POS_Tag,Total_Count
0,NN,7395
1,NNS,4018
2,JJ,3637
3,VBG,1624
4,VBP,967
5,RB,490
6,VBZ,404
7,VBD,359
8,VBN,298
9,VB,298


In [61]:
lmtzr = WordNetLemmatizer() #Importing Lemmatizer Function for changing words to their root forms 

lemmatized_tokens = [lmtzr.lemmatize(token) for token in tokens] #Lemmatizing the tokens

lemmatized_corpus = [" ".join(lemmatized_tokens)] #Creating corpus of lemmatized tokens

fd = nltk.FreqDist(lemmatized_tokens) #Calculating Frequency Distribution of lemmatized tokens

# Creating a list of 50 words having the highest frequency
top_words = []
for key, value in fd.items():
    top_words.append((key, value))
top_words = sorted(top_words, key = lambda x:x[1], reverse = True)
top_words = top_words[:50]
top_words

[('business', 318),
 ('risk', 300),
 ('experience', 254),
 ('team', 208),
 ('skill', 208),
 ('control', 188),
 ('management', 186),
 ('issue', 162),
 ('process', 161),
 ('work', 141),
 ('development', 135),
 ('ability', 134),
 ('knowledge', 133),
 ('within', 131),
 ('area', 120),
 ('project', 116),
 ('data', 114),
 ('reporting', 112),
 ('client', 111),
 ('function', 110),
 ('policy', 103),
 ('related', 96),
 ('technology', 95),
 ('strong', 92),
 ('including', 91),
 ('degree', 89),
 ('testing', 88),
 ('service', 87),
 ('level', 87),
 ('financial', 87),
 ('product', 87),
 ('responsibility', 84),
 ('audit', 84),
 ('practice', 83),
 ('solution', 83),
 ('activity', 82),
 ('regulatory', 82),
 ('regulation', 80),
 ('role', 78),
 ('citi', 77),
 ('law', 75),
 ('system', 72),
 ('firm', 72),
 ('review', 71),
 ('analysis', 71),
 ('understanding', 70),
 ('rule', 70),
 ('required', 68),
 ('application', 67),
 ('requirement', 66)]

In [29]:
corpus_bigrams = list(nltk.bigrams(lemmatized_tokens)) #creating tokens for bigrams

fd = nltk.FreqDist(corpus_bigrams) #Calculating Frequency Distribution of lemmatized bigram tokens

# Creating a list of 50 bigrams having the highest frequency
top_words = []
for key, value in fd.items():
    top_words.append((key, value))
top_words = sorted(top_words, key = lambda x:x[1], reverse = True)
top_words = top_words[:20]
top_words

[(('control', 'issue'), 56),
 (('rule', 'regulation'), 54),
 (('law', 'rule'), 52),
 (('risk', 'control'), 42),
 (('ass', 'risk'), 41),
 (('risk', 'business'), 41),
 (('business', 'decision'), 39),
 (('decision', 'made'), 38),
 (('made', 'demonstrating'), 38),
 (('demonstrating', 'particular'), 38),
 (('particular', 'consideration'), 38),
 (('consideration', 'firm'), 38),
 (('firm', 'reputation'), 38),
 (('reputation', 'safeguarding'), 38),
 (('safeguarding', 'citigroup'), 38),
 (('citigroup', 'client'), 38),
 (('client', 'asset'), 38),
 (('asset', 'applicable'), 38),
 (('applicable', 'law'), 38),
 (('regulation', 'adhering'), 38)]

In [30]:
corpus_trigrams = list(nltk.trigrams(lemmatized_tokens)) #creating tokens for trigrams

fd = nltk.FreqDist(corpus_trigrams) #Calculating Frequency Distribution of lemmatized trigram tokens

# Creating a list of 50 bigrams having the highest frequency
top_words = []
for key, value in fd.items():
    top_words.append((key, value))
top_words = sorted(top_words, key = lambda x:x[1], reverse = True)
top_words = top_words[:20]
top_words

[(('law', 'rule', 'regulation'), 52),
 (('ass', 'risk', 'business'), 38),
 (('risk', 'business', 'decision'), 38),
 (('business', 'decision', 'made'), 38),
 (('decision', 'made', 'demonstrating'), 38),
 (('made', 'demonstrating', 'particular'), 38),
 (('demonstrating', 'particular', 'consideration'), 38),
 (('particular', 'consideration', 'firm'), 38),
 (('consideration', 'firm', 'reputation'), 38),
 (('firm', 'reputation', 'safeguarding'), 38),
 (('reputation', 'safeguarding', 'citigroup'), 38),
 (('safeguarding', 'citigroup', 'client'), 38),
 (('citigroup', 'client', 'asset'), 38),
 (('client', 'asset', 'applicable'), 38),
 (('asset', 'applicable', 'law'), 38),
 (('applicable', 'law', 'rule'), 38),
 (('rule', 'regulation', 'adhering'), 38),
 (('regulation', 'adhering', 'policy'), 38),
 (('adhering', 'policy', 'applying'), 38),
 (('policy', 'applying', 'sound'), 38)]

#Keyword Extraction

In [31]:
data = df['Clean_Job_Description_no_stop'] 

# Importing CountVectorizer with vocabulary size as 20000, consiting of unigrams, bigrams and trigrams, ignoring words present in 95% of the documents and fitting on the data
cv = CountVectorizer(max_df=0.95,         
                   max_features=20000,  
                   ngram_range=(1,3)
                  )
word_count_vector=cv.fit_transform(data)

In [62]:
#Importing Tf-Idf Vectorizer from scikit learn and performing operation for unigrams
tfidf = TfidfVectorizer(ngram_range = (1,1), use_idf = False, analyzer = 'word', )
result_tfidf = tfidf.fit_transform(lemmatized_tokens)
result_tfidf = result_tfidf.mean(axis = 0)
result_tfidf = pd.DataFrame(result_tfidf, columns =  tfidf.get_feature_names())
result_tfidf = result_tfidf.T.reset_index().rename(columns = {'index' :'word',0:'score'})
result_tfidf.sort_values(by = 'score',ascending = False)[0:50]

Unnamed: 0,word,score
295,business,0.015732
1820,risk,0.014842
825,experience,0.012566
2076,team,0.01029
1926,skill,0.01029
468,control,0.009301
1266,management,0.009202
1156,issue,0.008015
1619,process,0.007965
2274,work,0.006976


In [63]:
#Importing Tf-Idf Vectorizer from scikit learn and performing operation for bigrams
tfidf = TfidfVectorizer(ngram_range = (2,2), use_idf = False)
result_tfidf = tfidf.fit_transform(lemmatized_corpus)
result_tfidf = result_tfidf.mean(axis = 0)
result_tfidf = pd.DataFrame(result_tfidf, columns =  tfidf.get_feature_names())
result_tfidf = result_tfidf.T.reset_index().rename(columns = {'index' :'word',0:'score'})
result_tfidf.sort_values(by = 'score',ascending = False)[0:50]

Unnamed: 0,word,score
2147,control issue,0.151715
9050,rule regulation,0.146296
5572,law rule,0.140878
8885,risk control,0.113786
8875,risk business,0.111077
935,ass risk,0.111077
1397,business decision,0.105658
8211,regarding personal,0.102949
2502,decision made,0.102949
6910,particular consideration,0.102949


##Spacy SGRank

In [35]:
#Using SGRank implementation from spacy for finding the scores
lemmatized_corpus = " ".join(lemmatized_tokens)
doc = nlp(lemmatized_corpus)
extract = textacy.extract.keyterms.sgrank(doc, ngrams=(1), window_size=2, normalize=None, topn = 50, include_pos=['NOUN'])
keywords = []
score = []
for a, b in extract:
        keywords.append(a)
        score.append(b)
result_sgrank = pd.DataFrame({'Keywords':keywords,'Score': score})
result_sgrank.sort_values(by = 'Score', ascending = False).head(50)

Unnamed: 0,Keywords,Score
0,business,0.070235
1,team,0.069931
2,experience,0.066073
3,risk,0.05727
4,process,0.035048
5,management,0.034381
6,project,0.024551
7,knowledge,0.023803
8,ability,0.022335
9,issue,0.02202


##Text Rank

In [36]:
#Using Text Rank and fining words with score greater than 8

text_rank = spacy.load("en_core_web_sm")
text_rank.add_pipe("textrank")
doc = text_rank(lemmatized_corpus)

textrank_result = {}
for phrase in doc._.phrases:
  textrank_result['phrase'] = phrase.text
  textrank_result['rank'] = phrase.rank
  textrank_result['count'] = phrase.count

text = []
chunks = []
count = []
rank = []

for p in doc._.phrases:
  text.append(p.text)
  chunks.append(p.chunks)
  count.append(p.count)
  rank.append(p.rank)

result_textrank = pd.DataFrame({'text':text,'chunks': chunks, 'count': count, 'rank': rank})

result_textrank[result_textrank['count'] > 8]

Unnamed: 0,text,chunks,count,rank
0,risk business decision,"[(risk, business, decision), (risk, business, ...",28,0.073197
91,business practice,"[(business, practice), (business, practice), (...",38,0.060562
180,risk,"[(risk), (risk), (risk), (risk), (risk), (risk...",10,0.056262
256,managing reporting control issue transparency ...,"[(managing, reporting, control, issue, transpa...",19,0.053913
506,team,"[(team), (team), (team), (team), (team), (team...",15,0.046978
590,high level review type work,"[(high, level, review, type, work), (high, lev...",16,0.044552
749,multiple project,"[(multiple, project), (multiple, project), (mu...",11,0.040937
836,job related duty,"[(job, related, duty), (job, related, duty), (...",16,0.039139
865,ability,"[(ability), (ability), (ability), (ability), (...",25,0.038502
883,citigroup client asset applicable law rule reg...,"[(citigroup, client, asset, applicable, law, r...",38,0.038116


##Yake

In [68]:
# Using Yake Algorithm for finding keywords and displaying top 50 observations
extractor = yake.KeywordExtractor(n = 1,windowsSize = 2,top=50, stopwords=None)
keywords = extractor.extract_keywords(str(lemmatized_corpus))
for kw, v in keywords:
  print("Keyphrase: ",kw, ": score", v)

Keyphrase:  business : score 7.819383943558719e-05
Keyphrase:  risk : score 7.836502047658387e-05
Keyphrase:  experience : score 0.00011016997576757604
Keyphrase:  skill : score 0.00012048375600417095
Keyphrase:  control : score 0.00013024350693703195
Keyphrase:  team : score 0.00015085553300317947
Keyphrase:  management : score 0.00016685515870909655
Keyphrase:  issue : score 0.00017175717240327173
Keyphrase:  process : score 0.00022023986547417256
Keyphrase:  work : score 0.0002265687308559922
Keyphrase:  area : score 0.00023544443821968387
Keyphrase:  ability : score 0.0002531117463989247
Keyphrase:  reporting : score 0.0002543055516159544
Keyphrase:  development : score 0.00025460323980073266
Keyphrase:  knowledge : score 0.0002609939075667504
Keyphrase:  function : score 0.00027053135714393657
Keyphrase:  policy : score 0.00029759176280277216
Keyphrase:  client : score 0.00031659113674001657
Keyphrase:  project : score 0.0003271028354470217
Keyphrase:  degree : score 0.00033136763

#Clustering based on Keywords

### Creating an empty column called class

In [38]:
df['Class'] = " "

### Based on the above results and domain knowledge, creatingc classes

In [80]:
classes = [ 'risk', 'management','analysis', 'technology', 'audit']


### Counting the number of observations of the keywords in each of the rows

In [79]:
df.drop(labels = classes, axis = 1, inplace = True)

In [81]:
for i in classes:
  df[i] = " "
  df[i] = df['Clean_Job_Description_no_stop'].str.count(i)
df

Unnamed: 0,ID,Job Description,Clean_Job_Description,Clean_Job_Description_no_stop,Class,risk,management,analysis,technology,audit
0,20153988,"<p>Within ISG, GPU is an integral part of the ...",within isg gpu is an integral part of the cen...,within isg gpu integral part centralized prici...,analysis,0,2,3,0,1
1,20169305,<p>The Financial Advisor SAFE Act is a seasone...,the financial advisor safe act is a seasoned ...,financial advisor safe act seasoned profession...,risk,1,0,0,0,0
2,20170606,<p></p><p><i>Description:</i></p><p>The Financ...,description the finance transformation amp go...,description finance transformation amp governa...,risk,10,2,0,2,1
3,20170684,<p>The Applications Development Technology Lea...,the applications development technology lead ...,applications development technology lead analy...,technology,2,1,2,14,0
4,20169174,<p>Serves as a senior compliance risk officer ...,serves as a risk officer for independent comp...,serves risk officer independent compliance ris...,risk,11,4,0,0,4
...,...,...,...,...,...,...,...,...,...,...
59,20189322,<p>The Senior Auditor is an intermediate level...,the senior auditor is an intermediate level r...,senior auditor intermediate level role respons...,audit,3,3,0,1,12
60,80008365,<p>The Investment Banking Associate is an inte...,the investment banking associate is an interm...,investment banking associate intermediate leve...,risk,1,0,0,0,0
61,20187917,<p>The IT Project Senior Analyst is a seasoned...,the it project senior analyst is a seasoned p...,project senior analyst seasoned professional r...,management,1,3,0,0,1
62,20183523,<p>The role of this position is responsible fo...,the role of this position is responsible for ...,role position responsible providing core opera...,analysis,1,1,2,1,0


### Classifying individual job descriptions based on the class framed above

In [82]:
df['Class'] = df[classes].idxmax(axis=1)
df

Unnamed: 0,ID,Job Description,Clean_Job_Description,Clean_Job_Description_no_stop,Class,risk,management,analysis,technology,audit
0,20153988,"<p>Within ISG, GPU is an integral part of the ...",within isg gpu is an integral part of the cen...,within isg gpu integral part centralized prici...,analysis,0,2,3,0,1
1,20169305,<p>The Financial Advisor SAFE Act is a seasone...,the financial advisor safe act is a seasoned ...,financial advisor safe act seasoned profession...,risk,1,0,0,0,0
2,20170606,<p></p><p><i>Description:</i></p><p>The Financ...,description the finance transformation amp go...,description finance transformation amp governa...,risk,10,2,0,2,1
3,20170684,<p>The Applications Development Technology Lea...,the applications development technology lead ...,applications development technology lead analy...,technology,2,1,2,14,0
4,20169174,<p>Serves as a senior compliance risk officer ...,serves as a risk officer for independent comp...,serves risk officer independent compliance ris...,risk,11,4,0,0,4
...,...,...,...,...,...,...,...,...,...,...
59,20189322,<p>The Senior Auditor is an intermediate level...,the senior auditor is an intermediate level r...,senior auditor intermediate level role respons...,audit,3,3,0,1,12
60,80008365,<p>The Investment Banking Associate is an inte...,the investment banking associate is an interm...,investment banking associate intermediate leve...,risk,1,0,0,0,0
61,20187917,<p>The IT Project Senior Analyst is a seasoned...,the it project senior analyst is a seasoned p...,project senior analyst seasoned professional r...,management,1,3,0,0,1
62,20183523,<p>The role of this position is responsible fo...,the role of this position is responsible for ...,role position responsible providing core opera...,analysis,1,1,2,1,0


In [83]:
pd.DataFrame(df['Class'].value_counts()).reset_index().rename(columns = {'index': 'Class', 'Class': 'Count'})

Unnamed: 0,Class,Count
0,risk,33
1,management,16
2,analysis,5
3,technology,5
4,audit,4
