# Word2Vec Word Embedding using skills from IT Job postings and Developers' Profiles

The model is trained using the Word2Vec word embedding technique.

The word embedding is built based on a data combination from job postings in the IT domain (Skill2vec dataset) and people's skills (Stack Overflow Developer Survey).

The created model is saved to a file for later loading.

## Import Libraries

In [14]:
import pandas as pd
from gensim.models import Word2Vec
import numpy as np

# Load Roles Dataset

In [15]:
# Read the Skill2Vec dataset with skills requirements to roles
df_roles = pd.read_csv(filepath_or_buffer="../2-data/ITroles.csv", sep=",", encoding="latin1")
df_roles = df_roles[['id', 'skills']]
df_roles

Unnamed: 0,id,skills
0,19805,diploma;machining;cnc m;mould;conventional mac...
1,80208,Compensation;Benefits;HR Functions;Alm;Payroll...
2,122729,Simulink;stateflow;Matlab developer;targetlink...
3,4772,gis;analysis;geographic_information_system;esr...
4,44923,Full Stack Developer;AngularJS;SaaS applicatio...
...,...,...
10353,91663,customer interaction;knowledge;java;android;pr...
10354,86050,Technical Management;Project Management;MS SQL...
10355,54515,XCode;IOS;Objective C;Project Management;;;;;;...
10356,36160,Director;NoSQL;Node.js;CTO;SQL;JIRA;Agile;PHP;...


# Load Employees Dataset

In [16]:
df_people = pd.read_csv(filepath_or_buffer="../2-data/employees.csv", sep=",", encoding="latin1")
df_people = df_people[['id', 'skills']]
df_people

Unnamed: 0,id,skills
0,2,JavaScript;TypeScript
1,3,C#;C++;HTML/CSS;JavaScript;Python;Microsoft SQ...
2,4,C#;JavaScript;SQL;TypeScript;Microsoft SQL Ser...
3,5,C#;HTML/CSS;JavaScript;SQL;Swift;TypeScript;Cl...
4,6,C++;Lua;;;;;Homebrew
...,...,...
68821,73264,Bash/Shell;Dart;JavaScript;PHP;Python;SQL;Type...
68822,73265,Bash/Shell;HTML/CSS;JavaScript;Python;SQL;Elas...
68823,73266,HTML/CSS;JavaScript;PHP;Python;SQL;MariaDB;Mic...
68824,73267,C#;Delphi;VBA;Microsoft SQL Server;MongoDB;Oracle


# Combine both Datasets

In [17]:
df_combined = pd.concat([df_people, df_roles])
df_combined

Unnamed: 0,id,skills
0,2,JavaScript;TypeScript
1,3,C#;C++;HTML/CSS;JavaScript;Python;Microsoft SQ...
2,4,C#;JavaScript;SQL;TypeScript;Microsoft SQL Ser...
3,5,C#;HTML/CSS;JavaScript;SQL;Swift;TypeScript;Cl...
4,6,C++;Lua;;;;;Homebrew
...,...,...
10353,91663,customer interaction;knowledge;java;android;pr...
10354,86050,Technical Management;Project Management;MS SQL...
10355,54515,XCode;IOS;Objective C;Project Management;;;;;;...
10356,36160,Director;NoSQL;Node.js;CTO;SQL;JIRA;Agile;PHP;...


In [18]:
# Make list out of skills
df_combined['skills'] = df_combined['skills'].apply(lambda x: x.split(';'))

# Remove empty skills
df_combined['skills'] = df_combined['skills'].apply(lambda x: [value for value in x if value != ''])
df_combined

Unnamed: 0,id,skills
0,2,"[JavaScript, TypeScript]"
1,3,"[C#, C++, HTML/CSS, JavaScript, Python, Micros..."
2,4,"[C#, JavaScript, SQL, TypeScript, Microsoft SQ..."
3,5,"[C#, HTML/CSS, JavaScript, SQL, Swift, TypeScr..."
4,6,"[C++, Lua, Homebrew]"
...,...,...
10353,91663,"[customer interaction, knowledge, java, androi..."
10354,86050,"[Technical Management, Project Management, MS ..."
10355,54515,"[XCode, IOS, Objective C, Project Management]"
10356,36160,"[Director, NoSQL, Node.js, CTO, SQL, JIRA, Agi..."


In [19]:
# Check usual number of skills per row
df_combined['NumSkills'] = df_combined['skills'].apply(len)
max_num = df_combined['NumSkills'].max()
mean_num = df_combined['NumSkills'].mean()
median_num = df_combined['NumSkills'].median()
print("Max:", max_num, "- Mean:",mean_num, "- Median:", median_num)

df_combined = df_combined[['id','skills']]
df_combined

Max: 960 - Mean: 14.359302384320065 - Median: 12.0


Unnamed: 0,id,skills
0,2,"[JavaScript, TypeScript]"
1,3,"[C#, C++, HTML/CSS, JavaScript, Python, Micros..."
2,4,"[C#, JavaScript, SQL, TypeScript, Microsoft SQ..."
3,5,"[C#, HTML/CSS, JavaScript, SQL, Swift, TypeScr..."
4,6,"[C++, Lua, Homebrew]"
...,...,...
10353,91663,"[customer interaction, knowledge, java, androi..."
10354,86050,"[Technical Management, Project Management, MS ..."
10355,54515,"[XCode, IOS, Objective C, Project Management]"
10356,36160,"[Director, NoSQL, Node.js, CTO, SQL, JIRA, Agi..."


# Word Embedding relating skills that are seen together

## Word Embedding for Skills

In [34]:
# Create the Word2Vec model
sentences = df_combined['skills'].tolist()
model = Word2Vec(sentences, min_count=2, vector_size=300, window=7, sg=1)


In [36]:
model.save("model-w2vcombinedfiltered")

In [44]:
model = Word2Vec.load("model-w2vcombinedfiltered")

# Check Word Embedding

In [45]:
model.wv.wmdistance(['SQL'], ['Python', 'SQL'])


0.375141147279869

In [43]:
sims = model.wv.most_similar('SQL', topn=20)
sims

#model.wv.similarity('SQL', 'Python')

#df_combined['skills'].max()


[('Neo4j', 0.8012363910675049),
 ('Oracle Cloud Infrastructure', 0.777561604976654),
 ('IBM DB2', 0.7762055993080139),
 ('Colocation', 0.7742033004760742),
 ('Angular.js', 0.7695358991622925),
 ('Angular', 0.7673373222351074),
 ('OVH', 0.7647529244422913),
 ('IBM Cloud or Watson', 0.764323890209198),
 ('Microsoft Azure', 0.7611520886421204),
 ('PowerShell', 0.7608779668807983),
 ('AWS', 0.7600660920143127),
 ('Redis', 0.7571758031845093),
 ('SQLite', 0.7569435834884644),
 ('Managed Hosting', 0.7567907571792603),
 ('Yarn', 0.7511159777641296),
 ('CouchDB', 0.7501111030578613),
 ('MongoDB', 0.7494320273399353),
 ('Vue.js', 0.7491989135742188),
 ('Kubernetes', 0.7491278052330017),
 ('VMware', 0.748875617980957)]