# Word2Vec Word Embedding using skills from Job postings and Developers' Profiles

The model is trained using the Word2Vec word embedding technique.

The word embedding is built based on a data combination from job postings (Skill2vec dataset) and people's skills (Stack Overflow Developer Survey).

The created model is saved to a file for later loading.

## Import Libraries

In [1]:
import pandas as pd
from gensim.models import Word2Vec
import numpy as np

# Load Roles Dataset

In [2]:
# Read the Skill2Vec dataset with skills requirements to roles
df_roles = pd.read_csv(filepath_or_buffer="../2-data/ITroles.csv", sep=",", encoding="latin1")
df_roles = df_roles[['id', 'skills']]
df_roles

Unnamed: 0,id,skills
0,19805,diploma;machining;cnc m;mould;conventional mac...
1,80208,Compensation;Benefits;HR Functions;Alm;Payroll...
2,122729,Simulink;stateflow;Matlab developer;targetlink...
3,4772,gis;analysis;geographic_information_system;esr...
4,44923,Full Stack Developer;AngularJS;SaaS applicatio...
...,...,...
10353,91663,customer interaction;knowledge;java;android;pr...
10354,86050,Technical Management;Project Management;MS SQL...
10355,54515,XCode;IOS;Objective C;Project Management;;;;;;...
10356,36160,Director;NoSQL;Node.js;CTO;SQL;JIRA;Agile;PHP;...


In [28]:
# Get list of all skills available to be selected - Source: Stack Overflow Survey questions
#df_skills = pd.read_csv(filepath_or_buffer="../data/skill-list.csv", sep=",", encoding="latin1")


#df_roles = df_roles.sample(1000) # get sample


# Get only positions with skills that are mapped
#skills = df_skills['skill']

# Account for synonyms
#synonym_skill = df_skills['synonym']
#def find_index(arr, val):
#    for i in range(len(arr)):
#        if arr[i] == val:
#            return i
#    return -1

#count = 0
#for skill in skills:
#    count = count + df_roles["skills"].apply(lambda x: 1 if skill.lower() in [item.lower() for item in x.split(';')] or synonym_skill[find_index(skills, skill)] in [item.lower() for item in x.split(';')] else 0)
#    df_roles['KeepRow'] = count

# Get only roles that have mapped skills
#df_roles = df_roles.loc[(df_roles['KeepRow'] > 0)]

# Tokenize the skills
#df_roles['skills'] = df_roles['skills'].apply(lambda x: x.split(';'))

# Get only required columns
#df_roles = df_roles[["id", "skills"]]
#df_roles

# Load Employees Dataset

In [3]:
df_people = pd.read_csv(filepath_or_buffer="../2-data/employees.csv", sep=",", encoding="latin1")
df_people = df_people[['id', 'skills']]
df_people

Unnamed: 0,id,skills
0,2,JavaScript;TypeScript
1,3,C#;C++;HTML/CSS;JavaScript;Python;Microsoft SQ...
2,4,C#;JavaScript;SQL;TypeScript;Microsoft SQL Ser...
3,5,C#;HTML/CSS;JavaScript;SQL;Swift;TypeScript;Cl...
4,6,C++;Lua;;;;;Homebrew
...,...,...
68545,73264,Bash/Shell;Dart;JavaScript;PHP;Python;SQL;Type...
68546,73265,Bash/Shell;HTML/CSS;JavaScript;Python;SQL;Elas...
68547,73266,HTML/CSS;JavaScript;PHP;Python;SQL;MariaDB;Mic...
68548,73267,C#;Delphi;VBA;Microsoft SQL Server;MongoDB;Oracle


# Combine both Datasets

In [4]:
df_combined = pd.concat([df_people, df_roles])
df_combined

Unnamed: 0,id,skills
0,2,JavaScript;TypeScript
1,3,C#;C++;HTML/CSS;JavaScript;Python;Microsoft SQ...
2,4,C#;JavaScript;SQL;TypeScript;Microsoft SQL Ser...
3,5,C#;HTML/CSS;JavaScript;SQL;Swift;TypeScript;Cl...
4,6,C++;Lua;;;;;Homebrew
...,...,...
10353,91663,customer interaction;knowledge;java;android;pr...
10354,86050,Technical Management;Project Management;MS SQL...
10355,54515,XCode;IOS;Objective C;Project Management;;;;;;...
10356,36160,Director;NoSQL;Node.js;CTO;SQL;JIRA;Agile;PHP;...


In [5]:
df_combined['skills'] = df_combined['skills'].apply(lambda x: x.split(';'))
df_combined

Unnamed: 0,id,skills
0,2,"[JavaScript, TypeScript]"
1,3,"[C#, C++, HTML/CSS, JavaScript, Python, Micros..."
2,4,"[C#, JavaScript, SQL, TypeScript, Microsoft SQ..."
3,5,"[C#, HTML/CSS, JavaScript, SQL, Swift, TypeScr..."
4,6,"[C++, Lua, , , , , Homebrew]"
...,...,...
10353,91663,"[customer interaction, knowledge, java, androi..."
10354,86050,"[Technical Management, Project Management, MS ..."
10355,54515,"[XCode, IOS, Objective C, Project Management, ..."
10356,36160,"[Director, NoSQL, Node.js, CTO, SQL, JIRA, Agi..."


# Word Embedding relating skills that are seen together

## Word Embedding for Skills

In [6]:

#df_combined = df_combined.sample(100)

# Train the Word2Vec model
sentences = df_combined['skills'].tolist()
model = Word2Vec(sentences, min_count=1, vector_size=300, window=300, sg=1)


In [7]:
model.save("model-w2vcombinedfiltered")

In [4]:
model = Word2Vec.load("model-w2vcombinedfiltered")


# Check Word Embedding

In [8]:
model.wv.wmdistance(['SQL'], ['Python', 'SQL'])


0.5123734885286234