# Problem Statement
While researching, Amanda found a raw dataset of over 30k datapoints
(Raw_Skills_Dataset.csv) which contain technical skills and a lot of jargon
mixed in.
the task is to clean this dataset and extract technical (Hard) skills

In [1]:
import numpy as np
import pandas as pd

C:\Users\ASUS\miniconda3\lib\site-packages\numpy\.libs\libopenblas.NOIJJG62EMASZI6NYURL6JBKM4EVBGM7.gfortran-win_amd64.dll
C:\Users\ASUS\miniconda3\lib\site-packages\numpy\.libs\libopenblas.XWYDX2IKJW2NMTWSFYNGFUWKQU3LYTCZ.gfortran-win_amd64.dll


In [2]:
#reading datasets  : 
raw_skills = pd.read_csv("Raw_Skills_Dataset.csv")
tech_skills = pd.read_csv("Example_Technical_Skills.csv")

In [3]:
raw_skills.shape

(34116, 1)

In [4]:
tech_skills.shape

(979, 1)

In [5]:
tech_skills.head(10)

Unnamed: 0,Technology Skills
0,SAP Fiori Developer
1,Oracle Instance Management & Strategy
2,Boomi Master Data Management
3,Digital Manufacturing on Cloud ( DMC)
4,DevOps
5,CA SAM
6,OpenShift
7,Acxiom Data Analytics
8,SAP Digital Boardroom
9,Seeburger BIS


In [6]:
raw_skills.head(10)

Unnamed: 0,RAW DATA
0,What ifs
1,seniority
2,familiarity
3,functionalities
4,Lambdas
5,Java Streams
6,Object Oriented analysis
7,Relational Databases
8,SQL
9,ORM


# Data Preprocessing :

In [7]:
# to find out if there is any null values to remove it :
tech_skills.isnull().sum()

Technology Skills    0
dtype: int64

In [8]:
raw_skills.isnull().sum()

RAW DATA    0
dtype: int64

# Text Preprcessing

We will now use NLP to process the text in our datasets so that we can use it in the model of our choice

In [9]:
import nltk

In [10]:
# we will remove common words from the text (stopwords)
# collect words on groups (lemmatizer)
# use regular expressions library (re)
from nltk.corpus import stopwords  
from nltk.stem import WordNetLemmatizer 
import re 

In [11]:
lemmatizer = WordNetLemmatizer()

In [12]:
# used this list to store processed text of tech_skills
tech = []  
for i in range(0,tech_skills.shape[0]):
     #convert all the characters expect a to z and A-Z with ' ' :
    review = re.sub('[^a-zA-Z]', ' ', tech_skills['Technology Skills'][i])
    #convert characters in lower case letters :
    review = review.lower() 
    #split the words :
    review = review.split() 
    # to remove the words that are present in stopwords set and to lemmatize the word :
    review = [lemmatizer.lemmatize(word) for word in review if not word in set(stopwords.words('english'))] 
    review = ' '.join(review) 
    tech.append(review)   

In [13]:
# we will do the same thing on raw_skills :
raw = []
for i in range(0,raw_skills.shape[0]):
    review = re.sub('[^a-zA-Z]', ' ', raw_skills['RAW DATA'][i])
    review = review.lower()
    review = review.split()
    
    review = [lemmatizer.lemmatize(word) for word in review if not word in stopwords.words('english')]
    review = ' '.join(review)
    raw.append(review)

In [32]:
# veiwing some processed text
tech[10:17]

['sap transport management',
 'peoplesoft workflow',
 'nice actimize',
 'oracle access manager',
 'sap electronic data interchange edi',
 'idea crowdsourcing',
 'peoplesoft workforce planning']

In [15]:
raw[10:17]

['jpa',
 'hibernate',
 'mybatis',
 'hibernate',
 'code versioning tool',
 'git familiarity',
 'maven']

# Feature Extraction

In [16]:
# we will convert the text into vectors now :
# I used TFIDF vectorizer because it's better than other vectorizers 
# (it takes how important the word is to the document in the group)
from sklearn.feature_extraction.text import TfidfVectorizer

In [17]:
tfidf_v=TfidfVectorizer()

In [18]:
# impement the vectorizer on the technical skills dataset
X=tfidf_v.fit_transform(tech)

In [19]:
# transform it to matrix using learnt vocabulary from tech_skills
Y=tfidf_v.transform(raw)

In [20]:
# veiw the number of features :
X.toarray().shape

(979, 1232)

# Fit Pretraind Module

We will use pretrained model from scikit learn which is One Class Support Vector Machine :

In [21]:
from sklearn.svm import OneClassSVM

In [22]:
# kernel = rbf for nonlinear 
model = OneClassSVM(kernel='rbf',gamma='auto')

In [23]:
model.fit(X)

OneClassSVM(gamma='auto')

In [24]:
pred=model.predict(Y)  

In [25]:
pred

array([ 1,  1,  1, ..., -1,  1,  1], dtype=int64)

now we will create a new dataset which contain technical skills only from raw_skills

In [26]:
# add new column to raw_skills which is the prediction : 
raw_skills['predictions']=pred  

In [27]:
with_pred =raw_skills[raw_skills.predictions==1]['RAW DATA'].to_numpy() 

In [28]:
with_pred

array(['What ifs', 'seniority', 'familiarity', ..., 'deadlines',
       'negotiation', 'deadlines'], dtype=object)

In [29]:
df=pd.DataFrame(with_pred) 

In [30]:
df

Unnamed: 0,0
0,What ifs
1,seniority
2,familiarity
3,ORM
4,JPA2
...,...
17454,all applicants
17455,negotiation
17456,deadlines
17457,negotiation


In [31]:
#save it on csv file :
df.to_csv('predicted_tech_skills.csv')