# Doc2Vec
The purpose of this section is to apply the word2vec algorithm to our dataset.
For now, we'll only use the following columns:
[Headline, Industry, College-Name-1,Degree-Name-1,Field-Of-Study-1,Job-Title-1,Company-Name-1,
College-Name-2 (Referring to masters), Degree-Name-2 (Referring to masters)]

## Important to note
We only care about the Headline and Job-Title-1 column since we have some discrepancies in our dataset, as
we've described previously sometimes the Headline does not properly reflect the Job-Title-1 Column so for that
reason we'll need to apply a bit of logic. For example in for profile_2 column Dates-Employed-1 says 03-2022-Present, this means that the profile_2 is currently employed there but his headline might not reflect that. Thus, we'll apply a bit of logic for this purpose, example: ```if str(employed).find('Present') != -1```

In [1088]:
# Main imports
from nltk.corpus import stopwords
from helper import remove_unused_columns, transform_profession, preprocess_text
import pandas as pd
import numpy as np
from tqdm import tqdm

tqdm.pandas(desc="progress-bar")
from gensim.models import Doc2Vec
from sklearn import utils
from sklearn.model_selection import train_test_split
from gensim.models.doc2vec import TaggedDocument

## Cleaning up
As we mentioned above we'll go ahead and clean up these discrepancies in our dataset.
We really only care about the profiles current position.

In [1089]:
# reading our files and
file = '../excel-data/all-data-linkedin.csv'
data_csv = pd.read_csv(file)
data_top = remove_unused_columns(data_csv)[0:13]

# Get our columns into lists
headline_categories = list(data_csv['Industry'])
profession = transform_profession(data_top, data=data_csv)
category_list = data_csv['Headline']
category_id = [i for i in range(len(category_list))]

# Initialize our columns into a dataframe
dtf = pd.DataFrame()
dtf['category_id'] = category_id
dtf['categories'] = headline_categories
dtf['profession'] = profession

# Clean your data set first remove unwanted words like: "I", "me", "you"
list_stop_of_words = stopwords.words('english')
dtf['clean_text_profession'] = dtf['profession'].apply(
    lambda x: preprocess_text(x, flg_stemm=False, flg_lemm=True, lst_stopwords=list_stop_of_words))

dtf.head()

Unnamed: 0,category_id,categories,profession,clean_text_profession
0,0,Machine Learning,"Director of Data Science, Machine Learning at ...",director data science machine learning walmart...
1,1,Machine Learning,Machine learning on Encrypted data Engineer,machine learning encrypted data engineer
2,2,Machine Learning,Machine Learning Research Scientist - Deep Lea...,machine learning research scientist deep learning
3,3,Machine Learning,Principal (Manager) R&D Data Scientist,principal manager rd data scientist
4,4,Machine Learning,"Vice President of Machine Learning, Merchandis...",vice president machine learning merchandising ...


### Split data set
In this section we'll split the data set into training set (70%) test set (30%)

In [1090]:
# training set (70%) test set (30%)
train, test = train_test_split(dtf.reset_index(drop=True), random_state=0)

# Introduction - Word Embedding
In this section we're introducing google's Word2vec algorithm, but first we'll briefly explain why we're using word embedding. Word embedding is a class of technique where each words are represented by real-valued vectors in a pre-defined vector space.
For example every word is mapped to a vector, which are learned in a way to resemble a neural network.
Meaning words with a similar context appear in the same corpus. So they will be in the same vector space as we mentioned above.

In [1091]:
num_of_words = dtf['profession'].apply(lambda x: len(x.split(' '))).sum()
print('---- num_of_words ----')
print(num_of_words)

train_tagged = [TaggedDocument(words=word, tags=[str(i)]) for i, word in enumerate(train.clean_text_profession)]
test_tagged = [TaggedDocument(words=word, tags=[str(i)]) for i, word in enumerate(test.clean_text_profession)]

---- num_of_words ----
718


## Bag of words Doc2vec Training

In [1092]:
import multiprocessing
from sklearn import metrics

cores = multiprocessing.cpu_count()

model_doc2vec = Doc2Vec(
    dm=1,
    vector_size=300,
    negative=0,
    min_count=2,
    workers=cores,
    alpha=.025,
    min_alpha=0.00025,
    window=2
)

model_doc2vec.build_vocab(train_tagged)

for epoch in range(30):
    model_doc2vec.train(utils.shuffle(train_tagged), total_examples=model_doc2vec.corpus_count, epochs=1)
    model_doc2vec.alpha -= 0.002
    model_doc2vec.min_alpha = model_doc2vec.alpha

X_train = np.array([model_doc2vec.dv[str(i)] for i in range(len(train_tagged))])
y_train = train.categories

X_test = np.array([model_doc2vec.infer_vector([test_tagged[i].words]).tolist() for i in range(len(test_tagged))])
y_test = test.categories

## [Training] Logistic Regression Classifier

In [1093]:
from sklearn.linear_model import LogisticRegression

logistic_regression = LogisticRegression(C=5, multi_class='multinomial', solver='saga', max_iter=1000)

logistic_regression.fit(X_train, y_train)

y_pred = logistic_regression.predict(X_test)

print(metrics.accuracy_score(y_test, y_pred))
print('Testing F1 score: {}'.format(metrics.f1_score(y_test, y_pred, average='weighted')))

0.2413793103448276
Testing F1 score: 0.09386973180076628


In [1094]:
from sklearn.svm import LinearSVC

clf = LinearSVC()
LSV = clf.fit(X_train, y_train)
linear_svc_prediction = LSV.predict(X_test)

print('----------- F1 Score ----------')
print(metrics.accuracy_score(y_test, linear_svc_prediction))
print('\n')

----------- F1 Score ----------
0.2413793103448276


