# Doc2Vec
The purpose of this section is to apply the word2vec algorithm to our dataset.
For now, we'll only use the following columns:
[Headline, Industry, College-Name-1,Degree-Name-1,Field-Of-Study-1,Job-Title-1,Company-Name-1,
College-Name-2 (Referring to masters), Degree-Name-2 (Referring to masters)]

## Important to note
We only care about the Headline and Job-Title-1 column since we have some discrepancies in our dataset, as
we've described previously sometimes the Headline does not properly reflect the Job-Title-1 Column so for that
reason we'll need to apply a bit of logic. For example in for profile_2 column Dates-Employed-1 says 03-2022-Present, this means that the profile_2 is currently employed there but his headline might not reflect that. Thus, we'll apply a bit of logic for this purpose, example: ```if str(employed).find('Present') != -1```

In [45]:
# Main imports
from nltk.corpus import stopwords
import matplotlib.pyplot as plt
from helper import remove_unused_columns, transform_profession, preprocess_text
import pandas as pd
import numpy as np
import seaborn as sns
from tqdm import tqdm

tqdm.pandas(desc="progress-bar")
from gensim.models import Doc2Vec
from sklearn import utils
from sklearn.model_selection import train_test_split
from gensim.models.doc2vec import TaggedDocument

## Cleaning up
As we mentioned above we'll go ahead and clean up these discrepancies in our dataset.
We really only care about the profiles current position.

In [46]:
# reading our files and
file = '../excel-data/f-linkedin-profile.csv'
data_csv = pd.read_csv(file)
data_top = remove_unused_columns(data_csv)[0:13]

# Get our columns into lists
headline_categories = list(data_csv['Industry'])
profession = transform_profession(data_top, data=data_csv)
category_list = data_csv['Headline']
category_id = [i for i in range(len(category_list))]

# Initialize our columns into a dataframe
dtf = pd.DataFrame()
dtf['category_id'] = category_id
dtf['categories'] = headline_categories
dtf['profession'] = profession

# Clean your data set first remove unwanted words like: "I", "me", "you"
list_stop_of_words = stopwords.words('english')
dtf['clean_text_profession'] = dtf['profession'].apply(
    lambda x: preprocess_text(x, flg_stemm=False, flg_lemm=True, lst_stopwords=list_stop_of_words))

dtf.head()


Unnamed: 0,category_id,categories,profession,clean_text_profession
0,0,Machine Learning,"Director of Data Science, Machine Learning at ...",director data science machine learning walmart...
1,1,Machine Learning,Machine learning on Encrypted data Engineer,machine learning encrypted data engineer
2,2,Machine Learning,Machine Learning Research Scientist - Deep Lea...,machine learning research scientist deep learning
3,3,Machine Learning,Principal (Manager) R&D Data Scientist,principal manager rd data scientist
4,4,Machine Learning,"Vice President of Machine Learning, Merchandis...",vice president machine learning merchandising ...


### Split data set
In this section we'll split the data set into training set (70%) test set (30%)

In [47]:
# training set (70%) test set (30%)
train, test = train_test_split(dtf, test_size=0.3)

# Introduction - Word Embedding
In this section we're introducing google's Word2vec algorithm, but first we'll briefly explain why we're using word embedding. Word embedding is a class of technique where each words are represented by real-valued vectors in a pre-defined vector space.
For example every word is mapped to a vector, which are learned in a way to resemble a neural network.
Meaning words with a similar context appear in the same corpus. So they will be in the same vector space as we mentioned above.

In [48]:
def read_corpus(categories_text, categories_id):
    for i in range(len(categories_text)):
        yield TaggedDocument(words=list(categories_text)[i].split(), tags=list(categories_id)[i])


train_tagged = list(read_corpus(train.clean_text_profession, train.categories))
test_tagged = list(read_corpus(test.clean_text_profession, test.categories))

## Bag of words Doc2vec Training

In [49]:
import multiprocessing
from sklearn import metrics, linear_model

cores = multiprocessing.cpu_count()

model = Doc2Vec(
    dm=1,
    vector_size=50,
    negative=1,
    min_count=1,
    workers=cores,
)

model.build_vocab(train_tagged)

for epoch in range(30):
    model.train(utils.shuffle([x for x in tqdm(train_tagged)]), total_examples=len(train_tagged), epochs=3)
    model.alpha -= 0.002
    model.min_alpha = model.alpha

def vec_for_learning(model, tagged_docs):
    sents = tagged_docs
    targets, regressors = zip(*[(doc.tags, model.infer_vector(item for item in doc.words)) for doc in sents])
    return targets, regressors

y_train, X_train = vec_for_learning(model, train_tagged)
y_test, X_test = vec_for_learning(model, test_tagged)

100%|██████████| 128/128 [00:00<00:00, 1102404.34it/s]
100%|██████████| 128/128 [00:00<00:00, 1174772.24it/s]
100%|██████████| 128/128 [00:00<00:00, 999759.61it/s]
100%|██████████| 128/128 [00:00<00:00, 490741.24it/s]
100%|██████████| 128/128 [00:00<00:00, 884466.08it/s]
100%|██████████| 128/128 [00:00<00:00, 1583690.01it/s]
100%|██████████| 128/128 [00:00<00:00, 1257308.93it/s]
100%|██████████| 128/128 [00:00<00:00, 1491308.09it/s]
100%|██████████| 128/128 [00:00<00:00, 758292.25it/s]
100%|██████████| 128/128 [00:00<00:00, 796544.38it/s]
100%|██████████| 128/128 [00:00<00:00, 457300.61it/s]
100%|██████████| 128/128 [00:00<00:00, 1401751.73it/s]
100%|██████████| 128/128 [00:00<00:00, 528936.86it/s]
100%|██████████| 128/128 [00:00<00:00, 1583690.01it/s]
100%|██████████| 128/128 [00:00<00:00, 1579032.09it/s]
100%|██████████| 128/128 [00:00<00:00, 1322342.15it/s]
100%|██████████| 128/128 [00:00<00:00, 1231355.30it/s]
100%|██████████| 128/128 [00:00<00:00, 164987.99it/s]
100%|██████████| 1

('Product Manager', 'Business', 'Machine Learning', 'Lawyer', 'Finance', 'Product Manager', 'Lawyer', 'Product Manager', 'Machine Learning', 'Software Engineer', 'Machine Learning', 'Product Manager', 'Lawyer', 'Software Engineer', 'Software Engineer', 'Business', 'Marketing', 'Finance', 'Product Manager', 'Marketing', 'Product Manager', 'Professor', 'Software Engineer', 'Software Engineer', 'Finance', 'Lawyer', 'Software Engineer', 'Software Engineer', 'Product Manager', 'Machine Learning', 'Machine Learning', 'Lawyer', 'Lawyer', 'Professor', 'Finance', 'Finance', 'Marketing', 'Business', 'Finance', 'Professor', 'Business', 'Product Manager', 'Software Engineer', 'Business', 'Software Engineer', 'Professor', 'Machine Learning', 'Sales', 'Lawyer', 'Sales', 'Machine Learning', 'Product Manager', 'Finance', 'Lawyer', 'Software Engineer', 'Sales', 'Software Engineer', 'Product Manager', 'Machine Learning', 'Sales', 'Business', 'Machine Learning', 'Machine Learning', 'Sales', 'Machine Lear

# [Training] LinearSVC

In [53]:
from sklearn.svm import LinearSVC

clf = LinearSVC()
LSV = clf.fit(X_train, y_train)

linear_svc_prediction = LSV.predict(X_test)
t = pd.DataFrame(y_test)

print('----------- F1 Score ----------')
print(metrics.f1_score(y_test, linear_svc_prediction, average='weighted'))
print('\n')

print('----------- Accuracy Score -----------')
print(metrics.accuracy_score(y_test, linear_svc_prediction))
print('\n')

print('----------- Classification Report ----------')
print(metrics.classification_report(y_test, linear_svc_prediction))
print('\n')

----------- F1 Score ----------
0.05113636363636364


----------- Accuracy Score -----------
0.16363636363636364


----------- Confusion Matrix ----------
                   precision    recall  f1-score   support

         Business       0.00      0.00      0.00         6
          Finance       0.00      0.00      0.00         8
           Lawyer       0.00      0.00      0.00         6
 Machine Learning       0.00      0.00      0.00         7
        Marketing       0.00      0.00      0.00         2
  Product Manager       0.17      0.90      0.28        10
        Professor       0.00      0.00      0.00         6
            Sales       0.00      0.00      0.00         2
Software Engineer       0.00      0.00      0.00         8

         accuracy                           0.16        55
        macro avg       0.02      0.10      0.03        55
     weighted avg       0.03      0.16      0.05        55





  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## Support Vector Machine

In [51]:
from sklearn.svm import SVC

clf = SVC()
SV = clf.fit(X_train, y_train)
svc_prediction = SV.predict(X_test)

print('----------- F1 Score ----------')
print(metrics.f1_score(y_test, svc_prediction, average='weighted'))
print('\n')

print('----------- Accuracy Score -----------')
print(metrics.accuracy_score(y_test, svc_prediction))
print('\n')

print('----------- Classification Report ----------')
print(metrics.classification_report(y_test, svc_prediction))
print('\n')

----------- F1 Score ----------
0.29286713286713295


----------- Accuracy Score -----------
0.3090909090909091


----------- Classification Report ----------
                   precision    recall  f1-score   support

         Business       0.67      0.33      0.44         6
          Finance       0.12      0.25      0.17         8
           Lawyer       1.00      0.33      0.50         6
 Machine Learning       0.31      0.57      0.40         7
        Marketing       0.00      0.00      0.00         2
  Product Manager       0.31      0.50      0.38        10
        Professor       0.00      0.00      0.00         6
            Sales       0.00      0.00      0.00         2
Software Engineer       0.40      0.25      0.31         8

         accuracy                           0.31        55
        macro avg       0.31      0.25      0.24        55
     weighted avg       0.35      0.31      0.29        55





  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## [Training] Decision Tree

In [52]:
from sklearn import tree

clf = tree.DecisionTreeClassifier(max_depth=20, criterion='gini')

DTree = clf.fit(X_train, y_train)
dtree_prediction = DTree.predict(X_test)

print('----------- F1 Score ----------')
print(metrics.f1_score(y_test, dtree_prediction, average='weighted'))
print('\n')

print('----------- Accuracy Score -----------')
print(metrics.accuracy_score(y_test, dtree_prediction))
print('\n')

print('----------- Classification Report ----------')
print(metrics.classification_report(y_test, dtree_prediction))
print('\n')

print('----------- Confusion Matrix ----------')
print(metrics.confusion_matrix(y_test, dtree_prediction))
print('\n')

----------- F1 Score ----------
0.2298741506171847


----------- Accuracy Score -----------
0.2545454545454545


----------- Classification Report ----------
                   precision    recall  f1-score   support

         Business       0.36      0.67      0.47         6
          Finance       0.00      0.00      0.00         8
           Lawyer       0.40      0.67      0.50         6
 Machine Learning       0.14      0.14      0.14         7
        Marketing       0.00      0.00      0.00         2
  Product Manager       0.22      0.20      0.21        10
        Professor       1.00      0.17      0.29         6
            Sales       0.00      0.00      0.00         2
Software Engineer       0.25      0.25      0.25         8

         accuracy                           0.25        55
        macro avg       0.26      0.23      0.21        55
     weighted avg       0.29      0.25      0.23        55



----------- Confusion Matrix ----------
[[4 0 1 0 0 1 0 0 0]
 [1 0 4 0