# Word2Vec
The purpose of this section is to apply the word2vec algorithm to our dataset.
For now, we'll only use the following columns:
[Headline, Industry, College-Name-1,Degree-Name-1,Field-Of-Study-1,Job-Title-1,Company-Name-1,
College-Name-2 (Referring to masters), Degree-Name-2 (Referring to masters)]

## Important to note
We only care about the Headline and Job-Title-1 column since we have some discrepancies in our dataset, as
we've described previously sometimes the Headline does not properly reflect the Job-Title-1 Column so for that
reason we'll need to apply a bit of logic. For example in for profile_2 column Dates-Employed-1 says 03-2022-Present, this means that the profile_2 is currently employed there but his headline might not reflect that. Thus, we'll apply a bit of logic for this purpose, example: ```if str(employed).find('Present') != -1```

In [620]:
# Main imports
import pandas as pd
from nltk.corpus import stopwords
from sklearn import model_selection, manifold, preprocessing as P
from sklearn.model_selection import train_test_split

from helper import remove_unused_columns, transform_profession, preprocess_text
import gensim
from keras import preprocessing, layers, models, Model

import seaborn as sns
import gensim.downloader as gensim_api
import matplotlib.pyplot as plt
import numpy as np
from tensorflow.python.keras import backend as K
import re

## Cleaning up
As we mentioned above we'll go ahead and clean up these discrepancies in our dataset.
We really only care about the profiles current position.

In [621]:
# reading our files and
file = '../excel-data/all-data-linkedin.csv'
data_csv = pd.read_csv(file)
data_top = remove_unused_columns(data_csv)[0:13]

# Get our columns into lists
headline_categories = list(data_csv['Industry'])
profession = transform_profession(data_top, data=data_csv)
category_list = data_csv['Headline']
category_id = [i for i in range(len(category_list))]

# Initialize our columns into a dataframe
dtf = pd.DataFrame()
dtf['category_id'] = category_id
dtf['categories'] = headline_categories
dtf['profession'] = profession

# Clean your data set first remove unwanted words like: "I", "me", "you"
list_stop_of_words = stopwords.words('english')
dtf['clean_text_profession'] = dtf['profession'].apply(lambda x: preprocess_text(x, flg_stemm=False, flg_lemm=True, lst_stopwords=list_stop_of_words))

dtf.head()

Unnamed: 0,category_id,categories,profession,clean_text_profession
0,0,Machine Learning,"Director of Data Science, Machine Learning at ...",director data science machine learning walmart...
1,1,Machine Learning,Machine learning on Encrypted data Engineer,machine learning encrypted data engineer
2,2,Machine Learning,Machine Learning Research Scientist - Deep Lea...,machine learning research scientist deep learning
3,3,Machine Learning,Principal (Manager) R&D Data Scientist,principal manager rd data scientist
4,4,Machine Learning,"Vice President of Machine Learning, Merchandis...",vice president machine learning merchandising ...


### Split data set
In this section we'll split the data set into training set (70%) test set (30%)

In [622]:
# training set (70%) test set (30%)
X_train, Y_test, y_train, y_test = train_test_split(dtf['clean_text_profession'], dtf['categories'], random_state=0)

# Introduction - Word Embedding
In this section we're introducing google's Word2vec algorithm, but first we'll briefly explain why we're using word embedding. Word embedding is a class of technique where each words are represented by real-valued vectors in a pre-defined vector space.
For example every word is mapped to a vector, which are learned in a way to resemble a neural network.
Meaning words with a similar context appear in the same corpus. So they will be in the same vector space as we mentioned above.

## Word2Vec
How does Word2vec tie into this? Well word2vec produces a vector space (with several hundred of dimensions). Each unique word in the corpus share a common context in the corpus as the close to the other in the space.

In this section we'll use our own data corpus which we created above when we removed words from our data-set. Words like I, me, you ect.

### Skip-gram model
***Predicts words within a certain range before and after the current word in the same sentences.***
Before fitting the model we'll transform our list into a lists of n-grams. As per the original paper, we'll apply this formula to find the nearby words: vec('Madrid') - vec('Spain') + vec('France') will lead us closer to vec('Paris') than any other vector.

In [651]:
corpus = dtf['clean_text_profession']

lst_corpus = []
for string in corpus:
   lst_words = string.split()
   lst_grams = [" ".join(lst_words[i:i+1])
               for i in range(0, len(lst_words), 1)]
   lst_corpus.append(lst_grams)


## detect bigrams and trigrams
bigrams_detector = gensim.models.phrases.Phrases(lst_corpus)
bigrams_detector = gensim.models.phrases.Phraser(bigrams_detector)
trigrams_detector = gensim.models.phrases.Phrases(bigrams_detector[lst_corpus], min_count=5, threshold=10)
trigrams_detector = gensim.models.phrases.Phraser(trigrams_detector)

nlp = gensim.models.word2vec.Word2Vec(lst_corpus, min_count=1,vector_size=100, sg=1)

nlp.build_vocab(lst_corpus, progress_per=10000)
nlp.train(lst_corpus, total_examples=nlp.corpus_count, epochs=1, report_delay=1)

word = "engineer"

## word embedding
tot_words = [word] + [tupla[0] for tupla in nlp.wv.most_similar(word, topn=10)]

X = nlp.wv[tot_words]

for item in nlp.wv.most_similar(word, topn=10):
    print(item)

('life', 0.3165738880634308)
('marketing', 0.2569739520549774)
('startup', 0.24796059727668762)
('northwestern', 0.2437247335910797)
('contract', 0.23932676017284393)
('director', 0.21964438259601593)
('international', 0.21706047654151917)
('chief', 0.21428938210010529)
('fellow', 0.18916472792625427)
('creative', 0.1853596717119217)
