**Potential Talent**

Loading the dataset and the library

In [68]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
nltk.download('punkt')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [69]:
data=pd.read_csv("potential-talents.csv")


In [70]:
data.head()

Unnamed: 0,id,job_title,location,connection,fit
0,1,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
1,2,Native English Teacher at EPIK (English Progra...,Kanada,500+,
2,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,
3,4,People Development Coordinator at Ryan,"Denton, Texas",500+,
4,5,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,


In [71]:
print(data.shape)

(104, 5)


There are 104 rows and 5 columns which are id, job_title, location, connection and fit.

In [72]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 104 entries, 0 to 103
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   id          104 non-null    int64  
 1   job_title   104 non-null    object 
 2   location    104 non-null    object 
 3   connection  104 non-null    object 
 4   fit         0 non-null      float64
dtypes: float64(1), int64(1), object(3)
memory usage: 4.2+ KB


Since the column fit which is target variable does not contain anything we will drop it during the training section.

In [73]:
data.drop('fit',axis=1, inplace=True)

In [74]:
for col in data.columns:
    print(f'There is {data[col].nunique()} unique value in {col} column')

There is 104 unique value in id column
There is 52 unique value in job_title column
There is 41 unique value in location column
There is 33 unique value in connection column


Since there are 104 unique ids but less number of job titles and other features it may be because of same range of jobs and location sometimes duplication may occur but since id are unique we will get duplicate only if we drop it.

In [75]:
data_dup = data.drop(['id'], axis = 1)
print("Number of duplicate entries:", data_dup.duplicated().sum())


Number of duplicate entries: 51


There are total of 51 duplicate entries.

In [76]:
data_dup[data_dup.duplicated(keep=False)]

Unnamed: 0,job_title,location,connection
0,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85
1,Native English Teacher at EPIK (English Progra...,Kanada,500+
2,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44
3,People Development Coordinator at Ryan,"Denton, Texas",500+
4,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+
...,...,...,...
60,HR Senior Specialist,San Francisco Bay Area,500+
61,Seeking Human Resources HRIS and Generalist Po...,Greater Philadelphia Area,500+
62,Student at Chapman University,"Lake Forest, California",2
63,"SVP, CHRO, Marketing & Communications, CSR Off...","Houston, Texas Area",500+


Since there are many duplicates to get an accurate model we need to drop it so to avoid biased model.

In [77]:
data_dup = data_dup.drop_duplicates()
data = pd.concat([data['id'], data_dup], axis = 1).dropna(axis = 0)


In [78]:
print(data.shape)

(53, 4)


So the 104 data is reduced to 53 rows without any duplicate rows.

Before removing duplicates

In [79]:
data.job_title.value_counts()

Aspiring Human Resources Professional                                                                                    2
2019 C.T. Bauer College of Business Graduate (Magna Cum Laude) and aspiring Human Resources professional                 1
Lead Official at Western Illinois University                                                                             1
Senior Human Resources Business Partner at Heil Environmental                                                            1
Aspiring Human Resources Professional | An energetic and Team-Focused Leader                                             1
HR Manager at Endemol Shine North America                                                                                1
Human Resources professional for the world leader in GIS software                                                        1
RRP Brand Portfolio Executive at JTI (Japan Tobacco International)                                                       1
Information Syst

After removing duplicates

In [80]:
data.job_title.value_counts()

Aspiring Human Resources Professional                                                                                    2
2019 C.T. Bauer College of Business Graduate (Magna Cum Laude) and aspiring Human Resources professional                 1
Lead Official at Western Illinois University                                                                             1
Senior Human Resources Business Partner at Heil Environmental                                                            1
Aspiring Human Resources Professional | An energetic and Team-Focused Leader                                             1
HR Manager at Endemol Shine North America                                                                                1
Human Resources professional for the world leader in GIS software                                                        1
RRP Brand Portfolio Executive at JTI (Japan Tobacco International)                                                       1
Information Syst

In [81]:
data = data.reset_index(drop = True)

In [82]:
from collections import Counter
words_counts = Counter()
for i in range(0 ,len(data)):
    li=data.job_title[i].split(" ")
    for word in li:
        words_counts[word] += 1

print(' Total words', len(words_counts))
words_counts.most_common()

 Total words 221


[('Human', 34),
 ('Resources', 29),
 ('at', 22),
 ('and', 13),
 ('Aspiring', 11),
 ('|', 10),
 ('Professional', 7),
 ('in', 6),
 ('University', 6),
 ('Seeking', 6),
 ('Business', 5),
 ('Student', 5),
 ('Generalist', 5),
 ('Manager', 5),
 ('of', 4),
 ('Specialist', 4),
 ('&', 4),
 ('Management', 4),
 ('seeking', 4),
 ('', 4),
 ('an', 3),
 ('Director', 3),
 ('for', 3),
 ('College', 2),
 ('aspiring', 2),
 ('professional', 2),
 ('Coordinator', 2),
 ('HR', 2),
 ('Senior', 2),
 ('internship', 2),
 ('Resources,', 2),
 ('Staffing', 2),
 ('North', 2),
 ('a', 2),
 ('Resources.', 2),
 ('Major', 2),
 ('to', 2),
 ('Information', 2),
 ('Systems', 2),
 ('-', 2),
 ('Position', 2),
 ('2019', 1),
 ('C.T.', 1),
 ('Bauer', 1),
 ('Graduate', 1),
 ('(Magna', 1),
 ('Cum', 1),
 ('Laude)', 1),
 ('Native', 1),
 ('English', 1),
 ('Teacher', 1),
 ('EPIK', 1),
 ('(English', 1),
 ('Program', 1),
 ('Korea)', 1),
 ('People', 1),
 ('Development', 1),
 ('Ryan', 1),
 ('Advisory', 1),
 ('Board', 1),
 ('Member', 1),
 ('Ce

Some of the words which have occured frequently are Human, Resources, Aspiring. Some words like at, and are used for grammar which will be removed also special characters are present which also will be removed.

**Data Cleaning**

Remove puctuations and special characters

In [83]:
data = data.replace({'location' : { "[\'!#)$%&(*+-./:;<=>?@[\]^_`{|}~\n]" : " "}}, regex=True)
data = data.replace({'connection' : { "[\'!#)$%&(*+-./:;<=>?@[\]^_`{|}~\n]" : " "}}, regex=True)

Making everything to uppercase

In [84]:
data['job_title'] = data['job_title'].str.lower()
data['location'] = data['location'].str.lower()

In [85]:
data.head()

Unnamed: 0,id,job_title,location,connection
0,1,2019 c.t. bauer college of business graduate (...,houston texas,85
1,2,native english teacher at epik (english progra...,kanada,500
2,3,aspiring human resources professional,raleigh durham north carolina area,44
3,4,people development coordinator at ryan,denton texas,500
4,5,advisory board member at celal bayar university,i̇zmir türkiye,500


In [86]:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [87]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.tokenize.treebank import TreebankWordDetokenizer
from nltk.stem import WordNetLemmatizer
def cleaning(data, col):

    stop_words = set(stopwords.words('english'))
    for i in range (len(data)):
        word_tokens = word_tokenize(data[col][i])
        tokens_without_sw = [w for w in word_tokens if w not in stop_words]
        lemmatized_sentence = []
        for word in tokens_without_sw:
            lemmatized_sentence.append(WordNetLemmatizer().lemmatize(word))
        data[col][i] = TreebankWordDetokenizer().detokenize(lemmatized_sentence)

In [88]:
print('Job title before removing stopwprds:\n', data.job_title.head(1))
print("-" * 100)


cleaning(data, 'job_title')
cleaning(data, 'location')

print('Job title after removing stopwprds:\n', data.job_title.head(1))
print("-" * 100)
data.head(2)

Job title before removing stopwprds:
 0    2019 c.t. bauer college of business graduate (...
Name: job_title, dtype: object
----------------------------------------------------------------------------------------------------


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[col][i] = TreebankWordDetokenizer().detokenize(lemmatized_sentence)


Job title after removing stopwprds:
 0    2019 c.t . bauer college business graduate (ma...
Name: job_title, dtype: object
----------------------------------------------------------------------------------------------------


Unnamed: 0,id,job_title,location,connection
0,1,2019 c.t . bauer college business graduate (ma...,houston texas,85
1,2,native english teacher epik (english program k...,kanada,500


In [89]:
data = data.replace({'job_title' : { 'chro' : 'chief human resources officer', 'svp' : 'senior vice president'
        ,'gphr' : 'global professional in human resources','hris' : 'human resources management system'
        , 'csr' : 'corporate social responsibility', 'sphr' : 'strategic and policy-making certification'
        , 'hr' : 'human resources', "[\'!#)$%&(*+-./:;<=>?@[\]^_`{|}~\n]" : "", r'[0-9]' : ''}},regex=True)

In [90]:
data.head()

Unnamed: 0,id,job_title,location,connection
0,1,ct bauer college business graduate magna cum...,houston texas,85
1,2,native english teacher epik english program korea,kanada,500
2,3,aspiring human resource professional,raleigh durham north carolina area,44
3,4,people development coordinator ryan,denton texas,500
4,5,advisory board member celal bayar university,i̇zmir türkiye,500


So we have removed the stopwords like and, at etc hence reducing noise.

Word Embedding Techniques

1. TF-IDF- It is based on a statistical measure of finding the relevance of words in the text

In [91]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [92]:
# Convert job_title column into a list
job_title_list = list(data['job_title'])

# Vectorize job_title_list
vectorizer=TfidfVectorizer()
vector = vectorizer.fit_transform(job_title_list)
#X = vectorizer.transform(job_title_list)

# Get feature names in all the documents
feature_names = vectorizer.get_feature_names_out()
print("Number of unique features: ", len(feature_names))
print("First 5 features: ", feature_names[:5])
df_tfidf=pd.DataFrame(vector.toarray(),columns=feature_names)
# Convert job titles into arrays
tfidf_vector = vector.toarray()
print("Shape of Tfidf vector: ", tfidf_vector.shape)
print(df_tfidf)

Number of unique features:  182
First 5 features:  ['administration' 'administrative' 'admission' 'advisory' 'always']
Shape of Tfidf vector:  (53, 182)
    administration  administrative  admission  advisory   always   america  \
0         0.000000        0.000000   0.000000  0.000000  0.00000  0.000000   
1         0.000000        0.000000   0.000000  0.000000  0.00000  0.000000   
2         0.000000        0.000000   0.000000  0.000000  0.00000  0.000000   
3         0.000000        0.000000   0.000000  0.000000  0.00000  0.000000   
4         0.000000        0.000000   0.000000  0.426332  0.00000  0.000000   
5         0.000000        0.000000   0.000000  0.000000  0.00000  0.000000   
6         0.000000        0.000000   0.000000  0.000000  0.00000  0.000000   
7         0.000000        0.000000   0.000000  0.000000  0.00000  0.000000   
8         0.000000        0.000000   0.000000  0.000000  0.00000  0.000000   
9         0.000000        0.000000   0.000000  0.000000  0.00000  0

In [93]:
phrases = pd.DataFrame({'phrase' : ['aspiring human resources']})
cleaning(phrases, 'phrase')
print(phrases.phrase[0])
new_vector=vectorizer.transform([phrases.phrase[0]])

feature_names=vectorizer.get_feature_names_out()
df_tfidf_new=pd.DataFrame(new_vector.toarray(), columns=feature_names)
print(df_tfidf_new)

aspiring human resource
   administration  administrative  admission  advisory  always  america  \
0             0.0             0.0        0.0       0.0     0.0      0.0   

   analyst  analytics  and  army  ...  university  vice  victoria  wellington  \
0      0.0        0.0  0.0   0.0  ...         0.0   0.0       0.0         0.0   

   western  westfield  within  woodland  work  world  
0      0.0        0.0     0.0       0.0   0.0    0.0  

[1 rows x 182 columns]


In [94]:
similarity_scores = cosine_similarity(new_vector, vector)
similarity_scores = similarity_scores.flatten()

data['tf_ft'] = similarity_scores


In [95]:
data.sort_values(by ='tf_ft', ascending = False).head()

Unnamed: 0,id,job_title,location,connection,tf_ft
2,3,aspiring human resource professional,raleigh durham north carolina area,44,0.781184
45,97,aspiring human resource professional,kokomo indiana area,71,0.781184
5,6,aspiring human resource specialist,greater new york city area,1,0.67989
21,73,aspiring human resource manager seeking intern...,houston texas area,7,0.6186
12,27,aspiring human resource management student see...,houston texas area,500,0.442293


In [96]:
data.sort_values(by ='tf_ft', ascending = False).tail()

Unnamed: 0,id,job_title,location,connection,tf_ft
34,86,information system specialist programmer love ...,gaithersburg maryland,4,0.0
33,85,rrp brand portfolio executive jti japan tobacc...,greater philadelphia area,500,0.0
28,80,junior me engineer information system,myrtle beach south carolina area,52,0.0
1,2,native english teacher epik english program korea,kanada,500,0.0
52,104,director administration excellence logging,katy texas,500,0.0


Here as you can see more similiar words have more score and the words which are extreame opposite does not have any score at all.

**Word2Vec**-Word2Vec finds similarities among words by using the cosine similarity metric

In [97]:
import gensim.downloader as api
from gensim.models import KeyedVectors, Word2Vec
from gensim.scripts.glove2word2vec import glove2word2vec
GoogleNews_model = api.load('word2vec-google-news-300')

In [98]:
def doc_token_vectors(sentence, model, sentence_vector_list, vector_dimensions):
    word_tokens = word_tokenize(sentence)
    filtered_words = [w for w in word_tokens if w in model]
    for j in range(len(word_tokens)):
        if word_tokens[j] in filtered_words:
            token_vector = model[word_tokens[j]]
        else:
            token_vector = np.zeros(vector_dimensions)
        sentence_vector_list.append(token_vector)

    return sentence_vector_list

In [99]:
def model_fitt_score(df, col, model, vector_dimensions, phrase, fitt_col_name):
    # Vectorize job title using the model
    model_vectors = []
    for i in range(len(df)):
        model_sentence_vector = []
        doc_token_vectors(df[col][i], model, model_sentence_vector, vector_dimensions)
        model_vectors.append(model_sentence_vector)

    print('model vectors shape', np.shape(model_vectors))
    # Vectorize searched phrase using the model
    model_search_phrase_vector = []
    doc_token_vectors(phrase, model, model_search_phrase_vector, vector_dimensions)
    print('model search phrase vector shape', np.shape(model_search_phrase_vector))

    # Calculate cosine similarity between searched phrase and job title
    model_similarity =[]
    for i in range(len(df)):
        sim_score = 1 - cosine(np.mean(model_vectors[i], axis = 0), np.mean(model_search_phrase_vector, axis =0))
        model_similarity.append(sim_score)

    # Add model similarity score to the pt dataframe
    df[fitt_col_name] = model_similarity

In [100]:

model_fitt_score(data, 'job_title', GoogleNews_model, 300, phrases.phrase[0], 'word_sc')


model vectors shape (53,)
model search phrase vector shape (3, 300)


  result = asarray(a).shape


In [101]:
data.sort_values(by ='word_sc', ascending = False).head()

Unnamed: 0,id,job_title,location,connection,tf_ft,word_sc
2,3,aspiring human resource professional,raleigh durham north carolina area,44,0.781184,0.950395
45,97,aspiring human resource professional,kokomo indiana area,71,0.781184,0.950395
5,6,aspiring human resource specialist,greater new york city area,1,0.67989,0.912262
21,73,aspiring human resource manager seeking intern...,houston texas area,7,0.6186,0.875945
22,74,human resource professional,greater boston area,16,0.421758,0.874494


In [102]:
data.sort_values(by ='word_sc', ascending = False).tail()

Unnamed: 0,id,job_title,location,connection,tf_ft,word_sc
1,2,native english teacher epik english program korea,kanada,500,0.0,0.240216
52,104,director administration excellence logging,katy texas,500,0.0,0.225671
51,103,always set success,greater los angeles area,500,0.0,0.213981
33,85,rrp brand portfolio executive jti japan tobacc...,greater philadelphia area,500,0.0,0.21234
4,5,advisory board member celal bayar university,i̇zmir türkiye,500,0.0,0.206563


Here as you can see the more different words have score higher compared to tf_idf which is a diadvantage also the more similar is having high score which is good.

**BERT**

In [103]:
#pip install sentence_transformers

In [104]:
from sentence_transformers import SentenceTransformer
# Build BERT_base model
bert_model = SentenceTransformer('bert-base-nli-mean-tokens')

In [105]:
bert_job_title_embeddings = bert_model.encode(list(data.job_title))
bert_job_title_embeddings.shape

(53, 768)

In [106]:
# Convert search phrase into a BERT embedded vector
bert_search_phrase_embedding = bert_model.encode(phrases.phrase[0])
bert_search_phrase_embedding.shape

(768,)

In [107]:
from scipy.spatial.distance import cosine
bert_cosine_similarity = []
for i in range(len(data)):
    cos_sim = 1 - cosine(bert_job_title_embeddings[i], bert_search_phrase_embedding)
    bert_cosine_similarity.append(cos_sim)

# Add BERT_cosine_similarity column in the pt dataframe
data['B_ft_sc'] = bert_cosine_similarity

In [108]:

data.sort_values(by ='B_ft_sc', ascending = False).head()

Unnamed: 0,id,job_title,location,connection,tf_ft,word_sc,B_ft_sc
5,6,aspiring human resource specialist,greater new york city area,1,0.67989,0.912262,0.955137
2,3,aspiring human resource professional,raleigh durham north carolina area,44,0.781184,0.950395,0.948828
45,97,aspiring human resource professional,kokomo indiana area,71,0.781184,0.950395,0.948828
30,82,aspiring human resource professional energeti...,austin texas area,174,0.379605,0.827266,0.86791
47,99,seeking human resource position,la vega nevada area,48,0.289466,0.728513,0.849294


In [109]:

data.sort_values(by ='B_ft_sc', ascending = True).head()

Unnamed: 0,id,job_title,location,connection,tf_ft,word_sc,B_ft_sc
35,87,bachelor science biology victoria university w...,baltimore maryland,40,0.0,0.277453,0.13641
41,93,admission representative community medical cen...,long beach california,9,0.0,0.262623,0.181031
44,96,student indiana university kokomo business ma...,lafayette indiana,19,0.0,0.314945,0.230208
33,85,rrp brand portfolio executive jti japan tobacc...,greater philadelphia area,500,0.0,0.21234,0.253621
39,91,lead official western illinois university,greater chicago area,39,0.0,0.25455,0.357181


Here the similar words are having more score and the disimilar words are having less score hence from the three models this can be taken as the correct fit value.

In [117]:
data=data.drop(['tf_ft',	'word_sc'],axis=1)

In [118]:
data.head()

Unnamed: 0,id,job_title,location,connection,B_ft_sc,fit
0,1,ct bauer college business graduate magna cum...,houston texas,85,0.557126,0
1,2,native english teacher epik english program korea,kanada,500,0.384784,0
2,3,aspiring human resource professional,raleigh durham north carolina area,44,0.948828,0
3,4,people development coordinator ryan,denton texas,500,0.746752,0
4,5,advisory board member celal bayar university,i̇zmir türkiye,500,0.46181,0


In [119]:
for i in range (0,len(data)-1):
   if data.connection[i]=='500+':
    data.connection[i]=500
   else:
    data.connection[i]=int(data.connection[i])


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.connection[i]=int(data.connection[i])


In [120]:
data.head()

Unnamed: 0,id,job_title,location,connection,B_ft_sc,fit
0,1,ct bauer college business graduate magna cum...,houston texas,85,0.557126,0
1,2,native english teacher epik english program korea,kanada,500,0.384784,0
2,3,aspiring human resource professional,raleigh durham north carolina area,44,0.948828,0
3,4,people development coordinator ryan,denton texas,500,0.746752,0
4,5,advisory board member celal bayar university,i̇zmir türkiye,500,0.46181,0


In [126]:
for i in range (0,len(data)):
  if (data.B_ft_sc[i] > 0.5):
    if(data.connection[i] > 100):
      data.fit[i]=1
  else:
    data.fit[i]=0


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.fit[i]=0
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.fit[i]=1


In [127]:
data

Unnamed: 0,id,job_title,location,connection,B_ft_sc,fit
0,1,ct bauer college business graduate magna cum...,houston texas,85,0.557126,0
1,2,native english teacher epik english program korea,kanada,500,0.384784,0
2,3,aspiring human resource professional,raleigh durham north carolina area,44,0.948828,0
3,4,people development coordinator ryan,denton texas,500,0.746752,1
4,5,advisory board member celal bayar university,i̇zmir türkiye,500,0.46181,0
5,6,aspiring human resource specialist,greater new york city area,1,0.955137,0
6,7,student humber college aspiring human resource...,kanada,61,0.712708,0
7,8,human resources senior specialist,san francisco bay area,500,0.703114,1
8,10,seeking human resource human resources managem...,greater philadelphia area,500,0.721451,1
9,11,student chapman university,lake forest california,2,0.426179,0


Here I have given the label as 1 for very much suitable and 0 for less suitable.

In [137]:
X=data.drop(['job_title','location','fit'], axis=1)
y=data['fit']
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
logmodel = LogisticRegression(solver='lbfgs', max_iter=1000)

lg=logmodel.fit(X,y)
y_pred = logmodel.predict(X)
print(y_pred)
ar=lg.predict_proba(X)
print (ar)


[0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 1 1 1 0 1 0 0 0 1 0 1 1 0 0 1 0 0 0 1 0 0 0
 1 0 0 0 0 1 0 0 0 0 0 0 1 0 1 1]
[[0.97226538 0.02773462]
 [0.57078485 0.42921515]
 [0.96274755 0.03725245]
 [0.42035714 0.57964286]
 [0.52836648 0.47163352]
 [0.97229978 0.02770022]
 [0.96818981 0.03181019]
 [0.42113586 0.57886414]
 [0.40609615 0.59390385]
 [0.98668202 0.01331798]
 [0.52664258 0.47335742]
 [0.52499809 0.47500191]
 [0.34748908 0.65251092]
 [0.51707658 0.48292342]
 [0.92863861 0.07136139]
 [0.21280406 0.78719594]
 [0.20958181 0.79041819]
 [0.28309201 0.71690799]
 [0.93048298 0.06951702]
 [0.23707524 0.76292476]
 [0.95197845 0.04802155]
 [0.93842681 0.06157319]
 [0.92696132 0.07303868]
 [0.23610719 0.76389281]
 [0.71659901 0.28340099]
 [0.41415716 0.58584284]
 [0.20930643 0.79069357]
 [0.94945916 0.05054084]
 [0.93303502 0.06696498]
 [0.30002    0.69998   ]
 [0.72567767 0.27432233]
 [0.70422398 0.29577602]
 [0.91587252 0.08412748]
 [0.29415929 0.70584071]
 [0.94738507 0.05261493]
 [0.9603976  0.0

In [141]:
for i in range (0, len(data)-1):
  data.fit[i]=ar[i][1]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.fit[i]=ar[i][1]


In [142]:
data.sort_values(by ='fit', ascending = False).head()

Unnamed: 0,id,job_title,location,connection,B_ft_sc,fit
49,101,human resource generalist loparex,raleigh durham north carolina area,500,0.630247,0.850129
51,103,always set success,greater los angeles area,500,0.567987,0.84157
26,78,human resource generalist schwans,amerika birleşik devletleri,500,0.612956,0.790694
16,68,human resource specialist luxottica,greater new york city area,500,0.715959,0.790418
15,67,human resource staffing recruiting professional,jackson mississippi area,500,0.714154,0.787196


**Conclusion**

In this project I have used BERT model for cecking similarity because it gives the value between our data and the targeted keyword. Also for the re-ranking purpose I have used Logistic Regression because it gives probability scores for the trained data and with that I have ranked the candidates. The re-ranking is based on  both the BERT score and the number of connection
