## Model Training


#### Life cycle 

- Data Collection
- Data Checks to perform
- Exploratory data analysis
- Data Pre-Processing
- Model Training
- Choose best model

### 1) Problem statement
- Implement a pipeline that automatically encodes to a vector
space the candidate information.


### 2) Data Collection
- Dataset Source - data/candidates.cs

### 2.1 Import Data and Required Packages
####  Importing Pandas, Numpy, Matplotlib, Seaborn and Warings Library.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sentence_transformers import SentenceTransformer
import re
import json
import pickle
from tqdm import tqdm
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

  from .autonotebook import tqdm as notebook_tqdm


#### Import the CSV Data as Pandas DataFrame

In [2]:
df = pd.read_csv('data/candidates.csv')

#### Show Top 5 Records

In [3]:
df.head()

Unnamed: 0,LI - List of Experience Information
0,"[{'company':'Next Billion Advisors','title':'P..."


#### Shape of the dataset

In [4]:
df.shape

(1, 1)

### 3. Data Checks to perform

- Check Missing values
- Check Duplicates
- Check data type
- Check the number of unique values of each column
- Check statistics of data set
- Check various categories present in the different categorical column

### 3.1 Check Missing values

In [5]:
df.isna().sum()

LI - List of Experience Information    0
dtype: int64

#### There are no missing values in the data set

### 3.2 Check Duplicates

In [6]:
df.duplicated().sum()

0

#### There are no duplicates  values in the data set

### 3.3 Check data types

In [7]:
# Check Null and Dtypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 1 columns):
 #   Column                               Non-Null Count  Dtype 
---  ------                               --------------  ----- 
 0   LI - List of Experience Information  1 non-null      object
dtypes: object(1)
memory usage: 140.0+ bytes


### 3.4 Checking the number of unique values of each column

In [8]:
df.nunique()

LI - List of Experience Information    1
dtype: int64

### 3.5 Check statistics of data set

In [9]:
df.describe()

Unnamed: 0,LI - List of Experience Information
count,1
unique,1
top,"[{'company':'Next Billion Advisors','title':'P..."
freq,1


### 3.7 Exploring Data

In [10]:
df.head()

Unnamed: 0,LI - List of Experience Information
0,"[{'company':'Next Billion Advisors','title':'P..."


### Define a function to extract relevant information on the candidates

In [11]:


def extract_info(row):
    experiences_str = row['LI - List of Experience Information'].replace('null', 'None')

    try:
        experiences = json.loads(experiences_str) if experiences_str != 'None' else []
    except json.JSONDecodeError:
        experiences = []

    titles = [exp['title'] for exp in experiences]
    description = [exp['description'] for exp in experiences]
    return titles, description

### Extract the relevant information on the candidates

In [12]:
df[['Job Titles', 'Description']] = df.apply(
    lambda row: pd.Series(extract_info(row)), axis=1)
df['inputCandidate'] = df.apply(lambda row: [re.sub(r'\s+', ' ', re.sub(r'[^\w\s]', '', title + ' ' + description))
                                                if title is not None and description is not None else '' for title, description in zip(row['Job Titles'], row['Description'])], axis=1)

### Load the SentenceTransformer model

In [48]:
model = SentenceTransformer('all-mpnet-base-v2')

.gitattributes: 100%|██████████| 1.18k/1.18k [00:00<00:00, 2.42MB/s]
1_Pooling/config.json: 100%|██████████| 190/190 [00:00<00:00, 552kB/s]
README.md: 100%|██████████| 10.6k/10.6k [00:00<00:00, 13.8MB/s]
config.json: 100%|██████████| 571/571 [00:00<00:00, 3.86MB/s]
config_sentence_transformers.json: 100%|██████████| 116/116 [00:00<00:00, 473kB/s]
data_config.json: 100%|██████████| 39.3k/39.3k [00:00<00:00, 189kB/s]
pytorch_model.bin: 100%|██████████| 438M/438M [10:42<00:00, 682kB/s] 
sentence_bert_config.json: 100%|██████████| 53.0/53.0 [00:00<00:00, 96.9kB/s]
special_tokens_map.json: 100%|██████████| 239/239 [00:00<00:00, 552kB/s]
tokenizer.json: 100%|██████████| 466k/466k [00:00<00:00, 564kB/s]
tokenizer_config.json: 100%|██████████| 363/363 [00:00<00:00, 761kB/s]
train_script.py: 100%|██████████| 13.1k/13.1k [00:00<00:00, 10.1MB/s]
vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 507kB/s]
modules.json: 100%|██████████| 349/349 [00:00<00:00, 762kB/s]


### Create a dictionary to store encodings for each candidate

In [58]:
encoded_job_experience = {}

### Get the unique vacancy titles from the DataFrame

In [13]:
unique_titles = df['inputCandidate'].explode().unique()

### Calculate and store encodings for each unique vacancy title with tqdm

In [14]:
for title in tqdm(unique_titles, desc='Encoding Titles', unit='title'):
   if title is not None and isinstance(title, str) and title not in encoded_job_experience:
        encoding = model.encode([title], convert_to_tensor=True)
        encoded_job_experience[title] = encoding

Encoding Titles: 100%|██████████| 1/1 [00:00<00:00, 8542.37title/s]


### Save the encoded jobtitles dictionary to a pickle file

In [68]:
with open('../artifacts/encodings.pickle', 'wb') as file:
    pickle.dump(encoded_job_experience, file)

### 5. Conclusions
- Done