# How Search Works - Illustrative Example

**Theme:** AI for Capacity Management

In [1]:
%load_ext nb_black
%load_ext autoreload
%autoreload 2

<IPython.core.display.Javascript object>

In [2]:
import pandas as pd
import seaborn as sns

from skill_match import SkillMatcher
from search import parse_demand, get_weight, _skill_names_to_ids
from retrieval import Retrieval, candidate_demand_similarity
from indexing import build_index

<IPython.core.display.Javascript object>

### Step 0 - Indexing
* Build an index of skills using the employee-skill pairs present in database.
* Need to be built just once and updated when an employee is added or removed.
* In production, build once and store on disk. Load from disk when required.

In [3]:
skill_tree = pd.read_csv("../data/skill_tree.csv", sep="\t")
skill_tree.head()

Unnamed: 0,Primary Unit,Sub Unit 1,Sub Unit 2,Sub Unit 3,Skill
0,Unit 1,Technology,Cloud Technology,Containers,Docker
1,Unit 1,Technology,Cloud Technology,Containers,Kubernetes
2,Unit 1,Technology,Cloud Technology,PaaS,Database management
3,Unit 1,Technology,Cloud Technology,PaaS,No SQL
4,Unit 1,Technology,Cloud Technology,PaaS,Application Development


<IPython.core.display.Javascript object>

In [4]:
emp_skill_pairs = pd.read_csv("../data/emp_skills.csv", sep="\t")
emp_skill_pairs.head()

Unnamed: 0,emp_id,skill_id,skill_level
0,Employee 1,26,2
1,Employee 1,45,3
2,Employee 1,68,3
3,Employee 1,60,2
4,Employee 1,48,2


<IPython.core.display.Javascript object>

The following results show the list of employees who have a particular skill.

In [5]:
retrieval = Retrieval()
retrieval.index  # built when initialized

{26: {'Employee 1': 2},
 45: {'Employee 1': 3, 'Employee 8': 3},
 68: {'Employee 1': 3},
 60: {'Employee 1': 2},
 48: {'Employee 1': 2},
 49: {'Employee 1': 3},
 50: {'Employee 1': 3},
 95: {'Employee 1': 3, 'Employee 2': 3, 'Employee 5': 2},
 70: {'Employee 1': 2},
 72: {'Employee 1': 3},
 73: {'Employee 1': 3},
 47: {'Employee 1': 3},
 74: {'Employee 1': 3},
 97: {'Employee 1': 3,
  'Employee 2': 2,
  'Employee 5': 3,
  'Employee 7': 3,
  'Employee 8': 4},
 98: {'Employee 1': 3,
  'Employee 3': 2,
  'Employee 4': 2,
  'Employee 5': 3,
  'Employee 6': 3,
  'Employee 9': 3,
  'Employee 10': 3,
  'Employee 11': 4,
  'Employee 12': 2},
 24: {'Employee 1': 1, 'Employee 5': 2, 'Employee 11': 2, 'Employee 13': 3},
 25: {'Employee 1': 1, 'Employee 5': 3, 'Employee 11': 2},
 51: {'Employee 2': 2},
 96: {'Employee 2': 3},
 61: {'Employee 2': 2},
 87: {'Employee 3': 2},
 88: {'Employee 3': 3},
 91: {'Employee 3': 2},
 81: {'Employee 3': 3, 'Employee 7': 2, 'Employee 12': 2},
 83: {'Employee 3':

<IPython.core.display.Javascript object>

### Step 1 - Parse the input

* Demand - skills, location, experience etc.
* Search criteria

Let's look at one example for demand.

In [6]:
demand_df = pd.read_csv("../data/demand.csv", sep="\t")
demand_df.head()

Unnamed: 0,Requestor,Requestor Service Line,Requestor Sub ServiceLine,Requestor SMU,Job Title,Rank,No of Resources_required,Country,Location,Alternate Location,Technical Skill 1,Technical Skill 2,Technical Skill 3,Functional Skill 1,Functional Skill 2,Functional Skill 3,Process Skill 1,Process Skill 2,Process Skill 3,Min Experience
0,Req_1,ServiceLine3,SubserviceLine4,SMU2,Financial Risk Analyst,Rank_3,1,India,Bangalore,,Microsoft Excel,Microsoft Word,,Risk Analysis,Analysis,Accounting,Communication,Documentation,Team Skill,8
1,Req_2,ServiceLine1,SubserviceLine1,SMU1,Data Visualisation Engineer,Rank_5,1,India,Kochi,,Dashboarding,Power BI,SQL,Requirement Analysis,,,Communication,Problem Solving,Team Skill,1
2,Req_3,ServiceLine2,SubserviceLine2,SMU3,ML Engineer,Rank_3,1,India,Bangalore,,Artificial Intelligence,ML,Python,Business Analysis,Requirement Analysis,Data Modelling,Problem Solving,Analytical Ability,Communication,8
3,Req_4,ServiceLine3,SubserviceLine1,SMU3,Account Reporting Specialist,Rank_4,2,Poland,Warsaw,,Microsoft Excel,VBA,,Accounting,Internal Audit,Reporting,Communication,Problem Solving,Analytical Ability,5
4,Req_5,ServiceLine3,SubserviceLine4,SMU2,Business Analyst,Rank_4,1,Germany,Koln,,Microsoft Visio,Microsoft Office,,Business Analysis,Taxation,Business Case Modelling,Problem Solving,Communication,Documentation,5


<IPython.core.display.Javascript object>

In [7]:
demand = demand_df.iloc[2]
demand

Requestor                                      Req_3
Requestor Service Line                  ServiceLine2
Requestor Sub ServiceLine            SubserviceLine2
Requestor SMU                                   SMU3
Job Title                                ML Engineer
Rank                                          Rank_3
No of Resources_required                           1
Country                                        India
Location                                   Bangalore
Alternate Location                               NaN
Technical Skill 1            Artificial Intelligence
Technical Skill 2                                 ML
Technical Skill 3                             Python
Functional Skill 1                 Business Analysis
Functional Skill 2              Requirement Analysis
Functional Skill 3                    Data Modelling
Process Skill 1                      Problem Solving
Process Skill 2                   Analytical Ability
Process Skill 3                        Communi

<IPython.core.display.Javascript object>

#### Fuzzy string matching

Now, we have to match the skills typed by the user, e.g. "machine learning" to the skills present in our database, "Machine Learning". Here comes the fuzzy logic that uses approximate string matching algorithms.

#### Identifying related skills (a better way)

Currently, there is no provision for identifying related skills. However, there are two ways I can think of to solve the problem.
1. Identifying commonly co-occuring skills from a database of employee-skill pairs. This can be done using apriori algorithm or FP-growth algorithm.
2. Computing semantically similar skills for a given skill using a large corpus of job descriptions. Computing semantic similarity in NLP is a solved problem. `gensim` is a great python library for this purpose.

Both of these methods require appropriate datasets.

In [8]:
sm = SkillMatcher("../data/skill_tree.csv")

<IPython.core.display.Javascript object>

In [9]:
print(sm.match("artificial intelligence", unit="technical")[0])

Technology Emerging Trends Artificial Intelligence Deep Learning


<IPython.core.display.Javascript object>

In [10]:
print(sm.match("communication", unit="process")[0])

Process Soft Skills Effective Communication 


<IPython.core.display.Javascript object>

Do the same for all skills and create a `demand` object.

In [11]:
demand = parse_demand(idx=2)  # parsing the same demand input as above
demand

Demand(skills=Skills(technical=[7, 23, 5], functional=[94, 94, 76], process=[102, 98, 101]), rank=3, location='Bangalore, India', experience=8, dept=Department(service_line='ServiceLine2', sub_service_line='SubserviceLine2', smu='SMU3'))

<IPython.core.display.Javascript object>

In [12]:
weight = get_weight(idx=0)  # search criteria
weight

criteria      Criteria1
technical           0.1
functional          0.3
process             0.1
experience          0.1
rank                0.1
location            0.3
bench_age             0
Name: 0, dtype: object

<IPython.core.display.Javascript object>

### Step 2 - Retrieve the employees who have the skills

Note there is an `OR` operator for retrieving by skills. For example, let's retrieve employees having a particular skill.

In [13]:
retrieval._get_candidates_for_skills([23])

{'Employee 5': 3, 'Employee 11': 2}

<IPython.core.display.Javascript object>

In [14]:
retrieval._get_candidates_for_skills([7])

{'Employee 11': 4, 'Employee 13': 3}

<IPython.core.display.Javascript object>

Look for all employees meeting the search criteria and return the union of all such results.

In [15]:
candidates = retrieval.get_candidates_df(demand.skills)
candidates

Unnamed: 0,emp_id,technical,functional,process,years_of_experience,rank,service_line,sub_service_line,smu,country,city,bench_age
0,Employee 11,4.0,0.0,4.0,2,Rank_5,ServiceLine1,SubServiceLine1,SMU2,India,Kochi,8
1,Employee 13,3.0,0.0,0.0,3,Rank_5,ServiceLine1,SubServiceLine1,SMU1,India,Kochi,3
2,Employee 5,3.0,3.0,3.0,15,Rank_2,ServiceLine1,SubServiceLine3,SMU1,India,Bangalore,5
3,Employee 6,4.0,0.0,3.0,12,Rank_3,ServiceLine1,SubServiceLine4,SMU1,India,Gurgaon,6
4,Employee 7,0.0,3.0,0.0,13,Rank_3,ServiceLine3,SubServiceLine2,SMU3,India,Chennai,3
5,Employee 8,0.0,2.0,0.0,16,Rank_2,ServiceLine1,SubServiceLine1,SMU3,India,Trivandrum,4
6,Employee 12,0.0,3.0,2.0,8,Rank_4,ServiceLine3,SubServiceLine2,SMU2,India,Bangalore,2
7,Employee 1,0.0,0.0,3.0,20,Rank_1,ServiceLine1,SubServiceLine1,SMU1,India,Gurgaon,4
8,Employee 3,0.0,0.0,2.0,9,Rank_4,ServiceLine3,SubServiceLine4,SMU1,India,Trivandrum,3
9,Employee 4,0.0,0.0,2.0,9,Rank_4,ServiceLine1,SubServiceLine2,SMU1,India,Kochi,6


<IPython.core.display.Javascript object>

### Step 3 - Ranking

Now, `weight` comes into the picture and we must create a fitment score for all employees. The workhorse for this part is the similarity model, that's built using the weighted Euclidean distance metric.

Euclidean distance between two vectors $A$ and $B$ is $$d(A, B) = \sqrt{\sum(A_i - B_i)^2}$$

Weighted Euclidean distance is $$d(A, B) = \sqrt{\sum_i w_i \cdot (A_i - B_i)^2}$$

To get similarity score $s$ from distance metric $d$, use $$s = \frac{1}{1+d}$$

Note that similarity score $0 \le s \le 1$

#### Normalize data

However, for using the above similarity model, we need to normalize the data. Divide by the maximum value found in the dataset. For skills - technical, functional and process, max value is 4. For experience, rank and bench age, find the maximum value from the databse.

In [16]:
normalized_df = retrieval.normalize_data(candidates)
normalized_df

Unnamed: 0,emp_id,technical,functional,process,years_of_experience,rank,service_line,sub_service_line,smu,country,city,bench_age,location
0,Employee 11,1.0,0.0,1.0,0.1,1.0,ServiceLine1,SubServiceLine1,SMU2,India,Kochi,1.0,"Kochi, India"
1,Employee 13,0.75,0.0,0.0,0.15,1.0,ServiceLine1,SubServiceLine1,SMU1,India,Kochi,0.375,"Kochi, India"
2,Employee 5,0.75,0.75,0.75,0.75,0.4,ServiceLine1,SubServiceLine3,SMU1,India,Bangalore,0.625,"Bangalore, India"
3,Employee 6,1.0,0.0,0.75,0.6,0.6,ServiceLine1,SubServiceLine4,SMU1,India,Gurgaon,0.75,"Gurgaon, India"
4,Employee 7,0.0,0.75,0.0,0.65,0.6,ServiceLine3,SubServiceLine2,SMU3,India,Chennai,0.375,"Chennai, India"
5,Employee 8,0.0,0.5,0.0,0.8,0.4,ServiceLine1,SubServiceLine1,SMU3,India,Trivandrum,0.5,"Trivandrum, India"
6,Employee 12,0.0,0.75,0.5,0.4,0.8,ServiceLine3,SubServiceLine2,SMU2,India,Bangalore,0.25,"Bangalore, India"
7,Employee 1,0.0,0.0,0.75,1.0,0.2,ServiceLine1,SubServiceLine1,SMU1,India,Gurgaon,0.5,"Gurgaon, India"
8,Employee 3,0.0,0.0,0.5,0.45,0.8,ServiceLine3,SubServiceLine4,SMU1,India,Trivandrum,0.375,"Trivandrum, India"
9,Employee 4,0.0,0.0,0.5,0.45,0.8,ServiceLine1,SubServiceLine2,SMU1,India,Kochi,0.75,"Kochi, India"


<IPython.core.display.Javascript object>

All numerical values are now in the range of 0-1. Let's demonstrate the similarity function for `Employee 13`.

In [19]:
candidate_demand_similarity(normalized_df.iloc[6], demand, weight)

0.7223436528299778

<IPython.core.display.Javascript object>

The only thing remaining is to find similarity score (fitment score) for all employees and view the results.

#### Department similarity score
We also create a department similarity score which is a simple function
* similarity = 3 if both from same service_line+sub_service_line+smu
* similarity = 2 if both from same service_line+sub_service_line
* similarity = 1 if both from same service_line
* similarity = 0 if both from different service_line

When `sort_by_dept=True`, we sort by dept_similarity first, then fitment scores. What this means for the user: if preference is someone from the same department, show them first.

In [29]:
result = retrieval.get_results(demand, weight, sort_by_dept=False)

<IPython.core.display.Javascript object>

In [30]:
demand

Demand(skills=Skills(technical=[7, 23, 5], functional=[94, 94, 76], process=[102, 98, 101]), rank=3, location='Bangalore, India', experience=8, dept=Department(service_line='ServiceLine2', sub_service_line='SubserviceLine2', smu='SMU3'))

<IPython.core.display.Javascript object>

In [31]:
result

Unnamed: 0,emp_id,technical,functional,process,years_of_experience,rank,service_line,sub_service_line,smu,country,city,bench_age,fitment,dept_similarity
2,Employee 5,3.0,3.0,3.0,15,Rank_2,ServiceLine1,SubServiceLine3,SMU1,India,Bangalore,5,0.821055,0
6,Employee 12,0.0,3.0,2.0,8,Rank_4,ServiceLine3,SubServiceLine2,SMU2,India,Bangalore,2,0.722344,0
11,Employee 10,0.0,0.0,3.0,5,Rank_5,ServiceLine1,SubServiceLine4,SMU1,India,Bangalore,3,0.605497,0
4,Employee 7,0.0,3.0,0.0,13,Rank_3,ServiceLine3,SubServiceLine2,SMU3,India,Chennai,3,0.579855,0
5,Employee 8,0.0,2.0,0.0,16,Rank_2,ServiceLine1,SubServiceLine1,SMU3,India,Trivandrum,4,0.564537,0
3,Employee 6,4.0,0.0,3.0,12,Rank_3,ServiceLine1,SubServiceLine4,SMU1,India,Gurgaon,6,0.561424,0
10,Employee 9,0.0,0.0,3.0,7,Rank_4,ServiceLine1,SubServiceLine1,SMU1,India,Kochi,5,0.54262,0
1,Employee 13,3.0,0.0,0.0,3,Rank_5,ServiceLine1,SubServiceLine1,SMU1,India,Kochi,3,0.539513,0
8,Employee 3,0.0,0.0,2.0,9,Rank_4,ServiceLine3,SubServiceLine4,SMU1,India,Trivandrum,3,0.539386,0
9,Employee 4,0.0,0.0,2.0,9,Rank_4,ServiceLine1,SubServiceLine2,SMU1,India,Kochi,6,0.539386,0


<IPython.core.display.Javascript object>