# KNN Model

## KNN description:
Goes through the same steps as count vectorizer, but gathers the 75 most closely related job postings.
Calculates the frequency of each job title in those 75 most related jobs.
Job recommendation is then based on the most frequent job title of the 75 most closely related posts to the LinkedIn profile.

In [13]:
import pandas as pd
import numpy as np
import re
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from sklearn.neighbors import NearestNeighbors
from IPython.display import display_html

## Jobs data

In [14]:
jobs = pd.read_csv('../data/job_postings.csv')
jobs = jobs.drop(columns=['date_added', 'organization', 'skills_len', 'job_type'])
jobs.fillna('', inplace=True)
jobs['text'] = jobs['job_description'] + ' ' + jobs['skills']

In [15]:
jobs

Unnamed: 0,job_description,job_title,location,skills,text
0,n edi analyst with experience please read on ...,Analyst,Northeast United States,edi trustedlink as van,n edi analyst with experience please read on ...
1,informatica etl developerst petersburg fl only...,Developer,Southern United States,etl informatica b data exchange netezza oracle...,informatica etl developerst petersburg fl only...
2,this nationally recognized microsoft gold part...,Manager,Western United States,microsoft dynamics ax project manager - toront...,this nationally recognized microsoft gold part...
3,.net developer with experience please read on...,Developer,Northeast United States,c asp.net sql javascript mvc,.net developer with experience please read on...
4,hatstand a global financial consultancy is see...,Developer,Northeast United States,java linux unix sdlc; multi-threaded or concur...,hatstand a global financial consultancy is see...
...,...,...,...,...,...
16427,jpmorgan chase co. (nyse: jpm) is a leading g...,Developer,Northeast United States,.net architecture developer development git ht...,jpmorgan chase co. (nyse: jpm) is a leading g...
16428,seeking jr. systems administrators with experi...,Administrator,Midwest United States,jr. linux administrator,seeking jr. systems administrators with experi...
16429,senior lead devops engineer with a desired to...,Developer,Midwest United States,amazon web services linux bash ruby python agile,senior lead devops engineer with a desired to...
16430,headquartered in downtown san francisco ca we ...,Developer,Western United States,javascript react.js golang startup ror iot ana...,headquartered in downtown san francisco ca we ...


## User data

In [16]:
def gather_profile_data(file_path):
    profile_data = pd.read_csv(file_path)
    profile_data['text'] = profile_data['Titles'] + ' ' \
                            + profile_data['Skills'] + ' ' \
                            + profile_data['Summary'] + ' ' \
                            + profile_data['Education']
    try: profile_data['text'] += ' ' + profile_data['Certifications']
    except: pass
    
    try: profile_data['text'] += ' ' + profile_data['Projects']
    except: pass
    
    return profile_data

In [17]:
# Reading in linkedin profile data.
profile_data_zach = gather_profile_data('../data/linkedin/test-output/Zach_LinkedInData_12-16-2020.csv')
profile_data_nolan = gather_profile_data('../data/linkedin/test-output/Nolan_LinkedInData_12-16-2020.csv')
profile_data_albert = gather_profile_data('../data/linkedin/test-output/Albert_LinkedInData.csv')
profile_data_ye = gather_profile_data('../data/linkedin/test-output/Ye_LinkedInData.csv')

## Make recommendations

In [18]:
def get_recommendations(vectorizer, tfidf_jobtext, user_data):
    # Transforming user profile text
    user_tfidf = vectorizer.transform(user_data['text'])

    # Calculating KNN similarity between users profile and job text (top 100 most similar jobs)
    n_neighbors=75
    KNN = NearestNeighbors(n_neighbors=n_neighbors, p=2, metric='cosine', algorithm = 'brute')
    KNN.fit(tfidf_jobtext)
    NNs = KNN.kneighbors(user_tfidf, return_distance=True)

    # Finding indexs for n_neighbors most similar jobs
    index = list(NNs[1][0][1:]) # indexs for top jobs
    final_jobs = jobs.loc[index] 
    
    # Create a dataframe using our job title counts(top 10)
    pos_df = pd.DataFrame(final_jobs['job_title'].value_counts()[:10])
    
    # Renaming job_title
    pos_df['Job Count'] = pos_df['job_title']
    pos_df.drop(columns='job_title', inplace= True)
    
    # Creating column for percent of jobs matched
    pos_df['Job Match %'] = pos_df['Job Count']/n_neighbors
    
    return pos_df

## Specific Recommendations

In [19]:
# Instantiating Tfidfvectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english', ngram_range=(1,2))
tfidf_jobtext = tfidf_vectorizer.fit_transform(jobs['text'])

In [20]:
# Calculate recommendations
nolans_recommendations = get_recommendations(tfidf_vectorizer, tfidf_jobtext, profile_data_nolan)
zachs_recommendations = get_recommendations(tfidf_vectorizer, tfidf_jobtext, profile_data_zach)
alberts_recommendations = get_recommendations(tfidf_vectorizer, tfidf_jobtext, profile_data_albert)
yes_recommendations = get_recommendations(tfidf_vectorizer, tfidf_jobtext, profile_data_ye)

In [21]:
# Credit for notebook styling: https://blog.softhints.com/display-two-pandas-dataframes-side-by-side-jupyter-notebook/
df1_styler = zachs_recommendations.style.set_table_attributes("style='display:inline'").set_caption('Zach')
df2_styler = nolans_recommendations.style.set_table_attributes("style='display:inline'").set_caption('Nolan')
df3_styler = alberts_recommendations.style.set_table_attributes("style='display:inline'").set_caption('Albert')
df4_styler = yes_recommendations.style.set_table_attributes("style='display:inline'").set_caption('Ye')

space = "\xa0" * 50
display_html(df1_styler._repr_html_() + space + df2_styler._repr_html_() + space + df3_styler._repr_html_() + space + df4_styler._repr_html_(), raw=True)

Unnamed: 0,Job Count,Job Match %
Data Position,37,0.493333
Analyst,10,0.133333
Engineer,8,0.106667
Developer,7,0.093333
Architect,7,0.093333
Manager,2,0.026667
Director,1,0.013333
Programmer,1,0.013333
Consulting,1,0.013333

Unnamed: 0,Job Count,Job Match %
Data Position,26,0.346667
Developer,17,0.226667
Engineer,11,0.146667
Architect,10,0.133333
Analyst,5,0.066667
Consulting,3,0.04
Director,1,0.013333
Manager,1,0.013333

Unnamed: 0,Job Count,Job Match %
Data Position,26,0.346667
Analyst,14,0.186667
Architect,11,0.146667
Developer,8,0.106667
Engineer,8,0.106667
Director,3,0.04
Consulting,3,0.04
Support,1,0.013333

Unnamed: 0,Job Count,Job Match %
Developer,20,0.266667
Analyst,17,0.226667
Engineer,14,0.186667
Support,6,0.08
Manager,5,0.066667
Technician,4,0.053333
Administrator,3,0.04
Designer,2,0.026667
Consulting,1,0.013333
Data Position,1,0.013333


## KNN Conclusion
Most common job matches appear to line up with what we would expect. Albert, Nolan and Zach were all recommended Data positions while Ye was recommended a Developer position.
Job match % may not be the most useful metric to compare other models to but it does give a sense of how well we can trust our given job output. 
KNN is also influenced by the balance of job titles and data positions was by far one of the least common. Yet, we still got recommended data positions which shows some strength to this model. 