# Recommandation for Job by LSH

#### Hi who is reiewing this page,
#### This is the code for the model part of the project. The model is based on the LSH algorithm. Since my computers' have problems in rendering php files, I use this html file to display the model. If you wanna try, you can directly download and type "keywords" in title. I will also remind you in the following part. I would fix this problem in the future as soon as possible.

## Impart Data and Packages

In [5]:
import numpy as np
import pandas as pd
import re
import time
from datasketch import MinHash, MinHashLSHForest
import pandas as pd

df = pd.read_csv('../2023-04-14-job-search/Clean_Data/combined_data_final.csv')

# replace nan with empty string
df['Responsibilities'] = df['Responsibilities'].fillna('')
df['Benefits'] = df['Benefits'].fillna('')
df['detected_extensions.schedule_type'] = df['detected_extensions.schedule_type'].fillna('')
df['detected_extensions.work_from_home'] = df['detected_extensions.work_from_home'].fillna('')
df['detected_extensions.posted_at'] = df['detected_extensions.posted_at'].fillna('')
df['detected_extensions.salary'] = df['detected_extensions.salary'].fillna('')

## Prepocessing Data

In [2]:
def preprocess(text):
    text = re.sub(r'http\S+', '', text)
    text = re.sub(r'[^\w\s]','',text)
    text = re.sub(r'\s\s+',' ',text)
    text = text.strip()
    tokens = text.lower()
    tokens = tokens.split()
    return tokens

## Create MinHash Objects

In [3]:
def get_forest(data, perms):
    start_time = time.time()
    
    minhash = []
    
    for text in data['text']:
        tokens = preprocess(text)
        m = MinHash(num_perm=perms)
        for s in tokens:
            m.update(s.encode('utf8'))
        minhash.append(m)
        
    forest = MinHashLSHForest(num_perm=perms)
    
    for i,m in enumerate(minhash):
        forest.add(i,m)
        
    forest.index()
    
    print('It took %s seconds to build forest.' %(time.time()-start_time))
    
    return forest

## Evaluate Query

In [4]:
def predict(text, database, perms, num_results, forest):
    start_time = time.time()
    
    tokens = preprocess(text)
    m = MinHash(num_perm=perms)
    for s in tokens:
        m.update(s.encode('utf8'))
        
    idx_array = np.array(forest.query(m, num_results))
    if len(idx_array) == 0:
        return None # if your query is empty, return none
    
    result = database.iloc[idx_array]
    # select columns in the result
    result = result[['title', 'company_name', 'location', 'via', 'description', 
                     'detected_extensions.schedule_type', 'detected_extensions.work_from_home',
                     'detected_extensions.posted_at', 'detected_extensions.salary',
                     'Qualifications', 'Responsibilities', 'Benefits'
                     ]]
    result.columns = ['Job Title', 'Company Name', 'Location', 'Platform', 'Description',
                      'Schedule Type', 'Work from Home', 'Posted at', 'Salary',
                      'Qualifications', 'Responsibilities', 'Benefits']
    
    print('It took %s seconds to query forest.' %(time.time()-start_time))
    
    return result

## Make Recommandation

### Choosing Parameters

In [9]:
#Number of Permutations
permutations = 128

#Number of Recommendations to return
num_recommendations = 5

# Keywords to search
title = 'Data Analysis in DC'

### Train Model and Make Recommandation

In [10]:
db = df.copy()
db['text'] = df['title'] + ' ' + df['company_name'] + ' ' + df['location'] + ' ' + df['description'] + ' ' + df['Qualifications'] + ' ' + df['Responsibilities'] + ' ' + df['Benefits']
forest = get_forest(db, permutations)
result = predict(title, db, permutations, num_recommendations, forest)
print('\n Top Recommendation(s) is(are) \n', result)

It took 7.447729110717773 seconds to build forest.
It took 0.0031728744506835938 seconds to query forest.

 Top Recommendation(s) is(are) 
                                              Job Title Company Name  \
365                 Research Intern - Machine Learning    Microsoft   
82   Online R, NLP, Natural Language Processing tea...    TeacherOn   
178  Expert on Graph Neural Networks applied to Soc...       Upwork   
374  Need an expert to consult on GNNs (graph neura...       Upwork   
317  Reinforcement Learning Developer for Stock Tra...       Upwork   

                   Location               Platform  \
365          Redmond, WA     via Microsoft Careers   
82     Silver Spring, MD                via Jooble   
178               Anywhere              via Upwork   
374               Anywhere              via Upwork   
317               Anywhere              via Upwork   

                                           Description Schedule Type  \
365  Research Internships at Microso