# Glassdoor HR Review Detector

## Theory

- The kinds of words that genuine employees use in glassdoor review is different from the one used by HR.
- So my theory is that I can use AI to distinguist between these two kind of reviews.
- This will be achived using embeddings. An embedding is a coversion of natural language to vectors. Vectors of similar words are close to each other. My theory is that one or more regions of of the vector space are more of the human resources kind while other regions are more of the genuine employee kinds.
- My goal is to train a neural network to identify regions in the vector space that are "hr"-ish vs. regions that are "genuine"-ish. Based on this, given a review text, the network will be able to predict if the review was written by HR or not.


- Training data consists of pros, cons, overall review, sounds like hr
- pros, cons represented as vector embeddings using openai api

## Quick Experiment

A quick experiment to see if my theory has some merit is the following:
1. Take a subset of training data
1. Divide training data into training set, test set.
1. Convert  reviews in the training set to embeddings
1. Store the embeddings in an in-memory vector store.
1. Query the vector store with genuine review (from the test set) ... should return other reviews that sound genuine
1. Query the vector store with hr review (from the test set) ... should return other reviews that sound hr.



In [20]:
import csv
import tempfile
from utils.utils import load_csv_as_dictionary, array_of_dictionaries_to_csv_string
from sklearn.model_selection import train_test_split
from langchain_openai import OpenAIEmbeddings
from langchain_community.document_loaders.csv_loader import CSVLoader
from langchain_chroma import Chroma

# 1. load csv as array of objects 
data = load_csv_as_dictionary(csv_filename='training-data/glassdoor-reviews-main.csv')

# 2. use scikit learn to split into training and test set?
training_set, test_set = train_test_split(data, test_size=0.40, random_state=1)

# 3. Remove the sounds_like_hr property. format training set into csv
for unit in training_set:
    del unit["sounds_like_hr"]

training_set_csv = array_of_dictionaries_to_csv_string(training_set)

    
# 4. embed training set openai API and store in in-memory vector db
with tempfile.NamedTemporaryFile(delete=False, mode="w+") as temp_file:
    temp_file.write(training_set_csv)
    temp_file_path = temp_file.name

loader = CSVLoader(file_path=temp_file_path)
documents = loader.load()
db = Chroma.from_documents(documents, OpenAIEmbeddings())

# 5. filter test set for a genuine review. query vector store. See what % of results are genuine
genuine = list(filter(lambda sample : sample['sounds_like_hr']=='0',test_set))[0]['cons']
docs = db.similarity_search(genuine)
print(docs[0].page_content)

# 6. filter test set for an hr review. query vector store. See what % of results are genuine
hr_written = list(filter(lambda sample : sample['sounds_like_hr']=='1',test_set))[0]['cons']
docs = db.similarity_search(hr_written)
print(docs[0].page_content)

id: empReview_25252248
date: Mar 20, 2019
overall_rating: 1.0
pros: good salary but with bad life and bad work life
cons: - toxic environment on QA Team at jordan
- there is no ability to improve yourself
- Super bad evaluation
- the management of the QA Team at jordan is not good, not knowledgeable and experienced.
- fake report generated for management
id: empReview_17494789
date: Oct 23, 2017
overall_rating: 4.0
pros: Career path- Supportive Management!
challenging
cons: Work under pressure! but its worth it.
