# Glassdoor HR Review Detector

## Theory

- The kinds of words that genuine employees use in glassdoor review is different from the one used by HR.
- So my theory is that I can use AI to distinguist between these two kind of reviews.
- This will be achived using embeddings. An embedding is a coversion of natural language to vectors. Vectors of similar words are close to each other. My theory is that one or more regions of of the vector space are more of the human resources kind while other regions are more of the genuine employee kinds.
- My goal is to train a neural network to identify regions in the vector space that are "hr"-ish vs. regions that are "genuine"-ish. Based on this, given a review text, the network will be able to predict if the review was written by HR or not.


- Training data consists of pros, cons, overall review, sounds like hr
- pros, cons represented as vector embeddings using openai api

## Quick Experiment

A quick experiment to see if my theory has some merit is the following:
1. Take a subset of training data
1. Divide training data into training set, test set.
1. Convert  reviews in the training set to embeddings
1. Store the embeddings in an in-memory vector store.
1. Query the vector store with genuine review (from the test set) ... should return other reviews that sound genuine
1. Query the vector store with hr review (from the test set) ... should return other reviews that sound hr.



In [1]:
import csv
from utils.utils import load_csv_as_dictionary, array_of_dictionaries_to_csv_string
from sklearn.model_selection import train_test_split

# 1. load csv as array of objects 
data = load_csv_as_dictionary(csv_filename='training-data/glassdoor-reviews-main.csv')

# 2. use scikit learn to split into training and test set?
training_set, test_test = train_test_split(data, test_size=0.40, random_state=1)

# 3. Remove the sounds_like_hr property. format training set into csv
for unit in training_set:
    del unit["sounds_like_hr"]

training_set_csv = array_of_dictionaries_to_csv_string(training_set)
print(training_set_csv)
    
# 4. embed training set openai API and store in in-memory vector db
# 5. filter test set for a genuine review. query vector store. See what % of results are genuine
# 6. filter test set for an hr review. query vector store. See what % of results are genuine

id,date,overall_rating,pros,cons
empReview_37608078,"Oct 26, 2020",4.0,Amazing place to learn and its filled with opportunities to take up challenges,There is no-one to guide you.
empReview_81546778,"Nov 4, 2023",2.0,"I was taken care of, mentored, driven",Crazy co-founder was nasty and just un-professional
empReview_55991189,"Nov 29, 2021",3.0,Good salaries and Working from home,They took our money and we cant find them
empReview_20874411,"Jun 3, 2018",4.0,Because  I really enjoy the company. I was work in company as a faimly,I am happy that company is growing in Medina Hub. I start company hub in Medina with one courier and now we have a big team in company.
empReview_17494789,"Oct 23, 2017",4.0,"Career path- Supportive Management!
challenging",Work under pressure! but its worth it.
empReview_67789236,"Aug 10, 2022",2.0,Started with new fresh opportunities and seemingly high-tech.,"Ended with no professionalism with lots of manual time-consuming processes, and inconsiderate people in