**Employer Review About Their Organisation**


**Context**

Every organization has their pros and cons which their employees feel that it should be made public so that other people who wants to join this organization make decisions based on reviews from the people.


**Content**

This is a textual data in the form of json file. Its has more than 145k records. Each record has attributes such as

Review Title
Review Body
Review Rating
Reviewed Company
Review description
Acknowledgements
All thanks to indeed.com to make this data public and easily available.


**Tasks**

You task would be to predict the rating/predict the text is positive or negative whether based on the review text. Also do some analysis of the text to get some insights and trends based on individual company/organization.

Dataset was trained with 45% of random sample size on Google Colab. Please go further with full dataset using this notebook, if you have the resources.

**Github Repository:**

# Download dataset from Kaggle

In [None]:
! pip install -q kaggle
from google.colab import files
files.upload()
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json
! kaggle datasets list
! kaggle datasets download -d muhammedabdulazeem/employer-review-about-their-organization
! unzip employer-review-about-their-organization.zip
from google.colab import drive
drive.mount('/content/drive/')

# Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import json
import os
from datetime import datetime
from sklearn.impute import SimpleImputer
from imblearn.over_sampling import SMOTE, RandomOverSampler

from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
import math
import pickle

import nltk
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
nltk.download('stopwords')

! pip install pandasql
from pandasql import sqldf

# Load Dataset

In [None]:
f = open('/content/results.json',)
data = json.load(f)
df = pd.DataFrame(data)
df.head()

In [None]:
df.info()

In [None]:
print('--------')
display(df.ReviewTitle.unique())
display(print('Unique values: ', len(df.ReviewTitle.unique())))

print('--------')
display(df.URL.unique())
display(print('Unique values: ', len(df.URL.unique())))

print('--------')
display(df.ReviewDetails.unique())
display(print('Unique values: ', len(df.ReviewDetails.unique())))

**As we can see, company name is mentioned in the URL & ReviewDetails has Datetime, Employeee type and Location.**



**We will try to extract them as much as possible.**

# Extracting and Preprocessing

**Extracting Company names**

In [None]:
# Extracting Company name from URL
df['Company'] = df.URL.str.split('/')[:].str[4]

In [None]:
# Checking unique values in Company
display(df.Company.unique())
display(print('Unique values: ', len(df.Company.unique())))

**Extracting Datetime**

In [None]:
# Extracting Date & Time from ReviewDetails
df['Timestamp'] = df['ReviewDetails'].str.split('-', expand=True)[2]

In [None]:
# Spliting Year, Month, Day
df['Year'] = df['Timestamp'].str.split(',', expand=True)[1]
df['Month'] = df['Timestamp'].str.split(',', expand=True)[0].str.split(' ', expand=True)[2]
df['Day'] = df['Timestamp'].str.split(',', expand=True)[0].str.split(' ', expand=True)[3]

In [None]:
# Checking missing values
display(df['Day'].isnull().sum())
display(df['Month'].isnull().sum())
display(df['Year'].isnull().sum())

In [None]:
# Dropping missing values
df = df.dropna()

In [None]:
# Removing unecessary blank spaces
df['Year'] = df['Year'].str.replace(' ', '')

In [None]:
# Merging and adding new Datetime column
temp_date = df[['Day', 'Month', 'Year']]
df['Timestamp'] = pd.to_datetime(temp_date.astype(str).agg('-'.join, axis=1), format='%d-%B-%Y')

In [None]:
# Dropping unecessary columns
df.drop(['Year', 'Month', 'Day'], axis=1, inplace=True)

In [None]:
df.head()

**Extracting Type of Employee submitted the review**

In [None]:
# Extract EmployeeType from ReviewDetails
df['EmployeeType'] = df['ReviewDetails'].str.split('-', expand=True)[0]

In [None]:
# Checking Unique values
display(df.EmployeeType.unique())
display(print('Unique values: ', len(df.EmployeeType.unique())))

In [None]:
def get_employee_type(value):
  return 'Current Employee' if 'Current' in value else 'Former Employee'

In [None]:
# Add new column EmployeeTpe
df['EmployeeType'] = df.apply(lambda row: get_employee_type(row['EmployeeType']),axis=1)

In [None]:
df.head()

**Extracting Location from Review Details**

In [None]:
df['Location'] = df['ReviewDetails'].str.split('-', expand=True)[1]

In [None]:
display(df.Location.unique())
display(print('Unique values: ', len(df.Location.unique())))

**Merging ReviewTitle + CompleteReview, cause not considering will be a waste.**

In [None]:
df['Review'] = df['ReviewTitle'] + ' ' + df['CompleteReview']

In [None]:
# Drop uneccsary columns
df.drop(['ReviewTitle', 'CompleteReview', 'URL', 'ReviewDetails'], axis=1, inplace=True)

In [None]:
df.Location = df.Location.str.strip()
df.Location = df.Location.str.lower()

In [None]:
sqldf("""select Location, count(*) from df group by Location order by count(*) desc limit 10""")

In [None]:
df.Location[df['Location'] == ''] = 'Unknown'

In [None]:
df.Location.unique()

**Locations are entered manually so it os inconsistent and unrealiable. We may not be able to use it until we clean it manually. My Bad**

# EDA

In [None]:
df.describe()

In [None]:
plt.figure(figsize = (15,8))
sns.countplot(x ='EmployeeType', data = df)
plt.xticks(rotation=90)

EmployeeType seems fairly balanced

In [None]:
sns.histplot(df['Rating'])

Here we can see that the dataset is imbalance which can effect the model performance

In [None]:
plt.figure(figsize=(8,4))
sns.countplot(x='Rating', hue='EmployeeType',data=df,palette='viridis')

In [None]:
plt.figure(figsize = (20,8))
sns.countplot(x ='Company', data = df)
plt.xticks(rotation=90)

Large number of reviews belongs to TCS, IBM, Accenture, Infosys, HDFC


**Rating Distribution for Top 10 Companies (Review Count)**

In [None]:
dftop_10 = sqldf("""select Company, count(*) from df group by Company order by count(*) desc limit 10""")
dftop_10 = sqldf("""select * from df where Company in (select Company from dftop_10)""")

In [None]:
plt.figure(figsize = (30,8))
sns.countplot(x="Company", hue="Rating", data=dftop_10)

**Rating Distribution for Bottom 10 Companies (Review Count)**

In [None]:
dfbot_10 = sqldf("""select Company, count(*) from df group by Company order by count(*) asc limit 10""")
dfbot_10 = sqldf("""select * from df where Company in (select Company from dfbot_10)""")
dfbot_10.Rating = pd.to_numeric(dfbot_10.Rating)

In [None]:
plt.figure(figsize = (30,8))
sns.countplot(x="Company", hue="Rating", data=dfbot_10)

**Timestamp-Rating Analysis**

In [None]:
df_g1 = df
df_g1['Year'] = df_g1.Timestamp.astype(str).str[:4]
df_g1 = sqldf("""select Company, Rating, Year, count(*) as count from df_g1 group by Company, Rating, Year""")

for i, company_name in enumerate(list(df.Company.unique())):
  plt.figure(i)
  plt.figure(figsize = (10,8))
  sns.lineplot(x=df_g1.Year, y="count", hue="Rating", data=df_g1[df_g1['Company'] == company_name]).set_title(company_name)

**For some reason there's a spike in reviews, espacially for 5-4-3 rating stars, during the period 2017-2019**

**After 2017-2018, 4 & 5 Star reviews started to fall down for all companies.**

# Cleaning

In [None]:
#Dropping unecessary columns
df.drop(['Company', 'EmployeeType', 'Timestamp'], axis=1, inplace=True)

In [None]:
# Coverting 5 ratings in 3 classes

# 0    Positive  (5-4 Stars)
# 1    Neural    (2-3 Stars)
# 2    Negative  (1 Stars)

df.Rating.replace({'1.0': 3, '2.0': 2, '3.0': 2, '4.0': 1, '5.0': 1}, inplace=True)

In [None]:
# Converting Rating field from float to int
df.Rating = df.Rating.astype(float).astype(int)

# Selecting Sample

In [None]:
df = df.sample(frac=0.45)

# Over Sampling

In [None]:
ros = RandomOverSampler()
X_sample, y_sample = ros.fit_resample(df[['Review']], df['Rating'])
df = pd.concat([pd.DataFrame(X_sample), pd.DataFrame(y_sample)], axis=1)

df.columns = ['Review', 'Rating']

# Build, Train, Test


In [None]:
vocab_size = 10000
sentence_length = 50
X = df.drop('Rating',axis=1)
y = df['Rating']
sentences = X.copy()

In [None]:
# Resetting so that we don't get error during extracting process 
sentences.reset_index(inplace=True)

In [None]:
def generate_corpus(sentences):
  ps = PorterStemmer()
  corpus = []
  for i in range(0, len(sentences)):
    review = re.sub('[^a-zA-Z]', ' ', sentences['Review'][i]).lower().split()
    review = [ps.stem(word) for word in review if not word in stopwords.words('english')]
    review = ' '.join(review)
    corpus.append(review)
  return corpus

In [None]:
def convert_ohe(corpus):
  return [one_hot(words,vocab_size) for words in corpus] 

In [None]:
def add_padding(onehot_repr):
  return pad_sequences(onehot_repr,padding='pre',maxlen=sentence_length)

In [None]:
# Generating corpus
corpus = generate_corpus(sentences)

In [None]:
# Creating 
onehot_repr = convert_ohe(corpus)

In [None]:
# Creating word embedding. 
embedded_docs = add_padding(onehot_repr)

In [None]:
y = pd.get_dummies(df["Rating"])

In [None]:
# LSTM Model
embedding_vector_features=80
model=Sequential()
model.add(Embedding(vocab_size,embedding_vector_features,input_length=sentence_length))
model.add(Dropout(0.3))
model.add(LSTM(100))
model.add(Dropout(0.3))
model.add(Dense(3,activation='softmax'))
model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])

In [None]:
import numpy as np
X_final=np.array(embedded_docs)
y_final=np.array(y)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_final, y_final, test_size=0.33, random_state=42)

In [None]:
model.fit(X_train,y_train,validation_data=(X_test,y_test),epochs=50,batch_size=64)

In [None]:
# Saving model for future use
# model.save('model.h5')

# Evaluation Metrics

In [None]:
y_pred = model.predict(X_test)

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix(np.argmax(y_test,axis=1),np.argmax(y_pred,axis=1))

In [None]:
from sklearn.metrics import accuracy_score
print('Model Accuracy', accuracy_score(np.argmax(y_test,axis=1),np.argmax(y_pred,axis=1)))

# Making Prediction

In [None]:
data = pd.DataFrame([{'Review': 'Great place to work. Nice work culture'}, {'Review': 'Very less salary. bad work culture'}, {'Review': 'No fix working hours. Good salary hike'}])

In [None]:
corpus = generate_corpus(data)
onehot_repr = convert_ohe(corpus)
embedded_docs = add_padding(onehot_repr)
y_pred = model.predict(embedded_docs)

In [None]:
predict_text = { 0:'Positive', 1:'Neural', 2:'Negative'}

In [None]:
pd.concat([data, pd.DataFrame(np.argmax(y_pred,axis=1), columns=['Prediction']).replace(predict_text)], axis=1)

# Extras

In [None]:
# model.save('model.h5')

# from keras.models import load_model
# model = load_model('model.h5')

In [None]:
#Saving corpus for future use

# with open('corpus.pkl', 'wb') as f:
#   pickle.dump(corpus, f)

# with open('corpus.pkl', 'rb') as f:
#   corpus = pickle.load(f)