## Exploring Web-scraped Ratings Data

This notebook contains exploratory analysis of data scraped from Glassdoor containing employee reviews from hospitals in Illinois. Data were scraped using a scraper forked from [sericson0](https://github.com/sericson0), which generated a .csv file for each hospital.

In [2]:
## concatenate all review files into one variable

import pandas as pd
import glob, os

files = glob.glob('/home/tjd/glassdoor_scraper/Output CSV/*.csv') #grabs all files from the directory
#print (files) ##use this to print for a sanity check

# concatenate files but also add the original file name as a new column to the df
review_df = pd.concat([pd.read_csv(fp).assign(New=os.path.basename(fp).split('.')[0]) for fp in files])

#rename the new column
review_df.rename(columns={'New':'Hosp_Name'}, inplace=True)
review_df.head()

review_df.shape


Unnamed: 0,full_text,date,employee_title,location,employee_status,review_title,rating_overall,rating_balance,rating_culture,rating_career,rating_comp,rating_mgmt,Hosp_Name
0,"April 6, 2018\nHelpful (1)\n""Avoid hospital""\n...",Fri Apr 06 2018 04:09:02 GMT-0400 (Eastern Day...,Registered Nurse,"Elgin, IL",Former Employee,Avoid hospital,1.0,2.0,1.0,2.0,2.0,1.0,Saint Joseph Hospital Elgin - AMITA
1,"October 10, 2017\n""Registered nurse""\nCurrent ...",Tue Oct 10 2017 13:29:53 GMT-0400 (Eastern Day...,Registered Nurse,"Elgin, IL",Current Employee,Registered nurse,1.0,3.0,1.0,2.0,2.0,1.0,Saint Joseph Hospital Elgin - AMITA
2,"January 21, 2017\nHelpful (2)\n""Registered Nur...",Sat Jan 21 2017 22:07:59 GMT-0500 (Eastern Sta...,Registered Nurse,"Elgin, IL",Current Employee,Registered Nurse,3.0,3.0,2.0,3.0,3.0,2.0,Saint Joseph Hospital Elgin - AMITA
3,"May 19, 2016\nHelpful (2)\n""Community Programs...",Thu May 19 2016 07:42:06 GMT-0400 (Eastern Day...,,"Elgin, IL","in Elgin, IL",Community Programs Coordinator,3.0,,,,,,Saint Joseph Hospital Elgin - AMITA
4,"September 3, 2015\n""Registered Nurse""\nRegiste...",Thu Sep 03 2015 10:28:33 GMT-0400 (Eastern Day...,,"Elgin, IL","Registered Nurse in Elgin, IL",Registered Nurse,3.0,3.0,4.0,1.0,1.0,1.0,Saint Joseph Hospital Elgin - AMITA


First we look at missing data and drop any that have misisng ratings, then calculate the median overall rating to get a baseline. The number of reviews varied for each hospital and they were on a 5-star rating scale.

In [3]:
review_df.isna()
review_df = review_df.dropna()

#review_df.isna().count() ##assess the amount of missing data
#review_df.info()

review_df['rating_overall'].median()

4.0

To start, I'll use the different ratings categories, which include employee sentements about work-life balance, workplace culture, career advancement, compensation and benefits, and management style. Then, if needed, we can always come back to this point and work with the text information for NLP and sentement analysis.

In [4]:
#pull out the columns of interest for plotting a figure to go in the app

ratings = review_df[['Hosp_Name','rating_overall', 'rating_balance', 'rating_culture', 'rating_career', 'rating_comp', 'rating_mgmt']]

Now, we can reshape the data to make it easier to plot densities and create interactive plots for the web app. 

In [8]:
#change wide to long for plotting densities of each rating type

ratings_long = pd.melt(ratings,id_vars=['Hosp_Name'],var_name='metrics', value_name='values')
ratings_long.head(15)

Unnamed: 0,Hosp_Name,metrics,values
0,Saint Joseph Hospital Elgin - AMITA,rating_overall,1.0
1,Saint Joseph Hospital Elgin - AMITA,rating_overall,1.0
2,Saint Joseph Hospital Elgin - AMITA,rating_overall,3.0
3,Saint Joseph Hospital Elgin - AMITA,rating_overall,5.0
4,Saint Joseph Hospital Elgin - AMITA,rating_overall,1.0
5,Saint Joseph Hospital Elgin - AMITA,rating_overall,5.0
6,Saint Joseph Hospital Elgin - AMITA,rating_overall,1.0
7,Saint Joseph Hospital Elgin - AMITA,rating_overall,1.0
8,Saint Joseph Hospital Elgin - AMITA,rating_overall,3.0
9,Saint Joseph Hospital Elgin - AMITA,rating_overall,1.0


In [10]:
ratings_long = ratings_long.dropna()
ratings_long.info()
ratings_long.to_csv('RatingsData.csv') # write this dataframe to a csv to upload in the web app folder

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14844 entries, 0 to 14843
Data columns (total 3 columns):
Hosp_Name    14844 non-null object
metrics      14844 non-null object
values       14844 non-null float64
dtypes: float64(1), object(2)
memory usage: 463.9+ KB


Here we work out the code to build density plots for the ratings data that can then be used in the web app script. I used altair, but bokeh, plotly, or ggplot2 would also work.

In [12]:
import altair as alt

ratings_long = ratings_long.dropna()
alt.data_transformers.disable_max_rows()

matrix = alt.Chart(ratings_long,
    width=120,
    height=80
).transform_filter(
    'isValid(datum.values)'
).transform_density(
    'values',
    groupby=['metrics'],
    as_=['values', 'density'],
    extent=[0.5,5.5],
).mark_area().encode(
    x='values:Q',
    y='density:Q',
).facet(
    'metrics',
    columns=3
)
matrix

In [13]:
chart2 = alt.LayerChart(ratings_long).configure_axis(
    labelFontSize=16,
    titleFontSize=16
).configure_title(fontSize=20)

base2 = alt.Chart(
    ratings_long, 
    width=400, 
    height=300, 
    title="Work-life balance rating"
).transform_density(
    'values',
    as_=['values', 'density'],
    extent=[0.5, 5.5],
).mark_area().encode(    
    x='values:Q',
    y='density:Q',
)
chart2+base2

In [None]:
#write the full dataframe to csv

review_df.to_csv("ReviewData.csv")

In [16]:
#get a better sense of the average of each rating type by hospital

EmpRatings = review_df.groupby(['Hosp_Name'])['rating_overall', 'rating_balance', 'rating_culture', 'rating_career', 'rating_comp', 'rating_mgmt'].mean()
EmpRatings.head()

Unnamed: 0_level_0,rating_overall,rating_balance,rating_culture,rating_career,rating_comp,rating_mgmt
Hosp_Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Adventist Medical Center Glen Oaks - AMITA,4.2,3.8,4.2,4.0,4.2,4.2
Adventist Medical Center Hinsdale - AMITA,3.625,3.25,3.875,3.375,3.25,2.75
Advocate BroMenn Medical Center,3.5,4.0,3.5,3.333333,3.166667,3.333333
Advocate Christ Medical Center - Oak Lawn,1.0,1.0,1.0,2.0,1.0,1.0
Advocate Good Samaritan - Downers Grove,4.0,3.0,4.0,3.0,3.0,5.0


In [None]:
#write ratings to a csv file and look through the ratings to make sure the locations look correct 
#validate the quality of results and fill in any missing values from Indeed 5-star ratings

EmpRatings.to_csv("EmployeeRatings.csv")

Perfect, we have code chunks that can be used to build pieces of an interactive web app made with Streamlit and deployed on the web with Heroku. We also have some summarized ratings data frames, saved as .csv files, from the web scraping results!