### Overview


In this notebook, based on our KPI's, we are testing the following hypothesis:
- Is there a significant dependency between work/life balance and other worklplace related metrics? (culture and values, senior management, diversity and inclusion, compensation and benefits)
- Do higher levels of well- correlate with lower rates of burnout?
- Does a supportive work environment reduce the prevalence of stress and anxiety?
- Do employees in industries with higher stress flags have lower retention rates?

Initial dataset
- https://www.kaggle.com/datasets/davidgauthier/glassdoor-job-reviews-2

Sample: 100 000 reviews
- Current Employees: 53730
- Former Employees: 46270

Timeframe: 2014 to 2023

Columns: 

Well-being indicators

- Stress Flag: Indicators of job-related stress
- Burnout Flag: Signs of employee burnout
- Anxiety Flag: Mentions of anxiety in reviews
- Supportive Environment Flag: Indicators of supportive workplaces
- Toxic Flag: Signs of toxic work environments
- Depression Flag: Mentions of depressive symptoms


### Import libraries with neccessary functions

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import re
import spacy
from collections import Counter
from textblob import TextBlob
from tqdm.notebook import tqdm
from scipy.stats import ttest_ind, chi2_contingency


### Load the initial dataset from Kaggle

In [2]:
reviews = pd.read_csv("../data/raw/all_reviews.csv", low_memory=False)
pd.set_option('display.max_columns', None)


### Initial cleaning

In [3]:
# Function to clean numerical rating columns
def clean_rating_columns(df, columns):
    """
    Clean numerical rating columns by converting them to float and removing invalid entries
    """
    for col in columns:
        df[col] = pd.to_numeric(df[col], errors='coerce')
        df = df.dropna(subset=[col])
        df[col] = df[col].astype(float)
    return df

# Function to handle missing values
def handle_missing_values(df, columns_to_fill, fill_value=0.0):
    """
    Fill missing values for specified columns with a given value
    """
    for col in columns_to_fill:
        df[col] = df[col].fillna(fill_value).astype(float)
    return df

# Function to extract year from date
def extract_year(df, date_column, year_column):
    """
    Extract the year from a date column and create a new year column
    """
    df[year_column] = df[date_column].str.split().str[2].astype(int)
    df.drop(columns = "date", inplace = True)
    return df[df[year_column] > 2014]

# Function to extract firm names from the firm_link
def extract_firm_names(df, firm_link_column, firm_name_column):
    """
    Extract firm names from firm_link and filter firms with frequent mentions
    """
    def extract_name(firm_link):
        match = re.search(r'(?<=/)([A-Za-z0-9\-]+)(?=-Reviews)', firm_link)
        return match.group(1) if match else None

    df[firm_name_column] = df[firm_link_column].apply(extract_name)
    df.drop(columns = "firm_link", inplace=True)
    firm_counts = df[firm_name_column].value_counts()
    valid_firms = firm_counts[firm_counts >= 100].index
    
    return df[df[firm_name_column].isin(valid_firms)]

# function to unify employment statuses
def unify_employment_statuses(df, status_column):
    """
    Standardize employment statuses
    """
    def unify_status(status):
        if "Current" in status:
            if "Contractor" in status:
                return "Current Contractor"
            elif "Freelancer" in status:
                return "Current Freelancer"
            elif "Intern" in status:
                return "Current Intern"
            elif "Self-employed" in status:
                return "Current Self-employed"
            else:
                return "Current Employee"
        elif "Former" in status:
            if "Contractor" in status:
                return "Former Contractor"
            elif "Freelancer" in status:
                return "Former Freelancer"
            elif "Intern" in status:
                return "Former Intern"
            elif "PhD" in status:
                return "Former PhD"
            elif "Self-employed" in status:
                return "Former Self-employed"
            else:
                return "Former Employee"
        else:
            return "Other"

    df[status_column] = df[status_column].apply(unify_status)
    return df

# main cleaning function
def clean_dataset(file_path, rating_columns, missing_columns, firm_link_column, status_column, date_column, year_column, job_column):
    """
    Clean the dataset by applying all cleaning functions step by step
    """
    # Read the data
    df = pd.read_csv(file_path, low_memory=False)
    
    # Clean rating columns
    df = clean_rating_columns(df, rating_columns)
    
    # Drop unnecessary columns
    df.drop(columns=["title", "advice", "index", "Business Outlook"], inplace=True)
    
    # Drop rows with significant missing values
    df.dropna(subset=['pros', 'cons', 'Culture & Values'], inplace=True)

    #Unify names of the columns and reset index
    df['pros'] = df['pros'].str.lower().str.replace(r'\n', ' ', regex=True)
    df['cons'] = df['cons'].str.lower().str.replace(r'\n', ' ', regex=True)
    
    
    # Extract year and filter rows
    df = extract_year(df, date_column, year_column)
    
    # Handle missing values in specific columns
    df = handle_missing_values(df, missing_columns, fill_value=0.0)
    
    # Extract and filter firm names
    df = extract_firm_names(df, firm_link_column, 'firm_name')

    # Unify employment statuses
    df = unify_employment_statuses(df, status_column)
    
    return df

# Usage
file_path = "../data/raw/all_reviews.csv"
rating_columns = ['Work/Life Balance', 'Compensation and Benefits', 'Senior Management', 'Career Opportunities']
missing_columns = ['Diversity & Inclusion']
firm_link_column = 'firm_link'
status_column = 'status'
date_column = 'date'
year_column = 'year'
job_column = 'job'

cleaned_reviews = clean_dataset(
    file_path=file_path,
    rating_columns=rating_columns,
    missing_columns=missing_columns,
    firm_link_column=firm_link_column,
    status_column=status_column,
    date_column=date_column,
    year_column=year_column,
    job_column=job_column
)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[col] = df[col].astype(float)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[col] = pd.to_numeric(df[col], errors='coerce')


### Clean job titles

In [4]:
def clean_job_titles(df, job_column):
    """
    Clean and standardize job titles in a DataFrame

    """
    # Replace empty strings or NaN values with "Anonymous Employee"
    df[job_column].replace(' ', 'Anonymous Employee', inplace=True)
    df[job_column].fillna('Anonymous Employee', inplace=True)
    
    # Strip leading/trailing spaces
    df[job_column] = df[job_column].str.strip()
    
    # Count occurrences of each job title
    job_counts = df[job_column].value_counts()
    
    # Filter out jobs mentioned less than 100 times
    valid_jobs = job_counts[job_counts >= 100].index
    df = df[df[job_column].isin(valid_jobs)]
    
    # Replace specific job titles
    job_replacements = {
        'Salees Associate': 'Sales Associate', 'Sales Executive': 'Sales Associate',
        'IT Strategy Consultant': 'IT Consultant',
        'Sales Associate/Cashier': 'Cashier', 'Front End Cashier': 'Cashier',
        'Senior Software Engineer': 'Software Engineer', 
        'Senior Systems Engineer': 'Software Engineer', 
        'Senior Software Developer': 'Software Engineer', 
        'Associate Software Engineer': 'Software Engineer',
        'Sales': 'Sales Associate', 
        'Salees Associate ': 'Sales Associate', 
        'Retail Sales Associate': 'Sales Associate',
        'Sales Assistant': 'Sales Associate', 
        'Sales Representative': 'Sales Associate', 'Inside Sales Representative': 'Sales Associate',
        'Systems Engineer': 'Software Engineer', 'Senior Engineer':'Software Engineer',
        'Senior Associate': 'Associate', 
        'Senior Analyst ': 'Analyst',
        'Business Analyst': 'Analyst', 'Senior Business Analyst ': 'Analyst','Senior Business Analyst': 'Analyst', 'Financial Analyst': 'Analyst',
        'IT Analyst': 'Analyst', 'Senior Analyst': 'Analyst', 'Programmer Analyst': 'Analyst',
        'Server': 'Waiter/Waitress', 'Waiter': 'Waiter/Waitress', 'Waitress': 'Waiter/Waitress',
        'Warehouse Worker': 'Warehouse Associate',
        'Human Resources': 'HR', 
        'Recruiter': 'HR',
        'Customer Service Representative': 'Customer Service', 
        'Customer Service Representative (CSR)': 'Customer Service', 'Customer Service Specialist': 'Customer Service',
        'Customer Assistant': 'Customer Service', 
        'Customer Service Associate': 'Customer Service',
        'Executive Assistant': 'Administrative Assistant', 
        'Associate Director': 'Director',
        'Shift Supervisor': 'Supervisor',
        'Graduate Research Assistant': 'Researcher','Research Assistant': 'Researcher',
    }
    df[job_column] = df[job_column].replace(job_replacements)
    
    # Apply broader job title standardizations
    def standardize_jobs(job):
        if 'Teacher' in job:
            return 'Teacher'
        elif 'Manager' in job:
            return 'Manager'
        elif 'Consultant' in job:
            return 'Consultant'
        elif 'Teller' in job:  
            return 'Bank Teller'
        elif 'Intern' in job:
            return 'Intern'
        elif 'Pharmacy' in job:
            return 'Pharmacy'
        elif 'Team Member' in job or 'Crew Member' in job:
            return 'Team Member'
        return job
    
    df[job_column] = df[job_column].apply(standardize_jobs)
    
    return df

cleaned_reviews = clean_job_titles(cleaned_reviews, 'job') #apply the job cleaning function

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[job_column] = df[job_column].replace(job_replacements)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[job_column] = df[job_column].apply(standardize_jobs)


### Create industries

In [5]:
# dictionary mapping companies to industries
industry_mapping = {
    'Technology': [
        'CGI', 'Amazon-Flex', 'Publicis-Sapient', 'Pegasystems', 'Tech-Data', 'Unity', 
        'Stefanini', 'Tata-Group', 'Tata-AIA-Life', 'Hi-Rez-Studios', 'Aujas', 
        'S-and-C-Electric-Company', 'LGC', 'TecOrb-Technologies', 'Tower-Semiconductor',
        'CA-Technologies', 'Sysco', 'ALSTOM', 'The-UPS-Store', 'COLSA', 'Hindustan-Times', 
        'Vista-College', 'DreamWorks-Animation', 'Hi-Rez-Studios', 'GlobalData', 'RH', 
        'HUB-International', 'Apex-Tool-Group', 'Conair', 'Emco-Corporation', 'CARQUEST', 
        'Orora', 'Hi-Rez-Studios', 'Electronic-Transaction-Consultants', 'Tower-Semiconductor',
        '3D-Systems', 'Toshiba-America-Business-Solutions', 'Jet', 'Rexel-Group', 'Portland-Communications',
        'Quinnox', 'Quikr-India', 'Tradeindia-com-Infocom-Network','Seiko-Epson', 'SKP-Group', 'REVE-Systems', 'Promact-InfoTech-Pvt'
    ],
    'Finance': [
        'Goldman-Sachs', 'Canada-Life', 'UBS', 'United-Bank-for-Africa', 'Rabobank', 
        'AXA-UK', 'Mercury-General', 'RLI', 'Ecobank', 'Intrum', 'Cadence-Financial', 'ICBC'
    ],
    'Retail': [
        'Chipotle', 'Dunkin', 'H-E-B', 'Ross-Stores', 'GameStop', 'Swarovski', 
        'Calvin-Klein', 'Gymboree', 'Men-s-Wearhouse', 'Famous-Dave-s', 'Savers-Health-and-Beauty', 
        'CARQUEST', 'Orora', 'CALVIN-KLEIN','SeaWorld-Parks-and-Entertainment', 'Lettuce-Entertain-You', 'Wayne-Sanderson-Farms', 'Marc-Glassman'
    ],
    'Healthcare': [
        'Beaumont-Health', 'Nuffield-Health', 'Texas-Health-and-Human-Services-Commission', 
        'Special-Olympics', 'Legacy-Healthcare-Services', 'Indian-Health-Service', 'Extendicare', 
        'Summa-Health-System', 'WW-WeightWatchers', 'WellSky', 'Atrium'
    ],
    'Energy': [
        'Valeo', 'Enersys', 'EnerSys','Hydro-One', 'Vallourec', 'Petrofac'
    ],
    'Media_and_Entertainment': [
        'Norwegian-Cruise-Line-Holdings-Ltd', 'Newsquest', 'Schibsted','DreamWorks-Animation', 'National-Geographic-Society'
    ],
    'Food_and_Beverage': [
        'Tyson-Foods', 'HP-Hood', 'Famous-Dave-s', 'Pernod-Ricard'
    ],
    'Aerospace_and_Defense': [
        'COLSA', 'Stefanini', 'Hi-Rez-Studios'
    ],
    'Consumer_Goods': [
        'Coty', 'GlobalData', 'Movado-Group','Milliken', 'Do-it-Best', 'BBDO', 'Savers-Health-and-Beauty', 'Purity-Cosmetics'
    ],
    'Public_Sector': [
        'New-York-County-District-Attorney', 'Army-Family-and-MWR-Programs', 'Commonwealth-of-Virginia',
        'Palm-Beach-County-School-District', 'University-of-Nevada-Las-Vegas', 'Binding-Site'
    ],
    'Transportation_and_Logistics': [
        'Alaska-Airlines', 'DHL', 'Trinity-Industries','Crowley','Group-1-Automotive', 'Meritage-Homes', 'Jet', 'Rexel-Group',  'Avianca', 'Arriva-Group'
    ],
    'Business_Services': [
        'TriNet', 'NIBCO', 'The-Select-Group','Cpl-Resources', 'Afni', 'EOS-Group', 'Barton-Malow', 'ProTransport-1', 'Portland-Communications', 'Vivo-Global'
    ],
    'Automotive': [
        'Hendrick-Automotive-Group', 'Brose', 'Wayne-Sanderson-Farms'
    ],
    'Education': [
        'Carleton-University', 'Brown-University', 'William-and-Mary', 'Valencia-College', 'Toshiba-America-Business-Solutions'
    ],
    'Telecommunication': ['M1', 'AFL']

}


# create a new column by mapping the firm names to their respective industry
def map_industry(firm_name):
    for industry, companies in industry_mapping.items():
        if firm_name in companies:
            return industry
    return 'Other'  # default category for companies not in the mapping


sampled_reviews = cleaned_reviews.head(100000)
# apply the function to the sampled_reviews DataFrame
sampled_reviews['industry'] = sampled_reviews['firm_name'].apply(map_industry)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sampled_reviews['industry'] = sampled_reviews['firm_name'].apply(map_industry)


### Assign well-being indicators

In [6]:
# Function to flag stress, burnout, and anxiety
def emotional_status(text):
    if not isinstance(text, str):  # check if text is not None or NaN
        return False, False, False, False, False, False  # return False for all flags if the text is invalid

    text = text.lower()

    # Keywords for flags
    stress_keywords = ['stress', 'overwhelmed', 'nervous', 'pressure', 'stressful', 'deadlines', 'high demand', 'time pressure',
                       'workload', 'tension', 'deadlines pressure', 'tight schedules','high workload']
    toxic_keywords = ['toxic','negative', 'micromanagement', 'lack of support', 'hostile', 'bullying', 
                      'favoritism', 'harassment', 'isolation', 'unfair', 'discrimination']
    burnout_keywords = ['burnout', 'overtasked', 'overworked', 'burned out', 'burning out', 
                        'drained', 'mental fatigue', 'overcommitment', 'work-life imbalance','exhaustion', 'fatigue']
    anxiety_keywords = ['anxious', 'anxiety', 'panic', 'workload stress', 'job insecurity', 
                        'constant worry', 'fear', 'insecurity']
    depression_keywords= ['depressed','sad','no motivation', 'sleepless', 'unmotivated', 'headaches',
                          'hopelessness', 'lethargy', 'apathy', 'no energy', 'self-doubt', 'dissatisfaction', 'feeling worthless']
    supportive_keywords = ['supportive', 'understanding', 'encouraging', 'teamwork', 'mentorship', 'recognition',
                           'respect', 'empowerment', 'inclusion', 'constructive feedback']    

    # removing unwanted phrases
    unwanted_phrases = ['not too stressful', 'no negatives', 'no cons', 'no complaints', 'not too stressfull']
    if any(phrase in text for phrase in unwanted_phrases):
        if any(keyword in text for keyword in supportive_keywords):
            # keep the text because it includes supportive aspects
            return False, False, False, False, True, False  # supportive flag is True
        else:
            # discard the text as it is negating sentiment
            return False, False, False, False, False, False



    # check for each set of keywords
    stress_flag = any(keyword in text for keyword in stress_keywords)
    burnout_flag = any(keyword in text for keyword in burnout_keywords)
    anxiety_flag = any(keyword in text for keyword in anxiety_keywords)
    toxic_flag = any(keyword in text for keyword in toxic_keywords)
    supportive_env_flag = any(keyword in text for keyword in supportive_keywords)
    depression_flag = any(keyword in text for keyword in depression_keywords)
    
    return stress_flag, burnout_flag, anxiety_flag, toxic_flag, supportive_env_flag, depression_flag



In [7]:
# apply the emotional_status function to "pros" and "cons"
sampled_reviews[['stress_flag', 'burnout_flag', 'anxiety_flag', 'toxic_flag', 'supportive_env_flag', 'depression_flag']] = pd.DataFrame(
    sampled_reviews['pros'].apply(emotional_status).tolist(), index=sampled_reviews.index)

sampled_reviews[['stress_flag_cons', 'burnout_flag_cons', 'anxiety_flag_cons', 'toxic_flag_cons', 'supportive_env_flag_cons', 'depression_flag_cons']] = pd.DataFrame(
    sampled_reviews['cons'].apply(emotional_status).tolist(), index=sampled_reviews.index)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sampled_reviews[['stress_flag', 'burnout_flag', 'anxiety_flag', 'toxic_flag', 'supportive_env_flag', 'depression_flag']] = pd.DataFrame(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sampled_reviews[['stress_flag', 'burnout_flag', 'anxiety_flag', 'toxic_flag', 'supportive_env_flag', 'depression_flag']] = pd.DataFrame(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-d

### Unify the column names

In [8]:
sampled_reviews.columns = sampled_reviews.columns.str.lower().str.replace(" ", "_", regex=True).str.replace("&", "and", regex=True)
sampled_reviews.reset_index(drop=True, inplace=True)
print(sampled_reviews.shape)

(100000, 28)


### Get the sentiment for review's pros and cons 

In [1]:
"""
Code defines a function to analyze the sentiment and subjectivity of a given text using TextBlob
Then we apply function to the 'pros' and 'cons' columns of sampled reviews, creating new columns for sentiment and subjectivity scores
"""


def get_sentiment(text):
    blob = TextBlob(text)
    return blob.sentiment.polarity, blob.sentiment.subjectivity

# Apply the sentiment analysis function to 'pros' and 'cons'
sampled_reviews[['pros_sentiment', 'pros_subjectivity']] = sampled_reviews['pros'].apply(lambda x: pd.Series(get_sentiment(x)))
sampled_reviews[['cons_sentiment', 'cons_subjectivity']] = sampled_reviews['cons'].apply(lambda x: pd.Series(get_sentiment(x)))

NameError: name 'sampled_reviews' is not defined

### Additional well-being Indicators

In [11]:
 #list of emotional flags (just the pros flags)
combined_flags = [
    'stress_flag', 'burnout_flag', 'anxiety_flag', 'toxic_flag', 'supportive_env_flag', 'depression_flag'
]

combined_flags_df = sampled_reviews.copy()

# combine the pros and cons flags for each emotional state
for flag in combined_flags:
    # Combine the 'pros' and 'cons' flags using the logical OR (|)
    combined_flags_df[flag] = combined_flags_df[flag] | combined_flags_df[flag + '_cons']

# drop the cons flags from the dataframe
for flag in combined_flags:
    combined_flags_df.drop(columns=[flag + '_cons'], inplace=True)

# Check the dataframe to see if the flags have been combined correctly
display(combined_flags_df[['stress_flag', 'burnout_flag', 'anxiety_flag', 'toxic_flag', 'supportive_env_flag', 'depression_flag']].head())

Unnamed: 0,stress_flag,burnout_flag,anxiety_flag,toxic_flag,supportive_env_flag,depression_flag
0,False,False,False,False,False,False
1,False,False,False,False,False,False
2,False,False,False,False,False,False
3,False,False,False,False,False,False
4,False,False,False,False,False,False


### Hypothesis testing

Is there a significant dependency between work/life balance and other worklplace related metrics? (culture and values, senior management, diversity and inclusion, compensation and benefits)

- Null Hypothesis: There is no significant difference in work/life balance scores between high and low levels of the metric (culture and values, senior management, diversity and inclusion, compensation and benefits)
- Alternative Hypothesis: There is a significant difference in work/life balance scores between high and low levels of the metric


In [12]:
# list of metrics to compare with work/life balance
metrics_to_compare = [
    'compensation_and_benefits',
    'culture_and_values',
    'senior_management',
    'diversity_and_inclusion',
    'career_opportunities',
    'rating'
]


threshold = 3.3 # threshold for splitting high vs. low groups
alpha = 0.05  # significance level


for metric in metrics_to_compare:
    # split work/life balance into high and low groups based on the current metric
    high_group = sampled_reviews[sampled_reviews[metric] > threshold]['work/life_balance']
    low_group = sampled_reviews[sampled_reviews[metric] <= threshold]['work/life_balance']

    t_stat, p_value = ttest_ind(high_group, low_group, equal_var=False) #performing t-test

    # print results
    print(f"\nMetric: {metric}")
    print(f"T-stat: {t_stat:.4f}, P-value: {p_value:.4f}")

    # conclusion
    if p_value < alpha:
        print(
            f"Reject the null hypothesis: The data provides evidence that work/life balance differs significantly based on {metric}.\n"
            f"Conclusion: Higher levels of {metric} have a statistically significant impact on work/life balance.\n"
        )
    else:
        print(
            f"Fail to reject the null hypothesis: The data does not provide sufficient evidence to suggest a significant difference in work/life balance based on {metric}.\n"
        )



Metric: compensation_and_benefits
T-stat: 129.4123, P-value: 0.0000
Reject the null hypothesis: The data provides evidence that work/life balance differs significantly based on compensation_and_benefits.
Conclusion: Higher levels of compensation_and_benefits have a statistically significant impact on work/life balance.


Metric: culture_and_values
T-stat: 196.1290, P-value: 0.0000
Reject the null hypothesis: The data provides evidence that work/life balance differs significantly based on culture_and_values.
Conclusion: Higher levels of culture_and_values have a statistically significant impact on work/life balance.


Metric: senior_management
T-stat: 194.1637, P-value: 0.0000
Reject the null hypothesis: The data provides evidence that work/life balance differs significantly based on senior_management.
Conclusion: Higher levels of senior_management have a statistically significant impact on work/life balance.


Metric: diversity_and_inclusion
T-stat: 89.4152, P-value: 0.0000
Reject the

### Key findings

- Compensation and Benefits improve Work/Life balance rates, suggesting that competitive pay and benefits reduce work-related stress and improve employees' ability to manage their professional and personal responsibilitier

- Positive perceptions of culture and values, senior management, career opportunities and diversity and inclusion are statistically linked to improved work/life balance. These factors emphasize the role of organizational support, leadership quality, and inclusivity in fostering a balanced work environment

- Higher overall ratings of the workplace are significantly associated with better work/life balance. This highlights that employee satisfaction across multiple dimensions correlates with a healthier equilibrium between work and personal life.

### BURNOUT x Well-being metrics


Do higher levels of well- correlate with lower rates of burnout?
- Null Hypothesis (H0): There is no significant difference in burnout rates between employees with high compensation and benefits and those with low compensation
- Alternative Hypothesis (H1): Employees with higher compensation and benefits experience significantly lower burnout rates

In [13]:
#calculating total flags related to potential burnout
sampled_reviews['total_burnout_flags'] = (
    sampled_reviews['burnout_flag'] + sampled_reviews['burnout_flag_cons'] 
)

#threshold for what we consider high burnout
sampled_reviews['burnout_group'] = (sampled_reviews['total_burnout_flags'] > 0).astype(int)

# list of metrics to compare with burnout group
metrics_to_compare = [
    'compensation_and_benefits',
    'culture_and_values',
    'senior_management',
    'diversity_and_inclusion',
    'work/life_balance',
    'rating'
]


threshold = 3.3 # threshold for splitting high vs. low groups
alpha = 0.05  # significance level


for metric in metrics_to_compare:
    # split burnout_group into high and low groups based on the current metric
    high_group = sampled_reviews[sampled_reviews[metric] > threshold]['burnout_group']
    low_group = sampled_reviews[sampled_reviews[metric] <= threshold]['burnout_group']

    t_stat, p_value = ttest_ind(high_group, low_group, equal_var=False) #performing t-test

    # print results
    print(f"\nMetric: {metric}")
    print(f"T-stat: {t_stat:.4f}, P-value: {p_value:.4f}")

   # conclusion
    if p_value < alpha:
        print(
            f"Reject the null hypothesis: {metric} has a significant impact on burnout.\n"
            f"Conclusion: Higher levels of {metric} are statistically associated with lower burnout.\n"
        )
    else:
        print(
            f"Fail to reject the null hypothesis: {metric} does not significantly affect burnout.\n"
        )





Metric: compensation_and_benefits
T-stat: -14.0567, P-value: 0.0000
Reject the null hypothesis: compensation_and_benefits has a significant impact on burnout.
Conclusion: Higher levels of compensation_and_benefits are statistically associated with lower burnout.


Metric: culture_and_values
T-stat: -16.7341, P-value: 0.0000
Reject the null hypothesis: culture_and_values has a significant impact on burnout.
Conclusion: Higher levels of culture_and_values are statistically associated with lower burnout.


Metric: senior_management
T-stat: -17.9235, P-value: 0.0000
Reject the null hypothesis: senior_management has a significant impact on burnout.
Conclusion: Higher levels of senior_management are statistically associated with lower burnout.


Metric: diversity_and_inclusion
T-stat: -2.8987, P-value: 0.0037
Reject the null hypothesis: diversity_and_inclusion has a significant impact on burnout.
Conclusion: Higher levels of diversity_and_inclusion are statistically associated with lower bu

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sampled_reviews['total_burnout_flags'] = (
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sampled_reviews['burnout_group'] = (sampled_reviews['total_burnout_flags'] > 0).astype(int)


### Key findings
- Among all metrics, work/life balance showed the strongest statistical association with lower burnout levels, emphasizing its critical role in reducing employee exhaustion and emotional fatigue
- Senior management, culture and values and rating show strong associations with lower stress levels, as rating increases, burnout level is decreasing

### Check the stress status only

In [14]:
#calculating total flags related to stress levels
sampled_reviews['total_stress_flags'] = (
    sampled_reviews['stress_flag'] + sampled_reviews['stress_flag_cons'] 
)

#threshold for what we consider high burnout
sampled_reviews['stress_group'] = (sampled_reviews['total_stress_flags'] > 0).astype(int)

# list of metrics to compare with stressed group
metrics_to_compare = [
    'compensation_and_benefits',
    'culture_and_values',
    'career_opportunities',
    'senior_management',
    'diversity_and_inclusion',
    'work/life_balance',
    'rating'
]


threshold = 3.3 # threshold for splitting high vs. low groups
alpha = 0.05  # significance level


for metric in metrics_to_compare:
    # split stress_group into high and low groups based on the current metric
    high_group = sampled_reviews[sampled_reviews[metric] > threshold]['stress_group']
    low_group = sampled_reviews[sampled_reviews[metric] <= threshold]['stress_group']

    t_stat, p_value = ttest_ind(high_group, low_group, equal_var=False) #performing t-test

    # print results
    print(f"\nMetric: {metric}")
    print(f"T-stat: {t_stat:.4f}, P-value: {p_value:.4f}")

   # conclusion
    if p_value < alpha:
        print(
            f"Reject the null hypothesis: {metric} has a significant impact on stress.\n"
            f"Conclusion: Higher levels of {metric} are statistically associated with lower stress level.\n"
        )
    else:
        print(
            f"Fail to reject the null hypothesis: {metric} does not significantly affect stress level.\n"
        )




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sampled_reviews['total_stress_flags'] = (
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sampled_reviews['stress_group'] = (sampled_reviews['total_stress_flags'] > 0).astype(int)



Metric: compensation_and_benefits
T-stat: -4.0176, P-value: 0.0001
Reject the null hypothesis: compensation_and_benefits has a significant impact on stress.
Conclusion: Higher levels of compensation_and_benefits are statistically associated with lower stress level.


Metric: culture_and_values
T-stat: -8.3542, P-value: 0.0000
Reject the null hypothesis: culture_and_values has a significant impact on stress.
Conclusion: Higher levels of culture_and_values are statistically associated with lower stress level.


Metric: career_opportunities
T-stat: -0.8983, P-value: 0.3690
Fail to reject the null hypothesis: career_opportunities does not significantly affect stress level.


Metric: senior_management
T-stat: -9.0965, P-value: 0.0000
Reject the null hypothesis: senior_management has a significant impact on stress.
Conclusion: Higher levels of senior_management are statistically associated with lower stress level.


Metric: diversity_and_inclusion
T-stat: -2.4684, P-value: 0.0136
Reject the

### Key Findings

- Work/Life balance has the strongest statistical association with lower stress levels, reflected by the largest T-statistic (-23.3003) and a highly significant p-value (0.0000), which is a critical metric related to stress level

- Metrics such as senior management (T-stat = -9.0965), culture and values (T-stat = -8.3542), and rating (T-stat = -11.6522) show strong associations with lower stress levels, showing the importance of organizational leadership and cultural alignment
- Career_opportunities does not affect stress level


### Supportive environment x stress

Does a supportive work environment reduce the prevalence of stress and anxiety?
- Null Hypothesis (H0): There is no relationship between a supportive environment and lower stress or anxiety flags
- Alternative Hypothesis (H1): Employees in supportive environments are significantly less likely to report stress or anxiety

In [15]:
# define the flags
flags = ['burnout_flag', 'stress_flag']

# initialize a results dictionary
results = {}

for flag in flags:
    # create a contingency table
    contingency_table = pd.crosstab(combined_flags_df['supportive_env_flag'], combined_flags_df[flag])
    
    # perform Chi-Square Test
    chi2, p_value, dof, expected = chi2_contingency(contingency_table)
    
    # Store results
    results[flag] = {'Chi2 Statistic': chi2, 'P-value': p_value}

    # Print results
    print(f"\nFlag: {flag}")
    print(f"Contingency Table:\n{contingency_table}")
    print(f"Chi2 Statistic: {chi2:.4f}, P-value: {p_value:.4f}")
    
    # Conclusion
    if p_value < 0.05:
        print(f"Reject the null hypothesis: There is a relationship between a supportive environment and {flag}.")
    else:
        print(f"Fail to reject the null hypothesis: No significant relationship between a supportive environment and {flag}.")



Flag: burnout_flag
Contingency Table:
burnout_flag         False  True 
supportive_env_flag              
False                93321    872
True                  5720     87
Chi2 Statistic: 18.2727, P-value: 0.0000
Reject the null hypothesis: There is a relationship between a supportive environment and burnout_flag.

Flag: stress_flag
Contingency Table:
stress_flag          False  True 
supportive_env_flag              
False                87895   6298
True                  5317    490
Chi2 Statistic: 26.2538, P-value: 0.0000
Reject the null hypothesis: There is a relationship between a supportive environment and stress_flag.


### Key insights

Supportive environment reduces burnout: Employees in supportive work environments are significantly less likely to experience burnout or stress levels, as evidenced by the relationship found between supportive environments and burnout flag prevalence 

- Burnout: Chi2 = 18.2727, p-value = 0.0000 
- Stress levels: Chi2 = 26.2538, P-value: 0.0000

The results underline the importance of creating a supportive workplace culture to reduce stress and burnout and improve overall employee well-being

### Well-being metrics x Retention

Do employees in industries with higher stress flags have lower retention rates?

- Null Hypothesis (H0): Industries with higher number of well-being metrics flags (stress, burnout, depression, anxiety) do not show lower retention rates
- Alternative Hypothesis (H1): Industries with higher number of well-being metric flags are associated with lower retention rate


In [16]:
combined_flags_df["status"].value_counts() #check the distribution of current/former employees in order to dont get biased retention

status
Current Employee    53730
Former Employee     46270
Name: count, dtype: int64

In [17]:
# sggregating data by industry, considering multiple flags for well-being
industry_group = combined_flags_df.groupby('industry').agg(
    # count the stress, burnout, depression, and anxiety flags
    stress_flag_count=('stress_flag', 'sum'),
    burnout_flag_count=('burnout_flag', 'sum'),
    depression_flag_count=('depression_flag', 'sum'),
    anxiety_flag_count=('anxiety_flag', 'sum'),
    
    # total employees and current employees (counting firm_name or status for total)
    total_employees=('status', 'count'),  # use status as employee count
    current_employees=('status', lambda x: (x == 'Current Employee').sum())
).reset_index()

# adding retention rate (percentage of current employees)
industry_group['retention_rate'] = industry_group['current_employees'] / industry_group['total_employees']

# calculate the combined well-being flag count (sum of all flags)
industry_group['combined_flag_count'] = (industry_group['stress_flag_count'] +
                                          industry_group['burnout_flag_count'] +
                                          industry_group['depression_flag_count'] +
                                          industry_group['anxiety_flag_count'])

# calculate the combined flag percentage (as a percentage of total employees)
industry_group['combined_flag_percentage'] = industry_group['combined_flag_count'] / industry_group['total_employees']

# get the percentage of each individual flag (stress, burnout, depression, anxiety)
industry_group['stress_flag_percentage'] = industry_group['stress_flag_count'] / industry_group['total_employees']
industry_group['burnout_flag_percentage'] = industry_group['burnout_flag_count'] / industry_group['total_employees']
industry_group['depression_flag_percentage'] = industry_group['depression_flag_count'] / industry_group['total_employees']
industry_group['anxiety_flag_percentage'] = industry_group['anxiety_flag_count'] / industry_group['total_employees']



industry_group
#industry_group.to_csv("../data/clean/industry_retention.csv")




Unnamed: 0,industry,stress_flag_count,burnout_flag_count,depression_flag_count,anxiety_flag_count,total_employees,current_employees,retention_rate,combined_flag_count,combined_flag_percentage,stress_flag_percentage,burnout_flag_percentage,depression_flag_percentage,anxiety_flag_percentage
0,Automotive,40,2,2,3,835,515,0.616766,47,0.056287,0.047904,0.002395,0.002395,0.003593
1,Business_Services,142,12,12,26,2399,1305,0.543977,192,0.080033,0.059191,0.005002,0.005002,0.010838
2,Consumer_Goods,143,16,22,20,1901,931,0.489742,201,0.105734,0.075224,0.008417,0.011573,0.010521
3,Education,50,7,8,2,1118,677,0.605546,67,0.059928,0.044723,0.006261,0.007156,0.001789
4,Energy,507,10,18,7,4127,2635,0.638478,542,0.13133,0.12285,0.002423,0.004362,0.001696
5,Finance,1214,110,45,47,15180,9165,0.603755,1416,0.093281,0.079974,0.007246,0.002964,0.003096
6,Food_and_Beverage,66,11,8,16,1827,985,0.539135,101,0.055282,0.036125,0.006021,0.004379,0.008758
7,Healthcare,173,52,26,17,3115,1751,0.562119,268,0.086035,0.055538,0.016693,0.008347,0.005457
8,Media_and_Entertainment,54,8,7,12,924,417,0.451299,81,0.087662,0.058442,0.008658,0.007576,0.012987
9,Other,36,3,3,5,608,371,0.610197,47,0.077303,0.059211,0.004934,0.004934,0.008224


In [18]:
# calculate correlation matrix between the flags and retention rate
correlation_matrix = industry_group[['stress_flag_percentage', 'burnout_flag_percentage', 
                                     'depression_flag_percentage', 'anxiety_flag_percentage', 
                                     'combined_flag_percentage', 'retention_rate']].corr()

# Display the correlation matrix
print(correlation_matrix)



                            stress_flag_percentage  burnout_flag_percentage  \
stress_flag_percentage                    1.000000                -0.038986   
burnout_flag_percentage                  -0.038986                 1.000000   
depression_flag_percentage               -0.105803                 0.327358   
anxiety_flag_percentage                  -0.280347                 0.004458   
combined_flag_percentage                  0.934269                 0.239350   
retention_rate                            0.052813                -0.501159   

                            depression_flag_percentage  \
stress_flag_percentage                       -0.105803   
burnout_flag_percentage                       0.327358   
depression_flag_percentage                    1.000000   
anxiety_flag_percentage                       0.387693   
combined_flag_percentage                      0.153528   
retention_rate                               -0.255183   

                            anxiety_fla

### Key findings 
Burnout flag and Retention / correlation: -0.5012
- higher burnout flag percentages are moderately associated with lower retention, which means burnout is a more significant factor and it aligns with the alternative hypothesis


Anxiety flag and Retention / correlation: -0.5889
- strong negative correlation with retention, indicating that higher anxiety levels are associated with lower retention rates, which aligns with the H1

Burnout and anxiety flags have a notable negative impact on retention, with higher levels of burnout and anxiety being associated with lower retention rates, which can be a significant insight for improving employee well-being and retention

Well-being and workplace policies: given the negative correlations, especially with burnout and anxiety, organizations should implement policies that focus on mental health support, stress management, and well-being initiatives to retain employees

### Check if there is any correlation between ceo approval/recommendation and well-being metrics

In [19]:
values_replacement = {"v" : "2", #Positive
                      "r": "1", #Mild
                      "x": "-1", #Negative
                      "o": "0"} #No opinion

sampled_reviews["ceo_approval"] = sampled_reviews["ceo_approval"].replace(values_replacement)
sampled_reviews["recommend"] = sampled_reviews["recommend"].replace(values_replacement)

print(sampled_reviews['ceo_approval'].isna().sum())
print(sampled_reviews['recommend'].isna().sum())

sampled_reviews['ceo_approval'] = sampled_reviews['ceo_approval'].astype(int)
sampled_reviews['recommend'] = sampled_reviews['recommend'].astype(int)


0
0


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sampled_reviews["ceo_approval"] = sampled_reviews["ceo_approval"].replace(values_replacement)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sampled_reviews["recommend"] = sampled_reviews["recommend"].replace(values_replacement)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sampled_reviews['ceo_ap

In [20]:
correlation_matrix = sampled_reviews[['ceo_approval', 'work/life_balance', 'stress_flag', 'burnout_flag']].corr()

print(correlation_matrix)

                   ceo_approval  work/life_balance  stress_flag  burnout_flag
ceo_approval           1.000000           0.331498     0.007540     -0.009560
work/life_balance      0.331498           1.000000     0.036927     -0.016641
stress_flag            0.007540           0.036927     1.000000      0.004427
burnout_flag          -0.009560          -0.016641     0.004427      1.000000


In [22]:
# Grouping by 'industry' and 'year' to calculate mean for each column
industry_trends = sampled_reviews.groupby(['industry', 'year']).agg({
    'pros_sentiment': 'mean',
    'cons_sentiment': 'mean',
    'rating': 'mean',
    'work/life_balance': 'mean',
    'compensation_and_benefits': 'mean',
    'culture_and_values': 'mean',
    'senior_management': 'mean',
    'diversity_and_inclusion': 'mean',
}).reset_index()

industry_trends = industry_trends.round(2)
# Display the resulting aggregated data
display(industry_trends)

Unnamed: 0,industry,year,pros_sentiment,cons_sentiment,rating,work/life_balance,compensation_and_benefits,culture_and_values,senior_management,diversity_and_inclusion
0,Automotive,2015,0.48,0.01,2.91,2.18,3.13,3.04,2.36,0.00
1,Automotive,2016,0.40,0.01,3.29,2.84,3.27,3.33,2.83,0.00
2,Automotive,2017,0.40,0.03,3.32,2.85,3.41,3.45,3.02,0.00
3,Automotive,2018,0.37,0.00,3.08,2.72,3.42,3.08,2.78,0.00
4,Automotive,2019,0.38,0.00,3.60,3.13,3.49,3.58,3.11,0.00
...,...,...,...,...,...,...,...,...,...,...
130,Transportation_and_Logistics,2019,0.36,0.02,3.50,3.24,3.48,3.43,3.09,0.00
131,Transportation_and_Logistics,2020,0.39,0.01,3.79,3.51,3.70,3.76,3.41,1.13
132,Transportation_and_Logistics,2021,0.41,0.01,3.77,3.30,3.64,3.61,3.33,3.70
133,Transportation_and_Logistics,2022,0.38,-0.00,3.64,3.26,3.50,3.55,3.26,3.80


In [24]:
#Save for further analysis
#industry_trends.to_csv("../data/clean/industry_trends.csv")

In [25]:
 #list of emotional flags (just the pros flags)
combined_flags = [
    'stress_flag', 'burnout_flag', 'anxiety_flag', 'toxic_flag', 'supportive_env_flag', 'depression_flag'
]

combined_flags_df = sampled_reviews.copy()

# combine the pros and cons flags for each emotional state
for flag in combined_flags:
    # Combine the 'pros' and 'cons' flags using the logical OR (|)
    combined_flags_df[flag] = combined_flags_df[flag] | combined_flags_df[flag + '_cons']

# drop the cons flags from the dataframe
for flag in combined_flags:
    combined_flags_df.drop(columns=[flag + '_cons'], inplace=True)

# Check the dataframe to see if the flags have been combined correctly
display(combined_flags_df[['stress_flag', 'burnout_flag', 'anxiety_flag', 'toxic_flag', 'supportive_env_flag', 'depression_flag']].head())

Unnamed: 0,stress_flag,burnout_flag,anxiety_flag,toxic_flag,supportive_env_flag,depression_flag
0,False,False,False,False,False,False
1,False,False,False,False,False,False
2,False,False,False,False,False,False
3,False,False,False,False,False,False
4,False,False,False,False,False,False


In [28]:
# Group by 'industry' and 'year', summing the flag columns
industry_year_flag_counts = combined_flags_df.groupby(['industry', 'year'])[
    ['stress_flag', 'burnout_flag', 'anxiety_flag', 'toxic_flag', 
     'supportive_env_flag', 'depression_flag']
].sum()

# Create a 'total_flags' column by summing across all emotional flag columns
industry_year_flag_counts['total_flags'] = industry_year_flag_counts.sum(axis=1)

# Ensure total_flags is of numeric type
industry_year_flag_counts['total_flags'] = pd.to_numeric(industry_year_flag_counts['total_flags'], errors='coerce')

# Reset the index to bring 'industry' and 'year' back as columns
industry_year_flag_counts = industry_year_flag_counts.reset_index()

# Sort by 'total_flags' in descending order
industry_year_flag_counts_sorted = industry_year_flag_counts.sort_values(by='industry', ascending=False)

# Display the sorted result
display(industry_year_flag_counts_sorted)



Unnamed: 0,industry,year,stress_flag,burnout_flag,anxiety_flag,toxic_flag,supportive_env_flag,depression_flag,total_flags
134,Transportation_and_Logistics,2023,7,1,0,12,6,0,26
133,Transportation_and_Logistics,2022,24,9,0,31,30,6,100
132,Transportation_and_Logistics,2021,26,5,1,26,21,0,79
131,Transportation_and_Logistics,2020,17,2,2,19,18,2,60
130,Transportation_and_Logistics,2019,10,3,4,25,13,2,57
...,...,...,...,...,...,...,...,...,...
1,Automotive,2016,1,0,0,5,4,1,11
6,Automotive,2021,14,0,0,13,21,1,49
7,Automotive,2022,9,1,0,9,9,0,28
8,Automotive,2023,2,0,0,1,1,0,4


In [41]:
industires = sampled_reviews[sampled_reviews["industry"] == "Retail"]
sample["rating"].size

33549

In [42]:
sampled_reviews["industry"].value_counts()

industry
Retail                          33549
Technology                      30310
Finance                         15180
Energy                           4127
Healthcare                       3115
Transportation_and_Logistics     2809
Business_Services                2399
Consumer_Goods                   1901
Food_and_Beverage                1827
Education                        1118
Public_Sector                     958
Media_and_Entertainment           924
Automotive                        835
Other                             608
Telecommunication                 340
Name: count, dtype: int64