# work performance & productivity analysis - Executive Report ( Data analyst project)
This notebook performs ** Exploratory data analysis (EDA) ** on a synthetic corporate IT employee dataset

Tools used:
-- ** panadas, numpy** for data manipulation 
-- ** matplotlib** for visualizations
## EDA and  ** business insights **

In [27]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# display all columns
pd.set_option('display.max_columns',None)
%matplotlib inline


1.Dataset overview

In [28]:
employee_df=pd.read_csv(r'c:\Users\dhanu\Downloads\employee_stress_productivity.csv')
print('shape of data:',employee_df.shape)
employee_df.head()

shape of data: (600, 15)


Unnamed: 0,EmployeeID,Age,Gender,JobRole,TenureYears,WorkHoursPerWeek,WFH_percent,SleepHours,OvertimeHours,MeetingsPerDay,BreaksPerDay,ManagerRating,StressScore,ProductivityScore,BurnoutRisk
0,1001,50,Female,Analyst,3.5,45.3,71.0,8.52,2.0,3,3.0,3.8,31.9,51.8,Low
1,1002,36,Male,Developer,2.0,49.9,48.6,6.76,11.0,5,2.0,3.8,65.4,35.2,Medium
2,1003,29,Male,Developer,1.3,41.1,52.0,8.21,3.0,2,1.0,4.1,38.7,60.5,Low
3,1004,42,Male,Support,2.9,30.8,59.7,8.12,2.0,6,3.0,4.4,28.8,55.5,Low
4,1005,40,Male,Developer,0.5,49.2,51.4,9.17,9.0,7,1.0,2.9,68.8,23.9,Medium


2.Data quality assessment

In [29]:
# column names
employee_df.columns

Index(['EmployeeID', 'Age', 'Gender', 'JobRole', 'TenureYears',
       'WorkHoursPerWeek', 'WFH_percent', 'SleepHours', 'OvertimeHours',
       'MeetingsPerDay', 'BreaksPerDay', 'ManagerRating', 'StressScore',
       'ProductivityScore', 'BurnoutRisk'],
      dtype='object')

In [30]:
# information about the data
employee_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 600 entries, 0 to 599
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   EmployeeID         600 non-null    int64  
 1   Age                600 non-null    int64  
 2   Gender             600 non-null    object 
 3   JobRole            600 non-null    object 
 4   TenureYears        600 non-null    float64
 5   WorkHoursPerWeek   600 non-null    float64
 6   WFH_percent        594 non-null    float64
 7   SleepHours         594 non-null    float64
 8   OvertimeHours      600 non-null    float64
 9   MeetingsPerDay     600 non-null    int64  
 10  BreaksPerDay       600 non-null    float64
 11  ManagerRating      594 non-null    float64
 12  StressScore        600 non-null    float64
 13  ProductivityScore  600 non-null    float64
 14  BurnoutRisk        600 non-null    object 
dtypes: float64(9), int64(3), object(3)
memory usage: 70.4+ KB


In [31]:
# summary statistics
employee_df.describe()


Unnamed: 0,EmployeeID,Age,TenureYears,WorkHoursPerWeek,WFH_percent,SleepHours,OvertimeHours,MeetingsPerDay,BreaksPerDay,ManagerRating,StressScore,ProductivityScore
count,600.0,600.0,600.0,600.0,594.0,594.0,600.0,600.0,600.0,594.0,600.0,600.0
mean,1300.5,38.558333,3.845333,44.542167,40.641246,6.944949,3.311667,3.716667,2.025,3.719192,42.662833,49.096167
std,173.349358,9.87822,2.515902,6.434483,26.620465,0.990858,2.781623,1.411752,1.037385,0.735284,15.807614,16.012119
min,1001.0,22.0,0.0,30.0,0.0,3.73,0.0,2.0,0.0,1.2,0.0,0.0
25%,1150.75,30.0,1.8,40.2,19.9,6.3,1.0,3.0,1.0,3.2,32.2,38.3
50%,1300.5,39.0,3.55,44.55,41.1,6.915,3.0,3.0,2.0,3.7,41.6,48.8
75%,1450.25,47.0,5.7,48.8,58.625,7.6575,4.0,5.0,3.0,4.2,53.3,60.25
max,1600.0,54.0,12.6,66.7,100.0,9.95,18.0,10.0,6.0,5.0,100.0,100.0


3.Completeness & data integrity review

In [32]:
# count missing values per column
employee_df.isna().sum().sort_values(ascending=False)


WFH_percent          6
ManagerRating        6
SleepHours           6
Gender               0
JobRole              0
Age                  0
EmployeeID           0
WorkHoursPerWeek     0
TenureYears          0
MeetingsPerDay       0
OvertimeHours        0
BreaksPerDay         0
StressScore          0
ProductivityScore    0
BurnoutRisk          0
dtype: int64

In [33]:
# create a copy for cleaning so we don't modify teh original data directly
df_clean=employee_df.copy()

# for numerical cloumns: filling missing values with median 
num_cols=df_clean.select_dtypes(include=['int64','float64']).columns
for col in num_cols:
    if df_clean[col].isna().sum()>0:
        median_value=df_clean[col].median()
        df_clean[col]=df_clean[col].fillna(median_value)
        
# categorial columns: filling missing values with mode (most frequent)
cat_cols=df_clean.select_dtypes(include=['object']).columns
for col in cat_cols:
    if df_clean[col].isna().sum()>0:
        mode_value=df_clean[col].mode()[0]
        df_clean[col]=df_clean[col].fillna(mode_value)
        


In [34]:
# check missing values again 
df_clean.isna().sum()

EmployeeID           0
Age                  0
Gender               0
JobRole              0
TenureYears          0
WorkHoursPerWeek     0
WFH_percent          0
SleepHours           0
OvertimeHours        0
MeetingsPerDay       0
BreaksPerDay         0
ManagerRating        0
StressScore          0
ProductivityScore    0
BurnoutRisk          0
dtype: int64

4.Anomaly & oulier diagnostics
