PHASE 1: DATA LOADING & INITIAL OVERVIEW

EXECUTIVE SUMMARY

Data loading and initial overview is used to import the dataset and understand its basic structure, size, and variables. This phase helps identify data types, missing values, and overall data composition, providing a foundation for further cleaning and analysis.

PHASE 1: DATA LOADING & INITIAL OVERVIEW – SUMMARY

This phase focuses on loading the HR dataset and understanding its structure and basic characteristics.

Using:

- Data loading techniques

- Dataset shape and dimension checks

- Data type inspection

- Initial data preview (head & tail)

- Basic descriptive statistics

this phase provides a high-level understanding of the dataset and identifies potential data quality issues to guide further cleaning and analysis.

STEP 1: IMPORT NECESSARY LIBRARIES

Objective: Load essential Python libraries required for data handling and inspection.

In [1]:
import pandas as pd

In [2]:
import numpy as np


STEP 2: LOAD THE DATASET

Objective: Import the dataset into the Python environment.

In [3]:
data = pd.read_csv("IBM_HR_Data.csv", low_memory=False)

STEP 3: DATA TYPE INSPECTION

Objective: Understand the data types and identify potential issues

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23530 entries, 0 to 23529
Data columns (total 37 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Age                       23527 non-null  float64
 1   Attrition                 23517 non-null  object 
 2   BusinessTravel            23522 non-null  object 
 3   DailyRate                 23519 non-null  float64
 4   Department                23519 non-null  object 
 5   DistanceFromHome          23521 non-null  float64
 6   Education                 23518 non-null  float64
 7   EducationField            23521 non-null  object 
 8   EmployeeCount             23525 non-null  float64
 9   EmployeeNumber            23530 non-null  object 
 10  Application ID            23527 non-null  object 
 11  EnvironmentSatisfaction   23521 non-null  float64
 12  Gender                    23520 non-null  object 
 13  HourlyRate                23521 non-null  float64
 14  JobInv

Insight:

- Reveals column data types
- Highlights missing values
- Identifies columns requiring type conversion

STEP 4: DATASET DIMENSIONS

Objective: Identify the size of the dataset.

In [5]:
print("shape:",data.shape)

shape: (23530, 37)


Insight:

- Displays the total number of rows and columns
- Helps assess whether the dataset meets minimum project requirements

STEP 5: PREVIEW DATA RECORDS

Objective: Examine sample records to understand variable content.

In [6]:
display(data.head(5))

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,Employee Source
0,41.0,Voluntary Resignation,Travel_Rarely,1102.0,Sales,1.0,2.0,Life Sciences,1.0,1,...,80.0,0.0,8.0,0.0,1.0,6.0,4.0,0.0,5.0,Referral
1,37.0,Voluntary Resignation,Travel_Rarely,807.0,Human Resources,6.0,4.0,Human Resources,1.0,1,...,80.0,0.0,8.0,0.0,1.0,6.0,4.0,0.0,5.0,Referral
2,41.0,Voluntary Resignation,Travel_Rarely,1102.0,Sales,1.0,2.0,Life Sciences,1.0,1,...,80.0,0.0,8.0,0.0,1.0,6.0,4.0,0.0,5.0,Referral
3,37.0,Voluntary Resignation,Travel_Rarely,807.0,Human Resources,6.0,4.0,Marketing,1.0,4,...,80.0,0.0,8.0,0.0,1.0,6.0,4.0,0.0,5.0,Referral
4,37.0,Voluntary Resignation,Travel_Rarely,807.0,Human Resources,6.0,4.0,Human Resources,1.0,5,...,80.0,0.0,8.0,0.0,1.0,6.0,4.0,0.0,5.0,Referral


Insight:

- Confirms correct data loading
- Helps spot obvious anomalies or formatting issues

STEP 6: INITIAL STATISTICAL OVERVIEW

Objective: Generate basic descriptive statistics.

In [7]:
display(data.describe(include='all'))

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,Employee Source
count,23527.0,23517,23522,23519.0,23519,23521.0,23518.0,23521,23525.0,23530.0,...,23520.0,23521.0,23522.0,23519.0,23520.0,23517.0,23515.0,23519.0,23523.0,23518
unique,,3,3,,3,,,7,,23462.0,...,,,,,,,,,,10
top,,Current employee,Travel_Rarely,,Research & Development,,,Life Sciences,,23244.0,...,,,,,,,,,,Company Website
freq,,19712,16700,,15349,,,9725,,7.0,...,,,,,,,,,,5428
mean,36.914354,,,802.168375,,9.193019,2.910962,,1.0,,...,80.0,0.791548,11.2632,2.796973,2.761437,7.006081,4.225048,2.179642,4.122136,
std,9.130563,,,403.198769,,8.098043,1.023755,,0.0,,...,0.0,0.850294,7.785116,1.289328,0.705991,6.13242,3.624251,3.213205,3.57317,
min,18.0,,,102.0,,1.0,1.0,,1.0,,...,80.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,
25%,30.0,,,465.0,,2.0,2.0,,1.0,,...,80.0,0.0,6.0,2.0,2.0,3.0,2.0,0.0,2.0,
50%,36.0,,,802.0,,7.0,3.0,,1.0,,...,80.0,1.0,10.0,3.0,3.0,5.0,3.0,1.0,3.0,
75%,43.0,,,1157.0,,14.0,4.0,,1.0,,...,80.0,1.0,15.0,3.0,3.0,9.0,7.0,3.0,7.0,


Insight:

- Summarizes numerical and categorical variables
- Identifies outliers, ranges, and distributions
- Highlights missing or unusual values

PHASE 2: DATA CLEANING & PRE-PROCESSING

Data cleaning and preprocessing is used to improve data quality by handling missing values, correcting inconsistencies, removing duplicates, and validating logical relationships. This process ensures the dataset is accurate, consistent, and ready for reliable exploratory data analysis and modeling.

SUMMARY

This phase prepares the dataset for analysis by improving data quality and consistency.

Using:

- Missing value detection and treatment
- Duplicate record checks
- Data type correction
- Logical relationship validation
- Data consistency and integrity checks
- Outlier and anomaly identification
- Feature standardization and formatting

this phase ensures the dataset is accurate, consistent, and analysis-ready for exploratory data analysis and visualization.

STEP 1: MISSING VALUE ANALYSIS

Objective: Identify and handle missing or null values to avoid biased analysis.

Actions Performed:

- Checked missing values across all columns
- Identified numerical and categorical columns with missing data
- Evaluated appropriate handling strategies based on column type

In [8]:
data.isnull().sum()

Age                          3
Attrition                   13
BusinessTravel               8
DailyRate                   11
Department                  11
DistanceFromHome             9
Education                   12
EducationField               9
EmployeeCount                5
EmployeeNumber               0
Application ID               3
EnvironmentSatisfaction      9
Gender                      10
HourlyRate                   9
JobInvolvement               9
JobLevel                     7
JobRole                      9
JobSatisfaction              9
MaritalStatus               11
MonthlyIncome               14
MonthlyRate                 11
NumCompaniesWorked           9
Over18                      10
OverTime                    12
PercentSalaryHike           14
PerformanceRating           10
RelationshipSatisfaction     8
StandardHours               10
StockOptionLevel             9
TotalWorkingYears            8
TrainingTimesLastYear       11
WorkLifeBalance             10
YearsAtC

STEP 2: DUPLICATE RECORD CHECK

Objective: Ensure each employee record is unique.

- Actions Performed:
- Checked for duplicate rows in the dataset
- Removed duplicate records where detected

In [9]:
print(data.duplicated().sum())

14


In [10]:
data=data.drop_duplicates()
print(data.duplicated().sum())

0


Actions Performed:

- Checked for duplicate rows
- Removed duplicate records where detected

Outcome:
- Dataset now contains unique employee entries only

STEP 3: REMOVE IRRELEVANT / CONSTANT COLUMNS

Objective:
- Remove columns that do not contribute meaningful information to analysis.

In [11]:
cols_to_drop=['Education','EmployeeCount','EmployeeNumber','Over18','StandardHours']
data.drop(columns=[i for i in cols_to_drop if i in data.columns],inplace=True)
print('\n Remaining columns:',data.columns.tolist())


 Remaining columns: ['Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department', 'DistanceFromHome', 'EducationField', 'Application ID', 'EnvironmentSatisfaction', 'Gender', 'HourlyRate', 'JobInvolvement', 'JobLevel', 'JobRole', 'JobSatisfaction', 'MaritalStatus', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked', 'OverTime', 'PercentSalaryHike', 'PerformanceRating', 'RelationshipSatisfaction', 'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion', 'YearsWithCurrManager', 'Employee Source']


Actions Performed:

- Identified constant or near-constant columns (single unique value)
- Removed irrelevant identifiers and non-informative features

Outcome:
- Dataset contains only relevant and informative features.

DISPLAYING SHAPE AFTER REMOVAL

In [12]:
print("Remaining Data shape:",data.shape)

Remaining Data shape: (23516, 32)


STEP 4 : DATA TYPE CORRECTION

Objective:
- Ensure columns have appropriate and consistent data types.

In [13]:
for col in data.columns:
    if data[col].dtype=='object':
       try:
           data[col]=pd.to_numeric(data[col])
       except:
           pass
data.dtypes

Age                         float64
Attrition                    object
BusinessTravel               object
DailyRate                   float64
Department                   object
DistanceFromHome            float64
EducationField               object
Application ID               object
EnvironmentSatisfaction     float64
Gender                       object
HourlyRate                  float64
JobInvolvement              float64
JobLevel                    float64
JobRole                      object
JobSatisfaction             float64
MaritalStatus                object
MonthlyIncome               float64
MonthlyRate                 float64
NumCompaniesWorked          float64
OverTime                     object
PercentSalaryHike           float64
PerformanceRating           float64
RelationshipSatisfaction    float64
StockOptionLevel            float64
TotalWorkingYears           float64
TrainingTimesLastYear       float64
WorkLifeBalance             float64
YearsAtCompany              

Actions Performed: 
- Converted numeric fields stored as text into numeric format Ensured categorical variables were correctly stored as object types Verified date-related fields (if any)

Outcome:
-  All columns now have correct and analysis-ready data types

STEP 5: FIX MISSING VALUES

Objective:
- Handle missing values to ensure completeness and avoid bias in analysis.

In [14]:
num_cols = data.select_dtypes(include=['int64','float64']).columns
for col in num_cols:
    data[col] = data[col].fillna(data[col].median())
    
cat_cols = data.select_dtypes(include=['object']).columns
for col in cat_cols:
    data[col] = data[col].replace("Unknown", data[col].mode()[0])
    data[col] = data[col].fillna(data[col].mode()[0])
data.isna().sum()

Age                         0
Attrition                   0
BusinessTravel              0
DailyRate                   0
Department                  0
DistanceFromHome            0
EducationField              0
Application ID              0
EnvironmentSatisfaction     0
Gender                      0
HourlyRate                  0
JobInvolvement              0
JobLevel                    0
JobRole                     0
JobSatisfaction             0
MaritalStatus               0
MonthlyIncome               0
MonthlyRate                 0
NumCompaniesWorked          0
OverTime                    0
PercentSalaryHike           0
PerformanceRating           0
RelationshipSatisfaction    0
StockOptionLevel            0
TotalWorkingYears           0
TrainingTimesLastYear       0
WorkLifeBalance             0
YearsAtCompany              0
YearsInCurrentRole          0
YearsSinceLastPromotion     0
YearsWithCurrManager        0
Employee Source             0
dtype: int64

Actions Performed:

- Filled missing values in numerical columns using the median, preserving data distribution
- Filled missing values in categorical columns using the value "Unknown" to maintain category consistency

Outcome:
- The dataset contains no missing values in key analytical columns and is fully prepared for further analysis.

 DATA VALIDATION

STEP 6: CHECK FOR NEGATIVE NUMBERS

Objective:
Validate that numerical values fall within logical ranges.

In [15]:
numeric_cols=data.select_dtypes(include=['float64','int64']).columns
negatives=(data[numeric_cols] <0).sum()
negatives[negatives > 0]

Series([], dtype: int64)

Validations Performed:

- Checked for negative values in:
- Age
- Monthly Income
- Years at Company
- Years with Current Manager

Corrections Applied:
- Replaced or removed invalid negative values where necessary

In [16]:
print(negatives)

Age                         0
DailyRate                   0
DistanceFromHome            0
EnvironmentSatisfaction     0
HourlyRate                  0
JobInvolvement              0
JobLevel                    0
JobSatisfaction             0
MonthlyIncome               0
MonthlyRate                 0
NumCompaniesWorked          0
PercentSalaryHike           0
PerformanceRating           0
RelationshipSatisfaction    0
StockOptionLevel            0
TotalWorkingYears           0
TrainingTimesLastYear       0
WorkLifeBalance             0
YearsAtCompany              0
YearsInCurrentRole          0
YearsSinceLastPromotion     0
YearsWithCurrManager        0
dtype: int64


STEP 7: DATA CONSISTENCY & LOGICAL VALIDATION

Objective:
- Ensure logical consistency between related variables.

In [17]:
data[data['Age']< 18]
data[data['Age'] > 80]
invalid_age = data[(data['Age'] < 18) | (data['Age'] > 60)]
invalid_age

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,EducationField,Application ID,EnvironmentSatisfaction,Gender,...,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,Employee Source


Validations Performed:
- THIS CHECK ENSURE NO EMPLOYEE IS YOUNGER THAN 18 & CAN ALSO CHECK IF AGE IS USUALLY HIGH (Eg.>60).

7.1 DATA INCONSISTENCY

Objective:
- MONTHLY INCOME MUST BE GREATER THAN 0
- MONTHLY INCOME MUST BE POSITIVE, AND NO EMPLOYEE SHOULD HAVE INVALID OR IMPOSSIBLE VALUES.

OBJECTIVE:
- MONTHLY INCOME MUST BE GREATER THAN 0
- MONTHLY INCOME MUST BE POSITIVE, AND NO EMPLOYEE SHOULD HAVE INVALID OR IMPOSSIBLE VALUES.

In [18]:
data['MonthlyIncome'] = pd.to_numeric(data['MonthlyIncome'])
print("\nEmployees with MonthlyIncome less than or equal to 0:")
print(data[data['MonthlyIncome'] <= 0])


Employees with MonthlyIncome less than or equal to 0:
Empty DataFrame
Columns: [Age, Attrition, BusinessTravel, DailyRate, Department, DistanceFromHome, EducationField, Application ID, EnvironmentSatisfaction, Gender, HourlyRate, JobInvolvement, JobLevel, JobRole, JobSatisfaction, MaritalStatus, MonthlyIncome, MonthlyRate, NumCompaniesWorked, OverTime, PercentSalaryHike, PerformanceRating, RelationshipSatisfaction, StockOptionLevel, TotalWorkingYears, TrainingTimesLastYear, WorkLifeBalance, YearsAtCompany, YearsInCurrentRole, YearsSinceLastPromotion, YearsWithCurrManager, Employee Source]
Index: []

[0 rows x 32 columns]


Outcome: 
- The dataset contain zero employees with MonthlyIncome less than or equal to zero

STEP 8: DERIVED COLUMN CREATION

Objective:
- Create new features to support better analysis.

Derived Features Created:

- Age groups
- Tenure categories
- Attrition flags (binary indicators)

Outcome:
- Enhanced analytical depth and improved interpretability.

TENURE GROUP
- GROUPS TOTAL WORKING YEARS INTO MEANINGFUL RANGES
- CATEGORIZES EMPLOYEES INTO GROUP BASED ON YEARS OF EXPERIENCE

In [19]:
data['TenureGroup'] = pd.cut(
    data['TotalWorkingYears'],
    bins=[-1, 2, 5, 10, 20, 60],
    labels=['<3', '3-5', '6-10', '11-20', '20+']
)

In [20]:
data['TenureGroup']

0        6-10
1        6-10
2        6-10
3        6-10
4        6-10
         ... 
23525    6-10
23526     3-5
23527     20+
23528    6-10
23529    6-10
Name: TenureGroup, Length: 23516, dtype: category
Categories (5, object): ['<3' < '3-5' < '6-10' < '11-20' < '20+']

AGE GROUPING
- CREATES SEGMENTS EMPLOYEES BY AGE.

In [21]:
data['Age_Group'] = pd.cut(
    data['Age'],
    bins=[17, 25, 35, 45, 55, 70],
    labels=["18-25", "26-35", "36-45", "46-55", "55+"]
)


In [22]:
data['Age_Group']

0        36-45
1        36-45
2        36-45
3        36-45
4        36-45
         ...  
23525    26-35
23526    36-45
23527    46-55
23528    26-35
23529    18-25
Name: Age_Group, Length: 23516, dtype: category
Categories (5, object): ['18-25' < '26-35' < '36-45' < '46-55' < '55+']

CONVERTS ATTRITION INTO A BINARY VALUE (1 = LEFT, 0 = STAYED).

In [23]:
data['AttritionFlag'] = data['Attrition'].map({
    'Voluntary Resignation': 1,
    'Current employee': 0
})


In [24]:
data['AttritionFlag']

0        1.0
1        1.0
2        1.0
3        1.0
4        1.0
        ... 
23525    0.0
23526    0.0
23527    0.0
23528    0.0
23529    1.0
Name: AttritionFlag, Length: 23516, dtype: float64

STEP 9: FILTERING & AGGREGATING DATA

Objective:
- Prepare summarized datasets for EDA and visualization.

Actions Performed:

- Used filtering to isolate key employee groups
- Created aggregated metrics (mean income, count of employees by dept , max age)

FILTERING

objective:
- FILTERS EMPLOYEES WHO LEFT FROM THE COMPANY

In [25]:
left_employees = data[data['Attrition']=='Voluntary Resignation']

In [26]:
left_employees.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,EducationField,Application ID,EnvironmentSatisfaction,Gender,...,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,Employee Source,TenureGroup,Age_Group,AttritionFlag
0,41.0,Voluntary Resignation,Travel_Rarely,1102.0,Sales,1.0,Life Sciences,123456,2.0,Female,...,0.0,1.0,6.0,4.0,0.0,5.0,Referral,6-10,36-45,1.0
1,37.0,Voluntary Resignation,Travel_Rarely,807.0,Human Resources,6.0,Human Resources,123457,1.0,Female,...,0.0,1.0,6.0,4.0,0.0,5.0,Referral,6-10,36-45,1.0
2,41.0,Voluntary Resignation,Travel_Rarely,1102.0,Sales,1.0,Life Sciences,123458,2.0,Female,...,0.0,1.0,6.0,4.0,0.0,5.0,Referral,6-10,36-45,1.0
3,37.0,Voluntary Resignation,Travel_Rarely,807.0,Human Resources,6.0,Marketing,123459,1.0,Female,...,0.0,1.0,6.0,4.0,0.0,5.0,Referral,6-10,36-45,1.0
4,37.0,Voluntary Resignation,Travel_Rarely,807.0,Human Resources,6.0,Human Resources,123460,1.0,Female,...,0.0,1.0,6.0,4.0,0.0,5.0,Referral,6-10,36-45,1.0


objective:
- FILTERS ACTIVE EMPLOYEES STILL WORKING IN THE COMPANY

In [27]:
current_employees = data[data['Attrition'] == "Current employee"]
current_employees.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,EducationField,Application ID,EnvironmentSatisfaction,Gender,...,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,Employee Source,TenureGroup,Age_Group,AttritionFlag
18,49.0,Current employee,Travel_Frequently,279.0,Research & Development,8.0,Life Sciences,123474,3.0,Male,...,3.0,3.0,10.0,7.0,1.0,7.0,Seek,6-10,46-55,0.0
19,59.0,Current employee,Non-Travel,1420.0,Human Resources,2.0,Human Resources,123475,1.0,Male,...,3.0,3.0,10.0,7.0,1.0,7.0,Seek,6-10,55+,0.0
20,59.0,Current employee,Non-Travel,1420.0,Human Resources,2.0,Life Sciences,123476,1.0,Male,...,3.0,3.0,10.0,7.0,1.0,7.0,Seek,6-10,55+,0.0
21,49.0,Current employee,Travel_Frequently,279.0,Research & Development,8.0,Marketing,123477,3.0,Male,...,3.0,3.0,10.0,7.0,1.0,7.0,Seek,6-10,46-55,0.0
22,49.0,Current employee,Travel_Frequently,279.0,Research & Development,8.0,Life Sciences,123478,3.0,Male,...,3.0,3.0,10.0,7.0,1.0,7.0,Seek,6-10,46-55,0.0


objective:
- DISPLAYS ALL EMPLOYEES BELONGING TO THE SALES DEPARTMENT 

In [28]:
sales_emp=data[data['Department']=='Sales']
sales_emp.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,EducationField,Application ID,EnvironmentSatisfaction,Gender,...,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,Employee Source,TenureGroup,Age_Group,AttritionFlag
0,41.0,Voluntary Resignation,Travel_Rarely,1102.0,Sales,1.0,Life Sciences,123456,2.0,Female,...,0.0,1.0,6.0,4.0,0.0,5.0,Referral,6-10,36-45,1.0
2,41.0,Voluntary Resignation,Travel_Rarely,1102.0,Sales,1.0,Life Sciences,123458,2.0,Female,...,0.0,1.0,6.0,4.0,0.0,5.0,Referral,6-10,36-45,1.0
6,41.0,Voluntary Resignation,Travel_Rarely,1102.0,Sales,1.0,Life Sciences,123462,2.0,Female,...,0.0,1.0,6.0,4.0,0.0,5.0,Referral,6-10,36-45,1.0
7,41.0,Voluntary Resignation,Travel_Rarely,1102.0,Sales,1.0,Life Sciences,123463,2.0,Female,...,0.0,1.0,6.0,4.0,0.0,5.0,Referral,6-10,36-45,1.0
8,41.0,Voluntary Resignation,Travel_Rarely,1102.0,Sales,1.0,Life Sciences,123464,2.0,Female,...,0.0,1.0,6.0,4.0,0.0,5.0,Referral,6-10,36-45,1.0


AGGREGATION

objective:
- AVERAGE MONTHLY INCOME BY DEPARTMENT  

In [29]:
avg_sal_dept=data.groupby('Department')['MonthlyIncome'].mean()

CALCULATES THE AVERAGE SALARY FOR EACH DEPARTMENT

In [30]:
avg_sal_dept.head()

Department
Human Resources           6442.313300
Research & Development    6357.825441
Sales                     6824.213067
Name: MonthlyIncome, dtype: float64

objective:
- COUNT OF EMPLOYEES IN EACH DEPARTMENT

In [31]:
count_emp=data['Department'].value_counts()

objective:
- DISPLAYS NUMBER OF EMPLOYEES WORKING IN EACH DEPARTMENT

In [32]:
count_emp.head()

Department
Research & Development    15353
Sales                      7148
Human Resources            1015
Name: count, dtype: int64

objective: 
- MAXIMUM AGE OF EMPLOYEES

In [33]:
max_age=data['Age'].max()

In [34]:
max_age

60.0

STEP 10: EXPORT CLEANED DATASET

Objective:
- Save the cleaned and validated HR dataset for further exploratory data analysis and modeling.

In [49]:
data.to_csv("IBM_HR_CLEANED.csv", index=False)

Action Performed:

- Exported the final cleaned dataset to a CSV file using:

Outcome:
- The cleaned dataset is successfully stored as IBM_HR_CLEANED.csv, ensuring reproducibility and easy reuse for EDA and future analysis phases.