# 📊 Employee Performance Analysis Report
- Prepared by: PolyTech [Capilitan, Niones, Raquem, Victorio, Villarta]
- Role: Junior Data Analyst  
- Objective: To uncover insights in employee performance using real-world HR data.
- Date: July 2025

This report explores how age, experience, education, and department 
relate to employee performance.
Findings will guide HR in training, hiring, and promotion decisions.


## About Dataset

- This dataset provides an overview of employee performance and is tailored for machine learning tasks such as clustering and classification. It contains essential attributes including age, total professional experience, educational attainment, department, and a numerical performance score. A derived target column classifies overall performance into five categories: Poor, Average, Good, Very Good, and Excellent. This dataset is well-suited for evaluating models in clustering, predictive analytics, and performance-based insights.
- Column Descriptions:
    - age: The employee’s age (ranging from 20 to 60 years).
    - years_experience: Total number of years in professional work (1 to 40 years).
    - education_level: Highest education achieved (High School, Bachelor, Master, PhD).
    - group_department: Department or functional group of the employee (Sales, Tech, HR, Finance).
    - performance_score: A numeric score representing the employee’s performance, on a scale from 1 to 10.
    - performance_category: A categorical label for performance classification (Poor, Average, Good, Very Good, Excellent).

## Import Libraries

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Data Loading & Initial Exploration


In [15]:
df = pd.read_csv('employee_performance.csv')
df.head()

Unnamed: 0,age,years_experience,education_level,group_department,performancescore,performance_category
0,58,23,High School,Finance,10.0,Excellent
1,48,39,Bchelor,HR,,
2,34,15,High School,HR,5.0,
3,27,29,PhD,HR,4.0,Average
4,40,36,PhD,Finance,2.0,Poor


## Cleaning Dataset

In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 791 entries, 0 to 790
Data columns (total 6 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   age                   791 non-null    int64  
 1   years_experience      791 non-null    int64  
 2   education_level       791 non-null    object 
 3   group_department      791 non-null    object 
 4   performancescore      790 non-null    float64
 5   performance_category  789 non-null    object 
dtypes: float64(1), int64(2), object(3)
memory usage: 37.2+ KB


#### Check Inconsistent Values

In [18]:
df['education_level'].unique()

array(['High School', 'Bchelor', 'PhD', "Master's", 'Master', 'Hs',
       'Bachelor'], dtype=object)

In [19]:
df['group_department'].unique()

array(['Finance', 'HR', 'Tech', 'Sales'], dtype=object)

#### Observed Issues in the Dataset

After reviewing the initial dataset structure and unique values, we identified the following issues:

---

##### 1. Missing Values in Key Columns
| Column                | Non-Null Count | Expected | Issue                          |
|-----------------------|----------------|----------|--------------------------------|
| `performance_score`   | 790            | 791      | 1 missing value                |
| `performance_category`| 789            | 791      | 2 missing value                |

- These columns are critical for performance analysis and cannot be left null.

---

##### 2. Inconsistent `education_level` Values

| Raw Value   | Issue Description                      |
|-------------|----------------------------------------|
| `Bchelor`   | Typo of "Bachelor"                     |
| `"Master's"`| Contains apostrophe, non-standard      |
| `Hs`        | Abbreviation for "High School"         |
| Mixed casing| Some entries are not properly capitalized |

```python
df['education_level'].unique()


#### Rename Columns

In [20]:
df.rename(columns={
    'group_department': 'department',
    'performancescore': 'performance_score'
}, inplace=True)

#####  Clean education_level

In [21]:
df['education_level'] = df['education_level'].str.strip().str.title()

edu_map = {
    'Bchelor': 'Bachelor',
    "Master's": 'Master',
    'Hs': 'High School'
}
df['education_level'] = df['education_level'].replace(edu_map)

##### Handle Missing Values

 Drop rows with both critical values missing:

In [22]:
df.dropna(subset=['performance_score', 'performance_category'], inplace=True)

Fill missing performance_category based on score logic:

In [23]:
def categorize(score):
    if score <= 2:
        return 'Poor'
    elif score <= 4:
        return 'Average'
    elif score <= 6:
        return 'Good'
    elif score <= 8:
        return 'Very Good'
    else:
        return 'Excellent'

df['performance_category'] = df['performance_category'].fillna(
    df['performance_score'].apply(categorize)
)

### Check for Ckeaned Data

In [None]:
df.info()