# Employee Performance Analysis  
### INX Future Inc.

| **Field**                 | **Details**                               |
|---------------------------|-------------------------------------------|
| **Candidate Name**        | Vimal Raj J                            |
| **Candidate E-mail**      | mrvimalofficiall@gmail.com                 |
| **REP Name**              | DataMites™ Solutions Pvt Ltd             |
| **Venue Name**            | Open Project                             |
| **Exam Country**          | India                                    |
| **Assessment ID**         | E10901-PR2-V18                          |
| **Module**                | Certified Data Scientist - Project       |
| **Language**              | English                                  |
| **Exam Format**           | Open Project - IABAC™ Project Submission |
| **Submission Deadline**   | 08-Apr-2025                             |
| **Registered Trainer**    | Ashok Kumar A                                    |
| **Project Assessment**    | IABAC™                                   |



# **2. Analysis**
## **Feature Description**
Data analysis began by examining the features in the dataset. The features play a crucial role in understanding the relationships between dependent and independent variables. Using `pandas`, the dataset was explored to answer preliminary questions, dividing the data into **numerical** and **categorical** features.

### **Categorical Features**  
Categorical values classify samples into distinct groups. These features may be **nominal**, **ordinal**, **ratio**, or **interval-based**:  
- **EmpNumber**  
- **Gender**  
- **EducationBackground**  
- **MaritalStatus**  
- **EmpDepartment**  
- **EmpJobRole**  
- **BusinessTravelFrequency**  
- **OverTime**  
- **Attrition**

---

### **Numerical Features**  
Numerical values vary between samples and may be **discrete**, **ordinal**, **continuous**, or **time-series based**:  
- **Age**  
- **DistanceFromHome**  
- **EmpHourlyRate**  
- **NumCompaniesWorked**  
- **EmpLastSalaryHikePercent**  
- **TotalWorkExperienceInYears**  
- **TrainingTimesLastYear**  
- **ExperienceYearsAtThisCompany**  
- **ExperienceYearsInCurrentRole**  
- **YearsSinceLastPromotion**  
- **YearsWithCurrManager**

---

### **Ordinal Features**  
Ordinal features include variables with a meaningful order:  
- **EmpEducationLevel**  
- **EmpEnvironmentSatisfaction**  
- **EmpJobInvolvement**  
- **EmpJobLevel**  
- **EmpJobSatisfaction**  
- **EmpRelationshipSatisfaction**  
- **EmpWorkLifeBalance**  
- **PerformanceRating**

---

### **Alphanumeric Features**  
These are features with a mix of numeric and alphanumeric values. For instance:  
- **EmpNumber** (contains unique identifiers that do not contribute to performance ratings).

---

## **Distribution of Numerical Features**  
Distribution analysis provides early insights into how representative the dataset is of the problem domain:  
- **Age:** Most employees are aged between 30 to 40 years.  
- **NumCompaniesWorked:** Most employees have worked at up to 2 companies before joining INX.  
- **EmpHourlyRate:** The majority of employees earn between $65 and $95 per hour.  
- **ExperienceYearsAtThisCompany:** Most employees have worked at INX for up to 5 years.  
- **EmpLastSalaryHikePercent:** The typical salary hike is between 11% and 15%.  

---

### **Checking Normal Distribution**
- Normality was assessed using **skewness** and **kurtosis**.  
- Example: The feature **YearsSinceLastPromotion** is skewed:  
  - **Skewness:** 1.972  
  - **Kurtosis:** 3.519  

---

### **Skewed Data Handling**  
**Square Root Transformation** was applied to handle skewed data. This transformation is particularly useful for count data or small whole numbers.

---

## **Distribution of Categorical Features**
- **Gender:** 60% Male, 40% Female.  
- **EducationBackground:** Six unique education backgrounds are present.  
- **EmpJobRole:** 19 unique job roles exist in the dataset.  
- **JobSatisfaction:** The majority report a high level of satisfaction.  
- **Attrition:** 85% of employees do not exhibit attrition.  
- **PerformanceRating:** Only 11% of employees achieved an "Outstanding" rating.  
- **OverTime:** 30% of employees work overtime.  

---

# **3. Data Cleaning**
Data cleaning ensures the quality of the dataset. While no missing data was found, outliers were detected in the following features:  
- **NumCompaniesWorked**  
- **TotalWorkExperienceInYears**  
- **TrainingTimesLastYear**  
- **ExperienceYearsAtThisCompany**  
- **ExperienceYearsInCurrentRole**  
- **YearsWithCurrManager**  

---

# **4. Data Preprocessing**
Data preprocessing transforms raw data into an understandable format. Key preprocessing steps included:  
1. **Handling Outliers:** Addressed the outliers in numerical features.  
2. **Encoding Categorical Data:** Used **Label Encoding** to convert string-based categorical data into numerical values.  
3. **Ensuring Consistency:** Verified data completeness and addressed any inconsistencies.

## **Analysis by Visualization**

Visualization helps in understanding the data better by providing graphical insights. In this project, two primary methods of visualization were used:

---

### **1. Distribution Plot**
One of the first steps in data exploration is to understand how features are distributed. The distribution of data was analyzed using **Seaborn's `distplot`** function for both **numerical** and **categorical** features.  

- **Purpose:**  
  - Provides an overview of the density and concentration of data at different levels.
  - Helps identify trends, patterns, and potential outliers.

- **Insights Derived:**  
  - The distribution plots reveal the overall density and majority of data for various features.  
  - Key features such as **Age**, **TotalWorkExperienceInYears**, and **EmpHourlyRate** give valuable insights into the workforce demographics and salary distribution.

---

### **2. Correlation Analysis**
Correlation analysis was performed using **bar plots** and **heatmaps** to visualize the relationships between numerical features.  

- **Methodology:**  
  - Used **Pandas `.corr()`** function to calculate the **Pearson Correlation Coefficient** for pairwise relationships.  
  - The results were visualized using heatmaps to show correlation levels ranging from 0 to 1.  

- **Key Observations from the Correlation Heatmap:**  
  1. **Total Work Experience & Job Level:**  
     - A high correlation indicates that employees with more experience are likely to be in higher job levels.  
  2. **Experience at the Company & Years with Current Manager:**  
     - Strong correlation suggests that longer tenure in the company often aligns with prolonged reporting to the same manager.  
  3. **Experience at the Company & Experience in Current Role:**  
     - Correlation here highlights that employees tend to gain deeper expertise in their current roles over time.  
  4. **Experience & Promotions:**  
     - Employees with longer tenure have a higher probability of receiving promotions, as evident from the correlation.  
  5. **Age & Total Work Experience:**  
     - A strong correlation aligns with the universal truth that age is directly related to overall work experience.  

---

### **Conclusion**  
By combining distribution analysis and correlation heatmaps, we obtained a comprehensive understanding of how features interact. These insights are instrumental in building a predictive model and making data-driven decisions.

# **Machine Learning Model**

In this project, two machine learning algorithms were implemented to predict employee performance ratings:

  Model  accuracy_score
  
   **LogisticRegression**           
   **DecisionTreeClassifier**       
    **RandomForestClassifier**      
   **KNeighborsClassifier**       
   **MLPClassifier**          
    **SVC**          

Both algorithms are highly effective for classification problems and labeled datasets.  

---

## **Methodology**
### **1. Data Preparation**
- The dataset was split into **training** and **testing** sets.
- Due to an observed imbalance in the target variable, the **SMOTE (Synthetic Minority Over-sampling Technique)** method was applied using the `imbalanced-learn` Python library.  
  - SMOTE helps in addressing the imbalance by generating synthetic samples for the minority class, ensuring better model performance.

### **2. Model Training and Testing**
- The training data was fitted into the models, and predictions were generated for the test set.
 

---

## **Results**
The performance of the models was as follows:  

| **Model**                  | **Accuracy**  |
|----------------------------|---------------|
| **LogisticRegression ** | **0.752778%**    |
| **DecisionTreeClassifier** | **0.600000%**   |
| **RandomForestClassifier**|    **0.772222%** |
|  **KNeighborsClassifier**|      **0.525000%** |
|**MLPClassifier**      |      **0.841667%** |
|   **SVC**             |       **0.838889%**|
 

### **Insights**
MLPClassifier and SVC are the top performers.

RandomForestClassifier is also strong.

LogisticRegression shows moderate performance.

DecisionTreeClassifier and KNeighborsClassifier are the least effective.

Neural Networks and SVMs appear best suited for this task.

---

### **Conclusion**
The implementation of these models, along with techniques like SMOTE for handling imbalanced data, resulted in accurate and reliable predictions. These models can serve as effective tools for employee performance analysis and decision-making.