Dataset: https://www.kaggle.com/datasets/tawfikelmetwally/employee-dataset/data

## 1. Business Understanding

### Objective:
Predict whether an employee will take leave (Leave or Not column as the target).

### Context:
This dataset contains information about employees in a company, including their educational backgrounds, work history, demographics, and employment-related factors. It has been anonymized to protect privacy while still providing valuable insights into the workforce.

### Goal:
Develop a binary classification model to predict employee leave status based on various features, such as education, joining year, city, payment tier, age, gender, bench status, and experience in the current domain.



## 2. Data Understanding
We'll start by loading the dataset, checking its structure, and understanding any potential issues (e.g., missing data).

In [33]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

# Load the dataset
file_path = '/content/Employee.csv'  # Modify this path based on where your dataset is
data = pd.read_csv(file_path)

# Display the first few rows of the dataset
print(data.head())

# Check the structure of the dataset
print(data.info())

# Get descriptive statistics of the dataset
print(data.describe())

# Check for missing values
print(data.isnull().sum())



   Education  JoiningYear       City  PaymentTier  Age  Gender EverBenched  \
0  Bachelors         2017  Bangalore            3   34    Male          No   
1  Bachelors         2013       Pune            1   28  Female          No   
2  Bachelors         2014  New Delhi            3   38  Female          No   
3    Masters         2016  Bangalore            3   27    Male          No   
4    Masters         2017       Pune            3   24    Male         Yes   

   ExperienceInCurrentDomain  LeaveOrNot  
0                          0           0  
1                          3           1  
2                          2           0  
3                          5           1  
4                          2           1  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4653 entries, 0 to 4652
Data columns (total 9 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   Education                  4653 non-null   object
 1

## 3. Data Preparation
Clean the data and handle missing values, if any. Also, encode categorical variables for the model.

In [36]:
# Fill missing values (example: fill numeric columns with mean and categorical with mode)
numeric_columns = data.select_dtypes(include=[np.number]).columns
categorical_columns = data.select_dtypes(exclude=[np.number]).columns

data[numeric_columns] = data[numeric_columns].fillna(data[numeric_columns].mean())
data[categorical_columns] = data[categorical_columns].fillna(data[categorical_columns].mode().iloc[0])

# Check if any missing values remain
print("Missing values after filling:")
print(data.isnull().sum())

# Convert categorical variables to numeric using one-hot encoding
data = pd.get_dummies(data, drop_first=True)

# Show the cleaned dataset
print(data.head())

# Split the dataset into features and target
X = data.drop(columns=['LeaveOrNot'])
y = data['LeaveOrNot']


Missing values after filling:
JoiningYear                  0
PaymentTier                  0
Age                          0
ExperienceInCurrentDomain    0
LeaveOrNot                   0
Education_Masters            0
Education_PHD                0
City_New Delhi               0
City_Pune                    0
Gender_Male                  0
EverBenched_Yes              0
dtype: int64
   JoiningYear  PaymentTier  Age  ExperienceInCurrentDomain  LeaveOrNot  \
0         2017            3   34                          0           0   
1         2013            1   28                          3           1   
2         2014            3   38                          2           0   
3         2016            3   27                          5           1   
4         2017            3   24                          2           1   

   Education_Masters  Education_PHD  City_New Delhi  City_Pune  Gender_Male  \
0              False          False           False      False         True   
1      

## 4. Modeling
Now we'll build and train a binary classification model using Logistic Regression. You can experiment with other models later.

In [37]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the model
model = RandomForestClassifier(random_state=42)

# Fit the model
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)


## 5. Evaluation
Evaluate model performance using key metrics like accuracy, precision, recall, and F1-score.

In [38]:
# Evaluate model accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")

# Confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(conf_matrix)

# Classification report (precision, recall, F1-score)
class_report = classification_report(y_test, y_pred)
print("Classification Report:")
print(class_report)


Accuracy: 0.8518
Confusion Matrix:
[[558  52]
 [ 86 235]]
Classification Report:
              precision    recall  f1-score   support

           0       0.87      0.91      0.89       610
           1       0.82      0.73      0.77       321

    accuracy                           0.85       931
   macro avg       0.84      0.82      0.83       931
weighted avg       0.85      0.85      0.85       931



## 6. Deployment
Once you are satisfied with your model's performance, you may:

*   Save the model for future predictions.
*   Deploy the model in a production environment (e.g., Flask or FastAPI web apps).

In [40]:
# Deployment can involve saving the model and creating a service to use it
import joblib

# Save the model
joblib.dump(model, 'employee_leave_model.pkl')

# Load the model when needed
loaded_model = joblib.load('employee_leave_model.pkl')

