# HR Analytics — Predict Employee Attrition

**Objective:**  
To analyze HR data and predict employee attrition using classification models.

**Dataset:**  
IBM HR Analytics Employee Attrition & Performance dataset.

**Tools Used:**  
- Python (Pandas, NumPy, Scikit-learn, Matplotlib, Seaborn)
- Jupyter Notebook

**Deliverables:**  
- Data Cleaning & Preprocessing
- Exploratory Data Analysis (EDA)
- Machine Learning Models (Logistic Regression, Decision Tree)
- Model Evaluation Metrics (Accuracy, Confusion Matrix, Classification Report)
- Insights & Conclusion

---


In [2]:
import pandas as pd

file_path = r"C:\Users\susan\Downloads\archive (10)\WA_Fn-UseC_-HR-Employee-Attrition.csv"
data = pd.read_csv(file_path)

data.head()


Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


## Exploratory Data Analysis (EDA)

- Shape of the dataset
- Data types and null values
- Summary statistics for numerical features
- Target variable ('Attrition') distribution


In [3]:
# Shape of dataset
print("Shape of the dataset:", data.shape)

# Information about columns
print("\nInfo about the dataset:")
print(data.info())

# Check for missing values
print("\nMissing values in each column:")
print(data.isnull().sum())

# Summary statistics of numerical features
print("\nSummary Statistics:")
print(data.describe())

# Distribution of Target Variable - 'Attrition'
print("\nAttrition value counts:")
print(data['Attrition'].value_counts())


Shape of the dataset: (1470, 35)

Info about the dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Age                       1470 non-null   int64 
 1   Attrition                 1470 non-null   object
 2   BusinessTravel            1470 non-null   object
 3   DailyRate                 1470 non-null   int64 
 4   Department                1470 non-null   object
 5   DistanceFromHome          1470 non-null   int64 
 6   Education                 1470 non-null   int64 
 7   EducationField            1470 non-null   object
 8   EmployeeCount             1470 non-null   int64 
 9   EmployeeNumber            1470 non-null   int64 
 10  EnvironmentSatisfaction   1470 non-null   int64 
 11  Gender                    1470 non-null   object
 12  HourlyRate                1470 non-null   int64 
 13  JobInvolvement      

## Data Preprocessing

- Drop irrelevant columns ('EmployeeCount', 'Over18', 'StandardHours', 'EmployeeNumber')
- Encode categorical variables using Label Encoding
- Verify data after encoding


In [4]:
from sklearn.preprocessing import LabelEncoder

# Drop irrelevant columns
data = data.drop(['EmployeeCount', 'Over18', 'StandardHours', 'EmployeeNumber'], axis=1)

# Encode 'Attrition' (Target Variable)
le = LabelEncoder()
data['Attrition'] = le.fit_transform(data['Attrition'])  # Yes=1, No=0

# Identify categorical features
cat_cols = data.select_dtypes(include='object').columns

# Encode all categorical columns
for col in cat_cols:
    data[col] = le.fit_transform(data[col])

# Check dataset after encoding
data.head()


Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EnvironmentSatisfaction,Gender,...,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,1,2,1102,2,1,2,1,2,0,...,3,1,0,8,0,1,6,4,0,5
1,49,0,1,279,1,8,1,1,3,1,...,4,4,1,10,3,3,10,7,1,7
2,37,1,2,1373,1,2,2,4,4,1,...,3,2,0,7,3,3,0,0,0,0
3,33,0,1,1392,1,3,4,1,4,0,...,3,3,0,8,3,3,8,7,3,0
4,27,0,2,591,1,2,1,3,1,1,...,3,4,1,6,3,3,2,2,2,2


## Feature Selection & Train-Test Split

- Separate Features (X) and Target (y)
- Split data into Training and Testing sets (80/20 split)
- Apply Standard Scaler for normalization


In [5]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Define Features & Target
X = data.drop('Attrition', axis=1)
y = data['Attrition']

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scaling Features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


## Model Building

- Logistic Regression
- Decision Tree Classifier


In [6]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

# Logistic Regression
lr = LogisticRegression()
lr.fit(X_train_scaled, y_train)
lr_preds = lr.predict(X_test_scaled)

# Decision Tree Classifier
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train_scaled, y_train)
dt_preds = dt.predict(X_test_scaled)
