Importing Libraries
First, we import the necessary libraries required for data handling, machine learning, and model evaluation:

In [2]:
import random
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.ensemble import RandomForestClassifier



Sample Employee Data
We define a sample employee's data, which we will use later to predict whether the employee will be promoted.

In [3]:
# Sample employee data
sample_employee = {
    "Job Title": "Software Engineer",
    "Years in Role": 5,
    "Performance Rating": 8,
    "Satisfaction Level": 7,
    "Job Stability": "Yes",
    "Salary Growth Rate": 12.5,
}

# Load the synthetic dataset (use the pre-existing df variable)
df = pd.read_csv("synthetic_employee_promotion_data.csv")





Data Preprocessing: Encoding, Missing Values, Scaling
We separate the features (X) and the target variable (y) from the dataset. The target variable Promotion is converted into binary values (1 for "Yes" and 0 for "No"). We also handle missing values and scale the numerical features while encoding the categorical features:

In [4]:
# Preprocess data: Encoding, missing values, scaling
X = df.drop('Promotion', axis=1)
y = df['Promotion'].apply(lambda x: 1 if x == 'Yes' else 0)  # Convert labels to binary (0 or 1)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)



Preprocessing Pipeline
We define a column transformer to handle both numerical and categorical columns. Numerical columns are imputed with the mean and scaled, while categorical columns are imputed with the most frequent value and one-hot encoded:

In [6]:
# Define a column transformer to handle categorical and numerical columns separately
preprocessor = ColumnTransformer(
    transformers=[
        ('num', Pipeline(steps=[
            ('imputer', SimpleImputer(strategy='mean')),  # Handle missing values in numerical columns
            ('scaler', StandardScaler())  # Scale numerical data
        ]), ['Years in Role', 'Performance Rating', 'Satisfaction Level', 'Salary Growth Rate']),

        ('cat', Pipeline(steps=[
            ('imputer', SimpleImputer(strategy='most_frequent')),  # Handle missing values in categorical columns
            ('onehot', OneHotEncoder(handle_unknown='ignore'))  # One-hot encode categorical data
        ]), ['Job Title', 'Job Stability'])
    ])

# Create a pipeline with preprocessing and the model
model_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(max_iter=500, random_state=42))
])



Creating the Model Pipeline
We create a pipeline that first preprocesses the data (imputation, scaling, encoding) and then applies a logistic regression model:

In [7]:
# Fit the model
model_pipeline.fit(X_train, y_train)

# Convert sample employee data into a DataFrame (so it matches the model input format)
sample_df = pd.DataFrame([sample_employee])




Predicting the Sample Employee's Promotion
We convert the sample employee data into a DataFrame and use the trained model to predict whether the sample employee will be promoted:

In [8]:
# Use the trained model to predict the promotion (1 = Yes, 0 = No)
predicted_promotion = model_pipeline.predict(sample_df)

print(predicted_promotion)


# Evaluate the model on the test data
y_pred = model_pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

print(f"Accuracy Score on Test Data: {accuracy}")
print(f"Confusion Matrix:\n{conf_matrix}")
print(f"Classification Report:\n{class_report}")

[0]
Accuracy Score on Test Data: 0.71
Confusion Matrix:
[[142   0]
 [ 58   0]]
Classification Report:
              precision    recall  f1-score   support

           0       0.71      1.00      0.83       142
           1       0.00      0.00      0.00        58

    accuracy                           0.71       200
   macro avg       0.35      0.50      0.42       200
weighted avg       0.50      0.71      0.59       200



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
