# HumanForYou Employee Turnover Analysis

## Project Overview

This notebook analyzes employee turnover at HumanForYou, a pharmaceutical company in India with approximately 4,000 employees experiencing a 15% annual turnover rate.

**Objective**: Identify factors influencing employee turnover and develop predictive models to help reduce attrition.

**Deliverables**:
1. Data exploration and preprocessing
2. Feature engineering
3. Multiple ML model development and comparison
4. Model interpretation and insights
5. Actionable recommendations



In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

# Import custom modules
import sys
sys.path.append('../src')
from data_loader import (
    load_general_data, load_manager_survey, load_employee_survey,
    load_working_hours_data, merge_all_data
)
from preprocessing import (
    handle_missing_values, encode_categorical_variables,
    create_features, prepare_features_for_modeling, scale_features
)
from model_evaluation import (
    evaluate_model, plot_confusion_matrix, plot_roc_curve,
    compare_models, print_classification_report
)

# Machine Learning libraries
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
import xgboost as xgb
import lightgbm as lgb

# Model interpretation
import shap

print("Libraries imported successfully!")



## 1. Data Loading

Load all available datasets:
- General HR data
- Manager survey data
- Employee survey data
- Working hours data (from ZIP file)



In [None]:
# Load all datasets
print("Loading datasets...")
general_df = load_general_data('../data/general_data.csv')
manager_df = load_manager_survey('../data/manager_survey_data.csv')
employee_df = load_employee_survey('../data/employee_survey_data.csv')

# Load working hours data (try CSV files first, then ZIP)
try:
    in_time_df, out_time_df = load_working_hours_data(
        zip_path='../data/in_out_time.zip',
        in_time_path='../data/in_time.csv',
        out_time_path='../data/out_time.csv'
    )
    if in_time_df is not None and out_time_df is not None:
        has_working_hours = True
        print("Working hours data loaded successfully!")
    else:
        has_working_hours = False
        print("Warning: Working hours data not available")
except Exception as e:
    print(f"Warning: Could not load working hours data: {e}")
    has_working_hours = False
    in_time_df, out_time_df = None, None

print("\nData loading complete!")



## 2. Data Exploration

### 2.1 Initial Data Inspection



In [None]:
# Display basic information about each dataset
print("="*60)
print("GENERAL DATA")
print("="*60)
print(f"Shape: {general_df.shape}")
print(f"\nColumns: {list(general_df.columns)}")
print(f"\nFirst few rows:")
display(general_df.head())
print(f"\nData types:")
print(general_df.dtypes)
print(f"\nMissing values:")
print(general_df.isnull().sum()[general_df.isnull().sum() > 0])

print("\n" + "="*60)
print("MANAGER SURVEY DATA")
print("="*60)
print(f"Shape: {manager_df.shape}")
display(manager_df.head())
print(f"\nMissing values:")
print(manager_df.isnull().sum()[manager_df.isnull().sum() > 0])

print("\n" + "="*60)
print("EMPLOYEE SURVEY DATA")
print("="*60)
print(f"Shape: {employee_df.shape}")
display(employee_df.head())
print(f"\nMissing values:")
print(employee_df.isnull().sum()[employee_df.isnull().sum() > 0])
print(f"\nNA values (string):")
for col in employee_df.columns:
    if col != 'EmployeeID':
        na_count = (employee_df[col] == 'NA').sum()
        if na_count > 0:
            print(f"{col}: {na_count}")



### 2.2 Target Variable Analysis



In [None]:
# Analyze target variable (Attrition)
if 'Attrition' in general_df.columns:
    print("Attrition Distribution:")
    print(general_df['Attrition'].value_counts())
    print(f"\nAttrition Rate: {(general_df['Attrition'] == 'Yes').sum() / len(general_df) * 100:.2f}%")
    
    # Visualize attrition distribution
    fig, axes = plt.subplots(1, 2, figsize=(12, 5))
    
    # Count plot
    general_df['Attrition'].value_counts().plot(kind='bar', ax=axes[0], color=['#2ecc71', '#e74c3c'])
    axes[0].set_title('Attrition Distribution')
    axes[0].set_xlabel('Attrition')
    axes[0].set_ylabel('Count')
    axes[0].tick_params(axis='x', rotation=0)
    
    # Pie chart
    general_df['Attrition'].value_counts().plot(kind='pie', ax=axes[1], autopct='%1.1f%%', 
                                                colors=['#2ecc71', '#e74c3c'])
    axes[1].set_title('Attrition Proportion')
    axes[1].set_ylabel('')
    
    plt.tight_layout()
    plt.show()
    
    print("\nNote: This is an imbalanced dataset, which we'll need to address during modeling.")



### 2.3 Exploratory Data Analysis

#### 2.3.1 Demographic Analysis



In [None]:
# Analyze demographic factors
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# Age distribution
if 'Age' in general_df.columns:
    general_df.groupby('Attrition')['Age'].hist(alpha=0.7, bins=20, ax=axes[0,0], legend=True)
    axes[0,0].set_title('Age Distribution by Attrition')
    axes[0,0].set_xlabel('Age')
    axes[0,0].set_ylabel('Frequency')
    axes[0,0].legend(['No', 'Yes'])

# Gender analysis
if 'Gender' in general_df.columns:
    gender_attrition = pd.crosstab(general_df['Gender'], general_df['Attrition'], normalize='index') * 100
    gender_attrition.plot(kind='bar', ax=axes[0,1], color=['#2ecc71', '#e74c3c'])
    axes[0,1].set_title('Attrition Rate by Gender')
    axes[0,1].set_xlabel('Gender')
    axes[0,1].set_ylabel('Percentage')
    axes[0,1].legend(['No', 'Yes'])
    axes[0,1].tick_params(axis='x', rotation=0)

# Marital Status
if 'MaritalStatus' in general_df.columns:
    marital_attrition = pd.crosstab(general_df['MaritalStatus'], general_df['Attrition'], normalize='index') * 100
    marital_attrition.plot(kind='bar', ax=axes[1,0], color=['#2ecc71', '#e74c3c'])
    axes[1,0].set_title('Attrition Rate by Marital Status')
    axes[1,0].set_xlabel('Marital Status')
    axes[1,0].set_ylabel('Percentage')
    axes[1,0].legend(['No', 'Yes'])
    axes[1,0].tick_params(axis='x', rotation=45)

# Education Field
if 'EducationField' in general_df.columns:
    edu_attrition = pd.crosstab(general_df['EducationField'], general_df['Attrition'], normalize='index') * 100
    edu_attrition.plot(kind='barh', ax=axes[1,1], color=['#2ecc71', '#e74c3c'])
    axes[1,1].set_title('Attrition Rate by Education Field')
    axes[1,1].set_xlabel('Percentage')
    axes[1,1].legend(['No', 'Yes'])

plt.tight_layout()
plt.show()



#### 2.3.2 Job-Related Factors



In [None]:
# Analyze job-related factors
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# Monthly Income
if 'MonthlyIncome' in general_df.columns:
    general_df.boxplot(column='MonthlyIncome', by='Attrition', ax=axes[0,0])
    axes[0,0].set_title('Monthly Income by Attrition')
    axes[0,0].set_xlabel('Attrition')
    axes[0,0].set_ylabel('Monthly Income (Rupees)')
    axes[0,0].set_xticklabels(['No', 'Yes'])

# Job Level
if 'JobLevel' in general_df.columns:
    joblevel_attrition = pd.crosstab(general_df['JobLevel'], general_df['Attrition'], normalize='index') * 100
    joblevel_attrition.plot(kind='bar', ax=axes[0,1], color=['#2ecc71', '#e74c3c'])
    axes[0,1].set_title('Attrition Rate by Job Level')
    axes[0,1].set_xlabel('Job Level')
    axes[0,1].set_ylabel('Percentage')
    axes[0,1].legend(['No', 'Yes'])

# Years at Company
if 'YearsAtCompany' in general_df.columns:
    general_df.boxplot(column='YearsAtCompany', by='Attrition', ax=axes[1,0])
    axes[1,0].set_title('Years at Company by Attrition')
    axes[1,0].set_xlabel('Attrition')
    axes[1,0].set_ylabel('Years at Company')
    axes[1,0].set_xticklabels(['No', 'Yes'])

# Job Role
if 'JobRole' in general_df.columns:
    jobrole_attrition = pd.crosstab(general_df['JobRole'], general_df['Attrition'], normalize='index') * 100
    jobrole_attrition['Yes'].sort_values(ascending=True).plot(kind='barh', ax=axes[1,1], color='#e74c3c')
    axes[1,1].set_title('Attrition Rate by Job Role')
    axes[1,1].set_xlabel('Attrition Rate (%)')

plt.tight_layout()
plt.show()



#### 2.3.3 Survey Data Analysis



In [None]:
# Merge data for survey analysis
merged_temp = general_df.merge(manager_df, on='EmployeeID', how='left')
merged_temp = merged_temp.merge(employee_df, on='EmployeeID', how='left')

# Analyze survey responses
survey_cols = ['JobInvolvement', 'PerformanceRating', 'EnvironmentSatisfaction', 
               'JobSatisfaction', 'WorkLifeBalance']

fig, axes = plt.subplots(2, 3, figsize=(18, 12))
axes = axes.flatten()

for idx, col in enumerate(survey_cols):
    if col in merged_temp.columns:
        # Replace 'NA' with NaN for proper handling
        merged_temp[col] = merged_temp[col].replace('NA', np.nan)
        merged_temp[col] = pd.to_numeric(merged_temp[col], errors='coerce')
        
        # Create crosstab
        survey_attrition = pd.crosstab(merged_temp[col], merged_temp['Attrition'], normalize='index') * 100
        survey_attrition.plot(kind='bar', ax=axes[idx], color=['#2ecc71', '#e74c3c'])
        axes[idx].set_title(f'Attrition Rate by {col}')
        axes[idx].set_xlabel(col)
        axes[idx].set_ylabel('Percentage')
        axes[idx].legend(['No', 'Yes'])
        axes[idx].tick_params(axis='x', rotation=0)

# Remove extra subplot
fig.delaxes(axes[5])

plt.tight_layout()
plt.show()



## 3. Data Preprocessing

### 3.1 Merge All Data Sources



In [None]:
# Merge all datasets
if has_working_hours:
    df = merge_all_data(general_df, manager_df, employee_df, in_time_df, out_time_df)
else:
    df = merge_all_data(general_df, manager_df, employee_df)

print(f"Final merged dataset shape: {df.shape}")
print(f"\nColumns: {list(df.columns)}")

# Check for duplicates
print(f"\nDuplicate EmployeeIDs: {df['EmployeeID'].duplicated().sum()}")



### 3.2 Handle Missing Values



In [None]:
# Display missing values before preprocessing
print("Missing values before preprocessing:")
missing_before = df.isnull().sum()
missing_before = missing_before[missing_before > 0]
if len(missing_before) > 0:
    print(missing_before)
else:
    print("No missing values found (except 'NA' strings in survey data)")

# Handle missing values
df_clean = handle_missing_values(df, strategy='median')

# Display missing values after preprocessing
print("\nMissing values after preprocessing:")
missing_after = df_clean.isnull().sum()
missing_after = missing_after[missing_after > 0]
if len(missing_after) > 0:
    print(missing_after)
else:
    print("All missing values handled!")



### 3.3 Feature Engineering



In [None]:
# Create additional features
df_features = create_features(df_clean)

print(f"Original features: {df_clean.shape[1]}")
print(f"Features after engineering: {df_features.shape[1]}")
print(f"New features added: {df_features.shape[1] - df_clean.shape[1]}")

# Display new features
new_features = set(df_features.columns) - set(df_clean.columns)
if new_features:
    print(f"\nNew features created: {list(new_features)}")



### 3.4 Encode Categorical Variables



In [None]:
# Encode categorical variables
df_encoded, encoders = encode_categorical_variables(df_features, target_col='Attrition')

print("Categorical variables encoded:")
for col, encoder in encoders.items():
    if col != 'Attrition':
        print(f"  {col}: {len(encoder.classes_)} unique values")

# Check target variable encoding
if 'Attrition' in encoders:
    print(f"\nAttrition encoding: {dict(zip(encoders['Attrition'].classes_, range(len(encoders['Attrition'].classes_))))}")

print(f"\nFinal dataset shape: {df_encoded.shape}")



### 3.5 Prepare Features for Modeling



In [None]:
# Prepare features and target
X, y = prepare_features_for_modeling(df_encoded, target_col='Attrition')

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"\nTarget distribution:")
print(y.value_counts())
print(f"\nTarget distribution (%):")
print(y.value_counts(normalize=True) * 100)

# Display feature names
print(f"\nFeature names ({len(X.columns)} features):")
print(list(X.columns))



### 3.6 Train-Test Split



In [None]:
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")
print(f"\nTraining set target distribution:")
print(y_train.value_counts(normalize=True) * 100)
print(f"\nTest set target distribution:")
print(y_test.value_counts(normalize=True) * 100)



### 3.7 Handle Class Imbalance



In [None]:
# Apply SMOTE to handle class imbalance
print("Before SMOTE:")
print(f"Class distribution: {pd.Series(y_train).value_counts().to_dict()}")

smote = SMOTE(random_state=42)
X_train_balanced, y_train_balanced = smote.fit_resample(X_train, y_train)

print("\nAfter SMOTE:")
print(f"Class distribution: {pd.Series(y_train_balanced).value_counts().to_dict()}")
print(f"Training set size: {X_train_balanced.shape[0]} samples")



### 3.8 Feature Scaling



In [None]:
# Scale features (for algorithms that require it)
X_train_scaled, X_test_scaled, scaler = scale_features(
    pd.DataFrame(X_train_balanced, columns=X.columns),
    pd.DataFrame(X_test, columns=X.columns)
)

print("Features scaled successfully!")
print(f"Scaled training set shape: {X_train_scaled.shape}")
print(f"Scaled test set shape: {X_test_scaled.shape}")

# Also keep unscaled versions for tree-based models
X_train_unscaled = X_train_balanced
X_test_unscaled = X_test.values



## 4. Model Development

We will train and compare multiple classification models:
1. Logistic Regression
2. Random Forest
3. Gradient Boosting
4. XGBoost
5. LightGBM
6. Support Vector Machine (SVM)
7. K-Nearest Neighbors (KNN)



In [None]:
# Initialize models
models = {
    'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1),
    'Gradient Boosting': GradientBoostingClassifier(random_state=42),
    'XGBoost': xgb.XGBClassifier(random_state=42, eval_metric='logloss'),
    'LightGBM': lgb.LGBMClassifier(random_state=42, verbose=-1),
    'SVM': SVC(probability=True, random_state=42),
    'KNN': KNeighborsClassifier(n_neighbors=5)
}

# Store results
results = []

print("Training models...")
print("="*60)



In [None]:
# Train and evaluate each model
trained_models = {}

for name, model in models.items():
    print(f"\nTraining {name}...")
    
    # Use scaled data for linear models, unscaled for tree-based models
    if name in ['Logistic Regression', 'SVM', 'KNN']:
        X_train_model = X_train_scaled
        X_test_model = X_test_scaled
    else:
        X_train_model = X_train_unscaled
        X_test_model = X_test_unscaled
    
    # Train model
    model.fit(X_train_model, y_train_balanced)
    trained_models[name] = model
    
    # Make predictions
    y_pred = model.predict(X_test_model)
    y_pred_proba = model.predict_proba(X_test_model)[:, 1]
    
    # Evaluate model
    metrics = evaluate_model(y_test, y_pred, y_pred_proba, model_name=name)
    results.append(metrics)
    
    print(f"  Accuracy: {metrics['Accuracy']:.4f}")
    print(f"  Precision: {metrics['Precision']:.4f}")
    print(f"  Recall: {metrics['Recall']:.4f}")
    print(f"  F1-Score: {metrics['F1-Score']:.4f}")
    print(f"  ROC-AUC: {metrics['ROC-AUC']:.4f}")

print("\n" + "="*60)
print("All models trained!")



### 4.1 Model Comparison



In [None]:
# Create results dataframe
results_df = pd.DataFrame(results)
results_df = results_df.sort_values('ROC-AUC', ascending=False)

print("Model Performance Comparison:")
print("="*60)
display(results_df)

# Visualize comparison
fig = compare_models(results_df)
plt.show()



### 4.2 Detailed Evaluation of Best Model



In [None]:
# Select best model based on ROC-AUC
best_model_name = results_df.iloc[0]['Model']
best_model = trained_models[best_model_name]

print(f"Best Model: {best_model_name}")
print("="*60)

# Get predictions from best model
if best_model_name in ['Logistic Regression', 'SVM', 'KNN']:
    X_test_model = X_test_scaled
else:
    X_test_model = X_test_unscaled

y_pred_best = best_model.predict(X_test_model)
y_pred_proba_best = best_model.predict_proba(X_test_model)[:, 1]

# Detailed evaluation
print_classification_report(y_test, y_pred_best, model_name=best_model_name)

# Visualizations
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Confusion Matrix
plot_confusion_matrix(y_test, y_pred_best, model_name=best_model_name, ax=axes[0])

# ROC Curve
plot_roc_curve(y_test, y_pred_proba_best, model_name=best_model_name, ax=axes[1])

# Precision-Recall Curve
from model_evaluation import plot_precision_recall_curve
plot_precision_recall_curve(y_test, y_pred_proba_best, model_name=best_model_name, ax=axes[2])

plt.tight_layout()
plt.show()



## 5. Model Interpretation

### 5.1 Feature Importance



In [None]:
# Extract feature importance from tree-based models
if hasattr(best_model, 'feature_importances_'):
    feature_importance = pd.DataFrame({
        'Feature': X.columns,
        'Importance': best_model.feature_importances_
    }).sort_values('Importance', ascending=False)
    
    # Plot top 20 features
    plt.figure(figsize=(12, 8))
    top_features = feature_importance.head(20)
    plt.barh(range(len(top_features)), top_features['Importance'])
    plt.yticks(range(len(top_features)), top_features['Feature'])
    plt.xlabel('Feature Importance')
    plt.title(f'Top 20 Feature Importance - {best_model_name}')
    plt.gca().invert_yaxis()
    plt.tight_layout()
    plt.show()
    
    print("Top 10 Most Important Features:")
    print(feature_importance.head(10))
elif hasattr(best_model, 'coef_'):
    # For linear models, use coefficients
    feature_importance = pd.DataFrame({
        'Feature': X.columns,
        'Coefficient': best_model.coef_[0]
    }).sort_values('Coefficient', key=abs, ascending=False)
    
    plt.figure(figsize=(12, 8))
    top_features = feature_importance.head(20)
    colors = ['red' if x < 0 else 'green' for x in top_features['Coefficient']]
    plt.barh(range(len(top_features)), top_features['Coefficient'], color=colors)
    plt.yticks(range(len(top_features)), top_features['Feature'])
    plt.xlabel('Coefficient Value')
    plt.title(f'Top 20 Feature Coefficients - {best_model_name}')
    plt.gca().invert_yaxis()
    plt.tight_layout()
    plt.show()
    
    print("Top 10 Most Important Features (by absolute coefficient):")
    print(feature_importance.head(10))



### 5.2 SHAP Values (if applicable)



In [None]:
# Calculate SHAP values for tree-based models
if hasattr(best_model, 'feature_importances_') or isinstance(best_model, (xgb.XGBClassifier, lgb.LGBMClassifier)):
    try:
        # Use a sample for SHAP (it can be computationally expensive)
        sample_size = min(100, X_test_model.shape[0])
        X_sample = X_test_model[:sample_size]
        
        # Create SHAP explainer
        explainer = shap.TreeExplainer(best_model)
        shap_values = explainer.shap_values(X_sample)
        
        # Plot SHAP summary
        plt.figure(figsize=(12, 8))
        shap.summary_plot(shap_values, X_sample, feature_names=X.columns, show=False, max_display=20)
        plt.tight_layout()
        plt.show()
        
        print("SHAP values calculated successfully!")
    except Exception as e:
        print(f"Could not calculate SHAP values: {e}")
        print("This is normal for some model types or if SHAP is not properly configured.")
else:
    print("SHAP analysis is most effective for tree-based models.")
    print("For linear models, coefficient analysis is provided above.")



## 6. Model Improvement

### 6.1 Hyperparameter Tuning



In [None]:
# Hyperparameter tuning for the best model
print(f"Tuning hyperparameters for {best_model_name}...")

# Define parameter grids based on model type
if best_model_name == 'Random Forest':
    param_grid = {
        'n_estimators': [100, 200, 300],
        'max_depth': [10, 20, None],
        'min_samples_split': [2, 5, 10],
        'min_samples_leaf': [1, 2, 4]
    }
    base_model = RandomForestClassifier(random_state=42, n_jobs=-1)
    X_train_tune = X_train_unscaled
    X_test_tune = X_test_unscaled
    
elif best_model_name == 'XGBoost':
    param_grid = {
        'n_estimators': [100, 200, 300],
        'max_depth': [3, 5, 7],
        'learning_rate': [0.01, 0.1, 0.2],
        'subsample': [0.8, 1.0]
    }
    base_model = xgb.XGBClassifier(random_state=42, eval_metric='logloss')
    X_train_tune = X_train_unscaled
    X_test_tune = X_test_unscaled
    
elif best_model_name == 'LightGBM':
    param_grid = {
        'n_estimators': [100, 200, 300],
        'max_depth': [3, 5, 7],
        'learning_rate': [0.01, 0.1, 0.2],
        'num_leaves': [31, 50, 70]
    }
    base_model = lgb.LGBMClassifier(random_state=42, verbose=-1)
    X_train_tune = X_train_unscaled
    X_test_tune = X_test_unscaled
    
elif best_model_name == 'Gradient Boosting':
    param_grid = {
        'n_estimators': [100, 200, 300],
        'max_depth': [3, 5, 7],
        'learning_rate': [0.01, 0.1, 0.2]
    }
    base_model = GradientBoostingClassifier(random_state=42)
    X_train_tune = X_train_unscaled
    X_test_tune = X_test_unscaled
    
else:
    print("Hyperparameter tuning skipped for this model type.")
    param_grid = None

if param_grid:
    # Perform grid search with cross-validation
    grid_search = GridSearchCV(
        base_model, 
        param_grid, 
        cv=5, 
        scoring='roc_auc', 
        n_jobs=-1, 
        verbose=1
    )
    
    grid_search.fit(X_train_tune, y_train_balanced)
    
    print(f"\nBest parameters: {grid_search.best_params_}")
    print(f"Best CV score: {grid_search.best_score_:.4f}")
    
    # Evaluate tuned model
    tuned_model = grid_search.best_estimator_
    y_pred_tuned = tuned_model.predict(X_test_tune)
    y_pred_proba_tuned = tuned_model.predict_proba(X_test_tune)[:, 1]
    
    metrics_tuned = evaluate_model(y_test, y_pred_tuned, y_pred_proba_tuned, 
                                   model_name=f"{best_model_name} (Tuned)")
    
    print("\nTuned Model Performance:")
    print(f"  Accuracy: {metrics_tuned['Accuracy']:.4f}")
    print(f"  Precision: {metrics_tuned['Precision']:.4f}")
    print(f"  Recall: {metrics_tuned['Recall']:.4f}")
    print(f"  F1-Score: {metrics_tuned['F1-Score']:.4f}")
    print(f"  ROC-AUC: {metrics_tuned['ROC-AUC']:.4f}")
    
    # Compare with original
    original_metrics = results_df[results_df['Model'] == best_model_name].iloc[0]
    print(f"\nImprovement in ROC-AUC: {metrics_tuned['ROC-AUC'] - original_metrics['ROC-AUC']:.4f}")
    
    # Update best model
    best_model = tuned_model
    best_model_name = f"{best_model_name} (Tuned)"



In [None]:
# Extract and summarize key findings
if hasattr(best_model, 'feature_importances_'):
    feature_importance = pd.DataFrame({
        'Feature': X.columns,
        'Importance': best_model.feature_importances_
    }).sort_values('Importance', ascending=False)
    
    print("KEY FACTORS INFLUENCING EMPLOYEE TURNOVER")
    print("="*60)
    print("\nTop 15 Factors (by importance):")
    print(feature_importance.head(15).to_string(index=False))
    
    # Categorize factors
    print("\n" + "="*60)
    print("FACTOR CATEGORIZATION")
    print("="*60)
    
    # Job-related
    job_factors = [f for f in feature_importance.head(15)['Feature'] 
                   if any(x in f.lower() for x in ['job', 'role', 'level', 'department'])]
    print(f"\nJob-Related Factors: {len(job_factors)}")
    for f in job_factors:
        print(f"  - {f}")
    
    # Compensation
    comp_factors = [f for f in feature_importance.head(15)['Feature'] 
                    if any(x in f.lower() for x in ['income', 'salary', 'stock', 'hike'])]
    print(f"\nCompensation-Related Factors: {len(comp_factors)}")
    for f in comp_factors:
        print(f"  - {f}")
    
    # Experience
    exp_factors = [f for f in feature_importance.head(15)['Feature'] 
                   if any(x in f.lower() for x in ['year', 'experience', 'tenure', 'promotion'])]
    print(f"\nExperience/Tenure-Related Factors: {len(exp_factors)}")
    for f in exp_factors:
        print(f"  - {f}")
    
    # Satisfaction
    sat_factors = [f for f in feature_importance.head(15)['Feature'] 
                   if any(x in f.lower() for x in ['satisfaction', 'involvement', 'balance', 'rating'])]
    print(f"\nSatisfaction-Related Factors: {len(sat_factors)}")
    for f in sat_factors:
        print(f"  - {f}")
    
    # Work-life
    wlb_factors = [f for f in feature_importance.head(15)['Feature'] 
                   if any(x in f.lower() for x in ['travel', 'distance', 'overtime', 'hours', 'balance'])]
    print(f"\nWork-Life Balance Factors: {len(wlb_factors)}")
    for f in wlb_factors:
        print(f"  - {f}")



### 7.2 Statistical Analysis of Key Factors



In [None]:
# Analyze key factors statistically
key_factors = ['MonthlyIncome', 'YearsAtCompany', 'JobSatisfaction', 
               'EnvironmentSatisfaction', 'WorkLifeBalance', 'JobLevel',
               'YearsSinceLastPromotion', 'DistanceFromHome', 'OverTime']

# Merge back with target for analysis
analysis_df = df_encoded.copy()

print("STATISTICAL ANALYSIS OF KEY FACTORS")
print("="*60)

for factor in key_factors:
    if factor in analysis_df.columns:
        print(f"\n{factor}:")
        print("-" * 40)
        
        # Compare means between attrition groups
        if analysis_df[factor].dtype in [np.int64, np.float64]:
            no_attrition = analysis_df[analysis_df['Attrition'] == 0][factor]
            yes_attrition = analysis_df[analysis_df['Attrition'] == 1][factor]
            
            print(f"  No Attrition - Mean: {no_attrition.mean():.2f}, Median: {no_attrition.median():.2f}")
            print(f"  Attrition - Mean: {yes_attrition.mean():.2f}, Median: {yes_attrition.median():.2f}")
            print(f"  Difference: {yes_attrition.mean() - no_attrition.mean():.2f}")
        else:
            # For categorical variables
            crosstab = pd.crosstab(analysis_df[factor], analysis_df['Attrition'], normalize='index') * 100
            print(f"  Attrition rates by category:")
            print(crosstab[1].sort_values(ascending=False).head())



## 8. Recommendations and Actionable Insights

Based on our analysis, here are the key recommendations for HumanForYou:



### 8.1 Priority Recommendations

1. **Address Job Satisfaction Issues**
   - Job satisfaction is a critical factor in employee retention
   - Implement regular satisfaction surveys and act on feedback
   - Create clear career progression paths

2. **Improve Work-Life Balance**
   - High work-life balance dissatisfaction correlates with turnover
   - Consider flexible working arrangements
   - Review workload distribution and overtime policies

3. **Enhance Compensation and Benefits**
   - Review salary structures, especially for high-risk roles
   - Ensure competitive compensation packages
   - Consider performance-based bonuses and stock options

4. **Focus on Career Development**
   - Employees with longer time since last promotion show higher turnover
   - Implement regular promotion cycles
   - Provide clear advancement opportunities

5. **Reduce Travel Burden**
   - Frequent business travel is associated with higher turnover
   - Consider alternatives to frequent travel (video conferencing)
   - Provide adequate compensation for travel

6. **Improve Manager Relationships**
   - Years with current manager impacts retention
   - Provide manager training programs
   - Ensure stable manager-employee relationships

7. **Target High-Risk Employee Segments**
   - Focus retention efforts on:
     - Employees with 1-3 years tenure
     - Lower job levels
     - Specific job roles with high turnover rates
     - Employees living far from office

### 8.2 Implementation Strategy

1. **Short-term (0-3 months)**
   - Conduct detailed exit interviews
   - Launch employee satisfaction survey
   - Review compensation for high-risk roles

2. **Medium-term (3-6 months)**
   - Implement flexible work policies
   - Establish clear promotion criteria and timelines
   - Launch manager training programs

3. **Long-term (6-12 months)**
   - Develop comprehensive retention strategy
   - Create career development programs
   - Establish predictive monitoring system using the model

### 8.3 Model Deployment

The developed model can be used to:
- Identify employees at high risk of leaving
- Prioritize retention efforts
- Monitor effectiveness of interventions
- Conduct "what-if" analyses for policy changes



## 9. Conclusion

This analysis has identified key factors influencing employee turnover at HumanForYou and developed a predictive model with strong performance. The model can help management:

1. **Predict** which employees are at risk of leaving
2. **Understand** the key drivers of turnover
3. **Prioritize** retention efforts effectively
4. **Evaluate** the impact of policy changes

**Next Steps:**
- Deploy the model for ongoing monitoring
- Implement recommended interventions
- Track model performance and update regularly
- Conduct follow-up analysis to measure intervention effectiveness

---

**Model Performance Summary:**
- Best Model: [Will be displayed after model training]
- ROC-AUC: [Will be displayed after model training]
- Key Strengths: [Will be displayed after model training]
- Areas for Improvement: [Will be displayed after model training]

