# Comparing Different Classification Models for Predicting Probability of Quitting
In this notebook, we will implement and compare different models to predict the probability of quitting using the 'left' column.

## 1. Import Necessary Libraries

In [21]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import roc_auc_score, classification_report, f1_score
import xgboost as xgb


## 2. Load and Explore the Dataset

In [22]:
# Load the dataset
file_path = 'D:\\Python_Projects\\attrition_predictor\\data\\HR_Dataset.csv'
df = pd.read_csv(file_path)

# Display the first few rows and check for missing data
df.head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,Departments,salary
0,0.38,0.53,2,157,3,0,1,0,sales,low
1,0.8,0.86,5,262,6,0,1,0,sales,medium
2,0.11,0.88,7,272,4,0,1,0,sales,medium
3,0.72,0.87,5,223,5,0,1,0,sales,low
4,0.37,0.52,2,159,3,0,1,0,sales,low


In [23]:
# Check for missing data and data types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 10 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   satisfaction_level     14999 non-null  float64
 1   last_evaluation        14999 non-null  float64
 2   number_project         14999 non-null  int64  
 3   average_montly_hours   14999 non-null  int64  
 4   time_spend_company     14999 non-null  int64  
 5   Work_accident          14999 non-null  int64  
 6   left                   14999 non-null  int64  
 7   promotion_last_5years  14999 non-null  int64  
 8   Departments            14999 non-null  object 
 9   salary                 14999 non-null  object 
dtypes: float64(2), int64(6), object(2)
memory usage: 1.1+ MB


## 3. Prepare the Data

In [24]:
# Define the feature columns and the target column
X = df.drop(columns=['left'])
y = df['left']

# Identify categorical and numerical columns
categorical_cols = X.select_dtypes(include=['object']).columns
numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns

# Preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_cols),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
    ]
)

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

## 4. Define and Train Different Models

In [25]:
# Define different models
logistic = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(random_state=42))
])

random_forest = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42))
])

gradient_boosting = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', GradientBoostingClassifier(random_state=42))
])

xgboost = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', xgb.XGBClassifier(random_state=42))
])

# Train each model
models = {
    'Logistic Regression': logistic,
    'Random Forest': random_forest,
    'Gradient Boosting': gradient_boosting,
    'XGBoost': xgboost
}

for name, model in models.items():
    print(f'Training {name}...')
    model.fit(X_train, y_train)

Training Logistic Regression...
Training Random Forest...
Training Gradient Boosting...
Training XGBoost...


## 5. Evaluate the Models

In [26]:
# Evaluate each model
evaluation_results = []

for name, model in models.items():
    y_pred = model.predict(X_test)
    y_pred_proba = model.predict_proba(X_test)[:, 1]
    roc_auc = roc_auc_score(y_test, y_pred_proba)
    f1 = f1_score(y_test, y_pred, pos_label=1)
    report = classification_report(y_test, y_pred, output_dict=True)
    evaluation_results.append({
        'Model': name,
        'ROC AUC Score': roc_auc,
        'F1 Score': f1,
        'Precision (1)': report['1']['precision'],
        'Recall (1)': report['1']['recall']
    })

# Create a DataFrame for the evaluation results
evaluation_df = pd.DataFrame(evaluation_results)
evaluation_df

Unnamed: 0,Model,ROC AUC Score,F1 Score,Precision (1),Recall (1)
0,Logistic Regression,0.815156,0.431604,0.586538,0.341418
1,Random Forest,0.990828,0.97151,0.989362,0.954291
2,Gradient Boosting,0.987917,0.940896,0.961988,0.920709
3,XGBoost,0.991231,0.964049,0.977927,0.95056


## 6. Conclusion and Next Steps

From the comparison above, you can identify the model that best predicts the probability of quitting.

**Next Steps:**
- **Parameter Tuning:** Further improve the best model by tuning hyperparameters.
- **Feature Engineering:** Explore additional features to improve prediction accuracy.