<a href="https://colab.research.google.com/github/swati2703raaj/HR-ATTRITION-PROJECT-/blob/main/HR_ATTRITION_PROJECT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# STEP 1: IMPORTING REQUIRED LIBRARIES
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier


In [7]:
# STEP 2: LOAD THE DATA
df = pd.read_csv("/content/IBM-HR-Analytics-Employee-Attrition-and-Performance-Revised.csv")
df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EnvironmentSatisfaction,Gender,...,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,College,Life Sciences,Medium,Female,...,Excellent,Low,0,8,0,Bad,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,Below College,Life Sciences,High,Male,...,Outstanding,Very High,1,10,3,Better,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,College,Other,Very High,Male,...,Excellent,Medium,0,7,3,Better,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,Master,Life Sciences,Very High,Female,...,Excellent,High,0,8,3,Better,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,Below College,Medical,Low,Male,...,Excellent,Very High,1,6,3,Better,2,2,2,2


In [8]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [10]:
# STEP 3: DROP NON-USEFUL COLUMNS
# The previously specified columns (EmployeeNumber, EmployeeCount, Over18, StandardHours) were not found in the dataframe.

In [None]:
# STEP 4: ENCODE TARGET VARIABLE (Binary conversion)
df['Attrition'] = df['Attrition'].map({'Yes': 1, 'No': 0})


In [None]:
# STEP 5: FEATURE ENGINEERING
df['Tenure_Ratio'] = df['YearsAtCompany'] / df['TotalWorkingYears'].replace(0,np.nan)
df['Income_Band'] = pd.qcut(df['MonthlyIncome'], 5, labels=False)
df['Promotion_Gap'] = df['YearsAtCompany'] - df['YearsSinceLastPromotion']


In [None]:
# STEP 6: SEPARATE FEATURES & TARGET
X = df.drop("Attrition", axis=1)
y = df['Attrition']


In [None]:
# STEP 7: ONE-HOT ENCODE CATEGORICAL VARIABLES
X = pd.get_dummies(X, drop_first=True)


In [None]:
# STEP 8: TRAIN-TEST SPLIT (WITH STRATIFICATION)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                    random_state=42, stratify=y)


In [None]:
# STEP 9: FEATURE SCALING (Required for Logistic Regression)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


In [None]:
# STEP 10: TRAIN BOTH MODELS
log_model = LogisticRegression(max_iter=2000)
log_model.fit(X_train, y_train)

tree_model = DecisionTreeClassifier(max_depth=5, random_state=42)
tree_model.fit(X_train, y_train)


In [None]:
# STEP 11: MODEL PREDICTIONS
log_pred = log_model.predict(X_test)
tree_pred = tree_model.predict(X_test)


In [None]:
# STEP 12: MODEL EVALUATION
print("\nLogistic Regression Results:")
print(classification_report(y_test, log_pred))
print("Accuracy:", accuracy_score(y_test, log_pred))

print("\nDecision Tree Results:")
print(classification_report(y_test, tree_pred))
print("Accuracy:", accuracy_score(y_test, tree_pred))


# Task
Perform a code review on the provided notebook code, focusing on data loading and initial preparation, feature engineering (Tenure_Ratio, Income_Band, Promotion_Gap), data preprocessing (one-hot encoding, train-test split, feature scaling), and model training and evaluation for both Logistic Regression and Decision Tree models. Provide a comprehensive summary of any corrections made or suggested improvements to the machine learning pipeline. The data used is from `/content/IBM-HR-Analytics-Employee-Attrition-and-Performance-Revised.csv`.

## Review Data Loading and Initial Preparation

### Subtask:
Examine the data loading, target variable encoding, and any column dropping steps for correctness and efficiency.


**Reasoning**:
To review the data loading, target variable encoding, and column dropping steps, I will first display the DataFrame's head and information to confirm correct loading. Then, I will inspect the columns identified for dropping and check for other potentially non-useful columns by examining their unique values. Finally, I will verify the 'Attrition' column's current state to confirm the encoding step's impact.



In [11]:
print("--- Reviewing Data Loading ---")
print("DataFrame Head (from initial load):")
print(df.head())
print("\nDataFrame Info:")
df.info()

print("\n--- Reviewing Column Dropping (Cell 8fOmK2lXipah) ---")
columns_to_check = ['EmployeeNumber', 'EmployeeCount', 'Over18', 'StandardHours']
found_columns = [col for col in columns_to_check if col in df.columns]

if found_columns:
    print(f"The following specified columns were found: {found_columns}")
    for col in found_columns:
        print(f"  '{col}' unique values: {df[col].unique()}")
        print(f"  '{col}' value counts:\n{df[col].value_counts()}")
else:
    print("The previously specified columns (EmployeeNumber, EmployeeCount, Over18, StandardHours) are NOT found in the dataframe as indicated in the notebook.")

print("\nChecking for other potential non-useful columns (e.g., constant or unique identifiers):")
for col in df.columns:
    if df[col].nunique() == 1:
        print(f"  Column '{col}' is constant with value: {df[col].iloc[0]}")
    elif df[col].nunique() == len(df):
        print(f"  Column '{col}' has all unique values (potential identifier).")

print("\n--- Reviewing Target Variable Encoding (Cell FdjUhe88i2E5) ---")
print("Current unique values in 'Attrition' column before encoding execution:")
print(df['Attrition'].value_counts())

--- Reviewing Data Loading ---
DataFrame Head (from initial load):
   Age Attrition     BusinessTravel  DailyRate              Department  \
0   41       Yes      Travel_Rarely       1102                   Sales   
1   49        No  Travel_Frequently        279  Research & Development   
2   37       Yes      Travel_Rarely       1373  Research & Development   
3   33        No  Travel_Frequently       1392  Research & Development   
4   27        No      Travel_Rarely        591  Research & Development   

   DistanceFromHome      Education EducationField EnvironmentSatisfaction  \
0                 1        College  Life Sciences                  Medium   
1                 8  Below College  Life Sciences                    High   
2                 2        College          Other               Very High   
3                 3         Master  Life Sciences               Very High   
4                 2  Below College        Medical                     Low   

   Gender  ...  Performan

**Reasoning**:
Based on the review of the previous output, the data loading was successful. The specified columns for dropping ('EmployeeNumber', 'EmployeeCount', 'Over18', 'StandardHours') were not found in the DataFrame, meaning the original dropping step was already redundant. No other constant or unique columns were identified. The 'Attrition' column still contains 'Yes'/'No' string values, indicating that the encoding step (Cell FdjUhe88i2E5) has not yet been applied. Therefore, the next step is to apply the encoding to the 'Attrition' column and then verify the change.



In [12]:
df['Attrition'] = df['Attrition'].map({'Yes': 1, 'No': 0})

print("--- Verifying Target Variable Encoding ---")
print("Unique values in 'Attrition' column AFTER encoding:")
print(df['Attrition'].value_counts())
print("Data type of 'Attrition' column AFTER encoding:", df['Attrition'].dtype)

--- Verifying Target Variable Encoding ---
Unique values in 'Attrition' column AFTER encoding:
Attrition
0    1233
1     237
Name: count, dtype: int64
Data type of 'Attrition' column AFTER encoding: int64


## Review Feature Engineering

### Subtask:
Analyze the created features (Tenure_Ratio, Income_Band, Promotion_Gap) to confirm their validity and potential impact on the models.


**Reasoning**:
To analyze the newly created features, I will display the head of the DataFrame, check for missing values and descriptive statistics for 'Tenure_Ratio', examine the unique values and counts for 'Income_Band', calculate descriptive statistics for 'Promotion_Gap', and verify the data types of all three columns.



In [13]:
print("--- Reviewing Newly Created Features ---")
print("DataFrame Head with new features:")
print(df[['Tenure_Ratio', 'Income_Band', 'Promotion_Gap']].head())

print("\n--- Analyzing Tenure_Ratio ---")
print("Missing values in 'Tenure_Ratio':", df['Tenure_Ratio'].isnull().sum())
print("Descriptive statistics for 'Tenure_Ratio':\n", df['Tenure_Ratio'].describe())
# Explain origin of NaNs
if (df['TotalWorkingYears'] == 0).any():
    print("Note: NaN values in 'Tenure_Ratio' likely originate from 'TotalWorkingYears' being 0, causing division by zero.")

print("\n--- Analyzing Income_Band ---")
print("Unique values and counts for 'Income_Band':\n", df['Income_Band'].value_counts().sort_index())

print("\n--- Analyzing Promotion_Gap ---")
print("Descriptive statistics for 'Promotion_Gap':\n", df['Promotion_Gap'].describe())

print("\n--- Data Types of New Features ---")
print(df[['Tenure_Ratio', 'Income_Band', 'Promotion_Gap']].dtypes)

--- Reviewing Newly Created Features ---
DataFrame Head with new features:


KeyError: "None of [Index(['Tenure_Ratio', 'Income_Band', 'Promotion_Gap'], dtype='object')] are in the [columns]"

**Reasoning**:
The previous code failed because the feature engineering step (creating 'Tenure_Ratio', 'Income_Band', 'Promotion_Gap') had not been executed, leading to a KeyError. I will first execute the feature engineering code and then the analysis code to correctly review the newly created features.



In [14]:
df['Tenure_Ratio'] = df['YearsAtCompany'] / df['TotalWorkingYears'].replace(0,np.nan)
df['Income_Band'] = pd.qcut(df['MonthlyIncome'], 5, labels=False)
df['Promotion_Gap'] = df['YearsAtCompany'] - df['YearsSinceLastPromotion']

print("--- Reviewing Newly Created Features ---")
print("DataFrame Head with new features:")
print(df[['Tenure_Ratio', 'Income_Band', 'Promotion_Gap']].head())

print("\n--- Analyzing Tenure_Ratio ---")
print("Missing values in 'Tenure_Ratio':", df['Tenure_Ratio'].isnull().sum())
print("Descriptive statistics for 'Tenure_Ratio':\n", df['Tenure_Ratio'].describe())
# Explain origin of NaNs
if (df['TotalWorkingYears'] == 0).any():
    print("Note: NaN values in 'Tenure_Ratio' likely originate from 'TotalWorkingYears' being 0, causing division by zero.")

print("\n--- Analyzing Income_Band ---")
print("Unique values and counts for 'Income_Band':\n", df['Income_Band'].value_counts().sort_index())

print("\n--- Analyzing Promotion_Gap ---")
print("Descriptive statistics for 'Promotion_Gap':\n", df['Promotion_Gap'].describe())

print("\n--- Data Types of New Features ---")
print(df[['Tenure_Ratio', 'Income_Band', 'Promotion_Gap']].dtypes)

--- Reviewing Newly Created Features ---
DataFrame Head with new features:
   Tenure_Ratio  Income_Band  Promotion_Gap
0      0.750000            3              6
1      1.000000            2              9
2      0.000000            0              0
3      1.000000            1              5
4      0.333333            1              0

--- Analyzing Tenure_Ratio ---
Missing values in 'Tenure_Ratio': 11
Descriptive statistics for 'Tenure_Ratio':
 count    1459.000000
mean        0.683186
std         0.323978
min         0.000000
25%         0.419872
50%         0.800000
75%         1.000000
max         1.000000
Name: Tenure_Ratio, dtype: float64
Note: NaN values in 'Tenure_Ratio' likely originate from 'TotalWorkingYears' being 0, causing division by zero.

--- Analyzing Income_Band ---
Unique values and counts for 'Income_Band':
 Income_Band
0    294
1    294
2    294
3    294
4    294
Name: count, dtype: int64

--- Analyzing Promotion_Gap ---
Descriptive statistics for 'Promotion_Gap

## Review Data Preprocessing

### Subtask:
Verify the one-hot encoding, train-test split, and feature scaling steps, ensuring they are applied correctly and prevent data leakage.


## Review Data Preprocessing

### Subtask:
Verify the one-hot encoding, train-test split, and feature scaling steps, ensuring they are applied correctly and prevent data leakage.

#### Instructions
1. Inspect the columns of the `X` DataFrame before and after one-hot encoding using `X.head()` and `X.info()` to confirm the transformation of categorical features and the creation of new dummy variables. Pay attention to the number of columns.
2. Check the shapes of `X_train`, `X_test`, `y_train`, and `y_test` to confirm the train-test split proportions (e.g., `X_train.shape`, `X_test.shape`, `y_train.shape`, `y_test.shape`).
3. Verify the stratification of the `y` variable by comparing the value counts of `y_train` and `y_test` (e.g., `y_train.value_counts(normalize=True)` and `y_test.value_counts(normalize=True)`).
4. Examine the descriptive statistics (mean and standard deviation) of `X_train` and `X_test` after scaling to confirm that the `StandardScaler` has been applied correctly (e.g., `pd.DataFrame(X_train).describe()` and `pd.DataFrame(X_test).describe()`). Ensure that `X_train` has a mean close to 0 and a standard deviation close to 1 for each feature.

**Reasoning**:
To verify the one-hot encoding, train-test split, and feature scaling steps as outlined in the instructions, I will first recreate the `X` DataFrame before one-hot encoding, then apply the encoding and inspect the changes. After that, I will check the shapes and stratification of the train-test split, and finally, verify the feature scaling by examining descriptive statistics of the scaled training and testing sets.



In [15]:
print("--- Reviewing One-Hot Encoding ---")
# Recreate X before one-hot encoding for comparison
X_before_encoding = df.drop("Attrition", axis=1)
print("X before one-hot encoding (first 5 rows):")
print(X_before_encoding.head())
print("\nShape of X before one-hot encoding:", X_before_encoding.shape)
print("Info of X before one-hot encoding:\n")
X_before_encoding.info(verbose=True, show_counts=True)

# Perform one-hot encoding on X (as done in original notebook STEP 7)
X = pd.get_dummies(X_before_encoding, drop_first=True)
print("\nX after one-hot encoding (first 5 rows):")
print(X.head())
print("\nShape of X after one-hot encoding:", X.shape)
print("Info of X after one-hot encoding:\n")
X.info(verbose=True, show_counts=True)


print("\n--- Reviewing Train-Test Split ---")
# Ensure y is the target variable (as done in original notebook STEP 6)
y = df['Attrition']

# Perform train-test split (as done in original notebook STEP 8)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                    random_state=42, stratify=y)

print("Shapes of datasets after train-test split:")
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

print("\nStratification check for target variable (y_train):")
print(y_train.value_counts(normalize=True))
print("\nStratification check for target variable (y_test):")
print(y_test.value_counts(normalize=True))


print("\n--- Reviewing Feature Scaling ---")
# Perform feature scaling (as done in original notebook STEP 9)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Descriptive statistics for X_train after scaling (first 5 features):\n")
# Convert scaled arrays back to DataFrame for easier inspection of descriptive stats
# Only show first 5 columns for brevity, as the table can be very wide
print(pd.DataFrame(X_train_scaled).iloc[:, :5].describe())

print("\nDescriptive statistics for X_test after scaling (first 5 features):\n")
print(pd.DataFrame(X_test_scaled).iloc[:, :5].describe())


--- Reviewing One-Hot Encoding ---
X before one-hot encoding (first 5 rows):
   Age     BusinessTravel  DailyRate              Department  \
0   41      Travel_Rarely       1102                   Sales   
1   49  Travel_Frequently        279  Research & Development   
2   37      Travel_Rarely       1373  Research & Development   
3   33  Travel_Frequently       1392  Research & Development   
4   27      Travel_Rarely        591  Research & Development   

   DistanceFromHome      Education EducationField EnvironmentSatisfaction  \
0                 1        College  Life Sciences                  Medium   
1                 8  Below College  Life Sciences                    High   
2                 2        College          Other               Very High   
3                 3         Master  Life Sciences               Very High   
4                 2  Below College        Medical                     Low   

   Gender  HourlyRate  ... TotalWorkingYears TrainingTimesLastYear  \
0  Fe

## Review Model Training and Evaluation

### Subtask:
Assess the training of both Logistic Regression and Decision Tree models, and check the evaluation metrics and their interpretation.


### Subtask: Assess Model Training and Evaluation

**Instructions Followed:**
1. Verify that the `log_model` and `tree_model` objects exist and have been fitted to the training data.
2. Review the classification reports and accuracy scores for both Logistic Regression and Decision Tree models.
3. Interpret the precision, recall, f1-score, and support for both classes ('0' for No Attrition, '1' for Yes Attrition) for each model, with a focus on the minority class (Attrition: '1').
4. Briefly summarize the strengths and weaknesses of each model based on these metrics.
5. Generate and inspect the confusion matrices for both models to understand true positives, true negatives, false positives, and false negatives.

**Reasoning**:
To begin assessing the model training and evaluation as per the updated instructions, I will first verify the existence and fitted status of the `log_model` and `tree_model` objects, and then display the model objects to confirm their presence in the environment.



In [16]:
print("--- Verifying Model Existence and Fitted Status ---")

if 'log_model' in locals() and log_model is not None:
    print("Logistic Regression model (log_model) exists and is likely fitted.")
    print(log_model)
else:
    print("Logistic Regression model (log_model) not found or not fitted.")

if 'tree_model' in locals() and tree_model is not None:
    print("\nDecision Tree model (tree_model) exists and is likely fitted.")
    print(tree_model)
else:
    print("\nDecision Tree model (tree_model) not found or not fitted.")

print("\n--- Reviewing Classification Reports and Accuracy Scores (from previous output) ---")
# The classification reports and accuracy scores were already printed in the previous cell (Ez05OATcjcQ1).
# We will now proceed to interpret them and then generate confusion matrices.

--- Verifying Model Existence and Fitted Status ---
Logistic Regression model (log_model) not found or not fitted.

Decision Tree model (tree_model) not found or not fitted.

--- Reviewing Classification Reports and Accuracy Scores (from previous output) ---


**Reasoning**:
The previous output indicated that the `log_model` and `tree_model` were not found or fitted. This implies that the model training step (Step 10) was not executed, or its state was lost. To proceed with model evaluation, I need to re-execute the model training and prediction steps (Steps 10 and 11) to ensure the models and their predictions are available in the kernel.



In [17]:
print("# STEP 10: TRAIN BOTH MODELS")
log_model = LogisticRegression(max_iter=2000)
log_model.fit(X_train_scaled, y_train)

tree_model = DecisionTreeClassifier(max_depth=5, random_state=42)
tree_model.fit(X_train_scaled, y_train)

print("# STEP 11: MODEL PREDICTIONS")
log_pred = log_model.predict(X_test_scaled)
tree_pred = tree_model.predict(X_test_scaled)

print("Models trained and predictions made.")

# STEP 10: TRAIN BOTH MODELS


ValueError: Input X contains NaN.
LogisticRegression does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

**Reasoning**:
The previous error `ValueError: Input X contains NaN` indicates that the `StandardScaler` or the subsequent model training received data with missing values. The `Tenure_Ratio` column, created in Feature Engineering, introduces NaNs where `TotalWorkingYears` was zero. These NaNs must be handled before scaling and model training. I will impute the missing values in `Tenure_Ratio` with the mean of the training data after the train-test split and before scaling, to prevent data leakage and ensure clean data for the models.



In [18]:
print("--- Reviewing One-Hot Encoding ---")
# Recreate X before one-hot encoding for comparison
X_before_encoding = df.drop("Attrition", axis=1)
print("X before one-hot encoding (first 5 rows):")
print(X_before_encoding.head())
print("\nShape of X before one-hot encoding:", X_before_encoding.shape)
print("Info of X before one-hot encoding:\n")
X_before_encoding.info(verbose=True, show_counts=True)

# Perform one-hot encoding on X (as done in original notebook STEP 7)
X = pd.get_dummies(X_before_encoding, drop_first=True)
print("\nX after one-hot encoding (first 5 rows):")
print(X.head())
print("\nShape of X after one-hot encoding:", X.shape)
print("Info of X after one-hot encoding:\n")
X.info(verbose=True, show_counts=True)


print("\n--- Reviewing Train-Test Split ---")
# Ensure y is the target variable (as done in original notebook STEP 6)
y = df['Attrition']

# Perform train-test split (as done in original notebook STEP 8)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                    random_state=42, stratify=y)

print("Shapes of datasets after train-test split:")
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

print("\nStratification check for target variable (y_train):")
print(y_train.value_counts(normalize=True))
print("\nStratification check for target variable (y_test):")
print(y_test.value_counts(normalize=True))

print("\n--- Handling Missing Values (Imputation) ---")
# Identify columns with NaNs (Tenure_Ratio)
nan_cols = X_train.columns[X_train.isnull().any()].tolist()
if nan_cols:
    print(f"Columns with NaNs identified in X_train: {nan_cols}")
    for col in nan_cols:
        # Impute NaNs using the mean of the training set to prevent data leakage
        mean_val = X_train[col].mean()
        X_train[col].fillna(mean_val, inplace=True)
        X_test[col].fillna(mean_val, inplace=True)
        print(f"  NaNs in '{col}' imputed with mean from X_train ({mean_val:.4f}).")
else:
    print("No NaNs found in X_train after split.")

print("\nNaN check after imputation:")
print(f"NaNs in X_train: {X_train.isnull().sum().sum()}")
print(f"NaNs in X_test: {X_test.isnull().sum().sum()}")


print("\n--- Reviewing Feature Scaling ---")
# Perform feature scaling (as done in original notebook STEP 9)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Descriptive statistics for X_train after scaling (first 5 features):\n")
# Convert scaled arrays back to DataFrame for easier inspection of descriptive stats
# Only show first 5 columns for brevity, as the table can be very wide
print(pd.DataFrame(X_train_scaled).iloc[:, :5].describe())

print("\nDescriptive statistics for X_test after scaling (first 5 features):\n")
print(pd.DataFrame(X_test_scaled).iloc[:, :5].describe())

--- Reviewing One-Hot Encoding ---
X before one-hot encoding (first 5 rows):
   Age     BusinessTravel  DailyRate              Department  \
0   41      Travel_Rarely       1102                   Sales   
1   49  Travel_Frequently        279  Research & Development   
2   37      Travel_Rarely       1373  Research & Development   
3   33  Travel_Frequently       1392  Research & Development   
4   27      Travel_Rarely        591  Research & Development   

   DistanceFromHome      Education EducationField EnvironmentSatisfaction  \
0                 1        College  Life Sciences                  Medium   
1                 8  Below College  Life Sciences                    High   
2                 2        College          Other               Very High   
3                 3         Master  Life Sciences               Very High   
4                 2  Below College        Medical                     Low   

   Gender  HourlyRate  ... TotalWorkingYears TrainingTimesLastYear  \
0  Fe

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X_train[col].fillna(mean_val, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X_test[col].fillna(mean_val, inplace=True)


**Reasoning**:
The models were not found or fitted in the previous execution due to NaN values, which have now been handled. I will re-execute the model training and prediction steps using the `X_train_scaled` and `X_test_scaled` data, which are now free of NaNs. After training, I will then print the classification reports and generate confusion matrices for both models as requested in the subtask instructions, importing `confusion_matrix` if needed.



## Assess Model Training and Evaluation_Retry_2

### Subtask:
Re-train Logistic Regression and Decision Tree models, make predictions, and evaluate their performance using classification reports and confusion matrices.


**Reasoning**:
The previous steps successfully handled missing values and scaled the data. Now, as per the instructions, I will re-train both Logistic Regression and Decision Tree models using the cleaned and scaled training data, make predictions on the test set, and then evaluate their performance using classification reports, accuracy scores, and confusion matrices. This completes the model training and evaluation part of the subtask.



In [19]:
from sklearn.metrics import confusion_matrix

print("# STEP 10: RE-TRAIN BOTH MODELS")
log_model = LogisticRegression(max_iter=2000, random_state=42) # Added random_state for reproducibility
log_model.fit(X_train_scaled, y_train)

tree_model = DecisionTreeClassifier(max_depth=5, random_state=42)
tree_model.fit(X_train_scaled, y_train)

print("# STEP 11: MODEL PREDICTIONS")
log_pred = log_model.predict(X_test_scaled)
tree_pred = tree_model.predict(X_test_scaled)

print("# STEP 12: MODEL EVALUATION")

print("\nLogistic Regression Results:")
print(classification_report(y_test, log_pred))
print("Accuracy:", accuracy_score(y_test, log_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, log_pred))

print("\nDecision Tree Results:")
print(classification_report(y_test, tree_pred))
print("Accuracy:", accuracy_score(y_test, tree_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, tree_pred))

# STEP 10: RE-TRAIN BOTH MODELS
# STEP 11: MODEL PREDICTIONS
# STEP 12: MODEL EVALUATION

Logistic Regression Results:
              precision    recall  f1-score   support

           0       0.88      0.96      0.91       247
           1       0.56      0.30      0.39        47

    accuracy                           0.85       294
   macro avg       0.72      0.63      0.65       294
weighted avg       0.83      0.85      0.83       294

Accuracy: 0.8503401360544217
Confusion Matrix:
 [[236  11]
 [ 33  14]]

Decision Tree Results:
              precision    recall  f1-score   support

           0       0.86      0.96      0.91       247
           1       0.45      0.19      0.27        47

    accuracy                           0.83       294
   macro avg       0.66      0.57      0.59       294
weighted avg       0.80      0.83      0.80       294

Accuracy: 0.8333333333333334
Confusion Matrix:
 [[236  11]
 [ 38   9]]


## Final Task

### Subtask:
Add the provided summary about the preprocessing steps, feature engineering, model training and evaluation, and Power BI visualization to the notebook as a final overview.


## Summary:

### Data Analysis Key Findings

*   Both Logistic Regression and Decision Tree models were successfully re-trained and evaluated on the scaled data.
*   **Logistic Regression Model** achieved an accuracy of approximately 85.03%. It performed well in identifying employees who would not attrit (Class 0), with 236 true negatives and 11 false positives. However, its performance was weaker for identifying employees who would attrit (Class 1), with 14 true positives and 33 false negatives.
*   **Decision Tree Model** achieved a slightly lower accuracy of approximately 83.33%. Similar to Logistic Regression, it was better at predicting non-attrition (236 true negatives, 11 false positives) but struggled more with attrition cases (9 true positives, 38 false negatives).
*   Both models showed a common challenge in predicting the minority class (attrition), indicated by lower recall and precision for Class 1 compared to Class 0.

### Insights or Next Steps

*   **Address Class Imbalance:** Given the weaker performance in predicting employee attrition (Class 1) for both models, explore techniques to handle class imbalance, such as oversampling the minority class, undersampling the majority class, or using algorithms robust to imbalance (e.g., SMOTE, Balanced Random Forest).
*   **Model Optimization:** Further optimize the models by tuning hyperparameters using techniques like GridSearchCV or RandomizedSearchCV to potentially improve the recall and F1-score for the attrition class, which is often crucial in employee retention scenarios.


In [24]:
output = X_test.copy()
output['Actual_Attrition'] = y_test.values
output['Predicted_Attrition'] = log_pred
output['Prediction_Probability'] = log_model.predict_proba(X_test_scaled)[:,1] # Corrected to use X_test_scaled

# Categorize risk based on probability
output['Risk_Level'] = pd.cut(
    output['Prediction_Probability'],
    bins=[0,0.40,0.65,1],
    labels=['Low','Medium','High']
)

output.to_csv("HR_Attrition_Model_Export.csv", index=False)
print("File Exported Successfully!")

File Exported Successfully!


In [25]:
theme_json = {
  "name": "Lavender Analytics Theme",
  "dataColors": ["#C9B7E9", "#7E60BC", "#53389E", "#AA98DA", "#E5DAF9"],
  "background": "#FFFFFF",
  "foreground": "#2C2C2C",
  "tableAccent": "#7E60BC",
  "visualStyles": {
    "*": {
      "*": {
        "title": [
          {
            "fontSize": 14,
            "fontFamily": "Segoe UI Semibold",
            "color": "#53389E"
          }
        ],
        "labels": [
          {
            "color": "#2C2C2C"
          }
        ]
      }
    }
  }
}
display(theme_json)


{'name': 'Lavender Analytics Theme',
 'dataColors': ['#C9B7E9', '#7E60BC', '#53389E', '#AA98DA', '#E5DAF9'],
 'background': '#FFFFFF',
 'foreground': '#2C2C2C',
 'tableAccent': '#7E60BC',
 'visualStyles': {'*': {'*': {'title': [{'fontSize': 14,
      'fontFamily': 'Segoe UI Semibold',
      'color': '#53389E'}],
    'labels': [{'color': '#2C2C2C'}]}}}}