# Capstone project: Providing data-driven suggestions for HR

### Business scenario and problem

The HR department at Salifort Motors wants to take some initiatives to improve employee satisfaction levels at the company. They collected data from employees, but now they don’t know what to do with it. They refer to you as a data analytics professional and ask you to provide data-driven suggestions based on your understanding of the data. They have the following question: what’s likely to make the employee leave the company?

Your goals in this project are to analyze the data collected by the HR department and to build a model that predicts whether or not an employee will leave the company.

If you can predict employees likely to quit, it might be possible to identify factors that contribute to their leaving. Because it is time-consuming and expensive to find, interview, and hire new employees, increasing employee retention will be beneficial to the company.

In [None]:
# Import packages
import pandas as pd
import numpy as np
from scipy.stats import f_oneway
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

In [None]:

df0 = pd.read_csv("HR_capstone_dataset.csv")


df0.head(10)

In [None]:
df0.info()

### Descriptive statistics about the data

In [None]:
df0.describe()

### Renaming columns

In [None]:
df0.columns

In [None]:
# Renaming columns as needed
df0.rename(columns={'satisfaction_level': 'satisfaction',
                   'number_project':'projects',
                    'average_montly_hours': 'average_monthly_hours',
                     'time_spend_company':'time_spent',
                      'work_accident':'has_work_accident',
                       'promotion_last_5years':'has_promotion_last_5years',
                        'Department':'department'}, inplace=True)



### Check missing values

In [None]:
# Checking for missing values
df0.isnull().sum()

### Check duplicates

In [None]:
# Checking for duplicates
df0.duplicated().sum()

In [None]:
# Inspecting some rows containing duplicates as needed
df0[df0.duplicated()]

In [None]:
# Droping duplicates and saving resulting dataframe 
df0 = df0.drop_duplicates()

df0.head(5)

### Check outliers

In [None]:
# Creating a boxplot to visualize distributions and detect any outliers
numerical_cols = df0.iloc[:,:7]
for col in numerical_cols[numerical_cols.columns]:
    plt.figure(figsize=(5,1))
    sns.boxplot(x=numerical_cols[col], fliersize=1)
    plt.title( f'{col} box plot');


In [None]:
#number of rows containing outliers

df0[df0["time_spent"]>= 6]

In [None]:
df0 = df0.sort_values(by='satisfaction', ascending=True)

In [None]:
df0.head(10)

In [None]:
# numbers of people who left vs. stayed
print("people who left vs. stayed")
print("How many?")
print(df0["left"].value_counts())

# Percentages of people who left vs. stayed
print("In what percentages?")
print(df0["left"].value_counts() / len(df0["left"]) * 100)

### Data visualizations

In [None]:
for col in numerical_cols.columns:
    plt.figure(figsize=(5, 3))  
    sns.histplot(numerical_cols[col], kde=True )
    median = df0[col].median()
    plt.axvline(median, color='red', linestyle='--')
    plt.title(f'{col} Histogram')
    plt.show()

In [None]:
sns.pairplot(df0)

In [None]:
# What is the mean satisfaction , last evaluation , average monthly hours and time spent of each employee who left and who doesn't?
df0.groupby("left")[['satisfaction', 'last_evaluation', 'average_monthly_hours',
       'time_spent']].mean()

In [None]:
# What is the mean satisfaction , last evaluation , average monthly hours and time spent in each category of salary?
df0.groupby("salary")[['satisfaction', 'last_evaluation', 'average_monthly_hours',
       'time_spent']].mean()

In [None]:
# What is the mean satisfaction , last evaluation , average monthly hours and time spent in each department?
df0.groupby("department")[['satisfaction', 'last_evaluation', 'average_monthly_hours',
       'time_spent']].mean()

In [None]:
#How many employees left in each department?
df0.groupby("department")["left"].sum()

In [None]:
df0.groupby("salary")["satisfaction"].mean()  
#are the differences statistically significant?#######

In [None]:
groups = []
for salary, group in df0.groupby('salary')['satisfaction']:
    groups.append(group)

result = f_oneway(*groups)
alpha = 0.05
if result.pvalue < alpha:
    print("There are significant differences in satisfaction means among salary categories.")
else:
    print("There are no significant differences in satisfaction means among salary categories.")

In [None]:
#what is the salary of each employee who left?
df0.groupby("salary")["left"].sum()

In [None]:
# what is the mode salary of each of the departments?
df0.groupby('department')['salary'].apply(lambda x: x.mode().iloc[0]).reset_index()

In [None]:
plt.figure(figsize=(9,10))
sns.scatterplot(x=df0["last_evaluation"],y=df0["average_monthly_hours"],hue=df0["left"] )

In [None]:
plt.figure(figsize=(12,8))
sns.scatterplot(data=df0, x='average_monthly_hours', y='salary', hue='left', palette='coolwarm', alpha=0.7)
plt.xlabel("Average Monthly Hours")
plt.ylabel("Salary")
plt.title("Number of Employees Left by Salary and Average Monthly Hours")
plt.legend(title="Left", labels=["Left", " Not left"], loc='upper left', bbox_to_anchor=(1.02, 1.0))

plt.tight_layout()
plt.show()

In [None]:
from scipy.stats import pointbiserialr
correlation_coefficient, p_value = pointbiserialr(df0['average_monthly_hours'], df0['left'])
alpha = 0.05  
if p_value < alpha:
    print("There is a statistically significant relationship between the two variables.")
else:
    print("There is no statistically significant relationship between the two variables.")

In [None]:
from scipy.stats import chi2_contingency
contingency_table = pd.crosstab(df0['salary'], df0['left'])

chi2, p_value, dof, expected = chi2_contingency(contingency_table)

alpha = 0.05

if p_value < alpha:
    print("There is a statistically significant relationship between the salary and the left variable.")
else:
    print("There is no statistically significant relationship between the salary and the left variable.")

In [None]:
# how many hours work the majority of people who left?
mask_left = df0["left"] == 1

mask_salary = (df0["salary"] == "medium") | (df0["salary"] == "low")

combined_mask = mask_left & mask_salary

df0["average_monthly_hours"][combined_mask].mean()

In [None]:
from scipy.stats import pearsonr
correlation_coefficient, p_value = pearsonr(df0['has_promotion_last_5years'], df0['satisfaction'])

alpha = 0.05  
if p_value < alpha:
    print("There is a statistically significant relationship between has_promotion_last_5years and satisfaction.")
else:
    print("There is no statistically significant relationship between has_promotion_last_5years and satisfaction  .")

In [None]:
correlation_coefficient, p_value = pearsonr(df0['last_evaluation'], df0['satisfaction'])

alpha = 0.05  
if p_value < alpha:
    print("There is a statistically significant relationship between last_evaluation and satisfaction.")
else:
    print("There is no statistically significant relationship between last_evaluation and satisfaction  .")

In [None]:
correlation_coefficient, p_value = pearsonr(df0['last_evaluation'], df0['average_monthly_hours'])

alpha = 0.05  
if p_value < alpha:
    print("There is a statistically significant relationship between last_evaluation and average_monthly_hours.")
else:
    print("There is no statistically significant relationship between last_evaluation and average_monthly_hours  .")

In [None]:
correlation_coefficient, p_value = pearsonr(df0['left'], df0['satisfaction'])

alpha = 0.05  
if p_value < alpha:
    print("There is a statistically significant relationship between left and satisfaction.")
else:
    print("There is no statistically significant relationship between left and satisfaction .")

In [None]:
plt.figure(figsize=(12,10))
corr = df0.corr()
sns.heatmap(corr, vmin=0, vmax=1)

In [None]:
plt.figure(figsize=(8, 6))
sns.boxplot(x='left', y='last_evaluation', data=df0, palette='pastel')
plt.xlabel('Left (0: Not Left, 1: Left)')
plt.ylabel('Last Evaluation')
plt.title('Distribution of Last Evaluation for Different "Left" Categories')
plt.show()


plt.figure(figsize=(8, 6))
sns.violinplot(x='left', y='last_evaluation', data=df0, palette='pastel')
plt.xlabel('Left (0: Not Left, 1: Left)')
plt.ylabel('Last Evaluation')
plt.title('Distribution of Last Evaluation for Different "Left" Categories')
plt.show()

correlation_matrix = df0[['last_evaluation', 'average_monthly_hours', 'satisfaction', 'left']].corr()
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Heatmap')
plt.show()

### Insights

* 59% of people who left have a low salary and almost 39% have a medium one.
* The majority of people who left are in technical(19,58% of all employees who left) , support(15,67% of all employees who left) and especially the sales department (27,62% of all employees who left); and the most frequent salary in all these three is low  .
* There is a statistically significant relationship between last_evaluation and satisfaction.
* There is a statistically significant relationship between last_evaluation and average_monthly_hours.
* There is a statistically significant relationship between has_promotion_last_5years and satisfaction.
* There is a statistically significant relationship between left and satisfaction.
* There are significant differences in satisfaction means among salary categories(low,medium, high).
* There is a statistically significant relationship between average monthly hours and employees who left.
* There is a statistically significant relationship between the salary and the left variable.
* The employees who left are divided into two groups : low evaluations and the majority who have higher evaluations.
* The majority of employees who left have a high evaluation but their working hours are more than average.


------------------------------------------------------------------------------------------------------------------

<b><span style="font-size: 24px;">As a conclusion: the majority of employees who left are working on average  208 hours , and they have low salary (59%) and medium salary (39%); And the most affected department is Sales.</span></b>


-------------------------------------------------------------------------------------------------------------------

### Modeling



In [None]:
### YOUR CODE HERE ###
from sklearn.model_selection import PredefinedSplit, cross_val_score
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.metrics import accuracy_score, precision_score,recall_score,f1_score, confusion_matrix, ConfusionMatrixDisplay, RocCurveDisplay
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from xgboost import plot_importance


In [None]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
df0['salary'] = label_encoder.fit_transform(df0['salary'])
df0 = pd.get_dummies(df0, columns=['department'], drop_first=True)

In [None]:

X = df0.drop("left", axis=1)
y = df0["left"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25,random_state = 0)
X_tr, X_val, y_tr, y_val = train_test_split(X_train, y_train, test_size = 0.25,random_state = 0)


In [None]:
models = {
    "DecisionTreeClassifier": DecisionTreeClassifier(random_state=0),
    "RandomForestClassifier": RandomForestClassifier(random_state=0),
    "XGBClassifier": XGBClassifier(objective='binary:logistic',random_state=0)
}

# Define hyperparameter grids for each model
cv_param = {
    "DecisionTreeClassifier": {'max_depth': [3, 5, 7]},
    "RandomForestClassifier": {'n_estimators': [100, 200, 300], 'max_depth': [3, 5, 7]},
    "XGBClassifier": {'n_estimators': [100, 200, 300], 'max_depth': [3, 5, 7]}
}
scoring = {'accuracy', 'precision', 'recall', 'f1'}
split_index = [0 if x in X_val.index else -1 for x in X_train.index]
custom_split = PredefinedSplit(split_index)

In [None]:
best_params_dict = {}  # Dictionary to store the best hyperparameters of each model

for model_name, model in models.items():
    model_gs = GridSearchCV(model, cv_param[model_name], scoring=scoring, cv=custom_split, refit='f1')
    model_gs.fit(X_train, y_train)
    best_model = model_gs.best_estimator_

    # Store the best hyperparameters of each model in the dictionary
    best_params_dict[model_name] = model_gs.best_params_

    # Evaluate the model on validation set
    y_val_pred = best_model.predict(X_val)
    val_accuracy = accuracy_score(y_val, y_val_pred)
    val_precision = precision_score(y_val, y_val_pred)
    val_recall = recall_score(y_val, y_val_pred)
    val_f1 = f1_score(y_val, y_val_pred)

    print(f"Model: {model_name}")
    print(f"Validation Accuracy: {val_accuracy:.4f}, Validation Precision: {val_precision:.4f}, Validation Recall: {val_recall:.4f}, Validation F1: {val_f1:.4f}")
    print("---------------------------------------------------------")



best_params_xgb = best_params_dict["XGBClassifier"]
print("Best Hyperparameters for XGBoostClassifier:")
print(best_params_xgb)



In [None]:
XGB_optimal = XGBClassifier(max_depth=3, n_estimators=300,random_state=1)
XGB_optimal.fit(X_train, y_train)
y_pred = XGB_optimal.predict(X_test)
pc_test = precision_score(y_test, y_pred)
print("The precision score is {pc:.3f}".format(pc = pc_test))
rc_test = recall_score(y_test, y_pred)
print("The recall score is {rc:.3f}".format(rc = rc_test))
ac_test = accuracy_score(y_test, y_pred)
print("The accuracy score is {ac:.3f}".format(ac = ac_test))
f1_test = f1_score(y_test, y_pred)
print("The F1 score is {f1:.3f}".format(f1 = f1_test))

In [None]:

cm = confusion_matrix(y_test, y_pred, labels=XGB_optimal.classes_)

disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=XGB_optimal.classes_, cmap='viridis', normalize='true')
disp.plot(cmap='viridis', include_values=True, xticks_rotation='horizontal', values_format='.2f')
plt.title('Normalized Confusion Matrix')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.colorbar()
plt.show()

## So our model is more likely to make false negatives than false positives

In [None]:
plot_importance(XGB_optimal, max_num_features=10);

## Step 4. Results and Evaluation



### Conclusion, Recommendations, Next Steps

So as a conclusion of this project: 
* The best model was XGBoostClassifier with an  accuracy score(98%),  precision score(95.3%),F1 score (93.9%) and the most important metric in our case because we want to reduce false negative is recall score (0.924%).
* we also discrovered  that the variable that determine the most  if the employee will leave or not are : average_monthly_hours, satisfaction and last_evaluation.





We recommend to the stakeholders the following actions to retain the employees:

    * Improve employees satisfaction by augmenting their salaries (which is an important factor of satisfaction).
    * Provide more promotions, especially for employees who have high evaluations.
    * Reduce working hours, at least to the average.
    * Offer training and development opportunities for employees who do not perform well (low evaluations)

The next steps are :
* Gathering more information about the employees to find out if there are other factors that can influence left variable.
* Searching for additional sources of employee satisfaction.
* Attempting to improve model performance.
