# <font face = 'Impact' color = '#FFAEBC' > Sample Demonstration on Machine Learning for Regression<font/>
#### <font face = 'Times New Roman' color = '#B5E5CF'> License: GPL v3.0<font/>
#### <font face = 'Times New Roman' color = '#B5E5CF'> Author and Trainer: Paolo Hilado MSc. (Data Science)<font/>
This notebook provides a backgrounder in doing Machine Learning in Python employing models such as Ridge Regression, LASSO Regression, Elastic Net, and Random Forest Regressor. 

# <font face = 'Palatino Linotype' color = '#5885AF'> Business Understanding:<font/>
Management seeks to gain a deeper understanding of the factors that drive employee productivity across the organization. The Human Resources and Development department has provided a comprehensive employee dataset containing demographic, performance, and engagement-related variables. The primary business objective is to DEVELOP A PREDICTIVE MODEL that accurately estimates an employee’s ProductivityScore based on key predictors such as age, department, tenure, education level, remote work ratio, job satisfaction, work hours, project load, managerial feedback, training participation, promotion history, and recent performance ratings.

By identifying and quantifying the most influential drivers of productivity, the organization aims to:
- Improve workforce management and development strategies,
- Optimize training and performance review programs,
- Support data-driven decision-making in promotions, hiring, and resource allocation, and
- Enhance overall organizational efficiency and employee satisfaction.

The success of this initiative will be measured by the model’s ability to accurately predict productivity and provide actionable insights that inform HR and management policies.

In [None]:
# Load the necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split # used for training and testing a model
import math # used to separate the whole number from the decimal values

In [None]:
# Load the data set
df = pd.read_csv("employee_productivity_data.csv")
df.head()

In [None]:
df.info()

In [None]:
df.eq(' ').any()

# <font face = 'Palatino Linotype' color = '#5885AF'> Data Understanding:<font/>
   
The dataset provided by the Human Resources and Development department contains detailed records for all employees within the organization. The primary purpose of this data is to support an analytical exploration of factors influencing employee productivity, represented by the response variable ProductivityScore.

The dataset includes a mix of demographic, behavioral, and performance-related variables that may influence productivity. The predictor variables are as follows:
- Age – The employee’s age in years, representing workforce demographics.
- Department – The functional area or division where the employee works (categorical).
- YearsAtCompany – The number of years the employee has been with the organization, reflecting experience and organizational familiarity.
- EducationLevel – The highest education qualification level attained (ordinal).
- RemoteWorkRatio – The proportion of work done remotely, indicating work flexibility.
- JobSatisfactionScore – A self-reported or survey-based score indicating job satisfaction.
- AverageWeeklyHours – The average number of hours worked per week.
- NumProjects – The number of projects currently or recently handled by the employee.
- ManagerFeedbackScore – The manager’s performance feedback rating.
- TrainingHoursLastYear – The total number of hours spent in training over the past year.
- PromotionsLast5Years – The number of promotions received within the last five years.
- RecentPerformanceRating – The most recent formal performance appraisal score.

The dataset is expected to include both numerical and categorical data types, potentially with varying scales and distributions. Before modeling, the data will need to be explored and preprocessed to ensure quality and reliability. This will involve checking for missing values, outliers, inconsistent data entries, and correlations among variables. Exploratory Data Analysis (EDA) will also be performed to uncover patterns, relationships, and possible drivers of productivity.

Understanding these characteristics will guide appropriate feature engineering, data transformation, and model selection steps to ensure that the resulting machine learning model accurately reflects the underlying dynamics of employee productivity within the organization.

# <font face = 'Palatino Linotype' color = '#5885AF'> Data Preparation:<font/>

In [None]:
# Drop the irrelevant feature for developing the machine learning model.
df = df.drop(['EmployeeID'], axis = 1)
df.head()

In [None]:
# Split the dataset into train and test sets.
# Given 12 explanatory variables we would at need > 146 observations for
# training a regression model (Tabachnick and Fidell, 2013). The 70-30 split
# will be used for this project. 
train, test = train_test_split(df, test_size=0.30, random_state=42)
print(f'''The number of records for the train set is {len(train)}.
The number of records for the test set is {len(test)}.''')
# Source: Tabachnick, B.G.,Fidell, L.S., 2013. Using Multivariate Statistics, 
#         6th ed. Pearson Education, Inc., Boston. 

In [None]:
# Separating the explanatory variables from the outcome variable.
x_train = train.drop(['ProductivityScore'], axis = 1)
y_train = train['ProductivityScore']
x_train.head()

In [None]:
# Separating the explanatory variables from the outcome variable.
x_test = test.drop(['ProductivityScore'], axis = 1)
y_test = test['ProductivityScore']
x_test.head()

In [None]:
# Standardize all the continuous variables.
from sklearn.preprocessing import StandardScaler

# Assigning feature labels to variable continuous_vars.
continuous_vars = ['Age','YearsAtCompany', 'RemoteWorkRatio', 'JobSatisfactionScore',
                   'AverageWeeklyHours','NumProjects', 'ManagerFeedbackScore', 'TrainingHoursLastYear',
                  'PromotionsLast5Years', 'RecentPerformanceRating']

# Initialize StandardScaler.
scaler = StandardScaler()

# Fit scaler to the continuous variables and transform them.
x_train[continuous_vars] = scaler.fit_transform(x_train[continuous_vars])

In [None]:
# Standardize all the continuous variables.
from sklearn.preprocessing import StandardScaler

# Assuming you have your data in a DataFrame called df with continuous variables
# Replace continuous_vars with the names of your continuous variables
continuous_vars = ['Age','YearsAtCompany', 'RemoteWorkRatio', 'JobSatisfactionScore',
                   'AverageWeeklyHours','NumProjects', 'ManagerFeedbackScore', 'TrainingHoursLastYear',
                  'PromotionsLast5Years', 'RecentPerformanceRating']

# Fit scaler to the continuous variables and transform them
x_test[continuous_vars] = scaler.transform(x_test[continuous_vars])

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
# Checking for Multicollinearity among continuous variables using correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(x_train[continuous_vars].corr(), annot=True, cmap='coolwarm')
plt.show()

In [None]:
# One-hot encode categorical variables automatically
x_train = pd.get_dummies(x_train, drop_first=True)
x_train.head()

In [None]:
# One-hot encode categorical variables automatically
x_test = pd.get_dummies(x_test, drop_first=True)
x_test.head()

# <font face = 'Palatino Linotype' color = '#5885AF'> Modelling: Ridge Regression<font/>

In [None]:
# Training a machine learning model for a regression problem using the x_train dataset and the
# outcome variable y_train.
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge # You can replace Ridge with any other regression model you want to tune
from sklearn.metrics import mean_squared_error
import warnings
warnings.filterwarnings('ignore')

# Assuming you have your features in X and target variable in y

# Define Ridge regression model
ridge = Ridge()

# Define hyperparameters to tune
param_grid = {
    'alpha': [0.01, 0.1, 1.0, 10.0],  # Regularization strength (L2 penalty)
    'solver': ['auto', 'svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag', 'saga']  # Solver options
}
# Perform cross-validation grid search
grid_search = GridSearchCV(estimator=ridge, param_grid=param_grid, cv=5, scoring='neg_root_mean_squared_error' ) # cv=5 for 5-fold cross-validation
grid_search.fit(x_train, y_train)

# Get the best hyperparameters
best_params = grid_search.best_params_
print("Best hyperparameters:", best_params)

# Get the best model
best_model = grid_search.best_estimator_
# CV RMSE of best model
cv_rmse = -grid_search.best_score_  # negate because sklearn uses "maximize" convention
print("Mean 5-fold CV RMSE:", np.round(cv_rmse,2))

# Evaluate the best model on the train set using RMSE
y_train_pred = best_model.predict(x_train)
rmse_train = np.sqrt(mean_squared_error(y_train, y_train_pred))  # RMSE on train set
print("Root Mean Squared Error on train set:", np.round(rmse_train,2))

In [None]:
# Evaluate the best model on the test set using RMSE
y_test_pred = best_model.predict(x_test)
rmse_test = np.sqrt(mean_squared_error(y_test, y_test_pred))  # RMSE on test set
print("Root Mean Squared Error on test set:", np.round(rmse_test,2))

# <font face = 'Palatino Linotype' color = '#5885AF'> Modelling: LASSO Regression<font/>

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Lasso
from sklearn.pipeline import Pipeline
import warnings

# Ignore all warnings
warnings.filterwarnings("ignore")

# Assuming you have your features in X and target variable in y

# Define the Lasso regression model
lasso = Lasso()

# Define hyperparameters to tune
param_grid = {
    'alpha': [0.01, 0.1, 1.0, 10.0]  # Regularization strength
}

# Perform cross-validation grid search
grid_search = GridSearchCV(estimator=lasso, param_grid=param_grid, cv=5, scoring='neg_root_mean_squared_error' )
grid_search.fit(x_train, y_train)

# Get the best hyperparameters
best_params = grid_search.best_params_
print("Best hyperparameters:", best_params)

# Get the best model
best_model = grid_search.best_estimator_
# CV RMSE of best model
cv_rmse = -grid_search.best_score_  # negate because sklearn uses "maximize" convention
print("Mean 5-fold CV RMSE:", np.round(cv_rmse,2))

# Evaluate the best model on the train set using RMSE
y_train_pred = best_model.predict(x_train)
rmse_train = np.sqrt(mean_squared_error(y_train, y_train_pred))  # RMSE on train set
print("Root Mean Squared Error on train set:", np.round(rmse_train,2))

In [None]:
# Evaluate the best model on the test set using RMSE
y_test_pred = best_model.predict(x_test)
rmse_test = np.sqrt(mean_squared_error(y_test, y_test_pred))  # RMSE on test set
print("Root Mean Squared Error on test set:", np.round(rmse_test,2))

# <font face = 'Palatino Linotype' color = '#5885AF'> Modelling: Elastic Net<font/>

In [None]:
# Performing Elastic Net Regression
# Import necessary libraries
import numpy as np
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import make_regression

# Split the data into training and testing sets
# (You should replace this with your own dataset)
# X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the hyperparameter grid for Elastic Net
parametersGrid = {
    "alpha": [0.0001, 0.001, 0.01, 0.1, 1, 10, 100],
    "l1_ratio": np.arange(0.1, 0.9, 0.1)
}

# Initialize the Elastic Net model
eNet = ElasticNet()

# Perform grid search to find the best hyperparameters
grid_search  = GridSearchCV(eNet, parametersGrid, scoring='neg_root_mean_squared_error', cv=5)
grid_search.fit(x_train, y_train)

# Get the best hyperparameters
best_params = grid_search.best_params_
print("Best hyperparameters:", best_params)

# Get the best model
best_model = grid_search.best_estimator_
# CV RMSE of best model
cv_rmse = -grid_search.best_score_  # negate because sklearn uses "maximize" convention
print("Mean 5-fold CV RMSE:", np.round(cv_rmse,2))

# Evaluate the best model on the train set using RMSE
y_train_pred = best_model.predict(x_train)
rmse_train = np.sqrt(mean_squared_error(y_train, y_train_pred))  # RMSE on train set
print("Root Mean Squared Error on train set:", np.round(rmse_train,2))

In [None]:
# Evaluate the best model on the test set using RMSE
y_test_pred = best_model.predict(x_test)
rmse_test = np.sqrt(mean_squared_error(y_test, y_test_pred))  # RMSE on test set
print("Root Mean Squared Error on test set:", np.round(rmse_test,2))

# <font face = 'Palatino Linotype' color = '#5885AF'> Modelling: Random Forest Regressor<font/>

In [None]:
from sklearn.model_selection import GridSearchCV, KFold
from sklearn.ensemble import RandomForestRegressor
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error

# Define Random Forest regressor
rf_regressor = RandomForestRegressor()

# Define hyperparameters grid
param_grid = {
    'n_estimators': [10, 50, 100],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Define GridSearchCV with 5-fold cross-validation
grid_search = GridSearchCV(estimator=rf_regressor, param_grid=param_grid, cv=5, scoring='neg_root_mean_squared_error' )

# Perform GridSearchCV
grid_search.fit(x_train, y_train)

# Print best hyperparameters
print("Best Hyperparameters:", grid_search.best_params_)
# Get the best model
best_model = grid_search.best_estimator_

# CV RMSE of best model
cv_rmse = -grid_search.best_score_  # negate because sklearn uses "maximize" convention
print("Mean 5-fold CV RMSE:", np.round(cv_rmse,2))
# Evaluate the best model on the train set using RMSE
y_train_pred = best_model.predict(x_train)
rmse_train = np.sqrt(mean_squared_error(y_train, y_train_pred))  # RMSE on train set
print("Root Mean Squared Error on train set:", np.round(rmse_train,2))

In [None]:
# Evaluate the best model on the test set using RMSE
y_test_pred = best_model.predict(x_test)
rmse_test = np.sqrt(mean_squared_error(y_test, y_test_pred))  # RMSE on test set
print("Root Mean Squared Error on test set:", np.round(rmse_test,2))

# <font face = 'Palatino Linotype' color = '#5885AF'> Saving the Model for Future Deployment<font/>

In [None]:
# Save a copy of the Random Forest Model.
import pickle
pickle.dump(best_model, open('RFEMPmodel.pkl', 'wb'))