Phase 3 - Machine Learning: Apply machine learning to 

- analyze satisfaction, segment employees, and recommend benefits
Employee Satisfaction Analysis
analyze SatisfactionScore distributions by BenefitID, BenefitSubType, and demographics
- compute correlations between UsageFrequency and SatisfactionScore
- perform predictive sentiment analysis on Comments (positive, neutral, negative)
- identify satisfaction drivers    
    - e.g., regression on SatisfactionScore
- create scorecards by BenefitType / BenefitSubType with sentiment insights
- deliverable → Employee Satisfaction Insights Report

Employee Segmentation
- create usage vectors (total UsageFrequency per BenefitSubType)
- generate temporal profiles
    - ex: monthly usage via PCA or summary stats
- apply clustering
    -  ex: K-Means with silhouette score, Gaussian Mixture Models
- validate clusters (silhouette score, Davies-Bouldin index)
- profile clusters by usage patterns and demographics
    - ex: "wellness enthusiasts"
- deliverable → Segmentation Analysis Report with Visualizations

Recommender System
- build a user-item matrix (EmployeeID vs. BenefitID / BenefitSubType, values = UsageFrequency)
- use collaborative filtering or content-based filtering 
    - ex: k-NN, SVD
    - ex: cosine similarity on metadata
- evaluate offline (Precision@K, Recall@K, MAP) and propose an A/B test for online evaluation
- suggest benefits based on peer usage or metadata
    - ex: department, tenure
- deliverable → Recommender System Notebook with evaluation metrics and sample recommendations


In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.cluster import KMeans
from sklearn.metrics import mean_squared_error, r2_score, classification_report, silhouette_score

In [4]:
# load cleaned data
cleaned_data = pd.read_csv('data/cleaned_data.csv')
cleaned_data.head()

Unnamed: 0,EmployeeID,BenefitID,UsageFrequency,LastUsedDate,Age,Gender,Department,Tenure,BenefitType,BenefitSubType,...,subcat_PPO Individual,subcat_Premium Discount Tier 1,subcat_Professional Certification,subcat_Supplemental High Amount,subcat_Supplemental Standard,subcat_Tier 1 Partners,subcat_Tier 2 Partners,subcat_Tier 3 Partners,subcat_Transit Subsidy,subcat_Undergraduate Degree
0,220,20,4,2024-05-03,64,Male,HR,35,Tuition Reimbursement,Undergraduate Degree,...,False,False,False,False,False,False,False,False,False,True
1,1820,26,1,2024-02-08,53,Male,Finance,2,Gym Membership,Family Membership,...,False,False,False,False,False,False,False,False,False,False
2,285,16,2,2023-10-27,64,Male,Marketing,35,Health Insurance,HDHP Individual,...,False,False,False,False,False,False,False,False,False,False
3,4536,8,8,2024-07-03,32,Female,Sales,10,Wellness Programs,Premium Discount Tier 1,...,False,True,False,False,False,False,False,False,False,False
4,1262,12,3,2024-04-13,42,Male,Finance,1,Tuition Reimbursement,Graduate Degree,...,False,False,False,False,False,False,False,False,False,False
