### Dataset Information: 

IBM has gathered information on employee satisfaction, income, seniority and some demographics. It includes the data of 1470 employees. Clustering analysis can be performed on this dataset to group the employee based on the similar characteristics among them. 

### ATTRIBUTES:

1 Age
2 Attrition
3 BusinessTravel
4 DailyRate
5 Department
6 DistanceFromHome
7 Education
8 EducationField
9 EmployeeCount
10 EmployeeNumber
11 EnvironmentSatisfaction
12 Gender
13 HourlyRate
14 JobInvolvement
15 JobLevel
16 JobRole
17 JobSatisfaction
18 MaritalStatus
19 MonthlyIncome
20 MonthlyRate
21 NumCompaniesWorked
22 Over18
23 OverTime
24 PercentSalaryHike
25 PerformanceRating
26 RelationshipSatisfaction
27 StandardHours
28 StockOptionLevel
29 TotalWorkingYears
30 TrainingTimesLastYear
31 WorkLifeBalance
32 YearsAtCompany
33 YearsInCurrentRole
34 YearsSinceLastPromotion
35 YearsWithCurrManager

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans,AgglomerativeClustering
from scipy.stats import zscore
from sklearn.metrics import silhouette_score,classification_report
import pandas as pd

pd.options.display.max_columns=1000

In [None]:
# Load dataset
    
df=pd.read_csv("../input/ibm-hr-analytics-attrition-dataset/WA_Fn-UseC_-HR-Employee-Attrition.csv")

df.head()


## Data Understanding

In [None]:
#data
df.head()

In [None]:
#shape of the dataset
print('number of rows:',df.shape[0])
print('number of columns:',df.shape[1])

In [None]:
df.info()

In [None]:
df.describe().transpose()

In [None]:
# categorical variables
df.select_dtypes(include='object').columns

In [None]:
# percentage of department category
a=df['Attrition'].value_counts()
per=(a.values/df.shape[0])*100
p1=pd.DataFrame()
p1['Attrition']=df['Attrition'].unique()
p1['Percentage']=per
p1

In [None]:
# percentage of department category
a=df['Department'].value_counts()
per=(a.values/df.shape[0])*100
p1=pd.DataFrame()
p1['Department_name']=df['Department'].unique()
p1['Percentage']=per
p1

In [None]:
# percentage 
a=df['MaritalStatus'].value_counts()
per=(a.values/df.shape[0])*100
p1=pd.DataFrame()
p1['MaritalStatus']=df['MaritalStatus'].unique()
p1['Percentage']=per
p1

In [None]:
# percentage
a=df['Gender'].value_counts()
per=(a.values/df.shape[0])*100
p1=pd.DataFrame()
p1['Over18']=df['Gender'].unique()
p1['Percentage']=per
p1

In [None]:
# percentage 
a=df['OverTime'].value_counts()
per=(a.values/df.shape[0])*100
p1=pd.DataFrame()
p1['OverTime']=df['OverTime'].unique()
p1['Percentage']=per
p1

In [None]:
#Correlation
df.corr()

In [None]:
# covariance
df.cov()

In [None]:
plt.figure(figsize=(20,10))
sns.heatmap(df.corr() , annot=True)
plt.show()

#### INFERENCES:
1. From the above correlation matrix, we can observe there is strong positive correlation between some variables


In [None]:
# boxplot
df_num=df.select_dtypes(exclude='object')
df_num.drop(columns='PerformanceRating' , inplace=True)
for i in range(len(df_num.columns)):
    sns.boxplot(df_num.iloc[:,i])
    plt.show()

## Data Preparation 

a.	Scale / Transform/ clean the data so that it is suitable for model building.


In [None]:
# Check unique values
df.nunique()

In [None]:
# there is 1 unique value in column, we will remove this column
df.drop(columns=['EmployeeCount','StandardHours'] , inplace=True)

In [None]:
# null values
df.isnull().sum()

#### There are no null values present in the data

In [None]:
df.columns

In [None]:
# check Outliers
# boxplot
df_num=df.select_dtypes(exclude='object')
for i in range(len(df_num.columns)):
    sns.boxplot(df_num.iloc[:,i])
    plt.show()

In [None]:
df_num.skew()

1. We can observe from the above boxplot & skewness values that there are no extreme values present in any of the column
2. Skewness values are also not large so this much skewness is fine to proceed for furthur analysis
3. we will transform two columns having high skewness yearsatcompany & yearssince last promotion

In [None]:
# transformation of yearsatcompany

print('\nSkewness before transformation:',df['YearsAtCompany'].skew())
df['YearsAtCompany']=np.sqrt(df['YearsAtCompany'])
print('\nSkewness after transformation:',df['YearsAtCompany'].skew())

In [None]:
# transformation of YearsSinceLastPromotion

print('\nSkewness before transformation:',df['YearsSinceLastPromotion'].skew())
df['YearsSinceLastPromotion']=np.sqrt(df['YearsSinceLastPromotion'])
print('\nSkewness after transformation:',df['YearsSinceLastPromotion'].skew())

In [None]:
# we will scale the numerical data
df_num=df.select_dtypes(exclude='object')
df_num_scaled=df_num.apply(zscore)

# we will encode categorical data
df_cat=df.select_dtypes(include='object')
df_cat_dummy=pd.get_dummies(df_cat, drop_first=True)


# concat numerical & categorical data
xscaled=pd.concat([df_num_scaled,df_cat_dummy] , axis=1).reset_index(drop=True)
xscaled.head()

In [None]:
xscaled.isnull().sum()

## Dimensionality Reduction : Principal component analysis


1. When we observe the correlation between two independent features in the dataset , we can say that there is multicollinearity exists in the given data.
So we can apply Principal component analysis to reduce the independent/insignificant feature dimentions so that to remove multicollinearity effect in the dataset

2. Features which are strongly correlated with each other needs PCA.

3. From the correlation matrix in the que1 , we can observe that there are some variables which are having strong correlation with another variable.

4. The total working years is strongly correlated with the age,job level , monthly income and years at company
also years at company are strongly correlated with the  Years in current role, years with current manager.

5. We need to remove on of the features or 2 or more features which are correlated with each other . so that redundant data will not be present while building model.

5. We will apply PCA to remove the multicollinearity effect



In [None]:
from sklearn.decomposition import PCA


In [None]:
# We will use the scaled data - xscaled
# COLUMNS 
xscaled.shape[1]

In [None]:
#We will take all the components to find cumulative variance
pca=PCA(n_components=29)
compo=pca.fit_transform(xscaled)

In [None]:
# Explained variance ratio
evr=pca.explained_variance_ratio_*100
evr

In [None]:
# Cumulative explained variance 
cevr=np.cumsum(evr)
cevr

In [None]:
# plot explained variance ratio & cumulative evr
plt.figure(figsize=(20,12))
plt.bar(np.arange(29),evr)
plt.step(np.arange(29),cevr)
plt.xticks(np.arange(29))
plt.show()

#### INFERENCES:
1. From the above cummulative graph & the cummulative variance ratio graph , we can clearly see that after dimention 18 there is no much difference in the variance values 

2. For the 90%  variance in the data , we will choose the 18 components which are explaining 90.53% of the variance in the data.

3. Number of components choosen using PCA are 18.

In [None]:
# xpca - for furthur analysis

xpca=PCA(n_components=22).fit_transform(xscaled)

#shape of xpca
xpca.shape

In [None]:
# 18 components
xpca

In [None]:
# dataframe using pca
dfpca=pd.DataFrame(data=xpca, columns=np.arange(1,23))
dfpca.head()

In [None]:
# check multicollinearity
plt.figure(figsize=(20,10))
sns.heatmap(dfpca.corr() , annot=True)
plt.show()

### INFERENECES:

1. From the above correlation matrix , we can say that there is no multicollinearity present in the features obtained using PCA.
2. Our main objective was to remove the multicollinearity using PCA.
2. We can procced with the above dimentions for furthur analysis .

In [None]:
# spread of the data using distribution plot
for i in range(len(dfpca.columns)):
    sns.distplot(dfpca.iloc[:,i])
    plt.show()

In [None]:
#skewness in the pca dimentions
dfpca.skew()

In [None]:
# check Outliers
# boxplot
for i in range(len(dfpca.columns)):
    sns.boxplot(dfpca.iloc[:,i])
    plt.show()

In [None]:
#skewness in the pca dimentions
dfpca.skew()

### INFERENCES

1. From the above distribution plot, we can say that the data is normally distributed in each feature.
2. Also from the boxplot & skewness values , we can observe that there are no extreme values exists in the features.
3. The skewness in each feature is also not large , so we can procced with the pca dimentions without treating outliers.
4. There is no need to treat outliers , but still we will use Inter quartile range method to remove them if any.

In [None]:
q1=dfpca.quantile(0.25)
q3=dfpca.quantile(0.75)
iqr=q3-q1

ll=q1-iqr
ul=q3+iqr

dfpca=dfpca[~((dfpca<ll) | (dfpca>ul)).any(axis=1)]
dfpca=dfpca.reset_index(drop=True)
dfpca.head()

In [None]:
# check Outliers after removing them
# boxplot
for i in range(len(dfpca.columns)):
    sns.boxplot(dfpca.iloc[:,i])
    plt.show()

In [None]:
# check skewness
dfpca.skew()

In [None]:
# we can observe that outliers are removed from pca dataframe

## KMeans clustering

In [None]:
# kmeans clustering 

#INERTIA VALUES

inert=[]

for k in range(1,12):
    kmeans=KMeans(n_clusters=k)
    kmeans.fit(dfpca)
    inert.append(kmeans.inertia_)
    
    
    
# inertia values for each cluster number
inertia=pd.DataFrame()
inertia['Clusters']=np.arange(1,12)
inertia['inertia']=inert
inertia 
 

In [None]:
# ELBOW PLOT - to find best value for cluster number
plt.figure(figsize=(15,8))
plt.plot(range(1,12),inertia['inertia'],color='red',marker='*')
plt.xticks(np.arange(1,12))
plt.xlabel('Number of clusters',fontsize=15)
plt.ylabel('Inertia',fontsize=15)
plt.show()

1. From the above elbow plot & inertia score , we will choose the best cluster value as 3.
because after that the change in inertia value is very less as compared to others

2. We can observe sharp bend at 2 clusters, so we will build  Kmeans model using 3 clusters

In [None]:
# kmeans model is build
kmeans=KMeans(n_clusters=2 , n_init=15, random_state=10)
kmeans.fit(dfpca)

In [None]:
# inertia score
print('\nInertia in the kmeans clustering',kmeans.inertia_)

#silhoutte score
print('\nSilhoutte score for kmeans clustering',silhouette_score(dfpca,kmeans.labels_))

In [None]:
# dataframe with label attached for kmeans clustering 

dfkmean=dfpca.copy()
dfkmean['label']=kmeans.labels_
dfkmean.head()

In [None]:
# Visualization of clusters 
# 2D
# pca component 1 & 2
plt.figure(figsize=(10,6))
plt.scatter(dfkmean[dfkmean.columns[0]],dfkmean[dfkmean.columns[1]],c=kmeans.labels_ , cmap=plt.cm.Set1)
plt.show()

In [None]:
# pca component 2 & 3
plt.figure(figsize=(10,8))
plt.scatter(dfkmean[dfkmean.columns[1]],dfkmean[dfkmean.columns[2]],c=kmeans.labels_ , cmap=plt.cm.Set1)
plt.show()

## Agglomerative clustering

In [None]:
# dendrogram to find best clusters number

from scipy.cluster.hierarchy import linkage,dendrogram

plt.figure(figsize=(15,10))

z=linkage(dfpca, method='ward')


# note that, color threshold is adjusted after observing dendrogram

dendrogram(z , leaf_rotation=90,color_threshold=21)
plt.show()

1. We can clearly observe the 3 clusters in the above dendrogram plot
2. So for agglomerative clustering , we will choose 3 clusters as best cluster number.
3. Optimum clusters got using dendrogram are also 3
3. We will build agglomerative clustering model using 3 clusters

In [None]:
# Agglomerative clustering

aglo=AgglomerativeClustering(n_clusters=3 , affinity='euclidean', linkage='ward')
aglo.fit(dfpca)

In [None]:
# dataframe is saved with agglomerative clustering labels
dfaglo=dfpca.copy()
dfaglo['label']=aglo.labels_
dfaglo.head()

#### Inertia for agglomerative clusters - calculated manually

In [None]:
# group by labels
agc=dfaglo.groupby(['label'])
df0=agc.get_group(0)
df1=agc.get_group(1)
df2=agc.get_group(2)


In [None]:
# find centroids for each cluster
c0=np.array(df0.mean())
c1=np.array(df1.mean())
c2=np.array(df2.mean())


In [None]:

# exclude last column of labels
c0=c0[:-1]
c1=c1[:-1]
c2=c2[:-1]

In [None]:
# find inertia for each cluster 
agi0=0
agi1=0
agi2=0

for i in np.arange(df0.shape[0]):
    agi0=agi0+np.sum((df0.iloc[i,:-1]-c0)**2)
    

for i in np.arange(df1.shape[0]):
    agi1=agi1+np.sum((df1.iloc[i,:-1]-c1)**2)    
    

for i in np.arange(df2.shape[0]):
    agi2=agi2+np.sum((df2.iloc[i,:-1]-c2)**2)    

In [None]:
# Add all the inertia scores

total_aglo_inertia=agi0+agi1 #+agi2


In [None]:
# inertia score
print('\nInertia in the agglomerative clustering',total_aglo_inertia)

#silhoutte score
print('\nSilhoutte score for agglomerative clustering',silhouette_score(dfpca,aglo.labels_))

### Clusters plot for agglomerative clustering

In [None]:
# Visualization of clusters in agglomerative clustering 
# 2D
# pca component 1 & 2
plt.figure(figsize=(10,6))
plt.scatter(dfaglo[dfaglo.columns[0]],dfaglo[dfaglo.columns[1]],c=aglo.labels_ , cmap=plt.cm.Set1)
plt.show()

In [None]:
# pca component 1 & 2
plt.figure(figsize=(10,6))
plt.scatter(dfaglo[dfaglo.columns[1]],dfaglo[dfaglo.columns[2]],c=aglo.labels_ , cmap=plt.cm.Set1)
plt.show()

## COMAPARISON: silhouette_score and Inertia of KMeans & Agglomerative clustering

In [None]:
# Kmeans clustering

# inertia score
print('\nInertia in the kmeans clustering',kmeans.inertia_)
#silhoutte score
print('\nSilhoutte score for kmeans clustering',silhouette_score(dfpca,kmeans.labels_))

In [None]:
# agglomerative clustering

# inertia score
print('\nInertia in the agglomerative clustering',total_aglo_inertia)
#silhoutte score
print('\nSilhoutte score for agglomerative clustering',silhouette_score(dfpca,aglo.labels_))

### INFERENCES

1. From the inertia , we can observe that the inertia value for clusters obtained using kmeans clustering is less than the inertia value for the clusters obtained using agglomerative clusters.
It means that the clusters obtained from kmeans clustering are more tighter than the clusters obtained from the agglomerative clustering

2. Silhoutte score is another matrix to check the quality of the clusters , silhoutte score is high for kmeans clusteing than the agglomerative clustering

3. So by comaparing inertia & silhoutte score , we will choose the clusters obtained from the kmeans clustering model for furthur classification model building

## COMAPARISON: Cluster visualization of KMeans & Agglomerative clustering

In [None]:
# Visualization of clusters with kmeans clustering
# 2D
# pca component 1 & 2
plt.figure(figsize=(10,6))
plt.scatter(dfkmean[dfkmean.columns[0]],dfkmean[dfkmean.columns[1]],c=kmeans.labels_ , cmap=plt.cm.Set1)
plt.show()

In [None]:
# Visualization of clusters in agglomerative clustering 
# 2D
# pca component 1 & 2
plt.figure(figsize=(10,6))
plt.scatter(dfaglo[dfaglo.columns[0]],dfaglo[dfaglo.columns[1]],c=aglo.labels_ , cmap=plt.cm.Set1)
plt.show()

### INFERENCES:

1. Cluster plot for 2 components obtained from kmeans clustering are clearly visble as 2 clusters.
but we cannot observe seperate clusters in agglomerative cluster plot

2. Using agglomerative clustering we got 3 clusters.

2. We will choose aglomerative clustering labeled dataframe for classification model build because it has less inertia score.

In [None]:
# spread of the data using distribution plot
for i in range(len(dfpca.columns)):
    sns.distplot(dfpca.iloc[:,i])
    plt.show()

##	Use the cluster labels from the best method above and interpret the clusters formed.

### KMeans clustering labelled dataframe is used & Logistic regression classification model is build to check the accuracy of the model

In [None]:
# As decide above , we go with the labelled dataframe obatained from kmeans clusterig
# For interpreting the clusters formed using above clustering models 
# we will build classification model to check the accuracy of the model , so that we will get to know how correctly
# kmeans clusters has beed done

In [None]:
# we will use this dataframe
dfkmean.head()

In [None]:
# split x & y
from sklearn.model_selection import train_test_split
x=dfkmean.drop(columns=['label'])
y=dfkmean['label']

xtrain,xtest,ytrain,ytest=train_test_split(x , y , test_size=0.3, random_state=20)

#check shape of the dataframe
(xtrain.shape, xtest.shape , ytrain.shape , ytest.shape)

In [None]:
from sklearn.linear_model import LogisticRegression
lr=LogisticRegression()
lr.fit(xtrain,ytrain)


In [None]:
from sklearn.metrics import accuracy_score

# train data accuracy
ytrain_pred=lr.predict(xtrain)
acc_train=accuracy_score(ytrain , ytrain_pred)
print('\nAccuracy for train data : ',acc_train)

# test data accuracy
ytest_pred=lr.predict(xtest)
acc_test=accuracy_score(ytest , ytest_pred)
print('\nAccuracy for test data : ',acc_test)

In [None]:
# Classification report
print('\nClassification Report : \n')
print(classification_report(ytest,ytest_pred))

### INFERENCES

1. From the above measures,
The accuracy score for the train model is 99.3% & for the test model it is 98.38%.
Since both the accuracy scores are almost same and having better accuracy scores we can say that the clusters which we got in KMeans clustering are extracting perfect hidden patterns from the data and then giving accurate clusters.

2. Classification report is showing precision , recall & f1-score for each label.
All the scores are good.

3. By observing above accuracy scores , we can say that the clusters got from kmeans clustering models are accurate.

##	Summary




1. Initially data is observed, size of the rows & column, five point summary 
2. Data is cleaned - Null values , outliers treatement
3. Checked multicollinearity - To remove multicollinearity we used PCA to reduce the dimention & redundancy in the data
4. Then we used two clustering models - 1. KMeans 2.Agglomerative
5. Before building clustering model , we found best value for cluster
In kmeans- we choosen optimal value for cluster using Iertia & ELBOW plot
In Agglomerative - we choosen optimal value for cluster using Dendrogram
6. In both Elbow plot & dendrogram , we observed that the best value for clusters is 3
7. We build two models and found out inertia & silhoutte scores for each.We can observe it below.
We got less inertia & high silhoutte score for kmeans clustering model, so we can say that for this dataset kmeans clustering is showing better clusters than agglomerative clusters.
8. To interpret whether the clustering is accurate & how much accurate it is , we build classification model to check the accuracy scores of the train & test data.
9. The accuracy score for the train model is 99.3% & for the test model it is 98.38%.
Since both the accuracy scores are almost same and having better accuracy scores we can say that the clusters which we got in KMeans clustering are extracting perfect hidden patterns from the data and then giving accurate clusters.

10. To check the model is good or not:We used below measures

   1. To choose best clustering model  ->  INERTIA, silhoette score
   2. To check the clusters accuracy   ->  accuracy score , precision, recall, f1-score

In [None]:
# Kmeans clustering
silhouette_score
# inertia score
print('\nInertia in the kmeans clustering',kmeans.inertia_)
#silhoutte score
print('\nSilhoutte score for kmeans clustering',silhouette_score(dfpca,kmeans.labels_))

In [None]:
# agglomerative clustering

# inertia score
print('\nInertia in the agglomerative clustering',total_aglo_inertia)
#silhoutte score
print('\nSilhoutte score for agglomerative clustering',silhouette_score(dfpca,aglo.labels_))

In [None]:
# For business interpretetion , i have used cluster plots to check on which basis the clusters are made

In [None]:
# kmeans model is build
kmeans1=KMeans(n_clusters=2 , n_init=15, random_state=10)
kmeans1.fit(xscaled)

In [None]:
xscaled.columns

In [None]:
# Visualization of clusters with kmeans clustering
# 2D
plt.figure(figsize=(10,6))
plt.scatter(xscaled['YearsAtCompany'],xscaled['MonthlyIncome'],c=kmeans1.labels_ , cmap=plt.cm.Set1)
plt.xlabel('Years at company', fontsize=15)
plt.ylabel('Monthly income', fontsize=15)
plt.show()

In [None]:
plt.figure(figsize=(10,6))
plt.scatter(xscaled['YearsAtCompany'],xscaled['PercentSalaryHike'],c=kmeans1.labels_ , cmap=plt.cm.Set1)
plt.xlabel('Years at company', fontsize=15)
plt.ylabel('Percent salary hike', fontsize=15)
plt.show()

### INFERENCES:

1. We have build model using 2 clusters & got good accuracy scores, so we can say that there are 3 types of employes categorized based on the features given in the dataset.
2. This analysis says that the employees are grouped into 3 categories based on the similar caracteristics.
3. In the above plot, i have plotted the clusters based on percentage salary hike & monthly salary Vs Yeas at company
4. We can clearly observe 2 categories of the employees for the 2 features in each plot.
5. from the 2nd plot , we can say that 2 categories are:
  1. Years at company less , percent salary hike less
  2. Years at company more , percent salary hike less
  3. Years at company more , percent salary hike moderate
6. Like this, we can explore each and evry feature  & extract 2 clusters of employee categories  
  

1. Accuracy of the classification model build using the clusters obtained from the kmeans clustering  is good but still we cannot be 100 % sure that the results are correct.
2. There can be some misleading interpretations got from the data .
3. It is possible that some unnecessary columns are present in the dataset which should be removed at the start of the datacleaning.
for that we need some logical thinking & domain knowledge
4. There are chances that if we could have taken some another variance % explained by features rather than 90% variance explained . It may give good results or accurate clusters
5. No one can sure about the model build is 100% accurate  & accuracy scores are that much obtained using model testing.
6. While choosing clusters there can be more than 3 employee groups if we go for hierarchial clustering.
We can find more insights by going in depth using agglomerative clustering.
But here we said that kmeans clustering is the perfect clustering model for the given data.So this result can be misleading us about employee groups.

