<a href="https://www.kaggle.com/code/sivm205/eda-of-hr-attrition?scriptVersionId=174722601" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns 

In [None]:
df = pd.read_csv("HR Data.xlsx - HR data.csv")
df.head()

Breakdown of indivitual attribute and understand the meaning of each one

1. **Attrition**: This column likely indicates whether an employee has left the company (attrition = Yes) or is still with the company (attrition = No).

2. **Business Travel**: This column may indicate the frequency or nature of business travel for employees, such as "Travel_Rarely," "Travel_Frequently," or "Non-Travel."

3. **CF_age band**: This could be a custom field indicating the age band or age group to which an employee belongs.

4. **Department**: This column likely denotes the department or division within the company where the employee works, such as "Sales," "Human Resources," "Research & Development," etc.

5. **Education Field**: This column may specify the field of education or academic background of the employee, such as "Life Sciences," "Medical," "Marketing," etc.

6. **emp no** / **Employee Number**: These columns likely represent unique identifiers for each employee in the dataset.

7. **Gender**: This column likely indicates the gender of the employee, such as "Male" or "Female."

8. **Job Role**: This column denotes the specific role or position held by the employee within the company, such as "Sales Executive," "Research Scientist," "Human Resources Manager," etc.

9. **Marital Status**: This column indicates the marital status of the employee, such as "Single," "Married," or "Divorced."

10. **Over Time**: This column may indicate whether the employee works overtime, such as "Yes" or "No."

11. **Training Times Last Year**: This column might represent the number of training sessions or courses attended by the employee in the last year.

12. **Age**: This column denotes the age of the employee.

13. **Monthly Income**: This column represents the monthly income or salary of the employee.

14. **Years At Company**, **Years In Current Role**, **Years Since Last Promotion**, **Years With Curr Manager**: These columns likely represent the respective durations (in years) for which the employee has been with the company, in their current role, since their last promotion, and with their current manager.

15. **Over18**: This column may indicate whether the employee is over 18 years old, typically with values like "Y" or "N."

16. **Environment Satisfaction**, **Job Involvement**, **Job Level**, **Job Satisfaction**: These columns likely represent employee satisfaction or engagement scores in various aspects of their work environment, job involvement, level within the company, and overall job satisfaction.

17. **Distance From Home**: This column indicates the distance of the employee's residence from their workplace.

18. **Daily Rate**, **Hourly Rate**, **Monthly Rate**: These columns might represent different rates or salaries paid to employees, possibly on a daily, hourly, or monthly basis.

19. **Education**: This column indicates the level of education attained by the employee, such as "Bachelor's," "Master's," "PhD," etc.

20. **Employee Count**: This column could represent the count of employees, possibly used for validation or aggregation purposes.

21. **Percent Salary Hike**: This column may represent the percentage increase in salary that the employee received in a particular period.

22. **Num Companies Worked**: This column might denote the number of different companies that the employee has worked for previously.

23. **Performance Rating**: This column could indicate the performance rating or evaluation score given to the employee, typically on a scale from 1 to 5.

24. **Standard Hours**: This column may represent the standard number of working hours per day or per week in the company.

25. **Stock Option Level**: This column could represent the level or extent of stock options granted to the employee as part of their compensation package.

26. **Total Working Years**: This column indicates the total number of years of work experience or employment history of the employee.

27. **Work Life Balance**: This column may represent the perceived balance between work and personal life, often rated on a scale from 1 to 5.

28. **Years At Company**: This column indicates the number of years the employee has been with the current company.

29. **Years In Current Role**: This column represents the number of years the employee has been in their current role or position.

30. **Years Since Last Promotion**: This column indicates the number of years since the employee's last promotion.

31. **Years With Curr Manager**: This column represents the number of years the employee has been working under their current manager.



In [None]:
df.info()

Data Cleaning


In [None]:
#check missing value in the dataframe
df.isnull().sum()

In [None]:
duplicate = [i for i in df.duplicated() if i==True]
duplicate

As from the above cell it is clear that no two rows are duplicated in the dataframe now lets check for the relevant attribute

In [None]:
df.describe(include='all')

In [None]:
#remove emp_no. as it is repeated column in the dataset
df = df.drop(columns=['emp no'])
df.info()

It is unlikely that over 18 attribute plays a significant role in predicting attrition, as the
vast majority of employees in any workforce are expected to be over 18 years old.
Since the "Over18" attribute is essentially a constant value for all employees in the dataset, it does not provide any meaningful variation or predictive power. Therefore, it would typically be considered a redundant feature and can be safely dropped from 
the dataset during data preprocessing without impacting the analysis or modeling process.

In [None]:
#futher attributes that could be removed fram the dataset
attributes_to_drop = [
    'CF_attrition label', #current emp or ex emp does not make impact on a dataset
    'Over18', #over 18 because it is expected that every employee in the workforce are above 18
    '-2',    #does not singnifies the meaning of atribute
    '0',  #ireelavant
    'CF_current Employee',
    'Employee Count',  #does not provide useful info as it contain only a single value
    'Standard Hours'  #varies acccording to the company so does not provide any info in the dataset
]
df = df.drop(columns=attributes_to_drop)
df.info()

In-depth analysis of the 'CF_age band' attribute to determine whether it should be dropped or not

In [None]:
df['CF_age band'].unique()

In [None]:
import seaborn as sns 
import matplotlib.pyplot as plt 
map_value = {}
arr = list(df['CF_age band'].unique())
for i in df['CF_age band']:
    if i in map_value:
        map_value[i]+=1 
    else:
        map_value[i] = 1 

print(type(map_value))


In [None]:
#visualising age band of a employees using bar graph
import matplotlib.pyplot as plt

def plot_bar_graph(data):
    keys = list(data.keys())
    values = list(data.values())

    # Adjust figure size and set a style
    plt.figure(figsize=(10, 6))
    plt.style.use('seaborn-darkgrid')

    # Plotting the bar graph with customized colors and edge colors
    plt.bar(keys, values, color='skyblue', edgecolor='gray')

    # Adding labels and title with larger fonts
    plt.xlabel('Age band', fontsize=12)
    plt.ylabel('Frequency', fontsize=12)
    plt.title('Distribution of Age Bands', fontsize=14)

    # Rotating x-axis labels for better readability
    plt.xticks(rotation=45, ha='right')

    # Adding grid lines for better readability
    plt.grid(axis='y', linestyle='--', alpha=0.7)

    # Displaying the plot
    plt.tight_layout()  # Adjust layout to prevent clipping of labels
    plt.show()

# Plotting the bar graph
plot_bar_graph(map_value)


In [None]:
import seaborn as sns

plt.figure(figsize=(10, 6))
plt.style.use('seaborn-darkgrid')
sns.countplot(x='CF_age band', hue='Attrition', data=df,color='skyblue', edgecolor='gray')
plt.title('Distribution of Age Bands by Attrition')
plt.xlabel('Age Band')
plt.ylabel('Frequency')
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y', linestyle='--', alpha=0.7)

plt.xticks(rotation=45)
plt.show()

In [None]:
# let's perform a chi-square test to determine if there's a significant difference
# in attrition rates across different age bands.

from scipy.stats import chi2_contingency

# Create a contingency table
contingency_table = pd.crosstab(df['CF_age band'], df['Attrition'])

# Perform the chi-square test
chi2, p, dof, expected = chi2_contingency(contingency_table)

print(f"Chi-square statistic: {chi2}")
print(f"P-value: {(p)}")
print(dof)
print(expected)

In [None]:
from sklearn.preprocessing import LabelEncoder

# Convert 'CF_age band' to numerical values
encoder = LabelEncoder()
df['CF_age band_encoded'] = encoder.fit_transform(df['CF_age band'])

# Calculate correlation coefficients
correlations = df.corr()['CF_age band_encoded']
print(correlations)

Based on these findings, age band appears to be an important factor associated with attrition rate and is moderately correlated with various other attributes in the dataset. Therefore, it may be valuable to retain 'CF_age band' (or its numerical encoding) for further analysis or modeling.

In [None]:
df.info()

In [None]:
df['Business Travel']

## Exploratory Data Analysis

Basic Information

In [None]:
df.info()
'''Data contain 1470 rows and 34 columns
   data type in this dataset are of two types, [Object and int] value
   data does not contain any single value
'''

In [None]:
# Unique values for categorical columns
dictionary_type = {}
for col in df:
    if df[col].dtype == 'O':  # Check if the column type is object (categorical)
        dictionary_type[col] = df[col].unique()

# Print the unique values for each categorical column
for col, unique_values in dictionary_type.items():
    print(f"Unique values for '{col}': {unique_values}")


2. **Distribution of Variables:**
   - What is the distribution of the target variable (Attrition)?
   - What is the distribution of numeric variables like Age, Daily Rate, Monthly Income, etc.?
   - How are categorical variables distributed, e.g., Business Travel, Department, Marital Status, etc.?

In [None]:
#make a separate list for numeriacal attribute and categorical attribute

numerical_atttribute = []
categorical_attribute = [] 
def check_type(data):
    for col in data:
        if data[col].dtype == 'O':
            categorical_attribute.append(col)
        else:
            numerical_atttribute.append(col)

check_type(df)
print(categorical_attribute)
print(numerical_atttribute)


In [None]:
# Calculate the number of each type of attribute

num_numerical = len(numerical_atttribute)
num_categorical = len(categorical_attribute)

# Create a pie chart
plt.figure(figsize=(6, 6))
plt.pie([num_numerical, num_categorical], labels=['Numerical Attribute', 'Categorical Attribute'], autopct='%1.1f%%',
        colors=['skyblue', 'yellowgreen'])
plt.title('Distribution of Attribute Types')
plt.show()

In [None]:
'''Distribution of Target variable'''

attrition_counts = df['Attrition'].value_counts()

plt.figure(figsize=(10, 6))
plt.style.use('seaborn-darkgrid')
plt.bar(attrition_counts.index, attrition_counts.values, color='skyblue', edgecolor='gray') #Compare frequencies or values across attrition categories in this they have only two categories
plt.title('Distribution of Attrition')
plt.xlabel('Attrition') 
plt.ylabel('Frequency')
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

In [None]:
'''Distribution of Age''' 

plt.figure(figsize=(10, 6))
plt.style.use('seaborn-darkgrid')
sns.histplot(df['Age'], color='skyblue', edgecolor='gray', kde=True)
plt.title('Distribution of Age')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()



The distribution shows a unimodal pattern with a peak around the 30-35 age group. The frequency gradually decreases on both sides of the mode, indicating fewer individuals in the younger and older age extremes. The distribution is slightly skewed to the right, suggesting a larger proportion of individuals in the older age groups. The overall shape represents a population concentrated in the middle-age groups

In [None]:
import math

# Calculate the number of rows needed for the subplots
num_attributes = len(numerical_atttribute)
num_rows = math.ceil(num_attributes / 2)  # Adjust as needed

# Create a figure for the subplots
fig, axs = plt.subplots(num_rows, 2, figsize=(10, num_rows*5))

# Flatten the axes array, to make iterating over it easier
axs = axs.flatten()

for i, attribute in enumerate(numerical_atttribute):
    sns.histplot(df[attribute], color='skyblue', edgecolor='gray', kde=True, ax=axs[i])
    axs[i].set_title(f'Distribution of {attribute}')
    axs[i].set_xlabel(attribute)
    axs[i].set_ylabel('Frequency')
    axs[i].grid(axis='y', linestyle='--', alpha=0.7)

# Remove any unused subplots
if len(numerical_atttribute) % 2 != 0:
    fig.delaxes(axs[-1])

plt.tight_layout()
plt.show()

In [None]:
#Distribution of categorical attribute
import math
def get_plot(data):
    num_attributes = len(data)
    num_rows = math.ceil(num_attributes / 2)  

    # Create a figure for the subplots
    fig, axs = plt.subplots(num_rows, 2, figsize=(10, num_rows*5))

    # Flatten the axes array, to make iterating over it easier
    axs = axs.flatten()

    for i, attribute in enumerate(data):
        sns.countplot(x=attribute, data=df, color='skyblue', edgecolor='gray', ax=axs[i])
        axs[i].set_title(f'Distribution of {attribute}')
        axs[i].set_xlabel(attribute)
        axs[i].set_ylabel('Frequency')
        axs[i].grid(axis='y', linestyle='--', alpha=0.7)

    # Remove any unused subplots
    if len(data) % 2 != 0:
        fig.delaxes(axs[-1])

    plt.tight_layout()
    plt.show()


get_plot(categorical_attribute)


3. **Correlation Analysis:**
   - What is the correlation between numeric variables?
   - Is there any correlation between numeric variables and the target variable (Attrition)?
   - How does the correlation vary across different departments, job roles, or other categorical variables?

In [None]:
#carrelation between numerical attribute
correlation_matrix = np.corrcoef(df[numerical_atttribute].values.T)
correlation_matrix = pd.DataFrame(correlation_matrix, columns=numerical_atttribute, index=numerical_atttribute)
correlation_matrix

In [None]:
#Make it more visually appealing by plotting a heatmap of the correlation matrix 
plt.figure(figsize=(20, 10))
sns.heatmap(correlation_matrix, annot=True, cmap= 'inferno', fmt=".2f")

There are very Few attributes that shows correlation with each other such as years at current role with years with current manager and so, on

In [None]:
#correlation between numerical and categorical attribute
df['Attrition'] = pd.factorize(df['Attrition'])[0] #convert the target variable to numerical value
df['Attrition'] = df['Attrition'].astype('int64')

In [None]:
numerical_atttribute.append('Attrition')
corr_tar_num = np.corrcoef(df[numerical_atttribute].values.T)
corr_tar_num = pd.DataFrame(corr_tar_num, columns=numerical_atttribute, index=numerical_atttribute)
corr_tar_num

In [None]:
for col in numerical_atttribute:
    if col != 'Attrition':
        print(f"correlation between {col} and Attrition is: {np.corrcoef(df[col], df['Attrition'])[0][1]} \n")  #[0][1] bcoz it is a correlation matrix of 2x2 


Correlation between Attrition and other numerical attributes are very low, which means that the numerical attributes are not strongly correlated with the target variable.

In [None]:
plt.figure(figsize=(20, 10))
sns.heatmap(corr_tar_num, annot=True, cmap= 'inferno', fmt=".2f")


In [None]:
# Convert categorical variables to numerical values
df_categorical = df[categorical_attribute].apply(lambda x: pd.factorize(x)[0]) #factorize will encode all the categorical value to numerical value
# Calculate the correlation coefficient
cor_cof = np.corrcoef(df_categorical.values.T)
cor_cof = pd.DataFrame(cor_cof, index=categorical_attribute, columns=categorical_attribute)
cor_cof


In [None]:
plt.figure(figsize=(20, 10))
sns.heatmap(cor_cof, annot=True, cmap= 'inferno', fmt=".2f")

4. **Employee Demographics:**
   - What is the age distribution of employees?
   - How many employees belong to each gender category?
   - What is the distribution of educational backgrounds (Education Field)?

In [None]:
df['Age'].plot(kind='hist', bins=20, color='skyblue', edgecolor='gray', figsize=(10, 6))

In [None]:
df['Gender'].unique()

In [None]:
# Calculate the counts of male and female
number_of_male = (df['Gender'] == 'Male').sum()
number_of_female = (df['Gender'] == 'Female').sum()

# Create a list of categories and corresponding counts
categories = ['Male', 'Female']
counts = [number_of_male, number_of_female]

# Plotting
plt.bar(categories, counts, color=['skyblue', 'pink'])
plt.xlabel('Gender')
plt.ylabel('Count')
plt.title('Distribution of Gender in the Employees')
plt.show()


In [None]:
edu_foelds = df['Education Field'].unique()
edu_foelds

In [None]:
count_edu_fields = []
for i in edu_foelds:
    print(f"Number of {i} employees: {(df['Education Field'] == i).sum()}")
    count_edu_fields.append((df['Education Field'] == i).sum())

In [None]:
count_edu_fields , edu_foelds

In [None]:
plt.bar(edu_foelds, count_edu_fields, color='skyblue', edgecolor='gray')
plt.xlabel('Education Field')
plt.ylabel('Count')
plt.title('Distribution of Education Fields in the Employees')
plt.xticks(rotation=45, ha='right')
plt.show()


5. **Work Environment:**
   - How far do employees live from work (Distance From Home)?
   - What is the average job satisfaction level among employees?
   - How satisfied are employees with their work-life balance?

In [None]:
#distribution of Distance from Home
df['Distance From Home'].plot(kind='hist', bins=20, color='skyblue', edgecolor='gray', figsize=(10, 6))

In [None]:
#how far employees are lived from work
df['Distance From Home'].describe() #mean distance is 9.19 and max distance is 29

In [None]:
df['Job Satisfaction'].plot(kind='hist')

In [None]:
np.percentile(df['Job Satisfaction'], 50)

In [None]:
from scipy.stats import mode
mode(df['Job Satisfaction'])   

In [None]:
''' By above statistics it is clear that no employee lived more than 29 km from work and mean employees are 9km far from work
median value of job satisfaction is 3 however most of the employees highly satisfied with a score of 4'''