<center><h1 style="font-size:280%; font-family:cursive; background:yellow; color:black; border-radius:10px 10px; padding:10px;"> HR Analysis, Prediction and Visualization</h1></center>

<h1 style="font-size:180%; font-family:cursive; color:red;"><b>What Is HR Analytics?</b></h1>

<p style="font-size:150%; font-family:cursive;">To understand the essence of HR analytics and to explain how it impacts business performance, we asked Mick Collins, Global Vice President, Workforce Analytics & Planning Solution Strategy and Chief Expert at SAP SuccessFactors, to break it down for us. “HR analytics is a methodology for creating insights on how investments in human capital assets contribute to the success of four principal outcomes: (a) generating revenue, (b) minimizing expenses, (c) mitigating risks, and (d) executing strategic plans. This is done by applying statistical methods to integrated HR, talent management, financial, and operational data,” says Collins in an exclusive discussion with HR Technologist. HR analytics focuses primarily on the HR function and is not – as is largely believed – exactly interchangeable with people analytics or workforce analytics.</p>

<center><img src="https://careerbright.com/wp-content/uploads/2020/03/HR-career-day-in-life-1024x681.jpg"></center>

<h1 style="font-size:180%; font-family:cursive; color:red;"><b>How does HR analytics work?</b></h1>

<p style="font-size:150%; font-family:cursive;">HR analytics follows a pattern of acquiring and analyzing information, which can help the organization derive crucial insight into their functions. Overall, HR analytics follows a multi-step process to understand their workforce more thoroughly.</p>

<h1 style="font-size:180%; font-family:cursive; color:red;"><b>Should We Invest in an HR Analytics Solution?</b></h1>

<p style="font-size:150%; font-family:cursive;">HR analytics offers some undoubted benefits. It allows HR teams to significantly streamline processes that reduce costs, reduce attrition, and consequently improve the bottom line. With task automation, you are freed up to innovate and explore the human aspect of human resources without spending time on tracking mountains of data from multiple sources. Overall, the use of HR analytics has been established as an HR technology trend for 2019, as it is posed to improve the employee experience that directly translates into improved business outcomes. However, it also presents some real challenges. As Collins tells us, “While HR is ambitious about the use of predictive HR analytics, two HR leaders I recently spoke with said ‘we want to be able to predict everything!’ The vision for how analytics will become an HR core competency is being constrained by limited consumption (insights being shared only within the four walls of HR) and action (the research does not lead to a program change or new investment). There is still much progress to be made.” In addition, because data is siloed across the organization and conversations about the goal of implementing analytics are unclear, the valuable data HR requires for analytics is often underutilized. The challenge is in waiting to actually see results. Predictive analytics is likely to take a minimum of 24 months to show meaningful results, but it will help get started with HR transformation. So, the time to get started with HR analytics is now. By making a strong business case to key stakeholders, as an HR practitioner, you can leverage the power of analytics to become a strategic business partner who makes substantial contributions to the business.</p>

In [None]:
# Import necessary  Libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

In [None]:
# Load train and Test data
train_data=pd.read_csv("../input/hranalysis/train.csv")
test_data=pd.read_csv("../input/hranalysis/test.csv")

In [None]:
train_data.head()

In [None]:
train_data.shape

In [None]:
# Check all columns in train data
train_data.columns

In [None]:
train_data.info()

<center><h1 style="font-size:230%; font-family:cursive; background:pink; color:black; border-radius:10px 10px; padding:10px;"> Checking the Null Values</h1></center>

In [None]:
# Check null values in train data
train_data.isnull().sum()

<h1 style="font-size:180%; font-family:cursive; color:red;"><b>Visualizing the null values using missingo function</b></h1>

<p style="font-size:150%; font-family:cursive;">Missingno is a Python library that provides the ability to understand the distribution of missing values through informative visualizations. The visualizations can be in the form of heat maps or bar charts. With this library, it is possible to observe where the missing values have occurred and to check the correlation of the columns containing the missing with the target column.</p>

In [None]:
import missingno as msno
msno.matrix(train_data)

In [None]:
test_data.shape

In [None]:
# Check all the information of test data
test_data.info()

In [None]:
test_data.isnull().sum()

In [None]:
# Check the null values in test data
msno.bar(test_data,color = 'y', figsize = (10,8))

<center><h1 style="font-size:230%; font-family:cursive; background:pink; color:black; border-radius:10px 10px; padding:10px;"> Exploratory Data Analysis</h1></center>

In [None]:
# pairplot using seaborn
sns.pairplot(train_data)

In [None]:
# Visulazing the distibution of the data for every feature
train_data.hist(edgecolor='black',linewidth=1.2, figsize=(20,20));

In [None]:
plt.figure(figsize=(30,30))
sns.heatmap(train_data.corr(),annot=True,cmap="RdYlGn", annot_kws={"size":15})

In [None]:
# check value counts in " department " columns
train_data['department'].value_counts()

In [None]:
# visualizing the different groups in the dataset
plt.subplots(figsize=(15,5))
train_data["department"].value_counts(normalize=True)
train_data["department"].value_counts(dropna=False).plot.bar(color=['black', 'red', 'green', 'blue', 'cyan'])
plt.show()

In [None]:
# checking the different regions of the company
plt.subplots(figsize=(15,5))
sns.countplot(train_data["region"],color="red")
plt.title("Different Region of the company",fontsize=30)
plt.xticks(rotation = 60)
plt.xlabel('Region Code')
plt.ylabel('count')
plt.show()

In [None]:
# Check most popular department
from wordcloud import WordCloud
from wordcloud import STOPWORDS

stopword=set(STOPWORDS)
wordcloud=WordCloud(stopwords=stopword).generate(str(train_data["department"]))
plt.rcParams['figure.figsize'] = (15, 8)
print(wordcloud)
plt.imshow(wordcloud)
plt.title('Most Popular Departments', fontsize = 30)
plt.axis('off')
plt.show()

In [None]:
train_data["education"].value_counts()

In [None]:
# prepare the data
data=train_data.groupby('education').size()

# Make plot with pandas
data.plot(kind='pie',subplots=True,figsize=(15,8))
plt.title("Pie Chart of different types of education")
plt.ylabel("")
plt.show()

In [None]:
# most popular education degree among the employees
from wordcloud import WordCloud
from wordcloud import STOPWORDS

stopword=set(STOPWORDS)
wordcloud=WordCloud(stopwords=stopword,max_words=5).generate(str(train_data["education"]))
plt.rcParams['figure.figsize'] = (15, 8)
print(wordcloud)
plt.imshow(wordcloud)
plt.title('Most Popular Degrees among the Employees', fontsize = 30)
plt.axis('off')
plt.show()


In [None]:
# checking the gender gap

train_data["gender"].value_counts()

In [None]:
# plotting a pie chart
size=[38496, 16312]
labels="Male","Female"
colors=["yellow","red"]
explod=[0.0,0.1]
plt.subplots(figsize=(8,8))
plt.pie(size,labels=labels,colors=colors,explode=explod,autopct = "%.2f%%")
plt.title('A Pie Chart Representing GenderGap', fontsize = 30)
plt.axis('off')
plt.legend()
plt.show()

In [None]:
# comparison of permoted gender male & female
plt.subplots(figsize=(15,5))
sns.countplot(x="education",data=train_data,hue="gender",palette = 'dark')
plt.show()

In [None]:
# comparison of permoted gender male & female
plt.subplots(figsize=(15,5))
sns.countplot(x = 'gender', data = train_data, hue = 'is_promoted', palette = 'dark')
plt.show()

In [None]:
# comparison of permoted gender male & female
plt.subplots(figsize=(15,5))
sns.countplot(x = 'recruitment_channel', data = train_data, hue = 'gender', palette = 'dark')
plt.show()

In [None]:
train_data['recruitment_channel'].value_counts()

In [None]:
# plotting a donut chart for visualizing each of the recruitment channel's share

size=[30446,23220,1142]
labels=["other","sourcing","referred"]
color=["black","yellow","red"]

my_circle = plt.Circle((0, 0), 0.7, color = 'white')

plt.rcParams['figure.figsize']=(9,9)
plt.pie(size,labels=labels,colors=color, shadow = True, autopct = '%.2f%%')
plt.title('Showing share of different Recruitment Channels', fontsize = 30)
p = plt.gcf()
p.gca().add_artist(my_circle)
plt.legend()
plt.show()

<h1 style="font-size:180%; font-family:cursive; color:red;"><b>Plot Displot for 'Distribution of Age of Employees'</b></h1>

<p style="font-size:150%; font-family:cursive;">Seaborn distplot lets you show a histogram with a line on it. This can be shown in all kinds of variations. We use seaborn in combination with matplotlib, the Python plotting module.A distplot plots a univariate distribution of observations. The distplot() function combines the matplotlib hist function with the seaborn kdeplot() and rugplot() functions.</p>

In [None]:
plt.subplots(figsize=(15,5))
sns.distplot(train_data["age"])
plt.title('Distribution of Age of Employees', fontsize = 30)

In [None]:
train_data["previous_year_rating"].value_counts().sort_index().plot.bar(color="red",figsize=(15,5))
plt.title('Distribution of Previous year rating of the Employees', fontsize = 30)
plt.xlabel('Ratings', fontsize = 15)
plt.ylabel('count')
plt.show()

In [None]:
# checking the distribution of length of service
plt.subplots(figsize=(15,8))
sns.distplot(train_data["length_of_service"],color="green")
plt.title('Distribution of length of service among the Employees', fontsize = 30)
plt.xlabel('Length of Service in years')
plt.ylabel('count')
plt.show()

In [None]:
train_data["KPIs_met >80%"].value_counts()

In [None]:
# plotting a pie chart


size = [35517, 19291]
labels = "Not Met KPI > 80%", "Met KPI > 80%"
colors = ['violet', 'grey']
explode = [0, 0.1]

plt.rcParams['figure.figsize'] = (8, 8)
plt.pie(size, labels = labels, colors = colors, explode = explode, shadow = True, autopct = "%.2f%%")
plt.title('A Pie Chart Representing Gap in Employees in terms of KPI', fontsize = 30)
plt.axis('off')
plt.legend()
plt.show()

In [None]:
train_data['awards_won?'].value_counts()

In [None]:
# plotting a donut chart for visualizing each of the recruitment channel's share

size = [53538, 1270]
colors = ['black', 'red']
labels = "Awards Won", "NO Awards Won"

my_circle = plt.Circle((0, 0), 0.7, color = 'white')

plt.rcParams['figure.figsize'] = (9, 9)
plt.pie(size, colors = colors, labels = labels, shadow = True, autopct = '%.2f%%')
plt.title('Showing a Percentage of employees who won awards', fontsize = 30)
p = plt.gcf()
p.gca().add_artist(my_circle)
plt.legend()
plt.show()

In [None]:
# checking the distribution of the avg_training score of the Employees
plt.subplots(figsize=(15,8))
sns.distplot(train_data["avg_training_score"],color="green")
plt.title('Distribution of length of service among the Employees', fontsize = 30)
plt.xlabel('Length of Service in years')
plt.ylabel('count')
plt.show()

In [None]:
# checkig the no. of Employees Promoted
train_data["is_promoted"].value_counts()

In [None]:
# finding the %age of people promoted

promoted = (4668/54808)*100
print("Percentage of Promoted Employees is {:.2f}%".format(promoted))

In [None]:
#plotting a scatter plot 
plt.hist(train_data["is_promoted"])
plt.title("plot to show the gap in Promoted and Non-Promoted Employees", fontsize=20)
plt.xlabel('0 -No Promotion and 1- Promotion', fontsize = 20)
plt.ylabel('count')
plt.show()


<h1 style="font-size:180%; font-family:cursive; color:red;"><b>Bivariate Plots</b></h1>

<p style="font-size:150%; font-family:cursive;">A bivariate plot graphs the relationship between two variables that have been measured on a single sample of subjects. Such a plot permits you to see at a glance the degree and pattern of relation between the two variables. On a bivariate plot, the abscissa (X-axis) represents the potential scores of the predictor variable and the ordinate (Y-axis) represents the potential scores of the predicted or outcome variable. Each point on the plot shows the X and Y scores for a single subject. This is what we mean by "bivariate" plot -- each point represents two variables. A bivariate plot of two scores (self-esteem and Interpersonal Avoidance) from our class dataset is shown below. The red line on the graph shows a perfect linear relationship between the two variables.</p>

In [None]:
# scatter plot between average training score and is_promoted
data=pd.crosstab(train_data["avg_training_score"],train_data["is_promoted"])
data.div(data.sum(1).astype(float), axis = 0).plot(kind = 'bar', stacked = True, figsize = (20, 9), color = ['darkred', 'lightgreen'])
plt.title('Looking at the Dependency of Training Score in promotion', fontsize = 30)
plt.xlabel('Average Training Scores', fontsize = 15)
plt.legend()
plt.show()

<p style="font-size:150%; font-family:cursive;">As, the Training Scores Increases, the chances of Promotion Increases Highly.</p>

In [None]:
# checking dependency of different regions in promotion

data = pd.crosstab(train_data['region'], train_data['is_promoted'])
data.div(data.sum(1).astype('float'), axis = 0).plot(kind = 'bar', stacked = True, figsize = (20, 8), color = ['lightblue', 'purple'])

plt.title('Dependency of Regions in determining Promotion of Employees', fontsize = 30)
plt.xlabel('Different Regions of the Company', fontsize = 20)
plt.legend()
plt.show()


<p style="font-size:150%; font-family:cursive;">The above graph shows that there is no biasedness over regions in terms of Promotion as all the regions share promotions almost equally.</p>

In [None]:
# dependency of awards won on promotion
data = pd.crosstab(train_data['awards_won?'], train_data['is_promoted'])
data.div(data.sum(1).astype('float'), axis = 0).plot(kind = 'bar', stacked = True, figsize = (10, 8), color = ['magenta', 'purple'])

plt.title('Dependency of Awards in determining Promotion', fontsize = 30)
plt.xlabel('Awards Won or Not', fontsize = 20)
plt.legend()
plt.show()



<p style="font-size:150%; font-family:cursive;">There is a very good chance of getting promoted if the employee has won an award.</p>

In [None]:
#dependency of KPIs with Promotion

data = pd.crosstab(train_data['KPIs_met >80%'], train_data['is_promoted'])
data.div(data.sum(1).astype('float'), axis = 0).plot(kind = 'bar', stacked = True, figsize = (10, 8), color = ['pink', 'darkred'])

plt.title('Dependency of KPIs in determining Promotion', fontsize = 30)
plt.xlabel('KPIs Met or Not', fontsize = 20)
plt.legend()
plt.show()


<p style="font-size:150%; font-family:cursive;">Again Having a good KPI score increases the chances of getting promoted in the company.</p>

In [None]:
# checking dependency on previous years' ratings

data = pd.crosstab(train_data['previous_year_rating'], train_data['is_promoted'])
data.div(data.sum(1).astype('float'), axis = 0).plot(kind = 'bar', stacked = True, figsize = (15, 8), color = ['violet', 'pink'])

plt.title('Dependency of Previous year Ratings in determining Promotion', fontsize = 30)
plt.xlabel('Different Ratings', fontsize = 20)
plt.legend()
plt.show()

<p style="font-size:150%; font-family:cursive;">The Above Graph clearly suggests that previous ratings matter a lot, if the ratings are high, the chances of being promoted in the company increases and there is completely no promotion for the employees with previous year ratings = 0
</p>

In [None]:
# checking how length of service determines the promotion of employees

data = pd.crosstab(train_data['length_of_service'], train_data['is_promoted'])
data.div(data.sum(1).astype('float'), axis = 0).plot(kind = 'bar', stacked = True, figsize = (20, 8), color = ['pink', 'lightblue'])

plt.title('Dependency of Length of service in Promotions of Employees', fontsize = 30)
plt.xlabel('Length of service of employees', fontsize = 20)
plt.legend()
plt.show()

In [None]:
# checking dependency of age factor in promotion of employees

data = pd.crosstab(train_data['age'], train_data['is_promoted'])
data.div(data.sum(1).astype('float'), axis = 0).plot(kind = 'bar', stacked = True, figsize = (20, 8), color = ['lightblue', 'green'])

plt.title('Dependency of Age in determining Promotion of Employees', fontsize = 30)
plt.xlabel('Age of Employees', fontsize = 20)
plt.legend()
plt.show()

<p style="font-size:150%; font-family:cursive;">This is Very Impressive that the company promotes employees of all the ages equally even the freshers have equal share of promotion and also the senior citizen employees are getting the equal share of Promotion in the Company
</p>

In [None]:
# checking which department got most number of promotions

data = pd.crosstab(train_data['department'], train_data['is_promoted'])
data.div(data.sum(1).astype('float'), axis = 0).plot(kind = 'bar', stacked = True, figsize = (20, 8), color = ['orange', 'lightgreen'])

plt.title('Dependency of Departments in determining Promotion of Employees', fontsize = 30)
plt.xlabel('Different Departments of the Company', fontsize = 20)
plt.legend()
plt.show()

<p style="font-size:150%; font-family:cursive;">Again, Each of the departments have equal no. of promotions showing an equal developement in each of the departments of the company.
</p>

In [None]:
# checking dependency of gender over promotion

data = pd.crosstab(train_data['gender'], train_data['is_promoted'])
data.div(data.sum(1).astype('float'), axis = 0).plot(kind = 'bar', stacked = True, figsize = (7, 5), color = ['pink', 'yellow'])

plt.title('Dependency of Genders in determining Promotion of Employees', fontsize = 30)
plt.xlabel('Gender', fontsize = 20)
plt.legend()
plt.show()

<p style="font-size:150%; font-family:cursive;">The above plot shows that there is no partiality between males and females in terms of promotion
</p>

<center><h1 style="font-size:230%; font-family:cursive; background:pink; color:black; border-radius:10px 10px; padding:10px;"> Data Pre-processing</h1></center>

In [None]:
# find the null values
train_data.isna().sum()

In [None]:
# filling missing values in train data
train_data["education"].fillna(train_data["education"].mode()[0],inplace=True)
train_data["previous_year_rating"].fillna(1,inplace=True)

# again checking if there is any Null value left in the data
train_data.isnull().sum().sum()

In [None]:
# filling missing values in test data
test_data["education"].fillna(test_data["education"].mode()[0],inplace=True)
test_data["previous_year_rating"].fillna(1,inplace=True)

# again checking if there is any Null value left in the data
test_data.isnull().sum().sum()

In [None]:
# removing the employee_id column because it's useless in our data
train_data=train_data.drop(["employee_id"],axis=1)
train_data.columns

In [None]:
# saving the employee_id

emp_id = test_data['employee_id']

# removing the employee_id column also in test data

test_data = test_data.drop(['employee_id'], axis = 1)

test_data.columns

In [None]:
train_data.shape

In [None]:
# defining the test set

x_test=test_data
x_test.columns


In [None]:
# one hot encoding for the test set
x_test=pd.get_dummies(x_test)

x_test.columns

In [None]:
# splitting the train set into dependent and independent sets

#x = train_data.iloc[:, :-1]
#y = train_data.iloc[:, -1]

#print("Shape of x:", x.shape)
#print("Shape of y:", y.shape)

In [None]:
# splitting the train set into dependent and independent sets
x=train_data.drop("is_promoted",axis=1)
y=train_data["is_promoted"]

print("Shape of x:", x.shape)
print("Shape of y:", y.shape)

In [None]:
# one hot encoding for the train set

x = pd.get_dummies(x)

x.columns

<center><h1 style="font-size:230%; font-family:cursive; background:pink; color:black; border-radius:10px 10px; padding:10px;"> Model Building </h1></center>

<h1 style="font-size:180%; font-family:cursive; color:red;"><b>What is Imblearn Technique?</b></h1>

<p style="font-size:150%; font-family:cursive;">Imblearn techniques are the methods by which we can generate a data set that has an equal ratio of classes. The predictive model built on this type of data set would be able to generalize well. We mainly have two options to treat an imbalanced data set that are Upsampling and Downsampling.</p>

In [None]:
from imblearn.over_sampling import SMOTE
x_sample, y_sample = SMOTE().fit_resample(x, y.values.ravel())

x_sample = pd.DataFrame(x_sample)
y_sample = pd.DataFrame(y_sample)

# checking the sizes of the sample data
print("Size of x-sample :", x_sample.shape)
print("Size of y-sample :", y_sample.shape)

In [None]:
# splitting x and y into train and validation sets

from sklearn.model_selection import train_test_split

x_train, x_valid, y_train, y_valid = train_test_split(x_sample, y_sample, test_size = 0.2, random_state = 0)

print("Shape of x_train: ", x_train.shape)
print("Shape of x_valid: ", x_valid.shape)
print("Shape of y_train: ", y_train.shape)
print("Shape of y_valid: ", y_valid.shape)

In [None]:
# standard scaling

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test  = sc.transform(x_test)
x_valid = sc.transform(x_valid)

<center><h1 style="font-size:180%; font-family:cursive; background:skyblue; color:black; border-radius:10px 10px; padding:10px;">Random Forest Classifier </h1></center>

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import average_precision_score

rfc = RandomForestClassifier()
rfc.fit(x_train,y_train)

rfc_pred=rfc.predict(x_test)
print("Training Accuracy :", rfc.score(x_train,y_train))

<center><h1 style="font-size:180%; font-family:cursive; background:skyblue; color:black; border-radius:10px 10px; padding:10px;">XGBoost CLassifier </h1></center>

In [None]:
from xgboost.sklearn import XGBClassifier
xgb = XGBClassifier()
xgb.fit(x_train, y_train)

xgb_pred = xgb.predict(x_test)

print("Training Accuracy :", xgb.score(x_train, y_train))

<center><h1 style="font-size:180%; font-family:cursive; background:skyblue; color:black; border-radius:10px 10px; padding:10px;">Light Gradient Boosting Classifier </h1></center>

In [None]:
from lightgbm import LGBMClassifier
lgb = LGBMClassifier()
lgb.fit(x_train, y_train)

lgb_pred = lgb.predict(x_test)

print("Training Accuracy :", lgb.score(x_train, y_train))

<center><h1 style="font-size:180%; font-family:cursive; background:skyblue; color:black; border-radius:10px 10px; padding:10px;">Extra Trees Classifier </h1></center>

In [None]:
from sklearn.ensemble import ExtraTreesClassifier
etc = ExtraTreesClassifier()
etc.fit(x_train, y_train)

etc_pred = etc.predict(x_test)

print("Training Accuracy :", etc.score(x_train, y_train))

<center><h1 style="font-size:130%; font-family:cursive; background:pink; color:black; border-radius:10px 10px; padding:10px;"> I hope you enjoyed this kernel , Please don't forget to appreciate me with an Upvote.</h1></center>