## IBM HR DATA VISUALIZATION
Draw graphs to get insight as to why people are leaving the organization. Or, in other words, which attributes appear to be more important in making the prediction of attrition.

In [1]:
#Call the required libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings    # We want to suppress warnings
import os

##### Read data and look at it

In [2]:
warnings.filterwarnings("ignore")    # Ignore warnings


In [3]:
hrdata = pd.read_csv('../input/WA_Fn-UseC_-HR-Employee-Attrition.csv')
hrdata.info()
hrdata.head(10)

#### Number of unique values per column.

In [4]:
Nunique = hrdata.nunique()
Nunique = Nunique.sort_values()
Nunique

we will ignore EmployeeNumber attribute for pretiction.  

### Data Visualization 

We are using Seaborn library for Data visualization. Seaborn is a library for making attractive and informative statistical graphics in Python. It is built on top of matplotlib and tightly integrated with the PyData stack, including support for numpy and pandas data structures and statistical routines from scipy and statsmodels.

color palettes of Seaborn :


In [5]:
sns.palplot(sns.color_palette("hls", 8))
plt.show() 

### Distribution Graph on Age Attribute

In [6]:
sns.distplot(hrdata['Age'])
plt.show() 

### Display Multiple Distribution Plots.

In [7]:
#  Plot areas are called axes

fig,ax = plt.subplots(3,3, figsize=(10,10))               # 'ax' has references to all the four axes
sns.distplot(hrdata['TotalWorkingYears'], ax = ax[0,0]) 
sns.distplot(hrdata['YearsAtCompany'], ax = ax[0,1]) 
sns.distplot(hrdata['DistanceFromHome'], ax = ax[0,2]) 
sns.distplot(hrdata['YearsInCurrentRole'], ax = ax[1,0]) 
sns.distplot(hrdata['YearsWithCurrManager'], ax = ax[1,1]) 
sns.distplot(hrdata['YearsSinceLastPromotion'], ax = ax[1,2]) 
sns.distplot(hrdata['PercentSalaryHike'], ax = ax[2,0]) 
sns.distplot(hrdata['YearsSinceLastPromotion'], ax = ax[2,1]) 
sns.distplot(hrdata['TrainingTimesLastYear'], ax = ax[2,2]) 
plt.show()

### Multiple Count Plots

In [11]:
total_records= len(hrdata)
columns = ["Gender","MaritalStatus","WorkLifeBalance","EnvironmentSatisfaction","JobSatisfaction",
           "JobLevel","BusinessTravel","Department"]
plt.figure(figsize=(12,8))
j=0
for i in columns:
    j +=1
    plt.subplot(4,2,j)
    #sns.countplot(hrdata[i])
    ax1 = sns.countplot(data=hrdata,x= i,hue="Attrition")
    if(j==8 or j== 7):
        plt.xticks( rotation=90)
    for p in ax1.patches:
        height = p.get_height()
        ax1.text(p.get_x()+p.get_width()/2.,
                height + 3,
                '{:1.2f}'.format(height/total_records,0),
                ha="center",rotation=0) 

# Custom the subplot layout
plt.subplots_adjust(bottom=-0.9, top=2)
plt.show()


###### Observation of above Count Plot Graph
High attrition rate in given attribute

1. Single attrition rate is 50% in marital status.
2. Job Level -1 attrition rate is also high comapre to other job levels 
3. EnvironmentSatisfaction Level 1 has high attrition rate. 
4. Attrition raltes are high in these attribute Sales Deparment, Male,Jobsatisfaction 1    



### Bar Plot
Bar plots categorical + numerical (summary). A bar plot represents an estimate of central tendency for a numeric variable with the height of each rectangle and provides some indication of the uncertainty around that estimate using error bars. Bar plots include 0 in the quantitative axis range, and they are a good choice when 0 is a meaningful value for the quantitative variable, and you want to make comparisons against.it. Default Conf. interval = 0.95

In [13]:
# MaritalStatus wise
columns = ["DistanceFromHome",
"WorkLifeBalance"]
plt.figure(figsize=(12,8))
j=0
for i in columns:
    j +=1
    plt.subplot(1,2,j)
    sns.barplot(x = 'Attrition', y = hrdata[i], hue="MaritalStatus", data =hrdata)

#plt.subplots_adjust(bottom=-0.9, top=2)

plt.show()

#JobLevel wise
columns = ["DistanceFromHome",
"WorkLifeBalance",
"PercentSalaryHike"]
plt.figure(figsize=(12,8))
j=0
for i in columns:
    j +=1
    plt.subplot(3,1,j)
    sns.barplot(x = 'Attrition', y = hrdata[i], hue="JobLevel", data =hrdata)

plt.subplots_adjust(bottom=-0.9, top=2)

plt.show()



 Employee are more likely to quit , When 
1. DistanceFromHome is above 8KM and (Married people are higher in this case ) 
2. DistanceFromHome is above 2.5  for JobLevel 5  
 

### Box plots

In [None]:
# Display multiple box plots.
#  Plot areas are called axes

fig,ax = plt.subplots(2,2, figsize=(10,10))                       # 'ax' has references to all the four axes
sns.boxplot(hrdata['Attrition'], hrdata['MonthlyIncome'], ax = ax[0,0])  # Plot on 1st axes 
sns.boxplot(hrdata['Gender'], hrdata['MonthlyIncome'], ax = ax[0,1])  # Plot on IInd axes
plt.xticks( rotation=90)
sns.boxplot(hrdata['Department'], hrdata['MonthlyIncome'], ax = ax[1,0])       # Plot on IIIrd axes
plt.xticks( rotation=90)

sns.boxplot(hrdata['JobRole'], hrdata['MonthlyIncome'], ax = ax[1,1])     # Plot on IV the axes
plt.show() 



In [None]:
sns.swarmplot(x="Department", y="MonthlyIncome", hue="Attrition", data=hrdata);
plt.show()

sns.swarmplot(x="JobRole", y="MonthlyIncome", hue="Attrition", data=hrdata);
plt.xticks( rotation=90 )
plt.show()


sns.swarmplot(x="JobLevel", y="MonthlyIncome", hue="Attrition", data=hrdata);
plt.show()

In [None]:
sns.factorplot(x =   'Department',     # Categorical
               y =   'MonthlyIncome',      # Continuous
               hue = 'Attrition',    # Categorical
               col = 'JobLevel',
               col_wrap=2,           # Wrap facet after two axes
               kind = 'swarm',
               data = hrdata)
plt.xticks( rotation=90 )
plt.show()


###### Observations from above graphical representations:

1. Attrition rate is high in JobLevel 1 at low level salary(Between -10% and +10 % of 2500)  after that in JobLevel-2 and LobLvel-3 at salary range between 7500 to 10000). 
2. Attrition rate is high in Sales and Research & Development Departments. especially in JobLevel-1 both the departments. 

Conclusion : 
High Attrition rates are in Sales Representive(JobLevel- 1 & Who are single ), Laboratory Technician (JobLevel - 1 ) , Sales Executive (JobLevel-3 ,JobLevel 2 and who has salary range of 7500 and 10000)


### Joint Plots

In [None]:
## Joint scatter plot
sns.jointplot(hrdata.Age,hrdata.MonthlyIncome, kind = "scatter")   
plt.show()

#Joint scatter plot with least square line
sns.jointplot(hrdata.TotalWorkingYears,hrdata.MonthlyIncome, kind = "reg")   
plt.show()



### Pair Plots
Plot pairwise relationships in a dataset.

In [None]:
cont_col= ['Attrition','Age','MonthlyIncome', 'JobLevel','DistanceFromHome']
sns.pairplot(hrdata[cont_col],  kind="reg", diag_kind = "kde"  , hue = 'Attrition' )
plt.show()

In [None]:
cont_col= ['Attrition','JobLevel','TotalWorkingYears', 'PercentSalaryHike','PerformanceRating']
sns.pairplot(hrdata[cont_col], kind="reg", diag_kind = "kde" , hue = 'Attrition' )
plt.show()

### Factor Plots
Factorplots are plots between one continuous, one categorical conditioned by another one or two categorical variables

In [None]:
sns.factorplot(x =   'Attrition',     # Categorical
               y =   'MonthlyIncome',      # Continuous
               hue = 'Gender',    # Categorical
               col = 'Department',
               col_wrap=2,           # Wrap facet after two axes
               kind = 'box',
               data = hrdata)
plt.show()
