In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
%matplotlib inline

In [None]:
hr = pd.read_csv("../input/WA_Fn-UseC_-HR-Employee-Attrition.csv", header =0)
hr.head()

In [None]:
hr.shape

In [None]:
hr.dtypes

In [None]:
hr[hr.isnull().any(axis=1)]

In [None]:
hr.nunique()

Since the value for columns Over18, StandardHours,EmployeeCount same for all rows, we can eliminate these columns.

In [None]:
cols = ["Over18", "StandardHours", "EmployeeCount"]
for i in cols:
    del hr[i]

In [None]:
hr.nunique()

Below are the categorical Variables

In [None]:
(hr.select_dtypes(exclude=['int64'])).head(0)

In [None]:
for i in range((hr.select_dtypes(exclude=['int64'])).shape[1]):
    print (hr.columns.values[i],":",np.unique(hr[hr.columns.values[i]]),"\n")

**Plotting univariate Distribution**
A univariate distribution is a probability distribution of only one random variable.

1.  Histogram
A Histogram visualises the distribution of data over a continuous interval or certain time period. Each bar in a histogram represents the tabulated frequency at each interval/bin. The total area of the Histogram is equal to the number of data.

Histograms help give an estimate as to where values are concentrated, what the extremes are and whether there are any gaps or unusual values. They are also useful for giving a rough view of the probability distribution. Height of the bar represents the frequency per individual interval or bin.

Lets plot Histogram for various factors in the same plot space using the axes.

2. KDEPLOT
Rather than a histogram, we can get a smooth estimate of the distribution using a kernel density estimation, which Seaborn does with sns.kdeplot.

In [None]:
fig,ax = plt.subplots(2,3, figsize=(10,10))               # 'ax' has references to all the four axes
plt.suptitle("Distribution of various factors", fontsize=20)
sns.distplot(hr['Age'], ax = ax[0,0]) 
sns.distplot(hr['MonthlyIncome'], ax = ax[0,1]) 
sns.distplot(hr['DistanceFromHome'], ax = ax[0,2]) 
sns.kdeplot(hr['YearsInCurrentRole'], ax = ax[1,0]) 
sns.kdeplot(hr['TotalWorkingYears'], ax = ax[1,1]) 
sns.kdeplot(hr['YearsAtCompany'], ax = ax[1,2])  
plt.show()

The bandwidth (bw) parameter of the KDE controls how tightly the estimation is fit to the data, much like the bin size in a histogram. It corresponds to the width of the kernels we plotted above. The default behavior tries to guess a good value using a common reference rule, but it may be helpful to try larger or smaller values:

In [None]:
sns.kdeplot(hr['WorkLifeBalance'])
sns.kdeplot(hr['WorkLifeBalance'], bw=.2, label="bw: 0.8")
sns.kdeplot(hr['WorkLifeBalance'], bw=2, label="bw: 4")
plt.legend();

3. Count Plot
A count plot can be thought of as a histogram across a categorical, instead of quantitative, variable.

In [None]:
fig,ax = plt.subplots(2,3, figsize=(20,20))               # 'ax' has references to all the four axes
plt.suptitle("Distribution of various factors", fontsize=20)
sns.countplot(hr['Attrition'], ax = ax[0,0]) 
sns.countplot(hr['BusinessTravel'], ax = ax[0,1]) 
sns.countplot(hr['Department'], ax = ax[0,2]) 
sns.countplot(hr['EducationField'], ax = ax[1,0])
sns.countplot(hr['Gender'], ax = ax[1,1])  
sns.countplot(hr['OverTime'], ax = ax[1,2]) 
plt.xticks(rotation=20)
plt.subplots_adjust(bottom=0.4)
plt.show()

Univariate plots give insights about the data.Like in this dataset we have personnels who are mostly between the age of 30 to 45, with monthly income of 3000 to 8000.

**Plotting bivariate distributions**

In the case of only two random variables, this is called a bivariate distribution, but the concept generalizes to any number of random variables, giving a multivariate distribution.

1. Scattor Plot 

The most familiar way to visualize a bivariate distribution is a scatterplot, where each observation is shown with point at the x and y values. We can draw a scatterplot with the matplotlib plt.scatter function, and it is also the default kind of plot shown by the jointplot() function:

In [None]:
sns.jointplot(x='MonthlyIncome', y='YearsAtCompany', data=hr,kind = 'hex');
plt.show()

In [None]:
sns.jointplot(hr.MonthlyIncome, hr.YearsAtCompany, hr, kind = 'kde');

* Pair Plots
The pair plot shows univariate histograms by default in diagonal axes. In pair plot we can plot n number of variables together. 

In [None]:
sns.pairplot(hr.iloc[:,[1,29,30,31]], hue='Attrition', size=3.5);

 **Stripplot** for categorical variables
 SInce the scatterplot points are overlapped,its difficult to see the full distribution of data. One easy solution is to adjust the positions (only along the categorical axis) using some random “jitter”(ax[0,1])
 A different approach is to use the function swarmplot(), which positions each scatterplot point on the categorical axis with an algorithm that avoids overlapping points. And we can add a nested categorical variable with the hue parameter. Fig. ax[1,0] and ax[1,1]

In [None]:
fig,ax = plt.subplots(2,2, figsize=(20,20))
sns.stripplot(x="Attrition", y="MonthlyIncome", data=hr,ax= ax[0,0]);
sns.stripplot(x="Attrition", y="MonthlyIncome", data=hr, jitter=True, ax=ax[0,1]);
sns.swarmplot(x="Department", y="MonthlyIncome",hue="Attrition" ,data=hr, ax=ax[1,0]);
sns.swarmplot(x="MaritalStatus", y="MonthlyIncome",hue="Attrition" ,data=hr, ax=ax[1,1]);

* Boxplots
Box plot shows the three quartile values of the distribution along with extreme values. The “whiskers” extend to points that lie within 1.5 IQRs of the lower and upper quartile, and then observations that fall outside this range are displayed independently. Importantly, this means that each value in the boxplot corresponds to an actual observation in the data:

In [None]:
fig,ax = plt.subplots(figsize=(15,10))
sns.boxplot(x = 'Gender',y = 'MonthlyIncome',data=hr, hue='Attrition',palette='Set3')
plt.legend(loc='best')
plt.show()

* Violinplots
A different approach is a violinplot(), which combines a boxplot with the kernel density estimation procedure. This approach uses the kernel density estimate to provide a better description of the distribution of values. Additionally, the quartile and whikser values from the boxplot are shown inside the violin. 

In [None]:
fig,ax = plt.subplots(figsize=(10,10))
sns.violinplot(x = 'Gender',y = 'MonthlyIncome',data=hr, hue='Attrition',split=True,palette='Set3')
plt.legend(loc='best')
plt.show()

* Bar plots

In [None]:
fig,ax = plt.subplots(2,3, figsize=(20,20))               # 'ax' has references to all the four axes
plt.suptitle("Distribution of various factors", fontsize=20)
sns.barplot(hr['Gender'],hr['DistanceFromHome'],hue = hr['Attrition'], ax = ax[0,0]); 
sns.barplot(hr['Gender'],hr['YearsAtCompany'],hue = hr['Attrition'], ax = ax[0,1]); 
sns.barplot(hr['Gender'],hr['TotalWorkingYears'],hue = hr['Attrition'], ax = ax[0,2]); 
sns.barplot(hr['Gender'],hr['YearsInCurrentRole'],hue = hr['Attrition'], ax = ax[1,0]); 
sns.barplot(hr['Gender'],hr['YearsSinceLastPromotion'],hue = hr['Attrition'], ax = ax[1,1]); 
sns.barplot(hr['Gender'],hr['NumCompaniesWorked'],hue = hr['Attrition'], ax = ax[1,2]); 
plt.show()

Conclusion from above fig
1. Distance from home matters to women employees more than men. Female employes are spending more years in one company compare to their counterpart.
2.Female employes spending more years in current company are more inclined to switch.

**Factor Plot and Facet Grid**


In [None]:
hr['hike_level'] = pd.cut(hr['PercentSalaryHike'], 3, labels=['Low', 'Avg', 'High']) 
sns.factorplot(x ='JobRole',y ='MonthlyIncome',hue = 'Attrition',col = 'hike_level',col_wrap=2,
               kind = 'box',
               data = hr)
plt.xticks( rotation=30)
plt.show()

In [None]:
sns.factorplot(x ='MaritalStatus',y ='TotalWorkingYears',hue = 'Attrition',col = 'hike_level',col_wrap=2,
               data = hr)
plt.xticks( rotation=30)
plt.show()

The salary hike affects married more when the monthly income is around 6k. But when the hike level is avg single are tending more towards switching. When hike level is high everyone seems to be happy and less attrition.

In [None]:
#Plot a correlation map for all numeric variables
f,ax = plt.subplots(figsize=(18, 18))
sns.heatmap(hr.corr(), annot=True, linewidths=.5, fmt= '.1f',ax=ax)
plt.show()