# Introduction #
Finding the right people is indispensable for the success of any company. Hiring people does not only cost time _and_ money, but it also might set a company back in revenue for a while. Therefore, understanding why people join, stay or leave an organization is essential to maximizing the organization's success.

I come from _R_ and this will be my first _Python_ notebook. **Exciting!** I think this is a very nice data set to try out some new ideas and get familiar with the syntax and functions.

This notebook will be structured as follows:

 1. **Exploratory Data Analysis**
 2. **Feature Engineering**
 3. **Training Machine Learning Models**
 4. **Assessing Models and Performance**

Let's see what we can discover!

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import seaborn.matrix as smatrix
import matplotlib.pyplot as plt

#1 | Exploratory Data Analysis#
We first load in the data and look at the head of the data set, to get an idea of what kind of variables we will be dealing with.

##1.1 | Load and first look##

In [None]:
df = pd.read_csv('../input/WA_Fn-UseC_-HR-Employee-Attrition.csv')
df.head()

We see right away what will be the _dependent variable_: **Attrition**. 

Let us quickly move forwards to check for some other important things: the completeness of the data set and the distribution of the variables. We might even get a quick idea in whether or not some variables might show a lot of correlation, which has implication for the training of our models in a later stage.

##1.2 | Data Quality Assessment & Variable Distributions##
We will now quickly assess if we have any missing values (using `isnull` from _pandas_).

In [None]:
df.isnull().any()

Cool! Pristine data set. Now we'll have a look at the distributions. A Kernel Density Estimation (KDE) is our friend here.

In [None]:
# Plot 1
x = df['Age']
y = df['DailyRate']
sns.jointplot(x, y, kind="kde");

# Plot 2
x = df['Age']
y = df['DistanceFromHome']
sns.jointplot(x, y, kind="kde");

# Plot 3
x = df['Age']
y = df['Education']
sns.jointplot(x, y, kind="kde");

# Plot 4
x = df['JobSatisfaction']
y = df['DailyRate']
sns.jointplot(x, y, kind="kde");

# Plot 5
x = df['YearsAtCompany']
y = df['DailyRate']
sns.jointplot(x, y, kind="kde");

# Plot 6
x = df['Education']
y = df['DailyRate']
sns.jointplot(x, y, kind="kde");

We know have some first insights (such that working for a long time at the company, doesn't really seem to mean higher daily rates -- an eye-opener for me!). But of course, given the many variables, making a plot for every combination of two variables is way to much work. 

So we'll need either a heatmap or some kind of correlogram here. I'll follow [these][1] instructions to come up with something nice.


  [1]: http://www.marketcalls.in/python/nifty-returns-heatmap-generation-using-nsepy-seaborn.html

In [None]:
# Select only the numerical variables
list_numerical = ['Age','DailyRate','DistanceFromHome','Education','EmployeeNumber', 
                  'EnvironmentSatisfaction','HourlyRate','JobInvolvement','JobLevel', 
                  'JobSatisfaction','MonthlyIncome','MonthlyRate','NumCompaniesWorked',
                  'PercentSalaryHike','PerformanceRating','RelationshipSatisfaction',
                  'StockOptionLevel','TotalWorkingYears',
                  'TrainingTimesLastYear','WorkLifeBalance','YearsAtCompany',
                  'YearsInCurrentRole','YearsSinceLastPromotion','YearsWithCurrManager']

data = df[list_numerical] # We can use data.head() to check if everything went correct

fig, ax = plt.subplots()

# Meh, doens't work out yet. Try on the whiteboard first
sns.heatmap(data, cbar=False, squara=False,
            robust=True, annot=True, fmt=".1d",
            annot_kws={"size":8}, linewidths=0.5, 
            cmap="RdYlGn", ax=ax)