# Can you help reduce employee turnover?

## 📖 Background
You work for the human capital department of a large corporation. The Board is worried about the relatively high turnover, and your team must look into ways to reduce the number of employees leaving the company.

The team needs to understand better the situation, which employees are more likely to leave, and why. Once it is clear what variables impact employee churn, you can present your findings along with your ideas on how to attack the problem.

## 💾 The data
The department has assembled data on almost 10,000 employees. The team used information from exit interviews, performance reviews, and employee records.

- "department" - the department the employee belongs to.
- "promoted" - 1 if the employee was promoted in the previous 24 months, 0 otherwise.
- "review" - the composite score the employee received in their last evaluation.
- "projects" - how many projects the employee is involved in.
- "salary" - for confidentiality reasons, salary comes in three tiers: low, medium, high.
- "tenure" - how many years the employee has been at the company.
- "satisfaction" - a measure of employee satisfaction from surveys.
- "avg_hrs_month" - the average hours the employee worked in a month.
- "left" - "yes" if the employee ended up leaving, "no" otherwise.

In [None]:
import itertools
import numpy as np
import pandas as pd

df = pd.read_csv('../data/employee_churn_data.csv')
df.head()

In [None]:
df.shape

## 💪 Competition challenge

Create a report that covers the following:
1. Which department has the highest employee turnover? Which one has the lowest?
2. Investigate which variables seem to be better predictors of employee departure.
3. What recommendations would you make regarding ways to reduce employee turnover?

## 🧑‍⚖️ Judging criteria

| CATEGORY | WEIGHTING | DETAILS                                                              |
|:---------|:----------|:---------------------------------------------------------------------|
| **Recommendations** | 35%       | <ul><li>Clarity of recommendations - how clear and well presented the recommendation is.</li><li>Quality of recommendations - are appropriate analytical techniques used & are the conclusions valid?</li><li>Number of relevant insights found for the target audience.</li></ul>       |
| **Storytelling**  | 35%       | <ul><li>How well the data and insights are connected to the recommendation.</li><li>How the narrative and whole report connects together.</li><li>Balancing making the report in-depth enough but also concise.</li></ul> |
| **Visualizations** | 20% | <ul><li>Appropriateness of visualization used.</li><li>Clarity of insight from visualization.</li></ul> |
| **Votes** | 10% | <ul><li>Up voting - most upvoted entries get the most points.</li></ul> |

## ✅ Checklist before publishing into the competition
- Rename your workspace to make it descriptive of your work. N.B. you should leave the notebook name as notebook.ipynb.
- Remove redundant cells like the judging criteria, so the workbook is focused on your story.
- Make sure the workbook reads well and explains how you found your insights.
- Check that all the cells run without error.

## ⌛️ Time is ticking. Good luck!

In [None]:
df.columns

In [None]:
import matplotlib.pyplot as plt

df.hist()

plt.tight_layout()

Missing columns: `'deparment'`, `'salary'`, `'left'`

In [None]:
departments = sorted(set(df.department), key=lambda x: x.lower())

In [None]:
departments

In [None]:
for i, dep in enumerate(departments):
    df.department = df.department.replace(dep, i)

In [None]:
df.salary = df.salary.replace('low', 0)
df.salary = df.salary.replace('medium', 1)
df.salary = df.salary.replace('high', 2)

In [None]:
df.left = df.left.replace('no', 0)
df.left = df.left.replace('yes', 1)

# Question 1

Which department has the highest employee turnover? Which one has the lowest?

In [None]:
df_left = df[df.left == 1]
df_stay = df[df.left == 0]

In [None]:
leave_counts = {
    dep: df_left[df_left.department == i].shape[0]
    for i, dep in enumerate(departments)
}

In [None]:
for k, v in sorted(leave_counts.items(), key=lambda x: x[1]):
    print('{:<20}: {}'.format(k, v))

In [None]:
print("The department with the most turnover is: '{}' ({})".format(*max(leave_counts.items(), key=lambda x: x[1])))

In [None]:
print("The department with the least turnover is: '{} ({})'".format(*min(leave_counts.items(), key=lambda x: x[1])))

# Question 2

Investigate which variables seem to be better predictors of employee departure.

In [None]:
df.corr()

In [None]:
import seaborn as sns

ax = sns.heatmap(df.corr())

In [None]:
axes = df_left.hist(figsize=(12, 9), alpha=0.3, density=True)

df_stay.hist(figsize=(12, 9), ax=list(itertools.chain.from_iterable(axes))[:10], alpha=0.3, density=True)

plt.tight_layout()

In [None]:
axes = df_left.hist(figsize=(12, 9), alpha=0.3)

df_stay.hist(figsize=(12, 9), ax=list(itertools.chain.from_iterable(axes))[:10], alpha=0.3)

plt.tight_layout()

Brainstorming:

* "Maybe they work too hard, then want to leave, but they get reviewed well because they worked so hard"
    * So is there a positive correlation between `'avg_hrs_month'` and `'review'`? No... actually, they're negatively correlated.
* Spring-board theory -- they're reviewed too well, they leave
* Start with a model using only `'review'`, `'tenure'`, `'avg_hrs_month'`, and `'satisfaction'`

# Naive Bayes

General description of technique:
    
* Assumes that all variables are independent of each other
* Uses Bayes' Theorem to make an updated prediction of the probability of a result given each of the variables
    * Recall Bayes' Theorem:
        * $P(\text{A}|\text{B}) = \frac{P(\text{B}|\text{A}) * P(\text{A})}{P(\text{B})}$
        * $ \text{posterior} = \frac{\text{likelihood} * \text{prior}}{\text{evidence}} $
    * For example:
        * $P(\text{left}|\text{promoted}) = \frac{P(\text{promoted}|\text{left})*P(\text{left})}{P(\text{promoted})}$
* Note that `likelihood` and `prior` should all be computed using _only_ the training set.
* Also note, `evidence` is a constant for a given observation, so it can be ignored. It is essentially the normalization of the distribution.
* For multiple $x$ (because of the assumption of independence), the total probability can be written as $$ P(y | x_1, ..., x_n) = P(y) \prod_{i=1}^n P(x_i|y)$$
* We can of course use the log-probability instead $$ \log{P(y | x_1, ..., x_n)} = \log{P(y)} + \sum_{i=1}^n \log{P(x_i|y)} $$
* This requires the assumption of a given likelihood for each possible output condition.
    * For example, for "Gaussian Naive Bayes":
        * $ P(x_i|y) = \frac{1}{\sigma \sqrt{2\pi}} \exp{-\frac{1}{2}\big(\frac{x - \mu}{\sigma}\big)^2}$
* The likelihoods for the categorical features must be handled differently: $$ P(x_i = t | y=c; \alpha ) = \frac{N_{xc} + \alpha}{N_c + \alpha n_i} $$
    * $N_xc$ is the number of times feature $x$ appears in the data with category $c$
    * $N_c$ is the number of counts of category $c$
    * $\alpha$ is a smoothing parameter
    * $n_i$ is the number of categories

In [None]:
def train_test_split(data, percent_train):
    num_data = len(data)
    
    all_indices = np.random.choice(num_data, num_data)
    
    split_index = int(np.floor(num_data*percent_train))
    train_indices, test_indices = np.split(all_indices, [split_index])
    
    return data.iloc[train_indices], data.iloc[test_indices]

## Manual version

In [None]:
def log_normal(x, mu, sigma):
    return -(((x-mu)/sigma)**2)/2 - np.log(sigma) - np.log(2*np.pi)/2

In [None]:
df_train, df_test = train_test_split(df, percent_train=0.9)
df_train.shape, df_test.shape

In [None]:
df_train.head()

In [None]:
train_left = df_train[df_train.left == 1]
train_stay = df_train[df_train.left == 0]

train_left.shape, train_stay.shape

In [None]:
test_left = df_test[df_test.left == 1]
test_stay = df_test[df_test.left == 0]

test_left.shape, test_stay.shape

In [None]:
means = train_left.mean()
stdvs = train_left.std()

In [None]:
p_left_given_data = np.log(train_left.shape[0]/df.shape[0])

for c in ['review', 'tenure', 'avg_hrs_month', 'satisfaction']:
    p_left_given_data += log_normal(df_test[c], means[c], stdvs[c])
    
p_left_given_data

In [None]:
p_stay_given_data = np.log(train_stay.shape[0]/df.shape[0])

for c in ['review', 'tenure', 'avg_hrs_month', 'satisfaction']:
    p_stay_given_data += log_normal(df_test[c], means[c], stdvs[c])
    
p_stay_given_data

In [None]:
predict_left_indices = np.where(p_left_given_data > p_stay_given_data)

predictions = np.zeros(len(df_test))
predictions[predict_left_indices] = 1

In [None]:
incorrect = np.sum(np.abs(predictions - df_test['left']))
percent_incorrect = incorrect / len(df_test)
percent_incorrect

## sklearn version

In [None]:
df_left.columns

In [None]:
from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()

gnb.fit(
    df_train.drop(
        columns=[
            'left', 'department', 'promoted', 'projects', 'salary', 'bonus'
        ]
    ),
    df_train['left']
)

predictions = gnb.predict(
    df_test.drop(
        columns=[
            'left', 'department', 'promoted', 'projects', 'salary', 'bonus'
        ]
    )
)

incorrect = np.sum(np.abs(predictions - df_test['left']))
percent_incorrect = incorrect / len(df_test)
percent_incorrect

## Questions

* How could we improve NB?
    * Use KDE for the likelihoods
    * Linear Discriminant Analysis (LDA) or Quadratic Discriminant Analysis (QDA)