# Module 3 -  Structured Data

In this module, we are going to learn how to manipulate structured data in a real world scenario using the Data Analytics Cycle that you learned in previous sessions. 



## Learning about Organisations with Structured Data: Why are people leaving my company?

Case studies help students learn by immersing them in a real world business scenario where they act as problem solvers and decision-makers.

A case study must not merely summarize the case. It should identify **key features** and **key problems**. And also **outline** and **assess** alternative courses of action to deal with the problem you identify.


In this module, we are going to analyse a structured dataset with information about human resources of a company.


### BUSINESS CONCERN: Understand why the best and most experienced employees leaving the company.

In [None]:
import pandas as pd    # used for data manipulation and data analysis
import numpy as np     # used for algebraic operations               

import matplotlib.pyplot as plt  # used for visualisations
import seaborn as sns            # used for visualisations

pd.set_option('display.max_rows', 250)  # maximum number of rows to display in a dataframe
#pd.options.display.max_rows            # prints the maximum number of rows to display in a dataframe

This block of code uses HTML to align the DATASET INFORMATION table to the left

In [None]:
%%html 
<style>
table {float:left} 
</style>

### DATASET INFORMATION

The Human Resources department of the company has collected the following information about the staff members

| Variable              | Descriptions                                |
|-----------------------|---------------------------------------------|
| satisfaction_level    | Satisfaction Level                          |
| last_evaluation       | Last evaluation                             |
| number_project        | Number of projects                          |
| average_montly_hours  | Average monthly hours                       |
| time_spend_company    | Time spent at the company                   |
| Work_accident         | Whether they have had a work accident       |
| left                  | Whether the employee has left               |
| promotion_last_5years | Whether had a promotion in the last 5 years |
| department            | department                                  |
| salary                | Salary                                      |

### Taking a look at the dataset

In [None]:
# dataset filepath
dataset_path = "data/dataset_week4.csv"

# load dataset into a dataframe
data = pd.read_csv( dataset_path )
data

Take a look at the "salary" column. A machine cannot understand what "low", "medium" or "high" mean. These are terms for our human understanding. We will need to encode this representation into numbers (also called integers).
The variable "Salary" is called a categorical variable, because it expresses the categories "low", "medium" and "high".

In [None]:
# how many categories of salary do we have?

np.unique(data['salary']) # the unique function removes all duplicates from a list

In [None]:
data.salary

In [None]:
# encoding categorical variables
# we are going to map the attributes "low", "medium", and "high" of the "salary" column to 0, 1, and 2, respectively
# and put the results in a new column "salary_enc"
data['salary_enc'] = data.salary.map({"low": 0, "medium": 1, "high": 2})
data

In [None]:
# how many categories of salary do we have?
np.unique(data['salary_enc'])

The first step that we need to do when analysing structured data is to clean up the data. In other words, to look for:
- missing
- incomplete or 
- noisy values.

In [None]:
# check if there are null (missing) entries in the dataset
data.info()

Dataset is complete and it does not have any missing values. 

### Performing Descriptive Statistics

A descriptive statistic is a summary and a quantitative analysis that describes or summarizes features from a dataset.

Pandas has a descriptive statistics method, **describe**, that summarize the central tendency, dispersion and shape of the dataset. 


In [None]:
# descriptive statistics
descr = data.describe()
descr

In [None]:
# average number of employees who left
employees_who_left = np.round(descr['left']['mean'],2)

# average satisfaction level
satisfaction_lv = np.round(descr['satisfaction_level']['mean'],2)

# average performance level
performance_lv = np.round(descr['last_evaluation']['mean'],2)

# average performance level
projects = np.round(descr['number_project']['mean'])

# average performance level
num_hours = np.round(descr['average_montly_hours']['mean'])

In [None]:
print("Summary of Descriptive Statistics:\n")
print("\tAverage percentage of exployees leaving the company: %.2f%%;" %(employees_who_left))
print("\tAverage satisfaction level amongst employees: %.2f%%;" %(satisfaction_lv))
print("\tAverage performance level of employees: %.2f%%;" %(performance_lv))
print("\tEmployees work on average on %d projects and spend %d hours\n" %(projects, num_hours))

#### Graphical Representation of Descriptive Statistics: Box Plots

Boxplot is a good statistical graphic to analyze the dataset and indentify outliers. An outlier is as observation that lies an abnormal distance from other values. In this case, the data analyst has to decide what is considered abnormal according to a certain data distribution.


In [None]:
f, axes = plt.subplots(3,2, figsize=(10,10))

plt.subplots_adjust(wspace=1) # adjust the space between the plots

# plot a boxplot of satisfaction_level to see if there is outliers
sns.boxplot( x= 'satisfaction_level',  data=data, orient='v', ax=axes[0,0])

# plot a boxplot of last_evaluation to see if there is outliers
sns.boxplot( x= 'last_evaluation',  data=data, orient='v',ax=axes[0,1])

# plot a boxplot of number_project to see if there is outliers
sns.boxplot( x= 'number_project',  data=data, orient='v',ax=axes[1,0])

# plot a boxplot of average_montly_hours to see if there is outliers
sns.boxplot( x= 'average_montly_hours',  data=data, orient='v',ax=axes[1,1])

# plot a boxplot of salary to see if there is outliers
sns.boxplot( x= 'salary_enc',  data=data, orient='v',ax=axes[2,0])

# plot a boxplot of time_spend_company to see if there is outliers
sns.boxplot( x= 'time_spend_company',  data=data, orient='v',ax=axes[2,1])


We can see the graphical representation of our analysis in terms of data distributions:

- Satisfaction level and Last evaluation (performance) has a skewed left (negative) ditribuitions.
- Number of projects has a skewed right(positive) ditribution.
- Average monthly hours has a simetric ditribution.
- Analyse de distribution of the variables is important due the fact that many statistical tests assume normal distribution


#### How Many Departments Does Our Organisation Have?

In [None]:
data_by_dept = data[['department','left']].groupby('department', sort=True).count()
data_by_dept

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(20, 6))

axes[0].pie(data_by_dept.values,
        labels=data_by_dept.index,
        shadow=True,
        colors = ['#fc910d','#fcb13e','#239cd3','#1674b1','#ed6d50'],
        explode=[0.2,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1]) # to separate the "slices"
axes[0].set_title('Distribution of the number of employees who left according to the different sectors\n\n')
axes[0].axis("equal")

axes[1].bar(x = data_by_dept.index.to_list(), height = sum(data_by_dept.values.tolist(),[]))
fig.tight_layout()


#### How many employees per salary range?

In [None]:
counts_per_salary = data['salary'].value_counts()
total = sum(counts_per_salary)

print("There are:\n")
print("%d employees with a Low Salary which represents %.2f%% of the employees" %(counts_per_salary[0], np.round(counts_per_salary[0]/total,2) ))
print("%d employees with a Medium Salary which represents %.2f%% of the employees" %(counts_per_salary[1], np.round(counts_per_salary[1]/total,2) ))
print("%d employees with a High Salary which represents %.2f%% of the employees" %(counts_per_salary[2], np.round(counts_per_salary[2]/total,2) ))

#### How many employees per salary range and per department?

In [None]:
table = data.pivot_table(values="satisfaction_level", index="department", columns="salary",aggfunc=np.count_nonzero)
table

In [None]:
table.plot(kind="bar", figsize=(10,6), title="Employees per department and per salary range")

#### Correlation Analysis

The correlation is a very useful statitiscal analysis that describes the degree of relationship between two variables. They can be of two types:
- positive correlation: two variables move in the same direction
- negative correlation: two variables move in oposite directions

In [None]:
correlation_matrix = data.corr()
correlation_matrix

The correlation matrix shows that

- Negative correlation of (-0.39) between **satisfaction_level** and the employees that **left** the company. This means that the majority of the emplyees who left were highly unsatisfied with the company.
- The highest positive correlation is between **number of projects** and **average monthly hours** (0.42). This means the more hours spent, the more projects the employee was working on.
- **Last_evaluation** is highly correlated to **number_project**(0.35)and **average_monthly_hours**(0.34). This means that the more hours and projects an employee worked on, the higher was his evaluation.
- **salary** (-0.16) with employees that **left**. This means that employees who left felt underpaid.

In [None]:
sns.set(style='white')

mask = np.zeros_like(correlation_matrix, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

f, ax = plt.subplots(figsize=(13,8))

cmap = sns.diverging_palette(10,220, as_cmap=True)

ax = sns.heatmap(correlation_matrix, mask=mask, 
                 cmap=cmap, vmax= .5, annot=True, annot_kws= {'size':11}, 
                 square=True, xticklabels=True, yticklabels=True, linewidths=.5, 
                 cbar_kws={'shrink': .5}, ax=ax)

ax.set_title('Correlation between variables', fontsize=20);

#### Hypothesis 1: Employess who leave it is because of the salary

In [None]:
j = sns.factorplot(x='salary', y='left', kind='bar', data=data)
plt.title('Employees that left by salary level', fontsize=14)
j.set_xticklabels(['High', 'Medium', 'Low']);

In [None]:
# Employees who left by salary range and department
table_leave = data[data['left']==1].pivot_table(values="satisfaction_level", index="department", columns="salary",aggfunc=np.count_nonzero)
table_leave

In [None]:
table_leave.plot(kind="bar", figsize=(10,6), title="Employees who left per department and per salary range")

In [None]:
# Employees who stayed by salary range and department
table_stay = data[data['left']==0].pivot_table(values="satisfaction_level", index="department", columns="salary",aggfunc=np.count_nonzero)
table_stay

In [None]:
table_stay.plot(kind="bar", figsize=(10,6), title="Employees who stayed per department and per salary range")

The analysis shows that:
- the majority of the employees who had a low salary left
- the sales department had the highest number of emplyees leaving. However, this is still not enough to draw a conclusion. Many employees with low salary also stayed in the company
- the technical and support departments also pay a low salary range to its employees, which might be another reason why they are leaving

#### Hypothesis 2: Employees who are not satisfied tend to leave the company

In [None]:
figure(figsize=(10,6))

bins = np.linspace(0.006,1.000, 15)
    
n, b, patches = plt.hist(data[(data['left']==1) ]['satisfaction_level'], bins=bins, alpha=1, label='Employees Left', color = "skyblue")
patches[8].set_fc('r') #average satisfaction
n, b, patches = plt.hist(data[(data['left']==0) ]['satisfaction_level'], bins=bins, alpha = 0.5, label = 'Employee Stayed', color = "green")
patches[8].set_fc('r')  #average satisfaction
plt.title('Employees Satisfaction', fontsize=14)
plt.xlabel('satisfaction_level')
plt.xlim((0,1.05))
plt.legend(loc='best');

It is possible to see 3 interesting peaks in the satisfaction levels of the employees that left the company.

- We have a peak of employees who are totally disappointed.
- Another peak at 0.4, representing another group with the satisfaction level below the average.
- And another amount in the range 0.7 and 0.9, with employees that left, although the high satisfaction (probably because they wanted to progree in the career in another job)

#### When did the employees start to feel unsatisfied?

Are the employees working too much?

In [None]:
sns.set()

ax = sns.factorplot(x="number_project", y="satisfaction_level", col="time_spend_company",col_wrap=4, size=3, color='blue',sharex=False, data=data)
ax.set_xlabels('Number of Projects');

Results show a clear drop in satisfaction when employees are working on 6 or more projects (overworking...)

#### Why are the most valuable employees leaving the company? To be continued in the tutorial!

### Interactive Discussion: Strengths and Weaknesses

Please go to http://www.wooclap.com/IAB303 and add some **Strengths** and **Weaknesses** that you think are related to this business concern