<div style="background:#E9FFF6; color:#440404; padding:8px; border-radius: 4px; text-align: center; font-weight: 500;">IFN619 - Data Analytics for Strategic Decision Makers (2023_sem1)</div>

# IFN619 :: B2-StructuredAnalytics

### Business Concern

Analysis of Human Resources (HR) data can help provide insights on issues associated with employees and employment.

In this ficticious scenario, a HR department has access to employee data from the company database that include the following fields:

| Field                 | Description                                 |
|-----------------------|---------------------------------------------|
| satisfaction_level    | Satisfaction Level                          |
| last_evaluation       | Last evaluation                             |
| number_project        | Number of projects                          |
| average_montly_hours  | Average monthly hours                       |
| time_spend_company    | Time spent at the company                   |
| Work_accident         | Whether they have had a work accident       |
| left                  | Whether the employee has left               |
| promotion_last_5years | Whether had a promotion in the last 5 years |
| department            | department                                  |
| salary                | Salary                                      |

### Question

Which employees are leaving the company, and what are the likely reasons?

### Data

The data can be loaded from the file `b2-structured_analytics-data.csv`. 

Before loading the data, take a look at the CSV to check column names and whether there is any index. 

After loading the data, check how many records we have and get the column headings as a list to check against the table above.

In [None]:
# Load necessary libraries
# Use pandas for dataframe representation of structured data
import ??? as ???
# Use the plotly library to visualise the distribution as a histogram
import plotly.express as px

In [None]:
# Locate the data and set appropriate variables
file_path = ???
file_name = ???

# Load the data into a dataframe assigned to an appropriately named variable
hr_df = pd.read_csv(f"{file_path}{file_name}",index_col=???)

# View the dataframe
???

If we want to report on the number of records, we can place a variable inside a string (using an f-string) which can be printed.

In [None]:
# Check the number of records
hr_df.size

In [None]:
# Report on important information
report_records = f"There are {???} records in the database"
print(report_records)

In [None]:
# An alternative to printing
print("Number of database records: ",???)

The dataframe can return a list of columns by calling the `columns` property. However, the type is a `pandas index object`. This object can be changed into a python list using the `list()` function.

In [None]:
# Get columns property
hr_df.columns

In [None]:
#Capture the column headings as a list
hr_headings = ???

hr_headings

### Analysis

When analysing structured data, we often want to use statistics to *describe* the data. Before doing this, we need to understand the types of **variables** in the data. 

In structured data, each column of data is considered to be a **variable**. For example, `satifaction_level` is one variable, and `work_accident` is another. For each record in the data, these variables take on a potentially different value which is why they are refered to as variables!

There are 2 main types of variable:
    1. **Quantitative** (or Numeric) - values are quantities represented by numbers
    2. **Categorical** - values are categories that may or may not be represented by numbers

#### Which columns are which variable types in the HR data?

Reading the descriptions of fields in the table above gives us an indication of which variables might be quantitative and which might be categorical.

We can also get an indication using the `describe()` function. 

Take note of which variables appear in the output. Why is this?


In [None]:
# Note which columns can be described
hr_df.???

In [None]:
# Capture the columns in a variable
quant_vars = list(hr_df.describe().columns)
quant_vars

In [None]:
# The other columns could be categorical
cat_vars = set(???) - set(???)
cat_vars

The `describe()` function is a quick an easy way to obtain descriptive statistics on data that is quantitative.

Take a look at just the `satisfaction_level` and note how that statistics describe the data.

How would this be reported appropriately to humans?

In [None]:
# Describe a single column
hr_df[???].describe()

In [None]:
# Get a particular value
sat_desc = hr_df[???].describe()
print("mean:",sat_desc[???])
print("max:",sat_desc[???])

In [None]:
# We can round the mean
round(sat_desc[???],???)

In [None]:
# More human approach by using a format string
hr_sat = hr_df.satisfaction_level
print(f"The average employee satisfaction is {hr_sat.mean():.0%} (over {hr_sat.count():,} past and present employees).")

We can obtain more meaningful information, but grouping on the relevant categorical variable.

In [None]:
# Grouping according to current and past employees
left_sat = hr_df.groupby([???])[???].describe()
left_sat

We can also obtain particular rows of data using the `loc[]` property.

In [None]:
left_sat.loc[???] # locate row with index 'no'

In [None]:
left_sat.loc[???] # locate row with index 'yes'

In [None]:
# More human, more nuance
hr_sat_past = left_sat.loc[???] # have left
hr_sat_current = left_sat.loc[???] # have not left

# Note formatting of numbers within the fstring
print(f"The average employee satisfaction recorded for the company's {hr_sat_current[???]:,} current employees is {hr_sat_current[???]:.0%}.")
print(f"The average employee satisfaction recorded for the {hr_sat_past[???]???}  employees who have left the company is {hr_sat_past[???]???}.")

### Visualise

Sometimes it is helpful to visualise the data to help understand it's distribution (shape).

**Histograms** are a good way of visualising the distribution of quantitative data. We can create a histogram using the `histogram()` function in the `plotly.express` library.

In [None]:
# Create a histogram of satisfaction_level
fig = px.histogram(hr_df[???])
fig.show()

The histogram allows us to see a significant group of very low satisfaction records. By hovering over the bar, we can see that the range is less than 11% satisfaction.

This would suggest that we should drill down and find out more.

In [None]:
# What do the descriptive statistics look like for this very low group (<15%)
lowsat_df = hr_df[hr_df['satisfaction_level']<???]
lowsat_df.describe()

In [None]:
# What about those that are very satisfied? (> 75%)
highsat_df = hr_df[hr_df['satisfaction_level']>???]
highsat_df.describe()

In [None]:
# average number of projects
hprojm = highsat_df.describe()[???].loc[???]
lprojm = lowsat_df.describe()[???].loc[???]
projd = ??? - ???
projmore  = ???/???

# average hours worked
hhourm = highsat_df.describe()[???].loc[???]
lhourm = lowsat_df.describe()[???].loc[???]
hourd = ??? - ???
hourmore = hourd/hhourm

print(f"Employees with low satisfaction were assigned to {lprojm:.2f} projects on average, compared to high satisfaction employees with {hprojm:.2f} on average. This represents an additional project load of {projd:.2f} projects for these employees with low satisfaction, or {projmore:.1%} ({hourmore:.1%}) more than those with high satisfaction.")
print(f"It is not surprising then that the low satisfaction group worked an average {lhourm:.0f} hours per month, which is {hourd:.0f} more than the high satisfaction group with an average of {hhourm:.0f} hours per month.")




#### Using visualisation to help explore

It looks like we have found a possible answer to the **why** part of our question. Very low satisfaction appears to be related to higher work load.

We need to continue to explore by comparing to get a better understanding of **who**.

One approach is to use visualisations to help compare the data. Using our histogram above, we can add a box plot that summarises the distribution.

After plotting the distribution with the box plot, hover over the box plot to see what it means. See this [Chartio data tutorial](https://chartio.com/learn/charts/box-plot-complete-guide/) for a deeper dive.

![img](https://chartio.com/assets/26dba4/tutorials/charts/box-plots/046df50d3e23296f1dda99a385bd54925317c413ffff2a63779ffef0a42b9434/box-plot-construction.png)

In [None]:
# Add a box plot to the histogram
fig = px.histogram(hr_df['satisfaction_level'],nbins=10, marginal=???,text_auto=True)
fig.show()

However, this visualisation doesn't tell the story reflected in our more nuanced stats above, and it still doesn't help address the **who** part of the question.

To tell a more nuanced story, we can split the dataframe into different subgroups by filtering just the rows that we want.

To answer our question about which employees leave the company and why, we're pursuing a hypothesis around satisfaction. But to check if we're on the right track we need to see if there is a difference between *current employees* `left=no` and *past employees* `left=yes`.

In [None]:
# filter the dataframe to those rows containing a string
current_df = hr_df[hr_df['left'].str.contains(???)]
current_df

In [None]:
# describe the data (check by comparing with our previous results)
current_df[???].describe()

In [None]:
# visualise the distribution of satisfaction for current employees
fig = px.histogram(current_df['satisfaction_level'],nbins=10, marginal="box",text_auto=True)
fig.show()

In [None]:
# filter the dataframe to those rows containing 'yes' indicating they have left
past_df = hr_df[hr_df['left'].str.contains(???)]
past_df

In [None]:
# describe the data (check by comparing with our previous results)
past_df['satisfaction_level'].???

In [None]:
# visualise the distribution of satisfaction for past employees
fig = px.histogram(past_df['satisfaction_level'],nbins=10, marginal="box",text_auto=True)
fig.show()

These charts show a much more nuanced view of the data for each group. However, it would be helpful to see them together in order to compare.

Visual comparison makes it more obvious the big difference in satisfaction level between the 2 groups, and the chart could be accompanied by the text that we printed out above.

This is an example of *making the narrative clearer*. We are telling a story from the data about **why** employees might be leaving, and both the text and the visualisation help tell the story.

In [None]:
# view the box plots for satisfaction level, but use different colours for the 'left' values
fig = px.box(hr_df,x='satisfaction_level',color='left')
fig.show()
print(f"Employees with low satisfaction were assigned to {lprojm:.2f} projects on average, compared to high satisfaction employees with {hprojm:.2f} on average. This represents an additional project load of {projd:.2f} projects for these employees with low satisfaction, or {projmore:.1%} ({hourmore:.1%}) more than those with high satisfaction.")
print(f"It is not surprising then that the low satisfaction group worked an average {lhourm:.0f} hours per month, which is {hourd:.0f} more than the high satisfaction group with an average of {hhourm:.0f} hours per month.")

**Note:** The chart above is fine for exploration, but it would need improvements before using in a report. For example, the `satisfaction_level` format in the chart does not match the format in the text.

However, continuing our exploration, we can use this same approach to check if `satisfaction_level` varies according to other variables, like `department`.

In [None]:
# How does satisfaction level look for each department?
px.box(past_df,x='satisfaction_level',color=???)

We can filter the past employees data to include only certain departments. 

**TIP:** To see what the filtering is doing, try running the `past_df['department'].str.contains('accounting')` in a cell by itself.

In [None]:
# Selected departments of past employees
deps_df = past_df[past_df['department'].str.contains('accounting') | # The vertical line is a logical OR operator
                past_df['department'].str.contains('product_mng')]

px.histogram(deps_df, x='satisfaction_level',nbins=10, marginal="box",text_auto=True,color='department')

You'll notice that the box plots reflect something that is difficult to see in the histogram - that is that proportionately accounting employees are less satisfied than product management employees.

When comparing groups of different sizes, *normalising* can help tell a better story. (e.g. percent of total employees in each group)

In [None]:
# normalise the histogram by percent
px.histogram(deps_df, x='satisfaction_level',nbins=10, marginal="box",text_auto=True,color='department', histnorm=???)

Also, using a different kind of visualisation designed for comparisons can help

In [None]:
fig = px.density_heatmap(deps_df, y='satisfaction_level',x='average_monthly_hours',facet_row='department',facet_col='left',histnorm='percent')

# Uncomment the following line to see how to change the dimensions of the image
#fig.update_layout(autosize=False,width = 600,height = 800,margin=dict(l=10,r=10,b=20,t=20,pad=4),paper_bgcolor="LightSteelBlue")

fig.show()

This visualisation makes obvious that the difference between the departments is minimal for current employees, but for past employees, there is a greater proportion of high hours and low satisfaction with accounting than with product management where the greater proportion of people who left had moderate satisfaction and relatively low hours.

#### Correlation Analysis

Another kind of analysis used for comparing quantitative data is *correlation analysis*.
Correlation analysis is a very useful statitiscal analysis that describes the degree of relationship between two variables. It can describe two main types of relationship between variables:
- positive correlation: two variables move in the same direction (correlated)
- negative correlation: two variables move in oposite directions (anti-correlated, or inverse correlation)

Note that strong relationships between any 2 variables are given by how close the correlation value is to either +1.0 (correlation) or -1.0 (inverse correlation), and the closer the value is to 0, the weaker the relationship between the variables.

When observing more than 2 variables, the results are using presented in a matrix form.

In [None]:
# Create a correlation matrix from variables: 'satisfaction_level','number_project','average_monthly_hours','time_spend_company'
hr_cor_matrix = hr_df[['satisfaction_level','number_project','average_monthly_hours','time_spend_company']].???
hr_cor_matrix

Matrix style data (particularly correlations) can be visualised using *Heatmap* visualisations.

In [None]:
# Create a heatmap using plotly image show by assigning colours to values in the dataframe
px.imshow(???,color_continuous_scale='Viridis', text_auto=True)

Compare this whole of company view, with just the past employees. What does it show?

In [None]:
# Create a correlation matrix for the past employees
past_cor_matrix = ???[['satisfaction_level','number_project','average_monthly_hours','time_spend_company']].???

px.imshow(past_cor_matrix,color_continuous_scale='Viridis', text_auto=True)

And the same for the current employees

In [None]:
current_cor_mat = ???[['satisfaction_level','number_project','average_monthly_hours','time_spend_company']].???

px.imshow(current_cor_mat,color_continuous_scale='Viridis', text_auto=True)

### Insights

We have been able to identify that low satisfaction appears to be a key factor with people who have left the company, and that accounting employees had the lowest satisfaction levels. We also identified relationships between between low satisfaction and high numbers of projects and longer hours worked. This suggests that those who have left disatisfied were overworked.

However, from the correlation analysis of past and current employees, it seems that the time spent at the company is more highly correlated with satisfaction levels for people who have left the company. This suggests that out of those who left, the most satisfied were also those that had been at the company the longest. Perhaps they were more persistent or more tolerant of the longer work hours. More analysis would be required to see if this sub-group of longer serving past employees actually had less work hours than more recent past employees.

The overall picture that has emerged (so far) suggests that employees with low satisfaction who have been recently employed by the company are not staying, that the likely reason for this is additional projects and longer hours, and that this seems to be more prevlent with the accounting department.