[credit: The Data Analysis Workshop](https://smile.amazon.com/Data-Analysis-Workshop-state-art/dp/1839211385/ref=sr_1_1_sspa?crid=PANVOH9YPUZP&dchild=1&keywords=data+analysis+workshop&qid=1612145266&sprefix=data+analysis+work%2Caps%2C212&sr=8-1-spons&psc=1&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUEzUFozUTJYUDJOQjgzJmVuY3J5cHRlZElkPUEwNTQyMDM2MkRBQ1U2NlgwM1hJSSZlbmNyeXB0ZWRBZElkPUEwOTA0Mjg3TllJTTNLUTA4R05OJndpZGdldE5hbWU9c3BfYXRmJmFjdGlvbj1jbGlja1JlZGlyZWN0JmRvTm90TG9nQ2xpY2s9dHJ1ZQ==)

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
data = pd.read_csv('../input/absenteeism-at-work-an-uci-dataset/Absenteeism_at_work.csv')

print the dimensionality of the data, column types, and the number of missing values

In [None]:
print(f"Data dimension: {data.shape}")
for col in data.columns:
    print(f"Column: {col:35} | type: {str(data[col].dtype):7} \
    | missing values: {data[col].isna().sum():3d}")

In [None]:
data.describe().T

# Data Preprocessing

*Month of absence, Day of the week, Seasons, Education, Disciplinary failure, Social drinker, and Social smoker*, are **encoding categorical** values.  
So, we can back-transform the numerical values to their original categories so that we have better plotting features.

In [None]:
# define encoding dictionaries
month_encoding = {1: "January", 2: "February", 3: "March", \
    4: "April", 5: "May", 6: "June", 7: "July", \
    8: "August", 9: "September", 10: "October", \
    11: "November", 12: "December", 0: "Unknown"}
dow_encoding = {2: "Monday", 3: "Tuesday", 4: "Wednesday", \
    5: "Thursday", 6: "Friday"}
season_encoding = {1: "Spring", 2: "Summer", 3: "Fall", 4: "Winter"}
education_encoding = {1: "high_school", 2: "graduate", \
    3: "postgraduate", 4: "master_phd"}
yes_no_encoding = {0: "No", 1: "Yes"}

In [None]:
# backtransform numerical variables to categorical
preprocessed_data = data.copy()
preprocessed_data["Month of absence"] = preprocessed_data["Month of absence"]\
    .apply(lambda x: month_encoding[x])
preprocessed_data["Day of the week"] = preprocessed_data["Day of the week"]\
    .apply(lambda x: dow_encoding[x])
preprocessed_data["Seasons"] = preprocessed_data["Seasons"]\
    .apply(lambda x: season_encoding[x])
preprocessed_data["Education"] = preprocessed_data["Education"]\
    .apply(lambda x: education_encoding[x])
preprocessed_data["Disciplinary failure"] = preprocessed_data["Disciplinary failure"]\
    .apply(lambda x: yes_no_encoding[x])
preprocessed_data["Social drinker"] = preprocessed_data["Social drinker"]\
    .apply(lambda x: yes_no_encoding[x])
preprocessed_data["Social smoker"] = preprocessed_data["Social smoker"]\
    .apply(lambda x: yes_no_encoding[x])

In [None]:
preprocessed_data.head().T

# Identifying Reasons for Absence  
we will create a new variable, called Disease, which indicates whether a specific reason for absence is present in the ICD table or not

define a function that returns Yes if a provided encoded value is contained in the ICD (values 1 to 21); otherwise, No:

In [None]:
in_icd = lambda x: 'Yes' if x in range(1, 21 + 1) else 'No'

Combine the .apply() method with the previously defined in_icd() function in order to create the new Disease column in the preprocessed dataset:

In [None]:
preprocessed_data['Disease'] = preprocessed_data['Reason for absence'].apply(in_icd)

Use bar plots in order to compare the absences due to disease reasons:

In [None]:
plt.figure(figsize=(10, 8))
sns.countplot(data=preprocessed_data, x='Disease')

As we can see, the number of reasons for absence that are not listed in the ICD table is almost twice the number of listed ones.

# Analysis of the Reason for Absence

The first thing we are interested in is the overall distribution of the absence reasons in the data.  
we also used the Disease column as the hue parameter. This helps us to distinguish between disease-related reasons (listed in the ICD encoding) and those that aren't.

In [None]:
plt.figure(figsize=(10, 5))
ax = sns.countplot(data=preprocessed_data, x="Reason for absence", hue='Disease')
ax.set_ylabel("Number of entries per reason of absence")