# Content:

- <a href='#1.'> 1. Importing Libraries</a>
- <a href='#2.'> 2. Loading and Checking Data</a>
- <a href='#3.'> 3. Variable Description</a>
- <a href='#4.'> 4. Data Analysis</a>
- <a href='#5.'> 5. Preprocessing</a>
    - <a href='#5.1.'> 5.1. Outlier Detection and Missing Values</a>
    - <a href='#5.2.'> 5.2. Dropping Useless Variables</a>
    - <a href='#5.3.'> 5.3. Feature Engineering</a>
        - <a href='#5.3.1.'> 5.3.1. Checking Reason for Absence Feature</a>
        - <a href='#5.3.2.'> 5.3.2. Checking Date Feature</a>
        - <a href='#5.3.3.'> 5.3.3. Checking Other Features</a>
        - <a href='#5.3.4.'> 5.3.4. Creating Target Variable</a>
    - <a href='#5.4.'> 5.4. Scaling and Splitting</a>
- <a href='#6.'> 6. Single Logistic Regression and Evaluation</a>
- <a href='#7.'> 7. References</a>

## <a id='1.'> 1. Importing Libraries</a>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## <a id='2.'> 2. Loading and Checking Data</a>

In [None]:
# Loading and checking data
raw_data = pd.read_csv("../input/employee-absenteeism-prediction/Absenteeism-data.csv")
raw_data.head()

In [None]:
# Creating checkpoint
df = raw_data.copy()
df.head()

In [None]:
# Checking the shape of dataset
df.shape

In [None]:
# Checking info
df.info()

In [None]:
# Checking the statistical summary of dataset
df.describe()

## <a id='3.'> 3. Variable Description</a>

* ID: Identification number of the employees (categorical)  
* Reason for Absence: Reason for the absenteeism of employees (categorical)  
* Date: Date of the absenteeism time of the employees (categorical) 
* Transportation Expense: The amount of transportaion expense of the employees  (numerical)
* Distance to Work: Distance to work for every employee (numerical)  
* Age: Age of the employees (numerical)  
* Daily Work Load Average:  Daily work load average of the employees (numerical)
* Body Mass Index: Body index of the employees (numerical)  
* Education: Education levels (categorical)   
* Children: Children number (numerical)  
* Pets: Pet number (numerical)                      
* Absenteeism Time in Hours:  Daily absenteeism time of the employees 
 
dtypes: float64(1), int64(10), object(1)

## <a id='4.'> 4. Data Analysis</a>

In [None]:
#We can make a simple EDA with "pandasprofiling".
from pandas_profiling import ProfileReport
df_profile = ProfileReport(df, title='Pandas Profiling Report', html={'style':{'full_width':True}})
df_profile

## <a id='5.'> 5. Preprocessing</a>

## <a id='5.1.'> 5.1. Outlier Detection and Missing Values</a>

In [None]:
# Outlier detection
def detect_outlier(df, features):
    outlier_indices = []
    
    for c in features:
        # 1st quartile
        Q1 = np.percentile(df[c],25)
        # 3rd quartile
        Q3 = np.percentile(df[c],75)
        # IQR
        IQR = Q3 - Q1
        # outlier step
        outlier_step = IQR * 1.5
        # detect outlier and their indices
        outlier_list_col = df[(df[c] < Q1 - outlier_step) | (df[c] > Q3 + outlier_step)].index
        # store indices
        outlier_indices.extend(outlier_list_col)
    
    outlier_indices = Counter(outlier_indices)
    multiple_outliers = list(i for i, v in outlier_indices.items() if v > 2)
    
    return multiple_outliers

In [None]:
df.columns.values

In [None]:
from collections import Counter
df.loc[detect_outlier(df,['Transportation Expense','Distance to Work', 'Age', 'Daily Work Load Average',
                              'Body Mass Index', 'Children', 'Pets'])]

There is no outliers.

In [None]:
# To check the missing values
df.isnull().sum()

There are no missing values in the dataset.

## <a id='5.2.'> 5.2. Dropping Useless Variables</a>

In [None]:
# ID variable is a number that is there to distinguish the individuals from one another, not to carry any numeric information.
# We should drop it. 
df = df.drop(["ID"], axis = 1) # axis=1 for columns, axis=0 for rows.
df.head()

## <a id='5.3.'> 5.3. Feature Engineering</a>

## <a id='5.3.1.'> 5.3.1. Checking Reason for Absence Feature</a>

In [None]:
# Checking Reason for Absence feature.
print(df["Reason for Absence"].max())
print(df["Reason for Absence"].min())
print(df["Reason for Absence"].unique())

In [None]:
len(df["Reason for Absence"].unique())

One number is missing.

In [None]:
sorted(df["Reason for Absence"].unique())

- 20 is missing in "Reason for Absence".
- "Reason for Absence" feature is categorical. We can get dummies for this feature but first we must be sure that every empoyee has only one "Reason for Absence". 

In [None]:
# Convert categorical variable into dummy variables.
reason_columns = pd.get_dummies(df["Reason for Absence"])
reason_columns

In [None]:
reason_columns["Check"] = reason_columns.sum(axis=1)
reason_columns

- 0 : missing value, 
- 1: single value, 
- 2,3,4...: Not possible because there can only be one reason for absence.

In [None]:
reason_columns["Check"].sum(axis=0)

In [None]:
reason_columns["Check"].unique()

In [None]:
reason_columns = reason_columns.drop(["Check"], axis=1)
reason_columns

In [None]:
# To avoid potencial multicollinearity issues
reason_columns = pd.get_dummies(df["Reason for Absence"], drop_first = True)
reason_columns

Group the Reasons for Absence

In [None]:
df.columns.values

In [None]:
reason_columns.columns.values

In [None]:
df = df.drop(["Reason for Absence"], axis=1)

- Classifying Reason for Absence
- 1-14: Desease
- 15-17: Pregnancy
- 18-21: Emergency issues
- 22-28: Light reasons 

In [None]:
reason_type_1 = reason_columns.loc[:,1:14].max(axis = 1)
reason_type_2 = reason_columns.loc[:,15:17].max(axis = 1)
reason_type_3 = reason_columns.loc[:,18:21].max(axis = 1)
reason_type_4 = reason_columns.loc[:,22:28].max(axis = 1)

In [None]:
# Concatenate column values
df = pd.concat([df, reason_type_1, reason_type_2, reason_type_3, reason_type_4], axis = 1)

In [None]:
df.head()

In [None]:
# Rename columns
df.columns.values

In [None]:
column_names = ['Date', 'Transportation Expense', 'Distance to Work', 'Age',
       'Daily Work Load Average', 'Body Mass Index', 'Education',
       'Children', 'Pets', 'Absenteeism Time in Hours', 'Reason_1',
       'Reason_2', 'Reason_3', 'Reason_4']

In [None]:
df.columns = column_names

In [None]:
df.head()

In [None]:
# Reorder columns
column_names_reordered = ['Reason_1', 'Reason_2', 'Reason_3', 'Reason_4',
                          'Date', 'Transportation Expense', 'Distance to Work', 'Age',
                          'Daily Work Load Average', 'Body Mass Index', 'Education',
                          'Children', 'Pets', 'Absenteeism Time in Hours']

In [None]:
df = df[column_names_reordered]

In [None]:
df.head()

In [None]:
# Create a checkpoint
df_reason_mod = df.copy()

## <a id='5.3.2.'> 5.3.2. Checking Date Feature</a>

In [None]:
# Analysis of Date feature
df_reason_mod["Date"]

In [None]:
# Checking type of Date variable
type(df_reason_mod["Date"][0])

In [None]:
# Converting datatype to datetime
df_reason_mod["Date"] = pd.to_datetime(df_reason_mod["Date"], format = "%d/%m/%Y")

In [None]:
type(df_reason_mod["Date"][0])

In [None]:
df_reason_mod.info()

In [None]:
# Extracting month value
df_reason_mod["Date"][0].month

In [None]:
# Creating month variable
list_months = []

for i in range(df_reason_mod.shape[0]):
    list_months.append(df_reason_mod["Date"][i].month)

In [None]:
len(list_months)

In [None]:
df_reason_mod["Month Value"] = list_months

In [None]:
df_reason_mod.head()

In [None]:
# Extract the day of the week
df_reason_mod["Date"][699].weekday()

In [None]:
def date_to_weekday(date_value):
    return date_value.weekday()

In [None]:
# Creating Day of the Week variable
df_reason_mod["Day of the Week"] = df_reason_mod["Date"].apply(date_to_weekday)

In [None]:
df_reason_mod.head()

In [None]:
df_reason_mod = df_reason_mod.drop(["Date"], axis=1)

In [None]:
df_reason_mod.head()

## <a id='5.3.3.'> 5.3.3. Checking Other Features</a>

In [None]:
# Analysis other columns
type(df_reason_mod["Transportation Expense"][0])

In [None]:
type(df_reason_mod["Distance to Work"][0])

In [None]:
type(df_reason_mod["Age"][0])

In [None]:
type(df_reason_mod["Daily Work Load Average"][0])

In [None]:
type(df_reason_mod["Body Mass Index"][0])

In [None]:
df_reason_mod["Education"].unique()

In [None]:
df_reason_mod["Education"].value_counts()

In [None]:
# Converting Education Feature
df_reason_mod["Education"] = df_reason_mod["Education"].map({1:0, 2:1, 3:1, 4:1})

In [None]:
df_reason_mod["Education"].value_counts()

In [None]:
# Final checkpoint
df_preprocessed = df_reason_mod.copy()

In [None]:
# Creating the targets
df_preprocessed["Absenteeism Time in Hours"].median()

- Absenteeism Time in Hours =<3, Moderately absent
- Absenteeism Time in Hours >3, Excessively absent

## <a id='5.3.4.'> 5.3.4. Creating Target Variable</a>

In [None]:
targets = np.where(df_preprocessed["Absenteeism Time in Hours"] > 3,1,0)

In [None]:
targets

In [None]:
df_preprocessed["Excessive Absenteeism"] = targets

In [None]:
df_preprocessed.head()

In [None]:
# Checking imbalance
targets.sum()/targets.shape[0]

In [None]:
# create a checkpoint by dropping the unnecessary variables
# also drop the variables we 'eliminated' after exploring the weights.
# after implementing the ml model we also add 3 more columns to remove list because they have very less affect on target feature.
df_with_targets = df_preprocessed.drop(['Absenteeism Time in Hours','Day of the Week',
                                        'Daily Work Load Average','Distance to Work'],axis=1)

In [None]:
df_with_targets is df_preprocessed

In [None]:
df_with_targets.head()

In [None]:
# Selecting input for the regression
df_with_targets.shape

## <a id='5.4.'> 5.4. Scaling and Splitting</a>

In [None]:
unscaled_inputs = df_with_targets.iloc[:,:-1]

In [None]:
# import the libraries needed to create the Custom Scaler
# note that all of them are a part of the sklearn package
# moreover, one of them is actually the StandardScaler module, 
# so you can imagine that the Custom Scaler is build on it

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler

# create the Custom Scaler class

class CustomScaler(BaseEstimator,TransformerMixin): 
    
    # init or what information we need to declare a CustomScaler object
    # and what is calculated/declared as we do
    
    def __init__(self,columns,copy=True,with_mean=True,with_std=True):
        
        # scaler is nothing but a Standard Scaler object
        self.scaler = StandardScaler(copy,with_mean,with_std)
        # with some columns 'twist'
        self.columns = columns
        self.mean_ = None
        self.var_ = None
        
    
    # the fit method, which, again based on StandardScale
    
    def fit(self, X, y=None):
        self.scaler.fit(X[self.columns], y)
        self.mean_ = np.mean(X[self.columns])
        self.var_ = np.var(X[self.columns])
        return self
    
    # the transform method which does the actual scaling

    def transform(self, X, y=None, copy=None):
        
        # record the initial order of the columns
        init_col_order = X.columns
        
        # scale all features that you chose when creating the instance of the class
        X_scaled = pd.DataFrame(self.scaler.transform(X[self.columns]), columns=self.columns)
        
        # declare a variable containing all information that was not scaled
        X_not_scaled = X.loc[:,~X.columns.isin(self.columns)]
        
        # return a data frame which contains all scaled features and all 'not scaled' features
        # use the original order (that you recorded in the beginning)
        return pd.concat([X_not_scaled, X_scaled], axis=1)[init_col_order]

In [None]:
# check what are all columns that we've got
unscaled_inputs.columns.values

In [None]:
# create the columns to scale, based on the columns to omit
# use list comprehension to iterate over the list
columns_to_omit = ['Reason_1', 'Reason_2', 'Reason_3', 'Reason_4','Education']
columns_to_scale = [x for x in unscaled_inputs.columns.values if x not in columns_to_omit]

In [None]:
# declare a scaler object, specifying the columns you want to scale
absenteeism_scaler = CustomScaler(columns_to_scale)

In [None]:
# fit the data (calculate mean and standard deviation); they are automatically stored inside the object 
absenteeism_scaler.fit(unscaled_inputs)

In [None]:
# standardizes the data, using the transform method 
# in the last line, we fitted the data - in other words
# we found the internal parameters of a model that will be used to transform data. 
# transforming applies these parameters to our data
# note that when you get new data, you can just call 'scaler' again and transform it in the same way as now
scaled_inputs = absenteeism_scaler.transform(unscaled_inputs)

In [None]:
# the scaled_inputs are now an ndarray, because sklearn works with ndarrays
scaled_inputs

In [None]:
scaled_inputs

In [None]:
# check the shape of the inputs
scaled_inputs.shape

In [None]:
# Split the data
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(scaled_inputs, targets, train_size = 0.8, random_state = 42)
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

## <a id='6.'> 6. Single Logistic Regression and Evaluation</a>

In [None]:
# Logistic Regression
from sklearn.linear_model import LogisticRegression

In [None]:
reg = LogisticRegression()

In [None]:
reg.fit(x_train, y_train)

In [None]:
reg.score(x_train, y_train)

In [None]:
# Manually check the accuracy
model_outputs = reg.predict(x_train)
np.sum(model_outputs == y_train)/(model_outputs.shape)

In [None]:
# Finding the intercept and coefficients
reg.intercept_

In [None]:
reg.coef_

In [None]:
unscaled_inputs.columns.values

In [None]:
feature_name = unscaled_inputs.columns.values

In [None]:
summary_table = pd.DataFrame(columns=["feature_name"], data=feature_name)
summary_table["Coefficient"] = np.transpose(reg.coef_)

In [None]:
summary_table

In [None]:
summary_table.index = summary_table.index +1

In [None]:
summary_table

In [None]:
summary_table.loc[0] = ["Intercept", reg.intercept_[0]]

In [None]:
summary_table = summary_table.sort_index()
summary_table

In [None]:
summary_table["Odds_ratio"] = np.exp(summary_table.Coefficient)

In [None]:
summary_table

In [None]:
summary_table.sort_values("Odds_ratio", ascending=False)

In [None]:
# Testing the model
reg.score(x_test, y_test)

In [None]:
predicted_proba = reg.predict_proba(x_test)
predicted_proba

In [None]:
predicted_proba[:,1]

## <a id='7.'> 7. References</a>

* https://www.udemy.com/course/the-data-science-course-complete-data-science-bootcamp/
* https://365datascience.com/