## creating logistic regression to predict absenteeism

In [2]:
import numpy as np
import pandas as pd

In [3]:
data_preprocessed = pd.read_csv("Absenteeism_preprocessed.csv");data_preprocessed.head()

Unnamed: 0,reason_1,reason_2,reason_3,reason_4,Month Value,Day of the Week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,0,0,0,1,7,2,289,36,33,239.554,30,0,2,1,4
1,0,0,0,0,7,2,118,13,50,239.554,31,0,1,0,0
2,0,0,0,1,7,3,179,51,38,239.554,31,0,0,0,2
3,1,0,0,0,7,4,279,5,39,239.554,24,0,2,0,4
4,0,0,0,1,7,4,289,36,33,239.554,30,0,2,1,2


#### creating targets

Absenteeism Time in Hours &gt; median ---> excessive absenteeism ---> 1

Absenteeism Time in Hours &le; median ---> moderately absenteeism ---> 0

we need to balance the dataset for preventing model focusing one classes

In [5]:
targets = np.where(data_preprocessed["Absenteeism Time in Hours"] > data_preprocessed["Absenteeism Time in Hours"].median(), 1, 0)

In [6]:
data_preprocessed["Excessive Absenteeism"] = targets

In [7]:
data_preprocessed["Excessive Absenteeism"].value_counts(normalize = True).to_frame().T.style.format(precision = 2)

Excessive Absenteeism,0,1
proportion,0.54,0.46


In [8]:
data_with_targets = data_preprocessed.drop(["Absenteeism Time in Hours", "Daily Work Load Average", "Distance to Work", "Day of the Week", "Education"], axis = 1)

In [9]:
data_with_targets.shape

(700, 11)

In [10]:
unscaled_inputs = data_with_targets.iloc[:, :-1]

#### Standardize the Input

In [12]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(unscaled_inputs.iloc[:, 4:])
scaler.transform(unscaled_inputs.iloc[:, 4:]).shape

(700, 6)

In [13]:
scaled_inputs = pd.DataFrame(data = np.concatenate([unscaled_inputs.iloc[:, :4].values, scaler.transform(unscaled_inputs.iloc[:, 4:])], axis = 1))

#### splitting data into train and test

In [15]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(scaled_inputs, targets, test_size=0.10, random_state=365)

In [16]:
x_train.shape, y_train.shape

((630, 10), (630,))

In [17]:
x_test.shape, y_test.shape

((70, 10), (70,))

## Logistic Model

In [20]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

In [21]:
reg = LogisticRegression()
reg.fit(x_train, y_train)

#### Accuracy and other metrics

In [73]:
y_pred = reg.predict(x_train)

In [95]:
metrics_table = pd.DataFrame()

In [107]:
metrics_table["accuracy"] = [reg.score(x_train, y_train)]
metrics_table["precision"] = metrics.precision_score(y_train, y_pred)
metrics_table["recall"] = metrics.recall_score(y_train, y_pred)
metrics_table["f-1 score"] = metrics.f1_score(y_train, y_pred)
metrics_table.index = ["metrics"]
metrics_table.style.format(precision = 4)

Unnamed: 0,accuracy,precision,recall,f-1 score
metrics,0.773,0.7662,0.732,0.7487


**There is a balance between precision and recall, and the model makes predictions by keeping false positives and false negatives limited.**

#### summary table

In [27]:
summary_table = pd.DataFrame(data = reg.coef_, columns = unscaled_inputs.columns.values).T.rename(columns = {0: "coefficients-weights"}).reset_index(names = "feature_names")

In [28]:
summary_table.index = summary_table.index + 1

In [29]:
summary_table.loc[0] = ["intercept", reg.intercept_[0]]

In [30]:
summary_table.sort_index()

Unnamed: 0,feature_names,coefficients-weights
0,intercept,-1.789414
1,reason_1,2.887458
2,reason_2,0.917974
3,reason_3,3.252387
4,reason_4,1.030756
5,Month Value,0.11397
6,Transportation Expense,0.575575
7,Age,-0.236075
8,Body Mass Index,0.282515
9,Children,0.436241


In [31]:
summary_table["odds_ratio"] = np.exp(summary_table["coefficients-weights"])

#### interpretation of summary table
<p>
    A feature is not particularly important:
    <ul>
        <li>if its coefficient is around 0</li>
        <li>if its odds ratio is around 1</li>
    </ul>
</p>

feature are:

- Daily Work Load Average
- Day of the Week
- Distance to Work
- Education

In [33]:
summary_table.sort_values(by = "odds_ratio", ascending = False)

Unnamed: 0,feature_names,coefficients-weights,odds_ratio
3,reason_3,3.252387,25.851986
1,reason_1,2.887458,17.947637
4,reason_4,1.030756,2.803183
2,reason_2,0.917974,2.504213
6,Transportation Expense,0.575575,1.778152
9,Children,0.436241,1.546882
8,Body Mass Index,0.282515,1.326462
5,Month Value,0.11397,1.120719
7,Age,-0.236075,0.789721
10,Pets,-0.313436,0.730931


important features are Reason features:

- reason_0 ---> No reason(baseline)
- reason_1 ---> Various Diseases
- reason_2 ---> Pregnancy and giving birth
- reason_3 ---> Peculiar reasons not categorized elsewhere(i.e poisoning)
- reason_4 ---> Light Diseases

The reason features have a significant impact on the target variable in the model.
Other features have limited effects, although variables like Transportation Expense and Children have a positive impact.
Negatively impactful variables (Pets, Age, Day of the Week) decrease the probability of the target variable.
 

#### Backward Elimination

we can drop the features that have no contribution the the model, in this case:
- Daily Work Load Average
- Day of the Week
- Distance to Work
- Education

can be dropped **(when fesatures have been dropped accuracy increase approximately 1.02 before 76.12 now 77.14)**
  

## testing model accuracy with test data

In [37]:
reg.score(x_test, y_test)

0.7714285714285715

In [38]:
predicted_proba = reg.predict_proba(x_test)
predicted_proba_df = pd.DataFrame(data = predicted_proba, columns = ["moderately absenteeism probability", "excessive absenteeism probability"]);predicted_proba_df

Unnamed: 0,moderately absenteeism probability,excessive absenteeism probability
0,0.547906,0.452094
1,0.587808,0.412192
2,0.420258,0.579742
3,0.145271,0.854729
4,0.199176,0.800824
...,...,...
65,0.452909,0.547091
66,0.259079,0.740921
67,0.785276,0.214724
68,0.756574,0.243426


In [105]:
y_pred_test = reg.predict(x_test)
metrics_table_test = pd.DataFrame()
metrics_table_test["accuracy"] = [reg.score(x_test, y_test)]
metrics_table_test["precision"] = metrics.precision_score(y_test, y_pred_test)
metrics_table_test["recall"] = metrics.recall_score(y_test, y_pred_test)
metrics_table_test["f-1 score"] = metrics.f1_score(y_test, y_pred_test)
metrics_table_test.index = ["metrics"]
metrics_table_test.style.format(precision = 4)

Unnamed: 0,accuracy,precision,recall,f-1 score
metrics,0.7714,0.7,0.75,0.7241
