# Creating a Logistic Regression to Predict Absenteeism

### Import Relavant Libraries

In [3]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler

### Load the Data

In [5]:
data_preprocessed = pd.read_csv('absenteeism_preprocessed.csv')
data_preprocessed.drop('Unnamed: 0', axis=1, inplace=True)
data_preprocessed.head()

Unnamed: 0,reason_group_1,reason_group_2,reason_group_3,reason_group_4,day_of_week,month,transportaion_expense_dollars,distance_to_work_miles,age,daily_work_load_average,body_mass_index,education,children,pets,absenteeism_time_hours
0,0,0,0,1,1,7,289,36,33,239.554,30,0,2,1,4
1,0,0,0,0,1,7,118,13,50,239.554,31,0,1,0,0
2,0,0,0,1,2,7,179,51,38,239.554,31,0,0,0,2
3,1,0,0,0,3,7,279,5,39,239.554,24,0,2,0,4
4,0,0,0,1,3,7,289,36,33,239.554,30,0,2,1,2


The task is more or less straightforward we will use a logistic regression which will take the reason for absence month of the year day of the week transportation expense distance to work age daily workload average education children and pets of a given employee and will predict their absenteeism we expect that half of those predictors won't have merit to me it seems that the reason for absence will be the most indicative maybe workload will have something to do with it as well since the busier person is the less he or she will want to skip work finally children and pets together with distance from work should also have something to do with absenteeism if your child or pet is sick at home you'll have to go home take them to the doctor and get them back which will be much more time consuming than a simple said the doctor OK we have a good idea what to expect the nice thing about regressions is that the model itself will give us a fair indication of which variables are important for the analysis 

### Create the Targets

More or less straightforward we will use a logistic regression which will take the reason for absence month of the year day of the week transportation expense distance to work age daily workload average education children and pets of a given employee and will predict their absenteeism we expect that half of those predictors won't have merit to me it seems that the reason for absence will be the most indicative maybe workload will have something to do with it as well since the busier person is the less he or she will want to skip work finally children and pets together with distance from work should also have something to do with absenteeism if your child or pet is sick at home you'll have to go home take them to the doctor and get them back which will be much more time consuming than a simple said the doctor OK we have a good idea what to expect the nice thing about regressions is that the model itself will give us a fair indication of which variables are important for the analysis 

In [8]:
data_preprocessed['absenteeism_time_hours'].describe()

count    700.000000
mean       6.761429
std       12.670082
min        0.000000
25%        2.000000
50%        3.000000
75%        8.000000
max      120.000000
Name: absenteeism_time_hours, dtype: float64

Cutoff = median = 3 hours. Anyone more than 3 hours is excessively absent (assign value 1) and less is moderately so (assign value 0). These are targets in supervised learning, the values we're aiming for. Our task is to predict whether we'll get a 0 or a 1.

In [10]:
cutoff = data_preprocessed['absenteeism_time_hours'].median()
targets = np.where(data_preprocessed['absenteeism_time_hours'] > cutoff, 1, 0)
targets

array([1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0,
       1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1,
       0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0,
       1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0,
       0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0,
       0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0,

result is an np array with ones and zeros. Add this as a new column 'excessive_absenteeism'.

In [12]:
data_preprocessed['excessive_absenteeism'] = targets
data_preprocessed.head()

Unnamed: 0,reason_group_1,reason_group_2,reason_group_3,reason_group_4,day_of_week,month,transportaion_expense_dollars,distance_to_work_miles,age,daily_work_load_average,body_mass_index,education,children,pets,absenteeism_time_hours,excessive_absenteeism
0,0,0,0,1,1,7,289,36,33,239.554,30,0,2,1,4,1
1,0,0,0,0,1,7,118,13,50,239.554,31,0,1,0,0,0
2,0,0,0,1,2,7,179,51,38,239.554,31,0,0,0,2,0
3,1,0,0,0,3,7,279,5,39,239.554,24,0,2,0,4,1
4,0,0,0,1,3,7,289,36,33,239.554,30,0,2,1,2,0


We've mapped the data into two classes now.


Using the median as a cutoff line is numerically stable and rigid that's because by using the media we have implicitly balanced the data set roughly half of the targets are zeros while the other half ones as you may remember this will prevent our model from learning to output one of the two classes exclusively thinking it did very well in order to prove that let's divide the number of targets that are ones by the total number of targets the number of targets that are ones can be found by summing up all values of targets while the total number of targets is simply the shape on axis 0 the result is around 0.46 so around 46% of the targets are ones thus around 54% of the targets are zeros Please remember that when balancing your data set the two classes needed represent 50% of the sample exactly usually 6040 split will work equally well for logistic regression but that's not true for other algorithms such as neural networks for my personal experience a balance of 45 to 55% is almost always sufficient so our results will do for this exercise let's proceed noting that our two groups have been distributed roughly equally finally let's drop the absenteeism 19 hours from the data frame since we won't be needing it I'll call the new variable data with targets 

In [14]:
targets.sum() / targets.shape[0]

0.45571428571428574

In [15]:
data_with_targets = data_preprocessed.drop(['absenteeism_time_hours'], axis=1)
data_with_targets.head()

Unnamed: 0,reason_group_1,reason_group_2,reason_group_3,reason_group_4,day_of_week,month,transportaion_expense_dollars,distance_to_work_miles,age,daily_work_load_average,body_mass_index,education,children,pets,excessive_absenteeism
0,0,0,0,1,1,7,289,36,33,239.554,30,0,2,1,1
1,0,0,0,0,1,7,118,13,50,239.554,31,0,1,0,0
2,0,0,0,1,2,7,179,51,38,239.554,31,0,0,0,0
3,1,0,0,0,3,7,279,5,39,239.554,24,0,2,0,1
4,0,0,0,1,3,7,289,36,33,239.554,30,0,2,1,0


### Selecting the Inputs for the Regression

In [17]:
data_with_targets.iloc[:, :-1]

Unnamed: 0,reason_group_1,reason_group_2,reason_group_3,reason_group_4,day_of_week,month,transportaion_expense_dollars,distance_to_work_miles,age,daily_work_load_average,body_mass_index,education,children,pets
0,0,0,0,1,1,7,289,36,33,239.554,30,0,2,1
1,0,0,0,0,1,7,118,13,50,239.554,31,0,1,0
2,0,0,0,1,2,7,179,51,38,239.554,31,0,0,0
3,1,0,0,0,3,7,279,5,39,239.554,24,0,2,0
4,0,0,0,1,3,7,289,36,33,239.554,30,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
695,1,0,0,0,2,5,179,22,40,237.656,22,1,2,0
696,1,0,0,0,2,5,225,26,28,237.656,24,0,1,2
697,1,0,0,0,3,5,330,16,28,237.656,25,1,0,0
698,0,0,0,1,3,5,235,16,32,237.656,25,1,0,0


it is time to select the inputs for our regression we will use the pandas method i loc
select all rows as inputs  all columns except for excessive absenteeism right if i have to select all rows i'll simply leave collins as the first argument regarding the columns i want to keep the 1st 14 since excessive absenteeism is the column with index

In [19]:
unscaled_inputs = data_with_targets.iloc[:, :-1]

### Standarize the Data

back it is time to standardize our data we have discussed why standardization is important There are several ways to perform standardization what's important relevant module from SK learn 

Here is a big big problem which you are not going to like when we standardize the inputs we also standardize the dummies this is bad practice because when we standardize we lose the whole interpretability of a dummy if we had left them as zeros and ones we could have said for a unit change it is 7.2 times more likely that a person will be excessively absent a unit change in the dummy variable universe means a change from disregarding this dummy to taking only this dummy into account so if the the reason is reason one we would have said finally we can declare from disregarding this dummy to taking only this dummy into account so if the reason is reason one we would have said it is around 7.7 times more likely than a person will be absent compared to no reason given whenever we standardize the reasons and now a unit change is completely uninterpretable the predictive power of the model is still valid and it is a good classifier but we don't know how the different reasons compare this is a problem because those are the most important features this brings us to a correction of our code that thing we had all those checkpoints maybe they'll help us do that effortlessly I'll go back to the part where we standardize the data and put all the code in comments since we won't be needing this now I'll copy paste some code that I've prepared prior to this lecture the idea is that this is a custom scaler based on the standard scaler from sklearn however when we declare the scalar project there's an extra argument columns to scale so our custom scaler will not standardize all of inputs but only the ones we choose in this way we will be able to preserve the dummies untouched in practice we would avoid this step by standardizing prior to creating the dummies but we didn't do it this time so we can show you yet another nifty tool so let's see how this will work 1st we have the custom scalar class is no different than the standard scaler in the way it works you don't necessarily need to understand the code all you need to know is how to use it 2nd we must choose the columns to be scaled to see what columns we've got we must return to the last checkpoint before we standardize this seems to be the unscaled input variable so let's see what its column values are alright we will create a new variable called columns to scale that will contain the names of the features we'd like to scale therefore we will omit the dummy variables from this list

In [22]:
# absenteeism_scaler = StandardScaler()

the object we just created will be used to scale our data in other words it will subtract the mean and divide by the standard deviation from each point variable wise again if you feel uncomfortable with this idea be sure to check the standardization lessons the next step is to fit our input data. The following line calculates and store the mean and std deviation. Ans absenteeism scaler will contain the mean and std info so we get a new data we can standarize it in the same way.

In [24]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler

class CustomScaler(BaseEstimator,TransformerMixin): 

    def __init__(self,columns,copy=True,with_mean=True,with_std=True):
        self.scaler = StandardScaler(copy=copy, with_mean=with_mean, with_std=with_std)
        self.columns = columns
        self.mean_ = None
        self.var_ = None
        self.copy = copy
        self.with_mean = with_mean
        self.with_std = with_std
    
    # the fit method, which, again based on StandardScale
    
    def fit(self, X, y=None):
        self.scaler.fit(X[self.columns], y)
        self.mean_ = np.mean(X[self.columns])
        self.var_ = np.var(X[self.columns])
        return self

    def transform(self, X, y=None, copy=None):
        init_col_order = X.columns
        
        # scale all features that you chose when creating the instance of the class
        X_scaled = pd.DataFrame(self.scaler.transform(X[self.columns]), columns=self.columns)
        
        # declare a variable containing all information that was not scaled
        X_not_scaled = X.loc[:,~X.columns.isin(self.columns)]
        return pd.concat([X_not_scaled, X_scaled], axis=1)[init_col_order]

In [25]:
unscaled_inputs.columns.values

array(['reason_group_1', 'reason_group_2', 'reason_group_3',
       'reason_group_4', 'day_of_week', 'month',
       'transportaion_expense_dollars', 'distance_to_work_miles', 'age',
       'daily_work_load_average', 'body_mass_index', 'education',
       'children', 'pets'], dtype=object)

In [26]:
columns_to_scale = ['day_of_week', 'month', 'transportaion_expense_dollars', 'distance_to_work_miles', 'age',
            'daily_work_load_average', 'body_mass_index', 'children', 'pets']

absenteeism_scaler = CustomScaler(columns_to_scale)

In [27]:
absenteeism_scaler.fit(unscaled_inputs)

  return var(axis=axis, dtype=dtype, out=out, ddof=ddof, **kwargs)


We've created a scaling mechanism. In order to apply it we must use transform.

In [29]:
scaled_inputs = absenteeism_scaler.transform(unscaled_inputs)
scaled_inputs

Unnamed: 0,reason_group_1,reason_group_2,reason_group_3,reason_group_4,day_of_week,month,transportaion_expense_dollars,distance_to_work_miles,age,daily_work_load_average,body_mass_index,education,children,pets
0,0,0,0,1,-0.683704,0.182726,1.005844,0.412816,-0.536062,-0.806331,0.767431,0,0.880469,0.268487
1,0,0,0,0,-0.683704,0.182726,-1.574681,-1.141882,2.130803,-0.806331,1.002633,0,-0.019280,-0.589690
2,0,0,0,1,-0.007725,0.182726,-0.654143,1.426749,0.248310,-0.806331,1.002633,0,-0.919030,-0.589690
3,1,0,0,0,0.668253,0.182726,0.854936,-1.682647,0.405184,-0.806331,-0.643782,0,0.880469,-0.589690
4,0,0,0,1,0.668253,0.182726,1.005844,0.412816,-0.536062,-0.806331,0.767431,0,0.880469,0.268487
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
695,1,0,0,0,-0.007725,-0.388293,-0.654143,-0.533522,0.562059,-0.853789,-1.114186,1,0.880469,-0.589690
696,1,0,0,0,-0.007725,-0.388293,0.040034,-0.263140,-1.320435,-0.853789,-0.643782,0,-0.019280,1.126663
697,1,0,0,0,0.668253,-0.388293,1.624567,-0.939096,-1.320435,-0.853789,-0.408580,1,-0.919030,-0.589690
698,0,0,0,1,0.668253,-0.388293,0.190942,-0.939096,-0.692937,-0.853789,-0.408580,1,-0.919030,-0.589690


Summies are untouched

In [31]:
scaled_inputs.shape

(700, 14)

All input data has been standardized.

## Split the Data Into Train & Test and Shuffle
Test accuracy of data it has never seen before. Shuffle data to remove all kinds of dependency from the order of the data.

### Split

In [35]:
train_test_split(scaled_inputs, targets)

[     reason_group_1  reason_group_2  reason_group_3  reason_group_4  \
 103               0               0               0               1   
 102               0               0               0               1   
 352               1               0               0               0   
 218               1               0               0               0   
 632               0               0               0               1   
 ..              ...             ...             ...             ...   
 441               0               0               0               1   
 14                0               0               0               1   
 2                 0               0               0               1   
 493               0               0               0               1   
 598               0               0               0               1   
 
      day_of_week     month  transportaion_expense_dollars  \
 103     1.344231  1.610276                       0.568211   
 102     0

We receive 4 arrays: a training data with inpust, a testing data with input, a training data with targets, a testing data with targets.

In [37]:
x_train, x_test, y_train, y_test = train_test_split(scaled_inputs, targets)

In [38]:
print(x_train.shape, y_train.shape)

(525, 14) (525,)


In [39]:
print(x_test.shape, y_test.shape)

(175, 14) (175,)


Training Input is 525 observation along 14 variables and 525 targtes (excessive_absenteeism). 75%. 25% is testing.

In [41]:
x_train, x_test, y_train, y_test = train_test_split(scaled_inputs, targets, train_size = 0.8, random_state = 20) # 80-20 split; sudo-random always shuffle but in the same way shuffle = True by def

In [42]:
print(x_train.shape, y_train.shape)

(560, 14) (560,)


In [43]:
print(x_test.shape, y_test.shape)

(140, 14) (140,)


## Logistic Regression wit SKlearn

### Training the Model

In [46]:
reg = LogisticRegression()
reg.fit(x_train, y_train) # Output is all parameter

In [47]:
reg.score(x_train, y_train) # R 78% accuracy; model learned to classify 78% of the observations correctly 

0.775

### Manually Check Accuracy

That's the target so if we want to find the accuracy of a model manually we should find the outputs and compare them with the targets let's do that in order to find the model outputs we will use a simple ask learned method it is called predict so if I write Reg dot predict X-ray this method will find the predicted outputs of the regression the model itself is contained in the variable Reg and we are choosing to predict the output to associated with the training inputs and contained an X training let's store all that in a new variable called model outputs and see what's inside

In [49]:
model_outputs = reg.predict(x_train) #  (Training Data) - Predictions of model
model_outputs

array([0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0,
       0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0,
       0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0,
       1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1,
       1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1,
       0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0,
       0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0,

In [50]:
model_outputs == y_train

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True, False,  True, False, False,  True,  True,  True,  True,
       False,  True, False,  True, False, False,  True,  True,  True,
       False,  True,  True,  True,  True,  True,  True,  True,  True,
        True, False, False, False,  True, False,  True,  True,  True,
        True,  True,  True,  True,  True, False,  True,  True,  True,
        True,  True,  True,  True,  True, False,  True,  True,  True,
        True,  True,  True,  True, False,  True, False,  True,  True,
        True,  True,  True, False,  True,  True,  True,  True,  True,
       False,  True, False,  True,  True, False, False, False,  True,
        True,  True,  True,  True,  True,  True,  True, False,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True, False,  True,  True,  True,  True,
       False,  True,  True,  True,  True, False,  True,  True,  True,
        True,  True,

In [51]:
np.sum(model_outputs == y_train) # Predicted correctly

434

In [52]:
model_outputs.shape[0] # Total observations

560

In [53]:
np.sum(model_outputs == y_train) / model_outputs.shape[0] # Exact same - behind the scene

0.775

### Finding the Intercept and Coefficients
in this Python SQL tableau integration the ultimate goal would be to create a function which could easily and reliably predict values from within tableau since tableau is a nice looking manager friendly software that's the place where the end users of our analysis will likely take advantage of our model regression analysis no matter if linear or nonlinear is about determining certain coefficients or wait which we apply to the inputs to obtain a final result so to use this logistic regression model outside of Python we must get our hands on the coefficients and the intercept moreover in order to interpret this logistics model we still need to do so let's get on with that we intercepted we found very easily using the intercept method

In [55]:
reg.intercept_

array([-1.65662792])

In [56]:
reg.coef_

array([[ 2.80136327e+00,  9.33540824e-01,  3.09673857e+00,
         8.57183147e-01, -8.43159241e-02,  1.66403124e-01,
         6.13215559e-01, -7.77871894e-03, -1.65545282e-01,
        -7.68487792e-05,  2.71154773e-01, -2.06026920e-01,
         3.61897667e-01, -2.85728905e-01]])

In [57]:
feature_name = unscaled_inputs.columns.values # variables corresponding to the coefficients
feature_name

array(['reason_group_1', 'reason_group_2', 'reason_group_3',
       'reason_group_4', 'day_of_week', 'month',
       'transportaion_expense_dollars', 'distance_to_work_miles', 'age',
       'daily_work_load_average', 'body_mass_index', 'education',
       'children', 'pets'], dtype=object)

In [58]:
summary_table = pd.DataFrame(columns = ['Feature Name'], data = feature_name)
summary_table['Coefficient'] = np.transpose(reg.coef_)
summary_table

Unnamed: 0,Feature Name,Coefficient
0,reason_group_1,2.801363
1,reason_group_2,0.933541
2,reason_group_3,3.096739
3,reason_group_4,0.857183
4,day_of_week,-0.084316
5,month,0.166403
6,transportaion_expense_dollars,0.613216
7,distance_to_work_miles,-0.007779
8,age,-0.165545
9,daily_work_load_average,-7.7e-05


In [59]:
# Add intercept
summary_table.index = summary_table.index + 1
summary_table.loc[0] = ['Intercept', reg.intercept_[0]]
summary_table = summary_table.sort_index()
summary_table

Unnamed: 0,Feature Name,Coefficient
0,Intercept,-1.656628
1,reason_group_1,2.801363
2,reason_group_2,0.933541
3,reason_group_3,3.096739
4,reason_group_4,0.857183
5,day_of_week,-0.084316
6,month,0.166403
7,transportaion_expense_dollars,0.613216
8,distance_to_work_miles,-0.007779
9,age,-0.165545


Remind you that the coefficients are also called weights while the intercept bias these notions are useful because the weights show how we weigh a certain input the closer they are to 0 the smaller the weight and alternatively the further away from zero no matter if positive the bigger the weight of this feature note that this is something which is true for our model but it's not universally true it holds only for models where all variables are on the same scale such as the one we just built there are coefficient values and standardized coefficient values these standardized coefficients are basically the coefficient values of a regression where all variables have been standardized other packages and software include the standardized coefficients because they allow for a simple and easy to understand comparison between the variables since in such cases the features are standardized they all have a variance of one or the same scale and whenever the scale is standard or the same that is we can simply say whatever weight is bigger its corresponding feature is more important for machine learning purposes and prediction in general we usually standardize the variables like we did now OK now another notion we must emphasize is that whenever we are dealing with the logistic regression the coefficients we are predicting are the so-called log odds this is a consequence of the choice of model logistics you are nothing but a linear function odds are later transformed into zeros and ones let's make this clear here's the logistic regression equation therefore all the coefficients that we have referred to the log odds so to make them more interpretable let's find the exponentials of these coefficients I'll create a new series that our data frame called odds ratio odds ratio is the correct term for what we will get after we find the exponentials of the locations it is equal to NP Exp of summary table dot coefficient

## Interpreting the Coefficients

In [62]:
summary_table['Odds_ratio'] = np.exp(summary_table.Coefficient)

In [63]:
summary_table.sort_values('Odds_ratio', ascending = False)

Unnamed: 0,Feature Name,Coefficient,Odds_ratio
3,reason_group_3,3.096739,22.125672
1,reason_group_1,2.801363,16.467081
2,reason_group_2,0.933541,2.543499
4,reason_group_4,0.857183,2.356513
7,transportaion_expense_dollars,0.613216,1.846359
13,children,0.361898,1.436052
11,body_mass_index,0.271155,1.311478
6,month,0.166403,1.181049
10,daily_work_load_average,-7.7e-05,0.999923
8,distance_to_work_miles,-0.007779,0.992251


And so how can we interpret them if the coefficient is around 0 or its odds ratio is close to one this means that the corresponding feature is not particularly important The reasoning in terms of weights is that a weight of 0 implies that no matter the feature value we will multiply it by zero and the whole result will be 0 the meaning in terms of odds ratios is the following for one unit change in the standardized feature the odds increase by a multiple equal to the odds ratio So if the odds ratio is 1 then the odds don't change at all for example if the odds are 5 to 1 and the odds ratio is 2 we would say that for one unit change the odds change from 5:00 to 1:00 to 12:50 because we multiplied them by the odds ratio alternatively if the odds ratio is 0.2 the odds would change to 1:00 to 1:00 when the ratio is 1 we don't have a change as multiplication with the number one keeps things equal this makes sense as the odds ratio was one whenever the way is 0
The average daily workload is -0.03 so almost zero and it's odd ratio is 0.97 so almost 1 so this feature is almost useless for our model and with or without it the result would likely be the same do we have other variables that could fall into this category of course day of the week and distance to work to be honest this is a bit surprising to me the justifying my mistake in logic this is the time to know that they may not necessarily be useless a more accurate statement is that given all features these seem to be the ones that make no difference we will keep the features for now but consider dropping them later on all right what else could we see we've got the four reasons for absence which are the most important predictors when we were creating the dummies the one we dropped was reason zero reason 0 represented a situation when a person was absent but no particular reason was given therefore the base model the case where there is no reason AKA reason 0 from the coefficients it seems that whenever a person has stated any reason we have a much higher chance of getting excessive absence


Since dummies weren't standardize, we can interpret it.

Features I'll just remind you that the further away from zero coefficient is the bigger it's importance so by looking at the coefficients table we will notice that the most strongly pronounced features seem to be the four reasons for absence transportation expenses and whether a person has children pets and education note that pet and education are at the bottom of the table but they're weights are still far away from zero but they are indeed important we can carry on in this way finishing with the daily work load distance to work and day of the week which seemed to have the smallest impact their weight is almost 0 so regardless of the particular values they will barely affect our model what about the reasons we said that the base model includes no reason but what is the impact of the various reasons I'll quickly recap with the five reason variable stand for reason 0 or no reason which is the baseline model reason one which comprises of various diseases reason 2 relating to pregnancy and giving birth the reason 3 regarding poisoning and peculiar reasons not categorized square and reason for which relates to light diseases in the light of this clarification we can easily understand our coefficients the most crucial reason for excessive absence is poisoning not much of a surprise there and if you are poison you just won't go to work the weight means the odds of someone that you excessively absent after being poisoned are 20 times higher than when no reason was reported OK clear another very important reason seems to be #1 or various diseases I've called this the normal absenteeism case you got sick you skipped work no drama a person who has reported this is 14 times more likely to be excessively absent than a person who didn't specify a reason then we have pregnancy and giving birth I particularly like this one because it's a prominent cause of absenteeism but at the same time it is way less pronounced than reasons one and three my explanation for this is a woman is pregnant she goes to the gynecologist gets her regular pregnancy check and comes back to work nothing excessive about that from time to time there are some emergencies but from the odds ratio we can verify that it's only around 2 times more likely to be excessively absent in the base model all right after that we've got transportation expense this is the most important non dummy feature in the model side here's the problem it is one of our standardized variables we don't have direct interpretability of it it's odds ratio implies that for one standardized unit or for one standard deviation increase in transportation expense it is close to twice as likely to be excessively absent this is the main drawback of standardization standardized models almost always yield higher accuracy because the optimization algorithms work better in this way machine learning engineers prefer models with higher accuracy so they normally go for sanitization econometricians and statisticians however prefer less accurate but more interpretable models because they care about the underlying reasons behind different phenomena data scientists may be in either position sometimes they need higher accuracy other times they must find the main drivers of a problem so it makes sense to create two different models one with standardized features and one without them and then draw insights from both however should we opt for predicting values we definitely prefer higher accuracy so standardization is more often 
Negative proficient I'll go for pet pet is a continuous variable it's odds ratio is 0.7 so for each additional standardized unit of pet the odds are 1 minus its odds ratio or 24% lower than the base bottle model that is no pet 1 - 0.759676 equals 24% one explanation may be if you have several pets you're probably not taking care of them on your own not being solely responsible for them implies somebody else can take them to the doctor if something's wrong finally I want to make a note on the intercept it is used to get more accurate predictions but there's no specific meaning attached to it that's why in machine learning you can say that it calibrates the model and you could also call it a bias without it each prediction would be off by precisely that value


## Backward Elimination and Simplifying Model

columns that have no impact or predictive power. Month value is useful even if it doesn't add predictive power. 

Removing all features which have close to no contribution to the model usually when we have the P values of variables we get rid of all coefficients with P values above 0.05 when learning with SK learned we don't have P values because we don't necessarily need them the reasoning of the engineers created the package is that if the weight is small enough it won't make a difference anyway and we trust their work OK so I say that if we remove these variables the rest of our model should not really change in terms of coefficient values let's go back to the checkpoint when we created the targets this was our last data frame manipulation step before we started standardizing i'll take this checkpoint data with targets and drop the three features we were just discussing guessing