## Creating a Logistic Regression using machine learning to predict Absenteeism at work

### Import the relevant libraries.

In [2]:
import pandas as pd
import numpy as np 

### Load the data.

In [3]:
data_preprocessed = pd.read_csv('/Users/ranjanadobal/Documents/Github/Repos/My_folder/Python_Programming_Data_Science/Data_Science_Bootcamp/5.Absenteeism Project/Data/Absenteeism_preprocessed.csv')

### Eyeball the data.

In [4]:
data_preprocessed.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month Value,Day of the week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1,4
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0,2
3,1,0,0,0,7,3,279,5,39,239.554,24,0,2,0,4
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1,2


###We will take a logistic regression which will take all these independent variables to predict their Absenteeism. 
Logistic regression is a type of classification, so we will be classifying people into classes.

The model itself will give a good indication about which variables are important for analysis and which are not. First, classify people into classes : people who are excessively absent and people who are moderately absent. Take the median value of 'Absenteeism Time in Hours'. Everything above the median is considered excessive and everything below the median is considered normal. 

#It seems that the reason for absence will be the most indicative of absenteeism at work.  Maybe workload will have something to do with it as well since the busier a person is the less he or she will want to skip work. Finally, children and pets together with distance from work should also have something to do with absenteeism. If your child or pet is sick at home, you'll have to go home, take them to the doctor, and get them back which will be much more time-consuming than a simple visit to the doctor.


### Find the median of 'Absenteeism Time in Hours'. 

In [5]:
data_preprocessed['Absenteeism Time in Hours'].median()

3.0

###The result is 3 and its datatype is float. 3 hours will be our cut-off line. People who are absent for more than 3 hours will be considered excessively absent and people who are absent for less than 3 hours will be considered moderately absent. If an observation has been absent for less than 3 hours, it will be assigned the value of 0 otherwise the value of 1. 

###In Supervised machine learning, these 0s and 1s are called TARGETS. These are the values we are aiming for. We will predict whether we obtain a 0 or a 1.  

### Create a new variable 'targets' which will measure if a person has been absent for more than 3 hours. Parameterize the code by specifying the method which finds the median. 
Parameterization makes the code easy to understand and follow and this minimizes the chance of making mistakes.


In [6]:
targets = np.where(data_preprocessed['Absenteeism Time in Hours'] > 
                   data_preprocessed['Absenteeism Time in Hours'].median(),1, 0) 

In [7]:
targets

array([1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0,
       1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1,
       0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0,
       1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0,
       0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0,
       0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0,

In [8]:
targets.shape

(700,)

###'targets' is an np array which contains 0s and 1s. 

### Add the targets to the dataframe data_preprocessed in a new column 'Excessive Absenteeism'

In [9]:
data_preprocessed['Excessive Absenteeism'] = targets 

In [10]:
data_preprocessed.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month Value,Day of the week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,Excessive Absenteeism
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1,4,1
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0,2,0
3,1,0,0,0,7,3,279,5,39,239.554,24,0,2,0,4,1
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1,2,0


###This is another method of mapping data into two classes 0 and 1. 

Using the median as a cutoff line is numerically stable and rigid. That's because by using the median, we have implicitly balanced the data set. Roughly half of the targets are zeros, while the other half ones. This will prevent our model from learning to output one of the two classes exclusively thinking it did very well.

In order to prove that, let's divide the number of targets that are ones by the total number of targets. The number of targets that are ones can be found by summing up all values of targets while the total number of targets is simply the shape on axis zero.

### A comment on the targets. 

In [11]:
targets.sum()/targets.shape[0]

0.45571428571428574

###Around 46% of the targets are 1 and 54% of the targets are 0. While balancing the datasets, the two classes need not represent 50% of the sample exactly. Usually a 60-40 split works equally well for a logistic regression. But this is not true for neural networks algorithms. A balance of 45-55% is almost always sufficient. So our two groups have been distributed roughly equally. 

###  Create a checkpoint by dropping the unnecessary variables after exploring the coefficients. 

Also drop the 'Absenteeism Time in Hours' column as this column will not be needed anymore. 

#Also drop the variables we 'eliminated' after exploring the weights such as'Absenteeism Time in Hours','Day of the week','Daily Work Load Average','Distance to Work'. 

In [12]:
data_preprocessed.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month Value,Day of the week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,Excessive Absenteeism
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1,4,1
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0,2,0
3,1,0,0,0,7,3,279,5,39,239.554,24,0,2,0,4,1
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1,2,0


In [13]:
data_with_targets = data_preprocessed.drop(['Absenteeism Time in Hours','Day of the week',
                                            'Daily Work Load Average','Distance to Work'],axis=1)

### Check if this new dataframe is same as the dataframe at the beginning of this notebook 'data_preprocessed'. This is to check if we need to create another checkpoint now. 

In [14]:
data_with_targets is data_preprocessed

False

###The two datasets are not the same because data_with_targets does not contain the column 'Absenteeism Time in Hours'

In [15]:
data_with_targets.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month Value,Transportation Expense,Age,Body Mass Index,Education,Children,Pets,Excessive Absenteeism
0,0,0,0,1,7,289,33,30,0,2,1,1
1,0,0,0,0,7,118,50,31,0,1,0,0
2,0,0,0,1,7,179,38,31,0,0,0,0
3,1,0,0,0,7,279,39,24,0,2,0,1
4,0,0,0,1,7,289,33,30,0,2,1,0


###data_with_targets is the checkpoint for the machine learning models. 

## Select the inputs for the regression. 

### Check the dimensions of the new dataframe 'data_with_targets'.

In [16]:
data_with_targets.shape

(700, 12)

### To select the inputs for regression, we need to select all rows and columns except the last column 'Excessive Absenteeism'. 

The ILOC method is used for selection by position in the data frame. There are two arguments, the first one refers to the row indices, and the second one to column indices.

In [17]:
data_with_targets.iloc[:,0:14]

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month Value,Transportation Expense,Age,Body Mass Index,Education,Children,Pets,Excessive Absenteeism
0,0,0,0,1,7,289,33,30,0,2,1,1
1,0,0,0,0,7,118,50,31,0,1,0,0
2,0,0,0,1,7,179,38,31,0,0,0,0
3,1,0,0,0,7,279,39,24,0,2,0,1
4,0,0,0,1,7,289,33,30,0,2,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...
695,1,0,0,0,5,179,40,22,1,2,0,1
696,1,0,0,0,5,225,28,24,0,1,2,0
697,1,0,0,0,5,330,28,25,1,0,0,1
698,0,0,0,1,5,235,32,25,1,0,0,0


###iloc excludes the ending index and colon in place of row indices will give all the rows. The results would be the same even when the 0 is removed.  

In [18]:
data_with_targets.iloc[:,:14]

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month Value,Transportation Expense,Age,Body Mass Index,Education,Children,Pets,Excessive Absenteeism
0,0,0,0,1,7,289,33,30,0,2,1,1
1,0,0,0,0,7,118,50,31,0,1,0,0
2,0,0,0,1,7,179,38,31,0,0,0,0
3,1,0,0,0,7,279,39,24,0,2,0,1
4,0,0,0,1,7,289,33,30,0,2,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...
695,1,0,0,0,5,179,40,22,1,2,0,1
696,1,0,0,0,5,225,28,24,0,1,2,0
697,1,0,0,0,5,330,28,25,1,0,0,1
698,0,0,0,1,5,235,32,25,1,0,0,0


### Display the same result with negative indices. Indicate the number of columns at the end that we want to skip. 

We wanted to select all rows and all columns but the last one.

In [19]:
data_with_targets.iloc[:,:-1]

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month Value,Transportation Expense,Age,Body Mass Index,Education,Children,Pets
0,0,0,0,1,7,289,33,30,0,2,1
1,0,0,0,0,7,118,50,31,0,1,0
2,0,0,0,1,7,179,38,31,0,0,0
3,1,0,0,0,7,279,39,24,0,2,0
4,0,0,0,1,7,289,33,30,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...
695,1,0,0,0,5,179,40,22,1,2,0
696,1,0,0,0,5,225,28,24,0,1,2
697,1,0,0,0,5,330,28,25,1,0,0
698,0,0,0,1,5,235,32,25,1,0,0


### Store the result in another variable 'unscaled_inputs'.

In [20]:
unscaled_inputs = data_with_targets.iloc[:,:-1]

## Standardize the data

### Import the relevant modules. 

In [21]:
# from sklearn.preprocessing import StandardScaler

### Declare a StandardScaler object. 

In [22]:
# absenteeism_scaler = StandardScaler()

###This is an empty StandardScaler object with no information. 

### Fit the input data. 

In [23]:
# import the libraries needed to create the Custom Scaler 
# note that all of them are a part of the sklearn package
# moreover, one of them is actually the StandardScaler module, 
# so you can imagine that the Custom Scaler is build on it
# CustomScaler will not standardize all inputs but only the ones we choose
# This will help to preserve the dummies untouched.

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler

# create the Custom Scaler class

class CustomScaler(BaseEstimator,TransformerMixin): 
    
    # init or what information we need to declare a CustomScaler object
    # and what is calculated/declared as we do
    
    def __init__(self,columns, copy=True, with_mean=True, with_std=True):
        # with some columns 'twist'
        self.columns = columns
        self.copy = copy
        self.with_mean = with_mean
        self.with_std = with_std
        
        
# the fit method, which, again based on StandardScale
    
    def fit(self, X, y=None):
        self.scaler = StandardScaler(copy=self.copy, with_mean=self.with_mean, with_std=self.with_std)
        self.scaler.fit(X[self.columns], y)
        self.mean_ = np.mean(X[self.columns])
        self.var_ = np.var(X[self.columns])
        return self
    
    
# the transform method which does the actual scaling

    def transform(self, X, y=None, copy=None):
        
            # record the initial order of the columns
            init_col_order = X.columns

            # scale all features that you chose when creating the instance of the class
            X_scaled = pd.DataFrame(self.scaler.transform(X[self.columns]), columns=self.columns)

            # declare a variable containing all information that was not scaled
            X_not_scaled = X.loc[:,~X.columns.isin(self.columns)]

            # return a data frame which contains all scaled features and all 'not scaled' features
            # use the original order (that you recorded in the beginning)
            return pd.concat([X_not_scaled, X_scaled], axis=1)[init_col_order]
    

#In practice, we would avoid this above step by standardizing prior to creating the dummies. 

###Check the column values of unscaled_inputs dataframe.

In [24]:
unscaled_inputs.columns.values 

array(['Reason_1', 'Reason_2', 'Reason_3', 'Reason_4', 'Month Value',
       'Transportation Expense', 'Age', 'Body Mass Index', 'Education',
       'Children', 'Pets'], dtype=object)

In [25]:
type(unscaled_inputs)

pandas.core.frame.DataFrame

###Create a new variable called columns_to_scale that will contain the names of the feature we would like to scale. So we will omit the dummy variables from this list. 

columns_to_scale = ['Month Value','Day of the week', 'Transportation Expense', 
                    'Distance to Work','Age', 'Daily Work Load Average', 'Body Mass Index',
       'Children', 'Pets']

In [26]:
columns_to_scale = ['Month Value', 'Transportation Expense','Age', 'Body Mass Index',
       'Children', 'Pets']

###Now we can declare the absenteeism_scaler and it will be equal to CustomScaler of columns_to_scale. 

In [27]:
absenteeism_scaler = CustomScaler(columns_to_scale)

In [28]:
absenteeism_scaler.fit(unscaled_inputs)

  return mean(axis=axis, dtype=dtype, out=out, **kwargs)


CustomScaler(columns=['Month Value', 'Transportation Expense', 'Age',
                      'Body Mass Index', 'Children', 'Pets'])

###This line calculates the mean and standard deviation for each feature from unscaled inputs. This information will be stored in the absenteeism_scaler object. Whenever we get new data, we will know that the standardization information is contained in the absenteeism_scaler. We will be able to standardize the data in the same way. We have just prepared the 'scaling mechanism'. 

### To apply the 'scaling mechanism', use another method called 'Transform'. 

In [29]:
scaled_inputs = absenteeism_scaler.transform(unscaled_inputs)

###This operation transforms the unscaled inputs using the information contained in absenteeism_scaler. We subtract the mean and divide by the standard deviation. This is the most common and useful way to transform the data when deploying a model. 

In [30]:
scaled_inputs

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month Value,Transportation Expense,Age,Body Mass Index,Education,Children,Pets
0,0,0,0,1,0.182726,1.005844,-0.536062,0.767431,0,0.880469,0.268487
1,0,0,0,0,0.182726,-1.574681,2.130803,1.002633,0,-0.019280,-0.589690
2,0,0,0,1,0.182726,-0.654143,0.248310,1.002633,0,-0.919030,-0.589690
3,1,0,0,0,0.182726,0.854936,0.405184,-0.643782,0,0.880469,-0.589690
4,0,0,0,1,0.182726,1.005844,-0.536062,0.767431,0,0.880469,0.268487
...,...,...,...,...,...,...,...,...,...,...,...
695,1,0,0,0,-0.388293,-0.654143,0.562059,-1.114186,1,0.880469,-0.589690
696,1,0,0,0,-0.388293,0.040034,-1.320435,-0.643782,0,-0.019280,1.126663
697,1,0,0,0,-0.388293,1.624567,-1.320435,-0.408580,1,-0.919030,-0.589690
698,0,0,0,1,-0.388293,0.190942,-0.692937,-0.408580,1,-0.919030,-0.589690


###All the input data has been standardized. 

### Check the dimensions of the scaled inputs. 

In [31]:
scaled_inputs.shape

(700, 11)

###We have 700 observations and 14 features. 

## Split the data into train and test and shuffle. 

#Overfitting occurs when the model learns to predict the data we've given it so well that when applied in a real life situation with new data, it fails miserably. One way to deal with overfitting is to hide a small part of the data set from the algorithm. So we train the model based on most of the data but not all of it. After that, we use the small piece of data we left aside to test if the model will do well in real life.

We also want to shuffle the data so that we remove all types of dependencies that come from the order of the data set like day of the week.


### Import the relevant module. 

In [32]:
from sklearn.model_selection import train_test_split

###train_test_split method has many arguments and the two most important ones are inputs and targets. 

### Indicate the inputs and targets in the train_test_split method. 

In [33]:
train_test_split(scaled_inputs,targets)

[     Reason_1  Reason_2  Reason_3  Reason_4  Month Value  \
 184         0         0         0         1    -0.673803   
 169         1         0         0         0    -0.959313   
 228         1         0         0         0    -0.102784   
 131         0         0         0         1    -1.530333   
 527         0         0         0         1     1.039256   
 ..        ...       ...       ...       ...          ...   
 549         0         0         0         0     1.324766   
 16          0         0         0         1     0.182726   
 313         0         0         0         0     1.039256   
 595         0         0         0         1    -1.244823   
 337         0         0         0         0     1.324766   
 
      Transportation Expense       Age  Body Mass Index  Education  Children  \
 184                1.036026  0.562059        -0.408580          0 -0.019280   
 169               -0.654143  0.562059        -1.114186          1  0.880469   
 228               -1.5746

###The output has 4 arrays: a training dataset with inputs, then a test dataset with inputs,  a training dataset with targets, then a test dataset with targets. 

### To make this output useful, declare 4 variables that will contain the 4 outputs. 

In [34]:
x_train, x_test, y_train, y_test = train_test_split(scaled_inputs,targets)

### Check the shape of these variables. 

This will be very indicative of what the train-test split has actually achieved.

In [35]:
print(x_train.shape, y_train.shape)

(525, 11) (525,)


In [36]:
print(x_test.shape, y_test.shape)

(175, 11) (175,)


###Training inputs contain 525 observations along 14 features. Training targets are a vector of length 525 and this corresponds to the 'Excessive_Absenteeism' column. 
Test inputs contain 175 observations along 14 features and 1 target variable. This method has split the scaled inputs and targets into matching forms that can be used in the machine learning part. 

###This split is 3 to 1 (75% observations will help us with training and 25% observations will serve for testing). This is the default split. Usually we opt for splits like 90-10 or 80-20 because we want to train on more data. We don't like setting aside too much data for testing because this means we are going to train the model on less data. We specify the split by mentioning the train_size. Train_size takes values between 0 and 1. Train_size of 0.9 means 90% of the data will be used for training and 10% for testing. Usually the train_size is 0.8 in most models.  

###By default shuffle is set to True and everytime we run the code, we get a different train and test set. So we get a different split everytime and this causes the final model to differ everytime. Due to different split everytime, we may get a higher accuracy or lower accuracy due to the random split. If we set random_state = 20, then it will make the shuffle pseudo random and it will always shuffle observations in the same random way. 

In [37]:
x_train, x_test, y_train, y_test = train_test_split(scaled_inputs,targets, train_size = 0.8, random_state = 20)

In [38]:
x_train

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month Value,Transportation Expense,Age,Body Mass Index,Education,Children,Pets
346,0,0,0,1,1.610276,-0.654143,0.248310,1.002633,0,-0.919030,-0.589690
91,0,0,1,0,1.324766,-0.654143,0.562059,-1.114186,1,0.880469,-0.589690
299,1,0,0,0,1.039256,-0.654143,-1.006686,-1.819793,1,-0.919030,-0.589690
129,0,0,1,0,-1.530333,-0.654143,-1.006686,-1.819793,1,-0.919030,-0.589690
695,1,0,0,0,-0.388293,-0.654143,0.562059,-1.114186,1,0.880469,-0.589690
...,...,...,...,...,...,...,...,...,...,...,...
218,1,0,0,0,-0.388293,-1.574681,2.130803,1.002633,0,-0.019280,-0.589690
223,0,0,0,1,-0.102784,1.036026,0.562059,-0.408580,0,-0.019280,0.268487
271,0,0,0,1,0.753746,-0.654143,0.248310,1.002633,0,-0.919030,-0.589690
474,0,0,0,1,0.182726,2.092381,-1.320435,0.061825,0,-0.019280,2.843016


In [39]:
x_test

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month Value,Transportation Expense,Age,Body Mass Index,Education,Children,Pets
535,0,0,0,1,1.324766,-0.654143,0.248310,1.002633,0,-0.919030,-0.589690
281,0,0,0,1,0.753746,1.036026,0.562059,-0.408580,0,-0.019280,0.268487
324,0,0,0,1,1.324766,0.190942,1.032682,2.649049,0,-0.019280,-0.589690
645,0,0,0,1,-0.959313,-0.654143,0.248310,1.002633,0,-0.919030,-0.589690
10,1,0,0,0,0.182726,0.568211,-0.065439,-0.878984,0,2.679969,-0.589690
...,...,...,...,...,...,...,...,...,...,...,...
136,0,0,0,1,-1.530333,1.005844,-0.536062,0.767431,0,0.880469,0.268487
430,0,0,0,1,-0.388293,2.092381,-1.320435,0.061825,0,-0.019280,2.843016
32,0,0,0,1,0.468236,0.190942,0.091435,0.532229,1,-0.019280,0.268487
449,0,0,0,1,-0.102784,0.356940,0.718933,-0.878984,0,-0.919030,-0.589690


In [40]:
print(x_train.shape, y_train.shape)

(560, 11) (560,)


In [41]:
print(x_test.shape, y_test.shape)

(140, 11) (140,)


## Logistic regression with sklearn.

### Import the relevant modules. 

In [42]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics 

###metrics module will be useful in evaluating the model. StatsModels are not always numerically stable for machine learning models.

## Training the model.

### Declare a new variable which will be a LogisticRegression object. 

In [43]:
reg = LogisticRegression()

### Fit the regression. 

In [44]:
reg.fit(x_train,y_train)

LogisticRegression()

###This method does all the machine learning. 

### Get all the default parameters of the logistic regression object. 

In [45]:
print(reg.get_params())

{'C': 1.0, 'class_weight': None, 'dual': False, 'fit_intercept': True, 'intercept_scaling': 1, 'l1_ratio': None, 'max_iter': 100, 'multi_class': 'auto', 'n_jobs': None, 'penalty': 'l2', 'random_state': None, 'solver': 'lbfgs', 'tol': 0.0001, 'verbose': 0, 'warm_start': False}


###Each of these parameters can help improve the model one way or the other. 

### Evaluate the accuracy of the model. 

In [46]:
reg.score(x_train,y_train)

0.7732142857142857

###The model has an accuracy of 0.8 that means based on the data we used, the model learned to classify 80% of the observations correctly. 

### Manually check the accuracy. 

###The LogisticRegression model is trained on the train inputs. Based on them, it finds outputs which are trying to be as close to the targets as possible. Accuracy means that x% of the model outputs match the targets. To find the accuracy manually, we should find the outputs and compare them with the targets. 

### Find model outputs using sklearn. 

reg.predict method finds the predicted output of the regression for the inputs in x_train. 

In [47]:
model_outputs = reg.predict(x_train)
model_outputs

array([0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1,
       1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0,
       0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0,
       0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0,
       1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1,
       1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1,
       0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0,
       0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0,

###The predictions of the model is an array of 0s and 1s. 

### Display the targets.

We will now compare these targets with the predictions. 

In [48]:
y_train

array([0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1,
       1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1,
       1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1,
       0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1,
       1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0,
       1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0,

### Compare the model predictions with the targets.  

In [49]:
model_outputs == y_train

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True, False,  True, False, False,  True,  True,  True,  True,
       False,  True, False,  True, False, False,  True,  True,  True,
       False,  True,  True,  True,  True,  True,  True,  True,  True,
       False, False, False, False,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True, False,  True,  True,  True,
        True,  True,  True,  True,  True, False,  True,  True,  True,
        True,  True,  True,  True,  True,  True, False,  True,  True,
        True,  True,  True, False,  True,  True,  True,  True,  True,
       False,  True, False,  True,  True, False, False, False,  True,
        True,  True,  True,  True,  True,  True,  True, False,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True, False,  True,  True,  True,  True,
       False,  True,  True,  True,  True, False,  True,  True,  True,
        True,  True,

### Sum the True which is equal to 1 and False which is equal to 0 in the above array. 

In [50]:
np.sum((model_outputs == y_train))

433

###Accuracy will be Number of correct predictions / total number of observations. 

In [51]:
model_outputs.shape[0]

560

###Total number of observations is 560. 

In [52]:
Accuracy = np.sum((model_outputs == y_train))/model_outputs.shape[0]
Accuracy

0.7732142857142857

###This result is the same as the accuracy calculated using sklearn method score(). 

## Finding the intercept and coefficients. 

To use this regression model outside Python, we need to determine coefficients or weights which we apply to the inputs to obtain a final result. 


#Intercept

In [53]:
reg.intercept_

array([-1.6474549])

#Coefficients

In [54]:
reg.coef_

array([[ 2.80019733,  0.95188356,  3.11555338,  0.83900082,  0.1589299 ,
         0.60528415, -0.16989096,  0.27981088, -0.21053312,  0.34826214,
        -0.27739602]])

### We want to know what variables those coefficients refer to. 

###scaled_inputs variable has the inputs. We can use scaled_inputs.columns.values because scaled_inputs is a dataframe whereas an ndarray will not have columns. 

In [55]:
type(scaled_inputs)

pandas.core.frame.DataFrame

In [56]:
scaled_inputs.columns.values

array(['Reason_1', 'Reason_2', 'Reason_3', 'Reason_4', 'Month Value',
       'Transportation Expense', 'Age', 'Body Mass Index', 'Education',
       'Children', 'Pets'], dtype=object)

###Unscaled_inputs variable has the pandas dataframe. 

In [57]:
unscaled_inputs.columns.values

array(['Reason_1', 'Reason_2', 'Reason_3', 'Reason_4', 'Month Value',
       'Transportation Expense', 'Age', 'Body Mass Index', 'Education',
       'Children', 'Pets'], dtype=object)

### Declare a new variable that will contain these column values. 

In [58]:
feature_name = unscaled_inputs.columns.values

### Create a dataframe that will contain the intercept, feature_name, and the corresponding coefficients. The dataframe will be called summary_table. 

We must transpose this array because by default ND arrays (reg.coef_) are rows and not columns.

In [59]:
summary_table = pd.DataFrame(columns = ['Feature name'], data = feature_name)

summary_table['Coefficient'] = np.transpose(reg.coef_)

summary_table

Unnamed: 0,Feature name,Coefficient
0,Reason_1,2.800197
1,Reason_2,0.951884
2,Reason_3,3.115553
3,Reason_4,0.839001
4,Month Value,0.15893
5,Transportation Expense,0.605284
6,Age,-0.169891
7,Body Mass Index,0.279811
8,Education,-0.210533
9,Children,0.348262


###We need to transpose the reg.coef_ because by default ndarrays are rows and not columns. Intercept can be added using the append or concatenate method. These methods will add the newly appended data at the end of the dataframe. Another method is to shift indices of the summary table by 1.

We'll specify the zeroth element so we can extract the float rather than the whole array.

In [60]:
summary_table.index = summary_table.index + 1
summary_table.loc[0] = ['Bias', reg.intercept_[0]]
summary_table = summary_table.sort_index()
summary_table

Unnamed: 0,Feature name,Coefficient
0,Bias,-1.647455
1,Reason_1,2.800197
2,Reason_2,0.951884
3,Reason_3,3.115553
4,Reason_4,0.839001
5,Month Value,0.15893
6,Transportation Expense,0.605284
7,Age,-0.169891
8,Body Mass Index,0.279811
9,Education,-0.210533


## Interpreting the coefficients. 

### All these coefficients are log odds. Find the exponentials of these coefficients to make them more interpretable. 

The coefficient are also called weights while the intercept is called bias. The weights show how we weight a certain input. The closer they are to zero, the smaller the weight. Let's calculate the odds_ratio which is equal to the exponentials of these coefficients.  


In [61]:
summary_table['Odds_ratio']= np.exp(summary_table.Coefficient)
summary_table

Unnamed: 0,Feature name,Coefficient,Odds_ratio
0,Bias,-1.647455,0.192539
1,Reason_1,2.800197,16.447892
2,Reason_2,0.951884,2.590585
3,Reason_3,3.115553,22.545903
4,Reason_4,0.839001,2.314054
5,Month Value,0.15893,1.172256
6,Transportation Expense,0.605284,1.831773
7,Age,-0.169891,0.843757
8,Body Mass Index,0.279811,1.32288
9,Education,-0.210533,0.810152


### Sort the dataframe using the sort_values method and 'Odds_ratio' column. 

In [62]:
summary_table.sort_values('Odds_ratio')

Unnamed: 0,Feature name,Coefficient,Odds_ratio
0,Bias,-1.647455,0.192539
11,Pets,-0.277396,0.757754
9,Education,-0.210533,0.810152
7,Age,-0.169891,0.843757
5,Month Value,0.15893,1.172256
8,Body Mass Index,0.279811,1.32288
10,Children,0.348262,1.416604
6,Transportation Expense,0.605284,1.831773
4,Reason_4,0.839001,2.314054
2,Reason_2,0.951884,2.590585


###By default, the coefficients are sorted in the ascending order. The most important ones are at the bottom. Specify ascending = False to order the coefficients in the descending order. 

In [63]:
summary_table.sort_values('Odds_ratio', ascending = False)

Unnamed: 0,Feature name,Coefficient,Odds_ratio
3,Reason_3,3.115553,22.545903
1,Reason_1,2.800197,16.447892
2,Reason_2,0.951884,2.590585
4,Reason_4,0.839001,2.314054
6,Transportation Expense,0.605284,1.831773
10,Children,0.348262,1.416604
8,Body Mass Index,0.279811,1.32288
5,Month Value,0.15893,1.172256
7,Age,-0.169891,0.843757
9,Education,-0.210533,0.810152


###If a coefficient is around 0 or if its odds ratio is around 1, this means that the corresponding feature is not particularly important. For one unit change in the standardized feature, the odds increase by a multiple equal to the odds ratio. A weight of 0 implies that the feature will be multiplied by 0 in the model, no matter the feature value and the whole result will be 0. 
For the odds ratio of 1, for a unit change in the standardized feature, the odds increase by a multiple equal to the odds ratio (1 = no change). The multiplication with 1 keeps the odds same. This makes sense as the odds ratio is 1 whenever the weight is 0. 

Daily Work Load Average, Distance to Work, and Day of the week are the variables with coefficients 0 and the Odds ratio 1. Given all features, Daily Work Load Average, Distance to Work, and Day of the week features seem to make no difference in the model. We will consider dropping these feature from the model later. 

Reason_1,Reason_2,Reason_3,and Reason_4 are the most important predictors. We dropped Reason_0 while creating dummies because Reason_0 was the situation when no reason was given. So the base model is the case when there is no reason. 

### Interpretation when we standardized dummy variables as well:

When we standardize the inputs, we also standardize the dummies. Now we don't know how the different reasons compare. This is bad practice because when we standardize, we lose the whole interpretability of the dummy. If we had left the dummies as 0 and 1, we could have said: If the reason given is Reason_1, for a unit change, it is 7.92 times more likely that a person will be excessively absent compared to when no reason is given. 

We will need to correct the code where we standardized the data. In practice, we would avoid this above step by standardizing prior to creating the dummies. 

### Interpretation after standardizing only the numerical variables:

#So by looking at the coefficients table, we will notice that the most strongly pronounced features seem to be the four reasons for absence, the Transportation Expense, body mass index and whether a person has children, pets, and education. Note that Pet and Education are at the bottom of the table but their weights are still far away from zero.

### Reasons of Absence:
The base model includes no reason. The five reason variables stand for: Reason zero or no reason is the baseline model, reason one comprising of various diseases, reason two relating to pregnancy and giving birth, reason three regarding poisoning and peculiar reasons not categorized elsewhere, and reason four, which relates to light diseases.

The most crucial reason for excessive absence is reason3 poisoning. The weight means the odds of someone being excessively absent after being poisoned are 20 times higher than when no reason was reported. A person who has reported Reason1 various diseases is 16 times more likely to be excessively absent than a person who didn't specify a reason. A person giving a pregnancy reason2 is around two times more likely to be excessively absent than the base model. Similarly, a person giving a light diseases reason4 is around two times more likely to be excessively absent than the base model. There is an explanation for this pattern: people with reason2 and 4 go to the doctor for medical checkup and come back to work. 

### Other numerical variables:

Transportation Expense is the most important non-dummy feature in the model. The odds ratio means that for one standardized unit, or for one standard deviation increase in Transportation Expense, a person's odds of being excessively absent are multiplied by 1.85 compared to the base model or (1.85 -1) 85% increase in the odds of being excessively absent. 

Standardized models almost always yield higher accuracy because the optimization algorithms work better in this way. Machine learning engineers prefer models with higher accuracy, so they normally go for standardization. Econometricians and statisticians however, prefer less accurate but more interpretable models, because they care about the underlying reasons behind different phenomena. So it makes sense to create two different models. One with standardized features and one without them, and then draw insights from both.

For Pets, the odds ratio is 0.75. So for each additional standardized unit of Pet, the odds are (1-0.75) one minus its odds ratio or 24% lower than the base model. This could be because if you have several pets, you're probably not taking care of them on your own. Not being solely responsible for them implies somebody else can take them to the doctor if something is wrong.





### Backward Elimination

#We can simplify our model by removing all features which have close to no contribution to the model. We can drop the three features we were just discussing: Day of the Week, Daily Work Load Average, and Distance to Work.


## Testing the model 

#So far when referring to the model accuracy, we meant the train accuracy. The train accuracy is around 77% but it does not mean much because our algorithm has seen this train data many times during the training process. So it has learned to model it quite well. However, it may fail miserably when provided with new data. We should test it on data it is never seen. It is time to use the test data. Also once we test the data, we are not conceptually allowed to touch the model anymore.

Assess the test accuracy of the model.

In [64]:
reg.score(x_test,y_test)

0.75

#So based on data that the model has never seen before we can say that in 74% of the cases, the model will predict if a person is going to be excessively absent. The test accuracy is always around 10% or 20% lower than the train accuracy. 

Apart from the accuracy, we can also get the outputs using the predict method as before. Instead of zero and one, we can also get the probability of an output being zero or one. There is an SK learn method called predict proba.

#Find the predicted probabilities of each output 0 and 1. The first column shows the probability of a particular observation to be 0, while the second one to be 1. That's why summing any two numbers horizontally will give an output of one.

In [65]:
predicted_proba = reg.predict_proba(x_test)
predicted_proba

array([[0.71340413, 0.28659587],
       [0.58724228, 0.41275772],
       [0.44020821, 0.55979179],
       [0.78159464, 0.21840536],
       [0.08410854, 0.91589146],
       [0.33487603, 0.66512397],
       [0.29984576, 0.70015424],
       [0.13103971, 0.86896029],
       [0.78625404, 0.21374596],
       [0.74903632, 0.25096368],
       [0.49397598, 0.50602402],
       [0.22484913, 0.77515087],
       [0.07129151, 0.92870849],
       [0.73178133, 0.26821867],
       [0.30934135, 0.69065865],
       [0.5471671 , 0.4528329 ],
       [0.55052275, 0.44947725],
       [0.5392707 , 0.4607293 ],
       [0.40201117, 0.59798883],
       [0.05361575, 0.94638425],
       [0.7003009 , 0.2996991 ],
       [0.78159464, 0.21840536],
       [0.42037128, 0.57962872],
       [0.42037128, 0.57962872],
       [0.24783565, 0.75216435],
       [0.74566259, 0.25433741],
       [0.51017274, 0.48982726],
       [0.85690195, 0.14309805],
       [0.20349733, 0.79650267],
       [0.78159464, 0.21840536],
       [0.

In [66]:
predicted_proba.shape

(140, 2)

#Select ONLY the probabilities referring to 1s. We are interested in the probability of excessive absenteeism, so the probability of getting one. We can simply print out all values from the second column. This will give us the probabilities of absenteeism. Logistic regression models calculate these probabilities in the background. If the probability is below 0.5 it places a zero, otherwise a one.



In [67]:
predicted_proba[:,1]

array([0.28659587, 0.41275772, 0.55979179, 0.21840536, 0.91589146,
       0.66512397, 0.70015424, 0.86896029, 0.21374596, 0.25096368,
       0.50602402, 0.77515087, 0.92870849, 0.26821867, 0.69065865,
       0.4528329 , 0.44947725, 0.4607293 , 0.59798883, 0.94638425,
       0.2996991 , 0.21840536, 0.57962872, 0.57962872, 0.75216435,
       0.25433741, 0.48982726, 0.14309805, 0.79650267, 0.21840536,
       0.36956558, 0.67906035, 0.68502567, 0.52868083, 0.21840536,
       0.53506551, 0.22147081, 0.73692105, 0.40498044, 0.60505988,
       0.21075848, 0.45224466, 0.23751292, 0.39833498, 0.82755447,
       0.56797575, 0.69113325, 0.28659587, 0.21935267, 0.2033097 ,
       0.57628256, 0.3294664 , 0.66512397, 0.26949499, 0.83321968,
       0.43491525, 0.88374612, 0.23127072, 0.33415858, 0.34432939,
       0.69909345, 0.65494263, 0.29244941, 0.79200758, 0.20750276,
       0.26840558, 0.08708566, 0.22147081, 0.73245417, 0.30530219,
       0.22147081, 0.29014408, 0.90438191, 0.46061297, 0.60174

## Save the model. 

First, we will save our model, so we can use it later on. We don't need to train it every time. We just need to determine the weights once and then save them for later use. Second, we will create our own module so that others can use this model too. Finally, we will get completely new data, classify it pass it through SQL, and then analyze it in tableau.

There are several popular ways to save (and finalize) a model. We can use Joblib (a part of the SciPy ecosystem), and JSON. ‘pickle’ is the standard Python tool for serialization and deserialization. pickling means: converting a Python object into a string of characters. Logically, unpickling is about converting a string of characters (that has been pickled) into a Python object.


Saving a model is the process of creating a file that will contain all the information regarding the machine learning. We want to create a file that will store the following information. This machine learning model is a logistic regression. It has these coefficients and intercept. The random state that was chosen for the shuffling was 20. The object 'reg' which was an instance of the sklearn logistic regression class contains all this information.

Saving the model is equivalent to saving the 'reg' object. Pickling is the process of converting a Python object into a character stream. The main idea is that this character stream will contain sufficient information. This file will then be loaded in a new notebook, and thus we'll be able to use the machine learning algorithm. The file size will be less than one kilobyte.










#Import the relevant module.

In [68]:
import pickle

#pickle the model file. First, the file name is "model" as it basically contains the model. Second, "wb" stands for write bites. When we unpickle, we will use rb or read bites. Third, we've got the dump method. When we pickle, we dump the information in a file. When we unpickle, we load it. In the dump method, we specify the object to be dumped. 

In [69]:
with open('model', 'wb') as file:
    pickle.dump(reg, file)

#Pickle the scaler file. 

The absenteeism scaler object was used to standardize all numerical variables. It stored the columns to scale as well as the mean and the standard deviation of each feature. Until now, our code was heavily dependent on training data. Without training data, the machine learning could not be executed at all. But once the model is trained and we have obtained the coefficients, we can save it as we just have. In this way, we are separating the model from the training data for good. The information in the absenteeism scaler is needed to preprocess any new data using the same rules as the ones applied to training data. Thus, we must pickle the scaler too.


In [70]:
with open('scaler','wb') as file:
    pickle.dump(absenteeism_scaler, file)

#We have pre-processed the data, trained a machine learning algorithm, and fine tuned it a bit.
Now we will explore how to deploy it. Deploying a model consists of making it available and ready to use.
Generally it consists of two steps, saving the model and then applying it to new data. We'd prefer creating a module because storing code in a module will allow us to reuse it without trouble. In essence, we will treat the methods in this module in the same way we treat the Numpy, sklearn and pandas methods.

The absenteeism module contains all the pre-processing and machine learning in one clean notebook. Given that Python is an object oriented programming language everything is organized in classes.



