## OBJECTIVE 

To Predict ABSENTEEISM at a company during work time. Absenteeism meaning absence from work during normal working hours, resulting in temporary incapacity to execute regular working activity.

## PROBLEM

The business environment today is more competitive than ever before, this leads to increased pressure in the workplace therefore it is reasonable to expect that unachievable business goals and elevated risk of unemployment can raise people stress levels, often a continuous presence of such factors becomes detrimental to persons health. Sometimes this may result in minor illness and can develop a long-term condition.
We will be solving the problem from the point of view of the person in charge of productivity in the company, so we will focus on predicting absenteeism of an employee during work time based on certain characteristics. We want to know whether an employee can be expected to be missing for a specific number of hours in a given workday. And having such information in advance can improve our decision making, by reorganizing the work process in a way that will allow us to avoid a lack of productivity and increase the quality of work generated in organization.

### Lets analyse dataset

In [1]:
import pandas as pd

In [2]:
raw_data = pd.read_csv("Absenteeism_data.csv")

In [3]:
raw_data

Unnamed: 0,ID,Reason for Absence,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,11,26,07/07/2015,289,36,33,239.554,30,1,2,1,4
1,36,0,14/07/2015,118,13,50,239.554,31,1,1,0,0
2,3,23,15/07/2015,179,51,38,239.554,31,1,0,0,2
3,7,7,16/07/2015,279,5,39,239.554,24,1,2,0,4
4,11,23,23/07/2015,289,36,33,239.554,30,1,2,1,2
...,...,...,...,...,...,...,...,...,...,...,...,...
695,17,10,23/05/2018,179,22,40,237.656,22,2,2,0,8
696,28,6,23/05/2018,225,26,28,237.656,24,1,1,2,3
697,18,10,24/05/2018,330,16,28,237.656,25,2,0,0,8
698,25,23,24/05/2018,235,16,32,237.656,25,3,0,0,2


In [4]:
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 700 entries, 0 to 699
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   ID                         700 non-null    int64  
 1   Reason for Absence         700 non-null    int64  
 2   Date                       700 non-null    object 
 3   Transportation Expense     700 non-null    int64  
 4   Distance to Work           700 non-null    int64  
 5   Age                        700 non-null    int64  
 6   Daily Work Load Average    700 non-null    float64
 7   Body Mass Index            700 non-null    int64  
 8   Education                  700 non-null    int64  
 9   Children                   700 non-null    int64  
 10  Pets                       700 non-null    int64  
 11  Absenteeism Time in Hours  700 non-null    int64  
dtypes: float64(1), int64(10), object(1)
memory usage: 65.8+ KB


In [5]:
raw_data.isnull().sum()

ID                           0
Reason for Absence           0
Date                         0
Transportation Expense       0
Distance to Work             0
Age                          0
Daily Work Load Average      0
Body Mass Index              0
Education                    0
Children                     0
Pets                         0
Absenteeism Time in Hours    0
dtype: int64

In [6]:
df= raw_data.copy()
df.head()

Unnamed: 0,ID,Reason for Absence,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,11,26,07/07/2015,289,36,33,239.554,30,1,2,1,4
1,36,0,14/07/2015,118,13,50,239.554,31,1,1,0,0
2,3,23,15/07/2015,179,51,38,239.554,31,1,0,0,2
3,7,7,16/07/2015,279,5,39,239.554,24,1,2,0,4
4,11,23,23/07/2015,289,36,33,239.554,30,1,2,1,2


In [7]:
# pd.options.display.max_columns = None
# pd.options.display.max_rows = None

In [8]:
df.drop(columns=['ID'], inplace= True)

In [9]:
df.head()

Unnamed: 0,Reason for Absence,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,26,07/07/2015,289,36,33,239.554,30,1,2,1,4
1,0,14/07/2015,118,13,50,239.554,31,1,1,0,0
2,23,15/07/2015,179,51,38,239.554,31,1,0,0,2
3,7,16/07/2015,279,5,39,239.554,24,1,2,0,4
4,23,23/07/2015,289,36,33,239.554,30,1,2,1,2


In [10]:
# analysing Reason for Absence column

df['Reason for Absence'].value_counts()

23    147
28    110
27     66
13     52
0      38
19     36
22     32
26     31
25     29
11     24
10     22
18     21
14     18
1      16
7      13
12      8
21      6
6       6
8       5
9       4
5       3
16      3
24      3
15      2
4       2
3       1
2       1
17      1
Name: Reason for Absence, dtype: int64

In [11]:
df['Reason for Absence'].max(), df['Reason for Absence'].min()

(28, 0)

In [12]:
df['Reason for Absence'].unique()

array([26,  0, 23,  7, 22, 19,  1, 11, 14, 21, 10, 13, 28, 18, 25, 24,  6,
       27, 17,  8, 12,  5,  9, 15,  4,  3,  2, 16], dtype=int64)

In [13]:
df['Reason for Absence'].nunique()

28

In [14]:
sorted(df['Reason for Absence'].unique())

[0,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28]

Observation with value 20 is missing.

In [15]:
# Transforming Reason for Absence column
One_hot_encode_reason = pd.get_dummies(df['Reason for Absence'], drop_first= True)
One_hot_encode_reason

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,18,19,21,22,23,24,25,26,27,28
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
3,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
695,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
696,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
697,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
698,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0


In [16]:
One_hot_encode_reason.sum(axis= 1).value_counts()

1    662
0     38
dtype: int64

In [17]:
One_hot_encode_reason.columns.values

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 21, 22, 23, 24, 25, 26, 27, 28], dtype=int64)

In [18]:
# directly addition to df dataset
# df['reason_type_1'] = One_hot_encode_reason.loc[:,1:14].max(axis= 1)

In [19]:
df.head()

Unnamed: 0,Reason for Absence,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,26,07/07/2015,289,36,33,239.554,30,1,2,1,4
1,0,14/07/2015,118,13,50,239.554,31,1,1,0,0
2,23,15/07/2015,179,51,38,239.554,31,1,0,0,2
3,7,16/07/2015,279,5,39,239.554,24,1,2,0,4
4,23,23/07/2015,289,36,33,239.554,30,1,2,1,2


In [20]:
reason_type_1 = One_hot_encode_reason.loc[:,1:14].max(axis= 1)

In [21]:
reason_type_2 = One_hot_encode_reason.loc[:,15:17].max(axis= 1)

In [22]:
reason_type_3 = One_hot_encode_reason.loc[:,18:21].max(axis= 1)

In [23]:
reason_type_4 = One_hot_encode_reason.loc[:,22:].max(axis= 1)

In [24]:
df = pd.concat([df, reason_type_1, reason_type_2, reason_type_3, reason_type_4], axis= 1)

In [25]:
df.head()

Unnamed: 0,Reason for Absence,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,0,1,2,3
0,26,07/07/2015,289,36,33,239.554,30,1,2,1,4,0,0,0,1
1,0,14/07/2015,118,13,50,239.554,31,1,1,0,0,0,0,0,0
2,23,15/07/2015,179,51,38,239.554,31,1,0,0,2,0,0,0,1
3,7,16/07/2015,279,5,39,239.554,24,1,2,0,4,1,0,0,0
4,23,23/07/2015,289,36,33,239.554,30,1,2,1,2,0,0,0,1


In [26]:
df.columns.values

array(['Reason for Absence', 'Date', 'Transportation Expense',
       'Distance to Work', 'Age', 'Daily Work Load Average',
       'Body Mass Index', 'Education', 'Children', 'Pets',
       'Absenteeism Time in Hours', 0, 1, 2, 3], dtype=object)

In [27]:
df.columns = ['Reason for Absence', 'Date', 'Transportation Expense',
       'Distance to Work', 'Age', 'Daily Work Load Average',
       'Body Mass Index', 'Education', 'Children', 'Pets',
       'Absenteeism Time in Hours', 'reason_type_1', 'reason_type_2', 'reason_type_3', 'reason_type_4']

In [28]:
df.head()

Unnamed: 0,Reason for Absence,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,reason_type_1,reason_type_2,reason_type_3,reason_type_4
0,26,07/07/2015,289,36,33,239.554,30,1,2,1,4,0,0,0,1
1,0,14/07/2015,118,13,50,239.554,31,1,1,0,0,0,0,0,0
2,23,15/07/2015,179,51,38,239.554,31,1,0,0,2,0,0,0,1
3,7,16/07/2015,279,5,39,239.554,24,1,2,0,4,1,0,0,0
4,23,23/07/2015,289,36,33,239.554,30,1,2,1,2,0,0,0,1


In [29]:
df.drop(columns=['Reason for Absence'], inplace= True)

In [30]:
df.columns = ['Date', 'Transportation Expense',
       'Distance to Work', 'Age', 'Daily Work Load Average',
       'Body Mass Index', 'Education', 'Children', 'Pets',
       'Absenteeism Time in Hours', 'reason_type_1', 'reason_type_2', 'reason_type_3', 'reason_type_4']

In [31]:
df.head()

Unnamed: 0,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,reason_type_1,reason_type_2,reason_type_3,reason_type_4
0,07/07/2015,289,36,33,239.554,30,1,2,1,4,0,0,0,1
1,14/07/2015,118,13,50,239.554,31,1,1,0,0,0,0,0,0
2,15/07/2015,179,51,38,239.554,31,1,0,0,2,0,0,0,1
3,16/07/2015,279,5,39,239.554,24,1,2,0,4,1,0,0,0
4,23/07/2015,289,36,33,239.554,30,1,2,1,2,0,0,0,1


In [32]:
reordered_col_names = ['reason_type_1', 'reason_type_2', 'reason_type_3', 'reason_type_4','Date', 'Transportation Expense',
       'Distance to Work', 'Age', 'Daily Work Load Average',
       'Body Mass Index', 'Education', 'Children', 'Pets',
       'Absenteeism Time in Hours']

In [33]:
# rearranging columns
df = df[reordered_col_names]
df.head()

Unnamed: 0,reason_type_1,reason_type_2,reason_type_3,reason_type_4,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,0,0,0,1,07/07/2015,289,36,33,239.554,30,1,2,1,4
1,0,0,0,0,14/07/2015,118,13,50,239.554,31,1,1,0,0
2,0,0,0,1,15/07/2015,179,51,38,239.554,31,1,0,0,2
3,1,0,0,0,16/07/2015,279,5,39,239.554,24,1,2,0,4
4,0,0,0,1,23/07/2015,289,36,33,239.554,30,1,2,1,2


# Creating cheak points

In [34]:
df_reason_mod = df
df_reason_mod.shape

(700, 14)

In [35]:
df_reason_mod.head()

Unnamed: 0,reason_type_1,reason_type_2,reason_type_3,reason_type_4,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,0,0,0,1,07/07/2015,289,36,33,239.554,30,1,2,1,4
1,0,0,0,0,14/07/2015,118,13,50,239.554,31,1,1,0,0
2,0,0,0,1,15/07/2015,179,51,38,239.554,31,1,0,0,2
3,1,0,0,0,16/07/2015,279,5,39,239.554,24,1,2,0,4
4,0,0,0,1,23/07/2015,289,36,33,239.554,30,1,2,1,2


## Now we manipulate date column

In [36]:
df_reason_mod['Date'] = pd.to_datetime(df_reason_mod['Date'], format= '%d/%m/%Y')

In [37]:
df_reason_mod.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 700 entries, 0 to 699
Data columns (total 14 columns):
 #   Column                     Non-Null Count  Dtype         
---  ------                     --------------  -----         
 0   reason_type_1              700 non-null    uint8         
 1   reason_type_2              700 non-null    uint8         
 2   reason_type_3              700 non-null    uint8         
 3   reason_type_4              700 non-null    uint8         
 4   Date                       700 non-null    datetime64[ns]
 5   Transportation Expense     700 non-null    int64         
 6   Distance to Work           700 non-null    int64         
 7   Age                        700 non-null    int64         
 8   Daily Work Load Average    700 non-null    float64       
 9   Body Mass Index            700 non-null    int64         
 10  Education                  700 non-null    int64         
 11  Children                   700 non-null    int64         
 12  Pets    

In [38]:
# Lets create new month_value column using dates.
df_reason_mod['month_value'] = df_reason_mod['Date'].dt.month

In [39]:
# creating weekdays column using dates column.
df_reason_mod['weekday'] = df_reason_mod['Date'].dt.weekday

In [40]:
df_reason_mod.columns.values

array(['reason_type_1', 'reason_type_2', 'reason_type_3', 'reason_type_4',
       'Date', 'Transportation Expense', 'Distance to Work', 'Age',
       'Daily Work Load Average', 'Body Mass Index', 'Education',
       'Children', 'Pets', 'Absenteeism Time in Hours', 'month_value',
       'weekday'], dtype=object)

In [41]:
reset_col_order = ['reason_type_1', 'reason_type_2', 'reason_type_3', 'reason_type_4',
       'Date','month_value', 'weekday', 'Transportation Expense', 'Distance to Work', 'Age',
       'Daily Work Load Average', 'Body Mass Index', 'Education',
       'Children', 'Pets', 'Absenteeism Time in Hours']

In [42]:
df_reason_mod = df_reason_mod[reset_col_order]

In [43]:
df_reason_mod.drop(columns=['Date'], inplace= True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


In [44]:
df_reason_mod

Unnamed: 0,reason_type_1,reason_type_2,reason_type_3,reason_type_4,month_value,weekday,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,0,0,0,1,7,1,289,36,33,239.554,30,1,2,1,4
1,0,0,0,0,7,1,118,13,50,239.554,31,1,1,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,1,0,0,2
3,1,0,0,0,7,3,279,5,39,239.554,24,1,2,0,4
4,0,0,0,1,7,3,289,36,33,239.554,30,1,2,1,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
695,1,0,0,0,5,2,179,22,40,237.656,22,2,2,0,8
696,1,0,0,0,5,2,225,26,28,237.656,24,1,1,2,3
697,1,0,0,0,5,3,330,16,28,237.656,25,2,0,0,8
698,0,0,0,1,5,3,235,16,32,237.656,25,3,0,0,2


## Now lets examine Education column

In [45]:
df_reason_mod['Education'].value_counts()

1    583
3     73
2     40
4      4
Name: Education, dtype: int64

here 1 means high school, 2 means graduate, 3 means post graduate and 4 means phd. Number of 1 is much higher than other appearances
so can encode 1 as 0 and other values as 1.

In [46]:
import numpy as np

In [47]:
df_reason_mod['Education'] = np.where(df_reason_mod['Education'] == 1,0,1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_reason_mod['Education'] = np.where(df_reason_mod['Education'] == 1,0,1)


In [48]:
df_reason_mod.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 700 entries, 0 to 699
Data columns (total 15 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   reason_type_1              700 non-null    uint8  
 1   reason_type_2              700 non-null    uint8  
 2   reason_type_3              700 non-null    uint8  
 3   reason_type_4              700 non-null    uint8  
 4   month_value                700 non-null    int64  
 5   weekday                    700 non-null    int64  
 6   Transportation Expense     700 non-null    int64  
 7   Distance to Work           700 non-null    int64  
 8   Age                        700 non-null    int64  
 9   Daily Work Load Average    700 non-null    float64
 10  Body Mass Index            700 non-null    int64  
 11  Education                  700 non-null    int32  
 12  Children                   700 non-null    int64  
 13  Pets                       700 non-null    int64  

## Final check point

In [49]:
df_preprocessed = df_reason_mod.copy()
df_preprocessed.head(10)

Unnamed: 0,reason_type_1,reason_type_2,reason_type_3,reason_type_4,month_value,weekday,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1,4
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0,2
3,1,0,0,0,7,3,279,5,39,239.554,24,0,2,0,4
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1,2
5,0,0,0,1,7,4,179,51,38,239.554,31,0,0,0,2
6,0,0,0,1,7,4,361,52,28,239.554,27,0,1,4,8
7,0,0,0,1,7,4,260,50,36,239.554,23,0,4,0,4
8,0,0,1,0,7,0,155,12,34,239.554,25,0,2,0,40
9,0,0,0,1,7,0,235,11,37,239.554,29,1,1,1,8


In [50]:
df_preprocessed.shape

(700, 15)

## Manipulating 'Absenteeism Time in Hours' column (target variable).

Since here we use Logistic Regression to predict absenteeism, we have to classify our data namely Moderately absent and Excessively absent. For that purpose we use median of the respective column and encode as 0(Moderately absent) for values below median and  1(Excessively absent) for values above median.

In [51]:
df_preprocessed['Absenteeism Time in Hours'].median()

3.0

In [52]:
df_preprocessed['excessive_absenteeism'] = np.where(df_preprocessed['Absenteeism Time in Hours'] > df_preprocessed['Absenteeism Time in Hours'].median(), 1, 0)

In [53]:
df_preprocessed.head()

Unnamed: 0,reason_type_1,reason_type_2,reason_type_3,reason_type_4,month_value,weekday,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,excessive_absenteeism
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1,4,1
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0,2,0
3,1,0,0,0,7,3,279,5,39,239.554,24,0,2,0,4,1
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1,2,0


In [54]:
# dropping Absenteeism Time in Hours column

df_preprocessed.drop(columns= ['Absenteeism Time in Hours'], inplace= True)

In [55]:
df_preprocessed.head()

Unnamed: 0,reason_type_1,reason_type_2,reason_type_3,reason_type_4,month_value,weekday,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,excessive_absenteeism
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1,1
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0,0
3,1,0,0,0,7,3,279,5,39,239.554,24,0,2,0,1
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1,0


In [56]:
#checking proportion of data in target column

df_preprocessed['excessive_absenteeism'].sum()/df_preprocessed.shape[0]*100

45.57142857142858

## Lets split data into input and target variable:

In [57]:
input_df = df_preprocessed.iloc[:,:14]

In [58]:
input_df.drop(columns=['weekday','Distance to Work','Daily Work Load Average'], inplace= True)

In [59]:
input_df.head()

Unnamed: 0,reason_type_1,reason_type_2,reason_type_3,reason_type_4,month_value,Transportation Expense,Age,Body Mass Index,Education,Children,Pets
0,0,0,0,1,7,289,33,30,0,2,1
1,0,0,0,0,7,118,50,31,0,1,0
2,0,0,0,1,7,179,38,31,0,0,0
3,1,0,0,0,7,279,39,24,0,2,0
4,0,0,0,1,7,289,33,30,0,2,1


In [60]:
target_df = df_preprocessed.iloc[:,-1].ravel()

### scaling input variables

In [61]:
from sklearn.preprocessing import StandardScaler

In [62]:
input_df.columns.values

array(['reason_type_1', 'reason_type_2', 'reason_type_3', 'reason_type_4',
       'month_value', 'Transportation Expense', 'Age', 'Body Mass Index',
       'Education', 'Children', 'Pets'], dtype=object)

In [63]:
dic = dict(enumerate(input_df.columns.values))
dic

{0: 'reason_type_1',
 1: 'reason_type_2',
 2: 'reason_type_3',
 3: 'reason_type_4',
 4: 'month_value',
 5: 'Transportation Expense',
 6: 'Age',
 7: 'Body Mass Index',
 8: 'Education',
 9: 'Children',
 10: 'Pets'}

In [104]:
columns_to_scale = [4,5,6,7,9,10]

In [105]:
#columns_to_scale = [x for x in input_df.columns.values if x not in 
   #                 ['reason_type_1', 'reason_type_2', 'reason_type_3', 'reason_type_4', 'Education']]

In [106]:
columns_to_scale

[4, 5, 6, 7, 9, 10]

In [108]:
len(columns_ORDER_after_scale)

11

In [109]:
#Importing column transformer class to scale a part of our dataset
from sklearn.compose import ColumnTransformer

In [110]:
transformer = ColumnTransformer(transformers=[('standerization', StandardScaler(), columns_to_scale)],remainder= 'passthrough')

In [111]:
X = transformer.fit_transform(input_df)

In [112]:
pd.DataFrame(X).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,0.182726,1.005844,-0.536062,0.767431,0.880469,0.268487,0.0,0.0,0.0,1.0,0.0
1,0.182726,-1.574681,2.130803,1.002633,-0.01928,-0.58969,0.0,0.0,0.0,0.0,0.0
2,0.182726,-0.654143,0.24831,1.002633,-0.91903,-0.58969,0.0,0.0,0.0,1.0,0.0
3,0.182726,0.854936,0.405184,-0.643782,0.880469,-0.58969,1.0,0.0,0.0,0.0,0.0
4,0.182726,1.005844,-0.536062,0.767431,0.880469,0.268487,0.0,0.0,0.0,1.0,0.0


In [113]:
#sc = StandardScaler()

In [114]:
#X = sc.fit_transform(input_df)

In [115]:
#X.shape

### splitting data into train and test

In [116]:
from sklearn.model_selection import train_test_split

In [117]:
x_train, x_test, y_train, y_test = train_test_split(X, target_df, test_size=0.2, random_state= 20)

In [118]:
print(x_train.shape,y_train.shape)

(560, 11) (560,)


In [119]:
print(x_test.shape,y_test.shape)

(140, 11) (140,)


### Lets train our model 

In [120]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

In [121]:
reg = LogisticRegression()

In [122]:
reg.fit(x_train,y_train)

In [123]:
reg.score(x_train,y_train)

0.7732142857142857

that means our model accuracy is about 77.3%

### finding intercept and coefficient values

In [124]:
reg.intercept_

array([-1.6474549])

In [125]:
reg.coef_

array([[ 0.1589299 ,  0.60528415, -0.16989096,  0.27981088,  0.34826214,
        -0.27739602,  2.80019733,  0.95188356,  3.11555338,  0.83900082,
        -0.21053312]])

In [126]:
reg.coef_.shape

(1, 11)

In [127]:
coefficients = np.transpose(reg.coef_)

In [128]:
columns_ORDER_after_scale

['month_value',
 'Transportation Expense',
 'Age',
 'Body Mass Index',
 'Children',
 'Pets',
 'reason_type_1',
 'reason_type_2',
 'reason_type_3',
 'reason_type_4',
 'Education']

In [129]:
feature_name = columns_ORDER_after_scale

In [130]:
# now we will form a data frame for intercept and coefficient values

summary_table = pd.DataFrame()

In [131]:
summary_table['feature_name'] = feature_name

In [132]:
summary_table['coefficients'] = coefficients

In [133]:
summary_table

Unnamed: 0,feature_name,coefficients
0,month_value,0.15893
1,Transportation Expense,0.605284
2,Age,-0.169891
3,Body Mass Index,0.279811
4,Children,0.348262
5,Pets,-0.277396
6,reason_type_1,2.800197
7,reason_type_2,0.951884
8,reason_type_3,3.115553
9,reason_type_4,0.839001


In [134]:
summary_table.iloc[4,0]

'Children'

In [135]:
summary_table.index = summary_table.index + 1

In [136]:
summary_table.loc[0] = ['Intercept',reg.intercept_[0]]

In [137]:
summary_table = summary_table.sort_index()

In [138]:
summary_table

Unnamed: 0,feature_name,coefficients
0,Intercept,-1.647455
1,month_value,0.15893
2,Transportation Expense,0.605284
3,Age,-0.169891
4,Body Mass Index,0.279811
5,Children,0.348262
6,Pets,-0.277396
7,reason_type_1,2.800197
8,reason_type_2,0.951884
9,reason_type_3,3.115553


### Interpreting coefficients 

In [139]:
summary_table['odds_ratio'] = np.exp(summary_table.coefficients)

In [140]:
summary_table = summary_table.sort_values('odds_ratio', ascending= False)

In [141]:
summary_table

Unnamed: 0,feature_name,coefficients,odds_ratio
9,reason_type_3,3.115553,22.545903
7,reason_type_1,2.800197,16.447892
8,reason_type_2,0.951884,2.590585
10,reason_type_4,0.839001,2.314054
2,Transportation Expense,0.605284,1.831773
5,Children,0.348262,1.416604
4,Body Mass Index,0.279811,1.32288
1,month_value,0.15893,1.172256
3,Age,-0.169891,0.843757
11,Education,-0.210533,0.810152


## Testing the model

In [142]:
reg.score(x_test, y_test)

0.75

so our test data is giving accuracy of 75%

In [143]:
predicted_probability = reg.predict_proba(x_test)
predicted_probability

array([[0.71340413, 0.28659587],
       [0.58724228, 0.41275772],
       [0.44020821, 0.55979179],
       [0.78159464, 0.21840536],
       [0.08410854, 0.91589146],
       [0.33487603, 0.66512397],
       [0.29984576, 0.70015424],
       [0.13103971, 0.86896029],
       [0.78625404, 0.21374596],
       [0.74903632, 0.25096368],
       [0.49397598, 0.50602402],
       [0.22484913, 0.77515087],
       [0.07129151, 0.92870849],
       [0.73178133, 0.26821867],
       [0.30934135, 0.69065865],
       [0.5471671 , 0.4528329 ],
       [0.55052275, 0.44947725],
       [0.5392707 , 0.4607293 ],
       [0.40201117, 0.59798883],
       [0.05361575, 0.94638425],
       [0.7003009 , 0.2996991 ],
       [0.78159464, 0.21840536],
       [0.42037128, 0.57962872],
       [0.42037128, 0.57962872],
       [0.24783565, 0.75216435],
       [0.74566259, 0.25433741],
       [0.51017274, 0.48982726],
       [0.85690195, 0.14309805],
       [0.20349733, 0.79650267],
       [0.78159464, 0.21840536],
       [0.

In [144]:
predicted_probability.shape

(140, 2)

Here through predicted_probability we get an array of shape (140,2). first column tells probability of being 0 and second column tells probability of being 1.

## Save the Model

In [145]:
import pickle

In [146]:
with open('model','wb') as file:
    pickle.dump(reg, file)

In [147]:
with open('scaler','wb') as file:
    pickle.dump(transformer, file)