# Mode of work - Are we in the future?
In March 2020, the world as we knew it changed forever. This change affected the way we work globally. Millions of people in the workforce were required to adapt to what was then described as a ‘new way of working’. Three years on, people have begun to settle into one of three broad work models; working from the office, their homes or a hybrid of the two. We are a group of students from Indian Institute of Management, Kozhikode, working on an academic submission to collect data and study individual preferences on the different models of work and what are some factors that influence those choices across various regions globally

In [957]:
import pandas as pd
import numpy as np
from scipy import stats
import statistics as st
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from scipy.stats import pearsonr
from statsmodels.stats.weightstats import ztest
import plotly.io as pio
pio.renderers.default = "notebook_connected"

# Load the CSV "QT_Study_Sheet1.csv"
study_records = pd.read_csv('qt-survey-updated-2023-03-17.csv')

#Display the first five rows of study_records using the .head() method
print(study_records.head())

   age  gender marital_status  ... entry_id entry_status           created_at
0   48    Male        Married  ...      280       unread  2023-02-25 08:11:19
1   46  Female        Married  ...      279       unread  2023-02-24 17:39:37
2   48    Male        Married  ...      278       unread  2023-02-21 23:10:20
3   39  Female        Married  ...      277       unread  2023-02-19 22:04:31
4   32  Female        Married  ...      276       unread  2023-02-16 04:51:20

[5 rows x 30 columns]


In [958]:
#Use .info() to inspect the DataFrame study_records
print(study_records.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 276 entries, 0 to 275
Data columns (total 30 columns):
 #   Column                                                      Non-Null Count  Dtype  
---  ------                                                      --------------  -----  
 0   age                                                         276 non-null    int64  
 1   gender                                                      276 non-null    object 
 2   marital_status                                              276 non-null    object 
 3   country                                                     276 non-null    object 
 4   place_you_live_in                                           276 non-null    object 
 5   no_of_family_members_living_with                            276 non-null    int64  
 6   employment_type                                             276 non-null    object 
 7   current_industry                                            276 non-null    object 
 8   

In [959]:
# print column names - iterating the columns
for col in study_records.columns:
    print(col)

age
gender
marital_status
country
place_you_live_in
no_of_family_members_living_with
employment_type
current_industry
current_industry_other
Number of employees in your organization
total_work_experience
work_experience_with_current_company
current_role_category
current_role_category_other
current_annual_salary
mode_of_working
how_many_days_you_go_to_office
how_many_days_personal_choice_or_prescribed_by_company
if_personal_choice_how_many_days_willing_to_work_in_office
total_commute_time_in_minutes
rating_wfh_hybrid_wfo
factors_influenced_wfo
factors_influenced_wfo_other
factors_influenced_hybrid
factors_influenced_hybrid_other
factors_influenced_wfh
factors_influenced_wfh_other
entry_id
entry_status
created_at


In [960]:
# shape of the data frame
study_records.shape

(276, 30)

In [961]:
#Finding Null Values
print(study_records.isnull().sum())

age                                                             0
gender                                                          0
marital_status                                                  0
country                                                         0
place_you_live_in                                               0
no_of_family_members_living_with                                0
employment_type                                                 0
current_industry                                                0
current_industry_other                                        247
Number of employees in your organization                        0
total_work_experience                                           0
work_experience_with_current_company                            0
current_role_category                                           0
current_role_category_other                                   260
current_annual_salary                                          73
mode_of_wo

In [962]:
 #data_types
 study_records.dtypes   

age                                                             int64
gender                                                         object
marital_status                                                 object
country                                                        object
place_you_live_in                                              object
no_of_family_members_living_with                                int64
employment_type                                                object
current_industry                                               object
current_industry_other                                         object
Number of employees in your organization                       object
total_work_experience                                         float64
work_experience_with_current_company                          float64
current_role_category                                          object
current_role_category_other                                    object
current_annual_salar

# Data Validation - Total Cells vs Missing %

In [963]:
#Find % of missing data
missing_count = study_records.isnull().sum() #number of missing
total_cells = np.product(study_records.shape) # number of cells (cols x rows)
total_missing = missing_count.sum()
missing_percent = (total_missing*100)/total_cells

print('Total : ', total_cells)
print('Total missing : ', total_missing)
print('Missing Percentage: ', missing_percent, '%')

Total :  8280
Total missing :  2181
Missing Percentage:  26.340579710144926 %


# #1 Descriptive Statistics - Pie chart, Bar chart, Histogram and Box plots

# Age
- Interested in descriptive statistics

In [964]:
study_records['age'].unique()
study_records['age'].describe()

count    276.000000
mean      33.942029
std        6.996382
min       18.000000
25%       29.000000
50%       34.000000
75%       38.000000
max       69.000000
Name: age, dtype: float64

### Median Age

In [965]:
np.median(study_records['age'])

34.0

### Mode Age

In [966]:
stats.mode(study_records['age'])

ModeResult(mode=array([32]), count=array([23]))

# Age range

In [967]:
study_records['age_range'] = 0
study_records['age_range']= np.where((study_records['age']>=18) & (study_records['age']<=19), '18 - 19 years', study_records.age_range)
study_records['age_range']= np.where((study_records['age']>=20) & (study_records['age']<=24), '20 - 24 years', study_records.age_range)
study_records['age_range']= np.where((study_records['age']>=25) & (study_records['age']<=29), '25 - 29 years', study_records.age_range)
study_records['age_range']= np.where((study_records['age']>=30) & (study_records['age']<=34), '30 - 34 years', study_records.age_range)
study_records['age_range']= np.where((study_records['age']>=35) & (study_records['age']<=39), '35 - 39 years', study_records.age_range)
study_records['age_range']= np.where((study_records['age']>=40) & (study_records['age']<=44), '40 - 45 years', study_records.age_range)
study_records['age_range']= np.where((study_records['age']>=45) & (study_records['age']<=49), '45 - 49 years', study_records.age_range)
study_records['age_range']= np.where((study_records['age']>=50), '50 and above years', study_records.age_range)
age_range_df = study_records.groupby('age_range')['age_range'].count().to_frame('count').reset_index()
age_range_df

Unnamed: 0,age_range,count
0,18 - 19 years,4
1,20 - 24 years,9
2,25 - 29 years,62
3,30 - 34 years,74
4,35 - 39 years,79
5,40 - 45 years,36
6,45 - 49 years,6
7,50 and above years,6


In [968]:
fig = px.pie(age_range_df, values='count', names='age_range', title='Age range')
fig.show()

In [969]:
fig = px.bar(age_range_df, x='age_range', y='count')
fig.show()

# Gender
- Interested in total no.of Male / Female participants in the survey

In [970]:
study_records['gender'].unique()
#count number of each gender
study_records.groupby('gender')['gender'].count()

gender
Female         60
Male          214
Non-binary      2
Name: gender, dtype: int64

In [971]:
study_records.isnull().sum()['gender']
study_records['gender'].describe()

count      276
unique       3
top       Male
freq       214
Name: gender, dtype: object

In [972]:
gender_df = study_records.groupby('gender')['gender'].count().to_frame('count').reset_index()
gender_df

Unnamed: 0,gender,count
0,Female,60
1,Male,214
2,Non-binary,2


In [973]:
df = px.data.tips()
fig = px.pie(gender_df, values='count', names='gender', title='Gender diversity')
fig.show()

In [974]:
fig = px.bar(gender_df, x='gender', y='count')
fig.show()

# Marital Status

In [975]:

study_records['marital_status'].unique()
#count number of each marital_status
study_records.groupby('marital_status')['marital_status'].count()

marital_status
Married                   185
Registered Partnership      1
Seperated                   2
Single                     87
Widowed                     1
Name: marital_status, dtype: int64

In [976]:
study_records['marital_status'].describe()

count         276
unique          5
top       Married
freq          185
Name: marital_status, dtype: object

In [977]:
marital_status_df = study_records.groupby('marital_status')['marital_status'].count().to_frame('count').reset_index()
marital_status_df

Unnamed: 0,marital_status,count
0,Married,185
1,Registered Partnership,1
2,Seperated,2
3,Single,87
4,Widowed,1


In [978]:
df = px.data.tips()
fig = px.pie(marital_status_df, values='count', names='marital_status', title='Marital status')
fig.show()

In [979]:
fig = px.bar(marital_status_df, x='marital_status', y='count')
fig.show()

# Country

In [980]:
study_records['country'].unique()
#count number of each country
study_records.groupby('country')['country'].count()

country
Australia               1
India                 267
Japan                   1
Luxembourg              1
Taiwan                  1
United States (US)      5
Name: country, dtype: int64

In [981]:
study_records['country'].describe()

count       276
unique        6
top       India
freq        267
Name: country, dtype: object

In [982]:
country_df = study_records.groupby('country')['country'].count().to_frame('count').reset_index()
country_df

Unnamed: 0,country,count
0,Australia,1
1,India,267
2,Japan,1
3,Luxembourg,1
4,Taiwan,1
5,United States (US),5


In [983]:
df = px.data.tips()
fig = px.pie(country_df, values='count', names='country', title='Country')
fig.show()

# Place you live in

In [984]:

study_records['place_you_live_in'].unique()
#count number of each place_you_live_in
study_records.groupby('place_you_live_in')['place_you_live_in'].count()

place_you_live_in
Metro         159
Semi Urban      6
Town           24
Urban          80
Village         7
Name: place_you_live_in, dtype: int64

In [985]:
study_records['place_you_live_in'].describe()

count       276
unique        5
top       Metro
freq        159
Name: place_you_live_in, dtype: object

In [986]:
place_you_live_in_df = study_records.groupby('place_you_live_in')['place_you_live_in'].count().to_frame('count').reset_index()
place_you_live_in_df

fig = px.pie(place_you_live_in_df, values='count', names='place_you_live_in', title='Place you live in')
fig.show()

In [987]:
fig = px.bar(place_you_live_in_df, x='place_you_live_in', y='count')
fig.show()

# No of family members you are living with

In [988]:

study_records['no_of_family_members_living_with'].describe()

count    276.000000
mean       2.818841
std        2.282695
min        0.000000
25%        2.000000
50%        3.000000
75%        4.000000
max       27.000000
Name: no_of_family_members_living_with, dtype: float64

### Mean family members

In [989]:
np.mean(study_records['no_of_family_members_living_with'])

2.818840579710145

### Median family members

In [990]:
np.median(study_records['no_of_family_members_living_with'])

3.0

### Mode family members

In [991]:
stats.mode(study_records['no_of_family_members_living_with'])

ModeResult(mode=array([3]), count=array([70]))

In [992]:
no_of_family_members_living_with_df = study_records.groupby('no_of_family_members_living_with')['no_of_family_members_living_with'].count().to_frame('count').reset_index()
no_of_family_members_living_with_df
fig = px.bar(no_of_family_members_living_with_df, x='no_of_family_members_living_with', y='count')
fig.show()

In [993]:
fig = px.pie(no_of_family_members_living_with_df, values='count', names='no_of_family_members_living_with', title='No of family members living with')
fig.show()

# Employment Type

In [994]:
study_records['employment_type'].unique()
#count number of each employment_type
study_records.groupby('employment_type')['employment_type'].count()

employment_type
Government        28
Private          233
Self Employed     15
Name: employment_type, dtype: int64

In [995]:
study_records['employment_type'].describe()

count         276
unique          3
top       Private
freq          233
Name: employment_type, dtype: object

In [996]:
employment_type_df = study_records.groupby('employment_type')['employment_type'].count().to_frame('count').reset_index()
employment_type_df

Unnamed: 0,employment_type,count
0,Government,28
1,Private,233
2,Self Employed,15


In [997]:

df = px.data.tips()
fig = px.pie(employment_type_df, values='count', names='employment_type', title='Employment Type')
fig.show()

In [998]:
fig = px.bar(employment_type_df, x='employment_type', y='count')
fig.show()

# Current Industry

In [999]:
 
study_records['current_industry'].unique()
#count number of each current_industry
study_records.groupby('current_industry')['current_industry'].count()

current_industry
E-commerce         8
Educational        7
Entertainment      1
Finance           22
Healthcare        14
Hospitality        3
Manufacturing     29
Other             33
Real estate        2
Retail             2
Technology       155
Name: current_industry, dtype: int64

In [1000]:
study_records['current_industry'].describe()

count            276
unique            11
top       Technology
freq             155
Name: current_industry, dtype: object

In [1001]:
current_industry_df = study_records.groupby('current_industry')['current_industry'].count().to_frame('count').reset_index()
current_industry_df

fig = px.pie(current_industry_df, values='count', names='current_industry', title='Current Industry')
fig.show()

In [1002]:
fig = px.bar(current_industry_df, x='current_industry', y='count')
fig.show()

### Current industry other than provided 
ToDo map industries to existing or create new one and map to main column

In [1003]:
study_records['current_industry_other'].unique()
#count number of each current_industry_other
study_records.groupby('current_industry_other')['current_industry_other'].count()

current_industry_other
Administration                            1
Advisory                                  1
Consultancy                               1
Consulting                                2
Content Writer for multiple industries    1
Def.                                      1
Defence                                   3
Engineering                               2
Engineering services                      1
Healthcare and lifescience                1
Information Technology                    1
Infrastructure                            1
Intellectual Property                     1
Ip                                        1
Legal                                     1
Logistics                                 1
Oil                                       1
Oil and Gas                               1
Railway (Transport)                       1
Regulatory                                1
Renewable Energy                          1
Sea Food Processing & Exports             1
Telecom  

In [1004]:
current_industry_other_df = study_records.groupby('current_industry_other')['current_industry_other'].count().to_frame('count').reset_index()
current_industry_other_df

fig = px.pie(current_industry_other_df, values='count', names='current_industry_other', title='Current Industry Other')
fig.show()

# Number of employees in your organization

In [1005]:

study_records['Number of employees in your organization'].unique()
#count number of each Number of employees in your organization
study_records.groupby('Number of employees in your organization')['Number of employees in your organization'].count()

Number of employees in your organization
1-300          37
10000+        144
1001-10000     65
301-1000       30
Name: Number of employees in your organization, dtype: int64

In [1006]:
study_records['Number of employees in your organization'].describe()

count        276
unique         4
top       10000+
freq         144
Name: Number of employees in your organization, dtype: object

In [1007]:
no_of_employees_org_df = study_records.groupby('Number of employees in your organization')['Number of employees in your organization'].count().to_frame('count').reset_index()
no_of_employees_org_df
fig = px.bar(no_of_employees_org_df, x='Number of employees in your organization', y='count')
fig.show()
fig = px.pie(no_of_employees_org_df, values='count', names='Number of employees in your organization', title='No of Employees in the organization')
fig.show()

# Total Work Experience

In [1008]:

study_records['total_work_experience'].describe()

count    276.000000
mean      10.936123
std        6.370373
min        0.000000
25%        6.000000
50%       11.000000
75%       15.000000
max       49.000000
Name: total_work_experience, dtype: float64

## Box plot for Total work experience - Finding outliers

In [1009]:
fig = px.box(study_records, y="total_work_experience")
fig.show()

In [1010]:
np.median(study_records['total_work_experience'])

11.0

In [1011]:
st.mode(study_records['total_work_experience'])

12.0

In [1012]:
print(study_records[study_records['total_work_experience']>10]['age'].count())
print(study_records[study_records['total_work_experience']>15]['age'].count())
print(study_records[study_records['total_work_experience']>20]['age'].count())
print(study_records[study_records['total_work_experience']>25]['age'].count())

141
53
12
5


In [1013]:
study_records[study_records['total_work_experience']>25]

Unnamed: 0,age,gender,marital_status,country,place_you_live_in,no_of_family_members_living_with,employment_type,current_industry,current_industry_other,Number of employees in your organization,total_work_experience,work_experience_with_current_company,current_role_category,current_role_category_other,current_annual_salary,mode_of_working,how_many_days_you_go_to_office,how_many_days_personal_choice_or_prescribed_by_company,if_personal_choice_how_many_days_willing_to_work_in_office,total_commute_time_in_minutes,rating_wfh_hybrid_wfo,factors_influenced_wfo,factors_influenced_wfo_other,factors_influenced_hybrid,factors_influenced_hybrid_other,factors_influenced_wfh,factors_influenced_wfh_other,entry_id,entry_status,created_at,age_range
9,60,Male,Married,India,Metro,1,Private,Manufacturing,,301-1000,38.0,5.0,Top Level Management,,48.0,Complete work from office,6.0,Prescribed by company,5.0,210,4,,,Flexibility - Better Control over work hours a...,,,,271,unread,2023-02-11 22:56:20,50 and above years
29,56,Male,Married,United States (US),Urban,2,Private,Technology,,10000+,30.0,7.0,Mid-Level Management,,,Complete work from home,,,,0,10,,,,,"Better focus on tasks & less distractions, Abl...",,251,unread,2023-02-08 17:37:21,50 and above years
67,59,Male,Married,India,Metro,2,Government,Healthcare,,1-300,30.0,24.0,Mid-Level Management,,30.0,Complete work from office,7.0,Prescribed by company,3.0,10,5,,,Flexibility - Better Control over work hours a...,,,,213,unread,2023-02-07 17:28:13,50 and above years
70,69,Non-binary,Widowed,India,Metro,7,Self Employed,Real estate,,1-300,49.0,49.0,Top Level Management,,100.0,Complete work from office,5.0,Personal choice,,10,1,"Better interpersonal connect, Better infrastru...",,,,,,210,unread,2023-02-07 17:19:02,50 and above years
221,23,Non-binary,Registered Partnership,India,Town,27,Private,Educational,,10000+,34.0,29.0,Associates,,421.0,Complete work from home,,,,50,10,,,,,"Avoid the long commute to office, Work life ba...",,59,unread,2023-02-06 11:50:57,20 - 24 years


In [1014]:
study_records[study_records['total_work_experience']<=2]

Unnamed: 0,age,gender,marital_status,country,place_you_live_in,no_of_family_members_living_with,employment_type,current_industry,current_industry_other,Number of employees in your organization,total_work_experience,work_experience_with_current_company,current_role_category,current_role_category_other,current_annual_salary,mode_of_working,how_many_days_you_go_to_office,how_many_days_personal_choice_or_prescribed_by_company,if_personal_choice_how_many_days_willing_to_work_in_office,total_commute_time_in_minutes,rating_wfh_hybrid_wfo,factors_influenced_wfo,factors_influenced_wfo_other,factors_influenced_hybrid,factors_influenced_hybrid_other,factors_influenced_wfh,factors_influenced_wfh_other,entry_id,entry_status,created_at,age_range
21,25,Male,Single,India,Urban,5,Private,Healthcare,,10000+,1.5,0.5,Associates,,7.5,Hybrid,2.0,Prescribed by company,1.0,240,4,,,Flexibility - Better Control over work hours a...,,,,259,unread,2023-02-09 08:16:29,25 - 29 years
34,25,Male,Single,India,Metro,2,Private,Technology,,1001-10000,2.0,2.0,Associates,,7.0,Hybrid,2.0,Personal choice,,25,5,,,Flexibility - Better Control over work hours a...,,,,246,unread,2023-02-08 13:34:17,25 - 29 years
43,25,Male,Single,India,Metro,0,Private,Finance,,10000+,0.67,0.67,Frontline Managers,,,Hybrid,2.0,Personal choice,,5,10,,,,,"Better focus on tasks & less distractions, Get...",,237,unread,2023-02-08 04:00:12,25 - 29 years
59,25,Female,Single,India,Metro,0,Private,E-commerce,,1001-10000,2.0,1.0,Top Level Management,,11.0,Complete work from office,5.0,Prescribed by company,3.0,60,8,,,,,"Work life balance (Ex., better work life seper...",,221,unread,2023-02-07 17:58:17,25 - 29 years
62,24,Male,Single,India,Urban,3,Private,Technology,,10000+,2.0,2.0,Others,Software developer,15.0,Complete work from home,,,,0,10,,,,,"Work life balance (Ex., better work life seper...",,218,unread,2023-02-07 17:49:32,20 - 24 years
71,24,Male,Single,India,Urban,0,Private,Technology,,1001-10000,2.0,2.0,Others,Software Developer,12.0,Hybrid,2.0,Prescribed by company,3.0,30,7,,,Flexibility - Better Control over work hours a...,,,,209,unread,2023-02-07 17:14:37,20 - 24 years
73,26,Male,Single,India,Metro,0,Private,Finance,,301-1000,1.0,1.0,Associates,,6.5,Complete work from office,5.0,Personal choice,,10,7,,,Flexibility - Better Control over work hours a...,,,,207,unread,2023-02-07 17:12:35,25 - 29 years
74,25,Female,Single,India,Urban,0,Private,Finance,,10000+,0.0,1.0,Associates,,11.0,Hybrid,5.0,Prescribed by company,5.0,40,5,,,Flexibility - Better Control over work hours a...,,,,206,unread,2023-02-07 17:11:37,25 - 29 years
76,25,Male,Single,India,Town,4,Private,Technology,,10000+,2.0,2.0,Others,SDE,10.0,Hybrid,1.0,Prescribed by company,1.0,60,9,,,,,"Avoid the long commute to office, Work life ba...",,204,unread,2023-02-07 17:08:16,25 - 29 years
102,20,Male,Single,India,Town,3,Private,Technology,,1-300,0.0,0.6,Others,Intern,,Complete work from office,5.0,Prescribed by company,4.0,480,3,"Better interpersonal connect, Better infrastru...",,,,,,178,unread,2023-02-07 06:53:19,20 - 24 years


In [1015]:
study_records['work_exp_range'] = 0
study_records['work_exp_range']= np.where((study_records['total_work_experience']>=0) & (study_records['total_work_experience']<=2), '0 - 2 years', study_records.work_exp_range)
study_records['work_exp_range']= np.where((study_records['total_work_experience']>=3) & (study_records['total_work_experience']<=5), '3 - 5 years', study_records.work_exp_range)
study_records['work_exp_range']= np.where((study_records['total_work_experience']>=6) & (study_records['total_work_experience']<=10), '6 - 10 years', study_records.work_exp_range)
study_records['work_exp_range']= np.where((study_records['total_work_experience']>=11) & (study_records['total_work_experience']<=15), '11 - 15 years', study_records.work_exp_range)
study_records['work_exp_range']= np.where((study_records['total_work_experience']>=15) & (study_records['total_work_experience']<=20), '15 - 20 years', study_records.work_exp_range)
study_records['work_exp_range']= np.where((study_records['total_work_experience']>=20), '20 and above years', study_records.work_exp_range)
work_exp_range_df = study_records.groupby('work_exp_range')['work_exp_range'].count().to_frame('count').reset_index()
work_exp_range_df

df = px.data.tips()
fig = px.pie(work_exp_range_df, values='count', names='work_exp_range', title='Work Experience')
fig.show()

total_work_experience_df = study_records.groupby('total_work_experience')['total_work_experience'].count().to_frame('count').reset_index()
total_work_experience_df
fig = px.bar(total_work_experience_df, x='total_work_experience', y='count')
fig.show()



# Work Experience with Current Company

In [1016]:

study_records['work_experience_with_current_company'].describe()

count    276.000000
mean       5.226703
std        5.699306
min        0.000000
25%        1.000000
50%        3.000000
75%        7.000000
max       49.000000
Name: work_experience_with_current_company, dtype: float64

In [1017]:
np.median(study_records['work_experience_with_current_company'])

3.0

In [1018]:
st.mode(study_records['work_experience_with_current_company'])

1.0

In [1019]:
print(study_records[study_records['work_experience_with_current_company']>10]['age'].count())
print(study_records[study_records['work_experience_with_current_company']>20]['age'].count())
print(study_records[study_records['work_experience_with_current_company']>15]['age'].count())

36
6
12


## Records with wrong data where work_experience_with_current_company > total_work_experience

In [1020]:
study_records[study_records['work_experience_with_current_company'] > study_records['total_work_experience']]

Unnamed: 0,age,gender,marital_status,country,place_you_live_in,no_of_family_members_living_with,employment_type,current_industry,current_industry_other,Number of employees in your organization,total_work_experience,work_experience_with_current_company,current_role_category,current_role_category_other,current_annual_salary,mode_of_working,how_many_days_you_go_to_office,how_many_days_personal_choice_or_prescribed_by_company,if_personal_choice_how_many_days_willing_to_work_in_office,total_commute_time_in_minutes,rating_wfh_hybrid_wfo,factors_influenced_wfo,factors_influenced_wfo_other,factors_influenced_hybrid,factors_influenced_hybrid_other,factors_influenced_wfh,factors_influenced_wfh_other,entry_id,entry_status,created_at,age_range,work_exp_range
74,25,Female,Single,India,Urban,0,Private,Finance,,10000+,0.0,1.0,Associates,,11.0,Hybrid,5.0,Prescribed by company,5.0,40,5,,,Flexibility - Better Control over work hours a...,,,,206,unread,2023-02-07 17:11:37,25 - 29 years,0 - 2 years
102,20,Male,Single,India,Town,3,Private,Technology,,1-300,0.0,0.6,Others,Intern,,Complete work from office,5.0,Prescribed by company,4.0,480,3,"Better interpersonal connect, Better infrastru...",,,,,,178,unread,2023-02-07 06:53:19,20 - 24 years,0 - 2 years
140,35,Male,Married,India,Metro,4,Government,Finance,,1-300,3.0,5.0,Mid-Level Management,,500.0,Complete work from office,3.0,Prescribed by company,5.0,320,3,Better infrastructure,,,,,,140,unread,2023-02-06 22:36:09,35 - 39 years,3 - 5 years


# Current role category

In [1021]:
 
study_records['current_role_category'].unique()
#count number of each current_role_category
study_records.groupby('current_role_category')['current_role_category'].count()

current_role_category
Associates               58
Frontline Managers       52
Mid-Level Management    127
Others                   20
Top Level Management     19
Name: current_role_category, dtype: int64

In [1022]:
study_records['current_role_category'].describe()

count                      276
unique                       5
top       Mid-Level Management
freq                       127
Name: current_role_category, dtype: object

In [1023]:
current_role_category_df = study_records.groupby('current_role_category')['current_role_category'].count().to_frame('count').reset_index()
current_role_category_df
df = px.data.tips()
fig = px.pie(current_role_category_df, values='count', names='current_role_category', title='Current role category')
fig.show()

In [1024]:
fig = px.bar(current_role_category_df, x='current_role_category', y='count')
fig.show()

# Current role category other
Need to see and merge other records with Current role category

In [1025]:

study_records['current_role_category_other'].unique()
#count number of each current_role_category_other
study_records.groupby('current_role_category_other')['current_role_category_other'].count()

current_role_category_other
Air force officer           1
Consultant                  2
Data analyst                1
Developer                   1
Engineer                    2
IP Counsel                  1
Intern                      1
SDE                         1
Senior Designer Engineer    1
Senior Project Engineer     1
Software Developer          1
Software developer          1
Team Lead                   1
Team lead                   1
Name: current_role_category_other, dtype: int64

In [1026]:
study_records['current_role_category_other'].describe()

count             16
unique            14
top       Consultant
freq               2
Name: current_role_category_other, dtype: object

In [1027]:
current_role_category_other_df = study_records.groupby('current_role_category_other')['current_role_category_other'].count().to_frame('count').reset_index()
current_role_category_other_df
df = px.data.tips()
fig = px.pie(current_role_category_other_df, values='count', names='current_role_category_other', title='Current role category other')
fig.show()

# Salary

In [1028]:

study_records['current_annual_salary'].describe()

count    203.000000
mean      27.524631
std       48.460271
min        0.000000
25%       12.000000
50%       20.000000
75%       30.000000
max      500.000000
Name: current_annual_salary, dtype: float64

In [1029]:
study_records.isnull().sum()['current_annual_salary']

73

In [1030]:
study_records['current_annual_salary'].median()

20.0

In [1031]:
study_records['current_annual_salary'].value_counts()

30.0     21
15.0     14
25.0     14
20.0     13
18.0      9
12.0      7
0.0       7
10.0      6
14.0      6
11.0      5
40.0      5
22.0      5
8.0       5
24.0      5
16.0      4
7.0       4
6.0       4
5.0       4
9.0       4
42.0      4
28.0      3
7.5       3
1.0       3
50.0      3
32.0      3
19.0      3
13.0      2
21.0      2
70.0      2
37.0      2
48.0      2
23.0      2
17.0      1
80.0      1
5.5       1
4.0       1
35.0      1
421.0     1
27.0      1
38.0      1
2.5       1
3.0       1
55.0      1
33.0      1
19.5      1
6.5       1
24.5      1
500.0     1
26.0      1
225.0     1
65.0      1
18.5      1
45.0      1
57.0      1
60.0      1
100.0     1
95.0      1
36.0      1
47.0      1
Name: current_annual_salary, dtype: int64

## Filter Dataframe for null values

In [1032]:
study_records[study_records['current_annual_salary'].isna()]['current_annual_salary']

1     NaN
2     NaN
5     NaN
7     NaN
8     NaN
       ..
247   NaN
259   NaN
268   NaN
269   NaN
270   NaN
Name: current_annual_salary, Length: 73, dtype: float64

###  Replace NAN salary records with Median salary

In [1033]:
study_records['current_annual_salary'].fillna(study_records['current_annual_salary'].median())

0      70.0
1      20.0
2      20.0
3      50.0
4      50.0
       ... 
271    37.0
272    32.0
273     1.0
274    30.0
275    30.0
Name: current_annual_salary, Length: 276, dtype: float64

In [1034]:
study_records['current_annual_salary'].fillna(study_records['current_annual_salary'].median(),inplace=True)


### Check for NAN salary records in filtered_records

In [1035]:
study_records.isnull().sum()['current_annual_salary']

0

In [1036]:
np.mean(study_records['current_annual_salary'])

25.534420289855074

In [1037]:
np.median(study_records['current_annual_salary'])

20.0

In [1038]:
st.mode(study_records['current_annual_salary'])

20.0

In [1039]:
study_records['current_annual_salary'].describe()

count    276.00000
mean      25.53442
std       41.66605
min        0.00000
25%       15.00000
50%       20.00000
75%       25.00000
max      500.00000
Name: current_annual_salary, dtype: float64

In [1040]:
# variance of population
st.pvariance(study_records['current_annual_salary'])

1729.7696485769798

In [1041]:
# variance of sample data
st.variance(study_records['current_annual_salary'])

1736.0597200263505

In [1042]:
# high median
st.median_high(study_records['current_annual_salary'])

20.0

In [1043]:
# low median
st.median_low(study_records['current_annual_salary'])

20.0

In [1044]:
st.median_grouped(study_records['current_annual_salary'])

19.96511627906977

In [1045]:
# standard deviation of salary.
st.stdev(study_records['current_annual_salary'])

41.66604996908575

## Finding outliers in salary
Finding outliers using interquartile range

In [1046]:
Q1 = study_records['current_annual_salary'].quantile(0.25)
Q3 = study_records['current_annual_salary'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)

10.0


In [1047]:
print(study_records['current_annual_salary'].skew())
print(study_records['current_annual_salary'].describe())

8.988049299616398
count    276.00000
mean      25.53442
std       41.66605
min        0.00000
25%       15.00000
50%       20.00000
75%       25.00000
max      500.00000
Name: current_annual_salary, dtype: float64


skew 8.98 shows that it's a right skewed distribution
Higher salary is on the right side.

# Visualizing Outliers
Outliers can be visualized with
1. Box plot
2. Histogram
3. Scatterplot (with two variables)

In [1048]:
# Box plot
fig = px.box(study_records, y="current_annual_salary")
fig.show()

In [1049]:
#By histogram

fig = px.histogram(study_records, x="current_annual_salary")
fig.show()

## Outlier treatment
Treat them until skewness is between -0.5 and 0.5

In [1050]:
print(study_records['current_annual_salary'].quantile(0.10))
print(study_records['current_annual_salary'].quantile(0.90))

7.5
39.0


In [1051]:

study_records['current_annual_salary'] = np.where(study_records['current_annual_salary'] < 7.5, 7.5, study_records['current_annual_salary'])
study_records["current_annual_salary"] = np.where(study_records["current_annual_salary"] > 39.0, 39.0,study_records['current_annual_salary'])
print(study_records['current_annual_salary'].skew())

0.49511902939165964


In [1052]:
#By histogram

fig = px.histogram(study_records, x="current_annual_salary")
fig.show()

In [1053]:
# Box plot after outlier treatment
fig = px.box(study_records, y="current_annual_salary")
fig.show()

# Finding ranges in salary

In [1054]:
study_records['salary_range'] = 0
study_records['salary_range']= np.where((study_records['current_annual_salary']>=0) & (study_records['current_annual_salary']<=4), '0 - 4 Lakhs', study_records.salary_range)
study_records['salary_range']= np.where((study_records['current_annual_salary']>=5) & (study_records['current_annual_salary']<=9), '5 - 9 Lakhs', study_records.salary_range)
study_records['salary_range']= np.where((study_records['current_annual_salary']>=10) & (study_records['current_annual_salary']<=14), '10 - 14 Lakhs', study_records.salary_range)
study_records['salary_range']= np.where((study_records['current_annual_salary']>=15) & (study_records['current_annual_salary']<=19), '15 - 19 Lakhs', study_records.salary_range)
study_records['salary_range']= np.where((study_records['current_annual_salary']>=20) & (study_records['current_annual_salary']<=24), '20 - 24 Lakhs', study_records.salary_range)
study_records['salary_range']= np.where((study_records['current_annual_salary']>=25) & (study_records['current_annual_salary']<=29), '25 - 29 Lakhs', study_records.salary_range)
study_records['salary_range']= np.where((study_records['current_annual_salary']>=30) & (study_records['current_annual_salary']<=34), '30 - 34 Lakhs', study_records.salary_range)
study_records['salary_range']= np.where((study_records['current_annual_salary']>=35) & (study_records['current_annual_salary']<=39), '35 - 39 Lakhs', study_records.salary_range)

study_records['salary_range']= np.where((study_records['current_annual_salary']>=40), '40 and above lakhs', study_records.salary_range)
salary_range_df = study_records.groupby('salary_range')['salary_range'].count().to_frame('count').reset_index()
salary_range_df

Unnamed: 0,salary_range,count
0,0,2
1,10 - 14 Lakhs,26
2,15 - 19 Lakhs,32
3,20 - 24 Lakhs,100
4,25 - 29 Lakhs,19
5,30 - 34 Lakhs,25
6,35 - 39 Lakhs,33
7,5 - 9 Lakhs,39


In [1055]:
df = px.data.tips()
fig = px.pie(salary_range_df, values='count', names='salary_range', title='Salary range')
fig.show()

In [1056]:
fig = px.bar(salary_range_df, x='salary_range', y='count')
fig.show()

# Mode of Working

In [1057]:

study_records['mode_of_working'].describe()

count        276
unique         3
top       Hybrid
freq         127
Name: mode_of_working, dtype: object

In [1058]:
study_records['mode_of_working'].unique()
#count number of each mode_of_working
study_records.groupby('mode_of_working')['mode_of_working'].count()

mode_of_working
Complete work from home       54
Complete work from office     95
Hybrid                       127
Name: mode_of_working, dtype: int64

In [1059]:
mode_of_working_df = study_records.groupby('mode_of_working')['mode_of_working'].count().to_frame('count').reset_index()
mode_of_working_df
df = px.data.tips()
fig = px.pie(mode_of_working_df, values='count', names='mode_of_working', title='Mode of working')
fig.show()

In [1060]:
fig = px.bar(mode_of_working_df, x='mode_of_working', y='count')
fig.show()

# How many days you go to office?

In [1061]:

study_records.isnull().sum()['how_many_days_you_go_to_office']

55

In [1062]:
study_records['how_many_days_you_go_to_office'].describe()

count    221.000000
mean       3.647059
std        1.711552
min        1.000000
25%        2.000000
50%        3.000000
75%        5.000000
max        7.000000
Name: how_many_days_you_go_to_office, dtype: float64

In [1063]:
# Median
print(study_records['how_many_days_you_go_to_office'].median())

3.0


In [1064]:
how_many_days_you_go_to_office_df = study_records.groupby('how_many_days_you_go_to_office')['how_many_days_you_go_to_office'].count().to_frame('count').reset_index()
how_many_days_you_go_to_office_df
fig = px.bar(how_many_days_you_go_to_office_df, x='how_many_days_you_go_to_office', y='count')
fig.show()

In [1065]:
fig = px.pie(how_many_days_you_go_to_office_df, values='count', names='how_many_days_you_go_to_office', title='How many days you go to office')
fig.show()

# Personal choice or prescribed by company

In [1066]:
study_records.groupby('how_many_days_personal_choice_or_prescribed_by_company')['how_many_days_personal_choice_or_prescribed_by_company'].count()         

how_many_days_personal_choice_or_prescribed_by_company
Personal choice           58
Prescribed by company    163
Name: how_many_days_personal_choice_or_prescribed_by_company, dtype: int64

In [1067]:
study_records['how_many_days_personal_choice_or_prescribed_by_company'].describe()

count                       221
unique                        2
top       Prescribed by company
freq                        163
Name: how_many_days_personal_choice_or_prescribed_by_company, dtype: object

In [1068]:
how_many_days_personal_choice_or_prescribed_by_company_df = study_records.groupby('how_many_days_personal_choice_or_prescribed_by_company')['how_many_days_personal_choice_or_prescribed_by_company'].count().to_frame('count').reset_index()
how_many_days_personal_choice_or_prescribed_by_company_df

fig = px.pie(how_many_days_personal_choice_or_prescribed_by_company_df, values='count', names='how_many_days_personal_choice_or_prescribed_by_company', title='Personal choice or prescribed by company')
fig.show()

In [1069]:
fig = px.bar(how_many_days_personal_choice_or_prescribed_by_company_df, x='how_many_days_personal_choice_or_prescribed_by_company', y='count')
fig.show()

- # How many days willing to work in office, if it's personal choice

In [1070]:

study_records['if_personal_choice_how_many_days_willing_to_work_in_office'].describe()

count    163.00000
mean       2.90184
std        1.72925
min        0.00000
25%        2.00000
50%        3.00000
75%        5.00000
max        7.00000
Name: if_personal_choice_how_many_days_willing_to_work_in_office, dtype: float64

# Total commute time

In [1071]:

study_records['total_commute_time_in_minutes'].describe()

count     276.000000
mean       93.105072
std       137.954247
min         0.000000
25%        30.000000
50%        60.000000
75%       120.000000
max      1440.000000
Name: total_commute_time_in_minutes, dtype: float64

In [1072]:
# Box plot
fig = px.box(study_records, y="total_commute_time_in_minutes")
fig.show()

In [1073]:
print(study_records['total_commute_time_in_minutes'].quantile(0.10))
print(study_records['total_commute_time_in_minutes'].quantile(0.90))

1.0
180.0


In [1074]:

study_records['total_commute_time_in_minutes'] = np.where(study_records['total_commute_time_in_minutes'] < 1.0, 1.0, study_records['total_commute_time_in_minutes'])
study_records["total_commute_time_in_minutes"] = np.where(study_records["total_commute_time_in_minutes"] > 180.0, 180.0,study_records['total_commute_time_in_minutes'])
print(study_records['total_commute_time_in_minutes'].skew())

0.6940746637959906


# Rating WFO, Hybrid and WFH

In [1075]:

study_records['rating_wfh_hybrid_wfo'].describe()

count    276.000000
mean       6.119565
std        2.811052
min        0.000000
25%        5.000000
50%        6.000000
75%        8.250000
max       10.000000
Name: rating_wfh_hybrid_wfo, dtype: float64

In [1076]:
# histogram of rating data distribution
#By histogram
print(study_records['rating_wfh_hybrid_wfo'].skew())

fig = px.histogram(study_records, x="rating_wfh_hybrid_wfo")
fig.show()

-0.25871670050620477


In [1077]:
rating_wfh_hybrid_wfo_df = study_records.groupby('rating_wfh_hybrid_wfo')['rating_wfh_hybrid_wfo'].count().to_frame('count').reset_index()
rating_wfh_hybrid_wfo_df
df = px.data.tips()
fig = px.pie(rating_wfh_hybrid_wfo_df, values='count', names='rating_wfh_hybrid_wfo', title='Rating WFO, Hybrid and WFH')
fig.show()

In [1078]:
fig = px.bar(rating_wfh_hybrid_wfo_df, x='rating_wfh_hybrid_wfo', y='count')
fig.show()

# Factors WFO

In [1079]:


study_records['factors_influenced_wfo'].unique()
#count number of each factors_influenced_wfo
study_records.groupby('factors_influenced_wfo')['factors_influenced_wfo'].count()

factors_influenced_wfo
Better Productivity                                                                                                                                                                                                                                             1
Better focus on tasks & less distractions, Better Productivity, Other                                                                                                                                                                                           1
Better focus on tasks & less distractions, Having a more physically active life, Better Productivity, Optimize Innovation & Collaboration                                                                                                                       1
Better infrastructure                                                                                                                                                                                      

In [1080]:
wfo_list = study_records['factors_influenced_wfo'].unique().tolist()
wfo_list

['Better interpersonal connect, Better infrastructure',
 nan,
 'Better interpersonal connect, Better infrastructure, Work life balance (Ex., better work life seperation)',
 'Better interpersonal connect, Better infrastructure, Work life balance (Ex., better work life seperation), Better focus on tasks & less distractions, Having a more physically active life, Better Productivity, Optimize Innovation & Collaboration',
 'Better interpersonal connect, Better infrastructure, Work life balance (Ex., better work life seperation), Better focus on tasks & less distractions, Having a more physically active life, Better Productivity',
 'Better interpersonal connect, Having a more physically active life, Optimize Innovation & Collaboration',
 'Better interpersonal connect',
 'Better interpersonal connect, Better infrastructure, Better focus on tasks & less distractions, Having a more physically active life, Better Productivity, Optimize Innovation & Collaboration',
 'Better interpersonal connect,

# #2 Correlation between two variables - Scatter plots

In [1081]:
study_records.corr(method ='pearson')

Unnamed: 0,age,no_of_family_members_living_with,total_work_experience,work_experience_with_current_company,current_annual_salary,how_many_days_you_go_to_office,if_personal_choice_how_many_days_willing_to_work_in_office,total_commute_time_in_minutes,rating_wfh_hybrid_wfo,factors_influenced_hybrid_other,entry_id
age,1.0,0.084724,0.840945,0.532178,0.283748,0.217475,0.223381,0.105994,-0.061401,,0.213679
no_of_family_members_living_with,0.084724,1.0,0.316182,0.27415,0.114005,0.022081,-0.065555,0.104524,0.118994,,-0.09486
total_work_experience,0.840945,0.316182,1.0,0.64019,0.418854,0.170396,0.159409,0.137982,-0.05818,,0.124257
work_experience_with_current_company,0.532178,0.27415,0.64019,1.0,0.170196,0.331645,0.183969,0.062431,-0.111762,,0.106891
current_annual_salary,0.283748,0.114005,0.418854,0.170196,1.0,0.020947,0.044075,0.04053,-0.099573,,-0.005011
how_many_days_you_go_to_office,0.217475,0.022081,0.170396,0.331645,0.020947,1.0,0.574552,-0.009761,-0.244622,,0.08985
if_personal_choice_how_many_days_willing_to_work_in_office,0.223381,-0.065555,0.159409,0.183969,0.044075,0.574552,1.0,0.051464,-0.360461,,0.070328
total_commute_time_in_minutes,0.105994,0.104524,0.137982,0.062431,0.04053,-0.009761,0.051464,1.0,-0.049894,,-0.00395
rating_wfh_hybrid_wfo,-0.061401,0.118994,-0.05818,-0.111762,-0.099573,-0.244622,-0.360461,-0.049894,1.0,,-0.007479
factors_influenced_hybrid_other,,,,,,,,,,,


# Is any correlation between Age and salary?

-1 --- 0 --- +1  
- 0 to +1 they are correlated
- -1 to 0 not correlated

In [1082]:
corr = study_records['age'].corr(study_records['current_annual_salary'])
print(corr)

0.28374831039957016


In [1083]:
fig = px.scatter(study_records['age'], study_records['current_annual_salary'])
fig.show()

# Is any correlation between Age and total_work_experience?

-1 --- 0 --- +1  
- 0 to +1 they are correlated
- -1 to 0 not correlated

In [1084]:

corr = study_records['age'].corr(study_records['total_work_experience'])
print(corr)

0.8409448240198869


In [1085]:
fig = px.scatter(study_records['age'], study_records['total_work_experience'])
fig.show()

# Is any correlation between Age and How many days you go to office?

In [1086]:
corr = study_records['age'].corr(study_records['how_many_days_you_go_to_office'])
print(corr)

0.21747452818160626


In [1087]:
fig = px.scatter(study_records['age'], study_records['how_many_days_you_go_to_office'])
fig.show()

# Is any correlation between Work experience and Current Salary?

In [1088]:
corr = study_records['total_work_experience'].corr(study_records['current_annual_salary'])
print(corr)

0.4188541289391467


In [1089]:
fig = px.scatter(study_records['total_work_experience'], study_records['current_annual_salary'])
fig.show()

# Is any correlation between Work Experience and How many days you go to office?

In [1090]:
corr = study_records['total_work_experience'].corr(study_records['how_many_days_you_go_to_office'])
print(corr)

0.17039595092351234


In [1091]:
fig = px.scatter(study_records['total_work_experience'], study_records['how_many_days_you_go_to_office'])
fig.show()

# Commute time and How many days you go to office?

In [1092]:
corr = study_records['total_commute_time_in_minutes'].corr(study_records['how_many_days_you_go_to_office'])
print(corr)

-0.009761444835119282


In [1093]:
fig = px.scatter(study_records['total_commute_time_in_minutes'], study_records['how_many_days_you_go_to_office'])
fig.show()

In [1094]:
study_records.groupby('factors_influenced_wfo')['factors_influenced_wfo'].count()

factors_influenced_wfo
Better Productivity                                                                                                                                                                                                                                             1
Better focus on tasks & less distractions, Better Productivity, Other                                                                                                                                                                                           1
Better focus on tasks & less distractions, Having a more physically active life, Better Productivity, Optimize Innovation & Collaboration                                                                                                                       1
Better infrastructure                                                                                                                                                                                      

# Factors for WF0 Expanded 

In [1095]:
wfo_expanded = study_records['factors_influenced_wfo'].str.split(',', expand=True)
wfo_expanded.columns = ['factors_influenced_wfo_'+str(i) for i in wfo_expanded.columns]

wfo_expanded_concat = pd.concat([study_records,wfo_expanded], axis=1)

rating_wfo_df = pd.melt(wfo_expanded_concat, id_vars=['entry_id','rating_wfh_hybrid_wfo'], value_vars=wfo_expanded.columns, var_name='factors_wfo', value_name='factors_wfo_val').dropna()

rating_wfo_df

Unnamed: 0,entry_id,rating_wfh_hybrid_wfo,factors_wfo,factors_wfo_val
0,280,1,factors_influenced_wfo_0,Better interpersonal connect
11,269,2,factors_influenced_wfo_0,Better interpersonal connect
33,247,3,factors_influenced_wfo_0,Better interpersonal connect
35,245,1,factors_influenced_wfo_0,Better interpersonal connect
38,242,2,factors_influenced_wfo_0,Better interpersonal connect
...,...,...,...,...
2111,101,3,factors_influenced_wfo_7,Optimize Innovation & Collaboration
2125,87,2,factors_influenced_wfo_7,Optimize Innovation & Collaboration
2183,29,2,factors_influenced_wfo_7,Optimize Innovation & Collaboration
2200,12,2,factors_influenced_wfo_7,Optimize Innovation & Collaboration


# Work From Home Factors Expanded..

In [1096]:
wfh_expanded = study_records['factors_influenced_wfh'].str.split(',', expand=True)
wfh_expanded.columns = ['factors_influenced_wfh_'+str(i) for i in wfh_expanded.columns]

wfh_expanded_concat = pd.concat([study_records,wfh_expanded], axis=1)

rating_wfh_df = pd.melt(wfh_expanded_concat, id_vars=['entry_id','rating_wfh_hybrid_wfo'], value_vars=wfh_expanded.columns, var_name='factors_wfh', value_name='factors_wfh_val').dropna()

rating_wfh_df

Unnamed: 0,entry_id,rating_wfh_hybrid_wfo,factors_wfh,factors_wfh_val
1,279,10,factors_influenced_wfh_0,Avoid the long commute to office
3,277,10,factors_influenced_wfh_0,Avoid the long commute to office
5,275,9,factors_influenced_wfh_0,Work life balance (Ex.
8,272,10,factors_influenced_wfh_0,Avoid the long commute to office
10,270,9,factors_influenced_wfh_0,Avoid the long commute to office
...,...,...,...,...
2178,34,9,factors_influenced_wfh_7,Better Productivity
2186,26,10,factors_influenced_wfh_7,Better Productivity
2193,19,10,factors_influenced_wfh_7,Better Productivity
2195,17,8,factors_influenced_wfh_7,Better Productivity


# Factors Hybrid Expanded...

In [1097]:
hybrid_expanded = study_records['factors_influenced_hybrid'].str.split(',', expand=True)
hybrid_expanded.columns = ['factors_influenced_hybrid_'+str(i) for i in hybrid_expanded.columns]

hybrid_expanded_concat = pd.concat([study_records,hybrid_expanded], axis=1)

rating_hybrid_df = pd.melt(hybrid_expanded_concat, id_vars=['entry_id','rating_wfh_hybrid_wfo'], value_vars=hybrid_expanded.columns, var_name='factors_hybrid', value_name='factors_hybrid_val').dropna()

rating_hybrid_df

Unnamed: 0,entry_id,rating_wfh_hybrid_wfo,factors_hybrid,factors_hybrid_val
2,278,5,factors_influenced_hybrid_0,Flexibility - Better Control over work hours a...
4,276,7,factors_influenced_hybrid_0,Relatively less commute as compared to only wo...
6,274,6,factors_influenced_hybrid_0,Better Productivity
7,273,7,factors_influenced_hybrid_0,Flexibility - Better Control over work hours a...
9,271,4,factors_influenced_hybrid_0,Flexibility - Better Control over work hours a...
...,...,...,...,...
2425,63,5,factors_influenced_hybrid_8,Better Productivity
2432,56,6,factors_influenced_hybrid_8,Better Productivity
2437,51,7,factors_influenced_hybrid_8,Better Productivity
2455,33,5,factors_influenced_hybrid_8,Better Productivity


In [1098]:
rating_wfo_df.groupby(['rating_wfh_hybrid_wfo','factors_wfo_val'])['factors_wfo_val'].count()

rating_wfh_hybrid_wfo  factors_wfo_val                           
0                       Better Productivity                           7
                        Better focus on tasks & less distractions     6
                        Better infrastructure                         5
                        Having a more physically active life          5
                        Optimize Innovation & Collaboration           7
                        Work life balance (Ex.                        6
                        better work life seperation)                  6
                       Better focus on tasks & less distractions      1
                       Better interpersonal connect                   6
                       Other                                          1
1                       Better Productivity                           9
                        Better focus on tasks & less distractions     9
                        Better infrastructure                         

In [1099]:
rating_wfh_df.groupby(['rating_wfh_hybrid_wfo','factors_wfh_val'])['factors_wfh_val'].count()

rating_wfh_hybrid_wfo  factors_wfh_val                                                             
8                       Able to make healthier choices like exercising or preparing healthire meals    12
                        Allows more time to spend with friends and family                              10
                        Better Productivity                                                            11
                        Better focus on tasks & less distractions                                       7
                        Get better sleep                                                                7
                        Other                                                                           1
                        Work life balance (Ex.                                                         11
                        better work life seperation)                                                   12
                       Allows more time to spend wit

In [1100]:
rating_hybrid_df.groupby(['rating_wfh_hybrid_wfo','factors_hybrid_val'])['factors_hybrid_val'].count()

rating_wfh_hybrid_wfo  factors_hybrid_val                                              
4                       Better Productivity                                                 2
                        Get time to spend with colleagues as well as family                 2
                        Having a more physically active life                                4
                        Not all days are as demanding to be in office                       4
                        Relatively less commute as compared to only working from office     5
                        Sense of being empowered and trusted                                3
                        Work life balance (Ex.                                              3
                        better work life seperation)                                        3
                       Flexibility - Better Control over work hours and work location       4
                       Get time to spend with colleagues as well a

# Factors WFO for rating 2

In [1101]:

rating_2_df = rating_wfo_df[rating_wfo_df['rating_wfh_hybrid_wfo'] == 2].groupby('factors_wfo_val')['factors_wfo_val'].count().to_frame('factors_wfo_val_count').reset_index()

In [1102]:
rating_2_df

Unnamed: 0,factors_wfo_val,factors_wfo_val_count
0,Better Productivity,9
1,Better focus on tasks & less distractions,9
2,Better infrastructure,5
3,Having a more physically active life,9
4,Optimize Innovation & Collaboration,8
5,Other,1
6,Work life balance (Ex.,8
7,better work life seperation),10
8,Better Productivity,1
9,Better focus on tasks & less distractions,1


In [1103]:
fig = px.pie(rating_2_df, values='factors_wfo_val_count', names='factors_wfo_val', title='Places you live in of Sample')
fig.show()

# Factors for Hybrid

In [1104]:
rating_5_df = rating_hybrid_df[rating_hybrid_df['rating_wfh_hybrid_wfo'] == 5].groupby('factors_hybrid_val')['factors_hybrid_val'].count().to_frame('factors_hybrid_val_count').reset_index()
rating_5_df

Unnamed: 0,factors_hybrid_val,factors_hybrid_val_count
0,Better Productivity,44
1,Get time to spend with colleagues as well as ...,50
2,Having a more physically active life,36
3,Not all days are as demanding to be in office,37
4,Relatively less commute as compared to only w...,38
5,Sense of being empowered and trusted,31
6,Work life balance (Ex.,55
7,better work life seperation),59
8,Flexibility - Better Control over work hours a...,63
9,Get time to spend with colleagues as well as f...,9


In [1105]:
fig = px.pie(rating_5_df, values='factors_hybrid_val_count', names='factors_hybrid_val', title='Factors Hybrid')
fig.show()

# Factors for Work from Home

In [1106]:

rating_9_df = rating_wfh_df[rating_wfh_df['rating_wfh_hybrid_wfo'] == 9].groupby('factors_wfh_val')['factors_wfh_val'].count().to_frame('factors_wfh_val_count').reset_index()
rating_9_df

Unnamed: 0,factors_wfh_val,factors_wfh_val_count
0,Able to make healthier choices like exercisin...,9
1,Allows more time to spend with friends and fa...,11
2,Better Productivity,13
3,Better focus on tasks & less distractions,11
4,Get better sleep,8
5,Work life balance (Ex.,13
6,better work life seperation),14
7,Allows more time to spend with friends and family,2
8,Avoid the long commute to office,13
9,Work life balance (Ex.,1


In [1107]:
fig = px.pie(rating_9_df, values='factors_wfh_val_count', names='factors_wfh_val', title='Factors Work from Home for rating 9')
fig.show()

# #3 Analysis - Hypothesis Testing
- Z-Test
- Chi-square Test

# Z Test for rating
- H0: Mean rating is equal to 5
- H1: Mean rating is > 5

In [1108]:
null_mean = 5

# For alpha = 0.05
alpha = 0.05
ztest_Score, p_value= ztest(study_records['rating_wfh_hybrid_wfo'],x2=None,value = null_mean,alternative='larger')

print(ztest_Score)
print(p_value)
if(p_value < alpha):
  print("Reject null hypothesis")
else:
  print("Do not reject null hypothesis")

# Go For alpha = 0.01
alpha = 0.01
ztest_Score, p_value= ztest(study_records['rating_wfh_hybrid_wfo'],x2=None,value = null_mean,alternative='larger')

print(ztest_Score)
print(p_value)
if(p_value < alpha):
  print("Reject null hypothesis")
else:
  print("Do not reject null hypothesis")

6.616601685915414
1.837750401518768e-11
Reject null hypothesis
6.616601685915414
1.837750401518768e-11
Reject null hypothesis


In [1109]:
rating_hybrid_df

Unnamed: 0,entry_id,rating_wfh_hybrid_wfo,factors_hybrid,factors_hybrid_val
2,278,5,factors_influenced_hybrid_0,Flexibility - Better Control over work hours a...
4,276,7,factors_influenced_hybrid_0,Relatively less commute as compared to only wo...
6,274,6,factors_influenced_hybrid_0,Better Productivity
7,273,7,factors_influenced_hybrid_0,Flexibility - Better Control over work hours a...
9,271,4,factors_influenced_hybrid_0,Flexibility - Better Control over work hours a...
...,...,...,...,...
2425,63,5,factors_influenced_hybrid_8,Better Productivity
2432,56,6,factors_influenced_hybrid_8,Better Productivity
2437,51,7,factors_influenced_hybrid_8,Better Productivity
2455,33,5,factors_influenced_hybrid_8,Better Productivity


In [1110]:
rating_hybrid_df['mode_of_working_coded'] = ''

rating_hybrid_df['mode_of_working_coded'] = 0
rating_hybrid_df['mode_of_working_coded']= np.where((rating_hybrid_df['rating_wfh_hybrid_wfo']>=0) & (rating_hybrid_df['rating_wfh_hybrid_wfo']<=3), 'WFO', rating_hybrid_df.mode_of_working_coded)
rating_hybrid_df['mode_of_working_coded']= np.where((rating_hybrid_df['rating_wfh_hybrid_wfo']>=4) & (rating_hybrid_df['rating_wfh_hybrid_wfo']<=7), 'Hybrid', rating_hybrid_df.mode_of_working_coded)
rating_hybrid_df['mode_of_working_coded']= np.where((rating_hybrid_df['rating_wfh_hybrid_wfo']>=8) & (rating_hybrid_df['rating_wfh_hybrid_wfo']<=10), 'WFH', rating_hybrid_df.mode_of_working_coded)

rating_hybrid_df


Unnamed: 0,entry_id,rating_wfh_hybrid_wfo,factors_hybrid,factors_hybrid_val,mode_of_working_coded
2,278,5,factors_influenced_hybrid_0,Flexibility - Better Control over work hours a...,Hybrid
4,276,7,factors_influenced_hybrid_0,Relatively less commute as compared to only wo...,Hybrid
6,274,6,factors_influenced_hybrid_0,Better Productivity,Hybrid
7,273,7,factors_influenced_hybrid_0,Flexibility - Better Control over work hours a...,Hybrid
9,271,4,factors_influenced_hybrid_0,Flexibility - Better Control over work hours a...,Hybrid
...,...,...,...,...,...
2425,63,5,factors_influenced_hybrid_8,Better Productivity,Hybrid
2432,56,6,factors_influenced_hybrid_8,Better Productivity,Hybrid
2437,51,7,factors_influenced_hybrid_8,Better Productivity,Hybrid
2455,33,5,factors_influenced_hybrid_8,Better Productivity,Hybrid


# Factors effecting rating

In [1111]:
#hybrid
factors_hybrid_expanded = study_records['factors_influenced_hybrid'].str.split(',', expand=True)
factors_hybrid_expanded.columns = ['factors_influenced_hybrid_'+str(i) for i in factors_hybrid_expanded.columns]

factors_hybrid_expanded_concat = pd.concat([study_records,factors_hybrid_expanded], axis=1)

factors_rating_hybrid_df = pd.melt(factors_hybrid_expanded_concat, id_vars=['entry_id','rating_wfh_hybrid_wfo','gender','place_you_live_in'], value_vars=factors_hybrid_expanded.columns, var_name='factors', value_name='factors_val').dropna()
factors_rating_hybrid_df

#wfh
factors_wfh_expanded = study_records['factors_influenced_wfh'].str.split(',', expand=True)
factors_wfh_expanded.columns = ['factors_influenced_wfh_'+str(i) for i in factors_wfh_expanded.columns]

factors_wfh_expanded_concat = pd.concat([study_records,factors_wfh_expanded], axis=1)

factors_rating_wfh_df = pd.melt(factors_wfh_expanded_concat, id_vars=['entry_id','rating_wfh_hybrid_wfo','gender','place_you_live_in'], value_vars=factors_wfh_expanded.columns, var_name='factors', value_name='factors_val').dropna()
factors_rating_wfh_df


#wfo
factors_wfo_expanded = study_records['factors_influenced_wfo'].str.split(',', expand=True)
factors_wfo_expanded.columns = ['factors_influenced_wfo_'+str(i) for i in factors_wfo_expanded.columns]

factors_wfo_expanded_concat = pd.concat([study_records,factors_wfo_expanded], axis=1)

factors_rating_wfo_df = pd.melt(factors_wfo_expanded_concat, id_vars=['entry_id','rating_wfh_hybrid_wfo','gender','place_you_live_in'], value_vars=factors_wfo_expanded.columns, var_name='factors', value_name='factors_val').dropna()
factors_rating_wfo_df

#combine frames
frames = [factors_rating_wfo_df,factors_rating_hybrid_df,factors_rating_wfh_df ]
  
factors_wfo_hybrid_wfh_result_df = pd.concat(frames)
factors_wfo_hybrid_wfh_result_df

Unnamed: 0,entry_id,rating_wfh_hybrid_wfo,gender,place_you_live_in,factors,factors_val
0,280,1,Male,Metro,factors_influenced_wfo_0,Better interpersonal connect
11,269,2,Male,Urban,factors_influenced_wfo_0,Better interpersonal connect
33,247,3,Male,Metro,factors_influenced_wfo_0,Better interpersonal connect
35,245,1,Male,Metro,factors_influenced_wfo_0,Better interpersonal connect
38,242,2,Male,Metro,factors_influenced_wfo_0,Better interpersonal connect
...,...,...,...,...,...,...
2178,34,9,Male,Semi Urban,factors_influenced_wfh_7,Better Productivity
2186,26,10,Male,Metro,factors_influenced_wfh_7,Better Productivity
2193,19,10,Female,Town,factors_influenced_wfh_7,Better Productivity
2195,17,8,Male,Town,factors_influenced_wfh_7,Better Productivity


In [1112]:
factors_wfo_hybrid_wfh_result_df['mode_of_working_coded'] = ''

factors_wfo_hybrid_wfh_result_df['mode_of_working_coded'] = 0
factors_wfo_hybrid_wfh_result_df['mode_of_working_coded']= np.where((factors_wfo_hybrid_wfh_result_df['rating_wfh_hybrid_wfo']>=0) & (factors_wfo_hybrid_wfh_result_df['rating_wfh_hybrid_wfo']<=3), 'WFO', factors_wfo_hybrid_wfh_result_df.mode_of_working_coded)
factors_wfo_hybrid_wfh_result_df['mode_of_working_coded']= np.where((factors_wfo_hybrid_wfh_result_df['rating_wfh_hybrid_wfo']>=4) & (factors_wfo_hybrid_wfh_result_df['rating_wfh_hybrid_wfo']<=7), 'Hybrid', factors_wfo_hybrid_wfh_result_df.mode_of_working_coded)
factors_wfo_hybrid_wfh_result_df['mode_of_working_coded']= np.where((factors_wfo_hybrid_wfh_result_df['rating_wfh_hybrid_wfo']>=8) & (factors_wfo_hybrid_wfh_result_df['rating_wfh_hybrid_wfo']<=10), 'WFH', factors_wfo_hybrid_wfh_result_df.mode_of_working_coded)

factors_wfo_hybrid_wfh_result_df


Unnamed: 0,entry_id,rating_wfh_hybrid_wfo,gender,place_you_live_in,factors,factors_val,mode_of_working_coded
0,280,1,Male,Metro,factors_influenced_wfo_0,Better interpersonal connect,WFO
11,269,2,Male,Urban,factors_influenced_wfo_0,Better interpersonal connect,WFO
33,247,3,Male,Metro,factors_influenced_wfo_0,Better interpersonal connect,WFO
35,245,1,Male,Metro,factors_influenced_wfo_0,Better interpersonal connect,WFO
38,242,2,Male,Metro,factors_influenced_wfo_0,Better interpersonal connect,WFO
...,...,...,...,...,...,...,...
2178,34,9,Male,Semi Urban,factors_influenced_wfh_7,Better Productivity,WFH
2186,26,10,Male,Metro,factors_influenced_wfh_7,Better Productivity,WFH
2193,19,10,Female,Town,factors_influenced_wfh_7,Better Productivity,WFH
2195,17,8,Male,Town,factors_influenced_wfh_7,Better Productivity,WFH


In [1113]:
mode_of_working_group_df = factors_wfo_hybrid_wfh_result_df.groupby('mode_of_working_coded')['mode_of_working_coded'].count().to_frame('mode_of_working_coded_count').reset_index()

In [1114]:
df = px.data.tips()
fig = px.pie(mode_of_working_group_df, values='mode_of_working_coded_count', names='mode_of_working_coded', title='Mode of working')
fig.show()

## Chi-squared test #1 - Between Gender and Mode of working (WFO, Hybrid and WFH)

Check if gender and mode_of_working_coded whether they are associated (dependent / independent)

- H0: Gender and mode_of_working_coded are Independent (There is no relationship between 2 categorical variables)
- H1: Gender and mode_of_working_coded are not Independent(There is a relationship between 2 categorical variables)

In [1115]:


contingency_table=pd.crosstab(factors_wfo_hybrid_wfh_result_df["gender"],factors_wfo_hybrid_wfh_result_df["mode_of_working_coded"])
print('contingency_table :-\n',contingency_table)

contingency_table :-
 mode_of_working_coded  Hybrid  WFH  WFO
gender                                 
Female                    185  102   30
Male                      548  377  185
Non-binary                  0    3    6


In [1116]:
# Observed Values
Observed_Values = contingency_table.values 
print("Observed Values :-\n",Observed_Values)

Observed Values :-
 [[185 102  30]
 [548 377 185]
 [  0   3   6]]


In [1117]:
# Expected Values
b=stats.chi2_contingency(contingency_table)
Expected_Values = b[3]
print("Expected Values :-\n",Expected_Values)

Expected Values :-
 [[161.81128134 106.40250696  48.7862117 ]
 [566.59470752 372.57660167 170.82869081]
 [  4.59401114   3.02089136   1.38509749]]


In [1118]:
no_of_rows=len(contingency_table.iloc[0:3,0])
no_of_columns=len(contingency_table.iloc[0,0:3])
ddof=(no_of_rows-1)*(no_of_columns-1)
print("Degree of Freedom:-",ddof)

Degree of Freedom:- 4


In [1119]:
alpha = 0.05

from scipy.stats import chi2
chi_square=sum([(o-e)**2./e for o,e in zip(Observed_Values,Expected_Values)])
chi_square_statistic=chi_square[0]+chi_square[1]
print("chi-square statistic:-",chi_square_statistic)
critical_value=chi2.ppf(q=1-alpha,df=ddof)
print('critical_value:',critical_value)

chi-square statistic:- 8.762187874879546
critical_value: 9.487729036781154


In [1120]:
#p-value
p_value=1-chi2.cdf(x=chi_square_statistic,df=ddof)
print('p-value:',p_value)
print('Significance level: ',alpha)
print('Degree of Freedom: ',ddof)
print('chi-square statistic:',chi_square_statistic)
print('critical_value:',critical_value)
print('p-value:',p_value)
if chi_square_statistic>=critical_value:
    print("Reject H0,There is a relationship between 2 categorical variables")
else:
    print("Retain H0,There is no relationship between 2 categorical variables")
    
if p_value<=alpha:
    print("Reject H0,There is a relationship between 2 categorical variables")
else:
    print("Retain H0,There is no relationship between 2 categorical variables")

p-value: 0.06732644012714106
Significance level:  0.05
Degree of Freedom:  4
chi-square statistic: 8.762187874879546
critical_value: 9.487729036781154
p-value: 0.06732644012714106
Retain H0,There is no relationship between 2 categorical variables
Retain H0,There is no relationship between 2 categorical variables


## Chi-squared test #2 - Between Category of place of residence you live in and Mode of working (WFO, Hybrid and WFH)

Check if place_you_live_in and mode_of_working_coded whether they are associated (dependent / independent)

- H0: place_you_live_in and mode_of_working_coded are Independent (There is no relationship between 2 categorical variables)
- H1: place_you_live_in and mode_of_working_coded are not Independent(There is a relationship between 2 categorical variables)

In [1121]:

contingency_table=pd.crosstab(factors_wfo_hybrid_wfh_result_df["place_you_live_in"],factors_wfo_hybrid_wfh_result_df["mode_of_working_coded"])
print('contingency_table :-\n',contingency_table)

contingency_table :-
 mode_of_working_coded  Hybrid  WFH  WFO
place_you_live_in                      
Metro                     463  222  123
Semi Urban                 12   23    0
Town                       24   66   13
Urban                     205  162   79
Village                    29    9    6


In [1122]:
#Observed Values
Observed_Values = contingency_table.values 
print("Observed Values :-\n",Observed_Values)

Observed Values :-
 [[463 222 123]
 [ 12  23   0]
 [ 24  66  13]
 [205 162  79]
 [ 29   9   6]]


In [1123]:
b=stats.chi2_contingency(contingency_table)
Expected_Values = b[3]
print("Expected Values :-\n",Expected_Values)

Expected Values :-
 [[412.44011142 271.20891365 124.35097493]
 [ 17.86559889  11.74791086   5.38649025]
 [ 52.57590529  34.5724234   15.85167131]
 [227.65877437 149.70194986  68.63927577]
 [ 22.45961003  14.76880223   6.77158774]]


In [1124]:
no_of_rows=len(contingency_table.iloc[0:5,0])
no_of_columns=len(contingency_table.iloc[0,0:3])
ddof=(no_of_rows-1)*(no_of_columns-1)
print("Degree of Freedom:-",ddof)

Degree of Freedom:- 8


In [1125]:
alpha = 0.05

from scipy.stats import chi2
chi_square=sum([(o-e)**2./e for o,e in zip(Observed_Values,Expected_Values)])
chi_square_statistic=chi_square[0]+chi_square[1]
print("chi-square statistic:-",chi_square_statistic)
critical_value=chi2.ppf(q=1-alpha,df=ddof)
print('critical_value:',critical_value)

chi-square statistic:- 79.35331592213029
critical_value: 15.50731305586545


In [1126]:
#p-value
p_value=1-chi2.cdf(x=chi_square_statistic,df=ddof)
print('p-value:',p_value)
print('Significance level: ',alpha)
print('Degree of Freedom: ',ddof)
print('chi-square statistic:',chi_square_statistic)
print('critical_value:',critical_value)
print('p-value:',p_value)
if chi_square_statistic>=critical_value:
    print("Reject H0,There is a relationship between 2 categorical variables")
else:
    print("Retain H0,There is no relationship between 2 categorical variables")
    
if p_value<=alpha:
    print("Reject H0,There is a relationship between 2 categorical variables")
else:
    print("Retain H0,There is no relationship between 2 categorical variables")

p-value: 6.59472476627343e-14
Significance level:  0.05
Degree of Freedom:  8
chi-square statistic: 79.35331592213029
critical_value: 15.50731305586545
p-value: 6.59472476627343e-14
Reject H0,There is a relationship between 2 categorical variables
Reject H0,There is a relationship between 2 categorical variables


## Chi-squared test #3 - Between Factors Effecting Mode of working and Mode of working (WFO, Hybrid and WFH)

Check if factors_val and mode_of_working_coded whether they are associated (dependent / independent)

- H0: factors_val and mode_of_working_coded are Independent (There is no relationship between 2 categorical variables)
- H1: factors_val and mode_of_working_coded are not Independent(There is a relationship between 2 categorical variables)

In [1127]:

contingency_table=pd.crosstab(factors_wfo_hybrid_wfh_result_df["factors_val"],factors_wfo_hybrid_wfh_result_df["mode_of_working_coded"])
print('contingency_table :-\n',contingency_table)

contingency_table :-
 mode_of_working_coded                               Hybrid  WFH  WFO
factors_val                                                         
 Able to make healthier choices like exercising...       0   59    0
 Allows more time to spend with friends and family       0   54    0
 Better Productivity                                    72   58   29
 Better focus on tasks & less distractions               0   48   29
 Better infrastructure                                   0    0   23
 Get better sleep                                        0   48    0
 Get time to spend with colleagues as well as f...      80    0    0
 Having a more physically active life                   59    0   25
 Not all days are as demanding to be in office          68    0    0
 Optimize Innovation & Collaboration                     0    0   23
 Other                                                   0    1    2
 Relatively less commute as compared to only wo...      76    0    0
 Sense of be

In [1128]:
#Observed Values
Observed_Values = contingency_table.values 
print("Observed Values :-\n",Observed_Values)

Observed Values :-
 [[  0  59   0]
 [  0  54   0]
 [ 72  58  29]
 [  0  48  29]
 [  0   0  23]
 [  0  48   0]
 [ 80   0   0]
 [ 59   0  25]
 [ 68   0   0]
 [  0   0  23]
 [  0   1   2]
 [ 76   0   0]
 [ 53   0   0]
 [ 89  60  22]
 [ 95  64  24]
 [  0   7   0]
 [  0  73   0]
 [  1   1   1]
 [  0   4   2]
 [  0   0   2]
 [  0   0  35]
 [108   0   0]
 [ 20   0   0]
 [  3   0   0]
 [  0   1   2]
 [  3   0   0]
 [  6   4   2]]


In [1129]:
b=stats.chi2_contingency(contingency_table)
Expected_Values = b[3]
print("Expected Values :-\n",Expected_Values)

Expected Values :-
 [[30.11629526 19.80362117  9.08008357]
 [27.56406685 18.12534819  8.31058496]
 [81.16086351 53.36908078 24.47005571]
 [39.30431755 25.8454039  11.85027855]
 [11.7402507   7.72005571  3.53969359]
 [24.50139276 16.11142061  7.38718663]
 [40.8356546  26.85236769 12.31197772]
 [42.87743733 28.19498607 12.9275766 ]
 [34.71030641 22.82451253 10.46518106]
 [11.7402507   7.72005571  3.53969359]
 [ 1.53133705  1.00696379  0.46169916]
 [38.79387187 25.5097493  11.69637883]
 [27.05362117 17.78969359  8.15668524]
 [87.2862117  57.39693593 26.31685237]
 [93.41155989 61.42479109 28.16364903]
 [ 3.57311978  2.34958217  1.07729805]
 [37.26253482 24.50278552 11.23467967]
 [ 1.53133705  1.00696379  0.46169916]
 [ 3.06267409  2.01392758  0.92339833]
 [ 1.02089136  0.67130919  0.30779944]
 [17.86559889 11.74791086  5.38649025]
 [55.1281337  36.25069638 16.62116992]
 [10.20891365  6.71309192  3.07799443]
 [ 1.53133705  1.00696379  0.46169916]
 [ 1.53133705  1.00696379  0.46169916]
 [ 1.

In [1130]:
no_of_rows=len(contingency_table.iloc[0:27,0])
no_of_columns=len(contingency_table.iloc[0,0:3])
ddof=(no_of_rows-1)*(no_of_columns-1)
print("Degree of Freedom:-",ddof)

Degree of Freedom:- 52


In [1131]:
alpha = 0.05

from scipy.stats import chi2
chi_square=sum([(o-e)**2./e for o,e in zip(Observed_Values,Expected_Values)])
chi_square_statistic=chi_square[0]+chi_square[1]
print("chi-square statistic:-",chi_square_statistic)
critical_value=chi2.ppf(q=1-alpha,df=ddof)
print('critical_value:',critical_value)

chi-square statistic:- 943.6086382431853
critical_value: 69.83216033984813


In [1132]:
#p-value
p_value=1-chi2.cdf(x=chi_square_statistic,df=ddof)
print('p-value:',p_value)
print('Significance level: ',alpha)
print('Degree of Freedom: ',ddof)
print('chi-square statistic:',chi_square_statistic)
print('critical_value:',critical_value)
print('p-value:',p_value)
if chi_square_statistic>=critical_value:
    print("Reject H0,There is a relationship between 2 categorical variables")
else:
    print("Retain H0,There is no relationship between 2 categorical variables")
    
if p_value<=alpha:
    print("Reject H0,There is a relationship between 2 categorical variables")
else:
    print("Retain H0,There is no relationship between 2 categorical variables")

p-value: 0.0
Significance level:  0.05
Degree of Freedom:  52
chi-square statistic: 943.6086382431853
critical_value: 69.83216033984813
p-value: 0.0
Reject H0,There is a relationship between 2 categorical variables
Reject H0,There is a relationship between 2 categorical variables
