## Table of Content

1. **[Chi-Square Test](#chisq)**
2. **[One-way ANOVA](#1way)**

**Import the required libraries**

In [199]:
# import 'pandas' 
import pandas as pd 

# import 'numpy' 
import numpy as np

# import subpackage of matplotlib
import matplotlib.pyplot as plt

# import 'seaborn'
import seaborn as sns

# to suppress warnings 
from warnings import filterwarnings
filterwarnings('ignore')

# import 'random' to generate random sample
import random

# import statistics to perform statistical computation  
import statistics

# import 'stats' package from scipy library
from scipy import stats

# import a library to perform Z-test
from statsmodels.stats import weightstats as stests

# to test the normality 
from scipy.stats import shapiro

from statsmodels.stats import proportion

import statsmodels.api as sm

import statsmodels.stats.multicomp as mc

### Let's begin with some hands-on practice exercises

<a id = "chisq"> </a>
## 1. Chi-Square Test

Use the data available in the CSV file `Employee_Attrition.csv` for questions 1 to 6.

<table align="left">
    <tr>
        <td width="6%">
            <img src="question_icon.png">
        </td>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b>1. A company in Los Angeles has three functional departments - Research and Development, Sales, and Human Resources. The company claims that the percentage of employees in these 3 departments is 55%, 35% and 10% respectively. Check the company's claim using p-value criteria. Consider a 5% level of significance.</b>
                </font>
            </div>
        </td>
    </tr>
</table>

In [200]:
#The null and alternate hypothesis are:
#ho: there is no significant difference between the observed and expected value
#ha : there is signifucant difference between the obs and exp values    

In [208]:
df = pd.read_csv('Employee_Attrition.csv')
df.head(2)

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7


In [155]:
df['Department'].value_counts()

Research & Development    961
Sales                     446
Human Resources            63
Name: Department, dtype: int64

In [156]:
df['Department'].unique()

array(['Sales', 'Research & Development', 'Human Resources'], dtype=object)

In [157]:
#given observed values
observed_value =[961,446,63]   #observed values are taken from given data set through value counts

In [158]:
expected_prop =np.array([0.55,0.35,0.1])
expected_value=expected_prop*1470
observed_value = np.array([961,446,63])

test_stat,p_val=stats.chisquare(f_obs=observed_value,f_exp=expected_value)

In [159]:
p_val

2.2406437256053955e-19

In [160]:
sl = 0.05

In [161]:
p_val<sl

True

In [162]:
#we reject null hypothesis

The above output shows that the p-value is less than 0.05. Thus, we reject the null hypothesis and conclude that the company's claim is not correct. i.e. the distribution of employees in a company is not as per the company's claim.

<table align="left">
    <tr>
        <td width="6%">
            <img src="question_icon.png">
        </td>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b>2. The employees in an IT firm undergo an online assessment survey. The survey reveals that 20% of employees are least satisfied, 18% are fairly satisfied, 30% are moderately satisfied and 32% are highly satisfied with their job. Use a critical value method to test the survey result with 90% confidence.</b>
                </font>
            </div>
        </td>
    </tr>
</table>

In [163]:
#The null and alternate hypothesis are:
#ho: there is no significant difference between the observed and expected value
#ha : there is signifucant difference between the obs and exp values    

In [164]:
df.shape

(1470, 35)

In [165]:
df['JobSatisfaction'].value_counts()

4    459
3    442
1    289
2    280
Name: JobSatisfaction, dtype: int64

In [166]:
df['JobSatisfaction']=df['JobSatisfaction'].astype('object')

In [167]:
critical_value = np.abs(round(stats.chi2.isf(0.1,df=3),4))

In [168]:
critical_value

6.2514

In [169]:
observed_value = [289,280,442,459]
exp_count = [0.2,0.18,0.3,0.32]
exp_val = np.array(exp_count)*len(df)
exp_val = round(pd.Series(exp_val))
stat,p_val =stats.chisquare(f_obs=observed_value,f_exp=exp_val)

In [170]:
stat

1.1938049995858107

In [171]:
critical_value

6.2514

In [172]:
test_stat<critical_value

False

In [173]:
#we fail to reject null hypothesis

 The above output shows that the test statistic is less than 6.2514. Thus, we fail to reject (i.e. accept) the null hypothesis and conclude that there is no difference in the sample percentage and the percentage revealed by the survey.

<table align="left">
    <tr>
        <td width="6%">
            <img src="question_icon.png">
        </td>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b>3. The company claims that in the organization, 68% of employees are there with rare travelling, 20% with frequent travelling and 12% of employees does not need to travel for business. Check if the given data fits with the company's claimed distribution. Use the p-value technique with 99% confidence. </b>
                </font>
            </div>
        </td>
    </tr>
</table>

The null and alternative hypothesis is:

H<sub>0</sub>: There is no significant difference between the observed and expected values. <br>
H<sub>1</sub>: There is a significant difference between the observed and expected values.

In [174]:
df.columns

Index(['Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department',
       'DistanceFromHome', 'Education', 'EducationField', 'EmployeeCount',
       'EmployeeNumber', 'EnvironmentSatisfaction', 'Gender', 'HourlyRate',
       'JobInvolvement', 'JobLevel', 'JobRole', 'JobSatisfaction',
       'MaritalStatus', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked',
       'Over18', 'OverTime', 'PercentSalaryHike', 'PerformanceRating',
       'RelationshipSatisfaction', 'StandardHours', 'StockOptionLevel',
       'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
       'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
       'YearsWithCurrManager'],
      dtype='object')

In [175]:
df['BusinessTravel'].value_counts()

Travel_Rarely        1043
Travel_Frequently     277
Non-Travel            150
Name: BusinessTravel, dtype: int64

In [176]:
observed_value = [1043, 277, 150]
exp_prop = [0.68,0.2,0.12]
exp_val=np.array(exp_prop) * 1470
stat,p_val= stats.chisquare(f_obs=observed_value,f_exp=exp_val)


In [177]:
sl = 0.01

In [178]:
p_val>sl

True

In [179]:
#we accept alternate hypothesis

The above output shows that the p-value is greater than 0.01. Thus, we fail to reject (i.e. accept) the null hypothesis and conclude that company's claim is correct. i.e. there is no significant difference between the observed and expected values.

<table align="left">
    <tr>
        <td width="6%">
            <img src="question_icon.png">
        </td>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b>4. Check whether travelling for work depends upon the job role of an employee. Use p-value criteria to test the dependence with 99% confidence.</b>
                </font>
            </div>
        </td>
    </tr>
</table>

In [180]:
df['BusinessTravel'].unique()

array(['Travel_Rarely', 'Travel_Frequently', 'Non-Travel'], dtype=object)

In [181]:
df['JobRole'].unique()

array(['Sales Executive', 'Research Scientist', 'Laboratory Technician',
       'Manufacturing Director', 'Healthcare Representative', 'Manager',
       'Sales Representative', 'Research Director', 'Human Resources'],
      dtype=object)

In [182]:
test_stat,p_val,dof,exp_val =stats.chi2_contingency(pd.crosstab(df['BusinessTravel'],df['JobRole']))

In [183]:
p_val

0.7448263418408124

In [184]:
sl = 0.01

In [185]:
p_val> sl

True

In [186]:
#we fail to reject the null hypothesis,so business travel is not dependent on job role

<table align="left">
    <tr>
        <td width="6%">
            <img src="question_icon.png">
        </td>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b>5. Is there any relationship between the attrition of an employee and his/her marital status? Use the critical value technique to test the relationship with 95% confidence. </b>
                </font>
            </div>
        </td>
    </tr>
</table>

In [187]:
table = pd.crosstab(df['Attrition'],df['MaritalStatus'])
table

MaritalStatus,Divorced,Married,Single
Attrition,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
No,294,589,350
Yes,33,84,120


In [188]:
observed_values = table.values
observed_values

array([[294, 589, 350],
       [ 33,  84, 120]], dtype=int64)

In [198]:
#df = (total no of rows-1)*(total no of rows-1)
dof = 2

In [190]:
cric_val = np.abs(round(stats.chi2.isf(q=0.05,df=2),4))
cric_val

5.9915

In [191]:
test_stat, p, dof, expected_value = stats.chi2_contingency(observed =observed_values,correction = False)

test_stat

In [192]:
test_stat>cric_val

True

The above output shows that the test statistic is greater than 5.9915, thus we reject the null hypothesis and conclude that attrition and marital status are dependent.

<table align="left">
    <tr>
        <td width="6%">
            <img src="question_icon.png">
        </td>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b>6. Check whether travelling for the business depends on the gender of an employee. Use the p-value technique to test the claim with 90% confidence. </b>
                </font>
            </div>
        </td>
    </tr>
</table>

The null and alternative hypothesis is:

H<sub>0</sub>: Business travel and gender of an employee are independent<br>
H<sub>1</sub>: Business travel and gender of an employee are not independent

In [204]:
df['BusinessTravel'].unique()

array(['Travel_Rarely', 'Travel_Frequently', 'Non-Travel'], dtype=object)

In [206]:
df['Gender'].unique()

array(['Female', 'Male'], dtype=object)

In [215]:
table = pd.crosstab(df['BusinessTravel'],df['Gender'])
observed_values=table.values

In [216]:
test_stat,p_val,dof,expected_value = stats.chi2_contingency(observed=observed_values,correction=False)

In [217]:
p_val

0.13322895625828154

In [None]:
test_stat,p_val,dof,expected_value=stats.chi2_contingency(observe)

In [218]:
sl = 0.1

In [220]:
p_val>sl

True

In [221]:
#we fail to reject the null hypothesis

The above output shows that the p-value is greater than 0.1, thus we fail to reject (i.e. accept) the null hypothesis and conclude that the business travel is independent of the gender of an employee.

<a id = "1way"> </a>
## 2. One-way ANOVA

Use the data available in the CSV file `sales_emp.csv` for questions 7 to 11.

<table align="left">
    <tr>
        <td width="6%">
            <img src="question_icon.png">
        </td>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b>7. Check whether we can use the given dataset to study the equality of average monthly income of sales executives with a different education background in the company. Use a p-value technique to test at a 5% level of significance.</b>
                </font>
            </div>
        </td>
    </tr>
</table>

In [226]:
df= pd.read_csv('sales_emp.csv')
df.head(2)

Unnamed: 0,Age,BusinessTravel,DailyRate,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,EnvironmentSatisfaction,HourlyRate,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,36,Travel_Rarely,1218,9,4,Life Sciences,1,27,3,82,...,2,80,0,10,4,3,5,3,0,3
1,39,Travel_Rarely,895,5,3,Technical Degree,1,42,4,56,...,3,80,1,19,6,4,1,0,0,0


In [228]:
df.columns

Index(['Age', 'BusinessTravel', 'DailyRate', 'DistanceFromHome', 'Education',
       'EducationField', 'EmployeeCount', 'EmployeeNumber',
       'EnvironmentSatisfaction', 'HourlyRate', 'JobInvolvement', 'JobLevel',
       'JobSatisfaction', 'MaritalStatus', 'MonthlyIncome', 'MonthlyRate',
       'NumCompaniesWorked', 'Over18', 'OverTime', 'PercentSalaryHike',
       'PerformanceRating', 'RelationshipSatisfaction', 'StandardHours',
       'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear',
       'WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole',
       'YearsSinceLastPromotion', 'YearsWithCurrManager'],
      dtype='object')

In the given dataset, the `EducationField` is a categorical variable and `MonthlyIncome` is a numeric variable. To use the one-way ANOVA test for the equality of average income for all the education fields; the sample data should be drawn from normally distributed population. Also the population variances for all the groups should be equal.

In [232]:
df['EducationField'].unique()

array(['Life Sciences', 'Technical Degree', 'Marketing', 'Medical',
       'Other'], dtype=object)

In [230]:
# perform Shapiro-Wilk test to test the normality
# shapiro() returns a tuple having the values of test statistics and the corresponding p-value
stat, p_value = stats.shapiro(df['MonthlyIncome'])

# print the p-value
print('p-value:', p_value)

p-value: 0.05732710659503937


From the above result, we can see that the p-value is greater than 0.05, thus we can say that the monthly income is normally distributed. Thus the assumption of normality is satisfied.

Let us check the equality of variances.

In [235]:
stats.levene(df[df['EducationField']=='Life Sciences']['MonthlyIncome'],
              df[df['EducationField']=='Technical Degree']['MonthlyIncome'],
              df[df['EducationField']=='Marketing']['MonthlyIncome'],
              df[df['EducationField']=='Medical']['MonthlyIncome'],
              df[df['EducationField']=='Other']['MonthlyIncome'])

LeveneResult(statistic=1.4405771214255787, pvalue=0.23476859109336565)

In [None]:
p_val <

In [236]:
p_val = 0.234

In [238]:
sl = 0.05

In [240]:
p_val > sl

True

From the above result, we can see that the p-value is greater than 0.05, thus we can say that the population variances are equal for all the groups.

Thus we can use the one-way ANOVA test to check the equality of average monthly income for all the education fields.

In [241]:
#we fail to reject null hypothesis #so the variances are equal and sample are drawn from normal population

In [244]:
s1=df[df['EducationField']=='Life Sciences']['MonthlyIncome']
s2=              df[df['EducationField']=='Technical Degree']['MonthlyIncome']
s3  =            df[df['EducationField']=='Marketing']['MonthlyIncome']
s4 =             df[df['EducationField']=='Medical']['MonthlyIncome']
s5=              df[df['EducationField']=='Other']['MonthlyIncome']

In [245]:
stats.f_oneway(s1,s2,s3,s4,s5)

F_onewayResult(statistic=0.9546031924978221, pvalue=0.4407865128409415)

In [246]:
sl

0.05

In [247]:
p_val

0.234

In [249]:
p_val > sl

True

 <table align="left">
    <tr>
        <td width="6%">
            <img src="question_icon.png">
        </td>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b>8. Use the sales employees' dataset to test whether the average monthly income of sales executives with different education background is equal or not. Use a critical value method with 95% confidence.</b>
                </font>
            </div>
        </td>
    </tr>
</table>

The null and alternative hypothesis is:

H<sub>0</sub>: The average monthly income for all the education fields is the same<br>
H<sub>1</sub>: The average monthly income due to at least one education field is different

In [267]:
df=pd.read_csv('sales_emp.csv')
df.head(2)

Unnamed: 0,Age,BusinessTravel,DailyRate,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,EnvironmentSatisfaction,HourlyRate,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,36,Travel_Rarely,1218,9,4,Life Sciences,1,27,3,82,...,2,80,0,10,4,3,5,3,0,3
1,39,Travel_Rarely,895,5,3,Technical Degree,1,42,4,56,...,3,80,1,19,6,4,1,0,0,0


In [268]:
#no of education fields in the data

df['EducationField'].nunique()

5

In [254]:
len(df)

54

In [256]:
cric_val = np.abs(round(stats.f.isf(q=0.05,dfn=4,dfd=49),4))
print(cric_val)

2.5611


In [272]:
test_stat,p_val=stats.f_oneway(df[df['EducationField'] == 'Life Sciences']['MonthlyIncome'],
                                  df[df['EducationField'] == 'Technical Degree']['MonthlyIncome'],
                                  df[df['EducationField'] == 'Marketing']['MonthlyIncome'],
                                  df[df['EducationField'] == 'Medical']['MonthlyIncome'],
                                  df[df['EducationField'] == 'Other']['MonthlyIncome'])

In [275]:
test_stat < cric_val

True

The above output shows that the test statistic is less than 2.5611. Thus we fail to reject (i.e. accept) the null hypothesis and conclude that the average monthly income for all the education fields is the same.

<table align="left">
    <tr>
        <td width="6%">
            <img src="question_icon.png">
        </td>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b>9. Can we use the given data of sales executives to check the equality of the average daily rate for different types of business travellers? Use a p-value technique to test at a 1% level of significance.</b>
                </font>
            </div>
        </td>
    </tr>
</table>

In [277]:
#in the given data set the business travel is 
#categorical variable and daily rate is numerical variable .to use annova test 
#for th equality of average daily rate of all types of  business travellers 
#the sample should be drawn from normally distributed population .
#also the population variances must be equal

In [281]:
test_stat,p_val=stats.shapiro(df['DailyRate'])

In [282]:
sl = 0.01

In [285]:
p_val > sl 

True

In [286]:
#from the above result,we can see p_val is greater than 0.01,thus we can say that
#the daily rate is normally distributed

In [287]:
#Let us check the equality of variances.

In [288]:
df['BusinessTravel'].unique()

array(['Travel_Rarely', 'Non-Travel', 'Travel_Frequently'], dtype=object)

In [291]:
stats.levene(df[df['BusinessTravel']=='Travel_Rarely']['DailyRate']
             ,df[df['BusinessTravel']=='Non-Travel']['DailyRate'],df[df['BusinessTravel']=='Travel_Frequently']['DailyRate'])

LeveneResult(statistic=1.506521535938148, pvalue=0.23137904227432696)

In [292]:
p_val = 0.2

In [293]:
sl

0.01

In [294]:
p_val > sl

True

From the above result, we can see that the p-value is greater than 0.01, thus we can say that the population variances are equal for all the groups.

Thus we can use the one-way ANOVA test to check the equality of the average daily rate for all the types of business travelling.

<table align="left">
    <tr>
        <td width="6%">
            <img src="question_icon.png">
        </td>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b>10. Use the parametric test to check the equality of the average daily rate for all the types of business travelling. Use a p-value technique to test the data with 99% confidence.</b>
                </font>
            </div>
        </td>
    </tr>
</table>

In [295]:
s1=df[df['BusinessTravel']=='Travel_Rarely']['DailyRate']
s2=df[df['BusinessTravel']=='Non-Travel']['DailyRate']
s3=df[df['BusinessTravel']=='Travel_Frequently']['DailyRate']

In [303]:
test_stat,p_val=stats.f_oneway(s1,s2,s3)

In [305]:
p_val

4.527131533131016e-06

In [306]:
sl =0.01

In [310]:
p_val<sl

True

The above output shows that the p-value is less than 0.01. Thus we reject the null hypothesis and conclude that the average daily rate for at least one type of business travel is different.

<table align="left">
    <tr>
        <td width="6%">
            <img src="question_icon.png">
        </td>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b>11. Find the types of business travel for which the average daily rate is different. Use a 1% level of significance.</b>
                </font>
            </div>
        </td>
    </tr>
</table>

In [311]:
df['BusinessTravel'].value_counts()

Travel_Rarely        18
Non-Travel           18
Travel_Frequently    18
Name: BusinessTravel, dtype: int64

The count of employees for each type of business travel is equal. Thus we use the Tukey hsd test for post-hoc analysis.

In [312]:
import statsmodels.stats.multicomp as mc

In [313]:
comp=mc.MultiComparison(data=df['DailyRate'],groups=df['BusinessTravel'])

In [314]:
post_hoc = comp.tukeyhsd(alpha=0.01)

In [315]:
post_hoc.summary()

group1,group2,meandiff,p-adj,lower,upper,reject
Non-Travel,Travel_Frequently,639.7778,0.0,289.5928,989.9627,True
Non-Travel,Travel_Rarely,242.6667,0.0972,-107.5183,592.8516,False
Travel_Frequently,Travel_Rarely,-397.1111,0.0031,-747.2961,-46.9262,True


The `reject=False` for pair (Non-Travel, Travel_Rarely) denotes that we fail to reject the null hypothesis; and conclude that the average daily rate for employees who do not travel and employees who travel rarely is same.

The `reject=True` for the pairs (Non-Travel, Travel_Frequently), (Travel_Frequently, Travel_Rarely) denotes that we reject the null hypothesis; and conclude that the average daily rate is not the same.