### A F&B manager wants to determine whether there is any significant difference in the diameter of the cutlet between two units. A randomly selected sample of cutlets was collected from both units and measured? Analyze the data and draw inferences at 5% significance level. Please state the assumptions and tests that you carried out to check validity of the assumptions.

Null hyposthesis Ho: μ1 = μ2 (There is no difference in diameters of cutlets between two units)

Alternate hypothesis Ha: μ1 ≠ μ2 (There is significant difference in diameters of cutlets between two units) 

2 Sample 2 Tail test applicable

In [1]:
import pandas as pd
import numpy as np
from scipy import stats
from scipy.stats import norm

In [2]:
# import the dataset
data=pd.read_csv('Cutlets.csv')
data.head()

Unnamed: 0,Unit A,Unit B
0,6.809,6.7703
1,6.4376,7.5093
2,6.9157,6.73
3,7.3012,6.7878
4,7.4488,7.1522


In [3]:
unit_A = pd.Series(data.iloc[:,0])
unit_A

0     6.8090
1     6.4376
2     6.9157
3     7.3012
4     7.4488
5     7.3871
6     6.8755
7     7.0621
8     6.6840
9     6.8236
10    7.3930
11    7.5169
12    6.9246
13    6.9256
14    6.5797
15    6.8394
16    6.5970
17    7.2705
18    7.2828
19    7.3495
20    6.9438
21    7.1560
22    6.5341
23    7.2854
24    6.9952
25    6.8568
26    7.2163
27    6.6801
28    6.9431
29    7.0852
30    6.7794
31    7.2783
32    7.1561
33    7.3943
34    6.9405
Name: Unit A, dtype: float64

In [5]:
unit_B = pd.Series(data.iloc[:,1])
unit_B

0     6.7703
1     7.5093
2     6.7300
3     6.7878
4     7.1522
5     6.8110
6     7.2212
7     6.6606
8     7.2402
9     7.0503
10    6.8810
11    7.4059
12    6.7652
13    6.0380
14    7.1581
15    7.0240
16    6.6672
17    7.4314
18    7.3070
19    6.7478
20    6.8889
21    7.4220
22    6.5217
23    7.1688
24    6.7594
25    6.9399
26    7.0133
27    6.9182
28    6.3346
29    7.5459
30    7.0992
31    7.1180
32    6.6965
33    6.5780
34    7.3875
Name: Unit B, dtype: float64

In [6]:
# 2-sample 2-tail ttest for independent sample   
p_value = stats.ttest_ind(unit_A,unit_B)
p_value

Ttest_indResult(statistic=0.7228688704678063, pvalue=0.4722394724599501)

In [9]:
np.round(p_value[1],2)     # 2-tail probability 

0.47

compare p_value with α = 0.05 (At 5% significance level)

Since, p_value = 0.47 > α = 0.05

Do not Reject Ho

Therefore, There is no difference in diameters of cutlets between two units.

### A hospital wants to determine whether there is any difference in the average Turn Around Time (TAT) of reports of the laboratories on their preferred list. They collected a random sample and recorded TAT for reports of 4 laboratories. TAT is defined as sample collected to report dispatch. Analyze the data and determine whether there is any difference in average TAT among the different laboratories at 5% significance level.

Null Hypothesis Ho: All samples TAT population means are same

Alternate Hypothesis Ha: Atleast one sample TAT population mean is different

ANOVA ftest statistics: Analysis of varaince between more than 2 samples or columns

In [10]:
import pandas as pd
import numpy as np
from scipy import stats
from scipy.stats import norm

In [11]:
# imort the dataset
data1 = pd.read_csv('LabTAT.csv')
data1.head()

Unnamed: 0,Laboratory 1,Laboratory 2,Laboratory 3,Laboratory 4
0,185.35,165.53,176.7,166.13
1,170.49,185.91,198.45,160.79
2,192.77,194.92,201.23,185.18
3,177.33,183.0,199.61,176.42
4,193.41,169.57,204.63,152.6


In [13]:
# Anova ftest statistics
p_value = stats.f_oneway(data1.iloc[:,0],data1.iloc[:,1],data1.iloc[:,2],data1.iloc[:,3])
p_value

F_onewayResult(statistic=118.70421654401437, pvalue=2.1156708949992414e-57)

In [17]:
np.round(p_value[1],4)

0.0

compare p_value with α = 0.05 (At 5% significance level)

Since, p_value = 0.00 < α = 0.05

Reject Ho

Therefore, Atleast one sample TAT population mean is different.

### Sales of products in four different regions is tabulated for males and females. Find if male-female buyer ratios are similar across regions.

Null Hypothesis Ho: Male-female ratio is not equal across all regions

Alternate Hypothesis Ha: Male-female ratio is equal accross regions

We have categorical data in the dataset, we will do the chisq test to see if there is correlation between the columns

In [18]:
import pandas as pd
from scipy import stats as stats
import numpy as np

In [20]:
data2 = pd.read_csv('BuyerRatio.csv')
data2.head()

Unnamed: 0,Observed Values,East,West,North,South
0,Males,50,142,131,70
1,Females,435,1523,1356,750


In [21]:
data_table = data2.iloc[:,1:6]
data_table

Unnamed: 0,East,West,North,South
0,50,142,131,70
1,435,1523,1356,750


In [22]:
data_table.values

array([[  50,  142,  131,   70],
       [ 435, 1523, 1356,  750]], dtype=int64)

In [23]:
val = stats.chi2_contingency(data_table)
val

(1.595945538661058,
 0.6603094907091882,
 3,
 array([[  42.76531299,  146.81287862,  131.11756787,   72.30424052],
        [ 442.23468701, 1518.18712138, 1355.88243213,  747.69575948]]))

In [25]:
no_of_rows=len(data_table.iloc[0:2,0])
no_of_columns=len(data_table.iloc[0,0:4])
degree_of_f=(no_of_rows-1)*(no_of_columns-1)
print('Degree of Freedom=',degree_of_f)

Degree of Freedom= 3


In [27]:
Expected_value = val[3]
Expected_value

array([[  42.76531299,  146.81287862,  131.11756787,   72.30424052],
       [ 442.23468701, 1518.18712138, 1355.88243213,  747.69575948]])

In [29]:
from scipy.stats import chi2
chi_square = sum([(o-e)**2/e for o,e in zip(data_table.values,Expected_value)])
chi_square_statistics = chi_square[0]+chi_square[1]
np.round(chi_square_statistics,2)

1.52

In [30]:
critical_value=chi2.ppf(0.95,3)
np.round(critical_value,2)

7.81

In [37]:
if chi_square_statistics >= critical_value:
    print('Dependent (Reject Ho) - Male-female ratio is equal accross regions')
else:
    print('Independent (Do Not Reject Ho) - Male-female ratio is not equal across all regions')

Independent (Do Not Reject Ho) - Male-female ratio is not equal across all regions


In [38]:
pvalue = 1-chi2.cdf(chi_square_statistics,3)
np.round(pvalue,2)

0.68

In [36]:
if pvalue <= 0.05:
    print('Dependent (Reject Ho) - Male-female ratio is equal accross regions')
else:
    print('Independent (Do Not Reject Ho) - Male-female ratio is not equal across all regions')

Independent (Do Not Reject Ho) - Male-female ratio is not equal across all regions


### TeleCall uses 4 centers around the globe to process customer order forms. They audit a certain %  of the customer order forms. Any error in order form renders it defective and has to be reworked before processing.  The manager wants to check whether the defective %  varies by centre. Please analyze the data at 5% significance level and help the manager draw appropriate inferences.

Null Hypothesis Ho: Independence of categorical variables (customer order forms defective % does not varies by centre)

Alternative hypothesis Ha: Dependence of categorical variables (customer order forms defective % varies by centre)

In [39]:
import pandas as pd
import numpy as np
from scipy import stats
from scipy.stats import norm
from scipy.stats import chi2_contingency

In [41]:
# import the dataset
data3 = pd.read_csv('Costomer+OrderForm.csv')
data3.head()

Unnamed: 0,Phillippines,Indonesia,Malta,India
0,Error Free,Error Free,Defective,Error Free
1,Error Free,Error Free,Error Free,Defective
2,Error Free,Defective,Defective,Error Free
3,Error Free,Error Free,Error Free,Error Free
4,Error Free,Error Free,Defective,Error Free


In [50]:
print(data3.Phillippines.value_counts(),'\n',data3.Indonesia.value_counts(),'\n',data3.Malta.value_counts(),'\n',data3.India.value_counts())

Error Free    271
Defective      29
Name: Phillippines, dtype: int64 
 Error Free    267
Defective      33
Name: Indonesia, dtype: int64 
 Error Free    269
Defective      31
Name: Malta, dtype: int64 
 Error Free    280
Defective      20
Name: India, dtype: int64


In [51]:
# Make a contingency table
obs = np.array([[271,267,269,280],[29,33,31,20]])
obs

array([[271, 267, 269, 280],
       [ 29,  33,  31,  20]])

In [56]:
# Chi2 contengency independence test
val = chi2_contingency(obs) # o/p is (Chi2 stats value, p_value, df, expected obsvations)
val

(3.858960685820355,
 0.2771020991233135,
 3,
 array([[271.75, 271.75, 271.75, 271.75],
        [ 28.25,  28.25,  28.25,  28.25]]))

In [60]:
np.round(val[1],2)

0.28

compare p_value with α = 0.05 (At 5% significance level)

Since, p_value = 0.28 > α = 0.05

Do Not Reject Ho

Therefore, Dependence of categorical variables (customer order forms defective % varies by centre).