# Assignment 3 Solutions



## Importing Required Packages

In [43]:
# importing required packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from scipy.stats import spearmanr
from scipy.stats import shapiro
from scipy.stats import ttest_ind
from scipy.stats import f_oneway
from scipy.stats import chi2_contingency

### Question 1

A F&B manager wants to determine whether there is any significant difference in the diameter of the cutlet between two units. A randomly selected sample of cutlets was collected from both units and measured? Analyze the data and draw inferences at 5% significance level. Please state the assumptions and tests that you carried out to check validity of the assumptions.


### Solution

Importing the data and take a look at it.

In [44]:
cutlets = pd.read_csv('Cutlets.csv')
cutlets.head()


Unnamed: 0,Unit A,Unit B
0,6.809,6.7703
1,6.4376,7.5093
2,6.9157,6.73
3,7.3012,6.7878
4,7.4488,7.1522


**Step 1:** State the hypotheses

Null hypothesis (H₀): There is no significant difference in the diameter of the 
cutlets between the two units.

Alternative hypothesis (H₁): There is a significant difference in the diameter 
of the cutlets between the two units.

**Step 2:** Testing the validity of the assumptions that the samples were randomly selected, are independent 
of one another, and follow a normal distribution. 

First, we conduct the **Durbin-Watson test** to test the randomness of the data.

In [46]:
# Extract the data from the dataframe
x = cutlets['Unit A']
y = cutlets['Unit B']

# Add a constant term to the predictor variable
x = sm.add_constant(x)

# Fit an ordinary least squares (OLS) regression model
model = sm.OLS(y, x)
results = model.fit()

# Perform the Durbin-Watson test on the residuals
durbin_watson_statistic = sm.stats.stattools.durbin_watson(results.resid)

# Print the Durbin-Watson statistic
print("Durbin-Watson statistic:", durbin_watson_statistic)

Durbin-Watson statistic: 2.6042413709944587


Since, the **Durbin-Watson statistic**  ranges from 0 to 4, where a value around 2 indicates no autocorrelation, while values below 2 suggest positive autocorrelation, and values above 2 suggest negative autocorrelation. So, other than a the possibility of a slight negative autocorrelation after getting a **Durbin-Watson statistic** of 2.6 (rounded down to 2 decimal points), the data is mostly randomly selected. 

Now, we conduct the **Spearman's rank correlation test** to test whether the samples are independent of each other since the data is continuous.

In [47]:
# Extract the data from the dataframe
a = cutlets['Unit A']
b = cutlets['Unit B']

# Perform the Spearman's rank correlation test
correlation_coefficient, p_value = spearmanr(a, b)

# Print the correlation coefficient and p-value
print("Correlation coefficient:", correlation_coefficient)
print("P-value:", p_value)

Correlation coefficient: 0.04537815126050421
P-value: 0.7957554247873968


Here the Correlation coefficient is close to 0, which tells us that there is no monotonic relationship between the samples. The p-value represents the probability of observing the calculated correlation coefficient by chance alone, assuming the null hypothesis of no correlation. we can conclude that the samples are independent of one another.

Finally, we conduct the **Shapiro-Wilk Test** to test the normality of the data.

In [48]:
stat, p = shapiro(cutlets)

if p > 0.05:
    print('Using Shapiro-Wilk Test we conclude that the cutlets data is nearly normally distributed')
else:
    print('Using Shapiro-Wilk Test we conclude that the cutlets data is probably not normally distributed')

Using Shapiro-Wilk Test we conclude that the cutlets data is nearly normally distributed


**Step 3:** Choose the Significance Level

In [49]:
alpha = 0.05

**Step 4:** Conduct the appropriate hypothesis test. Since  we are comparing the means of two groups, we can use the independent **two-sampled t-test**.

In [50]:
print('Average diameter of cutlets in Unit A:', np.mean(cutlets['Unit A']))    
print('Average diameter of cutlets in Unit B:', np.mean(cutlets['Unit B']))

t_statistic, p_value = ttest_ind(cutlets['Unit A'], cutlets['Unit B'])

print('The t-statistic is:', t_statistic)
print('The p-value is    :', p_value)

if p_value < 0.05:
    print('We can reject the Null hypotheseis and conclude that there is a significant difference in the diameter of the cutlets between two units')
else:
    print('We accept the Null hypothesis and conclude that there is no significant difference in the diameter of cutlets between the two units')

Average diameter of cutlets in Unit A: 7.0190914285714285
Average diameter of cutlets in Unit B: 6.964297142857142
The t-statistic is: 0.7228688704678063
The p-value is    : 0.47223947245995
We accept the Null hypothesis and conclude that there is no significant difference in the diameter of cutlets between the two units


### Question 2

A hospital wants to determine whether there is any difference in the average Turn Around Time (TAT) of reports of the laboratories on their preferred list. They collected a random sample and recorded TAT for reports of 4 laboratories. TAT is defined as sample collected to report dispatch.
   
Analyze the data and determine whether there is any difference in average TAT among the different laboratories at 5% significance level.


### Solution

Importing the data and take a look at it.

In [51]:
lab = pd.read_csv('LabTAT.csv')
lab.head()

Unnamed: 0,Laboratory 1,Laboratory 2,Laboratory 3,Laboratory 4
0,185.35,165.53,176.7,166.13
1,170.49,185.91,198.45,160.79
2,192.77,194.92,201.23,185.18
3,177.33,183.0,199.61,176.42
4,193.41,169.57,204.63,152.6


In [52]:
# Extract TAT values for each laboratory
lab1 = lab['Laboratory 1']
lab2 = lab['Laboratory 2']
lab3 = lab['Laboratory 3']
lab4 = lab['Laboratory 4']

# Perform one-way ANOVA
f_statistic, p_value = f_oneway(lab1, lab2, lab3, lab4)

# Print the F-statistic and p-value
print("F-statistic:", f_statistic)
print("P-value:", p_value)

F-statistic: 118.70421654401437
P-value: 2.1156708949992414e-57


Since the p-value is less than the significance level of 0.05, it suggests that there is evidence of a significant difference in the average TAT among the laboratories.

### Question 3

Sales of products in four different regions is tabulated for males and females. Find if male-female buyer rations are similar across regions.

### Solution

Importing the data and take a look at it.

In [91]:
buyr = pd.read_csv('BuyerRatio.csv')
buyr.head()

Unnamed: 0,Observed Values,East,West,North,South
0,Males,50,142,131,70
1,Females,435,1523,1356,750


To determine if the male-female buyer ratios are similar across regions, you can perform a statistical test called the **Chi-Square test of independence**. This test assesses whether there is an association between two categorical variables.

Let's take a significance level of 0.1

In [94]:
# Create a dataframe with the sales data
buyr = pd.DataFrame({
    'East': [50, 435],
    'West': [142, 1523],
    'North': [131, 1356],
    'South': [70, 750]
}, index=['Male', 'Female'])

# Display the contingency table
print(buyr)

        East  West  North  South
Male      50   142    131     70
Female   435  1523   1356    750


In [95]:
# Perform the Chi-Square test of independence
chi2, p_value, dof, expected = chi2_contingency(buyr)

# Print the test statistics and p-value
print("Chi-Square statistic:", chi2)
print("P-value:", p_value)

Chi-Square statistic: 1.595945538661058
P-value: 0.6603094907091882


Since the p-value is greater than the significance level of 0.1, it indicates that the male-female buyer ratios are similar across regions, suggesting no significant association.

### Question 4

TeleCall uses 4 centers around the globe to process customer order forms. They audit a certain % of the customer order forms. Any error in order form renders it defective and has to be reworked before processing.  The manager wants to check whether the defective % varies by centre. Please analyze the data at 5% significance level and help the manager draw appropriate inferences.

### Solution

Importing the data and take a look at it.

In [126]:
cof = pd.read_csv('Costomer+OrderForm.csv')
cof.head()

Unnamed: 0,Phillippines,Indonesia,Malta,India
0,Error Free,Error Free,Defective,Error Free
1,Error Free,Error Free,Error Free,Defective
2,Error Free,Defective,Defective,Error Free
3,Error Free,Error Free,Error Free,Error Free
4,Error Free,Error Free,Defective,Error Free


To analyze the data and determine if the defective percentage varies by center, you can use a statistical test such as the **Chi-Square test of independence**.

In [123]:
# Compute value counts for each column
value_counts_cof = pd.DataFrame({
    'Phillippines': cof['Phillippines'].value_counts(),
    'Indonesia': cof['Indonesia'].value_counts(),
    'Malta': cof['Malta'].value_counts(),
    'India': cof['India'].value_counts()
})

# Display the value counts dataframe
print(value_counts_cof)


            Phillippines  Indonesia  Malta  India
Error Free           271        267    269    280
Defective             29         33     31     20


In [125]:
# Perform the Chi-Square test of independence
chi2, p_value, dof, expected = chi2_contingency(value_counts_cof)

# Print the test statistics and p-value
print("Chi-Square statistic:", chi2)
print("P-value:", p_value)

Chi-Square statistic: 3.858960685820355
P-value: 0.2771020991233135


The p-value is greater than the significance level of 0.05, it indicates that the defective percentages do not vary significantly by center, suggesting no significant association.