## Table of Content

1. **[Large Sample Test](#z)**
2. **[Small Sample Test](#t)**
3. **[Z Proportion Test](#prop)**

**Import the required libraries**

In [2]:
import numpy as np
import pandas as pd
import seaborn as sns
import scipy.stats as stats
import matplotlib.pyplot as plt

### Let's begin with some hands-on practice exercises

<a id = "z"> </a>
## 1. Large Sample Test

<table align="left">
    <tr>
        <td width="6%">
            <img src="question_icon.png">
        </td>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b>1. The company MangoFun that produces mango juice claims that the protein content in their juice is more than the juice produced by their competitor company FruitMix. Protein content in 50 boxes of MangoFun juice and 80 boxes of FruitMix juice is collected from a normal population which have an average protein content as 0.4 g and 0.35 g. The two samples have a standard deviation of 0.08 g for MangoFun juice and 0.05 g for FruitMix juice. Test the claim using a critical value method at a 1% level of significance.</b>
                </font>
            </div>
        </td>
    </tr>
</table>

Use the data given below:

time =  [2.51, 2.1, 2.18, 2.61, 1.9, 1.7, 1.95, 2.1, 1.58, 2.5, 2.19, 1.85, 1.35, 2.4, 2.22, 2.15, 1.82, 1.62, 1.26, 
         2.2, 2.35, 2.5, 2.18, 1.58, 2.24, 1.84, 2.31, 2.24, 1.75, 1.41, 1.57, 2.4]

In [3]:
time = [2.51, 2.1, 2.18, 2.61, 1.9, 1.7, 1.95, 2.1, 1.58, 2.5, 2.19, 1.85, 1.35, 2.4, 2.22, 2.15, 1.82, 1.62, 1.26, 2.2, 2.35, 2.5, 2.18, 1.58, 2.24, 1.84, 2.31, 2.24, 1.75, 1.41, 1.57, 2.4]

In [4]:
#H0 : mu1 =< mu2
#Ha : mu1 > mu2

In [6]:
#calculating critical value at 0.01 level of significance
alpha = 0.01
z_crit = stats.norm.isf(0.01)
z_crit

2.3263478740408408

In [7]:
n1 = 50
n2 = 80
x1_bar = 0.4
x2_bar = 0.35
sig_1 =0.08
sig_2 = 0.05
sl =0.01

In [8]:
num = (x1_bar - x2_bar)

In [9]:
den = np.sqrt(       ((sig_1**2)/n1)    +    ((sig_2**2)/n2)     )

In [10]:
z_stat =  num/den
z_stat

3.96214425875164

In [11]:
p_val = stats.norm.sf(z_stat)
p_val

3.7139817357365585e-05

In [13]:
print(z_stat)
print(z_crit)

3.96214425875164
2.3263478740408408


In [17]:
z_stat > z_crit

True

In [None]:
#we fail to reject the null hypothesis

In [18]:
#H0 : mu1 =< mu2

<table align="left">
    <tr>
        <td width="6%">
            <img src="question_icon.png">
        </td>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b>2. The technician wants to test whether the price of a phablet is not the same as the price of tablet. For the study, the technician collects the data of 50 gadgets of each type. Check whether the data satisfies the conditions to use the two sample Z-test. (Use 5% level of significance)</b>
                </font>
            </div>
        </td>
    </tr>
</table>

Consider the dataset given in the CSV file `electronic.csv`. 

To use the two sample Z-test, both the samples should be drawn from a normal 
population and the population variances should be equal. Let us use the 
`Shapiro-Wilk` test to check the normality of the data and use 
`Levene's` test to test the equality of variances.

In [21]:
df = pd.read_csv('electronic.csv')
df.head()

Unnamed: 0,Screensize (inch),Price ($),Weight (g),Type
0,6.5,245,252,Phablet
1,6.3,210,252,Phablet
2,6.1,224,234,Phablet
3,6.0,217,212,Phablet
4,6.4,252,252,Phablet


In [22]:
df['Type'].value_counts()

Phablet    50
Tablet     50
Name: Type, dtype: int64

In [25]:
sample_1 = df[df['Type']=='Phablet']['Price ($)']
sample_2 = df[df['Type']=='Tablet']['Price ($)']

In [17]:
#ho  : price(phablet) = price(tablet)
# ha : price(phablet) != price(tablet)

In [18]:
#test for normality

#ho : Data is normal / skew =0
#ha : Data is not normal / skew !=0

In [None]:
#stats.shapiro(sample_1)
#stats.shapiro(sample_2)

In [27]:
print(stats.shapiro(sample_1))
print(stats.shapiro(sample_2))

ShapiroResult(statistic=0.9686920642852783, pvalue=0.20465952157974243)
ShapiroResult(statistic=0.9741330742835999, pvalue=0.33798879384994507)


In [28]:
pval = 0.20465952157974243
sl = 0.05

In [30]:
pval>sl  #ho is selected 

True

In [31]:
#data is normal

In [32]:
stats.levene(sample_1,sample_2)

LeveneResult(statistic=0.6635459332943239, pvalue=0.4172859681296204)

In [33]:
pvalue=0.4172859681296204/2
pvalue

0.2086429840648102

In [34]:
pval>sl

True

In [27]:
#we can say that the prices of gadget is normally distributed
#Thus we can perform two sample Z-test on the sample data.

<table align="left">
    <tr>
        <td width="6%">
            <img src="question_icon.png">
        </td>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b>3. The technician claims that the price of phablet is not the same as the price of tablet. For the study, the technician collects the data of 50 gadgets of each type. Test the technician's claim using the 95% confidence interval.</b>
                </font>
            </div>
        </td>
    </tr>
</table>

Consider the dataset given in the CSV file `electronic.csv`. 

In [71]:
df = pd.read_csv('electronic.csv')

In [72]:
df.head()

Unnamed: 0,Screensize (inch),Price ($),Weight (g),Type
0,6.5,245,252,Phablet
1,6.3,210,252,Phablet
2,6.1,224,234,Phablet
3,6.0,217,212,Phablet
4,6.4,252,252,Phablet


In [28]:
sample_1 = df[df['Type']=='Phablet']['Price ($)']
sample_2 =df[df['Type']=='Tablet']['Price ($)']

In [29]:
#ho  : price(phablet) = price(tablet)
#ha : price(phablet) != price(tablet)
sl = 0.05

In [30]:
x1_bar = np.mean(sample_1)
n1 = len(sample_1)
sig_1 = np.std(sample_1)
x2_bar = np.mean(sample_2)
n2 = len(sample_2)
sig_2 = np.std(sample_2)
sig_lvl = 0.05

In [31]:
num = (x1_bar - x2_bar)

In [32]:
den = np.sqrt(       ((sig_1**2)/n1)    +    ((sig_2**2)/n2)     )

In [33]:
z_stat = (num / den)
z_stat

9.377016317496455

In [34]:
p_val = stats.norm.sf(abs(z_stat))**2
p_val

1.151448617365287e-41

In [35]:
p_val =p_val/2

In [36]:
sl

0.05

In [37]:
p_val< sl

True

In [83]:
#ho is rejected 
#ha is accepted

In [38]:
#ha : price(phablet) != price(tablet)

<a id="t"></a>
## 2. Small Sample Test

<table align="left">
    <tr>
        <td width="6%">
            <img src="question_icon.png">
        </td>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b>4. The quality assurance department has collected 15 packets of potato chips. The department wants to test the average weight of the packets. Check whether they can use the one sample t-test for the population mean with 95% confidence.</b>
                </font>
            </div>
        </td>
    </tr>
</table>

Use the data given below:
        
    pack_wt = [26.8, 29.6, 27.8, 31.2, 30.9, 27.1, 28, 28.6, 29.4, 29.3, 31.5, 32.4, 29.7, 28.1, 31.9]

In [39]:
pack_wt = [26.8, 29.6, 27.8, 31.2, 30.9, 27.1, 28, 28.6, 29.4, 29.3, 31.5, 32.4, 29.7, 28.1, 31.9]

In [40]:
#In general for doing one sample t test these conditions must be present
#Given sample size should size be less than 30
#sample must be drawn from normal population (perform shapiro test)

In [41]:
#from given question sample size is 15

In [44]:
test_stat,p_value=stats.shapiro(pack_wt)

In [45]:
sl=0.05

In [47]:
p_value>sl

True

In [48]:
#so we fail to reject null hypothesis,data is normal

In [49]:
#so we can perform one sample t test

<table align="left">
    <tr>
        <td width="6%">
            <img src="question_icon.png">
        </td>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b>5. Use the weight of the potato chips packets given in the previous question and test whether the average weight of packet is more than 30 g. using p-value method. (Use 10% level of significance)</b>
                </font>
            </div>
        </td>
    </tr>
</table>

Use the data given below:
        
    pack_wt = [26.8, 29.6, 27.8, 31.2, 30.9, 27.1, 28, 28.6, 29.4, 29.3, 31.5, 32.4, 29.7, 28.1, 31.9]

In [50]:
#ho : mu = <30
#ha : mu > 30

In [51]:
test_stat,p_val = stats.ttest_1samp(pack_wt,popmean=30)

In [52]:
p_val = p_val/2
p_val

0.14068288127820355

In [53]:
sl =0.1

In [54]:
p_val>sl #ho is selected 

True

In [55]:
#we fail to reject the null hypothesis

In [56]:
#the average weight of packet is less than 30 g

<table align="left">
    <tr>
        <td width="6%">
            <img src="question_icon.png">
        </td>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b>6. The orthopaedic surgeon states that a T-score for females older than 30 years is less than -1.2 which indicates the low bone density. To test the claim a sample of 10 women was selected and the average T-score was found to be -1.34 with a standard deviation of 0.8. Test the surgeon's claim using a critical value method with 95% confidence. (Assume the normality of the T-score). </b>
                </font>
            </div>
        </td>
    </tr>
</table>

In [57]:
# T-score_females_older(30)>= -1.2low_bone_density
# T-score_females_older(30)< -1.2low_bone_density

In [58]:
x_bar = -1.34
mu = -1.2
S = 0.8
sl =0.05
n=10

In [63]:
cric_val = stats.t.isf(1-0.05,df=n-1)
cric_val

-1.8331129326536335

In [64]:
z_stat = (x_bar-mu)/(S/(n**0.5))
z_stat

-0.5533985905294669

In [68]:
z_stat>cric_val

True

In [69]:
#we accept null hypothesis

In [70]:
# T-score_females_older(30)>= -1.2low_bone_density

<table align="left">
    <tr>
        <td width="6%">
            <img src="question_icon.png">
        </td>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b>7. The newspaper agency wants to test whether Harry takes less time to deliver the newspapers than his colleague Ron. The manager has collected time (in minutes) taken by Harry and Ron for 7 days. Is the given dataset unpaired? If yes, check whether the manager at newspaper agency can use the two sample t-test for unpaired data with 99% confidence.  </b>
                </font>
            </div>
        </td>
    </tr>
</table>

Use the data given below:
        
    harry = [18.5, 17.4, 19.2, 16, 15.8, 13.4, 19.5]
    ron = [19.7, 18.6, 21.3, 17.5, 23.8, 20.7, 21]

In [85]:
#The given data is the time taken by Harry and Ron to deliver the newspapers.
#As the time taken by them is not dependent on each other. 
#This data is unpaired. Now to use the two sample t-test let us check the
#normality of the samples and equality of population variances for both the
#samples.

In [86]:
harry = [18.5, 17.4, 19.2, 16, 15.8, 13.4, 19.5]
ron = [19.7, 18.6, 21.3, 17.5, 23.8, 20.7, 21]

In [87]:
# time taken by Harry and Ron
time = [18.5, 17.4, 19.2, 16, 15.8, 13.4, 19.5, 19.7, 18.6, 21.3, 17.5, 23.8, 20.7, 21]

In [88]:
#checking normality by shapiro wilk test

In [89]:
test_stat,p_val=stats.shapiro(time)

In [90]:
sl = 0.01

In [91]:
p_val>sl

True

In [80]:
#Data is normal

In [92]:
#The p-value is greater than 0.01, thus we can say that 
#the time taken Harry and Ron is normally distributed.

In [96]:
# perform Shapiro-Wilk test to test the normality
# shapiro() returns a tuple having the values of test statistics and the corresponding p-value
stat, p_value = stats.shapiro(time)

# print the p-value 
print('P-Value:', p_value)

P-Value: 0.9969169497489929


In [97]:
p_val

0.9969169497489929

In [98]:
sl

0.01

In [99]:
p_val > sl

True

In [101]:
#From the above result, we can see that the p-value is greater than 0.01,
#thus we can say that the population variances are equal.

#The above results show that the given data satisfies both the assumptions
#for two sample t-test. Thus the manager at newspaper agency can use the two
#sample t-test for unpaired data.

<table align="left">
    <tr>
        <td width="6%">
            <img src="question_icon.png">
        </td>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b>8. The newspaper agency wants to test whether Harry takes less time to deliver the newspapers than his colleague Ron. The manager has collected time (in minutes) taken by Harry and Ron for 7 days. Use the given data and test the hypothesis for population mean time using a critical value method and p-value criteria with 90% confidence.</b>
                </font>
            </div>
        </td>
    </tr>
</table>

Use the data given below:
        
    harry = [18.5, 17.4, 19.2, 16, 15.8, 13.4, 19.5]
    ron = [19.7, 18.6, 21.3, 17.5, 23.8, 20.7, 21]

In [115]:
harry = [18.5, 17.4, 19.2, 16, 15.8, 13.4, 19.5]
ron = [19.7, 18.6, 21.3, 17.5, 23.8, 20.7, 21]

In [83]:
#H0 mu(harry)>=mu(ron)
#Ha mu(harry)<mu(ron)

In [84]:
test_stat,p_val =stats.ttest_ind(harry,ron)

In [118]:
p_val =p_val/2
p_val

0.00692997580305821

In [119]:
sl =0.1

In [120]:
p_val<sl #h0 rejected #ha is accepted

True

In [94]:
#Harry takes less time to deliver the newspapers than his colleague Ron.

<table align="left">
    <tr>
        <td width="6%">
            <img src="question_icon.png">
        </td>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                        <b>9. The pharmaceutical company had organized a program to introduce its new drug to lower the sugar level. They recorded the fasting sugar (in mg/dl) of 25 diabetic people. Those people undergo a medication of the new drug for 10 days and again took a fasting sugar level test. The company claim that the sugar level decreases due to its new drug. Test the claim using p-value technique with 95% confidence.</b>
                </font>
            </div>
        </td>
    </tr>
</table>

Use the data given in the file `sugar_level.xlsx`

In [104]:
df =pd.read_excel("sugar_level (1).xlsx")
df.head()

Unnamed: 0,sugar_before,sugar_after
0,135,132
1,132,136
2,142,139
3,154,151
4,198,192


The null and alternative hypothesis is:

H<sub>0</sub>: The new drug was not effective in reducing the fasting sugar level ($\mu_{d} \geq 0$)<br>
H<sub>1</sub>: The new drug was effective in reducing the fasting sugar level ($\mu_{d} < 0$)

In [105]:
#let us check normality of the sugarlevel before the medication
#performing shapiro wilk test 


In [110]:
test_stat,p_value_1=stats.shapiro(df['sugar_before'])
test_stat,p_value_2=stats.shapiro(df['sugar_after'])
sl = 0.05

In [112]:
p_value_1>sl

True

In [113]:
p_value_2>sl

True

In [114]:
#From the above result,
#we can see that the p-value_1 and p_value_2 is greater than 0.05, 
#thus the sugar level before the medication  and sugar level after medication
#is normally distributed.

In [116]:
#levene test
test_stat,p_val=stats.levene(df['sugar_before'],df['sugar_after'])

In [117]:
p_val

0.9172646929696414

In [120]:
p_val< sl

False

In [121]:
#we fail to reject the null hypothesis,data is normal

In [122]:
test_stat,p_val=stats.ttest_rel(df['sugar_before'],df['sugar_after'])

In [123]:
p_val = p_val/2

In [124]:
p_val

1.0544706972911436e-05

In [125]:
sl

0.05

In [126]:
p_val < sl 

True

In [None]:
#we reject the null hypothesis
#we accept ha i.e alternate hypothesis

We can see that the p-value is less than 0.05. Thus, we reject the null hypothesis and conclude that the new drug was effective in reducing the fasting sugar level.

<a id= "prop"></a>
## 3. Z Proportion Test

<table align="left">
    <tr>
        <td width="6%">
            <img src="question_icon.png">
        </td>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b>10. The physics department claims that it received 14% of fraud applications this year. The university head wants to test whether the percentage is different than what the department claims. The sample of 250 applications is selected out of which 38 applications are found to fraud. Test the claim with a 95% confidence interval.</b>
                </font>
            </div>
        </td>
    </tr>
</table>

In [131]:
#The null and alternate hypothesis 
#ho : p = 0.14 
#ha : p != 0.14

In [153]:
n = 250 
x = 38

#sample proportion
p_samp = x/n

#hypothesised proportion
p_hypo = 0.14

In [154]:
num = (p_samp-p_hypo)
num

0.011999999999999983

In [155]:
den = np.sqrt((p_hypo*(1-p_hypo))/n)
den

0.021945386758952325

In [156]:
z_prop = num/den
z_prop

0.5468119624323661

In [162]:
p_val = stats.norm.sf(abs(z_prop))*2
p_val

0.5845079238217354

In [163]:
sl = 0.05

In [164]:
p_val > sl

True

In [165]:
#we accept null hypothesis

<table align="left">
    <tr>
        <td width="6%">
            <img src="question_icon.png">
        </td>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b>11. The store manager claims that he received more 20% of defective plastic boxes in his previous order. Test the manager's claim using a critical value method. From a sample of 120 boxes, 42 were found to be defective. Test the claim with 90% confidence.</b>
                </font>
            </div>
        </td>
    </tr>
</table>

In [1]:
#ho : p<=0.2
#ha : p>0.2

In [180]:
cric_val = np.abs(round(stats.norm.isf(0.1),2))
cric_val

1.28

In [181]:
n = 120
x = 42

p_samp = x/n
hypo_p = 0.2

In [182]:
z_prop = (p_samp-hypo_p)/np.sqrt((hypo_p*(1-hypo_p))/n)
z_prop

4.107919181288744

In [184]:
z_prop > cric_val

True

  Here the test statistic is greater than the critical value (= 1.28). Thus, we reject the null hypothesis and conclude that there is enough evidence to state that the percentage of defective boxes is greater than 20%.

<table align="left">
    <tr>
        <td width="6%">
            <img src="question_icon.png">
        </td>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                        <b>12. The HR head of a company claims that the company supports the woman empowerment and the proportion of female employees is same in the New York and Oneonta branches. The women empowerment cell wants to check whether the proportion is different for the two branches. The sample of 150 employees is selected from New York branch out of which 53 are females and a sample of 170 employees is selected from Oneonta branch out of which 76 are females. Use the p-value technique with 95% confidence. </b>
                </font>
            </div>
        </td>
    </tr>
</table>

In [185]:
#ho : p1 = p2
#ha : p1!= p2

In [None]:
n1 = 150
n2 = 170
p1 = 53/53
p2 =76/170

In [186]:
import statsmodels.api as sm

In [198]:
z_prop,p_val=sm.stats.proportions_ztest(count=np.array([53,76]),nobs=np.array([150,170]),alternative='two-sided')

In [199]:
p_val

0.08807228185564836

In [200]:
sl

0.05

In [201]:
p_val > sl

True

In [202]:
#we fail to reject null hypothesis,null is selected

<table align="left">
    <tr>
        <td width="6%">
            <img src="question_icon.png">
        </td>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b>13. The education department claims that the proportion of students failed in Mathematics is more than the proportion of students failed in English. To test the claim a sample of 200 students enrolled for Mathematics was considered out of which 73 students failed in the exam and a sample of 150 students enrolled for English was considered out of which 53 students failed in the exam. Use the critical value method to test the department's claim with 90% confidence. </b>
                </font>
            </div>
        </td>
    </tr>
</table>

In [203]:
#ho : p1<=p2
#ha : p1>p2

In [204]:
n_math = 200
n_eng = 150
fail_math = 73
fail_eng = 53

In [207]:
cric_val = np.abs(round(stats.norm.isf(0.1),2))
cric_val

1.28

In [209]:
z_prop,p_val = sm.stats.proportions_ztest(count=np.array([73,53]),nobs=np.array([200,150]))

In [210]:
z_prop > cric_val

False

In [211]:
#we fail to reject null hypothesis,i.e we accept the null hypothesis