# Blood glucose levels for obese patients have a mean of 100 with a standard deviation of 15. A researcher thinks that a diet high in raw cornstarch will have a positive effect on blood glucose levels. A sample of 36 patients who have tried the raw cornstarch diet have a mean glucose level of 108. Test the hypothesis that the raw cornstarch had an effect or not.

In [1]:
import scipy.stats as st
import math

In [2]:
x = 108                   # x is sample mean

m = 100                   # m is population mean

s = (15/math.sqrt(36))    # s is standard deviation of sample

Our Hypothesis

H0 : Blood glucose levels remain the same(have no effect) even after following the cornstarch diet if Population mean(m) = 100

H1 : Blood glucose levels increase(have positive effect) after following the cornstarch diet if Population mean (m) > 100


In [6]:
probab = st.norm.cdf(x,m,s)
probab

0.9993128620620841

In [7]:
p = 1 - probab
p

0.0006871379379158604

In [8]:
if p > .05:
    print('Accept Null Hypotheses')
    
else:
    print('Reject Null Hypothesis and Accept Alternative Hypothesis. \nRaw cornstarch has +ve effect on blood glucose.')

Reject Null Hypothesis and Accept Alternative Hypothesis. 
Raw cornstarch has +ve effect on blood glucose.


# In one state, 52% of the voters are Republicans, and 48% are Democrats. In a second state, 47% of the voters are Republicans, and 53% are Democrats. Suppose a simple random sample of 100 voters are surveyed from each state. What is the probability that the survey will show a greater percentage of Republican voters in the second state than in the first state?

In [7]:
import numpy as np
import scipy.stats as st
import math
import scipy

In [8]:
R1 = 0.52    # The proportion of Republican voters in the first state
R2 = 0.47    # The proportion of Republican voters in the second state 
D1 = .48     # The proportion of Democrats voters in the first state
D2 = .53     # The proportion of Democrats voters in the second state

n = 100      # sample size

In [9]:
print(n*R1)
print(n*R2)
print(n*D1)
print(n*D2)

52.0
47.0
48.0
53.0


Hence we can see that the sample size are large than 10.So we can conclude
     
     1.In this context, populations are considered to be large if they are at least 10 times bigger than their sample.
     2.The samples are independent from assumptions.
     3.The set of differences between sample proportions will be normally distributed(central limit theoem).

This problem requires us to find the probability that R1 is less than R2. This is equivalent to finding the probability that R1 - R2 is less than zero.
To find this probability, we need to transform the random variable (R1 - R2) into a z-score.

In [11]:
mean_difference = R1-R2
mean_difference

0.050000000000000044

In [12]:
standard_dev = math.sqrt ((R1 * (1-R1) / n)  + (R2 * (1-R2) / n) )
standard_dev

0.07061869440877536

In [13]:
z_score = ( 0 - mean_difference) / standard_dev
z_score       # mod of z_score

-0.7080278164104213

Using Normal Distribution Calculator, we find
P(z <=0.7082) = 0.24

In [15]:
p_value = scipy.special.ndtr(z_score)
p_value

0.23946399182220013

In [16]:
print("Hence, the probability that the survey will show a greater percentage of Republican voters in the second state than in the first state is " +str(p_value))

Hence, the probability that the survey will show a greater percentage of Republican voters in the second state than in the first state is 0.23946399182220013


# You take the SAT and score 1100. The mean score for the SAT is 1026 and the standard deviation is 209. How well did you score on the test compared to the average test taker?

In [17]:
import numpy as np
import scipy.stats as st
import scipy

In [18]:
x      = 1100            # given score
u      = 1026            # given mean score
sigma  = 209             # given standard deviation

In [19]:
z_value =  ( x - u) / sigma
z_value

0.35406698564593303

In [20]:
p_value = scipy.special.ndtr(z_value)
p_value

0.6383556584353189

In [21]:
print("Test taker scoring less than me are =  " +str(p_value*100))

Test taker scoring less than me are =  63.835565843531896


Is gender independent of education level? A random sample of 395 people were surveyed and each person was asked to report the highest education level they obtained. The data that resulted from the survey is summarized in the following table:
 High School Bachelors Masters Ph.d.   Total

Female 60 54 46 41 201
Male 40 44 53 57 194
Total 100 98 99 98 395
Question: Are gender and education level dependent at 5% level of significance? In other words, given the data collected above, is there a relationship between the gender of an individual and the level of education that they have obtained?

# We will be following steps to solve this problem:


State the hypotheses. 

Formulate an analysis plan. 

Analyze sample data 

# Step 1: State the hypotheses

H0 (Null hypotheses) : Gender and level of education are independent.

H1 (Alternate hypotheses) : Gender and level of education are dependent.

# Step 2: Formulate an analysis plan.

For this analysis, the significance level is 0.05. Using sample data, we will conduct a chi-square test for independence

In [22]:
import numpy as np
import scipy.stats as st
import pandas as pd

In [23]:
males=[40,44,53,57]

females=[60,54,46,41]

In [25]:
st.chisquare(males,females)

Power_divergenceResult(statistic=15.827638348847257, pvalue=0.001230081983416058)

# Step 3: Analyze sample data

Applying the chi-square test for independence to sample data, we compute the degrees of freedom, the expected frequency counts, and the chi-square test statistic. Based on the chi-square statistic and the degrees of freedom, we determine the P-value.

Since the P-value (0.00123) is less than the significance level (0.05), we cannot accept the null hypothesis. Thus, we conclude that there is a relationship between gender and level of education.

# Using the following data, perform a oneway analysis of variance using α=.05. Write up the results in APA format.

[Group1: 51, 45, 33, 45, 67]

[Group2: 23, 43, 23, 43, 45]

[Group3: 56, 76, 74, 87, 56]

A one way analysis is used to compare two means from two independent (unrelated) groups using the F-distribution. The null hypothesis for the test is that the two means are equal. Therefore, a significant result means that the two means are unequal.
The ANOVA test has important assumptions that must be satisfied in order for the associated p-value to be valid.

The samples shall be independent.

Each sample shall be from a normally distributed population.

The population standard deviations of the groups shall be equal. This property is known as homoscedasticity.

In [15]:
import scipy.stats as st
import pandas as pd

In [16]:
data = pd.DataFrame({'Group1':[51,45,33,45,67], 'Group2':[23,43,23,43,45], 'Group3':[56,76,74,87,56]})
data

Unnamed: 0,Group1,Group2,Group3
0,51,23,56
1,45,43,76
2,33,23,74
3,45,43,87
4,67,45,56


In [17]:
data.describe()

Unnamed: 0,Group1,Group2,Group3
count,5.0,5.0,5.0
mean,48.2,35.4,69.8
std,12.377399,11.349009,13.535139
min,33.0,23.0,56.0
25%,45.0,23.0,56.0
50%,45.0,43.0,74.0
75%,51.0,43.0,76.0
max,67.0,45.0,87.0


We can see that standard deviations are close to each other. We can perform ANOVA testing

Limitations of the One Way analysis

A one way analysis will tell us that at least two groups were different from each other. But it won’t tell us what groups were different. If our test returns a significant f-statistic, we may need to run an ad hoc test (like the Least Significant Difference test) to tell us exactly which groups had a difference in means.

Defining Null Hypothesis , Alternate Hypothesis and Alpha

H0 (Null Hypothsis) : Mean of Group1, Group2 and Group3 are equal

H1 (Alternate Hypothesis): Mean of Group1, Group2 and Group3 are not equal

Rejection Criteria: If alpha(0.05) < F
It means that if F values as in table and as calculated are not equal, we will reject the Null Hypothesis

In [18]:
mean = (data.sum().sum()) / 15
mean

51.13333333333333

We can see that in our calculation mean of 3 Groups is 51.1333, let us check F-statistic value too

In [19]:
data.var()

Group1    153.2
Group2    128.8
Group3    183.2
dtype: float64

In [20]:
ms_error = (153.2+128.8+183.2) / 3       # mean sum of squares
ms_error

155.06666666666666

In [21]:
n = 15                ## n is total of all the data sets combined
k = 3                 # k is the numbers of groups
deg_freedom = n-k     ## is the degrees between groups
deg_freedom

12

In [22]:
ss_error = ms_error * deg_freedom     ## sum of squares error
ss_error

1860.8

In [23]:
data.sum().sum()

767

In [24]:
sq_dev = (data.mean()-mean)**2     ## square deviations
sq_dev

Group1      8.604444
Group2    247.537778
Group3    348.444444
dtype: float64

In [25]:
ss_means = (8.604+247.537+348.444)     ## sum of square deviation
ss_means

604.585

In [26]:
var_means = ss_means/(3-1)
var_means

302.2925

In [27]:
ms_between= var_means * 5        # Between group variance is the between group variation divided by its degrees of freedom
ms_between

1511.4625

In [28]:
group_freedom = k-1                         # degree of freedom of groups 

ss_group = ms_between * group_freedom  

ss_group                                    # calculating the remaining between(or group) terms of the ANOVA table

3022.925

In [29]:
F = ms_between / ms_error            # Test Statistic
F

9.747178632846088

In [30]:
Group1 = [51, 45, 33, 45, 67]
Group2 = [23, 43, 23, 43, 45]
Group3 = [56, 76, 74, 87, 56]

In [31]:
st.f_oneway(Group1,Group2,Group3)

F_onewayResult(statistic=9.747205503009463, pvalue=0.0030597541434430556)

Since pvalue < 0.05. The differences between the means are statistically significant. Therefore, we reject the null hypothesis that mean of Group1, Group2 and Group3 are equal.

In [32]:
chi_square = ss_group / (ss_error + ss_group)
chi_square

0.6189793651362433

# APA Writeup:


F(2,12) = 9.75 , p < 0.05 , chi_square=0.6189

# Calculate F Test for given 10, 20, 30, 40, 50 and 5,10,15, 20, 25.

What is F-test

F-test is a statistical test that is used to determine whether two populations having normal distribution have the same variances or standard deviation. This is an important part of Analysis of Variance (ANOVA). However in case the population is non normal, F test may not be used and alternate tests like Bartlett’s test may be used. Generally the comparison of variance is done by comparing the ratio of two variances and in case they are equal the ratio of variances are equal.

In order to carry out the F test we need to first determine the level of significance and then find out the degrees of freedom of numerator and denominator in order to determine the critical values. The null hypothesis in this case is , H0 : and an appropriate alternate hypothesis is to be used. The F value is calculated as F. Also the degrees of freedom are n-1 and m-1. This is then compared to the table value of F Statistic for the required confidence interval and degrees of freedom.

H0: The variances are equal

H1: The variances are not equal

In [1]:
import numpy as np
import pandas as pd
import scipy.stats as st

In [7]:
X = [10,20,30,40,50]

Y = [5,10,15,20,25]

In [10]:
X_var = np.var(X)                                   # variance of A
print("Variance of X = ", X_var)

Variance of X =  200.0


In [9]:
Y_var=np.var(Y)
print('The variance of Y=',Y_var)

The variance of Y= 50.0


In [11]:
F_test = X_var / Y_var                              # F-test is ratio of variance
print("F-test value = ", F_test)

F-test value =  4.0


In [None]:
#degree of freedom

n1=(n-1) = (5-1)=4                                              
m1=(m-1) =(5-1)=4

In [13]:
n1=4
m1=4

Since significance level is not given we will assume alpha = .05

In [15]:
alpha = 0.05 
p_value = st.f.sf(F_test, n1, m1)
print("p value = ", p_value)

p value =  0.10400000000000002


In [16]:
if p_value < alpha:
    print('Reject null hypothesis that varaiance are equal')
else:
    print('Accept null hypothesis that varaiance are equal')

Accept null hypothesis that varaiance are equal
