## Practice Problem Set

Let’s work through another example of a t-test problem.  A local movie theatre manager has been in trouble with her boss for not selling enough popcorn.  In order to increase revenue from popcorn sales, she decides to decrease the price of popcorn for two weeks to see if more people will purchase popcorn at the cheaper price, resulting in an overall higher sales result.  Her boss is skeptical, suggesting that the lower price will offset increased sales, and that there will be no significant differences in revenues before and after the discount.  The manager decides to collect data for two weeks before the discount and two weeks after to find out who is correct.  

Let's analyze this problem statistically.  The null hypothesis, with which the boss would agree, states that **there is no significant difference in revenue from popcorn sales before and after the discount**.  The research hypothesis, which the manager is hoping is correct, is that **revenue will be significantly higher after the discount than before**. Let’s find out who is right!  The data below show how many thousands of dollars worth of popcorn were sold each day for the two weeks before and two weeks after the discount began.  (For example, on the first day after the discount, the 8.0 indicates that 8000 USD worth of popcorn was sold, and the 6.2 on the second day after the discount indicates that 6200 USD worth of popcorn was sold).

In [69]:
import pandas as pd
import numpy as np

In [70]:
df = pd.read_excel('pipoca.xlsx')
df.head(4)

Unnamed: 0,before,after
0,4.2,8.0
1,7.7,6.2
2,3.3,4.1
3,4.7,7.3


In [71]:
mean_bef=np.mean(df.iloc[:,0])
mean_aft=np.mean(df.iloc[:,1])
df['mean_bef']=mean_bef
df['mean_aft']=mean_aft

In [72]:
df.head(2)

Unnamed: 0,before,after,mean_bef,mean_aft
0,4.2,8.0,5.292857,6.807143
1,7.7,6.2,5.292857,6.807143


In [73]:
df['sqr_dev0']=(df.before-df.mean_bef)**2
df['sqr_dev1']=(df.after-df.mean_aft)**2

In [74]:
df.head(2)

Unnamed: 0,before,after,mean_bef,mean_aft,sqr_dev0,sqr_dev1
0,4.2,8.0,5.292857,6.807143,1.194337,1.422908
1,7.7,6.2,5.292857,6.807143,5.794337,0.368622


In [75]:
s_bef=np.sqrt(round(df.sqr_dev0.sum(),2)/len(df.index)) #standard deviation before
s_aft=np.sqrt(round(df.sqr_dev1.sum(),2)/len(df.index)) #standard deviation before

In [76]:
s_bef

1.4200100603265358

In [77]:
s_aft

1.4964243095745653

In [78]:
len(df.index)

14

In [79]:
from scipy.stats import ttest_ind
from scipy.stats import t
from scipy.stats import sem

In [80]:
stat, p = ttest_ind(df.before,df.after)
print('t=%.3f, p=%.3f' % (stat, p))

t=-2.647, p=0.014


In [82]:
from scipy.stats import t
# define probability
p = 0.95
dof = 26
# retrieve value <= probability
value = t.ppf(p, dof)
print(value)
# confirm with cdf
p = t.cdf(value, dof)
print(p)

1.70561791976
0.95


In [85]:
# calculate standard errors
se0, se1 = sem(df.before.values), sem(df.after.values)
# standard error on the difference between the samples
sed = np.sqrt(se0**2.0 + se1**2.0)

In [118]:
np.std(df.before)

1.4199920954077203

In [119]:
se0

0.39383494698046834

In [49]:
sem?

In [83]:
df.before.values

array([ 4.2,  7.7,  3.3,  4.7,  4.9,  5.1,  6.8,  4.7,  3.3,  7. ,  5.1,
        5.9,  7.5,  3.9])

Achamos um t= -2.647 que, em valor absoluto, supera o valor de t crítico. Assim, podemos rejeitar a hipótese-nula, de que a receita não vai variar com a concessão do desconto.

### Homework A

Problem Set-up and Data  

At work, you manage two projects. Both projects will be expanding over the next year, so your boss is going to hire someone to help you out by taking over all of the communication tasks for one of the projects. To determine how your new colleague can be of most help to you, you need to determine which project has the greater communication burden.

To do this, you look back through your email Inbox for the past two weeks and count up the number of emails you’ve received for each project. Your plan is to compare the data for each project to determine if you receive significantly more emails regarding one project than the other.

Your **null hypothesis is that there is no significant difference in the number of emails that you receive by project**. Your **research hypothesis is that the number of emails that you receive varies significantly by project**.

In [121]:
hwa=pd.read_excel('hwa.xlsx')
hwa.head(3)

Unnamed: 0,Project A,Project B
0,17,28
1,24,30
2,23,32


In [122]:
hwa.columns=['a','b']
hwa.head(3)
sta=np.std(hwa.a.values)
sta

3.1320919526731648

In [123]:
stat, p = ttest_ind(hwa.a,hwa.b)
print('t=%.3f, p=%.3f' % (stat, p))

t=-2.026, p=0.058


When we calculate alpha for two-tail we divide it, so alpha = 0.05 means p=0.025 and 0.975,
So your probability is between both ends of the distribution:

In [136]:
# define probability
p = 0.05/2 #
dof = (10-1)+(10-1)
# retrieve value <= probability
value = t.ppf(0.025, dof)
print(value)
# confirm with cdf
p = t.cdf(value, dof) 
print(p)

-2.10092204024
0.025


In [132]:
# calculate standard errors
sea, seb = sem(hwa.a.values), sem(hwa.b.values)
# standard error on the difference between the samples
sed = np.sqrt(sea**2.0 + seb**2.0)

In [133]:
print(sea,seb,sed)

1.04403065089 2.34497097826 2.56688310776


#### Calculado: t=-2.026 Critical t value: 2.101 . Como o calculado nao superou o critico, nao podemos rejeitar null hypothesis!