# Unit4 Day1, 2, 3
### Sample mean and population mean
* The population mean is a constant value no matter how many times it's recalculated.
* The sample mean depends on exactly what samples we happened to choose.
* The sample means were all close to the population mean, but were all slightly different from the population mean and from each other.

In [1]:
import numpy as np

population = np.random.normal(loc=65, scale=3.5, size=300)
population_mean = np.mean(population)

print("Population Mean: {}".format(population_mean))

sample_1 = np.random.choice(population, size=30, replace=False)
sample_2 = np.random.choice(population, size=30, replace=False)
sample_3 = np.random.choice(population, size=30, replace=False)
sample_4 = np.random.choice(population, size=30, replace=False)
sample_5 = np.random.choice(population, size=30, replace=False)

sample_1_mean = np.mean(sample_1)
print("Sample 1 Mean: {}".format(sample_1_mean))

sample_2_mean = np.mean(sample_2)
sample_3_mean = np.mean(sample_3)
sample_4_mean = np.mean(sample_4)
sample_5_mean = np.mean(sample_5)

print("Sample 2 Mean: {}".format(sample_2_mean))
print("Sample 3 Mean: {}".format(sample_3_mean))
print("Sample 4 Mean: {}".format(sample_4_mean))
print("Sample 5 Mean: {}".format(sample_5_mean))



Population Mean: 65.02159180222115
Sample 1 Mean: 65.41866229107389
Sample 2 Mean: 64.63847323845235
Sample 3 Mean: 64.36252748507444
Sample 4 Mean: 63.93718641731621
Sample 5 Mean: 64.47761627675817


### Central Limit Theorem
If we have a _**large** enough sample size_, all of our sample means will be _sufficiently **close**_ to the population mean.

In [2]:
import numpy as np

# Create population and find population mean
population = np.random.normal(loc=65, scale=100, size=3000)
population_mean = np.mean(population)

# Select increasingly larger samples
extra_small_sample = population[:10]
small_sample = population[:50]
medium_sample = population[:100]
large_sample = population[:500]
extra_large_sample = population[:1000]

# Calculate the mean of those samples
extra_small_sample_mean = np.mean(extra_small_sample)
small_sample_mean = np.mean(small_sample)
medium_sample_mean = np.mean(medium_sample)
large_sample_mean = np.mean(large_sample)
extra_large_sample_mean = np.mean(extra_large_sample)

# Print them all out!
print("Extra Small Sample Mean: {}".format(extra_small_sample_mean))
print("Small Sample Mean: {}".format(small_sample_mean))
print("Medium Sample Mean: {}".format(medium_sample_mean))
print("Large Sample Mean: {}".format(large_sample_mean))
print("Extra Large Sample Mean: {}".format(extra_large_sample_mean))

print("\nPopulation Mean: {}".format(population_mean))

Extra Small Sample Mean: 77.86924935090619
Small Sample Mean: 69.1508367889319
Medium Sample Mean: 70.37883376766264
Large Sample Mean: 68.8154152167034
Extra Large Sample Mean: 64.77455651054326

Population Mean: 64.42972736330299


### Hypothesis Tests
* A **null hypothesis** is a statement that _the observed difference is the result of **chance**_.
  * A prediction that there is no significant difference. 
* Hypothesis testing is a mathematical way of determining whether we can be confident that the null hypothesis is false.
  * Different situations will require different types of hypothesis testing
* Type I error
  * "**false positive**"
  * is finding a correlation between things that are **not related**. 
  * occurs when the null hypothesis is **rejected** even though it is **true**.
* Type II error
  * "**false negative**"
  * is **failing** to find a correlation between things that are **actually related**.
  * occurs when the null hypothesis is **accepted** even though it is **false**.

In [3]:
import numpy as np

def intersect(list1, list2):
  return [sample for sample in list1 if sample in list2]

# the true positives and negatives:
actual_positive = [2, 5, 6, 7, 8, 10, 18, 21, 24, 25, 29, 30, 32, 33, 38, 39, 42, 44, 45, 47]
actual_negative = [1, 3, 4, 9, 11, 12, 13, 14, 15, 16, 17, 19, 20, 22, 23, 26, 27, 28, 31, 34, 35, 36, 37, 40, 41, 43, 46, 48, 49]

# the positives and negatives we determine by running the experiment:
experimental_positive = [2, 4, 5, 7, 8, 9, 10, 11, 13, 15, 16, 17, 18, 19, 20, 21, 22, 24, 26, 27, 28, 32, 35, 36, 38, 39, 40, 45, 46, 49]
experimental_negative = [1, 3, 6, 12, 14, 23, 25, 29, 30, 31, 33, 34, 37, 41, 42, 43, 44, 47, 48]

#define type_i_errors and type_ii_errors here
type_i_errors = intersect(experimental_positive, actual_negative)
type_ii_errors = intersect(experimental_negative, actual_positive)
print(type_i_errors)
print(type_ii_errors)

[4, 9, 11, 13, 15, 16, 17, 19, 20, 22, 26, 27, 28, 35, 36, 40, 46, 49]
[6, 25, 29, 30, 33, 42, 44, 47]


### P-Values
* A hypothesis test provides a numerical answer, called a p-value.
* A p-value is the probability that the null hypothesis is **true**.
  * P-values give us an idea of how confident we can be in a result.
  * A p-value of 0.05 would mean that there is a 5% chance that the null hypothesis is true.
  * A higher p-value is more likely to give a **false positive**.
    * If we want to be very sure that the result is not due to just chance, we will select a very **small** p-value.
  * p-value < 0.05, meaning that there is a less than 5% chance that our results are due to random chance.

In [4]:
def accept_null_hypothesis(p_value):
  """
  Returns the truthiness of the null_hypothesis

  Takes a p-value as its input and assumes p < 0.05 is significant
  """
  if p_value < 0.05:
    return False
  else:
    return True

hypothesis_tests = [0.1, 0.009, 0.051, 0.012, 0.37, 0.6, 0.11, 0.025, 0.0499, 0.0001]

for p_value in hypothesis_tests:
    accept_null_hypothesis(p_value)
    print(accept_null_hypothesis(p_value))

True
False
True
False
True
True
True
False
False
False


# Unit4 Day4 & Day5
### Types of Hypothesis Test
* For numerical data, we will cover:
  * One Sample T-Tests
  * Two Sample T-Tests
  * ANOVA (Analysis of Variance)
  * Tukey Tests
* For categorical data, we will cover:
  * Binomial Tests
  * Chi Square

### 1 Sample T-Testing
* A univariate T-test compares a **sample mean** to a **hypothetical population mean**.
  * Null hypothesis: the set of samples belongs to a population with the target mean
* `from scipy.stats import ttest_1samp`
* `tstat, pval = ttest_1samp(example_distribution, expected_mean)`
* If a p-value < 0.05, we can reject the null hypothesis and state that there is a significant difference.

In [5]:
from scipy.stats import ttest_1samp
import numpy as np

#ages = np.genfromtxt("unit4_ages.csv")
ages = np.array([ 32.,  34.,  29.,  29.,  22.,  39.,  38.,  37.,  38.,  36.,  30.,  26.,  22.,  22.])

print(ages)
ages_mean = np.mean(ages)
print(ages_mean)
tstat, pval = ttest_1samp(ages, 30)
print(pval)

[32. 34. 29. 29. 22. 39. 38. 37. 38. 36. 30. 26. 22. 22.]
31.0
0.5605155888171379


In [6]:
from scipy.stats import ttest_1samp
import numpy as np

correct_results = 0 # Start the counter at 0

daily_visitors = np.genfromtxt("unit4_daily_visitors.csv", delimiter=",")

for i in range(1000): # 1000 experiments
   #your ttest here:
   tstat, pval = ttest_1samp(daily_visitors[i], 30)
   #print the pvalue here:
   print(pval)
   if pval < 0.05:
    correct_results += 1
print("We correctly recognized that the distribution was different in " + str(correct_results) + " out of 1000 experiments.")
#print "We correctly recognized that the distribution was different in " + str(correct_results) + " out of 1000 experiments."

0.23695942473632142
0.005511750046377255
0.23636795865412213
0.10783517811607937
0.004414882120780694
0.16214048205543172
0.1600829236665709
0.008752908031478393
0.009413759844110177
0.28829862847381077
0.035310467589078576
0.2144769257585793
0.0006227718256323399
9.588258876957111e-06
0.2351509229750954
0.00026342324862845364
0.8039284954571189
0.030165739305163603
0.6706988618319033
0.08734286055124148
2.907880833989432e-05
0.02649255683160421
0.04467245164993898
0.13544331926124387
0.016612898432082093
0.14892019305948018
0.037645152288740394
0.015376369766211437
0.15679877794158653
0.15065054103948747
0.0685476997160581
0.14946579127226753
0.00035270120578793457
0.011641901238399317
0.7993493159918439
0.01625350692456645
0.0271314568134415
0.04835788738829874
0.11758287850526042
0.8801423342575648
0.05547696429556022
0.005089847690879995
0.006679724618281934
0.12478644532611288
0.00373992490997133
0.0016715530036133515
0.0012944771181846912
0.153507674004558
0.7798467721041478
0.00

0.1085417229023679
0.0011170465170635168
0.0007737011084607557
0.16156282362026705
0.029116916174228097
0.0364212432111863
0.05846710727831822
0.0025268041546313725
0.009834699875935049
0.023737761768635877
0.3723923176325392
0.017290944281008653
4.1476856331304464e-05
0.5791215864370689
0.4148459693995077
0.05933483124928568
0.00959738082293534
0.002429629979650248
0.07542348064577276
0.24602330373620324
0.4857941597507165
0.008792343808282343
0.5810831838756756
0.021114904390651345
0.006274811781082217
0.06745406578382171
0.014190701002785503
0.09773865434244938
0.005429312197087601
0.000260433608933623
0.06144967233382448
0.008874933100824023
0.041454905906007346
0.3416609801709968
0.00011638045220906159
0.09001838333051311
0.8772316839667647
0.060755850855216564
0.6668933487489712
0.0033355142855807955
0.0168001523027963
0.0002586725329523724
0.04507424752648597
0.04276512258477841
0.006734386330611377
0.1143723881294428
0.028722783659638797
0.02089832284687067
0.3153925158476216
0

### 2 Sample T-Test
* A 2 Sample T-Test compares **two** sets of data, which are both approximately **normally distributed**.
* The null hypothesis, in this case, is that the two distributions have **the same mean**.
* `from scipy.stats import ttest_ind`
* `ttest_ind()`

In [7]:
from scipy.stats import ttest_ind
import numpy as np

week1 = np.genfromtxt("unit4_week1.csv",  delimiter=",")
week2 = np.genfromtxt("unit4_week2.csv",  delimiter=",")

week1_mean = np.mean(week1)
week2_mean = np.mean(week2)
print(week1_mean)
print(week2_mean)

week1_std = np.std(week1)
week2_std = np.std(week2)
print(week1_std)
print(week2_std)

tstat, pval = ttest_ind(week1, week2)
print(pval)

25.4480593952
29.0215681076
4.531693386680561
5.497966708987187
0.000676767690454633


### Dangers of Multiple T-Tests
* The p-value is the probability that we **incorrectly reject** the null hypothesis on each t-test.
* The more t-tests we perform, the more likely that we are to get a **false positive**, a _Type I error_.
* For a p-value of 0.05
  * If the null hypothesis is _true_ then the probability of obtaining a significant result is 1 – 0.05 = 0.95.
  * When we run another t-test, the probability of still getting a correct result is 0.95 * 0.95, or 0.9025.
    * That means our probability of making an error is now close to 10%!
    * This error probability only gets bigger with the more t-tests we do.

In [8]:
from scipy.stats import ttest_ind
import numpy as np

a = np.genfromtxt("unit4_store_a.csv",  delimiter=",")
b = np.genfromtxt("unit4_store_b.csv",  delimiter=",")
c = np.genfromtxt("unit4_store_c.csv",  delimiter=",")

a_mean = np.mean(a)
b_mean = np.mean(b)
c_mean = np.mean(c)
print(a_mean)
print(b_mean)
print(c_mean)

a_std = np.std(a)
b_std = np.std(b)
c_std = np.std(c)
print(a_std)
print(b_std)
print(c_std)

a_b_pval = ttest_ind(a, b).pvalue
a_c_pval = ttest_ind(a, c).pvalue
b_c_pval = ttest_ind(b, c).pvalue
print(a_b_pval)
print(a_c_pval)
print(b_c_pval)

error_prob = (1-(0.95**3))
print(error_prob)

58.34963608399999
65.62628713553332
62.361173186000016
14.753704052425187
14.746564490177391
15.092458510871044
2.766762940022801e-05
0.021012051693103415
0.0598856352542542
0.1426250000000001


### ANOVA
* ANOVA (Analysis of Variance) tests the null hypothesis that all of the datasets have **the same mean**.
* When comparing **more than two** numerical datasets, the best way to preserve a Type I error probability of 0.05 is to use ANOVA.
* If we reject the null hypothesis with ANOVA, we're saying that **at least one** of the sets has a different mean
  * It does not tell us which datasets are different.
* `from scipy.stats import f_oneway`
* `f_oneway()`

In [9]:
from scipy.stats import ttest_ind
from scipy.stats import f_oneway
import numpy as np

a = np.genfromtxt("unit4_store_a.csv",  delimiter=",")
b = np.genfromtxt("unit4_store_b.csv",  delimiter=",")
c = np.genfromtxt("unit4_store_c.csv",  delimiter=",")

fstat, pval = f_oneway(a, b, c)
print(pval)

print(np.mean(b))

b_new = np.genfromtxt('unit4_store_b_new.csv', delimiter=",")

print(np.mean(b_new))
#fstat, pval = f_oneway(a, b_new, c)

b = b_new
fstat, pval = f_oneway(a, b, c)

0.00015341166007885106
65.62628713553332
148.35494018599996


### Assumptions of Numerical Hypothesis Tests
* The samples should each be **normally distributed...ish**.
* The population standard deviations of the groups should be **equal**.
* The samples must be **independent**.

In [10]:
#import codecademylib
import numpy as np
import matplotlib.pyplot as plt

dist_1 = np.genfromtxt("unit4_1.csv",  delimiter=",")
dist_2 = np.genfromtxt("unit4_2.csv",  delimiter=",")
dist_3 = np.genfromtxt("unit4_3.csv",  delimiter=",")
dist_4 = np.genfromtxt("unit4_4.csv",  delimiter=",")

#plot your histogram here
plt.hist(dist_1)
plt.show()
plt.subplot(2,2,1)
plt.hist(dist_1)
plt.subplot(2,2,2)
plt.hist(dist_2)
plt.subplot(2,2,3)
plt.hist(dist_3)
plt.subplot(2,2,4)
plt.hist(dist_4)
plt.show()
not_normal = 4
ratio = np.std(dist_2) / np.std(dist_3)

<Figure size 640x480 with 1 Axes>

<Figure size 640x480 with 4 Axes>

### Tukey's Range Test
* Perform a Tukey's Range Test to determine the difference between datasets
* `from statsmodels.stats.multicomp import pairwise_tukeyhsd`
* The function to perform Tukey's Range Test is `pairwise_tukeyhsd([list of all of the data], [list of labels], the significance level we want)`, which is found in `statsmodel`, not `scipy`.

In [11]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd
from scipy.stats import f_oneway
import numpy as np

a = np.genfromtxt("unit4_store_a.csv",  delimiter=",")
b = np.genfromtxt("unit4_store_b.csv",  delimiter=",")
c = np.genfromtxt("unit4_store_c.csv",  delimiter=",")

stat, pval = f_oneway(a, b, c)
print(pval)

# Using our data from ANOVA, we create v and l
v = np.concatenate([a, b, c])
labels = ['a'] * len(a) + ['b'] * len(b) + ['c'] * len(c)

tukey_results = pairwise_tukeyhsd(v, labels, 0.05)
print(tukey_results)

0.00015341166007885106
Multiple Comparison of Means - Tukey HSD,FWER=0.05
group1 group2 meandiff  lower   upper  reject
---------------------------------------------
  a      b     7.2767   3.2266 11.3267  True 
  a      c     4.0115  -0.0385  8.0616 False 
  b      c    -3.2651  -7.3152  0.7849 False 
---------------------------------------------


### Binomial Test
* A Binomial Test compares **a categorical dataset** to some expectation.
* The null hypothesis, in this case, would be that there is no difference between the observed behavior and the expected behavior.
  * If we get a p-value of less than 0.05, we can reject that hypothesis and determine that there is a difference between the observation and expectation.
* `from scipy.stats import binom_test`
* `binom_test(m success, N trials, probability)`
  * It returns a p-value, telling us how confident we can be that the sample of values was likely to occur with the specified probability.
  * If we get a p-value less than 0.05, we can reject the null hypothesis

In [12]:
from scipy.stats import binom_test
pval = binom_test(510,10000,0.06)
print(pval)
pval2 = binom_test(590,10000,0.06)
print(pval2)

0.00011592032724546606
0.6891529835730346


### Chi Square Test
* If we have **two or more categorical datasets** that we want to compare, we should use a Chi Square test.
* `from scipy.stats import chi2_contingency`
* `chi2_contingency()`
  * The function chi2_contingency accepts one argument: a contingency (偶然性) table.
* The null hypothesis is that there's no significant difference between the datasets.
  * We reject that hypothesis, and state that there is a significant difference between two of the datasets if we get a p-value less than 0.05.

In [13]:
from scipy.stats import chi2_contingency

# Contingency table
#         harvester |  leaf cutter
# ----+------------------+------------
# 1st gr | 30       |  10
# 2nd gr | 35       |  5
# 3rd gr | 28       |  12

X = [[30, 10],
     [35, 5],
     [28, 12],
     [20, 20]]
chi2, pval, dof, expected = chi2_contingency(X)
print(pval)

0.002812834559546625


||Numerical|Categorical|
|:--:|:--:|:--:|
|Sample vs. Known Quantity|1 Sample T-Test|Binomial Test|
|2 Samples|2 Sample T-Test|Chi Square|
|More Than 2 Samples|ANOVA and/or Tukey|Chi Square|

# Unit 4 Day 6, 7, 8
## Project: Familiar

In [14]:
# 1
import familiar
# 2
vein_pack_lifespans = familiar.lifespans(package='vein')
print(vein_pack_lifespans)
# 3
from scipy.stats import ttest_1samp
# 4
vein_pack_test = ttest_1samp(vein_pack_lifespans, 71)
print(vein_pack_test)
# 5
if vein_pack_test.pvalue < 0.05:
  print("significance")
# 6
if vein_pack_test.pvalue < 0.05:
  print("The Vein Pack Is Proven To Make You Live Longer!")
else:
  print("The Vein Pack Is Probably Good For You Somehow!")
# 7
artery_pack_lifespans = familiar.lifespans(package='artery')
print(artery_pack_lifespans)
# 8
from scipy.stats import ttest_ind
# 9
package_comparison_results = ttest_ind(vein_pack_lifespans, artery_pack_lifespans)
print(package_comparison_results)
# 10
if package_comparison_results.pvalue < 0.05:
  # significant
  print("the Artery Package guarantees even stronger results!")
else:
  print("the Artery Package is also a great product!")
# 11
# Since the p-value was greater than 0.05, we can't say that there is a significant difference between the life expectancy of the two packages.
# 12
# We received 200 responses from our Vein Package subscribers. 70% of them had low iron counts, 20% had normal, and 10% of them have high iron counts.
# We were only able to get 145 responses from our Artery Package subscribers, but only 20% of them had low iron counts. 60% had normal, and 20% have high iron counts.
# 13
iron_contingency_table = familiar.iron_counts_for_package()
# 14
from scipy.stats import chi2_contingency
# 15
_, iron_pvalue, _, _ = chi2_contingency(iron_contingency_table)
# 16
if iron_pvalue < 0.05:
  print("The Artery Package Is Proven To Make You Healthier!")
else:
  print("While We Can't Say The Artery Package Will Help You, I Bet It's Nice!")
# 17

[76.93767431371617, 75.99335913014681, 74.79815012354048, 74.50202147158551, 77.48888897587436, 72.14256573154043, 75.99303167191182, 76.34155048095228, 77.48475562999882, 76.5321014800867, 76.25508955276418, 77.58398316566651, 77.04737034962294, 72.87475174594711, 77.43504547002844, 77.4923414107892, 78.32672046879952, 73.34370246887067, 79.96915765236346, 74.83800583300325]
Ttest_1sampResult(statistic=11.958665180208271, pvalue=2.7463117986584107e-10)
significance
The Vein Pack Is Proven To Make You Live Longer!
[76.33537008426835, 76.92308231559062, 75.9524416448778, 74.5449834807203, 76.4045042754472, 73.07924888636576, 77.02354461052992, 74.1174204200688, 77.38650656208344, 73.04476583718993, 74.96311850866167, 73.31954301933486, 75.85740137696862, 76.15265351351255, 73.3551028632267, 73.90221256458788, 73.77121195092475, 68.31489830285578, 74.63975717775328, 78.38547730843979]
Ttest_indResult(statistic=1.9722687784695117, pvalue=0.05588883079070819)
the Artery Package is also a g

# Unit 4 Day 9 & 10
## Project: Fetchmaker

In [15]:
import numpy as np
# 1
import fetchmaker
# 2
rottweiler_tl = fetchmaker.get_tail_length("rottweiler")
print(rottweiler_tl.head())
# 3
print(np.mean(rottweiler_tl))
print(np.std(rottweiler_tl))
# 4
whippet_rescue = fetchmaker.get_is_rescue("whippet")
print(whippet_rescue.head())
# 5
num_whippet_rescues = np.count_nonzero(whippet_rescue)
print(num_whippet_rescues)
# 6
num_whippets = np.size(whippet_rescue)
print(num_whippets)
# 7
from scipy.stats import binom_test
num_whippet_rescues_pvalue = binom_test(num_whippet_rescues, num_whippets, 0.08)
# 8
print(num_whippet_rescues_pvalue)
# the result is significant
# 9
whippets_weight = fetchmaker.get_weight("whippet")
terriers_weight = fetchmaker.get_weight("terrier")
pitbulls_weight = fetchmaker.get_weight("pitbull")
print(whippets_weight)
print(terriers_weight)
print(pitbulls_weight)
from scipy.stats import f_oneway
_, mid_sized_pvalue = f_oneway(whippets_weight, terriers_weight, pitbulls_weight)
print(mid_sized_pvalue)
# 10
v = np.concatenate([whippets_weight, terriers_weight, pitbulls_weight])
labels = ['whippets_weight'] * len(whippets_weight) + ['terriers_weight'] * len(terriers_weight) + ['pitbulls_weight'] * len(pitbulls_weight)
from statsmodels.stats.multicomp import pairwise_tukeyhsd
tukey_results = pairwise_tukeyhsd(v, labels, 0.05)
print(tukey_results)
# 11
poddle_colors = fetchmaker.get_color("poodle")
shihtzu_colors = fetchmaker.get_color("shihtzu")
# 12
poddle_black = np.count_nonzero(poddle_colors == "black")
poddle_brown = np.count_nonzero(poddle_colors == "brown")
poddle_gold = np.count_nonzero(poddle_colors == "gold")
poddle_grey = np.count_nonzero(poddle_colors == "grey")
poddle_white = np.count_nonzero(poddle_colors == "white")
print(poddle_black)
print(poddle_brown)
print(poddle_gold)
print(poddle_grey)
print(poddle_white)

shihtzu_black = np.count_nonzero(shihtzu_colors == "black")
shihtzu_brown = np.count_nonzero(shihtzu_colors == "brown")
shihtzu_gold = np.count_nonzero(shihtzu_colors == "gold")
shihtzu_grey = np.count_nonzero(shihtzu_colors == "grey")
shihtzu_white = np.count_nonzero(shihtzu_colors == "white")
print(shihtzu_black)
print(shihtzu_brown)
print(shihtzu_gold)
print(shihtzu_grey)
print(shihtzu_white)

# Contingency table
#        Poodle | Shih Tzu
# ------+-------+----------+
# Black | 17    |  10
# Brown | 13    |  36
# Gold  | 8     |  6
# Grey  | 52    | 41
# White | 10    | 7
color_table = [[17, 10],
               [13, 36],
               [8, 6],
               [52, 41],
               [10, 7]]
# 13
from scipy.stats import chi2_contingency
chi2, pval, dof, expected = chi2_contingency(color_table)
print(pval)
# not significant
# 14

400    3.13
401    3.32
402    1.16
403    2.23
404    8.86
Name: tail_length, dtype: float64
4.2361
2.0647536874891395
700    0
701    0
702    0
703    0
704    0
Name: is_rescue, dtype: int64
6
100
0.5811780106238098
700    12
701    46
702    13
703    52
704    53
705    44
706    54
707    45
708    39
709    26
710    56
711    32
712    23
713     6
714    47
715    30
716    60
717    70
718    34
719    54
720    47
721    35
722    29
723    26
724    44
725    58
726    31
727    28
728    49
729    40
       ..
770    60
771    44
772    54
773    43
774    27
775    49
776    47
777    47
778    25
779    26
780    38
781    43
782    54
783    32
784    37
785    32
786    29
787    59
788    48
789    37
790    54
791    31
792    56
793    68
794    37
795    24
796    42
797    58
798    62
799    27
Name: weight, Length: 100, dtype: int64
600    26
601     2
602    38
603    31
604    33
605    15
606    22
607    29
608    33
609    25
610    36
611    21
612    26


# Unit 4 Day 11
## Sample Size Determination
### A/B Test
* An A/B Test is a scientific method of choosing between two options (Option A and Option B).
  * A/B tests compare an option that we’re currently using to a new option that we suspect might be better.
  * Whenever we want to make comparisons between subpopulations in our survey, we must use the A/B Test Calculator in order to get our desired survey size.
* In order to determine the sample size necessary for an A/B test, a sample size calculator requires three numbers:
  * The **Baseline conversion rate**
  * The **Minimum detectable effect**
  * The **Statistical significance**
* Lift: smallest difference.
  * Lift is generally expressed as a percent of the baseline conversion rate.
  * lift = 100 * (new - old) / old

In [16]:
# Find the size of both of these lists
number_of_site_visitors = 2000.0
number_of_converted_visitors = 1300.0

# Calculate the conversion rate in terms of the above two variables here
conversion_rate = number_of_converted_visitors / number_of_site_visitors
print(conversion_rate)

##########

# From 8%, calculate what lift you would need to get to 12%.
lift_eight_percent_to_twelve_percent = 100 * (12 - 8) / 8
# From 10%, calculate what a 50% increase would be.
# 50% of 10% is 5%
# 10% + 5% is 15%
ten_percent_up_fifty_percent = 15

##########

# 350 users daily
# If we only show 20% of our users the new headline, 
# how many days will it take for us to get a sample size of at least 910 for each headline?
hamster_headline_experiment_length = 910 / (350 * 0.2)
print(hamster_headline_experiment_length)

0.65
13.0


### Generally, sample size calculators use 4 parameters:
* Margin of error:
  * The margin of error is the furthest we expect the true value to be from what we measure in our survey.
  * The smaller we make the margin of error, the more certainty we have in the results.
  * The larger we make the margin of error, the more uncertainty
* Confidence level
  * The confidence level is the probability that the margin of error contains the true proportion.
  * As we increase the confidence level, we must have a larger sample size.
* Population size
* Expected proportion
  * If we do not have historical data, we normally use 50%, which gives the most conservative

## Project: Nosh Mish Mosh

In [17]:
# 1
import noshmishmosh
# 2
import numpy as np
# 3, 4
all_visitors = noshmishmosh.customer_visits
#print(all_visitors)
# 5
paying_visitors = noshmishmosh.purchasing_customers
#print(paying_visitors)
# 6
total_visitor_count = len(all_visitors)
paying_visitor_count = len(paying_visitors)
#print(total_visitor_count)
#print(paying_visitor_count)
# 7
baseline_percent = 100 * paying_visitor_count / total_visitor_count
# 8
print(baseline_percent)
# 9
payment_history = noshmishmosh.money_spent
#print(payment_history)
# 10
average_payment = np.mean(payment_history)
#print(average_payment)
# 11
new_customers_needed = np.ceil(1240 / average_payment)
#print(new_customers_needed)
# 12
percentage_point_increase = 100. * new_customers_needed / total_visitor_count
print(percentage_point_increase)
# 13
minimum_detectable_effect = 100. * percentage_point_increase / baseline_percent
# 14
print(minimum_detectable_effect)
# 15
# Use 90% significance
# 16
ab_sample_size = 280

18.6
9.4
50.537634408602145


## Project: A/B Testing for FarmBurg

In [18]:
# 1
import pandas as pd
# 2
df = pd.read_csv('unit4_clicks.csv')
# 3
print(df.head())

                                user_id group click_day
0  8e27bf9a-5b6e-41ed-801a-a59979c0ca98     A       NaN
1  eb89e6f0-e682-4f79-99b1-161cc1c096f1     A       NaN
2  7119106a-7a95-417b-8c4c-092c12ee5ef7     A       NaN
3  e53781ff-ff7a-4fcd-af1a-adba02b2b954     A       NaN
4  02d48cf1-1ae6-40b3-9d8b-8208884a0904     A  Saturday


In [19]:
# 1
df['is_purchase'] = df.click_day.apply(lambda x: "Purchase" if pd.notnull(x) else "No Purchase")
# 2
purchase_counts = df.groupby(["group", "is_purchase"]).user_id.count().reset_index()
# 3
print(purchase_counts)

  group  is_purchase  user_id
0     A  No Purchase     1350
1     A     Purchase      316
2     B  No Purchase     1483
3     B     Purchase      183
4     C  No Purchase     1583
5     C     Purchase       83


In [20]:
# 1
from scipy.stats import chi2_contingency
# 2
contingency = [[316, 1350],
               [183, 1483],
               [83, 1583]]
# 3
chi2, pvalue, dof, expected = chi2_contingency(contingency)
# 4
is_significant = False
if pvalue < 0.05:
  is_significant = True
else:
  is_significant = False

In [21]:
# 1
num_visits = len(df)
# 2
p_clicks_099 = (1000 / 0.99) / num_visits
p_clicks_199 = (1000 / 1.99) / num_visits
p_clicks_499 = (1000 / 4.99) / num_visits

In [22]:
# 1
from scipy.stats import binom_test
# 2
pvalueA = binom_test(316, 316 + 1350, p_clicks_099)
# 3
pvalueB = binom_test(183, 183 + 1483, p_clicks_199)
# 4
pvalueC = binom_test(83, 83 + 1583, p_clicks_499)
# 5
print(pvalueA, pvalueB, pvalueC)
final_answer = 4.99

0.2111287299402726 0.20660209246555486 0.045623672477172125
