# Homework 6: Percentiles, Bootstrap, A/B Testing
## Due Thursday Feb 27th, 11:59pm

Directly sharing answers is not okay, but discussing problems with the course staff or with other students is encouraged. 

You should start early so that you have time to get help if you're stuck.

In [1]:
#:
import math
import numpy as np
import babypandas as bpd
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
%matplotlib inline

from client.api.notebook import Notebook
ok = Notebook('hw.ok')
_ = ok.auth(inline=True)

**You do not need to submit anything to Gradescope!** The short answer problems in this homework are optional but recommended.

**Important**: The `ok` tests don't usually tell you that your answer is correct. More often, they help catch careless mistakes. It's up to you to ensure that your answer is correct. If you're not sure, ask someone (not for the answer, but for some guidance about your approach).

## 1. Ramen Ratings

![](menya.jpg)

In this section, we will be using a ramen rating dataset to better our understanding of A/B testing. The dataset can be found on [kaggle](https://www.kaggle.com/residentmario/ramen-ratings), but the data has been cleaned and condensed for the purposes of this question. We (the writers) also recommend eating at [Menya Ultra](http://menya-ultra.com/) before completing this section, as we did before we wrote these questions.

The ramen data is recorded in a CSV file called `ramen.csv`. It contains five columns: `Brand`, `Variety`, `Style`, `Country`, `Stars`. Read this file into a table called `ramen`.

In [2]:
ramen = bpd.read_csv('ramen.csv')
ramen

**Question 1.1**. You may have noticed that the `Stars` column contains strings instead of floats. Because we cannot do computations on strings, we need to convert these values into floats. In your `ramen` table, replace the `Stars` column so that all the data values are floats instead of strings. Find the mean star rating of all the ramen, and save it into a variable called `mean_star`.

In [6]:
mean_star = ...
mean_star

In [7]:
#: grade 1.1
_ = ok.grade('q1_1')

**Question 1.2.** Notice that there are two styles of ramen: "Pack" and "Cup". Using `ramen`, calculate the difference between the mean star ratings of Pack and Cup ramen. Assign your answer to `observed_difference`.

$$\text{observed difference} := \text{mean Pack stars} - \text{mean Cup stars}$$

In [8]:
observed_difference = ...
observed_difference

In [9]:
#: grade 1.2
_ = ok.grade('q1_2')

**Question 1.3.** Interpret in words the number you obtained for `observed_difference` and assign either 1, 2, 3, or 4 to `q1_3`.

1. In our sample, the mean cup stars is lower than the mean pack stars by about 0.20 stars.
2. In our sample, the mean pack stars is lower than the mean cup stars by about 0.20 stars.
3. In our sample, the mean cup stars is lower than the mean pack stars by about 20 percents.
4. In our sample, the mean cup stars is higher than the mean pack stars by about 20 percents.

In [10]:
q1_3 = ...
q1_3

In [11]:
#: grade 1.3
_ = ok.grade('q1_3')

Now we want to conduct an A/B test (i.e. Permutation Test) to see if it is by chance that the average star rating for the pack ramen is greater than cup ramen, or if the pack ramen really does have higher ratings than the cup. To remind you on the process of an A/B test, here is the textbook reference for the [process](http://sierra.ucsd.edu/dsc10-book/chapters/12/1/AB_Testing.html). In your upcoming A/B test, we want to shuffle the `Stars` column and keep the `Style` column in the same order.


**Null hypothesis:** Star ratings of pack ramen and cup ramen come from the same distribution.  
**Alternative hypothesis:** Star ratings of pack ramen is typically higher than that of cup ramen.

Hint: To make your simulation go faster, drop the irrelevant columns before our A/B test. Make another table called `small_ramen` that only has the `Stars` and `Style` column, and shuffle using `small_ramen`.

**Question 1.4.** Use a permutation test to calculate 500 differences using random permutations of the data. Store your 500 differences in the `differences` array.

In [14]:
differences = ...
differences

In [15]:
min(differences), max(differences)

In [16]:
#: grade 1.4
_ = ok.grade('q1_4')

**Question 1.5.** Which of the follow choices best describes the purpose of the permutation test with regards to A/B testing? Assign either 1, 2, or 3 to `q1_5`.
1. The permutation test generates a null distribution which we can use in testing our hypothesis.
2. The permutation test mitigates noise in our data by generating new permutations of the data.
3. The permutation test is a special case of the bootstrap and allows us to produce interval estimates.

In [13]:
q1_5 = ...
q1_5

In [15]:
#: grade 1.5
_ = ok.grade('q1_5')

**Question 1.6.** Compute a p-value for the hypothesis. That is, under the null hypothesis, compute the probability that we would have obtained a difference greater than or equal to `observed_difference` by chance alone. Assign your answer to `p_val`.

In [16]:
p_val = ...
p_val

In [17]:
#: grade 1.6
_ = ok.grade('q1_6')

**Question 1.7.** Do you reject or fail to reject the null hypothesis at the 0.05 significance level? What conclusion can you make with regards to the star ratings of pack ramen and cup ramen?

<hr style="color:Maroon;background-color:Maroon;border:0 none; height: 3px;">

...

<hr style="color:Maroon;background-color:Maroon;border:0 none; height: 3px;">

**Question 1.8.** Suppose in this question you had shuffled the `Style` column and kept the `Stars` column in the same order. 
Which of the following is a true statement?

1. Your new p-value would be 1 - (old p-value), where new p-value is with `Style` shuffled and old p-value is with `Stars` shuffled.
2. We would conclude that pack ramen would have lower star ratings than cup ramen.
3. The `Style` column cannot be shuffled because there are only two unique values.
4. There would be no difference in the A/B Test if we had shuffled the `Style` column instead.

In [18]:
q1_8 = ...
q1_8

In [19]:
#: grade 1.8
_ = ok.grade('q1_8')

## 2. Percentiles

**The General Definition**

> Let $p$ be a number between 0 and 100. The $p$th percentile of a collection is the smallest value in the collection that is *at least as large* as $p$% of all the values. 

![](percentile_example.jpg)

By this definition, any percentile between 0 and 100 can be computed for any collection of values and is always an element of the collection. Suppose there are $n$ elements in the collection. To find the $p$th percentile:

1. Sort the collection in increasing order.
2. Find $p$% of $n$: $\frac p{100}*n$. Call that $h$. If $h$ is an integer, define $k = h$. Otherwise, let $k$ be the smallest integer greater than $h$.
3. Take the $k$th element of the sorted collection.

**Question 2.1.** Assign the number of elements in `values` to the variable `n`. Define `k` as above -- your answer should be an integer. Assign the 36th percentile of the array `values` to `thirty_sixth_percentile`. You must use the variables provided for you when solving this problem. For this problem only, you may *not* use `np.percentile()`.

*Hint:* Using `math.ceil()` will round up a number to the next nearest whole number. `math` has already been imported for you.

In [17]:
#: don't change the values in this array!
values = np.array([23, 76, 94, 60, 70, 34, 23, 106, 54, 86, 39, 10, 47])
values.sort()  # This line sorts the array
values

In [18]:
n = ...
n

In [19]:
k = ...
k

In [20]:
thirty_sixth_percentile = ...
thirty_sixth_percentile

In [21]:
_ = ok.grade('q2_1')

**Question 2.2.** The csv file `mcdonalds.csv` contains some selected information on menu items taken from [kaggle](https://www.kaggle.com/mcdonalds/nutrition-facts). The columns include `Category`, `Item`, `Calories`, `Sodium`, `Total Fat`, `Carbohydrates`, `Sugars`, `Protein`. Plot a histogram showing the distribution of `Calories`. Use the bins provided.

In [22]:
#Do not change this cell
mcd = bpd.read_csv('mcdonalds.csv',index_col = 0)
mcd_bins = np.arange(0, 2000, 100)
mcd

**Question 2.3.** Compare the calorie distribution between categories `Beef & Pork` **AND** `Chicken & Fish` (group 1) **versus** category `Breakfast` (group 2). Find the absolute difference between the 90th percentile of the two group's `Calories` column and assign it to `absolute_difference`. You may use `np.percentile()`.

In [26]:
absolute_difference = ...
absolute_difference

In [27]:
_ = ok.grade('q2_3')

**Question 2.4**. In an array `carb_quartiles`, put the values for the first, second, and third quartiles (in that order) of the `Carbohydrates` data provided, but only for items not in the `Coffee & Tea` category. Make sure your values are in the correct order. You may use `np.percentile()`.

In [28]:
carb_quartiles = ...
carb_quartiles

In [29]:
_ = ok.grade('q2_4')

**Quetion 2.5.** Say that McDonald's wants to add in a new Smoothie called `Mocha Almond Fudge (Large)` which has 90 grams of sugar. What would the `Sugars` percentile range of this new smoothie be out of the **Large** only `Smoothies & Shakes` `Category`? Give the result back as two numbers (1-100). The smallest percentile that will return the new drink should be returned as `lower_bound` and the largest percentile that will return the new drink should be labeled `upper_bound`. For example, if the new smoothie would be returned back when finding the 70th percentile and 80th percentile of the Large Smoothies, but not at the 69th and 81st percentile, then lower_bound = 70, upper_bound = 80.

**Hint:** If you're unsure about percentiles, refer back to the general definition above Question 1.

In [30]:
mcd[(mcd.get('Category') == 'Smoothies & Shakes') &(mcd.get('Item').str.contains('Large'))]

In [31]:
lower_bound = ...
lower_bound

In [32]:
_ = ok.grade('q2_5')

**Question 2.6.** Shaun surveyed his class to find the total number of pets each of his classmates has. You can see his findings below in the table `pets`. For instance, 2 people have 0 pets, 4 have 1 pet, and so on. If one of his classmates, Jake, has some number of pets that falls in the 70th percentile of Shaun's data, how many pets does Jake have? Assign your answer to the value `jake_pets`. You may use `np.percentile()`.

*Hint*: A possilbe solution uses [np.repeat](https://docs.scipy.org/doc/numpy/reference/generated/numpy.repeat.html). (Also described in part 4.3)

In [33]:
#: load the data
pets = bpd.read_csv('pets.csv')
pets

In [35]:
jake_pets = ...
jake_pets

In [36]:
_ = ok.grade('q2_6')

## 3. In-N-Out and Five Guys

Suppose you are a burger lover and a regular at In-N-Out. When you get your third In-N-Out burger of the week, you notice that your patty is extremely small. Your friend tells you In-N-Out patties have always been this small, but you are doubtful and decide to investigate.

Ideally, you would want to figure out the exact mean weight of all In-N-Out burger patties. However, it's not feasible to obtain the mean weight of *all* In-N-Out patties (i.e. the mean weight of the population).

**Question 3.1.** Complete the statement below by filling in the blanks.

<hr style="color:Maroon;background-color:Maroon;border:0 none; height: 3px;">

Therefore, you want to collect a sample of In-N-Out patties to obtain a __________ statistic to estimate the ___________ parameter.

<hr style="color:Maroon;background-color:Maroon;border:0 none; height: 3px;">

Your other friend, who works at In-N-Out, agreed to weigh all the patties during his shift. He also does the same with Five Guys, since he works there as well. You decide to use this data as your sample.

**Question 3.2.** Your data is recorded in a CSV file called `burgers.csv`. Read this file into a table named `burgers`.

In [37]:
burgers = ...
burgers

In [38]:
_ = ok.grade('q3_2')

**Question 3.3.** For now, you only care about the weights of the In-N-Out patties. Create a new table with the rows of `burgers` where the value of `Place` is "In-N-Out". Assign this new table to `in_n_out`.

In [39]:
in_n_out = ...
in_n_out

In [40]:
_ = ok.grade('q3_3')

**Question 3.4.** Calculate the mean weight of `in_n_out` patties and assign it to `in_n_out_mean`. 

In [41]:
in_n_out_mean = ...
in_n_out_mean

In [42]:
_ = ok.grade('q3_4')

You're done! Or are you? You have a single point estimate for the true mean In-N-Out patty weight. However, you don't know how uncertain your estimate is and you don't know how much these estimates could vary. In other words, you don't have a sense of how good your estimate is. You may have gotten a particular statistic for one sample, but you could also get a completely different one for another sample.

This is where the idea of resampling via the [bootstrap](http://sierra.ucsd.edu/dsc10-book/chapters/13/2/Bootstrap.html) comes in. Let's assume that our original sample resembles the population fairly well. We can then resample from our original sample to produce even more estimates, which we can then use to produce an interval estimate for the true mean weight of In-N-Out patties.

**Question 3.5.** Fill out the following code to produce 1,000 bootstrapped estimates for the  *mean* weight of In-N-Out patties. Store your 1,000 estimates in the `in_n_out_means` array.

In [43]:
in_n_out_means = ...
for ... in ...:
    resample = ...
    resample_mean = ...
    in_n_out_means = ...   

In [46]:
#: This cell displays a histogram of in_n_out_means
bpd.DataFrame().assign(Estimated_Mean = in_n_out_means).plot(kind = 'hist')

In [47]:
_ = ok.grade('q3_5')

**Question 3.6.** Using the array `in_n_out_means`, compute an approximate 95% confidence interval for the true mean weight of In-N-Out patties. (Compute the lower and upper ends of the interval, named `lower_bound` and `upper_bound`, respectively.)

*Hint:* Use `percentile()`.

In [48]:
lower_bound = ...
lower_bound

In [49]:
upper_bound = ...
upper_bound

In [50]:
#: the confidence interval
print("Bootstrapped 95% confidence interval for the true mean weight of In-N-Out patties: [{:f}, {:f}]".format(lower_bound, upper_bound))

In [51]:
_ = ok.grade('q3_6')

**Question 3.7.** Which of the following would make the histogram narrower? Assign either 1 or 2 to `q3_7`.
1. Starting with a larger original sample size.
2. Increasing the number of resamples (repetitions of bootstrap).

In [52]:
q3_7 = ...
q3_7

In [53]:
_ = ok.grade('q3_7')

**Question 3.8.** Suppose you want to find the weight of the heaviest In-N-Out patty (maximum weight out of the entire population). Would your bootstrap procedure be effective in estimating the weight of the heaviest In-N-Out patty? Explain your answer below.

<hr style="color:Maroon;background-color:Maroon;border:0 none; height: 3px;">

<hr style="color:Maroon;background-color:Maroon;border:0 none; height: 3px;">

**Question 3.9.** Suppose you're wondering how heavy the average In-N-Out patty is compared to the average Five Guys patty. Using the same bootstrap procedure, compute an approximate 95% confidence interval for the true mean difference in weight between In-N-Out and Five Guys patties. Store your 1,000 estimates in the `difference_means` array. Use the original `burgers` table for this.

$$\text{difference_mean} := \text{mean weight of In-N-Out} - \text{mean weight of Five Guys}$$

In [57]:
# You may need to add lines for additional code!
difference_means = ...
for ... in ...:
    resample = ...
    resample_mean = ...
    difference_means = ...

In [58]:
#: This cell displays a histogram of difference_means
bpd.DataFrame().assign(Estimated_Mean_Difference = difference_means).plot(kind = 'hist')

In [61]:
_ = ok.grade('q3_9')

**Question 3.10.** Compute the 95% confidence interval for the mean difference in weights of In-N-Out and Five Guys patties. Assign the left and right endpoints to `left_endpoint` and `right_endpoint` respectively. 

In [63]:
left_endpoint = ...
left_endpoint

In [64]:
right_endpoint = ...
right_endpoint

In [65]:
#: the confidence interval
print("Bootstrapped 95% confidence interval for the mean difference in weights of In-N-Out and Five Guys patties: [{:f}, {:f}]".format(left_endpoint, right_endpoint))

In [66]:
_ = ok.grade('q3_10')

**Question 3.11:** Based on your histogram and confidence interval, would you say with high probability that the mean In-N-Out patty is lighter than the mean Five Guys patty? Explain your answer.

<hr style="color:Maroon;background-color:Maroon;border:0 none; height: 3px;">

<hr style="color:Maroon;background-color:Maroon;border:0 none; height: 3px;">

**Question 3.12.** Would changing the units of weight from ounces to grams change your conclusion? Assign a boolean (`True` if it would and `False` otherwise) to the name `q3_12`.

In [55]:
q3_12 = ...
q3_12

In [56]:
_ = ok.grade('q3_12')

# Finish Line

Congratulations, you're done with the lab!  Be sure to

- **Verify that all tests pass** (the next cell has a shortcut for that), 
- **Save and Checkpoint** from the `File` menu,
- **Run the last cell to submit your work**
- **You do not need to submit anything toGradescope**

In [69]:
#: Run all tests at once
import os
_ = [ok.grade(q[:-3]) for q in os.listdir('tests') if q.startswith('q')]

## Before submitting, select "Kernel" -> "Restart & Run All" from the menu!

Then make sure that all of your cells ran without error.

In [None]:
#: Submit your notebook
_ = ok.submit()