In [None]:
# Run this cell, but please don't change it.

# These lines import the Numpy and Datascience modules.
import numpy as np
from datascience import *

# These lines do some fancy plotting magic
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)
from matplotlib import patches
from ipywidgets import interact, interactive, fixed
import ipywidgets as widgets

# These lines load the tests.
from client.api.notebook import Notebook
ok = Notebook('lab08.ok')
_ = ok.auth(inline=True)

## 1. Skittles and Conditionals

In Python, Boolean values can either be `True` or `False`, as we have covered before. We will be reviewing comparison operators, among which are `<`, `>`, and `==`. For a longer list, refer to the Booleans and Comparison section in the book [here](https://www.inferentialthinking.com/chapters/09/randomness.html#Booleans-and-Comparison). 

Let's run the cell below to see how a comparison operator works.

In [None]:
5 > 2 + 1

We can even assign the result of a comparison operation to a variable!

In [None]:
result = 6/3 == 2
result

Arrays are also compatible with comparison operators. The output would be an array of boolean values.

In [None]:
make_array(4, 6, 3, 5, 1, -5) < 4

One of your friends loves Skittles so much that they just bought you and your friends a big bag of Skittles to snack on while studying. The flavors and colors of Skittles are Grape (purple), Lemon (yellow), Green Apple (green), Orange (orange), and Strawberry (red).

Using the function call `np.random.choice(array_name)`, let's simulate taking Skittles from the bag at random. Run the cell below a couple of times and see how the results change.

In [None]:
skittles = make_array('purple', 'yellow', 'green', 'orange', 'red')
np.random.choice(skittles)

**Question 1.** You just took a handful of ten skittles from the bag at random, and stored the results in the array `ten skittles` in the next cell. Find the number of skittles that are red (do not hardcode!).

*Hint:* Our solution involves a comparison operator and the `np.count_nonzero` method.

In [None]:
#Let's grab 10 random skittles from the bag
ten_skittles = ...

#How many red skittles are there?
number_red = ...
number_red

In [None]:
_ = ok.grade('q1_1')

**Conditional Statements**
A conditional statement allows Python to choose from different blocks of code base on whether some condition is true.

For example: 

```
def sign(x):
    if x > 0:
        return 'Positive'
```

The way the function works is this: if the input `x` is greater than `0`, we get the string `'Positive'` back.

If we want to test multiple conditions at once, we use the general format: 


```
if <if expression>:
    <if body>
elif <elif expression 0>:
    <elif body 0>
...
else:
    <else body>
```
Only one of the `<elif body>` statements will ever be executed. Each `if` and `elif` expression is evaluated and considered in order, starting at the top. As soon as a true value is found, the corresponding body is executed, and the rest of the expression is skipped. If none of the `if` or `elif` expressions are true, then the `else body` is executed. For more examples and explanation, refer to [Section 9.1](https://www.inferentialthinking.com/chapters/09/1/conditional-statements.html).


**Question 2.** Let's define a function called `grab_more` that returns `'Yes please'` if the number of red Skittles in `ten_skittles` is less than `4` and `'I'm okay'` if there are `4` or more red Skittles.

In [None]:
def grab_more(skittles):
    ...
    if ...:
        return 'I\'m okay'
    # next condition should return 'Yes please'

grab_more(ten_skittles)

In [None]:
_ = ok.grade('q1_2')

**Question 3.** Write a function called `skittle_reaction` that returns a string based on the color of Skittle passed in as an argument. From top to bottom, the conditions should correspond to: `'purple'`, `'yellow'`, `'green'`, `'orange'`, `'red'`.  

In [None]:
def skittle_reaction(skittle):
    if ...:
        return 'Okay'
    # next condition should return 'Mm'
    ...
    # next condition should return 'Bleh'
    ...
    # next condition should return 'Hmm'
    ...
    # next condition should return 'Yum!'
    ...

best_skittle = skittle_reaction('red')
best_skittle

In [None]:
_ = ok.grade('q1_3')

**Question 4.** Create a table called `ten_skittles_reactions` with one column called `Skittles` for your `ten_skittles` and a column `'Reactions'` that consists of reactions for each of the skittles in `ten_skittles`. 

*Hint:* Use the `apply` method. 

In [None]:
ten_skittles_reactions = ...
ten_skittles_reactions

In [None]:
_ = ok.grade('q1_4')

**Question 5.** Using code, find the number of `'Yum!'` reactions for the skittles in `ten_skittles_reactions`

In [None]:
number_yum_reactions = ...
number_yum_reactions

In [None]:
_ = ok.grade('q1_5')

**Question 6.** Complete the function `yum_or_bummed`, which takes in a table of Skittles with reactions (just like the one from Question 4) and returns `'Yes!'` if there are more red Skittles, or `'Bleh.'` if there are more green Skittles. If there are an equal number of each, return `'Alright!'`.

In [None]:
def yum_or_bummed(skittle_table):
    reactions = ...
    number_yum_reactions = ...
    number_bleh_reactions = ...
    if ...:
        return 'Yes!!'
    # next condition should return 'Bleh.'
    ...
    # next condition should return 'Alright!'
    ...

In [None]:
_ = ok.grade('q1_6')

**Question 7.** Let's create a table called `many_skittles` with one column called `Skittles` that contains `100` random Skittles and another column called `Reactions` that contains all the reactions for each of the `100` random Skittles. Then, use `yum_or_bummed` to see if your handfull of `100` random Skittles will be delicious snack or not.

In [None]:
#Create a table with 100 random skittles
many_skittles = Table().with_column(...)

#Add a column for the reactions
many_skittles = many_skittles.with_column(...)

#Test out your function yum_or_bummed!
result = ...
result

In [None]:
_ = ok.grade('q1_7')

In [None]:
# Run this cell, but please don't change it.

# These lines import the Numpy and Datascience modules.
import numpy as np
from datascience import *

# These lines do some fancy plotting magic
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

## Question 2. Simulations
Practice using `for` statements to perform a task multiple times. Recall, this is known as iteration. 

**Question 2.1.** Once again, Clay is playing darts, but this time it's against his friend Jimmy. Recall that his dartboard contains ten equal-sized zones with point values from 1 to 10. Write code that simulates whether Clay or Jimmy wins after 100 dart throws. Make sure to use a `for` loop.

Possible outcomes for winner are: `'Clay'`, `'Jimmy'`, or  `'neither'` 

*Hint:* There are a few steps to this problem (and most simulations): 
1. Figuring out the big picture of what we want to simulate (the total score for each of them after 100 dart throws)
2. Deciding the possible values you can take in the experiment (point values in this case) and simulating one example (throwing one dart)
3. Deciding how many times to run through the experiment (100 throws in our case) and keeping track of the total information of each time you ran through the experiment (the total score in this case)
4. Comparing the total scores to figure out who won (Don't forget to consider the possibility that they tie)
5. Coding up the whole simulation!

In [None]:
possible_point_values = ...
throws = 100
C_total_score = ...
J_total_score = ...

# a for loop would be useful here (Clay's total score)



# a for loop would be useful here (Jimmy's total score)


# write some comparisons to see who's score is bigger
winner

**Question 2.2.** In the following cell, we've loaded the text of _Pride and Prejudice_ by Jane Austen, split it into individual words, and stored these words in an array (that we called `p_and_p_words`). Using a `for` loop, assign `longer_than_fifteen` to the number of words in the novel that are more than 15 letters long.

In [None]:
austen_string = open('Austen_PrideAndPrejudice.txt', encoding='utf-8').read()
p_and_p_words = np.array(austen_string.split())

longer_than_fifteen = ...

# a for loop would be useful here


longer_than_fifteen

**Question 2.3.** Using simulation with 1,000,000 trials, assign `chance_greater_15` to an estimate of the chance that if you pick a word from _Pride and Prejudice_ uniformly at random (with replacement), the length of the word is greater than 15. 

*This may take a minute to run*

In [None]:
trials = 1000000
greater_fifteen = ...

for ... in ...:
    ...

chance_greater_15 = ...

chance_greater_15

Notice that the value you get for **Question 2.3.** is close to what you would get if you divided the value from **Question 2.2.** by the total length of words in Pride and Prejudice. 

Run the following cell to check and see!

In [None]:
print('The chance a word has a length greater than fifteen is:',chance_greater_15)
print('And the percentage of words with a length greater than 15 is: ',longer_than_fifteen/len(p_and_p_words))

**Question 2.4.** LeBron James is drafting Basketball Players for his NBA Fantasy League. He chooses 10 players randomly from a list of players, and drafts the player regardless of whether the player has been chosen before (You could have 10 Kevin Durant's on a team!). He does this 100 times (100 drafts with each draft consisting of 10 players). Count how many times Kevin Durant the first pick? 

*Hint* You may try using a nested `for` loop (this is not the only way to do it)

In [None]:
players = ["John Wall", "Kevin Durant", "Kyrie Irving", "Joel Embiid", "Russell Westbrook"]
draft_picks = ... #array of size 100 containing teams of size 10
num_durant_first = ...

#creat draft_picks
for ... in ...:

#check how many times Kevin Durant is the first pick
for ... in ...:

num_durant_first

## 3. Analyzing Data: Majors at UCSB

In the 2016-17 school year, 5,809 UCSB students graduated from the university with a total of 95 different majors. The most popular major was Psychology (B.A.) with 513 students choosing Psych as their major of choice. Six major programs only had one graduating student, including Geophysics and Portuguese. 

Run this cell to see a full list of all the majors available at UCSB. Note: Some majors are listed twice and considered to be different because they are offered as both a Bachelor of Arts (B.A.) and a Bachelor of Science (B.S.) degree.

In [None]:
major_data = Table().read_table("major_data.csv")
major_data

Change the three dots below to your current or intended major and then run the cell below to see how many students graduated with your current major in 2017. You can look up the major.csv file within the lab files to look for the correct spelling if your major is not appearing in the table. If you are currently undeclared, choose a major of your choice.

In [None]:
major_data.where("Major", are.equal_to(...))

**Question 3.1**  Use the major data to identify the ten most popular majors at UCSB. Make an array called `top_ten` that contains the data for the ten most popular majors only.

In [None]:
top_ten = ...
top_ten

In [None]:
_ = ok.grade('q3_1')

**Question 3.2** You may notice that none of the top ten majors are majors within the College of Engineering. The CoE is relatively small compared to the College of Letters and Science. Make an array called 'engineering' that contains only the data for the majors within the College of Engineering.

**Hint 1:** *In 2016-17, there were five majors within the College of Engineering: Chemical Engineering, Computer Engineering, Computer Science, Electrical Engineering, and Mechanical Engineering. *

**Hint 2:** *Computer Science may appear twice because the major used to be offered in the College of Letters & Science in addition to within the engineering department, but the major is no longer offered within L&S.*

In [None]:
engineering = ...
engineering

In [None]:
_ = ok.grade('q3_2')

Next, let's analyze historical data to look for possible trends in the majors that students have selected over the years. Run the next cell to see how many students graduated from a selection of six different majors in the 1980-1981 school year.

In [None]:
past_majors = ["Computer Science", "Electrical Engineering", "Geography", "Linguistics", "Communication", "Statistical Science"]
past_graduates = [21, 91, 66, 6, 125, 0]

past_data = Table().with_columns("Majors", past_majors, "1980 Graduates", past_graduates)
past_data.show()

Let's compare the above numbers from 1980 with the numbers from 2017.

In [None]:
current_majors = ["Computer Science", "Electrical Engineering", "Geography", "Linguistics", "Communication", "Statistical Science"]
current_graduates = [91, 51, 56, 56, 389, 121]

current_data = Table().with_columns("Majors", current_majors, "2017 Graduates", current_graduates)
current_data.show()

**Question 3.3** Use the `join` method to join these two tables together so each row contains the major, the number of 1980 graduates and then the number of 2017 graduates. Save this new table into the variable `major_data`.

In [None]:
major_data = ...
major_data

In [None]:
_ = ok.grade('q3_3')

**Question 2.4** You should notice some trends by looking at the table you just made. Out of the six majors shown, which majors were more popular in 2017 than in 1980? Which were less popular?

<hr style="color:Maroon;background-color:Maroon;border:0 none; height: 3px;">
Replace this text with your answer
<hr style="color:Maroon;background-color:Maroon;border:0 none; height: 3px;">

## 4. Hypothesis Testing

**At UCSB, the herbal energy drink Yerba Mate is wildly popular. As an inquisitive UCSB student who just so happens to be savvy in the art of data science, you desire to understand if Yerba Mate is truly popular due to its energy-enhancing properties or its taste, or if rather there exists love for the drink due to UCSB's campus culture. You claim that: Yerba Mate is less popular with people outside of UCSB as Yerba Mate is to people at UCSB.**

**To test this claim, you decide to conduct a survey to 200 UCSB third-year students during the first week of the fall quarter, 100 third-year students who are transfer students and 100 third-year students who have been at UCSB since Freshman Year. You decide to do this because, you assume that if Yerba Mate's popularity is due only to culture, then you expect that students who have been at UCSB since their first year will like Yerba Mate more than new Transfer Students who are new to the campus culture.**

**So, You decide to ask each person to rate their preference of Yerba Mate to other energy drinks on a scale of 0-1,using the following scoring:**

| Score | Meaning of Score                                                             |
|-------|---------------------------------------------------------------------|
| $\frac{1}{5}$     | I never drink Yerba Mate.                                           |
| $\frac{2}{5}$     | I prefer other forms of energy-enhancing beverages.                 |
| $\frac{3}{5}$     | I am impartial to Yerba Mate                                        |
| $\frac{4}{5}$     | I prefer Yerba Mate over other forms of energy-enhancing beverages. |
| $\frac{5}{5}$     | I only drink Yerba Mate for my energy-enhancing beverage.           |


**After conducting your survey you arrive at the following results:**
- Average Score for Transfer Students (Population 1) = $0.37$
- Avergae Score for students who have been at UCSB since their first year (Population 2) = $0.46$.

 **As a reminder, a Hypothesis Test has several elements that go into it:**


- **A Null Hypothesis ($H_0$) that describes the claim you wish to reject or otherwise lack evidence to reject.**
    - This is the status quo (i.e. what is considered to be true) 
    - Always contains the equality sign (i.e. =, $\leq$, $\geq$).
    
- **An Alternative Hypothesis ($H_A$) that decribes a differing claim from the null hypothesis.** 
    - Research claim
    - Goes against the status quo
    - Always contains the inequality (i.e. $\neq$, <, >)
    
- **Test Statistic (TS)** a metric of the data that can be used to compare what we observe to what we expect to observe under the null hypothesis. ("Under the null hypothesis" means "Assuming the null hypothesis to be true.")
    - It is a function of the data
    - Its distribution is known under $H_0$.
    
- **P-Value** is the probabilibity that the Test Statistic we observe in our data, comes from the distribution of the null hypothesis.
    - If this probabiity is low, this suggests evidence that our null hypothesis should be reject since the Test Statistic of what we observed had a low probability of being the value observed from the null hypothesis' distribution.
    - Simplified: the p-value is the chance of our null hypothesis being true given that we observed a certain value which we assume to come from the null hypothesis.

## Gaining Intuition

**Now, you want to test the claim that Yerba Mate is equally as popular with people outside of UCSB as Yerba Mate is to people outside of UCSB through a Hypothesis Test.**

**One way to state this using a hypothesis test would be: the difference in the averagee feeling towards Yerba Mate between transfer students (Population 1) and students who have been at UCSB since freshman year (Population 2) is less than 0.**

- $H_0$: Avergage Score of Population 1 - Avergae Score of Population 2 = 0.
- $H_A$: Avergage Score of Population 1 - Avergae Score of Population 2 < 0.

*$H_A$ is the claim that you are testing, and $H_0$ is what you wish to reject or otherwise lack evidence to reject.* 

- Your Test Statistic should be thus the (average of Population 1) - (avergae of Population 2).   

**If the averages are equal then, we would expect that the relative proportion of times Average Score of Population 1 is larger than Population 2 would be $\frac{1}{2}$ (like flipping a fair coin since we are assuming in our null hypothesis that the their averages are equal.)**

So then, we will specify a model for the proportion of times that Average Score of Population 1 is larger than Population 2 with proportions 0.5.

In [None]:
model = [0.5,0.5]
#First element is proportion of times that population 1 has a higher average than population 2.

Next, we need to calculate the obseved Test Statistic. Since we defined our Test Statistic to be (average of Population 1) - (avergae of Population 2), then we observe a Test Statistic of: -0.09

In [None]:
TS = 0.37 - 0.46 
TS

Let's now see what the distribution of test statistics is actually like under our fully specified model. Assign `simulated_test_statistics` to an array of 1000 test statistics that you simulated assuming the null hypothesis is true. 

In [None]:
num_trials = ...
model = ...

test_statistics = ...

for i in range(num_trials):
#Your code here
    
plt.hist(test_statistics, bins = 10)
plt.axvline(x=TS, color  = "r")

**Now to calculate the P-Value**

As a reminder from above, P-Value is the probabilibity that the Test Statistic we observe in our data, comes from the distribution of the null hypothesis.

**Calculate the p-value below:**

In [None]:
P_Value = sum(test < TS)/len(test)
P_Value

### Assuming a significance level of 0.05, does there exist sufficient evidence to reject the null hypothesis?