# Lab 3: DataFrames, Control Flow, and Probability

## Due Saturday, April 29th at 11:59PM

Welcome to Lab 3! This week, we will go over more DataFrame manipulation techniques, conditionals and iteration, and introduce the concept of randomness. You should complete this entire lab so that all tests pass and submit it to Gradescope by 11:59PM on the due date.

Refer to the following readings:
- Grouping with subgroups (see [BPD 11.4](https://notes.dsc10.com/02-data_sets/groupby.html#subgroups))
- Merging DataFrames (see [BPD 13](https://notes.dsc10.com/02-data_sets/merging.html))
- Conditional statements (see [CIT 9.1](https://inferentialthinking.com/chapters/09/1/Conditional_Statements.html))
- Iteration (see [CIT 9.2](https://inferentialthinking.com/chapters/09/2/Iteration.html))
- Probability (see [CIT 9.5](https://inferentialthinking.com/chapters/09/5/Finding_Probabilities.html))

First, set up the tests and imports by running the cell below.

In [1]:
import numpy as np
import babypandas as bpd

# These lines set up graphing capabilities.
import matplotlib
import matplotlib.pyplot as plt
plt.style.use('ggplot')

import otter
grader = otter.Notebook()

%reload_ext pandas_tutor

## 1. California National Parks 🏞️ 🐻

In this question, we'll take a closer look at the DataFrame methods `merge` and `groupby`.

We will be working with two datasets, `california_parks.csv` (stored as `parks`) and `california_parks_species.csv` (stored as `species`), which provide information on California National Parks and the species of plants and animals found there, respectively. These are a subset of a [larger dataset the National Parks Service provides](https://www.kaggle.com/nationalparkservice/park-biodiversity). We've also created a third DataFrame, `parks_species`, that contains the number of species per park.

Run the cell below to load in our data.

In [2]:
parks = bpd.read_csv("data/california_parks.csv")
species = bpd.read_csv("data/california_parks_species.csv")
parks_species = bpd.DataFrame().assign(
    count=species.groupby('Park Name').count().get('Category')
)

Right now, the information we have on each California National Park is split across two DataFrames. The `parks` DataFrame has the code, state, size, and location of each park, and the `parks_species` DataFrame contains the number of species at each park. Run the cells below to see both DataFrames.

In [3]:
parks

Unnamed: 0,Park Code,Park Name,State,Acres,Latitude,Longitude
0,CHIS,Channel Islands National Park,CA,249561,34.01,-119.42
1,JOTR,Joshua Tree National Park,CA,789745,33.79,-115.9
2,LAVO,Lassen Volcanic National Park,CA,106372,40.49,-121.51
3,PINN,Pinnacles National Park,CA,26606,36.48,-121.16
4,REDW,Redwood National Park,CA,112512,41.3,-124.0
5,SEKI,Sequoia and Kings Canyon National Parks,CA,865952,36.43,-118.68
6,YOSE,Yosemite National Park,CA,761266,37.83,-119.5


In [4]:
parks_species

Unnamed: 0_level_0,count
Park Name,Unnamed: 1_level_1
Channel Islands National Park,1885
Joshua Tree National Park,2294
Lassen Volcanic National Park,1797
Pinnacles National Park,1416
Redwood National Park,6310
Sequoia and Kings Canyon National Parks,1995
Yosemite National Park,2088


**Question 1.1.** Below, use the `merge` method to create a new DataFrame named `parks_with_species`, which will have the parks' existing information along with the number of species each has. Make sure the DataFrame only has one row per park. Your DataFrame should look like this:

|    | Park Code   | Park Name                               | State   |   Acres |   Latitude |   Longitude |   count |
|---:|------------|----------------------------------------|--------|--------|-----------|------------|--------|
|  0 | CHIS        | Channel Islands National Park           | CA      |  249561 |      34.01 |     -119.42 |    1885 |
|  1 | JOTR        | Joshua Tree National Park               | CA      |  789745 |      33.79 |     -115.9  |    2294 |
|  2 | LAVO        | Lassen Volcanic National Park           | CA      |  106372 |      40.49 |     -121.51 |    1797 |
|  3 | PINN        | Pinnacles National Park                 | CA      |   26606 |      36.48 |     -121.16 |    1416 |
|  4 | REDW        | Redwood National Park                   | CA      |  112512 |      41.3  |     -124    |    6310 |
|  5 | SEKI        | Sequoia and Kings Canyon National Parks | CA      |  865952 |      36.43 |     -118.68 |    1995 |
|  6 | YOSE        | Yosemite National Park                  | CA      |  761266 |      37.83 |     -119.5  |    2088 |

In [5]:
parks_species_reset = parks_species.reset_index()
parks_with_species = parks.merge(parks_species_reset, on='Park Name')
parks_with_species

Unnamed: 0,Park Code,Park Name,State,Acres,Latitude,Longitude,count
0,CHIS,Channel Islands National Park,CA,249561,34.01,-119.42,1885
1,JOTR,Joshua Tree National Park,CA,789745,33.79,-115.9,2294
2,LAVO,Lassen Volcanic National Park,CA,106372,40.49,-121.51,1797
3,PINN,Pinnacles National Park,CA,26606,36.48,-121.16,1416
4,REDW,Redwood National Park,CA,112512,41.3,-124.0,6310
5,SEKI,Sequoia and Kings Canyon National Parks,CA,865952,36.43,-118.68,1995
6,YOSE,Yosemite National Park,CA,761266,37.83,-119.5,2088


In [6]:
grader.check("q1_1")

Now, let's take a look at the `species` DataFrame. Each park has a lot of different species, and each species varies in abundance at each park. 

In [7]:
species

Unnamed: 0,Park Name,Category,Order,Family,Common Names,Abundance
0,Channel Islands National Park,Mammal,Carnivora,Canidae,Channel Islands Gray Fox,Rare
1,Channel Islands National Park,Mammal,Carnivora,Mephitidae,Spotted Skunk,Uncommon
2,Channel Islands National Park,Mammal,Carnivora,Mustelidae,Sea Otter,
3,Channel Islands National Park,Mammal,Carnivora,Otariidae,Guadalupe Fur Seal,Occasional
4,Channel Islands National Park,Mammal,Carnivora,Otariidae,Northern Fur Seal,Uncommon
...,...,...,...,...,...,...
17780,Yosemite National Park,Vascular Plant,Solanales,Solanaceae,Parish's Nightshade,Rare
17781,Yosemite National Park,Vascular Plant,Solanales,Solanaceae,"Chaparral Nightshade, Purple Nightshade",Uncommon
17782,Yosemite National Park,Vascular Plant,Vitales,Vitaceae,"Thicket Creeper, Virginia Creeper, Woodbine",Rare
17783,Yosemite National Park,Vascular Plant,Vitales,Vitaceae,"California Grape, California Wild Grape",Uncommon


**Question 1.2.** Using the `groupby` method, assign the variable `species_abundance` to a DataFrame that *classifies* the parks by both Park Name and Abundance.

_**Hint:**_ Reset the index and assign columns so that you have three columns: `'Park Name'`, `'Abundance'`, and `'Category'`. The first few rows of your DataFrame should look like this:

|    | Park Name                               | Abundance   |   Category |
|---|----------------------------------------|------------|-----------|
|  0 | Channel Islands National Park           | Abundant    |         48 |
|  1 | Channel Islands National Park           | Common      |        228 |
|  2 | Channel Islands National Park           | Occasional  |        190 |
|  3 | Channel Islands National Park           | Rare        |        368 |
|  4 | Channel Islands National Park           | Uncommon    |        471 |
|  5 | Channel Islands National Park           | Unknown     |        173 |
|  6 | Joshua Tree National Park               | Abundant    |         37 |
|  7 | Joshua Tree National Park               | Common      |        543 |
|  8 | Joshua Tree National Park               | Occasional  |         84 |
|  9 | Joshua Tree National Park               | Rare        |         90 |

In [8]:
# Group the species by Park Name and Abundance
species_abundance = species.groupby(['Park Name', 'Abundance']).count().reset_index()

# Select relevant columns
species_abundance = species_abundance.get(['Park Name', 'Abundance', 'Category'])

species_abundance


Unnamed: 0,Park Name,Abundance,Category
0,Channel Islands National Park,Abundant,48
1,Channel Islands National Park,Common,228
2,Channel Islands National Park,Occasional,190
3,Channel Islands National Park,Rare,368
4,Channel Islands National Park,Uncommon,471
...,...,...,...
37,Yosemite National Park,Common,480
38,Yosemite National Park,Occasional,81
39,Yosemite National Park,Rare,342
40,Yosemite National Park,Uncommon,952


In [9]:
#species_abundance = species.drop(columns = ['Order', 'Family', 'Common Names'])
#species_abundance

In [10]:
#species_abundance = species.drop(columns = ['Category', 'Order', 'Family', 'Common Names'].merge(parks_species_reset, on = "Park Name")
#species_abundance

In [11]:
grader.check("q1_2")

## 2. Nachos 🧀 🌶️

In Python, Boolean values can either be `True` or `False`. We get Boolean values when using comparison operators, among which are `<` (less than), `>` (greater than), and `==` (equal to). For a more complete list, [see here](https://www.tutorialspoint.com/python/comparison_operators_example.htm).

Run the cell below to see an example of a comparison operator in action.

In [12]:
3 > 1 + 1

True

We can even assign the result of a comparison operation to a variable.

In [13]:
result = 10 / 2 == 5
result

True

Arrays are compatible with comparison operators. The output is an array of boolean values.

In [14]:
np.array([1, 5, 7, 8, 3, -1]) > 3

array([False,  True,  True,  True, False, False])

Waiting on the dining table just for you is a hot bowl of nachos! Let's say that whenever you take a nacho, it will have cheese, salsa, both, or neither (just a plain tortilla chip).

<img src='images/nacho.png' width=300>

Using the function call `np.random.choice(array_name)`, let's simulate taking nachos from the bowl at random. Start by running the cell below several times, and observe how the results change.

In [15]:
nachos = np.array(['cheese', 'salsa', 'both', 'neither'])
np.random.choice(nachos)

'cheese'

Assume we took ten nachos at random, and stored the results in an array called `ten_nachos`.

In [16]:
ten_nachos = np.array(['neither', 'cheese', 'both', 'both', 'cheese', 'salsa', 'both', 'neither', 'cheese', 'both'])

**Question 2.1.** Find the number of nachos with only cheese using code (do not hardcode the answer).  

_**Hint:**_ Our solution involves a comparison operator and the `np.count_nonzero` function.

In [17]:
number_cheese = np.count_nonzero(ten_nachos == 'cheese')
number_cheese

3

In [18]:
grader.check("q2_1")

**Conditional Statements**

A conditional statement is made up of multiple lines of code that allow Python to choose from different alternatives based on whether some condition is true.

Here is a basic example.

```
def sign(x):
    if x > 0:
        return 'Positive'
```

How the function works is if the input `x` is greater than `0`, we get the string `'Positive'` back.

If we want to test multiple conditions at once, we use the following general format.

```
if <if expression>:
    <if body>
elif <elif expression 0>:
    <elif body 0>
elif <elif expression 1>:
    <elif body 1>
...
else:
    <else body>
```

Only one of the bodies will ever be executed. Each `if` and `elif` (else-if) expression is evaluated and considered in order, starting at the top. As soon as a true value is found (i.e. once a condition is met), the corresponding body is executed, and the rest of the expression is skipped. If none of the `if` or `elif` expressions are true, then the `else body` is executed. For more examples and explanation, refer to [CIT 9.1](https://inferentialthinking.com/chapters/09/1/Conditional_Statements.html?highlight=else).

**Question 2.2.** Complete the following conditional statement so that the string `'More please'` is assigned to `say_please` if the number of nachos with only cheese in `ten_nachos` is less than `5`.

In [19]:
if number_cheese<5:
    say_please = 'More please'
    
say_please

'More please'

In [20]:
grader.check("q2_2")

**Question 2.3.** Write a function called `nacho_reaction` that returns a string representing a person's reaction, based on the type of nacho passed in. The reactions should be as shown in the table below.

| Type of nacho    | Reaction |
| ----------- | ----------- |
| `cheese`      | `Cheesy!`      |
| `salsa`  | `Spicy!`        |
| `both`      | `Delicious!`      |
| `neither`  | `Boring.`        |

In [21]:
def nacho_reaction(nacho):
    if nacho == "cheese":
        return "Cheesy!"
    if nacho == "salsa":
        return "Spicy!"
    if nacho == "both":
        return "Delicious!"
    if nacho == "neither":
        return "Boring."

# This is an example call to your function.
spicy_nacho = nacho_reaction('salsa')
spicy_nacho

'Spicy!'

In [22]:
grader.check("q2_3")

Now consider the DataFrame `ten_nachos_reactions` defined below.

In [23]:
ten_nachos_reactions = bpd.DataFrame().assign(Nacho=ten_nachos)
ten_nachos_reactions

Unnamed: 0,Nacho
0,neither
1,cheese
2,both
3,both
4,cheese
5,salsa
6,both
7,neither
8,cheese
9,both


**Question 2.4.** Add a column named `'Reaction'` to the DataFrame `ten_nachos_reactions` that consists of the reaction for each of the nachos in `ten_nachos`. 

_**Hint:**_ Use the `apply` method.

In [24]:
ten_nachos_reactions = ten_nachos_reactions.assign(Reaction=ten_nachos_reactions.get('Nacho').apply(nacho_reaction))
ten_nachos_reactions

Unnamed: 0,Nacho,Reaction
0,neither,Boring.
1,cheese,Cheesy!
2,both,Delicious!
3,both,Delicious!
4,cheese,Cheesy!
5,salsa,Spicy!
6,both,Delicious!
7,neither,Boring.
8,cheese,Cheesy!
9,both,Delicious!


In [25]:
grader.check("q2_4")

**Question 2.5.** Using code, find the number of `'Delicious!'` reactions for the nachos in `ten_nachos_reactions`.  Think about how you could find this both by using DataFrame methods or by using `np.count_nonzero`.

In [26]:
num_delicious = np.count_nonzero(ten_nachos_reactions.get('Reaction') == "Delicious!")
num_delicious

4

In [27]:
grader.check("q2_5")

**Question 2.6.** Complete the function `both_or_neither` below. The function takes as input any DataFrame of nachos and reactions, with column names `'Nacho'` and `'Reaction'`. The function compares the number of nachos with both cheese and salsa to the number of nachos with neither cheese nor salsa. If there are more nachos with both, the function returns `'These were some yummy nachos!'` and if there are more nachos with neither, the function returns `'These nachos were disappointing.'` If there are an equal number of each, the function returns `'These nachos were hit or miss.'`

In [37]:
def both_or_neither(nacho_df):
    number_both = nacho_df.get('Nacho').loc[nacho_df.get('Nacho') == 'both'].shape[0]
    number_neither = nacho_df.get('Nacho').loc[nacho_df.get('Nacho') == 'neither'].shape[0]
    # Now return the appropriate string describing the nachos overall.
    if number_both > number_neither:
        return 'These were some yummy nachos!'
    elif number_both < number_neither:
        return 'These nachos were disappointing.'
    else:
        return 'These nachos were hit or miss.'


# Below, we create a DataFrame with randomly-generated data and test your function on it.
# Do NOT change anything below this line.
# However, you may want to add a new cell and evaluate both_or_neither(ten_nachos_reactions) to see
# if your function behaves as expected.
np.random.seed(24)
many_nachos = bpd.DataFrame().assign(Nacho=np.random.choice(nachos, 250))
many_nachos = many_nachos.assign(Reaction=many_nachos.get('Nacho').apply(nacho_reaction))
result = both_or_neither(many_nachos)
result

'These nachos were disappointing.'

In [38]:
grader.check("q2_6")

## 3. Hungry Billy 🍗 🍕🍟
After a long day of class, Billy decides to go to Dirty Birds for dinner. Today's menu has Billy's four favorite foods: wings, pizza, fries, and mozzarella sticks. However, each dish has a 25% chance of running out before Billy can get to Dirty Birds.

**Note:** Use Python as your calculator. Your answers should be expressions (like `0.5 ** 2`); don't simplify your answers using an outside calculator. Also, all of your answers should be given as decimals between 0 and 1, not percentages.

**Question 3.1.** What is the probability that Billy will be able to eat wings at Dirty Birds?

In [39]:
wings_prob = 1-0.25
wings_prob

0.75

In [40]:
grader.check("q3_1")

**Question 3.2.** What is the probability that Billy will be able to eat all four of these foods at Dirty Birds?

In [41]:
all_prob = 0.75**4
all_prob

0.31640625

In [42]:
grader.check("q3_2")

**Question 3.3.** What is the probability that Dirty Birds will have run out of at least one of the four foods before Billy can get there?

In [76]:
something_is_out = 1-all_prob
something_is_out

0.68359375

In [77]:
grader.check("q3_3")

To make up for their unpredictable food supply, Dirty Birds decides to hold a contest for some free HDH Dining swag. There is a bag with three red marbles, three green marbles, and three blue marbles. Billy has to draw three marbles **without replacement**. In order to win, all three marbles Billy draws must be of different colors.

**Question 3.4.** What is the probability that Billy wins the contest?

_**Hint:**_ If you're stuck, start by determining the probability that the second marble Billy draws is different from the first marble Billy draws.

In [47]:
winning_prob = (9/9)*(6/8)*(3/7)
winning_prob

0.3214285714285714

In [48]:
grader.check("q3_4")

## 4. Iteration 🔂
Using a `for` loop, we can perform a task multiple times. This is known as iteration. Here, we'll simulate drawing different suits from a deck of cards. 🃏

In [49]:
suits = np.array(['♣️', '♥️', '♠️', '♦️'])

draws = np.array([])

repetitions = 6

for i in np.arange(repetitions):
    draws = np.append(draws, np.random.choice(suits))

draws

array(['♥️', '♦️', '♠️', '♥️', '♣️', '♦️'], dtype='<U32')

Another use of iteration is to loop through a set of values. For instance, we can print out all of the colors of the rainbow. 🌈

In [50]:
rainbow = np.array(["red", "orange", "yellow", "green", "blue", "indigo", "violet"])

for color in rainbow:
    print(color)

red
orange
yellow
green
blue
indigo
violet


We can see that the indented part of the `for` loop, known as the body, is executed once for each item in `rainbow`. Note that the name `color` is arbitrary; we could replace both instances of `color` in the cell above with any valid variable name and the code would work the same.

We can also use a `for` loop to add to a variable in an iterative fashion. Here, we count the number of even numbers in an array of numbers. Each time we encounter an even number in `num_array`, we increase `even_count` by 1. To check if an individual number is even, we compute its remainder when divided by 2 using the `%` ([modulus](https://www.freecodecamp.org/news/the-python-modulo-operator-what-does-the-symbol-mean-in-python-solved/#:~:text=The%20%25%20symbol%20in%20Python%20is,basic%20syntax%20is%3A%20a%20%25%20b)) operator.

In [51]:
num_array = np.array([1, 3, 4, 7, 21, 23, 28, 28, 30])

even_count = 0

for i in num_array:
    if i % 2 == 0:
        even_count = even_count + 1
        
even_count

4

**Question 4.1.** Valentina is playing darts. 🎯 Her dartboard contains ten equal-sized zones with point values from 1 to 10. Write code using `np.random.choice` that simulates her total score after 1000 dart tosses.

In [69]:
possible_point_values = np.arange(1,11)

tosses = 1000

total_score = 0
for i in range(tosses):
    total_score += np.random.choice(possible_point_values)

total_score


5410

In [70]:
grader.check("q4_1")

**Question 4.2.** What is the average point value of a dart thrown by Valentina?

In [71]:
average_score = total_score/tosses
average_score

5.41

In [72]:
grader.check("q4_2")

**Question 4.3.** In the following cell, we've loaded the text of _The Wonderful Wizard of Oz_ by L. Frank Baum, the book we looked at in Homework 1. We've split the text into individual words, and stored these words in an array. Using a `for` loop, assign `longer_than_four` to the number of words in the novel that are more than 4 letters long.  Look at [CIT 9.2](https://inferentialthinking.com/chapters/09/2/Iteration.html) if you get stuck.

_**Hint:**_ You can find the number of letters in a word with the `len` function.

In [73]:
wizard_string = open('data/the-wonderful-wizard-of-oz.txt', encoding='utf-8').read()
wizard_words = np.array(wizard_string.split())

longer_than_four = 0
for i in wizard_words:
    if len(i)>4:
        longer_than_four+=1
        
longer_than_four

15515

In [74]:
grader.check("q4_3")

## Finish Line 🏁

Congratulations! You are done with Lab 3.

To submit your assignment:

1. Select `Kernel -> Restart & Run All` to ensure that you have executed all cells, including the test cells.
2. Read through the notebook to make sure everything is fine and all tests passed.
3. Run the cell below to run all tests, and make sure that they all pass.
4. Download your notebook using `File -> Download as -> Notebook (.ipynb)`, then upload your notebook to Gradescope.
5. Stick around while the Gradescope autograder grades your work. Make sure you see that all tests have passed on Gradescope.
6. Check that you have a confirmation email from Gradescope and save it as proof of your submission.

In [78]:
# For your convenience, you can run this cell to run all the tests at once!
grader.check_all()

q1_1 results: All test cases passed!

q1_2 results: All test cases passed!

q2_1 results: All test cases passed!

q2_2 results: All test cases passed!

q2_3 results: All test cases passed!

q2_4 results: All test cases passed!

q2_5 results: All test cases passed!

q2_6 results: All test cases passed!

q3_1 results: All test cases passed!

q3_2 results: All test cases passed!

q3_3 results: All test cases passed!

q3_4 results: All test cases passed!

q4_1 results: All test cases passed!

q4_2 results: All test cases passed!

q4_3 results: All test cases passed!