In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab.ipynb")

# Lab 2 – `pandas` 

## DSC 80, Fall 2022

### Due Date: Monday, October 10th at 11:59 PM

## Instructions
Much like in DSC 10, this Jupyter Notebook contains the statements of the problems and provides code and Markdown cells to display your answers to the problems. Unlike DSC 10, the notebook is *only* for displaying a readable version of your final answers. The coding will be done in an accompanying `lab.py` file that is imported into the current notebook.

Labs and programming assignments will be graded in (at most) two ways:
1. The functions and classes in the accompanying `.py` file will be tested (a la DSC 20),
2. The notebook may be graded (if it contains free response questions or asks you to draw plots).

**Note**: Labs will have public tests and private tests. The public "smoke tests" that you will run below and which appear on Gradescope are generally worth no points. After the due date, we will replace these tests with private tests that will determine your grade. This is different from DSC 10, where labs only had public tests!

**Do not change the function names in the `*.py` file!**
- The functions in the `*.py` file are how your assignment is graded, and they are graded by their name.
- If you changed something you weren't supposed to, just use git to revert! Ask us if you need help with this, or google around for `git revert`.

**Tips for working in the notebook**:
- The notebooks serve to present the questions and give you a place to present your results for later review.
- The notebooks in *lab assignments* are not graded (only the `.py` file is submitted and graded).
- The notebook serves as a nice environment for 'pre-development' and experimentation before designing your function in your `.py` file. You can write code here, but make sure that all of your real work is in the `.py` file.

**Tips for developing in the `.py` file**:
- Do not change the function names in the starter code; grading is done using these function names.
- Do not change the docstrings in the functions. These are there to tell you if your work is on the right track!
- You are encouraged to write your own additional helper functions to solve the lab! 
- Always document your code!

### Importing code from `lab.py`

* We import our `.py` file that's contained in the same directory as this notebook.
* We use the `autoreload` notebook extension to make changes to our `lab.py` file immediately available in our notebook. Without this extension, we would need to restart the notebook kernel to see any changes to `lab.py` in the notebook.
    - `autoreload` is necessary because, upon import, `lab.py` is compiled to bytecode (in the directory `__pycache__`). Subsequent imports of `lab` merely import the existing compiled python.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from lab import *

In [3]:
import pandas as pd
import numpy as np
import os
import doctest

## Part 1: `pandas` Basics 👶

In this section, you'll have to implement several functions. The doctests test your functions on an example dataset, which is stored in `data/scores.csv`. You're free to import this `.csv` file as a DataFrame in your notebook and experiment with it. **However,** the functions you write must be general enough such that they can work on other datasets with the same column names but different values.

In addition:
* Do not hard-code any answers.
* Do not use any loops – you will not receive full credit if you do!

### Question 1

#### `data_load`

Write a function called `data_load` that takes in the file path of a dataset to be read as a string and returns the DataFrame that results from following the steps below:
    
a. First, read in only a subset of the columns: `'name'`, `'tries'`, `'highest_score'`, and `'sex'`.

b. Then, drop the `'sex'` column.

c. Rename the `'name'` column to `'firstname'` and the `'tries'` column to `'attempts'`.

d. Turn the `'firstname'` column into the index.
    
#### `pass_fail`

Write a function called `pass_fail` that takes a DataFrame returned from `data_load` and adds a column `'pass'` that contains `'Yes'` or `'No'` for each row, based on the following conditions:

* `'No'` if a number of attempts is strictly larger than 1 but the score is less than 60
* `'No'` if a number of attempts is strictly larger than 4 but the score is less than 70
* `'No'` if a number of attempts is strictly larger than 6 but the score is less than 90
* `'No'` if a number of attempts is strictly larger than 8
* `'Yes'` otherwise
 
Your function should return the modified DataFrame with the added column.

In [6]:
# don't change this cell -- it is needed for the tests to work
scores_fp = os.path.join('data', 'scores.csv')
scores = data_load(scores_fp)
passfail = pass_fail(scores.copy())

In [None]:
grader.check("q1")

### Question 2

#### `med_score`

Write a function called `med_score` that takes in a DataFrame that is returned by `pass_fail` and returns the median score amongst students who passed the test.

#### `highest_score_name`
    
Write a function called `highest_score_name` that takes in a DataFrame that is returned by `pass_fail` and returns a tuple, which the first item is the maximum score any student received, and the second item should be a list of the name(s) of the person(s) with the maximum score (attempts do not count). If just one student received the maximum score, the list you create will have length 1.

As a reminder, please follow these requirements:

* For all questions you need to write code general enough to be applied to another similar dataset. 
* Do not hard-code any answers. 
* Do not use `for` or `while` loops.

In [20]:
# don't change this cell -- it is needed for the tests to work
medscore = med_score(passfail.copy())
highest = highest_score_name(passfail)

In [None]:
grader.check("q2")

### Question 3

Write a function called `idx_dup` that does not have any parameters and returns a single integer, answering the question below:

Is it possible for a DataFrame's index to have duplicate values?
1. No, index values must be unique and use non-negative integers only, just like in `numpy` arrays.
2. No, index values must be unique and use integers only.
3. No, index values must be unique but index values are not restricted to integers.
4. Yes, but index values must be non-negative integers only.
5. Yes, but index values must be integers only.
6. Yes, and index values are not restricted to integers.

In [33]:
# don't change this cell -- it is needed for the tests to work
idxdup = idx_dup()

In [None]:
grader.check("q3")

## Part 2: Tricky Pandas 🤔

Sometimes, `pandas` gives you weird outputs that you may not expect. The next set of questions walks you through a few examples that might surprise you. 

### Question 4

The following subparts all require you to define a function and return a number that is the answer to a multiple-choice question. You may need to write code and experiment with DataFrames to arrive at your answers.

#### `trick_me`

`trick_me` should not take any parameters. 
<br>

Inside the function:

* Create a DataFrame `tricky_1` that has three columns labeled `'Name'`, `'Name'`, and `'Age'`. Your DataFrame should have 5 rows, the values are up to you.
* Save this DataFrame in the `.csv` file called `tricky_1.csv` without the index.
* Now create another DataFrame, `tricky_2`, by reading in the file `tricky_1.csv`. What are your observations?

  1. It was not possible to create a DataFrame with the duplicate columns.
  2. `tricky_1` and `tricky_2` have the same column names.
  3. `tricky_1` and `tricky_2` have different column names.
   
Your function should return `1`, `2`, or `3`, answering the above question.

<br>
  
#### `trick_bool`
`trick_bool` should not take any parameters.

To determine the correct answer from the list below, you should follow the steps outlined by experimenting in **the notebook** (or in the Terminal by running `python`). Outside the function:

* Create a DataFrame `bools` that has four columns: `True`, `True`, `False`, `False`. Each column name should be Boolean.
* Your DataFrame should have 4 rows, the values are up to you.
* Predict the shape of the DataFrame that results by running each of the three lines of code below. Pick a corresponding answer from the given list. Your function should return a list with three numbers, one for each line.
* You should be able to answer without running any code, but feel free to run code to check your answer.
* **Your function should not do anything other than return a hardcoded answer.**

```py
df[True]
df[[True, True, False, False]]
df[[True, False]]
```
    
Answer choices:
1. DataFrame: 2 columns, 1 row
2. DataFrame: 2 columns, 2 rows
3. DataFrame: 2 columns, 3 rows
4. DataFrame: 2 columns, 4 rows
5. DataFrame: 3 columns, 1 rows
6. DataFrame: 3 columns, 2 rows
7. DataFrame: 3 columns, 3 rows
8. DataFrame: 3 columns, 4 rows
9. DataFrame: 4 columns, 1 rows
10. DataFrame: 4 columns, 2 rows
11. DataFrame: 4 columns, 3 rows
12. DataFrame: 4 columns, 4 rows
13. Error

In [40]:
# don't change this cell -- it is needed for the tests to work
trick_ans = trick_bool()

In [None]:
grader.check("q4")

### Question 5

In the notebook, use the line of code given below to create a DataFrame called `nans`. Note that we use `np.NaN` (`numpy`'s representation of "Not a Number") to create missing values.
 
```py
nans = pd.DataFrame([[0, 1, np.NaN], [np.NaN, np.NaN, np.NaN], [1, 2, 3]])
```
Now you've decided to make your dataset more readable for people who do not understand `np.NaN` and replace each `np.NaN` with a `"MISSING"` string instead. In order to do that you've written the following function:

```py
def change(x):
    if x == np.NaN:
        return "MISSING"
    else:
        return x
```

In your notebook, write a line of code that applies the function above to the last column of the `nans` DataFrame. What was a result?
* A: It worked: all `np.NaN`s in the last column were changed to `"MISSING"`.
* B: It did not work.

You should end up answering B. What happened? 🤔 It turns out that you can't use simple comparison `==` to detect if a value is `np.NaN`. You need to use another way to compare a value to `np.NaN`. [Read more about it here](https://stackoverflow.com/questions/41342609/the-difference-between-comparison-to-np-nan-and-isnull).

<br>

#### `change`

Once you've read the aforementioned article, fix `change` so that it works as intended.

<br>

####  `correct_replacement`
Write a function called `correct_replacement` that takes in a DataFrame like `nans` and uses your updated `change` function to replace all of the `np.NaN`s in the input DataFrame (in all columns) with `"MISSING"`.

You **cannot** use the `fillna` method, though the `apply` method might be useful.

Note that the DataFrame returned here should be a **copy** instead of the original DataFrame.

<br>

####  `missing_ser`

`missing_ser` should not take any parameters.

For a Series called `ser` that has six elements:

```py
ser = pd.Series([np.NaN, 'DSC80', np.NaN, 'Justin Eldridge', 'Justin Long', np.NaN])
```

What would be the result of running the following code?

```py
ser[ser.isna()] = 'MISSING'
```

* Predict the output of running the lines of code above. Pick a corresponding answer from the given options below. Your function should return a number. 
* You should be able to answer without running any code, but feel free to run code to check your answer.
* **Your function should not do anything other than return a hardcoded answer.**


      1. pd.Series([np.NaN, 'MISSING', np.NaN, 'MISSING', 'MISSING', np.NaN])
      2. pd.Series(['MISSING', 'DSC80', 'MISSING', 'Justin Eldridge', 'Justin Long', 'MISSING'])
      3. Error. The code would not run.
      
<br>
        
####  `fill_ser`


Write a function called `fill_ser` that takes in a DataFrame with many `np.NaN` and replace each `np.NaN` with a `'MISSING'` string instead. This modification should be **IN-PLACE**, meaning that the function should not return anything, it should simply modify the dataframe given as input.

As a reminder, please follow these requirements:

* You need to write code general enough to be applied to a different DataFrame. 
* Do not hard-code any answers. 
* looping over the columns *is* allowed.
* `apply()` and `fillna()` are not allowed
* This function should **NOT** return anything since it makes in-place modification to the input.

In [None]:
grader.check("q5")

## Part 3: Summary Statistics 📊

In this question you will create two general purpose functions that make it easy to qualitatively assess the contents of a DataFrame.

### Question 6

Create a function called `population_stats` that takes in a DataFrame `df` and returns a DataFrame indexed by the columns of `df`, with the following columns:
   * `'num_nonnull'` contains the number of non-null entries in each column.
   * `'prop_nonnull'` contains the proportion of entries in each column that are non-null.
   * `'num_distinct'` contains the number of distinct non-null entries in each column.
   * `'prop_distinct'` contains the proportion of non-null entries that are distinct in each column.
       
For example, if `df` had a column with the following elements:
       
```py
[2, 2, 2, np.NaN, 5, 7, 5, 10, 11, np.NaN]
```
- `'num_nonnull'` is 8, and `'prop_nonnull'` is $\frac{8}{10}$ = 0.8.
- There are six distinct entries, `[2, 5, 7, 10, 11, np.NaN]`, but only 5 of them are non-null. So the number of distinct non-null entries, `'num_distinct'`, is 5.
- There are 5 distinct non-null entries, and there are 8 total non-null entries, so `'prop_distinct'` is $\frac{5}{8}$ = 0.625.

***Hint***: you may find the `nunique` Series method useful.

In [66]:
# don't change this cell -- it is needed for the tests to work
pop_data = np.random.choice(range(10), size=(100, 4))
df_pop = pd.DataFrame(pop_data, columns='A B C D'.split())
out_pop = population_stats(df_pop)

In [None]:
grader.check("q6")

### Question 7
    
Write a function called `most_common` that takes in a DataFrame `df` and a number `N` and returns a DataFrame of the `N` most-common values and their counts for each column of `df`. Any column with fewer than `N` distinct values should contain `np.NaN` in those entries.

For example, consider the subset of the `salaries` DataFrame from Lecture 1/2 shown on the left. On the right, the return value of `most_common(salaries, N=5)` is shown.

<table><tr>
    <td><img src="data/imgs/dataframe.png" width="70%"/></td>
    <td><img src="data/imgs/most_common.png" width="70%"/></td>
</tr></table>

***Note:*** you can loop through the *columns* of `df` to construct your output. You should **not** be looping through rows.

***Hint:*** You may find that initializing an empty DataFrame with `N` rows and adding columns to it is useful in your implementation.

In [79]:
# don't change this cell -- it is needed for the tests to work
common_data = np.random.choice(range(10), size=(100, 2))
common_df = pd.DataFrame(common_data, columns='A B'.split())
common_out = most_common(common_df, N=3)

In [None]:
grader.check("q7")

## Part 4: Defective Wet Suits 🏄

### Question 8

In San Diego, students are looking to surf in their free time. There is a pop-up surf store on Library Walk selling wet suits and surf board to students. Last Saturday, this store sold 250 wet suits to UCSD students. After a surf session, 10 students complained that their wet suits had tears in them, letting the cold ocean water to rush in the suit. In response to the student dissatisfaction, the store claims that 98% of their wet suits are produced without any manufacturing defects. You think this seems unlikely and decide to investigate.

First, select a significance level for your investigation. You don't need to turn this in anywhere. Then, complete the following three functions.

#### `null_hyp`

Write a function called `null_hyp` that has no parameters and returns your answer to the following question **as a list**.

What are reasonable choices for the **null hypothesis** for your investigation? Select all that apply:
1. The store sells wet suits that are ~2% defective.
2. The store sells wet suits that are 98% non-defective.
3. The store sells wet suits that are less than 98% non-defective.
4. The store sells wet suits that are at least 2% defective.


#### `simulate_null`

Write a function called `simulate_null` that simulates a single step of the data generation process under the null hypothesis. The function should return a binary array, i.e. an array of 0s and 1s. It is up to you to decide what the 0s and 1s mean.

#### `estimate_p_val`

Write a function called `estimate_p_val` that takes in a number `N` and returns the estimated p-value of your investigation upon simulating the null hypothesis `N` times.

***Note:*** Plot the null distribution and your observed statistic to check your work. (If you decide to plot, you may have to run `import matplotlib.pyplot as plt`.)

In [None]:
grader.check("q8")

## Part 5: Superheroes 🦸

The questions below analyze a dataset of superheroes found in the `data` directory. One of the datasets lists the attributes of each superhero, while the other is a *Boolean* DataFrame describing which superheroes have which superpowers. Note, the datasets contain information on both **good** superheroes, as well as **bad** superheroes (AKA villains). 

### Question 9

Let's start working with the `powers` dataset, which you can see in `data/superheroes_powers.csv`. Write a function called `super_hero_powers` that takes in a DataFrame like `powers` and returns a list with the following three entries:

1. The name of the superhero with the greatest number of superpowers.
2. The name of the second most common superpower among superheroes who can fly (the most common being "Flight" itself).
3. The name of the most common superpower among superheroes with only one superpower.

You should **not** be hard-coding your answers in this question; your function should work on any DataFrame similar to `powers`. You should not be using loops in this question. In each case, you can assume the answer is unique.

***Hint:*** You may find the `idxmax` method useful in this problem.

In [100]:
# don't change this cell -- it is needed for the tests to work
super_fp = os.path.join('data', 'superheroes_powers.csv')
powers = pd.read_csv(super_fp)
super_out = super_hero_powers(powers)

In [None]:
grader.check("q9")

### Question 10

In the notebook, load in the dataset in `data/superheroes.csv` as a DataFrame and explore it. Call your `population_stats` function from Question 6 on the DataFrame. You should notice that there are very few actually null (`np.NaN`) values, but there are many entries that **should** be null.

Write a function called `clean_heroes` that takes in a DataFrame like the one mentioned above and returns a new DataFrame with all of the missing values replaced with `np.NaN`.

***Note:*** Most of the work in this question is identifying how the missing values are stored in the DataFrame. The implementation of the function should only take one line.

In [116]:
# don't change this cell -- it is needed for the tests to work
superheroes_fp = os.path.join('data', 'superheroes.csv')
heroes = pd.read_csv(superheroes_fp, index_col=0)
clean_out = clean_heroes(heroes)

In [None]:
grader.check("q10")

Below, we have displayed the first 10 rows of the cleaned DataFrame.

In [124]:
clean_out.head(10)

### Question 11

Using the **cleaned** superhero data, we will now generate some insights. We are curious about the following questions. The `super_hero_stats` function should return a list of length 6 that contains your answers to the questions below. **Your answers should be hard-coded in the function.**

0. Which publisher has a greater proportion of "bad" characters – `'Marvel Comics'` or `'DC Comics'`?
1. Out of publishers who have strictly more than 5 characters, which publisher's characters are mostly non-Human? For example, around 21% of Marvel's characters are non-Human; is there a publisher which has a greater proportion? There are in fact *two* publishers which both have the maximum proportion; return the publisher whose name is first alphabetically. Consider any string other than `'Human'` to be "Non-Human". For example, a race of `'Human / Radiation'` is Non-Human. (*Hint*: The `.isin` Series method may be useful).
2. There is only one character that is **both** greater than one standard deviation above the mean in height and at least one standard deviation below the mean in weight. What is their name?
3. Who is taller on average – `'good'` characters or `'bad'` characters?
4. What is the name of the tallest `'Mutant'` with `'No Hair'`?
5. Which `Publisher` that isn't `Marvel` or `DC` has the most characters?

***Note:*** Although you'll be writing code to find the answers, you should not include your code in your `.py` file. Just return a hard-coded list with your answers to the 6 questions.

***Note:*** For part 5, you may choose whether you would like to include or exclude null values for the total number of `'Marvel Comics'` characters. 

In [126]:
# don't change this cell -- it is needed for the tests to work
stats_out = super_hero_stats()

In [None]:
grader.check("q11")

### Question 12 

Create a function called `bhbe_col` that takes in a DataFrame like `heroes` and returns a Boolean Series that contains `True` for characters that have **both** blond hair and blue eyes, and `False` for all other characters. 

***Note***: If a character's hair color contains the word `'blond'`, uppercase or lowercase, we consider their hair to be blond for the purposes of this question. Similarly, if a character's eye color contains the word `'blue'`, uppercase or lowercase, we consider their eye color to be blue for the purposes of this question.

In [141]:
# don't change this cell -- it is needed for the tests to work
superheroes_fp = os.path.join('data', 'superheroes.csv')
heroes = pd.read_csv(superheroes_fp, index_col=0)
bhbe_out = bhbe_col(heroes)

In [None]:
grader.check("q12")

### Question 13

Now, you'd like to answer the question 
> Are blond-haired, blue-eyed characters disproportionately "good"?

To do this, you'd like to test the null hypothesis:
> The proportion of "good" characters among blond-haired, blue-eyed characters is equal to the proportion of "good" characters in the overall population."

Fix a significance level of 1%.

Before proceeding, think about what test statistic to use in this hypothesis test. Once you've done that, complete the implementations of the following functions.

#### `observed_stat`
`observed_stat` takes in the DataFrame `heroes` and returns the observed test statistic.

#### `simulate_bhbe_null` 
`simulate_bhbe_null` takes in a positive integer `n` and returns an array of length `n`, where each element is a simulated test statistic according to the null hypothesis. You should hard-code the simulation parameter within your function, do not read in any data. (The simulation parameter is a probability. You can round it to two decimal places.)

***Hint:*** You can access columns of a multidimensional array the same way you access columns of a DataFrame using `iloc`.

#### `calc_pval` 
`calc_pval` takes in no parameters and returns a list where:
* The first element is the p-value for the hypothesis test (using 100,000 simulations). Please run the code yourself **in your notebook** and hard-code this answer **in your `.py` file**, as actually running the 100,000 simulation hypothesis test will timeout on Gradescope.
* The second element is `'Reject'` if you reject the null hypothesis and `'Fail to reject'` if you fail to reject the null hypothesis, at the 1% significance level.

In [147]:
# don't change this cell -- it is needed for the tests to work
obs_stat_out = observed_stat(heroes)

simulate_bhbe_out = simulate_bhbe_null(10)

pval_out = calc_pval()

In [None]:
grader.check("q13")

## Congratulations! You're done! 🏁

Submit your `.py` file to Gradescope. Note that you only need to submit the `.py` file; this notebook should not be uploaded.

Before submitting, you should ensure that all of your work is in the `.py` file. You can do this by running the doctests below, which will verify that your work passes the public tests **and** that your work is in the `.py` file. Run the cell below; you should see no output.

In [163]:
!python -m doctest lab.py

In addition, `grader.check_all()` will verify that your work passes the public tests. Ultimately, the Gradescope autograder is also going to run `grader.check_all()`, so you should ensure these pass as well (which they should if the doctests above passed).

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()