In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab.ipynb")

# Lab 3 – Hypothesis Testing and DataFrame Manipulation

## DSC 80, Spring 2022

### Due Date: Monday, October 17th at 11:59 PM

## Instructions
Much like in DSC 10, this Jupyter Notebook contains the statements of the problems and provides code and Markdown cells to display your answers to the problems. Unlike DSC 10, the notebook is *only* for displaying a readable version of your final answers. The coding will be done in an accompanying `lab.py` file that is imported into the current notebook.

Labs and programming assignments will be graded in (at most) two ways:
1. The functions and classes in the accompanying `.py` file will be tested (a la DSC 20),
2. The notebook may be graded (if it contains free response questions or asks you to draw plots).

**Note**: Labs will have public tests and private tests. The public "smoke tests" that you will run below and which appear on Gradescope are generally worth no points. After the due date, we will replace these tests with private tests that will determine your grade. This is different from DSC 10, where labs only had public tests!

**Do not change the function names in the `*.py` file!**
- The functions in the `*.py` file are how your assignment is graded, and they are graded by their name.
- If you changed something you weren't supposed to, just use git to revert! Ask us if you need help with this, or google around for `git revert`.

**Tips for working in the notebook**:
- The notebooks serve to present the questions and give you a place to present your results for later review.
- The notebooks in *lab assignments* are not graded (only the `.py` file is submitted and graded).
- The notebook serves as a nice environment for 'pre-development' and experimentation before designing your function in your `.py` file. You can write code here, but make sure that all of your real work is in the `.py` file.

**Tips for developing in the `.py` file**:
- Do not change the function names in the starter code; grading is done using these function names.
- Do not change the docstrings in the functions. These are there to tell you if your work is on the right track!
- You are encouraged to write your own additional helper functions to solve the lab! 
- Always document your code!

### Importing code from `lab.py`

* We import our `.py` file that's contained in the same directory as this notebook.
* We use the `autoreload` notebook extension to make changes to our `lab.py` file immediately available in our notebook. Without this extension, we would need to restart the notebook kernel to see any changes to `lab.py` in the notebook.
    - `autoreload` is necessary because, upon import, `lab.py` is compiled to bytecode (in the directory `__pycache__`). Subsequent imports of `lab` merely import the existing compiled python.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from lab import *

In [3]:
import os
import io
import pandas as pd
import numpy as np

## Part 1: Hypothesis Testing

In this section we'll develop an intuition for the terms and structure of hypothesis testing – it's nothing to be afraid of!

The first step is always to define what you're looking at, create your hypotheses, and set a level of significance.  Once you've done that, you can find a p-value which is related to your test statistic.

If all of these words are scary: look at the [Lecture 4](https://github.com/dsc-courses/dsc80-2022-fa/blob/main/lectures/04-hypothesis_testing/notebook/lecture.ipynb) notebook, the readings, and don't forget to think about the real-world meaning of these terms!  The following example describes a real-world scenario, so you can think of it in a normal lens.

### Question 1 – Tires 🚗

A tire manufacturer, TritonTire, claims that their tires are so good, they will bring a Toyota Highlander from 60 mph to a complete stop in under 106 feet, 97% percent of the time.

Now, you own a Toyota Highlander equipped with TritonTire tires, and you decide to test this claim. You take your car to an empty Vons parking lot, speed up to exactly 60 mph, hit the brakes, and measure the stopping distance. As illegal as it is, you repeat this process 50 times and find that **you stopped in under 106 feet only 47 of the 50 times**.

Livid, you call TritonTire and say that their claim is false. They say, no, that you were just unlucky: your experiment is consistent with their claim. But they didn't realize that they are dealing with a *data scientist* 🧑‍🔬.

To settle the matter, you decide to unleash the power of the hypothesis test. The following three subparts ask you to answer a total of four select-all multiple choice questions.

#### Question 1.1

You will set up a hypothesis test in order to test your suspicion that the tires are are actually worse than claimed. Which of the following are valid null and alternative hypotheses for this hypothesis test?

1. The tires will stop your car in under 106 feet exactly 97% of the time.
0. The tires will stop your car in under 106 feet less than 97% of the time.
0. The tires will stop your car in under 106 feet greater than 97% of the time.
0. The tires will stop your car in more than 106 feet exactly 3% of the time.
0. The tires will stop your car in more than 106 feet less than 3% of the time.
0. The tires will stop your car in more than 106 feet greater than 3% of the time.

Create a function called `car_null_hypoth` which takes zero arguments and returns a list of integers, corresponding to the the valid null hypotheses above.
Also create a function called `car_alt_hypoth` which takes zero arguments and returns a list of integers, corresponding to the valid alternative hypotheses above.

<br>

#### Question 1.2

Which of the following are valid test statistics for our question?

1. The number of times the car stopped in under 106 feet in 50 attempts.
1. The average number of feet the car took to come to a complete stop in 50 attempts.
1. The number of attempts it took before the car stopped in under 95 feet.
1. The proportion of attempts in which the car successfully stopped in under 106 feet.

Create a function called `car_test_stat` which takes zero arguments and returns a list of integers, corresponding to the valid test statistics above.

<br>

#### Question 1.3

The p-value is the probability, under the assumption the null hypothesis is true, of observing a test statistic **equal to our observed statistic, or more extreme in the direction of the alternative hypothesis**.

Why don't we just look at the probability of observing a test statistic equal to our observed statistic? That is, why is the "more extreme in the direction of the alternative hypothesis" part necessary?

1. Because our observed test statistic isn't extreme.
4. Because our null hypothesis isn't suggesting equality.
5. Because our alternative hypothesis isn't suggesting equality.
2. Because the probability of finding our observed test statistic equals the probability of finding something more extreme.
3. Because if we run more and more trials (where a trial is speeding up the car then stopping), the probability of finding *any* observed test statistic gets closer and closer to zero, so if we did this we would always reject the null with more trials even if the null is true.


Create a function `car_p_value` which takes zero arguments and returns the correct reason as an integer (not a list).

In [None]:
grader.check("q1")

## Part 2: Grouping

Last month, the UK 🇬🇧 announced a new ["High Potential Individual" visa](https://www.lexology.com/library/detail.aspx?g=41fa64ec-9272-468c-bdcb-8002745a754f), which allows graduates of universities ranked in the Top 50 globally to move to the UK without a job lined up. This visa has been a subject of much debate, in part due to how much rankings play a role. (Rest assured, UCSD is on the list!)

In this section, you will analyze a dataset of university rankings, collected from  [here](https://www.kaggle.com/datasets/mylesoneill/world-university-rankings?datasetId=) (though we have pre-processed and modified the original dataset for the purposes of this question). Our version of the dataset is stored in `data/universities_unified.csv`.

Columns:
* `'world_rank'`: world rank of the institution
* `'institution'`: name of the institution
* `'national_rank'`: rank within the nation, formatted as `'country, rank'`
* `'quality_of_education'`: rank by quality of education
* `'alumni_employment'`: rank by alumni employment
* `'quality_of_faculty'`: rank by quality of faculty
* `'publications'`: rank by publications
* `'influence'`: rank by influence
* `'citations'`: rank by number of citations
* `'broad_impact'`: rank by broad impact
* `'patents'`: rank by number of patents
* `'score'`: overall score of the institution, out of 100
* `'control'`: whether the university is public or private
* `'city'`: city in which the institution is located
* `'state'`: state in which the institution is located

### Question 2 – Rankings 1️⃣

There are (still) a few aspects of the dataset we need to clean before it's ready for analysis.

IMPORTANT: You should NOT use loops in this question.

#### `clean_universities`

Create a function `clean_universities` which takes in the raw rankings DataFrame and returns a cleaned DataFrame, cleaned according to the following information:

- Some `'institution'` names contain `'\n'` characters (e.g. `'University of California\nSan Diego'`). Replace all instances of `'\n'` with `', '` (a comma and a space) in the `'institution'` column.

- Change the data type of the `'broad_impact'` column to `int`.

* Split `'national_rank'` into two columns, `'nation'` and `'national_rank_cleaned'`, where:
    * `'nation'` is the country indicated in the first part of `'national_rank'`. 
        * Note that there are **3** countries that appear under different names for different schools. For all 3 of these countries, you should pick **the name that is longer** and use that name for every occurrence of the country. One of the 3 countries is **`'Czech Republic'`**, which also appears as **`'Czechia'`** – since these refer to the same country and `'Czech Republic'` is longer, all instances of either name should be replaced with `'Czech Republic'`. You need to find the other 2 countries on your own. 
        * As is mentioned below, your function will only be tested on the DataFrame in `data/universities_unified.csv`, so you don't need to worry about country names other than these 3.
    * `'national_rank_cleaned'` is the integer in the latter part of `'national_rank'`. Make sure that the data type of this column is `int`. 
    * Don't include the original `'national_rank'` column in the output DataFrame.
* Create a Boolean column `'is_r1_public'`. This column should contain `True` if a university is public and classified as R1 and `False` otherwise. Treat `np.NaN`s as False. **Note that in the raw DataFrame, a university is classified as R1 if and only if it has non-null values in all of the following columns: `'control'`, `'city'`, and `'state'`.**
    - Read [this page](https://en.wikipedia.org/wiki/List_of_research_universities_in_the_United_States) to learn more about what it means for a university to be classified as R1.
    
**The only dataset your function will be tested on is `data/universities_unified.csv`; you don't need to worry about other hidden test sets.** In addition, please return a *copy* of the original DataFrame; don't modify the original.

<br>

Now, we can do some basic exploration.

#### `university_info`

Create a function `university_info` that takes in the **cleaned** DataFrame outputted by `clean_universities` and returns the following values in a list:
* Among `'state(s)'` with three or more `'institution(s)'` in the dataset, the `'state'` whose universities have the lowest mean `'score'`.
* The proportion of the `'institution(s)'` in the top 100 for which the `'quality of faculty'` ranking is also in the top 100.
* The number of `'state(s)'` where at least 50% of the `'institution(s)'` are Private (NOT r1_public).
* The lowest ranking `'institution'`, according to `'world_rank'`, that is ranked #1 in its nation (i.e. that has a `'national_rank_cleaned'` of 1).

You can assume there are no ties.

In [25]:
# don't change this cell -- it is needed for the tests to work
fp = os.path.join('data', 'universities_unified.csv')
df = pd.read_csv(fp)
cleaned = clean_universities(df)
info = university_info(cleaned)

In [None]:
grader.check("q2")

### Question 3 – High Standards ™️ 

#### `std_scores_by_nation` 

Create a function `std_scores_by_nation` that takes in a **cleaned** DataFrame, like the one returned by `clean_universities`, and outputs a DataFrame: 
- with the same rows as the input, 
- with three columns: `'institution'`, `'nation'`, and `'score'` (in that order),
- where the `'score'` column is **standardized** by `'nation'` - that is, the `'score'`s for each country are converted to standard units, using the mean and standard deviation of the `'score'`s for that country. If a `'score'` is `np.NaN`, leave it as `np.NaN`.
    - For a review of standard units, see [Computational and Inferential Thinking](https://www.inferentialthinking.com/chapters/15/1/Correlation).
    - ***Hint:*** Use [`groupby` and `transform`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.transform.html).

<br>

#### `su_and_spread`

Lastly, create a function `su_and_spread` that returns the answers to the following two questions, as a list.

****Part 1****

Let's compare rankings between two countries – the US 🇺🇸 and Canada 🇨🇦. There are in total $n$ universities in the US and $m$ universities in Canada. Suppose $x_1, x_2, ..., x_n$ are the `'world_rank'`s for US universities in **increasing order**, meaning that $x_1$ is the `'world_rank'` of the "best" US university. Similarly, $y_1, y_2, ..., y_m$ are the `'world_rank'`s for Canadian universities, also in increasing order. 

Suppose we take the aforementioned `'world_rank'`s and sort them together in **increasing order**, e.g. $x_1, x_2, y_1, x_3, ...$. **We define $R$ to be the average of the positions of the $x$ values.**

For example, if there are 3 US universities (so $n=3$) and 2 Canadian universities ($m=2$), and
  
$$x_1 = 1, x_2 = 3, x_3 = 10, \:\:\:\: y_1 = 5, y_2 = 15$$

When we sort the rankings in increasing order, we'd get 1, 3, 5, 10, 15, which correspond to the values $x_1, x_2, y_1, x_3, y_2$. The $x$ values are at positions 1, 2, and 4. Then, $R = \frac{1 + 2 + 4}{3} = \frac{7}{3}$. (Note that this is **not** the average of 1, 3, and 10).


**Question:** If we believe that US universities in general rank higher than Canadian universities, should $R$ be
1. larger than $\frac{m + n}{2}$?
2. smaller than $\frac{m + n}{2}$?
3. equal to $\frac{m + n}{2}$?


Store your answer – either 1, 2, or 3 – in the first element of `su_and_spread`'s output list. Note that this is a classical example of a non-parametric hypothesis test called a rank test.

<br>

****Part 2****

Which `'nation'` has the largest variation in `'score'`s before standardization? 

***Note:*** To find the answer to Part 2, you'll need to find the standard deviation of a column. You should use the formula with `n` in the denominator. `numpy`'s `.std()` by default uses that formula, while `pandas`' `.std()` by default uses the formula with `n-1` in the denominator. To force `pandas`' `.std()` to use `n` in the denominator, use the optional argument `ddof=0`.

In [48]:
# do not edit this cell -- it is needed for the tests
fp = os.path.join('data', 'universities_unified.csv')
universities = pd.read_csv(fp)
cleaned = clean_universities(universities)
universities_out = std_scores_by_nation(cleaned)
su_and_spread_out = su_and_spread()

In [None]:
grader.check("q3")

## Part 3: Combining Data

### Question 4 – Making Connections 🤝

A group of students decided to send out a survey to their connections on LinkedIn. Each student asks 1000 of their connections for their first and last name, the company they currently work at, their job title, their email, and the university they attended.

**Your job is to combine all the data contained in the files `survey*.csv` (stored within the `data/responses` folder) into a single DataFrame. The number of files and the number of rows in each file may vary, so don't hardcode your answers!** To do so, implement the following two functions.

#### `read_linkedin_survey`

Create a function `read_linkedin_survey` which takes in a string describing the path to a folder containing `survey*.csv` files and outputs a DataFrame with six columns titled `'first name'`, `'last name'`, `'current company'`, `'job title'`, `'email'`, and `'university'` (in that order) containing the survey information for all files combined. Make sure to reset the index of the combined DataFrame before returning it so that the index is unique. 

***Hints***:

- Take a look at a few of the files in the `responses` folder. You may have to do some data cleaning to combine the DataFrames!

- You can list the files in a directory using `os.listdir`.

***Note***:

- If you are using Windows, do not use "\\\\" to build file paths.

<br>

#### `com_stats`

Create a function `com_stats` which takes in a DataFrame returned by `read_linkedin_survey` and returns a list containing, in the following order: 
- The proportion of people who went to a university with `Ohio` in its name who are some kind of `Nurse`
- The number of job titles that **end** in `Engineer`
- The job title that has the longest name (there are no ties)
- The number of managers (a manager is anyone who has the word `'manager'` in their job title, uppercase or lowercase)

In [64]:
# do not edit this cell -- it is needed for the tests
dirname = os.path.join('data', 'responses')
q4_out = read_linkedin_survey(dirname)
stats_out = com_stats(q4_out)

In [None]:
grader.check("q4")

### Question 5 – Survey Says... 👨‍👩‍👧‍👦

Professor Billy often sends out extra credit surveys asking students for their favorite animals, movies, and other favorite things. These surveys are stored in the `data/extra-credit-surveys` folder. Each file in that folder corresponds to a different survey question (except for `favorite1.csv`, which contains students' names and IDs).

Here's how extra credit works:
- Each student who has completed at least 50% of the survey questions receives 5 points of extra credit.
- If there is at least one survey question that at least 80% of the class answered (e.g. favorite animal), **everyone** in the class receives 1 point of extra credit. This overall class extra credit only applies twice, so if for example 95% of students answer the favorite color survey question and 91% answer the favorite animal survey question, and and 97% answer the favorite movie question, the entire class still receives 2 extra point as a class, not 3.
- Note that this means that the most extra credit any student can earn is 7 points.

#### `read_student_surveys`

Create a function `read_student_surveys` which takes in a string describing the path to a folder containing `favorite*.csv` files and outputs a DataFrame containing all of the survey data combined, indexed by student ID (a value 1-1000).

<br>

#### `check_credit`

Create a function `check_credit` which takes in a DataFrame returned by `read_student_surveys` and outputs a DataFrame indexed by student ID (a value 1-1000) with two columns:
- `'name'`, containing the name of each student, and
- `'ec'`, containing the number of extra credit points each student earned.

In [81]:
# do not edit this cell -- it is needed for the tests
dirname = os.path.join('data', 'extra-credit-surveys')
q5_out = read_student_surveys(dirname)
check_credit_out = check_credit(q5_out)

In [None]:
grader.check("q5")

### Question 6 – Paw Patrol 🐾

You are analyzing data from a veterinarian clinic. The datasets contain several types of information from the clinic, including its customers (pet owners), pets, available procedures, and procedure history. The column names are self-explanatory. These DataFrames are provided to you:
-  `owners` stores the customer information, where every `'OwnerID'` is unique (verify this yourself).
-  `pets` stores the pet information. Each pet belongs to a customer in `owners`.
-  `procedure_detail` contains a catalog of procedures that are offered by the clinic.
-  `procedure_history` has procedure records. Most procedures were given to a pet in `pets`.

<br>

Implement the following three functions, which each ask you to answer a specific question.

#### `most_popular_procedure`

What is the most popular `'ProcedureType'` amongst all pets in the `pets` DataFrame? Create a function `most_popular_procedure` that takes in two DataFrames, `pets` and `procedure_history`, and returns the name of the most popular `'ProcedureType'` as a string.

Note that some pets are registered but haven't had any procedures performed. Also, some pets that have had procedures done are not registered in `pets`.


<br>

#### `pet_name_by_owner`

What is the name of each customer's pet(s)? Create a function `pet_name_by_owner` that takes in two DataFrames, `owners` and `pets`, and returns a Series whose index contains owner first names, and whose values are pet names as **strings**. If an owner has multiple pets, the value corresponding to that owner should instead be a **list of pet names as strings**.

Note that owner first names are not necessarily unique, and so the Series you return will not necessarily have a unique index.

<br>

#### `total_cost_per_city`

Note that the `owners` DataFrame has a `'City'` column, describing the city in which each pet owner and their pets live. How much did each city spend in total on procedures? Create a function `total_cost_per_city` that takes in four DataFrames, `owners`, `pets`, `procedure_history`, and `procedure_detail`, and returns a Series indexed by `'City'` that describes the total amount that each city has spent on pets' procedures.

***Hint:*** At some point, you may have to merge on multiple columns.

In [97]:
# do not edit this cell -- it is needed for the tests
pets_fp = os.path.join('data', 'pets', 'Pets.csv')
procedure_history_fp = os.path.join('data', 'pets', 'ProceduresHistory.csv')
owners_fp = os.path.join('data', 'pets', 'Owners.csv')
procedure_detail_fp = os.path.join('data', 'pets', 'ProceduresDetails.csv')
pets = pd.read_csv(pets_fp)
procedure_history = pd.read_csv(procedure_history_fp)
owners = pd.read_csv(owners_fp)
procedure_detail = pd.read_csv(procedure_detail_fp)

out_01 = most_popular_procedure(pets, procedure_history)
out_02 = pet_name_by_owner(owners, pets)
out_03 = total_cost_per_city(owners, pets, procedure_history, procedure_detail)

In [None]:
grader.check("q6")

## Congratulations! You're done! 🏁

Submit your `.py` file to Gradescope. Note that you only need to submit the `.py` file; this notebook should not be uploaded.

Before submitting, you should ensure that all of your work is in the `.py` file. You can do this by running the doctests below, which will verify that your work passes the public tests **and** that your work is in the `.py` file. Run the cell below; you should see no output.

In [113]:
!python -m doctest lab.py

In addition, `grader.check_all()` will verify that your work passes the public tests. Ultimately, the Gradescope autograder is also going to run `grader.check_all()`, so you should ensure these pass as well (which they should if the doctests above passed).

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()