In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("hw05.ipynb")

# Homework 5 – Table Fundamentals 🏫

## Data 94, Spring 2021

In this homework assignment, you will use exercise your newfound table manipulation skills from Lectures 16 and 18.

This homework is due on **Thursday, March 11th at 11:59PM**. You must submit the assignment to Gradescope. Submission instructions can be found at the bottom of this notebook. See the [syllabus](http://data94.org/syllabus/#late-policy-and-extensions) for our late submission policy.

**Note:** Unlike recent homework assignments, in this assignment most questions depend on all previous work, so it's in your best interest to work through the questions sequentially.

In [None]:
# Run this cell.
from datascience import *
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import plotly.express as px

## Understanding the data

In this homework, we will ask and answer questions about UC Berkeley's undergraduate admissions numbers for the class that entered in Fall 2020. The data we'll work with in this question comes from [this public webpage](https://www.universityofcalifornia.edu/infocenter/admissions-source-school).

Run the cell below to load in our data as a table.

In [None]:
schools = Table.read_table('data/enrollment.csv')

In [None]:
schools

Each row corresponds to a high school. For each high school, we have the following information:
- `'Name'`: The name of the high school. Note, this is not unique – for instance, the top three rows of our table correspond to three different high schools all with the name `'ABRAHAM LINCOLN HIGH SCHOOL'`; one is in Los Angeles, one is in San Francisco, and one is in San Jose.
- `'City'`: The city in which the high school is. Note, only schools within the US have a valid `'City'` listed; international schools have a city of `'nan'`. (`'nan'` means "missing value".) See the code cell below.
- `'Region'`: The county in which the high school is if the high school is in California, or the state in which the high school is if the high school is elsewhere in the US (see `'ADLAI E STEVENSON HIGH SCHOOL'` above). Again, if the high school is not within the US, `'Region'` is `'nan'`.
- `'Applied'`: The number of students who applied to UC Berkeley from that high school for admission in Fall 2020.
- `'Admitted'`: The number of students who were admitted to UC Berkeley from that high school for admission in Fall 2020.
- `'Enrolled'`: The number of students who actually chose to attend UC Berkeley from that high school starting in Fall 2020.

In [None]:
# There's nothing you need to do or change here, this is just
# showing you some of the international high schools in the dataset.
# Notice, they have 'nan' as their city and region.

schools.where('City', 'nan').take(np.arange(5))

**Note:** It's a good idea to have the official [`datascience` documentation](http://data8.org/datascience/tables.html) open while working on the assignment in the event you have any questions. The second-to-last slide of Lecture 16 and Lecture 18 will also be quite helpful. 
- [Lecture 16](https://docs.google.com/presentation/d/1Oy9PYPbow8OVJBFHuB4yd10ZlYCt6paqtLp6H4EKPxE/edit#slide=id.gbbd6171521_0_0) 
- [Lecture 18](https://docs.google.com/presentation/d/1Eh8WQan8sshT2eDMSFHB3_pdQLOmmNoBl0oL5WnOh6k/edit#slide=id.gbbd6171521_0_0)

You can also easily see the documentation for a function by either:
- typing the name of the function on a new line, followed by a `?`, and running the cell
- typing the name of the function anywhere in a code cell and hitting `Shift + Tab` on your keyboard

Try it out below!

In [None]:
Table.where

We should also note that **throughout this entire assignment**, your answers should be computed using code, not by hard-coding an actual number. What we mean by that is, if the question asks "How many students applied to UC Berkeley from the Bay Area?" you shouldn't write `342`, you should write something like `schools.where(...)`.

Let's get started! 😎

<!-- BEGIN QUESTION -->

## Question 1 – Asking questions

Throughout this homework, you will write code to answer questions that we proposed for you. However, it's helpful to think about what **you** may be interested in exploring before we get started.

In the cell below, write at least three questions you may want to try and answer using this dataset. An example question (that you cannot use is) "What high school sent the most students to UC Berkeley this year?"

<!--
BEGIN QUESTION
name: q1
points: 2
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

## Question 2 – Key Numbers

### Question 2a – How many schools are represented in the dataset?


Below, assign the variable `num_schools` to an integer corresponding to the number of schools represented in the dataset. Note that this is equal to the number of rows of the table `schools`, since each row corresponds to one school.

_Hint: Remember, every table has a `num_rows` attribute that you can access by writing `<table name>.num_rows`._

<!--
BEGIN QUESTION
name: q2a
points: 1
-->

In [None]:
num_schools = ...
num_schools

In [None]:
grader.check("q2a")

### Question 2b – How many students were admitted?

For your convenience, we'll show you the `schools` table again here, though you should get in the habit of making new cells wherever you're working to visualize the table(s) that you're dealing with.

In [None]:
schools

Suppose we're interested in determining the number of students who applied to UC Berkeley. That number is equal to the sum of the `'Applied'` column in our dataset:

In [None]:
# We could also write schools.column('Applied').sum()
np.sum(schools.column('Applied'))

Below, assign the variable `num_admitted` to an integer corresponding to the number of students who were admitted to UC Berkeley in our dataset.

_Hint: Do something similar to the example above._

<!--
BEGIN QUESTION
name: q2b
points: 1
-->

In [None]:
num_admitted = ...
num_admitted

In [None]:
grader.check("q2b")

### Question 2c – What was the overall acceptance rate?

Below, assign the variable `overall_acceptance_rate` to a float corresponding to the proportion of students who applied to UC Berkeley that were admitted.

_Hint: Use `num_admitted` along with the example that came right before it._

<!--
BEGIN QUESTION
name: q2c
points: 1
-->

In [None]:
overall_acceptance_rate = ...
overall_acceptance_rate

In [None]:
grader.check("q2c")

<!-- BEGIN QUESTION -->

### Question 2d – Wait... what?

Something doesn't quite seem right.

In Question 2b, you computed the number of students that UC Berkeley admitted for enrollment in Fall 2020. Scroll back up to Question 2b to look at that number, and then come back to this question.

Strangely, this [news.berkeley.edu](https://news.berkeley.edu/2020/07/16/uc-berkeleys-push-for-more-diversity-shows-in-its-newly-admitted-class/) article from July 2020 states

> Overall, UC Berkeley admitted 14,668 students as freshmen in 2019 and 15,435 for fall 2020. The admit rate remains the same as last year, at 15%.

The number that you computed in Question 2b is much smaller than the 15,435 figure that this article provides. But both are official University of California sources. What's going on here?

In the cell below, write a short answer to the question "**Why is the number of admitted students in our dataset less than the true number of admitted students?**" To find your answer, go to the [UC site where we got this data from](https://www.universityofcalifornia.edu/infocenter/admissions-source-school) and look for the fine print. You'll find that only schools with a certain number of applicants and admitted students are represented; **your answer must mention those specific thresholds as well as why you think they may have excluded schools who don't meet the thresholds from the dataset.**

<!--
BEGIN QUESTION
name: q2d
points: 2
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



## Question 3 – Which schools?

Now it's time to answer questions of the form "Which schools \_\_\_\_\_"? In order to proceed, you'll need to make sure you're familiar with selecting/dropping, table sorting, and element-wise array operations.

### Question 3a – Removing columns

In this section, we're not going to worry about the city where each school is – we'll look at cities in the next section. It'll be helpful to keep around the `'Region'` column just so that we can see at a glance if a school is in-state, domestic, or international. (We also need it to tell apart the three `'ABRAHAM LINCOLN HIGH SCHOOL'`s!)

For now, make a new table `schools_stats` that contains all of the columns in `schools` except for `'City'`.

<!--
BEGIN QUESTION
name: q3a
points: 1
-->

In [None]:
schools_stats = ...
schools_stats

In [None]:
grader.check("q3a")

### Question 3b – Which school sent the most students?

The value in the `'Enrolled'` column for each high school is the number of students they sent to UC Berkeley.

Below, assign `feeders` to a table with the same columns as `schools_stats`, but with **only the 14 high schools who sent the most students to UC Berkeley**, sorted in descending order. (If you're wondering, this corresponds to all of the high schools that sent 22 or more students, but this is not something you need to worry about in this question.)

The first five rows of your table should look like this.

| Name                         | Region        |   Applied |   Admitted |   Enrolled |
|-----------------------------:|--------------:|----------:|-----------:|-----------:|
| LOWELL HIGH SCHOOL           | San Francisco |       435 |        106 |         64 |
| IRVINGTON HIGH SCHOOL        | Alameda       |       248 |         63 |         47 |
| DOUGHERTY VALLEY HIGH SCHOOL | Contra Costa  |       430 |         78 |         39 |
| CANYON CREST ACADEMY         | San Diego     |       269 |         66 |         38 |
| PORTOLA HIGH SCHOOL          | Orange        |       175 |         57 |         30 |

_Hint: First, call `.sort` on `school_stats`. Then, use `.take(np.arange(14))` to get just the first 14 rows._

<!--
BEGIN QUESTION
name: q3b
points: 2
-->

In [None]:
feeders = ...
feeders.show()

In [None]:
grader.check("q3b")

### Question 3c – What was the acceptance rate of each school?

Right now we have the number of students who applied, were admitted, and actually enrolled from each school. We don't have the acceptance rate of students at each school, but we can easily figure that out using some array operations!

Below, assign `schools_stats_acc` to a table with the same four columns as `schools_stats` plus an additional fifth column. This fifth column should have the label `'Acceptance Rate'`, and its values should be the acceptance rates of each school, each as a decimal between 0 (no students were admitted) and 1 (all students were admitted).

There are several steps involved:
- First, create an array containing the acceptance rates for each school. This should be done in one line; remember that each column in a table is an array, and that if you divide two arrays, the division is performed element-wise (as we saw in Homework 4 and Lecture 15).
- Then, use the `with_columns` method to add an `'Acceptance Rate'` column to `schools_stats`, using the array you just created. Store your result in the table `schools_stats_acc` – the `schools_stats` table should not change!
- **Note**: unlike in the previous question, you aren't supposed to sort or take the top 10.

The first few rows of your table should look like this:

| Name                        | Region        |   Applied |   Admitted |   Enrolled |   Acceptance Rate |
|----------------------------:|--------------:|----------:|-----------:|-----------:|------------------:|
| ABRAHAM LINCOLN HIGH SCHOOL | Los Angeles   |        17 |          6 |          3 |          0.352941 |
| ABRAHAM LINCOLN HIGH SCHOOL | San Francisco |       106 |         21 |         14 |          0.198113 |
| ABRAHAM LINCOLN HIGH SCHOOL | Santa Clara   |        48 |         10 |          4 |          0.208333 |
| ACADEMY OF THE CANYONS      | Los Angeles   |        45 |         15 |          6 |          0.333333 |
| ACADEMY-SAN FRAN @ MCATEER  | San Francisco |        19 |          8 |          5 |          0.421053 |


<!--
BEGIN QUESTION
name: q3c
points: 2
-->

In [None]:
acceptance_rates = ...
schools_stats_acc = ...
schools_stats_acc

In [None]:
grader.check("q3c")

### Question 3d – Which schools had the lowest and highest acceptance rate?

Now that we have a table, `schools_stats_acc`, containing the acceptance rate of each school, it's natural to ask which schools had the highest and lowest acceptance rates.

Your job below is to define two **arrays**:
- `top_5_acc`, which contains the **names** of the five schools with the highest acceptance rates, such that the first element of `top_5_acc` has the absolute highest acceptance rate, the second element has the second highest acceptance rate, and so on.
- `bottom_5_acc`, which contains the **names** of the five schools with the lowest acceptance rates, such that the first element of `bottom_5_acc` has the absolute lowest acceptance rate, the second element has the second lowest acceptance rate, and so on.

At some point, you'll need to sort `schools_stats_acc` by acceptance rate. However, how you choose to do that is up to you – you could elect to sort it in both descending and ascending order, or you could just sort it once and be creative with how you use `.take` (which you will need to use regardless).

<!--
BEGIN QUESTION
name: q3d
points: 2
-->

In [None]:
...

top_5_acc = ...
bottom_5_acc = ...

# Don't change anything below this comment, it's just for visualization
print('Top 5 acceptance rates:')
for school in top_5_acc:
    print(school)

print('----------\nBottom 5 acceptance rates:')
for school in bottom_5_acc:
    print(school)

In [None]:
grader.check("q3d")

### Question 3e – What does acceptance rate have to do with the number of applicants?

You may note that none of the schools with a top 5 or bottom 5 acceptance rate appeared in our `feeders` table. This might mean that the schools with very high and very low acceptance rates don't send very many students to UC Berkeley.

In this question, your job is to investigate this idea further. In the two cells below, we've displayed two tables: one containing `schools_stats_acc` sorted by `'Acceptance Rate'` in descending order, and one sorted in ascending order. (These tables likely appeared as intermediate steps in your answer to the previous question.)

In [None]:
schools_stats_acc.sort('Acceptance Rate', descending = True)

In [None]:
schools_stats_acc.sort('Acceptance Rate')

<!-- BEGIN QUESTION -->

In the cell below, comment on what you notice about the number of applicants, admitted students, and enrolled students for the two categories of schools – there is a clear pattern that you should identify. _(Hint: Which column(s) look roughly the same in the two tables? Which column(s) look different?)_ You should also comment on the schools' regions.

<!--
BEGIN QUESTION
name: q3e
points: 2
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



## Question 4 – Location

In the previous section, we ignored the `'City'` column from `schools`. In this section, we'll bring that information back in. Here, we're going to heavily rely on the `.where` method and the various `are` predicates, so you may want to open the corresponding documentation (see Lecture 18).

Run the cell below to create a new table, `schools_acc`, with the six original columns in `schools` plus `'Acceptance Rate'` from `schools_stats_acc`.

In [None]:
schools_acc = schools.with_columns('Acceptance Rate', schools_stats_acc.column('Acceptance Rate'))
schools_acc

To show you how `.where` might be useful here, we can find all of the schools in the city of San Francisco like so:

In [None]:
schools_acc.where('City', 'San Francisco')

### Question 4a – How many schools were in Los Angeles county?

Los Angeles is both the name of a city and a county, and counties correspond to regions in our dataset (at least for California high schools).

Below, assign `num_schools_lac` to the **number** of schools in our dataset that are from Los Angeles county.

_Hint: This involves using `.where` and `.num_rows`._

<!--
BEGIN QUESTION
name: q4a
points: 1
-->

In [None]:
num_schools_lac = ...
num_schools_lac

In [None]:
grader.check("q4a")

### Question 4b – How many students actually enrolled from schools in Los Angeles county?

Below, assign `num_students_lac` to the number of students who enrolled at UC Berkeley from high schools in Los Angeles county. This involves using `.where` and some of the techniques we used in Question 2.

Note: While our solution is only one line, yours doesn't have to be.

<!--
BEGIN QUESTION
name: q4b
points: 2
-->

In [None]:
num_students_lac = ...
num_students_lac

In [None]:
grader.check("q4b")

### Question 4c – Which schools in Los Angeles county sent the most students?

Below, assign `top_lac_schools` to a **table** with the same columns as `schools_acc`, but with **only the 10 high schools in Los Angeles county who sent the most students to UC Berkeley**, sorted in descending order. The first five rows of your table should look like this:

| Name                          | City             | Region      |   Applied |   Admitted |   Enrolled |   Acceptance Rate |
|------------------------------:|-----------------:|------------:|----------:|-----------:|-----------:|------------------:|
| PALISADES CHARTER HIGH SCHOOL | Pacific Palisade | Los Angeles |       221 |         46 |         26 |          0.208145 |
| ARCADIA HIGH SCHOOL           | Arcadia          | Los Angeles |       249 |         55 |         21 |          0.220884 |
| DIAMOND BAR HIGH SCHOOL       | Diamond Bar      | Los Angeles |       264 |         39 |         19 |          0.147727 |
| GRETCHEN WHITNEY HIGH SCHOOL  | Cerritos         | Los Angeles |        86 |         21 |         17 |          0.244186 |
| SANTA MONICA HIGH SCHOOL      | Santa Monica     | Los Angeles |       195 |         38 |         15 |          0.194872 |

<!--
BEGIN QUESTION
name: q4c
points: 2
-->

In [None]:
top_lac_schools = ...
top_lac_schools

In [None]:
grader.check("q4c")

### Question 4d – Which schools in Alameda county sent more than 20 students?

Below, assign `big_alameda` to a table containing all of the columns of `schools_acc`, but only the rows corresponding to schools in Alameda county that sent more than 20 students to Berkeley. Don't sort.

_Hint: You can use `.where` multiple times if there are multiple conditions you want to be true; that's what you'll need to do here. Also, `big_alameda` should have exactly 7 rows._ 

<!--
BEGIN QUESTION
name: q4d
points: 2
-->

In [None]:
big_alameda = ...
big_alameda

In [None]:
grader.check("q4d")

### Question 4e – How many students applied from schools in the Bay Area? 

<img src='https://upload.wikimedia.org/wikipedia/commons/b/bc/Bayarea_map.png' width=400>

The Bay Area consists of the nine counties `'San Francisco'`, `'San Mateo'`, `'Santa Clara'`, `'Alameda'`, `'Contra Costa'`, `'Solano'`, `'Napa'`, `'Sonoma'`, and `'Marin'`.

Below, you have two tasks.
1. Assign `bay_schools` to a table with the same columns as `schools_acc`, but only with rows corresponding to schools in the Bay Area. You should do this by first creating an array of the names of the nine Bay Area counties, and then use `.where` with `are.contained_in` to filter just the relevant rows from `schools_acc`. Don't sort.
2. Assign `bay_acc_rate` to the overall acceptance rate of students from the Bay Area. **This requires a new calculation, you can't just look at the `'Acceptance Rate'` column in your table.** _Hint: How did we calculate the overall acceptance rate in Question 2?_

<!--
BEGIN QUESTION
name: q4e
points: 3
-->

In [None]:
bay_counties = ...
bay_schools = ...
bay_acc_rate = ...

# Don't change anything below this comment, it's just for visualization
display(bay_schools)
print(bay_acc_rate)

In [None]:
grader.check("q4e")

### Question 4f – What proportion of students from the Bay Area came from the top 5 schools in the Bay Area?

Some high schools in the Bay Area are very well represented at UC Berkeley. Your job in this question is to determine **the proportion (i.e. decimal between 0 and 1) of students who are from the Bay Area that come from one of the top 5 Bay Area high schools, and assign your result to `bay_top_5_prop`.** By "top 5 Bay Area high schools", we mean the 5 Bay Area high schools that sent the most students to UC Berkeley; these schools are `'LOWELL HIGH SCHOOL'`, `'IRVINGTON HIGH SCHOOL'`, `'DOUGHERTY VALLEY HIGH SCHOOL'`, `'MISSION SAN JOSE HIGH SCHOOL'`, and `'FOOTHILL HIGH SCHOOL'` (but you don't need to write these names anywhere).

Here are some steps to help guide your work:
- First, sort `bay_schools` by `'Enrolled'` in descending order, and sum the first five elements of the `'Enrolled'` column. This is the number of schools from the top 5 Bay Area high schools.
- Then, sum all elements of the `'Enrolled'` column of `bay_schools`. This is the number of students from the Bay Area overall.
- Divide these two numbers.

Again, many of our queries only use one line but yours can take multiple.

<!--
BEGIN QUESTION
name: q4f
points: 2
-->

In [None]:
# You don't have to use bay_top_5_sum and bay_overall_sum in your calculation of bay_top_5_prop
bay_top_5_sum = ...
bay_overall_sum = ...

# Remember, bay_top_5_prop needs to be a number between 0 and 1
bay_top_5_prop = ...
bay_top_5_prop

In [None]:
grader.check("q4f")

If you completed the above question correctly, you'll learn that about 13\% of the students from the Bay Area come from one of 5 high schools – even though there are 190 Bay Area high schools represented in the dataset (and more that aren't)!

### Fun Demo

Run the cell below to generate an interactive scatter plot displaying the number of students who applied and were admitted from high schools in the Bay Area. Look at what happens when you hover over a point. If you're from the Bay Area, can you find your high school?

In [None]:
px.scatter(data_frame = bay_schools.to_df(), x = 'Applied', y = 'Admitted', 
                                             color = 'Region',
                                             hover_data = {'Name': True},
                                             title = 'Number of students admitted to UC Berkeley from Bay Area high schools')

## Question 5 – What about my school?

In this last section, we will ask questions about specific schools.

**Note:** In this section, you'll need to do something you haven't done thus far in this assignment, which is use `.item`. `.item` gets the element of the array corresponding to the index you give it, just like with list indexing.

In [None]:
nums = np.array([100, 24, 32, 76, 89])

In [None]:
nums.item(0)

In [None]:
nums.item(-2)

In [None]:
# This also works, but we prefer .item for arrays.
nums[0]

### Question 5a – How many students are at UC Berkeley from Chadwick International School?

Below, assign `num_chadwick` to the number of students enrolled from `'CHADWICK INTERNATIONAL SCHOOL`'.

Remember, the result of calling `.where` is always a table, even if it only has one row. Calling `.column` on a table with just one row will give you an array with just one element. Your answer in this question must be a number, not an array with one number.

<!--
BEGIN QUESTION
name: q5a
points: 1
-->

In [None]:
num_chadwick = ...
num_chadwick

In [None]:
grader.check("q5a")

### Question 5b – How many students are at UC Berkeley from \_\_\_\_\_?

Continuing with our work from the previous question, it's natural to ask the same question for other schools. Below, write a function `sent_from_school` which takes in the name of a school as a string and returns the number of students enrolled at UC Berkeley from that school. Example behavior is shown below.

```py
>>> sent_from_school('CHADWICK INTERNATIONAL SCHOOL')
5

>>> sent_from_school('BERKELEY HIGH SCHOOL')
18
```

Note: This is the only function you've had to write in this homework! But don't worry – you've already done most of the work in Question 5a. All you need to do is generalize your calculation in 5a, which was only for `'CHADWICK INTERNATIONAL SCHOOL'`, to work with any school. Also, don't worry about high schools with repeated names.


<!--
BEGIN QUESTION
name: q5b
points: 2
-->

In [None]:
def sent_from_school(name):
    ...

In [None]:
grader.check("q5b")

### Extra

Your function `sent_from_school` from the previous question doesn't work correctly for high schools whose names are not unique. There are many such high schools: run the cell below to see them all (don't worry about how the code works, we'll cover it next week):

In [None]:
schools_acc.group('Name').sort('count', descending = True).where('count', are.above(1)).show()

Let's look at one in particular – `'SAINT FRANCIS HIGH SCHOOL'`:

In [None]:
schools_acc.where('Name', 'SAINT FRANCIS HIGH SCHOOL')

If we call your function on `'SAINT FRANCIS HIGH SCHOOL'`, we only get the enrollment count for one of the schools (the one that appears first in the dataset, most likely):

In [None]:
sent_from_school('SAINT FRANCIS HIGH SCHOOL')

In this dataset, we'd say that name is not a [primary key](https://en.wikipedia.org/wiki/Primary_key), because knowing the name of a school doesn't necessarily tell you which school it is, as name is not unique. This is not something we will cover formally in this class, but feel free to read the linked article if you're interested.

# Done!

Congrats! You've finished another Data 94 homework assignment!

To submit your work, follow the steps outlined on Ed.

The point breakdown for this assignment is given in the table below:

| **Category** | Points |
| --- | --- |
| Autograder | 25 |
| Written | 6 |
| **Total** | 31 |

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export()