### Using Jupyter Notebook

#### The Workflow
1. **Go into your project folder**: Use a **terminal** (PowerShell for Windows and Terminal.app for macOS) & move (**cd**) into your project directory (*USERNAME*/**workspace/introdsm**)
   1. The lab computers can be inconsistent in the virtual environment and package installation.
   2. Bring your own device (BYOD) is strongly encouraged. 
2. **Activate the virtual environment**:
   1. For macOS, Enter "**source .venv/bin/activate**" 
   2. For Windows, Enter "**.venv\Scripts\activate**" 
   3. You should see the ".venv" prefix at your terminal prompt when the virtual environment is activated.
3. **Launch Jupyter Notebook** by entering "jupyter notebook".
4. Work on your Notebook.
5. **Shut down Jupyter Notebook** by selecting the File menu => Shut Down.
6. **Deactivate the virtual environment**: At your terminal, enter "**deactivate**".

#### Installing packages: otter-grader & datascience
1. Ideally, you should install packages using a terminal:
   1. In a terminal, activate your virtual environment.
   2. Enter "pip install *PACKAGE_NAME*" to install the package/module/ibrary. 
2. To install packages from Jupyter Notebook, for example:
   1. Enter **```%pip install datascience```** (this will take a while, like 2 minutes)
   2. Enter **```%pip install otter-grader```**. 
3. **Comment out** the installation line after installation is complete. 

# Homework 3: Table Manipulation and Visualization

**Attention:**
- Provide your answer in the **designated space**.
- Do not re-assign variables in the notebooks! For example, if you use `max_temperature` in your answer to one question, do not reassign it later on. Otherwise, you will fail tests that you thought you were passing previously!
- Most "tests" in this homework **test formats and data types** of your answer, not their **correctness**. When you see you are **100% passed**, it **doesn't mean your final grade will be 100%**.
- Points may be scaled in Canvas. 
- DO NOT directly share your answers with others
- Discussing problems with others is ENCOURAGED.
- Academic honor is important. **DO NOT cheat!**
- Come to the TA's help sessions and the instructor's office hours for help and clarification.

**Getting help**: During the lab, you are welcome to compare notes with the instructor.

**Reference Materials**:
- The [Python Reference sp25](https://www.data8.org/sp24/reference/) or [Python Reference sp24](https://www.data8.org/sp24/reference/).
- The [Data8 datascience Reference](https://www.data8.org/datascience/tables.html) is very helpful with **syntax and examples**. For example, the [Table Functions and Methods](https://www.data8.org/datascience/reference-nb/datascience-reference.html) and [Tables](https://www.data8.org/datascience/tables.html)

**Recommended Reading**: 
* [Visualization](https://inferentialthinking.com/chapters/07/Visualization.html)

In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("hw03.ipynb")

In [None]:
### pay attention to what you are importing; know the tools
# Don't change this cell; just run it. 

import numpy as np
from datascience import *
import warnings
warnings.simplefilter('ignore', FutureWarning)

# These lines do some fancy plotting magic.\n",
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

## 1. Unemployment

The Great Recession of 2008-2009 was a period of economic decline observed globally, with scale and timing varying from country to country. In the United States, it resulted in a rapid rise in unemployment that affected industries and population groups to different extents.

The Federal Reserve Bank of St. Louis publishes data about jobs in the US.  Below, we've loaded data on unemployment in the United States. There are many ways of defining unemployment, and our dataset includes two notions of the unemployment rate:

1. *Non-Employment Index (or NEI)*: Among people who are able to work and are looking for a full-time job, the percentage who can't find a job.
2. *NEI-PTER*: Among people who are able to work and are looking for a full-time job, the percentage who can't find any job *or* are only working at a part-time job.  The latter group is called "Part-Time for Economic Reasons", so the acronym for this index is NEI-PTER.  (Economists are great at marketing.)

The source of the data is [here](https://fred.stlouisfed.org/categories/33509).

**Question 1.** The data are in a CSV file called `unemployment.csv`.  Load that file into a table called `unemployment`. **(4 Points)**

_Hint:_ After loading in the CSV file, the `unemployment` table should look like this:

<img src="unemployment.png" width="20%"/>


In [None]:
unemployment = ...
unemployment

In [None]:
grader.check("q1_1")

**Question 2.** Sort the data in descending order by NEI, naming the sorted table `by_nei`.  Create another table called `by_nei_pter` that's sorted in descending order by NEI-PTER instead. **(4 Points)**


In [None]:
by_nei = ...
by_nei_pter = ...

In [None]:
grader.check("q1_2")

In [None]:
# Run this cell to check your by_nei table. You do not need to change the code.
by_nei.show(5)

In [None]:
# Run this cell to check your by_nei_pter table. You do not need to change the code.
by_nei_pter.show(5)

**Question 3.** Using `take`, assign `greatest_nei` to a table containing the data for the 11 quarters when NEI was greatest.

`greatest_nei` should be sorted in descending order of `NEI`. Note that each row of `unemployment` represents a quarter. **(4 Points)**

\### Note that take() operates on rows and **take()** references can be found: 
1. in the [book (6.2.)](https://introdsm.org/chapters/06/2/Selecting_Rows.html) or
2. [datascience reference](https://www.data8.org/datascience/_autosummary/datascience.tables.Table.take.html#datascience.tables.Table.take).

In [None]:
### it was not explained in the question but you can 
### observe and see that we are dealing with quarterly data.
### to take() rows you probably want to use np.arange() here.

greatest_nei = ...
greatest_nei

### note that np.arange(12) will give you 12 integers

In [None]:
grader.check("q1_3")

**Question 4.** It's believed that many people became PTER (recall: "Part-Time for Economic Reasons") in the "Great Recession" of 2008-2009.  NEI-PTER is the percentage of people who are unemployed (included in the NEI) plus the percentage of people who are PTER.

Compute an array containing the percentage of people who were PTER in each quarter.  (The first element of the array should correspond to the first row of `unemployment`, and so on.) **(4 Points)**

*Note:* Use the original `unemployment` table for this.


In [None]:
pter = ...
pter

In [None]:
grader.check("q1_4")

**Question 5.** Add `pter` as a column to `unemployment` (name the column `PTER`) and sort the resulting table by that column in descending order.  Call the resulting table `by_pter`.

Try to do this with a single line of code, if you can. **(4 Points)**


In [None]:
by_pter = ...
by_pter

In [None]:
grader.check("q1_5")

**Question 6.** Create a line plot of PTER over time. To do this, create a new table called `pter_over_time` with the same columns as the `unemployment` table with the addition of two new columns: `Year` and `PTER` using the `year` array and the `pter` array, respectively. Then, generate a line plot using one of the table methods you've learned in class.

The order of the columns matter for our correctness tests, so be sure `Year` comes before `PTER`. **(4 Points)**

*Note:* When constructing `pter_over_time`, do not just add the `year` column to the `by_pter` table. Please follow the directions in the question above.


In [None]:
year = 1994 + np.arange(by_pter.num_rows)/4
pter_over_time = ...
...
plt.ylim(0,2); # Do not change this line

In [None]:
grader.check("q1_6")

**Question 7.** Were PTER rates high during the Great Recession (that is to say, were PTER rates particularly high in the years 2008 through 2011)? Assign `highPTER` to `True` if you think PTER rates were high in this period, or `False` if you think they weren't. **(4 Points)**


In [None]:
highPTER = ...

In [None]:
grader.check("q1_7")

## 2. Birth Rates

The following table gives Census-based population estimates for each US state on both July 1, 2015 and July 1, 2016. The last four columns describe the components of the estimated change in population during this time interval. **For all questions below, assume that the word "states" refers to all 52 rows including Puerto Rico and the District of Columbia.**

The data was taken from [here](http://www2.census.gov/programs-surveys/popest/datasets/2010-2016/national/totals/nst-est2016-alldata.csv). (Note: If the file doesn't download for you when you click the link, you can copy and paste the link address it into your address bar!) If you want to read more about the different column descriptions, click [here](http://www2.census.gov/programs-surveys/popest/datasets/2010-2015/national/totals/nst-est2015-alldata.pdf).

The raw data is a bit messy—run the cell below to clean the table and make it easier to work with.

In [None]:
# Don't change this cell; just run it.
pop = Table.read_table('nst-est2016-alldata.csv').where('SUMLEV', 40).select([1, 4, 12, 13, 27, 34, 62, 69])
pop = pop.relabeled('POPESTIMATE2015', '2015').relabeled('POPESTIMATE2016', '2016')
pop = pop.relabeled('BIRTHS2016', 'BIRTHS').relabeled('DEATHS2016', 'DEATHS')
pop = pop.relabeled('NETMIG2016', 'MIGRATION').relabeled('RESIDUAL2016', 'OTHER')
pop = pop.with_columns("REGION", np.array([int(region) if region != "X" else 0 for region in pop.column("REGION")]))
pop.set_format([2, 3, 4, 5, 6, 7], NumberFormatter(decimals=0)).show(5)

**Question 1.** Assign `us_birth_rate` to the total US annual birth rate during this time interval. The annual birth rate for a year-long period is the total number of births in that period as a proportion of the total population size at the start of the time period. **(4 Points)**

_Hint:_ Remember that each row in the `pop` table refers to a state, not the US as a whole.


In [None]:
us_birth_rate = ...
us_birth_rate

In [None]:
grader.check("q2_1")

**Question 2.** Assign `movers` to the number of states for which the **absolute value** of the **annual rate of migration** was higher than 1%. The annual rate of migration for a year-long period is the net number of migrations (in and out) as a proportion of the population size at the start of the period. The `MIGRATION` column contains estimated annual net migration counts by state. **(4 Points)**

*Hint*: `migration_rates` should be a table and `movers` should be a number.


In [None]:
migration_rates = ...
# migration_rates
movers = ...
movers

In [None]:
grader.check("q2_2")

**Question 3.** Assign `west_births` to the total number of births that occurred in region 4 (the Western US). **(4 Points)**

*Hint:* Make sure you double check the type of the values in the `REGION` column and appropriately filter (i.e. the types must match!).


In [None]:
west_births = ...
west_births

In [None]:
grader.check("q2_3")

**Question 4.** In the next question, you will be creating a visualization to understand the relationship between birth and death rates. The annual death rate for a year-long period is the total number of deaths in that period as a proportion of the population size at the start of the time period.

What visualization is most appropriate to see if there is an association between annual birth and death rates across multiple states in the United States?

1. Line Graph
2. Bar Chart
3. Scatter Plot

Assign `visualization` below to the number corresponding to the correct visualization. **(4 Points)**


In [None]:
visualization = ...

In [None]:
grader.check("q2_4")

<!-- BEGIN QUESTION -->

**Question 5.** In the code cell below, create a visualization that will help us determine if there is an association between birth rate and death rate during this time interval. It may be helpful to create an intermediate table containing the birth and death rates for each state. **(4 Points)**

Things to consider:

- What type of chart will help us illustrate an association between 2 variables?
- How can you manipulate a certain table to help generate your chart?
- Check out the [Recommended Reading](https://inferentialthinking.com/chapters/07/Visualization.html) for this homework!


In [None]:
# In this cell, use birth_rates and death_rates to generate your visualization
birth_rates_2015 = pop.column('BIRTHS') / pop.column('2015')
death_rates_2015 = pop.column('DEATHS') / pop.column('2015')
...

<!-- END QUESTION -->

**Question 6.** True or False: There is an association between birth rate and death rate during this time interval. 

Assign `assoc` to `True` or `False` in the cell below. **(4 Points)**


In [None]:
assoc = ...

In [None]:
grader.check("q2_6")

## 3. Uber

**Note:** We recommend reading [Chapter 7.2](https://inferentialthinking.com/chapters/07/2/Visualizing_Numerical_Distributions.html) of the textbook before starting on Question 3.

Below we load tables containing 200,000 weekday Uber rides in the Manila, Philippines, and Boston, Massachusetts metropolitan areas from the [Uber Movement](https://www.uber.com/newsroom/introducing-uber-movement-2/) project. The `sourceid` and `dstid` columns contain codes corresponding to start and end locations of each ride. The `hod` column contains codes corresponding to the hour of the day the ride took place. The `ride time` column contains the length of the ride in minutes.

In [None]:
### note that using print() we can print more than one piece of data in one cell
### always check your data first.

boston = Table.read_table("boston.csv")
manila = Table.read_table("manila.csv")
print("Boston Table")
boston.show(4)
print("Manila Table")
manila.show(4)

<!-- BEGIN QUESTION -->

**Question 1.** Produce a histogram that visualizes the distributions of all ride times in Boston using the given bins in `equal_bins`. **(4 Points)**

*Hint:* See [Chapter 7.2](https://inferentialthinking.com/chapters/07/2/Visualizing_Numerical_Distributions.html) if you're stuck on how to specify bins.

In [None]:
equal_bins = np.arange(0, 120, 5)
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 2.** Now, produce a histogram that visualizes the distribution of all ride times in Manila using the given bins. **(4 Points)**


In [None]:
equal_bins = np.arange(0, 120, 5)
...

# Don't delete the following line!
plt.ylim(0, 0.05);

<!-- END QUESTION -->

**Question 3.** Let's take a closer look at the y-axis label. Assign `unit_meaning` to an integer (1, 2, 3) that corresponds to the "unit" in "Percent per unit". **(4 Points)**

1. minute  
2. ride time  
3. second


In [None]:
unit_meaning = ...
unit_meaning

In [None]:
grader.check("q3_3")

**Question 4.** Assign `boston_under_15` and `manila_under_15` to the percentage of rides that are less than 15 minutes in their respective metropolitan areas. Use the height variables provided below in order to compute the percentages. Your solution should only use height variables, numbers, and mathematical operations. You should **not** access the tables `boston` and `manila` in any way. **(4 Points)**

> ***Note:*** that the height variables (i.e. `boston_under_5`) represent the height of the bin it describes.


In [None]:
boston_under_5_bin_height = 1.2
manila_under_5_bin_height = 0.6
boston_5_to_under_10_bin_height = 3.2
manila_5_to_under_10_bin_height = 1.4
boston_10_to_under_15_bin_height = 4.9
manila_10_to_under_15_bin_height = 2.2

boston_under_15 = ...
manila_under_15 = ...

boston_under_15, manila_under_15

In [None]:
grader.check("q3_4")

**Question 5.** Let's take a closer look at the distribution of ride times in Boston. Assign `boston_median_bin` to an integer (1, 2, 3, or 4) that corresponds to the bin that contains the median time. **(4 Points)**

1. 0-8 minutes  
2. 8-14 minutes  
3. 14-20 minutes  
4. 20-40 minutes  

*Hint:* The median of a sorted list has half of the list elements to its left, and half to its right.


In [None]:
boston_median_bin = ...
boston_median_bin

In [None]:
grader.check("q3_5")

<!-- BEGIN QUESTION -->

**Question 6.** Identify one difference between the histograms, in terms of the statistical properties. 
> *Hint*: Without performing any calculations, can you comment on the average or skew of each histogram? **(4 Points)**


_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 7.** Why is your solution in Question 6 the case? Based on one of the following two readings, why are the distributions for Boston and Manila different? **(4 Points)**

- [Boston reading](https://www.climatestotravel.com/climate/united-states/boston)
- [Manila reading](https://manilafyi.com/why-is-manila-traffic-so-bad/)

*Hint:* Try thinking about external factors of the two cities that may be causing the difference! The readings provide some potential factors -- try to connect them to the ride time data.


_Type your answer here, replacing this text._

<!-- END QUESTION -->

## 4. Histograms

Consider the following scatter plot: 

![Alt text](scatter.png "Scatter plot showing data points for the variables 'x' and 'y'. The data are symmetric about the x-axis centered at 0 and symmetric about the y-axis centered at 0, but with no data in the [-0.5, 0.5] range on the y-axis.")

The axes of the plot represent values of two variables: $x$ and $y$. 

Suppose we have a table called `t` that has two columns in it:

- `x`: a column containing the x-values of the points in the scatter plot
- `y`: a column containing the y-values of the points in the scatter plot

Below, you are given three histograms one corresponds to column `x`, one corresponds to column `y`, and one does not correspond to either column. 

**Histogram A:**
 
![Alt text](var3.png "Symmetrical, bell-shaped histogram centered around 0")

**Histogram B:**

![Alt text](var1.png "Symmetrical histogram with two peaks at -1 and 1 but no data around 0")

**Histogram C:**

![Alt text](var2.png "Asymmetrical histogram with a peak around -0.5 and a right skew")

**Question 1.** Suppose we run `t.hist('x')`. Which histogram does this code produce? Assign `histogram_column_x` to either 1, 2, or 3. **(5 Points)**

1. Histogram A
2. Histogram B
3. Histogram C


In [None]:
histogram_column_x = ...

In [None]:
grader.check("q4_1")

<!-- BEGIN QUESTION -->

**Question 2.** State at least one reason why you chose the histogram from Question 1. **Make sure to clearly indicate which histogram you selected** (ex: "I chose histogram A because ..."). **(5 Points)**


_Type your answer here, replacing this text._

<!-- END QUESTION -->

**Question 3.** Suppose we run `t.hist('y')`. Which histogram does this code produce? Assign `histogram_column_y` to either 1, 2, or 3. **(5 Points)**

1. Histogram A
2. Histogram B
3. Histogram C


In [None]:
histogram_column_y = ...

In [None]:
grader.check("q4_3")

<!-- BEGIN QUESTION -->

**Question 4.** State at least one reason why you chose the histogram from Question 3.  **Make sure to clearly indicate which histogram you selected** (ex: "I chose histogram A because ..."). **(5 Points)**


_Type your answer here, replacing this text._

`Good Job! You are done with this assignment!`

## Submission

After you have completed the assignment, do the following to submit it:

1. **Save** the notebook file (File ==> Save Notebook) (or the Save icon)
2. Go to the notebook menu, choose Kernel ==> **Restart Kernel and Run All Cells**.
3. Scrolling around to make sure everything works fine. 
4. **Save** the notebook file (File ==> Save Notebook, or use the Save icon).
5. Use the Jupyter Notebook dashboard to create a **duplicate** of this file and then **rename** it from *assignment*.ipynb_**copy** (e.g., a01.ipynb_copy) to *assignment_**FIRSTNAME_LASTNAME***.ipynb (e.g., a01_TSANGYAO_CHEN.ipynb) to be graded. That way you will be able to keep your original file.  
6. **Upload** your <font color="blue">*assignment_**FIRSTNAME_LASTNAME***.ipynb</font> to Canvas.