In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("midterm-project-checkpoint.ipynb")

# Project 1: World Progress

In this project, you'll continue to explore data from [Gapminder.org](http://gapminder.org), a website dedicated to providing a fact-based view of the world and how it has changed. That site includes several data visualizations and presentations, but also publishes the raw data that we will use in this project to recreate and extend some of their most famous visualizations.

The Gapminder website collects data from many sources and compiles them into tables that describe many countries around the world. All of the data they aggregate are published in the [Systema Globalis](https://github.com/open-numbers/ddf--gapminder--systema_globalis/blob/master/README.md). Their goal is "to compile all public statistics; Social, Economic and Environmental; into a comparable total dataset." All data sets in this project are copied directly from the Systema Globalis without any changes.

This project is dedicated to [Hans Rosling](https://en.wikipedia.org/wiki/Hans_Rosling) (1948-2017), who championed the use of data to understand and prioritize global development challenges.

Watch this brilliant [TED talk](https://www.youtube.com/watch?v=hVimVzgtD6w) by Hans Rosling to see how the world was changing through visualizations.

## Logistics

**Deadline.** The midterm project is due Friday, May 13th, 2022 at 11:59pm PT. 

**Checkpoint.** For full credit, you must also complete the first 8 questions, pass all public autograder tests, and submit to Gradescope
by 11:59pm on May 6th, 2022 by 11:59pm PT. After you've submitted the checkpoint, you may still change your answers before the project
deadline - only your final submission will be graded for correctness. **Please do not forget to copy your checkpoint solutions over to the full project notebook - this way, you can also correct your checkpoint answers and get some credit when you submit your full project notebook.**

**Partners.** You may work with one other partner; your partner must be from your assigned lab section. Only one of you is
required to submit the project. In Gradescope, the person who submits should also designate their partner so
that both of you receive credit.

**Rules.** Don't share your code with anybody but your partner. You are welcome to discuss questions with other students, but don't share the answers. The experience of solving the problems in this project will prepare you for exams (and life). If someone asks you for the answer, resist! Instead, you can demonstrate how you would solve a similar problem.

**Support.** You are not alone! Come to office hours, post on Piazza, and talk to your classmates. If you want to ask about the details of your solution to a problem, make a private Piazza post and the staff will respond. If you're ever feeling overwhelmed or don't know how to make progress, email your TA or tutor for help. 

**Tests.** The tests that are given are **not comprehensive** and passing the tests for a question **does not** mean that you answered the question correctly. Tests usually only check that your table has the correct column labels. However, more tests will be applied to verify the correctness of your submission in order to assign your final score, so be careful and check your work! You might want to create your own checks along the way to see if your answers make sense. Additionally, before you submit, make sure that none of your cells take a very long time to run (several minutes).

**Free Response Questions:** Make sure that you put the answers to the written questions in the indicated cell we provide. Check to make sure that you have a [Gradescope](http://gradescope.com) account, which is where the scores to the free response questions will be posted. If you do not, make sure to reach out to your TA.

**Advice.** Develop your answers incrementally. To perform a complicated table manipulation, break it up into steps, perform each step on a different line, give a new name to each result, and check that each intermediate result is what you expect. You can add any additional names or functions you want to the provided cells. Make sure that you are using distinct and meaningful variable names throughout the notebook. Along that line, **DO NOT** reuse the variable names that we use when we grade your answers. For example, in Question 1 of the Global Poverty section, we ask you to assign an answer to `latest`. Do not reassign the variable name `latest` to anything else in your notebook, otherwise there is the chance that our tests grade against what `latest` was reassigned to.

You **never** have to use just one line in this project or any others. Use intermediate variables and multiple lines as much as you would like!  

To get started, load `datascience`, `numpy`, `plots`, and `ok`.

In [105]:
from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')


## Global Population Growth

The global population of humans reached 1 billion around 1800, 3 billion around 1960, and 7 billion around 2011. The potential impact of exponential population growth has concerned scientists, economists, and politicians alike.

The UN Population Division estimates that the world population will likely continue to grow throughout the 21st century, but at a slower rate, perhaps reaching 11 billion by 2100. However, the UN does not rule out scenarios of more extreme growth.

<a href="http://www.pewresearch.org/fact-tank/2015/06/08/scientists-more-worried-than-public-about-worlds-growing-population/ft_15-06-04_popcount/"> 
 <img src="data/pew_population_projection.png"/> 
</a>

In this section, we will examine some of the factors that influence population growth and how they are changing around the world.

The first table we will consider is the total population of each country over time. Run the cell below.

In [106]:
population = Table.read_table('data/population.csv')
population.show(3)

**Note:** The population csv file can also be found [here](https://github.com/open-numbers/ddf--gapminder--systema_globalis/raw/master/ddf--datapoints--population_total--by--geo--time.csv). The data for this project was downloaded in February 2017.

## India

In the `population` table, the `geo` column contains three-letter codes established by the [International Organization for Standardization](https://en.wikipedia.org/wiki/International_Organization_for_Standardization) (ISO) in the [Alpha-3](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-3#Current_codes) standard. We will begin by taking a close look at India. Inspect the standard to find the 3-letter code for India.

**Question 1.** Create a table called `ind_pop` that has two columns labeled `time` and `population_total`. The first column should contain the years from 1960 through 2015 (including both 1960 and 2015) and the second should contain the population of India in each of those years.

<!--
BEGIN QUESTION
name: q1_1
points: 1
manual: false
-->

In [107]:
ind_pop = ...
ind_pop

In [None]:
grader.check("q1_1")

Run the following cell to create a table called `ind_five` that has the population of India every five years. At a glance, it appears that the population of India has been growing quickly indeed!

In [112]:
ind_pop.set_format('population_total', NumberFormatter)

fives = np.arange(1960, 2016, 5) # 1960, 1965, 1970, ...
ind_five = ind_pop.sort('time').where('time', are.contained_in(fives))
ind_five

**Question 2.** Assign `initial` to an array that contains the population for every five year interval from 1960 to 2010. Then, assign `changed` to an array that contains the population for every five year interval from 1965 to 2015. You should use the `ind_five` table to create both arrays, first filtering the table to only contain the relevant years.

We have provided the code below that uses  `initial` and `changed` in order to add a column to `ind_five` called `annual_growth`. Don't worry about the calculation of the growth rates; run the test below to test your solution.

If you are interested in how we came up with the formula for growth rates, consult the [growth rates](https://www.inferentialthinking.com/chapters/03/2/1/growth) section of the textbook.

<!--
BEGIN QUESTION
name: q1_2
points: 1
manual: false
-->

In [113]:
initial = ...
changed = ...
ind_1960_through_2010 = ind_five.where('time', are.below_or_equal_to(2010))
ind_five_growth = ind_1960_through_2010.with_column('annual_growth', (changed/initial)**0.2-1)
ind_five_growth.set_format('annual_growth', PercentFormatter)

In [None]:
grader.check("q1_2")

Let's take a look at the population of Bangladesh over the years 1980 to 2015. Try to find out how the population has been changing since 1980 for Bangladesh. While the population has grown every five years since 1980, the annual growth rate decreased dramatically from 1985 to 2005. Let’s look at some other information in order to develop a possible explanation. Run the next cell to load three
additional tables of measurements about countries over time.

In [120]:
life_expectancy = Table.read_table('data/life_expectancy.csv')
child_mortality = Table.read_table('data/child_mortality.csv').relabel(2, 'child_mortality_under_5_per_1000_born')
fertility = Table.read_table('data/fertility.csv')

The `life_expectancy` table contains a statistic that is often used to measure how long people live, called *life expectancy at birth*. This number, for a country in a given year, [does not measure how long babies born in that year are expected to live](http://blogs.worldbank.org/opendata/what-does-life-expectancy-birth-really-mean). Instead, it measures how long someone would live, on average, if the *mortality conditions* in that year persisted throughout their lifetime. These "mortality conditions" describe what fraction of people at each age survived the year. So, it is a way of measuring the proportion of people that are staying alive, aggregated over different age groups in the population.

Run the following cells below to see `life_expectancy`, `child_mortality`, and `fertility`. Refer back to these tables as they will be helpful for answering further questions!

In [121]:
life_expectancy

In [122]:
child_mortality

In [123]:
fertility

<!-- BEGIN QUESTION -->

**Question 3.** Perhaps population is growing more slowly because people aren't living as long. Use the `life_expectancy` table to draw a line graph with the years 1980 and later on the horizontal axis that shows how the *life expectancy at birth* has changed in Bangladesh.

<!--
BEGIN QUESTION
name: q1_3
points: 1
manual: true
-->

In [124]:
#Fill in code here
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 4.** Assuming everything else stays the same, do the trends in life expectancy in the graph above directly explain why the population growth rate decreased from 1985 to 2010 in Bangladesh? Why or why not? 

Hint: What happened in Bangladesh in 1991, and does that event explain the overall change in population growth rate?

<!--
BEGIN QUESTION
name: q1_4
points: 1
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



The `fertility` table contains a statistic that is often used to measure how many babies are being born, the *total fertility rate*. This number describes the [number of children a woman would have in her lifetime](https://www.measureevaluation.org/prh/rh_indicators/specific/fertility/total-fertility-rate), on average, if the current rates of birth by age of the mother persisted throughout her child bearing years, assuming she survived through age 49. 

**Question 5.** Write a function `fertility_over_time` that takes the Alpha-3 code of a `country` and a `start` year. It returns a two-column table with labels `Year` and `Children per woman` that can be used to generate a line chart of the country's fertility rate each year, starting at the `start` year. The plot should include the `start` year and all later years that appear in the `fertility` table. 

Then, in the next cell, call your `fertility_over_time` function on the Alpha-3 code for Bangladesh and the year 1980 in order to plot how Bangladesh's fertility rate has changed since 1980. Note that the function `fertility_over_time` should not return the plot itself. **The expression that draws the line plot is provided for you; please don't change it.**

<!--
BEGIN QUESTION
name: q1_5
points: 1
manual: false
-->

In [125]:
def fertility_over_time(country, start):
    """Create a two-column table that describes a country's total fertility rate each year."""
    country_fertility = ...
    country_fertility_after_start = ...
    cleaned_table = ...
    ...
bangladesh_code = ...
fertility_over_time(bangladesh_code, 1980).plot(0, 1) # You should *not* change this line.

In [None]:
grader.check("q1_5")

<!-- BEGIN QUESTION -->

**Question 6.** Assuming everything else is constant, do the trends in fertility in the graph above help directly explain why the population growth rate decreased from 1985 to 2010 in Bangladesh? Why or why not?

<!--
BEGIN QUESTION
name: q1_6
points: 1
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



It has been observed that lower fertility rates are often associated with lower child mortality rates. The link has been attributed to family planning: if parents can expect that their children will all survive into adulthood, then they will choose to have fewer children. We can see if this association is evident in Bangladesh by plotting the relationship between total fertility rate and [child mortality rate per 1000 children](https://en.wikipedia.org/wiki/Child_mortality).

**Question 7.** Using both the `fertility` and `child_mortality` tables, draw a scatter diagram that has Bangladesh's total fertility on the horizontal axis and its child mortality on the vertical axis with one point for each year, starting with 1980.

**The expression that draws the scatter diagram is provided for you; please don't change it.** Instead, create a table called `post_1979_fertility_and_child_mortality` with the appropriate column labels and data in order to generate the chart correctly. Use the label `Children per woman` to describe total fertility and the label `Child deaths per 1000 born` to describe child mortality.

<!--
BEGIN QUESTION
name: q1_7
points: 1
manual: false
-->

In [131]:
bgd_fertility = ...
bgd_child_mortality = ...
fertility_and_child_mortality = ...
post_1979_fertility_and_child_mortality = ...
post_1979_fertility_and_child_mortality.scatter('Children per woman', 'Child deaths per 1000 born') # You should *not* change this line.

In [None]:
grader.check("q1_7")

<!-- BEGIN QUESTION -->

**Question 8.** In one or two sentences, describe the association (if any) that is illustrated by this scatter diagram. Does the diagram show that reduced child mortality causes parents to choose to have fewer children?

<!--
BEGIN QUESTION
name: q1_8
points: 1
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



### Checkpoint (due May 6th by 11:59pm PT) ##
Congratulations, you have reached the Checkpoint!

To submit:
- **save the notebook** first (**`Save and Checkpoint`** from the `File` menu)
- go up to the `Kernel` menu and select `Restart & Clear Output` (make sure the notebook is saved first, because otherwise, you will lose all your work!). 
- go to `Cell -> Run All Above` this Markdown Cell. Carefully look through your notebook and verify that all computations execute correctly. You should see **no errors**; if there are any errors, make sure to correct them before you submit the notebook.
- <span style="color:red">The tests don't usually tell you that your answer is correct.</span> Take a look at the results that you are getting and verify that they match what is being asked and what you would expect to see.
* go to `File -> Download as -> Notebook` and download the notebook to your own computer. ([Please verify](https://ucsb-ds.github.io/ds1-f20/troubleshooting/#i-downloaded-the-notebook-file-but-it-saves-as-the-ipynbjson-extension-so-whenever-i-upload-it-to-gradescope-it-fails) that it got saved as an .ipynb file.)
* Upload the notebook to [Gradescope](https://www.gradescope.com/). You can drag and drop both files or hold down Ctrl to click on multiple files when you are uploading them.

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()