# Project 2: Meteorites

## Due: 7/31/2021 11:59 PM PST  (Saturday)

**Note**: Because of tight deadlines for submitting grades during Summer Session, we cannot accept slip days for this assignment!

*Welcome to the final project!* This project is a cumulative look at everything we've done in the quarter, from table operations to statistics. It takes you through the entire data science workflow in three parts. In Part 1, we clean the data set. In Part 2, we do exploratory analysis to find interesting trends. And in Part 3, we perform statistical analyses.

This project will focus on *meteorites*. A *meteorite* is a piece of rock from outer space that survives entry into the Earth's atmosphere and strikes the surface.

<img width=70% src="data/meteorite.jpg">
<center>
<i>
<small>Credit: By User:Captmondo - Own work (photo), CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=5752726
</small>
</i>
</center>

In this project, you will work with the meteorite landings [dataset](https://data.nasa.gov/Space-Science/Meteorite-Landings/gh4g-9sfh) from The Meteoritical Society, which contains information on all known meteorite landings.

In [1]:
import babypandas as bpd
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn')
np.set_printoptions(legacy='1.13')

import otter
grader = otter.Notebook('tests')

## Introducing the Dataset

The data is contained in the file `data/meteorite_landings.csv`. Peaking into the data set, here is what we see:

In [2]:
!head ./data/meteorite_landings.csv

This dataset contains 45716 rows and 8 columns. Each row represents one known meteorite. Following is the description for each of the eight columns:

- **id**: ID of the meteorite as an integer.
- **name**: Name of the mateorite as a string.
- **nametype**: an identifier that is either *valid* or *relict*. A meteorite is a *relict* if it mostly consists of material that formed after it impacted the Earth.
- **reclass**: the classification of the meteorite.
- **mass (g)**: the mass of the meteorite in grams as a float.
- **fall**: an identifier that is either *fell* or *found*. When a mateorite is marked *fell*, it was observed during its fall. Otherwise, it was found without a recorded event of falling from the sky. Note that a meteorite may be found long after it originally fell!
- **year**: a string containing the time at which the meteorite was recorded. If the meteorite was *Found*, then this is the date one which it was found. If it was observed while falling, this is the time that the meteorite was seen. Note that meteorites which were found could have fell a *long* time ago (think: many thousands of years).
- **GeoLocation**: the latitude and longitude of the meteorite's location as a string

## Part 1: Cleaning the Dataset

**Lectures to review: 01-Intro; 02-Expressions, Types; 03-Tables; 06-Functions, Apply; 07-Group, Merge, Conditionals, Iteration**

Not every dataset is perfect. In fact, most of the datasets we've seen thus far have already be "cleaned" so that they can be used out-of-the-box. As you have gained sufficient knowledge on data manipulation, this time let's clean the dataset together. First, let's see what our dataset look like:

In [3]:
raw_df = bpd.read_csv('data/meteorite_landings.csv')
raw_df

**Question 1.** Let's start by choosing a column to be the index. Remember that a good choice should have unique values. Create a new dataframe named `df_with_index` that has its index set to an appropriate column (not that meteorite names are not guaranteed to be unique).

<!--
BEGIN QUESTION
name: q1_1
-->

In [4]:
df_with_index = ...
df_with_index

In [None]:
grader.check("q1_1")

The *GeoLocation* column provides the location of each meteorite's impact as a string of the form `(LATITUDE, LONGITUDE)`. Unfortunately, some of the meteorites are missing this information and therefore have a *GeoLocation* of `np.nan`, as can be seen for meteorite 5163 in the table below:

In [6]:
df_with_index.loc[[5163, 5164, 5165]]

Let's remove the rows with missing GeoLocation from the table. Note that this isn't always the right thing to do, but we'll assume that it is OK in this case. You will learn more about how to deal missing information if you take *DSC 80*.

One way to determine if an entry is missing is to check its type. In the *GeoLocation* column, entries that are not missing are strings, and missing entries are NaNs (which are not strings -- in fact `np.nan` is considered a float).

To check if a variable has a certain type, we can use the built-in Python function `isinstance`. This function takes two arguments, an object and a type, and returns `True` or `False` depending on if the object is of that type. Run the following four cells to see some examples:

In [7]:
isinstance("this is a string", str)

In [8]:
isinstance(432, str)

In [9]:
isinstance(np.nan, str)

In [10]:
isinstance(np.nan, float)

**Question 2**.  Starting from `df_with_index`, create a new dataframe named `df_with_geolocation` containing only those rows for which the *GeoLocation* is not missing; that is, it contains only those rows for which *GeoLocation* is a string).

In [11]:
df_with_geolocation = ...
df_with_geolocation

In [None]:
grader.check("q1_2")

**Question 3**. Create a new dataframe by adding two columns to `df_with_geolocation` named `latitude` and `longitude`, respectively containing the latitude and longitude of each meteorite's impact. The type of each entry in these new columns should be *float*. Additionally, drop the *GeoLocation* column from the table. Store the result in `df_with_lat_lon`.

In [14]:
df_with_lat_lon = ...
df_with_lat_lon

In [None]:
grader.check("q1_3")

**Question 4**. You have likely noticed by now that the variable `year` doesn't contain the year of a meteorite's landing exactly, but instead contains a string of form *MM/DD/YYYY HH:MM:SS AM/PM*. Taking a closer look, every record shows a month of 1, a day of 1, and a time of 12:00:00 AM. This clearly is not correct -- these times and dates are just dummy values. Therefore, the only useful information in each of the entries is just the *year*.

Create a function named `extract_year` that takes in a string of the form *MM/DD/YYYY HH:MM:SS AM/PM* and outputs the year as a number. Then create a new dataframe named `df_with_year` which is the same as `df_with_lat_lon` but where the entries in the *year* column has been replaced with the year as a number.

**Note**: some of the entries in the *year* column are missing. To handle this, if `extract_year` is called with a value that is not a string, it should return `np.nan`.

In [19]:
def extract_year(s):
    ...

df_with_year = ...
df_with_year

In [None]:
grader.check("q1_4")

**Question 5**. Sometimes when dealing with events such as these, it is helpful to aggregate them into *decades*. Write a function named `to_decade(y)` which accepts a year (as a number) and returns the decade. For instance, `to_decade(2008)` should return the number 2000, and `to_decade(2010)` should return 2010. Your function does not need to work for negative years. Next, create a new table named `df_with_decade` that has all of the same information as `df_with_year`, but which has an additonal column named *decade* containing the decade of each meteorite's recording.

*Hint*: The *flooring division* operator, `//`, might be useful here. This operator divides two numbers as usual, but if the result has a decimal part, it is "forgotten". For example. `35 / 6` is `5.83333`, but `35 // 6` is simply 5. Try using flooring division as part of your answer.

In [25]:
def to_decade(y):
    ...

df_with_decade = ...
df_with_decade

In [None]:
grader.check("q1_5")

**Question 6**. As described above, entries in the *nametype* column are strings that can take only one of two values: they are either "Valid" or "Relict". Likewise, entries in the *fall* column can either be "Fell" or "Found". Columns which can take on only one of a handful of possible values are referred to as "categorical" columns. If there are only two possibilities, we sometimes call these "binary" variables.

A common transformation done to columns with binary values is to convert them to `bool`. This simplified the representation, and has the added benefit of saving memory (a string requires *much* more memory than a single `bool`). You will learn much more about this kind of data management if you take *DSC 100: Intro to Data Management*.

Starting from `df_with_decade`, create a new dataframe named simply `df_bool` that contains two new columns, *is_valid* and *is_found*. An entry of *is_valid* should be `True` if and only if the meteorite's *nametype* is "Valid". Likewise, an entry of *is_found* should be `True` if and only if the meteorite's *fall* is "Found". Additionally, the dataframe should have the `nametype` and `fall` columns removed.

In [30]:
df_bool = ...
df_bool

In [None]:
grader.check("q1_6")

**Question 7.** While the latitude and longitude tell us the precise location of the meteorite, it would be helpful to have a more coarse description of where it was seen or found. The file `data/continents.csv` contains a table with two columns: *id* (the ID of a meteorite) and *continent* (the continent on which it was recorded).

Create a new dataframe, simply named `df`, that contains all of the columns of `df_bool` and one additional column, *continent*, which has the continent in which each meteorite was found or seen.

**Note**: The rows in `data/continents.csv` are not in the same order as the rows in `df_bool`! Also, after this step, double-check to make sure that the *index* is still properly set.

In [36]:
df = ...
df

In [None]:
grader.check("q1_7")

**Question 8**. That is enough cleaning for now, but before we move on we should recall that the dataset contains both meteorites that are witnessed while falling, and meteorites that are found (and which potentially fell many thousands of years ago). To understand a potential trend in the number of meteorites reaching Earth, we should therefore only look at the meteorites which are *witnessed* while falling. That is, the meteorites for which *is_found* is `False`.

Construct a new dataframe named `seen` which contains only those meteorites which were seen falling. We will use `seen` for some questions below, and `df` for others.

In [40]:
seen = ...
seen

In [None]:
grader.check("q1_8")

### Check Your Work

That's all of the cleaning we'll do. Here's what the dataframe looked like when we started:

In [44]:
raw_df

And here's what it looks like now:

In [45]:
df

It looks much better! (but I might be biased...)

Before we move on, let's make sure that your dataframe is properly cleaned. The below test will make sure that your dataframe has these columns and *only* these columns (though not necessarily in this order):

- *name*
- *recclass*
- *mass (g)*
- *year*
- *decade*
- *latitude*
- *longitude*
- *is_valid*
- *is_found*
- *continent*

It will also check that the index is properly set.

In [None]:
grader.check("q1_check")

## Part 2: Exploring the Dataset

**Lectures to review: 04-Queries, Groupby; 05-Visualization; 06-Functions, Apply; 07-Group, Merge, Conditionals, Iteration**

The next step in a data science project is *Exploratory Data Analysis*, or *EDA* for short. We don't always come into a data science project with a clear idea of what trends to expect or what questions to ask. In EDA, we comb through the data in order to identify interesting patterns worth exploring. EDA often involves a lot of data visualization, as we will see.

**Question 1**. One of the simplest questions we might ask of the data is how the number of meteorites falling each year has changed over time. To answer this, we should use only the data from meteorites that are witnessed falling; that is, we should use the data in `seen`.

Create a density histogram that shows the number of meteorites per year which are seen while falling. For bins, use `[1700, 1810, ..., 2020]`. Then, looking at your histogram, find the proportion of meteorites in this subset that were recorded in the interval from 1950 to 2000. Store your answer in `prop_1950`. Your answer should be a number between 0 and 1.

In [48]:
# make your plot here
...
prop_1950 = ...
prop_1950

In [None]:
grader.check("q2_1")

**Question 2.** You will likely see that the number of meteorites seen falling grew substantially until it leveled off in the mid-1900s. Using your understanding of the universe, which one of the following is **not** a plausible partial explanation for why this may be?

1. The large increase in the number of meteorites being recorded is because of an actual increase in the number of meteorites hitting the Earth.
2. Population increases in remote areas have led to more people seeing meteorites that before would not have been witnessed.
3. Technological advancements have allowed humans to track and therefore see more meteorites as they approach Earth.
4. The recording of meteorite impacts is a recent phenomenon; in previous centuries, meteorites may have been observed but rarely recorded.

In [51]:
not_plausible = ...

In [None]:
grader.check("q2_2")

Next we might ask: Where are meteorites recorded? We can visualize the locations of impacts using a python package called `folium`. It has been installed for you. Let's import it now:

In [54]:
import folium

The code below shows how to create a map using `folium`. This map is centered at a latitude and longitude of (39, -98), which happens to be in Kansas.

In [55]:
m = folium.Map(location=[39.0, -98.0], zoom_start=4)
m

We can create markers on the map as shown below. Here we've made two markers: one for San Diego, and another for New York City.

In [56]:
folium.Marker(location=[32.7157, -117.1611]).add_to(m)
folium.Marker(location=[40.7, -74]).add_to(m)
m

Let's see if we can detect any patterns in the locations of meteorites. The function below takes in a dataframe of meteorites (like `df`) and uses `folium` to plot the location of each meteorite in the table.

In [57]:
def plot_meteorites(meteorite_df, center=(39.0, -0), zoom=2):
    m = folium.Map(location=center, min_zoom=2, zoom_start=zoom, max_bounds=True)
    for _, row in meteorite_df.get(['latitude', 'longitude', 'name'])._pd.iterrows():
        folium.CircleMarker(
            location=[row['latitude'], row['longitude']],
            popup=row['name'], radius=1.5
        ).add_to(m)
    display(m)

For example, here are 4000 randomly-chosen meteorites. Note that these meteorites include both those that have been found and those that have been observed falling. Note that you can pan, zoom, and click markers to see the name of each meteorite.

In [58]:
plot_meteorites(df.sample(4000))

We can also plot only those meteorites that have been observed while falling. Can you see a difference between the map below and the one above?

In [59]:
plot_meteorites(seen)

**Question 3**. By looking at the map above, which continent do you think has the most meteorites in the dataset overall? There's no wrong answer to this question -- it won't be graded for correctness.

1. Africa
1. Antarctica
1. Asia
1. Australia
1. Europe
1. North America
1. South America

*Note*: different educational systems have different definitions of "continent". For this question, we're using the 7-continent model -- but since it isn't graded, it doesn't really matter which model we use.

In [60]:
continent_guess = ...

In [None]:
grader.check("q2_3")

**Question 4.** Our table, `df`, includes the continent where each meteorite was recorded. Using table operations, construct a series named `per_continent` containing the number of meteorites recorded in each continent.

**Note**: for the purposes of this project, "recorded" means either found or observed while falling. In other words, use the whole table `df` for this question, not just those that were seen falling.

In [62]:
per_continent = ...
per_continent

In [None]:
grader.check("q2_4")

**Question 5.** The answer to the last question may have surprised you -- let's investigate further. Some continents are much larger than others, and it is plausible that these continents have a larger number of meteorites simply by virtue of their size. To control for this, let's calculate the *number of meteorites per unit of area*.

The file `data/areas.csv` contains the area of each of the seven continents that appear in our data set. Construct a series named `density` containing the number of meteorites per square kilometer for each continent.

In [68]:
density = ...
density

In [None]:
grader.check("q2_5")

If your answer is correct, the continent which has the most meteorites per unit area is also the one that has the most meteorites overall. This suggests that this continent really does have more recorded meteorites than other continents.

The key word in the previous sentence is *recorded*. Is it really the case that a certain part of the Earth *attracts* more meteorites than other parts? Probably not, but there is something going on here. Remember that meteorites are, by definition, meteors that survived the trip through the atmosphere long enough to hit the surface of the Earth. Therefore, it could be that certain regions of the planet have thinner atmospheres, making it more likely that a meteor survives entry. On the other hand, physics tells us that meteorite landings should be more-or-less randomly distributed across the surface of the Earth -- maybe the concentration of meteorites is due to something else? Let's see...

Perhaps visualizing the Antarctic meteorites again can give us a clue. The code below shows meteorites in the vicinity of McMurdo Sound, Antarctica.

In [74]:
plot_meteorites(df[df.get('continent') == 'Antarctica'].sample(4000), center=(-75, 170), zoom=5)

You should notice that the meteorites are clustered. Why is this? Is there something special about the regions these meteorites were found in?

Perhaps the name of the meteorite can give us a clue. Click a marker to see the meteorite's name, and repeat to learn about the different regions where meteorites were found. Search for some of these region names on Wikipedia to determine what the geology of the region is like.

*Hint*: you might learn a new word that starts with "m" -- this is related to the reason.

<!-- BEGIN QUESTION -->

**Question 6**. In a few sentences, explain why meteorites are so commonly recorded in Antarctica. Does the data support the claim that meteorites are more likely to *fall* in Antarctica than elsewhere?

<!--
BEGIN QUESTION
name: q2_6
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



**Question 7**. Now let's look at the various classifications of meteorites. These are stored in the *recclass* column. First, how many unique classifications are there? Don't hardcode your answer -- write code to compute it.

In [75]:
number_of_classes = ...
number_of_classes

In [None]:
grader.check("q2_7")

This is quite a lot! Next, let's understand how the classifications are distributed. Are there roughly the same number of meteorites in each class? Or are some classes more popular?

**Question 8**. What is the median number of meteorites in each class? What is the mean number of meteorites in each class?

In [78]:
median_in_each_class = ...
mean_in_each_class = ...
median_in_each_class, mean_in_each_class

In [None]:
grader.check("q2_8")

**Question 9**. The mean and median are quite different. Which one of the explanations below could be the reason for this difference?

1. The median is much lower than the mean because of floating point error.
1. The mean is much lower than the median because it does not take outliers into account.
1. The mean is much higher than the median due to large outliers.
1. The mean is much higher than the median because of random chance.
1. The median is much smaller than the mean due to small outliers.

In [83]:
could_be_the_reason = ...

In [None]:
grader.check("q2_9")

**Question 10**. Considering only the classes which have more than 100 meteorites, plot a horizontal bar chart of the count of meteorites belonging to each class. Use your plot to answer the question: which class has the fifth most meteorites? Your answer should be in the form of a string.

In [86]:
# make your plot here
...
fifth_most = ...

In [None]:
grader.check("q2_10")

Lastly, let's look at the mass of the meteorites. Here's a simple histogram of the masses:

In [89]:
df.get('mass (g)').plot(kind='hist')

Well, that's unhelpful. We are seeing this because the largest meteorites are much larger than the smallest, but they are relatively infrequent.

In [90]:
df.get('mass (g)').max()

In [91]:
df.get('mass (g)').min()

Apparently some meteorites are given a mass of 0. This could be because the meteorites are very light, or because the weight was unknown and a placeholder of 0 was used. The massive 60 million gram meteorite is [Hoba](https://en.wikipedia.org/wiki/Hoba_meteorite), a meteorite which fell on present-day Namibia some time within the last 80,000 years.

There are ways to improve the visualization. First, we can plot a histogram of a subset of the data; for instance, from 0 to 500 grams:

In [92]:
df.get('mass (g)').plot(kind='hist', bins=np.arange(0, 500, 25), density=True)

We can also use the `logy = True` argument to `.plot()` to plot the y-axis on a logarithmic scale:

In [93]:
df.get('mass (g)').plot(kind='hist', logy=True, bins=20)

Note that we have omitted the `density = True` keyword argument, so this is a histogram of *counts*. 

**Question 11**. Let's see if there is any trend in the average mass of meteorites over time. As with the first part of this section, we will restrict ourselves to only those meteorites that are observed while falling (that is, meteorites for which *is_found* is `False`).

Draw a bar chart of the median mass of meteorites seen in each decade between 1900 and now. Use your bar chart to answer the question: What was the median mass of meteorites that were seen in the 1990s?

In [94]:
# make your plot here
...
observed_falling_in_90s = ...

In [None]:
grader.check("q2_11")

It looks like some decades have heavier meteorites than others. In the next section, we'll investigate this to see if it is a real phenomenon, or whether it might be due to chance.

Before we move on, let's look at the closest meteorites to San Diego, just for fun:

In [97]:
coords = np.column_stack([
    df.get('latitude'),
    df.get('longitude')
])

distances = np.sum((coords - np.array([32.7, -117.1]))**2, axis=1)

closest = df.iloc[np.argsort(distances)[:10]]

plot_meteorites(closest, center=(32.7, -117.1), zoom=8)

---

## Part 3: Statistics

**Lectures to review: 08-Simulation; 09-Sampling; 10-Models, Statistics; 11-Hypothesis Testing; 12-AB Testing; 13-Bootstrap; 14-CI; 15-CLT; 16-Normal CI**

In this section we will take a deeper look at some of the trends uncovered in our EDA in order to determine if they are real phenomena, or just due to chance.

In Part 2, Question 11, we plotted the median mass of meteorites per decade. We saw that some decades had much heavier medians than others. Is this a real phenomenon? Or is it possible that this is simply due to chance? We can answer this with a hypothesis test.

Our intuition tells us that a meteorite's mass is essentially random. That is, when Nature decides to throw a meteorite at the Earth, it doesn't first check the date. Of course, there may be reasons for heavier meteorites to cluster together in time -- for instance, maybe a large meteor breaks into several pieces, all of which fall around the same time. However, we might believe these events to be sufficiently rare so as to not affect the decadal medians.

Given this intuition, we'll introduce the following null and alternative hypotheses:

**Null Hypothesis**: Nature assigns meteorite masses randomly, with no relationship to the decade that the meteorite was observed.

**Alternative Hypothesis**: The median meteorite mass is significantly higher in some decades as compared to others.

**Question 1a**. Which of the examples seen in lecture had the most similar null/alternative hypothesis to the current situation? (*Hint*: relating a problem to one seen in lecture is a great way to choose the right statistical test to apply).

1. Is the coin fair? (Lecture 10)
1. The Swain jury panel (Lecture 10)
1. The Alameda county jury panels (Lecture 11)
1. Is the TA a poor teacher? (Lecture 11)
1. Deflategate (Lecture 12)
1. Smoking and Baby Weight (Lecture 12)

In [98]:
most_similar_3_1a = ...

In [None]:
grader.check("q3_1a")

**Question 1b**. Which of the below test statistics would be the most reasonable choice for this situation?

1. The maximum of any decadal median.
2. The minimum of any decadal median.
3. The difference between the maximum decadal median and the median decadal median.
4. The average decadal median.
5. The median mass in the 1990s.

In [101]:
test_stat_guess_3_1b = ...

In [None]:
grader.check("q3_1b")

Our test statistic should tell us something about the difference in median mass between decades. While there are several choices, we will use the *difference between the maximum decadal median and the minimum decadal median* (fix your answer above if you guessed something else!). That is, we'll find the decade with the largest median mass and the decade with the smallest, and subtract these two medians. The result will be a non-negative number. If this number is large, it means that there is a pair of decades with a very big difference in median meteorite mass. Under the null hypothesis, this difference should be relatively small.

**Question 2**. Before computing our test statistic, it will be useful to restrict our data set to only those meteorites that are *seen* (i.e., the ones that were *not* found), and which fell on or after the year 1900. Create a dataframe named `seen_since_1900` which contains only these meteorites.

In [103]:
seen_since_1900 = ...
seen_since_1900

In [None]:
grader.check("q3_2")

**Question 3**. Define a function named `mass_range` which accepts one argument, a dataframe (of the form of `df`, `seen_since_1900`, or similar), and returns the value of the test statistic for the meteorites in that dataframe.

In [105]:
def mass_range(meteorites):
    ...

In [None]:
grader.check("q3_3")

**Question 4**. Under the null hypothesis, the values in the *decade* column are not related to the values in the *mass (g)* column. As a result, we can simulate a new dataset under the null hypothesis by permuting the *decade* column. We can then generate a new test statistic using this simulated data.

Generate 1000 values of the test statistic using the above approach and place them in a numpy array called `test_stats`.

*Hint*: you may create additional cells if you'd like. Your code should take at most a few minutes to run; if it takes too long, it may not receive credit.

In [110]:
test_stats = ...

In [None]:
grader.check("q3_4")

**Question 5.** Compute the p-value of the observed test statistic.

In [114]:
p_value = ...
p_value

In [None]:
grader.check("q3_5")

**Question 6.** Which one of the following conclusions can we make given the result of the test?

1. The null hypothesis is true
1. We should reject the null hypothesis (assuming we use a 5% threshold for significance)
1. The results of the test are consistent with the null hypothesis that the meteorite masses do not depend on the decade

In [117]:
conclusion = ...

In [None]:
grader.check("q3_6")

In the EDA section above, we saw that there was a difference in where meteorites were *found* and where they are *seen*. As a reminder, here is the plot of locations of *seen* meteorites:

In [120]:
seen = df[~df.get('is_found')]
plot_meteorites(seen)

A likely explanation for this is that a meteorite is more likely to be seen if it falls near a populated area. In fact, compare the above map with this composite image of the Earth at night -- the nighttime lights show areas that are densely populated:

<img src="./data/lights.jpg">

We might suppose that the number of meteorites spotted in a continent is proportional to the *population* of the continent, but this is not quite right. Consider, for instance, Asia, which has the greatest population out of any of the continents, by far. This population, however, is not spread evenly over the surface of the continent -- it is concentrated in several areas. We can see this in the nighttime image above -- large parts of Asia are relatively sparsely inhabited.

Instead, we might suppose that the number of meteorites seen in a continent is proportional to the total area of the continent that is "populated". One way to do this is to count (for each continent) the number of pixels in the above image which are brighter than some threshold. The more bright pixels, the more land area that is populated, and the more surface area where a falling meteorite is likely to be seen.

We have done this work for you. We broke the surface of the Earth up into 259200 rectangular cells and counted the number of cells that are populated for each continent. The results are contained in `data/populated_cells.csv`.

In [121]:
populated_cells = bpd.read_csv('./data/populated_cells.csv').set_index('Continent')
populated_cells

**Question 7.** Assume that a given meteorite will land in one of the seven continents listed above, and that the probability that the meteorite is seen in a given continent is proportional to the number of populated cells in that continent. Calculate a series containing the probability of the meteorite being seen in each continent.

In [122]:
probabilities = ...
probabilities

In [None]:
grader.check("q3_7")

**Question 8**. Restricting yourself to only those meteorites which were *seen* falling on or after 1980, calculate the proportion of meteorites which fell in each continent. Save your answer in a series named `proportions`.

In [126]:
proportions = ...
proportions

In [None]:
grader.check("q3_8")

Consider the null and alternative hypotheses shown below:

**Null Hypothesis**: The locations of meteorites seen since 1980 were generated by the distribution in the `probabilities` series.

**Alternative Hypothesis**: The locations were generated from some other distribution.

**Question 9a**. Which of the examples seen in lecture had the most similar null/alternative hypothesis to the current situation?

1. Is the coin fair? (Lecture 10)
1. The Swain jury panel (Lecture 10)
1. The Alameda county jury panels (Lecture 11)
1. Is the TA a poor teacher? (Lecture 11)
1. Deflategate (Lecture 12)
1. Smoking and Baby Weight (Lecture 12)

In [130]:
most_similar_3_9a = ...

In [None]:
grader.check("q3_9a")

**Question 9b**. Using your answer for the previous part of this question as your guide, which of the below test statistics would be the most reasonable choice for this situation?

1. The proportion of meteorites seen in Asia.
2. The proportion of meteorites that are not seen in Asia.
3. The difference between the proportion of meteorites seen in Asia and the probability that a meteorite falls in Asia (according to `probabilities`)
4. The TVD between the proportion of meteorites falling in each continent and the probabilities in `probabilities`.
5. The difference between the maximum proportion falling in any continent and the maximum probability in `probabilities`.

In [133]:
test_stat_guess_3_9b = ...

In [None]:
grader.check("q3_9b")

**Question 10.** Write a function named `generate_proportions` which accepts no arguments and which simulates the location of 176 meteorite landings according to the distribution in `probabilities` (176 because 176 meteorites have been seen since 1980). The return value of your function should be a numpy array of size 7 which contains the proportion of simulated meteorite landings in each of the 7 continents (in the order that they appear in `probabilities`).

In [135]:
def generate_proportions():
    ...

In [None]:
grader.check("q3_10")

**Question 11**. Using your function `generate_proportions` to simulate landings and the Total Variation Distance (TVD) as your test statistic, generate 1000 values of the test statistic. Place them in an array named `tvd_stats`

*Hint*: you may create additional cells if you'd like. Your code should take at most a few minutes to run; if it takes too long, it may not receive credit.

In [141]:
tvd_stats = ...

In [None]:
grader.check("q3_11")

**Question 12**. Calculate the probability under the null hypothesis of observing a TVD that is at least as extreme as that observed in actuality.

In [145]:
tvd_p_value = ...
tvd_p_value

In [None]:
grader.check("q3_12")

**Question 13**. What can we conclude from the result of this hypothesis test?

1. We reject the null hypothesis that the meteorites seen since 1980 were generated from the distribution in `probabilities`.
2. We accept the null hypothesis.
3. The null hypothesis is consistent with what was observed.

In [148]:
prob_conclusion = ...

In [None]:
grader.check("q3_13")

A potentially interested observation is that the meteorites that have been seen in Asia since data has been recorded are, on the median, more massive than meteorites that have landed in North America:

In [151]:
seen[seen.get('continent') == 'Asia'].get('mass (g)').median()

In [152]:
seen[seen.get('continent') == 'North America'].get('mass (g)').median()

Is this a real difference? Or is it just due to chance?

**Question 14**. We will run a hypothesis test to answer this question. Which of the examples seen in lecture is most similar to the current situation?

1. Is the coin fair? (Lecture 10)
1. The Swain jury panel (Lecture 10)
1. The Alameda county jury panels (Lecture 11)
1. Is the TA a poor teacher? (Lecture 11)
1. Deflategate (Lecture 12)
1. Smoking and Baby Weight (Lecture 12)

In [153]:
most_similar_3_14 = ...

In [None]:
grader.check("q3_14")

**Question 15**. Using the table `seen` as a starting point, run a permutation test for the null hypothesis that the mass of meteorites seen in Asia and that of those seen in North America come from the same distribution. The alternative hypothesis is that that mass of meteorites seen in Asia is larger. As a test statistic, use the (signed) difference in medians between the two groups. Use 1000 permutations in your test. For your answer, report the p-value.

*Hint*: think carefully about what data should be included. Should we shuffle all of the rows of `seen`, or work only with data from NA/Asia?

*Hint 2*: you may create additional cells if you'd like. Your code should take at most a few minutes to run; if it takes too long, it may not receive credit.

In [156]:
perm_p_value = ...
perm_p_value

In [None]:
grader.check("q3_14")

**Question 16**. Using data in the `seen` table, construct bootstrapped 95% confidence intervals for both:

1. the median mass of meteorites *seen falling* in North American
2. the median mass of meteorites *seen falling* in Asia

Use 2000 bootstrap samples to construct your confidence intervals.

Your confidence intervals should be lists with two entries: the lower bound and the upper bound of the interval.

*Hint*: you can use your answer for this question as a check on Question 15. They should agree.

*Hint 2*: you may create additional cells if you'd like. Your code should take at most a few minutes to run; if it takes too long, it may not receive credit.

In [159]:
asia_ci = ...
na_ci = ...

In [None]:
grader.check("q3_15")

**Question 16**. The table `seen` is not a list of *all* meteorites that fell in recent history; obviously, there were many meteorites that fell but were not seen. In this sense, `seen` is a sample of all meteorites that fell in recent history. Out of these meteorites, roughly 25% are over 10,000 grams in mass, as the cell below shows:

In [166]:
large_prop = seen[seen.get('mass (g)') > 10_000].shape[0] / seen.shape[0]
large_prop

Using the Central Limit Theorem, construct a 95% confidence interval for the population proportion of meteorites which are above 10,000 grams. Your answer should be a list of two numbers: the lower and upper bounds of the confidence interval.

In [167]:
normal_ci = ...
normal_ci

In [None]:
grader.check("q3_16")

<!-- BEGIN QUESTION -->

**Question 17**. In the last question, we said that `seen` is a sample of the population of meteorites that have fallen in recent years and used this to construct a confidence interval. In order for this confidence interval to be accurate, `seen` should be a *random sample* of meteorites.

Do you believe `seen` is a random sample, or might it be biased? Why or why not?

<!--
BEGIN QUESTION
name: q3_17
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



# Finish Line

Congratulations! You've completed Project 02. To submit your assignment:

1. Select `Kernel -> Restart & Run All` to ensure that you have executed all cells, including the test cells. <p style="color: red"><b>Important!</b> We will allot 20 minutes of computer time to run your notebook. If your notebook takes longer than this to run, it may not pass the autograder! Run "Kernel -> Restart and Run All" to time how long your notebook takes. A notebook with correct answers should take less than 5 minutes.</p>
2. Read through the notebook to make sure everything is fine and all tests passed.
3. Run the cell below to run all tests, and make sure that they all pass.
4. Download your notebook using `File -> Download as -> Notebook (.ipynb)`, then upload your notebook to Gradescope.

In [None]:
grader.check_all()