# Homework 4 - Investigating Mammalian Fecundity and Conservation using Filtering, Joins, and Arithmetic
 


## Introduction

For this week's homework, we are going to continue to work with the Pantheria Dataset and the IUCN categories. 

We will create a new metric using the Pantheria data that estimates: how many offspring do individuals within each species produce throughout their lifetime, on average? We call this 'lifetime fecundity'. We will be looking to see whether there is a relationship between average lifetime fecundity and a species' risk of going extinct.

### Question

The overarching question you're answering in this homework:

**_Is there a difference in ICUN category between species with smaller mean lifetime fecundity and species with larger mean lifetime fecundity?_**

Because this is an intricate investigation, this homework has you prepare the data to begin to answer the question. The week 5 homework will conclude this data story.

## Lab Instructions and Learning Objectives

Just like in the previous homework, you will be creating and submitting a data story answering a data science question. You will be required to submit your work in the same format as last time, complete with sections for *Introduction*, *Data*, *Methods*, *Computation*, and *Conclusion*.

In this lab, you will:
* Start a data story in a notebook exploring the question: is the number of offspring birthed by a lineage related to its risk of extinction?
* Write and use advanced Boolean expressions to filter specific observations in our dataset. (Specifically, you're encourage to practice using logical operators such as `!=`, `<=`, `>=`, `>`, `<`.)
* Join two related datasets to create a larger, more comprehensive dataset.
* Perform arithmetic on several pandas series to estimate the maximum theoretical number of offspring that mothers within each species are capable of siring throughout their lifetime.


## Due date 

You will submit your completed Homework 4 on MarkUs by *Fri, Feb 11 2022 at 11:59 PM EST*. We will send an announcement in a couple days when autotesting has been set up on MarkUs.

## EEB: How to submit

1. Download your homework to your local computer and save it as `EEB125_Homework_4.ipynb`.
2. Log in here: https://markus-ds.teach.cs.toronto.edu.
3. Submit your homework to `HW4: Homework 4`.

## Marking Rubric

Section     | 0 | 1 | 2 | 3
------------|---|---|---|---
Introduction|The question is not stated correctly or left blank | The question is stated correctly | NA | NA 
Data (for each python variable)       |auto test fails | auto test passes | NA | NA 
Methods (for each part) | No answer | The data extracted is specified or a reasonable rationale is given, but not both | Both the data extracted is specified and a reasonable rationale is given | NA
Computation |auto test fails | auto test passes | NA | NA 
Conclusion (for each part) | No answer | The question is answered but no explanation is given | The question is answered but the explanation is not supported or weakly supported by the data | The question is answered and the explanation is supported by the data 

Maximum grade: **20**

## Introduction section

This should introduce the question being explored in a sentence. __(1 mark)__

## Data section

### Step 1: import data

Import the raw data from Pantheria (`pantheria.txt`) and phylacine (`phylacine.csv`) and name the `DataFrame`s as follows:

+ `pantheria_raw`: the `DataFrame` created by reading the `pantheria.txt` file. __(1 mark)__
+ `iucn_raw`: the `DataFrame` created by reading the `phylacine.csv` file. __(1 mark)__

In [None]:
# Step 1 check
pantheria_raw.head()

In [None]:
# Step 1 check
iucn_raw.head()

### Step 2: select columns
Create new dataframes containing only the columns we need for this homework. Name them as follows:

 + `pantheria_data`: a `DataFrame` containing only the relevant columns from `pantheria_raw`: `'"MSW05_Order'`, `'MSW05_Binomial'`, `'23-1_SexualMaturityAge_d'`, and `'14-1_InterbirthInterval_d'`, `'17-1_MaxLongevity_m'`, and `'15-1_LitterSize'`. __(1 mark)__ 
 + `iucn_data`: a `DataFrame` containing only the relevant columns from `iucn_raw`: `'Binomial.1.2'` and `'IUCN.Status.1.2'`. __(1 mark)__ 

In a markdown cell, describe what each of the selected columns represents. __(1 mark)__

In [None]:
# Step 2 check
print(pantheria_data.columns)
print(iucn_data.columns)

### Step 3: create new column names

Let's prepare to rename the columns. Create dictionaries mapping the current column names to new, clearer names. Name them as follows:
+ `pantheria_new_column_names`: the `dictionary` mapping the column names from `pantheria_data` to the values `'order'`, `'genus_species'`, `'maturity_d'`, `interbirth_d`, `'longevity_m'`, and `'litter_size_ind'`. __(1 mark)__
+ `iucn_data_new_column_names`: the `dictionary` mapping the column names from `iucn_data_raw` to `genus_species`, and `iucn_status`, respectively. __(1 mark)__

In [None]:
# Step 3 check
print(pantheria_new_column_names)
print(iucn_data_new_column_names)

### Step 4: Rename the columns in the dataframes

Use function `rename` to rename the columns in `pantheria_data` and `iucn_data` using the dictionaries that we have just created. Name the new dataframes `pantheria_data_clean` and `iucn_data_clean`.

+ `pantheria_data_clean`: the `DataFrame` that is the result of renaming the columns in `pantheria_data`. __(1 mark)__
+ `iucn_data_clean`: the `DataFrame` that is the result of renaming the columns in `iucn_data_raw`. (We will not autotest this `DataFrame` until you have added columns, as described below.) __(1 mark)__

In [None]:
# Step 4 check
print(pantheria_data_clean.columns)
print(iucn_data_clean.columns)

### Step 5: replace spaces in column so that data can be merged

Right now, column `'genus_species'` in `iucn_data_clean` has underscores in the string values, but column `'genus_species'` in `pantheria_data_clean` does not.

Update the `'genus_species'` column in `pantheria_data_clean` as follows.

First, replace the spaces `" "` in the species names stored in `pantheria_data['genus_species']` with underscores `"_"` so that the puncuation matches in both dataframes that we are trying to merge. This creates a new `Series` of strings whose values have underscores, for example `'Canis_adustus'`.

Next, replace the `'genus_species'` column in `pantheria_data_clean` with this new `Series`. __(1 mark)__

In [None]:
# Step 5 check
pantheria_data_clean.head()

### Step 6: merge the two dataframes

Merge (join) `pantheria_data_clean` and `iucn_data_clean` using function `merge`. Use `pantheria_data_clean` as the main (left) dataframe, and `iucn_data_clean` as the right dataframe. Join on column `'genus_species'`. Name the result `joined_pantheria_iucn_data`.  __(1 mark)__

In [None]:
# Step 6 check
joined_pantheria_iucn_data.head()

### Step 7: eliminate irrelevant IUCN categories

Values `'DD'` and `'EP'` are not useful. In a Markdown cell, describe why we are eliminating these IUCN categories. __(1 mark)__

Now extract a new dataframe containing all rows with IUCN categories OTHER THAN `'DD'` and `'EP'` (missing data and errors) from `joined_pantheria_iucn_data`. Name this new dataframe `pantheria_iucn_clean`. __(1 mark)__

In [None]:
# Step 7 check
pantheria_iucn_clean.head()

## Methods section

Using `pantheria_iucn_clean`, you will estimate a new measurement that we will call `max_lifetime_fecundity`. This will be computed using the following columns:

`'maturity_d'`: How long it takes for the average individual to grow to maturity. This is measured in days as the interval between birth and the time when the individual first reproduces.
 
`'longevity_m'`: How long can individuals within each species live, expressed in months.

`'interbirth_d'`: How long do adult females wait, on average, between giving birth and becoming pregnant again?

`'litter_size_ind'`: How many babies do females within each species have at one time, on average?

The Computation section below descibes in detail how these will be used.

## Computation section

The three measurements relating to time (`'maturity_d'`, `'longevity_m'`, and `'interbirth_d'`) are expressed in two different units. Convert each of these columns so that they are expressed in years. Name the three resulting `Series` `maturity_yr`, `longevity_yr`, and `interbirth_yr`, respectively.

Also extract column `'litter_size_ind'` as a series and name it `litter_size_series`.

Estimate the maximum lifetime fecundity metric for each species using the formula: 

((longevity - maturity) / (interbirth)) * litter size

What are the units of our new column? What is its Python data type? __(2 marks)__ 

Create a new column in `pantheria_iucn_clean` called `'max_lifetime_fecundity'` that contains the maximum lifetime fecundity metric previously estimated. __(1 mark)__

In [None]:
# Computation check
print(maturity_yr.head())
print(longevity_yr.head())
print(interbirth_yr.head())
print(litter_size_series.head())
pantheria_iucn_clean.head()

## Conclusion

Include cells with your answers to each of these questions:
 
1. Explain, in biological terms, what our new `max_lifetime_fecundity` metric measures. __(3 marks)__