# Homework 5 - Investigating Mammalian Fecundity and Conservation using Filtering, Joins, and Arithmetic

## Introduction

For this week's homework, you will build upon last week's work, where you cleaned up the Pantheria and IUCN data, merged them, and then computed a metric of maximum lifetime fecundity across mammals. In this assignment, we will delve more deeply into the conservation side of things to investigate whether there is a relationship between maximum lifetime fecundity and a species' risk of going extinct. 

### Question

The metric we computed last week (maximum lifetime fecundity) estimates a species' __reproductive potential__. That is, _how prodigious is each species at producing new offspring_? As biologists, we may have an intuition that species that are more capable at reproducing quickly maybe more resilient to extinction as the environment changes. We will combine the metric from last week with the IUCN data to ask a targeted question about mammalian conservation:

**_Is there a difference in extinction risk between species with higher reproductive potential (greater maximum lifetime fecundity) vs. species with lower reproductive potential smaller maximum lifetime fecundity)?_**

## Lab Instructions and Learning Objectives

You will be creating and submitting a data story answering a data science question. You will be required to submit your work in the same format as last time, complete with sections for *Introduction*, *Data*, *Methods*, *Computation*, and *Conclusion*.

In this lab, you will:
* Complete a data story in a notebook exploring the question: is the number of offspring birthed by a lineage related to its risk of extinction?
* Write and use Boolean expressions to recode the ordinal IUCN data. (Specifically, you're encouraged to practice using logical operators such as `!=`, `<=`, `>=`, `>`, `<`.)
* Perform a groupby operation to investigate maximum lifetime fecundity and extinction risk across mammalian orders and then to compare fecundity according to species that are at risk vs those that are not at risk.

## Due date 

You will submit your completed Homework 5 on MarkUs by *Fri, Feb 18 2022 at 11:59 PM EST*. MarkUs is ready.

##### EEB: How to submit

1. Download your homework to your local computer and save it as `EEB125_Homework_5.ipynb`.
2. Log in here: https://markus-ds.teach.cs.toronto.edu.
3. Submit your homework to `HW5: Homework 5`.

## Marking Rubric

Section     | 0 | 1 | 2 | 3
------------|---|---|---|---
Introduction|The question is not stated correctly or left blank | The question is stated correctly | NA | NA 
Data (for each python variable)       |auto test fails | auto test passes | NA | NA 
Methods (for each part) | No answer | The data extracted is specified or a reasonable rationale is given, but not both | Both the data extracted is specified and a reasonable rationale is given | NA
Computation |auto test fails | auto test passes | NA | NA 
Conclusion (for each part) | No answer | The question is answered but no explanation is given | The question is answered but the explanation is not supported or weakly supported by the data | The question is answered and the explanation is supported by the data 

## Introduction section

This should introduce the question being explored in a sentence. __(1 mark)__

## Data section

### Step 1: Gather your data from last week

Combine all of the code from last week's assignment into a single cell 

+ `pantheria_iucn_clean`: the `DataFrame` you created in your assignment last week combining the merged IUCN and pantheria data, plus our new column with the added variable `max_lifetime_fecundity`.


In [None]:
import pandas as pd

pantheria_raw = pd.read_csv('pantheria.txt', sep = '\t')
iucn_raw = pd.read_csv('phylacine.csv')
important_columns = ['MSW05_Order', 'MSW05_Binomial', '23-1_SexualMaturityAge_d',
                     '14-1_InterbirthInterval_d', '17-1_MaxLongevity_m', '15-1_LitterSize']
pantheria_data = pantheria_raw[important_columns]
important_columns = ['Binomial.1.2', 'IUCN.Status.1.2']
iucn_data = iucn_raw[important_columns]
pantheria_new_column_names = {'MSW05_Order': 'order',
                              'MSW05_Binomial': 'genus_species',
                              '23-1_SexualMaturityAge_d': 'maturity_d',
                              '14-1_InterbirthInterval_d': 'interbirth_d',
                              '17-1_MaxLongevity_m': 'longevity_m',
                              '15-1_LitterSize': 'litter_size_ind'}

iucn_data_new_column_names = {'Binomial.1.2':'genus_species', 
                         'IUCN.Status.1.2':'iucn_status'}

pantheria_data_clean = pantheria_data.rename(columns=pantheria_new_column_names)
iucn_data_clean = iucn_data.rename(columns=iucn_data_new_column_names)
pantheria_data_clean['genus_species'] = pantheria_data_clean['genus_species'].str.replace(" ","_")
joined_pantheria_iucn_data = pantheria_data_clean.merge(iucn_data_clean, 
                                                right_on='genus_species', # the right data frame is iucn 
                                                left_on='genus_species')   # the left data frame is pantheria 

# pantheria_iucn_clean
nomiss = (joined_pantheria_iucn_data['iucn_status'] != 'DD') & (joined_pantheria_iucn_data['iucn_status'] != 'EP')

pantheria_iucn_clean = joined_pantheria_iucn_data[nomiss]


maturity_yr = pantheria_iucn_clean['maturity_d'] / 365
longevity_yr = pantheria_iucn_clean['longevity_m'] / 12
interbirth_yr = pantheria_iucn_clean['interbirth_d'] / 365
litter_size_series = pantheria_iucn_clean['litter_size_ind']

max_lifetime_fecundity = (((longevity_yr - maturity_yr) / interbirth_yr) * litter_size_series)
pantheria_iucn_clean['max_lifetime_fecundity'] = max_lifetime_fecundity
pantheria_iucn_clean


## Methods section

Examine the `'iucn_status'` column in `pantheria_iucn_clean`. You will find that it contains seven different categories, each corresponding to a particular level of extinction risk (remembering that we removed all rows corresponding to `'DD'`, or data deficient, last week): 

![](iucn.svg)

(note that our dataset does not contain any species within the 'CD' category, so we will ignore it)

These categories can be viewed as an **ordinal** statistical variable, reflecting an ordering of severity from a species having a low risk of going extinct (LC) to fully extinct (EX). **We will consider any level above _Near Threatened (NT)_ to be at risk.** 

Since we are interested in examining fecundity in at-risk species, we can use this ordering to simplify our categories to the seven IUCN risk levels to two categories: 'at risk' vs 'not at risk'. Recoding our data in this way will involve two steps.

### Step 1. Recode our IUCN risk categories as numeric levels

In this step, we want to specify the ordering scheme that we will use to represent the level of severity associated with each IUCN category.

Create a dictionary named `iucn_map` that links the IUCN risk categories as displayed in our dataset to the level of severity, expressed numerically from 0 (least severe) to 6 (most severe). __(1 mark)__

Use the `.replace()` function on the `iucn_status` column to change the IUCN risk category labels to numeric levels using the dictionary that we created above. Name the result `iucn_ord`. __(1 mark)__

In [None]:
# Step 1 check your work

assert isinstance(iucn_ord, pd.core.series.Series)
pantheria_iucn_clean.head()

### Step 2. Recode from set of 7 severity levels to Boolean risk categories

Remembering that we are considering any species higher than IUCN level 'Near Threatened' (NT), create a new column called `'at_risk'` in our `pantheria_iucn_clean` dataset that is `True` for any species at IUCN level 2 (VU -- 'Vulnerable') or higher and `False` for any species below IUCN level 2. You may use an intermediate variable to add this new column if you wish, but we will not check for it.

In [None]:
# Step 2 check your work

expected_columnnames = [
    'order', 'genus_species', 'maturity_d', 'interbirth_d', 'longevity_m',
    'litter_size_ind', 'iucn_status', 'max_lifetime_fecundity', 'at_risk'
]
assert expected_columnnames == list(pantheria_iucn_clean)

## Computation section

Using this new representation of conservation risk, we will now apply the `groupby()` function explore the following biological question:

Do at-risk species have lower reproductive potential, on average, than species not at risk?

Perform the steps below to carry out the computations to gather evidence to inform your answer to this question.

## How does reproductive potential vary across mammalian orders?

### Step 1. Group data according to order

We want to examine how our `max_lifetime_fecundity` metric varies across groups of mammals. Use the `groupby()` function to group `pantheria_iucn_clean` columns according to column `'order'`. Name the grouped `DataFrame` `order_grouped`. __(1 mark)__

In [None]:
# # check that the data is groups appropriately.

assert len(order_grouped) == 29

### Step 2. Calculate means according to groups.

Use the `mean()` function to calculate the mean for each column within each order across `order_grouped`. Name the result `order_grouped_means`. __(1 mark)__

In [None]:
# check your work

order_grouped_means.head()

### Step 3. Sort the values according to `max_lifetime_fecundity` from largest to smallest

Use the `sort_values()` function to reshuffle `order_grouped_means` so that it descends from highest `max_lifetime_fecundity` to smallest. Name this sorted `DataFrame` `order_grouped_means_sorted`. __(1 mark)__

If you are having trouble figuring out how to order the rows from largest to smallest, consult the documentation (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html) to see what argument is needed.

In [None]:
# check your work

order_grouped_means_sorted.head()

### Step 4. Visualize the results using a barplot

Create a visualization of `max_lifetime_fecundity` across orders using `plot.bar()`.  Name this plot `order_mean_bar`. __(1 mark)__

## Do at risk species have lower reproductive potential than species not at risk?

### Step 1. Group rows according to risk category

Group `pantheria_iucn_clean` by column `'at_risk'` and name the result `risk_grouped`. __(1 mark)__

### Step 2. Visualize our results using a barplot

Make a barplot similar to the one you made for the previous question to visualize average `'max_lifetime_fecundity'` between species at risk vs not at risk. Name the barplot `risk_group_plot`. __(1 mark)__

## Conclusion

Include cells with your answers to each of these questions:

1. What mammalian order has the highest reproductive potential (maximum lifetime fecundity)? Which has the least (do not consider orders with missing values)? Google the names of the highest and lowest orders and describe in 1-2 words what type of animal corresponds to each. __(3 marks)__

2. Are species with greater reproductive potential (higher `max_lifetime_fecundity`) at lower risk of going extinct? If so, why do you think that might be the case? Feel free to speculate! __(3 marks)__
