# Lab 3 — Data Wrangling and Exploratory Data Analysis

## Lab Instructions and Learning Objectives 

Don't delete any of the cells in this notebook, and add markdown/code cells when asked.

In this lab you will:

  - Understand and implement crosstabulation in `pandas`.
  - Read an English specification and translate it into code.
  - Become more familiar with the terminology you've seen so far.

Some rodent species give birth to small litters, say smaller than 3. Some are active at night. Some are large, some are small.

Here is the question you will explore: _What proportion of rodent species who give birth to small litters and are active during nighttime, are also small body sized?_

## How to submit

1. Log in here: https://markus-ds.teach.cs.toronto.edu (Tip: Control/Command-click to open it in a new tab so you can still see these instructions.)

2. Choose your course.

3. Click the lab3: Lab week 3 assessment.

4. Click the `Submissions` tab. The new page is `lab3: Submissions`.

5. Click button `Upload File` on the bottom right.

6. Click button `Choose Files`.

7. Select the `Lab_3.ipynb` file that you downloaded in the previous task, then click `Save`.

## Due Date  

You will submit your completed labs as Notebook files on MarkUs. We have heard feedback that the time pressure to submit the lab is stressful, so we are changing the lab deadlines on MarkUs to Fridays at 10am.

# Marking Rubric

1 mark for having all the right variable names, plus 1 mark per correct variable type and value.

# Data science recipes

We have written up [a few data science recipes](https://jupyter.utoronto.ca/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2FUofTCompDSci%2Frecipes&branch=main&urlpath=tree%2Frecipes%2Fdata_science_recipes.ipynb) in a Jupyter notebook. You'll find these:

+ Read a CSV file
+ Calculate a statistic about a single column
+ Subset a DataFrame by extracting columns
+ Rename the columns in a DataFrame
+ Subset a DataFrame by filtering rows that have a particular property

Please feel free to suggest ones that you'd like us to write for you.

## Lab 3 Introduction

In this lab, you will continue to work with data from week 3 class: the ecological dataset across mammals, Pantheria. You will continue to work with boolean conditions, you will work with multiple conditions, you will rename variables, and add new columns to your existing data frame. 


As usual, these labs are meant to facilitate your understanding of the material from lectures in a low-stakes environment. Please feel free to refer to your lecture content, collaborate with your peers, and seek out help from your TAs. 

### Data Description 

We will work with the subset data we created during the lecture but for the order Rodentia. The subset data set contains several variables pertaining to the taxonomy (order, genus, species) and morphology characteristics and activity patterns.

Notice how we use blank lines to organize the programming concepts.

In [None]:
import pandas as pd
pantheria = pd.read_csv('pantheria.csv', sep="\t")

# Subset by column
important_columns = ["MSW05_Order","MSW05_Genus","MSW05_Species","5-1_AdultBodyMass_g",
                     "12-2_Terrestriality","6-2_TrophicLevel","1-1_ActivityCycle",
                     "15-1_LitterSize"]
sub_pantheria = pantheria[important_columns]

# Rename
columnnames = {'MSW05_Order': 'order',
               'MSW05_Genus': 'genus',
               'MSW05_Species': 'species',
               '5-1_AdultBodyMass_g': 'body_mass_g',
               '12-2_Terrestriality': 'terrestriality',
               '6-2_TrophicLevel': 'trophic_level',
               '1-1_ActivityCycle': 'activity_cycle',
               '15-1_LitterSize': 'litter_size'}
sub_pantheria_col_names = sub_pantheria.rename(columns= columnnames)
print(f'sub_pantheria_col_names shape: {sub_pantheria_col_names.shape}')

#subset rows of orders primate and carnivora
order_series = sub_pantheria_col_names['order']
order_rodents = order_series == 'Rodentia'

# This selects only the rows for order 'Rodentia'
rodents_df = sub_pantheria_col_names[order_rodents]

print(f'rodents_df shape: {rodents_df.shape}')

rodents_df.head()

## Question 1

We'll start by partitioning body mass into three categories: small (under 100g), medium (100g to 500g) and large (over 500g).

Write code to extract the `'body_mass_g'` column and name it `body_mass_col`. (What type is it?)

Now create 3 Boolean `Series` based on `body_mass_col`:
+ One named `s_body` that contains `True` only for rodents who weigh less than 100g
+ One named `m_body` that contains `True` only for rodents that weigh between 100g and 500g, inclusive.
    - Hint: Use `(body_mass_col >= 100) & (body_mass_col <= 500)`, which:
        * makes a Boolean `Series` that it `True` when the body mass is >= 100g (and `False` otherwise)
        * makes a Boolean `Series` that it `True` when the body mass is <= 500g
        * Combines the two to make a Boolean `Series` that is `True` when both of those two `Series` are both `True`.

+ One named `l_body` that contains `True` only for rodents who weigh more than 500g.

How many species are large size?

In [None]:
# place your answer in this cell


In [None]:
# Q1 check
print(s_body.value_counts())
print(m_body.value_counts())
print(l_body.value_counts())

## Question 2

Now that we have Boolean `Series` for each of the three categories, we can add a new column that has value `0` if the rodent is small, `1` if it is medium, and `2` if it is large.

Below, we make a copy of your `DataFrame` and name it `activity_analysis`. It's good practice to do this before starting to add columns and otherwise modify the data.

Now write 3 assignment statements to add a column called `'size_cat'` to the `activity_analysis` `DataFrame`. You'll need to use `.loc`. Use variables `small`, `medium`, and `large` as indexes, one per assignment statement. (The 'cat' in `'size_cat'` is short for 'category', not a feline.)


In [None]:
# place your answer in this cell
activity_analysis = rodents_df.copy()


In [None]:
# Q2 check
print(activity_analysis.shape)
activity_analysis.head()

##  Question 3

The `'activity_cycle'` column describes the time of day in which each species is more active as `1`: nocturnal only, `2`: crepuscular, `3`: diurnal. We're going to create a new column called `'nighttime_activity'` that is `True` for nocturnal and crepuscular rodents, but is `False` for diurnal rodents.

Extrac the `'activity_cycle'` column from the `activity_analysis` `DataFrame` and name that column `cycle`. It's a `Series` of `1`s, `2`s and `3`s.

Time to make the two Boolean `Series` that we're going to use to create the new column. Name them `nighttime` and `daytime`.

`daytime` is where `cycle == 3`.

For `nighttime`: create two _other_ Boolean `Series`, one each for nocturnal and crepuscular. Name them anything you want. Then combine them using `|` (or) and assign the result to `nighttime`.

Using `nighttime` and `daytime`, add a new column called `'nighttime_activity'` that is `True` when the `nighttime` `Series` is `True` and `False` when the `daytime` `Series` is `True`.
 
How many rodent species are active during nighttime?

In [None]:
# place your answer in this cell


In [None]:
# Q3 check
print(f'cycle.head\n{cycle.head()}')
print()
print(f'There are {sum(nighttime)} nighttime creatures') # We get 407
print(f'There are {sum(daytime)} daytime creatures') # We get 151
print()
print(activity_analysis.head())

## Question 4

What species have fewer than 3 offspring and are active at night?

The `'litter_size'` column has the average litter size for each rodent. You'll notice it contains numbers like `1.94`, because it's a series of floating-point numbers. You'll also notice `NaN`, which stand for "Not a Number". That means the data is missing. Those won't be included in our results.

Extract the `'litter_size'` column and name it `litter_size_col`. It is a floating-point `Series`. Use it to make a Boolean `Series` named `small_litter` that is `True` when the litter size is less than 3.

Then do the same for your new `nighttime_activity` column. Luckily, you already have a Boolean `Series` named `nighttime` with the appropriate `True` values.

Combine `small_litter` and `nighttime` using `&` (and), and name it `small_litter_night`. These are the critters that have small litters and are active at night.

In [None]:
# place your answer in this cell


In [None]:
# Q4 check
print(f'litter_size_col:\n{litter_size_col.head()}')
print(f'small_litter:\n{small_litter.head()}')
print(f'small_night:\n{small_litter_night.head()}')

# We get 117 for this next one:
print(f'There are {sum(small_litter_night)} types of small-litter nighttime critters.')

## Question 5

Use the `small_litter_night` Boolean `Series`to select rows from `activity_analysis` where the rodent species have both a small body size and are active during the night. Name this new `DataFrame` `s_litter_nighttime_df`.

In [None]:
# place your answer in this cell


In [None]:
# Q5 check
print(f's_litter_nighttime_df.shape: {s_litter_nighttime_df.shape}')  # We get (117, 10)
s_litter_nighttime_df

## Question 6

These two Boolean `Series` have been created:

+ `small_litter_night`: `True` for critters that have small litters and are active at night.

+ `s_body`: `True` for critters that weigh less than 100g.

Let's confirm they have the same number of rows.

In [None]:
print(small_litter_night.shape)
print(s_body.shape)

Let's take the intersection. Combine them with `&` and name the resulting Boolean `Series` `s_body_s_litter_night`. (Zounds, these names are fun.)

In [None]:
# place your answer in this cell


In [None]:
# Q6 check
s_body_s_litter_night.value_counts()  # We got 52.

## Question 7 

Remember way back in Question 1 that you partitioned body mass into three categories (small, medium, large) and created a column called `'size_cat'` with values `1`, `2`, and `3`?

Let's extract that column as a `Series` and name it `size`. Now compare `size` to `1` using `==` to get a Boolean `Series`, and name it `size_small`.

Subset the rows for size `1` rodents by using `size_small` as an index into the `activity_analysis` `DataFrame`. This makes a `DataFrame` containing only the small rodents. Name it `size_small_rodents`.

Recall that `small_litter_night` is `True` for small-litter rodents that are active at night, and that `s_body` is `True` for small-body rodents. Create a crosstab for these two `Series` and name it `small_nighttime`.

What proportion of rodent species, which give birth to small litters and are active during nighttime, are also small body sized? You should be able to access that value like this: `small_nighttime.iloc[1, 1]`. Name that value `small_night_pct`.

In [None]:
# place you answer in this cell


In [None]:
# Q7 check
print(f'size.value_counts:\n{size.value_counts()}')
print(f'size_small.value_counts:\n{size_small.value_counts()}')
print(f'size_small_rodents:\n{size_small_rodents.head()}')
print(f'The answer is {small_night_pct}')