# Discussion 14

This Discussion activity is a component of your [group mini-project](https://philchodrow.github.io/PIC16A/project/). While the usual Discussion expectations apply with regards to your participation grade (i.e. if you work for the full 50 minutes, you will get full credit), it is recommended for the purposes of your final project that you coordinate with your group to eventually complete all parts of this assignment. 

## Group Roles

The roles for this Discussion activity are slightly modified. The Driver and Proposer are the same as usual. Instead of a Reviewer, use an **Interpreter**. The job of the Interpreter is to think about the significance of each of the code outputs in the context of the long-term project goal (classifying penguin species). In parts of the Discussion where the problems ask you to explain or interpret your findings, the Interpreter should suggest responses to the Proposer and Driver. The **Interpreter** may also give code feedback when the group is writing functions. 

## Part A

Run the below code to load inyour necessary libraries and read in the Palmer Penguins data set as a `pandas` data frame called `penguins`. 

In [1]:
import pandas as pd
import numpy as np
import urllib
from matplotlib import pyplot as plt

url = "https://philchodrow.github.io/PIC16A/datasets/palmer_penguins.csv"
filedata = urllib.request.urlopen(url)
to_write = filedata.read()

with open("palmer_penguins.csv", "wb") as f:
    f.write(to_write)
    
penguins = pd.read_csv("palmer_penguins.csv")

# shorten the species name
penguins["Species"] = penguins["Species"].str.split().str.get(0)

In [2]:
# optional code here if you need to refresh your memory of the data set
penguins.head()

Unnamed: 0,studyName,Sample Number,Species,Region,Island,Stage,Individual ID,Clutch Completion,Date Egg,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex,Delta 15 N (o/oo),Delta 13 C (o/oo),Comments
0,PAL0708,1,Adelie,Anvers,Torgersen,"Adult, 1 Egg Stage",N1A1,Yes,11/11/07,39.1,18.7,181.0,3750.0,MALE,,,Not enough blood for isotopes.
1,PAL0708,2,Adelie,Anvers,Torgersen,"Adult, 1 Egg Stage",N1A2,Yes,11/11/07,39.5,17.4,186.0,3800.0,FEMALE,8.94956,-24.69454,
2,PAL0708,3,Adelie,Anvers,Torgersen,"Adult, 1 Egg Stage",N2A1,Yes,11/16/07,40.3,18.0,195.0,3250.0,FEMALE,8.36821,-25.33302,
3,PAL0708,4,Adelie,Anvers,Torgersen,"Adult, 1 Egg Stage",N2A2,Yes,11/16/07,,,,,,,,Adult not sampled.
4,PAL0708,5,Adelie,Anvers,Torgersen,"Adult, 1 Egg Stage",N3A1,Yes,11/16/07,36.7,19.3,193.0,3450.0,FEMALE,8.76651,-25.32426,


## Part B

Write a function called `penguin_summary_table` which accepts two arguments, `group_cols` and `value_cols`. This function should create a table in which the mean of each element of `value_cols` is shown, grouped according to the specified `group_cols`. For example, the call 

```python
penguin_summary_table(["Species"], ["Culmen Length (mm)", "Culmen Depth (mm)"])
```

should produce a summary table with the mean culmen length and depth per species. 

For a more pleasant display, **round the numbers in your table to 2 decimal places**. This can be done using the code `my_data_frame.round(2)`. 

This function can be implemented in just a few lines. Comments and docstrings are not necessary. 

In [26]:
# your solution here
def penguin_summary_table(group_cols, value_cols):
    return penguins.groupby(group_cols)[value_cols].aggregate([np.mean,np.std])#,np.var])

    
    
penguin_summary_table(["Species"], ["Culmen Length (mm)", "Culmen Depth (mm)"])

Unnamed: 0_level_0,Culmen Length (mm),Culmen Length (mm),Culmen Depth (mm),Culmen Depth (mm)
Unnamed: 0_level_1,mean,std,mean,std
Species,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Adelie,38.791391,2.663405,18.346358,1.21665
Chinstrap,48.833824,3.339256,18.420588,1.135395
Gentoo,47.504878,3.081857,14.982114,0.98122


## Part  C

Use your function to explore the data a bit. Focus on the physiological variables:

- `Culmen Length (mm)`
- `Culmen Depth (mm)`
- `Flipper Length (mm)`
- `Body Mass (g)`
- `Delta 15 N (o/oo)`
- `Delta 13 C (o/oo)`

These last two variables are measures of nitrogen and carbon isotopes in the penguin's bloodstreams. 

**Create at least three readable summary tables.** Then, work with your **Interpreter** to explain the significance of each table. Do observe any important differences between the penguin species?

Make sure that each table has a message, and that no information is shown that is not part of that message. Is there a part of the table that you have nothing to say about? Remove it! 


- **Hint**: "This table suggests that there's not much of a difference between..." is a fine explanation of the table, as long as it's warranted. 
- **Hint**: consider the sex of the penguins as well as the species. 

In [14]:
# Table 1
x = ["Culmen Length (mm)","Culmen Depth (mm)", "Flipper Length (mm)","Body Mass (g)", "Delta 15 N (o/oo)", "Delta 13 C (o/oo)"]
penguin_summary_table(["Species"],x[0:2])

Unnamed: 0_level_0,Culmen Length (mm),Culmen Length (mm),Culmen Length (mm),Culmen Depth (mm),Culmen Depth (mm),Culmen Depth (mm)
Unnamed: 0_level_1,mean,std,var,mean,std,var
Species,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Adelie,38.791391,2.663405,7.093725,18.346358,1.21665,1.480237
Chinstrap,48.833824,3.339256,11.15063,18.420588,1.135395,1.289122
Gentoo,47.504878,3.081857,9.497845,14.982114,0.98122,0.962792


#### Discussion of Table 1

It's interesting to see that the Gentoo penguin population has a noticeably different culmen depth than that of the Adele and Chinstrap populations, with a 14.9 to 18.34 and 18.42, respectively. Moreover, we can see the Std Deviations are also quite small, so they show some distinguishable difference.

If we look at the Culmen length, we can delineate between the Adelie and Chinstrap populations through the mean Culmen length, with a 10mm margin (38.8 to 48.8 mm)  with standard deviations from both populations at around 3.

If we wanted to compare the populations, we can first compare culment depth and culmen length in a RF tree to predict species.

In [29]:
# Table 2
penguin_summary_table(["Species","Sex"],x[2:4])

Unnamed: 0_level_0,Unnamed: 1_level_0,Flipper Length (mm),Flipper Length (mm),Body Mass (g),Body Mass (g)
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,std,mean,std
Species,Sex,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Adelie,FEMALE,187.794521,5.595035,3368.835616,269.380102
Adelie,MALE,192.410959,6.599317,4043.493151,346.811553
Chinstrap,FEMALE,191.735294,5.754096,3527.205882,285.333912
Chinstrap,MALE,199.911765,5.976558,3938.970588,362.13755
Gentoo,.,217.0,,4875.0,
Gentoo,FEMALE,212.706897,3.897856,4679.741379,281.578294
Gentoo,MALE,221.540984,5.673252,5484.836066,313.158596


#### Discussion of Table 2
We can see that Gentoo penguins have a noticeably different body mass (5076 g) vs both the lighter Adelie and Chinstrap populations (~3700 g each). Moreover, we can also compare the Gentoo with the other 2 populations based on Flipper length as well (217 mm compared to 190 and 196 mm).

In [28]:
# Table 3
penguin_summary_table(["Species","Sex"],x[4:6])

Unnamed: 0_level_0,Unnamed: 1_level_0,Delta 15 N (o/oo),Delta 15 N (o/oo),Delta 13 C (o/oo),Delta 13 C (o/oo)
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,std,mean,std
Species,Sex,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Adelie,FEMALE,8.793275,0.475914,-25.794158,0.613175
Adelie,MALE,8.928437,0.362755,-25.833813,0.562443
Chinstrap,FEMALE,9.250962,0.32204,-24.565405,0.241078
Chinstrap,MALE,9.464535,0.386763,-24.527679,0.238612
Gentoo,.,8.04111,,-26.18444,
Gentoo,FEMALE,8.193405,0.279057,-26.197205,0.534377
Gentoo,MALE,8.303429,0.245151,-26.170608,0.554716


#### Discussion of Table 3
Even after stratifying by sex and species, we could not discern any noticeable differences in Delta 15N and Delta 13 C values

In [21]:
penguins.head()

Unnamed: 0,studyName,Sample Number,Species,Region,Island,Stage,Individual ID,Clutch Completion,Date Egg,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex,Delta 15 N (o/oo),Delta 13 C (o/oo),Comments
0,PAL0708,1,Adelie,Anvers,Torgersen,"Adult, 1 Egg Stage",N1A1,Yes,11/11/07,39.1,18.7,181.0,3750.0,MALE,,,Not enough blood for isotopes.
1,PAL0708,2,Adelie,Anvers,Torgersen,"Adult, 1 Egg Stage",N1A2,Yes,11/11/07,39.5,17.4,186.0,3800.0,FEMALE,8.94956,-24.69454,
2,PAL0708,3,Adelie,Anvers,Torgersen,"Adult, 1 Egg Stage",N2A1,Yes,11/16/07,40.3,18.0,195.0,3250.0,FEMALE,8.36821,-25.33302,
3,PAL0708,4,Adelie,Anvers,Torgersen,"Adult, 1 Egg Stage",N2A2,Yes,11/16/07,,,,,,,,Adult not sampled.
4,PAL0708,5,Adelie,Anvers,Torgersen,"Adult, 1 Egg Stage",N3A1,Yes,11/16/07,36.7,19.3,193.0,3450.0,FEMALE,8.76651,-25.32426,


## Part D

Based on your findings from these tables, propose a miniature decision tree to help distinguish between the penguin species. Your decision tree might have rules like the following: 

1. First, check the island on which the penguin was found. 
    1. If Torgersen, then check the body mass. 
        1. If the body mass is over 4,000g, then guess Adelie. 
        1. Otherwise, guess Chinstrap
    1. If Biscoe, then check the sex of the penguin. 
        1. If female, guess Gentoo
        1. Otherwise, guess Chinstrap
    1. If Dream, then guess Adelie.     
      
Your decision tree should operate using no more than three columns from the data. 

Below your decision tree, write an explanation of how you came up with it and how the tables that you created above informed your choices. 

If you like, you may skip ahead to Part E and write your decision tree directly as a Python function. You should then explain your reasoning as a docstring in the function rather than typing it here.  

First, check mass.
if(under 4500 g):
    check culmen length (if culmen length is above 42 mm):
        it's a chinstrap
    otherwise it's an adelie
else:
check culmen depth (if culmen depth is less than 16.5):
    it's a Gentoo


In [31]:
penguin_summary_table(["Species"],x[0:2])

Unnamed: 0_level_0,Culmen Length (mm),Culmen Length (mm),Culmen Depth (mm),Culmen Depth (mm)
Unnamed: 0_level_1,mean,std,mean,std
Species,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Adelie,38.791391,2.663405,18.346358,1.21665
Chinstrap,48.833824,3.339256,18.420588,1.135395
Gentoo,47.504878,3.081857,14.982114,0.98122


In [33]:
penguin_summary_table(["Species"],x[2:4])

Unnamed: 0_level_0,Flipper Length (mm),Flipper Length (mm),Body Mass (g),Body Mass (g)
Unnamed: 0_level_1,mean,std,mean,std
Species,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Adelie,189.953642,6.539457,3700.662252,458.566126
Chinstrap,195.823529,7.131894,3733.088235,384.335081
Gentoo,217.186992,6.484976,5076.01626,504.116237


## Part E 

Write a function called `decision_tree` that implements your decision tree. It should accept as input single values of the relevant variables, and then return as output the guessed species of a penguin. Here's an example for the decision tree above: 

```python
def decision_tree(island, mass, sex):
    if island == "Torgersen":
        if mass > 4000:
            return "Adelie"
        else:
            return "Chinstrap"
    elif island == "Biscoe":
        if sex == "FEMALE":
            return "Gentoo"
        else:
            return "Chinstrap"
    else: 
        return "Adelie"
    
decision_tree("Biscoe", 5000, "MALE")
```
```
'Chinstrap'
```

Comments and docstrings are not necessary in this case, unless you skipped Part D. 

First, check mass.
if(under 4500 g):
    check culmen length (if culmen length is above 42 mm):
        it's a chinstrap
    otherwise it's an adelie
else:
check culmen depth (if culmen depth is less than 16.5):
    it's a Gentoo


In [36]:
# your decision tree function here
def decision_tree(length, depth, mass):
    if (mass < 4500):
        if (length >= 42):
            return "Chinstrap"
        else:
            return "Adelie"
    else:
        if depth < 16.5:
            return "Gentoo"
        return "Adelie"



In [41]:
penguins.tail()

Unnamed: 0,studyName,Sample Number,Species,Region,Island,Stage,Individual ID,Clutch Completion,Date Egg,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex,Delta 15 N (o/oo),Delta 13 C (o/oo),Comments
339,PAL0910,120,Gentoo,Anvers,Biscoe,"Adult, 1 Egg Stage",N38A2,No,12/1/09,,,,,,,,
340,PAL0910,121,Gentoo,Anvers,Biscoe,"Adult, 1 Egg Stage",N39A1,Yes,11/22/09,46.8,14.3,215.0,4850.0,FEMALE,8.41151,-26.13832,
341,PAL0910,122,Gentoo,Anvers,Biscoe,"Adult, 1 Egg Stage",N39A2,Yes,11/22/09,50.4,15.7,222.0,5750.0,MALE,8.30166,-26.04117,
342,PAL0910,123,Gentoo,Anvers,Biscoe,"Adult, 1 Egg Stage",N43A1,Yes,11/22/09,45.2,14.8,212.0,5200.0,FEMALE,8.24246,-26.11969,
343,PAL0910,124,Gentoo,Anvers,Biscoe,"Adult, 1 Egg Stage",N43A2,Yes,11/22/09,49.9,16.1,213.0,5400.0,MALE,8.3639,-26.15531,


In [42]:
decision_tree(46.8, 14.3,4850)

'Gentoo'

## Part F

The following code will generate a guess for each penguin using the `decision_tree` function shown above. Modify the line that defines the `guesser` function according to the variables required by your decision tree. Then, run the code to create a new column called `Guess` containing the species guess for each penguin. 

In [43]:
# modify the first line, then run
guesser = lambda r: decision_tree(r["Culmen Length (mm)"], r["Culmen Depth (mm)"], r["Body Mass (g)"])
penguins["Guess"] = penguins.apply(guesser, axis = 1)

## Part G

Compute the accuracy of your decision tree -- what percentage of the time does your decision tree give you the right answer? 

**Hint**: this is a one-liner. 

In [55]:
# your solution here

penguins["Accuracy"] = (penguins["Guess"] == penguins["Species"])
penguins["Accuracy"].sum() /344

0.875

In [57]:
(penguins["Guess"] == penguins["Species"]).mean() #same thing lol

0.875

0.875

Soon, we'll learn how to use Python to automatically generate good decision trees without us needing to eyeball the data. 