# Exercise 4.1: Long-term trends in hybridization of Darwin finches

<hr>

[Peter and Rosemary Grant](https://en.wikipedia.org/wiki/Peter_and_Rosemary_Grant) have been working on the Galápagos island of Daphne Major for over forty years.  During this time, they have collected lots and lots of data about physiological features of finches.  In 2014, they published a book with a summary of some of their major results (Grant P. R., Grant B. R., *40 years of evolution. Darwin's finches on Daphne Major Island*, Princeton University Press, 2014). They made their data from the book publicly available via the [Dryad Digital Repository](http://dx.doi.org/10.5061/dryad.g6g3h).

We will investigate their measurements of beak depth (the distance, top to bottom, of a closed beak) and beak length (base to tip on the top) of Darwin's finches.  We will look at data from two species, *Geospiza fortis* and *Geospiza scandens*.  The Grants provided data on the finches of Daphne for the years 1973, 1975, 1987, 1991, and 2012.  I have included the data in the files `grant_1973.csv`, `grant_1975.csv`, `grant_1987.csv`, `grant_1991.csv`, and  `grant_2012.csv`. They are in almost exactly the same format is in the Dryad repository; I have only deleted blank entries at the end of the files.

**Note**: If you want to skip the wrangling (which is very valuable experience), you can go directly to part (d). You can load in the data frame you generate in parts (a) through (c) from the file `~/git/bootcamp/data/grant_complete.csv`.

**a)** Load each of the files into separate Pandas data frames.  You might want to inspect the file first to make sure you know what character the comments start with and if there is a header row.

**b)** We would like to merge these all into one data frame.  The problem is that they have different header names, and only the 1973 file has a year entry (called `yearband`).  This is common with real data.  It is often a bit messy and requires some wrangling.  

1. First, change the name of the `yearband` column of the 1973 data to `year`.  Also, make sure the year format is four digits, not two!  
2. Next, add a `year` column to the other four data frames.  You want tidy data, so each row in the data frame should have an entry for the year.
3. Change the column names so that all the data frames have the same column names.  I would choose column names

    `['band', 'species', 'beak length (mm)', 'beak depth (mm)', 'year']`

4. Concatenate the data frames into a single data frame. Be careful with indices! If you use `pd.concat()`, you will need to use the `ignore_index=True` kwarg. You might also need to use the `axis` kwarg.

**c)** The `band` field gives the number of the band on the bird's leg that was used to tag it. Are some birds counted twice? Are they counted twice in the same year? Do you think you should drop duplicate birds from the same year? How about different years? My opinion is that you should drop duplicate birds from the same year and keep the others, but I would be open to discussion on that. To practice your Pandas skills, though, let's delete only duplicate birds from the same year from the data frame. When you have made this data frame, save it as a CSV file.

*Hint*: The data frame methods `duplicated()` and `drop_duplicates()` will be useful.

After doing this work, it is worth saving your tidy data frame in a CSV document. To this using the `to_csv()` method of your data frame. Since the indices are uninformative, you should use the `index=False` kwarg. (I have already done this and saved it as `~/git/bootcamp/data/grant_complete.csv`, which will help you do the rest of the exercise if you have problems with this part.)

**d)** Make a plots exploring how beak depth changes over time for each species. Think about what might be effective ways to display the data.

**e)** It is informative to plot the measurement of each bird's beak as a point in the beak depth-beak length plane.  For the 1987 data, plot beak depth vs. beak width for *Geospiza fortis* and for *Geospiza scandens*. The function you wrote in [Exercise 3.5](exercise_3.5.ipynb) will be useful to do this.

**f)** Do part (d) again for all years. _Hint_: To display all of the plots, check out the [Bokeh documentation for layouts](https://bokeh.pydata.org/en/latest/docs/user_guide/layout.html). In your plots, make sure all plots have the same range on the axes. If you want to set two plots, say `p1` and `p2` to have the same axis ranges, you can do the following.

```python
p1.x_range = p2.x_range
p1.y_range = p2.y_range
```

<br />

In [8]:
import os
os.chdir('/Users/jschaefer/git/bootcamp/data/grant-data')

#`grant_1973.csv`, `grant_1975.csv`, `grant_1987.csv`, `grant_1991.csv`,`grant_2012.csv`

In [9]:
#A load data into dataframes
import numpy as np
import pandas as pd

df1973 = pd.read_csv('grant_1973.csv', na_values='*', comment = '#')
df1975 = pd.read_csv('grant_1975.csv', na_values='*', comment = '#')
df1987 = pd.read_csv('grant_1987.csv', na_values='*', comment = '#')
df1991 = pd.read_csv('grant_1991.csv', na_values='*', comment = '#')
df2012 = pd.read_csv('grant_2012.csv', na_values='*', comment = '#')

df1973

Unnamed: 0,band,species,yearband,beak length,beak depth
0,20123,fortis,73,9.25,8.05
1,20126,fortis,73,11.35,10.45
2,20128,fortis,73,10.15,9.55
3,20129,fortis,73,9.95,8.75
4,20133,fortis,73,11.55,10.15
...,...,...,...,...,...
84,20224,scandens,73,15.65,9.95
85,20245,scandens,73,14.05,9.55
86,20254,scandens,73,13.85,9.15
87,20259,scandens,73,14.95,10.45


In [43]:
#B

#1.change `yearband` in df1973 to `year`.  
#2.change 'year' format from YY to YYYY  
#3.add 'year' column to all datasets
#4.change all column names to `['band', 'species', 'beak length (mm)', 'beak depth (mm)', 'year']`
#5.Concatenate the df. Be careful with indices! If you use `pd.concat()`, you will need to use the `ignore_index=True` kwarg. You might also need to use the `axis` kwarg.

df1973 = df1973.rename(columns={'yearband': 'year', 'beak length': 'beak length (mm)', 'beak depth': 'beak depth (mm)'})
df1973['year'] = df1973['year'].replace(73, 1973)
df1973 = df1973[[col for col in df1973.columns if col != 'year'] + ['year']]
df1975['year'] = 1975
df1975 = df1975.rename(columns={'Beak length, mm': 'beak length (mm)', 'Beak depth, mm': 'beak depth (mm)'})
df1987['year'] = 1987
df1987 = df1987.rename(columns={'Beak length, mm': 'beak length (mm)', 'Beak depth, mm': 'beak depth (mm)'})
df1991['year'] = 1991
df1991 = df1991.rename(columns={'blength': 'beak length (mm)', 'bdepth': 'beak depth (mm)'})
df2012['year'] = 2012
df2012 = df2012.rename(columns={'blength': 'beak length (mm)', 'bdepth': 'beak depth (mm)'})

df = pd.concat([df1973, df1975, df1987, df1991, df2012], ignore_index=True)
#axis flag not needed, because concatenation is by rows, default is rows (row --> axis=0, column --> axis=1)

df


Unnamed: 0,band,species,beak length (mm),beak depth (mm),year
0,20123,fortis,9.25,8.05,1973
1,20126,fortis,11.35,10.45,1973
2,20128,fortis,10.15,9.55,1973
3,20129,fortis,9.95,8.75,1973
4,20133,fortis,11.55,10.15,1973
...,...,...,...,...,...
2299,21295,scandens,14.20,9.30,2012
2300,21297,scandens,13.00,9.80,2012
2301,21340,scandens,14.60,8.90,2012
2302,21342,scandens,13.10,9.80,2012


In [46]:
#C
#1. remove duplicates in 'band' from the same year; `duplicated()` and `drop_duplicates()
#2. save as .csv; `to_csv()` `index=False` kwarg  `~/git/bootcamp/data/grant_complete.csv`

# Delete duplicates based on 'col1' and 'col2', keeping the first occurrence
df_subset_duplicates = df.drop_duplicates(subset=['band', 'year'])
df_subset_duplicates.to_csv('grant_complete_JHS.csv', index=False)

df_subset_duplicates

Unnamed: 0,band,species,beak length (mm),beak depth (mm),year
0,20123,fortis,9.25,8.05,1973
1,20126,fortis,11.35,10.45,1973
2,20128,fortis,10.15,9.55,1973
3,20129,fortis,9.95,8.75,1973
4,20133,fortis,11.55,10.15,1973
...,...,...,...,...,...
2299,21295,scandens,14.20,9.30,2012
2300,21297,scandens,13.00,9.80,2012
2301,21340,scandens,14.60,8.90,2012
2302,21342,scandens,13.10,9.80,2012


In [50]:
#D Plot beak depth changes over time for each species.

import numpy as np
import pandas as pd
import iqplot
import bokeh.io
import bokeh.models
import bokeh.plotting
bokeh.io.output_notebook()


df = pd.read_csv('grant_complete_JHS.csv', na_values='*', comment = '#')
df.groupby('beak depth (mm)')['year'].median()

p = iqplot.box(
    data=df,
    q="beak depth (mm)",
    cats="year",
)

bokeh.io.show(p)

In [82]:
#E

#1 plot  'beak depth (mm)' against 'beak lenghth (mm)' for 1987 Geospiza fortis* and for *Geospiza scandens

#D Plot beak depth changes over time for each species.

import numpy as np
import pandas as pd
import iqplot
import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()


df = pd.read_csv('grant_complete_JHS.csv', na_values='*', comment='#')
df_selected = df[(df["year"] == 1987) & (df["species"].isin(["fortis", "scandens"]))]

p = bokeh.plotting.figure(
    title="Beak Depth vs Beak Length (1987)",
    x_axis_label="Beak Length (mm)",
    y_axis_label="Beak Depth (mm)",
    width=400,
    height=400,
)

for species, color in zip(["fortis", "scandens"], ["blue", "orange"]):
    species_data = df_selected[df_selected["species"] == species]
    p.circle(
        x=species_data["beak length (mm)"],
        y=species_data["beak depth (mm)"],
        size=8,
        color=color,
        alpha=0.6,
        legend_label=species,
    )

p.legend.title = "Species"
p.legend.location = "top_left"

bokeh.io.show(p)



In [None]:
#F

from bokeh.io import hplot, output_file, show
from bokeh.plotting import figure

output_file("layout.html")

x = list(range(11))
y0 = x
y1 = [10 - i for i in x]
y2 = [abs(i - 5) for i in x]

# create a new plot
s1 = figure(width=250, plot_height=250, title=None)
s1.circle(x, y0, size=10, color="navy", alpha=0.5)

# create another one
s2 = figure(width=250, height=250, title=None)
s2.triangle(x, y1, size=10, color="firebrick", alpha=0.5)

# create and another
s3 = figure(width=250, height=250, title=None)
s3.square(x, y2, size=10, color="olive", alpha=0.5)

# put all the plots in an HBox
p = hplot(s1, s2, s3)

# show the results
show(p)