# Discussion Section 3
***

### Archive exploration

About: 
- Data on snowshoe hares in Bonanza Creek. Data was colllected from 1999 to 2025. 

Citation:
- Kielland, K., F.S. Chapin, R.W. Ruess, and Bonanza Creek LTER. 2017. Snowshoe hare physical data in Bonanza Creek Experimental Forest: 1999-Present ver 22. Environmental Data Initiative. https://doi.org/10.6073/pasta/03dce4856d79b91557d8e6ce2cbcdc14 (Accessed 2025-10-17).

![hare_image](https://upload.wikimedia.org/wikipedia/commons/thumb/8/8a/SNOWSHOE_HARE_%28Lepus_americanus%29_%285-28-2015%29_quoddy_head%2C_washington_co%2C_maine_-01_%2818988734889%29.jpg/1452px-SNOWSHOE_HARE_%28Lepus_americanus%29_%285-28-2015%29_quoddy_head%2C_washington_co%2C_maine_-01_%2818988734889%29.jpg?20170313021652)

In [30]:
import pandas as pd
import numpy as np

URL = "https://pasta.lternet.edu/package/data/eml/knb-lter-bnz/55/22/f01f5d71be949b8c700b6ecd1c42c701"
hares = pd.read_csv(URL)

In [31]:
hares

Unnamed: 0,date,time,grid,trap,l_ear,r_ear,sex,age,weight,hindft,notes,b_key,session_id,study
0,11/26/1998,,bonrip,1A,414D096A08,,,,1370.0,160.0,,917.0,51,Population
1,11/26/1998,,bonrip,2C,414D320671,,M,,1430.0,,,936.0,51,Population
2,11/26/1998,,bonrip,2D,414D103E3A,,M,,1430.0,,,921.0,51,Population
3,11/26/1998,,bonrip,2E,414D262D43,,,,1490.0,135.0,,931.0,51,Population
4,11/26/1998,,bonrip,3B,414D2B4B58,,,,1710.0,150.0,,933.0,51,Population
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3375,8/8/2002,18:00:00,bonrip,1b,1201,1202,,,1400.0,,,63.0,64,Population
3376,8/8/2002,6:00:00,bonrip,4b,1201,1202,,,,,,63.0,64,Population
3377,8/7/2002,,bonrip,4b,1217,1218,,,1000.0,134.0,,69.0,64,Population
3378,8/8/2002,,bonrip,6d,1217,1218,,,990.0,,,69.0,64,Population


In [32]:
hares.isna().sum()

date             0
time          3116
grid             0
trap            12
l_ear           48
r_ear          169
sex            352
age           2111
weight         535
hindft        1747
notes         3137
b_key           47
session_id       0
study          163
dtype: int64

In [33]:
print("Max Weight", hares["weight"].max())
print("Min Weight", hares["weight"].min())

print("Max Hind Foot Length", hares["hindft"].max())
print("Min Hind Foot Lengtht", hares["hindft"].min())

Max Weight 2365.0
Min Weight 0.0
Max Hind Foot Length 160.0
Min Hind Foot Lengtht 60.0


In [34]:
hares["notes"].unique()

array([nan, 'No right ear tag', 'Escapee', 'Mortality', 'Mortality ',
       'Old tag lost in L ear',
       'Bunny escaped before second ear tag was added',
       'Rabbit too bloody, released', 'R Front Foot Injured',
       'L Hind Leg Injured',
       'Left Front Foot Injured by Mink. Mink Still Around, Not Shy',
       'Injured Bunny, Released, No Tags', 'Died after release',
       'Dead in trap', 'Dead', 'non-pregnant',
       'pregnant (2 peanut sized babies)', 'pregnant', 'Pregnant',
       'Pregnant; last collar was chewed off',
       '149.074 recapture; collar loose, removed and replaced; non-pregnant',
       'previous collar was chewed off',
       '149.013 came off/removed; replaced',
       '149.033 recapture; collar loose, removed and replaced',
       'previous collar fell off',
       'collar previously chewed off (put back on the same bunny!)',
       'collar broke off, caught in cage', 'dead in trap',
       '149.754 recapture; no VHF signal, removed and replaced',

### Study Question
Is there a relationship between snowshoe hare weight and hind foot size?

<table width = 50%>
    <tr>
        <th>Value</th>
        <th>Description</th>
    </tr>
    <tr>
        <td>m</td>
        <td>	male</td>
    </tr>
    <tr>
        <td>f</td>
        <td>female</td>
    </tr>
    <tr>
        <td>m?</td>
        <td>male not confirmed</td>
    </tr>
</table>

In [35]:
hares['sex'].value_counts()

sex
F     1161
M      730
f      556
m      515
?       40
F?      10
f        4
m        4
f?       3
M?       2
m?       2
pf       1
Name: count, dtype: int64

The function of drop NA in value_counts is to not include to the count of NaN values in the output. By default this is set to True.

In [36]:
hares['sex'].value_counts(dropna=False)

sex
F      1161
M       730
f       556
m       515
NaN     352
?        40
F?       10
f         4
m         4
f?        3
M?        2
m?        2
pf        1
Name: count, dtype: int64

**Do the values in the sex column correspond to the values declared in the metadata?**
-They do not. Not only is tehre variation in the capitalization of male and female, but there is also an additional value "p" wich they do not identify.

**What could have been potential causes for multiple codes?**
This could be due to multiple people working on the dataset without proper communication, leading to a lack of standardization.

**Are there seemingly repated values? If so, what could be the cause?**
There are four duplicate rows in the data frame. This could be due to the same rabbit being caught multiple times in the same date.


In [37]:
hares[hares.duplicated() == True]

Unnamed: 0,date,time,grid,trap,l_ear,r_ear,sex,age,weight,hindft,notes,b_key,session_id,study
2893,7/1/2011,,bonbs,10a,,,,,,,juvenile,,23,Population
2894,7/1/2011,,bonbs,10a,,,,,,,juvenile,,23,Population
2895,7/1/2011,,bonbs,10a,,,,,,,juvenile,,23,Population
3071,9/11/2012,,bonbs,10d,b2834,b2835,f,j,840.0,114.0,,838.0,31,Population


**Brainstorm**
We would take all values with an f and change their name to female.
Then take all values with an m and change to male/ 
Then take all other columns and store them as unknown.

In [None]:
# Conditions to select fr
conditions = [
    (hares['sex'].isin(['m', 'M', 'm_'])),
    (hares['sex'].isin(['f', 'F', 'f_']))
]

choices = ['male', 'female']

default = 'unknown'

hares['sex_simple'] = np.select(conditions, choices, default=default)

In [47]:
hares['sex_simple'].value_counts()

sex_simple
female     1717
male       1245
unknown     418
Name: count, dtype: int64

In [48]:
hares.groupby('sex_simple')['weight'].mean()

sex_simple
female     1366.920372
male       1352.145553
unknown    1176.511111
Name: weight, dtype: float64

We can seee that the mean weight of females seems to be higher than in males by a slight amount, though in general there seems to be very little variation between the two. The mean value of 

In [None]:
# Final code chunk

# Import required packages
import pandas as pd
import numpy as np

URL = "https://pasta.lternet.edu/package/data/eml/knb-lter-bnz/55/22/f01f5d71be949b8c700b6ecd1c42c701"
hares = pd.read_csv(URL)

# Conditions to select fr
conditions = [
    (hares['sex'].isin(['m', 'M', 'm_'])),
    (hares['sex'].isin(['f', 'F', 'f_']))
]

choices = ['male', 'female']

default = 'unknown'

hares.groupby(np.select(conditions, choices, default = 'unknown'))['weight'].mean()


female     1366.920372
male       1352.145553
unknown    1176.511111
Name: weight, dtype: float64