## Snowshoe hares at Bonanza Creek Experimental Forest

### Week 3 - Discussion section



1. Archive exploration

Take some time to look through the dataset’s description in EDI and click around. Discuss the following questions with your team:

- What is this data about?
- During what time frame were the observations in the dataset collected?
- Does the dataset contain sensitive data?
- Is there a publication associated with this dataset?

In your notebook: use a markdown cell to add a brief description of the dataset, including a citation, date of access, and a link to the archive.

Back in the EDI repository, click on View Full Metadata to access more information if you haven’t done so already. Go to the “Detailed Metadata” section and click on “Data Entities”. Take some time to look at the descriptions for the dataset’s columns.



This dataset documents snoeshoe hare populatoin numbers in the Bonanza creek area in Alaska. The data is not sensitive and was collected from 1999-2017. 	

Flora, B.K. 2002. Comparison of snowshoe hare populations in Interior. M.S. Thesis. University of Alaska Fairbanks. Fairbanks, AK, USA.

Kielland, Knut (2017) *Snowshoe hare physical data in Bonanza Creek Experimental Forest: 1999-Present*

[doi:10.6073/pasta/03dce4856d79b91557d8e6ce2cbcdc14](https://portal.edirepository.org/nis/metadataviewer?packageid=knb-lter-bnz.55.22)

![SNOWSHOE HARE (Lepus americanus)](https://upload.wikimedia.org/wikipedia/commons/thumb/8/8a/SNOWSHOE_HARE_%28Lepus_americanus%29_%285-28-2015%29_quoddy_head%2C_washington_co%2C_maine_-01_%2818988734889%29.jpg/1089px-SNOWSHOE_HARE_%28Lepus_americanus%29_%285-28-2015%29_quoddy_head%2C_washington_co%2C_maine_-01_%2818988734889%29.jpg?20170313021652)

In [1]:
# Import libraries
import pandas as pd
import numpy as np

# Read in the data
df = pd.read_csv('data/55_Hare_Data_2012.txt')

In [2]:
df.head()

Unnamed: 0,date,time,grid,trap,l_ear,r_ear,sex,age,weight,hindft,notes,b_key,session_id,study
0,11/26/1998,,bonrip,1A,414D096A08,,,,1370.0,160.0,,917.0,51,Population
1,11/26/1998,,bonrip,2C,414D320671,,M,,1430.0,,,936.0,51,Population
2,11/26/1998,,bonrip,2D,414D103E3A,,M,,1430.0,,,921.0,51,Population
3,11/26/1998,,bonrip,2E,414D262D43,,,,1490.0,135.0,,931.0,51,Population
4,11/26/1998,,bonrip,3B,414D2B4B58,,,,1710.0,150.0,,933.0,51,Population


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3380 entries, 0 to 3379
Data columns (total 14 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   date        3380 non-null   object 
 1   time        264 non-null    object 
 2   grid        3380 non-null   object 
 3   trap        3368 non-null   object 
 4   l_ear       3332 non-null   object 
 5   r_ear       3211 non-null   object 
 6   sex         3028 non-null   object 
 7   age         1269 non-null   object 
 8   weight      2845 non-null   float64
 9   hindft      1633 non-null   float64
 10  notes       243 non-null    object 
 11  b_key       3333 non-null   float64
 12  session_id  3380 non-null   int64  
 13  study       3217 non-null   object 
dtypes: float64(3), int64(1), object(10)
memory usage: 369.8+ KB


In [4]:
df.dtypes

date           object
time           object
grid           object
trap           object
l_ear          object
r_ear          object
sex            object
age            object
weight        float64
hindft        float64
notes          object
b_key         float64
session_id      int64
study          object
dtype: object

In [5]:
df.shape

(3380, 14)

In [6]:
na_counts = df.isna().sum()

print(na_counts)


date             0
time          3116
grid             0
trap            12
l_ear           48
r_ear          169
sex            352
age           2111
weight         535
hindft        1747
notes         3137
b_key           47
session_id       0
study          163
dtype: int64


In [7]:
df['weight'].max()

2365.0

In [8]:
 df['weight'].min()

0.0

In [13]:
 df['hindft'].max()

160.0

In [14]:
df['hindft'].min()

60.0

| Code      | Definition |
| ----------- | ----------- |
| m      | male      |
| f   | female        |
| m ?     | male not confirmed     |

In [20]:
df['sex'].value_counts()

sex
F     1161
M      730
f      556
m      515
?       40
F?      10
f        4
m        4
f?       3
M?       2
m?       2
pf       1
Name: count, dtype: int64

In [24]:
df['sex'].value_counts(dropna = False)

sex
F      1161
M       730
f       556
m       515
NaN     352
?        40
F?       10
f         4
m         4
f?        3
M?        2
m?        2
pf        1
Name: count, dtype: int64

Do the values in the sex column correspond to the values declared in the metadata?

- No, the values have multiple versions of correct codes but with capital letters, there is also 'f?', '?', and one 'pf' listed.

What could have been potential causes for multiple codes?
- Errors inputting data, accidentially capitalizing letters or otherwise not understanding that 'f?' is not an option. The Docutmentaion could also possibly be missing the 'f?' code.

Are there seemingly repated values? If so, what could be the cause?
- It's hard to know for sure, but data colelctors could have made an error judging whetehr or not a specimen was a given sex. Juveniles may have been harder to identify and then later in their life been more easily determined.

In [41]:
df[(df['l_ear'].duplicated()) | (df['date'].duplicated())]

Unnamed: 0,date,time,grid,trap,l_ear,r_ear,sex,age,weight,hindft,notes,b_key,session_id,study
1,11/26/1998,,bonrip,2C,414D320671,,M,,1430.0,,,936.0,51,Population
2,11/26/1998,,bonrip,2D,414D103E3A,,M,,1430.0,,,921.0,51,Population
3,11/26/1998,,bonrip,2E,414D262D43,,,,1490.0,135.0,,931.0,51,Population
4,11/26/1998,,bonrip,3B,414D2B4B58,,,,1710.0,150.0,,933.0,51,Population
5,11/26/1998,,bonrip,3D,414D193011,,F,,1890.0,145.0,,926.0,51,Population
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3375,8/8/2002,18:00:00,bonrip,1b,1201,1202,,,1400.0,,,63.0,64,Population
3376,8/8/2002,6:00:00,bonrip,4b,1201,1202,,,,,,63.0,64,Population
3377,8/7/2002,,bonrip,4b,1217,1218,,,1000.0,134.0,,69.0,64,Population
3378,8/8/2002,,bonrip,6d,1217,1218,,,990.0,,,69.0,64,Population


In [54]:
# Question 5 & 6
conditions = [df['sex'].isin(['F', 'f', 'f ', ]),
df['sex'].isin(['M', 'm', 'm '])]
    
gender = ['female', 'male']

df['sex_simple'] = np.select(conditions, gender, default =np.nan)

print(df['sex_simple'])

0        nan
1       male
2       male
3        nan
4        nan
        ... 
3375     nan
3376     nan
3377     nan
3378     nan
3379    male
Name: sex_simple, Length: 3380, dtype: object


In [55]:
# Question 7
df.groupby('sex_simple').weight.mean()

sex_simple
female    1365.164792
male      1349.935542
nan       1193.364055
Name: weight, dtype: float64